Hi, the broken SSD was replaced yesterday evening. All seems fine. I moved the Jenkins workspace and (snapshots of the) virtual machine disks to the new SSD and game is going on!
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Uwe Schindler [mailto:u...@thetaphi.de] > Sent: Monday, August 19, 2013 5:04 PM > To: dev@lucene.apache.org > Subject: Lucene tests killed one other SSD - Policeman Jenkins > > Hi, > > there were some problems with Policeman Jenkins the last days. The server > died 6 times the last month, recently 2 times in 24 hours. After I moved away > the swap file from the SSD, the failures were no longer fatal for the server > but fatal for some Jenkins runs :-) > > Finally the SSD device got unresponsible and only after a power cycle it was > responsible again. The error messages in dmesg look similar to other dying > OCX Vertex 2 drives. > > Now the statistics: During the whole lifetime of this SSD (2.5 years; which is > the lifetime of the server), it was mostly unused (it was just a "addon", > provided by the hosting provider, thanks to Serverloft / Plusserver). 1.5 > years > ago, Robert Muir and also Mike McCandless decided to use the server of my > own company SD DataSolutions to do more than idling most of the time: We > installed Jenkins and 2 additional virtualbox machines on this server after > the > 2012 Lucene Revolution conference and the "spare" SSD was given as base > for swap file, Jenkins Workspace and virtual disks for the Windows and > Haskintosh machines. > > During this time (1 year, 3 months) the SSD did hard work, according to > SMART: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 112 112 050 Pre-fail Always > - > 0/61244435 > 5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always > - 0 > 9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always > - 21904h+48m+22.180s > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always > - > 19 > 171 Program_Fail_Count 0x0032 000 000 000 Old_age Always > - 0 > 172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always > - 0 > 174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline > - > 6 > 177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline > - 2 > 181 Program_Fail_Count 0x0032 000 000 000 Old_age Always > - 0 > 182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always > - 0 > 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always > - 0 > 194 Temperature_Celsius 0x0022 001 129 000 Old_age Always > - > 1 (0 127 0 129) > 195 ECC_Uncorr_Error_Count 0x001c 112 112 000 Old_age Offline > - > 0/61244435 > 196 Reallocated_Event_Count 0x0033 100 100 000 Pre-fail Always > - > 0 > 231 SSD_Life_Left 0x0013 096 096 010 Pre-fail Always > - 0 > 233 SandForce_Internal 0x0000 000 000 000 Old_age Offline > - > 18752 > 234 SandForce_Internal 0x0032 000 000 000 Old_age Always > - > 53376 > 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always > - > 53376 > 242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always > - > 22784 > > Last 2 lines are interesting: > 53 Terabytes written to it and 22 Terabytes read from it. Ignore swap (mostly > unused as swappiness is low), so our tests are reading and writing a lot! > > And unfortunately after that it died (or almost died) this morning. Cause is > unclear, it could also be broken SATA cable, but from the web the given error > messages in "dmesg" seem to also be caused by drive failure (especially as it > is a timeout, not DMA error)! See https://paste.apache.org/bjAH > > So just to conclude: Lucene kills SSDs :-) Mike still has one Vertex 3 running > (his Intel one died before). > > Of course as this is a rented server, the hosting provider will replace the > SSD > (I was able to copy the data off, but the Jenkins workspace is not really > important data, more the virtual machines). After that one more year with a > new SSD, or should it survive longer? Let's see what type I will get as > replacement. I have no idea when it is replaced, so excuse any jenkins > downtime and after that maybe broken builds until all is settled again. At the > moment Jenkins is running much slower from the RAID 1 harddisks (with lots > of IOWAITS!). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional > commands, e-mail: dev-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org