I am told that SSD's are spec'd for only 70 full writes before they get an error. The error block is set aside but eventually something critical gets hit. So you should probably should expect this to happen again.
Karl -----Original Message----- From: ext Uwe Schindler [mailto:[email protected]] Sent: Monday, August 19, 2013 11:04 AM To: [email protected] Subject: Lucene tests killed one other SSD - Policeman Jenkins Hi, there were some problems with Policeman Jenkins the last days. The server died 6 times the last month, recently 2 times in 24 hours. After I moved away the swap file from the SSD, the failures were no longer fatal for the server but fatal for some Jenkins runs :-) Finally the SSD device got unresponsible and only after a power cycle it was responsible again. The error messages in dmesg look similar to other dying OCX Vertex 2 drives. Now the statistics: During the whole lifetime of this SSD (2.5 years; which is the lifetime of the server), it was mostly unused (it was just a "addon", provided by the hosting provider, thanks to Serverloft / Plusserver). 1.5 years ago, Robert Muir and also Mike McCandless decided to use the server of my own company SD DataSolutions to do more than idling most of the time: We installed Jenkins and 2 additional virtualbox machines on this server after the 2012 Lucene Revolution conference and the "spare" SSD was given as base for swap file, Jenkins Workspace and virtual disks for the Windows and Haskintosh machines. During this time (1 year, 3 months) the SSD did hard work, according to SMART: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 112 112 050 Pre-fail Always - 0/61244435 5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0 9 Power_On_Hours_and_Msec 0x0032 100 100 000 Old_age Always - 21904h+48m+22.180s 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 19 171 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0 172 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0 174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 6 177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 2 181 Program_Fail_Count 0x0032 000 000 000 Old_age Always - 0 182 Erase_Fail_Count 0x0032 000 000 000 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 194 Temperature_Celsius 0x0022 001 129 000 Old_age Always - 1 (0 127 0 129) 195 ECC_Uncorr_Error_Count 0x001c 112 112 000 Old_age Offline - 0/61244435 196 Reallocated_Event_Count 0x0033 100 100 000 Pre-fail Always - 0 231 SSD_Life_Left 0x0013 096 096 010 Pre-fail Always - 0 233 SandForce_Internal 0x0000 000 000 000 Old_age Offline - 18752 234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 53376 241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 53376 242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 22784 Last 2 lines are interesting: 53 Terabytes written to it and 22 Terabytes read from it. Ignore swap (mostly unused as swappiness is low), so our tests are reading and writing a lot! And unfortunately after that it died (or almost died) this morning. Cause is unclear, it could also be broken SATA cable, but from the web the given error messages in "dmesg" seem to also be caused by drive failure (especially as it is a timeout, not DMA error)! See https://paste.apache.org/bjAH So just to conclude: Lucene kills SSDs :-) Mike still has one Vertex 3 running (his Intel one died before). Of course as this is a rented server, the hosting provider will replace the SSD (I was able to copy the data off, but the Jenkins workspace is not really important data, more the virtual machines). After that one more year with a new SSD, or should it survive longer? Let's see what type I will get as replacement. I have no idea when it is replaced, so excuse any jenkins downtime and after that maybe broken builds until all is settled again. At the moment Jenkins is running much slower from the RAID 1 harddisks (with lots of IOWAITS!). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
