RE: Lucene tests killed one other SSD - Policeman Jenkins

karl.wright Mon, 19 Aug 2013 08:09:27 -0700

I am told that SSD's are spec'd for only 70 full writes before they get an 
error.  The error block is set aside but eventually something critical gets 
hit.  So you should probably should expect this to happen again.

Karl

-----Original Message-----
From: ext Uwe Schindler [mailto:[email protected]] 
Sent: Monday, August 19, 2013 11:04 AM
To: [email protected]
Subject: Lucene tests killed one other SSD - Policeman Jenkins

Hi,

there were some problems with Policeman Jenkins the last days. The server died 
6 times the last month, recently 2 times in 24 hours. After I moved away the 
swap file from the SSD, the failures were no longer fatal for the server but 
fatal for some Jenkins runs :-)

Finally the SSD device got unresponsible and only after a power cycle it was 
responsible again. The error messages in dmesg look similar to other dying OCX 
Vertex 2 drives.

Now the statistics: During the whole lifetime of this SSD (2.5 years; which is 
the lifetime of the server), it was mostly unused (it was just a "addon", 
provided by the hosting provider, thanks to Serverloft / Plusserver). 1.5 years 
ago, Robert Muir and also Mike McCandless decided to use the server of my own 
company SD DataSolutions  to do more than idling most of the time: We installed 
Jenkins and 2 additional virtualbox machines on this server after the 2012 
Lucene Revolution conference and the "spare" SSD was given as base for swap 
file, Jenkins Workspace and virtual disks for the Windows and Haskintosh 
machines.

During this time (1 year, 3 months) the SSD did hard work, according to SMART:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   112   112   050    Pre-fail  Always       
-       0/61244435
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       
-       0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always       
-       21904h+48m+22.180s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       19
171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       
-       0
172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       
-       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      
-       6
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      
-       2
181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always       
-       0
182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always       
-       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       
-       0
194 Temperature_Celsius     0x0022   001   129   000    Old_age   Always       
-       1 (0 127 0 129)
195 ECC_Uncorr_Error_Count  0x001c   112   112   000    Old_age   Offline      
-       0/61244435
196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always       
-       0
231 SSD_Life_Left           0x0013   096   096   010    Pre-fail  Always       
-       0
233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline      
-       18752
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       
-       53376
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       
-       53376
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       
-       22784

Last 2 lines are interesting:
53 Terabytes written to it and 22 Terabytes read from it. Ignore swap (mostly 
unused as swappiness is low), so our tests are reading and writing a lot!

And unfortunately after that it died (or almost died) this morning. Cause is 
unclear, it could also be broken SATA cable, but from the web the given error 
messages in "dmesg" seem to also be caused by drive failure (especially as it 
is a timeout, not DMA error)! See https://paste.apache.org/bjAH

So just to conclude: Lucene kills SSDs :-) Mike still has one Vertex 3 running 
(his Intel one died before).

Of course as this is a rented server, the hosting provider will replace the SSD 
(I was able to copy the data off, but the Jenkins workspace is not really 
important data, more the virtual machines). After that one more year with a new 
SSD, or should it survive longer? Let's see what type I will get as 
replacement. I have no idea when it is replaced, so excuse any jenkins downtime 
and after that maybe broken builds until all is settled again. At the moment 
Jenkins is running much slower from the RAID 1 harddisks (with lots of 
IOWAITS!).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

RE: Lucene tests killed one other SSD - Policeman Jenkins

Reply via email to