RE: Lucene tests killed one other SSD - Policeman Jenkins

Uwe Schindler Wed, 21 Aug 2013 00:11:50 -0700

Hi,

the broken SSD was replaced yesterday evening. All seems fine. I moved the 
Jenkins workspace and (snapshots of the) virtual machine disks to the new SSD 
and game is going on!


Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -----Original Message-----
> From: Uwe Schindler [mailto:u...@thetaphi.de]
> Sent: Monday, August 19, 2013 5:04 PM
> To: dev@lucene.apache.org
> Subject: Lucene tests killed one other SSD - Policeman Jenkins
> 
> Hi,
> 
> there were some problems with Policeman Jenkins the last days. The server
> died 6 times the last month, recently 2 times in 24 hours. After I moved away
> the swap file from the SSD, the failures were no longer fatal for the server
> but fatal for some Jenkins runs :-)
> 
> Finally the SSD device got unresponsible and only after a power cycle it was
> responsible again. The error messages in dmesg look similar to other dying
> OCX Vertex 2 drives.
> 
> Now the statistics: During the whole lifetime of this SSD (2.5 years; which is
> the lifetime of the server), it was mostly unused (it was just a "addon",
> provided by the hosting provider, thanks to Serverloft / Plusserver). 1.5 
> years
> ago, Robert Muir and also Mike McCandless decided to use the server of my
> own company SD DataSolutions  to do more than idling most of the time: We
> installed Jenkins and 2 additional virtualbox machines on this server after 
> the
> 2012 Lucene Revolution conference and the "spare" SSD was given as base
> for swap file, Jenkins Workspace and virtual disks for the Windows and
> Haskintosh machines.
> 
> During this time (1 year, 3 months) the SSD did hard work, according to
> SMART:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   112   112   050    Pre-fail  Always      
>  -
> 0/61244435
>   5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always      
>  -       0
>   9 Power_On_Hours_and_Msec 0x0032   100   100   000    Old_age   Always
> -       21904h+48m+22.180s
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always      
>  -
> 19
> 171 Program_Fail_Count      0x0032   000   000   000    Old_age   Always      
>  -       0
> 172 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always      
>  -       0
> 174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline     
>  -
> 6
> 177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline     
>  -       2
> 181 Program_Fail_Count      0x0032   000   000   000    Old_age   Always      
>  -       0
> 182 Erase_Fail_Count        0x0032   000   000   000    Old_age   Always      
>  -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always      
>  -       0
> 194 Temperature_Celsius     0x0022   001   129   000    Old_age   Always      
>  -
> 1 (0 127 0 129)
> 195 ECC_Uncorr_Error_Count  0x001c   112   112   000    Old_age   Offline     
>  -
> 0/61244435
> 196 Reallocated_Event_Count 0x0033   100   100   000    Pre-fail  Always      
>  -
> 0
> 231 SSD_Life_Left           0x0013   096   096   010    Pre-fail  Always      
>  -       0
> 233 SandForce_Internal      0x0000   000   000   000    Old_age   Offline     
>  -
> 18752
> 234 SandForce_Internal      0x0032   000   000   000    Old_age   Always      
>  -
> 53376
> 241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always      
>  -
> 53376
> 242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always      
>  -
> 22784
> 
> Last 2 lines are interesting:
> 53 Terabytes written to it and 22 Terabytes read from it. Ignore swap (mostly
> unused as swappiness is low), so our tests are reading and writing a lot!
> 
> And unfortunately after that it died (or almost died) this morning. Cause is
> unclear, it could also be broken SATA cable, but from the web the given error
> messages in "dmesg" seem to also be caused by drive failure (especially as it
> is a timeout, not DMA error)! See https://paste.apache.org/bjAH
> 
> So just to conclude: Lucene kills SSDs :-) Mike still has one Vertex 3 running
> (his Intel one died before).
> 
> Of course as this is a rented server, the hosting provider will replace the 
> SSD
> (I was able to copy the data off, but the Jenkins workspace is not really
> important data, more the virtual machines). After that one more year with a
> new SSD, or should it survive longer? Let's see what type I will get as
> replacement. I have no idea when it is replaced, so excuse any jenkins
> downtime and after that maybe broken builds until all is settled again. At the
> moment Jenkins is running much slower from the RAID 1 harddisks (with lots
> of IOWAITS!).
> 
> Uwe
> 
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

RE: Lucene tests killed one other SSD - Policeman Jenkins

Reply via email to