Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-28 Thread Christian Balzer
Hello, as a follow-up, conclusion and dire warning to all who happen to encounter this failure mode: The server with that failed power loss capacitor SSD had a religious experience 2 days ago and needed a power cycle to revive it. Now in theory the data should have been safe, as the drive had

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer
Hello, On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote: > Christian, can you post your values for Power_Loss_Cap_Test on the drive > which is failing? > Sure: --- 175 Power_Loss_Cap_Test 0x0033 001 001 010Pre-fail Always FAILING_NOW 1 (47 942) --- Now according to the

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Christian, can you post your values for Power_Loss_Cap_Test on the drive which is failing? Thanks Jan > On 03 Aug 2016, at 13:33, Christian Balzer wrote: > > > Hello, > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it > seemed to be such an odd thing

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer
Hello, yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it seemed to be such an odd thing to fail (given that's not single capacitor). As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA worthy issue. For the record, Intel SSDs use (typically 24)

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Right, I actually updated to smartmontools 6.5+svn4324, which now properly supports this drive model. Some of the smart attr names have changed, and make more sense now (and there are no more "Unknowns"): ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE 5

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-) 600GB/month is not much for S36xx at all, must be some sort of defect then... Jan > On 03 Aug 2016, at 12:15, Jan Schermer wrote: > > Make sure you are reading the right attribute and interpreting it

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Make sure you are reading the right attribute and interpreting it right. update-smart-drivedb sometimes makes wonders :) I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month?? Jan > On 03 Aug 2016, at 12:06, Maxime Guyot

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Maxime Guyot
Hi, I haven’t had problems with Power_Loss_Cap_Test so far. Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Hi Christian, Intel drives are good, but apparently not infallible. I'm watching a DC S3610 480GB die from reallocated sectors. ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 081 081 000-756 9 Power_On_Hours

[ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-02 Thread Christian Balzer
Hello, not a Ceph specific issue, but this is probably the largest sample size of SSD users I'm familiar with. ^o^ This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a religious experience. It turns out that the SMART check plugin I run to mostly get an early wearout