Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-28 Thread Christian Balzer

Hello,

as a follow-up, conclusion and dire warning to all who happen to encounter
this failure mode:

The server with that failed power loss capacitor SSD had a religious
experience 2 days ago and needed a power cycle to revive it.

Now in theory the data should have been safe, as the drive had minutes to
scribble away it's cache.

Alas what happened is that the SSD bricked itself, it's not accessible any
longer and the only meaningful output from "smartctl -a" is:
"SMART overall-health self-assessment test result: FAILED!"

I'm trying to think of a failure mode where the capacitor would cause
something like this and am coming up blank, so my theories at this time
are:

1. Something more substantial was failing the the error was a symptom, not
the cause.

2. Intel's "we won't let you deal with potentially broken data" rule
strikes again (they brick SSDs that reach max wear-out levels) and a
failed power cap triggers such a rule.


Either way, if you ever encounter this problem, get a replacement ASAP,
and if used as journal SSD, shut down all associated OSDs, flush the
journals and replace it.

Christian

On Wed, 3 Aug 2016 21:15:22 +0900 Christian Balzer wrote:
> 
> Hello,
> 
> On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote:
> 
> > Christian, can you post your values for Power_Loss_Cap_Test on the drive 
> > which is failing?
> >
> Sure:
> ---
> 175 Power_Loss_Cap_Test 0x0033   001   001   010Pre-fail  Always   
> FAILING_NOW 1 (47 942)
> ---
> 
> Now according to the Intel data sheet that value of 1 means failed, NOT
> the actual buffer time it usually means, like this on the neighboring SSD:
> ---
> 175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always  
>  -   614 (47 944)
> ---
> 
> And my 800GB DC S3610s have more than 10 times the endurance, my guess is
> a combo of larger cache and slower writes:
> ---
> 175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always  
>  -   8390 (22 7948)
> ---
> 
> I'll definitely leave that "failing" SSD in place until it has done the
> next self-check.
> 
> Christian
> 
> > Thanks
> > Jan
> > 
> > > On 03 Aug 2016, at 13:33, Christian Balzer  wrote:
> > > 
> > > 
> > > Hello,
> > > 
> > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> > > seemed to be such an odd thing to fail (given that's not single 
> > > capacitor).
> > > 
> > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> > > worthy issue. 
> > > 
> > > For the record, Intel SSDs use (typically 24) sectors when doing firmware
> > > upgrades, so this is a totally healthy 3610. ^o^
> > > ---
> > >  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
> > > -   24
> > > ---
> > > 
> > > Christian
> > > 
> > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> > > 
> > >> Right, I actually updated to smartmontools 6.5+svn4324, which now
> > >> properly supports this drive model. Some of the smart attr names have
> > >> changed, and make more sense now (and there are no more "Unknowns"):
> > >> 
> > >> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> > >>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
> > >>  9 Power_On_Hours  -O--CK   100   100   000-1067
> > >> 12 Power_Cycle_Count   -O--CK   100   100   000-7
> > >> 170 Available_Reservd_Space PO--CK   085   085   010-0
> > >> 171 Program_Fail_Count  -O--CK   100   100   000-0
> > >> 172 Erase_Fail_Count-O--CK   100   100   000-68
> > >> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> > >> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 
> > >> 4307)
> > >> 183 SATA_Downshift_Count-O--CK   100   100   000-0
> > >> 184 End-to-End_ErrorPO--CK   100   100   090-0
> > >> 187 Reported_Uncorrect  -O--CK   100   100   000-0
> > >> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
> > >> 25/35)
> > >> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> > >> 194 Temperature_Internal-O---K   100   100   000-30
> > >> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
> > >> 199 CRC_Error_Count -OSRCK   100   100   000-0
> > >> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
> > >> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
> > >> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
> > >> 228 Workload_Minutes-O--CK   100   100   000-64012
> > >> 232 Available_Reservd_Space PO--CK   084   084   010-0
> > >> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
> > >> 234 Thermal_Throttle-O--CK   100   100   000-0/0
> > >> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
> > >> 242 Host_Reads_32MiB-O--CK   100   100   

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer

Hello,

On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote:

> Christian, can you post your values for Power_Loss_Cap_Test on the drive 
> which is failing?
>
Sure:
---
175 Power_Loss_Cap_Test 0x0033   001   001   010Pre-fail  Always   
FAILING_NOW 1 (47 942)
---

Now according to the Intel data sheet that value of 1 means failed, NOT
the actual buffer time it usually means, like this on the neighboring SSD:
---
175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always   
-   614 (47 944)
---

And my 800GB DC S3610s have more than 10 times the endurance, my guess is
a combo of larger cache and slower writes:
---
175 Power_Loss_Cap_Test 0x0033   100   100   010Pre-fail  Always   
-   8390 (22 7948)
---

I'll definitely leave that "failing" SSD in place until it has done the
next self-check.

Christian

> Thanks
> Jan
> 
> > On 03 Aug 2016, at 13:33, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> > seemed to be such an odd thing to fail (given that's not single capacitor).
> > 
> > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> > worthy issue. 
> > 
> > For the record, Intel SSDs use (typically 24) sectors when doing firmware
> > upgrades, so this is a totally healthy 3610. ^o^
> > ---
> >  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always 
> >   -   24
> > ---
> > 
> > Christian
> > 
> > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> > 
> >> Right, I actually updated to smartmontools 6.5+svn4324, which now
> >> properly supports this drive model. Some of the smart attr names have
> >> changed, and make more sense now (and there are no more "Unknowns"):
> >> 
> >> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
> >>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
> >>  9 Power_On_Hours  -O--CK   100   100   000-1067
> >> 12 Power_Cycle_Count   -O--CK   100   100   000-7
> >> 170 Available_Reservd_Space PO--CK   085   085   010-0
> >> 171 Program_Fail_Count  -O--CK   100   100   000-0
> >> 172 Erase_Fail_Count-O--CK   100   100   000-68
> >> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> >> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
> >> 183 SATA_Downshift_Count-O--CK   100   100   000-0
> >> 184 End-to-End_ErrorPO--CK   100   100   090-0
> >> 187 Reported_Uncorrect  -O--CK   100   100   000-0
> >> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
> >> 25/35)
> >> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> >> 194 Temperature_Internal-O---K   100   100   000-30
> >> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
> >> 199 CRC_Error_Count -OSRCK   100   100   000-0
> >> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
> >> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
> >> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
> >> 228 Workload_Minutes-O--CK   100   100   000-64012
> >> 232 Available_Reservd_Space PO--CK   084   084   010-0
> >> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
> >> 234 Thermal_Throttle-O--CK   100   100   000-0/0
> >> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
> >> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
> >> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
> >> 
> >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> >> seems to be holding steady.
> >> 
> >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> >> death. The drive simply disappeared from the controller one day, and
> >> could no longer be detected.
> >> 
> >> On 03/08/16 12:15, Jan Schermer wrote:
> >>> Make sure you are reading the right attribute and interpreting it right.
> >>> update-smart-drivedb sometimes makes wonders :)
> >>> 
> >>> I wonder what isdct tool would say the drive's life expectancy is with 
> >>> this workload? Are you really writing ~600TB/month??
> >>> 
> >>> Jan
> >>> 
> >> 
> >> 
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> > 
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com  

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Christian, can you post your values for Power_Loss_Cap_Test on the drive which 
is failing?

Thanks
Jan

> On 03 Aug 2016, at 13:33, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
> seemed to be such an odd thing to fail (given that's not single capacitor).
> 
> As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
> worthy issue. 
> 
> For the record, Intel SSDs use (typically 24) sectors when doing firmware
> upgrades, so this is a totally healthy 3610. ^o^
> ---
>  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
> -   24
> ---
> 
> Christian
> 
> On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:
> 
>> Right, I actually updated to smartmontools 6.5+svn4324, which now
>> properly supports this drive model. Some of the smart attr names have
>> changed, and make more sense now (and there are no more "Unknowns"):
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
>>  9 Power_On_Hours  -O--CK   100   100   000-1067
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 170 Available_Reservd_Space PO--CK   085   085   010-0
>> 171 Program_Fail_Count  -O--CK   100   100   000-0
>> 172 Erase_Fail_Count-O--CK   100   100   000-68
>> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
>> 183 SATA_Downshift_Count-O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
>> 194 Temperature_Internal-O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
>> 199 CRC_Error_Count -OSRCK   100   100   000-0
>> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
>> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
>> 228 Workload_Minutes-O--CK   100   100   000-64012
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 234 Thermal_Throttle-O--CK   100   100   000-0/0
>> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
>> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
>> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
>> 
>> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
>> seems to be holding steady.
>> 
>> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
>> death. The drive simply disappeared from the controller one day, and
>> could no longer be detected.
>> 
>> On 03/08/16 12:15, Jan Schermer wrote:
>>> Make sure you are reading the right attribute and interpreting it right.
>>> update-smart-drivedb sometimes makes wonders :)
>>> 
>>> I wonder what isdct tool would say the drive's life expectancy is with this 
>>> workload? Are you really writing ~600TB/month??
>>> 
>>> Jan
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Christian Balzer

Hello,

yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it
seemed to be such an odd thing to fail (given that's not single capacitor).

As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA
worthy issue. 

For the record, Intel SSDs use (typically 24) sectors when doing firmware
upgrades, so this is a totally healthy 3610. ^o^
---
  5 Reallocated_Sector_Ct   0x0032   099   099   000Old_age   Always   
-   24
---

Christian

On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote:

> Right, I actually updated to smartmontools 6.5+svn4324, which now
> properly supports this drive model. Some of the smart attr names have
> changed, and make more sense now (and there are no more "Unknowns"):
> 
> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>   5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
>   9 Power_On_Hours  -O--CK   100   100   000-1067
>  12 Power_Cycle_Count   -O--CK   100   100   000-7
> 170 Available_Reservd_Space PO--CK   085   085   010-0
> 171 Program_Fail_Count  -O--CK   100   100   000-0
> 172 Erase_Fail_Count-O--CK   100   100   000-68
> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> 175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
> 183 SATA_Downshift_Count-O--CK   100   100   000-0
> 184 End-to-End_ErrorPO--CK   100   100   090-0
> 187 Reported_Uncorrect  -O--CK   100   100   000-0
> 190 Temperature_Case-O---K   070   065   000-30 (Min/Max
> 25/35)
> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
> 194 Temperature_Internal-O---K   100   100   000-30
> 197 Current_Pending_Sector  -O--C-   100   100   000-1100
> 199 CRC_Error_Count -OSRCK   100   100   000-0
> 225 Host_Writes_32MiB   -O--CK   100   100   000-20135
> 226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
> 228 Workload_Minutes-O--CK   100   100   000-64012
> 232 Available_Reservd_Space PO--CK   084   084   010-0
> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
> 234 Thermal_Throttle-O--CK   100   100   000-0/0
> 241 Host_Writes_32MiB   -O--CK   100   100   000-20135
> 242 Host_Reads_32MiB-O--CK   100   100   000-92945
> 243 NAND_Writes_32MiB   -O--CK   100   100   000-95289
> 
> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
> seems to be holding steady.
> 
> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
> death. The drive simply disappeared from the controller one day, and
> could no longer be detected.
> 
> On 03/08/16 12:15, Jan Schermer wrote:
> > Make sure you are reading the right attribute and interpreting it right.
> > update-smart-drivedb sometimes makes wonders :)
> > 
> > I wonder what isdct tool would say the drive's life expectancy is with this 
> > workload? Are you really writing ~600TB/month??
> > 
> > Jan
> > 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Right, I actually updated to smartmontools 6.5+svn4324, which now
properly supports this drive model. Some of the smart attr names have
changed, and make more sense now (and there are no more "Unknowns"):

ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   081   081   000-944
  9 Power_On_Hours  -O--CK   100   100   000-1067
 12 Power_Cycle_Count   -O--CK   100   100   000-7
170 Available_Reservd_Space PO--CK   085   085   010-0
171 Program_Fail_Count  -O--CK   100   100   000-0
172 Erase_Fail_Count-O--CK   100   100   000-68
174 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
175 Power_Loss_Cap_Test PO--CK   100   100   010-6510 (4 4307)
183 SATA_Downshift_Count-O--CK   100   100   000-0
184 End-to-End_ErrorPO--CK   100   100   090-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Temperature_Case-O---K   070   065   000-30 (Min/Max
25/35)
192 Unsafe_Shutdown_Count   -O--CK   100   100   000-6
194 Temperature_Internal-O---K   100   100   000-30
197 Current_Pending_Sector  -O--C-   100   100   000-1100
199 CRC_Error_Count -OSRCK   100   100   000-0
225 Host_Writes_32MiB   -O--CK   100   100   000-20135
226 Workld_Media_Wear_Indic -O--CK   100   100   000-20
227 Workld_Host_Reads_Perc  -O--CK   100   100   000-82
228 Workload_Minutes-O--CK   100   100   000-64012
232 Available_Reservd_Space PO--CK   084   084   010-0
233 Media_Wearout_Indicator -O--CK   100   100   000-0
234 Thermal_Throttle-O--CK   100   100   000-0/0
241 Host_Writes_32MiB   -O--CK   100   100   000-20135
242 Host_Reads_32MiB-O--CK   100   100   000-92945
243 NAND_Writes_32MiB   -O--CK   100   100   000-95289

Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space
seems to be holding steady.

AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden
death. The drive simply disappeared from the controller one day, and
could no longer be detected.

On 03/08/16 12:15, Jan Schermer wrote:
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this 
> workload? Are you really writing ~600TB/month??
> 
> Jan
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-)
600GB/month is not much for S36xx at all, must be some sort of defect then...

Jan


> On 03 Aug 2016, at 12:15, Jan Schermer  wrote:
> 
> Make sure you are reading the right attribute and interpreting it right.
> update-smart-drivedb sometimes makes wonders :)
> 
> I wonder what isdct tool would say the drive's life expectancy is with this 
> workload? Are you really writing ~600TB/month??
> 
> Jan
> 
> 
>> On 03 Aug 2016, at 12:06, Maxime Guyot  wrote:
>> 
>> Hi,
>> 
>> I haven’t had problems with Power_Loss_Cap_Test so far. 
>> 
>> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
>> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
>> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>>  reads:
>> "This attribute reports the number of reserve blocks
>> 
>>  remaining. The normalized value 
>> begins at 100 (64h),
>> which corresponds to 100 percent availability of the
>> reserved space. The threshold value for this attribute is
>> 10 percent availability."
>> 
>> According to the SMART data you copied, it should be about 84% of the over 
>> provisioning left? Since the drive is pretty young, it might be some form of 
>> defect?
>> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
>> values (except for the temperature).
>> 
>> Cheers,
>> Maxime
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>> > daniel.swarbr...@profitbricks.com> wrote:
>> 
>>> Hi Christian,
>>> 
>>> Intel drives are good, but apparently not infallible. I'm watching a DC
>>> S3610 480GB die from reallocated sectors.
>>> 
>>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>>> 25/35)
>>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>>> 194 Temperature_Celsius -O---K   100   100   000-30
>>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>>> 
>>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>>> sure how many reserved sectors the drive has, i.e., how soon before it
>>> starts throwing write IO errors.
>>> 
>>> It's a very young drive, with only 1065 hours on the clock, and has not
>>> even done two full drive-writes:
>>> 
>>> Device Statistics (GP Log 0x04)
>>> Page Offset Size Value  Description
>>> 1  =  ==  == General Statistics (rev 2) ==
>>> 1  0x008  47  Lifetime Power-On Resets
>>> 1  0x018  6   1319318736  Logical Sectors Written
>>> 1  0x020  6137121729  Number of Write Commands
>>> 1  0x028  6   6091245600  Logical Sectors Read
>>> 1  0x030  6115252407  Number of Read Commands
>>> 
>>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>>> RAID5 array :-|
>>> 
>>> Cheers,
>>> Daniel
>>> 
>>> On 03/08/16 07:45, Christian Balzer wrote:
 
 Hello,
 
 not a Ceph specific issue, but this is probably the largest sample size of
 SSD users I'm familiar with. ^o^
 
 This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
 religious experience.
 
 It turns out that the SMART check plugin I run to mostly get an early
 wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
 200GB DC S3700 used for journals.
 
 While SMART is of the opinion that this drive is failing and will explode
 spectacularly any moment that particular failure is of little worries to
 me, never mind that I'll eventually replace this unit.
 
 What brings me here is that this is the first time in over 3 years that an
 Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
 this particular failure has been seen by others.
 
 That 

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Jan Schermer
Make sure you are reading the right attribute and interpreting it right.
update-smart-drivedb sometimes makes wonders :)

I wonder what isdct tool would say the drive's life expectancy is with this 
workload? Are you really writing ~600TB/month??

Jan


> On 03 Aug 2016, at 12:06, Maxime Guyot  wrote:
> 
> Hi,
> 
> I haven’t had problems with Power_Loss_Cap_Test so far. 
> 
> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the 
> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet 
> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
>  reads:
> "This attribute reports the number of reserve blocks
> 
>   remaining. The normalized value 
> begins at 100 (64h),
> which corresponds to 100 percent availability of the
> reserved space. The threshold value for this attribute is
> 10 percent availability."
> 
> According to the SMART data you copied, it should be about 84% of the over 
> provisioning left? Since the drive is pretty young, it might be some form of 
> defect?
> I have a number of S3610 with ~150 DW, all SMART counters are their initial 
> values (except for the temperature).
> 
> Cheers,
> Maxime
> 
> 
> 
> 
> 
> 
> 
> 
> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
>  daniel.swarbr...@profitbricks.com> wrote:
> 
>> Hi Christian,
>> 
>> Intel drives are good, but apparently not infallible. I'm watching a DC
>> S3610 480GB die from reallocated sectors.
>> 
>> ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>> 5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>> 9 Power_On_Hours  -O--CK   100   100   000-1065
>> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>> 175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>> 183 Runtime_Bad_Block   -O--CK   100   100   000-0
>> 184 End-to-End_ErrorPO--CK   100   100   090-0
>> 187 Reported_Uncorrect  -O--CK   100   100   000-0
>> 190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>> 25/35)
>> 192 Power-Off_Retract_Count -O--CK   100   100   000-6
>> 194 Temperature_Celsius -O---K   100   100   000-30
>> 197 Current_Pending_Sector  -O--C-   100   100   000-1288
>> 199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>> 228 Power-off_Retract_Count -O--CK   100   100   000-63889
>> 232 Available_Reservd_Space PO--CK   084   084   010-0
>> 233 Media_Wearout_Indicator -O--CK   100   100   000-0
>> 241 Total_LBAs_Written  -O--CK   100   100   000-20131
>> 242 Total_LBAs_Read -O--CK   100   100   000-92945
>> 
>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>> sure how many reserved sectors the drive has, i.e., how soon before it
>> starts throwing write IO errors.
>> 
>> It's a very young drive, with only 1065 hours on the clock, and has not
>> even done two full drive-writes:
>> 
>> Device Statistics (GP Log 0x04)
>> Page Offset Size Value  Description
>> 1  =  ==  == General Statistics (rev 2) ==
>> 1  0x008  47  Lifetime Power-On Resets
>> 1  0x018  6   1319318736  Logical Sectors Written
>> 1  0x020  6137121729  Number of Write Commands
>> 1  0x028  6   6091245600  Logical Sectors Read
>> 1  0x030  6115252407  Number of Read Commands
>> 
>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>> RAID5 array :-|
>> 
>> Cheers,
>> Daniel
>> 
>> On 03/08/16 07:45, Christian Balzer wrote:
>>> 
>>> Hello,
>>> 
>>> not a Ceph specific issue, but this is probably the largest sample size of
>>> SSD users I'm familiar with. ^o^
>>> 
>>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>>> religious experience.
>>> 
>>> It turns out that the SMART check plugin I run to mostly get an early
>>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>>> 200GB DC S3700 used for journals.
>>> 
>>> While SMART is of the opinion that this drive is failing and will explode
>>> spectacularly any moment that particular failure is of little worries to
>>> me, never mind that I'll eventually replace this unit.
>>> 
>>> What brings me here is that this is the first time in over 3 years that an
>>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>>> this particular failure has been seen by others.
>>> 
>>> That of course entails people actually monitoring for these things. ^o^
>>> 
>>> Thanks,
>>> 
>>> Christian
>>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing 

Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Maxime Guyot
Hi,

I haven’t had problems with Power_Loss_Cap_Test so far. 

Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available 
Reserved Space” (SMART ID: 232/E8h), the data sheet 
(http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf)
 reads:
"This attribute reports the number of reserve blocks

remaining. The normalized value 
begins at 100 (64h),
which corresponds to 100 percent availability of the
reserved space. The threshold value for this attribute is
10 percent availability."

According to the SMART data you copied, it should be about 84% of the over 
provisioning left? Since the drive is pretty young, it might be some form of 
defect?
I have a number of S3610 with ~150 DW, all SMART counters are their initial 
values (except for the temperature).

Cheers,
Maxime








On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" 
 wrote:

>Hi Christian,
>
>Intel drives are good, but apparently not infallible. I'm watching a DC
>S3610 480GB die from reallocated sectors.
>
>ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
>  5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
>  9 Power_On_Hours  -O--CK   100   100   000-1065
> 12 Power_Cycle_Count   -O--CK   100   100   000-7
>175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
>183 Runtime_Bad_Block   -O--CK   100   100   000-0
>184 End-to-End_ErrorPO--CK   100   100   090-0
>187 Reported_Uncorrect  -O--CK   100   100   000-0
>190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
>25/35)
>192 Power-Off_Retract_Count -O--CK   100   100   000-6
>194 Temperature_Celsius -O---K   100   100   000-30
>197 Current_Pending_Sector  -O--C-   100   100   000-1288
>199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
>228 Power-off_Retract_Count -O--CK   100   100   000-63889
>232 Available_Reservd_Space PO--CK   084   084   010-0
>233 Media_Wearout_Indicator -O--CK   100   100   000-0
>241 Total_LBAs_Written  -O--CK   100   100   000-20131
>242 Total_LBAs_Read -O--CK   100   100   000-92945
>
>The Reallocated_Sector_Ct is increasing about once a minute. I'm not
>sure how many reserved sectors the drive has, i.e., how soon before it
>starts throwing write IO errors.
>
>It's a very young drive, with only 1065 hours on the clock, and has not
>even done two full drive-writes:
>
>Device Statistics (GP Log 0x04)
>Page Offset Size Value  Description
>  1  =  ==  == General Statistics (rev 2) ==
>  1  0x008  47  Lifetime Power-On Resets
>  1  0x018  6   1319318736  Logical Sectors Written
>  1  0x020  6137121729  Number of Write Commands
>  1  0x028  6   6091245600  Logical Sectors Read
>  1  0x030  6115252407  Number of Read Commands
>
>Fortunately this drive is not used as a Ceph journal. It's in a mdraid
>RAID5 array :-|
>
>Cheers,
>Daniel
>
>On 03/08/16 07:45, Christian Balzer wrote:
>> 
>> Hello,
>> 
>> not a Ceph specific issue, but this is probably the largest sample size of
>> SSD users I'm familiar with. ^o^
>> 
>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
>> religious experience.
>> 
>> It turns out that the SMART check plugin I run to mostly get an early
>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
>> 200GB DC S3700 used for journals.
>> 
>> While SMART is of the opinion that this drive is failing and will explode
>> spectacularly any moment that particular failure is of little worries to
>> me, never mind that I'll eventually replace this unit.
>> 
>> What brings me here is that this is the first time in over 3 years that an
>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
>> this particular failure has been seen by others.
>> 
>> That of course entails people actually monitoring for these things. ^o^
>> 
>> Thanks,
>> 
>> Christian
>> 
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-03 Thread Daniel Swarbrick
Hi Christian,

Intel drives are good, but apparently not infallible. I'm watching a DC
S3610 480GB die from reallocated sectors.

ID# ATTRIBUTE_NAME  FLAGSVALUE WORST THRESH FAIL RAW_VALUE
  5 Reallocated_Sector_Ct   -O--CK   081   081   000-756
  9 Power_On_Hours  -O--CK   100   100   000-1065
 12 Power_Cycle_Count   -O--CK   100   100   000-7
175 Program_Fail_Count_Chip PO--CK   100   100   010-17454078318
183 Runtime_Bad_Block   -O--CK   100   100   000-0
184 End-to-End_ErrorPO--CK   100   100   090-0
187 Reported_Uncorrect  -O--CK   100   100   000-0
190 Airflow_Temperature_Cel -O---K   070   065   000-30 (Min/Max
25/35)
192 Power-Off_Retract_Count -O--CK   100   100   000-6
194 Temperature_Celsius -O---K   100   100   000-30
197 Current_Pending_Sector  -O--C-   100   100   000-1288
199 UDMA_CRC_Error_Count-OSRCK   100   100   000-0
228 Power-off_Retract_Count -O--CK   100   100   000-63889
232 Available_Reservd_Space PO--CK   084   084   010-0
233 Media_Wearout_Indicator -O--CK   100   100   000-0
241 Total_LBAs_Written  -O--CK   100   100   000-20131
242 Total_LBAs_Read -O--CK   100   100   000-92945

The Reallocated_Sector_Ct is increasing about once a minute. I'm not
sure how many reserved sectors the drive has, i.e., how soon before it
starts throwing write IO errors.

It's a very young drive, with only 1065 hours on the clock, and has not
even done two full drive-writes:

Device Statistics (GP Log 0x04)
Page Offset Size Value  Description
  1  =  ==  == General Statistics (rev 2) ==
  1  0x008  47  Lifetime Power-On Resets
  1  0x018  6   1319318736  Logical Sectors Written
  1  0x020  6137121729  Number of Write Commands
  1  0x028  6   6091245600  Logical Sectors Read
  1  0x030  6115252407  Number of Read Commands

Fortunately this drive is not used as a Ceph journal. It's in a mdraid
RAID5 array :-|

Cheers,
Daniel

On 03/08/16 07:45, Christian Balzer wrote:
> 
> Hello,
> 
> not a Ceph specific issue, but this is probably the largest sample size of
> SSD users I'm familiar with. ^o^
> 
> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
> religious experience.
> 
> It turns out that the SMART check plugin I run to mostly get an early
> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
> 200GB DC S3700 used for journals.
> 
> While SMART is of the opinion that this drive is failing and will explode
> spectacularly any moment that particular failure is of little worries to
> me, never mind that I'll eventually replace this unit.
> 
> What brings me here is that this is the first time in over 3 years that an
> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
> this particular failure has been seen by others.
> 
> That of course entails people actually monitoring for these things. ^o^
> 
> Thanks,
> 
> Christian
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure

2016-08-02 Thread Christian Balzer

Hello,

not a Ceph specific issue, but this is probably the largest sample size of
SSD users I'm familiar with. ^o^

This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a
religious experience.

It turns out that the SMART check plugin I run to mostly get an early
wearout warning detected a "Power_Loss_Cap_Test" failure in one of the
200GB DC S3700 used for journals.

While SMART is of the opinion that this drive is failing and will explode
spectacularly any moment that particular failure is of little worries to
me, never mind that I'll eventually replace this unit.

What brings me here is that this is the first time in over 3 years that an
Intel SSD has shown a (harmless in this case) problem, so I'm wondering if
this particular failure has been seen by others.

That of course entails people actually monitoring for these things. ^o^

Thanks,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com