Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hello, as a follow-up, conclusion and dire warning to all who happen to encounter this failure mode: The server with that failed power loss capacitor SSD had a religious experience 2 days ago and needed a power cycle to revive it. Now in theory the data should have been safe, as the drive had minutes to scribble away it's cache. Alas what happened is that the SSD bricked itself, it's not accessible any longer and the only meaningful output from "smartctl -a" is: "SMART overall-health self-assessment test result: FAILED!" I'm trying to think of a failure mode where the capacitor would cause something like this and am coming up blank, so my theories at this time are: 1. Something more substantial was failing the the error was a symptom, not the cause. 2. Intel's "we won't let you deal with potentially broken data" rule strikes again (they brick SSDs that reach max wear-out levels) and a failed power cap triggers such a rule. Either way, if you ever encounter this problem, get a replacement ASAP, and if used as journal SSD, shut down all associated OSDs, flush the journals and replace it. Christian On Wed, 3 Aug 2016 21:15:22 +0900 Christian Balzer wrote: > > Hello, > > On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote: > > > Christian, can you post your values for Power_Loss_Cap_Test on the drive > > which is failing? > > > Sure: > --- > 175 Power_Loss_Cap_Test 0x0033 001 001 010Pre-fail Always > FAILING_NOW 1 (47 942) > --- > > Now according to the Intel data sheet that value of 1 means failed, NOT > the actual buffer time it usually means, like this on the neighboring SSD: > --- > 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always > - 614 (47 944) > --- > > And my 800GB DC S3610s have more than 10 times the endurance, my guess is > a combo of larger cache and slower writes: > --- > 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always > - 8390 (22 7948) > --- > > I'll definitely leave that "failing" SSD in place until it has done the > next self-check. > > Christian > > > Thanks > > Jan > > > > > On 03 Aug 2016, at 13:33, Christian Balzerwrote: > > > > > > > > > Hello, > > > > > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it > > > seemed to be such an odd thing to fail (given that's not single > > > capacitor). > > > > > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA > > > worthy issue. > > > > > > For the record, Intel SSDs use (typically 24) sectors when doing firmware > > > upgrades, so this is a totally healthy 3610. ^o^ > > > --- > > > 5 Reallocated_Sector_Ct 0x0032 099 099 000Old_age Always > > > - 24 > > > --- > > > > > > Christian > > > > > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote: > > > > > >> Right, I actually updated to smartmontools 6.5+svn4324, which now > > >> properly supports this drive model. Some of the smart attr names have > > >> changed, and make more sense now (and there are no more "Unknowns"): > > >> > > >> ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > > >> 5 Reallocated_Sector_Ct -O--CK 081 081 000-944 > > >> 9 Power_On_Hours -O--CK 100 100 000-1067 > > >> 12 Power_Cycle_Count -O--CK 100 100 000-7 > > >> 170 Available_Reservd_Space PO--CK 085 085 010-0 > > >> 171 Program_Fail_Count -O--CK 100 100 000-0 > > >> 172 Erase_Fail_Count-O--CK 100 100 000-68 > > >> 174 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > > >> 175 Power_Loss_Cap_Test PO--CK 100 100 010-6510 (4 > > >> 4307) > > >> 183 SATA_Downshift_Count-O--CK 100 100 000-0 > > >> 184 End-to-End_ErrorPO--CK 100 100 090-0 > > >> 187 Reported_Uncorrect -O--CK 100 100 000-0 > > >> 190 Temperature_Case-O---K 070 065 000-30 (Min/Max > > >> 25/35) > > >> 192 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > > >> 194 Temperature_Internal-O---K 100 100 000-30 > > >> 197 Current_Pending_Sector -O--C- 100 100 000-1100 > > >> 199 CRC_Error_Count -OSRCK 100 100 000-0 > > >> 225 Host_Writes_32MiB -O--CK 100 100 000-20135 > > >> 226 Workld_Media_Wear_Indic -O--CK 100 100 000-20 > > >> 227 Workld_Host_Reads_Perc -O--CK 100 100 000-82 > > >> 228 Workload_Minutes-O--CK 100 100 000-64012 > > >> 232 Available_Reservd_Space PO--CK 084 084 010-0 > > >> 233 Media_Wearout_Indicator -O--CK 100 100 000-0 > > >> 234 Thermal_Throttle-O--CK 100 100 000-0/0 > > >> 241 Host_Writes_32MiB -O--CK 100 100 000-20135 > > >> 242 Host_Reads_32MiB-O--CK 100 100
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hello, On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote: > Christian, can you post your values for Power_Loss_Cap_Test on the drive > which is failing? > Sure: --- 175 Power_Loss_Cap_Test 0x0033 001 001 010Pre-fail Always FAILING_NOW 1 (47 942) --- Now according to the Intel data sheet that value of 1 means failed, NOT the actual buffer time it usually means, like this on the neighboring SSD: --- 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 614 (47 944) --- And my 800GB DC S3610s have more than 10 times the endurance, my guess is a combo of larger cache and slower writes: --- 175 Power_Loss_Cap_Test 0x0033 100 100 010Pre-fail Always - 8390 (22 7948) --- I'll definitely leave that "failing" SSD in place until it has done the next self-check. Christian > Thanks > Jan > > > On 03 Aug 2016, at 13:33, Christian Balzerwrote: > > > > > > Hello, > > > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it > > seemed to be such an odd thing to fail (given that's not single capacitor). > > > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA > > worthy issue. > > > > For the record, Intel SSDs use (typically 24) sectors when doing firmware > > upgrades, so this is a totally healthy 3610. ^o^ > > --- > > 5 Reallocated_Sector_Ct 0x0032 099 099 000Old_age Always > > - 24 > > --- > > > > Christian > > > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote: > > > >> Right, I actually updated to smartmontools 6.5+svn4324, which now > >> properly supports this drive model. Some of the smart attr names have > >> changed, and make more sense now (and there are no more "Unknowns"): > >> > >> ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > >> 5 Reallocated_Sector_Ct -O--CK 081 081 000-944 > >> 9 Power_On_Hours -O--CK 100 100 000-1067 > >> 12 Power_Cycle_Count -O--CK 100 100 000-7 > >> 170 Available_Reservd_Space PO--CK 085 085 010-0 > >> 171 Program_Fail_Count -O--CK 100 100 000-0 > >> 172 Erase_Fail_Count-O--CK 100 100 000-68 > >> 174 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > >> 175 Power_Loss_Cap_Test PO--CK 100 100 010-6510 (4 4307) > >> 183 SATA_Downshift_Count-O--CK 100 100 000-0 > >> 184 End-to-End_ErrorPO--CK 100 100 090-0 > >> 187 Reported_Uncorrect -O--CK 100 100 000-0 > >> 190 Temperature_Case-O---K 070 065 000-30 (Min/Max > >> 25/35) > >> 192 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > >> 194 Temperature_Internal-O---K 100 100 000-30 > >> 197 Current_Pending_Sector -O--C- 100 100 000-1100 > >> 199 CRC_Error_Count -OSRCK 100 100 000-0 > >> 225 Host_Writes_32MiB -O--CK 100 100 000-20135 > >> 226 Workld_Media_Wear_Indic -O--CK 100 100 000-20 > >> 227 Workld_Host_Reads_Perc -O--CK 100 100 000-82 > >> 228 Workload_Minutes-O--CK 100 100 000-64012 > >> 232 Available_Reservd_Space PO--CK 084 084 010-0 > >> 233 Media_Wearout_Indicator -O--CK 100 100 000-0 > >> 234 Thermal_Throttle-O--CK 100 100 000-0/0 > >> 241 Host_Writes_32MiB -O--CK 100 100 000-20135 > >> 242 Host_Reads_32MiB-O--CK 100 100 000-92945 > >> 243 NAND_Writes_32MiB -O--CK 100 100 000-95289 > >> > >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space > >> seems to be holding steady. > >> > >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden > >> death. The drive simply disappeared from the controller one day, and > >> could no longer be detected. > >> > >> On 03/08/16 12:15, Jan Schermer wrote: > >>> Make sure you are reading the right attribute and interpreting it right. > >>> update-smart-drivedb sometimes makes wonders :) > >>> > >>> I wonder what isdct tool would say the drive's life expectancy is with > >>> this workload? Are you really writing ~600TB/month?? > >>> > >>> Jan > >>> > >> > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > > > > -- > > Christian BalzerNetwork/Systems Engineer > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Christian, can you post your values for Power_Loss_Cap_Test on the drive which is failing? Thanks Jan > On 03 Aug 2016, at 13:33, Christian Balzerwrote: > > > Hello, > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it > seemed to be such an odd thing to fail (given that's not single capacitor). > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA > worthy issue. > > For the record, Intel SSDs use (typically 24) sectors when doing firmware > upgrades, so this is a totally healthy 3610. ^o^ > --- > 5 Reallocated_Sector_Ct 0x0032 099 099 000Old_age Always > - 24 > --- > > Christian > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote: > >> Right, I actually updated to smartmontools 6.5+svn4324, which now >> properly supports this drive model. Some of the smart attr names have >> changed, and make more sense now (and there are no more "Unknowns"): >> >> ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE >> 5 Reallocated_Sector_Ct -O--CK 081 081 000-944 >> 9 Power_On_Hours -O--CK 100 100 000-1067 >> 12 Power_Cycle_Count -O--CK 100 100 000-7 >> 170 Available_Reservd_Space PO--CK 085 085 010-0 >> 171 Program_Fail_Count -O--CK 100 100 000-0 >> 172 Erase_Fail_Count-O--CK 100 100 000-68 >> 174 Unsafe_Shutdown_Count -O--CK 100 100 000-6 >> 175 Power_Loss_Cap_Test PO--CK 100 100 010-6510 (4 4307) >> 183 SATA_Downshift_Count-O--CK 100 100 000-0 >> 184 End-to-End_ErrorPO--CK 100 100 090-0 >> 187 Reported_Uncorrect -O--CK 100 100 000-0 >> 190 Temperature_Case-O---K 070 065 000-30 (Min/Max >> 25/35) >> 192 Unsafe_Shutdown_Count -O--CK 100 100 000-6 >> 194 Temperature_Internal-O---K 100 100 000-30 >> 197 Current_Pending_Sector -O--C- 100 100 000-1100 >> 199 CRC_Error_Count -OSRCK 100 100 000-0 >> 225 Host_Writes_32MiB -O--CK 100 100 000-20135 >> 226 Workld_Media_Wear_Indic -O--CK 100 100 000-20 >> 227 Workld_Host_Reads_Perc -O--CK 100 100 000-82 >> 228 Workload_Minutes-O--CK 100 100 000-64012 >> 232 Available_Reservd_Space PO--CK 084 084 010-0 >> 233 Media_Wearout_Indicator -O--CK 100 100 000-0 >> 234 Thermal_Throttle-O--CK 100 100 000-0/0 >> 241 Host_Writes_32MiB -O--CK 100 100 000-20135 >> 242 Host_Reads_32MiB-O--CK 100 100 000-92945 >> 243 NAND_Writes_32MiB -O--CK 100 100 000-95289 >> >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space >> seems to be holding steady. >> >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden >> death. The drive simply disappeared from the controller one day, and >> could no longer be detected. >> >> On 03/08/16 12:15, Jan Schermer wrote: >>> Make sure you are reading the right attribute and interpreting it right. >>> update-smart-drivedb sometimes makes wonders :) >>> >>> I wonder what isdct tool would say the drive's life expectancy is with this >>> workload? Are you really writing ~600TB/month?? >>> >>> Jan >>> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hello, yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it seemed to be such an odd thing to fail (given that's not single capacitor). As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA worthy issue. For the record, Intel SSDs use (typically 24) sectors when doing firmware upgrades, so this is a totally healthy 3610. ^o^ --- 5 Reallocated_Sector_Ct 0x0032 099 099 000Old_age Always - 24 --- Christian On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote: > Right, I actually updated to smartmontools 6.5+svn4324, which now > properly supports this drive model. Some of the smart attr names have > changed, and make more sense now (and there are no more "Unknowns"): > > ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > 5 Reallocated_Sector_Ct -O--CK 081 081 000-944 > 9 Power_On_Hours -O--CK 100 100 000-1067 > 12 Power_Cycle_Count -O--CK 100 100 000-7 > 170 Available_Reservd_Space PO--CK 085 085 010-0 > 171 Program_Fail_Count -O--CK 100 100 000-0 > 172 Erase_Fail_Count-O--CK 100 100 000-68 > 174 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > 175 Power_Loss_Cap_Test PO--CK 100 100 010-6510 (4 4307) > 183 SATA_Downshift_Count-O--CK 100 100 000-0 > 184 End-to-End_ErrorPO--CK 100 100 090-0 > 187 Reported_Uncorrect -O--CK 100 100 000-0 > 190 Temperature_Case-O---K 070 065 000-30 (Min/Max > 25/35) > 192 Unsafe_Shutdown_Count -O--CK 100 100 000-6 > 194 Temperature_Internal-O---K 100 100 000-30 > 197 Current_Pending_Sector -O--C- 100 100 000-1100 > 199 CRC_Error_Count -OSRCK 100 100 000-0 > 225 Host_Writes_32MiB -O--CK 100 100 000-20135 > 226 Workld_Media_Wear_Indic -O--CK 100 100 000-20 > 227 Workld_Host_Reads_Perc -O--CK 100 100 000-82 > 228 Workload_Minutes-O--CK 100 100 000-64012 > 232 Available_Reservd_Space PO--CK 084 084 010-0 > 233 Media_Wearout_Indicator -O--CK 100 100 000-0 > 234 Thermal_Throttle-O--CK 100 100 000-0/0 > 241 Host_Writes_32MiB -O--CK 100 100 000-20135 > 242 Host_Reads_32MiB-O--CK 100 100 000-92945 > 243 NAND_Writes_32MiB -O--CK 100 100 000-95289 > > Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space > seems to be holding steady. > > AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden > death. The drive simply disappeared from the controller one day, and > could no longer be detected. > > On 03/08/16 12:15, Jan Schermer wrote: > > Make sure you are reading the right attribute and interpreting it right. > > update-smart-drivedb sometimes makes wonders :) > > > > I wonder what isdct tool would say the drive's life expectancy is with this > > workload? Are you really writing ~600TB/month?? > > > > Jan > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Right, I actually updated to smartmontools 6.5+svn4324, which now properly supports this drive model. Some of the smart attr names have changed, and make more sense now (and there are no more "Unknowns"): ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 081 081 000-944 9 Power_On_Hours -O--CK 100 100 000-1067 12 Power_Cycle_Count -O--CK 100 100 000-7 170 Available_Reservd_Space PO--CK 085 085 010-0 171 Program_Fail_Count -O--CK 100 100 000-0 172 Erase_Fail_Count-O--CK 100 100 000-68 174 Unsafe_Shutdown_Count -O--CK 100 100 000-6 175 Power_Loss_Cap_Test PO--CK 100 100 010-6510 (4 4307) 183 SATA_Downshift_Count-O--CK 100 100 000-0 184 End-to-End_ErrorPO--CK 100 100 090-0 187 Reported_Uncorrect -O--CK 100 100 000-0 190 Temperature_Case-O---K 070 065 000-30 (Min/Max 25/35) 192 Unsafe_Shutdown_Count -O--CK 100 100 000-6 194 Temperature_Internal-O---K 100 100 000-30 197 Current_Pending_Sector -O--C- 100 100 000-1100 199 CRC_Error_Count -OSRCK 100 100 000-0 225 Host_Writes_32MiB -O--CK 100 100 000-20135 226 Workld_Media_Wear_Indic -O--CK 100 100 000-20 227 Workld_Host_Reads_Perc -O--CK 100 100 000-82 228 Workload_Minutes-O--CK 100 100 000-64012 232 Available_Reservd_Space PO--CK 084 084 010-0 233 Media_Wearout_Indicator -O--CK 100 100 000-0 234 Thermal_Throttle-O--CK 100 100 000-0/0 241 Host_Writes_32MiB -O--CK 100 100 000-20135 242 Host_Reads_32MiB-O--CK 100 100 000-92945 243 NAND_Writes_32MiB -O--CK 100 100 000-95289 Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space seems to be holding steady. AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden death. The drive simply disappeared from the controller one day, and could no longer be detected. On 03/08/16 12:15, Jan Schermer wrote: > Make sure you are reading the right attribute and interpreting it right. > update-smart-drivedb sometimes makes wonders :) > > I wonder what isdct tool would say the drive's life expectancy is with this > workload? Are you really writing ~600TB/month?? > > Jan > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
I'm a fool, I miscalculated the writes by a factor of 1000 of course :-) 600GB/month is not much for S36xx at all, must be some sort of defect then... Jan > On 03 Aug 2016, at 12:15, Jan Schermerwrote: > > Make sure you are reading the right attribute and interpreting it right. > update-smart-drivedb sometimes makes wonders :) > > I wonder what isdct tool would say the drive's life expectancy is with this > workload? Are you really writing ~600TB/month?? > > Jan > > >> On 03 Aug 2016, at 12:06, Maxime Guyot wrote: >> >> Hi, >> >> I haven’t had problems with Power_Loss_Cap_Test so far. >> >> Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the >> “Available Reserved Space” (SMART ID: 232/E8h), the data sheet >> (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) >> reads: >> "This attribute reports the number of reserve blocks >> >> remaining. The normalized value >> begins at 100 (64h), >> which corresponds to 100 percent availability of the >> reserved space. The threshold value for this attribute is >> 10 percent availability." >> >> According to the SMART data you copied, it should be about 84% of the over >> provisioning left? Since the drive is pretty young, it might be some form of >> defect? >> I have a number of S3610 with ~150 DW, all SMART counters are their initial >> values (except for the temperature). >> >> Cheers, >> Maxime >> >> >> >> >> >> >> >> >> On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" >> > daniel.swarbr...@profitbricks.com> wrote: >> >>> Hi Christian, >>> >>> Intel drives are good, but apparently not infallible. I'm watching a DC >>> S3610 480GB die from reallocated sectors. >>> >>> ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE >>> 5 Reallocated_Sector_Ct -O--CK 081 081 000-756 >>> 9 Power_On_Hours -O--CK 100 100 000-1065 >>> 12 Power_Cycle_Count -O--CK 100 100 000-7 >>> 175 Program_Fail_Count_Chip PO--CK 100 100 010-17454078318 >>> 183 Runtime_Bad_Block -O--CK 100 100 000-0 >>> 184 End-to-End_ErrorPO--CK 100 100 090-0 >>> 187 Reported_Uncorrect -O--CK 100 100 000-0 >>> 190 Airflow_Temperature_Cel -O---K 070 065 000-30 (Min/Max >>> 25/35) >>> 192 Power-Off_Retract_Count -O--CK 100 100 000-6 >>> 194 Temperature_Celsius -O---K 100 100 000-30 >>> 197 Current_Pending_Sector -O--C- 100 100 000-1288 >>> 199 UDMA_CRC_Error_Count-OSRCK 100 100 000-0 >>> 228 Power-off_Retract_Count -O--CK 100 100 000-63889 >>> 232 Available_Reservd_Space PO--CK 084 084 010-0 >>> 233 Media_Wearout_Indicator -O--CK 100 100 000-0 >>> 241 Total_LBAs_Written -O--CK 100 100 000-20131 >>> 242 Total_LBAs_Read -O--CK 100 100 000-92945 >>> >>> The Reallocated_Sector_Ct is increasing about once a minute. I'm not >>> sure how many reserved sectors the drive has, i.e., how soon before it >>> starts throwing write IO errors. >>> >>> It's a very young drive, with only 1065 hours on the clock, and has not >>> even done two full drive-writes: >>> >>> Device Statistics (GP Log 0x04) >>> Page Offset Size Value Description >>> 1 = == == General Statistics (rev 2) == >>> 1 0x008 47 Lifetime Power-On Resets >>> 1 0x018 6 1319318736 Logical Sectors Written >>> 1 0x020 6137121729 Number of Write Commands >>> 1 0x028 6 6091245600 Logical Sectors Read >>> 1 0x030 6115252407 Number of Read Commands >>> >>> Fortunately this drive is not used as a Ceph journal. It's in a mdraid >>> RAID5 array :-| >>> >>> Cheers, >>> Daniel >>> >>> On 03/08/16 07:45, Christian Balzer wrote: Hello, not a Ceph specific issue, but this is probably the largest sample size of SSD users I'm familiar with. ^o^ This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a religious experience. It turns out that the SMART check plugin I run to mostly get an early wearout warning detected a "Power_Loss_Cap_Test" failure in one of the 200GB DC S3700 used for journals. While SMART is of the opinion that this drive is failing and will explode spectacularly any moment that particular failure is of little worries to me, never mind that I'll eventually replace this unit. What brings me here is that this is the first time in over 3 years that an Intel SSD has shown a (harmless in this case) problem, so I'm wondering if this particular failure has been seen by others. That
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Make sure you are reading the right attribute and interpreting it right. update-smart-drivedb sometimes makes wonders :) I wonder what isdct tool would say the drive's life expectancy is with this workload? Are you really writing ~600TB/month?? Jan > On 03 Aug 2016, at 12:06, Maxime Guyotwrote: > > Hi, > > I haven’t had problems with Power_Loss_Cap_Test so far. > > Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the > “Available Reserved Space” (SMART ID: 232/E8h), the data sheet > (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) > reads: > "This attribute reports the number of reserve blocks > > remaining. The normalized value > begins at 100 (64h), > which corresponds to 100 percent availability of the > reserved space. The threshold value for this attribute is > 10 percent availability." > > According to the SMART data you copied, it should be about 84% of the over > provisioning left? Since the drive is pretty young, it might be some form of > defect? > I have a number of S3610 with ~150 DW, all SMART counters are their initial > values (except for the temperature). > > Cheers, > Maxime > > > > > > > > > On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick" > daniel.swarbr...@profitbricks.com> wrote: > >> Hi Christian, >> >> Intel drives are good, but apparently not infallible. I'm watching a DC >> S3610 480GB die from reallocated sectors. >> >> ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE >> 5 Reallocated_Sector_Ct -O--CK 081 081 000-756 >> 9 Power_On_Hours -O--CK 100 100 000-1065 >> 12 Power_Cycle_Count -O--CK 100 100 000-7 >> 175 Program_Fail_Count_Chip PO--CK 100 100 010-17454078318 >> 183 Runtime_Bad_Block -O--CK 100 100 000-0 >> 184 End-to-End_ErrorPO--CK 100 100 090-0 >> 187 Reported_Uncorrect -O--CK 100 100 000-0 >> 190 Airflow_Temperature_Cel -O---K 070 065 000-30 (Min/Max >> 25/35) >> 192 Power-Off_Retract_Count -O--CK 100 100 000-6 >> 194 Temperature_Celsius -O---K 100 100 000-30 >> 197 Current_Pending_Sector -O--C- 100 100 000-1288 >> 199 UDMA_CRC_Error_Count-OSRCK 100 100 000-0 >> 228 Power-off_Retract_Count -O--CK 100 100 000-63889 >> 232 Available_Reservd_Space PO--CK 084 084 010-0 >> 233 Media_Wearout_Indicator -O--CK 100 100 000-0 >> 241 Total_LBAs_Written -O--CK 100 100 000-20131 >> 242 Total_LBAs_Read -O--CK 100 100 000-92945 >> >> The Reallocated_Sector_Ct is increasing about once a minute. I'm not >> sure how many reserved sectors the drive has, i.e., how soon before it >> starts throwing write IO errors. >> >> It's a very young drive, with only 1065 hours on the clock, and has not >> even done two full drive-writes: >> >> Device Statistics (GP Log 0x04) >> Page Offset Size Value Description >> 1 = == == General Statistics (rev 2) == >> 1 0x008 47 Lifetime Power-On Resets >> 1 0x018 6 1319318736 Logical Sectors Written >> 1 0x020 6137121729 Number of Write Commands >> 1 0x028 6 6091245600 Logical Sectors Read >> 1 0x030 6115252407 Number of Read Commands >> >> Fortunately this drive is not used as a Ceph journal. It's in a mdraid >> RAID5 array :-| >> >> Cheers, >> Daniel >> >> On 03/08/16 07:45, Christian Balzer wrote: >>> >>> Hello, >>> >>> not a Ceph specific issue, but this is probably the largest sample size of >>> SSD users I'm familiar with. ^o^ >>> >>> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a >>> religious experience. >>> >>> It turns out that the SMART check plugin I run to mostly get an early >>> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the >>> 200GB DC S3700 used for journals. >>> >>> While SMART is of the opinion that this drive is failing and will explode >>> spectacularly any moment that particular failure is of little worries to >>> me, never mind that I'll eventually replace this unit. >>> >>> What brings me here is that this is the first time in over 3 years that an >>> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if >>> this particular failure has been seen by others. >>> >>> That of course entails people actually monitoring for these things. ^o^ >>> >>> Thanks, >>> >>> Christian >>> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hi, I haven’t had problems with Power_Loss_Cap_Test so far. Regarding Reallocated_Sector_Ct (SMART ID: 5/05h), you can check the “Available Reserved Space” (SMART ID: 232/E8h), the data sheet (http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3610-spec.pdf) reads: "This attribute reports the number of reserve blocks remaining. The normalized value begins at 100 (64h), which corresponds to 100 percent availability of the reserved space. The threshold value for this attribute is 10 percent availability." According to the SMART data you copied, it should be about 84% of the over provisioning left? Since the drive is pretty young, it might be some form of defect? I have a number of S3610 with ~150 DW, all SMART counters are their initial values (except for the temperature). Cheers, Maxime On 03/08/16 11:12, "ceph-users on behalf of Daniel Swarbrick"wrote: >Hi Christian, > >Intel drives are good, but apparently not infallible. I'm watching a DC >S3610 480GB die from reallocated sectors. > >ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE > 5 Reallocated_Sector_Ct -O--CK 081 081 000-756 > 9 Power_On_Hours -O--CK 100 100 000-1065 > 12 Power_Cycle_Count -O--CK 100 100 000-7 >175 Program_Fail_Count_Chip PO--CK 100 100 010-17454078318 >183 Runtime_Bad_Block -O--CK 100 100 000-0 >184 End-to-End_ErrorPO--CK 100 100 090-0 >187 Reported_Uncorrect -O--CK 100 100 000-0 >190 Airflow_Temperature_Cel -O---K 070 065 000-30 (Min/Max >25/35) >192 Power-Off_Retract_Count -O--CK 100 100 000-6 >194 Temperature_Celsius -O---K 100 100 000-30 >197 Current_Pending_Sector -O--C- 100 100 000-1288 >199 UDMA_CRC_Error_Count-OSRCK 100 100 000-0 >228 Power-off_Retract_Count -O--CK 100 100 000-63889 >232 Available_Reservd_Space PO--CK 084 084 010-0 >233 Media_Wearout_Indicator -O--CK 100 100 000-0 >241 Total_LBAs_Written -O--CK 100 100 000-20131 >242 Total_LBAs_Read -O--CK 100 100 000-92945 > >The Reallocated_Sector_Ct is increasing about once a minute. I'm not >sure how many reserved sectors the drive has, i.e., how soon before it >starts throwing write IO errors. > >It's a very young drive, with only 1065 hours on the clock, and has not >even done two full drive-writes: > >Device Statistics (GP Log 0x04) >Page Offset Size Value Description > 1 = == == General Statistics (rev 2) == > 1 0x008 47 Lifetime Power-On Resets > 1 0x018 6 1319318736 Logical Sectors Written > 1 0x020 6137121729 Number of Write Commands > 1 0x028 6 6091245600 Logical Sectors Read > 1 0x030 6115252407 Number of Read Commands > >Fortunately this drive is not used as a Ceph journal. It's in a mdraid >RAID5 array :-| > >Cheers, >Daniel > >On 03/08/16 07:45, Christian Balzer wrote: >> >> Hello, >> >> not a Ceph specific issue, but this is probably the largest sample size of >> SSD users I'm familiar with. ^o^ >> >> This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a >> religious experience. >> >> It turns out that the SMART check plugin I run to mostly get an early >> wearout warning detected a "Power_Loss_Cap_Test" failure in one of the >> 200GB DC S3700 used for journals. >> >> While SMART is of the opinion that this drive is failing and will explode >> spectacularly any moment that particular failure is of little worries to >> me, never mind that I'll eventually replace this unit. >> >> What brings me here is that this is the first time in over 3 years that an >> Intel SSD has shown a (harmless in this case) problem, so I'm wondering if >> this particular failure has been seen by others. >> >> That of course entails people actually monitoring for these things. ^o^ >> >> Thanks, >> >> Christian >> > > >___ >ceph-users mailing list >ceph-users@lists.ceph.com >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hi Christian, Intel drives are good, but apparently not infallible. I'm watching a DC S3610 480GB die from reallocated sectors. ID# ATTRIBUTE_NAME FLAGSVALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 081 081 000-756 9 Power_On_Hours -O--CK 100 100 000-1065 12 Power_Cycle_Count -O--CK 100 100 000-7 175 Program_Fail_Count_Chip PO--CK 100 100 010-17454078318 183 Runtime_Bad_Block -O--CK 100 100 000-0 184 End-to-End_ErrorPO--CK 100 100 090-0 187 Reported_Uncorrect -O--CK 100 100 000-0 190 Airflow_Temperature_Cel -O---K 070 065 000-30 (Min/Max 25/35) 192 Power-Off_Retract_Count -O--CK 100 100 000-6 194 Temperature_Celsius -O---K 100 100 000-30 197 Current_Pending_Sector -O--C- 100 100 000-1288 199 UDMA_CRC_Error_Count-OSRCK 100 100 000-0 228 Power-off_Retract_Count -O--CK 100 100 000-63889 232 Available_Reservd_Space PO--CK 084 084 010-0 233 Media_Wearout_Indicator -O--CK 100 100 000-0 241 Total_LBAs_Written -O--CK 100 100 000-20131 242 Total_LBAs_Read -O--CK 100 100 000-92945 The Reallocated_Sector_Ct is increasing about once a minute. I'm not sure how many reserved sectors the drive has, i.e., how soon before it starts throwing write IO errors. It's a very young drive, with only 1065 hours on the clock, and has not even done two full drive-writes: Device Statistics (GP Log 0x04) Page Offset Size Value Description 1 = == == General Statistics (rev 2) == 1 0x008 47 Lifetime Power-On Resets 1 0x018 6 1319318736 Logical Sectors Written 1 0x020 6137121729 Number of Write Commands 1 0x028 6 6091245600 Logical Sectors Read 1 0x030 6115252407 Number of Read Commands Fortunately this drive is not used as a Ceph journal. It's in a mdraid RAID5 array :-| Cheers, Daniel On 03/08/16 07:45, Christian Balzer wrote: > > Hello, > > not a Ceph specific issue, but this is probably the largest sample size of > SSD users I'm familiar with. ^o^ > > This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a > religious experience. > > It turns out that the SMART check plugin I run to mostly get an early > wearout warning detected a "Power_Loss_Cap_Test" failure in one of the > 200GB DC S3700 used for journals. > > While SMART is of the opinion that this drive is failing and will explode > spectacularly any moment that particular failure is of little worries to > me, never mind that I'll eventually replace this unit. > > What brings me here is that this is the first time in over 3 years that an > Intel SSD has shown a (harmless in this case) problem, so I'm wondering if > this particular failure has been seen by others. > > That of course entails people actually monitoring for these things. ^o^ > > Thanks, > > Christian > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intel SSD (DC S3700) Power_Loss_Cap_Test failure
Hello, not a Ceph specific issue, but this is probably the largest sample size of SSD users I'm familiar with. ^o^ This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a religious experience. It turns out that the SMART check plugin I run to mostly get an early wearout warning detected a "Power_Loss_Cap_Test" failure in one of the 200GB DC S3700 used for journals. While SMART is of the opinion that this drive is failing and will explode spectacularly any moment that particular failure is of little worries to me, never mind that I'll eventually replace this unit. What brings me here is that this is the first time in over 3 years that an Intel SSD has shown a (harmless in this case) problem, so I'm wondering if this particular failure has been seen by others. That of course entails people actually monitoring for these things. ^o^ Thanks, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com