Hello, as a follow-up, conclusion and dire warning to all who happen to encounter this failure mode:
The server with that failed power loss capacitor SSD had a religious experience 2 days ago and needed a power cycle to revive it. Now in theory the data should have been safe, as the drive had minutes to scribble away it's cache. Alas what happened is that the SSD bricked itself, it's not accessible any longer and the only meaningful output from "smartctl -a" is: "SMART overall-health self-assessment test result: FAILED!" I'm trying to think of a failure mode where the capacitor would cause something like this and am coming up blank, so my theories at this time are: 1. Something more substantial was failing the the error was a symptom, not the cause. 2. Intel's "we won't let you deal with potentially broken data" rule strikes again (they brick SSDs that reach max wear-out levels) and a failed power cap triggers such a rule. Either way, if you ever encounter this problem, get a replacement ASAP, and if used as journal SSD, shut down all associated OSDs, flush the journals and replace it. Christian On Wed, 3 Aug 2016 21:15:22 +0900 Christian Balzer wrote: > > Hello, > > On Wed, 3 Aug 2016 13:42:50 +0200 Jan Schermer wrote: > > > Christian, can you post your values for Power_Loss_Cap_Test on the drive > > which is failing? > > > Sure: > --- > 175 Power_Loss_Cap_Test 0x0033 001 001 010 Pre-fail Always > FAILING_NOW 1 (47 942) > --- > > Now according to the Intel data sheet that value of 1 means failed, NOT > the actual buffer time it usually means, like this on the neighboring SSD: > --- > 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always > - 614 (47 944) > --- > > And my 800GB DC S3610s have more than 10 times the endurance, my guess is > a combo of larger cache and slower writes: > --- > 175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always > - 8390 (22 7948) > --- > > I'll definitely leave that "failing" SSD in place until it has done the > next self-check. > > Christian > > > Thanks > > Jan > > > > > On 03 Aug 2016, at 13:33, Christian Balzer <ch...@gol.com> wrote: > > > > > > > > > Hello, > > > > > > yeah, I was particular interested in the Power_Loss_Cap_Test bit, as it > > > seemed to be such an odd thing to fail (given that's not single > > > capacitor). > > > > > > As for your Reallocated_Sector_Ct, that's really odd and definitely a RMA > > > worthy issue. > > > > > > For the record, Intel SSDs use (typically 24) sectors when doing firmware > > > upgrades, so this is a totally healthy 3610. ^o^ > > > --- > > > 5 Reallocated_Sector_Ct 0x0032 099 099 000 Old_age Always > > > - 24 > > > --- > > > > > > Christian > > > > > > On Wed, 3 Aug 2016 13:12:53 +0200 Daniel Swarbrick wrote: > > > > > >> Right, I actually updated to smartmontools 6.5+svn4324, which now > > >> properly supports this drive model. Some of the smart attr names have > > >> changed, and make more sense now (and there are no more "Unknowns"): > > >> > > >> ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE > > >> 5 Reallocated_Sector_Ct -O--CK 081 081 000 - 944 > > >> 9 Power_On_Hours -O--CK 100 100 000 - 1067 > > >> 12 Power_Cycle_Count -O--CK 100 100 000 - 7 > > >> 170 Available_Reservd_Space PO--CK 085 085 010 - 0 > > >> 171 Program_Fail_Count -O--CK 100 100 000 - 0 > > >> 172 Erase_Fail_Count -O--CK 100 100 000 - 68 > > >> 174 Unsafe_Shutdown_Count -O--CK 100 100 000 - 6 > > >> 175 Power_Loss_Cap_Test PO--CK 100 100 010 - 6510 (4 > > >> 4307) > > >> 183 SATA_Downshift_Count -O--CK 100 100 000 - 0 > > >> 184 End-to-End_Error PO--CK 100 100 090 - 0 > > >> 187 Reported_Uncorrect -O--CK 100 100 000 - 0 > > >> 190 Temperature_Case -O---K 070 065 000 - 30 (Min/Max > > >> 25/35) > > >> 192 Unsafe_Shutdown_Count -O--CK 100 100 000 - 6 > > >> 194 Temperature_Internal -O---K 100 100 000 - 30 > > >> 197 Current_Pending_Sector -O--C- 100 100 000 - 1100 > > >> 199 CRC_Error_Count -OSRCK 100 100 000 - 0 > > >> 225 Host_Writes_32MiB -O--CK 100 100 000 - 20135 > > >> 226 Workld_Media_Wear_Indic -O--CK 100 100 000 - 20 > > >> 227 Workld_Host_Reads_Perc -O--CK 100 100 000 - 82 > > >> 228 Workload_Minutes -O--CK 100 100 000 - 64012 > > >> 232 Available_Reservd_Space PO--CK 084 084 010 - 0 > > >> 233 Media_Wearout_Indicator -O--CK 100 100 000 - 0 > > >> 234 Thermal_Throttle -O--CK 100 100 000 - 0/0 > > >> 241 Host_Writes_32MiB -O--CK 100 100 000 - 20135 > > >> 242 Host_Reads_32MiB -O--CK 100 100 000 - 92945 > > >> 243 NAND_Writes_32MiB -O--CK 100 100 000 - 95289 > > >> > > >> Reallocated_Sector_Ct is still increasing, but Available_Reservd_Space > > >> seems to be holding steady. > > >> > > >> AFAIK, we've only had one other S3610 fail, and it seemed to be a sudden > > >> death. The drive simply disappeared from the controller one day, and > > >> could no longer be detected. > > >> > > >> On 03/08/16 12:15, Jan Schermer wrote: > > >>> Make sure you are reading the right attribute and interpreting it right. > > >>> update-smart-drivedb sometimes makes wonders :) > > >>> > > >>> I wonder what isdct tool would say the drive's life expectancy is with > > >>> this workload? Are you really writing ~600TB/month?? > > >>> > > >>> Jan > > >>> > > >> > > >> > > >> _______________________________________________ > > >> ceph-users mailing list > > >> ceph-users@lists.ceph.com > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >> > > > > > > > > > -- > > > Christian Balzer Network/Systems Engineer > > > ch...@gol.com Global OnLine Japan/Rakuten Communications > > > http://www.gol.com/ > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > -- Christian Balzer Network/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com