Hi Christian, Intel drives are good, but apparently not infallible. I'm watching a DC S3610 480GB die from reallocated sectors.
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 5 Reallocated_Sector_Ct -O--CK 081 081 000 - 756 9 Power_On_Hours -O--CK 100 100 000 - 1065 12 Power_Cycle_Count -O--CK 100 100 000 - 7 175 Program_Fail_Count_Chip PO--CK 100 100 010 - 17454078318 183 Runtime_Bad_Block -O--CK 100 100 000 - 0 184 End-to-End_Error PO--CK 100 100 090 - 0 187 Reported_Uncorrect -O--CK 100 100 000 - 0 190 Airflow_Temperature_Cel -O---K 070 065 000 - 30 (Min/Max 25/35) 192 Power-Off_Retract_Count -O--CK 100 100 000 - 6 194 Temperature_Celsius -O---K 100 100 000 - 30 197 Current_Pending_Sector -O--C- 100 100 000 - 1288 199 UDMA_CRC_Error_Count -OSRCK 100 100 000 - 0 228 Power-off_Retract_Count -O--CK 100 100 000 - 63889 232 Available_Reservd_Space PO--CK 084 084 010 - 0 233 Media_Wearout_Indicator -O--CK 100 100 000 - 0 241 Total_LBAs_Written -O--CK 100 100 000 - 20131 242 Total_LBAs_Read -O--CK 100 100 000 - 92945 The Reallocated_Sector_Ct is increasing about once a minute. I'm not sure how many reserved sectors the drive has, i.e., how soon before it starts throwing write IO errors. It's a very young drive, with only 1065 hours on the clock, and has not even done two full drive-writes: Device Statistics (GP Log 0x04) Page Offset Size Value Description 1 ===== = = == General Statistics (rev 2) == 1 0x008 4 7 Lifetime Power-On Resets 1 0x018 6 1319318736 Logical Sectors Written 1 0x020 6 137121729 Number of Write Commands 1 0x028 6 6091245600 Logical Sectors Read 1 0x030 6 115252407 Number of Read Commands Fortunately this drive is not used as a Ceph journal. It's in a mdraid RAID5 array :-| Cheers, Daniel On 03/08/16 07:45, Christian Balzer wrote: > > Hello, > > not a Ceph specific issue, but this is probably the largest sample size of > SSD users I'm familiar with. ^o^ > > This morning I was woken at 4:30 by Nagios, one of our Ceph nodes having a > religious experience. > > It turns out that the SMART check plugin I run to mostly get an early > wearout warning detected a "Power_Loss_Cap_Test" failure in one of the > 200GB DC S3700 used for journals. > > While SMART is of the opinion that this drive is failing and will explode > spectacularly any moment that particular failure is of little worries to > me, never mind that I'll eventually replace this unit. > > What brings me here is that this is the first time in over 3 years that an > Intel SSD has shown a (harmless in this case) problem, so I'm wondering if > this particular failure has been seen by others. > > That of course entails people actually monitoring for these things. ^o^ > > Thanks, > > Christian > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com