Re: SMART data & Self tests, not sure if my SSD is on it's last gasp

2020-12-30 Thread Joshua Judson Rosen
Storage still scares me, just as a general principle...,
so I'm basically never going to say "you really have nothing to worry about"...,
but I think I _might_ be able to settle your nerves a little:

On 12/30/20 2:04 PM, Bruce Labitt wrote:
> I think I have a SSD on the way out.  Last reboot took a REALLY long
> time.  Like 30 minutes.
Are you sure your computer wasn't just running an extensive fsck during that 
boot?

Assuming you're running one of the "ext" filesystem variants (ext4, ext3...),
you can try running dumpe2fs on each of your filesystems and looking at the 
"Last checked" field.
If that's the same as the last time you booted..., there you go.

IIRC ext3 used to force periodic full fsck by default I'm not sure what the 
intervals were,
what the current defaults are, or when they might have changed. A lot of people 
liked to
disabled them, though, because otherwise the lengthy fsck always seemed to come 
at the
most unexpected and inopportune times (especially on laptops that might be 
running battery-only).
The relevant fields in the dumpe2fs output here are "Maximum mount count" and 
"Check interval".

Your smartctl output actually doesn't sound any alarms for me:

> I ran the smart data and self test and the SSD
> passes.  Overall assessment is disk is ok.  I really don't know how to
> interpret what the results are.
> 
> I think the disk is in pre-fail based on the smartctl output below

I think you're misreading the `attribute TYPE' column as an `attribute value 
summary interpretation'.

"Pre-Fail" doesn't mean "this drive *is* about to fail according to current 
value of this attribute",
it just means "this drive *would be* about to fail if the current value were
past the value in the THRESHOLD column".

The relevant paragraph from the smartctl manual:

 The Attribute table printed  out  by  smartctl  also  shows  the
 "TYPE"  of  the  Attribute.   Attributes are one of two possible
 types: Pre-failure or Old age.  Pre-failure Attributes are  ones
 which, if less than or equal to their threshold values, indicate
 pending disk failure.  Old age, or usage  Attributes,  are  ones
 which  indicate end-of-product life from old-age or normal aging
 and wearout, if the Attribute value is less than or equal to the
 threshold.   Please  note: the fact that an Attribute is of type
 'Pre-fail' does not mean that your disk is about  to  fail!   It
 only  has  this  meaning  if  the Attribute's current Normalized
 value is less than or equal to the threshold value.


Just going by your smartctl report, this drive looks `practically new' to me...:
the current and `worst ever seen' values are all at 100 and the closest pre-fail
indicator is `not until it gets down to 50' (and the others are
either `not until it gets down to 10' or `not until it gets down to 1').

The Power_On_Hours and Power_Cycle_Count figures show that the drive has 
probably been
in use in a laptop (with typical sleep/wake/powercycle frequency) for a couple 
of years,
but that's all I see.

If you haven't taken a backup recently..., you should do _that_... just 
because... backups.

It's been a while since I researched `SSD failure modes', but my recollection 
was
that `suddenly, completely, and without a lot of warning' was pretty typical--
as opposed to the old spinning-platter disc drives for which `first they get 
hot and noisy'
and `you lose a few sectors first and then an recover the rest' were more 
normal
(someone who's more up-to-date on this than me, please jump in!). So.., 
yeah--backups.

And if it's a couple years old, it might be out of its warranty period--
so consider whether that bothers you, I guess?


> 
> /snip
> 
> === START OF INFORMATION SECTION ===
> Model Family: Crucial/Micron RealSSD m4/C400/P400
> Device Model: M4-CT256M4SSD2
> Serial Number:    1247091DC2FF
> LU WWN Device Id: 5 00a075 1091dc2ff
> Firmware Version: 040H
> User Capacity:    256,060,514,304 bytes [256 GB]
> Sector Size:  512 bytes logical/physical
> Rotation Rate:    Solid State Device
> Form Factor:  2.5 inches
> Device is:    In smartctl database [for details use: -P show]
> ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Wed Dec 30 13:49:17 2020 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> /snip
> 
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>     1 Raw_Read_Error_Rate 0x002f   100   100   050 Pre-fail
> Always   -   0
>     5 Reallocated_Sector_Ct   0x0033   100   100   010 Pre-fail
> Always   -   0
>     9 Power_On_Hours  0x0032   100   100   001 Old_age
> Always   -   7294
>    12 Pow

SMART data & Self tests, not sure if my SSD is on it's last gasp

2020-12-30 Thread Bruce Labitt
I think I have a SSD on the way out.  Last reboot took a REALLY long 
time.  Like 30 minutes.  I ran the smart data and self test and the SSD 
passes.  Overall assessment is disk is ok.  I really don't know how to 
interpret what the results are.

I think the disk is in pre-fail based on the smartctl output below

/snip

=== START OF INFORMATION SECTION ===
Model Family: Crucial/Micron RealSSD m4/C400/P400
Device Model: M4-CT256M4SSD2
Serial Number:    1247091DC2FF
LU WWN Device Id: 5 00a075 1091dc2ff
Firmware Version: 040H
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:  512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:  2.5 inches
Device is:    In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec 30 13:49:17 2020 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

/snip

ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x002f   100   100   050 Pre-fail  
Always   -   0
   5 Reallocated_Sector_Ct   0x0033   100   100   010 Pre-fail  
Always   -   0
   9 Power_On_Hours  0x0032   100   100   001 Old_age   
Always   -   7294
  12 Power_Cycle_Count   0x0032   100   100   001 Old_age   
Always   -   2511
170 Grown_Failing_Block_Ct  0x0033   100   100   010 Pre-fail  
Always   -   0
171 Program_Fail_Count  0x0032   100   100   001 Old_age   
Always   -   0
172 Erase_Fail_Count    0x0032   100   100   001 Old_age   
Always   -   0
173 Wear_Leveling_Count 0x0033   098   098   010 Pre-fail  
Always   -   66
174 Unexpect_Power_Loss_Ct  0x0032   100   100   001 Old_age   
Always   -   87
181 Non4k_Aligned_Access    0x0022   100   100   001 Old_age   
Always   -   10250 5047 5203
183 SATA_Iface_Downshift    0x0032   100   100   001 Old_age   
Always   -   0
184 End-to-End_Error    0x0033   100   100   050 Pre-fail  
Always   -   0
187 Reported_Uncorrect  0x0032   100   100   001 Old_age   
Always   -   0
188 Command_Timeout 0x0032   100   100   001 Old_age   
Always   -   0
189 Factory_Bad_Block_Ct    0x000e   100   100   001 Old_age   
Always   -   81
194 Temperature_Celsius 0x0022   100   100   000 Old_age   
Always   -   0
195 Hardware_ECC_Recovered  0x003a   100   100   001 Old_age   
Always   -   0
196 Reallocated_Event_Count 0x0032   100   100   001 Old_age   
Always   -   0
197 Current_Pending_Sector  0x0032   100   100   001 Old_age   
Always   -   0
198 Offline_Uncorrectable   0x0030   100   100   001 Old_age   
Offline  -   0
199 UDMA_CRC_Error_Count    0x0032   100   100   001 Old_age   
Always   -   0
202 Perc_Rated_Life_Used    0x0018   098   098   001 Old_age   
Offline  -   2
206 Write_Error_Rate    0x000e   100   100   001 Old_age   
Always   -   0

Replace the disk pronto?  Is that what this is telling me?  Or?

I recently copied over many important files to another disk.  And 
downloaded a new OS.  I just hate re-configuring things, and starting 
from scratch, it's such a pain.  Not as painful as a disk crash, but 
close.  I've got loads of stuff I've compiled from source and just 100's 
of things to check or update.  Yes, I'll just have to do it.  It's just 
the week plus of recovery that I'm rebelling against.

Anything else I should do first?  Check something?  Run a test? Any tips 
to make the "recovery" less painful?

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/