Re: Failing disk advice

Patrick Zaloum Tue, 07 Mar 2017 08:27:12 -0800

In my experience, you will likely be able to pull a few more weeks / months
of life out of the drive but it will die.
Mirko's suggestion of migrating to a n=3 raid1 setup is also what I would
recommend.


You will notice in your smartctl output that Reallocated_Sector_Ct   is 41.
That means that there have already been 41 sectors remapped to the spare
sectors of your drive. The 8 offline_uncorrectable / current_pending_sector
are probably unpopulated sectors that haven't been rewritten to yet, but
triggered an i/o error the last time there was data there. For me this was
often on a swap partition since there are a lot of transient writes. The
next time the system tries to write to those sectors it will either fail
and mark it as permanently unusable, or succeed and clear the pending count.

Good luck
Patrick

On Mon, Mar 6, 2017 at 8:27 PM, Gregory Seidman <
gsslist+deb...@anthropohedron.net> wrote:

> On Mon, Mar 06, 2017 at 12:17:03PM +0100, Mirko Parthey wrote:
> > On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> > > On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> > > >I have a disk that is reporting SMART errors. It is an active disk in
> > > >a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> > > >in the RAID1, and md hasn't decided it should fail the disk and switch
> > > >to the hot spare. Should I proactively tell md to fail the disk (and
> > > >let the hot spare take over), or should I just wait until md notices a
> > > >problem?
> > >
> > > I'm confused by "I also have a hot spare in the RAID1".  Do you have a
> > > two-member RAID1 with a hot spare, or a three-member RAID1?  I would
> > > prefer the latter:
> > >
> > > https://manpages.debian.org/jessie/mdadm/md.4.en.html
> >
> > Refining this advice a bit, I would convert the spare to a full RAID
> > member now, without explicitly failing the disk that reports SMART
> > errors first.
> > Assuming you have a two-member RAID1 with a hot spare, the command
> > should be similar to this (untested):
> >   mdadm -G /dev/mdX -n 3
> > This ensures you keep redundancy during further maintenance actions.
>
> I was unaware that this was possible. I've run it and mdadm -D reports that
> it is now in the "clean, degraded, rebuilding" state. Thank you! I wish I
> had room in my system to add the fourth (which I've ordered) without
> removing the failing disk, but I do not.
>
> > Which SMART errors do you get, and who reports them?
>
> I get emails sent to root:
>
>         This message was generated by the smartd daemon running on:
>
>            host name:  XXXXXX
>            DNS domain: YYYYYY
>
>         The following warning/error was logged by the smartd daemon:
>
>         Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors
>
>         Device info:
>         ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50
> TB
>
>         For details see host's SYSLOG.
>
>         You can also use the smartctl utility for further investigation.
>         The original message about this issue was sent at Wed Dec 14
> 00:51:36 2016 EST
>         Another message will be sent in 24 hours if the problem persists.
>
> ...and...
>
>         This message was generated by the smartd daemon running on:
>
>            host name:  XXXXXX
>            DNS domain: YYYYYY
>
>         The following warning/error was logged by the smartd daemon:
>
>         Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors
>
>         Device info:
>         ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50
> TB
>
>         For details see host's SYSLOG.
>
>         You can also use the smartctl utility for further investigation.
>         The original message about this issue was sent at Wed Dec 14
> 00:51:37 2016 EST
>         Another message will be sent in 24 hours if the problem persists.
>
> (Yes, I know, I've been letting it do this since mid-December, which is not
> great.)
>
> > What is the output of the following command for the failing drive?
> >   smartctl -A /dev/sdY
>
>         # smartctl -A /dev/sdc
>         smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local
> build)
>         Copyright (C) 2002-14, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
>         === START OF READ SMART DATA SECTION ===
>         SMART Attributes Data Structure revision number: 10
>         Vendor Specific SMART Attributes with Thresholds:
>         ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE
>           1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail
> Always       -       205161943
>           3 Spin_Up_Time            0x0003   100   091   000    Pre-fail
> Always       -       0
>           4 Start_Stop_Count        0x0032   099   099   020    Old_age
>  Always       -       1055
>           5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail
> Always       -       41
>           7 Seek_Error_Rate         0x000f   092   060   030    Pre-fail
> Always       -       1743842168
>           9 Power_On_Hours          0x0032   039   039   000    Old_age
>  Always       -       53898
>          10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail
> Always       -       0
>          12 Power_Cycle_Count       0x0032   100   100   020    Old_age
>  Always       -       85
>         184 End-to-End_Error        0x0032   100   100   099    Old_age
>  Always       -       0
>         187 Reported_Uncorrect      0x0032   097   097   000    Old_age
>  Always       -       3
>         188 Command_Timeout         0x0032   100   098   000    Old_age
>  Always       -       133146017827
>         189 High_Fly_Writes         0x003a   007   007   000    Old_age
>  Always       -       93
>         190 Airflow_Temperature_Cel 0x0022   060   040   045    Old_age
>  Always   In_the_past 40 (Min/Max 26/45 #502)
>         194 Temperature_Celsius     0x0022   040   060   000    Old_age
>  Always       -       40 (0 18 0 0 0)
>         195 Hardware_ECC_Recovered  0x001a   038   023   000    Old_age
>  Always       -       205161943
>         197 Current_Pending_Sector  0x0012   100   100   000    Old_age
>  Always       -       8
>         198 Offline_Uncorrectable   0x0010   100   100   000    Old_age
>  Offline      -       8
>         199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age
>  Always       -       1
>         240 Head_Flying_Hours       0x0000   100   253   000    Old_age
>  Offline      -       53897 (15 186 0)
>         241 Total_LBAs_Written      0x0000   100   253   000    Old_age
>  Offline      -       917595486
>         242 Total_LBAs_Read         0x0000   100   253   000    Old_age
>  Offline      -       1262569510
>
> > Regards,
> > Mirko
>
> Thanks for the help so far,
> --Greg
>
>

Re: Failing disk advice

Reply via email to