Re: Failing disk advice

Gregory Seidman Mon, 06 Mar 2017 17:27:53 -0800

On Mon, Mar 06, 2017 at 12:17:03PM +0100, Mirko Parthey wrote:
> On Sun, Mar 05, 2017 at 08:38:27PM -0800, David Christensen wrote:
> > On 03/05/2017 01:02 PM, Gregory Seidman wrote:
> > >I have a disk that is reporting SMART errors. It is an active disk in
> > >a (kernel, not hardware) RAID1 configuration. I also have a hot spare
> > >in the RAID1, and md hasn't decided it should fail the disk and switch
> > >to the hot spare. Should I proactively tell md to fail the disk (and
> > >let the hot spare take over), or should I just wait until md notices a
> > >problem?
> > 
> > I'm confused by "I also have a hot spare in the RAID1".  Do you have a
> > two-member RAID1 with a hot spare, or a three-member RAID1?  I would
> > prefer the latter:
> > 
> > https://manpages.debian.org/jessie/mdadm/md.4.en.html
> 
> Refining this advice a bit, I would convert the spare to a full RAID
> member now, without explicitly failing the disk that reports SMART
> errors first.
> Assuming you have a two-member RAID1 with a hot spare, the command
> should be similar to this (untested):
>   mdadm -G /dev/mdX -n 3 
> This ensures you keep redundancy during further maintenance actions.


I was unaware that this was possible. I've run it and mdadm -D reports that
it is now in the "clean, degraded, rebuilding" state. Thank you! I wish I
had room in my system to add the fourth (which I've ordered) without
removing the failing disk, but I do not.

> Which SMART errors do you get, and who reports them?

I get emails sent to root:

        This message was generated by the smartd daemon running on:

           host name:  XXXXXX
           DNS domain: YYYYYY

        The following warning/error was logged by the smartd daemon:

        Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors

        Device info:
        ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

        For details see host's SYSLOG.

        You can also use the smartctl utility for further investigation.
        The original message about this issue was sent at Wed Dec 14 00:51:36 
2016 EST
        Another message will be sent in 24 hours if the problem persists.

...and...

        This message was generated by the smartd daemon running on:

           host name:  XXXXXX
           DNS domain: YYYYYY

        The following warning/error was logged by the smartd daemon:

        Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors

        Device info:
        ST31500341AS, S/N:9VS43CV9, WWN:5-000c50-0208aa9a3, FW:CC1H, 1.50 TB

        For details see host's SYSLOG.

        You can also use the smartctl utility for further investigation.
        The original message about this issue was sent at Wed Dec 14 00:51:37 
2016 EST
        Another message will be sent in 24 hours if the problem persists.

(Yes, I know, I've been letting it do this since mid-December, which is not
great.)

> What is the output of the following command for the failing drive?
>   smartctl -A /dev/sdY

        # smartctl -A /dev/sdc  
        smartctl 6.4 2014-10-07 r4002 [i686-linux-3.16.0-4-686-pae] (local 
build)
        Copyright (C) 2002-14, Bruce Allen, Christian Franke, 
www.smartmontools.org

        === START OF READ SMART DATA SECTION ===
        SMART Attributes Data Structure revision number: 10
        Vendor Specific SMART Attributes with Thresholds:
        ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      
UPDATED  WHEN_FAILED RAW_VALUE
          1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  
Always       -       205161943
          3 Spin_Up_Time            0x0003   100   091   000    Pre-fail  
Always       -       0
          4 Start_Stop_Count        0x0032   099   099   020    Old_age   
Always       -       1055
          5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  
Always       -       41
          7 Seek_Error_Rate         0x000f   092   060   030    Pre-fail  
Always       -       1743842168
          9 Power_On_Hours          0x0032   039   039   000    Old_age   
Always       -       53898
         10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  
Always       -       0
         12 Power_Cycle_Count       0x0032   100   100   020    Old_age   
Always       -       85
        184 End-to-End_Error        0x0032   100   100   099    Old_age   
Always       -       0
        187 Reported_Uncorrect      0x0032   097   097   000    Old_age   
Always       -       3
        188 Command_Timeout         0x0032   100   098   000    Old_age   
Always       -       133146017827
        189 High_Fly_Writes         0x003a   007   007   000    Old_age   
Always       -       93
        190 Airflow_Temperature_Cel 0x0022   060   040   045    Old_age   
Always   In_the_past 40 (Min/Max 26/45 #502)
        194 Temperature_Celsius     0x0022   040   060   000    Old_age   
Always       -       40 (0 18 0 0 0)
        195 Hardware_ECC_Recovered  0x001a   038   023   000    Old_age   
Always       -       205161943
        197 Current_Pending_Sector  0x0012   100   100   000    Old_age   
Always       -       8
        198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   
Offline      -       8
        199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   
Always       -       1
        240 Head_Flying_Hours       0x0000   100   253   000    Old_age   
Offline      -       53897 (15 186 0)
        241 Total_LBAs_Written      0x0000   100   253   000    Old_age   
Offline      -       917595486
        242 Total_LBAs_Read         0x0000   100   253   000    Old_age   
Offline      -       1262569510

> Regards,
> Mirko

Thanks for the help so far,
--Greg

Re: Failing disk advice

Reply via email to