Re: Software RAID blocks

2019-01-13 Thread deloptes
Tom Bachreier wrote:

> So it is most likely that I have a problem with the software raid or the
> harddisks, isn't it? SMART is activated on all disks and does not show
> any error.

don't know exactly but I replaced all Seagate drives with WD - especially WD
Red 2TB NAS (WD20EFRX). Now just ordered couple of them for a RAID5 storage
solution. The 2TB Red seems to be very good from all I have seen on the
customer market.




Re: Software RAID blocks

2019-01-13 Thread Jens Holzhäuser
Hi!


On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
> Last night I got a "blocked for more than 300 seconds." message in syslog -
> see > 
> (link valid for 90 days).
> 
> Log summary:
> Jan 13 02:34:44 osprey kernel: [969696.242745] INFO: task md127_raid5:238 
> blocked for more than 300 seconds.
> Jan 13 02:34:44 osprey kernel: [969696.242772] Call Trace:
> Jan 13 02:34:44 osprey kernel: [969696.242789]  ? __schedule+0x2a2/0x870
> Jan 13 02:34:44 osprey kernel: [969696.242995] INFO: task dmcrypt_write:904 
> blocked for more than 300 seconds.
> Jan 13 02:34:44 osprey kernel: [969696.243223] INFO: task jbd2/dm-2-8:917 
> blocked for more than 300 seconds.
> Jan 13 02:34:44 osprey kernel: [969696.243525] INFO: task mpc:6622 blocked 
> for more than 300 seconds.
> Jan 13 02:34:44 osprey kernel: [969696.243997] INFO: task kworker/u8:0:6625 
> blocked for more than 300 seconds.

I am occasionally having very similar issues with my RAID1, task
blocking for more than 120 seconds, for no obvious reason.

I've started playing around with the vm.dirty_background_ratio and
vm.dirty_ratio kernel parameters, suspecting file system caching being
slow and the issue. [1]

While lowering the values does seem to have helped, it has not
completely eliminated the issue. So the jury for me is still out.

> In this case I did a
>   $ fdisk -l /dev/sdf
> and everything worked again.

Not sure if/how this would interact with ongoing cache flushing.

Jens


[1] 
https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/



Re: Software RAID blocks

2019-01-13 Thread Reco
On Sun, Jan 13, 2019 at 02:22:09PM +0100, Tom Bachreier wrote:
> 
> Hi Reco!
> 
> Jan 13, 2019, 1:47 PM by recovery...@enotuniq.net:
> 
> > On Sun, Jan 13, 2019 at 01:20:50PM +0100, Tom Bachreier wrote:
> >
> >> Jan 13, 2019, 12:46 PM by >> recovery...@enotuniq.net 
> >> >> :
> >>
> >> > On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
> >> >
> >> >> TLDR;
> >> >> My /home on dmcrypt -> software Raid5 blocks irregular usually without
> >> >> any error messages.
> >> >>
> >> >> I can get it going again with "fdisk -l /dev/sdx".
> >> >>
> >> >> Do you have an ideas how I can debug this issue further? Is it a 
> >> >> dmcrypt,
> >> >> a dm-softraid or a hardware issue?
> >> >>
> >> >
> >> > Let's start with something uncommon:
> >> >
> >>
> >> Thanks for your suggestions.
> >>
> >
> > My suspicion is that either some/all HDDs' firmware or disk controller
> > puts drive(s) in sleep mode.
> >
> 
> In this case: Why don't they awake with a write from dm-raid but with
> a read from fdisk? I don't see the logic behind.

RAID5 may be the reason. If you're reading a short sequence of bytes
from the array it does not mean you're utilizing all the drives.


> >> hdparm seems OK. Keep in mind only sdc and sdf ar WD drives.
> >>
> >
> > Since you have Seagates, please check them with 'hdparm -Z'.
> >
> 
> I don't think that did much to my drives.

I agree. It seems that drive's firmware rejected the request.


> $ smartctl -l scterc,70,70 /dev/sdd
> SCT Error Recovery Control set to:
>    Read: 70 (7.0 seconds)
>   Write: 70 (7.0 seconds)
> 

This setting may not survive the powercycle. Happen to have four such
drives. I just apply it at every reboot.

Reco



Re: Software RAID blocks

2019-01-13 Thread Tom Bachreier


Hi Reco!

Jan 13, 2019, 1:47 PM by recovery...@enotuniq.net:

> On Sun, Jan 13, 2019 at 01:20:50PM +0100, Tom Bachreier wrote:
>
>> Jan 13, 2019, 12:46 PM by >> recovery...@enotuniq.net 
>> >> :
>>
>> > On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
>> >
>> >> TLDR;
>> >> My /home on dmcrypt -> software Raid5 blocks irregular usually without
>> >> any error messages.
>> >>
>> >> I can get it going again with "fdisk -l /dev/sdx".
>> >>
>> >> Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
>> >> a dm-softraid or a hardware issue?
>> >>
>> >
>> > Let's start with something uncommon:
>> >
>>
>> Thanks for your suggestions.
>>
>
> My suspicion is that either some/all HDDs' firmware or disk controller
> puts drive(s) in sleep mode.
>

In this case: Why don't they awake with a write from dm-raid but with
a read from fdisk? I don't see the logic behind.


>> hdparm seems OK. Keep in mind only sdc and sdf ar WD drives.
>>
>
> Since you have Seagates, please check them with 'hdparm -Z'.
>

I don't think that did much to my drives.

$ hdparm -Z /dev/sdb
/dev/sdb:
disabling Seagate auto powersaving mode
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 40 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

$ hdparm -Z /dev/sde
/dev/sde:
disabling Seagate auto powersaving mode
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 40 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00


> Unsure about that Toshiba drive, though.
>
> I'd be wary of including those into the RAID (a single bad block can
> paralyze your whole RAID):
>
>> DISK: /dev/sdc
>> DISK: /dev/sde
>>

Thanks, I keep that in mind. I try to replace them in the near future.


> And I'd enable it for sdd.
>

Done.

$ smartctl -l scterc,70,70 /dev/sdd
SCT Error Recovery Control set to:
   Read: 70 (7.0 seconds)
  Write: 70 (7.0 seconds)

Tom





Re: Software RAID blocks

2019-01-13 Thread Reco
Hi.

On Sun, Jan 13, 2019 at 01:20:50PM +0100, Tom Bachreier wrote:
> 
> Hi Reco!
> 
> Jan 13, 2019, 12:46 PM by recovery...@enotuniq.net:
> 
> > On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
> >
> >> TLDR;
> >> My /home on dmcrypt -> software Raid5 blocks irregular usually without
> >> any error messages.
> >>
> >> I can get it going again with "fdisk -l /dev/sdx".
> >>
> >> Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
> >> a dm-softraid or a hardware issue?
> >>
> >
> > Let's start with something uncommon:
> >
> > for x in /dev/sd{b..f}; do
> >  smartctl -l scterc $x
> >  hdparm -J $x
> > done
>
> Thanks for your suggestions.

My suspicion is that either some/all HDDs' firmware or disk controller
puts drive(s) in sleep mode.


> hdparm seems OK. Keep in mind only sdc and sdf ar WD drives.

Since you have Seagates, please check them with 'hdparm -Z'.
Unsure about that Toshiba drive, though.


That looks good, though:

> /dev/sdc:
> wdidle3  = disabled
> 
> /dev/sdf:
> wdidle3  = disabled


I'd be wary of including those into the RAID (a single bad block can
paralyze your whole RAID):

> -
> And here comes the SCT state:
> 
> DISK: /dev/sdc
> SCT Error Recovery Control command not supported
> 
> 
> DISK: /dev/sde
> SCT Error Recovery Control command not supported

And I'd enable it for sdd.

Reco



Re: Software RAID blocks

2019-01-13 Thread Tom Bachreier


Hi Reco!

Jan 13, 2019, 12:46 PM by recovery...@enotuniq.net:

> On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
>
>> TLDR;
>> My /home on dmcrypt -> software Raid5 blocks irregular usually without
>> any error messages.
>>
>> I can get it going again with "fdisk -l /dev/sdx".
>>
>> Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
>> a dm-softraid or a hardware issue?
>>
>
> Let's start with something uncommon:
>
> for x in /dev/sd{b..f}; do
>  smartctl -l scterc $x
>  hdparm -J $x
> done
>
Thanks for your suggestions.

hdparm seems OK. Keep in mind only sdc and sdf ar WD drives.

$ hdparm -J /dev/sd[bcdef]
/dev/sdb:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
wdidle3  = 1 ??

/dev/sdc:
wdidle3  = disabled

/dev/sdd:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 00 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 00 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
wdidle3  = disabled

/dev/sde:
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 be 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0a 04 51 a0 00 21 04 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
wdidle3  = 1 ??

/dev/sdf:
wdidle3  = disabled

-
And here comes the SCT state:

$ for i in /dev/sd{b..f}; do echo "DISK: ${i}"; smartctl -l scterc "${i}"; done
DISK: /dev/sdb
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-1-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org 


SCT Error Recovery Control:
   Read: 70 (7.0 seconds)
  Write: 70 (7.0 seconds)

DISK: /dev/sdc
SCT Error Recovery Control command not supported

DISK: /dev/sdd
SCT Error Recovery Control:
   Read: Disabled
  Write: Disabled

DISK: /dev/sde
SCT Error Recovery Control command not supported

DISK: /dev/sdf
SCT Error Recovery Control:
   Read: 70 (7.0 seconds)
  Write: 70 (7.0 seconds)

Hope this helps...
Tom



Re: Software RAID blocks

2019-01-13 Thread Reco
Hi.

On Sun, Jan 13, 2019 at 12:27:19PM +0100, Tom Bachreier wrote:
> TLDR;
> My /home on dmcrypt -> software Raid5 blocks irregular usually without
> any error messages.
> 
> I can get it going again with "fdisk -l /dev/sdx".
> 
> Do you have an ideas how I can debug this issue further? Is it a dmcrypt,
> a dm-softraid or a hardware issue?

Let's start with something uncommon:

for x in /dev/sd{b..f}; do
smartctl -l scterc $x
hdparm -J $x
done

Reco