Re: errors found in extent allocation tree or chunk allocation after power failure

Chris Murphy Wed, 25 Sep 2019 12:08:54 -0700

On Wed, Sep 25, 2019 at 8:50 AM Pallissard, Matthew <m...@pallissard.net> wrote:
>
> Version:
> Kernel: 5.2.2-arch1-1-ARCH #1 SMP PREEMPT Sun Jul 21 19:18:34 UTC 2019 x86_64 
> GNU/Linux


You need to upgrade to arch kernel 5.2.14 or newer (they backported
the fix first appearing in stable 5.2.15). Or you need to downgrade to
5.1 series.
https://lore.kernel.org/linux-btrfs/20190911145542.1125-1-fdman...@kernel.org/T/#u

That's a nasty bug. I don't offhand see evidence that you've hit this
bug. But I'm not certain. So first thing should be to use a different
kernel.

Next, anytime there is a crash or powerfailur with Btrfs raid56, you
need to do a complete scrub of the volume. Obviously will take time
but that's what needs to be done first.

OK actually, before the scrub you need to confirm that each drive's
SCT ERC time is *less* than the kernel's SCSI command timer. e.g.

# smartclt -l scterc /dev/sda
# cat /sys/block/sda/device/timeout

The SCT ERC value is in deciseconds so convert to seconds. The second
value is in seconds. The first value must be shorter. By default the
kernel's command timer per device is 30 seconds, typical consumer
drives are much longer. So depending on the reply from your drive for
that smart command, you might either change the drive timer or the
SCSI command timer - or it might actually be perfect. NAS specific
drives and nearline and SAS all tend to have short SCT ERC by default,
around 7 second. That's fine.

Note that the smart command is transient, when the drive powers off it
goes back to a default. And on reboot, the kernel's command timer also
resets.


-- 
Chris Murphy

Re: errors found in extent allocation tree or chunk allocation after power failure

Reply via email to