On Tue, Aug 20, 2019 at 3:10 PM Peter Chant <p...@petezilla.co.uk> wrote:
>
> Chasing IO errors.  BTRFS: error (device dm-2) in
> btrfs_run_delayed_refs:2907: errno=-5 IO failure
>
>
> I've just had an odd one.
>
> Over the last few days I've noticed a file system blocking, if that is
> the correct term, and this morning go read only.  This resulted in a lot
> of checksum errors.

That doesn't sound good. Checksum errors where? A complete start to
finish dmesg is most useful in this case.


>
> Having spotted the file system go read only in the logs and then noted
> the error message in the subject shortly after booting I assumed a
> hardware error and changed the SATA cable.  That had no effect so I
> isolated the disk and mounted the respective file system degraded.
> Shortly after mounting the degraded file system I had the same error
> again. So I unmounted the file system edited fstab and swapped the disk
> which I though originally had the error with the one now showing an error.

OK but we don't know anything from what you've told us about what and
whose error, so it's all speculation. Definitely a complete dmesg is
needed.

Or if running systemd-journald to persistent media, you can look up
that boot with journalctl --list-boots, and export just the kernel
messages portion with something like this:

journalctl -b -2 -k -o -short-monotonic > journalbtrfshang.txt

That's two boots back, kernel messages only, monotonic time stamp.

Also useful if you experience blocked tasks, like a kind of system
hang for 2 minutes sort of thing, is a sysrq+t and the simple version
is, as root

# echo 1 > /proc/sys/kernel/sysrq
# echo w > /proc/sysrq-trigger
# echo t > /proc/sysrq-trigger

Detailed version here:
https://fedoraproject.org/wiki/QA/Sysrq

That will dump a bunch of task info into kernel messages, and will be
found in dmesg or the above journalctl command. It's useful to have
the echo 1 setup before you reproduce the problem; and even more
useful to use remote ssh to type out the 2nd command so all you have
to do is hit return upon reproducing the hang - otherwise it can take
a long time to type it all out.


> Does this sound like a hardware error?  I have ordered a replacement
> drive, if it is not needed as a replacement I will put it into a
> homebrew NAS.
>
> I've hit the issue again.  Hopefully the system is up long enough to
> post this.
>
> I'm a bit worried that trying to track this down disconnecting a disk at
> a time I might hit the btrfs split brain issue.

WDC Reds have SCT ERC of I think 70 deciseconds by default which you
can check with 'smartctl -l scterc' for each drive. If it's hardware
related it probably isn't bad block related, and at least if the drive
is aware of the problem it'll report it via libata and you'll see such
messages in kernel messages.


-- 
Chris Murphy

Reply via email to