btrfs ops hang indefinitely (process in D state)

Eugene Crosser Sat, 02 Jul 2016 03:01:02 -0700

Hello,

This may be the same problem as "btrfs lockup".


I have two systems using btrfs for several years. One is my home desktop, it has
root+home ext4 fs on a PCI SSD, and "big stuff" on a btrfs using two hard disks
in RAID1 configuration:

root@pccross:/export# uname -a
Linux pccross 4.7.0-rc2-custom #2 SMP Sat Jun 11 01:13:59 MSK 2016 x86_64 x86_64
x86_64 GNU/Linux # -- Was earlier 4.x version when the problem happened
root@pccross:/export# btrfs --version
btrfs-progs v4.4
root@pccross:/export# btrfs fi show
Label: 'export'  uuid: c94c3ef6-394e-4441-8992-d7033332bdff
        Total devices 2 FS bytes used 1.26TiB
        devid    1 size 3.64TiB used 1.26TiB path /dev/sda
        devid    2 size 3.64TiB used 1.26TiB path /dev/sdb

root@pccross:/export# btrfs fi df /export
Data, RAID1: total=1.26TiB, used=1.25TiB
System, RAID1: total=32.00MiB, used=208.00KiB
Metadata, RAID1: total=5.00GiB, used=3.82GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

A month ago, I moved a directory containing a few Gb from home (ext4) to btrfs
with `mv` command. The command took some minutes and eventually finished without
error. After some hours, a cron job that uses files on btrfs did not run. I
logged in to investigate and realized that its process was in 'D' state, and any
command that I tried that would use btrfs (ls, ...) would enter 'D' state and
stay there indefinitely. There was nothing interesting (that I remember) in
dmesg. Reboot did not help and indeed could not complete because some of startup
jobs use files on btfs, and they hang.

I rebooted without mounting btrfs and ran `btrfsck`. It found and fixed some
inconsistencies (no log, sorry), and I could mount, and since then everything
works, except the directory that I moved disappeared altogether (I had a backup
so could restore it). No debugging material left so this is just for background.

=====
Enter the second system. It is a rented physical server in a datacenter with two
hard disks, joined into a single root btrfs (/dev/sd[ab]1 are swap partitions):

root@dehost:~# uname -a
Linux dehost 3.13.0-91-generic #138-Ubuntu SMP Fri Jun 24 17:00:34 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
root@dehost:~# btrfs --version
Btrfs v3.12
root@dehost:~# btrfs fi show
Label: none  uuid: 67a2708c-f039-4783-a699-6f6be0dac318
        Total devices 2 FS bytes used 442.58GiB
        devid    1 size 2.72TiB used 444.04GiB path /dev/sda2
        devid    2 size 2.72TiB used 444.03GiB path /dev/sdb2

Btrfs v3.12
root@dehost:~# btrfs fi df /
Data, RAID1: total=440.00GiB, used=439.51GiB
System, RAID1: total=32.00MiB, used=72.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, RAID1: total=4.00GiB, used=3.07GiB

A week ago, the system started to become unresponsive every day. Kernel works
(responds to ping) but no processes can start. Looking at the logs after reboot
I noticed that activity stops some time after the start of backup cron job that
covers a set of directories (/etc, /home, /var/mail and some more.). I disabled
the backup job and since then, several days, it did not hang.

=====
My question to the developers: what can I do to (1) recover the filesystem while
it is mounted (I can use recovery netboot system and run `btrfs check` as the
last resort), and (2) provide any useful debugging information to the 
developers?

Thank you,

Eugene

signature.asc
Description: OpenPGP digital signature

btrfs ops hang indefinitely (process in D state)

Reply via email to