Help needed, server is unresponsive after btrfs balance

Moritz M Mon, 04 Feb 2019 03:55:01 -0800

Hi,

I'm running a Ubuntu server with a btrfs RAID1 consisting of three HDDs.


I do balancing daily via

btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4 /


It usually takes between 1 - 10 minutes.

But today the server was unresponsive (no ssh connect possible, nodirect login via keyboard possible) even after 7 hours.

I had a similar situation two weeks ago. I did not find anything andfinally checked and repaired the filesystem with

btrfs check --repair /dev/sda3


Which found some qgroup related problems:

enabling repair mode
Checking filesystem on /dev/sda3
UUID: cf8c4bb2-6a75-4e1d-983c-19583a93a546
No device size related problem found
cache and super generation don't match, space cache will be invalidated
Counts for qgroup id: 0/257 are different
our:            referenced 127300112384 referenced compressed 127300112384

disk: referenced 18446743939800129536 referenced compressed18446743939800129536

diff:           referenced 261209534464 referenced compressed 261209534464
our:            exclusive 56360521728 exclusive compressed 56360521728
disk:           exclusive 56360521728 exclusive compressed 56360521728

…

Repair qgroup 0/257


Today I had to boot a Live system, mount the btrfs filessystem with
-o skip_balance and cancel the balancing there.

Mounting took ~30 mins and in journalctl of the Live system I found this

Feb 04 09:42:28 ubuntu kernel: INFO: task btrfs-transacti:7527 blockedfor

more than 120 seconds.
Feb 04 09:42:28 ubuntu kernel:       Not tainted
4.15.0-29-generic #31-Ubuntu
Feb 04 09:42:28 ubuntu kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.

Feb 04 09:42:28 ubuntu kernel: btrfs-transacti D 0 7527 20x80000000

Feb 04 09:42:28 ubuntu kernel: Call Trace:
Feb 04 09:42:28 ubuntu kernel:  __schedule+0x291/0x8a0
Feb 04 09:42:28 ubuntu kernel:  schedule+0x2c/0x80

Feb 04 09:42:28 ubuntu kernel: btrfs_commit_transaction+0x81d/0x8f0[btrfs]

Feb 04 09:42:28 ubuntu kernel:  ? wait_woken+0x80/0x80
Feb 04 09:42:28 ubuntu kernel:  transaction_kthread+0x18d/0x1b0 [btrfs]
Feb 04 09:42:28 ubuntu kernel:  kthread+0x121/0x140
Feb 04 09:42:28 ubuntu kernel:  ? btrfs_cleanup_transaction+0x560/0x560
[btrfs] Feb 04 09:42:28 ubuntu kernel:  ?

kthread_create_worker_on_cpu+0x70/0x70 Feb 04 09:42:28 ubuntu kernel:?

do_syscall_64+0x73/0x130
Feb 04 09:42:28 ubuntu kernel:  ? SyS_exit_group+0x14/0x20

After rebooting the server acted normal. The only thing I could find inthe journalctl was:

Feb 04 02:00:02 server kernel: BTRFS info (device sda3): relocatingblock
group 7246746484736 flags data|raid1
Feb 04 02:05:23 server kernel: BTRFS info (device sda3): found 3extentsFeb 04 02:06:12 server kernel: BTRFS info (device sda3): found 3extentsFeb 04 02:07:01 server kernel: BTRFS info (device sda3): relocatingblock
group 7059915407360 flags metadata|raid1


Btrfs balancing starts at 02:00.

Can anybody give me a hint what causes this?

I suspect some kind of hardware failure but can't find anything. Anyidea where to look?


My setup:

Linux server 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC2019

x86_64 x86_64 x86_64 GNU/Linux

btrfs-progs v4.15.1

Label: 'rootfs'  uuid: cf8c4bb2-6a75-4e1d-983c-19583a93a546

        Total devices 3 FS bytes used 620.55GiB
        devid    1 size 923.13GiB used 446.03GiB path /dev/sdc3
        devid    2 size 923.13GiB used 449.00GiB path /dev/sda3
        devid    3 size 923.13GiB used 447.03GiB path /dev/sdb3

Data, RAID1: total=667.00GiB, used=617.65GiB
System, RAID1: total=32.00MiB, used=176.00KiB
Metadata, RAID1: total=4.00GiB, used=2.90GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Dmesg output is not provided there was nothing after reboot.

Thanks

Moritz

Help needed, server is unresponsive after btrfs balance

Reply via email to