Hi list,

This weekend had my first btrfs horror story. 

system: 3.13.0-49-lowlatency, btrfs-progs v4.1.2

A disclaimer: I know 3.13 is very out of date, but I the requirement of
keeping kernel up to date clashes with my requirement of keeping a stable
system. At the moment I can't disturb my system as I'm doing important work,
upgrading kernel requires upgrading ubuntu, which will upgrade a lot of
packages and might lead to problems which I don't have time to fix. One
might argue that in the end I lost time anyway dealing with these btrfs
issues. When I'm done with this current work I will update the whole system
which will update the kernel in the process.

Story:

btrfs fi show / -> devid 1 size 92.27GiB used 92.27GiB.

Suddenly a 100GB single device fs goes into a state where it doesn't have
more free space, no new files can be written. 'fi usage' says a minimum of
20GB are free, but metadata is 4.97GiB allocated vs 4.45Gib used. I decide
to do a 'balance -dusage=55' as lower values of usage don't balance
anything. This starts a balance of '0 out of 0 chunks' which goes on for 24h
(status always says 0 considered, -nan% left, dmesg had only 'relocating
block group 32...... flags 36'). This is a OCZ vertex 3, a quite fast SSD.
24h seemed excessive to me, I assume that the balance has gone wrong
somehow. I shutdown the system to see if the balance will stop. On the first
boot up the fs is still mounted, on a second boot the fs no longer mounts.

I switch to a nixos usb pen running kernel 4.1.6 and progs up to date also
(probably 4.1.x). Trying to mount the fs results in a kernel error (
http://pastebin.com/CzryecsX ).

trying mounting with '-o recovery,ro' hangs the system, a reboot is needed.

I proceed to get contents of disk via 'restore'. I then do btrfs-zero-log,
still doesn't mount, and then do btrfs check --repair a couple of times (log
before running with '--repair' http://pastebin.com/VPZLjcXR)
 
I then try to mount with '-o recovery,ro' and it works !! (thank you btrfs
check !!). I proceed to get the data out of the disk, this seems to go well,
no errors in dmesg. I then try to mount without recovery,ro and again the
kernel hangs. One time I had the dmesg window open and was able to see that
it was something about ..._async_reclaim_metadata_space. 

I finally give up on the filesystem and format it. I have an image produced
via btrfs-image if it is of interest.

A couple of notes:

1) I know btrfs is experimental, anything that I might have lost is my fault
(I didn't loose anything).
2) I think this issue of free space is a known issues being worked on, that
no space to allocate when metadata space is needed will cause no free space
problem. It's on the faq. Still, it doesn't seem acceptable to me that the
system can go into a state where in the end the fs is destroyed or the linux
machine is unusable (can never turn it off because balance never stops),
specially since there was still a lot of space in the disk. If the system
somehow had rebalanced itself automatically this could have been avoided. It
shouldn't let itself get into this corner. Perhaps a bit of space should
always be kept for emergency rebalancing ?
4) I wonder if that balance would have ended or if it had just stalled. It
seems that a reboot with a ongoing or stalled balance will cause fs
destruction. Possibly an issue already fixed in later kernels. Also, why did
it say 0 considered, -nan% left ? -nan% looks strange.
3) The system freezing when trying to mount a fs is possible not supposed to
happen ? (this was on the latest kernel)

Just wanted to share the story in case it is of some use for developers.
Again, possibly this happened due to issues already fixed in later kernels.
Anyway, despite this issue, I'm still quite happy with btrfs overall.

Best regards,
Miguel 
NegrãoN�����r��y����b�X��ǧv�^�)޺{.n�+����{�n�߲)����w*jg��������ݢj/���z�ޖ��2�ޙ����&�)ߡ�a�����G���h��j:+v���w��٥

Reply via email to