Re: BTRFS free space handling still needs more work: Hangs again

Robert White Fri, 26 Dec 2014 14:49:18 -0800

On 12/26/2014 05:37 AM, Martin Steigerwald wrote:

Hello!


First: Have a merry christmas and enjoy a quiet time in these days.

Second: At a time you feel like it, here is a little rant, but also a bug
report:

I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with
space_cache, skinny meta data extents – are these a problem? – and
compress=lzo:

(there is no known problem with skinny metadata, it's actually moreefficient than the older format. There has been some anecdotes aboutmixing the skinny and fat metadata but nothing has ever beendemonstrated problematic.)


merkaba:~> btrfs fi sh /home
Label: 'home'  uuid: b96c4f72-0523-45ac-a401-f7be73dd624a
         Total devices 2 FS bytes used 144.41GiB
         devid    1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home
         devid    2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home

Btrfs v3.17
merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=141.12GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.29GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


This filesystem, at the allocation level, is "very full" (see below).

And I had hangs with BTRFS again. This time as I wanted to install tax
return software in Virtualbox´d Windows XP VM (which I use once a year
cause I know no tax return software for Linux which would be suitable for
Germany and I frankly don´t care about the end of security cause all
surfing and other network access I will do from the Linux box and I only
run the VM behind a firewall).


And thus I try the balance dance again:


ITEM: Balance... it doesn't do what you think it does... 8-)

"Balancing" is something you should almost never need to do. It is onlyfor cases of changing geometry (adding disks, switching RAID levels,etc.) of for cases when you've radically changed allocation behaviors(like you decided to remove all your VM's or you've decided to remove amail spool directory full of thousands of tiny files).

People run balance all the time because they think they should. They are_usually_ incorrect in that belief.


merkaba:~> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device

ITEM: Running out of space during a balance is not running out of spacefor files. BTRFS has two layers of allocation. That is, there are twolevels of abstraction where "no space" can occur.

The first level of allocation is the "making more BTRFS structures outof raw device space".

The second level is "allocating space for files inside of existing BTRFSstructures".

Balance is the operation of relocating the BTRFS structures andattempting to increase their order (conincidentally) while doing that.So, for instance, "reocating block group some_number_here" requiresfinding an unallocated expanse of disk, creating a new/empty block groupthere of the current relevant block group size (typically data=1G ormetadata=256M if you didn't override these settings while making thefilesystem). You can _easily_ end up lacking a 1G contiguous expanse ofraw allocation space on a nearly-full filesystem.

NOTE :: This does _not_ happen with other filesystems like EXT4 becausebuilding those filesystems creates a static filesystem-level allocation.That is 100% of the disk that can be controlled by EXT4 (etc) isallocated and initialized at initial creation time (or first mount inthe case of EXT4).

BTRFS is intentionally different because it wants to be able to adapt asyour usage changes. If you first make millions of tiny files then youwill have a lot of metadata extents and virtually no data extents. Ifyou erase a lot of those and then start making large files the metadatawill tend to go away and then data extents will be created.

Being a chaotic system, you can get into some corner cases that suck,but in terms of natural evolution it has more benefits than drawbacks.

There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 -musage=5 /home
ERROR: error during balancing '/home' - No space left on device
There may be more info in syslog - try dmesg | tail
merkaba:~#1> btrfs balance start -dusage=5 /home

.... losts deleted for brevity ....

So I am rebalancing everything basically, without need I bet, so causing
more churn to SSDs than is needed.


Correct, though churn isn't really the issue.

Otherwise alternative would be to make BTRFS larger I bet.


Correct.



Well this is still not what I would consider stable. So I will still


Not a question of stability.

See, dong a balance is like doing a sliding block puzzle. If there isn'tenough room to slide the blocks around then the blocks will not slidearound. You are just out of space and that results in "out of space"returns. This is not even an error, just a fact.


http://en.wikipedia.org/wiki/15_puzzle

Meditate on the above link. Then ask yourself what happens if you put inthe number 16. 8-)


The below recomendation is incorrect...

recommend: If you want to use BTRFS on a server and estimate 25 GiB of
usage, make drive at least 50GiB big or even 100GiB to be on the safe
side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but
hey, there say meanwhile "don´t" as in "just don´t use it at all and use SLES
12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots
is really not asking for anything even near to production or enterprise
reliability" (if you need proof, I think I still have a snapshot of a SLES
11 SP3 VM that broke over night due to me having installed an LDAP server
for preparing some training slides). Even 3.12 kernel seems daring regarding
BTRFS, unless SUSE actively backports fixes.


In kernel log the failed attempts look like this:

 Already covered.


[  209.783437] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
[  210.116416] BTRFS info (device dm-3): relocating block group 501238202368 
flags 17
My expection for a *stable* and *production quality* filesystem would be:

I never ever get hangs with one kworker running on 100% of one Sandybridge
core *for minutes* in a production filesystem and thats about it.


Now this is one of several other issues.

ITEM: An SSD plus a good fast controller and default system virtualmemory and disk scheduler activities can completely bog a system down.You can get into a mode where the system begins doing synchronous writesof vast expanses of dirty cache. The SSD is so fast that there iseffectively zero "wait for IO time" and the IO subsystem is effectivelylocked or just plain busy.

Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10%of system ram.

You may need/want to change this number to something closer to 4. That'snot a hard suggestion. Some reading and analysis will be needed to findthe best possible tuning for an advanced system.


Especially for a filesystem that claims to still have a good amount of free
space:

merkaba:~> LANG=C df -hT /home
Filesystem             Type   Size  Used Avail Use% Mounted on
/dev/mapper/msata-home btrfs  160G  146G   25G  86% /home

It does have plenty of free space at the file-storage level. (Which isnot the "balance" level where raw disk is converted into file system"data" or "metadata" extents.)


(yeah, these don´t add up, I account this to compression, but hey, who knows)


No need to "account for" compression.

They add up fine, in the sense that they are separate domains for spaceand so are not intended to be taken together. You will notice that youare not getting "out of space" errors for actually creating/appending files.



In kernel log I have things like this, but some earlier time and these I have
not yet perceived as hangs:

Dec 23 23:33:26 merkaba kernel: [23040.621678] ------------[ cut here 
]------------
Dec 23 23:33:26 merkaba kernel: [23040.621792] WARNING: CPU: 3 PID: 308 at 
fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:26 merkaba kernel: [23040.621796] Modules linked in: mmc_block ufs 
qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd
_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl 
ff_memless usbhid hid nls_utf8 nls_cp437 vfat fat uas usb_stor
age bnep bluetooth binfmt_misc cpufreq_userspace cpufreq_stats pci_stub 
cpufreq_powersave vboxpci(O) cpufreq_conservative vboxnetadp(O)
  vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache jbd2 intel_rapl 
x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm crct10dif_pclmul crc32
_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm aesni_intel 
snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic lrw
gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw iwlwifi 
pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd
_hda_codec snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd 
soundcore rfkill battery ac tpm_tis tpm processor evdev
joydev sbs sbshc coretemp hdaps(O) tp_smapi(O) thinkpad_ec(O) loop 
firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq microc
ode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom crc32c_intel 
ahci firewire_ohci libahci sata_sil24 e1000e libata ptp
sdhci_pci ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core 
scsi_mod usbcore usb_common thermal
Dec 23 23:33:26 merkaba kernel: [23040.621978] CPU: 3 PID: 308 Comm: 
btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
Dec 23 23:33:26 merkaba kernel: [23040.621982] Hardware name: LENOVO 
42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:26 merkaba kernel: [23040.621985]  0000000000000009 
ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:26 merkaba kernel: [23040.621992]  0000000000000000 
ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:26 merkaba kernel: [23040.621999]  ffffffffc04bd5a1 
ffff880037590800 ffff8800a599c320 0000000000000000
Dec 23 23:33:26 merkaba kernel: [23040.622006] Call Trace:
Dec 23 23:33:26 merkaba kernel: [23040.622026]  [<ffffffff814a516e>] 
dump_stack+0x4f/0x7c
Dec 23 23:33:26 merkaba kernel: [23040.622034]  [<ffffffff8103f83e>] 
warn_slowpath_common+0x7c/0x96
Dec 23 23:33:26 merkaba kernel: [23040.622104]  [<ffffffffc04bd5a1>] ? 
btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622111]  [<ffffffff8103f8ec>] 
warn_slowpath_null+0x15/0x17
Dec 23 23:33:26 merkaba kernel: [23040.622164]  [<ffffffffc04bd5a1>] 
btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622211]  [<ffffffffc047a830>] 
btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622254]  [<ffffffffc0476dd5>] 
transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622295]  [<ffffffffc0476cdc>] ? 
btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:26 merkaba kernel: [23040.622305]  [<ffffffff8105697c>] 
kthread+0xb2/0xba
Dec 23 23:33:26 merkaba kernel: [23040.622312]  [<ffffffff814a0000>] ? 
dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:26 merkaba kernel: [23040.622317]  [<ffffffff810568ca>] ? 
__kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622324]  [<ffffffff814a9f6c>] 
ret_from_fork+0x7c/0xb0
Dec 23 23:33:26 merkaba kernel: [23040.622329]  [<ffffffff810568ca>] ? 
__kthread_parkme+0x62/0x62
Dec 23 23:33:26 merkaba kernel: [23040.622334] ---[ end trace 90db5b1c7067cf1d 
]---
Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here 
]------------

Not sure about either of these, they _could_ be previous unrelated bugsthat are now fixed bugs, since you say they've stopped happening.

Dec 23 23:33:56 merkaba kernel: [23070.671999] ------------[ cut here 
]------------
Dec 23 23:33:56 merkaba kernel: [23070.672064] WARNING: CPU: 3 PID: 308 at 
fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empt
y+0x2d/0x2f [btrfs]()
Dec 23 23:33:56 merkaba kernel: [23070.672067] Modules linked in: mmc_block ufs 
qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c snd_usb_audio 
snd_usbmidi_lib snd_rawmidi snd_seq_device hid_generic hid_pl ff_memless usbhid 
hid nls_utf8 nls_cp437 vfat fat uas usb_storage bnep bluetooth binfmt_misc 
cpufreq_userspace cpufreq_stats pci_stub cpufreq_powersave vboxpci(O) 
cpufreq_conservative vboxnetadp(O) vboxnetflt(O) vboxdrv(O) ext4 crc16 mbcache 
jbd2 intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi iwldvm 
aesni_intel snd_hda_codec_conexant mac80211 aes_x86_64 snd_hda_codec_generic 
lrw gf128mul glue_helper ablk_helper cryptd psmouse snd_hda_intel serio_raw 
iwlwifi pcspkr lpc_ich snd_hda_controller i2c_i801 mfd_core snd_hda_codec 
snd_hwdep cfg80211 snd_pcm snd_timer shpchp thinkpad_acpi nvram snd soundcore 
rfkill battery ac tpm_tis tpm processor evdev joydev sbs sbshc coretemp 
hdaps(O) tp_smapi(O) think

pad_ec(O) loop firewire_sbp2 fuse ecryptfs autofs4 md_mod btrfs xor raid6_pq 
microcode dm_mirror dm_region_hash dm_log dm_mod sg sr_mod sd_mod cdrom 
crc32c_intel ahci firewire_ohci libahci sata_sil24 e1000e libata ptp sdhci_pci 
ehci_pci sdhci firewire_core ehci_hcd crc_itu_t pps_core mmc_core scsi_mod 
usbcore usb_common thermal

Dec 23 23:33:56 merkaba kernel: [23070.672193] CPU: 3 PID: 308 Comm: 
btrfs-transacti Tainted: G        W  O   3.18.0-tp520 #14
Dec 23 23:33:56 merkaba kernel: [23070.672196] Hardware name: LENOVO 
42433WG/42433WG, BIOS 8AET63WW (1.43 ) 05/08/2013
Dec 23 23:33:56 merkaba kernel: [23070.672200]  0000000000000009 
ffff8804044c7d88 ffffffff814a516e 0000000080000000
Dec 23 23:33:56 merkaba kernel: [23070.672205]  0000000000000000 
ffff8804044c7dc8 ffffffff8103f83e ffff8804044c7db8
Dec 23 23:33:56 merkaba kernel: [23070.672209]  ffffffffc04bd5a1 
ffff880037590800 ffff8802cd6e50a0 0000000000000000
Dec 23 23:33:56 merkaba kernel: [23070.672214] Call Trace:
Dec 23 23:33:56 merkaba kernel: [23070.672222]  [<ffffffff814a516e>] 
dump_stack+0x4f/0x7c
Dec 23 23:33:56 merkaba kernel: [23070.672229]  [<ffffffff8103f83e>] 
warn_slowpath_common+0x7c/0x96
Dec 23 23:33:56 merkaba kernel: [23070.672264]  [<ffffffffc04bd5a1>] ? 
btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672270]  [<ffffffff8103f8ec>] 
warn_slowpath_null+0x15/0x17
Dec 23 23:33:56 merkaba kernel: [23070.672301]  [<ffffffffc04bd5a1>] 
btrfs_assert_delayed_root_empty+0x2d/0x2f [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672330]  [<ffffffffc047a830>] 
btrfs_commit_transaction+0x394/0x8bc [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672357]  [<ffffffffc0476dd5>] 
transaction_kthread+0xf9/0x1af [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672383]  [<ffffffffc0476cdc>] ? 
btrfs_cleanup_transaction+0x43a/0x43a [btrfs]
Dec 23 23:33:56 merkaba kernel: [23070.672389]  [<ffffffff8105697c>] 
kthread+0xb2/0xba
Dec 23 23:33:56 merkaba kernel: [23070.672395]  [<ffffffff814a0000>] ? 
dcbnl_newmsg+0x14/0xa8
Dec 23 23:33:56 merkaba kernel: [23070.672399]  [<ffffffff810568ca>] ? 
__kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672405]  [<ffffffff814a9f6c>] 
ret_from_fork+0x7c/0xb0
Dec 23 23:33:56 merkaba kernel: [23070.672409]  [<ffffffff810568ca>] ? 
__kthread_parkme+0x62/0x62
Dec 23 23:33:56 merkaba kernel: [23070.672412] ---[ end trace 90db5b1c7067cf1e 
]---
Dec 23 23:34:26 merkaba kernel: [23100.709530] ------------[ cut here 
]------------


The recent hangings today are not in the log, I was upset enough to
forcefully switch of the machine. Tax returns are not my all time favorite,
but tax returns with hanging filesystems is no fun at all.


I will upgrade to 3.19 with 3.19-rc2.

Lets see what this balance will do.

It currently is here:

merkaba:~> btrfs balance status /home
Balance on '/home' is running
32 out of about 164 chunks balanced (53 considered),  80% left

merkaba:~> btrfs fi df /home
Data, RAID1: total=154.97GiB, used=142.10GiB
System, RAID1: total=32.00MiB, used=48.00KiB
Metadata, RAID1: total=5.00GiB, used=3.33GiB
GlobalReserve, single: total=512.00MiB, used=254.31MiB


So for once, we are told not to balance needlessly, but then in order for
stable operation I need to balance nonetheless?

Nope. "needing to Balance" just isn't your problem. Being out of spacefor new extents is your problem with the balancing you don't need to do.Which is different than your VM update problem. And is also differentthan your bursty, excessive caching problem.

I've also not seen you say you did a btrfsck ever. Does a filesystemcheck come up clean.

Well lets see how it will improve things. Last time it did. Considerably.
BTRFS only had these hang problems with 3.15 and 3.16 if trees allocated
all remaining space. So I expect it to downsize these trees are to there is
some device space being freed to allocatable again.

Next I will also defrag the Windows VM image just as an additional safety
net.

Simply copying the file might help you for a while at least. But in thelong term "too much orderliness" for large files ends up being anti-helpful.

e.g. disk_file.img :: cp disk_file.img new_disk_file.img; rmdisk_file.img; mv new_disk_file.img disk_file.img;

Turning off Copy-on-write might be helpful. (This will turn offcompression as well) but it can be anti-helpful too depending on the VMand how it's used.

As I learn more about the way BTRFS stores files, particularly deltas tofiles, I come to suspect that the "best" storage model for a VM _might_be exactly the opposite of normal suggestions. (The most wastefulpossible storage is the Gauss sum of consecutive integers where i=1 andn=number of consecutively stored blocks in the file. Ouch. So a filethat is reasonably segmented is "more efficent".)

With a fast SSD, my research suggests that defragging the disk image isbad. No-COW is good if you don't snapshot often, but each snapshot putsthe file in one-COW which kind-of defeats the No-COW if you do it veryoften.

But as near as I can tell, starting with an "empty" .qcow file andgrowing the system step-wise and _never_ defragging that file tends tocreate a chaotically natural expanse that wont hit these corner cases.(Way more analysis needs to be done here for that to be a real answer.)

As I learn more I discover that being overly agressive with balance anddefrag of large files is the opposite of good. The system seems to wantto develop a chaotic layout and trying to make it orderly seems to makethings worse. For very large files like VM images it seems to amplifythe worst parts.

Okay, doing something else now as the BTRFS will sort things out hopefully.

To get good natural performance on my (non SSD) system while runningVM(s) I classify a bunch of the system ram with the moveablecore= kernelboot option (about 1/4 to 1/3rd of physical ram) and turn down the dirtybackround ratio to avoid large synchronous cache flush events.


YMMV.


Ciao,

Later.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS free space handling still needs more work: Hangs again

Reply via email to