Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Martin Steigerwald posted on Thu, 08 Jan 2015 11:18:40 +0100 as excerpted: Duncan, I *did* file a bug. I think you misunderstood me... I understood that and actually said as much: But the recommendation is to file the bugzilla report precisely so it does /not/ get lost, and you've done that, so... you've done your part there and now comes the enforced patience bit of waiting [...] My point was simply that based on the wiki recommendation and the earlier thread as mentioned on the wiki, the reason /why/ a bugzi report is preferred over simply reporting it here is that the devs tend to pick bugs and spend some time digging into them, during which they don't look too much at other reports here, and they can get lost, while the bugzi report won't. Which implies that a failure to respond either to a thread here or a bug report there is because they're busy working on other bugs, and that failure to immediately respond isn't to be seen as ignoring the problem, and is in fact to be expected. IOW, I was saying now that the bug is filed, you can sit back and wait in reasonable assurance that it'll be processed in due time, as you've done your bit and now it's up to them to prioritize and process in due time. That's a good thing, and I was commending you for taking the time to file the bug as well. =:^) ... While at the same time commiserating a bit, since I know from experience how hard that wait for a dev reply can be, and that the wait is sort of an enforced patience as at least for a non-coder as I am, there's not much else one can do. =:^( That said, now that I reread, I can see how what I wrote could appear to be contingent on an assumed /future/ filing of a bug, and that it wasn't as clear as I intended that I was commending you for filing it already, and basically saying, Be patient, I know how hard it can be to wait. Words! They be tricky! =:^( -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regular rebalancing should be unnecessary? (Was: Re: BTRFS free space handling still needs more work: Hangs again)
Am Freitag, 9. Januar 2015, 11:04:32 schrieb Peter Waller: Apologies to those receiving this twice. On 27 December 2014 at 09:30, Hugo Mills h...@carfax.org.uk wrote: Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I have experienced machine lockups on four separate cloud machines, and reported it in a few venues. I think I even reported it on this list in the past but I can't find that right now. Here's a bug report to Ubuntu-Kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711 Regularly rebalancing the machines and ensuring they have 10% free disk (filesystem) and I don't experience this. Yet I read in this thread I read that regular rebalancing shouldn't be necessary? FWIW, trying to sell BTRFS to my colleagues and they view it as a stupid filesystem like the bad old windows days when you had to regularly defragment. They then go on to say they have never experienced machine lockups on EXT* (over a fairly significant length of time). So what can I tell them? Are we just hitting a bug which is likely to get fixed, or must we regularly rebalance? .. or is regularly rebalancing incorrect and actually regular machine lockups are the expected behaviour? :-) I think it should *not* be required. But my practical experience differs from what I think, as I described in great detail here and in this bugreport: [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 So I had these hangs so far *only* when BTRFS was not able to reserve previously unused and unreserved space on the devices for a new chunk, as long as BTRFS can still allocate a new chunk, it stays fast. That said, not in all situation where BTRFS can´t do this, it goes slow. So for me it seems that not having any unreserved device space to allocate chunks from seems to be a *necessary* but no *sufficient* criterion for the kworker uses up 100% of one core issue I reported. I suggest that you add your findings to the bug report and also share details there, as it may help to have more data available on when it happens. That said, still no BTRFS developer looked into the kern.log with Sysrq-T triggers I uploaded there. Robert made a test case which easily triggers the behavior for him, I didn´t yet take time to try out this testcase. Maybe you have a chance to? Its somewhere in this thread as a little shell script. [1] https://bugzilla.kernel.org/show_bug.cgi?id=90401#c0 Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regular rebalancing should be unnecessary? (Was: Re: BTRFS free space handling still needs more work: Hangs again)
Apologies to those receiving this twice. On 27 December 2014 at 09:30, Hugo Mills h...@carfax.org.uk wrote: Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I have experienced machine lockups on four separate cloud machines, and reported it in a few venues. I think I even reported it on this list in the past but I can't find that right now. Here's a bug report to Ubuntu-Kernel: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1349711 Regularly rebalancing the machines and ensuring they have 10% free disk (filesystem) and I don't experience this. Yet I read in this thread I read that regular rebalancing shouldn't be necessary? FWIW, trying to sell BTRFS to my colleagues and they view it as a stupid filesystem like the bad old windows days when you had to regularly defragment. They then go on to say they have never experienced machine lockups on EXT* (over a fairly significant length of time). So what can I tell them? Are we just hitting a bug which is likely to get fixed, or must we regularly rebalance? .. or is regularly rebalancing incorrect and actually regular machine lockups are the expected behaviour? :-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Am Donnerstag, 8. Januar 2015, 05:45:56 schrieben Sie: Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted: No BTRFS developers commented yet on this, neither in this thread nor in the bug report at kernel.org I made. Just a quick general note on this point... There has in the past (and I believe referenced on the wiki) been dev comment to the effect that on the list they tend to find particular reports/threads and work on them until they find and either fix the issue or (when not urgent) decide it must wait for something else, first. During the time they're busy pursuing such a report, they don't read others on the list very closely, and such list-only bug reports may thus get dropped on the floor and never worked on. The recommendation, then, is to report it to the list, and if not picked up right away and you plan on being around in a few weeks/months when they potentially get to it, file a bug on it, so it doesn't get dropped on the floor. Duncan, I *did* file a bug. [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted: No BTRFS developers commented yet on this, neither in this thread nor in the bug report at kernel.org I made. Just a quick general note on this point... There has in the past (and I believe referenced on the wiki) been dev comment to the effect that on the list they tend to find particular reports/threads and work on them until they find and either fix the issue or (when not urgent) decide it must wait for something else, first. During the time they're busy pursuing such a report, they don't read others on the list very closely, and such list-only bug reports may thus get dropped on the floor and never worked on. The recommendation, then, is to report it to the list, and if not picked up right away and you plan on being around in a few weeks/months when they potentially get to it, file a bug on it, so it doesn't get dropped on the floor. With the bugzilla.kernel.org report you've followed the recommendation, but the implication is that you won't necessarily get any comment right away, only later, when they're not immediately busy looking at some other bug. So lack of b.k.o comment in the immediate term doesn't mean they're ignoring the bug or don't value it; it just means they're hot on the trail of something else ATM and it might take some time to get that first comment engagement. But the recommendation is to file the bugzilla report precisely so it does /not/ get lost, and you've done that, so... you've done your part there and now comes the enforced patience bit of waiting for that engagement. But if it takes a bit, I would keep the bug updated every kernel release or so, with a comment updating status. (Meanwhile, I've seen no indication of such issues here. Most of my btrfs are 8-24 GiB each, all SSD, mostly dual-device btrfs raid1 both data/metadata. Maybe I don't run those full enough, however. I do have three mixed-bg mode sub-GiB btrfs, however, with one of them, a 256 MiB single-device dup-mode btrfs, used as /boot, that tends to run reasonably full, but I've not seen a problem like that there, either. But my use- case probably simply doesn't hit the problem.) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
On Wed, Jan 07, 2015 at 08:08:50PM +0100, Martin Steigerwald wrote: Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell: ext3 has a related problem when it's nearly full: it will try to search gigabytes of block allocation bitmaps searching for a free block, which can result in a single 'mkdir' call spending 45 minutes reading a large slow 99.5% full filesystem. Ok, thats for bitmap access. Ext4 uses extens. ...and the problem doesn't happen to the same degree on ext4 as it did on ext3. So far I've found that problems start when space drops below 1GB free (although it can go as low as 400MB) and problems stop when space gets above 1GB free, even without resizing or balancing the filesystem. I've adjusted free space monitoring thresholds accordingly for now, and it seems to be keeping things working so far. Just to see whether we are on the same terms: You talk about space that BTRFS has not yet reserved for chunks, i.e. the difference between size and used in btrfs fi sh, right? The number I look at for this issue is statvfs() f_bavail (i.e. the Available column of /bin/df). Before the empty-chunk-deallocation code, most of my filesystems would quickly reach a steady state where all space is allocated to chunks, and they stay that way unless I have to downsize them. Now there is free (non-chunk) space on most of my filesystems. I'll try monitoring btrfs fi df and btrfs fi show under the failing conditions and see if there are interesting correlations. signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell: On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote: Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell: On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote: […] Zygo, was is the characteristics of your filesystem. Do you use compress=lzo and skinny metadata as well? How are the chunks allocated? What kind of data you have on it? compress-force (default zlib), no skinny-metadata. Chunks are d=single, m=dup. Data is a mix of various desktop applications, most active file sizes from a few hundred K to a few MB, maybe 300k-400k files. No database or VM workloads. Filesystem is 100GB and is usually between 98 and 99% full (about 1-2GB free). I have another filesystem which has similar problems when it's 99.99% full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with skinny-metadata and no-holes. On various filesystems I have the above CPU-burning problem, a bunch of irreproducible random crashes, and a hang with a kernel stack that goes through SyS_unlinkat and btrfs_evict_inode. Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase, with the interesting difference that you have no databases or VMs on it. That said, I use the Windows XP rarely, but using it was what made the issue so visible for me. Is your desktop filesystem on SSD? No, but I recently stumbled across the same symptoms on an 8GB SD card on kernel 3.12.24 (raspberry pi). When the filesystem hit over ~97% full, all accesses were blocked for several minutes. I was able to work around it by adjusting the threshold on a garbage collector daemon (i.e. deleting a lot of expendable files) to keep usage below 90%. I didn't try to balance the filesystem, and didn't seem to need to. Interesting. ext3 has a related problem when it's nearly full: it will try to search gigabytes of block allocation bitmaps searching for a free block, which can result in a single 'mkdir' call spending 45 minutes reading a large slow 99.5% full filesystem. Ok, thats for bitmap access. Ext4 uses extens. BTRFS can use bitmaps as well, but also supports extents and I think uses it for most use cases. I'd expect a btrfs filesystem that was nearly full to have a small tree of cached free space extents and be able to search it quickly even if the result is negative (i.e. there's no free space). It seems to be doing something else... :-P Yeah :) Do you have the chance to extend one of the affected filesystems to check my theory that this does not happen as long as BTRFS can still allocate new data chunks? If its right, your FS should be fluent again as long as you see more than 1 GiB free Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069 Total devices 2 FS bytes used 512.00KiB devid1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1 devid2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1 between size and used in btrfs fi sh. I suggest going with at least 2-3 GiB, as BTRFS may allocate just one chunk so quickly that you do not have the chance to recognize the difference. So far I've found that problems start when space drops below 1GB free (although it can go as low as 400MB) and problems stop when space gets above 1GB free, even without resizing or balancing the filesystem. I've adjusted free space monitoring thresholds accordingly for now, and it seems to be keeping things working so far. Just to see whether we are on the same terms: You talk about space that BTRFS has not yet reserved for chunks, i.e. the difference between size and used in btrfs fi sh, right? No BTRFS developers commented yet on this, neither in this thread nor in the bug report at kernel.org I made. Well, and if thats works for you, we are back to my recommendation: More so than with other filesystems give BTRFS plenty of free space to operate with. At best as much, that you always have a mininum of 2-3 GiB unused device space for chunk reservation left. One could even do some Nagios/Icinga monitoring plugin for that :) Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
Re: BTRFS free space handling still needs more work: Hangs again
Am Sonntag, 28. Dezember 2014, 16:27:41 schrieb Robert White: On 12/28/2014 07:42 AM, Martin Steigerwald wrote: Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White: On 12/28/2014 04:07 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. My bad. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. A version of the test case to demonstrate absolutely system-clogging loads is pretty easy to construct. Make a raid1 filesystem. Balance it once to make sure the seed filesystem is fully integrated. Create a bunch of small files that are at least 4K in size, but are randomly sized. Fill the entire filesystem with them. BASH Script: typeset -i counter=0 while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do echo $counter /dev/null #basically a noop done The while will exit when the dd encounters a full filesystem. Then delete ~10% of the files with rm *0 Run the while loop again, then delete a different 10% with rm *1. Then again with rm *2, etc... Do this a few times and with each iteration the CPU usage gets worse and worse. You'll easily get system-wide stalls on all IO tasks lasting ten or more seconds. Thanks Robert. Thats wonderful. I wondered about such a test case already and thought about reproducing it just with fallocate calls instead to reduce the amount of actual writes done. I.e. just do some silly fallocate, truncating, write just some parts with dd seek and remove things again kind of workload. Feel free to add your testcase to the bug report: [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Cause anything that helps a BTRFS developer to reproduce will make it easier to find and fix the root cause of it. I think I will try with this little critter: merkaba:/mnt/btrfsraid1 cat freespracefragment.sh #!/bin/bash TESTDIR=./test mkdir -p $TESTDIR typeset -i counter=0 while true; do fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter)) echo $counter /dev/null #basically a noop done If you don't do the remove/delete passes you won't get as much fragmentation... I also noticed that fallocate would not actually create the files in my toolset, so I had to touch them first. So the theoretical script became e.g. typeset -i counter=0 for AA in {0..9} do while touch ${TESTDIR}/$((++counter)) fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter)) do if ((counter%100 == 0)) then echo $counter fi done echo removing ${AA} rm ${TESTDIR}/*${AA} done Hmmm, strange. It did here. I had a ton of files in the test directory. Meanwhile, on my test rig using fallocate did _not_ result in final exhaustion of resources. That is btrfs fi df /mnt/Work didn't show significant changes on a near full expanse. Hmmm, I had it running up to it allocating about 5 GiB in the data chunks. But I stopped it yesterday. It took a long time to get there. It seems to be quite slow on filling a 10 GiB RAID-1 BTRFS. I bet that may be due to lots of forks for the fallocate command. But it seems my fallocate works differently than yours. I have fallocate from: merkaba:~ fallocate --version fallocate von util-linux 2.25.2 I also never got a failed response back from fallocate, that is the inner loop never terminated. This could be a problem with the system call itself or it could be a problem with the application wrapper. Hmmm, it should return a failure like this: merkaba:/mnt/btrfsraid1 LANG=C fallocate -l 20G 20g fallocate: fallocate failed: No space left on device merkaba:/mnt/btrfsraid1#1 echo $? 1 Nor did I reach the CPU saturation I expected. No, I didn´t reach it as well. Just 5% or so for the script itself and I didn´t see any notable kworker activity. But I stopped it before the filesystem was full, so. e.g. Gust vm # btrfs fi df /mnt/Work/ Data, RAID1: total=1.72GiB, used=1.66GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=256.00MiB, used=57.84MiB GlobalReserve, single: total=32.00MiB, used=0.00B time passes while script running... Gust vm # btrfs fi df /mnt/Work/ Data, RAID1: total=1.72GiB, used=1.66GiB System, RAID1: total=32.00MiB,
Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell: On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote: My simple test case didn´t trigger it, and I so not have another twice 160 GiB available on this SSDs available to try with a copy of my home filesystem. Then I could safely test without bringing the desktop session to an halt. Maybe someone has an idea on how to enhance my test case in order to reliably trigger the issue. It may be challenging tough. My /home is quite a filesystem. It has a maildir with at least one million of files (yeah, I am performance testing KMail and Akonadi as well to the limit!), and it has git repos and this one VM image, and the desktop search and the Akonadi database. In other words: It has been hit nicely with various mostly random I think workloads over the last about six months. I bet its not that easy to simulate that. Maybe some runs of compilebench to age the filesystem before the fio test? That said, BTRFS performs a lot better. The complete lockups without any CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there is this kworker issue now. I noticed it that gravely just while trying to complete this tax returns stuff with the Windows XP VM. Otherwise it may have happened, I have seen some backtraces in kern.log, but it didn´t last for minutes. So this indeed is of less severity than the full lockups with 3.15 and 3.16. Zygo, was is the characteristics of your filesystem. Do you use compress=lzo and skinny metadata as well? How are the chunks allocated? What kind of data you have on it? compress-force (default zlib), no skinny-metadata. Chunks are d=single, m=dup. Data is a mix of various desktop applications, most active file sizes from a few hundred K to a few MB, maybe 300k-400k files. No database or VM workloads. Filesystem is 100GB and is usually between 98 and 99% full (about 1-2GB free). I have another filesystem which has similar problems when it's 99.99% full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with skinny-metadata and no-holes. On various filesystems I have the above CPU-burning problem, a bunch of irreproducible random crashes, and a hang with a kernel stack that goes through SyS_unlinkat and btrfs_evict_inode. Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase, with the interesting difference that you have no databases or VMs on it. That said, I use the Windows XP rarely, but using it was what made the issue so visible for me. Is your desktop filesystem on SSD? Do you have the chance to extend one of the affected filesystems to check my theory that this does not happen as long as BTRFS can still allocate new data chunks? If its right, your FS should be fluent again as long as you see more than 1 GiB free Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069 Total devices 2 FS bytes used 512.00KiB devid1 size 10.00GiB used 6.53GiB path /dev/mapper/sata-btrfsraid1 devid2 size 10.00GiB used 6.53GiB path /dev/mapper/msata-btrfsraid1 between size and used in btrfs fi sh. I suggest going with at least 2-3 GiB, as BTRFS may allocate just one chunk so quickly that you do not have the chance to recognize the difference. Well, and if thats works for you, we are back to my recommendation: More so than with other filesystems give BTRFS plenty of free space to operate with. At best as much, that you always have a mininum of 2-3 GiB unused device space for chunk reservation left. One could even do some Nagios/Icinga monitoring plugin for that :) -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
Re: BTRFS free space handling still needs more work: Hangs again
Am Sonntag, 28. Dezember 2014, 18:04:31 schrieb Patrik Lundquist: On 28 December 2014 at 13:03, Martin Steigerwald mar...@lichtvoll.de wrote: BTW, I found that the Oracle blog didn´t work at all for me. I completed a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it apparently did *nothing* to reduce the size of the file. They've changed the argument to -z; sdelete -z. Now how cute is that. Thank you. This did the trick: martin@merkaba:~/.VirtualBox/HardDisks VBoxManage modifyhd Winlala.vdi --compact 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100% martin@merkaba:~/.VirtualBox/HardDisks ls -lh insgesamt 12G -rw--- 1 martin martin 12G Dez 29 11:00 Winlala.vdi martin@merkaba:~/.VirtualBox/HardDisks It has been 20 GiB before. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 16:06:13 schrieb Robert White: I also don't know what kind of tool you are using, but it might be repeatedly trying and failing to fallocate the file as a single extent or something equally dumb. Userspace doesn't as far as I know, get to make that decision. I've just read the fallocate(2) man page, and it says nothing at all about the contiguity of the extent(s) storage allocated by the call. Yep, my bad. But as soon as I saw that fio was starting two threads, one doing random read/write and another doing sequential read/write, both on the same file, it set off my not just creating a file mindset. Given the delayed write into/through the cache normally done by casual file io, It seemed likely that fio would be doing something more aggressive (like using O_DIRECT or repeated fdatasync() which could get very tit-for-tat). Robert, please get to know about fio or *ask* before jumping to conclusions. I used this: [global] bs=4k #ioengine=libaio #iodepth=4 size=4g #direct=1 runtime=120 filename=ssd.test.file #[seq-write] #rw=write #stonewall [rand-write] rw=randwrite stonewall At the first test I still tested seq-write, but do you note the stonewall param? It *separates* both jobs from one another. I.e. fio may be starting two threads as it I think prepares all threads in advance, yet it did execute only *one* at a time. From the manpage of fio: stonewall , wait_for_previous Wait for preceding jobs in the job file to exit before starting this one. stonewall implies new_group. (that said the first stonewall isn´t even needed, but I removed the read jobs from the ssd-test.fio example fio I used for this job and I didn´t remember to remove the statement) Thank you a lot for your input. I learned some from it. For example that the trees for the data handling are in the metadata section. And now I am very clear the btrfs fi df does not display any trees but the chunk reservation and usage. I think I knew this before, but I thought somehow that was combined with the tree, but it isn´t, at least not in place, but the trees are stored in the metadata chunks. I´d still not call these extents tough, cause thats a file-based thing to all I know. I skip theoretizing about algorithms here. I prefer to let measurements speak and try to understand these. Best approach to understand the ones I made, I think, is what Hugo suggested: A developer looks at the sysrq-t outputs. So I personally won´t speculate any further about given or not given algorithmic limitations of BTRFS. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: On 12/27/2014 05:01 PM, Bardur Arantsson wrote: On 2014-12-28 01:25, Robert White wrote: On 12/27/2014 08:01 AM, Martin Steigerwald wrote: From how you write I get the impression that you think everyone else beside you is just silly and dumb. Please stop this assumption. I may not always get terms right, and I may make a mistake as with the wrong df figure. But I also highly dislike to feel treated like someone who doesn´t know a thing. Nope. I'm a systems theorist and I demand/require variable isolation. Not a question of silly or dumb but a question of speaking with sufficient precision and clarity. For instance you speak of having an impression and then decide I've made an assumption. I define my position. Explain my terms. Give my examples. I also risk being utterly wrong because sometimes being completely wrong gets others to cut away misconceptions and assumptions. It annoys some people, but it gets results. Can you please stop this bullshit posturing nonsense? It accomlishes nothing -- if you're right your other posts will stand for themselves and show that you are indeed the shit when it comes to these matters, but this post (so far, didn't read further) accomplishes nothing other than (possibly) convincing everyone that you're a pompous/self-important ass. Really? accomplishes nothing? 24 hours ago: the complaining party was talking about - Windows XP - Tax software - Virtual box - vdi files - defragging - balancing - data trees - system hanging And the responding party was saying you are the only person reporting this as a regular occurrence with the implication that the report was a duplicate or at least might not get much immediate attention. Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. It was repeatable before. That I go from application case to simulate a workload case is only natural. Or do you run fio or other load testing apps as a part of your daily work on your computer (unless you are actually diagnosing performance issues). I still *use* the computer with applications. And if thats where I see the performance issue, I report as such. Then I think about the kind of workload it creates and go from there to simplicy it to a reproducable case. At least I read mails, browse the web, run a VM, and so such kinds of things as daily computer usage. And thus its likely that performance issues show like this. Heck even my server does mail and Owncloud and things. I only use workload generation tools during my teachings or when analysing things, not as part of my daily computer usage. And that doesn´t make using a VM any less valid. And if it basically crawls BTRFS to an halt, I report this. Its actually that easy. That's not accomplishing nothing, thats called engaging in diagnostics instead of dismissing a complaint, and sticking out the diagnostic process until everyone is on the same page. I never dismissed Martin. I never disbelieved him. I went through his elements one at a time with examples of what I was taking away from him and why they didn't match expectations and experimental evidence. We adjusted our positions and communications. Robert, I received this differently. I received your input partly as wronging me. Granted that motivated me even more to prove things. But I highly dislike this kind of motivation. As I think I am motivated myself. I like finding causes of performance bottle necks. And I prefer positive motivation instead of negative one. So you can call it bullshit posturing nonsense but I see taking less than a day to get to the bottom of a bug report that might not have gotten significant attention. And you attribute all of this to your argumentation? Thats bold. See, Robert, your arguments helped with clearing my understanding in some parts. Especially on the terms I have not been very familiar. I am grateful for that. It even helped motivate me to do the further tests, as I got the impression that you have just been discussing that what I am seeing is just the way BTRFS necesessarily is *algorithmically* and I was just using it wrongly. But that said: I have an interest myself in resolving this. I was prepared for giving additional input at a given time. But right on this day I was just fed up with things. It motivated to prove the abysmal performance behaviour in a certain workload. Robert, your arguments contributed, thats true. But still I did the work of the actual measurements. I spent the hours on doing the measurements, with a slight risk of having to restore from backup, incase BTRFS would mess up things. I was the one bringing BTRFS to the limits where it actually shows an issue, instead of theoreticing about
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (further tests)
Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's :) usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance every time you hit it isn't going to get the problem solved properly. I think I would suggest the following: - make sure you have some way of logging your dmesg permanently (use a different filesystem for /var/log, or a serial console, or a netconsole) - when the lockup happens, hit Alt-SysRq-t a few times - send the dmesg output here, or post to bugzilla.kernel.org That's probably going to give enough
Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare)
Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's :) usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance every time you hit it isn't going to get the problem solved properly. I think I would suggest the following: - make sure you have some way of logging your dmesg permanently (use a different
Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance
Re: BTRFS free space handling still needs more work: Hangs again
On 12/28/2014 04:07 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. My bad. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. A version of the test case to demonstrate absolutely system-clogging loads is pretty easy to construct. Make a raid1 filesystem. Balance it once to make sure the seed filesystem is fully integrated. Create a bunch of small files that are at least 4K in size, but are randomly sized. Fill the entire filesystem with them. BASH Script: typeset -i counter=0 while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do echo $counter /dev/null #basically a noop done The while will exit when the dd encounters a full filesystem. Then delete ~10% of the files with rm *0 Run the while loop again, then delete a different 10% with rm *1. Then again with rm *2, etc... Do this a few times and with each iteration the CPU usage gets worse and worse. You'll easily get system-wide stalls on all IO tasks lasting ten or more seconds. I don't have enough spare storage to do this directly, so I used loopback devices. First I did it with the loopback files in COW mode. Then I did it again with the files in NOCOW mode. (the COW files got thick with overwrite real fast. 8-) So anyway... After I got through all ten digits on the rm (that is removing *0, then refilling, then *1 etc...) I figured the FS image was nicely fragmented. At that point it was very easy to spike the kworker to 100% CPU with dd if=/dev/urandom of=/mnt/Work/scratch bs=40k The DD wold read 40k (a cpu spike for /dev/urandom processing) then it would write the 40k and the kworker would peg 100% on one CPU and stay there for a while. Then it would be back to the /dev/urandom spike. So this laptop has been carefully detuned to prevent certain kinds of stalls (particularly the moveablecore= reservation, as previously mentioned, to prevent non-responsiveness of the UI) and I had to go through /dev/loop so that had a smoothing effect... but yep, there were clear kworker spikes that _did_ stop the IO path (the system monitor ap, for instance, could not get I/O statistics for ten and fifteen second intervals and would stop logging/scrolling). Progressively larger block sizes on the write path made things progressively worse... dd if=/dev/urandom of=/mnt/Work/scratch bs=160k And overwriting the file by just invoking DD again, was worse still (presumably from the juggling act) before resulting in a net out-of-space condition. Switching from /dev/urandom to /dev/zero for writing the large file made things worse still -- probably since there were no respites for the kworker to catch up etc. ASIDE: Playing with /proc/sys/vm/dirty_{background_,}ratio had lots of interesting and difficult to quantify effects on user-space applications. Cutting in half (5 and 10 instead of 10 and 20 respectively) seemed to give some relief, but going further got harmful quickly. Diverging numbers was odd too. But it seemed a little brittle to play with these numbers. SUPER FREAKY THING... Every time I removed and recreated scratch I would get _radically_ different results for how much I could write into that remaining space and how long it took to do so. In theory I am reusing the exact same storage again and again. I'm not doing compression (the underlying filessytem behind the loop devices have compression but that would be disabled by the +C attribute). It's not enough space coming-and-going to cause data extents to be reclaimed or displaced by metadata. And the filessytem is otherwise completely unused. But check it out... Gust Work # rm scratch Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700 1700+0 records in 1700+0 records out 278528000 bytes (279 MB) copied, 1.4952 s, 186 MB/s Gust Work # rm scratch Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700 1700+0 records in 1700+0 records out 278528000 bytes (279 MB) copied, 292.135 s, 953 kB/s Gust Work # rm scratch Gust Work # dd if=/dev/zero of=/mnt/Work/scratch bs=160k count=1700 dd: error writing ‘/mnt/Work/scratch’: No space left on device 93+0 records in 92+0 records out 15073280 bytes (15 MB) copied, 0.0453977 s, 332 MB/s Gust Work # rm scratch Gust Work # dd if=/dev/urandom of=/mnt/Work/scratch bs=160k count=1700 dd: error writing ‘/mnt/Work/scratch’: No space left on device 1090+0 records in 1089+0 records out 178421760 bytes (178 MB)
Re: BTRFS free space handling still needs more work: Hangs again (further tests, as close as I dare, current idea)
Am Sonntag, 28. Dezember 2014, 14:56:21 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 14:40:32 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 14:00:19 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 14:55:58 schrieb Martin Steigerwald: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's
Re: BTRFS free space handling still needs more work: Hangs again
Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White: On 12/28/2014 04:07 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. My bad. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. A version of the test case to demonstrate absolutely system-clogging loads is pretty easy to construct. Make a raid1 filesystem. Balance it once to make sure the seed filesystem is fully integrated. Create a bunch of small files that are at least 4K in size, but are randomly sized. Fill the entire filesystem with them. BASH Script: typeset -i counter=0 while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do echo $counter /dev/null #basically a noop done The while will exit when the dd encounters a full filesystem. Then delete ~10% of the files with rm *0 Run the while loop again, then delete a different 10% with rm *1. Then again with rm *2, etc... Do this a few times and with each iteration the CPU usage gets worse and worse. You'll easily get system-wide stalls on all IO tasks lasting ten or more seconds. Thanks Robert. Thats wonderful. I wondered about such a test case already and thought about reproducing it just with fallocate calls instead to reduce the amount of actual writes done. I.e. just do some silly fallocate, truncating, write just some parts with dd seek and remove things again kind of workload. Feel free to add your testcase to the bug report: [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Cause anything that helps a BTRFS developer to reproduce will make it easier to find and fix the root cause of it. I think I will try with this little critter: merkaba:/mnt/btrfsraid1 cat freespracefragment.sh #!/bin/bash TESTDIR=./test mkdir -p $TESTDIR typeset -i counter=0 while true; do fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter)) echo $counter /dev/null #basically a noop done It takes a time, the script itself is using only a few percent of one core there, while busying out the SSDs more heavily than I thought it would do. But well I see up to 12000 writes per 10 seconds – thats not that much, still it busies one SSD for 80%: ATOP - merkaba 2014/12/28 16:40:57 --- 10s elapsed PRC | sys1.50s | user 3.47s | #proc367 | #trun 1 | #tslpi 649 | #tslpu 0 | #zombie0 | clones 839 | | no procacct | CPU | sys 30% | user 38% | irq 1% | idle293% | wait 37% | | steal 0% | guest 0% | curf 1.63GHz | curscal 50% | cpu | sys 7% | user 11% | irq 1% | idle 75% | cpu000 w 6% | | steal 0% | guest 0% | curf 1.25GHz | curscal 39% | cpu | sys 8% | user 11% | irq 0% | idle 76% | cpu002 w 4% | | steal 0% | guest 0% | curf 1.55GHz | curscal 48% | cpu | sys 7% | user 9% | irq 0% | idle 71% | cpu001 w 13% | | steal 0% | guest 0% | curf 1.75GHz | curscal 54% | cpu | sys 8% | user 7% | irq 0% | idle 71% | cpu003 w 14% | | steal 0% | guest 0% | curf 1.96GHz | curscal 61% | CPL | avg11.69 | avg51.30 | avg15 0.94 | | | csw68387 | intr 36928 | | | numcpu 4 | MEM | tot15.5G | free3.1G | cache 8.8G | buff4.2M | slab1.0G | shmem 210.3M | shrss 79.1M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M | SWP | tot12.0G | free 11.5G | | | | | | | vmcom 4.9G | vmlim 19.7G | LVM | a-btrfsraid1 | busy 80% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 1.11 | avio 0.67 ms | LVM | a-btrfsraid1 | busy 5% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 2.45 | avio 0.04 ms | LVM | msata-home | busy 3% | read 0 | write175 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 0.06 | avq 1.71 | avio 1.43 ms | LVM | msata-debian | busy 0% | read 0 | write 10 | KiB/r
Re: BTRFS free space handling still needs more work: Hangs again
Am Sonntag, 28. Dezember 2014, 16:42:20 schrieb Martin Steigerwald: Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White: On 12/28/2014 04:07 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. My bad. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. A version of the test case to demonstrate absolutely system-clogging loads is pretty easy to construct. Make a raid1 filesystem. Balance it once to make sure the seed filesystem is fully integrated. Create a bunch of small files that are at least 4K in size, but are randomly sized. Fill the entire filesystem with them. BASH Script: typeset -i counter=0 while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do echo $counter /dev/null #basically a noop done The while will exit when the dd encounters a full filesystem. Then delete ~10% of the files with rm *0 Run the while loop again, then delete a different 10% with rm *1. Then again with rm *2, etc... Do this a few times and with each iteration the CPU usage gets worse and worse. You'll easily get system-wide stalls on all IO tasks lasting ten or more seconds. Thanks Robert. Thats wonderful. I wondered about such a test case already and thought about reproducing it just with fallocate calls instead to reduce the amount of actual writes done. I.e. just do some silly fallocate, truncating, write just some parts with dd seek and remove things again kind of workload. Feel free to add your testcase to the bug report: [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Cause anything that helps a BTRFS developer to reproduce will make it easier to find and fix the root cause of it. I think I will try with this little critter: merkaba:/mnt/btrfsraid1 cat freespracefragment.sh #!/bin/bash TESTDIR=./test mkdir -p $TESTDIR typeset -i counter=0 while true; do fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter)) echo $counter /dev/null #basically a noop done It takes a time, the script itself is using only a few percent of one core there, while busying out the SSDs more heavily than I thought it would do. But well I see up to 12000 writes per 10 seconds – thats not that much, still it busies one SSD for 80%: ATOP - merkaba 2014/12/28 16:40:57 --- 10s elapsed PRC | sys1.50s | user 3.47s | #proc367 | #trun 1 | #tslpi 649 | #tslpu 0 | #zombie0 | clones 839 | | no procacct | CPU | sys 30% | user 38% | irq 1% | idle293% | wait 37% | | steal 0% | guest 0% | curf 1.63GHz | curscal 50% | cpu | sys 7% | user 11% | irq 1% | idle 75% | cpu000 w 6% | | steal 0% | guest 0% | curf 1.25GHz | curscal 39% | cpu | sys 8% | user 11% | irq 0% | idle 76% | cpu002 w 4% | | steal 0% | guest 0% | curf 1.55GHz | curscal 48% | cpu | sys 7% | user 9% | irq 0% | idle 71% | cpu001 w 13% | | steal 0% | guest 0% | curf 1.75GHz | curscal 54% | cpu | sys 8% | user 7% | irq 0% | idle 71% | cpu003 w 14% | | steal 0% | guest 0% | curf 1.96GHz | curscal 61% | CPL | avg11.69 | avg51.30 | avg15 0.94 | | | csw68387 | intr 36928 | | | numcpu 4 | MEM | tot15.5G | free3.1G | cache 8.8G | buff4.2M | slab 1.0G | shmem 210.3M | shrss 79.1M | vmbal 0.0M | hptot 0.0M | hpuse 0.0M | SWP | tot12.0G | free 11.5G | | | | | | | vmcom 4.9G | vmlim 19.7G | LVM | a-btrfsraid1 | busy 80% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 1.11 | avio 0.67 ms | LVM | a-btrfsraid1 | busy 5% | read 0 | write 11873 | KiB/r 0 | KiB/w 3 | MBr/s 0.00 | MBw/s 4.31 | avq 2.45 | avio 0.04 ms | LVM | msata-home | busy
Re: BTRFS free space handling still needs more work: Hangs again
On 28 December 2014 at 13:03, Martin Steigerwald mar...@lichtvoll.de wrote: BTW, I found that the Oracle blog didn´t work at all for me. I completed a cycle of defrag, sdelete -c and VBoxManage compact, [...] and it apparently did *nothing* to reduce the size of the file. They've changed the argument to -z; sdelete -z. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/28/2014 07:42 AM, Martin Steigerwald wrote: Am Sonntag, 28. Dezember 2014, 06:52:41 schrieb Robert White: On 12/28/2014 04:07 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 20:03:09 schrieb Robert White: Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. I didn´t yet provide such a test case. My bad. At the moment I can only reproduce this kworker thread using a CPU for minutes case with my /home filesystem. A mininmal test case for me would be to be able to reproduce it with a fresh BTRFS filesystem. But yet with my testcase with the fresh BTRFS I get 4800 instead of 270 IOPS. A version of the test case to demonstrate absolutely system-clogging loads is pretty easy to construct. Make a raid1 filesystem. Balance it once to make sure the seed filesystem is fully integrated. Create a bunch of small files that are at least 4K in size, but are randomly sized. Fill the entire filesystem with them. BASH Script: typeset -i counter=0 while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do echo $counter /dev/null #basically a noop done The while will exit when the dd encounters a full filesystem. Then delete ~10% of the files with rm *0 Run the while loop again, then delete a different 10% with rm *1. Then again with rm *2, etc... Do this a few times and with each iteration the CPU usage gets worse and worse. You'll easily get system-wide stalls on all IO tasks lasting ten or more seconds. Thanks Robert. Thats wonderful. I wondered about such a test case already and thought about reproducing it just with fallocate calls instead to reduce the amount of actual writes done. I.e. just do some silly fallocate, truncating, write just some parts with dd seek and remove things again kind of workload. Feel free to add your testcase to the bug report: [Bug 90401] New: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 Cause anything that helps a BTRFS developer to reproduce will make it easier to find and fix the root cause of it. I think I will try with this little critter: merkaba:/mnt/btrfsraid1 cat freespracefragment.sh #!/bin/bash TESTDIR=./test mkdir -p $TESTDIR typeset -i counter=0 while true; do fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((++counter)) echo $counter /dev/null #basically a noop done If you don't do the remove/delete passes you won't get as much fragmentation... I also noticed that fallocate would not actually create the files in my toolset, so I had to touch them first. So the theoretical script became e.g. typeset -i counter=0 for AA in {0..9} do while touch ${TESTDIR}/$((++counter)) fallocate -l $((4096 + $RANDOM)) $TESTDIR/$((counter)) do if ((counter%100 == 0)) then echo $counter fi done echo removing ${AA} rm ${TESTDIR}/*${AA} done Meanwhile, on my test rig using fallocate did _not_ result in final exhaustion of resources. That is btrfs fi df /mnt/Work didn't show significant changes on a near full expanse. I also never got a failed response back from fallocate, that is the inner loop never terminated. This could be a problem with the system call itself or it could be a problem with the application wrapper. Nor did I reach the CPU saturation I expected. e.g. Gust vm # btrfs fi df /mnt/Work/ Data, RAID1: total=1.72GiB, used=1.66GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=256.00MiB, used=57.84MiB GlobalReserve, single: total=32.00MiB, used=0.00B time passes while script running... Gust vm # btrfs fi df /mnt/Work/ Data, RAID1: total=1.72GiB, used=1.66GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=256.00MiB, used=57.84MiB GlobalReserve, single: total=32.00MiB, used=0.00B So there may be some limiting factor or something. Without the actual writes to the actual file expanse I don't get the stalls. (I added a _touch_ of instrumentation, it makes the various catostrophy events a little more obvious in context. 8-) mount /dev/whattever /mnt/Work typeset -i counter=0 for AA in {0..9} do while dd if=/dev/urandom of=/mnt/Work/$((++counter)) bs=$((4096 + $RANDOM)) count=1 2/dev/null do if ((counter%100 == 0)) then echo $counter if ((counter%1000 == 0)) then btrfs fi df /mnt/Work fi fi done btrfs fi df /mnt/Work echo removing ${AA} rm /mnt/Work/*${AA} btrfs fi df /mnt/Work done So you definitely need the writes to really see the stalls. I may try with with my test BTRFS. I could even make it 2x20 GiB RAID 1 as well. I guess I never mentioned it... I am using 4x1GiB NOCOW files through losetup as the basis of a RAID1. No compression (by virtue of the NOCOW files
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote: My simple test case didn´t trigger it, and I so not have another twice 160 GiB available on this SSDs available to try with a copy of my home filesystem. Then I could safely test without bringing the desktop session to an halt. Maybe someone has an idea on how to enhance my test case in order to reliably trigger the issue. It may be challenging tough. My /home is quite a filesystem. It has a maildir with at least one million of files (yeah, I am performance testing KMail and Akonadi as well to the limit!), and it has git repos and this one VM image, and the desktop search and the Akonadi database. In other words: It has been hit nicely with various mostly random I think workloads over the last about six months. I bet its not that easy to simulate that. Maybe some runs of compilebench to age the filesystem before the fio test? That said, BTRFS performs a lot better. The complete lockups without any CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there is this kworker issue now. I noticed it that gravely just while trying to complete this tax returns stuff with the Windows XP VM. Otherwise it may have happened, I have seen some backtraces in kern.log, but it didn´t last for minutes. So this indeed is of less severity than the full lockups with 3.15 and 3.16. Zygo, was is the characteristics of your filesystem. Do you use compress=lzo and skinny metadata as well? How are the chunks allocated? What kind of data you have on it? compress-force (default zlib), no skinny-metadata. Chunks are d=single, m=dup. Data is a mix of various desktop applications, most active file sizes from a few hundred K to a few MB, maybe 300k-400k files. No database or VM workloads. Filesystem is 100GB and is usually between 98 and 99% full (about 1-2GB free). I have another filesystem which has similar problems when it's 99.99% full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with skinny-metadata and no-holes. On various filesystems I have the above CPU-burning problem, a bunch of irreproducible random crashes, and a hang with a kernel stack that goes through SyS_unlinkat and btrfs_evict_inode. signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again
Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. merkaba:~ btrfs balance start -dusage=5 -musage=5 /home ERROR: error during balancing '/home' - No space left on device ITEM: Running out of space during a balance is not running out of space for files. BTRFS has two layers of allocation. That is, there are two levels of abstraction where no space can occur. I understand that *very* well. I know about the allocation of *device* space for tree and I know about the allocation *inside* a tree. The first level of allocation is the making more BTRFS structures out Skipped rest of explaination that I already now. I also don´t buy in the SSD makes kworker thread to use 100% for minutes explaination - *while* this SSDs are basically idling. A sandybridge core is not exactly slow and these are still consumer SSDs, we are not talking about a million of IOPS here. And again: This does not ever happen on when the trees do *not* fully allocate all device space. Even the defragmentation of the Windows XP run fine until after the trees allocated all space on the device again. Try to reread the last two sentences in case it doesn´t sink to you. Thats why I consider it a bug. I totally agree with you that a balance should not be necessary, but in my observation it is. That is the actual bug. And no, no one needs me to tell to nocow the file. Even the extents are no issue: Not with SSDs which provide good enough random access. My interpretation from what I see is this: BTRFS free space *in tree* handling is still not up to producation quality. Now you either try out what I describe and see whether you perceive the same, or if you don´t, please don´t argue with my perception. You can argue with my conclusion, but I know what I see here. Thanks. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance every time you hit it isn't going to get the problem solved properly. I think I would suggest the following: - make sure you have some way of logging your dmesg permanently (use a different filesystem for /var/log, or a serial console, or a netconsole) - when the lockup happens, hit Alt-SysRq-t a few times - send the dmesg output here, or post to bugzilla.kernel.org That's probably going to give enough information to the developers to work out where the lockup is happening, and is clearly the way forward here. Hugo. -- Hugo Mills | w.w.w. -- England's batting scorecard hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: 65E74AC0 | signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Ok, let me rephrase that: Then the space *reserved* for the trees occupies all space on the device. Or okay, when that I see in btrfs fi df as total in summary occupies what I see as size in btrfs fi sh, i.e. when used equals space in btrfs fi sh What happened here is this: I tried https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual in order to regain some space from the Windows XP VDI file. I just wanted to get around upsizing the BTRFS again. And on the defragementation step in Windows it first ran fast. For about 46-47% there, during that fast phase btrfs fi df showed that BTRFS was quickly reserving the remaining free device space for data trees (not metadata). Only after a while after it did so, it got slow again, basically the Windows defragmentation process stopped at 46-47% altogether and then after a while even the desktop locked due to processes being blocked in I/O. I decided to forget about this downsizing of the Virtualbox VDI file, it will extend again on next Windows work and it is already 18 GB of its maximum 20GB, so… I dislike the approach anyway, and don´t even understand why the defragmentation step would be necessary as I think Virtualbox can poke holes into the file for any space not allocated inside the VM, whether it is defragmented or not. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? The *only* person? The compression lockups with 3.15 and 3.16, quite some people saw them, I thought. For me also these lockups only happened with all space on device allocated. And these seem to be gone. In regular use it doesn´t lockup totally hard. But in the a processes writes a lot into one big no-cowed file case, it seems it can still get into a lockup, but this time one where a kworker thread consumes 100% of CPU for minutes. I *never* so far saw it lockup if there is still space BTRFS can allocate from to
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? Okay, just about terms. What I call trees is this: merkaba:~ btrfs fi df / Data, RAID1: total=27.99GiB, used=17.21GiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=596.12MiB GlobalReserve, single: total=208.00MiB, used=0.00B For me each one of Data, System, Metadata and GlobalReserve is what I call a tree. How would you call it? I always thought that BTRFS uses a tree structure not only for metadata, but also for data. But I bet strictly spoken thats only to *manage* the chunks it allocates and what I see above is the actual chunk usage. I.e. to get terms straight, how would you call it? I think my understanding of how BTRFS handles space allocation is quite correct, but I may use a term incorrectly. I read Data, RAID1: total=27.99GiB, used=17.21GiB as: I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks so far. So I have about 10,5 GiB free in these data chunks at the moment and all is good. What it doesn´t tell me at all is how the allocated space is distributed onto these chunks. I may be that some chunks are completely empty or not. It may be that each chunk has some space allocated to it but in total there is that amount of free space yet. I.e. it doesn´t tell me anything about the free space fragmentation inside the chunks. Yet I still hold my theory that in the case of heavily writing to a COW´d file BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of my laptop instead of trying to find free space in existing only partially empty chunks. And the lockup only happens when it tries to do the latter. And no, I think it shouldn´t lockup then. I also think its a bug. I never said differently. And yes, I only ever had this on my /home so far. Not on / which is also RAID 1 and has all device space reserved for quite some time, not on /daten which only holds large files and is single instead of RAID. Also not on the server, but the server FS has lots of unallocated device space still, or on the 2 TiB eSATA backup HD, also I do get the impression that BTRFS started to get slower there as well at least the rsync based backup script takes quite long meanwhile and I see rsync reading from backup BTRFS and in this case almost fully ultilizing the disk for longer times. But unlike my /home the backup disk has some timely widely distributed snaphots (about 2 week to 1 months intervalls, and about last half year). Neither /home nor / on the SSD have snapshots at the moment. So this is happening without snapshots. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 02:54 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Ok, let me rephrase that: Then the space *reserved* for the trees occupies all space on the device. Or okay, when that I see in btrfs fi df as total in summary occupies what I see as size in btrfs fi sh, i.e. when used equals space in btrfs fi sh What happened here is this: I tried https://blogs.oracle.com/virtualbox/entry/how_to_compact_your_virtual in order to regain some space from the Windows XP VDI file. I just wanted to get around upsizing the BTRFS again. And on the defragementation step in Windows it first ran fast. For about 46-47% there, during that fast phase btrfs fi df showed that BTRFS was quickly reserving the remaining free device space for data trees (not metadata). The above statement is word-salad. The storage for data is not a data tree, the tree that maps data into a file is metadata. The data is data. There is no data tree. Only after a while after it did so, it got slow again, basically the Windows defragmentation process stopped at 46-47% altogether and then after a while even the desktop locked due to processes being blocked in I/O. If you've over-organized your very-large data files you can get waste some terrific amounts of space. [---] [---] [uuu] [] [-] [--] [-][] [---] [] As you write new segments you don't actually free the lower extents unless they are _completely_ obscured end-to-end by a later extent. So if you've _ever_ defragged the BTRFS extent to be fully contiguous and you've not overwritten each and every byte later, the original expanse is still going to be there. In the above exampel only the uuu block is ever freed, and only when the fourth generation finally covers the little gap. In the worst case you can end up with (N*(N+1))/2 total blocks used up on disk when only N blocks are visible. (See the Gauss equation for the sum of consecutive integers for why this is the correct approximation for the worst case.) [] [---] [--] ... [-] Each generation, being one block shorter than the previous one, exposes N blocks, one from each generation. So 1+2+3+4+5...+N blocks allocated if each ovewrite is one block shorter
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 03:11 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? Okay, just about terms. Terms are _really_ important if you want to file and discuss bugs. What I call trees is this: merkaba:~ btrfs fi df / Data, RAID1: total=27.99GiB, used=17.21GiB System, RAID1: total=8.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=596.12MiB GlobalReserve, single: total=208.00MiB, used=0.00B For me each one of Data, System, Metadata and GlobalReserve is what I call a tree. How would you call it? Those are extents I think. All of the Trees are in the metadata. One of the trees is the extent tree. That extent tree is what contains the list of which regions of the disk are data, or metadata, or system-metadata (like the superblocks), or the global reserve. Those extents are then filled with the type of information described. But all the trees are in the metadata extents. I always thought that BTRFS uses a tree structure not only for metadata, but also for data. But I bet strictly spoken thats only to *manage* the chunks it allocates and what I see above is the actual chunk usage. I.e. to get terms straight, how would you call it? I think my understanding of how BTRFS handles space allocation is quite correct, but I may use a term incorrectly. I read Data, RAID1: total=27.99GiB, used=17.21GiB as: I reserved 27,99 GiB for data chunks and used 17,21 GiB in these data chunks so far. So I have about 10,5 GiB free in these data chunks at the moment and all is good. What it doesn´t tell me at all is how the allocated space is distributed onto these chunks. I may be that some chunks are completely empty or not. It may be that each chunk has some space allocated to it but in total there is that amount of free space yet. I.e. it doesn´t tell me anything about the free space fragmentation inside the chunks. Yet I still hold my theory that in the case of heavily writing to a COW´d file BTRFS seems to prefer to reserve new empty chunks on this /home filesystem of my laptop instead of trying to find free space in existing only partially empty chunks. And the lockup only happens when it tries to do the latter. And no, I think it shouldn´t lockup then. I also think its a bug. I never said differently. Partly correct. The system (as I understand it) will try to fill old chunks before allocating to new ones. It also perfers the most empty chunk first. But if you fallocate large extents they can have trouble finding a home. So lets say you have a systemic process that keeps making .51GiB files then it will tend to allocate a new 1GiB data extent each time (presuming you used default values) because each successive .51GiB region cannot fit in any existing data extent. Excessive snapshotting can also contribute to this effect, but only because it freezes the history. There are some other odd-out cases. And yes, I only ever had this on my /home so far. Not on / which is also RAID 1 and has all device space reserved for quite some time, not on /daten which only holds large files and is single instead of RAID. Also not on the server, but the server FS has lots of unallocated device space still, or on the 2 TiB eSATA backup HD, also I do get the impression that BTRFS started to get slower there as well at least the rsync based backup script takes quite long meanwhile and I see rsync reading from backup BTRFS and in this case almost fully ultilizing the disk for longer times. But unlike my /home the backup disk has some timely widely distributed snaphots (about 2 week to 1 months intervalls, and about last half year). Neither /home nor / on the SSD have snapshots at the moment. So this is happening without snapshots. Ciao, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White: My theory from watching the Windows XP defragmentation case is this: - For writing into the file BTRFS needs to actually allocate and use free space in the current tree allocation, or, as we seem to misunderstood from the words we use, it needs to fit data in Data, RAID1: total=144.98GiB, used=140.94GiB between 144,98 GiB and 140,94 GiB given that total space of this tree, or if its not a tree, but the chunks in that the tree manages, in these chunks can *not* be extended anymore. If your file was actually COW (and you have _not_ been taking snapshots) then there is no extenting to be had. But if you are using snapper (which I believe you mentioned previously) then the snapshots cause a write boundary and a layer of copying. Frequently taking snapshots of a COW file is self defeating. If you are going to take snapshots then you might as well turn copy on write back on and, for the love of pete, stop defragging things. I don´t use any snapshots on the filesystems. None, zero, zilch, nada. And as I understand it copy on write means: It has to write the new write requests to somewhere else. For this it needs to allocate space. Either withing existing chunks or in a newly allocated one. So for COW when writing to a file it will always need to allocate new space (although it can forget about the old space afterwards unless there isn´t a snapshot holding it) Anyway, I got it reproduced. And am about to write a lengthy mail about. It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 05:16 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 03:52:56 schrieb Robert White: My theory from watching the Windows XP defragmentation case is this: - For writing into the file BTRFS needs to actually allocate and use free space in the current tree allocation, or, as we seem to misunderstood from the words we use, it needs to fit data in Data, RAID1: total=144.98GiB, used=140.94GiB between 144,98 GiB and 140,94 GiB given that total space of this tree, or if its not a tree, but the chunks in that the tree manages, in these chunks can *not* be extended anymore. If your file was actually COW (and you have _not_ been taking snapshots) then there is no extenting to be had. But if you are using snapper (which I believe you mentioned previously) then the snapshots cause a write boundary and a layer of copying. Frequently taking snapshots of a COW file is self defeating. If you are going to take snapshots then you might as well turn copy on write back on and, for the love of pete, stop defragging things. I don´t use any snapshots on the filesystems. None, zero, zilch, nada. And as I understand it copy on write means: It has to write the new write requests to somewhere else. For this it needs to allocate space. Either withing existing chunks or in a newly allocated one. So for COW when writing to a file it will always need to allocate new space (although it can forget about the old space afterwards unless there isn´t a snapshot holding it) It can _only_ forget about the space if absolutely _all_ of the old extent is overwritten. So if you write 1MiB, then you go back and overwrite 1MiB-4Kib, then you go back and write 1MiB-8KiB, you've now got 3MiB-12KiB to represent 1MiB of data. No snapshots involved. The worst case is quite well understood. [...--] 1MiB [...-] 1MiB-4KiB [...] 1MiB-8KiB BTRFS will _NOT_ reclaim the part of any extent. So if this kept going it would take 250 diminishing overwrites, each 4k less than the prior: 1MiB == 250 4k blocks. (250*(250+1))/2 = 31375 4K blocks or 125.5MiB of storage allocated and dedicated to representing 1MiB of accessible data. This is a worst case, of course, but it exists and it's _horrible_. And such a file can be burped by doing a copy-and-rename, resulting in returning it to a single 1MiB extent. (I don't know if a btrfs defrag would have identical results, but I think it would.) The problem is that there isn't (yet) a COW safe way to discard partial extents. That is, there is no universally safe way (yet implemented) to turn that first 1MiB into two extents of 1MiB-4K and one 4K extent in place so there is no way (yet) to prevent this worst case. Doing things like excessive defragging at the BTRFS level, and defragging inside of a VM, and using certain file types can lead to pretty awful data wastage. YMMV. e.g. too much tidying up and you make a mess. I offered a pseudocode example a few days back on how this problem might be dealt with in future, but I've not seen any feedback on it. Anyway, I got it reproduced. And am about to write a lengthy mail about. Have fun with that lengthy email, but the devs already know about the data waste profile of the system. They just don't have a good solution yet. Practical use cases involving _not_ defragging and _not_ packing files, or disabling COW and using raw image formats for VM disk storage are, meanwhile, also well understood. It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. Yep. As I've explained twice now. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's :) usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance every time you hit it isn't going to get the problem solved properly. I think I would suggest the following: - make sure you have some way of logging your dmesg permanently (use a different filesystem for /var/log, or a serial console, or a netconsole) - when the lockup happens, hit Alt-SysRq-t a few times - send the dmesg output here, or post to bugzilla.kernel.org That's probably going to give enough information to the developers to work out where the lockup is happening, and is clearly the way forward here. And I got it reproduced. *Perfectly* reproduced, I´d say. But let me run
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Doing the bad things is very bad... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 05:49:48 schrieb Robert White: Anyway, I got it reproduced. And am about to write a lengthy mail about. Have fun with that lengthy email, but the devs already know about the data waste profile of the system. They just don't have a good solution yet. Practical use cases involving _not_ defragging and _not_ packing files, or disabling COW and using raw image formats for VM disk storage are, meanwhile, also well understood. Okay, then how about a database? BTRFS is not usable for these kind of workloads then. And thats about it. Not even on SSD. Yet, what I have shown in my lengthy mail is pathological. Its even abysmal. And yet it only happens when BTRFS is forced to pack things into *existing* chunks. It does not happen when BTRFS can still reserve new chunks and write to them. And this makes all the talk that you should not need to rebalance obsolete when in practice you need to to get decent performance. To get out of your SSDs what your SSDs can provide instead of waiting for BTRFS to finish being busy with itself. Still, I have only yet reproduced it on this /home filesystem. If that is also reproducable on a freshly created filesystem after some runs of the fio job I provided I´d say that there is a performance bug in BTRFS. And thats it. No talking about technicalities my turn this performance bug observation away. Heck 254 IOPS from a Dual SSD RAID 1? Are you even kidding me? I refuse to believe that this is built into the design, no matter how much you outline its limitations. And if it is? Well… then maybe BTRFS won´t save us. Unless you give it a ton of extra free space. Unless you do as I recommend and if you use 25 GB you make it 100 GB big so it will always find enough space to waste. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Robert, I experienced this hang issues even before the defragmenting case. It happened while just installed a 400 MiB tax returns application to it (that is no joke, it is that big). It happens while just using the VM. Yes, I recommend not to use BTRFS for any VM image or any larger database on rotating storage for exactly that COW semantics. But on SSD? Its busy looping a CPU core and while the flash is basically idling. I refuse to believe that this is by design. I do think there is a *bug*. Either acknowledge it and try to fix it, or say its by design *without even looking at it closely enough to be sure that it is not a bug* and limit your own possibilities by it. I´d rather see it treated as a bug for now. Come on, 254 IOPS on a filesystem with still 17 GiB of free space while randomly writing to a 4 GiB file. People do these kind of things. Ditch that defrag Windows XP VM case, I had performance issue even before by just installing things to it. Databases, VMs, emulators. And heck even while just *creating* the file with fio as I shown. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 06:00 AM, Robert White wrote: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit 0 Slight correction: you need to prevent the truncate dd performs by default, and flush the data and metadata to disk between after each invocatoin. So you need the conv= flags. for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter done Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Doing the bad things is very bad... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Robert, I experienced this hang issues even before the defragmenting case. It happened while just installed a 400 MiB tax returns application to it (that is no joke, it is that big). It happens while just using the VM. Yes, I recommend not to use BTRFS for any VM image or any larger database on rotating storage for exactly that COW semantics. But on SSD? Its busy looping a CPU core and while the flash is basically idling. I refuse to believe that this is by design. I do think there is a *bug*. Either acknowledge it and try to fix it, or say its by design *without even looking at it closely enough to be sure that it is not a bug* and limit your own possibilities by it. I´d rather see it treated as a bug for now. Come on, 254 IOPS on a filesystem with still 17 GiB of free space while randomly writing to a 4 GiB file. People do these kind of things. Ditch that defrag Windows XP VM case, I had performance issue even before by just installing things to it. Databases, VMs, emulators. And heck even while just *creating* the file with fio as I shown. Add to these use cases things like this: martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5 insgesamt 2,2G -rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd -rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd -rw-rw 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd -rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd Or this: martin@merkaba:~/.local/share/baloo du -sch * | sort -rh 9,2Ginsgesamt 8,0Gemail 1,2Gfile 51M emailContacts 408Kcontacts 76K notes 16K calendars martin@merkaba:~/.local/share/baloo ls -lSh email | head -5 insgesamt 8,0G -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB -rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA These will not be as bad as the fio test case, but still these files are written into. They are updated in place. And thats running on every Plasma desktop by default. And on GNOME desktops there is similar stuff. I haven´t seen this spike out a kworker yet tough, so maybe the workload is light enough not to trigger it that easily. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 05:55 AM, Martin Steigerwald wrote: Summarized at Bug 90401 - btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 see below. This is reproducable with fio, no need for Windows XP in Virtualbox for reproducing the issue. Next I will try to reproduce with a freshly creating filesystem. Am Samstag, 27. Dezember 2014, 09:30:43 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. I only see the lockups of BTRFS is the trees *occupy* all space on the device. No, the trees occupy 3.29 GiB of your 5 GiB of mirrored metadata space. What's more, balance does *not* balance the metadata trees. The remaining space -- 154.97 GiB -- is unstructured storage for file data, and you have some 13 GiB of that available for use. Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I *never* so far saw it lockup if there is still space BTRFS can allocate from to *extend* a tree. It's not a tree. It's simply space allocation. It's not even space *usage* you're talking about here -- it's just allocation (i.e. the FS saying I'm going to use this piece of disk for this purpose). This may be a bug, but this is what I see. And no amount of you should not balance a BTRFS will make that perception go away. See, I see the sun coming out on a morning and you tell me no, it doesn´t. Simply that is not going to match my perception. Duncan's assertion is correct in its detail. Looking at your space Robert's :) usage, I would not suggest that running a balance is something you need to do. Now, since you have these lockups that seem quite repeatable, there's probably a lurking bug in there, but hacking around with balance every time you hit it isn't going to get the problem solved properly. I think I would suggest the following: - make sure you have some way of logging your dmesg permanently (use a different filesystem for /var/log, or a serial console, or a netconsole) - when the lockup happens, hit Alt-SysRq-t a few times - send the dmesg output here, or post to bugzilla.kernel.org That's probably going to give enough information to the developers to work out where the lockup is happening, and is clearly the way forward here. And I got it reproduced. *Perfectly* reproduced, I´d say. But let me run the whole story: 1) I downsized my /home BTRFS from dual 170 GiB to dual 160 GiB again. Which gave me: merkaba:~ btrfs fi sh /home Label: 'home' uuid:
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 06:21 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Robert, I experienced this hang issues even before the defragmenting case. It happened while just installed a 400 MiB tax returns application to it (that is no joke, it is that big). It happens while just using the VM. Yes, I recommend not to use BTRFS for any VM image or any larger database on rotating storage for exactly that COW semantics. But on SSD? Its busy looping a CPU core and while the flash is basically idling. I refuse to believe that this is by design. I do think there is a *bug*. Either acknowledge it and try to fix it, or say its by design *without even looking at it closely enough to be sure that it is not a bug* and limit your own possibilities by it. I´d rather see it treated as a bug for now. Come on, 254 IOPS on a filesystem with still 17 GiB of free space while randomly writing to a 4 GiB file. People do these kind of things. Ditch that defrag Windows XP VM case, I had performance issue even before by just installing things to it. Databases, VMs, emulators. And heck even while just *creating* the file with fio as I shown. Add to these use cases things like this: martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5 insgesamt 2,2G -rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd -rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd -rw-rw 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd -rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd Or this: martin@merkaba:~/.local/share/baloo du -sch * | sort -rh 9,2Ginsgesamt 8,0Gemail 1,2Gfile 51M emailContacts 408Kcontacts 76K notes 16K calendars martin@merkaba:~/.local/share/baloo ls -lSh email | head -5 insgesamt 8,0G -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB -rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA /usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing the amount of filespace used by a file in BTRFS. Look at a nice paste of the previously described worst case allocation. Gust rwhite # btrfs fi df / Data, single: total=344.00GiB, used=340.41GiB System, DUP: total=32.00MiB, used=80.00KiB Metadata, DUP: total=8.00GiB, used=4.84GiB GlobalReserve, single: total=512.00MiB, used=0.00B Gust rwhite # for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=some_file conv=notrunc,fsync bs=4k count=$counter /dev/null 21; done Gust rwhite # btrfs fi df / Data, single: total=344.00GiB, used=340.48GiB System, DUP: total=32.00MiB, used=80.00KiB Metadata, DUP: total=8.00GiB, used=4.84GiB GlobalReserve, single: total=512.00MiB, used=0.00B Gust rwhite # du some_file 1000some_file Gust rwhite # ls -lh some_file -rw-rw-r--+ 1 root root 1000K Dec 27 07:00 some_file Gust rwhite # rm some_file Gust rwhite # btrfs fi df / Data, single: total=344.00GiB, used=340.41GiB System, DUP: total=32.00MiB, used=80.00KiB Metadata, DUP: total=8.00GiB, used=4.84GiB GlobalReserve, single: total=512.00MiB, used=0.00B Notice that some_file shows 1000 blocks in du, and 1000k bytes in ls. But notice that data used jumps from 340.41GiB to 340.48GiB when the file is created, then drops back down to 340.41GiB when it's deleted. Now I have compression turned on so the amount of growth/shrinkage changes between each run, but it's _Way_ more than 1Meg, that's like 70MiB (give or take significant rounding in the third place after the decimal). So I wrote this file in a way that leads to it taking up _seventy_ _times_ it's base size in actual allocated storage. Real files do not perform this terribly, but they can get pretty ugly in some cases. You _really_ need to learn how the system works and what its best and worst cases look like before you start shouting bug! You are using the wrong numbers (e.g. df) for available space and you don't know how to estimate what your tools _should_ do for the conditions observed. But yes, if you open a file and scribble all over it when your disk is full to
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White: But yes, if you open a file and scribble all over it when your disk is full to within the same order of magnitude as the size of the file you are scribbling on, you will get into a condition where the _application_ will aggressively retry the IO. Particularly if that application is a test program or a virtual machine doing asynchronous IO. That's what those sorts of systems do when they crash against a limit in the underlying system. So yea... out of space plus agressive writer equals spinning CPU Before you can assign blame you need to strace your application to see what call its making over and over again to see if its just being stupid. Robert, I am pretty sure that fio does not retry the I/O. If the I/O returns an error it exists immediately. I don´t think BTRFS fails an I/O – there is nothing of that in kern.log or dmesg. But it just needs a very long time for it. And yet, with BTRFS *is* *full* testcase I still can´t reproduce the 300 IOPS case. I consistently get about 4800 IOPS which is just about okay IMHO. fio just does random I/O. Aggressively, yes. But it would stop on the *first* *failed* I/O request. I am pretty sure of that. fio is flexible I/O tester. It has been written mostly by Jens Axboe. Jens is the block maintainer of the Linux kernel. So I kindly ask that before you assume I use crap tools, you have a look at it. From how you write I get the impression that you think everyone else beside you is just silly and dumb. Please stop this assumption. I may not always get terms right, and I may make a mistake as with the wrong df figure. But I also highly dislike to feel treated like someone who doesn´t know a thing. I made my case. I tried to reproduce it in a test case. Now I suggest we wait till someone had an actual log at the sysrq-t triggers of the 25 MiB kern.log I provided in the bug report. I will now wait for BTRFS developers to comment on this. I think Chris and Josef and other BTRFS developers actually know what fio is, so… either they are interested in that 300 IOPS case I cannot yet reproduce with a fresh filesystem or not. Even when it is as almost full as it can get and the fio *barely* completes without a no space left on device error, I still get those 4800 IOPS. I tested it and took the first run where it actually completed again after deleting partially copies /usr/bin directory from the test filesystem. As I have shown it in my test case (see my other mail with altered subject line) So for at least a *small* full filesystem, the filesystem full or BTRFS has to search for free space aggressively case *does not* explain what I see with my /home. So either I need a fuller filesystem for the test case, maybe one which carries a million of files or more, or one that at least has more chunks to allocate from, or there is more to it and there is something with my /home that makes it even worse. So it isn´t just the filesystem full case, and the all free space allocated for chunks condition also does not suffice as my test case shows (where BTRFS just won´t allocate another data chunk it seems). Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 07:14:32 schrieb Robert White: On 12/27/2014 06:21 AM, Martin Steigerwald wrote: Am Samstag, 27. Dezember 2014, 15:14:05 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 06:00:48 schrieb Robert White: On 12/27/2014 05:16 AM, Martin Steigerwald wrote: It can easily be reproduced without even using Virtualbox, just by a nice simple fio job. TL;DR: If you want a worst-case example of consuming a BTRFS filesystem with one single file... #!/bin/bash # not tested, so correct any syntax errors typeset -i counter for ((counter=250;counter0;counter--)); do dd if=/dev/urandom of=/some/file bs=4k count=$counter done exit Each pass over /some/file is 4k shorter than the previous one, but none of the extents can be deallocated. File will be 1MiB in size and usage will be something like 125.5MiB (if I've done the math correctly). larger values of counter will result in exponentially larger amounts of waste. Robert, I experienced this hang issues even before the defragmenting case. It happened while just installed a 400 MiB tax returns application to it (that is no joke, it is that big). It happens while just using the VM. Yes, I recommend not to use BTRFS for any VM image or any larger database on rotating storage for exactly that COW semantics. But on SSD? Its busy looping a CPU core and while the flash is basically idling. I refuse to believe that this is by design. I do think there is a *bug*. Either acknowledge it and try to fix it, or say its by design *without even looking at it closely enough to be sure that it is not a bug* and limit your own possibilities by it. I´d rather see it treated as a bug for now. Come on, 254 IOPS on a filesystem with still 17 GiB of free space while randomly writing to a 4 GiB file. People do these kind of things. Ditch that defrag Windows XP VM case, I had performance issue even before by just installing things to it. Databases, VMs, emulators. And heck even while just *creating* the file with fio as I shown. Add to these use cases things like this: martin@merkaba:~/.local/share/akonadi/db_data/akonadi ls -lSh | head -5 insgesamt 2,2G -rw-rw 1 martin martin 1,7G Dez 27 15:17 parttable.ibd -rw-rw 1 martin martin 488M Dez 27 15:17 pimitemtable.ibd -rw-rw 1 martin martin 23M Dez 27 15:17 pimitemflagrelation.ibd -rw-rw 1 martin martin 240K Dez 27 15:17 collectiontable.ibd Or this: martin@merkaba:~/.local/share/baloo du -sch * | sort -rh 9,2Ginsgesamt 8,0Gemail 1,2Gfile 51M emailContacts 408Kcontacts 76K notes 16K calendars martin@merkaba:~/.local/share/baloo ls -lSh email | head -5 insgesamt 8,0G -rw-r--r-- 1 martin martin 4,0G Dez 27 15:16 postlist.DB -rw-r--r-- 1 martin martin 3,9G Dez 27 15:16 termlist.DB -rw-r--r-- 1 martin martin 143M Dez 27 15:16 record.DB -rw-r--r-- 1 martin martin 63K Dez 27 15:16 postlist.baseA /usr/bin/du and /usr/bin/df and /bin/ls are all _useless_ for showing the amount of filespace used by a file in BTRFS. Yes. But they are *useful* to demonstrate that there are regular desktop application which randomly write into huge files. And that was *exactly* the point I was trying to make. Yes, I didn´t prove the random aspect. But heck, one is a MySQL and one is a Xapian. I am fairly sure that for a desktop search and for maildir folder indexing there is some random aspect in the workload. Do you agree to that? So what you call as bad – that was my exact point I was going to make – point is going to happen on systems. Maybe not as fierce as a fio job, granted. And for these said /home BTRFS worked fine, but for just installed a 400 MiB application onto the Windows XP I had the hang already. With more than 8 GiB of free space within the chunks at that time. If BTRFS fails like 300 IOPS on Dual SSD on disk full conditions on workloads like this it will fail in real world scenarios. And again my recommendation to leave way more free space than with other filesystems still holds. Yes, I saw XFS developer Dave Chinner recommending about 50% of free space of XFS for a crazy workload in case you want the filesystem in a young state even after 10 years. So I am fully aware that filesystems will age. But to *this* extent? After about the six months I actually run the BTRFS RAID 1, and started with a fresh single BTRFS that I balanced as RAID 1 to the second SSD then? I still think it is a bug. Especially as it just does not happen with a simple disk full condition as I spent several hours in trying to reproduce this worst case. If it only happens with my /home, I am willing to accept that something may be borked with it. And I haven´t been able to produce with a clean filesystem yet. So maybe it doesn´t happen for others. Then all fine, I recreate the FS and forget about it. But before I do any
Re: BTRFS free space handling still needs more work: Hangs again
On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote: On 12/27/2014 05:55 AM, Martin Steigerwald wrote: [snip] while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU for 10 seconds while allocatiing a 4 GiB file on a filesystem like: martin@merkaba:~ LANG=C df -hT /home Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home where a 4 GiB file should easily fit, no? (And this output is with the 4 GiB file. So it was even 4 GiB more free before.) No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of the fsstat() function call. The fstat function call was defined in 1990 and can't understand the dynamic allocation model used in BTRFS as it assumes fixed geometry for filesystems. You do _not_ have 17G actually available. You need to rely on btrfs fi df and btrfs fi show to figure out how much space you _really_ have. According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks) merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home And according to this block you have about 4.49GiB of data space: Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B 154.97 5.00 0.032 + 0.512 Pretty much as close to 160GiB as you are going to get (those numbers being rounded up in places for human readability) BTRFS has allocate 100% of the raw storage into typed extents. A large datafile can only fit in the 154.97-149.58 = 5.39 I appreciate that this is something of a minor point in the grand scheme of things, but I'm afraid I've lost the enthusiasm to engage with the broader (somewhat rambling, possibly-at-cross-purposes) conversation in this thread. However... Trying to allocate that 4GiB file into that 5.39GiB of space becomes an NP-complete (e.g. very hard) problem if it is very fragmented. This is... badly mistaken, at best. The problem of where to write a file into a set of free extents is definitely *not* an NP-hard problem. It's a P problem, with an O(n log n) solution, where n is the number of free extents in the free space cache. The simple approach: fill the first hole with as many bytes as you can, then move on to the next hole. More complex: order the free extents by size first. Both of these are O(n log n) algorithms, given an efficient general-purpose index of free space. The problem of placing file data isn't a bin-packing problem; it's not like allocating RAM (where each allocation must be contiguous). The items being placed may be split as much as you like, although minimising the amount of splitting is a goal. I suspect that the performance problems that Martin is seeing may indeed be related to free space fragmentation, in that finding and creating all of those tiny extents for a huge file is causing problems. I believe that btrfs isn't alone in this, but it may well be showing the problem to a far greater degree than other FSes. I don't have figures to compare, I'm afraid. I also don't know what kind of tool you are using, but it might be repeatedly trying and failing to fallocate the file as a single extent or something equally dumb. Userspace doesn't as far as I know, get to make that decision. I've just read the fallocate(2) man page, and it says nothing at all about the contiguity of the extent(s) storage allocated by the call. Hugo. [snip] -- Hugo Mills | O tempura! O moresushi! hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: 65E74AC0 | signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote: On 12/27/2014 05:55 AM, Martin Steigerwald wrote: [snip] while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU for 10 seconds while allocatiing a 4 GiB file on a filesystem like: martin@merkaba:~ LANG=C df -hT /home Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home where a 4 GiB file should easily fit, no? (And this output is with the 4 GiB file. So it was even 4 GiB more free before.) No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of the fsstat() function call. The fstat function call was defined in 1990 and can't understand the dynamic allocation model used in BTRFS as it assumes fixed geometry for filesystems. You do _not_ have 17G actually available. You need to rely on btrfs fi df and btrfs fi show to figure out how much space you _really_ have. According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks) merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home And according to this block you have about 4.49GiB of data space: Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B 154.97 5.00 0.032 + 0.512 Pretty much as close to 160GiB as you are going to get (those numbers being rounded up in places for human readability) BTRFS has allocate 100% of the raw storage into typed extents. A large datafile can only fit in the 154.97-149.58 = 5.39 I appreciate that this is something of a minor point in the grand scheme of things, but I'm afraid I've lost the enthusiasm to engage with the broader (somewhat rambling, possibly-at-cross-purposes) conversation in this thread. However... Trying to allocate that 4GiB file into that 5.39GiB of space becomes an NP-complete (e.g. very hard) problem if it is very fragmented. This is... badly mistaken, at best. The problem of where to write a file into a set of free extents is definitely *not* an NP-hard problem. It's a P problem, with an O(n log n) solution, where n is the number of free extents in the free space cache. The simple approach: fill the first hole with as many bytes as you can, then move on to the next hole. More complex: order the free extents by size first. Both of these are O(n log n) algorithms, given an efficient general-purpose index of free space. The problem of placing file data isn't a bin-packing problem; it's not like allocating RAM (where each allocation must be contiguous). The items being placed may be split as much as you like, although minimising the amount of splitting is a goal. I suspect that the performance problems that Martin is seeing may indeed be related to free space fragmentation, in that finding and creating all of those tiny extents for a huge file is causing problems. I believe that btrfs isn't alone in this, but it may well be showing the problem to a far greater degree than other FSes. I don't have figures to compare, I'm afraid. Thats what I wanted to hint at. I suspect an issue with free space fragmentation and do what I think I see: btrfs balance minimizes free space in chunk fragmentation. And that is my whole case on why I think it does help with my /home filesystem. So while btrfs filesystem defragment may help with defragmenting individual files, possibly at the cost of fragmenting free space at least on filesystem almost full conditions, I think to help with free space fragmentation there are only three options at the moment: 1) reformat and restore via rsync or btrfs send from backup (i.e. file based) 2) make the BTRFS in itself bigger 3) btrfs balance at least chunks, at least those that are not more than 70% or 80% full. Do you know of any other ways to deal with it? So yes, in case it really is freespace fragmentation, I do think a balance may be helpful. Even if usually one should not use a balance. I also don't know what kind of tool you are using, but it might be repeatedly trying and failing to fallocate the file as a single extent or something equally dumb. Userspace doesn't as far as I know, get to make that decision. I've just read the fallocate(2) man page, and it says nothing at all about the contiguity of the extent(s) storage allocated by the call. fio fallocates just once. And then writes, even if the fallocate call fails. Was nice to see at some point as BTRFS
Re: BTRFS free space handling still needs more work: Hangs again
Am Samstag, 27. Dezember 2014, 18:11:21 schrieb Martin Steigerwald: Am Samstag, 27. Dezember 2014, 16:26:42 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 06:54:33AM -0800, Robert White wrote: On 12/27/2014 05:55 AM, Martin Steigerwald wrote: [snip] while fio was just *laying* out the 4 GiB file. Yes, thats 100% system CPU for 10 seconds while allocatiing a 4 GiB file on a filesystem like: martin@merkaba:~ LANG=C df -hT /home Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/msata-home btrfs 170G 156G 17G 91% /home where a 4 GiB file should easily fit, no? (And this output is with the 4 GiB file. So it was even 4 GiB more free before.) No. /usr/bin/df is an _approximation_ in BTRFS because of the limits of the fsstat() function call. The fstat function call was defined in 1990 and can't understand the dynamic allocation model used in BTRFS as it assumes fixed geometry for filesystems. You do _not_ have 17G actually available. You need to rely on btrfs fi df and btrfs fi show to figure out how much space you _really_ have. According to this block you have a RAID1 of ~ 160GB expanse (two 160G disks) merkaba:~ date; btrfs fi sh /home ; btrfs fi df /home Sa 27. Dez 13:26:39 CET 2014 Label: 'home' uuid: [some UUID] Total devices 2 FS bytes used 152.83GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home And according to this block you have about 4.49GiB of data space: Btrfs v3.17 Data, RAID1: total=154.97GiB, used=149.58GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.26GiB GlobalReserve, single: total=512.00MiB, used=0.00B 154.97 5.00 0.032 + 0.512 Pretty much as close to 160GiB as you are going to get (those numbers being rounded up in places for human readability) BTRFS has allocate 100% of the raw storage into typed extents. A large datafile can only fit in the 154.97-149.58 = 5.39 I appreciate that this is something of a minor point in the grand scheme of things, but I'm afraid I've lost the enthusiasm to engage with the broader (somewhat rambling, possibly-at-cross-purposes) conversation in this thread. However... Trying to allocate that 4GiB file into that 5.39GiB of space becomes an NP-complete (e.g. very hard) problem if it is very fragmented. This is... badly mistaken, at best. The problem of where to write a file into a set of free extents is definitely *not* an NP-hard problem. It's a P problem, with an O(n log n) solution, where n is the number of free extents in the free space cache. The simple approach: fill the first hole with as many bytes as you can, then move on to the next hole. More complex: order the free extents by size first. Both of these are O(n log n) algorithms, given an efficient general-purpose index of free space. The problem of placing file data isn't a bin-packing problem; it's not like allocating RAM (where each allocation must be contiguous). The items being placed may be split as much as you like, although minimising the amount of splitting is a goal. I suspect that the performance problems that Martin is seeing may indeed be related to free space fragmentation, in that finding and creating all of those tiny extents for a huge file is causing problems. I believe that btrfs isn't alone in this, but it may well be showing the problem to a far greater degree than other FSes. I don't have figures to compare, I'm afraid. Thats what I wanted to hint at. I suspect an issue with free space fragmentation and do what I think I see: btrfs balance minimizes free space in chunk fragmentation. And that is my whole case on why I think it does help with my /home filesystem. So while btrfs filesystem defragment may help with defragmenting individual files, possibly at the cost of fragmenting free space at least on filesystem almost full conditions, I think to help with free space fragmentation there are only three options at the moment: 1) reformat and restore via rsync or btrfs send from backup (i.e. file based) 2) make the BTRFS in itself bigger 3) btrfs balance at least chunks, at least those that are not more than 70% or 80% full. Do you know of any other ways to deal with it? Yes. 4) Delete some stuff from it or move it over to a different filesystem. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
Re: BTRFS free space handling still needs more work: Hangs again
On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I do see something similar, but there are so many problems going on I have no idea which ones to report, and which ones are my own doing. :-P I see lots of CPU being burned when all the disk space is allocated to chunks, but there is still lots of space free (multiple GB) inside the chunks. iotop shows a crapton of disk writes (1-5MB/sec) from one kworker. There are maybe a few kB/sec of writes through the filesystem at the time. The filesystem where I see this most is on a laptop, so the disk writes also hit the CPU again for encryption. There's so much CPU usage it's worth mentioning twice. :-( 'watch cat /proc/12345/stack' on the active processes shows the kernel fairly often in that new chunk deallocator function whose name escapes me at the moment. Deleting a bunch of data then running balance helps return to sane CPU usage...for a while (maybe a week?). It's not technically locked up per se, but when a 5KB download takes a minute or more, most users won't wait around to see the difference. Kernel versions I'm using are 3.17.7 and 3.18.1. signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again
On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote: On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I do see something similar, but there are so many problems going on I have no idea which ones to report, and which ones are my own doing. :-P I see lots of CPU being burned when all the disk space is allocated to chunks, but there is still lots of space free (multiple GB) inside the chunks. iotop shows a crapton of disk writes (1-5MB/sec) from one kworker. There are maybe a few kB/sec of writes through the filesystem at the time. The filesystem where I see this most is on a laptop, so the disk writes also hit the CPU again for encryption. There's so much CPU usage it's worth mentioning twice. :-( 'watch cat /proc/12345/stack' on the active processes shows the kernel fairly often in that new chunk deallocator function whose name escapes me at the moment. Deleting a bunch of data then running balance helps return to sane CPU usage...for a while (maybe a week?). It's not technically locked up per se, but when a 5KB download takes a minute or more, most users won't wait around to see the difference. Kernel versions I'm using are 3.17.7 and 3.18.1. OK, so I'd like to change my statement above. When I first read Martin's problem, I thought that he was referring to a complete, hit-the-power-button kind of lock-up. Given that (erroneous) assumption, I stand by my (now pointless) statement. :) I realised during a brief conversation on IRC that Martin was actually referring to long but temporary periods where the machine is unusable by any process requiring disk activity. There's clearly a number of people seeing that. It doesn't stop it being a major problem, but it does change the interpretation considerably. Hugo. -- Hugo Mills | Mixing mathematics and alcohol is dangerous. Don't hugo@... carfax.org.uk | drink and derive. http://carfax.org.uk/ | PGP: 65E74AC0 | signature.asc Description: Digital signature
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, just tasks stuck for some time)
Am Samstag, 27. Dezember 2014, 18:40:17 schrieb Hugo Mills: On Sat, Dec 27, 2014 at 01:28:46PM -0500, Zygo Blaxell wrote: On Sat, Dec 27, 2014 at 09:30:43AM +, Hugo Mills wrote: On Sat, Dec 27, 2014 at 10:01:17AM +0100, Martin Steigerwald wrote: Am Freitag, 26. Dezember 2014, 14:48:38 schrieb Robert White: On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Now, since you're seeing lockups when the space on your disks is all allocated I'd say that's a bug. However, you're the *only* person who's reported this as a regular occurrence. Does this happen with all filesystems you have, or just this one? I do see something similar, but there are so many problems going on I have no idea which ones to report, and which ones are my own doing. :-P I see lots of CPU being burned when all the disk space is allocated to chunks, but there is still lots of space free (multiple GB) inside the chunks. iotop shows a crapton of disk writes (1-5MB/sec) from one kworker. There are maybe a few kB/sec of writes through the filesystem at the time. The filesystem where I see this most is on a laptop, so the disk writes also hit the CPU again for encryption. There's so much CPU usage it's worth mentioning twice. :-( 'watch cat /proc/12345/stack' on the active processes shows the kernel fairly often in that new chunk deallocator function whose name escapes me at the moment. Deleting a bunch of data then running balance helps return to sane CPU usage...for a while (maybe a week?). It's not technically locked up per se, but when a 5KB download takes a minute or more, most users won't wait around to see the difference. Kernel versions I'm using are 3.17.7 and 3.18.1. OK, so I'd like to change my statement above. When I first read Martin's problem, I thought that he was referring to a complete, hit-the-power-button kind of lock-up. Given that (erroneous) assumption, I stand by my (now pointless) statement. :) I realised during a brief conversation on IRC that Martin was actually referring to long but temporary periods where the machine is unusable by any process requiring disk activity. There's clearly a number of people seeing that. It doesn't stop it being a major problem, but it does change the interpretation considerably. Ah, then my bet was right with whom I talked there. :) Yeah, it does not seem to be a complete hang, I though so initially, cause honestly after waiting several minutes for my Plasma desktop to come back I just gave up. Maybe it would have returned at some time. I just didn´t have the patience to wait. It now did at my last testing where I continued on tty1 (had all the testing in a screen) as the desktop session locked up. After some time after the test completed I was able to use that desktop again and I am still using it. So the issue I see is: One kworker uses 100% of one core for minutes and while doing so processes that do I/O to the BTRFS that I test (/home) in my case seem to be stuck in uninteruptible sleep (D process state). While I see this there is no huge load on the SSDs so… it seems to be something CPU bound. I didn´t yet use a strace on the kworker process – or at the allocation time on the fio process –, Robert, thats a good suggestion. From a gut feeling I wouldn´t be surprised if I see *nothing* in strace as my bet is that the kworker thread deals with finding free space inside the chunks and deals with some data structures while doing so. But that is really just a gut feeling and so an strace would be nice. I made a backup yesterday, so I think I can try the strace. But I also spend a considerable amount of time of reproducing it and digging deeper into it so likely not this weekend anymore although this even makes some fun. But I see myself neglecting other stuff thats important to me as well, so… My simple test case didn´t trigger it, and I so not have another twice 160 GiB available on this SSDs available to try with a copy of my home filesystem. Then I could safely test without bringing the desktop session to an halt. Maybe someone has an idea on how to enhance my test case in order to reliably trigger the issue. It may be challenging tough. My /home is quite a filesystem. It has a maildir with at least one million of files (yeah, I am performance testing KMail and Akonadi as well to the limit!), and it has git repos and this one VM image, and the desktop search and the Akonadi database. In other words: It has been hit nicely with various mostly random I think workloads over the last about six months. I bet its not that easy to simulate that. Maybe some runs of compilebench to age the filesystem before the fio test? That said, BTRFS performs a lot better. The complete lockups without any CPU usage of 3.15 and 3.16 have gone for sure. Thats wonderful. But there is this kworker issue now. I noticed it that gravely just while trying to complete this tax returns
Re: BTRFS free space handling still needs more work: Hangs again
Semi off-topic questions... On 12/27/2014 08:26 AM, Hugo Mills wrote: This is... badly mistaken, at best. The problem of where to write a file into a set of free extents is definitely *not* an NP-hard problem. It's a P problem, with an O(n log n) solution, where n is the number of free extents in the free space cache. The simple approach: fill the first hole with as many bytes as you can, then move on to the next hole. More complex: order the free extents by size first. Both of these are O(n log n) algorithms, given an efficient general-purpose index of free space. Which algorithm is actually in use? Is any attempt made to keep subsequent allocations in the same data extent? All of best fit, first fit, and first encountered allocation have terrible distribution graphs over time. Without a knod to locality, discontiguous allocation will have staggeringly bad after-effects in terms of read-ahead. The problem of placing file data isn't a bin-packing problem; it's not like allocating RAM (where each allocation must be contiguous). The items being placed may be split as much as you like, although minimising the amount of splitting is a goal. How is compression and re-compression handled? If a linear extent is compressed to find its on-disk size in bytes, and then there isn't an extent large enough to fit it, it has to be cut, then recompressed, then searched again right? How does the system look for the right cut? How iterative can this get? Does it always try cutting in half? Does it shave single bytes off the end? Does it add one byte at a time till it reaches the size of the extent its looking at? Can you get down to a point where you are placing data in five or ten byte chunks somehow? (e.g. what's the smallest chunk you can place? clearly if I open a multi-megabyte file and update a single word or byte it's not going to land in metadata from my reading of the code.) One could easily end up with a couple million free extents of just a few bytes each, particularly if largest-first allocation is used. The degenerate cases here do come straight from the various packing problems. You may not be executing any of those packing algorithms but once you ignore enough of those issues in the easy cases your free space will be a fine pink mist suspended in space. (both an explosion analogy and a reference to pink noise 8-) ). I suspect that the performance problems that Martin is seeing may indeed be related to free space fragmentation, in that finding and creating all of those tiny extents for a huge file is causing problems. I believe that btrfs isn't alone in this, but it may well be showing the problem to a far greater degree than other FSes. I don't have figures to compare, I'm afraid. I also don't know what kind of tool you are using, but it might be repeatedly trying and failing to fallocate the file as a single extent or something equally dumb. Userspace doesn't as far as I know, get to make that decision. I've just read the fallocate(2) man page, and it says nothing at all about the contiguity of the extent(s) storage allocated by the call. Yep, my bad. But as soon as I saw that fio was starting two threads, one doing random read/write and another doing sequential read/write, both on the same file, it set off my not just creating a file mindset. Given the delayed write into/through the cache normally done by casual file io, It seemed likely that fio would be doing something more aggressive (like using O_DIRECT or repeated fdatasync() which could get very tit-for-tat). Compare that to a VM in which the guest operating system knows it has, and has used, its disk space internally, and the subsequent async activity of the monitor to push that activity out to real storage which is usually quite pathological... well you can get into some super pernicious behavior over write ordering and infinite retries. So I was wrong about fallocate per-se, applications can be incredibly dumb. For instance a VM might think its _inconceivable_ to get an ENOSPC while rewriting data it's just read from a file it knows has no holes etc. Given how lots of code doesn't even check the results of many function calls... how many times have you seen code that doesn't look at the return value of fwrite() or printf()? Or one that, if it does something like if (bytes_written size) retry_remainder(); So sure I was imagining an fallocate() in a loop or something equally dumb. 8-) Hugo. [snip] -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 08:01 AM, Martin Steigerwald wrote: From how you write I get the impression that you think everyone else beside you is just silly and dumb. Please stop this assumption. I may not always get terms right, and I may make a mistake as with the wrong df figure. But I also highly dislike to feel treated like someone who doesn´t know a thing. Nope. I'm a systems theorist and I demand/require variable isolation. Not a question of silly or dumb but a question of speaking with sufficient precision and clarity. For instance you speak of having an impression and then decide I've made an assumption. I define my position. Explain my terms. Give my examples. I also risk being utterly wrong because sometimes being completely wrong gets others to cut away misconceptions and assumptions. It annoys some people, but it gets results. You've been going around on this topic for how long? and just today Hugo got that your problem is becoming CPU bound (long process) instead of a hard lockup. We've stopped talking about trees and started talking about free space management. We've stopped talking about 17G of free space and gotten down to the 5 or so, plus you've gotten angry at me, tried to prove me an idiot, and so produced test cases and data that is absolutely clear including steps to reproduce. In real life I work on mission critical systems that can get people killed when they fail. So I have developed the reflex of tenacity in getting everyone using the same words, talking about the same concepts, giving concrete examples, and generally bringing the discussion to a very precise head. Example: I had two parties in conflict about a system. One party said that every time they did an orderly shutdown the device would hang in a way that took days to recover from. The other party would examine the device and say could not reproduce. Turns out that the two parties were doing entirely different (but both correct) sequences for orderly shutdown. They'd been having that conflict for more than a year. But since they both _knew_ what an orderly shutdown was, they _never_ analyzed what they were saying. (turns out one procedure left a chip in a state that it wouldn't restart until a capacitor discharged, and the other procedure did not.) So yea, when people make statements that everybody understands and those statements don't agree. I start slicing concepts off one at a time... It's not about dumb or silly it's about exact and accurate descriptions that have been stripped of assumptions and tribal knowledge. And I don't care if I come off looking like the bad guy because I don't believe in the bad guy at all when it comes to making things more clear and getting out of a communications deadlock. My only goal is less broken. So occasionally annoying... but look... progress! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 2014-12-28 01:25, Robert White wrote: On 12/27/2014 08:01 AM, Martin Steigerwald wrote: From how you write I get the impression that you think everyone else beside you is just silly and dumb. Please stop this assumption. I may not always get terms right, and I may make a mistake as with the wrong df figure. But I also highly dislike to feel treated like someone who doesn´t know a thing. Nope. I'm a systems theorist and I demand/require variable isolation. Not a question of silly or dumb but a question of speaking with sufficient precision and clarity. For instance you speak of having an impression and then decide I've made an assumption. I define my position. Explain my terms. Give my examples. I also risk being utterly wrong because sometimes being completely wrong gets others to cut away misconceptions and assumptions. It annoys some people, but it gets results. Can you please stop this bullshit posturing nonsense? It accomlishes nothing -- if you're right your other posts will stand for themselves and show that you are indeed the shit when it comes to these matters, but this post (so far, didn't read further) accomplishes nothing other than (possibly) convincing everyone that you're a pompous/self-important ass. Regards, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
On 12/27/2014 05:01 PM, Bardur Arantsson wrote: On 2014-12-28 01:25, Robert White wrote: On 12/27/2014 08:01 AM, Martin Steigerwald wrote: From how you write I get the impression that you think everyone else beside you is just silly and dumb. Please stop this assumption. I may not always get terms right, and I may make a mistake as with the wrong df figure. But I also highly dislike to feel treated like someone who doesn´t know a thing. Nope. I'm a systems theorist and I demand/require variable isolation. Not a question of silly or dumb but a question of speaking with sufficient precision and clarity. For instance you speak of having an impression and then decide I've made an assumption. I define my position. Explain my terms. Give my examples. I also risk being utterly wrong because sometimes being completely wrong gets others to cut away misconceptions and assumptions. It annoys some people, but it gets results. Can you please stop this bullshit posturing nonsense? It accomlishes nothing -- if you're right your other posts will stand for themselves and show that you are indeed the shit when it comes to these matters, but this post (so far, didn't read further) accomplishes nothing other than (possibly) convincing everyone that you're a pompous/self-important ass. Really? accomplishes nothing? 24 hours ago: the complaining party was talking about - Windows XP - Tax software - Virtual box - vdi files - defragging - balancing - data trees - system hanging And the responding party was saying you are the only person reporting this as a regular occurrence with the implication that the report was a duplicate or at least might not get much immediate attention. Now: The complaining party has verified the minimum, repeatable case of simple file allocation on a very fragmented system and the responding party and several others have understood and supported the bug. That's not accomplishing nothing, thats called engaging in diagnostics instead of dismissing a complaint, and sticking out the diagnostic process until everyone is on the same page. I never dismissed Martin. I never disbelieved him. I went through his elements one at a time with examples of what I was taking away from him and why they didn't match expectations and experimental evidence. We adjusted our positions and communications. So you can call it bullshit posturing nonsense but I see taking less than a day to get to the bottom of a bug report that might not have gotten significant attention. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS free space handling still needs more work: Hangs again
Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: merkaba:~ btrfs balance start -dusage=5 -musage=5 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=5 -musage=5 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=5 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=10 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=20 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=30 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=40 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=50 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=60 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=70 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=70 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=70 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=65 /home Done, had to relocate 0 out of 164 chunks merkaba:~ btrfs balance start -dusage=67 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -musage=10 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -musage=05 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail Okay, not really, ey? But merkaba:~ btrfs balance start /home works. So I am rebalancing everything basically, without need I bet, so causing more churn to SSDs than is needed. Otherwise alternative would be to make BTRFS larger I bet. Well this is still not what I would consider stable. So I will still recommend: If you want to use BTRFS on a server and estimate 25 GiB of usage, make drive at least 50GiB big or even 100GiB to be on the safe side. Like I recommended for SLES 11 SP 2/3 BTRFS deployments – but hey, there say meanwhile don´t as in just don´t use it at all and use SLES 12 instead, cause BTRFS with 3.0 kernel with a ton of snapper snapshots is really not asking for anything even near to production or enterprise reliability (if you need proof, I think I still have a snapshot of a SLES 11 SP3 VM that broke over night due to me having installed an LDAP server for preparing some training slides). Even 3.12 kernel seems daring regarding BTRFS, unless SUSE actively backports fixes. In kernel log the failed attempts look like this: [ 209.783437] BTRFS info (device dm-3): relocating block group 501238202368 flags 17 [ 210.116416] BTRFS info (device dm-3): relocating block group 501238202368 flags 17 [ 210.455479] BTRFS info (device dm-3): 1 enospc errors during balance [ 212.915690] BTRFS info (device dm-3): relocating block group 501238202368 flags 17 [ 213.291634] BTRFS info (device dm-3): relocating block group 501238202368 flags 17 [ 213.654145] BTRFS info (device dm-3): 1 enospc errors during balance [ 219.219584] BTRFS info (device dm-3): relocating block group 501238202368 flags 17 [
Re: BTRFS free space handling still needs more work: Hangs again
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie: It currently is here: merkaba:~ btrfs balance status /home Balance on '/home' is running 32 out of about 164 chunks balanced (53 considered), 80% left merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=142.10GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.33GiB GlobalReserve, single: total=512.00MiB, used=254.31MiB Now I got this: merkaba:~ btrfs balance start /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 dmesg | tail [ 4260.276416] BTRFS info (device dm-3): relocating block group 151418568704 flags 17 [ 4274.683349] BTRFS info (device dm-3): found 25089 extents [ 4295.836590] BTRFS info (device dm-3): found 25089 extents [ 4296.026778] BTRFS info (device dm-3): relocating block group 150344826880 flags 17 [ 4312.732021] BTRFS info (device dm-3): found 59388 extents [ 4326.398261] BTRFS info (device dm-3): found 59388 extents [ 4326.813205] BTRFS info (device dm-3): relocating block group 149271085056 flags 17 [ 4347.346540] BTRFS info (device dm-3): found 104739 extents [ 4357.160098] BTRFS info (device dm-3): found 104739 extents [ 4359.304646] BTRFS info (device dm-3): 20 enospc errors during balance And I wonder about: Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2 �ޙ�)ߡ�a�����G���h��j:+v���w��٥ These random chars are not supposed to be there: I better run scrub straight after this balance. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie: And I wonder about: Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2 �ޙ�)ߡ�a�����G���h��j:+v���w��٥ These random chars are not supposed to be there: I better run scrub straight after this balance. Okay, thats not me I think. scrub didn´t report any errors and when I look in kmail send folder I don´t see these random chars as well, so it seems some server on the wire added the garbage. Lets defragment the file: merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi Winlala.vdi: 41462 extents found merkaba:/home/martin/.VirtualBox/HardDisks btrfs filesystem defragment Winlala.vdi merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi Winlala.vdi: 11735 extents found merkaba:/home/martin/.VirtualBox/HardDisks sync merkaba:/home/martin/.VirtualBox/HardDisks filefrag Winlala.vdi Winlala.vdi: 11735 extents found Okay, that together with: merkaba:~ btrfs fi df /home Data, RAID1: total=151.95GiB, used=144.68GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.25GiB GlobalReserve, single: total=512.00MiB, used=0.00B merkaba:~ btrfs fi sh /home Label: 'home' uuid: […] Total devices 2 FS bytes used 147.94GiB devid1 size 160.00GiB used 156.98GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 156.98GiB path /dev/mapper/sata-home Btrfs v3.17 May do for a while. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Am Freitag, 26. Dezember 2014, 14:37:36 schrieben Sie: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). These are 100% reproducable for me: 1) Have the compress=lzo, space_cache BTRFS RAID Dual SSD RAID 1 fill both with trees. 2) Have a Windows XP VM in Virtualbox on that BTRFS RAID 1 3) Press Defragment (in the hope to be able to use sdelete -c and then VBoxManage modifyhd Winlala.vdi --compact to reduce image size) Gives: One kworker thread using up 100% of a core for minutes with bursts of btrfs-transaction-process in between and: Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: [Hardware Error]: Machine check events logged Dec 26 16:18:15 merkaba kernel: [ 8119.879230] CPU2: Core temperature above threshold, cpu clock throttled (total events = 54053) Dec 26 16:18:15 merkaba kernel: [ 8119.879232] CPU0: Package temperature above threshold, cpu clock throttled (total events = 89435) Dec 26 16:18:15 merkaba kernel: [ 8119.879234] CPU3: Core temperature above threshold, cpu clock throttled (total events = 54053) Dec 26 16:18:15 merkaba kernel: [ 8119.879235] CPU1: Package temperature above threshold, cpu clock throttled (total events = 89435) Dec 26 16:18:15 merkaba kernel: [ 8119.879237] CPU3: Package temperature above threshold, cpu clock throttled (total events = 89435) Dec 26 16:18:15 merkaba kernel: [ 8119.879245] CPU2: Package temperature above threshold, cpu clock throttled (total events = 89435) Dec 26 16:18:15 merkaba kernel: [ 8119.880218] CPU2: Core temperature/speed normal Dec 26 16:18:15 merkaba kernel: [ 8119.880219] CPU1: Package temperature/speed normal Dec 26 16:18:15 merkaba kernel: [ 8119.880220] CPU3: Core temperature/speed normal Dec 26 16:18:15 merkaba kernel: [ 8119.880221] CPU0: Package temperature/speed normal Dec 26 16:18:15 merkaba kernel: [ 8119.880223] CPU3: Package temperature/speed normal Dec 26 16:18:15 merkaba kernel: [ 8119.880228] CPU2: Package temperature/speed normal Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: [Hardware Error]: Machine check events logged Dec 26 16:20:57 merkaba kernel: [ 8281.461874] INFO: task kded4:1959 blocked for more than 120 seconds. Dec 26 16:20:57 merkaba kernel: [ 8281.464106] Tainted: G O 3.18.0-tp520 #14 Dec 26 16:20:57 merkaba kernel: [ 8281.466361] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Dec 26 16:20:57 merkaba kernel: [ 8281.468760] kded4 D 88040764ce98 0 1959 1 0x Dec 26 16:20:57 merkaba kernel: [ 8281.471112] 8803efa57bb8 0002 8803efa57c00 880407f261c0 Dec 26 16:20:57 merkaba kernel: [ 8281.473462] 8803efa57fd8 88040764c950 00012300 88040764c950 Dec 26 16:20:57 merkaba kernel: [ 8281.475780] 8803efa57ba8 8803eea9a900 8803eea9a904 88040764c950 Dec 26 16:20:57 merkaba kernel: [ 8281.478142] Call Trace: Dec 26 16:20:57 merkaba kernel: [ 8281.480414] [814a6f9a] schedule+0x64/0x66 Dec 26 16:20:57 merkaba kernel: [ 8281.482694] [814a72d3] schedule_preempt_disabled+0x13/0x1f Dec 26 16:20:57 merkaba kernel: [ 8281.484979] [814a8440] __mutex_lock_slowpath+0xab/0x126 Dec 26 16:20:57 merkaba kernel: [ 8281.487271] [81143735] ? lookup_fast+0x173/0x238 Dec 26 16:20:57 merkaba kernel: [ 8281.489534] [814a84ce] mutex_lock+0x13/0x24 Dec 26 16:20:57 merkaba kernel: [ 8281.491811] [81143c45] walk_component+0x69/0x17e Dec 26 16:20:57 merkaba kernel: [ 8281.494092] [81143d88] lookup_last+0x2e/0x30 Dec 26 16:20:57 merkaba kernel: [ 8281.496416] [81145a32] path_lookupat+0x83/0x2d9 Dec 26 16:20:57 merkaba kernel: [ 8281.498733] [8121f38c] ? debug_smp_processor_id+0x17/0x19 Dec 26 16:20:57 merkaba kernel: [ 8281.501074] [8114683c] ? getname_flags+0x31/0x134 Dec 26 16:20:57 merkaba kernel: [ 8281.503338] [81145cad] filename_lookup+0x25/0x7a Dec 26 16:20:57 merkaba
Re: BTRFS free space handling still needs more work: Hangs again
On 12/26/2014 05:37 AM, Martin Steigerwald wrote: Hello! First: Have a merry christmas and enjoy a quiet time in these days. Second: At a time you feel like it, here is a little rant, but also a bug report: I have this on 3.18 kernel on Debian Sid with BTRFS Dual SSD RAID with space_cache, skinny meta data extents – are these a problem? – and compress=lzo: (there is no known problem with skinny metadata, it's actually more efficient than the older format. There has been some anecdotes about mixing the skinny and fat metadata but nothing has ever been demonstrated problematic.) merkaba:~ btrfs fi sh /home Label: 'home' uuid: b96c4f72-0523-45ac-a401-f7be73dd624a Total devices 2 FS bytes used 144.41GiB devid1 size 160.00GiB used 160.00GiB path /dev/mapper/msata-home devid2 size 160.00GiB used 160.00GiB path /dev/mapper/sata-home Btrfs v3.17 merkaba:~ btrfs fi df /home Data, RAID1: total=154.97GiB, used=141.12GiB System, RAID1: total=32.00MiB, used=48.00KiB Metadata, RAID1: total=5.00GiB, used=3.29GiB GlobalReserve, single: total=512.00MiB, used=0.00B This filesystem, at the allocation level, is very full (see below). And I had hangs with BTRFS again. This time as I wanted to install tax return software in Virtualbox´d Windows XP VM (which I use once a year cause I know no tax return software for Linux which would be suitable for Germany and I frankly don´t care about the end of security cause all surfing and other network access I will do from the Linux box and I only run the VM behind a firewall). And thus I try the balance dance again: ITEM: Balance... it doesn't do what you think it does... 8-) Balancing is something you should almost never need to do. It is only for cases of changing geometry (adding disks, switching RAID levels, etc.) of for cases when you've radically changed allocation behaviors (like you decided to remove all your VM's or you've decided to remove a mail spool directory full of thousands of tiny files). People run balance all the time because they think they should. They are _usually_ incorrect in that belief. merkaba:~ btrfs balance start -dusage=5 -musage=5 /home ERROR: error during balancing '/home' - No space left on device ITEM: Running out of space during a balance is not running out of space for files. BTRFS has two layers of allocation. That is, there are two levels of abstraction where no space can occur. The first level of allocation is the making more BTRFS structures out of raw device space. The second level is allocating space for files inside of existing BTRFS structures. Balance is the operation of relocating the BTRFS structures and attempting to increase their order (conincidentally) while doing that. So, for instance, reocating block group some_number_here requires finding an unallocated expanse of disk, creating a new/empty block group there of the current relevant block group size (typically data=1G or metadata=256M if you didn't override these settings while making the filesystem). You can _easily_ end up lacking a 1G contiguous expanse of raw allocation space on a nearly-full filesystem. NOTE :: This does _not_ happen with other filesystems like EXT4 because building those filesystems creates a static filesystem-level allocation. That is 100% of the disk that can be controlled by EXT4 (etc) is allocated and initialized at initial creation time (or first mount in the case of EXT4). BTRFS is intentionally different because it wants to be able to adapt as your usage changes. If you first make millions of tiny files then you will have a lot of metadata extents and virtually no data extents. If you erase a lot of those and then start making large files the metadata will tend to go away and then data extents will be created. Being a chaotic system, you can get into some corner cases that suck, but in terms of natural evolution it has more benefits than drawbacks. There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=5 -musage=5 /home ERROR: error during balancing '/home' - No space left on device There may be more info in syslog - try dmesg | tail merkaba:~#1 btrfs balance start -dusage=5 /home losts deleted for brevity So I am rebalancing everything basically, without need I bet, so causing more churn to SSDs than is needed. Correct, though churn isn't really the issue. Otherwise alternative would be to make BTRFS larger I bet. Correct. Well this is still not what I would consider stable. So I will still Not a question of stability. See, dong a balance is like doing a sliding block puzzle. If there isn't enough room to slide the blocks around then the blocks will not slide around. You are just out of space and that results in out of space returns. This is not even an error, just a fact. http://en.wikipedia.org/wiki/15_puzzle Meditate on the above link. Then ask yourself what happens if you put
Re: BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald posted on Fri, 26 Dec 2014 15:41:23 +0100 as excerpted: Am Freitag, 26. Dezember 2014, 15:20:42 schrieben Sie: And I wonder about: Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/ ���z�ޖ��2 �ޙ�)ߡ�a�����G���h��j:+v���w��٥ These random chars are not supposed to be there: I better run scrub straight after this balance. Okay, thats not me I think. scrub didn´t report any errors and when I look in kmail send folder I don´t see these random chars as well, so it seems some server on the wire added the garbage. FWIW... They didn't show up here on gmane's list2nntp service (message viewed with pan), either. There were a few strange characters -- your dashes(?) on either side of the are these a problem? showed up as the squares containing four digits (0080, 0093) that appear when a font doesn't contain the appropriate character it's being asked to display, and there were a few others, but that's a common charset/font l10n issue, not the apparent line noise binary corruption shown above. So I'd guess it was either the transmission to your mail service, at the mail service, or the transmission between them and your mail client, that corrupted. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Martin Steigerwald posted on Fri, 26 Dec 2014 16:59:09 +0100 as excerpted: Dec 26 16:17:57 merkaba kernel: [ 8102.029438] mce: [Hardware Error]: Machine check events logged Dec 26 16:20:27 merkaba kernel: [ 8252.054015] mce: [Hardware Error]: Machine check events logged Have you checked these MCEs? What are they? MCEs are hardware errors. These are *NOT* kernel errors, tho of course they may /trigger/ kernel errors. The reported event codes can be looked up and translated into English. Since shortly after the first one until a bit before the second one here, you had hardware thermal throttling, the CPUs, on-chip cache, and possibly the memory, was working pretty hard. FWIW, I had an AMD machine that would MCE with memory related errors some time (about a decade) ago. I had ECC RAM, but it was cheap and apparently not quite up to the speeds it was actually rated for. MemTest check out the memory fine, but under high stress especially, it would sometimes have bus/transit related corruption, which would sometimes (not always) trigger those MCEs. Eventually a BIOS update gave me the ability to turn down the memory timings, and turning them down just one notch made everything rock-stable -- I was even able to decrease some of the wait-states to get a bit of the memory speed back. It just so happened that it was borderline stable at the rated clock, and turning the memory clock down just one notch was all it took. Later, I upgraded the RAM (the bad RAM was two half-gig sticks, back when they were $100+ a piece, I upgraded to four 2-gig sticks), and the new RAM didn't have the problem at all -- the bad RAM sticks simply weren't /quite/ stable at the rated speed, that was it. I run gentoo so of course do a lot of building from sources, and interestingly enough, the thing that turned out to detect the corruption the most often was bzip2 compression checksums -- I'd get errors on sources decompress previous to the build, rather more often than actual build failures altho those would happen occasionally as well, while redoing it would work fine -- checksums passed, and I never had a build that actually finished fail to run due to a bad build. Now here's the thing. Of course a decade ago was well before I was running btrfs (FWIW I was running reiserfs at the time, and it seemed pretty resilient given the bad RAM I had), so it was the bzip2 checksums it failed on. But guess what btrfs uses for file integrity, checksums. If your MCEs are either like my memory-related MCEs were, or are similar CPU-cache or CPU related but still something that would affect checksumming, btrfs may well be fighting bad checksums due to the same issues, and that would of course throw all sorts of wrenches into things. Another thing I've seen reported as triggering MCEs is bad power (in that case it was an either underpowered or going bad UPS, once it was out of the picture, the MCEs and problems stopped). Now I think you're having other btrfs issues as well, some of which are likely legit bugs. However, your MCEs certainly aren't helping things, and I'd definitely recommend checking up on them to see what's actually happening to your hardware. It may well be that without whatever hardware issues are triggering those MCEs, you may end up with less btrfs problems as well. Or maybe not, but it's something to look into, because right now, regardless of whether they're making things worse physically, they're at minimum obscuring a troubleshooting picture that would be clearer without them. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again
Robert White posted on Fri, 26 Dec 2014 14:48:38 -0800 as excerpted: ITEM: An SSD plus a good fast controller and default system virtual memory and disk scheduler activities can completely bog a system down. You can get into a mode where the system begins doing synchronous writes of vast expanses of dirty cache. The SSD is so fast that there is effectively zero wait for IO time and the IO subsystem is effectively locked or just plain busy. Look at /proc/sys/vm/dirty_background_ratio which is probably set to 10% of system ram. You may need/want to change this number to something closer to 4. That's not a hard suggestion. Some reading and analysis will be needed to find the best possible tuning for an advanced system. FWIW, I can second at least this part, myself. Half of the base problem is that memory speeds have increased far faster than storage speeds. SSDs do help with that, but the problem remains. The other half of the problem is the comparatively huge memory capacity systems have today, with the result being that the default percentages of system RAM that were allowed to be dirty before kicking in background and then foreground flushing, reasonable back when they were introduced, simply aren't reasonable any longer, PARTICULARLY on spinning rust, but even on SSD. vm.dirty_ratio is the percentage of RAM allowed to dirty before the system kicks into high-priority write-flush mode. vm.dirty_background_ratio is likewise, but where the system starts even worrying about it at all, doing work in the background. Now take my 16 GiB RAM system as an example. The default background setting is 5%, foreground/high-priority, 10%. With 16 gigs RAM, that 10% is 1.6 GiB of dirty pages to flush. A spinning rust drive might do 100 MiB/sec throughput contiguous, but a real-world number is more like 30-50 MiB/sec. At 100 MiB/sec, that 1.6 GiB will take 16+ seconds, during which nothing else can be doing I/O. So let's just divide the speed by 3 and call it 33.3 MiB/sec. Now we're looking at being blocked for nearly 50 seconds to flush all those dirty blocks. And the system doesn't even START worrying about it, at even LOW priority, until it has about 25 seconds worth of full-usage flushing built-up! Not only that, but that's *ALSO* 1.6 GiB worth of dirty data that isn't yet written to storage, that would lost in the event of a crash! Of course there's a timer expiry as well. vm.dirty_writeback_centiseconds (that's background) defaults to 499 (5 seconds), vm.dirty_expire_centiseconds defaults to 2999 (30 seconds). So the first thing to notice is that it's going to take more time to write the dirty data we're allowing to stack up, than the expiry time! At least to me, that makes absolutely NO sense! At minimum, we need to reduce cached writes allowed to stack up to something that can actually be done before they expire, time-wise. Either that, or trying to depend on that 30-second expiry to make sure our dirty data is flushed in something at least /close/ to that isn't going to work so well! So assuming we think the 30-seconds is logical, the /minimum/ we need to do is reduce the size cap by half, to 5% high-priority/foreground (which was as we saw about 25 seconds worth), say 2% lower-priority/background. But that's STILL about 800 MiB before it kicks to high priority mode at risk in case of a crash, and I still considered that a bit more than I wanted. So what I ended up with here (set for spinning rust before I had SSD), was: vm.dirty_background_ratio = 1 (low priority flush, that's still ~160 MiB or about 5 seconds worth of activity at lower 30s MiB/sec) vm.dirty_ratio = 3 (high priority flush, roughly half a GiB, about 15 seconds of activity) vm.dirty_writeback_centiseconds=1000 (10 seconds, background flush timeout, note that the corresponding size cap is ~5 seconds worth so about 50% duty cycle, a bit high for background priority, but...) (I left vm.dirty_expire_centiseconds at the default, 2999 or 30 seconds, since I found that an acceptable amount of work to lose in the case of a crash. Again, the corresponding size cap is ~15 seconds worth, so ~50 duty cycle. This is very reasonable for high priority, as if data is coming in faster than that, it'll trigger high priority flushing billed to the processes actually dirtying the memory in the first place, thus forcing them to slow down and wait for their IO, in turn allowing other (CPU-bound) processes to run.) And while 15-second interactivity latency during disk thrashing isn't cake, it's at least tolerable, while 50-second latency is HORRIBLE. Meanwhile, with vm.dirty_background_ratio already set to 1 and without knowing whether it can take a decimal such as 0.5 (I could look I suppose but I don't really have to), that's the lowest I can go there unless I set it to zero. HOWEVER, if I wanted to go lower, I could set the actual size version, vm.dirty_background_bytes,