Ok, I'm going to revive a year old mail thread here with interesting new info:
On 05/31/2016 03:36 AM, Qu Wenruo wrote: > > > Hans van Kranenburg wrote on 2016/05/06 23:28 +0200: >> Hi, >> >> I've got a mostly inactive btrfs filesystem inside a virtual machine >> somewhere that shows interesting behaviour: while no interesting disk >> activity is going on, btrfs keeps allocating new chunks, a GiB at a time. >> >> A picture, telling more than 1000 words: >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/btrfs_usage_ichiban.png >> (when the amount of allocated/unused goes down, I did a btrfs balance) That picture is still there, for the idea. > Nice picture. > Really better than 1000 words. > > AFAIK, the problem may be caused by fragments. Free space fragmentation is a key thing here indeed. The major two things involved here are 1) the extent allocator, which causes the free space fragmentation 2) the extent allocator, which doesn't handle the fragmentation it just caused really well. Let's start with the pictures, instead of too many words. The following two videos are png images of the 4 block groups with highest vaddr. Every 15 minutes a picture is created, and then they're added together: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4 And, with autodefrag enabled, which was the first thing I tried as a change: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-13-autodefrag-ichiban.mp4 So, this is why putting your /var/log, /var/lib/mailman and /var/spool on btrfs is a terrible idea. Because the allocator keeps walking forward every file that is created and then removed leaves a blank spot behind. Autodefrag makes the situation only a little bit better, changing the resulting pattern from a sky full of stars into a snowstorm. The result of taking a few small writes and rewriting them again is that again the small parts of free space are left behind. Just a random idea.. for this write pattern, always putting new writes in the first free available spot at the beginning of the block group would make a total difference, since the little 4/8KiB parts would be filled up again all the time, preventing the shotgun blast to spread all over. > And even I saw some early prototypes inside the codes to allow btrfs do > allocation smaller extent than required. > (E.g. caller needs 2M extent, but btrfs returns 2 1M extents) > > But it's still prototype and seems no one is really working on it now. > > So when btrfs is writing new data, for example, to write about 16M data, > it will need to allocate a 16M continuous extent, and if it can't find > large enough space to allocate, then create a new data chunk. > > [...] That's the cluster idea right? Combining free space fragments into a bigger piece of space to fill with writes? The fun thing is that this might work, but because of the pattern we end up with, a large write apparently fails (the files downloaded when doing apt-get update by daily cron) which causes a new chunk allocation. This is clearly visible in the videos. Directly after that, the new chunk gets filled with the same pattern, because the extent allocator now continues there and next day same thing happens again etc... And voila, there's the answer to my original question. Now, another surprise: >From the exact moment I did mount -o remount,nossd on this filesystem, the problem vanished. https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-07-ichiban-munin-nossd.png I don't have a new video yet, but I'll set up a cron tonight and post it later. I'm going to send another mail specifically about the nossd/ssd behaviour and other things I found out last week, but that'll probably be tomorrow. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html