Re: btrfs is using 25% more disk than it should

Josef Bacik Sat, 20 Dec 2014 03:29:16 -0800

On 12/20/2014 12:52 AM, Zygo Blaxell wrote:

On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:

And for your inode you now have this


inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g

and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you
cp'ed it right now it would copy 302g of information.  But what you have
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g

and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count.  This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.


Wait, *what*?

OK, I did a small experiment, and found that btrfs actually does do
something like this.  Can't argue with fact, though it would be nice if
btrfs could be smarter and drop unused portions of the original extent
sooner.  :-P

So we've thought about changing this, and will eventually, but it's kindof difficult. Above is an example of what happens currently, so thesplit code for file extents is kind of big and scary, check__btrfs_drop_extents. We would have to fix that to adjust thedisk_bytenr and disk_num_bytes, which isn't too bad since we already aredoing this dance and adjusting offset. The trick would be when updatingthe extent references, we would have to split those extents. So say wehave a 128mb extent and we write 4k at 1mb. If we split the extent refswe'd have this afterwards

(note this isn't how they'd be ordered on disk, just written this way soit makes logical sense)


extent bytenr 0, len 1mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

Ok so now we have 3 extents in the extent tree to describe essentially 2ranges that are in use, but we get back the 4k so that's nice. But waitthere's more! What if we're snapshotted? We can't just drop that 4kbecause somebody else has a reference to it. So what do we do? Well wecould do something like this


extent bytenr 0, len 1mb, refs 1
extent bytenr 0, len 128mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

This creates all sorts of problems for us. We now have two extents withthe same bytenr but with different lengths. This could be ok, we'd haveto add a bunch of checks to make sure we're looking at the right extent,but it wouldn't be horrible. I imagine we'd be fixing weird corruptionbugs for a few releases though while we found all of the corner cases wemissed.

Then there is the problem of actually returning the free space. Now ifwe drop all of the refs for an extent we know the space is free and wereturn it to the allocator. With the above example we can't do thatanymore, we have to check the extent tree for any area that is leftoverlapping the area we just freed. This add's another search to everybtrfs_free_extent operation, which slows the whole system down and againleaves us with weird corner cases and pain for the users. Plus thiswould be an incompatible format change so would require setting afeature flag in the fs and rolled to voluntarily.

Now I have another solution, but I'm not convinced it's awesome either.Take the same example above, but instead we split the original extentin the extent tree so we avoid all the mess of having overlapping rangesand get this instead


extent bytenr 0, len 1mb, refs 2
extent bytenr 1mb, len 4k, refs 1  <-- part of the original extent
                                        pointed to by the snapshot
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 2

So yay we've solved the problem of overlapping extents and bonus this isbackwards compatible. So why don't we do this? Well all the reasons Ilisted above about corner cases and much pain for our users. Thiswouldn't require a format change so everybody would get this behaviouras soon as we turned it on, and I feel I would be doing a lot of fsckwork for the next 6 months. Plus we would have to add a 'split'operation to the extent operations that copies all of the extentreferences around and drops the proper reference. Keep in mind thatI've been showing a dumbed down version of extent refs, what it wouldreally look like is this


extent bytenr 0, len 128mb, refs 2
        root 5, owner 256, refs 1
        root 256, owner 256, refs 1

So when we do our split operation we'd copy this extent entry twice,update the two sides with their new offset and len, and drop theoriginal inode from the middle thing, and finally add our new extent.That is a lot more work for one operation than just adding a new entryor removing an old entry. Not only is it more work but it adds moremetadata to the extent root, which makes extent operations moreexpensive which again slows the whole file system down.

Welcome to file system development, you spin the giant wheel of tradeoffs and decide which sucks less for you and your users. Years ago wechose simplicity in one of the more complex areas of btrfs for wastingspace in overwrites. It's not super clear that was the right choice sowe're considering changing it, but as you can see it ain't going to befun, and will require other trade offs which may have unintendedconsequences later on. Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs is using 25% more disk than it should

Reply via email to