On 12/20/2014 12:52 AM, Zygo Blaxell wrote:
On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote:
And for your inode you now have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123,
disklen 302g

and in your extent tree you have

extent bytenr 123, len 302g, refs 1
extent bytenr whatever, len 4k, refs 1

See that?  Your file is still the same size, it is still 302g.  If you
cp'ed it right now it would copy 302g of information.  But what you have
actually allocated on disk?  Well that's now 302g + 4k.  Now lets say
your virt thing decides to write to the middle, lets say at offset 12k,
now you have this

inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
disklen 4k
inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
disklen 4k
inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123,
disklen 302g

and in the extent tree you have this

extent bytenr 123, len 302g, refs 2
extent bytenr whatever, len 4k, refs 1
extent bytenr notimportant, len 4k, refs 1

See that refs 2 change?  We split the original extent, so we have 2 file
extents pointing to the same physical extents, so we bumped the ref
count.  This will happen over and over again until we have completely
overwritten the original extent, at which point your space usage will go
back down to ~302g.

Wait, *what*?

OK, I did a small experiment, and found that btrfs actually does do
something like this.  Can't argue with fact, though it would be nice if
btrfs could be smarter and drop unused portions of the original extent
sooner.  :-P


So we've thought about changing this, and will eventually, but it's kind of difficult. Above is an example of what happens currently, so the split code for file extents is kind of big and scary, check __btrfs_drop_extents. We would have to fix that to adjust the disk_bytenr and disk_num_bytes, which isn't too bad since we already are doing this dance and adjusting offset. The trick would be when updating the extent references, we would have to split those extents. So say we have a 128mb extent and we write 4k at 1mb. If we split the extent refs we'd have this afterwards

(note this isn't how they'd be ordered on disk, just written this way so it makes logical sense)

extent bytenr 0, len 1mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

Ok so now we have 3 extents in the extent tree to describe essentially 2 ranges that are in use, but we get back the 4k so that's nice. But wait there's more! What if we're snapshotted? We can't just drop that 4k because somebody else has a reference to it. So what do we do? Well we could do something like this

extent bytenr 0, len 1mb, refs 1
extent bytenr 0, len 128mb, refs 1
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 1

This creates all sorts of problems for us. We now have two extents with the same bytenr but with different lengths. This could be ok, we'd have to add a bunch of checks to make sure we're looking at the right extent, but it wouldn't be horrible. I imagine we'd be fixing weird corruption bugs for a few releases though while we found all of the corner cases we missed.

Then there is the problem of actually returning the free space. Now if we drop all of the refs for an extent we know the space is free and we return it to the allocator. With the above example we can't do that anymore, we have to check the extent tree for any area that is left overlapping the area we just freed. This add's another search to every btrfs_free_extent operation, which slows the whole system down and again leaves us with weird corner cases and pain for the users. Plus this would be an incompatible format change so would require setting a feature flag in the fs and rolled to voluntarily.

Now I have another solution, but I'm not convinced it's awesome either. Take the same example above, but instead we split the original extent in the extent tree so we avoid all the mess of having overlapping ranges and get this instead

extent bytenr 0, len 1mb, refs 2
extent bytenr 1mb, len 4k, refs 1  <-- part of the original extent
                                        pointed to by the snapshot
extent bytenr 128mb, len 4k, refs 1
extent bytenr 1mb+4k, len 128mb-4k, refs 2

So yay we've solved the problem of overlapping extents and bonus this is backwards compatible. So why don't we do this? Well all the reasons I listed above about corner cases and much pain for our users. This wouldn't require a format change so everybody would get this behaviour as soon as we turned it on, and I feel I would be doing a lot of fsck work for the next 6 months. Plus we would have to add a 'split' operation to the extent operations that copies all of the extent references around and drops the proper reference. Keep in mind that I've been showing a dumbed down version of extent refs, what it would really look like is this

extent bytenr 0, len 128mb, refs 2
        root 5, owner 256, refs 1
        root 256, owner 256, refs 1

So when we do our split operation we'd copy this extent entry twice, update the two sides with their new offset and len, and drop the original inode from the middle thing, and finally add our new extent. That is a lot more work for one operation than just adding a new entry or removing an old entry. Not only is it more work but it adds more metadata to the extent root, which makes extent operations more expensive which again slows the whole file system down.

Welcome to file system development, you spin the giant wheel of trade offs and decide which sucks less for you and your users. Years ago we chose simplicity in one of the more complex areas of btrfs for wasting space in overwrites. It's not super clear that was the right choice so we're considering changing it, but as you can see it ain't going to be fun, and will require other trade offs which may have unintended consequences later on. Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to