Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
On Wed, Sep 11, 2019 at 01:20:53PM -0400, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> On 2019-09-10 19:32, webmas...@zedlx.com wrote:
> >
> > Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> >
> >
> > === I CHALLENGE you and anyone else on this mailing list: ===
> >
> > - Show me an exaple where splitting an extent requires unsharing,
> > and this split is needed to defrag.
> >
> > Make it clear, write it yourself, I don't want any machine-made outputs.
> >
> Start with the above comment about all writes unsharing the region being
> written to.
>
> Now, extrapolating from there:
>
> Assume you have two files, A and B, each consisting of 64 filesystem
> blocks in single shared extent. Now assume somebody writes a few bytes
> to the middle of file B, right around the boundary between blocks 31 and
> 32, and that you get similar writes to file A straddling blocks 14-15
> and 47-48.
>
> After all of that, file A will be 5 extents:
>
> * A reflink to blocks 0-13 of the original extent.
> * A single isolated extent consisting of the new blocks 14-15
> * A reflink to blocks 16-46 of the original extent.
> * A single isolated extent consisting of the new blocks 47-48
> * A reflink to blocks 49-63 of the original extent.
>
> And file B will be 3 extents:
>
> * A reflink to blocks 0-30 of the original extent.
> * A single isolated extent consisting of the new blocks 31-32.
> * A reflink to blocks 32-63 of the original extent.
>
> Note that there are a total of four contiguous sequences of blocks that
> are common between both files:
>
> * 0-13
> * 16-30
> * 32-46
> * 49-63
>
> There is no way to completely defragment either file without splitting
> the original extent (which is still there, just not fully referenced by
> either file) unless you rewrite the whole file to a new single extent
> (which would, of course, completely unshare the whole file). In fact,
> if you want to ensure that those shared regions stay reflinked, there's
> no way to defragment either file without _increasing_ the number of
> extents in that file (either file would need 7 extents to properly share
> only those 4 regions), and even then only one of the files could be
> fully defragmented.
>
> Such a situation generally won't happen if you're just dealing with
> read-only snapshots, but is not unusual when dealing with regular files
> that are reflinked (which is not an uncommon situation on some systems,
> as a lot of people have `cp` aliased to reflink things whenever
> possible).
Well, thank you very much for writing this example. Your example is
certainly not minimal, as it seems to me that one write to the file A and
one write to file B would be sufficient to prove your point, so there we
have one extra write in the example, but that's OK.
Your example proves that I was wrong. I admit: it is impossible to perfectly
defrag one subvolume (in the way I imagined it should be done).
Why? Because, as in your example, there can be files within a SINGLE
subvolume which share their extents with each other. I didn't consider such
a case.
On the other hand, I judge this issue to be mostly irrelevant. Why? Because
most of the file sharing will be between subvolumes, not within a subvolume.
When a user creates a reflink to a file in the same subvolume, he is
willingly denying himself the assurance of a perfect defrag. Because, as
your example proves, if there are a few writes to BOTH files, it gets
impossible to defrag perfectly. So, if the user creates such reflinks, it's
his own whish and his own fault.
Such situations will occur only in some specific circumstances:
a) when the user is reflinking manually
b) when a file is copied from one subvolume into a different file in a
different subvolume.
The situation a) is unusual in normal use of the filesystem. Even when it
occurs, it is the explicit command given by the user, so he should be
willing to accept all the consequences, even the bad ones like imperfect
defrag.
The situation b) is possible, but as far as I know copies are currently not
done that way in btrfs. There should probably be the option to reflink-copy
files fron another subvolume, that would be good.
Reflink copies across subvolumes have been working for years. They are
an important component that makes dedupe work when snapshots are present.
I take that what you say is true, but what I said is that when a user
(or application) makes a
normal copy from one subvolume to another, then it won't be a
reflink-copy. To make such a reflink-copy, you need btrfs-aware cp or
btrfs-aware applications.
So, the reflik-copy is a special case, usually explicitly requested by
the user.
But anyway, it doesn't matter. Because most of the sharing will be between
subvolumes, not within subvolume.
Heh. I'd like you to meet one of my medium-sized filesystems:
Physical size: 8TB
Logical size: 16TB
Average references per extent: 2.03 (not counting snapshots)
Workload: CI build server, VM host
That's a filesystem where over half of the logical data is reflinks to the
other physical data, and 94% of that data is in a single subvol. 7.5TB of
data is unique, the remaining 500GB is referenced an average of 17 times.
We use ordinary applications to make ordinary copies of files, and do
tarball unpacks and source checkouts with reckless abandon, all day long.
Dedupe turns the copies into reflinks as we go, so every copy becomes
a reflink no matter how it was created.
For the VM filesystem image files, it's not uncommon to see a high
reflink rate within a single file as well as reflinks to other files
(like the binary files in the build directories that the VM images are
constructed from). Those reference counts can go into the millions.
OK, but that cannot be helped: either you retain the sharing structure
with imperfect defrag, or you unshare and produce a perfect defrag
which should have somewhat better performance (and pray that the disk
doesn't fill up).
So, if there is some in-subvolume sharing,
the defrag wont be 100% perfect, that a minor point. Unimportant.
It's not unimportant; however, the implementation does have to take this
into account, and make sure that defrag can efficiently skip extents that
are too expensive to relocate. If we plan to read an extent fewer than
100 times, it makes no sense to update 20000 references to it--we spend
less total time just doing the 100 slower reads.
Not necesarily. Because you can defrag in the time-of-day when there
is a low pressure on the disk IO, so updating 20000 references is
esentially free.
You are just making those later 100 reads faster.
OK, you are right, there is some limit, but this is such a rare case,
that such a heavily-referenced extents are best left untouched.
I suggest something along these lines: if there are more than XX
(where XX defaults to 1000) reflinks to an extent, then one or more
copies of the extent should be made such that each has less than XX
reflinks to it. The number XX should be user-configurable.
If the numbers are
reversed then it's better to defrag the extent--100 reference updates
are easily outweighed by 20000 faster reads. The kernel doesn't have
enough information to make good decisions about this.
So, just make the number XX user-provided.
Dedupe has a similar problem--it's rarely worth doing a GB of IO to
save 4K of space, so in practical implementations, a lot of duplicate
blocks have to remain duplicate.
There are some ways to make the kernel dedupe and defrag API process
each reference a little more efficiently, but none will get around this
basic physical problem: some extents are just better off where they are.
OK. If you don't touch those extents, they are still shared. That's
what I wanted.
Userspace has access to some extra data from the user, e.g. "which
snapshots should have their references excluded from defrag because
the entire snapshot will be deleted in a few minutes." That will allow
better defrag cost-benefit decisions than any in-kernel implementation
can make by itself.
Yes, but I think that we are going into too much details which are
diverting the attention from the overall picture and from big problems.
And the big problem here is: what do we want defrag to do in general,
most common cases. Because we haven't still agreed on that one since
many of the people here are ardent followers of the
defrag-by-unsharing ideology.
'btrfs fi defrag' is just one possible userspace implementation, which
implements the "throw entire files at the legacy kernel defrag API one
at a time" algorithm. Unfortunately, nobody seems to have implemented
any other algorithms yet, other than a few toy proof-of-concept demos.
I really don't have a clue what's happening, but if I were to start
working on it (which I won't), then the first things should be:
- creating a way for btrfs to split large extents into smaller ones
(for easier defrag, as first phase).
- creating a way for btrfs to merge small adjanced extents shared by
the same files into larger extents (as the last phase of defragmenting
a file).
- create a structure (associative array) for defrag that can track
backlinks. Keep the structure updated with each filesystem change, by
placing hooks in filesystem-update routines.
You can't go wrong with this. Whatever details change about defrag
operation, the given three things will be needed by defrag.
Now, to retain the original sharing structure, the defrag has to change the
reflink of extent E55 in file B to point to E70. You are telling me this is
not possible? Bullshit!
This is already possible today and userspace tools can do it--not as
efficiently as possible, but without requiring more than 128M of temporary
space. 'btrfs fi defrag' is not one of those tools.
Please explain to me how this 'defrag has to unshare' story of yours isn't
an intentional attempt to mislead me.
Austin is talking about the btrfs we have, not the btrfs we want.
OK, but then, you agree with me that current defrag is a joke. I mean,
something is better than nothing, and the current defrag isn't
completely useless, but it is in most circumstances either unusable or
not good enough.
I mean, the snapshots are a prime feature of btrfs. If not, then why
bother with b-trees? If you wanted subvolumes, checksums and RAID,
then you should have made ext5. B-trees are in btrfs so that there can
be snapshots. But, the current defrag works bad with snaphots. It
doesn't defrag them well, it also unshares data. Bad bad bad.
And if you wanted to be honest to your users, why don't you place this
info in the wiki? Ok, the wiki says "defrag will unshare", but it
doesn't say that it also doesn't defragment well.
For example, lets examine the typical home user. If he is using btrfs,
it means he probably wants snapshots of his data. And, after a few
snapshots, his data is fragmented, and the current defrag can't help
because it does a terrible job in this particualr case.
So why don't you write on the wiki "the defrag is practically unusable
in case you use snapshots". Because that is the truth. Be honest.