Re: Feature requests: online backup - defrag - change RAID level

Austin S. Hemmelgarn Wed, 11 Sep 2019 11:19:52 -0700

On 2019-09-11 13:20, webmas...@zedlx.com wrote:

Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-10 19:32, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
=== I CHALLENGE you and anyone else on this mailing list: ===
- Show me an exaple where splitting an extent requires unsharing,and this split is needed to defrag.
Make it clear, write it yourself, I don't want any machine-made outputs.
Start with the above comment about all writes unsharing the regionbeing written to.
Now, extrapolating from there:
Assume you have two files, A and B, each consisting of 64 filesystemblocks in single shared extent. Now assume somebody writes a fewbytes to the middle of file B, right around the boundary betweenblocks 31 and 32, and that you get similar writes to file A straddlingblocks 14-15 and 47-48.
After all of that, file A will be 5 extents:

* A reflink to blocks 0-13 of the original extent.
* A single isolated extent consisting of the new blocks 14-15
* A reflink to blocks 16-46 of the original extent.
* A single isolated extent consisting of the new blocks 47-48
* A reflink to blocks 49-63 of the original extent.

And file B will be 3 extents:

* A reflink to blocks 0-30 of the original extent.
* A single isolated extent consisting of the new blocks 31-32.
* A reflink to blocks 32-63 of the original extent.
Note that there are a total of four contiguous sequences of blocksthat are common between both files:
* 0-13
* 16-30
* 32-46
* 49-63
There is no way to completely defragment either file without splittingthe original extent (which is still there, just not fully referencedby either file) unless you rewrite the whole file to a new singleextent (which would, of course, completely unshare the whole file).In fact, if you want to ensure that those shared regions stayreflinked, there's no way to defragment either file without_increasing_ the number of extents in that file (either file wouldneed 7 extents to properly share only those 4 regions), and even thenonly one of the files could be fully defragmented.
Such a situation generally won't happen if you're just dealing withread-only snapshots, but is not unusual when dealing with regularfiles that are reflinked (which is not an uncommon situation on somesystems, as a lot of people have `cp` aliased to reflink thingswhenever possible).
Well, thank you very much for writing this example. Your example iscertainly not minimal, as it seems to me that one write to the file Aand one write to file B would be sufficient to prove your point, sothere we have one extra write in the example, but that's OK.
Your example proves that I was wrong. I admit: it is impossible toperfectly defrag one subvolume (in the way I imagined it should be done).Why? Because, as in your example, there can be files within a SINGLEsubvolume which share their extents with each other. I didn't considersuch a case.
On the other hand, I judge this issue to be mostly irrelevant. Why?Because most of the file sharing will be between subvolumes, not withina subvolume.

Not necessarily. Even ignoring the case of data deduplication (whichneeds to be considered if you care at all about enterprise usage, and ispart of the whole point of using a CoW filesystem), there are existingapplications that actively use reflinks, either directly or indirectly(via things like the `copy_file_range` system call), and the number ofsuch applications is growing.

When a user creates a reflink to a file in the samesubvolume, he is willingly denying himself the assurance of a perfectdefrag. Because, as your example proves, if there are a few writes toBOTH files, it gets impossible to defrag perfectly. So, if the usercreates such reflinks, it's his own whish and his own fault.

The same argument can be made about snapshots. It's an invalid argumentin both cases though because it's not always the user who's creating thereflinks or snapshots.

Such situations will occur only in some specific circumstances:
a) when the user is reflinking manually
b) when a file is copied from one subvolume into a different file in adifferent subvolume.
The situation a) is unusual in normal use of the filesystem. Even whenit occurs, it is the explicit command given by the user, so he should bewilling to accept all the consequences, even the bad ones like imperfectdefrag.
The situation b) is possible, but as far as I know copies are currentlynot done that way in btrfs. There should probably be the option toreflink-copy files fron another subvolume, that would be good.
But anyway, it doesn't matter. Because most of the sharing will bebetween subvolumes, not within subvolume. So, if there is somein-subvolume sharing, the defrag wont be 100% perfect, that a minorpoint. Unimportant.

You're focusing too much on your own use case here. Not everybody usessnapshots, and there are many people who are using reflinks veryactively within subvolumes, either for deduplication or because it savestime and space when dealing with multiple copies of mostly identicaltress of files.

About merging extents: a defrag should merge extents ONLY when bothextents are shared by the same files (and when those extents areneighbours in both files). In other words, defrag should always mergewithout unsharing. Let's call that operation "fusing extents", sothat there are no more misunderstandings.
And I reiterate: defrag only operates on the file it's passed in. Itneeds to for efficiency reasons (we had a reflink aware defrag for awhile a few years back, it got removed because performance limitationsmeant it was unusable in the cases where you actually needed it).Defrag doesn't even know that there are reflinks to the extents it'soperating on.
If the defrag doesn't know about all reflinks, that's bad in my view.That is a bad defrag. If you had a reflink-aware defrag, and it wasslow, maybe that happened because the implementation was bad. Because, Idon't see any reason why it should be slow. So, you will have to explainto me what was causing this performance problems.
Given this, defrag isn't willfully unsharing anything, it's just aside-effect of how it works (since it's rewriting the block layout ofthe file in-place).
The current defrag has to unshare because, as you said, because it isunaware of the full reflink structure. If it doesn't know about allreflinks, it has to unshare, there is no way around that.
Now factor in that _any_ write will result in unsharing the regionbeing written to, rounded to the nearest full filesystem block in bothdirections (this is mandatory, it's a side effect of the copy-on-writenature of BTRFS, and is why files that experience heavy internalrewrites get fragmented very heavily and very quickly on BTRFS).
You mean: when defrag performs a write, the new data is unshared becauseevery write is unshared? Really?
Consider there is an extent E55 shared by two files A and B. The defraghas to move E55 to another location. In order to do that, defrag createsa new extent E70. It makes it belong to file A by changing the reflinkof extent E55 in file A to point to E70.
Now, to retain the original sharing structure, the defrag has to changethe reflink of extent E55 in file B to point to E70. You are telling methis is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story of yoursisn't an intentional attempt to mislead me.

As mentioned in the previous email, we actually did have a (mostly)working reflink-aware defrag a few years back. It got removed becauseit had serious performance issues. Note that we're not talking a fewseconds of extra time to defrag a full tree here, we're talkingdouble-digit _minutes_ of extra time to defrag a moderate sized (lowtriple digit GB) subvolume with dozens of snapshots, _if you were lucky_(if you weren't, you would be looking at potentially multiple _hours_ ofruntime for the defrag). The performance scaled inversely proportionateto the number of reflinks involved and the total amount of data in thesubvolume being defragmented, and was pretty bad even in the case ofonly a couple of snapshots.


Ultimately, there are a couple of issues at play here:

* Online defrag has to maintain consistency during operation. Thecurrent implementation does this by rewriting the regions beingdefragmented (which causes them to become a single new extent (most ofthe time)), which avoids a whole lot of otherwise complicated logicrequired to make sure things happen correctly, and also means that onlythe file being operated on is impacted and only the parts being modifiedneed to be protected against concurrent writes. Properly handlingreflinks means that _every_ file that shares some part of an extent withthe file being operated on needs to have the reflinked regions lockedfor the defrag operation, which has a huge impact on performance. Usingyour example, the update to E55 in both files A and B has to happen aspart of the same commit, which can contain no other writes in thatregion of the file, otherwise you run the risk of losing writes to fileB that occur while file A is being defragmented. It's not horrible whenit's just a small region in two files, but it becomes a big issue whendealing with lots of files and/or particularly large extents (extents inBTRFS can get into the GB range in terms of size when dealing withreally big files).

* Reflinks can reference partial extents. This means, ultimately, thatyou may end up having to split extents in odd ways during defrag if youwant to preserve reflinks, and might have to split extents _elsewhere_that are only tangentially related to the region being defragmented.See the example in my previous email for a case like this, maintainingthe shared regions as being shared when you defragment either file to asingle extent will require splitting extents in the other file (ineither case, whichever file you don't defragment to a single extent willend up having 7 extents if you try to force the one that's beendefragmented to be the canonical version). Once you consider that agiven extent can have multiple ranges reflinked from multiple otherlocations, it gets even more complicated.

* If you choose to just not handle the above point by not letting defragsplit extents, you put a hard lower limit on the amount of fragmentationpresent in a file if you want to preserve reflinks. IOW, you can'tdefragment files past a certain point. If we go this way, neither ofthe two files in the example from my previous email could bedefragmented any further than they already are, because doing so wouldrequire splitting extents.

* Determining all the reflinks to a given region of a given extent isnot a cheap operation, and the information may immediately be stale(because an operation right after you fetch the info might changethings). We could work around this by locking the extent somehow, butdoing so would be expensive because you would have to hold the lock forthe entire defrag operation.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to