Re: Feature requests: online backup - defrag - change RAID level

Austin S. Hemmelgarn Thu, 12 Sep 2019 04:32:03 -0700

On 2019-09-11 17:37, webmas...@zedlx.com wrote:

Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 13:20, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-10 19:32, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
Given this, defrag isn't willfully unsharing anything, it's just aside-effect of how it works (since it's rewriting the block layoutof the file in-place).
The current defrag has to unshare because, as you said, because it isunaware of the full reflink structure. If it doesn't know about allreflinks, it has to unshare, there is no way around that.
Now factor in that _any_ write will result in unsharing the regionbeing written to, rounded to the nearest full filesystem block inboth directions (this is mandatory, it's a side effect of thecopy-on-write nature of BTRFS, and is why files that experienceheavy internal rewrites get fragmented very heavily and very quicklyon BTRFS).
You mean: when defrag performs a write, the new data is unsharedbecause every write is unshared? Really?
Consider there is an extent E55 shared by two files A and B. Thedefrag has to move E55 to another location. In order to do that,defrag creates a new extent E70. It makes it belong to file A bychanging the reflink of extent E55 in file A to point to E70.
Now, to retain the original sharing structure, the defrag has tochange the reflink of extent E55 in file B to point to E70. You aretelling me this is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story of yoursisn't an intentional attempt to mislead me.
As mentioned in the previous email, we actually did have a (mostly)working reflink-aware defrag a few years back. It got removed becauseit had serious performance issues. Note that we're not talking a fewseconds of extra time to defrag a full tree here, we're talkingdouble-digit _minutes_ of extra time to defrag a moderate sized (lowtriple digit GB) subvolume with dozens of snapshots, _if you werelucky_ (if you weren't, you would be looking at potentially multiple_hours_ of runtime for the defrag). The performance scaled inverselyproportionate to the number of reflinks involved and the total amountof data in the subvolume being defragmented, and was pretty bad evenin the case of only a couple of snapshots.
You cannot ever make the worst program, because an even worse programcan be made by slowing down the original by a factor of 2.So, you had a badly implemented defrag. At least you got someexperience. Let's see what went wrong.
Ultimately, there are a couple of issues at play here:
* Online defrag has to maintain consistency during operation. Thecurrent implementation does this by rewriting the regions beingdefragmented (which causes them to become a single new extent (most ofthe time)), which avoids a whole lot of otherwise complicated logicrequired to make sure things happen correctly, and also means thatonly the file being operated on is impacted and only the parts beingmodified need to be protected against concurrent writes. Properlyhandling reflinks means that _every_ file that shares some part of anextent with the file being operated on needs to have the reflinkedregions locked for the defrag operation, which has a huge impact onperformance. Using your example, the update to E55 in both files A andB has to happen as part of the same commit, which can contain no otherwrites in that region of the file, otherwise you run the risk oflosing writes to file B that occur while file A is being defragmented.
Nah. I think there is a workaround. You can first (atomically) update A,then whatever, then you can update B later. I know, your yelling "whatif E55 gets updated in B". Doesn't matter. The defrag continues later bysearching for reflink to E55 in B. Then it checks the data contained inE55. If the data matches the E70, then it can safely update the reflinkin B. Or the defrag can just verify that neither E55 nor E70 have beenwritten to in the meantime. That means they still have the same data.

So, IOW, you don't care if the total space used by the data isinstantaneously larger than what you started with? That seems to be atodds with your previous statements, but OK, if we allow for that thenthis is indeed a non-issue.

It's not horrible when it's just a small region in two files, but itbecomes a big issue when dealing with lots of files and/orparticularly large extents (extents in BTRFS can get into the GB rangein terms of size when dealing with really big files).
You must just split large extents in a smart way. So, in the beginning,the defrag can split large extents (2GB) into smaller ones (32MB) tofacilitate more responsive and easier defrag.
If you have lots of files, update them one-by one. It is possible. Oryou can update in big batches. Whatever is faster.

Neither will solve this though. Large numbers of files are an issuebecause the operation is expensive and has to be done on each file, notbecause the number of files somehow makes the operation more espensive.It's O(n) relative to files, not higher time complexity.

The point is that the defrag can keep a buffer of a "pendingoperations". Pending operations are those that should be performed inorder to keep the original sharing structure. If the defrag getsinterrupted, then files in "pending operations" will be unshared. Butthis should really be some important and urgent interrupt, as the"pending operations" buffer needs at most a second or two to completeits operations.

Depending on the exact situation, it can take well more than a fewseconds to complete stuff. Especially if there are lots of reflinks.

* Reflinks can reference partial extents. This means, ultimately,that you may end up having to split extents in odd ways during defragif you want to preserve reflinks, and might have to split extents_elsewhere_ that are only tangentially related to the region beingdefragmented. See the example in my previous email for a case likethis, maintaining the shared regions as being shared when youdefragment either file to a single extent will require splittingextents in the other file (in either case, whichever file you don'tdefragment to a single extent will end up having 7 extents if you tryto force the one that's been defragmented to be the canonicalversion). Once you consider that a given extent can have multipleranges reflinked from multiple other locations, it gets even morecomplicated.
I think that this problem can be solved, and that it can be solvedperfectly (the result is a perfectly-defragmented file). But, if it isso hard to do, just skip those problematic extents in initial version ofdefrag.
Ultimately, in the super-duper defrag, those partially-referencedextents should be split up by defrag.
* If you choose to just not handle the above point by not lettingdefrag split extents, you put a hard lower limit on the amount offragmentation present in a file if you want to preserve reflinks.IOW, you can't defragment files past a certain point. If we go thisway, neither of the two files in the example from my previous emailcould be defragmented any further than they already are, because doingso would require splitting extents.
Oh, you're reading my thoughts. That's good.
Initial implementation of defrag might be not-so-perfect. It would stillbe better than the current defrag.
This is not a one-way street. Handling of partially-used extents can beimproved in later versions.
* Determining all the reflinks to a given region of a given extent isnot a cheap operation, and the information may immediately be stale(because an operation right after you fetch the info might changethings). We could work around this by locking the extent somehow, butdoing so would be expensive because you would have to hold the lockfor the entire defrag operation.
No. DO NOT LOCK TO RETRIEVE REFLINKS.
Instead, you have to create a hook in every function that updates thereflink structure or extents (for exaple, write-to-file operation). So,when a reflink gets changed, the defrag is immediately notified aboutthis. That way the defrag can keep its data about reflinks in-sync withthe filesystem.

This doesn't get around the fact that it's still an expensive operationto enumerate all the reflinks for a given region of a file or extent.

It also allows a very real possibility of a user functionally delayingthe defrag operation indefinitely (by triggering a continuous stream ofoperations that would cause reflink changes for a file being operated onby defrag) if not implemented very carefully.

Also note, this defrag should run as a part of the kernel, not inuserspace. Defrag-from-userspace is a nightmare. Defrag has to serializeits operations properly, and it must have knowledge of all otheroperations in progress. So, it can only operate efficiently as part ofthe kernel.

Agreed on this point.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to