Re: Feature requests: online backup - defrag - change RAID level

webmaster Wed, 11 Sep 2019 14:37:46 -0700


Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

On 2019-09-11 13:20, webmas...@zedlx.com wrote:


Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

On 2019-09-10 19:32, webmas...@zedlx.com wrote:


Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

Given this, defrag isn't willfully unsharing anything, it's just aside-effect of how it works (since it's rewriting the block layoutof the file in-place).
The current defrag has to unshare because, as you said, because itis unaware of the full reflink structure. If it doesn't know aboutall reflinks, it has to unshare, there is no way around that.
Now factor in that _any_ write will result in unsharing the regionbeing written to, rounded to the nearest full filesystem block inboth directions (this is mandatory, it's a side effect of thecopy-on-write nature of BTRFS, and is why files that experienceheavy internal rewrites get fragmented very heavily and veryquickly on BTRFS).
You mean: when defrag performs a write, the new data is unsharedbecause every write is unshared? Really?
Consider there is an extent E55 shared by two files A and B. Thedefrag has to move E55 to another location. In order to do that,defrag creates a new extent E70. It makes it belong to file A bychanging the reflink of extent E55 in file A to point to E70.
Now, to retain the original sharing structure, the defrag has tochange the reflink of extent E55 in file B to point to E70. You aretelling me this is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story ofyours isn't an intentional attempt to mislead me.

As mentioned in the previous email, we actually did have a (mostly)working reflink-aware defrag a few years back. It got removedbecause it had serious performance issues. Note that we're nottalking a few seconds of extra time to defrag a full tree here,we're talking double-digit _minutes_ of extra time to defrag amoderate sized (low triple digit GB) subvolume with dozens ofsnapshots, _if you were lucky_ (if you weren't, you would be lookingat potentially multiple _hours_ of runtime for the defrag). Theperformance scaled inversely proportionate to the number of reflinksinvolved and the total amount of data in the subvolume beingdefragmented, and was pretty bad even in the case of only a coupleof snapshots.

You cannot ever make the worst program, because an even worse programcan be made by slowing down the original by a factor of 2.So, you had a badly implemented defrag. At least you got someexperience. Let's see what went wrong.

Ultimately, there are a couple of issues at play here:
* Online defrag has to maintain consistency during operation. Thecurrent implementation does this by rewriting the regions beingdefragmented (which causes them to become a single new extent (mostof the time)), which avoids a whole lot of otherwise complicatedlogic required to make sure things happen correctly, and also meansthat only the file being operated on is impacted and only the partsbeing modified need to be protected against concurrent writes.Properly handling reflinks means that _every_ file that shares somepart of an extent with the file being operated on needs to have thereflinked regions locked for the defrag operation, which has a hugeimpact on performance. Using your example, the update to E55 in bothfiles A and B has to happen as part of the same commit, which cancontain no other writes in that region of the file, otherwise yourun the risk of losing writes to file B that occur while file A isbeing defragmented.

Nah. I think there is a workaround. You can first (atomically) updateA, then whatever, then you can update B later. I know, your yelling"what if E55 gets updated in B". Doesn't matter. The defrag continueslater by searching for reflink to E55 in B. Then it checks the datacontained in E55. If the data matches the E70, then it can safelyupdate the reflink in B. Or the defrag can just verify that neitherE55 nor E70 have been written to in the meantime. That means theystill have the same data.

It's not horrible when it's just a small region in two files, but itbecomes a big issue when dealing with lots of files and/orparticularly large extents (extents in BTRFS can get into the GBrange in terms of size when dealing with really big files).

You must just split large extents in a smart way. So, in thebeginning, the defrag can split large extents (2GB) into smaller ones(32MB) to facilitate more responsive and easier defrag.

If you have lots of files, update them one-by one. It is possible. Oryou can update in big batches. Whatever is faster.

The point is that the defrag can keep a buffer of a "pendingoperations". Pending operations are those that should be performed inorder to keep the original sharing structure. If the defrag getsinterrupted, then files in "pending operations" will be unshared. Butthis should really be some important and urgent interrupt, as the"pending operations" buffer needs at most a second or two to completeits operations.

* Reflinks can reference partial extents. This means, ultimately,that you may end up having to split extents in odd ways duringdefrag if you want to preserve reflinks, and might have to splitextents _elsewhere_ that are only tangentially related to the regionbeing defragmented. See the example in my previous email for a caselike this, maintaining the shared regions as being shared when youdefragment either file to a single extent will require splittingextents in the other file (in either case, whichever file you don'tdefragment to a single extent will end up having 7 extents if youtry to force the one that's been defragmented to be the canonicalversion). Once you consider that a given extent can have multipleranges reflinked from multiple other locations, it gets even morecomplicated.

I think that this problem can be solved, and that it can be solvedperfectly (the result is a perfectly-defragmented file). But, if it isso hard to do, just skip those problematic extents in initial versionof defrag.

Ultimately, in the super-duper defrag, those partially-referencedextents should be split up by defrag.

* If you choose to just not handle the above point by not lettingdefrag split extents, you put a hard lower limit on the amount offragmentation present in a file if you want to preserve reflinks.IOW, you can't defragment files past a certain point. If we go thisway, neither of the two files in the example from my previous emailcould be defragmented any further than they already are, becausedoing so would require splitting extents.


Oh, you're reading my thoughts. That's good.

Initial implementation of defrag might be not-so-perfect. It wouldstill be better than the current defrag.

This is not a one-way street. Handling of partially-used extents canbe improved in later versions.

* Determining all the reflinks to a given region of a given extentis not a cheap operation, and the information may immediately bestale (because an operation right after you fetch the info mightchange things). We could work around this by locking the extentsomehow, but doing so would be expensive because you would have tohold the lock for the entire defrag operation.


No. DO NOT LOCK TO RETRIEVE REFLINKS.

Instead, you have to create a hook in every function that updates thereflink structure or extents (for exaple, write-to-file operation).So, when a reflink gets changed, the defrag is immediately notifiedabout this. That way the defrag can keep its data about reflinksin-sync with the filesystem.

Also note, this defrag should run as a part of the kernel, not inuserspace. Defrag-from-userspace is a nightmare. Defrag has toserialize its operations properly, and it must have knowledge of allother operations in progress. So, it can only operate efficiently aspart of the kernel.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to