Re: Feature requests: online backup - defrag - change RAID level

Austin S. Hemmelgarn Fri, 13 Sep 2019 04:54:02 -0700

On 2019-09-12 18:21, General Zed wrote:

Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-12 15:18, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 17:37, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 13:20, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-10 19:32, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
Given this, defrag isn't willfully unsharing anything, it's justa side-effect of how it works (since it's rewriting the blocklayout of the file in-place).
The current defrag has to unshare because, as you said, becauseit is unaware of the full reflink structure. If it doesn't knowabout all reflinks, it has to unshare, there is no way around that.
Now factor in that _any_ write will result in unsharing theregion being written to, rounded to the nearest full filesystemblock in both directions (this is mandatory, it's a side effectof the copy-on-write nature of BTRFS, and is why files thatexperience heavy internal rewrites get fragmented very heavilyand very quickly on BTRFS).
You mean: when defrag performs a write, the new data is unsharedbecause every write is unshared? Really?
Consider there is an extent E55 shared by two files A and B. Thedefrag has to move E55 to another location. In order to do that,defrag creates a new extent E70. It makes it belong to file A bychanging the reflink of extent E55 in file A to point to E70.
Now, to retain the original sharing structure, the defrag has tochange the reflink of extent E55 in file B to point to E70. Youare telling me this is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story ofyours isn't an intentional attempt to mislead me.
As mentioned in the previous email, we actually did have a(mostly) working reflink-aware defrag a few years back. It gotremoved because it had serious performance issues. Note thatwe're not talking a few seconds of extra time to defrag a fulltree here, we're talking double-digit _minutes_ of extra time todefrag a moderate sized (low triple digit GB) subvolume withdozens of snapshots, _if you were lucky_ (if you weren't, youwould be looking at potentially multiple _hours_ of runtime forthe defrag). The performance scaled inversely proportionate tothe number of reflinks involved and the total amount of data inthe subvolume being defragmented, and was pretty bad even in thecase of only a couple of snapshots.
You cannot ever make the worst program, because an even worseprogram can be made by slowing down the original by a factor of 2.So, you had a badly implemented defrag. At least you got someexperience. Let's see what went wrong.
Ultimately, there are a couple of issues at play here:
* Online defrag has to maintain consistency during operation. Thecurrent implementation does this by rewriting the regions beingdefragmented (which causes them to become a single new extent(most of the time)), which avoids a whole lot of otherwisecomplicated logic required to make sure things happen correctly,and also means that only the file being operated on is impactedand only the parts being modified need to be protected againstconcurrent writes. Properly handling reflinks means that _every_file that shares some part of an extent with the file beingoperated on needs to have the reflinked regions locked for thedefrag operation, which has a huge impact on performance. Usingyour example, the update to E55 in both files A and B has tohappen as part of the same commit, which can contain no otherwrites in that region of the file, otherwise you run the risk oflosing writes to file B that occur while file A is beingdefragmented.
Nah. I think there is a workaround. You can first (atomically)update A, then whatever, then you can update B later. I know, youryelling "what if E55 gets updated in B". Doesn't matter. The defragcontinues later by searching for reflink to E55 in B. Then itchecks the data contained in E55. If the data matches the E70, thenit can safely update the reflink in B. Or the defrag can justverify that neither E55 nor E70 have been written to in themeantime. That means they still have the same data.
So, IOW, you don't care if the total space used by the data isinstantaneously larger than what you started with? That seems to beat odds with your previous statements, but OK, if we allow for thatthen this is indeed a non-issue.
It is normal and common for defrag operation to use some disk spacewhile it is running. I estimate that a reasonable limit would be touse up to 1% of total partition size. So, if a partition size is 100GB, the defrag can use 1 GB. Lets call this "defrag operation space".
The defrag should, when started, verify that there is "sufficientfree space" on the partition. In the case that there is no sufficientfree space, the defrag should output the message to the user andabort. The size of "sufficient free space" must be larger than the"defrag operation space". I would estimate that a good limit would be2% of the partition size. "defrag operation space" is a part of"sufficient free space" while defrag operation is in progress.
If, during defrag operation, sufficient free space drops below 2%,the defrag should output a message and abort. Another possibility isfor defrag to pause until the user frees some disk space, but this isnot common in other defrag implementations AFAIK.
It's not horrible when it's just a small region in two files, butit becomes a big issue when dealing with lots of files and/orparticularly large extents (extents in BTRFS can get into the GBrange in terms of size when dealing with really big files).
You must just split large extents in a smart way. So, in thebeginning, the defrag can split large extents (2GB) into smallerones (32MB) to facilitate more responsive and easier defrag.
If you have lots of files, update them one-by one. It is possible.Or you can update in big batches. Whatever is faster.
Neither will solve this though. Large numbers of files are an issuebecause the operation is expensive and has to be done on each file,not because the number of files somehow makes the operation moreespensive. It's O(n) relative to files, not higher time complexity.
I would say that updating in big batches helps a lot, to the pointthat it gets almost as fast as defragging any other file system. Whatdefrag needs to do is to write a big bunch of defragged file (data)extents to the disk, and then update the b-trees. What happens isthat many of the updates to the b-trees would fall into the same disksector/extent, so instead of many writes there will be just one write.
Here is the general outline for implementation:
    - write a big bunch of defragged file extents to disk
- a minimal set of updates of the b-trees that cannot bedelayed is performed (this is nothing or almost nothing in mostcircumstances) - put the rest of required updates of b-trees into "pendingoperations buffer" - analyze the "pending operations buffer", and find out(approximately) the biggest part of it that can be flushed out bydoing minimal number of disk writes
        - flush out that part of "pending operations buffer"
    - repeat
It helps, but you still can't get around having to recompute the newtree state, and that is going to take time proportionate to the numberof nodes that need to change, which in turn is proportionate to thenumber of files.
Yes, but that is just a computation. The defrag performance mostlydepends on minimizing disk I/O operations, not on computations.

You're assuming the defrag is being done on a system that's otherwiseperfectly idle. In the real world, that rarely, if ever, will be thecase, The system may be doing other things at the same time, and themore computation the defrag operation has to do, the more likely it isto negatively impact those other things.

In the past many good and fast defrag computation algorithms have beenproduced, and I don't see any reason why this project wouldn't be alsoable to create such a good algorithm.

Because it's not just the new extent locations you have to compute, youalso need to compute the resultant metadata tree state, and theresultant extent tree state, and after all of that the resultantchecksum tree state. Yeah, figuring out optimal block layouts issolved, but you can't get around the overhead of recomputing the newtree state and all the block checksums for it.

The current defrag has to deal with this too, but it doesn't need to doas much computation because it's not worried about preserving reflinks(and therefore defragmenting a single file won't require updates to anyother files).

The point is that the defrag can keep a buffer of a "pendingoperations". Pending operations are those that should be performedin order to keep the original sharing structure. If the defrag getsinterrupted, then files in "pending operations" will be unshared.But this should really be some important and urgent interrupt, asthe "pending operations" buffer needs at most a second or two tocomplete its operations.
Depending on the exact situation, it can take well more than a fewseconds to complete stuff. Especially if there are lots of reflinks.
Nope. You are quite wrong there.
In the worst case, the "pending operations buffer" will update (writeto disk) all the b-trees. So, the upper limit on time to flush the"pending operations buffer" equals the time to write the entireb-tree structure to the disk (into new extents). I estimate thattakes at most a few seconds.
So what you're talking about is journaling the computed state ofdefrag operations. That shouldn't be too bad (as long as it's done inmemory instead of on-disk) if you batch the computations properly. Ithought you meant having a buffer of what operations to do, and thencomputing them on-the-fly (which would have significant overhead)
Looks close to what I was thinking. Soon we might be able tocommunicate. I'm not sure what you mean by "journaling the computedstate of defrag operations". Maybe it doesn't matter.

Essentially, doing a write-ahead log of pending operations. Journalingis just the common term for such things when dealing with Linuxfilesystems because of ext* and XFS. Based on what you say below, itsounds like we're on the same page here other than the terminology.

What happens is that file (extent) data is first written to disk(defragmented), but b-tree is not immediately updated. It doesn't haveto be. Even if there is a power loss, nothing happens.
So, the changes that should be done to the b-trees are put intopending-operations-buffer. When a lot of file (extent) data is writtento disk, such that defrag-operation-space (1 GB) is close to beingexhausted, the pending-operations-buffer is examined in order to attemptto free as much of defrag-operation-space as possible. The simplestalgorithm is to flush the entire pending-operations-buffer at once. Thisreduces the number of writes that update the b-trees because manychanges to the b-trees fall into the same or neighbouring disk sectors.
* Reflinks can reference partial extents. This means, ultimately,that you may end up having to split extents in odd ways duringdefrag if you want to preserve reflinks, and might have to splitextents _elsewhere_ that are only tangentially related to theregion being defragmented. See the example in my previous emailfor a case like this, maintaining the shared regions as beingshared when you defragment either file to a single extent willrequire splitting extents in the other file (in either case,whichever file you don't defragment to a single extent will end uphaving 7 extents if you try to force the one that's beendefragmented to be the canonical version). Once you consider thata given extent can have multiple ranges reflinked from multipleother locations, it gets even more complicated.
I think that this problem can be solved, and that it can be solvedperfectly (the result is a perfectly-defragmented file). But, if itis so hard to do, just skip those problematic extents in initialversion of defrag.
Ultimately, in the super-duper defrag, those partially-referencedextents should be split up by defrag.
* If you choose to just not handle the above point by not lettingdefrag split extents, you put a hard lower limit on the amount offragmentation present in a file if you want to preserve reflinks.IOW, you can't defragment files past a certain point. If we gothis way, neither of the two files in the example from my previousemail could be defragmented any further than they already are,because doing so would require splitting extents.
Oh, you're reading my thoughts. That's good.
Initial implementation of defrag might be not-so-perfect. It wouldstill be better than the current defrag.
This is not a one-way street. Handling of partially-used extentscan be improved in later versions.
* Determining all the reflinks to a given region of a given extentis not a cheap operation, and the information may immediately bestale (because an operation right after you fetch the info mightchange things). We could work around this by locking the extentsomehow, but doing so would be expensive because you would have tohold the lock for the entire defrag operation.
No. DO NOT LOCK TO RETRIEVE REFLINKS.
Instead, you have to create a hook in every function that updatesthe reflink structure or extents (for exaple, write-to-fileoperation). So, when a reflink gets changed, the defrag isimmediately notified about this. That way the defrag can keep itsdata about reflinks in-sync with the filesystem.
This doesn't get around the fact that it's still an expensiveoperation to enumerate all the reflinks for a given region of a fileor extent.
No, you are wrong.
In order to enumerate all the reflinks in a region, the defrag needsto have another array, which is also kept in memory and in sync withthe filesystem. It is the easiest to divide the disk into regions ofequal size, where each region is a few MB large. Lets call this array"regions-to-extents" array. This array doesn't need to beassociative, it is a plain array.This in-memory array links regions of disk to extents that are in theregion. The array in initialized when defrag starts.
This array makes the operation of finding all extents of a regionextremely fast.
That has two issues:
* That's going to be a _lot_ of memory. You still need to be able todefragment big (dozens plus TB) arrays without needing multiple GB ofRAM just for the defrag operation, otherwise it's not realisticallyuseful (remember, it was big arrays that had issues with the oldreflink-aware defrag too).
Ok, but let's get some calculations there. If regions are 4 MB in size,the region-extents array for an 8 TB partition would have 2 millionentries. If entries average 64 bytes, that would be:
  - a total of 128 MB memory for an 8 TB partition.

Of course, I'm guessing a lot of numbers there, but it should be doable.

Even if we assume such an optimistic estimation as you provide (Isuspect it will require more than 64 bytes per-entry), that's a lot ofRAM when you look at what it's potentially displacing. That's enoughRAM for receive and transmit buffers for a few hundred thousand networkconnections, or for caching multiple hundreds of thousands of dentries,or a few hundred thousand inodes. Hell, that's enough RAM to run allthe standard network services for a small network (DHCP, DNS, NTP, TFTP,mDNS relay, UPnP/NAT-PMP, SNMP, IGMP proxy, VPN of your choice) at leasttwice over.

* You still have to populate the array in the first place. A saneimplementation wouldn't be keeping it in memory even when defrag isnot running (no way is anybody going to tolerate even dozens of MB ofmemory overhead for this), so you're not going to get around the needto enumerate all the reflinks for a file at least once (duringstartup, or when starting to process that file), so you're just movingthe overhead around instead of eliminating it.
Yes, when the defrag starts, the entire b-tree structure is examined inorder for region-extents array and extents-backref associative array tobe populated.

So your startup is going to take forever on any reasonably large volume.This isn't eliminating the overhead, it's just moving it all to oneplace. That might make it a bit more efficient than it would beinterspersed throughout the operation, but only because it is readingall the relevant data at once.

Of course, those two arrays exist only during defrag operation. Whendefrag completes, those arrays are deallocated.
It also allows a very real possibility of a user functionallydelaying the defrag operation indefinitely (by triggering acontinuous stream of operations that would cause reflink changes fora file being operated on by defrag) if not implemented very carefully.
Yes, if a user does something like that, the defrag can be paused oreven aborted. That is normal.
Not really. Most defrag implementations either avoid files that couldreasonably be written to, or freeze writes to the file they'reoperating on, or in some other way just sidestep the issue withoutdelaying the defragmentation process.
There are many ways around this problem, but it really doesn'tmatter, those are just details. The initial version of defrag canjust abort. The more mature versions of defrag can have a betterhandling of this problem.
Details like this are the deciding factor for whether something issanely usable in certain use cases, as you have yourself found out(for a lot of users, the fact that defrag can unshare extents is 'justa detail' that's not worth worrying about).
I wouldn't agree there.
Not every issue is equal. Some issues are more important, some aretrivial, some are tolerable etc...
The defrag is usually allowed to abort. It can easily be restartedlater. Workaround: You can make a defrag-supervisor program, whichstarts a defrag, and if defrag aborts then it is restarted after some(configurable) amount of time.

The fact that the defrag can be functionally deferred indefinitely by auser means that a user can, with a bit of effort, force degradedperformance for everyone using the system. Aborting the defrag doesn'tsolve that, and it's a significant issue for anybody doing shared hosting.


On the other hand, unsharing is not easy to get undone.

But, again, it this just doesn't matter for some people.


So, those issues are not equals.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to