Re: Feature requests: online backup - defrag - change RAID level

Austin S. Hemmelgarn Fri, 13 Sep 2019 11:29:51 -0700

On 2019-09-13 12:54, General Zed wrote:

Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-12 18:21, General Zed wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-12 15:18, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 17:37, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 13:20, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-10 19:32, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
Given this, defrag isn't willfully unsharing anything, it'sjust a side-effect of how it works (since it's rewriting theblock layout of the file in-place).
The current defrag has to unshare because, as you said, becauseit is unaware of the full reflink structure. If it doesn't knowabout all reflinks, it has to unshare, there is no way aroundthat.
Now factor in that _any_ write will result in unsharing theregion being written to, rounded to the nearest fullfilesystem block in both directions (this is mandatory, it's aside effect of the copy-on-write nature of BTRFS, and is whyfiles that experience heavy internal rewrites get fragmentedvery heavily and very quickly on BTRFS).
You mean: when defrag performs a write, the new data isunshared because every write is unshared? Really?
Consider there is an extent E55 shared by two files A and B.The defrag has to move E55 to another location. In order to dothat, defrag creates a new extent E70. It makes it belong tofile A by changing the reflink of extent E55 in file A to pointto E70.
Now, to retain the original sharing structure, the defrag hasto change the reflink of extent E55 in file B to point to E70.You are telling me this is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story ofyours isn't an intentional attempt to mislead me.
As mentioned in the previous email, we actually did have a(mostly) working reflink-aware defrag a few years back. It gotremoved because it had serious performance issues. Note thatwe're not talking a few seconds of extra time to defrag a fulltree here, we're talking double-digit _minutes_ of extra time todefrag a moderate sized (low triple digit GB) subvolume withdozens of snapshots, _if you were lucky_ (if you weren't, youwould be looking at potentially multiple _hours_ of runtime forthe defrag). The performance scaled inversely proportionate tothe number of reflinks involved and the total amount of data inthe subvolume being defragmented, and was pretty bad even in thecase of only a couple of snapshots.
You cannot ever make the worst program, because an even worseprogram can be made by slowing down the original by a factor of 2.So, you had a badly implemented defrag. At least you got someexperience. Let's see what went wrong.
Ultimately, there are a couple of issues at play here:
* Online defrag has to maintain consistency during operation.The current implementation does this by rewriting the regionsbeing defragmented (which causes them to become a single newextent (most of the time)), which avoids a whole lot ofotherwise complicated logic required to make sure things happencorrectly, and also means that only the file being operated onis impacted and only the parts being modified need to beprotected against concurrent writes. Properly handling reflinksmeans that _every_ file that shares some part of an extent withthe file being operated on needs to have the reflinked regionslocked for the defrag operation, which has a huge impact onperformance. Using your example, the update to E55 in both filesA and B has to happen as part of the same commit, which cancontain no other writes in that region of the file, otherwiseyou run the risk of losing writes to file B that occur whilefile A is being defragmented.
Nah. I think there is a workaround. You can first (atomically)update A, then whatever, then you can update B later. I know,your yelling "what if E55 gets updated in B". Doesn't matter. Thedefrag continues later by searching for reflink to E55 in B. Thenit checks the data contained in E55. If the data matches the E70,then it can safely update the reflink in B. Or the defrag canjust verify that neither E55 nor E70 have been written to in themeantime. That means they still have the same data.
So, IOW, you don't care if the total space used by the data isinstantaneously larger than what you started with? That seems tobe at odds with your previous statements, but OK, if we allow forthat then this is indeed a non-issue.
It is normal and common for defrag operation to use some disk spacewhile it is running. I estimate that a reasonable limit would be touse up to 1% of total partition size. So, if a partition size is100 GB, the defrag can use 1 GB. Lets call this "defrag operationspace".
The defrag should, when started, verify that there is "sufficientfree space" on the partition. In the case that there is nosufficient free space, the defrag should output the message to theuser and abort. The size of "sufficient free space" must be largerthan the "defrag operation space". I would estimate that a goodlimit would be 2% of the partition size. "defrag operation space"is a part of "sufficient free space" while defrag operation is inprogress.
If, during defrag operation, sufficient free space drops below 2%,the defrag should output a message and abort. Another possibilityis for defrag to pause until the user frees some disk space, butthis is not common in other defrag implementations AFAIK.
It's not horrible when it's just a small region in two files,but it becomes a big issue when dealing with lots of filesand/or particularly large extents (extents in BTRFS can get intothe GB range in terms of size when dealing with really big files).
You must just split large extents in a smart way. So, in thebeginning, the defrag can split large extents (2GB) into smallerones (32MB) to facilitate more responsive and easier defrag.
If you have lots of files, update them one-by one. It ispossible. Or you can update in big batches. Whatever is faster.
Neither will solve this though. Large numbers of files are anissue because the operation is expensive and has to be done oneach file, not because the number of files somehow makes theoperation more espensive. It's O(n) relative to files, not highertime complexity.
I would say that updating in big batches helps a lot, to the pointthat it gets almost as fast as defragging any other file system.What defrag needs to do is to write a big bunch of defragged file(data) extents to the disk, and then update the b-trees. Whathappens is that many of the updates to the b-trees would fall intothe same disk sector/extent, so instead of many writes there willbe just one write.
Here is the general outline for implementation:
    - write a big bunch of defragged file extents to disk
- a minimal set of updates of the b-trees that cannot bedelayed is performed (this is nothing or almost nothing in mostcircumstances) - put the rest of required updates of b-trees into "pendingoperations buffer" - analyze the "pending operations buffer", and find out(approximately) the biggest part of it that can be flushed out bydoing minimal number of disk writes
        - flush out that part of "pending operations buffer"
    - repeat
It helps, but you still can't get around having to recompute the newtree state, and that is going to take time proportionate to thenumber of nodes that need to change, which in turn is proportionateto the number of files.
Yes, but that is just a computation. The defrag performance mostlydepends on minimizing disk I/O operations, not on computations.
You're assuming the defrag is being done on a system that's otherwiseperfectly idle. In the real world, that rarely, if ever, will be thecase, The system may be doing other things at the same time, and themore computation the defrag operation has to do, the more likely it isto negatively impact those other things.
No, I'm not assuming that the system is perfectly idle. I'm assumingthat the required computations don't take much CPU time, like it iscommon in a well implemented defrag.

Which also usually doesn't have to do anywhere near as much computationas is needed here.

In the past many good and fast defrag computation algorithms havebeen produced, and I don't see any reason why this project wouldn'tbe also able to create such a good algorithm.
Because it's not just the new extent locations you have to compute,you also need to compute the resultant metadata tree state, and theresultant extent tree state, and after all of that the resultantchecksum tree state. Yeah, figuring out optimal block layouts issolved, but you can't get around the overhead of recomputing the newtree state and all the block checksums for it.
The current defrag has to deal with this too, but it doesn't need todo as much computation because it's not worried about preservingreflinks (and therefore defragmenting a single file won't requireupdates to any other files).
Yes, the defrag algorithm needs to compute the new tree state. However,it shouldn't be slow at all. All operations on b-trees can be done in atmost N*logN time, which is sufficiently fast. There is no operationthere that I can think of that takes N*N or N*M time. So, it should alltake little CPU time. Essentially a non-issue.
The ONLY concern that causes N*M time is the presence of sharing. But,even this is unfair, as the computation time will still be N*logN withregards to the total number of reflinks. That is still fast, even for100 GB metadata with a billion reflinks.
I don't understand why do you think that recomputing the new tree statemust be slow. Even if there are a 100 new tree states that need to berecomputed, there is still no problem. Each metadata update will changeonly a small portion of b-trees, so the complexity and size of b-treesshould not seriously affect the computation time.

Well, let's start with the checksum computations which then need tohappen for each block that would be written, which can't be faster thanO(n).

Yes, the structural overhead of the b-trees isn't bad by itself, but youhave multiple trees that need to be updated in sequence (that is, youhave to update one, then update the next based on that one, then updateanother based on both of the previous two, etc) and a number of otherbits of data involved that need to be updated as part of the b-treeupdate which have worse time complexity than computing the structuralchanges to the b-trees.

The point is that the defrag can keep a buffer of a "pendingoperations". Pending operations are those that should beperformed in order to keep the original sharing structure. If thedefrag gets interrupted, then files in "pending operations" willbe unshared. But this should really be some important and urgentinterrupt, as the "pending operations" buffer needs at most asecond or two to complete its operations.
Depending on the exact situation, it can take well more than a fewseconds to complete stuff. Especially if there are lots of reflinks.
Nope. You are quite wrong there.
In the worst case, the "pending operations buffer" will update(write to disk) all the b-trees. So, the upper limit on time toflush the "pending operations buffer" equals the time to write theentire b-tree structure to the disk (into new extents). I estimatethat takes at most a few seconds.
So what you're talking about is journaling the computed state ofdefrag operations. That shouldn't be too bad (as long as it's donein memory instead of on-disk) if you batch the computationsproperly. I thought you meant having a buffer of what operations todo, and then computing them on-the-fly (which would have significantoverhead)
Looks close to what I was thinking. Soon we might be able tocommunicate. I'm not sure what you mean by "journaling the computedstate of defrag operations". Maybe it doesn't matter.
Essentially, doing a write-ahead log of pending operations.Journaling is just the common term for such things when dealing withLinux filesystems because of ext* and XFS. Based on what you saybelow, it sounds like we're on the same page here other than theterminology.
What happens is that file (extent) data is first written to disk(defragmented), but b-tree is not immediately updated. It doesn'thave to be. Even if there is a power loss, nothing happens.
So, the changes that should be done to the b-trees are put intopending-operations-buffer. When a lot of file (extent) data iswritten to disk, such that defrag-operation-space (1 GB) is close tobeing exhausted, the pending-operations-buffer is examined in orderto attempt to free as much of defrag-operation-space as possible. Thesimplest algorithm is to flush the entire pending-operations-bufferat once. This reduces the number of writes that update the b-treesbecause many changes to the b-trees fall into the same orneighbouring disk sectors.
* Reflinks can reference partial extents. This means,ultimately, that you may end up having to split extents in oddways during defrag if you want to preserve reflinks, and mighthave to split extents _elsewhere_ that are only tangentiallyrelated to the region being defragmented. See the example in myprevious email for a case like this, maintaining the sharedregions as being shared when you defragment either file to asingle extent will require splitting extents in the other file(in either case, whichever file you don't defragment to a singleextent will end up having 7 extents if you try to force the onethat's been defragmented to be the canonical version). Once youconsider that a given extent can have multiple ranges reflinkedfrom multiple other locations, it gets even more complicated.
I think that this problem can be solved, and that it can besolved perfectly (the result is a perfectly-defragmented file).But, if it is so hard to do, just skip those problematic extentsin initial version of defrag.
Ultimately, in the super-duper defrag, those partially-referencedextents should be split up by defrag.
* If you choose to just not handle the above point by notletting defrag split extents, you put a hard lower limit on theamount of fragmentation present in a file if you want topreserve reflinks. IOW, you can't defragment files past acertain point. If we go this way, neither of the two files inthe example from my previous email could be defragmented anyfurther than they already are, because doing so would requiresplitting extents.
Oh, you're reading my thoughts. That's good.
Initial implementation of defrag might be not-so-perfect. Itwould still be better than the current defrag.
This is not a one-way street. Handling of partially-used extentscan be improved in later versions.
* Determining all the reflinks to a given region of a givenextent is not a cheap operation, and the information mayimmediately be stale (because an operation right after you fetchthe info might change things). We could work around this bylocking the extent somehow, but doing so would be expensivebecause you would have to hold the lock for the entire defragoperation.
No. DO NOT LOCK TO RETRIEVE REFLINKS.
Instead, you have to create a hook in every function that updatesthe reflink structure or extents (for exaple, write-to-fileoperation). So, when a reflink gets changed, the defrag isimmediately notified about this. That way the defrag can keep itsdata about reflinks in-sync with the filesystem.
This doesn't get around the fact that it's still an expensiveoperation to enumerate all the reflinks for a given region of afile or extent.
No, you are wrong.
In order to enumerate all the reflinks in a region, the defragneeds to have another array, which is also kept in memory and insync with the filesystem. It is the easiest to divide the disk intoregions of equal size, where each region is a few MB large. Letscall this array "regions-to-extents" array. This array doesn't needto be associative, it is a plain array.This in-memory array links regions of disk to extents that are inthe region. The array in initialized when defrag starts.
This array makes the operation of finding all extents of a regionextremely fast.
That has two issues:
* That's going to be a _lot_ of memory. You still need to be ableto defragment big (dozens plus TB) arrays without needing multipleGB of RAM just for the defrag operation, otherwise it's notrealistically useful (remember, it was big arrays that had issueswith the old reflink-aware defrag too).
Ok, but let's get some calculations there. If regions are 4 MB insize, the region-extents array for an 8 TB partition would have 2million entries. If entries average 64 bytes, that would be:
 - a total of 128 MB memory for an 8 TB partition.

Of course, I'm guessing a lot of numbers there, but it should be doable.
Even if we assume such an optimistic estimation as you provide (Isuspect it will require more than 64 bytes per-entry), that's a lot ofRAM when you look at what it's potentially displacing. That's enoughRAM for receive and transmit buffers for a few hundred thousandnetwork connections, or for caching multiple hundreds of thousands ofdentries, or a few hundred thousand inodes. Hell, that's enough RAMto run all the standard network services for a small network (DHCP,DNS, NTP, TFTP, mDNS relay, UPnP/NAT-PMP, SNMP, IGMP proxy, VPN ofyour choice) at least twice over.
That depends on the average size of an extent. If the average size of anextent is around 4 MB, than my numbers should be good. Do you have anydata which would suggest that my estimate is wrong? What's the averagesize of an extent on your filesystems (used space divided by number ofextents)?

Depends on what filesystem.

Worst case I have (which I monitor regularly, so I actually have goodaggregate data on the actual distribution of extent sizes) is used forbacked storage for virtual machine disk images. The arithmetic meanextent size is just barely over 32k, but the median is actually closerto 48k, with the 10th percentile at 4k and the 90th percentile at justover 2M. From what I can tell, this is a pretty typical distributionfor this type of usage (high frequency small writes internal to existingfiles) on BTRFS.

Typical usage on most of my systems when dealing with data sets thatinclude reflinks shows a theoretical average extent size of about 1M,though I suspect the 50th percentile to be a little bit higher than that(I don't regularly check any of those, but the times I have the 50thpercentile has been just a bit than the arithmetic mean, which makessense given that I have a lot more small files than large ones).

It might be normal on some systems to have larger extents than this, butI somewhat doubt that that will be the case for many potential users.

This "regions-to-extents" array can be further optimized if necessary.
You are not thinking correctly there (misplaced priorities). If thesystem needs to be defragmented, that's the priority. You can't docomparisons like that, that's unfair debating.
The defrag that I'm proposing should be able to run within common memorylimits of today's computer systems. So, it will likely take somewhatless than 700 MB of RAM in most common situations, including the smallservers. They all have 700 MB RAM.
700 MB is a lot for a defrag, but there is no way around it. Btrfs issimply a filesystem with such complexity that a good defrag requires alot of RAM to operate.
If, for some reason, you would like to cover a use-case with constrainedRAM conditions, then that is an entirely different concern for adifferent project. You can't make a project like this to cover ALL thepossible circumstances. Some cases have to be left out. Here we aretalking about a defrag that is usable in a general and common set ofcircumstances.

Memory constrained systems come up as a point of discussion prettyregularly when dealing with BTRFS, so they're obviously something usersactually care about. You have to keep in mind that it's not unusual ina consumer NAS system to have less than 4GB of RAM, but have arrays wellinto double digit terabytes in size. Even 128MB of RAM needing to beused for defrag is trashing _a lot_ of cache on such a system.

Please, don't drop special circumstances argument on me. That's not fair.
* You still have to populate the array in the first place. A saneimplementation wouldn't be keeping it in memory even when defrag isnot running (no way is anybody going to tolerate even dozens of MBof memory overhead for this), so you're not going to get around theneed to enumerate all the reflinks for a file at least once (duringstartup, or when starting to process that file), so you're justmoving the overhead around instead of eliminating it.
Yes, when the defrag starts, the entire b-tree structure is examinedin order for region-extents array and extents-backref associativearray to be populated.
So your startup is going to take forever on any reasonably largevolume. This isn't eliminating the overhead, it's just moving it allto one place. That might make it a bit more efficient than it wouldbe interspersed throughout the operation, but only because it isreading all the relevant data at once.
No, the startup will not take forever.

Forever is subjective. Arrays with hundreds of GB of metadata are notunusual, and that's dozens of minutes of just reading data right at thebeginning before even considering what to defragment.

I would encourage you to take a closer look at some of the performanceissues quota groups face when doing a rescan, as they have to deal withthis kind of reflink tracking too, and they take quite a while on anyreasonably sized volume.

The startup needs exactly 1 (one) pass through the entire metadata. Itneeds this to find all the backlinks and to populate the"regios-extents" array. The time to do 1 pass through metadata dependson the metadata size on disk, as entire metadata has to be read out (onepiece at a time, you won't keep it all in RAM). In most cases, thetime-to read the metadata will be less than 1 minute, on an SSD lessthan 20 seconds.
There is no way around it: to defrag, you eventually need to read allthe b-trees, so nothing is lost there.
All computations in this defrag are simple. Finding all refliks inmetadata is simple. It is a single pass metadata read-out.
Of course, those two arrays exist only during defrag operation. Whendefrag completes, those arrays are deallocated.
It also allows a very real possibility of a user functionallydelaying the defrag operation indefinitely (by triggering acontinuous stream of operations that would cause reflink changesfor a file being operated on by defrag) if not implemented verycarefully.
Yes, if a user does something like that, the defrag can be pausedor even aborted. That is normal.
Not really. Most defrag implementations either avoid files thatcould reasonably be written to, or freeze writes to the file they'reoperating on, or in some other way just sidestep the issue withoutdelaying the defragmentation process.
There are many ways around this problem, but it really doesn'tmatter, those are just details. The initial version of defrag canjust abort. The more mature versions of defrag can have a betterhandling of this problem.
Details like this are the deciding factor for whether something issanely usable in certain use cases, as you have yourself found out(for a lot of users, the fact that defrag can unshare extents is'just a detail' that's not worth worrying about).
I wouldn't agree there.
Not every issue is equal. Some issues are more important, some aretrivial, some are tolerable etc...
The defrag is usually allowed to abort. It can easily be restartedlater. Workaround: You can make a defrag-supervisor program, whichstarts a defrag, and if defrag aborts then it is restarted after some(configurable) amount of time.
The fact that the defrag can be functionally deferred indefinitely bya user means that a user can, with a bit of effort, force degradedperformance for everyone using the system. Aborting the defragdoesn't solve that, and it's a significant issue for anybody doingshared hosting.
This is a quality-of-implementation issue. Not worthy of considerationat this time. It can be solved.Then solve it and be done with it, don't just punt it down the road.

You're the one trying to convince the developers to spend _their_ timeimplementing _your_ idea, so you need to provide enough detail to solveissues that are brought up about your idea.

You can go and pick this kind of stuff all the time, with any system. Imean, because of the FACT that we have never proven that all securityholes are eliminated, the computers shouldn't be powered on at all.Therefore, all computers should be shut down immediately and then thereis absolutely no need to continue working on the btrfs. It is alsoimpossible to produce the btrfs defrag, because all computers have to beshut down immediately.
Can we have a bit more fair discussion? Please?

I would ask the same, I provided a concrete example of a demonstrablesecurity issue with your proposed implementation that's trivial toverify without even going beyond the described behavior of theimplementation. You then dismissed at as a non-issue and tried toexplain why my legitimate security concern wasn't even worth trying tothink about using apagogical argument that's only tangentially relatedto my statement.


On the other hand, unsharing is not easy to get undone.

But, again, it this just doesn't matter for some people.


So, those issues are not equals.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to