On 2019-09-13 12:54, General Zed wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-12 18:21, General Zed wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-12 15:18, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 17:37, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-11 13:20, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
On 2019-09-10 19:32, webmas...@zedlx.com wrote:
Quoting "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
Given this, defrag isn't willfully unsharing anything, it's
just a side-effect of how it works (since it's rewriting the
block layout of the file in-place).
The current defrag has to unshare because, as you said, because
it is unaware of the full reflink structure. If it doesn't know
about all reflinks, it has to unshare, there is no way around
that.
Now factor in that _any_ write will result in unsharing the
region being written to, rounded to the nearest full
filesystem block in both directions (this is mandatory, it's a
side effect of the copy-on-write nature of BTRFS, and is why
files that experience heavy internal rewrites get fragmented
very heavily and very quickly on BTRFS).
You mean: when defrag performs a write, the new data is
unshared because every write is unshared? Really?
Consider there is an extent E55 shared by two files A and B.
The defrag has to move E55 to another location. In order to do
that, defrag creates a new extent E70. It makes it belong to
file A by changing the reflink of extent E55 in file A to point
to E70.
Now, to retain the original sharing structure, the defrag has
to change the reflink of extent E55 in file B to point to E70.
You are telling me this is not possible? Bullshit!
Please explain to me how this 'defrag has to unshare' story of
yours isn't an intentional attempt to mislead me.
As mentioned in the previous email, we actually did have a
(mostly) working reflink-aware defrag a few years back. It got
removed because it had serious performance issues. Note that
we're not talking a few seconds of extra time to defrag a full
tree here, we're talking double-digit _minutes_ of extra time to
defrag a moderate sized (low triple digit GB) subvolume with
dozens of snapshots, _if you were lucky_ (if you weren't, you
would be looking at potentially multiple _hours_ of runtime for
the defrag). The performance scaled inversely proportionate to
the number of reflinks involved and the total amount of data in
the subvolume being defragmented, and was pretty bad even in the
case of only a couple of snapshots.
You cannot ever make the worst program, because an even worse
program can be made by slowing down the original by a factor of 2.
So, you had a badly implemented defrag. At least you got some
experience. Let's see what went wrong.
Ultimately, there are a couple of issues at play here:
* Online defrag has to maintain consistency during operation.
The current implementation does this by rewriting the regions
being defragmented (which causes them to become a single new
extent (most of the time)), which avoids a whole lot of
otherwise complicated logic required to make sure things happen
correctly, and also means that only the file being operated on
is impacted and only the parts being modified need to be
protected against concurrent writes. Properly handling reflinks
means that _every_ file that shares some part of an extent with
the file being operated on needs to have the reflinked regions
locked for the defrag operation, which has a huge impact on
performance. Using your example, the update to E55 in both files
A and B has to happen as part of the same commit, which can
contain no other writes in that region of the file, otherwise
you run the risk of losing writes to file B that occur while
file A is being defragmented.
Nah. I think there is a workaround. You can first (atomically)
update A, then whatever, then you can update B later. I know,
your yelling "what if E55 gets updated in B". Doesn't matter. The
defrag continues later by searching for reflink to E55 in B. Then
it checks the data contained in E55. If the data matches the E70,
then it can safely update the reflink in B. Or the defrag can
just verify that neither E55 nor E70 have been written to in the
meantime. That means they still have the same data.
So, IOW, you don't care if the total space used by the data is
instantaneously larger than what you started with? That seems to
be at odds with your previous statements, but OK, if we allow for
that then this is indeed a non-issue.
It is normal and common for defrag operation to use some disk space
while it is running. I estimate that a reasonable limit would be to
use up to 1% of total partition size. So, if a partition size is
100 GB, the defrag can use 1 GB. Lets call this "defrag operation
space".
The defrag should, when started, verify that there is "sufficient
free space" on the partition. In the case that there is no
sufficient free space, the defrag should output the message to the
user and abort. The size of "sufficient free space" must be larger
than the "defrag operation space". I would estimate that a good
limit would be 2% of the partition size. "defrag operation space"
is a part of "sufficient free space" while defrag operation is in
progress.
If, during defrag operation, sufficient free space drops below 2%,
the defrag should output a message and abort. Another possibility
is for defrag to pause until the user frees some disk space, but
this is not common in other defrag implementations AFAIK.
It's not horrible when it's just a small region in two files,
but it becomes a big issue when dealing with lots of files
and/or particularly large extents (extents in BTRFS can get into
the GB range in terms of size when dealing with really big files).
You must just split large extents in a smart way. So, in the
beginning, the defrag can split large extents (2GB) into smaller
ones (32MB) to facilitate more responsive and easier defrag.
If you have lots of files, update them one-by one. It is
possible. Or you can update in big batches. Whatever is faster.
Neither will solve this though. Large numbers of files are an
issue because the operation is expensive and has to be done on
each file, not because the number of files somehow makes the
operation more espensive. It's O(n) relative to files, not higher
time complexity.
I would say that updating in big batches helps a lot, to the point
that it gets almost as fast as defragging any other file system.
What defrag needs to do is to write a big bunch of defragged file
(data) extents to the disk, and then update the b-trees. What
happens is that many of the updates to the b-trees would fall into
the same disk sector/extent, so instead of many writes there will
be just one write.
Here is the general outline for implementation:
- write a big bunch of defragged file extents to disk
- a minimal set of updates of the b-trees that cannot be
delayed is performed (this is nothing or almost nothing in most
circumstances)
- put the rest of required updates of b-trees into "pending
operations buffer"
- analyze the "pending operations buffer", and find out
(approximately) the biggest part of it that can be flushed out by
doing minimal number of disk writes
- flush out that part of "pending operations buffer"
- repeat
It helps, but you still can't get around having to recompute the new
tree state, and that is going to take time proportionate to the
number of nodes that need to change, which in turn is proportionate
to the number of files.
Yes, but that is just a computation. The defrag performance mostly
depends on minimizing disk I/O operations, not on computations.
You're assuming the defrag is being done on a system that's otherwise
perfectly idle. In the real world, that rarely, if ever, will be the
case, The system may be doing other things at the same time, and the
more computation the defrag operation has to do, the more likely it is
to negatively impact those other things.
No, I'm not assuming that the system is perfectly idle. I'm assuming
that the required computations don't take much CPU time, like it is
common in a well implemented defrag.
Which also usually doesn't have to do anywhere near as much computation
as is needed here.
In the past many good and fast defrag computation algorithms have
been produced, and I don't see any reason why this project wouldn't
be also able to create such a good algorithm.
Because it's not just the new extent locations you have to compute,
you also need to compute the resultant metadata tree state, and the
resultant extent tree state, and after all of that the resultant
checksum tree state. Yeah, figuring out optimal block layouts is
solved, but you can't get around the overhead of recomputing the new
tree state and all the block checksums for it.
The current defrag has to deal with this too, but it doesn't need to
do as much computation because it's not worried about preserving
reflinks (and therefore defragmenting a single file won't require
updates to any other files).
Yes, the defrag algorithm needs to compute the new tree state. However,
it shouldn't be slow at all. All operations on b-trees can be done in at
most N*logN time, which is sufficiently fast. There is no operation
there that I can think of that takes N*N or N*M time. So, it should all
take little CPU time. Essentially a non-issue.
The ONLY concern that causes N*M time is the presence of sharing. But,
even this is unfair, as the computation time will still be N*logN with
regards to the total number of reflinks. That is still fast, even for
100 GB metadata with a billion reflinks.
I don't understand why do you think that recomputing the new tree state
must be slow. Even if there are a 100 new tree states that need to be
recomputed, there is still no problem. Each metadata update will change
only a small portion of b-trees, so the complexity and size of b-trees
should not seriously affect the computation time.
Well, let's start with the checksum computations which then need to
happen for each block that would be written, which can't be faster than
O(n).
Yes, the structural overhead of the b-trees isn't bad by itself, but you
have multiple trees that need to be updated in sequence (that is, you
have to update one, then update the next based on that one, then update
another based on both of the previous two, etc) and a number of other
bits of data involved that need to be updated as part of the b-tree
update which have worse time complexity than computing the structural
changes to the b-trees.
The point is that the defrag can keep a buffer of a "pending
operations". Pending operations are those that should be
performed in order to keep the original sharing structure. If the
defrag gets interrupted, then files in "pending operations" will
be unshared. But this should really be some important and urgent
interrupt, as the "pending operations" buffer needs at most a
second or two to complete its operations.
Depending on the exact situation, it can take well more than a few
seconds to complete stuff. Especially if there are lots of reflinks.
Nope. You are quite wrong there.
In the worst case, the "pending operations buffer" will update
(write to disk) all the b-trees. So, the upper limit on time to
flush the "pending operations buffer" equals the time to write the
entire b-tree structure to the disk (into new extents). I estimate
that takes at most a few seconds.
So what you're talking about is journaling the computed state of
defrag operations. That shouldn't be too bad (as long as it's done
in memory instead of on-disk) if you batch the computations
properly. I thought you meant having a buffer of what operations to
do, and then computing them on-the-fly (which would have significant
overhead)
Looks close to what I was thinking. Soon we might be able to
communicate. I'm not sure what you mean by "journaling the computed
state of defrag operations". Maybe it doesn't matter.
Essentially, doing a write-ahead log of pending operations.
Journaling is just the common term for such things when dealing with
Linux filesystems because of ext* and XFS. Based on what you say
below, it sounds like we're on the same page here other than the
terminology.
What happens is that file (extent) data is first written to disk
(defragmented), but b-tree is not immediately updated. It doesn't
have to be. Even if there is a power loss, nothing happens.
So, the changes that should be done to the b-trees are put into
pending-operations-buffer. When a lot of file (extent) data is
written to disk, such that defrag-operation-space (1 GB) is close to
being exhausted, the pending-operations-buffer is examined in order
to attempt to free as much of defrag-operation-space as possible. The
simplest algorithm is to flush the entire pending-operations-buffer
at once. This reduces the number of writes that update the b-trees
because many changes to the b-trees fall into the same or
neighbouring disk sectors.
* Reflinks can reference partial extents. This means,
ultimately, that you may end up having to split extents in odd
ways during defrag if you want to preserve reflinks, and might
have to split extents _elsewhere_ that are only tangentially
related to the region being defragmented. See the example in my
previous email for a case like this, maintaining the shared
regions as being shared when you defragment either file to a
single extent will require splitting extents in the other file
(in either case, whichever file you don't defragment to a single
extent will end up having 7 extents if you try to force the one
that's been defragmented to be the canonical version). Once you
consider that a given extent can have multiple ranges reflinked
from multiple other locations, it gets even more complicated.
I think that this problem can be solved, and that it can be
solved perfectly (the result is a perfectly-defragmented file).
But, if it is so hard to do, just skip those problematic extents
in initial version of defrag.
Ultimately, in the super-duper defrag, those partially-referenced
extents should be split up by defrag.
* If you choose to just not handle the above point by not
letting defrag split extents, you put a hard lower limit on the
amount of fragmentation present in a file if you want to
preserve reflinks. IOW, you can't defragment files past a
certain point. If we go this way, neither of the two files in
the example from my previous email could be defragmented any
further than they already are, because doing so would require
splitting extents.
Oh, you're reading my thoughts. That's good.
Initial implementation of defrag might be not-so-perfect. It
would still be better than the current defrag.
This is not a one-way street. Handling of partially-used extents
can be improved in later versions.
* Determining all the reflinks to a given region of a given
extent is not a cheap operation, and the information may
immediately be stale (because an operation right after you fetch
the info might change things). We could work around this by
locking the extent somehow, but doing so would be expensive
because you would have to hold the lock for the entire defrag
operation.
No. DO NOT LOCK TO RETRIEVE REFLINKS.
Instead, you have to create a hook in every function that updates
the reflink structure or extents (for exaple, write-to-file
operation). So, when a reflink gets changed, the defrag is
immediately notified about this. That way the defrag can keep its
data about reflinks in-sync with the filesystem.
This doesn't get around the fact that it's still an expensive
operation to enumerate all the reflinks for a given region of a
file or extent.
No, you are wrong.
In order to enumerate all the reflinks in a region, the defrag
needs to have another array, which is also kept in memory and in
sync with the filesystem. It is the easiest to divide the disk into
regions of equal size, where each region is a few MB large. Lets
call this array "regions-to-extents" array. This array doesn't need
to be associative, it is a plain array.
This in-memory array links regions of disk to extents that are in
the region. The array in initialized when defrag starts.
This array makes the operation of finding all extents of a region
extremely fast.
That has two issues:
* That's going to be a _lot_ of memory. You still need to be able
to defragment big (dozens plus TB) arrays without needing multiple
GB of RAM just for the defrag operation, otherwise it's not
realistically useful (remember, it was big arrays that had issues
with the old reflink-aware defrag too).
Ok, but let's get some calculations there. If regions are 4 MB in
size, the region-extents array for an 8 TB partition would have 2
million entries. If entries average 64 bytes, that would be:
- a total of 128 MB memory for an 8 TB partition.
Of course, I'm guessing a lot of numbers there, but it should be doable.
Even if we assume such an optimistic estimation as you provide (I
suspect it will require more than 64 bytes per-entry), that's a lot of
RAM when you look at what it's potentially displacing. That's enough
RAM for receive and transmit buffers for a few hundred thousand
network connections, or for caching multiple hundreds of thousands of
dentries, or a few hundred thousand inodes. Hell, that's enough RAM
to run all the standard network services for a small network (DHCP,
DNS, NTP, TFTP, mDNS relay, UPnP/NAT-PMP, SNMP, IGMP proxy, VPN of
your choice) at least twice over.
That depends on the average size of an extent. If the average size of an
extent is around 4 MB, than my numbers should be good. Do you have any
data which would suggest that my estimate is wrong? What's the average
size of an extent on your filesystems (used space divided by number of
extents)?
Depends on what filesystem.
Worst case I have (which I monitor regularly, so I actually have good
aggregate data on the actual distribution of extent sizes) is used for
backed storage for virtual machine disk images. The arithmetic mean
extent size is just barely over 32k, but the median is actually closer
to 48k, with the 10th percentile at 4k and the 90th percentile at just
over 2M. From what I can tell, this is a pretty typical distribution
for this type of usage (high frequency small writes internal to existing
files) on BTRFS.
Typical usage on most of my systems when dealing with data sets that
include reflinks shows a theoretical average extent size of about 1M,
though I suspect the 50th percentile to be a little bit higher than that
(I don't regularly check any of those, but the times I have the 50th
percentile has been just a bit than the arithmetic mean, which makes
sense given that I have a lot more small files than large ones).
It might be normal on some systems to have larger extents than this, but
I somewhat doubt that that will be the case for many potential users.
This "regions-to-extents" array can be further optimized if necessary.
You are not thinking correctly there (misplaced priorities). If the
system needs to be defragmented, that's the priority. You can't do
comparisons like that, that's unfair debating.
The defrag that I'm proposing should be able to run within common memory
limits of today's computer systems. So, it will likely take somewhat
less than 700 MB of RAM in most common situations, including the small
servers. They all have 700 MB RAM.
700 MB is a lot for a defrag, but there is no way around it. Btrfs is
simply a filesystem with such complexity that a good defrag requires a
lot of RAM to operate.
If, for some reason, you would like to cover a use-case with constrained
RAM conditions, then that is an entirely different concern for a
different project. You can't make a project like this to cover ALL the
possible circumstances. Some cases have to be left out. Here we are
talking about a defrag that is usable in a general and common set of
circumstances.
Memory constrained systems come up as a point of discussion pretty
regularly when dealing with BTRFS, so they're obviously something users
actually care about. You have to keep in mind that it's not unusual in
a consumer NAS system to have less than 4GB of RAM, but have arrays well
into double digit terabytes in size. Even 128MB of RAM needing to be
used for defrag is trashing _a lot_ of cache on such a system.
Please, don't drop special circumstances argument on me. That's not fair.
* You still have to populate the array in the first place. A sane
implementation wouldn't be keeping it in memory even when defrag is
not running (no way is anybody going to tolerate even dozens of MB
of memory overhead for this), so you're not going to get around the
need to enumerate all the reflinks for a file at least once (during
startup, or when starting to process that file), so you're just
moving the overhead around instead of eliminating it.
Yes, when the defrag starts, the entire b-tree structure is examined
in order for region-extents array and extents-backref associative
array to be populated.
So your startup is going to take forever on any reasonably large
volume. This isn't eliminating the overhead, it's just moving it all
to one place. That might make it a bit more efficient than it would
be interspersed throughout the operation, but only because it is
reading all the relevant data at once.
No, the startup will not take forever.
Forever is subjective. Arrays with hundreds of GB of metadata are not
unusual, and that's dozens of minutes of just reading data right at the
beginning before even considering what to defragment.
I would encourage you to take a closer look at some of the performance
issues quota groups face when doing a rescan, as they have to deal with
this kind of reflink tracking too, and they take quite a while on any
reasonably sized volume.
The startup needs exactly 1 (one) pass through the entire metadata. It
needs this to find all the backlinks and to populate the
"regios-extents" array. The time to do 1 pass through metadata depends
on the metadata size on disk, as entire metadata has to be read out (one
piece at a time, you won't keep it all in RAM). In most cases, the
time-to read the metadata will be less than 1 minute, on an SSD less
than 20 seconds.
There is no way around it: to defrag, you eventually need to read all
the b-trees, so nothing is lost there.
All computations in this defrag are simple. Finding all refliks in
metadata is simple. It is a single pass metadata read-out.
Of course, those two arrays exist only during defrag operation. When
defrag completes, those arrays are deallocated.
It also allows a very real possibility of a user functionally
delaying the defrag operation indefinitely (by triggering a
continuous stream of operations that would cause reflink changes
for a file being operated on by defrag) if not implemented very
carefully.
Yes, if a user does something like that, the defrag can be paused
or even aborted. That is normal.
Not really. Most defrag implementations either avoid files that
could reasonably be written to, or freeze writes to the file they're
operating on, or in some other way just sidestep the issue without
delaying the defragmentation process.
There are many ways around this problem, but it really doesn't
matter, those are just details. The initial version of defrag can
just abort. The more mature versions of defrag can have a better
handling of this problem.
Details like this are the deciding factor for whether something is
sanely usable in certain use cases, as you have yourself found out
(for a lot of users, the fact that defrag can unshare extents is
'just a detail' that's not worth worrying about).
I wouldn't agree there.
Not every issue is equal. Some issues are more important, some are
trivial, some are tolerable etc...
The defrag is usually allowed to abort. It can easily be restarted
later. Workaround: You can make a defrag-supervisor program, which
starts a defrag, and if defrag aborts then it is restarted after some
(configurable) amount of time.
The fact that the defrag can be functionally deferred indefinitely by
a user means that a user can, with a bit of effort, force degraded
performance for everyone using the system. Aborting the defrag
doesn't solve that, and it's a significant issue for anybody doing
shared hosting.
This is a quality-of-implementation issue. Not worthy of consideration
at this time. It can be solved.Then solve it and be done with it, don't just punt it down the road.
You're the one trying to convince the developers to spend _their_ time
implementing _your_ idea, so you need to provide enough detail to solve
issues that are brought up about your idea.
You can go and pick this kind of stuff all the time, with any system. I
mean, because of the FACT that we have never proven that all security
holes are eliminated, the computers shouldn't be powered on at all.
Therefore, all computers should be shut down immediately and then there
is absolutely no need to continue working on the btrfs. It is also
impossible to produce the btrfs defrag, because all computers have to be
shut down immediately.
Can we have a bit more fair discussion? Please?
I would ask the same, I provided a concrete example of a demonstrable
security issue with your proposed implementation that's trivial to
verify without even going beyond the described behavior of the
implementation. You then dismissed at as a non-issue and tried to
explain why my legitimate security concern wasn't even worth trying to
think about using apagogical argument that's only tangentially related
to my statement.
On the other hand, unsharing is not easy to get undone.
But, again, it this just doesn't matter for some people.
So, those issues are not equals.