On 2017-02-04 16:10, Kai Krakow wrote:
Am Sat, 04 Feb 2017 20:50:03 +0000
schrieb "Jorg Bornschein" <j...@capsec.org>:
February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgold...@suse.de>
wrote:
Yes, please check if disabling quotas makes a difference in
execution time of btrfs balance.
Just FYI: With quotas disabled it took ~20h to finish the balance
instead of the projected >30 days. Therefore, in my case, there was a
speedup of factor ~35.
and thanks for the quick reply! (and for btrfs general!)
BTW: I'm wondering how much sense it makes to activate the underlying
bcache for my raid1 fs again. I guess btrfs chooses randomly (or
based predicted of disk latency?) which copy of a given extend to
load?
As far as I know, it uses PID modulo only currently, no round-robin,
no random value. There are no performance optimizations going into btrfs
yet because there're still a lot of ongoing feature implementations.
I think there were patches to include a rotator value in the stripe
selection. They don't apply to the current kernel. I tried it once and
didn't see any subjective difference for normal desktop workloads. But
that's probably because I use RAID1 for metadata only.
I had tested similar patches myself using raid1 for everything, and saw
near zero improvement unless I explicitly tried to create a worst-case
performance situation. The reality is that the current algorithm is
actually remarkably close to being optimal for most use cases while
using an insanely small amount of processing power and memory compared
to an optimal algorithm (and a truly optimal algorithm is in fact
functionally impossible in almost all cases because it would require
predicting the future).
MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in my
btrfs array). This could write new blocks to the most idle disk always.
I think this wasn't covered by the above mentioned patch. Currently,
selection is based only on the disk with most free space.
You're confusing read selection and write selection. MDADM and DM-RAID
both use a load-balancing read selection algorithm that takes latency
and other factors into account. However, they use a round-robin write
selection algorithm that only cares about the position of the block in
the virtual device modulo the number of physical devices.
As an example, say you have a 3 disk RAID10 array set up using MDADM
(this is functionally the same as a 3-disk raid1 mode BTRFS filesystem).
Every third block starting from block 0 will be on disks 1 and 2,
every third block starting from block 1 will be on disks 3 and 1, and
every third block starting from block 2 will be on disks 2 and 3. No
latency measurements are taken, literally nothing is factored in except
the block's position in the virtual device.
Now, that said, BTRFS does behave differently under the same
circumstances, but this is because the striping is different for BTRFS.
It happens at the chunk level instead of the block level. If we look at
an example using the same 3 devices as the MDADM example, and then for
simplicity assume that you end up allocating alternating data and
metadata chunks, things might look a bit like this:
* System chunk: Device 1 and 2
* Metadata chunk 0: Device 3 and 1
* Data chunk 0: Device 2 and 3
* Metadata chunk 1: Device 1 and 2
* Data chunk 1: Device 1 and 2
Overall, there is technically a pattern, but it's got a very long
repetition period. This is still however a near optimal allocation
pattern given the constraints. It also gives (just like the MDADM and
DM-RAID method) 100% deterministic behavior, the only difference is it
depends on a slightly different factor. Changing this to select the
most idle disk as you suggest would remove that determinism, increase
the likelihood of sub-optimal layouts in terms of space usage, increase
the number of cases where you could get ENOSPC, and provide near zero
net performance benefit except under heavy load. IOW, it would provide
a pretty negative net benefit.
What actually needs to happen to improve write performance is that BTRFS
needs to quit serializing writes when writing chunks across multiple
devices. In the case of a raid1 setup, it writes first to one device,
then the other, alternating back and forth as it updates each extent.
This combined with the COW behavior causing write amplification is what
makes write performance so horrible for BTRFS compared to MDADM or
DM-RAID. It's not that we have bad device selection for writes, it's
that we don't even try to do any kind of practical parallelization
despite it being an embarrassingly parallel task (and yes, that
seriously is what something that's trivial to parallelize is called in
scientific papers...).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html