Re: Very slow balance / btrfs-transaction

Austin S. Hemmelgarn Mon, 06 Feb 2017 05:20:08 -0800

On 2017-02-04 16:10, Kai Krakow wrote:

Am Sat, 04 Feb 2017 20:50:03 +0000
schrieb "Jorg Bornschein" <j...@capsec.org>:

February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgold...@suse.de>
wrote:

Yes, please check if disabling quotas makes a difference in
execution time of btrfs balance.


Just FYI: With quotas disabled it took ~20h to finish the balance
instead of the projected >30 days. Therefore, in my case, there was a
speedup of factor ~35.


and thanks for the quick reply! (and for btrfs general!)


BTW: I'm wondering how much sense it makes to activate the underlying
bcache for my raid1 fs again. I guess btrfs chooses randomly (or
based predicted of disk latency?) which copy of a given extend to
load?


As far as I know, it uses PID modulo only currently, no round-robin,
no random value. There are no performance optimizations going into btrfs
yet because there're still a lot of ongoing feature implementations.

I think there were patches to include a rotator value in the stripe
selection. They don't apply to the current kernel. I tried it once and
didn't see any subjective difference for normal desktop workloads. But
that's probably because I use RAID1 for metadata only.

I had tested similar patches myself using raid1 for everything, and sawnear zero improvement unless I explicitly tried to create a worst-caseperformance situation. The reality is that the current algorithm isactually remarkably close to being optimal for most use cases whileusing an insanely small amount of processing power and memory comparedto an optimal algorithm (and a truly optimal algorithm is in factfunctionally impossible in almost all cases because it would requirepredicting the future).


MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in my
btrfs array). This could write new blocks to the most idle disk always.
I think this wasn't covered by the above mentioned patch. Currently,
selection is based only on the disk with most free space.

You're confusing read selection and write selection. MDADM and DM-RAIDboth use a load-balancing read selection algorithm that takes latencyand other factors into account. However, they use a round-robin writeselection algorithm that only cares about the position of the block inthe virtual device modulo the number of physical devices.

As an example, say you have a 3 disk RAID10 array set up using MDADM(this is functionally the same as a 3-disk raid1 mode BTRFS filesystem).Every third block starting from block 0 will be on disks 1 and 2,every third block starting from block 1 will be on disks 3 and 1, andevery third block starting from block 2 will be on disks 2 and 3. Nolatency measurements are taken, literally nothing is factored in exceptthe block's position in the virtual device.

Now, that said, BTRFS does behave differently under the samecircumstances, but this is because the striping is different for BTRFS.It happens at the chunk level instead of the block level. If we look atan example using the same 3 devices as the MDADM example, and then forsimplicity assume that you end up allocating alternating data andmetadata chunks, things might look a bit like this:

* System chunk: Device 1 and 2
* Metadata chunk 0: Device 3 and 1
* Data chunk 0: Device 2 and 3
* Metadata chunk 1: Device 1 and 2
* Data chunk 1: Device 1 and 2

Overall, there is technically a pattern, but it's got a very longrepetition period. This is still however a near optimal allocationpattern given the constraints. It also gives (just like the MDADM andDM-RAID method) 100% deterministic behavior, the only difference is itdepends on a slightly different factor. Changing this to select themost idle disk as you suggest would remove that determinism, increasethe likelihood of sub-optimal layouts in terms of space usage, increasethe number of cases where you could get ENOSPC, and provide near zeronet performance benefit except under heavy load. IOW, it would providea pretty negative net benefit.

What actually needs to happen to improve write performance is that BTRFSneeds to quit serializing writes when writing chunks across multipledevices. In the case of a raid1 setup, it writes first to one device,then the other, alternating back and forth as it updates each extent.This combined with the COW behavior causing write amplification is whatmakes write performance so horrible for BTRFS compared to MDADM orDM-RAID. It's not that we have bad device selection for writes, it'sthat we don't even try to do any kind of practical parallelizationdespite it being an embarrassingly parallel task (and yes, thatseriously is what something that's trivial to parallelize is called inscientific papers...).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very slow balance / btrfs-transaction

Reply via email to