Tomasz Chmielewski posted on Sun, 15 May 2016 19:24:47 +0900 as excerpted:
> I'm trying to read two large files in parallel from a 2-disk RAID-1
> btrfs setup (using kernel 4.5.3).
>
> According to iostat, one of the disks is 100% saturated, while the other
> disk is around 0% busy.
>
> Is it expected?
Depends. Btrfs redundancy-raid, raid1/10 has an unoptimized read
algorithm at this time (and parity-raid, raid5/6, remains new and
unstable in terms of parity-recovery and restriping after device loss, so
isn't recommended except for testing). See below.
> With two readers from the same disk, each file is being read with ~50
> MB/s from disk (with just one reader from disk, the speed goes up to
> around ~150 MB/s).
>
> In md RAID, with many readers, it will try to distribute the reads -
> after md manual on http://linux.die.net/man/4/md:
>
> Raid1 (...)
> Data is read from any one device. The driver attempts to distribute
> read requests across all devices to maximize performance.
Btrfs' current redundancy-raid read-scheduling algorithm is a pretty
basic unoptimized even/odd PID implementation at this point. It's
suitable for basic use and will parallelize over a large enough random
set of read tasks as the PIDs distribute even/odd, and it's well suited
to testing as it's simple, and easy enough to ensure use of either just
one side or the other, or both, by simply arranging for all even/odd or
mixed PIDs. But as you discovered, it's not yet anything near as well
optimized as md redundancy-raid.
Another difference between the two that favors mdraid1 is that the latter
will make N redundant copies across N devices, while btrfs redundancy
raid in all forms (raid1/10 and dup on single device) has exactly two
copies, no matter the number of devices. More devices simply gives you
more capacity, not more copies, as there's still only two.
OTOH, for those concerned about data integrity, btrfs has one seriously
killer feature that mdraid lacks -- btrfs checksums both data and
metadata and verifies a checksum match on read-back, falling back to the
second copy on redundancy-raid if the first copy fails checksum
verification, rewriting the bad copy from the good one. One of the
things that distressed me about mdraid is that in all cases, redundancy
and parity alike, it never actually cross-checks either redundant copies
or parity in normal operation -- if you get a bad copy and the hardware/
firmware level doesn't detect it, you get a bad copy and mdraid is none
the wiser. Only during a scrub or device recovery does mdraid actually
use the parity or redundant copies, and even then, for redundancy-scrub,
it simply arbitrarily calls the first copy good and rewrites it to the
others if they differ.
What I'm actually wanting myself, is this killer data integrity
verification feature, in combination with N-way mirroring instead of just
the two-way that current btrfs offers. For me, N=3, three-way-mirroring,
would be perfect, as with just two-way-mirroring, if one copy is found
invalid, you better /hope/ the second one is good, while with three way,
there's still two fallbacks if one is bad. 4+-way would of course be
even better in that regard, but of course there's the practical side of
actually buying and housing the things too, and 3-way simply happens to
be my sweet-spot.
N-way-mirroring is on the roadmap for after parity-raid (the current
raid56), as it'll use some of the same code. However, parity-raid ended
up being rather more complex to properly implement along with COW and
other btrfs features than they expected, so it took way more time to
complete than originally estimated and as mentioned above it's still not
really stable as there remain a couple known bugs that affect restriping
and recovery from lost device. So N-way-mirroring could be awhile, and
if it follows the pattern of parity-raid, it'll be awhile after that
before it's reasonably stable. So we're talking years... But I'm still
eagerly anticipating...
Obviously, once N-way-mirroring gets in they'll need to revisit the read-
scheduling algorithm anyway, because even/odd won't cut it when there's
three-plus-way scheduling. So that's when I'd expect some optimization
to occur, effectively as part of N-way-mirroring.
Meanwhile, I've argued before that the unoptimized read-scheduling of
btrfs raid1 remains a prime example-in-point of btrfs' overall stability
status, particularly when mdraid has a much better algorithm already
implemented in the same kernel. Developers tend to be very aware of
something called premature optimization, where optimization too early
will either lock out otherwise viable extensions later, or force throwing
away major sections of optimization code as the optimization is redone to
account for the new extensions that don't work with the old optimization
code.
That such prime examples as raid1 read-scheduling remain so under-
optimized