Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Kai Krakow
Am Sun, 15 May 2016 19:24:47 +0900
schrieb Tomasz Chmielewski :

> I'm trying to read two large files in parallel from a 2-disk RAID-1 
> btrfs setup (using kernel 4.5.3).
> 
> According to iostat, one of the disks is 100% saturated, while the
> other disk is around 0% busy.
> 
> Is it expected?
> 
> With two readers from the same disk, each file is being read with ~50 
> MB/s from disk (with just one reader from disk, the speed goes up to 
> around ~150 MB/s).
> 
> 
> In md RAID, with many readers, it will try to distribute the reads - 
> after md manual on http://linux.die.net/man/4/md:
> 
>  Raid1
>  (...)
>  Data is read from any one device. The driver attempts to
> distribute read requests across all devices
>  to maximise performance.
> 
>  Raid5
>  (...)
>  This also allows more parallelism when reading, as read requests
> are distributed over all the devices
>  in the array instead of all but one.
> 
> 
> Are there any plans to improve this is btrfs?
> 
> 
> Tomasz Chmielewski
> http://wpkg.org

Here is an idea that could need improvement:
http://permalink.gmane.org/gmane.comp.file-systems.btrfs/17985


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Duncan
Tomasz Chmielewski posted on Sun, 15 May 2016 19:24:47 +0900 as excerpted:

> I'm trying to read two large files in parallel from a 2-disk RAID-1
> btrfs setup (using kernel 4.5.3).
> 
> According to iostat, one of the disks is 100% saturated, while the other
> disk is around 0% busy.
> 
> Is it expected?

Depends.  Btrfs redundancy-raid, raid1/10 has an unoptimized read 
algorithm at this time (and parity-raid, raid5/6, remains new and 
unstable in terms of parity-recovery and restriping after device loss, so 
isn't recommended except for testing).  See below.

> With two readers from the same disk, each file is being read with ~50
> MB/s from disk (with just one reader from disk, the speed goes up to
> around ~150 MB/s).
> 
> In md RAID, with many readers, it will try to distribute the reads -
> after md manual on http://linux.die.net/man/4/md:
> 
>  Raid1 (...)
>  Data is read from any one device. The driver attempts to distribute
>  read requests across all devices to maximize performance.

Btrfs' current redundancy-raid read-scheduling algorithm is a pretty 
basic unoptimized even/odd PID implementation at this point.  It's 
suitable for basic use and will parallelize over a large enough random 
set of read tasks as the PIDs distribute even/odd, and it's well suited 
to testing as it's simple, and easy enough to ensure use of either just 
one side or the other, or both, by simply arranging for all even/odd or 
mixed PIDs.  But as you discovered, it's not yet anything near as well 
optimized as md redundancy-raid.

Another difference between the two that favors mdraid1 is that the latter 
will make N redundant copies across N devices, while btrfs redundancy 
raid in all forms (raid1/10 and dup on single device) has exactly two 
copies, no matter the number of devices.  More devices simply gives you 
more capacity, not more copies, as there's still only two.

OTOH, for those concerned about data integrity, btrfs has one seriously 
killer feature that mdraid lacks -- btrfs checksums both data and 
metadata and verifies a checksum match on read-back, falling back to the 
second copy on redundancy-raid if the first copy fails checksum 
verification, rewriting the bad copy from the good one.  One of the 
things that distressed me about mdraid is that in all cases, redundancy 
and parity alike, it never actually cross-checks either redundant copies 
or parity in normal operation -- if you get a bad copy and the hardware/
firmware level doesn't detect it, you get a bad copy and mdraid is none 
the wiser.  Only during a scrub or device recovery does mdraid actually 
use the parity or redundant copies, and even then, for redundancy-scrub, 
it simply arbitrarily calls the first copy good and rewrites it to the 
others if they differ.

What I'm actually wanting myself, is this killer data integrity 
verification feature, in combination with N-way mirroring instead of just 
the two-way that current btrfs offers.  For me, N=3, three-way-mirroring, 
would be perfect, as with just two-way-mirroring, if one copy is found 
invalid, you better /hope/ the second one is good, while with three way, 
there's still two fallbacks if one is bad.  4+-way would of course be 
even better in that regard, but of course there's the practical side of 
actually buying and housing the things too, and 3-way simply happens to 
be my sweet-spot.

N-way-mirroring is on the roadmap for after parity-raid (the current 
raid56), as it'll use some of the same code.  However, parity-raid ended 
up being rather more complex to properly implement along with COW and 
other btrfs features than they expected, so it took way more time to 
complete than originally estimated and as mentioned above it's still not 
really stable as there remain a couple known bugs that affect restriping 
and recovery from lost device.  So N-way-mirroring could be awhile, and 
if it follows the pattern of parity-raid, it'll be awhile after that 
before it's reasonably stable.  So we're talking years...  But I'm still 
eagerly anticipating...

Obviously, once N-way-mirroring gets in they'll need to revisit the read-
scheduling algorithm anyway, because even/odd won't cut it when there's 
three-plus-way scheduling.  So that's when I'd expect some optimization 
to occur, effectively as part of N-way-mirroring.

Meanwhile, I've argued before that the unoptimized read-scheduling of 
btrfs raid1 remains a prime example-in-point of btrfs' overall stability 
status, particularly when mdraid has a much better algorithm already 
implemented in the same kernel.  Developers tend to be very aware of 
something called premature optimization, where optimization too early 
will either lock out otherwise viable extensions later, or force throwing 
away major sections of optimization code as the optimization is redone to 
account for the new extensions that don't work with the old optimization 
code.

That such prime examples as raid1 read-scheduling remain so under-
optimized 

Re: btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Anand Jain



On 05/15/2016 06:24 PM, Tomasz Chmielewski wrote:

I'm trying to read two large files in parallel from a 2-disk RAID-1
btrfs setup (using kernel 4.5.3).

According to iostat, one of the disks is 100% saturated, while the other
disk is around 0% busy.

Is it expected?


No.



Are there any plans to improve this is btrfs?


yes.

Thanks, Anand


Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs RAID-1 vs md RAID-1?

2016-05-15 Thread Tomasz Chmielewski
I'm trying to read two large files in parallel from a 2-disk RAID-1 
btrfs setup (using kernel 4.5.3).


According to iostat, one of the disks is 100% saturated, while the other 
disk is around 0% busy.


Is it expected?

With two readers from the same disk, each file is being read with ~50 
MB/s from disk (with just one reader from disk, the speed goes up to 
around ~150 MB/s).



In md RAID, with many readers, it will try to distribute the reads - 
after md manual on http://linux.die.net/man/4/md:


Raid1
(...)
Data is read from any one device. The driver attempts to distribute 
read requests across all devices

to maximise performance.

Raid5
(...)
This also allows more parallelism when reading, as read requests are 
distributed over all the devices

in the array instead of all but one.


Are there any plans to improve this is btrfs?


Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html