On 2018-07-19 13:29, Goffredo Baroncelli wrote:
On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:
On 2018-07-18 15:42, Goffredo Baroncelli wrote:
On 07/18/2018 09:20 AM, Duncan wrote:
Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

On 07/17/2018 11:12 PM, Duncan wrote:
Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

[...]

When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.
And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of....) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.
No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where you 
have lots of disks.  The RAID5/6 lower layer mitigates the data loss risk of 
RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of 
RAID5/6.  In fact, this is pretty much the standard recommended configuration 
for large ZFS arrays that want to use parity RAID.  This could be reasonably 
easily supported to a rudimentary degree in BTRFS by providing the ability to 
limit the stripe width for the parity profiles.

Some people use RAID50 or RAID60, although they are strictly speaking inferior 
in almost all respects to RAID05 and RAID06.

RAID01 is also used on occasion, it ends up having the same storage capacity as 
RAID10, but for some RAID implementations it has a different performance 
envelope and different rebuild characteristics.  Usually, when it is used 
though, it's software RAID0 on top of hardware RAID1.

RAID51 and RAID61 used to be used, but aren't much now.  They provided an easy 
way to have proper data verification without always having the rebuild overhead 
of RAID5/6 and without needing to do checksumming. They are pretty much useless 
for BTRFS, as it can already tell which copy is correct.

So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring

No, not quite. At least, not in the combinations you're saying make sense if you are using standard terminology. RAID05 and RAID06 are not the same thing as 'striping+parity' as BTRFS implements that case, and can be significantly more optimized than the trivial implementation of just limiting the number of disks involved in each chunk (by, you know, actually striping just like what we currently call raid10 mode in BTRFS does).


RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.
If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1
I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in the upper layer is correct when they don't agree. It doesn't really provide significantly better redundancy (they can technically sustain more disk failures without failing completely than simple two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive more than one (or two if you're using RAID6 as the lower layer) disk failure), so it does not do the same thing that higher-order parity does.


The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[....]

As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.

As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.

The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel from 
different disks.
That's not the only benefit of striping though.  The other big one is that you 
now have one volume that's the combined size of both of the original devices.  
Striping is arguably better for this even if you're using a large stripe size 
because it better balances the wear across the devices than simple 
concatenation.

Striping means that the data is interleaved between the disks with a reasonable 
"block unit". Otherwise which would be the difference between btrfs-raid0 and 
btrfs-single ?
Single mode guarantees that any file less than the chunk size in length will either be completely present or completely absent if one of the devices fails. BTRFS raid0 mode does not provide any such guarantee, and in fact guarantees that all files that are larger than the stripe unit size (however much gets put on one disk before moving to the next) will all lose data if a device fails.

Stupid as it sounds, this matters for some people.


With a "stripe size" of 1GB, it is very unlikely that this would happens.
That's a pretty big assumption.  There are all kinds of access patterns that 
will still distribute the load reasonably evenly across the constituent 
devices, even if they don't parallelize things.

If, for example, all your files are 64k or less, and you only read whole files, 
there's no functional difference between RAID0 with 1GB blocks and RAID0 with 
64k blocks.  Such a workload is not unusual on a very busy mail-server.

I fully agree that 64K may be too much for some workload, however I have to 
point out that I still find difficult to imagine that you can take advantage of 
parallel read from multiple disks with a 1GB stripe unit for a *common 
workload*. Pay attention that btrfs inline in the metadata the small files, so 
even if the file is smaller than 64k, a 64k read (or more) will be required in 
order to access it.
Again, mail servers. Each file should be written out as a single extent, which means it's all in one chunk. Delivery and processing need to access _LOTS_ of files on a busy mail server, and the good ones do this with userspace parallelization. BTRFS doesn't parallelize disk accesses from the same userspace execution context (thread if threads are being used, process if not), but it does parallelize access for separate contexts, so if userspace is doing things from multiple threads, so will BTRFS.

FWIW, I actually tested this back when the company I work for still ran their own internal mail server. BTRFS was significantly less optimized back then, but there was no measurable performance difference from userspace between using single profile for data or raid0 profile for data.

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to