Re: [PATCH 0/4] 3- and 4- copy RAID1

Austin S. Hemmelgarn Fri, 20 Jul 2018 11:33:44 -0700

On 2018-07-20 13:13, Goffredo Baroncelli wrote:

On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:

On 2018-07-19 13:29, Goffredo Baroncelli wrote:

[...]


So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring


No, not quite.  At least, not in the combinations you're saying make sense if 
you are using standard terminology.  RAID05 and RAID06 are not the same thing 
as 'striping+parity' as BTRFS implements that case, and can be significantly 
more optimized than the trivial implementation of just limiting the number of 
disks involved in each chunk (by, you know, actually striping just like what we 
currently call raid10 mode in BTRFS does).


Could you provide more information ?

Just parity by itself is functionally equivalent to a really stupidimplementation of 2 or more copies of the data. Setups with only onedisk more than the number of parities in RAID5 and RAID6 are calleddegenerate for this very reason. All sane RAID5/6 implementations dostriping across multiple devices internally, and that's almost alwayswhat people mean when talking about striping plus parity.

What I'm referring to is different though. Just like RAID10 used to beimplemented as RAID1 on top of RAID0, RAID05 is RAID0 on top of RAID5.That is, you're striping your data across multiple RAID5 arrays insteadof using one big RAID5 array to store it all. As I mentioned, thismitigates the scaling issues inherent in RAID5 when it comes to rebuilds(namely, the fact that device failure rates go up faster for largerarrays than rebuild times do).

Functionally, such a setup can be implemented in BTRFS by limitingRAID5/6 stripe width, but that will have all kinds of performancelimitations compared to actually striping across all of the underlyingRAID5 chunks. In fact, it will have the exact same performancelimitations you're calling out BTRFS single mode for below.


RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.

If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1

I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in the 
upper layer is correct when they don't agree.


This happens only because there is a redundancy greater than 1. Anyway BTRFS 
has the checksum, which helps a lot in this area

The checksum helps, but what do you do when all copies fail thechecksum? Or, worse yet, what do you do with both copies have the'right' checksum, but different data? Yes, you could have one morecopy, but that just reduces the chances of those cases happening, itdoesn't eliminate them.

Note that I'm not necessarily saying it makes sense to have support forthis in BTRFS, just that it's a real-world counter-example to yourstatement that only those combinations make sense. In the case ofBTRFS, these would make more sense than RAID51 and RAID61, but theystill aren't particularly practical. For classic RAID though, they'rereally important, because you don't have checksumming (unless you haveT10 DIF capable hardware and a RAID implementation that understands howto work with it, but that's rare and expensive) and it makes it easierto resize an array than having three copies (you only need 2 new disksfor RAID15 or RAID16 to increase the size of the array, but you need 3for 3-copy RAID1 or RAID10).

It doesn't really provide significantly better redundancy (they can technically 
sustain more disk failures without failing completely than simple two-copy 
RAID1 can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so it does 
not do the same thing that higher-order parity does.


The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[....]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel from 
different disks.

That's not the only benefit of striping though.  The other big one is that you 
now have one volume that's the combined size of both of the original devices.  
Striping is arguably better for this even if you're using a large stripe size 
because it better balances the wear across the devices than simple 
concatenation.


Striping means that the data is interleaved between the disks with a reasonable 
"block unit". Otherwise which would be the difference between btrfs-raid0 and 
btrfs-single ?

Single mode guarantees that any file less than the chunk size in length will 
either be completely present or completely absent if one of the devices fails.  
BTRFS raid0 mode does not provide any such guarantee, and in fact guarantees 
that all files that are larger than the stripe unit size (however much gets put 
on one disk before moving to the next) will all lose data if a device fails.

Stupid as it sounds, this matters for some people.


I think that even better would be having different filesystems.

Not necessarily. In fact, quite the opposite in most cases, becausehaving separate filesystems pushes the requirement to sort the filesonto devices to userspace, which should not have to worry about that.

Put in cluster computing terms (where this kind of file layout is thenorm), why exactly should the application software be the componentresponsible for figuring out what node a given file from a particulardataset is on? Why shouldn't the filesystem itself handle this?

With a "stripe size" of 1GB, it is very unlikely that this would happens.

That's a pretty big assumption.  There are all kinds of access patterns that 
will still distribute the load reasonably evenly across the constituent 
devices, even if they don't parallelize things.

If, for example, all your files are 64k or less, and you only read whole files, 
there's no functional difference between RAID0 with 1GB blocks and RAID0 with 
64k blocks.  Such a workload is not unusual on a very busy mail-server.


I fully agree that 64K may be too much for some workload, however I have to 
point out that I still find difficult to imagine that you can take advantage of 
parallel read from multiple disks with a 1GB stripe unit for a *common 
workload*. Pay attention that btrfs inline in the metadata the small files, so 
even if the file is smaller than 64k, a 64k read (or more) will be required in 
order to access it.

Again, mail servers. Each file should be written out as a single extent, which 
means it's all in one chunk.  Delivery and processing need to access _LOTS_ of 
files on a busy mail server, and the good ones do this with userspace 
parallelization.  BTRFS doesn't parallelize disk accesses from the same 
userspace execution context (thread if threads are being used, process if not), 
but it does parallelize access for separate contexts, so if userspace is doing 
things from multiple threads, so will BTRFS.

The parallelization matters only if it is distributed across different disks. So more disks are involved more parallelization is possible. As extreme example, whit a stripe unit of 1GB, until the filesystem is smaller than 1GB, no parallelizzation is possible[*] because all data is in the same disk. And when the filesystem increases its size, the data must be "distant" more than 1GB to be parallelized.First, I think you have things slightly backwards, it should be 'until

the filesystem is _larger_ than 1GB' here.

That aside, the whole issue of data locality is not one to the degreeyou might think. for 64k files, that's 16384 files per chunk. That's aminuscule number for a really active mail-server (no, seriously, singlesubsidiary mail-servers in big companies may be handling queuing anddelivery of more than twice that per-minute).


[*] Of course it is possible to perform parallel read on the same disk, but the 
throughput would decrease; may be that the average latency would perform better.

Raw throughput, measured simply as how many bytes you can read or writeper second, would decrease. Actual effective throughput will notnecessarily if you've got a storage device with very low seek times,because being able to load and process files in parallel may allow formuch faster actual processing of the data compared to simple serialprocessing. Latency would be dependent on


FWIW, I actually tested this back when the company I work for still ran their 
own internal mail server.  BTRFS was significantly less optimized back then, 
but there was no measurable performance difference from userspace between using 
single profile for data or raid0 profile for data.


Despite the btrfs optimization, having a stripe unit of 1GB reduces the likelihood of parallelizing 
the reads. This because the data to read to be parallelized must be "distant" more than 
the "stripe unit": having a stripe unit smaller increase the likelihood of a parallel 
reads.

Of course this is not sufficient. In any case BTRFS should improve its I/O 
scheduler

Agreed, we need actual parallel access to devices in BTRFS.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

Reply via email to