On Thu, Dec 24, 2015 at 9:19 AM, Donald Pearson
<donaldwhpear...@gmail.com> wrote:

> Got it.  I'm not the biggest fan of mixing mdraid with btrfs raid in
> order to work around deficiencies.  Hopefully in the future btrfs will
> allow me to select my mirror groups.

As far as I know, mdadm -l raid10 works this same way, you don't have
control over this. But what you can do with mdadm is create mirrored
pairs first, and then stripe those arrays. I don't know if that's
better/easier/necessary to do with an mdadm container.



> The trouble with a mirror of stripes is you take a nasty impact to
> your fault tolerance for dropping drives.  With Raid01 dropping just 1
> drive from each cabinet will fail the entire array because there is
> only one mirror group.  So now it's a choice between fault tolerance
> of dropping drives or fault tolerance of file-level errors.

Right. Open question if btrfs raid10 is more like raid01 because of
this. Where it's more like raid10 than 01 is rebuild. With 01, when
one drive dies, the entire raid0 array its in dies and has to be
rebuilt, which is not the case for btrfs. So it has characteristics of
raid10 and raid01 depending on the mode and context.

Thing is the trend of building storage stacks, because drive
capacities are so huge but their performance hasn't scaled at the same
rate, is to build more arrays with fewer drives, and pool the arrays
with something like ceph or glusterfs.

While the controller tolerance is a legit concern, is it more or less
likely to have a controller problem than it is a power supply problem?
Or something with that particular system that just craps out rather
than the array attached to it?


> All this makes me ask why?  Why implement Raid10 in this non-standard
> fashion and create this mess of compromise?

Because it was a straightforward extension of how the file system
already behaves. To implement drive based copies rather than chunk
based copies is a totally different strategy that actually negates how
btrfs does allocation, and would require things like logically
checking for mirrored pairs being the same size +/- maybe 1% similar
to mdadm.

And keep in mind the raid10 multiple device failure is not fixed, not
just any additional failure is OK. It just depends on aviation's
equivalent of "big sky theory" for air traffic separation. Yes the
probability of mirror A's two drives dying is next to zero, but it's
not zero. If you're building arrays depending on it being zero, well
that's not a good idea. The way to look at it is more of a bonus of
uptime, rather than depending on it in design. You design for it's
scaleable performance, which it does have.



>  It's frustrating on the
> user side and makes admins look at alternatives.  All this is because
> I can't define what the mirrored pairs (or beyond in the future) are,
> just to gain elegance in supporting different sized drives?  That can
> be done at the stripe level, it doesn't need to be done at the mirror
> level, and if it were done at the stripe level this issue wouldn't
> exist.

Whether the granularity for mirroring shifts from chunks to drives or
to stripes doesn't matter. A mirrored pair will have to be the same
size, or the bullseye simply gets bigger, from one drive to two or
more.


> I get it but this really isn't compelling.  This can't be done without
> using a hybrid of mdraid + btrfs; I can already do this in a raid 1+0
> arrangement I just don't benefit from checksumming.  All
> N-way-mirroring is going to give me is the ability to do it in a 0+1
> arrangement which means my filesystem made of 3 trays of 30 drives
> total will be failed with just the failure of 1 drive in each tray and
> that's not acceptable.

OK, so in that case, you can't use Btrfs alone to get the fault
tolerance you need. There are other things I'd think an admin would
want in a Btrfs only solution that Btrfs doesn't have, like the faulty
state for devices and notifications for that state change. This isn't
the only one, it's just rather a gotcha if you come with the
expectation of raid10 being almost certainly capable of tolerating a 2
disk failure. So I do kinda wonder if it ought to be called raid01,
even though that's misleading too, but at least not in a way that
causes an overestimation of data availability.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to