Re: Status of RAID5/6

Zygo Blaxell Tue, 03 Apr 2018 23:21:08 -0700

On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote:
> On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli <kreij...@inwind.it> 
> wrote:
> > On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>>> I thought that a possible solution is to create BG with different
> >>>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>>> BG #1: 1 data disk, 2 parity disks
> >>>>> BG #2: 2 data disks, 2 parity disks,
> >>>>> BG #3: 4 data disks, 2 parity disks
> >>>>>
> >>>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>>
> >>>>> So If you have a write with a length of 4 KB, this should be placed
> >>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>>> should be placed in in BG#2, then in BG#1.
> >>>>> This would avoid space wasting, even if the fragmentation will
> >>>> increase (but shall the fragmentation matters with the modern solid
> >>>> state disks ?).
> >>> I don't really see why this would increase fragmentation or waste space.
> >
> >> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> >> remaining 2 blocks).  It also flips the usual order of "determine size
> >> of extent, then allocate space for it" which might require major surgery
> >> on the btrfs allocator to implement.
> >
> > I have to point out that in any case the extent is physically interrupted 
> > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 
> > 128KB, the first half is written in the first disk, the other in the 2nd 
> > disk.  If you want to write 96kb, the first 64 are written in the first 
> > disk, the last part in the 2nd, only on a different BG.
> > So yes there is a fragmentation from a logical point of view; from a 
> > physical point of view the data is spread on the disks in any case.
> >
> > In any case, you are right, we should gather some data, because the 
> > performance impact are no so clear.
> 
> They're pretty clear, and there's a lot written about small file size
> and parity raid performance being shit, no matter the implementation
> (md, ZFS, Btrfs, hardware maybe less so just because of all the
> caching and extra processing hardware that's dedicated to the task).


Pretty much everything goes fast if you put a faster non-volatile cache
in front of it.

> The linux-raid@ list is full of optimizations for this that are use
> case specific. One of those that often comes up is how badly suited
> raid56 are for e.g. mail servers, tons of small file reads and writes,
> and all the disk contention that comes up, and it's even worse when
> you lose a disk, and even if you're running raid 6 and lose two disk
> it's really god awful. It can be unexpectedly a disqualifying setup
> without prior testing in that condition: can your workload really be
> usable for two or three days in a double degraded state on that raid6?
> *shrug*
> 
> Parity raid is well suited for full stripe reads and writes, lots of
> sequential writes. Ergo a small file is anything less than a full
> stripe write. Of course, delayed allocation can end up making for more
> full stripe writes. But now you have more RMW which is the real
> performance killer, again no matter the raid.

RMW isn't necessary if you have properly configured COW on top.
ZFS doesn't do RMW at all.  OTOH for some workloads COW is a step in a
different wrong direction--the btrfs raid5 problems with nodatacow
files can be solved by stripe logging and nothing else.

Some equivalent of autodefrag that repacks your small RAID stripes
into bigger ones will burn 3x your write IOPS eventually--it just
lets you defer the inevitable until a hopefully more convenient time.
A continuously loaded server never has a more convenient time, so it
needs a different solution.

> > I am not worried abut having different BG; we have problem with these 
> > because we never developed tool to handle this issue properly (i.e. a 
> > daemon which starts a balance when needed). But I hope that this will be 
> > solved in future.
> >
> > In any case, the all solutions proposed have their trade off:
> >
> > - a) as is: write hole bug
> > - b) variable stripe size (like ZFS): big impact on how btrfs handle the 
> > extent. limited waste of space
> > - c) logging data before writing: we wrote the data two times in a short 
> > time window. Moreover the log area is written several order of magnitude 
> > more than the other area; there was some patch around
> > - d) rounding the writing to the stripe size: waste of space; simple to 
> > implement;
> > - e) different BG with different stripe size: limited waste of space; 
> > logical fragmentation.
> 
> I'd say for sure you're worse off with metadata raid5 vs metadata
> raid1. And if there are many devices you might be better off with
> metadata raid1 even on a raid6, it's not an absolute certainty you
> lose the file system with a 2nd drive failure - it depends on the
> device and what chunk copies happen to be on it. But at the least if
> you have a script or some warning you can relatively easily rebalance
> ... HMMM
> 
> Actually that should be a test. Single drive degraded raid6 with
> metadata raid1, can you do a metadata only balance to force the
> missing copy of metadata to be replicated again? In theory this should
> be quite fast.

I've done it, but it's not as fast as you might hope.  Metadata balances
95% slower than data, and seeks pretty hard (stressing the surviving
drives and sucking performance) while it does so.  Also you're likely to
have to fix or work around a couple of btrfs bugs while you do it.

> 
> 
> -- 
> Chris Murphy

signature.asc
Description: PGP signature

Re: Status of RAID5/6

Reply via email to