Re: RFC: raid with a variable stripe size

Zygo Blaxell Tue, 29 Nov 2016 14:53:54 -0800

On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote:
> >>>My proposal requires only a modification to the extent allocator.
> >>>The behavior at the block group layer and scrub remains exactly the same.
> >>>We just need to adjust the allocator slightly to take the RAID5 CoW
> >>>constraints into account.
> >>
> >>Then, you'd need to allow btrfs to split large buffered/direct write into
> >>small extents(not 128M anymore).
> >>Not sure if we need to do extra work for DirectIO.
> >
> >Nope, that's not my proposal.  My proposal is to simply ignore free
> >space whenever it's inside a partially filled raid stripe (optimization:
> >...which was empty at the start of the current transaction).
> 
> Still have problems.
> 
> Allocator must handle fs under device remove or profile converting (from 4
> disks raid5 to 5 disk raid5/6) correctly.
> Which already seems complex for me.


Those would be allocations in separate block groups with different stripe
widths.  Already handled in btrfs.

> And further more, for fs with more devices, for example, 9 devices RAID5.
> It will be a disaster to just write a 4K data and take up the whole 8 * 64K
> space.
> It will  definitely cause huge ENOSPC problem.

If you called fsync() after every 4K, yes; otherwise you can just batch
up small writes into full-size stripes.  The worst case isn't common
enough to be a serious problem for a lot of the common RAID5 use cases
(i.e. non-database workloads).  I wouldn't try running a database on
it--I'd use a RAID1 or RAID10 array for that instead, because the other
RAID5 performance issues would be deal-breakers.

On ZFS the same case degenerates into something like btrfs RAID1 over
the 9 disks, which burns over 50% of the space.  More efficient than 
wasting 99% of the space, but still wasteful.

> If you really think it's easy, make a RFC patch, which should be easy if it
> is, then run fstest auto group on it.

I plan to when I get time; however, that could be some months in the
future and I don't want to "claim" the task and stop anyone else from
taking a crack at it in the meantime.

> Easy words won't turn emails into real patch.
> 
> >That avoids modifying a stripe with committed data and therefore plugs the
> >write hole.
> >
> >For nodatacow, prealloc (and maybe directio?) extents the behavior
> >wouldn't change (you'd have write hole, but only on data blocks not
> >metadata, and only on files that were already marked as explicitly not
> >requiring data integrity).
> >
> >>And in fact, you're going to support variant max file extent size.
> >
> >The existing extent sizing behavior is not changed *at all* in my proposal,
> >only the allocator's notion of what space is 'free'.
> >
> >We can write an extent across multiple RAID5 stripes so long as we
> >finish writing the entire extent before pointing committed metadata to
> >it.  btrfs does that already otherwise checksums wouldn't work.
> >
> >>This makes delalloc more complex (Wang enhanced dealloc support for variant
> >>file extent size, to fix ENOSPC problem for dedupe and compression).
> >>
> >>This is already much more complex than you expected.
> >
> >The complexity I anticipate is having to deal with two implementations
> >of the free space search, one for free space cache and one for free
> >space tree.
> >
> >It could be as simple as calling the existing allocation functions and
> >just filtering out anything that isn't suitably aligned inside a raid56
> >block group (at least for a proof of concept).
> >
> >>And this is the *BIGGEST* problem of current btrfs:
> >>No good enough(if there is any) *ISOLATION* for such a complex fs.
> >>
> >>So even "small" modification can lead to unexpected bugs.
> >>
> >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards.
> >
> >I don't think the write hole is fixable in the current raid56 layer, at
> >least not without a nasty brute force solution like stripe update journal.
> >
> >Any of the fixes I'd want to use fix the problem from outside.
> >
> >>If not possible, I prefer not to do anything yet, until we are sure the very
> >>basic part of RAID56 is stable.
> >>
> >>Thanks,
> >>Qu
> >>
> >>>
> >>>It's not as efficient as the ZFS approach, but it doesn't require an
> >>>incompatible disk format change either.
> >>>
> >>>>>On BTRFS this could be achieved using several BGs (== block group or 
> >>>>>chunk), one for each stripe size.
> >>>>>
> >>>>>For example, if a filesystem - RAID5 is composed by 4 DISK, the 
> >>>>>filesystem should have three BGs:
> >>>>>BG #1,composed by two disks (1 data+ 1 parity)
> >>>>>BG #2 composed by three disks (2 data + 1 parity)
> >>>>>BG #3 composed by four disks (3 data + 1 parity).
> >>>>
> >>>>Too complicated bg layout and further extent allocator modification.
> >>>>
> >>>>More code means more bugs, and I'm pretty sure it will be bug prone.
> >>>>
> >>>>
> >>>>Although the idea of variable stripe size can somewhat reduce the problem
> >>>>under certain situation.
> >>>>
> >>>>For example, if sectorsize is 64K, and we make stripe len to 32K, and use 
> >>>>3
> >>>>disc RAID5, we can avoid such write hole problem.
> >>>>Withouth modification to extent/chunk allocator.
> >>>>
> >>>>And I'd prefer to make stripe len mkfs time parameter, not possible to
> >>>>modify after mkfs. To make things easy.
> >>>>
> >>>>Thanks,
> >>>>Qu
> >>>>
> >>>>>
> >>>>>If the data to be written has a size of 4k, it will be allocated to the 
> >>>>>BG #1.
> >>>>>If the data to be written has a size of 8k, it will be allocated to the 
> >>>>>BG #2
> >>>>>If the data to be written has a size of 12k, it will be allocated to the 
> >>>>>BG #3
> >>>>>If the data to be written has a size greater than 12k, it will be 
> >>>>>allocated to the BG3, until the data fills a full stripes; then the 
> >>>>>remainder will be stored in BG #1 or BG #2.
> >>>>>
> >>>>>
> >>>>>To avoid unbalancing of the disk usage, each BG could use all the disks, 
> >>>>>even if a stripe uses less disks: i.e
> >>>>>
> >>>>>DISK1 DISK2 DISK3 DISK4
> >>>>>S1    S1    S1    S2
> >>>>>S2    S2    S3    S3
> >>>>>S3    S4    S4    S4
> >>>>>[....]
> >>>>>
> >>>>>Above is show a BG which uses all the four disks, but has a stripe which 
> >>>>>spans only 3 disks.
> >>>>>
> >>>>>
> >>>>>Pro:
> >>>>>- btrfs already is capable to handle different BG in the filesystem, 
> >>>>>only the allocator has to change
> >>>>>- no more RMW are required (== higher performance)
> >>>>>
> >>>>>Cons:
> >>>>>- the data will be more fragmented
> >>>>>- the filesystem, will have more BGs; this will require time-to time a 
> >>>>>re-balance. But is is an issue which we already know (even if may be not 
> >>>>>100% addressed).
> >>>>>
> >>>>>
> >>>>>Thoughts ?
> >>>>>
> >>>>>BR
> >>>>>G.Baroncelli
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
> 
>

signature.asc
Description: Digital signature

Re: RFC: raid with a variable stripe size

Reply via email to