On Tue, Nov 29, 2016 at 01:49:09PM +0800, Qu Wenruo wrote: > >>>My proposal requires only a modification to the extent allocator. > >>>The behavior at the block group layer and scrub remains exactly the same. > >>>We just need to adjust the allocator slightly to take the RAID5 CoW > >>>constraints into account. > >> > >>Then, you'd need to allow btrfs to split large buffered/direct write into > >>small extents(not 128M anymore). > >>Not sure if we need to do extra work for DirectIO. > > > >Nope, that's not my proposal. My proposal is to simply ignore free > >space whenever it's inside a partially filled raid stripe (optimization: > >...which was empty at the start of the current transaction). > > Still have problems. > > Allocator must handle fs under device remove or profile converting (from 4 > disks raid5 to 5 disk raid5/6) correctly. > Which already seems complex for me.
Those would be allocations in separate block groups with different stripe widths. Already handled in btrfs. > And further more, for fs with more devices, for example, 9 devices RAID5. > It will be a disaster to just write a 4K data and take up the whole 8 * 64K > space. > It will definitely cause huge ENOSPC problem. If you called fsync() after every 4K, yes; otherwise you can just batch up small writes into full-size stripes. The worst case isn't common enough to be a serious problem for a lot of the common RAID5 use cases (i.e. non-database workloads). I wouldn't try running a database on it--I'd use a RAID1 or RAID10 array for that instead, because the other RAID5 performance issues would be deal-breakers. On ZFS the same case degenerates into something like btrfs RAID1 over the 9 disks, which burns over 50% of the space. More efficient than wasting 99% of the space, but still wasteful. > If you really think it's easy, make a RFC patch, which should be easy if it > is, then run fstest auto group on it. I plan to when I get time; however, that could be some months in the future and I don't want to "claim" the task and stop anyone else from taking a crack at it in the meantime. > Easy words won't turn emails into real patch. > > >That avoids modifying a stripe with committed data and therefore plugs the > >write hole. > > > >For nodatacow, prealloc (and maybe directio?) extents the behavior > >wouldn't change (you'd have write hole, but only on data blocks not > >metadata, and only on files that were already marked as explicitly not > >requiring data integrity). > > > >>And in fact, you're going to support variant max file extent size. > > > >The existing extent sizing behavior is not changed *at all* in my proposal, > >only the allocator's notion of what space is 'free'. > > > >We can write an extent across multiple RAID5 stripes so long as we > >finish writing the entire extent before pointing committed metadata to > >it. btrfs does that already otherwise checksums wouldn't work. > > > >>This makes delalloc more complex (Wang enhanced dealloc support for variant > >>file extent size, to fix ENOSPC problem for dedupe and compression). > >> > >>This is already much more complex than you expected. > > > >The complexity I anticipate is having to deal with two implementations > >of the free space search, one for free space cache and one for free > >space tree. > > > >It could be as simple as calling the existing allocation functions and > >just filtering out anything that isn't suitably aligned inside a raid56 > >block group (at least for a proof of concept). > > > >>And this is the *BIGGEST* problem of current btrfs: > >>No good enough(if there is any) *ISOLATION* for such a complex fs. > >> > >>So even "small" modification can lead to unexpected bugs. > >> > >>That's why I want to isolate the fix in RAID56 layer, not any layer upwards. > > > >I don't think the write hole is fixable in the current raid56 layer, at > >least not without a nasty brute force solution like stripe update journal. > > > >Any of the fixes I'd want to use fix the problem from outside. > > > >>If not possible, I prefer not to do anything yet, until we are sure the very > >>basic part of RAID56 is stable. > >> > >>Thanks, > >>Qu > >> > >>> > >>>It's not as efficient as the ZFS approach, but it doesn't require an > >>>incompatible disk format change either. > >>> > >>>>>On BTRFS this could be achieved using several BGs (== block group or > >>>>>chunk), one for each stripe size. > >>>>> > >>>>>For example, if a filesystem - RAID5 is composed by 4 DISK, the > >>>>>filesystem should have three BGs: > >>>>>BG #1,composed by two disks (1 data+ 1 parity) > >>>>>BG #2 composed by three disks (2 data + 1 parity) > >>>>>BG #3 composed by four disks (3 data + 1 parity). > >>>> > >>>>Too complicated bg layout and further extent allocator modification. > >>>> > >>>>More code means more bugs, and I'm pretty sure it will be bug prone. > >>>> > >>>> > >>>>Although the idea of variable stripe size can somewhat reduce the problem > >>>>under certain situation. > >>>> > >>>>For example, if sectorsize is 64K, and we make stripe len to 32K, and use > >>>>3 > >>>>disc RAID5, we can avoid such write hole problem. > >>>>Withouth modification to extent/chunk allocator. > >>>> > >>>>And I'd prefer to make stripe len mkfs time parameter, not possible to > >>>>modify after mkfs. To make things easy. > >>>> > >>>>Thanks, > >>>>Qu > >>>> > >>>>> > >>>>>If the data to be written has a size of 4k, it will be allocated to the > >>>>>BG #1. > >>>>>If the data to be written has a size of 8k, it will be allocated to the > >>>>>BG #2 > >>>>>If the data to be written has a size of 12k, it will be allocated to the > >>>>>BG #3 > >>>>>If the data to be written has a size greater than 12k, it will be > >>>>>allocated to the BG3, until the data fills a full stripes; then the > >>>>>remainder will be stored in BG #1 or BG #2. > >>>>> > >>>>> > >>>>>To avoid unbalancing of the disk usage, each BG could use all the disks, > >>>>>even if a stripe uses less disks: i.e > >>>>> > >>>>>DISK1 DISK2 DISK3 DISK4 > >>>>>S1 S1 S1 S2 > >>>>>S2 S2 S3 S3 > >>>>>S3 S4 S4 S4 > >>>>>[....] > >>>>> > >>>>>Above is show a BG which uses all the four disks, but has a stripe which > >>>>>spans only 3 disks. > >>>>> > >>>>> > >>>>>Pro: > >>>>>- btrfs already is capable to handle different BG in the filesystem, > >>>>>only the allocator has to change > >>>>>- no more RMW are required (== higher performance) > >>>>> > >>>>>Cons: > >>>>>- the data will be more fragmented > >>>>>- the filesystem, will have more BGs; this will require time-to time a > >>>>>re-balance. But is is an issue which we already know (even if may be not > >>>>>100% addressed). > >>>>> > >>>>> > >>>>>Thoughts ? > >>>>> > >>>>>BR > >>>>>G.Baroncelli > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >> > >> > >
signature.asc
Description: Digital signature