On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 


Attachment: signature.asc
Description: PGP signature

Reply via email to