On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
> It is possible to combine writes from a single transaction into full
> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> Any partially-filled stripe is effectively read-only and the space within
> it is inaccessible until all data within the stripe is overwritten,
> deleted, or relocated by balance.
>
> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> update, but that has a significant write magnification effect (and before
> kernel 4.14, non-trivial CPU load as well).
> 
> btrfs could also just allocate the full stripe to an extent, but emit
> only extent ref items for the blocks that are in use.  No fragmentation
> but lots of extra disk space used.  Also doesn't quite work the same
> way for metadata pages.
> 
> If btrfs adopted the ZFS approach, the extent allocator and all higher
> layers of the filesystem would have to know about--and skip over--the
> parity blocks embedded inside extents.  Making this change would mean
> that some btrfs RAID profiles start interacting with stuff like balance
> and compression which they currently do not.  It would create a new
> block group type and require an incompatible on-disk format change for
> both reads and writes.

I thought that a possible solution is to create BG with different number of 
data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are 
parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if 
you have a write with a length of 4*3KB, the first 8KB, should be placed in in 
BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but 
shall the fragmentation matters with the modern solid state disks ?).

Time to time, a re-balance should be performed to empty the BG #1, and #2. 
Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than 
a full-stripe, has to be written two times); the implementation should be quite 
easy, because already NOW btrfs support BG with different set of disks.

BR 
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to