On 04/02/2018 07:45 AM, Zygo Blaxell wrote: [...] > It is possible to combine writes from a single transaction into full > RMW stripes, but this *does* have an impact on fragmentation in btrfs. > Any partially-filled stripe is effectively read-only and the space within > it is inaccessible until all data within the stripe is overwritten, > deleted, or relocated by balance. > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe > update, but that has a significant write magnification effect (and before > kernel 4.14, non-trivial CPU load as well). > > btrfs could also just allocate the full stripe to an extent, but emit > only extent ref items for the blocks that are in use. No fragmentation > but lots of extra disk space used. Also doesn't quite work the same > way for metadata pages. > > If btrfs adopted the ZFS approach, the extent allocator and all higher > layers of the filesystem would have to know about--and skip over--the > parity blocks embedded inside extents. Making this change would mean > that some btrfs RAID profiles start interacting with stuff like balance > and compression which they currently do not. It would create a new > block group type and require an incompatible on-disk format change for > both reads and writes.
I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG BG #1: 1 data disk, 2 parity disks BG #2: 2 data disks, 2 parity disks, BG #3: 4 data disks, 2 parity disks For simplicity, the disk-stripe length is assumed = 4K. So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1. This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?). Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated. The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html