On Wed, Jun 21, 2017 at 12:51 AM, Marat Khalili <m...@rqc.ru> wrote: > On 21/06/17 06:48, Chris Murphy wrote: >> >> Another possibility is to ensure a new write is written to a new*not* >> full stripe, i.e. dynamic stripe size. So if the modification is a 50K >> file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K >> parity strip (a full stripe write); write out 1 64K data strip + 1 64K >> parity strip. In effect, a 4 disk raid5 would quickly get not just 3 >> data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 >> data + 1 parity chunks, and direct those write to the proper chunk >> based on size. Anyway that's beyond my ability to assess how much >> allocator work that is. Balance I'd expect to rewrite everything to >> max data strips possible; the optimization would only apply to normal >> operation COW..
> This will make some filesystems mostly RAID1, negating all space savings of > RAID5, won't it? No. It'd only apply to partial stripe writes, typically small files. But small file, metadata centric workloads suck for raid5 anyway, and should use raid1. So making the implementation more like raid1 than raid5 for the RMW case I think is still better than Btrfs raid56 RMW writes in effect being no-COW. > Isn't it easier to recalculate parity block based using previous state of > two rewritten strips, parity and data? I don't understand all performance > implications, but it might scale better with number of devices. The problem is atomicity. Either the data strip or parity strip is overwritten first, and before the other is committed, the file system is not merely inconsistent, it's basically lying, there's no way to know for sure after the fact whether the data or parity were properly written. And even the metadata is inconsistent too because it can only describe the unmodified state and the successfully modified state, whereas a 3rd state "partially modified" is possible and no way to really fix it. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html