On Thu, 29 May 2025 16:44:51 +0530 Kundan Kumar <[email protected]>
wrote:
> Currently, pagecache writeback is performed by a single thread. Inodes
> are added to a dirty list, and delayed writeback is triggered. The single
> writeback thread then iterates through the dirty inode list, and executes
> the writeback.
>
> This series parallelizes the writeback by allowing multiple writeback
> contexts per backing device (bdi). These writebacks contexts are executed
> as separate, independent threads, improving overall parallelism.
>
> Would love to hear feedback in-order to move this effort forward.
>
> Design Overview
> ================
> Following Jan Kara's suggestion [1], we have introduced a new bdi
> writeback context within the backing_dev_info structure. Specifically,
> we have created a new structure, bdi_writeback_context, which contains
> its own set of members for each writeback context.
>
> struct bdi_writeback_ctx {
> struct bdi_writeback wb;
> struct list_head wb_list; /* list of all wbs */
> struct radix_tree_root cgwb_tree;
> struct rw_semaphore wb_switch_rwsem;
> wait_queue_head_t wb_waitq;
> };
>
> There can be multiple writeback contexts in a bdi, which helps in
> achieving writeback parallelism.
>
> struct backing_dev_info {
> ...
> int nr_wb_ctx;
> struct bdi_writeback_ctx **wb_ctx_arr;
I don't think the "_arr" adds value. bdi->wb_contexts[i]?
> ...
> };
>
> FS geometry and filesystem fragmentation
> ========================================
> The community was concerned that parallelizing writeback would impact
> delayed allocation and increase filesystem fragmentation.
> Our analysis of XFS delayed allocation behavior showed that merging of
> extents occurs within a specific inode. Earlier experiments with multiple
> writeback contexts [2] resulted in increased fragmentation due to the
> same inode being processed by different threads.
>
> To address this, we now affine an inode to a specific writeback context
> ensuring that delayed allocation works effectively.
>
> Number of writeback contexts
> ===========================
> The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded
> behavior. However, we set the number of writeback contexts equal to
> number of CPUs in the current version.
Makes sense. It would be good to test this on a non-SMP machine, if
you can find one ;)
> Later we will make it configurable
> using a mount option, allowing filesystems to choose the optimal number
> of writeback contexts.
>
> IOPS and throughput
> ===================
> We see significant improvement in IOPS across several filesystem on both
> PMEM and NVMe devices.
>
> Performance gains:
> - On PMEM:
> Base XFS : 544 MiB/s
> Parallel Writeback XFS : 1015 MiB/s (+86%)
> Base EXT4 : 536 MiB/s
> Parallel Writeback EXT4 : 1047 MiB/s (+95%)
>
> - On NVMe:
> Base XFS : 651 MiB/s
> Parallel Writeback XFS : 808 MiB/s (+24%)
> Base EXT4 : 494 MiB/s
> Parallel Writeback EXT4 : 797 MiB/s (+61%)
>
> We also see that there is no increase in filesystem fragmentation
> # of extents:
> - On XFS (on PMEM):
> Base XFS : 1964
> Parallel Writeback XFS : 1384
>
> - On EXT4 (on PMEM):
> Base EXT4 : 21
> Parallel Writeback EXT4 : 11
Please test the performance on spinning disks, and with more filesystems?
_______________________________________________
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel