On Thu, 29 May 2025 16:44:51 +0530 Kundan Kumar <kundan.ku...@samsung.com> 
wrote:

> Currently, pagecache writeback is performed by a single thread. Inodes
> are added to a dirty list, and delayed writeback is triggered. The single
> writeback thread then iterates through the dirty inode list, and executes
> the writeback.
> 
> This series parallelizes the writeback by allowing multiple writeback
> contexts per backing device (bdi). These writebacks contexts are executed
> as separate, independent threads, improving overall parallelism.
> 
> Would love to hear feedback in-order to move this effort forward.
> 
> Design Overview
> ================
> Following Jan Kara's suggestion [1], we have introduced a new bdi
> writeback context within the backing_dev_info structure. Specifically,
> we have created a new structure, bdi_writeback_context, which contains
> its own set of members for each writeback context.
> 
> struct bdi_writeback_ctx {
>         struct bdi_writeback wb;
>         struct list_head wb_list; /* list of all wbs */
>         struct radix_tree_root cgwb_tree;
>         struct rw_semaphore wb_switch_rwsem;
>         wait_queue_head_t wb_waitq;
> };
> 
> There can be multiple writeback contexts in a bdi, which helps in
> achieving writeback parallelism.
> 
> struct backing_dev_info {
> ...
>         int nr_wb_ctx;
>         struct bdi_writeback_ctx **wb_ctx_arr;

I don't think the "_arr" adds value. bdi->wb_contexts[i]?

> ...
> };
> 
> FS geometry and filesystem fragmentation
> ========================================
> The community was concerned that parallelizing writeback would impact
> delayed allocation and increase filesystem fragmentation.
> Our analysis of XFS delayed allocation behavior showed that merging of
> extents occurs within a specific inode. Earlier experiments with multiple
> writeback contexts [2] resulted in increased fragmentation due to the
> same inode being processed by different threads.
> 
> To address this, we now affine an inode to a specific writeback context
> ensuring that delayed allocation works effectively.
> 
> Number of writeback contexts
> ===========================
> The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded
> behavior. However, we set the number of writeback contexts equal to
> number of CPUs in the current version.

Makes sense.  It would be good to test this on a non-SMP machine, if
you can find one ;)

> Later we will make it configurable
> using a mount option, allowing filesystems to choose the optimal number
> of writeback contexts.
> 
> IOPS and throughput
> ===================
> We see significant improvement in IOPS across several filesystem on both
> PMEM and NVMe devices.
> 
> Performance gains:
>   - On PMEM:
>       Base XFS                : 544 MiB/s
>       Parallel Writeback XFS  : 1015 MiB/s  (+86%)
>       Base EXT4               : 536 MiB/s
>       Parallel Writeback EXT4 : 1047 MiB/s  (+95%)
> 
>   - On NVMe:
>       Base XFS                : 651 MiB/s
>       Parallel Writeback XFS  : 808 MiB/s  (+24%)
>       Base EXT4               : 494 MiB/s
>       Parallel Writeback EXT4 : 797 MiB/s  (+61%)
> 
> We also see that there is no increase in filesystem fragmentation
> # of extents:
>   - On XFS (on PMEM):
>       Base XFS                : 1964
>       Parallel Writeback XFS  : 1384
> 
>   - On EXT4 (on PMEM):
>       Base EXT4               : 21
>       Parallel Writeback EXT4 : 11

Please test the performance on spinning disks, and with more filesystems?




_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

Reply via email to