On Thu, 29 May 2025 16:44:51 +0530 Kundan Kumar <kundan.ku...@samsung.com> wrote:
> Currently, pagecache writeback is performed by a single thread. Inodes > are added to a dirty list, and delayed writeback is triggered. The single > writeback thread then iterates through the dirty inode list, and executes > the writeback. > > This series parallelizes the writeback by allowing multiple writeback > contexts per backing device (bdi). These writebacks contexts are executed > as separate, independent threads, improving overall parallelism. > > Would love to hear feedback in-order to move this effort forward. > > Design Overview > ================ > Following Jan Kara's suggestion [1], we have introduced a new bdi > writeback context within the backing_dev_info structure. Specifically, > we have created a new structure, bdi_writeback_context, which contains > its own set of members for each writeback context. > > struct bdi_writeback_ctx { > struct bdi_writeback wb; > struct list_head wb_list; /* list of all wbs */ > struct radix_tree_root cgwb_tree; > struct rw_semaphore wb_switch_rwsem; > wait_queue_head_t wb_waitq; > }; > > There can be multiple writeback contexts in a bdi, which helps in > achieving writeback parallelism. > > struct backing_dev_info { > ... > int nr_wb_ctx; > struct bdi_writeback_ctx **wb_ctx_arr; I don't think the "_arr" adds value. bdi->wb_contexts[i]? > ... > }; > > FS geometry and filesystem fragmentation > ======================================== > The community was concerned that parallelizing writeback would impact > delayed allocation and increase filesystem fragmentation. > Our analysis of XFS delayed allocation behavior showed that merging of > extents occurs within a specific inode. Earlier experiments with multiple > writeback contexts [2] resulted in increased fragmentation due to the > same inode being processed by different threads. > > To address this, we now affine an inode to a specific writeback context > ensuring that delayed allocation works effectively. > > Number of writeback contexts > =========================== > The plan is to keep the nr_wb_ctx as 1, ensuring default single threaded > behavior. However, we set the number of writeback contexts equal to > number of CPUs in the current version. Makes sense. It would be good to test this on a non-SMP machine, if you can find one ;) > Later we will make it configurable > using a mount option, allowing filesystems to choose the optimal number > of writeback contexts. > > IOPS and throughput > =================== > We see significant improvement in IOPS across several filesystem on both > PMEM and NVMe devices. > > Performance gains: > - On PMEM: > Base XFS : 544 MiB/s > Parallel Writeback XFS : 1015 MiB/s (+86%) > Base EXT4 : 536 MiB/s > Parallel Writeback EXT4 : 1047 MiB/s (+95%) > > - On NVMe: > Base XFS : 651 MiB/s > Parallel Writeback XFS : 808 MiB/s (+24%) > Base EXT4 : 494 MiB/s > Parallel Writeback EXT4 : 797 MiB/s (+61%) > > We also see that there is no increase in filesystem fragmentation > # of extents: > - On XFS (on PMEM): > Base XFS : 1964 > Parallel Writeback XFS : 1384 > > - On EXT4 (on PMEM): > Base EXT4 : 21 > Parallel Writeback EXT4 : 11 Please test the performance on spinning disks, and with more filesystems? _______________________________________________ Linux-f2fs-devel mailing list Linux-f2fs-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel