On Tue, Jun 24, 2025 at 11:29:28AM +0530, Kundan Kumar wrote:
> On Wed, Jun 11, 2025 at 9:21 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Wed, Jun 04, 2025 at 02:52:34PM +0530, Kundan Kumar wrote:
> > > > > > For xfs used this command:
> > > > > > xfs_io -c "stat" /mnt/testfile
> > > > > > And for ext4 used this:
> > > > > > filefrag /mnt/testfile
> > > > >
> > > > > filefrag merges contiguous extents, and only counts up for
> > > > > discontiguous
> > > > > mappings, while fsxattr.nextents counts all extent even if they are
> > > > > contiguous. So you probably want to use filefrag for both cases.
> > > >
> > > > Got it — thanks for the clarification. We'll switch to using filefrag
> > > > and will share updated extent count numbers accordingly.
> > >
> > > Using filefrag, we recorded extent counts on xfs and ext4 at three
> > > stages:
> > > a. Just after a 1G random write,
> > > b. After a 30-second wait,
> > > c. After unmounting and remounting the filesystem,
> > >
> > > xfs
> > > Base
> > > a. 6251 b. 2526 c. 2526
> > > Parallel writeback
> > > a. 6183 b. 2326 c. 2326
> >
> > Interesting that the mapping record count goes down...
> >
> > I wonder, you said the xfs filesystem has 4 AGs and 12 cores, so I guess
> > wb_ctx_arr[] is 12? I wonder, do you see a knee point in writeback
> > throughput when the # of wb contexts exceeds the AG count?
> >
> > Though I guess for the (hopefully common) case of pure overwrites, we
> > don't have to do any metadata updates so we wouldn't really hit a
> > scaling limit due to ag count or log contention or whatever. Does that
> > square with what you see?
> >
>
> Hi Darrick,
>
> We analyzed AG count vs. number of writeback contexts to identify any
> knee point. Earlier, wb_ctx_arr[] was fixed at 12; now we varied nr_wb_ctx
> and measured the impact.
>
> We implemented a configurable number of writeback contexts to measure
> throughput more easily. This feature will be exposed in the next series.
> To configure, used: echo <nr_wb_ctx> > /sys/class/bdi/259:2/nwritebacks.
>
> In our test, writing 1G across 12 directories showed improved bandwidth up
> to the number of allocation groups (AGs), mostly a knee point, but gains
> tapered off beyond that. Also, we see a good increase in bandwidth of about
> 16 times from base to nr_wb_ctx = 6.
>
> Base (single threaded) : 9799KiB/s
> Parallel Writeback (nr_wb_ctx = 1) : 9727KiB/s
> Parallel Writeback (nr_wb_ctx = 2) : 18.1MiB/s
> Parallel Writeback (nr_wb_ctx = 3) : 46.4MiB/s
> Parallel Writeback (nr_wb_ctx = 4) : 135MiB/s
> Parallel Writeback (nr_wb_ctx = 5) : 160MiB/s
> Parallel Writeback (nr_wb_ctx = 6) : 163MiB/s
Heh, nice!
> Parallel Writeback (nr_wb_ctx = 7) : 162MiB/s
> Parallel Writeback (nr_wb_ctx = 8) : 154MiB/s
> Parallel Writeback (nr_wb_ctx = 9) : 152MiB/s
> Parallel Writeback (nr_wb_ctx = 10) : 145MiB/s
> Parallel Writeback (nr_wb_ctx = 11) : 145MiB/s
> Parallel Writeback (nr_wb_ctx = 12) : 138MiB/s
>
>
> System config
> ===========
> Number of CPUs = 12
> System RAM = 9G
> For XFS number of AGs = 4
> Used NVMe SSD of 3.84 TB (Enterprise SSD PM1733a)
>
> Script
> =====
> mkfs.xfs -f /dev/nvme0n1
> mount /dev/nvme0n1 /mnt
> echo <num_wb_ctx> > /sys/class/bdi/259:2/nwritebacks
> sync
> echo 3 > /proc/sys/vm/drop_caches
>
> for i in {1..12}; do
> mkdir -p /mnt/dir$i
> done
>
> fio job_nvme.fio
>
> umount /mnt
> echo 3 > /proc/sys/vm/drop_caches
> sync
>
> fio job
> =====
> [global]
> bs=4k
> iodepth=1
> rw=randwrite
> ioengine=io_uring
> nrfiles=12
> numjobs=1 # Each job writes to a different file
> size=1g
> direct=0 # Buffered I/O to trigger writeback
> group_reporting=1
> create_on_open=1
> name=test
>
> [job1]
> directory=/mnt/dir1
>
> [job2]
> directory=/mnt/dir2
> ...
> ...
> [job12]
> directory=/mnt/dir1
>
> > > ext4
> > > Base
> > > a. 7080 b. 7080 c. 11
> > > Parallel writeback
> > > a. 5961 b. 5961 c. 11
> >
> > Hum, that's particularly ... interesting. I wonder what the mapping
> > count behaviors are when you turn off delayed allocation?
> >
> > --D
> >
>
> I attempted to disable delayed allocation by setting allocsize=4096
> during mount (mount -o allocsize=4096 /dev/pmem0 /mnt), but still
> observed a reduction in file fragments after a delay. Is there something
> I'm overlooking?
Not that I know of. Maybe we should just take the win. :)
--D
> -Kundan
>
_______________________________________________
Linux-f2fs-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel