Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-07-01 Thread Oliver Sang
hi, Christoph Hellwig,

On Wed, Jun 26, 2024 at 09:54:05PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 27, 2024 at 10:35:38AM +0800, Oliver Sang wrote:
> > 
> > I failed to apply patch in your previous reply to 1122c0c1cc or current tip
> > of axboe-block/for-next:
> > c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
> > for-next
> 
> That already includes it.

for the patch in your previous reply [1]
the bot applied it automatically as:
* 5c683739f6c2f patch in [1]
* 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 20240625

for patch set [2], the bot applied it as:
* 6490f979767736 block: move dma_pad_mask into queue_limits
* 278817f42e219b block: remove the fallback case in queue_dma_alignment
* 81afb19d619a04 block: remove disk_update_readahead
* 037d85402b8b83 block: conding style fixup for blk_queue_max_guaranteed_bio
* 4fe67425ae31a8 block: convert features and flags to __bitwise types
* e3c2d2ad4136f2 block: rename BLK_FLAG_MISALIGNED
* 33ead159243d1c block: correctly report cache type
* 6725109120e0ba md: set md-specific flags for all queue limits
*   e6d130064a02f5 Merge branch 'for-6.11/block' into for-next


but both build failed with the error:
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid456.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid1.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid0.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid10.ko] 
undefined!"


since you mentioned the axboe-block/for-next branch has already includes the
patch-set, I got a snapshot of the branch as below several days ago:

*   bc512ae8cb934 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
for-next   <---
|\
| * 18048c1af7836 (axboe-block/for-6.11/block) loop: Fix a race between loop 
detach and loop open
| * 63db4a1f795a1 block: Delete blk_queue_flag_test_and_set()
* | e21d05740862c Merge branch 'for-6.11/block' into for-next
|\|
| * e269537e491da block: clean up the check in blkdev_iomap_begin()
* | 9c6e1f8702d51 Merge branch 'for-6.11/block' into for-next
|\|
| * 69b6517687a4b block: use the right type for stub rq_integrity_vec()
* | c1440ed442a58 Merge branch 'for-6.11/block' into for-next
|\|
| * e94b45d08b5d1 block: move dma_pad_mask into queue_limits  
<
| * abfc9d810926d block: remove the fallback case in queue_dma_alignment
| * 73781b3b81e76 block: remove disk_update_readahead
| * 3302f6f090522 block: conding style fixup for blk_queue_max_guaranteed_bio
| * fcf865e357f80 block: convert features and flags to __bitwise types
| * ec9b1cf0b0ebf block: rename BLK_FEAT_MISALIGNED
| * 78887d004fb2b block: correctly report cache type
| * 573d5abf3df00 md: set md-specific flags for all queue limits   
<
* | 72e9cd924fccc Merge branch 'for-6.11/block' into for-next
|\|
| * cf546dd289e0f block: change rq_integrity_vec to respect the iterator  
<-

from below, it seems the patchset doesn't introduce any performance improvement
but a regression now. is this expected?

=
compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
  
gcc-13/performance/4BRD_12G/xfs/x86_64-rhel-8.3/300/RAID0/debian-12-x86_64-20240206.cgz/lkp-csl-2sp3/sync_disk_rw/aim7

cf546dd289e0f6d2 573d5abf3df00c879fbd25774e4 e94b45d08b5d1c230c0f59c3eed 
bc512ae8cb934ac31470bc825fa
 --- --- 
---
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\
 21493   -19.6%  17278   -19.2%  17371   
-19.7%  17264aim7.jobs-per-min



[1] https://lore.kernel.org/all/znqgf49cvy6w-...@infradead.org/
[2] https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/

> 
> > 
> > but it's ok to apply upon next:
> > * 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 
> > 20240625
> > 
> > I've already started the test based on this applyment.
> > is the expectation that patch should not introduce performance change 
> > comparing
> > to 0fc4bfab2cd45?
> > 
> > or if this applyment is not ok, please just give me guidance. Thanks!
> 
> The expectation is that the latest block branch (and thus linux-next)
> doesn't see this performance change.
> 


Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-26 Thread Oliver Sang
hi, Christoph Hellwig,

On Tue, Jun 25, 2024 at 08:39:50PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 26, 2024 at 10:10:49AM +0800, Oliver Sang wrote:
> > I'm not sure I understand this test request. as in title, we see a good
> > improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
> > this commit.
> 
> The improvement suggests we are not sending cache flushes when we should
> send them, or at least just handle them in md.

thanks for explanation!

> 
> > do you mean this improvement is not expected or exposes some problems 
> > instead?
> > then by below patch, should the performance back to the level of parent of
> > 1122c0c1cc?
> > 
> > sure! it's our great pleasure to test your patches. I noticed there are
> > [1]
> > https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
> > which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
> > [2]
> > https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
> > which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"
> > 
> > which one you suggest us to test?
> > do we only need to apply the first patch "md: set md-specific flags for all 
> > queue limits"
> > upon 1122c0c1cc?
> > then is the expectation the performance back to parent of 1122c0c1cc?
> 
> Either just the patch in reply or the entire [2] series would be fine.

I failed to apply patch in your previous reply to 1122c0c1cc or current tip
of axboe-block/for-next:
c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into for-next

but it's ok to apply upon next:
* 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 20240625

I've already started the test based on this applyment.
is the expectation that patch should not introduce performance change comparing
to 0fc4bfab2cd45?

or if this applyment is not ok, please just give me guidance. Thanks!


> 
> Thanks!
> 


Re: [axboe-block:for-next] [block] bd4a633b6f: fsmark.files_per_sec -64.5% regression

2024-06-25 Thread Oliver Sang
hi, Christoph Hellwig,

On Mon, Jun 24, 2024 at 10:35:37AM +0200, Christoph Hellwig wrote:
> This is odd to say at least.  Any chance you can check the value
> of /sys/block/$DEVICE/queue/rotational for the relevant device before
> and after this commit?  And is this an ATA or NVMe SSD?
> 

yeah, as Niklas mentioned, it's an ATA SSD.

I checked the /sys/block/$DEVICE/queue/rotational before and after this commit,
both show '0'. not sure if this is expected.

anyway, I noticed you send a patch [1]

so I applied this patch upon bd4a633b6f, and found the performance restored.

=
compiler/cpufreq_governor/disk/filesize/fs2/fs/iterations/kconfig/nr_directories/nr_files_per_directory/nr_threads/rootfs/sync_method/tbox_group/test_size/testcase:
  
gcc-13/performance/1SSD/9B/nfsv4/btrfs/1x/x86_64-rhel-8.3/16d/256fpd/32t/debian-12-x86_64-20240206.cgz/fsyncBeforeClose/lkp-ivb-2ep2/400M/fsmark

commit:
  1122c0c1cc ("block: move cache control settings out of queue->flags")
  bd4a633b6f ("block: move the nonrot flag to queue_limits")
  e9a0f6a398 = bd4a633b6f + patch [1]

1122c0c1cc71f740 bd4a633b6f7c3c6b6ebc1a07317 e9a0f6a398f162d115d208ad95b
 --- ---
 %stddev %change %stddev %change %stddev
 \  |\  |\
  4177 ±  2% -64.7%   1475-1.1%   4130
fsmark.files_per_sec


[1] https://lore.kernel.org/all/20240624173835.76753-1-...@lst.de/


Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-25 Thread Oliver Sang
hi, Christoph Hellwig,

On Tue, Jun 25, 2024 at 01:57:35AM -0700, Christoph Hellwig wrote:
> Hi Oliver,
> 
> can you test the patch below?  It restores the previous behavior if
> the device did not have a volatile write cache.  I think at least
> for raid0 and raid1 without bitmap the new behavior actually is correct
> and better, but it will need fixes for other modes.  If the underlying
> devices did have a volatile write cache I'm a bit lost what the problem
> was and this probably won't fix the issue.

I'm not sure I understand this test request. as in title, we see a good
improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
this commit.

do you mean this improvement is not expected or exposes some problems instead?
then by below patch, should the performance back to the level of parent of
1122c0c1cc?

sure! it's our great pleasure to test your patches. I noticed there are
[1]
https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
[2]
https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"

which one you suggest us to test?
do we only need to apply the first patch "md: set md-specific flags for all 
queue limits"
upon 1122c0c1cc?
then is the expectation the performance back to parent of 1122c0c1cc?

thanks

> 
> ---
> From 81c816827197f811e14add7a79220ed9eef6af02 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Tue, 25 Jun 2024 08:48:18 +0200
> Subject: md: set md-specific flags for all queue limits
> 
> The md driver wants to enforce a number of flags to an all devices, even
> when not inheriting them from the underlying devices.  To make sure these
> flags survive the queue_limits_set calls that md uses to update the
> queue limits without deriving them form the previous limits add a new
> md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
> these flags.
> 
> Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/md.c | 13 -
>  drivers/md/md.h |  1 +
>  drivers/md/raid0.c  |  2 +-
>  drivers/md/raid1.c  |  2 +-
>  drivers/md/raid10.c |  2 +-
>  drivers/md/raid5.c  |  2 +-
>  6 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 69ea54aedd99a1..8368438e58e989 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5853,6 +5853,13 @@ static void mddev_delayed_delete(struct work_struct 
> *ws)
>   kobject_put(&mddev->kobj);
>  }
>  
> +void md_init_stacking_limits(struct queue_limits *lim)
> +{
> + blk_set_stacking_limits(lim);
> + lim->features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
> + BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
> +}
> +
>  struct mddev *md_alloc(dev_t dev, char *name)
>  {
>   /*
> @@ -5871,10 +5878,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>   int shift;
>   int unit;
>   int error;
> - struct queue_limits lim = {
> - .features   = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
> -   BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT,
> - };
>  
>   /*
>* Wait for any previous instance of this device to be completely
> @@ -5914,7 +5917,7 @@ struct mddev *md_alloc(dev_t dev, char *name)
>*/
>   mddev->hold_active = UNTIL_STOP;
>  
> - disk = blk_alloc_disk(&lim, NUMA_NO_NODE);
> + disk = blk_alloc_disk(NULL, NUMA_NO_NODE);
>   if (IS_ERR(disk)) {
>   error = PTR_ERR(disk);
>   goto out_free_mddev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c4d7ebf9587d07..28cb4b0b6c1740 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -893,6 +893,7 @@ extern int strict_strtoul_scaled(const char *cp, unsigned 
> long *res, int scale);
>  
>  extern int mddev_init(struct mddev *mddev);
>  extern void mddev_destroy(struct mddev *mddev);
> +void md_init_stacking_limits(struct queue_limits *lim);
>  struct mddev *md_alloc(dev_t dev, char *name);
>  void mddev_put(struct mddev *mddev);
>  extern int md_run(struct mddev *mddev);
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 62634e2a33bd0f..32d58752477847 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -379,7 +379,7 @@ static int raid0_set_limits(struct mddev *mddev)
>   struct queue_limits lim;
>   int err;
>  
> - blk_set_stacking_limits(&lim);
> + md_init_stacking_limits(&lim);
>   lim.max_hw_sectors = mddev->chunk_sectors;
>   lim.max_write_zeroes_sectors = mddev->chunk_sectors;
>   lim.io_min = mddev->chunk_sectors << 9;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 1a0eba65b8a92b..04a0c2ca173245 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -3194,7 +3194,7 @@ static int raid1_set_limit

Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-26 Thread Oliver Sang
hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
>  wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
> 
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
> 
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
> 
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
> 
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot 

=
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
  
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
  93faf426e3 ("vfs: shave work on failed file open")
  0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
  c712b4365b ("Improve __fget_files_rcu() code generation (and thus 
__fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
 --- ---
 %stddev %change %stddev %change %stddev
 \  |\  |\
228481 ±  4%  -4.6% 217900 ±  6% -11.7% 201857 ±  5%  
meminfo.DirectMap4k
 89056-2.0%  87309-1.6%  87606
proc-vmstat.nr_slab_unreclaimable
 16.28-0.7%  16.16-1.0%  16.12
turbostat.RAMWatt
  0.01 ±  9%  +58125.6%   4.17 ±175%  +23253.5%   1.67 ±222%  
perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
781.67 ± 10%  +6.5% 832.50 ± 19% -14.3% 670.17 ±  4%  
perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
 97958 ±  7%  -9.7%  88449 ±  4%  -0.6%  97399 ±  4%  
sched_debug.cpu.avg_idle.stddev
  0.00 ± 12% +24.2%   0.00 ± 17%  -5.2%   0.00 ±  7%  
sched_debug.cpu.next_balance.stddev
   6391048-2.9%6208584+3.4%6605584
will-it-scale.16.threads
399440-2.9% 388036+3.4% 412848
will-it-scale.per_thread_ops
   6391048-2.9%6208584+3.4%6605584
will-it-scale.workload
 19.99 ±  4%  -2.2   17.74+1.2   21.18 ±  2%  
perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  1.27 ±  5%  +0.82.11 ±  3% +31.1   32.36 ±  2%  
perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
 32.69 ±  4%  +5.0   37.70   -32.70.00
perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
  0.00   +27.9   27.85+0.00.00
perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
 20.00 ±  4%  -2.3   17.75+0.4   20.43 ±  2%  
perf-profile.children.cycles-pp.fput
  0.24 ± 10%  -0.10.18 ±  2%  -0.10.18 ± 10%  
perf-profile.children.cycles-pp.syscall_return_via_sysret
  1.48 ±  5%  +0.51.98 ±  3% +30.8   32.32 ±  2%  
perf-profile.children.cycles-pp.__fdget
 31.85 ±  4%  +6.0   37.86   -31.80.00
perf-profile.children.cycles-pp.__fget_light
  0.00   +27.7   27.67+0.00.00
perf-profile.children.cycles-pp.__get_file_rcu
 30.90 ±  4% -20.6   10.35 ±  2% -30.90.00
perf-profile.self.cycles-pp.__fget_light
 19.94 ±  4%  -2.4   17.53-0.3   19.62 ±  2%  
perf-profile.self.cycles-pp.fput
  9.81 ±  4%  -2.47.42 ±  2%  +1.7   11.51 ±  4%  
perf-profile.self.cycles-pp.do_poll
  0.23 ± 11%  -0.10.17 ±  4%  -0.10.18 ± 11%  
perf-profile.self.cycles-pp.syscall_return_via_sysret
  0.44 ±  7%  +0.00.45 ±  5%  +0.10.52 ±  4%  
perf-profile.self.cycles-pp.__poll
  0.85 ±  4%  +0.10.92 ±  3% +30.3   31.17 ±  2%  
perf-profile.self.cycles-pp.__fdget
  0.00   +26.5   26.48+0.00.00
perf-profile.self.cycles-pp.__get_file_rcu
 2.146e+10 ±  2%  +8.5%  2.329e+10 ±  2%  -2.1%  2.101e+10
perf-stat.i.branch-instructions