Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-07-02 Thread Christoph Hellwig
On Mon, Jul 01, 2024 at 04:22:19PM +0800, Oliver Sang wrote:
> from below, it seems the patchset doesn't introduce any performance 
> improvement
> but a regression now. is this expected?

Not having the improvement at least alleviate my concerns about data
integrity.  I'm still curious where it comes from as it isn't exactly
expected.



Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-07-01 Thread Oliver Sang
hi, Christoph Hellwig,

On Wed, Jun 26, 2024 at 09:54:05PM -0700, Christoph Hellwig wrote:
> On Thu, Jun 27, 2024 at 10:35:38AM +0800, Oliver Sang wrote:
> > 
> > I failed to apply patch in your previous reply to 1122c0c1cc or current tip
> > of axboe-block/for-next:
> > c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
> > for-next
> 
> That already includes it.

for the patch in your previous reply [1]
the bot applied it automatically as:
* 5c683739f6c2f patch in [1]
* 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 20240625

for patch set [2], the bot applied it as:
* 6490f979767736 block: move dma_pad_mask into queue_limits
* 278817f42e219b block: remove the fallback case in queue_dma_alignment
* 81afb19d619a04 block: remove disk_update_readahead
* 037d85402b8b83 block: conding style fixup for blk_queue_max_guaranteed_bio
* 4fe67425ae31a8 block: convert features and flags to __bitwise types
* e3c2d2ad4136f2 block: rename BLK_FLAG_MISALIGNED
* 33ead159243d1c block: correctly report cache type
* 6725109120e0ba md: set md-specific flags for all queue limits
*   e6d130064a02f5 Merge branch 'for-6.11/block' into for-next


but both build failed with the error:
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid456.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid1.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid0.ko] 
undefined!"
  - "ERROR: modpost: \"md_init_stacking_limits\" [drivers/md/raid10.ko] 
undefined!"


since you mentioned the axboe-block/for-next branch has already includes the
patch-set, I got a snapshot of the branch as below several days ago:

*   bc512ae8cb934 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
for-next   <---
|\
| * 18048c1af7836 (axboe-block/for-6.11/block) loop: Fix a race between loop 
detach and loop open
| * 63db4a1f795a1 block: Delete blk_queue_flag_test_and_set()
* | e21d05740862c Merge branch 'for-6.11/block' into for-next
|\|
| * e269537e491da block: clean up the check in blkdev_iomap_begin()
* | 9c6e1f8702d51 Merge branch 'for-6.11/block' into for-next
|\|
| * 69b6517687a4b block: use the right type for stub rq_integrity_vec()
* | c1440ed442a58 Merge branch 'for-6.11/block' into for-next
|\|
| * e94b45d08b5d1 block: move dma_pad_mask into queue_limits  
<
| * abfc9d810926d block: remove the fallback case in queue_dma_alignment
| * 73781b3b81e76 block: remove disk_update_readahead
| * 3302f6f090522 block: conding style fixup for blk_queue_max_guaranteed_bio
| * fcf865e357f80 block: convert features and flags to __bitwise types
| * ec9b1cf0b0ebf block: rename BLK_FEAT_MISALIGNED
| * 78887d004fb2b block: correctly report cache type
| * 573d5abf3df00 md: set md-specific flags for all queue limits   
<
* | 72e9cd924fccc Merge branch 'for-6.11/block' into for-next
|\|
| * cf546dd289e0f block: change rq_integrity_vec to respect the iterator  
<-

from below, it seems the patchset doesn't introduce any performance improvement
but a regression now. is this expected?

=
compiler/cpufreq_governor/disk/fs/kconfig/load/md/rootfs/tbox_group/test/testcase:
  
gcc-13/performance/4BRD_12G/xfs/x86_64-rhel-8.3/300/RAID0/debian-12-x86_64-20240206.cgz/lkp-csl-2sp3/sync_disk_rw/aim7

cf546dd289e0f6d2 573d5abf3df00c879fbd25774e4 e94b45d08b5d1c230c0f59c3eed 
bc512ae8cb934ac31470bc825fa
 --- --- 
---
 %stddev %change %stddev %change %stddev 
%change %stddev
 \  |\  |\  
|\
 21493   -19.6%  17278   -19.2%  17371   
-19.7%  17264aim7.jobs-per-min



[1] https://lore.kernel.org/all/znqgf49cvy6w-...@infradead.org/
[2] https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/

> 
> > 
> > but it's ok to apply upon next:
> > * 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 
> > 20240625
> > 
> > I've already started the test based on this applyment.
> > is the expectation that patch should not introduce performance change 
> > comparing
> > to 0fc4bfab2cd45?
> > 
> > or if this applyment is not ok, please just give me guidance. Thanks!
> 
> The expectation is that the latest block branch (and thus linux-next)
> doesn't see this performance change.
> 


Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-26 Thread Christoph Hellwig
On Thu, Jun 27, 2024 at 10:35:38AM +0800, Oliver Sang wrote:
> 
> I failed to apply patch in your previous reply to 1122c0c1cc or current tip
> of axboe-block/for-next:
> c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into 
> for-next

That already includes it.

> 
> but it's ok to apply upon next:
> * 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 
> 20240625
> 
> I've already started the test based on this applyment.
> is the expectation that patch should not introduce performance change 
> comparing
> to 0fc4bfab2cd45?
> 
> or if this applyment is not ok, please just give me guidance. Thanks!

The expectation is that the latest block branch (and thus linux-next)
doesn't see this performance change.



Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-26 Thread Oliver Sang
hi, Christoph Hellwig,

On Tue, Jun 25, 2024 at 08:39:50PM -0700, Christoph Hellwig wrote:
> On Wed, Jun 26, 2024 at 10:10:49AM +0800, Oliver Sang wrote:
> > I'm not sure I understand this test request. as in title, we see a good
> > improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
> > this commit.
> 
> The improvement suggests we are not sending cache flushes when we should
> send them, or at least just handle them in md.

thanks for explanation!

> 
> > do you mean this improvement is not expected or exposes some problems 
> > instead?
> > then by below patch, should the performance back to the level of parent of
> > 1122c0c1cc?
> > 
> > sure! it's our great pleasure to test your patches. I noticed there are
> > [1]
> > https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
> > which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
> > [2]
> > https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
> > which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"
> > 
> > which one you suggest us to test?
> > do we only need to apply the first patch "md: set md-specific flags for all 
> > queue limits"
> > upon 1122c0c1cc?
> > then is the expectation the performance back to parent of 1122c0c1cc?
> 
> Either just the patch in reply or the entire [2] series would be fine.

I failed to apply patch in your previous reply to 1122c0c1cc or current tip
of axboe-block/for-next:
c1440ed442a58 (axboe-block/for-next) Merge branch 'for-6.11/block' into for-next

but it's ok to apply upon next:
* 0fc4bfab2cd45 (tag: next-20240625) Add linux-next specific files for 20240625

I've already started the test based on this applyment.
is the expectation that patch should not introduce performance change comparing
to 0fc4bfab2cd45?

or if this applyment is not ok, please just give me guidance. Thanks!


> 
> Thanks!
> 


Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-25 Thread Christoph Hellwig
On Wed, Jun 26, 2024 at 10:10:49AM +0800, Oliver Sang wrote:
> I'm not sure I understand this test request. as in title, we see a good
> improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
> this commit.

The improvement suggests we are not sending cache flushes when we should
send them, or at least just handle them in md.

> do you mean this improvement is not expected or exposes some problems instead?
> then by below patch, should the performance back to the level of parent of
> 1122c0c1cc?
> 
> sure! it's our great pleasure to test your patches. I noticed there are
> [1]
> https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
> which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
> [2]
> https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
> which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"
> 
> which one you suggest us to test?
> do we only need to apply the first patch "md: set md-specific flags for all 
> queue limits"
> upon 1122c0c1cc?
> then is the expectation the performance back to parent of 1122c0c1cc?

Either just the patch in reply or the entire [2] series would be fine.

Thanks!



Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-25 Thread Oliver Sang
hi, Christoph Hellwig,

On Tue, Jun 25, 2024 at 01:57:35AM -0700, Christoph Hellwig wrote:
> Hi Oliver,
> 
> can you test the patch below?  It restores the previous behavior if
> the device did not have a volatile write cache.  I think at least
> for raid0 and raid1 without bitmap the new behavior actually is correct
> and better, but it will need fixes for other modes.  If the underlying
> devices did have a volatile write cache I'm a bit lost what the problem
> was and this probably won't fix the issue.

I'm not sure I understand this test request. as in title, we see a good
improvement of aim7 for 1122c0c1cc, and we didn't observe other issues for
this commit.

do you mean this improvement is not expected or exposes some problems instead?
then by below patch, should the performance back to the level of parent of
1122c0c1cc?

sure! it's our great pleasure to test your patches. I noticed there are
[1]
https://lore.kernel.org/all/20240625110603.50885-2-...@lst.de/
which includes "[PATCH 1/7] md: set md-specific flags for all queue limits"
[2]
https://lore.kernel.org/all/20240625145955.115252-2-...@lst.de/
which includes "[PATCH 1/8] md: set md-specific flags for all queue limits"

which one you suggest us to test?
do we only need to apply the first patch "md: set md-specific flags for all 
queue limits"
upon 1122c0c1cc?
then is the expectation the performance back to parent of 1122c0c1cc?

thanks

> 
> ---
> From 81c816827197f811e14add7a79220ed9eef6af02 Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Tue, 25 Jun 2024 08:48:18 +0200
> Subject: md: set md-specific flags for all queue limits
> 
> The md driver wants to enforce a number of flags to an all devices, even
> when not inheriting them from the underlying devices.  To make sure these
> flags survive the queue_limits_set calls that md uses to update the
> queue limits without deriving them form the previous limits add a new
> md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
> these flags.
> 
> Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/md/md.c | 13 -
>  drivers/md/md.h |  1 +
>  drivers/md/raid0.c  |  2 +-
>  drivers/md/raid1.c  |  2 +-
>  drivers/md/raid10.c |  2 +-
>  drivers/md/raid5.c  |  2 +-
>  6 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 69ea54aedd99a1..8368438e58e989 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5853,6 +5853,13 @@ static void mddev_delayed_delete(struct work_struct 
> *ws)
>   kobject_put(>kobj);
>  }
>  
> +void md_init_stacking_limits(struct queue_limits *lim)
> +{
> + blk_set_stacking_limits(lim);
> + lim->features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
> + BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
> +}
> +
>  struct mddev *md_alloc(dev_t dev, char *name)
>  {
>   /*
> @@ -5871,10 +5878,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
>   int shift;
>   int unit;
>   int error;
> - struct queue_limits lim = {
> - .features   = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
> -   BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT,
> - };
>  
>   /*
>* Wait for any previous instance of this device to be completely
> @@ -5914,7 +5917,7 @@ struct mddev *md_alloc(dev_t dev, char *name)
>*/
>   mddev->hold_active = UNTIL_STOP;
>  
> - disk = blk_alloc_disk(, NUMA_NO_NODE);
> + disk = blk_alloc_disk(NULL, NUMA_NO_NODE);
>   if (IS_ERR(disk)) {
>   error = PTR_ERR(disk);
>   goto out_free_mddev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index c4d7ebf9587d07..28cb4b0b6c1740 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -893,6 +893,7 @@ extern int strict_strtoul_scaled(const char *cp, unsigned 
> long *res, int scale);
>  
>  extern int mddev_init(struct mddev *mddev);
>  extern void mddev_destroy(struct mddev *mddev);
> +void md_init_stacking_limits(struct queue_limits *lim);
>  struct mddev *md_alloc(dev_t dev, char *name);
>  void mddev_put(struct mddev *mddev);
>  extern int md_run(struct mddev *mddev);
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 62634e2a33bd0f..32d58752477847 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -379,7 +379,7 @@ static int raid0_set_limits(struct mddev *mddev)
>   struct queue_limits lim;
>   int err;
>  
> - blk_set_stacking_limits();
> + md_init_stacking_limits();
>   lim.max_hw_sectors = mddev->chunk_sectors;
>   lim.max_write_zeroes_sectors = mddev->chunk_sectors;
>   lim.io_min = mddev->chunk_sectors << 9;
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 1a0eba65b8a92b..04a0c2ca173245 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -3194,7 +3194,7 @@ static int raid1_set_limits(struct mddev 

Re: [axboe-block:for-next] [block] 1122c0c1cc: aim7.jobs-per-min 22.6% improvement

2024-06-25 Thread Christoph Hellwig
Hi Oliver,

can you test the patch below?  It restores the previous behavior if
the device did not have a volatile write cache.  I think at least
for raid0 and raid1 without bitmap the new behavior actually is correct
and better, but it will need fixes for other modes.  If the underlying
devices did have a volatile write cache I'm a bit lost what the problem
was and this probably won't fix the issue.

---
>From 81c816827197f811e14add7a79220ed9eef6af02 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig 
Date: Tue, 25 Jun 2024 08:48:18 +0200
Subject: md: set md-specific flags for all queue limits

The md driver wants to enforce a number of flags to an all devices, even
when not inheriting them from the underlying devices.  To make sure these
flags survive the queue_limits_set calls that md uses to update the
queue limits without deriving them form the previous limits add a new
md_init_stacking_limits helper that calls blk_set_stacking_limits and sets
these flags.

Fixes: 1122c0c1cc71 ("block: move cache control settings out of queue->flags")
Signed-off-by: Christoph Hellwig 
---
 drivers/md/md.c | 13 -
 drivers/md/md.h |  1 +
 drivers/md/raid0.c  |  2 +-
 drivers/md/raid1.c  |  2 +-
 drivers/md/raid10.c |  2 +-
 drivers/md/raid5.c  |  2 +-
 6 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 69ea54aedd99a1..8368438e58e989 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5853,6 +5853,13 @@ static void mddev_delayed_delete(struct work_struct *ws)
kobject_put(>kobj);
 }
 
+void md_init_stacking_limits(struct queue_limits *lim)
+{
+   blk_set_stacking_limits(lim);
+   lim->features = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
+   BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT;
+}
+
 struct mddev *md_alloc(dev_t dev, char *name)
 {
/*
@@ -5871,10 +5878,6 @@ struct mddev *md_alloc(dev_t dev, char *name)
int shift;
int unit;
int error;
-   struct queue_limits lim = {
-   .features   = BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA |
- BLK_FEAT_IO_STAT | BLK_FEAT_NOWAIT,
-   };
 
/*
 * Wait for any previous instance of this device to be completely
@@ -5914,7 +5917,7 @@ struct mddev *md_alloc(dev_t dev, char *name)
 */
mddev->hold_active = UNTIL_STOP;
 
-   disk = blk_alloc_disk(, NUMA_NO_NODE);
+   disk = blk_alloc_disk(NULL, NUMA_NO_NODE);
if (IS_ERR(disk)) {
error = PTR_ERR(disk);
goto out_free_mddev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index c4d7ebf9587d07..28cb4b0b6c1740 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -893,6 +893,7 @@ extern int strict_strtoul_scaled(const char *cp, unsigned 
long *res, int scale);
 
 extern int mddev_init(struct mddev *mddev);
 extern void mddev_destroy(struct mddev *mddev);
+void md_init_stacking_limits(struct queue_limits *lim);
 struct mddev *md_alloc(dev_t dev, char *name);
 void mddev_put(struct mddev *mddev);
 extern int md_run(struct mddev *mddev);
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 62634e2a33bd0f..32d58752477847 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -379,7 +379,7 @@ static int raid0_set_limits(struct mddev *mddev)
struct queue_limits lim;
int err;
 
-   blk_set_stacking_limits();
+   md_init_stacking_limits();
lim.max_hw_sectors = mddev->chunk_sectors;
lim.max_write_zeroes_sectors = mddev->chunk_sectors;
lim.io_min = mddev->chunk_sectors << 9;
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 1a0eba65b8a92b..04a0c2ca173245 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3194,7 +3194,7 @@ static int raid1_set_limits(struct mddev *mddev)
struct queue_limits lim;
int err;
 
-   blk_set_stacking_limits();
+   md_init_stacking_limits();
lim.max_write_zeroes_sectors = 0;
err = mddev_stack_rdev_limits(mddev, , MDDEV_STACK_INTEGRITY);
if (err) {
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 3334aa803c8380..2a9c4ee982e023 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3974,7 +3974,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
struct queue_limits lim;
int err;
 
-   blk_set_stacking_limits();
+   md_init_stacking_limits();
lim.max_write_zeroes_sectors = 0;
lim.io_min = mddev->chunk_sectors << 9;
lim.io_opt = lim.io_min * raid10_nr_stripes(conf);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0192a6323f09ba..10219205160bbf 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7708,7 +7708,7 @@ static int raid5_set_limits(struct mddev *mddev)
 */
stripe = roundup_pow_of_two(data_disks * (mddev->chunk_sectors << 9));
 
-   blk_set_stacking_limits();
+   md_init_stacking_limits();