Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:52 PM, Christoph Hellwig wrote:
> On Tue, Jun 11, 2024 at 02:51:24PM +0900, Damien Le Moal wrote:
>>> -   if (lim->zoned)
>>> +   if (sdkp->device->type == TYPE_ZBC)
>>
>> Nit: use sd_is_zoned() here ?
> 
> Yes.
> 
>>> -   if (!sd_is_zoned(sdkp))
>>> +   if (!sd_is_zoned(sdkp)) {
>>> +   lim->zoned = false;
>>
>> Maybe we should clear the other zone related limits here ? If the drive is
>> reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
>> limits may be set already, no ?
> 
> blk_validate_zoned_limits already takes care of that.

I do not think it does:

static int blk_validate_zoned_limits(struct queue_limits *lim)
{
if (!lim->zoned) {
if (WARN_ON_ONCE(lim->max_open_zones) ||
WARN_ON_ONCE(lim->max_active_zones) ||
WARN_ON_ONCE(lim->zone_write_granularity) ||
WARN_ON_ONCE(lim->max_zone_append_sectors))
return -EINVAL;
return 0;
}
...

So setting lim->zoned to false without clearing the other limits potentially
will trigger warnings...

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:54 PM, Christoph Hellwig wrote:
> On Tue, Jun 11, 2024 at 07:52:39AM +0200, Christoph Hellwig wrote:
>>> Maybe we should clear the other zone related limits here ? If the drive is
>>> reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
>>> limits may be set already, no ?
>>
>> blk_validate_zoned_limits already takes care of that.
> 
> Sorry, brainfart.  The integrity code does that, but not the zoned
> code.  I suspect the core code might be a better place for it,
> though.

Yes. Just replied to your previous email before seeing this one.
I think that:

static int blk_validate_zoned_limits(struct queue_limits *lim)
{
if (!lim->zoned) {
if (WARN_ON_ONCE(lim->max_open_zones) ||
WARN_ON_ONCE(lim->max_active_zones) ||
WARN_ON_ONCE(lim->zone_write_granularity) ||
WARN_ON_ONCE(lim->max_zone_append_sectors))
return -EINVAL;
return 0;
}
...

could be changed into:

static int blk_validate_zoned_limits(struct queue_limits *lim)
{
if (!lim->zoned) {
lim->max_open_zones = 0;
lim->max_active_zones = 0;
lim->zone_write_granularity = 0;
lim->max_zone_append_sectors = 0
return 0;
}

But then we would not see "bad" drivers. Could have a small

blk_clear_zoned_limits(struct queue_limits *lim)

helper too.

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> virtblk_update_cache_mode boils down to a single call to
> blk_queue_write_cache.  Remove it in preparation for moving the cache
> control flags into the queue_limits.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 09/26] nbd: move setting the cache control flags to __nbd_set_size

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move setting the cache control flags in nbd in preparation for moving
> these flags into the queue_limits structure.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 10/26] xen-blkfront: don't disable cache flushes when they fail

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> blkfront always had a robust negotiation protocol for detecting a write
> cache.  Stop simply disabling cache flushes when they fail as that is
> a grave error.
> 
> Signed-off-by: Christoph Hellwig 

Looks good to me but maybe mention that removal of xlvbd_flush() as well ?
And it feels like the "stop disabling cache flushes when they fail" part should
be a fix patch sent separately...

Anyway, for both parts, feel free to add:

Reviewed-by: Damien Le Moal 

> ---
>  drivers/block/xen-blkfront.c | 29 +
>  1 file changed, 9 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 9b4ec3e4908cce..9794ac2d3299d1 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -982,18 +982,6 @@ static const char *flush_info(struct blkfront_info *info)
>   return "barrier or flush: disabled;";
>  }
>  
> -static void xlvbd_flush(struct blkfront_info *info)
> -{
> - blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
> -   info->feature_fua ? true : false);
> - pr_info("blkfront: %s: %s %s %s %s %s %s %s\n",
> - info->gd->disk_name, flush_info(info),
> - "persistent grants:", info->feature_persistent ?
> - "enabled;" : "disabled;", "indirect descriptors:",
> - info->max_indirect_segments ? "enabled;" : "disabled;",
> - "bounce buffer:", info->bounce ? "enabled" : "disabled;");
> -}
> -
>  static int xen_translate_vdev(int vdevice, int *minor, unsigned int *offset)
>  {
>   int major;
> @@ -1162,7 +1150,15 @@ static int xlvbd_alloc_gendisk(blkif_sector_t capacity,
>   info->sector_size = sector_size;
>   info->physical_sector_size = physical_sector_size;
>  
> - xlvbd_flush(info);
> + blk_queue_write_cache(info->rq, info->feature_flush ? true : false,
> +   info->feature_fua ? true : false);
> +
> + pr_info("blkfront: %s: %s %s %s %s %s %s %s\n",
> + info->gd->disk_name, flush_info(info),
> + "persistent grants:", info->feature_persistent ?
> + "enabled;" : "disabled;", "indirect descriptors:",
> + info->max_indirect_segments ? "enabled;" : "disabled;",
> + "bounce buffer:", info->bounce ? "enabled" : "disabled;");
>  
>   if (info->vdisk_info & VDISK_READONLY)
>   set_disk_ro(gd, 1);
> @@ -1622,13 +1618,6 @@ static irqreturn_t blkif_interrupt(int irq, void 
> *dev_id)
>  info->gd->disk_name, 
> op_name(bret.operation));
>   blkif_req(req)->error = BLK_STS_NOTSUPP;
>   }
> - if (unlikely(blkif_req(req)->error)) {
> - if (blkif_req(req)->error == BLK_STS_NOTSUPP)
> - blkif_req(req)->error = BLK_STS_OK;
> - info->feature_fua = 0;
> - info->feature_flush = 0;
> - xlvbd_flush(info);
> - }
>   fallthrough;
>   case BLKIF_OP_READ:
>   case BLKIF_OP_WRITE:

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 11/26] block: freeze the queue in queue_attr_store

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> queue_attr_store updates attributes used to control generating I/O, and
> can cause malformed bios if changed with I/O in flight.  Freeze the queue
> in common code instead of adding it to almost every attribute.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 12/26] block: remove blk_flush_policy

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Fold blk_flush_policy into the only caller to prepare for pending changes
> to it.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 13/26] block: move cache control settings out of queue->flags

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the cache control settings into the queue_limits so that they
> can be set atomically and all I/O is frozen when changing the
> flags.

...so that they can be set atomically with the device queue frozen when
changing the flags.

may be better.

> 
> Add new features and flags field for the driver set flags, and internal
> (usually sysfs-controlled) flags in the block layer.  Note that we'll
> eventually remove enough field from queue_limits to bring it back to the
> previous size.
> 
> The disable flag is inverted compared to the previous meaning, which
> means it now survives a rescan, similar to the max_sectors and
> max_discard_sectors user limits.
> 
> The FLUSH and FUA flags are now inherited by blk_stack_limits, which
> simplified the code in dm a lot, but also causes a slight behavior
> change in that dm-switch and dm-unstripe now advertise a write cache
> despite setting num_flush_bios to 0.  The I/O path will handle this
> gracefully, but as far as I can tell the lack of num_flush_bios
> and thus flush support is a pre-existing data integrity bug in those
> targets that really needs fixing, after which a non-zero num_flush_bios
> should be required in dm for targets that map to underlying devices.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  .../block/writeback_cache_control.rst | 67 +++
>  arch/um/drivers/ubd_kern.c|  2 +-
>  block/blk-core.c  |  2 +-
>  block/blk-flush.c |  9 ++-
>  block/blk-mq-debugfs.c|  2 -
>  block/blk-settings.c  | 29 ++--
>  block/blk-sysfs.c | 29 +---
>  block/blk-wbt.c   |  4 +-
>  drivers/block/drbd/drbd_main.c|  2 +-
>  drivers/block/loop.c  |  9 +--
>  drivers/block/nbd.c   | 14 ++--
>  drivers/block/null_blk/main.c | 12 ++--
>  drivers/block/ps3disk.c   |  7 +-
>  drivers/block/rnbd/rnbd-clt.c | 10 +--
>  drivers/block/ublk_drv.c  |  8 ++-
>  drivers/block/virtio_blk.c| 20 --
>  drivers/block/xen-blkfront.c  |  9 ++-
>  drivers/md/bcache/super.c |  7 +-
>  drivers/md/dm-table.c | 39 +++
>  drivers/md/md.c   |  8 ++-
>  drivers/mmc/core/block.c  | 42 ++--
>  drivers/mmc/core/queue.c  | 12 ++--
>  drivers/mmc/core/queue.h  |  3 +-
>  drivers/mtd/mtd_blkdevs.c |  5 +-
>  drivers/nvdimm/pmem.c |  4 +-
>  drivers/nvme/host/core.c  |  7 +-
>  drivers/nvme/host/multipath.c |  6 --
>  drivers/scsi/sd.c | 28 +---
>  include/linux/blkdev.h| 38 +--
>  29 files changed, 227 insertions(+), 207 deletions(-)
> 
> diff --git a/Documentation/block/writeback_cache_control.rst 
> b/Documentation/block/writeback_cache_control.rst
> index b208488d0aae85..9cfe27f90253c7 100644
> --- a/Documentation/block/writeback_cache_control.rst
> +++ b/Documentation/block/writeback_cache_control.rst
> @@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache 
> flushing and how
>  the Forced Unit Access is implemented.  The REQ_PREFLUSH and REQ_FUA flags
>  may both be set on a single bio.
>  
> +Feature settings for block drivers
> +--
>  
> -Implementation details for bio based block drivers
> ---
> +For devices that do not support volatile write caches there is no driver
> +support required, the block layer completes empty REQ_PREFLUSH requests 
> before
> +entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from
> +requests that have a payload.
>  
> -These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit
> -directly below the submit_bio interface.  For remapping drivers the REQ_FUA
> -bits need to be propagated to underlying devices, and a global flush needs
> -to be implemented for bios with the REQ_PREFLUSH bit set.  For real device
> -drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits
> -on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without
> -data can be completed successfully without doing any work.  Drivers for
> -devices with volatile caches need to implement the support for these
> -flags themselves without any help from the block layer.
> +For devices with volatile write caches the driver needs to tell the block 
> layer
> +that it supports flushing caches by setting the
>  
> +   BLK_FEAT_WRITE_CACHE
>  
> -Implementation details for request_fn based block drivers
> 

Re: [PATCH 14/26] block: move the nonrot flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the norot flag into the queue_limits feature field so that it can be

s/norot/nonrot

> set atomically and all I/O is frozen when changing the flag.

and... -> with the queue frozen when... ?

> 
> Use the chance to switch to defaulting to non-rotational and require
> the driver to opt into rotational, which matches the polarity of the
> sysfs interface.
> 
> For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new
> rotational flag is not set as they clearly are not rotational despite
> this being a behavior change.  There are some other drivers that
> unconditionally set the rotational flag to keep the existing behavior
> as they arguably can be used on rotational devices even if that is
> probably not their main use today (e.g. virtio_blk and drbd).
> 
> The flag is automatically inherited in blk_stack_limits matching the
> existing behavior in dm and md.
> 
> Signed-off-by: Christoph Hellwig 

Other than that, looks good to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 15/26] block: move the add_random flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the add_random flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.

Same remark as the previous patches for the end of this sentence.c

> 
> Note that this also removes code from dm to clear the flag based on
> the underlying devices, which can't be reached as dm devices will
> always start out without the flag set.
> 
> Signed-off-by: Christoph Hellwig 

Other than that, looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 16/26] block: move the io_stat flag setting to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the io_stat flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.

Why a feature ? It seems more appropriate for io_stat to be a flag rather than
a feature as that is a block layer thing rather than a device characteristic, 
no ?

> 
> Simplify md and dm to set the flag unconditionally instead of avoiding
> setting a simple flag for cases where it already is set by other means,
> which is a bit pointless.
> 
> Signed-off-by: Christoph Hellwig 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 01/26] sd: fix sd_is_zoned

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Since commit 7437bb73f087 ("block: remove support for the host aware zone
model"), only ZBC devices expose a zoned access model.  sd_is_zoned is
used to check for that and thus return false for host aware devices.

Fixes: 7437bb73f087 ("block: remove support for the host aware zone model")
Signed-off-by: Christoph Hellwig 
---
  drivers/scsi/sd.h | 7 ++-
  drivers/scsi/sd_zbc.c | 7 +--
  2 files changed, 7 insertions(+), 7 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Move a bit of code that sets up the zone flag and the write granularity
into sd_zbc_read_zones to be with the rest of the zoned limits.

Signed-off-by: Christoph Hellwig 
---
  drivers/scsi/sd.c | 21 +
  drivers/scsi/sd_zbc.c | 13 -
  2 files changed, 13 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 85b45345a27739..5bfed61c70db8f 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3308,29 +3308,10 @@ static void sd_read_block_characteristics(struct 
scsi_disk *sdkp,
blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, q);
}
  
-

-#ifdef CONFIG_BLK_DEV_ZONED /* sd_probe rejects ZBD devices early otherwise */
-   if (sdkp->device->type == TYPE_ZBC) {
-   lim->zoned = true;
-
-   /*
-* Per ZBC and ZAC specifications, writes in sequential write
-* required zones of host-managed devices must be aligned to
-* the device physical block size.
-*/
-   lim->zone_write_granularity = sdkp->physical_block_size;
-   } else {
-   /*
-* Host-aware devices are treated as conventional.
-*/
-   lim->zoned = false;
-   }
-#endif /* CONFIG_BLK_DEV_ZONED */
-
if (!sdkp->first_scan)
return;
  
-	if (lim->zoned)

+   if (sdkp->device->type == TYPE_ZBC)


Why not sd_is_zoned()?

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 17/26] block: move the stable_write flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the io_stat flag into the queue_limits feature field so that it can

s/io_stat/stable_write

> be set atomically and all I/O is frozen when changing the flag.
> 
> The flag is now inherited by blk_stack_limits, which greatly simplifies
> the code in dm, and fixed md which previously did not pass on the flag
> set on lower devices.
> 
> Signed-off-by: Christoph Hellwig 

Other than the nit above, looks OK to me.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 18/26] block: move the synchronous flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the synchronous flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 03/26] loop: stop using loop_reconfigure_limits in __loop_clr_fd

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

__loop_clr_fd wants to clear all settings on the device.  Prepare for
moving more settings into the block limits by open coding
loop_reconfigure_limits.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/loop.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 04/26] loop: always update discard settings in loop_reconfigure_limits

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Simplify loop_reconfigure_limits by always updating the discard limits.
This adds a little more work to loop_set_block_size, but doesn't change
the outcome as the discard flag won't change.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/loop.c | 10 --
  1 file changed, 4 insertions(+), 6 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 19/26] block: move the nowait flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the nowait flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.
> 
> Stacking drivers are simplified in that they now can simply set the
> flag, and blk_stack_limits will clear it when the features is not
> supported by any of the underlying devices.
> 
> Signed-off-by: Christoph Hellwig 


> @@ -1825,9 +1815,7 @@ int dm_table_set_restrictions(struct dm_table *t, 
> struct request_queue *q,
>   int r;
>  
>   if (dm_table_supports_nowait(t))
> - blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q);
> - else
> - blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, q);
> + limits->features &= ~BLK_FEAT_NOWAIT;

Shouldn't you set the flag here instead of clearing it ?

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 05/26] loop: regularize upgrading the lock size for direct I/O

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

The LOOP_CONFIGURE path automatically upgrades the block size to that
of the underlying file for O_DIRECT file descriptors, but the
LOOP_SET_BLOCK_SIZE path does not.  Fix this by lifting the code to
pick the block size into common code.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/loop.c | 25 +++--
  1 file changed, 15 insertions(+), 10 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 20/26] block: move the dax flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the dax flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.
> 
> Signed-off-by: Christoph Hellwig 

Looks good.

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 06/26] loop: also use the default block size from an underlying block device

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Fix the code in loop_reconfigure_limits to pick a default block size for
O_DIRECT file descriptors to also work when the loop device sits on top
of a block device and not just on a regular file on a block device based
file system.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/loop.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 07/26] loop: fold loop_update_rotational into loop_reconfigure_limits

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

This prepares for moving the rotational flag into the queue_limits and
also fixes it for the case where the loop device is backed by a block
device.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/loop.c | 23 ---
  1 file changed, 4 insertions(+), 19 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

virtblk_update_cache_mode boils down to a single call to
blk_queue_write_cache.  Remove it in preparation for moving the cache
control flags into the queue_limits.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/virtio_blk.c | 13 +++--
  1 file changed, 3 insertions(+), 10 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 09/26] nbd: move setting the cache control flags to __nbd_set_size

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Move setting the cache control flags in nbd in preparation for moving
these flags into the queue_limits structure.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/nbd.c | 17 +++--
  1 file changed, 7 insertions(+), 10 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 21/26] block: move the poll flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the poll flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.
> 
> Stacking drivers are simplified in that they now can simply set the
> flag, and blk_stack_limits will clear it when the features is not
> supported by any of the underlying devices.
> 
> Signed-off-by: Christoph Hellwig 

Kind of the same remark as for io_stat about this not really being a device
feature. But I guess seeing "features" as a queue feature rather than just a
device feature makes it OK to have poll (and io_stat) as a feature rather than
a flag.

So:

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 10/26] xen-blkfront: don't disable cache flushes when they fail

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

blkfront always had a robust negotiation protocol for detecting a write
cache.  Stop simply disabling cache flushes when they fail as that is
a grave error.

Signed-off-by: Christoph Hellwig 
---
  drivers/block/xen-blkfront.c | 29 +
  1 file changed, 9 insertions(+), 20 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 11/26] block: freeze the queue in queue_attr_store

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

queue_attr_store updates attributes used to control generating I/O, and
can cause malformed bios if changed with I/O in flight.  Freeze the queue
in common code instead of adding it to almost every attribute.

Signed-off-by: Christoph Hellwig 
---
  block/blk-mq.c| 5 +++--
  block/blk-sysfs.c | 9 ++---
  2 files changed, 5 insertions(+), 9 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 22/26] block: move the zoned flag into the feature field

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the boolean zoned field into the flags field to reclaim a little
> bit of space.

Nit: flags -> feature flags

> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 12/26] block: remove blk_flush_policy

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Fold blk_flush_policy into the only caller to prepare for pending changes
to it.

Signed-off-by: Christoph Hellwig 
---
  block/blk-flush.c | 33 +++--
  1 file changed, 15 insertions(+), 18 deletions(-)


Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
--
Dr. Hannes Reinecke  Kernel Storage Architect
h...@suse.de+49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




Re: [PATCH 23/26] block: move the zone_resetall flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the zone_resetall flag into the queue_limits feature field so that
> it can be set atomically and all I/O is frozen when changing the flag.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 24/26] block: move the pci_p2pdma flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the pci_p2pdma flag into the queue_limits feature field so that it
> can be set atomically and all I/O is frozen when changing the flag.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 25/26] block: move the skip_tagset_quiesce flag to queue_limits

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the skip_tagset_quiesce flag into the queue_limits feature field so
> that it can be set atomically and all I/O is frozen when changing the
> flag.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 26/26] block: move the bounce flag into the feature field

2024-06-11 Thread Damien Le Moal
On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> Move the bounce field into the flags field to reclaim a little bit of

s/flags/feature

> space.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 02/12] md: add a new enum type sync_action

2024-06-11 Thread Mariusz Tkaczyk
On Mon, 3 Jun 2024 20:58:05 +0800
Yu Kuai  wrote:

> In order to make code related to sync_thread cleaner in following
> patches, also add detail comment about each sync action. And also
> prepare to remove the related recovery_flags in the fulture.
> 
> Signed-off-by: Yu Kuai 
> ---
>  drivers/md/md.h | 57 -
>  1 file changed, 56 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 170412a65b63..6b9d9246f260 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -34,6 +34,61 @@
>   */
>  #define  MD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT)
>  
> +/* Status of sync thread. */
> +enum sync_action {
> + /*
> +  * Represent by MD_RECOVERY_SYNC, start when:
> +  * 1) after assemble, sync data from first rdev to other copies, this
> +  * must be done first before other sync actions and will only execute
> +  * once;
> +  * 2) resize the array(notice that this is not reshape), sync data
> for
> +  * the new range;
> +  */
> + ACTION_RESYNC,
> + /*
> +  * Represent by MD_RECOVERY_RECOVER, start when:
> +  * 1) for new replacement, sync data based on the replace rdev or
> +  * available copies from other rdev;
> +  * 2) for new member disk while the array is degraded, sync data from
> +  * other rdev;
> +  * 3) reassemble after power failure or re-add a hot removed rdev,
> sync
> +  * data from first rdev to other copies based on bitmap;
> +  */
> + ACTION_RECOVER,
> + /*
> +  * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED |
> +  * MD_RECOVERY_CHECK, start when user echo "check" to sysfs api
> +  * sync_action, used to check if data copies from differenct rdev are
> +  * the same. The number of mismatch sectors will be exported to user
> +  * by sysfs api mismatch_cnt;
> +  */
> + ACTION_CHECK,
> + /*
> +  * Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED, start when
> +  * user echo "repair" to sysfs api sync_action, usually paired with
> +  * ACTION_CHECK, used to force syncing data once user found that
> there
> +  * are inconsistent data,
> +  */
> + ACTION_REPAIR,
> + /*
> +  * Represent by MD_RECOVERY_RESHAPE, start when new member disk is
> added
> +  * to the conf, notice that this is different from spares or
> +  * replacement;
> +  */
> + ACTION_RESHAPE,
> + /*
> +  * Represent by MD_RECOVERY_FROZEN, can be set by sysfs api
> sync_action
> +  * or internal usage like setting the array read-only, will forbid
> above
> +  * actions.
> +  */
> + ACTION_FROZEN,
> + /*
> +  * All above actions don't match.
> +  */
> + ACTION_IDLE,
> + NR_SYNC_ACTIONS,
> +};

I like if counter is keep in same style as rest enum values, like ACTION_COUNT.

Anyway LGTM.

Mariusz



Re: [PATCH 13/26] block: move cache control settings out of queue->flags

2024-06-11 Thread Hannes Reinecke

On 6/11/24 07:19, Christoph Hellwig wrote:

Move the cache control settings into the queue_limits so that they
can be set atomically and all I/O is frozen when changing the
flags.

Add new features and flags field for the driver set flags, and internal
(usually sysfs-controlled) flags in the block layer.  Note that we'll
eventually remove enough field from queue_limits to bring it back to the
previous size.

The disable flag is inverted compared to the previous meaning, which
means it now survives a rescan, similar to the max_sectors and
max_discard_sectors user limits.

The FLUSH and FUA flags are now inherited by blk_stack_limits, which
simplified the code in dm a lot, but also causes a slight behavior
change in that dm-switch and dm-unstripe now advertise a write cache
despite setting num_flush_bios to 0.  The I/O path will handle this
gracefully, but as far as I can tell the lack of num_flush_bios
and thus flush support is a pre-existing data integrity bug in those
targets that really needs fixing, after which a non-zero num_flush_bios
should be required in dm for targets that map to underlying devices.

Signed-off-by: Christoph Hellwig 
---
  .../block/writeback_cache_control.rst | 67 +++
  arch/um/drivers/ubd_kern.c|  2 +-
  block/blk-core.c  |  2 +-
  block/blk-flush.c |  9 ++-
  block/blk-mq-debugfs.c|  2 -
  block/blk-settings.c  | 29 ++--
  block/blk-sysfs.c | 29 +---
  block/blk-wbt.c   |  4 +-
  drivers/block/drbd/drbd_main.c|  2 +-
  drivers/block/loop.c  |  9 +--
  drivers/block/nbd.c   | 14 ++--
  drivers/block/null_blk/main.c | 12 ++--
  drivers/block/ps3disk.c   |  7 +-
  drivers/block/rnbd/rnbd-clt.c | 10 +--
  drivers/block/ublk_drv.c  |  8 ++-
  drivers/block/virtio_blk.c| 20 --
  drivers/block/xen-blkfront.c  |  9 ++-
  drivers/md/bcache/super.c |  7 +-
  drivers/md/dm-table.c | 39 +++
  drivers/md/md.c   |  8 ++-
  drivers/mmc/core/block.c  | 42 ++--
  drivers/mmc/core/queue.c  | 12 ++--
  drivers/mmc/core/queue.h  |  3 +-
  drivers/mtd/mtd_blkdevs.c |  5 +-
  drivers/nvdimm/pmem.c |  4 +-
  drivers/nvme/host/core.c  |  7 +-
  drivers/nvme/host/multipath.c |  6 --
  drivers/scsi/sd.c | 28 +---
  include/linux/blkdev.h| 38 +--
  29 files changed, 227 insertions(+), 207 deletions(-)

diff --git a/Documentation/block/writeback_cache_control.rst 
b/Documentation/block/writeback_cache_control.rst
index b208488d0aae85..9cfe27f90253c7 100644
--- a/Documentation/block/writeback_cache_control.rst
+++ b/Documentation/block/writeback_cache_control.rst
@@ -46,41 +46,50 @@ worry if the underlying devices need any explicit cache 
flushing and how
  the Forced Unit Access is implemented.  The REQ_PREFLUSH and REQ_FUA flags
  may both be set on a single bio.
  
+Feature settings for block drivers

+--
  
-Implementation details for bio based block drivers

---
+For devices that do not support volatile write caches there is no driver
+support required, the block layer completes empty REQ_PREFLUSH requests before
+entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from
+requests that have a payload.
  
-These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit

-directly below the submit_bio interface.  For remapping drivers the REQ_FUA
-bits need to be propagated to underlying devices, and a global flush needs
-to be implemented for bios with the REQ_PREFLUSH bit set.  For real device
-drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits
-on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without
-data can be completed successfully without doing any work.  Drivers for
-devices with volatile caches need to implement the support for these
-flags themselves without any help from the block layer.
+For devices with volatile write caches the driver needs to tell the block layer
+that it supports flushing caches by setting the
  
+   BLK_FEAT_WRITE_CACHE
  
-Implementation details for request_fn based block drivers

--
+flag in the queue_limits feature field.  For devices that also support the FUA
+bit the block layer needs to be told to pass on the REQ_FUA bit by also setting
+the
  
-For devices that do not support vola

Re: [PATCH 01/26] sd: fix sd_is_zoned

2024-06-11 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 


Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 


Re: dm-verity requires WQ_UNBOUND with mxs-dcp

2024-06-11 Thread Sven
The driver complains during self-testing, however, those errors seem to 
be thrown during aes testing not sha256 testing:


# dmesg | grep -iE 'crypt|alg:'
[0.316442] alg: extra crypto tests enabled.  This is intended for 
developer use only.

[1.055440] mxs-dcp 228.crypto: Invalid block size!
[1.060927] mxs-dcp 228.crypto: Invalid block size!
[1.164247] mxs-dcp 228.crypto: Invalid block size!
[1.169591] mxs-dcp 228.crypto: Invalid block size!
[1.204738] mxs-dcp 228.crypto: Invalid block size!
[1.210093] mxs-dcp 228.crypto: Invalid block size!
[1.371876] mxs-dcp 228.crypto: Invalid block size!
[1.377280] mxs-dcp 228.crypto: Invalid block size!
[1.394093] mxs-dcp 228.crypto: Invalid block size!
[1.399432] mxs-dcp 228.crypto: Invalid block size!
[1.431509] mxs-dcp 228.crypto: Invalid block size!
[1.436904] mxs-dcp 228.crypto: Invalid block size!

The sha256-dcp self-test is reported as "passed".

I have tried to reproduce the issue by forcing many concurrent sha256 
operations from userspace using AF_ALG but I haven't been able to see 
any issues. Of course, this goes through the userspace crypto API, so it 
might not be the same as the kernel using it directly.


Any tips on how to narrow in on this issue are appreciated :)

Best regards,
Sven

On 6/7/24 7:22 AM, Eric Biggers wrote:

On Thu, Jun 06, 2024 at 03:12:26PM +0200, Sven wrote:

Do you happen to have some more infos on the self-tests you mentioned? Can
you point me at documentation, I coulnd't really find much useful stuff.

Thanks

On 5/28/24 9:42 PM, Eric Biggers wrote:

Does this driver pass all the crypto self-tests?


The crypto self-tests can be enabled by unsetting
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS, and preferably setting
CONFIG_CRYPTO_MANAGER_EXTRA_TESTS in order to enable the full set of tests.

They might catch something, though they don't test parallel usage of drivers
which is where it sounds like the issue here might be here.

In any case the bug here is likely to be in the mxs-dcp driver, especially
considering that drivers tend to be much lower quality than the software crypto
(some driver developers apparently don't even bother to run the tests).

- Eric




[PATCH v2 md-6.11 00/12] md: refacotor and some fixes related to sync_thread

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Changes from v1:
 - respin on the top of md-6.11 branch

Changes from RFC:
 - fix some typos;
 - add patch 7 to prevent some mdadm tests failure;
 - add patch 12 to fix BUG_ON() panic by mdadm test 07revert-grow;

Yu Kuai (12):
  md: rearrange recovery_flags
  md: add a new enum type sync_action
  md: add new helpers for sync_action
  md: factor out helper to start reshape from action_store()
  md: replace sysfs api sync_action with new helpers
  md: remove parameter check_seq for stop_sync_thread()
  md: don't fail action_store() if sync_thread is not registered
  md: use new helers in md_do_sync()
  md: replace last_sync_action with new enum type
  md: factor out helpers for different sync_action in md_do_sync()
  md: pass in max_sectors for pers->sync_request()
  md/raid5: avoid BUG_ON() while continue reshape after reassembling

 drivers/md/dm-raid.c |   2 +-
 drivers/md/md.c  | 437 ++-
 drivers/md/md.h  | 124 +---
 drivers/md/raid1.c   |   5 +-
 drivers/md/raid10.c  |   8 +-
 drivers/md/raid5.c   |  23 ++-
 6 files changed, 388 insertions(+), 211 deletions(-)

-- 
2.39.2




[PATCH v2 md-6.11 01/12] md: rearrange recovery_flags

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Currently there are lots of flags with the same confusing prefix
"MD_REOCVERY_", and there are two main types of flags, sync thread runnng
status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer
prefix "SYNC_ACTION_".

For now, rearrange and update comment to improve code readability,
there are no functional changes.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.h | 52 -
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 487582058f74..1ee129c6f98f 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -551,22 +551,46 @@ struct mddev {
 };
 
 enum recovery_flags {
+   /* flags for sync thread running status */
+
+   /*
+* set when one of sync action is set and new sync thread need to be
+* registered, or just add/remove spares from conf.
+*/
+   MD_RECOVERY_NEEDED,
+   /* sync thread is running, or about to be started */
+   MD_RECOVERY_RUNNING,
+   /* sync thread needs to be aborted for some reason */
+   MD_RECOVERY_INTR,
+   /* sync thread is done and is waiting to be unregistered */
+   MD_RECOVERY_DONE,
+   /* running sync thread must abort immediately, and not restart */
+   MD_RECOVERY_FROZEN,
+   /* waiting for pers->start() to finish */
+   MD_RECOVERY_WAIT,
+   /* interrupted because io-error */
+   MD_RECOVERY_ERROR,
+
+   /* flags determines sync action */
+
+   /* if just this flag is set, action is resync. */
+   MD_RECOVERY_SYNC,
+   /*
+* paired with MD_RECOVERY_SYNC, if MD_RECOVERY_CHECK is not set,
+* action is repair, means user requested resync.
+*/
+   MD_RECOVERY_REQUESTED,
/*
-* If neither SYNC or RESHAPE are set, then it is a recovery.
+* paired with MD_RECOVERY_SYNC and MD_RECOVERY_REQUESTED, action is
+* check.
 */
-   MD_RECOVERY_RUNNING,/* a thread is running, or about to be started 
*/
-   MD_RECOVERY_SYNC,   /* actually doing a resync, not a recovery */
-   MD_RECOVERY_RECOVER,/* doing recovery, or need to try it. */
-   MD_RECOVERY_INTR,   /* resync needs to be aborted for some reason */
-   MD_RECOVERY_DONE,   /* thread is done and is waiting to be reaped */
-   MD_RECOVERY_NEEDED, /* we might need to start a resync/recover */
-   MD_RECOVERY_REQUESTED,  /* user-space has requested a sync (used with 
SYNC) */
-   MD_RECOVERY_CHECK,  /* user-space request for check-only, no repair 
*/
-   MD_RECOVERY_RESHAPE,/* A reshape is happening */
-   MD_RECOVERY_FROZEN, /* User request to abort, and not restart, any 
action */
-   MD_RECOVERY_ERROR,  /* sync-action interrupted because io-error */
-   MD_RECOVERY_WAIT,   /* waiting for pers->start() to finish */
-   MD_RESYNCING_REMOTE,/* remote node is running resync thread */
+   MD_RECOVERY_CHECK,
+   /* recovery, or need to try it */
+   MD_RECOVERY_RECOVER,
+   /* reshape */
+   MD_RECOVERY_RESHAPE,
+   /* remote node is running resync thread */
+   MD_RESYNCING_REMOTE,
 };
 
 enum md_ro_state {
-- 
2.39.2




[PATCH v2 md-6.11 02/12] md: add a new enum type sync_action

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

In order to make code related to sync_thread cleaner in following
patches, also add detail comment about each sync action. And also
prepare to remove the related recovery_flags in the fulture.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.h | 57 -
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1ee129c6f98f..e5001d39c82d 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -34,6 +34,61 @@
  */
 #defineMD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT)
 
+/* Status of sync thread. */
+enum sync_action {
+   /*
+* Represent by MD_RECOVERY_SYNC, start when:
+* 1) after assemble, sync data from first rdev to other copies, this
+* must be done first before other sync actions and will only execute
+* once;
+* 2) resize the array(notice that this is not reshape), sync data for
+* the new range;
+*/
+   ACTION_RESYNC,
+   /*
+* Represent by MD_RECOVERY_RECOVER, start when:
+* 1) for new replacement, sync data based on the replace rdev or
+* available copies from other rdev;
+* 2) for new member disk while the array is degraded, sync data from
+* other rdev;
+* 3) reassemble after power failure or re-add a hot removed rdev, sync
+* data from first rdev to other copies based on bitmap;
+*/
+   ACTION_RECOVER,
+   /*
+* Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED |
+* MD_RECOVERY_CHECK, start when user echo "check" to sysfs api
+* sync_action, used to check if data copies from differenct rdev are
+* the same. The number of mismatch sectors will be exported to user
+* by sysfs api mismatch_cnt;
+*/
+   ACTION_CHECK,
+   /*
+* Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED, start when
+* user echo "repair" to sysfs api sync_action, usually paired with
+* ACTION_CHECK, used to force syncing data once user found that there
+* are inconsistent data,
+*/
+   ACTION_REPAIR,
+   /*
+* Represent by MD_RECOVERY_RESHAPE, start when new member disk is added
+* to the conf, notice that this is different from spares or
+* replacement;
+*/
+   ACTION_RESHAPE,
+   /*
+* Represent by MD_RECOVERY_FROZEN, can be set by sysfs api sync_action
+* or internal usage like setting the array read-only, will forbid above
+* actions.
+*/
+   ACTION_FROZEN,
+   /*
+* All above actions don't match.
+*/
+   ACTION_IDLE,
+   NR_SYNC_ACTIONS,
+};
+
 /*
  * The struct embedded in rdev is used to serialize IO.
  */
@@ -571,7 +626,7 @@ enum recovery_flags {
/* interrupted because io-error */
MD_RECOVERY_ERROR,
 
-   /* flags determines sync action */
+   /* flags determines sync action, see details in enum sync_action */
 
/* if just this flag is set, action is resync. */
MD_RECOVERY_SYNC,
-- 
2.39.2




[PATCH v2 md-6.11 03/12] md: add new helpers for sync_action

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

The new helpers will get current sync_action of the array, will be used
in later patches to make code cleaner.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 79 +
 drivers/md/md.h |  3 ++
 2 files changed, 82 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b9b15aa79496..4ce8d164cde9 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -69,6 +69,16 @@
 #include "md-bitmap.h"
 #include "md-cluster.h"
 
+static const char *action_name[NR_SYNC_ACTIONS] = {
+   [ACTION_RESYNC] = "resync",
+   [ACTION_RECOVER]= "recover",
+   [ACTION_CHECK]  = "check",
+   [ACTION_REPAIR] = "repair",
+   [ACTION_RESHAPE]= "reshape",
+   [ACTION_FROZEN] = "frozen",
+   [ACTION_IDLE]   = "idle",
+};
+
 /* pers_list is a list of registered personalities protected by pers_lock. */
 static LIST_HEAD(pers_list);
 static DEFINE_SPINLOCK(pers_lock);
@@ -4868,6 +4878,75 @@ metadata_store(struct mddev *mddev, const char *buf, 
size_t len)
 static struct md_sysfs_entry md_metadata =
 __ATTR_PREALLOC(metadata_version, S_IRUGO|S_IWUSR, metadata_show, 
metadata_store);
 
+enum sync_action md_sync_action(struct mddev *mddev)
+{
+   unsigned long recovery = mddev->recovery;
+
+   /*
+* frozen has the highest priority, means running sync_thread will be
+* stopped immediately, and no new sync_thread can start.
+*/
+   if (test_bit(MD_RECOVERY_FROZEN, &recovery))
+   return ACTION_FROZEN;
+
+   /*
+* read-only array can't register sync_thread, and it can only
+* add/remove spares.
+*/
+   if (!md_is_rdwr(mddev))
+   return ACTION_IDLE;
+
+   /*
+* idle means no sync_thread is running, and no new sync_thread is
+* requested.
+*/
+   if (!test_bit(MD_RECOVERY_RUNNING, &recovery) &&
+   !test_bit(MD_RECOVERY_NEEDED, &recovery))
+   return ACTION_IDLE;
+
+   if (test_bit(MD_RECOVERY_RESHAPE, &recovery) ||
+   mddev->reshape_position != MaxSector)
+   return ACTION_RESHAPE;
+
+   if (test_bit(MD_RECOVERY_RECOVER, &recovery))
+   return ACTION_RECOVER;
+
+   if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
+   /*
+* MD_RECOVERY_CHECK must be paired with
+* MD_RECOVERY_REQUESTED.
+*/
+   if (test_bit(MD_RECOVERY_CHECK, &recovery))
+   return ACTION_CHECK;
+   if (test_bit(MD_RECOVERY_REQUESTED, &recovery))
+   return ACTION_REPAIR;
+   return ACTION_RESYNC;
+   }
+
+   /*
+* MD_RECOVERY_NEEDED or MD_RECOVERY_RUNNING is set, however, no
+* sync_action is specified.
+*/
+   return ACTION_IDLE;
+}
+
+enum sync_action md_sync_action_by_name(const char *page)
+{
+   enum sync_action action;
+
+   for (action = 0; action < NR_SYNC_ACTIONS; ++action) {
+   if (cmd_match(page, action_name[action]))
+   return action;
+   }
+
+   return NR_SYNC_ACTIONS;
+}
+
+const char *md_sync_action_name(enum sync_action action)
+{
+   return action_name[action];
+}
+
 static ssize_t
 action_show(struct mddev *mddev, char *page)
 {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index e5001d39c82d..88add162b08e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -864,6 +864,9 @@ extern void md_unregister_thread(struct mddev *mddev, 
struct md_thread __rcu **t
 extern void md_wakeup_thread(struct md_thread __rcu *thread);
 extern void md_check_recovery(struct mddev *mddev);
 extern void md_reap_sync_thread(struct mddev *mddev);
+extern enum sync_action md_sync_action(struct mddev *mddev);
+extern enum sync_action md_sync_action_by_name(const char *page);
+extern const char *md_sync_action_name(enum sync_action action);
 extern void md_write_start(struct mddev *mddev, struct bio *bi);
 extern void md_write_inc(struct mddev *mddev, struct bio *bi);
 extern void md_write_end(struct mddev *mddev);
-- 
2.39.2




[PATCH v2 md-6.11 04/12] md: factor out helper to start reshape from action_store()

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

There are no functional changes, just to make code cleaner and prepare
for following refactor.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 65 +++--
 1 file changed, 41 insertions(+), 24 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 4ce8d164cde9..b34ae9fbd246 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5071,6 +5071,45 @@ static void frozen_sync_thread(struct mddev *mddev)
mutex_unlock(&mddev->sync_mutex);
 }
 
+static int mddev_start_reshape(struct mddev *mddev)
+{
+   int ret;
+
+   if (mddev->pers->start_reshape == NULL)
+   return -EINVAL;
+
+   ret = mddev_lock(mddev);
+   if (ret)
+   return ret;
+
+   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
+   mddev_unlock(mddev);
+   return -EBUSY;
+   }
+
+   if (mddev->reshape_position == MaxSector ||
+   mddev->pers->check_reshape == NULL ||
+   mddev->pers->check_reshape(mddev)) {
+   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   ret = mddev->pers->start_reshape(mddev);
+   if (ret) {
+   mddev_unlock(mddev);
+   return ret;
+   }
+   } else {
+   /*
+* If reshape is still in progress, and md_check_recovery() can
+* continue to reshape, don't restart reshape because data can
+* be corrupted for raid456.
+*/
+   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   }
+
+   mddev_unlock(mddev);
+   sysfs_notify_dirent_safe(mddev->sysfs_degraded);
+   return 0;
+}
+
 static ssize_t
 action_store(struct mddev *mddev, const char *page, size_t len)
 {
@@ -5090,32 +5129,10 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
} else if (cmd_match(page, "reshape")) {
-   int err;
-   if (mddev->pers->start_reshape == NULL)
-   return -EINVAL;
-   err = mddev_lock(mddev);
-   if (!err) {
-   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
-   err =  -EBUSY;
-   } else if (mddev->reshape_position == MaxSector ||
-  mddev->pers->check_reshape == NULL ||
-  mddev->pers->check_reshape(mddev)) {
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   err = mddev->pers->start_reshape(mddev);
-   } else {
-   /*
-* If reshape is still in progress, and
-* md_check_recovery() can continue to reshape,
-* don't restart reshape because data can be
-* corrupted for raid456.
-*/
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   }
-   mddev_unlock(mddev);
-   }
+   int err = mddev_start_reshape(mddev);
+
if (err)
return err;
-   sysfs_notify_dirent_safe(mddev->sysfs_degraded);
} else {
if (cmd_match(page, "check"))
set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
-- 
2.39.2




[PATCH v2 md-6.11 05/12] md: replace sysfs api sync_action with new helpers

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

To get rid of extrem long if else if usage, and make code cleaner.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 94 +++--
 1 file changed, 52 insertions(+), 42 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index b34ae9fbd246..d035cd52e49a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4950,27 +4950,9 @@ const char *md_sync_action_name(enum sync_action action)
 static ssize_t
 action_show(struct mddev *mddev, char *page)
 {
-   char *type = "idle";
-   unsigned long recovery = mddev->recovery;
-   if (test_bit(MD_RECOVERY_FROZEN, &recovery))
-   type = "frozen";
-   else if (test_bit(MD_RECOVERY_RUNNING, &recovery) ||
-   (md_is_rdwr(mddev) && test_bit(MD_RECOVERY_NEEDED, &recovery))) {
-   if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
-   type = "reshape";
-   else if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
-   if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
-   type = "resync";
-   else if (test_bit(MD_RECOVERY_CHECK, &recovery))
-   type = "check";
-   else
-   type = "repair";
-   } else if (test_bit(MD_RECOVERY_RECOVER, &recovery))
-   type = "recover";
-   else if (mddev->reshape_position != MaxSector)
-   type = "reshape";
-   }
-   return sprintf(page, "%s\n", type);
+   enum sync_action action = md_sync_action(mddev);
+
+   return sprintf(page, "%s\n", md_sync_action_name(action));
 }
 
 /**
@@ -5113,35 +5095,63 @@ static int mddev_start_reshape(struct mddev *mddev)
 static ssize_t
 action_store(struct mddev *mddev, const char *page, size_t len)
 {
+   int ret;
+   enum sync_action action;
+
if (!mddev->pers || !mddev->pers->sync_request)
return -EINVAL;
 
+   action = md_sync_action_by_name(page);
 
-   if (cmd_match(page, "idle"))
-   idle_sync_thread(mddev);
-   else if (cmd_match(page, "frozen"))
-   frozen_sync_thread(mddev);
-   else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-   return -EBUSY;
-   else if (cmd_match(page, "resync"))
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   else if (cmd_match(page, "recover")) {
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
-   } else if (cmd_match(page, "reshape")) {
-   int err = mddev_start_reshape(mddev);
-
-   if (err)
-   return err;
+   /* TODO: mdadm rely on "idle" to start sync_thread. */
+   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
+   switch (action) {
+   case ACTION_FROZEN:
+   frozen_sync_thread(mddev);
+   return len;
+   case ACTION_IDLE:
+   idle_sync_thread(mddev);
+   break;
+   case ACTION_RESHAPE:
+   case ACTION_RECOVER:
+   case ACTION_CHECK:
+   case ACTION_REPAIR:
+   case ACTION_RESYNC:
+   return -EBUSY;
+   default:
+   return -EINVAL;
+   }
} else {
-   if (cmd_match(page, "check"))
+   switch (action) {
+   case ACTION_FROZEN:
+   set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   return len;
+   case ACTION_RESHAPE:
+   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   ret = mddev_start_reshape(mddev);
+   if (ret)
+   return ret;
+   break;
+   case ACTION_RECOVER:
+   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
+   break;
+   case ACTION_CHECK:
set_bit(MD_RECOVERY_CHECK, &mddev->recovery);
-   else if (!cmd_match(page, "repair"))
+   fallthrough;
+   case ACTION_REPAIR:
+   set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
+   set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
+   fallthrough;
+   case ACTION_RESYNC:
+   case ACTION_IDLE:
+   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
+   break;
+   default:
return -EINVAL;
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   set_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
-   

[PATCH v2 md-6.11 06/12] md: remove parameter check_seq for stop_sync_thread()

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and
always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace
the parameter with test_bit() to make code cleaner.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 26 +++---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d035cd52e49a..44cb18ec1c52 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4961,15 +4961,10 @@ action_show(struct mddev *mddev, char *page)
  * @locked:if set, reconfig_mutex will still be held after this function
  * return; if not set, reconfig_mutex will be released after this
  * function return.
- * @check_seq: if set, only wait for curent running sync_thread to stop, noted
- * that new sync_thread can still start.
  */
-static void stop_sync_thread(struct mddev *mddev, bool locked, bool check_seq)
+static void stop_sync_thread(struct mddev *mddev, bool locked)
 {
-   int sync_seq;
-
-   if (check_seq)
-   sync_seq = atomic_read(&mddev->sync_seq);
+   int sync_seq = atomic_read(&mddev->sync_seq);
 
if (!test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
if (!locked)
@@ -4990,7 +4985,8 @@ static void stop_sync_thread(struct mddev *mddev, bool 
locked, bool check_seq)
 
wait_event(resync_wait,
   !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) ||
-  (check_seq && sync_seq != atomic_read(&mddev->sync_seq)));
+  (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery) &&
+   sync_seq != atomic_read(&mddev->sync_seq)));
 
if (locked)
mddev_lock_nointr(mddev);
@@ -5001,7 +4997,7 @@ void md_idle_sync_thread(struct mddev *mddev)
lockdep_assert_held(&mddev->reconfig_mutex);
 
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   stop_sync_thread(mddev, true, true);
+   stop_sync_thread(mddev, true);
 }
 EXPORT_SYMBOL_GPL(md_idle_sync_thread);
 
@@ -5010,7 +5006,7 @@ void md_frozen_sync_thread(struct mddev *mddev)
lockdep_assert_held(&mddev->reconfig_mutex);
 
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   stop_sync_thread(mddev, true, false);
+   stop_sync_thread(mddev, true);
 }
 EXPORT_SYMBOL_GPL(md_frozen_sync_thread);
 
@@ -5035,7 +5031,7 @@ static void idle_sync_thread(struct mddev *mddev)
return;
}
 
-   stop_sync_thread(mddev, false, true);
+   stop_sync_thread(mddev, false);
mutex_unlock(&mddev->sync_mutex);
 }
 
@@ -5049,7 +5045,7 @@ static void frozen_sync_thread(struct mddev *mddev)
return;
}
 
-   stop_sync_thread(mddev, false, false);
+   stop_sync_thread(mddev, false);
mutex_unlock(&mddev->sync_mutex);
 }
 
@@ -6544,7 +6540,7 @@ void md_stop_writes(struct mddev *mddev)
 {
mddev_lock_nointr(mddev);
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-   stop_sync_thread(mddev, true, false);
+   stop_sync_thread(mddev, true);
__md_stop_writes(mddev);
mddev_unlock(mddev);
 }
@@ -6612,7 +6608,7 @@ static int md_set_readonly(struct mddev *mddev)
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
}
 
-   stop_sync_thread(mddev, false, false);
+   stop_sync_thread(mddev, false);
wait_event(mddev->sb_wait,
   !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags));
mddev_lock_nointr(mddev);
@@ -6658,7 +6654,7 @@ static int do_md_stop(struct mddev *mddev, int mode)
set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
}
 
-   stop_sync_thread(mddev, true, false);
+   stop_sync_thread(mddev, true);
 
if (mddev->sysfs_active ||
test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
-- 
2.39.2




[PATCH v2 md-6.11 08/12] md: use new helers in md_do_sync()

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Make code cleaner. and also use the action_name directly in kernel log:
 - "check" instead of "data-check"
 - "repair" instead of "requested-resync"

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 21 +
 drivers/md/md.h |  2 +-
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 86abd0fe0681..5fa7b5f4bc6d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8929,7 +8929,8 @@ void md_do_sync(struct md_thread *thread)
sector_t last_check;
int skipped = 0;
struct md_rdev *rdev;
-   char *desc, *action = NULL;
+   enum sync_action action;
+   const char *desc;
struct blk_plug plug;
int ret;
 
@@ -8960,21 +8961,9 @@ void md_do_sync(struct md_thread *thread)
goto skip;
}
 
-   if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-   if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) {
-   desc = "data-check";
-   action = "check";
-   } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) {
-   desc = "requested-resync";
-   action = "repair";
-   } else
-   desc = "resync";
-   } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
-   desc = "reshape";
-   else
-   desc = "recovery";
-
-   mddev->last_sync_action = action ?: desc;
+   action = md_sync_action(mddev);
+   desc = md_sync_action_name(action);
+   mddev->last_sync_action = desc;
 
/*
 * Before starting a resync we must have set curr_resync to
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 732053b905b2..ee06cb076f8c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -432,7 +432,7 @@ struct mddev {
 * when the sync thread is "frozen" (interrupted) or "idle" (stopped
 * or finished).  It is overwritten when a new sync operation is begun.
 */
-   char*last_sync_action;
+   const char  *last_sync_action;
sector_tcurr_resync;/* last block scheduled 
*/
/* As resync requests can complete out of order, we cannot easily track
 * how much resync has been completed.  So we occasionally pause until
-- 
2.39.2




[PATCH v2 md-6.11 09/12] md: replace last_sync_action with new enum type

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

The only difference is that "none" is removed and initial
last_sync_action will be idle.

On the one hand, this value is introduced by commit c4a395514516
("MD: Remember the last sync operation that was performed"), and the
usage described in commit message is not affected. On the other hand,
last_sync_action is not used in mdadm or mdmon, and none of the tests
that I can find.

Signed-off-by: Yu Kuai 
---
 drivers/md/dm-raid.c | 2 +-
 drivers/md/md.c  | 7 ---
 drivers/md/md.h  | 9 -
 3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index abe88d1e6735..052c00c1eb15 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -3542,7 +3542,7 @@ static void raid_status(struct dm_target *ti, 
status_type_t type,
recovery = rs->md.recovery;
state = decipher_sync_action(mddev, recovery);
progress = rs_get_progress(rs, recovery, state, 
resync_max_sectors);
-   resync_mismatches = (mddev->last_sync_action && 
!strcasecmp(mddev->last_sync_action, "check")) ?
+   resync_mismatches = mddev->last_sync_action == ACTION_CHECK ?
atomic64_read(&mddev->resync_mismatches) : 
0;
 
/* HM FIXME: do we want another state char for raid0? It shows 
'D'/'A'/'-' now */
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 5fa7b5f4bc6d..ab492e885867 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -768,7 +768,7 @@ int mddev_init(struct mddev *mddev)
init_waitqueue_head(&mddev->recovery_wait);
mddev->reshape_position = MaxSector;
mddev->reshape_backwards = 0;
-   mddev->last_sync_action = "none";
+   mddev->last_sync_action = ACTION_IDLE;
mddev->resync_min = 0;
mddev->resync_max = MaxSector;
mddev->level = LEVEL_NONE;
@@ -5149,7 +5149,8 @@ __ATTR_PREALLOC(sync_action, S_IRUGO|S_IWUSR, 
action_show, action_store);
 static ssize_t
 last_sync_action_show(struct mddev *mddev, char *page)
 {
-   return sprintf(page, "%s\n", mddev->last_sync_action);
+   return sprintf(page, "%s\n",
+  md_sync_action_name(mddev->last_sync_action));
 }
 
 static struct md_sysfs_entry md_last_scan_mode = __ATTR_RO(last_sync_action);
@@ -8963,7 +8964,7 @@ void md_do_sync(struct md_thread *thread)
 
action = md_sync_action(mddev);
desc = md_sync_action_name(action);
-   mddev->last_sync_action = desc;
+   mddev->last_sync_action = action;
 
/*
 * Before starting a resync we must have set curr_resync to
diff --git a/drivers/md/md.h b/drivers/md/md.h
index ee06cb076f8c..41781e41d8ff 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -426,13 +426,12 @@ struct mddev {
struct md_thread __rcu  *thread;/* management thread */
struct md_thread __rcu  *sync_thread;   /* doing resync or 
reconstruct */
 
-   /* 'last_sync_action' is initialized to "none".  It is set when a
-* sync operation (i.e "data-check", "requested-resync", "resync",
-* "recovery", or "reshape") is started.  It holds this value even
+   /*
+* Set when a sync operation is started. It holds this value even
 * when the sync thread is "frozen" (interrupted) or "idle" (stopped
-* or finished).  It is overwritten when a new sync operation is begun.
+* or finished). It is overwritten when a new sync operation is begun.
 */
-   const char  *last_sync_action;
+   enum sync_actionlast_sync_action;
sector_tcurr_resync;/* last block scheduled 
*/
/* As resync requests can complete out of order, we cannot easily track
 * how much resync has been completed.  So we occasionally pause until
-- 
2.39.2




[PATCH v2 md-6.11 07/12] md: don't fail action_store() if sync_thread is not registered

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

MD_RECOVERY_RUNNING will always be set when trying to register a new
sync_thread, however, if md_start_sync() turns out to do nothing,
MD_RECOVERY_RUNNING will be cleared in this case. And during the race
window, action_store() will return -EBUSY, which will cause some
mdadm tests to fail. For example:

The test 07reshape5intr will add a new disk to array, then start
reshape:

mdadm /dev/md0 --add /dev/xxx
mdadm --grow /dev/md0 -n 3

And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED,
then during the race windown, mdadm --grow will fail.

Fix the problem by waiting in action_store() during the race window,
fail only if sync_thread is registered.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 85 +++--
 drivers/md/md.h |  2 --
 2 files changed, 33 insertions(+), 54 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 44cb18ec1c52..86abd0fe0681 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -753,7 +753,6 @@ int mddev_init(struct mddev *mddev)
 
mutex_init(&mddev->open_mutex);
mutex_init(&mddev->reconfig_mutex);
-   mutex_init(&mddev->sync_mutex);
mutex_init(&mddev->suspend_mutex);
mutex_init(&mddev->bitmap_info.mutex);
INIT_LIST_HEAD(&mddev->disks);
@@ -5021,34 +5020,6 @@ void md_unfrozen_sync_thread(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(md_unfrozen_sync_thread);
 
-static void idle_sync_thread(struct mddev *mddev)
-{
-   mutex_lock(&mddev->sync_mutex);
-   clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-
-   if (mddev_lock(mddev)) {
-   mutex_unlock(&mddev->sync_mutex);
-   return;
-   }
-
-   stop_sync_thread(mddev, false);
-   mutex_unlock(&mddev->sync_mutex);
-}
-
-static void frozen_sync_thread(struct mddev *mddev)
-{
-   mutex_lock(&mddev->sync_mutex);
-   set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
-
-   if (mddev_lock(mddev)) {
-   mutex_unlock(&mddev->sync_mutex);
-   return;
-   }
-
-   stop_sync_thread(mddev, false);
-   mutex_unlock(&mddev->sync_mutex);
-}
-
 static int mddev_start_reshape(struct mddev *mddev)
 {
int ret;
@@ -5056,24 +5027,13 @@ static int mddev_start_reshape(struct mddev *mddev)
if (mddev->pers->start_reshape == NULL)
return -EINVAL;
 
-   ret = mddev_lock(mddev);
-   if (ret)
-   return ret;
-
-   if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
-   mddev_unlock(mddev);
-   return -EBUSY;
-   }
-
if (mddev->reshape_position == MaxSector ||
mddev->pers->check_reshape == NULL ||
mddev->pers->check_reshape(mddev)) {
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
ret = mddev->pers->start_reshape(mddev);
-   if (ret) {
-   mddev_unlock(mddev);
+   if (ret)
return ret;
-   }
} else {
/*
 * If reshape is still in progress, and md_check_recovery() can
@@ -5083,7 +5043,6 @@ static int mddev_start_reshape(struct mddev *mddev)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
}
 
-   mddev_unlock(mddev);
sysfs_notify_dirent_safe(mddev->sysfs_degraded);
return 0;
 }
@@ -5097,36 +5056,53 @@ action_store(struct mddev *mddev, const char *page, 
size_t len)
if (!mddev->pers || !mddev->pers->sync_request)
return -EINVAL;
 
+retry:
+   if (work_busy(&mddev->sync_work))
+   flush_work(&mddev->sync_work);
+
+   ret = mddev_lock(mddev);
+   if (ret)
+   return ret;
+
+   if (work_busy(&mddev->sync_work)) {
+   mddev_unlock(mddev);
+   goto retry;
+   }
+
action = md_sync_action_by_name(page);
 
/* TODO: mdadm rely on "idle" to start sync_thread. */
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) {
switch (action) {
case ACTION_FROZEN:
-   frozen_sync_thread(mddev);
-   return len;
+   md_frozen_sync_thread(mddev);
+   ret = len;
+   goto out;
case ACTION_IDLE:
-   idle_sync_thread(mddev);
+   md_idle_sync_thread(mddev);
break;
case ACTION_RESHAPE:
case ACTION_RECOVER:
case ACTION_CHECK:
case ACTION_REPAIR:
case ACTION_RESYNC:
-   return -EBUSY;
+   ret = -EBUSY;
+   goto out;
default:
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out;
}
} else {
switch (action) {
 

[PATCH v2 md-6.11 10/12] md: factor out helpers for different sync_action in md_do_sync()

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Make code cleaner by replacing if else if with switch, and it's more
obvious now what is doing for each sync_action. There are no
functional changes.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 123 
 1 file changed, 73 insertions(+), 50 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ab492e885867..ec2ef4dd42cf 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -8914,6 +8914,77 @@ void md_allow_write(struct mddev *mddev)
 }
 EXPORT_SYMBOL_GPL(md_allow_write);
 
+static sector_t md_sync_max_sectors(struct mddev *mddev,
+   enum sync_action action)
+{
+   switch (action) {
+   case ACTION_RESYNC:
+   case ACTION_CHECK:
+   case ACTION_REPAIR:
+   atomic64_set(&mddev->resync_mismatches, 0);
+   fallthrough;
+   case ACTION_RESHAPE:
+   return mddev->resync_max_sectors;
+   case ACTION_RECOVER:
+   return mddev->dev_sectors;
+   default:
+   return 0;
+   }
+}
+
+static sector_t md_sync_position(struct mddev *mddev, enum sync_action action)
+{
+   sector_t start = 0;
+   struct md_rdev *rdev;
+
+   switch (action) {
+   case ACTION_CHECK:
+   case ACTION_REPAIR:
+   return mddev->resync_min;
+   case ACTION_RESYNC:
+   if (!mddev->bitmap)
+   return mddev->recovery_cp;
+   return 0;
+   case ACTION_RESHAPE:
+   /*
+* If the original node aborts reshaping then we continue the
+* reshaping, so set again to avoid restart reshape from the
+* first beginning
+*/
+   if (mddev_is_clustered(mddev) &&
+   mddev->reshape_position != MaxSector)
+   return mddev->reshape_position;
+   return 0;
+   case ACTION_RECOVER:
+   start = MaxSector;
+   rcu_read_lock();
+   rdev_for_each_rcu(rdev, mddev)
+   if (rdev->raid_disk >= 0 &&
+   !test_bit(Journal, &rdev->flags) &&
+   !test_bit(Faulty, &rdev->flags) &&
+   !test_bit(In_sync, &rdev->flags) &&
+   rdev->recovery_offset < start)
+   start = rdev->recovery_offset;
+   rcu_read_unlock();
+
+   /* If there is a bitmap, we need to make sure all
+* writes that started before we added a spare
+* complete before we start doing a recovery.
+* Otherwise the write might complete and (via
+* bitmap_endwrite) set a bit in the bitmap after the
+* recovery has checked that bit and skipped that
+* region.
+*/
+   if (mddev->bitmap) {
+   mddev->pers->quiesce(mddev, 1);
+   mddev->pers->quiesce(mddev, 0);
+   }
+   return start;
+   default:
+   return MaxSector;
+   }
+}
+
 #define SYNC_MARKS 10
 #defineSYNC_MARK_STEP  (3*HZ)
 #define UPDATE_FREQUENCY (5*60*HZ)
@@ -9032,56 +9103,8 @@ void md_do_sync(struct md_thread *thread)
spin_unlock(&all_mddevs_lock);
} while (mddev->curr_resync < MD_RESYNC_DELAYED);
 
-   j = 0;
-   if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) {
-   /* resync follows the size requested by the personality,
-* which defaults to physical size, but can be virtual size
-*/
-   max_sectors = mddev->resync_max_sectors;
-   atomic64_set(&mddev->resync_mismatches, 0);
-   /* we don't use the checkpoint if there's a bitmap */
-   if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery))
-   j = mddev->resync_min;
-   else if (!mddev->bitmap)
-   j = mddev->recovery_cp;
-
-   } else if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) {
-   max_sectors = mddev->resync_max_sectors;
-   /*
-* If the original node aborts reshaping then we continue the
-* reshaping, so set j again to avoid restart reshape from the
-* first beginning
-*/
-   if (mddev_is_clustered(mddev) &&
-   mddev->reshape_position != MaxSector)
-   j = mddev->reshape_position;
-   } else {
-   /* recovery follows the physical size of devices */
-   max_sectors = mddev->dev_sectors;
-   j = MaxSector;
-   rcu_read_lock();
-   rdev_for_each_rcu(rdev, mddev)
-   if (rdev->raid_disk >= 0 &&
-   !test_bit(Journal, &rdev->flags) &&
-

[PATCH v2 md-6.11 11/12] md: pass in max_sectors for pers->sync_request()

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

For different sync_action, sync_thread will use different max_sectors,
see details in md_sync_max_sectors(), currently both md_do_sync() and
pers->sync_request() in eatch iteration have to get the same
max_sectors. Hence pass in max_sectors for pers->sync_request() to
prevent redundant code.

Signed-off-by: Yu Kuai 
---
 drivers/md/md.c | 5 +++--
 drivers/md/md.h | 3 ++-
 drivers/md/raid1.c  | 5 ++---
 drivers/md/raid10.c | 8 ++--
 drivers/md/raid5.c  | 3 +--
 5 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ec2ef4dd42cf..c0426a6d2fd1 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -9186,7 +9186,8 @@ void md_do_sync(struct md_thread *thread)
if (test_bit(MD_RECOVERY_INTR, &mddev->recovery))
break;
 
-   sectors = mddev->pers->sync_request(mddev, j, &skipped);
+   sectors = mddev->pers->sync_request(mddev, j, max_sectors,
+   &skipped);
if (sectors == 0) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
break;
@@ -9276,7 +9277,7 @@ void md_do_sync(struct md_thread *thread)
mddev->curr_resync_completed = mddev->curr_resync;
sysfs_notify_dirent_safe(mddev->sysfs_completed);
}
-   mddev->pers->sync_request(mddev, max_sectors, &skipped);
+   mddev->pers->sync_request(mddev, max_sectors, max_sectors, &skipped);
 
if (!test_bit(MD_RECOVERY_CHECK, &mddev->recovery) &&
mddev->curr_resync > MD_RESYNC_ACTIVE) {
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 41781e41d8ff..2dc52edec3fe 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -729,7 +729,8 @@ struct md_personality
int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*hot_remove_disk) (struct mddev *mddev, struct md_rdev *rdev);
int (*spare_active) (struct mddev *mddev);
-   sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr, int 
*skipped);
+   sector_t (*sync_request)(struct mddev *mddev, sector_t sector_nr,
+sector_t max_sector, int *skipped);
int (*resize) (struct mddev *mddev, sector_t sectors);
sector_t (*size) (struct mddev *mddev, sector_t sectors, int 
raid_disks);
int (*check_reshape) (struct mddev *mddev);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3d54f30112a0..2bbfb4e682b2 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2756,12 +2756,12 @@ static struct r1bio *raid1_alloc_init_r1buf(struct 
r1conf *conf)
  */
 
 static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
-  int *skipped)
+  sector_t max_sector, int *skipped)
 {
struct r1conf *conf = mddev->private;
struct r1bio *r1_bio;
struct bio *bio;
-   sector_t max_sector, nr_sectors;
+   sector_t nr_sectors;
int disk = -1;
int i;
int wonly = -1;
@@ -2777,7 +2777,6 @@ static sector_t raid1_sync_request(struct mddev *mddev, 
sector_t sector_nr,
if (init_resync(conf))
return 0;
 
-   max_sector = mddev->dev_sectors;
if (sector_nr >= max_sector) {
/* If we aborted, we need to abort the
 * sync on the 'current' bitmap chunk (there will
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index f8d7c02c6ed5..4e804602d1e5 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -3139,12 +3139,12 @@ static void raid10_set_cluster_sync_high(struct r10conf 
*conf)
  */
 
 static sector_t raid10_sync_request(struct mddev *mddev, sector_t sector_nr,
-int *skipped)
+   sector_t max_sector, int *skipped)
 {
struct r10conf *conf = mddev->private;
struct r10bio *r10_bio;
struct bio *biolist = NULL, *bio;
-   sector_t max_sector, nr_sectors;
+   sector_t nr_sectors;
int i;
int max_sync;
sector_t sync_blocks;
@@ -3174,10 +3174,6 @@ static sector_t raid10_sync_request(struct mddev *mddev, 
sector_t sector_nr,
return 0;
 
  skipped:
-   max_sector = mddev->dev_sectors;
-   if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ||
-   test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery))
-   max_sector = mddev->resync_max_sectors;
if (sector_nr >= max_sector) {
conf->cluster_sync_low = 0;
conf->cluster_sync_high = 0;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a84389311dd1..013adc5ba0e1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6457,11 +6457,10 @@ static sector_t reshape_request(struct mddev *mddev, 
sector_t sector_nr, int *sk
 }
 
 static inline sector_t raid5_sync_request(struct md

[PATCH v2 md-6.11 12/12] md/raid5: avoid BUG_ON() while continue reshape after reassembling

2024-06-11 Thread Yu Kuai
From: Yu Kuai 

Currently, mdadm support --revert-reshape to abort the reshape while
reassembling, as the test 07revert-grow. However, following BUG_ON()
can be triggerred by the test:

kernel BUG at drivers/md/raid5.c:6278!
invalid opcode:  [#1] PREEMPT SMP PTI
irq event stamp: 158985
CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94
RIP: 0010:reshape_request+0x3f1/0xe60
Call Trace:
 
 raid5_sync_request+0x43d/0x550
 md_do_sync+0xb7a/0x2110
 md_thread+0x294/0x2b0
 kthread+0x147/0x1c0
 ret_from_fork+0x59/0x70
 ret_from_fork_asm+0x1a/0x30
 

Root cause is that --revert-reshape update the raid_disks from 5 to 4,
while reshape position is still set, and after reassembling the array,
reshape position will be read from super block, then during reshape the
checking of 'writepos' that is caculated by old reshape position will
fail.

Fix this panic the easy way first, by converting the BUG_ON() to
WARN_ON(), and stop the reshape if checkings fail.

Noted that mdadm must fix --revert-shape as well, and probably md/raid
should enhance metadata validation as well, however this means
reassemble will fail and there must be user tools to fix the wrong
metadata.

Signed-off-by: Yu Kuai 
---
 drivers/md/raid5.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 013adc5ba0e1..547fd15115cd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6254,7 +6254,9 @@ static sector_t reshape_request(struct mddev *mddev, 
sector_t sector_nr, int *sk
safepos = conf->reshape_safe;
sector_div(safepos, data_disks);
if (mddev->reshape_backwards) {
-   BUG_ON(writepos < reshape_sectors);
+   if (WARN_ON(writepos < reshape_sectors))
+   return MaxSector;
+
writepos -= reshape_sectors;
readpos += reshape_sectors;
safepos += reshape_sectors;
@@ -6272,14 +6274,18 @@ static sector_t reshape_request(struct mddev *mddev, 
sector_t sector_nr, int *sk
 * to set 'stripe_addr' which is where we will write to.
 */
if (mddev->reshape_backwards) {
-   BUG_ON(conf->reshape_progress == 0);
+   if (WARN_ON(conf->reshape_progress == 0))
+   return MaxSector;
+
stripe_addr = writepos;
-   BUG_ON((mddev->dev_sectors &
-   ~((sector_t)reshape_sectors - 1))
-  - reshape_sectors - stripe_addr
-  != sector_nr);
+   if (WARN_ON((mddev->dev_sectors &
+   ~((sector_t)reshape_sectors - 1)) -
+   reshape_sectors - stripe_addr != sector_nr))
+   return MaxSector;
} else {
-   BUG_ON(writepos != sector_nr + reshape_sectors);
+   if (WARN_ON(writepos != sector_nr + reshape_sectors))
+   return MaxSector;
+
stripe_addr = sector_nr;
}
 
-- 
2.39.2




Re: [PATCH 02/12] md: add a new enum type sync_action

2024-06-11 Thread Yu Kuai

Hi,

在 2024/06/11 16:31, Mariusz Tkaczyk 写道:

On Mon, 3 Jun 2024 20:58:05 +0800
Yu Kuai  wrote:


In order to make code related to sync_thread cleaner in following
patches, also add detail comment about each sync action. And also
prepare to remove the related recovery_flags in the fulture.

Signed-off-by: Yu Kuai 
---
  drivers/md/md.h | 57 -
  1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.h b/drivers/md/md.h
index 170412a65b63..6b9d9246f260 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -34,6 +34,61 @@
   */
  #define   MD_FAILFAST (REQ_FAILFAST_DEV | REQ_FAILFAST_TRANSPORT)
  
+/* Status of sync thread. */

+enum sync_action {
+   /*
+* Represent by MD_RECOVERY_SYNC, start when:
+* 1) after assemble, sync data from first rdev to other copies, this
+* must be done first before other sync actions and will only execute
+* once;
+* 2) resize the array(notice that this is not reshape), sync data
for
+* the new range;
+*/
+   ACTION_RESYNC,
+   /*
+* Represent by MD_RECOVERY_RECOVER, start when:
+* 1) for new replacement, sync data based on the replace rdev or
+* available copies from other rdev;
+* 2) for new member disk while the array is degraded, sync data from
+* other rdev;
+* 3) reassemble after power failure or re-add a hot removed rdev,
sync
+* data from first rdev to other copies based on bitmap;
+*/
+   ACTION_RECOVER,
+   /*
+* Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED |
+* MD_RECOVERY_CHECK, start when user echo "check" to sysfs api
+* sync_action, used to check if data copies from differenct rdev are
+* the same. The number of mismatch sectors will be exported to user
+* by sysfs api mismatch_cnt;
+*/
+   ACTION_CHECK,
+   /*
+* Represent by MD_RECOVERY_SYNC | MD_RECOVERY_REQUESTED, start when
+* user echo "repair" to sysfs api sync_action, usually paired with
+* ACTION_CHECK, used to force syncing data once user found that
there
+* are inconsistent data,
+*/
+   ACTION_REPAIR,
+   /*
+* Represent by MD_RECOVERY_RESHAPE, start when new member disk is
added
+* to the conf, notice that this is different from spares or
+* replacement;
+*/
+   ACTION_RESHAPE,
+   /*
+* Represent by MD_RECOVERY_FROZEN, can be set by sysfs api
sync_action
+* or internal usage like setting the array read-only, will forbid
above
+* actions.
+*/
+   ACTION_FROZEN,
+   /*
+* All above actions don't match.
+*/
+   ACTION_IDLE,
+   NR_SYNC_ACTIONS,
+};


I like if counter is keep in same style as rest enum values, like ACTION_COUNT.


Thanks for the review, however, I didn't find this style "xxx_COUNT" in
md code, AFAIK, "NR_xxx" style is used more, for example:

enum stat_group {
STAT_READ,
STAT_WRITE,
STAT_DISCARD,
STAT_FLUSH,

NR_STAT_GROUPS
};

Just to let you know, I'll keep this style in the next version. :)

Thanks,
Kuai


Anyway LGTM.

Mariusz
.






Re: [PATCH v2 md-6.11 00/12] md: refacotor and some fixes related to sync_thread

2024-06-11 Thread Paul Menzel

Dear Yu,


Thank you for your series.


Am 11.06.24 um 15:22 schrieb Yu Kuai:

From: Yu Kuai 


It’d be great if you wrote a small summary, what the same fixes are, are 
what patches are that.


Nit: Small typo in the summary/title: refacotor → refactor.


Changes from v1:
  - respin on the top of md-6.11 branch

Changes from RFC:
  - fix some typos;
  - add patch 7 to prevent some mdadm tests failure;
  - add patch 12 to fix BUG_ON() panic by mdadm test 07revert-grow;

Yu Kuai (12):
   md: rearrange recovery_flags
   md: add a new enum type sync_action
   md: add new helpers for sync_action
   md: factor out helper to start reshape from action_store()
   md: replace sysfs api sync_action with new helpers
   md: remove parameter check_seq for stop_sync_thread()
   md: don't fail action_store() if sync_thread is not registered
   md: use new helers in md_do_sync()


hel*p*ers


   md: replace last_sync_action with new enum type
   md: factor out helpers for different sync_action in md_do_sync()
   md: pass in max_sectors for pers->sync_request()
   md/raid5: avoid BUG_ON() while continue reshape after reassembling

  drivers/md/dm-raid.c |   2 +-
  drivers/md/md.c  | 437 ++-
  drivers/md/md.h  | 124 +---
  drivers/md/raid1.c   |   5 +-
  drivers/md/raid10.c  |   8 +-
  drivers/md/raid5.c   |  23 ++-
  6 files changed, 388 insertions(+), 211 deletions(-)



Kind regards,

Paul



Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Herbert Xu
On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote:
>
> I understand that you think the ahash based API would make it easier to add
> multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which seems
> to be a very important use case for you (though it isn't relevant to nearly as
> many systems as dm-verity and fsverity are).  Regardless, the reality is that 
> it
> would be much more difficult to take advantage of multibuffer crypto in the
> IPsec authenc use case than in dm-verity and fsverity.  authenc uses multiple
> underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
> multibuffer crypto in order to see a significant benefit, seeing as even if 
> the
> SHA-256 support could be wired up through HMAC-SHA256, encryption would be
> bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the 
> IPsec
> code would need a lot of updates to support multibuffer crypto.

The linked-request thing feeds nicely into networking.  In fact
that's where I got the idea of linking them from.  In networking
a large GSO (currently limited to 64K but theoretically we could
make it unlimited) packet is automatically split up into a linked
list of MTU-sized skb's.

Therefore if we switched to a linked-list API networking could
give us the buffers with minimal changes.

BTW, I found an old Intel paper that claims through their multi-
buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
I wonder if we could still replicate this today:

https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
 
> Ultimately, I need to have dm-verity and fsverity be properly optimized in the
> downstreams that are most relevant to me.  If you're not going to allow the
> upstream crypto API to provide the needed functionality in a reasonable way,
> then I'll need to shift my focus to getting this patchset into downstream
> kernels such as Android and Chrome OS instead.

I totally understand that this is your priority.  But please give
me some time to see if we can devise something that works for both
scenarios.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt



Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Herbert Xu
On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
>
> Therefore if we switched to a linked-list API networking could
> give us the buffers with minimal changes.

BTW, this is not just about parallelising hashing.  Just as one of
the most significant benefits of GSO does not come from hardware
offload, but rather the amortisation of (network) stack overhead.
IOW you're traversing a very deep stack once instead of 40 times
(this is the factor for 64K vs MTU, if we extend beyond 64K (which
we absolute should do) the benefit would increase as well).

The same should apply to the Crypto API.  So even if this was a
purely software solution with no assembly code at all, it may well
improve GCM performance (at least for users able to feed us bulk
data, like networking).

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt



Re: [PATCH v5 00/15] Optimize dm-verity and fsverity using multibuffer hashing

2024-06-11 Thread Ard Biesheuvel
On Tue, 11 Jun 2024 at 05:49, Eric Biggers  wrote:
>
> On many modern CPUs, it is possible to compute the SHA-256 hash of two
> equal-length messages in about the same time as a single message, if all
> the instructions are interleaved.  This is because each SHA-256 (and
> also most other cryptographic hash functions) is inherently serialized
> and therefore can't always take advantage of the CPU's full throughput.
>
> An earlier attempt to support multibuffer hashing in Linux was based
> around the ahash API.  That approach had some major issues, as does the
> alternative ahash-based approach proposed by Herbert (see my response at
> https://lore.kernel.org/linux-crypto/20240610164258.GA3269@sol.localdomain/).
> This patchset instead takes a much simpler approach of just adding a
> synchronous API for hashing equal-length messages.
>

I share Eric's skepticism that shoehorning this into ahash for
theoretical reasons is going to lead anywhere. So I would strongly
prefer this approach. We can always revisit this if/when this generic
multibuffer ahash materializes.

So for this series

Acked-by: Ard Biesheuvel 



Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Stefan Hajnoczi
On Tue, Jun 11, 2024 at 07:19:08AM +0200, Christoph Hellwig wrote:
> virtblk_update_cache_mode boils down to a single call to
> blk_queue_write_cache.  Remove it in preparation for moving the cache
> control flags into the queue_limits.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/block/virtio_blk.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Ard Biesheuvel
On Tue, 11 Jun 2024 at 17:21, Herbert Xu  wrote:
>
> On Mon, Jun 10, 2024 at 09:42:58AM -0700, Eric Biggers wrote:
> >
> > I understand that you think the ahash based API would make it easier to add
> > multibuffer support to "authenc(hmac(sha256),cbc(aes))" for IPsec, which 
> > seems
> > to be a very important use case for you (though it isn't relevant to nearly 
> > as
> > many systems as dm-verity and fsverity are).  Regardless, the reality is 
> > that it
> > would be much more difficult to take advantage of multibuffer crypto in the
> > IPsec authenc use case than in dm-verity and fsverity.  authenc uses 
> > multiple
> > underlying algorithms, AES-CBC and HMAC-SHA256, that would both have to use
> > multibuffer crypto in order to see a significant benefit, seeing as even if 
> > the
> > SHA-256 support could be wired up through HMAC-SHA256, encryption would be
> > bottlenecked on AES-CBC, especially on Intel CPUs.  It also looks like the 
> > IPsec
> > code would need a lot of updates to support multibuffer crypto.
>
> The linked-request thing feeds nicely into networking.  In fact
> that's where I got the idea of linking them from.  In networking
> a large GSO (currently limited to 64K but theoretically we could
> make it unlimited) packet is automatically split up into a linked
> list of MTU-sized skb's.
>
> Therefore if we switched to a linked-list API networking could
> give us the buffers with minimal changes.
>
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
>
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf
>

This looks like the whitepaper that describes the buggy multibuffer
code that we ripped out.

> > Ultimately, I need to have dm-verity and fsverity be properly optimized in 
> > the
> > downstreams that are most relevant to me.  If you're not going to allow the
> > upstream crypto API to provide the needed functionality in a reasonable way,
> > then I'll need to shift my focus to getting this patchset into downstream
> > kernels such as Android and Chrome OS instead.
>
> I totally understand that this is your priority.  But please give
> me some time to see if we can devise something that works for both
> scenarios.
>

The issue here is that the CPU based multibuffer approach has rather
tight constraints in terms of input length and the shared prefix, and
so designing a more generic API based on ahash doesn't help at all.
The intel multibuffer code went off into the weeds entirely attempting
to apply this parallel scheme to arbitrary combinations of inputs, so
this is something we know we should avoid.



Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Herbert Xu
On Tue, Jun 11, 2024 at 05:46:01PM +0200, Ard Biesheuvel wrote:
>
> The issue here is that the CPU based multibuffer approach has rather
> tight constraints in terms of input length and the shared prefix, and
> so designing a more generic API based on ahash doesn't help at all.
> The intel multibuffer code went off into the weeds entirely attempting
> to apply this parallel scheme to arbitrary combinations of inputs, so
> this is something we know we should avoid.

The sha-mb approach failed because it failed to aggregate the data
properly.  By driving this from the data sink, it was doomed to fail.

The correct way to aggregate data is to do it at the source.  The
user (of the Crypto API) knows exactlty how much data they want to
hash and how it's structured.  They should be supplying that info
to the API so it can use multi-buffer where applicable.  Even where
multi-buffer isn't available, they would at least benefit from making
a single indirect call into the Crypto stack instead of N calls.
When N is large (which is almost always the case for TCP) this
produces a non-trivial saving.

Sure I understand that you guys are more than happy with N=2 but
please let me at least try this out and see if we could make this
work for a large value of N.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt



Re: [PATCH v5 00/15] Optimize dm-verity and fsverity using multibuffer hashing

2024-06-11 Thread Sami Tolvanen
Hi Eric,

On Mon, Jun 10, 2024 at 8:49 PM Eric Biggers  wrote:
>
> On many modern CPUs, it is possible to compute the SHA-256 hash of two
> equal-length messages in about the same time as a single message, if all
> the instructions are interleaved.  This is because each SHA-256 (and
> also most other cryptographic hash functions) is inherently serialized
> and therefore can't always take advantage of the CPU's full throughput.
>
> An earlier attempt to support multibuffer hashing in Linux was based
> around the ahash API.  That approach had some major issues, as does the
> alternative ahash-based approach proposed by Herbert (see my response at
> https://lore.kernel.org/linux-crypto/20240610164258.GA3269@sol.localdomain/).
> This patchset instead takes a much simpler approach of just adding a
> synchronous API for hashing equal-length messages.
>
> This works well for dm-verity and fsverity, which use Merkle trees and
> therefore hash large numbers of equal-length messages.

Thank you for continuing to work on this! Improving dm-verity
performance is a high priority for Android, and this patch series
shows very promising results. FWIW, I would like to see this merged
upstream, and any ahash improvements handled in follow-up patches. For
the series:

Reviewed-by: Sami Tolvanen 

Sami



Re: [PATCH 09/26] nbd: move setting the cache control flags to __nbd_set_size

2024-06-11 Thread Josef Bacik
On Tue, Jun 11, 2024 at 07:19:09AM +0200, Christoph Hellwig wrote:
> Move setting the cache control flags in nbd in preparation for moving
> these flags into the queue_limits structure.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Josef Bacik 

Thanks,

Josef



Re: [PATCH 01/26] sd: fix sd_is_zoned

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

Since commit 7437bb73f087 ("block: remove support for the host aware zone
model"), only ZBC devices expose a zoned access model.  sd_is_zoned is
used to check for that and thus return false for host aware devices.


Reviewed-by: Bart Van Assche 



Re: [PATCH 03/26] loop: stop using loop_reconfigure_limits in __loop_clr_fd

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

__loop_clr_fd wants to clear all settings on the device.  Prepare for
moving more settings into the block limits by open coding
loop_reconfigure_limits.


If Damien's comment is addressed, feel free to add:

Reviewed-by: Bart Van Assche 



Re: [PATCH 04/26] loop: always update discard settings in loop_reconfigure_limits

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

Simplify loop_reconfigure_limits by always updating the discard limits.
This adds a little more work to loop_set_block_size, but doesn't change
the outcome as the discard flag won't change.


Reviewed-by: Bart Van Assche 



Re: [PATCH 05/26] loop: regularize upgrading the lock size for direct I/O

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

The LOOP_CONFIGURE path automatically upgrades the block size to that
of the underlying file for O_DIRECT file descriptors, but the
LOOP_SET_BLOCK_SIZE path does not.  Fix this by lifting the code to
pick the block size into common code.


Reviewed-by: Bart Van Assche 



Re: [PATCH 06/26] loop: also use the default block size from an underlying block device

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

Fix the code in loop_reconfigure_limits to pick a default block size for
O_DIRECT file descriptors to also work when the loop device sits on top
of a block device and not just on a regular file on a block device based
file system.


Reviewed-by: Bart Van Assche 



Re: [PATCH 07/26] loop: fold loop_update_rotational into loop_reconfigure_limits

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

This prepares for moving the rotational flag into the queue_limits and
also fixes it for the case where the loop device is backed by a block
device.


Reviewed-by: Bart Van Assche 



Re: [PATCH 08/26] virtio_blk: remove virtblk_update_cache_mode

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

virtblk_update_cache_mode boils down to a single call to
blk_queue_write_cache.  Remove it in preparation for moving the cache
control flags into the queue_limits.


Reviewed-by: Bart Van Assche 



Re: [PATCH 09/26] nbd: move setting the cache control flags to __nbd_set_size

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

Move setting the cache control flags in nbd in preparation for moving
these flags into the queue_limits structure.


Reviewed-by: Bart Van Assche 



Re: [PATCH 11/26] block: freeze the queue in queue_attr_store

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

queue_attr_store updates attributes used to control generating I/O, and
can cause malformed bios if changed with I/O in flight.  Freeze the queue
in common code instead of adding it to almost every attribute.


Reviewed-by: Bart Van Assche 



Re: [PATCH 12/26] block: remove blk_flush_policy

2024-06-11 Thread Bart Van Assche

On 6/10/24 10:19 PM, Christoph Hellwig wrote:

Fold blk_flush_policy into the only caller to prepare for pending changes
to it.

 Reviewed-by: Bart Van Assche 



Re: [PATCH 03/11] block: remove the BIP_IP_CHECKSUM flag

2024-06-11 Thread Martin K. Petersen


Christoph,

Sorry about the delay. Travel got in the way.

>> On the wire between controller and target there's only CRC. If I want to
>> write a "bad" CRC to disk, I have switch the controller to CRC mode. The
>> controller can't convert a "bad" IP checksum to a "bad" CRC. The PI test
>> tooling relies heavily on being able to write "bad" things to disk and
>> read them back to validate that we detect the error.
>
> But how do you even toggle the flag?  There is no no code to do that.
> And if you already have a special kernel module for that it really
> should just use a passthrough request to take care of that.

A passthrough command to the controller?

> Note that unlike the NOCHECK flag which I just cleaned up because they
> were unused, this one actually does get in the way of the architecture
> of the whole series :( We could add a per-bip csum_type but it would
> feel really weird.

Why would it feel weird? That's how it currently works.

The qualification tool issues a flurry of commands injecting errors at
various places in the stack to identify that the right entity (block
layer, controller, storage device) catch a bad checksum, reference tag,
etc. Being able to enable/disable checking at each place in the stack is
important. I also have code for target that does the same thing in the
reverse direction.

-- 
Martin K. Petersen  Oracle Linux Engineering



Re: [PATCH 02/11] block: remove the unused BIP_{CTRL,DISK}_NOCHECK flags

2024-06-11 Thread Martin K. Petersen


Christoph,

> I can just keep the flags in, they aren't really in the way of anything
> else here.  That being said, if you want opt-in aren't they the wrong
> polarity anyway?

I don't particularly like the polarity. It is an artifact of the fact
that unless otherwise noted, checking will be enabled both at HBA and
storage device. So if we reverse the polarity, it would mean that sd.c,
somewhat counter-intuitively, would enable checking on a bio that has no
bip attached. Since checking is enabled by default, regardless of
whether a bip is provided, it seemed more appropriate to opt in to
disabling the checks.

I believe one of my review comments wrt. to the io_uring passthrough
series was that I'd prefer to see the userland flag have the right
polarity, though. Because at that level, explicitly enabling checking
makes more sense.

I don't really mind reversing the BIP flag polarity either. It's mostly
a historical artifact since non-DIX HBAs would snoop INQUIRY and READ
CAPACITY and transparently enable T10 PI on the wire. DIX moved that
decision to sd.c instead of being done by HBA firmware. But we'd still
want checking to be enabled by default even if no integrity was passed
down from the HBA.

-- 
Martin K. Petersen  Oracle Linux Engineering



Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Eric Biggers
On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> 
> BTW, I found an old Intel paper that claims through their multi-
> buffer strategy they were able to make AES-CBC-XCBC beat AES-GCM.
> I wonder if we could still replicate this today:
> 
> https://github.com/intel/intel-ipsec-mb/wiki/doc/fast-multi-buffer-ipsec-implementations-ia-processors-paper.pdf

No, not even close.  Even assuming that the lack of parallelizability in AES-CBC
and AES-XCBC can be entirely compensated for via multibuffer crypto (which
really it can't -- consider single packets, for example), doing AES twice is
much more expensive than doing AES and GHASH.  GHASH is a universal hash
function, and computing a universal hash function is inherently cheaper than
computing a cryptographic hash function.  But also modern Intel CPUs have very
fast carryless multiplication, and it uses a different execution port from what
AES uses.  So the overhead of AES + GHASH over AES alone is very small.  By
doing AES twice, you'd be entirely bottlenecked by the ports that can execute
the AES instructions, while the other ports go nearly unused.  So it would
probably be approaching twice as slow as AES-GCM.

Westmere (2010) through Ivy Bridge (2012) are the only Intel CPUs where
multibuffer AES-CBC-XCBC could plausibly be faster than AES-GCM (given a
sufficiently large number of messages at once), due to the very slow pclmulqdq
instruction on those CPUs.  This is long since fixed, as pclmulqdq became much
faster in Haswell (2013), and faster still in Broadwell.  This is exactly what
that Intel paper shows; they show AES-GCM becoming fastest in "Gen 4", i.e.
Haswell.  The paper is from 2012, so of course they don't show anything after
that.  But AES-GCM has only pulled ahead even more since then.

In theory something like AES-CBC + SHA-256 could be slightly more competitive
than AES-CBC + AES-XCBC.  But it would still be worse than simply doing AES-GCM
-- which again, doesn't need multibuffer, and my recent patches have already
fully optimized for recent x86_64 CPUs.

- Eric



Re: [PATCH v4 6/8] fsverity: improve performance by using multibuffer hashing

2024-06-11 Thread Eric Biggers
On Tue, Jun 11, 2024 at 11:39:08PM +0800, Herbert Xu wrote:
> On Tue, Jun 11, 2024 at 11:21:43PM +0800, Herbert Xu wrote:
> >
> > Therefore if we switched to a linked-list API networking could
> > give us the buffers with minimal changes.
> 
> BTW, this is not just about parallelising hashing.  Just as one of
> the most significant benefits of GSO does not come from hardware
> offload, but rather the amortisation of (network) stack overhead.
> IOW you're traversing a very deep stack once instead of 40 times
> (this is the factor for 64K vs MTU, if we extend beyond 64K (which
> we absolute should do) the benefit would increase as well).
> 
> The same should apply to the Crypto API.  So even if this was a
> purely software solution with no assembly code at all, it may well
> improve GCM performance (at least for users able to feed us bulk
> data, like networking).
> 

At best this would save an indirect call per message, if the underlying
algorithm explicitly added support for it and the user of the API migrated to
the multi-request model.  This alone doesn't seem worth the effort of migrating
to multi-request, especially considering the many other already-possible
optimizations that would not require API changes or migrating users to
multi-request.  The x86_64 AES-GCM is pretty well optimized now after my recent
patches, but there's still an indirect call associated with the use of the SIMD
helper which could be eliminated, saving one per message (already as much as we
could hope to get from multi-request).  authenc on the other hand is almost
totally unoptimized, as I mentioned before; it makes little sense to talk about
any sort of multi-request optimization for it at this point.

- Eric



[PATCH] dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume

2024-06-11 Thread Benjamin Marzinski
rm-raid devices will occasionally trigger the following warning when
being resumed after a table load because DM_RECOVERY_RUNNING is set:

WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 
[dm_raid]

The failing check is:
WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));

This check is designed to make sure that the sync thread isn't
registered, but md_check_recovery can set MD_RECOVERY_RUNNING without
the sync_thread ever getting registered. Instead of checking if
MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL.

Fixes: 16c4770c75b1 ("dm-raid: really frozen sync_thread during suspend")
Suggested-by: Yu Kuai 
Signed-off-by: Benjamin Marzinski 
---
 drivers/md/dm-raid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index abe88d1e6735..74184989fd15 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -4102,7 +4102,7 @@ static void raid_resume(struct dm_target *ti)
rs_set_capacity(rs);
 
WARN_ON_ONCE(!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery));
-   WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+   WARN_ON_ONCE(mddev->sync_thread);
clear_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags);
mddev_lock_nointr(mddev);
mddev->ro = 0;
-- 
2.45.0




Re: [PATCH] dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume

2024-06-11 Thread Yu Kuai

Hi,

在 2024/06/12 5:55, Benjamin Marzinski 写道:

rm-raid devices will occasionally trigger the following warning when
being resumed after a table load because DM_RECOVERY_RUNNING is set:

WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 
[dm_raid]

The failing check is:
WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));

This check is designed to make sure that the sync thread isn't
registered, but md_check_recovery can set MD_RECOVERY_RUNNING without
the sync_thread ever getting registered. Instead of checking if
MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL.

Fixes: 16c4770c75b1 ("dm-raid: really frozen sync_thread during suspend")
Suggested-by: Yu Kuai 
Signed-off-by: Benjamin Marzinski 
---
  drivers/md/dm-raid.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index abe88d1e6735..74184989fd15 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -4102,7 +4102,7 @@ static void raid_resume(struct dm_target *ti)
rs_set_capacity(rs);
  
  		WARN_ON_ONCE(!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery));

-   WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery));
+   WARN_ON_ONCE(mddev->sync_thread);


sync_thread is protected by rcu, I think this will cause spares warning.
Please use:

rcu_dereferenct_protected(mddev->sync_thread, 
lockdep_is_held(&mddev->reconfig_mutex));


Otherwise, LGTM.
Thanks


clear_bit(RT_FLAG_RS_FROZEN, &rs->runtime_flags);
mddev_lock_nointr(mddev);
mddev->ro = 0;






Re: [PATCH 03/11] block: remove the BIP_IP_CHECKSUM flag

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 03:51:27PM -0400, Martin K. Petersen wrote:
> > But how do you even toggle the flag?  There is no no code to do that.
> > And if you already have a special kernel module for that it really
> > should just use a passthrough request to take care of that.
> 
> A passthrough command to the controller?

A passthrough command to the LU that has a mismatching checksum.

> > Note that unlike the NOCHECK flag which I just cleaned up because they
> > were unused, this one actually does get in the way of the architecture
> > of the whole series :( We could add a per-bip csum_type but it would
> > feel really weird.
> 
> Why would it feel weird? That's how it currently works.

Because there's no way to have it set to anything but the per-queue
one.  

> The qualification tool issues a flurry of commands injecting errors at
> various places in the stack to identify that the right entity (block
> layer, controller, storage device) catch a bad checksum, reference tag,
> etc.

How does it do that?  There's no actualy way to make it mismatch.




Re: [PATCH 02/11] block: remove the unused BIP_{CTRL,DISK}_NOCHECK flags

2024-06-11 Thread Christoph Hellwig
I can just leave them in for now.  Or I can remove them and we add them
back with the right polarity and support in nvme when we add an
actual user.  Either way is pretty simple unlike the weird ip checksum
thing.




Re: [PATCH 02/26] sd: move zone limits setup out of sd_read_block_characteristics

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 02:51:24PM +0900, Damien Le Moal wrote:
> > -   if (!sd_is_zoned(sdkp))
> > +   if (!sd_is_zoned(sdkp)) {
> > +   lim->zoned = false;
> 
> Maybe we should clear the other zone related limits here ? If the drive is
> reformatted/converted from SMR to CMR (FORMAT WITH PRESET), the other zone
> limits may be set already, no ?

Yes, but we would not end up here.  The device type is constant over
the struct of the scsi_device and we'd have to fully reprobe it.

So we don't need to clear any flags, including the actual zoned flag
here.




Re: [PATCH 10/26] xen-blkfront: don't disable cache flushes when they fail

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 04:30:39PM +0900, Damien Le Moal wrote:
> On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> > blkfront always had a robust negotiation protocol for detecting a write
> > cache.  Stop simply disabling cache flushes when they fail as that is
> > a grave error.
> > 
> > Signed-off-by: Christoph Hellwig 
> 
> Looks good to me but maybe mention that removal of xlvbd_flush() as well ?
> And it feels like the "stop disabling cache flushes when they fail" part 
> should
> be a fix patch sent separately...

I'll move the patch to the front of the series to get more attention from
the maintainers, but otherwise the xlvbd_flush remova lis the really
trivial part here.



Re: [PATCH 13/26] block: move cache control settings out of queue->flags

2024-06-11 Thread Christoph Hellwig
A friendly reminder that I've skipped over the full quote.  Please
properly quote mails if you want your replies to be seen.




Re: [PATCH 13/26] block: move cache control settings out of queue->flags

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 04:55:04PM +0900, Damien Le Moal wrote:
> On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> > Move the cache control settings into the queue_limits so that they
> > can be set atomically and all I/O is frozen when changing the
> > flags.
> 
> ...so that they can be set atomically with the device queue frozen when
> changing the flags.
> 
> may be better.

Sure.

If there was anything below I've skipped it after skipping over two
pages of full quotes.




Re: [PATCH 16/26] block: move the io_stat flag setting to queue_limits

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 05:09:45PM +0900, Damien Le Moal wrote:
> On 6/11/24 2:19 PM, Christoph Hellwig wrote:
> > Move the io_stat flag into the queue_limits feature field so that it
> > can be set atomically and all I/O is frozen when changing the flag.
> 
> Why a feature ? It seems more appropriate for io_stat to be a flag rather than
> a feature as that is a block layer thing rather than a device characteristic, 
> no ?

Because it must actually be supported by the driver for bio based
drivers.  Then again we also support chaning it through sysfs, so
we might actually need both.  At least unlike say the cache it's
not actively harmful when enabled despite not being supported.

I can look into that, but I'll do it in another series after getting
all the driver changes out.



Re: [PATCH 19/26] block: move the nowait flag to queue_limits

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 05:16:37PM +0900, Damien Le Moal wrote:
> > @@ -1825,9 +1815,7 @@ int dm_table_set_restrictions(struct dm_table *t, 
> > struct request_queue *q,
> > int r;
> >  
> > if (dm_table_supports_nowait(t))
> > -   blk_queue_flag_set(QUEUE_FLAG_NOWAIT, q);
> > -   else
> > -   blk_queue_flag_clear(QUEUE_FLAG_NOWAIT, q);
> > +   limits->features &= ~BLK_FEAT_NOWAIT;
> 
> Shouldn't you set the flag here instead of clearing it ?

No, but the dm_table_supports_nowait check needs to be inverted.
 



Re: [PATCH 21/26] block: move the poll flag to queue_limits

2024-06-11 Thread Christoph Hellwig
On Tue, Jun 11, 2024 at 05:21:07PM +0900, Damien Le Moal wrote:
> Kind of the same remark as for io_stat about this not really being a device
> feature. But I guess seeing "features" as a queue feature rather than just a
> device feature makes it OK to have poll (and io_stat) as a feature rather than
> a flag.

So unlike io_stat this very much is a feature and a feature only as
we don't even allow changing it.  It purely exposes a device (or
rather driver) capability.



[PATCH] multipath.conf.5: fix the description of prio_args for path_latency prio

2024-06-11 Thread 303146950
From: Kou Wenqi 

This aligns the description of prio_args for path_latency prio and
the actual code.

Signed-off-by: Kou Wenqi 
---
 multipath/multipath.conf.5.in | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/multipath/multipath.conf.5.in b/multipath/multipath.conf.5.in
index c788c180..c4ebce97 100644
--- a/multipath/multipath.conf.5.in
+++ b/multipath/multipath.conf.5.in
@@ -437,11 +437,11 @@ Needs a value of the form "io_num=\fI<20>\fR 
base_num=\fI<10>\fR"
 .TP 8
 .I io_num
 The number of read IOs sent to the current path continuously, used to 
calculate the average path latency.
-Valid Values: Integer, [2, 200].
+Valid Values: Integer, [20, 200].
 .TP
 .I base_num
-The base number value of logarithmic scale, used to partition different 
priority ranks. Valid Values: Integer,
-[2, 10]. And Max average latency value is 100s, min average latency value is 
1us.
+The base number value of logarithmic scale, used to partition different 
priority ranks. Valid Values:
+Double-precision floating-point, [1.1, 10]. And Max average latency value is 
100s, min average latency value is 1us.
 For example: If base_num=10, the paths will be grouped in priority groups with 
path latency <=1us, (1us, 10us],
 (10us, 100us], (100us, 1ms], (1ms, 10ms], (10ms, 100ms], (100ms, 1s], (1s, 
10s], (10s, 100s], >100s.
 .RE
-- 
2.27.0