date:20240208

Re: [PATCH v3 1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call

2024-02-08 Thread Damien Le Moal

On 1/29/24 16:52, Johannes Thumshirn wrote:
> Pass GFP_KERNEL instead of GFP_NOFS to the blkdev_zone_mgmt() call in
> zonefs_zone_mgmt().
> 
> As as zonefs_zone_mgmt() and zonefs_inode_zone_mgmt() are never called
> from a place that can recurse back into the filesystem on memory reclaim,
> it is save to call blkdev_zone_mgmt() with GFP_KERNEL.
> 
> Link: https://lore.kernel.org/all/zzcgxi46ainlc...@casper.infradead.org/
> Signed-off-by: Johannes Thumshirn 
> ---
>  fs/zonefs/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 93971742613a..63fbac018c04 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -113,7 +113,7 @@ static int zonefs_zone_mgmt(struct super_block *sb,
>  
>   trace_zonefs_zone_mgmt(sb, z, op);
>   ret = blkdev_zone_mgmt(sb->s_bdev, op, z->z_sector,
> -z->z_size >> SECTOR_SHIFT, GFP_NOFS);
> +z->z_size >> SECTOR_SHIFT, GFP_KERNEL);
>   if (ret) {
>   zonefs_err(sb,
>  "Zone management operation %s at %llu failed %d\n",
> 

Given that zonefs_inode_zone_mgmt() which calls zonefs_zone_mgmt() is only used
for sequential zone inodes, and that these inodes cannot be written with
buffered IOs, I think this is safe as we will never have dirty pages to
writeback for reclaim. So there should be no risk of re-entering the FS on
reclaim with GFP_KERNEL.

So:

Acked-by: Damien Le Moal 

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 02/26] block: Remove req_bio_endio()

2024-02-08 Thread Damien Le Moal

On 2/6/24 02:28, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> @@ -916,9 +888,8 @@ bool blk_update_request(struct request *req, blk_status_t
>> error,
>>   if (blk_crypto_rq_has_keyslot(req) && nr_bytes >= blk_rq_bytes(req))
>>   __blk_crypto_rq_put_keyslot(req);
>>   -    if (unlikely(error && !blk_rq_is_passthrough(req) &&
>> - !(req->rq_flags & RQF_QUIET)) &&
>> - !test_bit(GD_DEAD, &req->q->disk->state)) {
>> +    if (unlikely(error && !blk_rq_is_passthrough(req) && !quiet) &&
>> +    !test_bit(GD_DEAD, &req->q->disk->state)) {
> 
> The new indentation of !test_bit(GD_DEAD, &req->q->disk->state) looks odd to 
> me

But it is actually correct because that test bit is not part of the unlikely().
Not sure if that is intentional though.

> ...
> 
>>   blk_print_req_error(req, error);
>>   trace_block_rq_error(req, error, nr_bytes);
>>   }
>> @@ -930,12 +901,37 @@ bool blk_update_request(struct request *req,
>> blk_status_t error,
>>   struct bio *bio = req->bio;
>>   unsigned bio_bytes = min(bio->bi_iter.bi_size, nr_bytes);
>>   -    if (bio_bytes == bio->bi_iter.bi_size)
>> +    if (unlikely(error))
>> +    bio->bi_status = error;
>> +
>> +    if (bio_bytes == bio->bi_iter.bi_size) {
>>   req->bio = bio->bi_next;
> 
> The behavior has been changed compared to the original code: the original code
> only tests bio_bytes if error == 0. The new code tests bio_bytes no matter 
> what
> value the 'error' variable has. Is this behavior change intentional?

No change actually. The bio_bytes test was in blk_update_request() already.

> 
> Otherwise this patch looks good to me.
> 
> Thanks,
> 
> Bart.

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 00/26] Zone write plugging

2024-02-08 Thread Damien Le Moal

On 2/3/24 21:11, Jens Axboe wrote:
>> I forgot to mention that the patches are against Jens block/for-next
>> branch with the addition of Christoph's "clean up blk_mq_submit_bio"
>> patches [1] and my patch "null_blk: Always split BIOs to respect queue
>> limits" [2].
> 
> I figured that was the case, I'll get both of these properly setup in a
> for-6.9/block branch, just wanted -rc3 to get cut first. JFYI that they
> are coming tomorrow.

Jens,

I saw the updated rc3-based for-next branch. Thanks for that. But it seems that
you removed the mq-deadline insert optimization ? Is that in purpose or did I
mess up something ?

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 00/26] Zone write plugging

2024-02-08 Thread Damien Le Moal

On 2/6/24 10:25, Bart Van Assche wrote:
> On 2/5/24 16:07, Damien Le Moal wrote:
>> On 2/6/24 03:18, Bart Van Assche wrote:
>>> Are there numbers available about the performance differences (bandwidth
>>> and latency) between plugging zoned write bios and zoned write plugging
>>> requests?
>>
>> Finish reading the cover letter. It has lots of measurements with rc2, Jens
>> block/for-next and ZWP...
> Hmm ... as far as I know nobody ever implemented zoned write plugging
> for requests in the block layer core so these numbers can't be in the
> cover letter.

No, I have not implemented zone write plugging for requests as I beleive it
would lead to very similar results as zone write locking, that is, a potential
problem with efficiently using a device in a mixed read/write workload as
having too many plugged writes can lead to read starvation (blocking of read
submission on request allocation when nr_requests is reached).

> Has the bio plugging approach perhaps been chosen because it works
> better for bio-based device mapper drivers?

Not that it "works better" but rather that doing plugging at the BIO level
allows re-using the exact same code for zone append emulation, and write
ordering (if a DM driver wants the block layer to handle that). We had zone
append emulation implemented for DM (for dm-crypt) using BIOs and in scsi sd
driver using requests. ZWP unifies all this and will trivially allow enabling
that emulation for other device types as well (e.g. NVMe ZNS drives that do not
have native zone append support).

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-08 Thread Damien Le Moal

On 2/7/24 06:20, Bart Van Assche wrote:
> On 2/5/24 15:55, Damien Le Moal wrote:
>> The array of struct blk_zone_wplug for the disk is sized for the total 
>> number of
>> zones of the drive. The reason for that is that we want to retain the 
>> wp_offset
>> value for all zones, even if they are not being written. Otherwise, 
>> everytime we
>> start writing a zone, we would need to do a report zones to be able to 
>> emulate
>> zone append operations if the drive requested that.
> 
> We do not need to track wp_offset for empty zones nor for full zones. The data
> structure with plug information would become a lot smaller if it only tracks
> information for zones that are neither empty nor full. If a zone append is
> submitted to a zone and no information is being tracked for that zone, we can
> initialize wp_offset to zero. That may not match the actual write pointer if
> the zone is full but that shouldn't be an issue since write appends submitted
> to a zone that is full fail anyway.

We still need to keep in memory the write pointer offset of zones that are not
being actively written to but have been previously partially written. So I do
not see how excluding empty and full zones from that tracking simplifies
anything at all. And the union of wp offset+zone capacity with a pointer to the
active zone plug structure is not *that* complicated to handle...

-- 
Damien Le Moal
Western Digital Research

Re: [PATCH 6/6] multipathd: don't activate socket activation by default

2024-02-08 Thread Benjamin Marzinski

On Mon, Feb 05, 2024 at 01:46:38PM +0100, Martin Wilck wrote:
> Socket activation will start multipathd on systems that don't have multipath
> hardware. This is often not desired.  On systems that do have multipath
> hardware, OTOH, it is highly recommended to enable multipathd.service directly
> rather than have it started via socket activation.  Therefore don't activate
> the socket by default.  multipathd still supports socket activation, so users
> who find it useful can disable multipathd.service and activate the socket.
> 
> Fixes: https://github.com/opensvc/multipath-tools/issues/76
> Signed-off-by: Martin Wilck 
> Cc: Sergio Durigan Junior 
> Cc: Chris Hofstaedtler 

Reviewed-by: Benjamin Marzinski 

> ---
>  multipathd/multipathd.socket | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/multipathd/multipathd.socket b/multipathd/multipathd.socket
> index c777e5e..6a62f5f 100644
> --- a/multipathd/multipathd.socket
> +++ b/multipathd/multipathd.socket
> @@ -10,4 +10,6 @@ Before=sockets.target
>  ListenStream=@/org/kernel/linux/storage/multipathd
>  
>  [Install]
> -WantedBy=sockets.target
> +# Socket activation for multipathd is disabled by default.
> +# Activate it here if you find it useful.
> +# WantedBy=sockets.target
> -- 
> 2.43.0

Re: [PATCH 5/6] multipath: udev rules: use configured $(bindir) in udev rules

2024-02-08 Thread Benjamin Marzinski

On Mon, Feb 05, 2024 at 01:46:37PM +0100, Martin Wilck wrote:
> This allows us to remove the lumsy MPATH_SBIN_PATH property and
> related tests.
> 
> Signed-off-by: Martin Wilck 

Reviewed-by: Benjamin Marzinski 
> ---
>  .gitignore| 1 +
>  Makefile.inc  | 2 +-
>  multipath/{11-dm-mpath.rules => 11-dm-mpath.rules.in} | 5 +
>  multipath/Makefile| 2 +-
>  multipath/multipath.rules.in  | 5 +
>  5 files changed, 5 insertions(+), 10 deletions(-)
>  rename multipath/{11-dm-mpath.rules => 11-dm-mpath.rules.in} (97%)
> 
> diff --git a/.gitignore b/.gitignore
> index 6890e4a..049ffe8 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -16,6 +16,7 @@ multipath/multipath
>  multipath/multipath.8
>  multipath/multipath.conf.5
>  multipath/multipath.rules
> +multipath/11-dm-mpath.rules
>  multipath/tmpfiles.conf
>  multipathd/multipathd
>  multipathd/multipathd.8
> diff --git a/Makefile.inc b/Makefile.inc
> index 06bdd5e..3bcc7c2 100644
> --- a/Makefile.inc
> +++ b/Makefile.inc
> @@ -148,4 +148,4 @@ NV_VERSION_SCRIPT = $(DEVLIB:%.so=%-nv.version)
>  
>  %:   %.in
>   @echo creating $@
> - $(Q)sed 
> 's:@CONFIGFILE@:'$(configfile)':g;s:@CONFIGDIR@:'$(configdir)':g;s:@STATE_DIR@:'$(statedir)':g;s:@RUNTIME_DIR@:'$(runtimedir)':g;s/@MODPROBE_UNIT@/'$(MODPROBE_UNIT)'/g'
>  $< >$@
> + $(Q)sed 
> 's:@CONFIGFILE@:'$(configfile)':g;s:@CONFIGDIR@:'$(configdir)':g;s:@STATE_DIR@:'$(statedir)':g;s:@RUNTIME_DIR@:'$(runtimedir)':g;s/@MODPROBE_UNIT@/'$(MODPROBE_UNIT)'/g;s:@BINDIR@:'$(bindir)':g'
>  $< >$@
> diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-mpath.rules.in
> similarity index 97%
> rename from multipath/11-dm-mpath.rules
> rename to multipath/11-dm-mpath.rules.in
> index 2706809..38a0132 100644
> --- a/multipath/11-dm-mpath.rules
> +++ b/multipath/11-dm-mpath.rules.in
> @@ -35,16 +35,13 @@ ENV{DM_SUBSYSTEM_UDEV_FLAG2}=="1", 
> ENV{MPATH_DEVICE_READY}="0", \
>  # This may not be reliable, as events aren't necessarily received in order.
>  ENV{DM_NR_VALID_PATHS}=="0", ENV{MPATH_DEVICE_READY}="0", GOTO="mpath_action"
>  
> -ENV{MPATH_SBIN_PATH}="/sbin"
> -TEST!="$env{MPATH_SBIN_PATH}/multipath", ENV{MPATH_SBIN_PATH}="/usr/sbin"
> -
>  # Don't run multipath -U during "coldplug" after switching root,
>  # because paths are just being added to the udev db.
>  ACTION=="add", ENV{.MPATH_DEVICE_READY_OLD}=="1", GOTO="paths_ok"
>  
>  # Check the map state directly with multipath -U.
>  # This doesn't attempt I/O on the device.
> -PROGRAM=="$env{MPATH_SBIN_PATH}/multipath -U -v1 %k", GOTO="paths_ok"
> +PROGRAM=="@BINDIR@/multipath -U -v1 %k", GOTO="paths_ok"
>  ENV{MPATH_DEVICE_READY}="0", GOTO="mpath_action"
>  LABEL="paths_ok"
>  
> diff --git a/multipath/Makefile b/multipath/Makefile
> index 0efb9b2..f8c1f5e 100644
> --- a/multipath/Makefile
> +++ b/multipath/Makefile
> @@ -5,7 +5,7 @@ include ../Makefile.inc
>  
>  EXEC  := multipath
>  MANPAGES  := multipath.8 multipath.conf.5
> -GENERATED := $(MANPAGES) multipath.rules tmpfiles.conf
> +GENERATED := $(MANPAGES) multipath.rules tmpfiles.conf 11-dm-mpath.rules
>  
>  CPPFLAGS += -I$(multipathdir) -I$(mpathutildir) -I$(mpathcmddir)
>  CFLAGS += $(BIN_CFLAGS)
> diff --git a/multipath/multipath.rules.in b/multipath/multipath.rules.in
> index 03fa4d7..780bf85 100644
> --- a/multipath/multipath.rules.in
> +++ b/multipath/multipath.rules.in
> @@ -18,9 +18,6 @@ GOTO="end_mpath"
>  
>  LABEL="test_dev"
>  
> -ENV{MPATH_SBIN_PATH}="/sbin"
> -TEST!="$env{MPATH_SBIN_PATH}/multipath", ENV{MPATH_SBIN_PATH}="/usr/sbin"
> -
>  # FIND_MULTIPATHS_WAIT_UNTIL is the timeout (in seconds after the
>  # epoch).
>  IMPORT{db}="FIND_MULTIPATHS_WAIT_UNTIL"
> @@ -31,7 +28,7 @@ IMPORT{db}="DM_MULTIPATH_DEVICE_PATH"
>  
>  # multipath -u sets DM_MULTIPATH_DEVICE_PATH and,
>  # if "find_multipaths smart", also FIND_MULTIPATHS_WAIT_UNTIL.
> -IMPORT{program}=="$env{MPATH_SBIN_PATH}/multipath -u %k", \
> +IMPORT{program}=="@BINDIR@/multipath -u %k", \
>   ENV{.MPATH_CHECK_PASSED}="1"
>  
>  # case 1: this is definitely multipath
> -- 
> 2.43.0

Re: [PATCH 4/6] 11-dm-mpath.rules: handle reloads during coldplug events

2024-02-08 Thread Benjamin Marzinski

On Mon, Feb 05, 2024 at 01:46:36PM +0100, Martin Wilck wrote:
> If a map reload happens while udev is processing rules for a coldplug
> event, DM_SUSPENDED may be set if the respective test in 10-dm.rules
> happens while the device is suspened. This will cause the rules for all
> higher block device layers to be skipped. Record this situation in an udev
> property. The reload operation will trigger another "change" uevent later,
> which would normally be treated as a reload, and be ignored without
> rescanning the device. If a previous "coldplug while suspended" situation is
> detected, perform a full device rescan instead.

For what it's worth, this seems reasonable. But I do see the issue you
pointed out where we can't trust DM_SUSPENDED.  Perhaps we could track
in multipathd if an add event occured for a path between when we issued
a reload, and when we got the change event (unfortunately, change events
can occur for things other than reloads that multipathd triggers, and
the only way I know to accurately associate a uevent with a reload is by
using dm udev cookies. We have those turned off in multipathd and I
think it would be a good bit of work to turn them back on without
causing lots of waiting with locks held in multipathd). 
 
> Signed-off-by: Martin Wilck 
> ---
>  multipath/11-dm-mpath.rules | 22 +-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-mpath.rules
> index 8fc4a6f..2706809 100644
> --- a/multipath/11-dm-mpath.rules
> +++ b/multipath/11-dm-mpath.rules
> @@ -9,8 +9,13 @@ ENV{DM_ACTION}=="PATH_*", GOTO="check_ready"
>  
>  ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}!="?*", 
> IMPORT{db}="DM_UDEV_DISABLE_OTHER_RULES_FLAG"
>  ENV{DM_NOSCAN}!="?*", IMPORT{db}="DM_NOSCAN"
> -GOTO="scan_import"
>  
> +# Coldplug event while device is suspended (e.g. during a reload)
> +ACTION=="add", ENV{DM_ACTIVATION}=="1", ENV{DM_SUSPENDED}=="1", \
> + PROGRAM="/bin/logger -t 11-dm-mpath.rules -p daemon.warning \"Coldplug 
> event for suspended device\"", \
> + ENV{DM_COLDPLUG_SUSPENDED}="1"
> +
> +GOTO="scan_import"
>  LABEL="check_ready"
>  
>  IMPORT{db}="DM_DISABLE_OTHER_RULES_FLAG_OLD"
> @@ -53,6 +58,16 @@ ENV{DM_ACTION}=="PATH_FAILED", GOTO="mpath_action"
>  ENV{MPATH_DEVICE_READY}="1"
>  
>  LABEL="mpath_action"
> +
> +# A previous coldplug event occured while the device was suspended.
> +# Activation might have been partially skipped. Activate the device now,
> +# i.e. disable the MPATH_UNCHANGED logic and set DM_ACTIVATION=1.
> +IMPORT{db}="DM_COLDPLUG_SUSPENDED"
> +ENV{DM_COLDPLUG_SUSPENDED}=="1", ENV{DM_SUSPENDED}!="1", \
> + ENV{DM_ACTIVATION}="1", ENV{MPATH_UNCHANGED}="0", \
> + PROGRAM="/bin/logger -t 11-dm-mpath.rules -p daemon.notice \"Forcing 
> activation of previously suspended device\"", \
> + GOTO="force_activation"
> +
>  # DM_SUBSYSTEM_UDEV_FLAG0 is the "RELOAD" flag for multipath subsystem.
>  # Drop the DM_ACTIVATION flag here as mpath reloads tables if any of its
>  # paths are lost/recovered. For any stack above the mpath device, this is not
> @@ -67,6 +82,8 @@ ENV{DM_SUBSYSTEM_UDEV_FLAG0}=="1", \
>  ENV{DM_ACTION}=="PATH_FAILED|PATH_REINSTATED", \
>   ENV{DM_ACTIVATION}="0", ENV{MPATH_UNCHANGED}="1"
>  
> +LABEL="force_activation"
> +
>  # Do not initiate scanning if no path is available,
>  # otherwise there would be a hang or IO error on access.
>  # We'd like to avoid this, especially within udev processing.
> @@ -103,6 +120,9 @@ IMPORT{db}="ID_PART_GPT_AUTO_ROOT"
>  
>  LABEL="import_end"
>  
> +# Reset previous DM_COLDPLUG_SUSPENDED if activation happens now
> +ENV{DM_SUSPENDED}!="1", ENV{DM_ACTIVATION}=="1", 
> ENV{DM_COLDPLUG_SUSPENDED}=""
> +
>  # Multipath maps should take precedence over their members.
>  ENV{DM_UDEV_LOW_PRIORITY_FLAG}!="1", OPTIONS+="link_priority=50"
>  
> -- 
> 2.43.0

Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> On 2024-02-08 17:37, Dan Williams wrote:
> > Mathieu Desnoyers wrote:
> >> On 2024-02-08 16:39, Dan Williams wrote:
> >> [...]
> >>>
> >>> So per other feedback on earlier patches, I think this hunk deserves to
> >>> be moved to its own patch earlier in the series as a standalone fixup.
> >>
> >> Done.
> >>
> >>>
> >>> Rest of this patch looks good to me.
> >>
> >> Adding your Acked-by to what is left of this patch if OK with you.
> > 
> > You can add:
> > 
> > Reviewed-by: Dan Williams 
> > 
> > ...after that re-org.
> 
> Just to make sure: are you OK with me adding your Reviewed-by
> only for what is left of this patch, or also to the other driver
> patches after integrating your requested changes ?

Sure, if you make all those changes go ahead and propagate my
Reviewed-by across the set.

Re: [PATCH 3/6] 11-dm-mpath.rules: use import logic like 13-dm-disk.rules

2024-02-08 Thread Benjamin Marzinski

On Mon, Feb 05, 2024 at 01:46:35PM +0100, Martin Wilck wrote:
> We have to import the properties if either DM_NOSCAN or
> DM_DISABLE_OTHER_RULES_FLAG is set, because blkid will be skipped
> in both cases. Also, if DM_UDEV_PRIMARY_SOURCE_FLAG is not set,
> it makes no sense to try and import the properties.
> 
> Signed-off-by: Martin Wilck 
> ---
>  multipath/11-dm-mpath.rules | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-mpath.rules
> index 43d227c..8fc4a6f 100644
> --- a/multipath/11-dm-mpath.rules
> +++ b/multipath/11-dm-mpath.rules
> @@ -89,7 +89,8 @@ ENV{MPATH_DEVICE_READY}!="0", 
> ENV{.MPATH_DEVICE_READY_OLD}=="0", \
>  # not. If symlinks get lost, systemd may auto-unmount file systems.
>  
>  LABEL="scan_import"
> -ENV{DM_NOSCAN}!="1", GOTO="import_end"
> +ENV{DM_NOSCAN}!="1", ENV{DM_DISABLE_OTHER_RULES_FLAG}!="1", \
> + ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}=="1", GOTO="import_end"
>  IMPORT{db}="ID_FS_TYPE"
>  IMPORT{db}="ID_FS_USAGE"
>  IMPORT{db}="ID_FS_UUID_ENC"
> -- 
> 2.43.0

Doesn't this mean that we will always import the properties if
ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}!="1"

I think you mean

ENV{DM_UDEV_PRIMARY_SOURCE_FLAG}!="1", GOTO="import_end"
ENV{DM_NOSCAN}!="1", ENV{DM_DISABLE_OTHER_RULES_FLAG}!="1", GOTO="import_end"

right?

-Ben

Re: [PATCH 2/6] 11-dm-mpath.rules: fix list of imported properties

2024-02-08 Thread Benjamin Marzinski

On Mon, Feb 05, 2024 at 01:46:34PM +0100, Martin Wilck wrote:
> Make sure we import all properties that are also imported in
> 13-dm-disk.rules. Keep importing ID_FS_TYPE for now to avoid
> breakage, even if 13-dm-disk.rules does not.
> 
> Signed-off-by: Martin Wilck 
> ---
>  multipath/11-dm-mpath.rules | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-mpath.rules
> index 2c4d006..43d227c 100644
> --- a/multipath/11-dm-mpath.rules
> +++ b/multipath/11-dm-mpath.rules
> @@ -92,11 +92,13 @@ LABEL="scan_import"
>  ENV{DM_NOSCAN}!="1", GOTO="import_end"
>  IMPORT{db}="ID_FS_TYPE"
>  IMPORT{db}="ID_FS_USAGE"
> -IMPORT{db}="ID_FS_UUID"
>  IMPORT{db}="ID_FS_UUID_ENC"
> -IMPORT{db}="ID_FS_LABEL"
>  IMPORT{db}="ID_FS_LABEL_ENC"
>  IMPORT{db}="ID_FS_VERSION"
> +IMPORT{db}="ID_PART_ENTRY_NAME"
> +IMPORT{db}="ID_PART_ENTRY_UUID"
> +IMPORT{db}="ID_PART_ENTRY_SCHEME"
> +IMPORT{db}="ID_PART_GPT_AUTO_ROOT"
>  
>  LABEL="import_end"
>  
> -- 
> 2.43.0

Reviewed-by: Benjamin Marzinski

Re: [PATCH 1/6] 11-dm-mpath.rules: don't import properties that are already set

2024-02-08 Thread Benjamin Marzinski

On Tue, Feb 06, 2024 at 11:50:46AM +0100, Martin Wilck wrote:
> On Mon, 2024-02-05 at 15:44 -0500, Benjamin Marzinski wrote:
> > On Mon, Feb 05, 2024 at 01:46:33PM +0100, Martin Wilck wrote:
> > > DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN may be already set
> > > from previous rules, e.g. if the device is suspended. Make sure
> > > we don't overwrite them.
> > > 
> > > DM_DISABLE_OTHER_RULES_FLAG_OLD and MPATH_DEVICE_READY are only
> > > used in this file, and not used in the scan_import code path.
> > > 
> > > Signed-off-by: Martin Wilck 
> > > ---
> > >  multipath/11-dm-mpath.rules | 15 +++
> > >  1 file changed, 11 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/multipath/11-dm-mpath.rules b/multipath/11-dm-
> > > mpath.rules
> > > index c339f52..2c4d006 100644
> > > --- a/multipath/11-dm-mpath.rules
> > > +++ b/multipath/11-dm-mpath.rules
> > > @@ -2,12 +2,19 @@ ACTION!="add|change", GOTO="mpath_end"
> > >  ENV{DM_UDEV_RULES_VSN}!="?*", GOTO="mpath_end"
> > >  ENV{DM_UUID}!="mpath-?*", GOTO="mpath_end"
> > >  
> > > -IMPORT{db}="DM_DISABLE_OTHER_RULES_FLAG_OLD"
> > > -IMPORT{db}="MPATH_DEVICE_READY"
> > > -
> > >  # If this uevent didn't come from dm, don't try to update the
> > >  # device state
> > > -ENV{DM_COOKIE}!="?*", ENV{DM_ACTION}!="PATH_*",
> > > IMPORT{db}="DM_UDEV_DISABLE_OTHER_RULES_FLAG",
> > > IMPORT{db}="DM_NOSCAN", GOTO="scan_import"
> > > +ENV{DM_COOKIE}=="?*", GOTO="check_ready"
> > > +ENV{DM_ACTION}=="PATH_*", GOTO="check_ready"
> > > +
> > > +ENV{DM_UDEV_DISABLE_OTHER_RULES_FLAG}!="?*",
> > > IMPORT{db}="DM_UDEV_DISABLE_OTHER_RULES_FLAG"
> > > +ENV{DM_NOSCAN}!="?*", IMPORT{db}="DM_NOSCAN"
> > > +GOTO="scan_import"
> > > +
> > 
> > If we do this, don't we forget the values for
> > DM_DISABLE_OTHER_RULES_FLAG_OLD and MPATH_DEVICE_READY whenever we
> > get a
> > non-dm uevent? If we skip importing them for a uevent, they're
> > dropped
> > from the database, which means that on the next dm-originated uevent
> > we
> > won't be able to get the old values. right?
> 
> Right, I didn't consider that. It makes sense for MPATH_DEVICE_READY.
> 
> However, while at it, I wonder about our rationale for saving
> DM_UDEV_DISABLE_OTHER_RULES_FLAG in DM_DISABLE_OTHER_RULES_FLAG_OLD.
> 
> In 10-dm.rules, DM_DISABLE_OTHER_RULES_FLAG changes its value only
>  - in DM_UDEV_PRIMARY_SOURCE_FLAG=1 events, according to the value of
> DM_SUSPENDED
>  - for DISK_RO=1 events (switches the flag on) 
> 
> (11-dm-lvm.rules has some additional logic that doesn't matter for
> multipath).
> 
> For all other events, the flag is restored from the udev db in 10-
> dm.rules. Ignoring DISK_RO, it always has the value that DM_SUSPENDED
> had in the last DM_UDEV_PRIMARY_SOURCE_FLAG=1 (aka map reload) event.
> 
> The logic in 11-dm-mpath.rules adds a check for unusable arrays
> on top, setting DM_DISABLE_OTHER_RULES_FLAG if MPATH_DEVICE_READY
> switches from 1 to 0. In this case we save the previous value in
> DM_DISABLE_OTHER_RULES_FLAG_OLD, and restore it from there in a later
> event if MPATH_DEVICE_READY switches back from 0 to 1.
> 
> IMO the following can happen:
> 
>  1. an event arrives while DM_SUSPENDED=1, causing 
> DM_DISABLE_OTHER_RULES_FLAG=1 to be set
>  2. 11-dm-mpath.rules sees no paths and saves
> DM_DISABLE_OTHER_RULES_FLAG=1 to DM_DISABLE_OTHER_RULES_FLAG_OLD
>  3. an event arrives after the device has resumed, DM_SUSPENDED and
> DM_DISABLE_OTHER_RULES_FLAG are cleared in 10-dm.rules
>  4. 11-dm-mpath.rules sees MPATH_DEVICE_READY=1 and restores
> DM_DISABLE_OTHER_RULES_FLAG, setting it to 1
> 
> ... and this would be wrong, no?
> 
> It seems to me that we should not save DM_DISABLE_OTHER_RULES_FLAG_OLD
> between uevents. We must set DM_DISABLE_OTHER_RULES_FLAG=1 to avoid
> upper layer probing if there are no paths, but I suppose we should
> restore the previous value after udev rule execution, e.g. in 99-dm-
> mpath.rules:
> 
> ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}=="?*", \
>ENV{DM_DISABLE_OTHER_RULES_FLAG}=$env{DM_DISABLE_OTHER_RULES_FLAG_OLD}, \
>ENV{DM_DISABLE_OTHER_RULES_FLAG_OLD}=""
> 
> Am I confusing stuff here?

Nope. That makes a lot of sense.
 
> Unfortunately, testing of my patch series has shown that it's an
> improvement, but it isn't sufficient for completely avoiding races
> between multipathd and udev. DM_SUSPENDED=1 can be avoided, but it's
> still possible that the device gets suspended after the udev rules test
> for supended state and before they run kpartx, blkid, pvscan, or other
> similar commands.
> 
> I am quite clueless about this right now, and would be grateful for
> ideas. Re-implementing LOCK_EX locking would be a possibility for
> systemd >= 250, as noted in the cover letter of the series. But for
> systemd <= 249, I don't see good options. We can't use LOCK_EX, because
> when we release the lock, we have no means to know whether or not a
> race with udev occurred (iow, whether udev tried to access the device
> while we held the lock, fail

Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions

2024-02-08 Thread Benjamin Marzinski

On Thu, Feb 08, 2024 at 12:04:45AM -0800, Song Liu wrote:
> Hi Benjamin,
> 
> On Mon, Feb 5, 2024 at 7:58 PM Benjamin Marzinski  wrote:
> >
> > On Tue, Feb 06, 2024 at 09:36:18AM +0800, Yu Kuai wrote:
> > > Hi!
> > >
> > > 在 2024/02/06 3:35, Benjamin Marzinski 写道:
> > > > Could you run the test with something like
> > > >
> > > > # make check_local T=lvconvert-repair-raid.sh VERBOSE=1 > out 2>&1
> > > >
> > > > and post the output.
> > >
> > > Attached is the output from my VM.
> >
> > Instead of running the tests from the lvm2 git repo, if you run
> >
> > # make -C test install
> >
> > to install the tests, and then create a results directory and run the
> > test from there, do you still see the error in the 6.6 kernel?
> >
> > # make ~/results
> > # cd ~/results
> > # lvm2-testsuite --only lvconvert-repair-raid.sh
> >
> > Running the tests this way will test the installed lvm2 binaries on your
> > system, instead of the ones in the lvm2 git repo. They may be compiled
> > differently.
> 
> I am not able to get reliable results from shell/lvconvert-repair-raid.sh
> either. For 6.6.0 kernel, the test fails. On 6.8-rc1 kernel, the test fails
> sometimes.
> 
> Could you please share more information about your test setup?
> Specifically:
> 1. Which tree/branch/tag are you testing?
> 2. What's the .config used in the tests?
> 3. How do you run the test suite? One test at a time, or all of them
> together?
> 4. How do you handle "test passes sometimes" cases?

So, I have been able to recreate the case where lvconvert-repair-raid.sh
keeps failing. It happens when I tried running the reproducer on a virtual
machine made using a cloud image, instead of one that I manually
installed. I'm not sure why there is a difference. But I can show you
how I can reliably recreate the errors I'm seeing.


Create a new Fedora 39 virtual machine with the following commands (I'm
not sure if it is possible to reproduce this on a machine using less
memory and cpus, but I can try that if you need me to. You probably also
want to pick a faster Fedora Mirror for the image location):
# virt-install --name repair-test --memory 8192 --vcpus 8 --disk size=40 
--graphics none --extra-args "console=ttyS0" --osinfo 
detect=on,name=fedora-unknown --location 
https://download.fedoraproject.org/pub/fedora/linux/releases/39/Server/x86_64/os/

Install to the whole virtual drive, using the default LVM partitioning.
Then ssh into the VM and run the following commands to setup the
lvm2-testsuite and 6.6.0 kernel:

# dnf upgrade grub2
# dnf install -y git gcc bc flex make bison openssl openssl-devel dwarves zstd 
elfutils-libelf-devel libaio-devel lvm2 g++
# git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
# git clone git://sourceware.org/git/lvm2.git
# cd ~/lvm2
# ./configure
# make
# cd ~/linux
# git checkout -b ver6.6 v6.6
# cp /boot/config-`uname -r` .config
# make olddefconfig
# modprobe -a dm_raid dm_delay ext4 raid1 raid10 brd
# yes "" | make localmodconfig
# make -j8
# make modules_install
# make install
# reboot

ssh back into the VM, and run the following commands to run
lvm2-testsuite:

# mount -o remount,dev /tmp
# cd ~/lvm2
# make check T=lvconvert-repair-raid.sh

This should always pass. I ran it 100 times without failure.

To test the patched kernel, run:

# cd ~/linux
# git checkout -b dmraid-fix-v5 v6.8-rc3
# git am ~/dmraid-fix-v5.mbox  ***Apply the v5 patches***
# make olddefconfig
# make -j8
# make modules_install
# make install
# reboot

Rerun the lvm2-testsuite with the same commands as before:

# mount -o remount,dev /tmp
# cd ~/lvm2
# make check T=lvconvert-repair-raid.sh

This fails about 20% of the time, usually at either line 146 or 164. You
can check by running the following command when the test fails.

# grep "STACKTRACE()" 
~/lvm2/test/results/ndev-vanilla\:shell_lvconvert-repair-raid.sh.txt
[ 0:13.152] ## 1 STACKTRACE() called from 
/root/lvm2/test/shell/lvconvert-repair-raid.sh:146

Let me know if you have any questions, or if this doesn't work for you.

-Ben

> Thanks,
> Song

Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 17:37, Dan Williams wrote:

Mathieu Desnoyers wrote:

On 2024-02-08 16:39, Dan Williams wrote:
[...]


So per other feedback on earlier patches, I think this hunk deserves to
be moved to its own patch earlier in the series as a standalone fixup.


Done.



Rest of this patch looks good to me.


Adding your Acked-by to what is left of this patch if OK with you.


You can add:

Reviewed-by: Dan Williams 

...after that re-org.


Just to make sure: are you OK with me adding your Reviewed-by
only for what is left of this patch, or also to the other driver
patches after integrating your requested changes ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> On 2024-02-08 16:39, Dan Williams wrote:
> [...]
> > 
> > So per other feedback on earlier patches, I think this hunk deserves to
> > be moved to its own patch earlier in the series as a standalone fixup.
> 
> Done.
> 
> > 
> > Rest of this patch looks good to me.
> 
> Adding your Acked-by to what is left of this patch if OK with you.

You can add:

Reviewed-by: Dan Williams 

...after that re-org.

Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 17:12, Andrew Morton wrote:

On Thu, 8 Feb 2024 17:04:52 -0500 Mathieu Desnoyers 
 wrote:


[...]


Should I keep this patch 01/12 within the series for v5 or should I
send it separately ?


Doesn't matter much, but perfectionism does say "standalone patch please".


Will do. I plan to add the following statement to the commit message
to make it clear that there is a dependency between the patch series
and this fix:

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:55, Dan Williams wrote:
[...]


Oh... I see you cleanup what I was talking about later in the series.

For my taste I don't like to see tech-debt added and then removed later
in the series. The whole series would appear to get smaller by removing
the alloc_dax() returning NULL case from the beginning, and then doing
the EOPNOTSUPP fixups.

...repeat this comment for patch 10, 11, 12.


Done.

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Andrew Morton

On Thu, 8 Feb 2024 17:04:52 -0500 Mathieu Desnoyers 
 wrote:

> > The series seems useful but is at v4 without much sign of review
> > activity.  I think I'll take silence as assent and shall slam it all
> > into -next and see who shouts at me.
> > 
> 
> Thanks Andrew for picking it up! Dan just reacted with feedback that
> will help reducing the patch series size by removing intermediate
> commits. I'll implement the requested changes and post a v5 in a few
> days.

Yup.  I'll leave v4 out there for testers to bet on.

> So far there are not behavior changes requested in Dan's feedback.
> 
> Should I keep this patch 01/12 within the series for v5 or should I
> send it separately ?

Doesn't matter much, but perfectionism does say "standalone patch please".

Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:39, Dan Williams wrote:
[...]


So per other feedback on earlier patches, I think this hunk deserves to
be moved to its own patch earlier in the series as a standalone fixup.


Done.



Rest of this patch looks good to me.


Adding your Acked-by to what is left of this patch if OK with you.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:37, Dan Williams wrote:
[...]

+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))


This IS_ERR_OR_NULL is ok because dax_dev is initialized to NULL, but
maybe this could just be IS_ERR() and then initialize @dax_dev to
ERR_PTR(-EOPNOTSUPP)?


Done.

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:36, Dan Williams wrote:
[...]

Just another "ditto" on alloc_dax() returning NULL so that the ternary
can be removed, but otherwise this looks good.


Done.

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:34, Dan Williams wrote:
[...]

Similar feedback as the pmem change, lets not propagate the mistake that
alloc_dax() could return NULL, none of the callers of alloc_dax()
properly handled NULL and it was just luck that none of the use cases
tried to use alloc_dax() in the CONFIG_DAX=n case.


Done.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:32, Dan Williams wrote:

Mathieu Desnoyers wrote:

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
  drivers/nvdimm/pmem.c | 26 ++
  1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..f1d9f5c6dbac 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = &pmem->bb;
  
  	dax_dev = alloc_dax(pmem, &pmem_dax_ops);

-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto out;
+   if (IS_ERR_OR_NULL(dax_dev)) {


alloc_dax() should never return NULL. I.e. the lead in before this patch
should fix this misunderstanding:

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
  static inline struct dax_device *alloc_dax(void *private,
 const struct dax_operations *ops)
  {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
  }
  static inline void put_dax(struct dax_device *dax_dev)
  {


+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;


Then this ternary can be replaced with just a check of which PTR_ERR()
value is being returned.


As you noted, I've introduced this as cleanups in later patches. I don't
mind folding these into their respective per-driver commits and moving
the alloc_dax() hunk earlier in the series.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers


On 2024-02-08 16:21, Andrew Morton wrote:

On Thu,  8 Feb 2024 13:49:02 -0500 Mathieu Desnoyers 
 wrote:


Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
 kill_dax(pmem->dax_dev);
 put_dax(pmem->dax_dev);


Seems inappropriate that this fix is within this patch series?

otoh I assume dax_add_host() has never failed so it doesn't matter much.


The series seems useful but is at v4 without much sign of review
activity.  I think I'll take silence as assent and shall slam it all
into -next and see who shouts at me.



Thanks Andrew for picking it up! Dan just reacted with feedback that
will help reducing the patch series size by removing intermediate
commits. I'll implement the requested changes and post a v5 in a few
days.

So far there are not behavior changes requested in Dan's feedback.

Should I keep this patch 01/12 within the series for v5 or should I
send it separately ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

RE: [PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
> the callers do not have to handle NULL return values anymore.
> 
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/nvdimm/pmem.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index f1d9f5c6dbac..e9898457a7bd 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev,
>   disk->bb = &pmem->bb;
>  
>   dax_dev = alloc_dax(pmem, &pmem_dax_ops);
> - if (IS_ERR_OR_NULL(dax_dev)) {
> - rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
> + if (IS_ERR(dax_dev)) {
> + rc = PTR_ERR(dax_dev);

Oh... I see you cleanup what I was talking about later in the series.

For my taste I don't like to see tech-debt added and then removed later
in the series. The whole series would appear to get smaller by removing
the alloc_dax() returning NULL case from the beginning, and then doing
the EOPNOTSUPP fixups.

...repeat this comment for patch 10, 11, 12.

RE: [PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> prevents DAX from building on architectures with virtually aliased
> dcache with:
> 
>   depends on !(ARM || MIPS || SPARC)
> 
> This check is too broad (e.g. recent ARMv7 don't have virtually aliased
> dcaches), and also misses many other architectures with virtually
> aliased data cache.
> 
> This is a regression introduced in the v4.0 Linux kernel where the
> dax mount option is removed for 32-bit ARMv7 boards which have no data
> cache aliasing, and therefore should work fine with FS_DAX.
> 
> This was turned into the following check in alloc_dax() by a preparatory
> change:
> 
> if (ops && (IS_ENABLED(CONFIG_ARM) ||
> IS_ENABLED(CONFIG_MIPS) ||
> IS_ENABLED(CONFIG_SPARC)))
> return NULL;
> 
> Use cpu_dcache_is_aliasing() instead to figure out whether the environment
> has aliasing data caches.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/dax/super.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index ce5bffa86bba..a21a7c262382 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -13,6 +13,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "dax-private.h"
>  
>  /**
> @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
> dax_operations *ops)
>* except for device-dax (NULL operations pointer), which does
>* not use aliased mappings from the kernel.
>*/
> - if (ops && (IS_ENABLED(CONFIG_ARM) ||
> - IS_ENABLED(CONFIG_MIPS) ||
> - IS_ENABLED(CONFIG_SPARC)))
> + if (ops && cpu_dcache_is_aliasing())
>   return ERR_PTR(-EOPNOTSUPP);

Looks good,

Reviewed-by: Dan Williams

RE: [PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> Introduce a generic way to query whether the data cache is virtually
> aliased on all architectures. Its purpose is to ensure that subsystems
> which are incompatible with virtually aliased data caches (e.g. FS_DAX)
> can reliably query this.
> 
> For data cache aliasing, there are three scenarios dependending on the
> architecture. Here is a breakdown based on my understanding:
> 
> A) The data cache is always aliasing:
> 
> * arc
> * csky
> * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing 
> there.)
> * sh
> * parisc
> 
> B) The data cache aliasing is statically known or depends on querying CPU
>state at runtime:
> 
> * arm (cache_is_vivt() || cache_is_vipt_aliasing())
> * mips (cpu_has_dc_aliases)
> * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
> * sparc32 (vac_cache_size > PAGE_SIZE)
> * sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
> * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)
> 
> C) The data cache is never aliasing:
> 
> * alpha
> * arm64 (aarch64)
> * hexagon
> * loongarch (but with incoherent write buffers, which are disabled since
>  commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
> PAGE_SIZE"))
> * microblaze
> * openrisc
> * powerpc
> * riscv
> * s390
> * um
> * x86
> 
> Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
> implement "cpu_dcache_is_aliasing()".
> 
> Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
> cpu_dcache_is_aliasing() simply evaluates to "false".
> 
> Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
> work. This would be useful to gate features like XIP on architectures
> which have aliasing CPU dcache-icache but not CPU dcache-dcache.
> 
> Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
> to clarify that we really mean "CPU data cache" and "CPU cache" to
> eliminate any possible confusion with VFS "dentry cache" and "page
> cache".
> 
> Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  arch/arc/Kconfig|  1 +
>  arch/arc/include/asm/cachetype.h|  9 +
>  arch/arm/Kconfig|  1 +
>  arch/arm/include/asm/cachetype.h|  2 ++
>  arch/csky/Kconfig   |  1 +
>  arch/csky/include/asm/cachetype.h   |  9 +
>  arch/m68k/Kconfig   |  1 +
>  arch/m68k/include/asm/cachetype.h   |  9 +
>  arch/mips/Kconfig   |  1 +
>  arch/mips/include/asm/cachetype.h   |  9 +
>  arch/nios2/Kconfig  |  1 +
>  arch/nios2/include/asm/cachetype.h  | 10 ++
>  arch/parisc/Kconfig |  1 +
>  arch/parisc/include/asm/cachetype.h |  9 +
>  arch/sh/Kconfig |  1 +
>  arch/sh/include/asm/cachetype.h |  9 +
>  arch/sparc/Kconfig  |  1 +
>  arch/sparc/include/asm/cachetype.h  | 14 ++
>  arch/xtensa/Kconfig |  1 +
>  arch/xtensa/include/asm/cachetype.h | 10 ++
>  include/linux/cacheinfo.h   |  6 ++
>  mm/Kconfig  |  6 ++
>  22 files changed, 112 insertions(+)
>  create mode 100644 arch/arc/include/asm/cachetype.h
>  create mode 100644 arch/csky/include/asm/cachetype.h
>  create mode 100644 arch/m68k/include/asm/cachetype.h
>  create mode 100644 arch/mips/include/asm/cachetype.h
>  create mode 100644 arch/nios2/include/asm/cachetype.h
>  create mode 100644 arch/parisc/include/asm/cachetype.h
>  create mode 100644 arch/sh/include/asm/cachetype.h
>  create mode 100644 arch/sparc/include/asm/cachetype.h
>  create mode 100644 arch/xtensa/include/asm/cachetype.h
> 
[..]
> diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
> index d504eb4b49ab..2cb15fe4fe12 100644
> --- a/include/linux/cacheinfo.h
> +++ b/include/linux/cacheinfo.h
> @@ -138,4 +138,10 @@ static inline int get_cpu_cacheinfo_id(int cpu, int 
> level)
>  #define use_arch_cache_info()(false)
>  #endif
>  
> +#ifndef CONFIG_ARCH_HAS_CPU_CACHE_ALIASING
> +#define cpu_dcache_is_aliasing() false
> +#else
> +#include 
> +#endif
> +
>  #endif /* _LINUX_CACHEINFO_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 57cd378c73d6..db09c9ad15c9 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1016,6 +1016,12 @@ config IDLE_PAGE_TRACKING
> See Documentation/admin-guide/mm/idle_page_tracking.rst for
> more details.
>  
> +# Architectures which implement cpu_d

RE: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> Replace the following fs/Kconfig:FS_DAX dependency:
> 
>   depends on !(ARM || MIPS || SPARC)
> 
> By a runtime check within alloc_dax(). This runtime check returns
> ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
> the kernel is using an aliased mapping) on an architecture which
> has data cache aliasing.
> 
> Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
> CONFIG_DAX=n for consistency.
> 
> This is done in preparation for using cpu_dcache_is_aliasing() in a
> following change which will properly support architectures which detect
> data cache aliasing at runtime.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/dax/super.c | 15 +++
>  fs/Kconfig  |  1 -
>  include/linux/dax.h |  6 +-
>  3 files changed, 16 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0da9232ea175..ce5bffa86bba 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
>   * that any fault handlers or operations that might have seen
>   * dax_alive(), have completed.  Any operations that start after
>   * synchronize_srcu() has run will abort upon seeing !dax_alive().
> + *
> + * Note, because alloc_dax() returns an ERR_PTR() on error, callers
> + * typically store its result into a local variable in order to check
> + * the result. Therefore, care must be taken to populate the struct
> + * device dax_dev field make sure the dax_dev is not leaked.
>   */
>  void kill_dax(struct dax_device *dax_dev)
>  {
> @@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
> dax_operations *ops)
>   dev_t devt;
>   int minor;
>  
> + /*
> +  * Unavailable on architectures with virtually aliased data caches,
> +  * except for device-dax (NULL operations pointer), which does
> +  * not use aliased mappings from the kernel.
> +  */
> + if (ops && (IS_ENABLED(CONFIG_ARM) ||
> + IS_ENABLED(CONFIG_MIPS) ||
> + IS_ENABLED(CONFIG_SPARC)))
> + return ERR_PTR(-EOPNOTSUPP);
> +
>   if (WARN_ON_ONCE(ops && !ops->zero_page_range))
>   return ERR_PTR(-EINVAL);
>  
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 42837617a55b..e5efdb3b276b 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -56,7 +56,6 @@ endif # BLOCK
>  config FS_DAX
>   bool "File system based Direct Access (DAX) support"
>   depends on MMU
> - depends on !(ARM || MIPS || SPARC)
>   depends on ZONE_DEVICE || FS_DAX_LIMITED
>   select FS_IOMAP
>   select DAX
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b463502b16e1..df2d52b8a245 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
>  static inline struct dax_device *alloc_dax(void *private,
>   const struct dax_operations *ops)
>  {
> - /*
> -  * Callers should check IS_ENABLED(CONFIG_DAX) to know if this
> -  * NULL is an error or expected.
> -  */
> - return NULL;
> + return ERR_PTR(-EOPNOTSUPP);
>  }

So per other feedback on earlier patches, I think this hunk deserves to
be moved to its own patch earlier in the series as a standalone fixup.

Rest of this patch looks good to me.

RE: [PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of virtio
> virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
> non-fatal.
> 
> For the transition, consider that alloc_dax() returning NULL is the
> same as returning -EOPNOTSUPP.
> 
> Co-developed-by: Dan Williams 
> Signed-off-by: Dan Williams 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  fs/fuse/virtio_fs.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 5f1be1da92ce..621b1bca2d55 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "fuse_i.h"
>  
> @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
>   put_dax(dax_dev);
>  }
>  
> +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
> virtio_fs_cleanup_dax(_T))

This IS_ERR_OR_NULL is ok because dax_dev is initialized to NULL, but
maybe this could just be IS_ERR() and then initialize @dax_dev to
ERR_PTR(-EOPNOTSUPP)?

RE: [PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of dcssblk
> dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.
> 
> Considering that s390 is not a data cache aliasing architecture,
> and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
> from alloc_dax() should make dcssblk_add_store() fail.
> 
> For the transition, consider that alloc_dax() returning NULL is the
> same as returning -EOPNOTSUPP.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> Cc: linux-s...@vger.kernel.org
> ---
>  drivers/s390/block/dcssblk.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 4b7ecd4fd431..a3010849bfed 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
> device_attribute *attr, const char
>   int rc, i, j, num_of_segments;
>   struct dcssblk_dev_info *dev_info;
>   struct segment_info *seg_info, *temp;
> + struct dax_device *dax_dev;
>   char *local_buf;
>   unsigned long seg_byte_size;
>  
> @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
> device_attribute *attr, const char
>   if (rc)
>   goto put_dev;
>  
> - dev_info->dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> - if (IS_ERR(dev_info->dax_dev)) {
> - rc = PTR_ERR(dev_info->dax_dev);
> - dev_info->dax_dev = NULL;
> + dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> + if (IS_ERR_OR_NULL(dax_dev)) {
> + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;

Just another "ditto" on alloc_dax() returning NULL so that the ternary
can be removed, but otherwise this looks good.

RE: [PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
> to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
> 
> For the transition, consider that alloc_dax() returning NULL is the
> same as returning -EOPNOTSUPP.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Suggested-by: Dan Williams 
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/md/dm.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 23c32cd1f1d8..2fc22cae9089 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
> *md)
>  static struct mapped_device *alloc_dev(int minor)
>  {
>   int r, numa_node_id = dm_get_numa_node();
> + struct dax_device *dax_dev;
>   struct mapped_device *md;
>   void *old_md;
>  
> @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
>   md->disk->private_data = md;
>   sprintf(md->disk->disk_name, "dm-%d", minor);
>  
> - if (IS_ENABLED(CONFIG_FS_DAX)) {
> - md->dax_dev = alloc_dax(md, &dm_dax_ops);
> - if (IS_ERR(md->dax_dev)) {
> - md->dax_dev = NULL;
> + dax_dev = alloc_dax(md, &dm_dax_ops);
> + if (IS_ERR_OR_NULL(dax_dev)) {

Similar feedback as the pmem change, lets not propagate the mistake that
alloc_dax() could return NULL, none of the callers of alloc_dax()
properly handled NULL and it was just luck that none of the use cases
tried to use alloc_dax() in the CONFIG_DAX=n case.

RE: [PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
> pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
> 
> For the transition, consider that alloc_dax() returning NULL is the
> same as returning -EOPNOTSUPP.
> 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  drivers/nvdimm/pmem.c | 26 ++
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 9fe358090720..f1d9f5c6dbac 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
>   disk->bb = &pmem->bb;
>  
>   dax_dev = alloc_dax(pmem, &pmem_dax_ops);
> - if (IS_ERR(dax_dev)) {
> - rc = PTR_ERR(dax_dev);
> - goto out;
> + if (IS_ERR_OR_NULL(dax_dev)) {

alloc_dax() should never return NULL. I.e. the lead in before this patch
should fix this misunderstanding:

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {

> + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;

Then this ternary can be replaced with just a check of which PTR_ERR()
value is being returned.

RE: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Dan Williams

Mathieu Desnoyers wrote:
> Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
> before setting pmem->dax_dev, which therefore issues the two following
> calls on NULL pointers:
> 
> out_cleanup_dax:
> kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);
> 
> Signed-off-by: Mathieu Desnoyers 

Looks good to me.

Reviewed-by: Dan Williams

Re: [PATCH] multipath-tools: remove extra hyphen from CFLAGS std option

2024-02-08 Thread Martin Wilck

On Thu, 2024-02-08 at 04:02 +0100, Xose Vazquez Perez wrote:
> https://gcc.gnu.org/onlinedocs/gcc/C-Dialect-Options.html#index-std-1
> 
> Cc: Martin Wilck 
> Cc: Benjamin Marzinski 
> Cc: Christophe Varoqui 
> Cc: DM-DEVEL ML 
> Signed-off-by: Xose Vazquez Perez 

Reviewed-by: Martin Wilck 
Thanks!

Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Andrew Morton

On Thu,  8 Feb 2024 13:49:02 -0500 Mathieu Desnoyers 
 wrote:

> Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
> before setting pmem->dax_dev, which therefore issues the two following
> calls on NULL pointers:
> 
> out_cleanup_dax:
> kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);

Seems inappropriate that this fix is within this patch series?

otoh I assume dax_add_host() has never failed so it doesn't matter much.

The series seems useful but is at v4 without much sign of review
activity.  I think I'll take silence as assent and shall slam it all
into -next and see who shouts at me.

Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

2024-02-08 Thread Allen

On Thu, Feb 8, 2024 at 8:56 AM Tejun Heo  wrote:
>
> On Wed, Feb 07, 2024 at 11:02:37AM -0800, Allen wrote:
> > https://github.com/allenpais/for-6.9-bh-conversions
> >
> >  I am holding on to the patch that converts drivers/media/*, as I haven't 
> > found
> > a right way to replace tasklet_[disable/enable] api's. The rest should be 
> > ready
> > in a day or two.
>
> Yeah, we'll need to add something to workqueue to support that. As for the
> rest, looking at the code, I think tasklet_kill() should be converted to
> cancel_work_sync(), not flush_work().
>

  Ah, Thanks for pointing that out. I will update it and get that fixed.

Thanks for the review.

-- 
   - Allen

[PATCH v4 11/12] dcssblk: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers

Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index a3010849bfed..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -679,8 +679,8 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
goto put_dev;
 
dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
set_dax_synchronous(dax_dev);
-- 
2.39.2

[PATCH v4 12/12] virtio: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers

Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 fs/fuse/virtio_fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 621b1bca2d55..a28466c2da71 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -809,8 +809,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
return 0;
 
dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
return rc == -EOPNOTSUPP ? 0 : rc;
}
 
-- 
2.39.2

[PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-08 Thread Mathieu Desnoyers

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && 
FRAME_POINTER && !ARM_UNWIND

[PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers

Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 15 +++
 fs/Kconfig  |  1 -
 include/linux/dax.h |  6 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
@@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2

[PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers

Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f1d9f5c6dbac..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = &pmem->bb;
 
dax_dev = alloc_dax(pmem, &pmem_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
if (rc != -EOPNOTSUPP)
goto out;
} else {
-- 
2.39.2

[PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures

2024-02-08 Thread Mathieu Desnoyers

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2

[PATCH v4 10/12] dm: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers

Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2fc22cae9089..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2124,8 +2124,8 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
 
dax_dev = alloc_dax(md, &dm_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
} else {
set_dax_nocache(dax_dev);
-- 
2.39.2

[PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..621b1bca2d55 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, &cache_reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(&vdev->dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, &virtio_fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(&vdev->dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2

[PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..2fc22cae9089 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, &dm_dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, &dm_dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2

[PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-08 Thread Mathieu Desnoyers

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..a3010849bfed 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2

[PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-08 Thread Mathieu Desnoyers

This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

Thanks,

Mathieu

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (12):
  nvdimm/pmem: Fix leak on dax_add_host() failure
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures
  nvdimm/pmem: Cleanup alloc_dax() error handling
  dm: Cleanup alloc_dax() error handling
  dcssblk: Cleanup alloc_dax() error handling
  virtio: Cleanup alloc_dax() error handling

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers/dax/super.c | 14 ++
 drivers/md/dm.c | 17 +
 drivers/nvdimm/pmem.c   | 23 ---
 drivers/s390/block/dcssblk.c| 11 ++-
 fs/Kconfig  |  1 -
 fs/fuse/virtio_fs.c | 15 +++
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  6 +-
 mm/Kconfig  |  6 ++
 29 files changed, 165 insertions(+), 34 deleti

[PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers

Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2

[PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..f1d9f5c6dbac 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = &pmem->bb;
 
dax_dev = alloc_dax(pmem, &pmem_dax_ops);
-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto out;
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2

[PATCH 3/3] dm vdo: add documentation details on zones and locking

2024-02-08 Thread Matthew Sakai

Add details describing the vdo zone and thread model to the
documentation comments for major vdo components. Also added
some high-level description of the block map structure.

Signed-off-by: Matthew Sakai 
---
 drivers/md/dm-vdo/block-map.h| 15 +++
 drivers/md/dm-vdo/dedupe.c   |  5 +
 drivers/md/dm-vdo/recovery-journal.h |  4 
 drivers/md/dm-vdo/slab-depot.h   | 16 +++-
 4 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm-vdo/block-map.h b/drivers/md/dm-vdo/block-map.h
index c574bd524bc2..b662c318c2ea 100644
--- a/drivers/md/dm-vdo/block-map.h
+++ b/drivers/md/dm-vdo/block-map.h
@@ -19,6 +19,21 @@
 #include "vio.h"
 #include "wait-queue.h"
 
+/*
+ * The block map is responsible for tracking all the logical to physical 
mappings of a VDO. It
+ * consists of a collection of 60 radix trees gradually allocated as logical 
addresses are used.
+ * Each tree is assigned to a logical zone such that it is easy to compute 
which zone must handle
+ * each logical address. Each logical zone also has a dedicated portion of the 
leaf page cache.
+ *
+ * Each logical zone has a single dedicated queue and thread for performing 
all updates to the
+ * radix trees assigned to that zone. The concurrency guarantees of this 
single-threaded model
+ * allow the code to omit more fine-grained locking for the block map 
structures.
+ *
+ * Load operations must be performed on the admin thread. Normal operations, 
such as reading and
+ * updating mappings, must be performed on the appropriate logical zone 
thread. Save operations
+ * must be launched from the same admin thread as the original load operation.
+ */
+
 enum {
BLOCK_MAP_VIO_POOL_SIZE = 64,
 };
diff --git a/drivers/md/dm-vdo/dedupe.c b/drivers/md/dm-vdo/dedupe.c
index 4b00135511dd..d81065a0951c 100644
--- a/drivers/md/dm-vdo/dedupe.c
+++ b/drivers/md/dm-vdo/dedupe.c
@@ -14,6 +14,11 @@
  * deduplicate against a single block instead of being serialized through a 
PBN read lock. Only one
  * index query is needed for each hash_lock, instead of one for every data_vio.
  *
+ * Hash_locks are assigned to hash_zones by computing a modulus on the hash 
itself. Each hash_zone
+ * has a single dedicated queue and thread for performing all operations on 
the hash_locks assigned
+ * to that zone. The concurrency guarantees of this single-threaded model 
allow the code to omit
+ * more fine-grained locking for the hash_lock structures.
+ *
  * A hash_lock acts like a state machine perhaps more than as a lock. Other 
than the starting and
  * ending states INITIALIZING and BYPASSING, every state represents and is 
held for the duration of
  * an asynchronous operation. All state transitions are performed on the 
thread of the hash_zone
diff --git a/drivers/md/dm-vdo/recovery-journal.h 
b/drivers/md/dm-vdo/recovery-journal.h
index 19fa7ed9648a..d78c6c7da4ea 100644
--- a/drivers/md/dm-vdo/recovery-journal.h
+++ b/drivers/md/dm-vdo/recovery-journal.h
@@ -26,6 +26,10 @@
  * write amplification of writes by providing amortization of slab journal and 
block map page
  * updates.
  *
+ * The recovery journal has a single dedicated queue and thread for performing 
all journal updates.
+ * The concurrency guarantees of this single-threaded model allow the code to 
omit more
+ * fine-grained locking for recovery journal structures.
+ *
  * The journal consists of a set of on-disk blocks arranged as a circular log 
with monotonically
  * increasing sequence numbers. Three sequence numbers serve to define the 
active extent of the
  * journal. The 'head' is the oldest active block in the journal. The 'tail' 
is the end of the
diff --git a/drivers/md/dm-vdo/slab-depot.h b/drivers/md/dm-vdo/slab-depot.h
index efdef566709a..fba293f9713e 100644
--- a/drivers/md/dm-vdo/slab-depot.h
+++ b/drivers/md/dm-vdo/slab-depot.h
@@ -29,11 +29,17 @@
  * a single array of slabs in order to eliminate the need for additional math 
in order to compute
  * which physical zone a PBN is in. It also has a block_allocator per zone.
  *
- * Load operations are required to be performed on a single thread. Normal 
operations are assumed
- * to be performed in the appropriate zone. Allocations and reference count 
updates must be done
- * from the thread of their physical zone. Requests to commit slab journal 
tail blocks from the
- * recovery journal must be done on the journal zone thread. Save operations 
are required to be
- * launched from the same thread as the original load operation.
+ * Each physical zone has a single dedicated queue and thread for performing 
all updates to the
+ * slabs assigned to that zone. The concurrency guarantees of this 
single-threaded model allow the
+ * code to omit more fine-grained locking for the various slab structures. 
Each physical zone
+ * maintains a separate copy of the slab summary to remove the need for 
explicit locking on that
+ * structure as well.
+ *
+ * Load operations must be perfo

[PATCH 1/3] dm vdo: add vdo documentation to device-mapper index

2024-02-08 Thread Matthew Sakai

Signed-off-by: Matthew Sakai 
---
 Documentation/admin-guide/device-mapper/index.rst | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/admin-guide/device-mapper/index.rst 
b/Documentation/admin-guide/device-mapper/index.rst
index cde52cc09645..cc5aec861576 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -34,6 +34,8 @@ Device Mapper
 switch
 thin-provisioning
 unstriped
+vdo-design
+vdo
 verity
 writecache
 zero
-- 
2.42.0

[PATCH 2/3] dm vdo: add vio lifecycle details to doc

2024-02-08 Thread Matthew Sakai

Add more documentation details for most aspects of the data_vio
read and write processes. Also correct a few minor errors and
rewrite some text for clarity.

Signed-off-by: Matthew Sakai 
---
 .../admin-guide/device-mapper/vdo-design.rst  | 674 --
 1 file changed, 446 insertions(+), 228 deletions(-)

diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst 
b/Documentation/admin-guide/device-mapper/vdo-design.rst
index c82d51071c7d..93e564540204 100644
--- a/Documentation/admin-guide/device-mapper/vdo-design.rst
+++ b/Documentation/admin-guide/device-mapper/vdo-design.rst
@@ -38,19 +38,27 @@ structures involved in a single write operation to a vdo 
target is larger
 than most other targets. Furthermore, because vdo must operate on small
 block sizes in order to achieve good deduplication rates, acceptable
 performance can only be achieved through parallelism. Therefore, vdo's
-design attempts to be lock-free. Most of a vdo's main data structures are
-designed to be easily divided into "zones" such that any given bio must
-only access a single zone of any zoned structure. Safety with minimal
-locking is achieved by ensuring that during normal operation, each zone is
-assigned to a specific thread, and only that thread will access the portion
-of that data structure in that zone. Associated with each thread is a work
-queue. Each bio is associated with a request object which can be added to a
-work queue when the next phase of its operation requires access to the
-structures in the zone associated with that queue. Although each structure
-may be divided into zones, this division is not reflected in the on-disk
-representation of each data structure. Therefore, the number of zones for
-each structure, and hence the number of threads, is configured each time a
-vdo target is started.
+design attempts to be lock-free.
+
+Most of a vdo's main data structures are designed to be easily divided into
+"zones" such that any given bio must only access a single zone of any zoned
+structure. Safety with minimal locking is achieved by ensuring that during
+normal operation, each zone is assigned to a specific thread, and only that
+thread will access the portion of the data structure in that zone.
+Associated with each thread is a work queue. Each bio is associated with a
+request object (the "data_vio") which will be added to a work queue when
+the next phase of its operation requires access to the structures in the
+zone associated with that queue.
+
+Another way of thinking about this arrangement is that the work queue for
+each zone has an implicit lock on the structures it manages for all its
+operations, because vdo guarantees that no other thread will alter those
+structures.
+
+Although each structure is divided into zones, this division is not
+reflected in the on-disk representation of each data structure. Therefore,
+the number of zones for each structure, and hence the number of threads,
+can be reconfigured each time a vdo target is started.
 
 The Deduplication Index
 ---
@@ -75,17 +83,17 @@ is sufficient to find and eliminate most of the redundancy.
 Each block of data is hashed to produce a 16-byte block name. An index
 record consists of this block name paired with the presumed location of
 that data on the underlying storage. However, it is not possible to
-guarantee that the index is accurate. Most often, this occurs because it is
-too costly to update the index when a block is over-written or discarded.
-Doing so would require either storing the block name along with the blocks,
-which is difficult to do efficiently in block-based storage, or reading and
-rehashing each block before overwriting it. Inaccuracy can also result from
-a hash collision where two different blocks have the same name. In
-practice, this is extremely unlikely, but because vdo does not use a
-cryptographic hash, a malicious workload can be constructed. Because of
-these inaccuracies, vdo treats the locations in the index as hints, and
-reads each indicated block to verify that it is indeed a duplicate before
-sharing the existing block with a new one.
+guarantee that the index is accurate. In the most common case, this occurs
+because it is too costly to update the index when a block is over-written
+or discarded. Doing so would require either storing the block name along
+with the blocks, which is difficult to do efficiently in block-based
+storage, or reading and rehashing each block before overwriting it.
+Inaccuracy can also result from a hash collision where two different blocks
+have the same name. In practice, this is extremely unlikely, but because
+vdo does not use a cryptographic hash, a malicious workload could be
+constructed. Because of these inaccuracies, vdo treats the locations in the
+index as hints, and reads each indicated block to verify that it is indeed
+a duplicate before sharing the existing block with a new one.
 
 Records are collected into groups called c

[PATCH 0/3] dm vdo: improve documentation

2024-02-08 Thread Matthew Sakai

Add details about I/O handling in vdo, and details about
concurrency and locking. Also add vdo documentation to
the device-mapper doc index.

Matthew Sakai (3):
  dm vdo: add vdo documentation to device-mapper index
  dm vdo: add vio lifecycle details to doc
  dm vdo: add documentation details on zones and locking

 .../admin-guide/device-mapper/index.rst   |   2 +
 .../admin-guide/device-mapper/vdo-design.rst  | 674 --
 drivers/md/dm-vdo/block-map.h |  15 +
 drivers/md/dm-vdo/dedupe.c|   5 +
 drivers/md/dm-vdo/recovery-journal.h  |   4 +
 drivers/md/dm-vdo/slab-depot.h|  16 +-
 6 files changed, 483 insertions(+), 233 deletions(-)

-- 
2.42.0

Re: [PATCHSET wq/for-6.9] workqueue: Implement BH workqueue and convert several tasklet users

2024-02-08 Thread Tejun Heo

On Wed, Feb 07, 2024 at 11:02:37AM -0800, Allen wrote:
> https://github.com/allenpais/for-6.9-bh-conversions
> 
>  I am holding on to the patch that converts drivers/media/*, as I haven't 
> found
> a right way to replace tasklet_[disable/enable] api's. The rest should be 
> ready
> in a day or two.

Yeah, we'll need to add something to workqueue to support that. As for the
rest, looking at the code, I think tasklet_kill() should be converted to
cancel_work_sync(), not flush_work().

Thanks.

-- 
tejun

Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions

2024-02-08 Thread Song Liu

Hi Benjamin,

On Mon, Feb 5, 2024 at 7:58 PM Benjamin Marzinski  wrote:
>
> On Tue, Feb 06, 2024 at 09:36:18AM +0800, Yu Kuai wrote:
> > Hi!
> >
> > 在 2024/02/06 3:35, Benjamin Marzinski 写道:
> > > Could you run the test with something like
> > >
> > > # make check_local T=lvconvert-repair-raid.sh VERBOSE=1 > out 2>&1
> > >
> > > and post the output.
> >
> > Attached is the output from my VM.
>
> Instead of running the tests from the lvm2 git repo, if you run
>
> # make -C test install
>
> to install the tests, and then create a results directory and run the
> test from there, do you still see the error in the 6.6 kernel?
>
> # make ~/results
> # cd ~/results
> # lvm2-testsuite --only lvconvert-repair-raid.sh
>
> Running the tests this way will test the installed lvm2 binaries on your
> system, instead of the ones in the lvm2 git repo. They may be compiled
> differently.

I am not able to get reliable results from shell/lvconvert-repair-raid.sh
either. For 6.6.0 kernel, the test fails. On 6.8-rc1 kernel, the test fails
sometimes.

Could you please share more information about your test setup?
Specifically:
1. Which tree/branch/tag are you testing?
2. What's the .config used in the tests?
3. How do you run the test suite? One test at a time, or all of them
together?
4. How do you handle "test passes sometimes" cases?

Thanks,
Song

55 matches

Mail list logo