Re: [PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

2024-02-12 Thread Lukas Wunner
On Mon, Feb 12, 2024 at 11:30:54AM -0500, Mathieu Desnoyers wrote:
> Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
> CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
> never returns NULL.

All the callers of alloc_dax() only check for IS_ERR().

Doesn't this result in a change of behavior in all the callers?
Previously they'd ignore the NULL return value and continue,
now they'll error out.

Given that, seems dangerous to add a Fixes tag with a v4.0 commit
and thus risk regressing all stable kernels.

Thanks,

Lukas



Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Lukas Wunner
On Mon, Feb 12, 2024 at 11:30:58AM -0500, Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of virtio
> virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
> non-fatal.
> 
> Co-developed-by: Dan Williams 
> Signed-off-by: Dan Williams 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")

That's a v4.0 commit, yet this patch uses DEFINE_FREE() which is
only available in v6.6 but not any earlier stable kernels.

So the Fixes tag feels a bit weird.

Apart from that,
Reviewed-by: Lukas Wunner 



Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Lukas Wunner
On Mon, Feb 12, 2024 at 03:02:46PM -0800, Dan Williams wrote:
> However, Lukas, I think Linus is right, your DEFINE_FREE() should use
> IS_ERR_OR_NULL().

Uh... that's a negative, sir. ;)

IS_ERR_OR_NULL() results in...
* a superfluous NULL pointer check in x509_key_preparse() and
* a superfluous IS_ERR check in x509_cert_parse().

IS_ERR() results *only* in...
* a superfluous IS_ERR check in x509_cert_parse().

I can get rid of the IS_ERR() check by using assume().

I can *not* get rid of the NULL pointer check because the compiler
is compiled with -fno-delete-null-pointer-checks.  (The compiler
seems to ignore __attribute__((returns_nonnull)) due to that.)


> I.e. the problem is trying to use
> __free(x509_free_certificate) in x509_cert_parse().
> 
> > --- a/crypto/asymmetric_keys/x509_cert_parser.c
> > +++ b/crypto/asymmetric_keys/x509_cert_parser.c
> > @@ -60,24 +60,24 @@ void x509_free_certificate(struct x509_certificate 
> > *cert)
> >   */
> >  struct x509_certificate *x509_cert_parse(const void *data, size_t datalen)
> >  {
> > -   struct x509_certificate *cert;
> > -   struct x509_parse_context *ctx;
> > +   struct x509_certificate *cert __free(x509_free_certificate);
> 
> ...make this:
> 
> struct x509_certificate *cert __free(kfree);

That doesn't work I'm afraid.  x509_cert_parse() needs
x509_free_certificate() to be called in the error path,
not kfree().  See the existing code in current mainline:

x509_cert_parse() populates three sub-allocations in
struct x509_certificate (pub, sig, id) and two
sub-sub-allocations (pub->key, pub->params).

So I'd have to add five additional local variables which
get freed by __cleanup().  One of them (pub->key) requires
kfree_sensitive() instead of kfree(), so I'd need an extra
DEFINE_FREE() for that.

I haven't tried it but I suspect the result would look
terrible and David Howells wouldn't like it.


> ...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary
> call to virtio_fs_cleanup_dax() at function exit that the compiler
> should elide.

My recommendation is to check for !IS_ERR() in the DEFINE_FREE() clause
and amend virtio_fs_cleanup_dax() with a "if (!dax_dev) return;" for
defensiveness in case someone calls it with a NULL pointer.

That's the best solution I could come up with for the x509_certificate
conversion.

Note that even with superfluous checks avoided, __cleanup() causes
gcc-12 to always generate two return paths.  It's very visible in
the generated code that all the stack unwinding code gets duplicated
in every function using __cleanup().  The existing Assembler code
of x509_key_preparse() and x509_cert_parse(), without __cleanup()
invocation, has only a single return path.

So __cleanup() bloats the code regardless of superfluous checks,
but future gcc versions might avoid that.  clang-15 generates much
more compact code (vmlinux is a couple hundred kBytes smaller),
but does weird things such as inlining x509_free_certificate()
in x509_cert_parse().

As you may have guessed, I've spent an inordinate amount of time
down that rabbit hole. ;(

Thanks,

Lukas



[linux-next:master] BUILD REGRESSION ae00c445390b349e070a64dc62f08aa878db7248

2024-02-12 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
branch HEAD: ae00c445390b349e070a64dc62f08aa878db7248  Add linux-next specific 
files for 20240212

Error/Warning reports:

https://lore.kernel.org/oe-kbuild-all/202402122047.ydhrzmm4-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202402130633.bfncwveh-...@intel.com
https://lore.kernel.org/oe-kbuild-all/202402131351.a0fzogeg-...@intel.com

Error/Warning: (recently discovered and may have been fixed)

drivers/md/dm.c:2131:(.text+0x9f34): relocation truncated to fit: 
R_OR1K_INSN_REL_26 against undefined symbol `set_dax_nocache'
drivers/md/dm.c:2131:(.text+0x9f34): undefined reference to `set_dax_nocache'
drivers/md/dm.c:2132:(.text+0x9f3c): relocation truncated to fit: 
R_OR1K_INSN_REL_26 against undefined symbol `set_dax_nomc'
misc.c:(.text+0x14f0): undefined reference to 
`__ubsan_handle_load_invalid_value'
or1k-linux-ld: drivers/md/dm.c:2132:(.text+0x9f3c): undefined reference to 
`set_dax_nomc'
sh4-linux-ld: misc.c:(.text+0x1698): undefined reference to 
`__ubsan_handle_load_invalid_value'

Error/Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arc-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arc-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arc-randconfig-002-20240212
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- arm-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arm-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arm-randconfig-001-20240212
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- arm64-defconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- csky-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- csky-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- csky-randconfig-001-20240212
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- csky-randconfig-002-20240212
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- csky-randconfig-r061-20240212
|   |-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- i386-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- i386-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- i386-buildonly-randconfig-002-20240212
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- i386-randconfig-013-20240212
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- i386-randconfig-015-20240212
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- loongarch-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- loongarch-randconfig-002-20240212
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- microblaze-allmodconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- microblaze-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- mips-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg
|-- nios2-randconfig-001-20240212
|   `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size
|-- openrisc-allyesconfig
|   `-- 
drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described

Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-12 Thread Damien Le Moal
On 2/13/24 03:40, Bart Van Assche wrote:
> On 2/12/24 00:47, Damien Le Moal wrote:
>> Replying to myself as I had an idea:
>> 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
>> "0" as a value for a zone in that array would indicate that the zone is
>> conventional. No additional zone bitmap needed.
>> 2) Use a sparse xarray for managing allocated zone write plugs: 64B per
>> allocated zone write plug needed, which for an SMR drive would generally be 
>> at
>> most 128 * 64B = 8K.
>>
>> So for an SMR drive with 100,000 zones, that would be a total of 408 KB, 
>> instead
>> of the current 1.6 MB. Will try to prototype this to see how performance 
>> goes (I
>> am worried about the xarray lookup overhead in the hot path).
> 
> Hi Damien,
> 
> Are there any zoned devices where the sequential write required zones occur 
> before
> the conventional zones? If not, does this mean that the conventional zones 
> always
> occur before the write pointer zones and also that storing the number of 
> conventional
> zones is sufficient?

Not sure where you want to go with this... In any case, there are SMR drives
which have conventional zones before and after the bulk of the capacity as
sequential write required zones. Conventional zones can be anywhere.

> Are there zoned storage devices where each zone has a different capacity? I 
> have
> not yet encountered any such device. I'm wondering whether a single capacity
> variable would be sufficient for the entire device.

Yes, I did this optimization. Right now, for the 28TB SMR disk case, I am down
to a bitmap for conventional zones (16KB) plus max-open-zones * 64 B for the
zone write plugs. Cannot go lower than that. I am still looking into xarray vs
hash table for the zone write plugs for the overhead/performance.


-- 
Damien Le Moal
Western Digital Research




Re: [PATCH] multipath-tools: update no_path_retry value for IBM/2145

2024-02-12 Thread Xose Vazquez Perez

On 8/26/21 8:47 AM, Martin Wilck wrote:
   ^^^
It is never too late!


On Thu, 2021-08-26 at 00:24 +0200, Xose Vazquez Perez wrote:

Based on current configs:
https://www.ibm.com/docs/en/flashsystem-9x00/8.4.x?topic=system-settings-linux-hosts

Cc: Martin Wilck 
Cc: Benjamin Marzinski 
Cc: Christophe Varoqui 
Cc: DM-DEVEL ML 
Signed-off-by: Xose Vazquez Perez 
---
  libmultipath/hwtable.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c
index 2a896440..58554cbb 100644
--- a/libmultipath/hwtable.c
+++ b/libmultipath/hwtable.c
@@ -662,7 +662,7 @@ static struct hwentry default_hw[] = {
 /* Storwize family / SAN Volume Controller / Flex
System V7000 / FlashSystem V840/V9000/9100 */
 .vendor    = "IBM",
 .product   = "^2145",
-   .no_path_retry = NO_PATH_RETRY_QUEUE,
+   .no_path_retry = 5,
 .pgpolicy  = GROUP_BY_PRIO,
 .pgfailback    = -FAILBACK_IMMEDIATE,
 .prio_name = PRIO_ALUA,


Ref: https://github.com/opensvc/multipath-tools/issues/6

The question is on which basis IBM came up with this recommendation.
5 (aka 25s) is a rather low value. Some users may encounter unpleasant
surprises if we change the default this way, as it used to be infinite
before.

Using 5, the IBS 2145 would have the 2nd-lowest default in hwtable.c
after Dell PowerStore (3). Symmetrix has 6; all other arrays default to
10 or higher, many default to "queue".

Observing that the above is the documentation for the *Flashsystem*
9200,  I consider it likely that the value ".no_path_retry = 5" would
apply to flash-based IBM storage products, but not to the older
products such as the V7000, which unfortunately use the same device ID.

It'd be helpful if someone from IBM could jump in here...

Pondering the pros and cons, I vote for keeping the current defaults
for now.

Martin


Some history:

first commit 3eb8c380a :
   {
   /* IBM SAN Volume Controller */
   .vendor= "IBM",
   .product   = "2145",
   .getuid= DEFAULT_GETUID,
   .getprio   = "mpath_prio_alua /dev/%n",
   .features  = "1 queue_if_no_path",
   .hwhandler = DEFAULT_HWHANDLER,
   .selector  = DEFAULT_SELECTOR,
   .pgpolicy  = GROUP_BY_PRIO,
   .pgfailback= -FAILBACK_IMMEDIATE,
   .rr_weight = RR_WEIGHT_NONE,
   .no_path_retry = NO_PATH_RETRY_UNDEF,
   .minio = DEFAULT_MINIO,
   .checker_name  = TUR,
   },

NO_PATH_RETRY_UNDEF was removed in b7c3cf014 because it was the default value,
and later "1 queue_if_no_path" was replaced by NO_PATH_RETRY_QUEUE in 87ea76f99

IBM docs recommends:
no_path_retry 5 # or no_path_retry "fail" for some current linux distros

IBM Storage FlashSystem 5200, 5000, 5100, Storwize V5100 and V5000E:
https://www.ibm.com/docs/en/flashsystem-5x00/8.6.x?topic=system-settings-linux-hosts

IBM Storage FlashSystem 7300, 7200 and Storwize V7000:
https://www.ibm.com/docs/en/flashsystem-7x00/8.6.x?topic=system-settings-linux-hosts

IBM FlashSystem V9000:
https://www.ibm.com/docs/en/flashsystem-v9000/8.3.x?topic=system-settings-linux-hosts

IBM Storage FlashSystem 9500, 9200 and 9100:
https://www.ibm.com/docs/en/flashsystem-9x00/8.6.x?topic=system-settings-linux-hosts

Therefore, we should change this value.



Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Dan Williams
[ add Lukas ]

Linus Torvalds wrote:
> On Mon, 12 Feb 2024 at 14:04, Dan Williams  wrote:
> >
> > This works because the internals of virtio_fs_cleanup_dax(), "kill_dax()
> > and put_dax()", know how to handle a NULL @dax_dev. It is still early
> > days with the "cleanup" helpers, but I wonder if anyone else cares that
> > the DEFINE_FREE() above does not check for NULL?
> 
> Well, the main reason for DEFINE_FREE() to check for NULL is not
> correctness, but code generation. See the comment about kfree() in
> :
> 
>  * NOTE: the DEFINE_FREE()'s @free expression includes a NULL test even though
>  * kfree() is fine to be called with a NULL value. This is on purpose. This 
> way
>  * the compiler sees the end of our alloc_obj() function as [...]
> 
> with the full explanation there.
> 
> Now, whether the code wants to actually use the cleanup() helpers for
> a single use-case is debatable.
> 
> But yes, if it does, I suspect it should use !IS_ERR_OR_NULL(ptr).o

I am trying to arrive at a common recommendation given Lukas found that
IS_ERR_OR_NULL() resulted in unwanted NULL checks emitted in the
assembly [1].

He is doing something similar:

http://lore.kernel.org/r/4143b15418c4ecf87ddeceb36813943c3ede17aa.1707734526.git.lu...@wunner.de

...and introduced an assume() helper.

However, Lukas, I think Linus is right, your DEFINE_FREE() should use
IS_ERR_OR_NULL(). I.e. the problem is trying to use
__free(x509_free_certificate) in x509_cert_parse().

> --- a/crypto/asymmetric_keys/x509_cert_parser.c
> +++ b/crypto/asymmetric_keys/x509_cert_parser.c
> @@ -60,24 +60,24 @@ void x509_free_certificate(struct x509_certificate *cert)
>   */
>  struct x509_certificate *x509_cert_parse(const void *data, size_t datalen)
>  {
> -   struct x509_certificate *cert;
> -   struct x509_parse_context *ctx;
> +   struct x509_certificate *cert __free(x509_free_certificate);

...make this:

struct x509_certificate *cert __free(kfree);

...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary
call to virtio_fs_cleanup_dax() at function exit that the compiler
should elide.



Re: dm-ps-historical-service-time should probably set QUEUE_FLAG_STATS

2024-02-12 Thread Khazhy Kumykov
(urgh, apologies for the html)

We use io_start_time_ns in dm-mpath.c, but the only actual user is
historical-service-time currently. We need
blk_stat_enable_accounting()/blk_stat_disable_accounting() calls
somewhere.

I was considering adding a new multipath feature flag, but we already
have "DM_PS_USE_HR_TIMER", which seems similar in spirit. Would that
re-use make sense?

Khazhy


smime.p7s
Description: S/MIME Cryptographic Signature


+ dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: dax: fix incorrect list of data cache aliasing architectures
has been added to the -mm mm-unstable branch.  Its filename is
 dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: dax: fix incorrect list of data cache aliasing architectures
Date: Mon, 12 Feb 2024 11:31:01 -0500

commit d92576f1167c ("dax: does not work correctly with virtual aliasing
caches") prevents DAX from building on architectures with virtually
aliased dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g.  recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually aliased
data cache.

This is a regression introduced in the v4.0 Linux kernel where the dax
mount option is removed for 32-bit ARMv7 boards which have no data cache
aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-9-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Alasdair Kergon 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Signed-off-by: Andrew Morton 
---

 drivers/dax/super.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- 
a/drivers/dax/super.c~dax-fix-incorrect-list-of-data-cache-aliasing-architectures
+++ a/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -456,9 +457,7 @@ struct dax_device *alloc_dax(void *priva
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions

2024-02-12 Thread Song Liu
Hi Mikulas and DM folks,

This patchset looks good to me. Please help review the dm-raid code
so that we can ship it to 6.8 kernel and back port the fixes to 6.7
stable kernels.

In the longer term, we will invest more into a CI system and include the
lvm2 test suite as part of the tests. Hopefully, this effort will help us
catch similar issues much sooner.

Thanks,
Song

On Thu, Feb 1, 2024 at 1:30 AM Yu Kuai  wrote:
>
> From: Yu Kuai 
>
> Changes in v5:
>  - remove the patch to wait for bio completion while removing dm disk;
>  - add patch 6;
>  - reorder the patches, patch 1-8 is for md/raid, and patch 9-14 is
>  related to dm-raid;
>
> Changes in v4:
>  - add patch 10 to fix a raid456 deadlock(for both md/raid and dm-raid);
>  - add patch 13 to wait for inflight IO completion while removing dm
>  device;
>
> Changes in v3:
>  - fix a problem in patch 5;
>  - add patch 12;
>
> Changes in v2:
>  - replace revert changes for dm-raid with real fixes;
>  - fix dm-raid5 deadlock that exist for a long time, this deadlock is
>  triggered because another problem is fixed in raid5, and instead of
>  deadlock, user will read wrong data before v6.7, patch 9-11;
>
> First regression related to stop sync thread:
>
> The lifetime of sync_thread is designed as following:
>
> 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up
> daemon thread;
> 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set
> MD_RECOVERY_RUNNING and register sync_thread;
> 3) Execute md_do_sync() for the actual work, if it's done or
> interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread;
> 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear
> MD_RECOVERY_RUNNING and unregister sync_thread;
>
> In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4
> ("md: fix stopping sync thread"), however, dm-raid is not considered at
> that time, and following test will hang:
>
> shell/integrity-caching.sh
> shell/lvconvert-raid-reshape.sh
>
> This patch set fix the broken test by patch 1-4;
>  - patch 1 fix that step 4) is broken by suspended array;
>  - patch 2 fix that step 4) is broken by read-only array;
>  - patch 3 fix that step 3) is broken that md_do_sync() doesn't set
>  MD_RECOVERY_DONE; Noted that this patch will introdece new problem that
>  data will be corrupted, which will be fixed in later patches.
>  - patch 4 fix that setp 1) is broken that sync_thread is register and
>  MD_RECOVERY_RUNNING is set directly, md/raid behaviour, not related to
>  dm-raid;
>
> With patch 1-4, the above test won't hang anymore, however, the test
> will still fail and complain that ext4 is corrupted;
>
>
> Second regression is found by code review, interrupted reshape
> concurrent with IO can deadlock, patch 5;
>
>
> Third regression fix 'active_io' leakage, patch 6;
>
>
> The fifth regression related to frozen sync thread:
>
> Noted that for raid456, if reshape is interrupted, then call
> "pers->start_reshape" will corrupt data. And dm-raid rely on md_do_sync()
> doesn't set MD_RECOVERY_DONE so that new sync_thread won't be registered,
> and patch 3 just break this.
>
>  - Patch 9 fix this problem by interrupting reshape and frozen
>  sync_thread in dm_suspend(), then unfrozen and continue reshape in
> dm_resume(). It's verified that dm-raid tests won't complain that
> ext4 is corrupted anymore.
>  - Patch 10 fix the problem that raid_message() call
>  md_reap_sync_thread() directly, without holding 'reconfig_mutex'.
>
>
> Last regression related to dm-raid456 IO concurrent with reshape:
>
> For raid456, if reshape is still in progress, then IO across reshape
> position will wait for reshape to make progress. However, for dm-raid,
> in following cases reshape will never make progress hence IO will hang:
>
> 1) the array is read-only;
> 2) MD_RECOVERY_WAIT is set;
> 3) MD_RECOVERY_FROZEN is set;
>
> After commit c467e97f079f ("md/raid6: use valid sector values to determine
> if an I/O should wait on the reshape") fix the problem that IO across
> reshape position doesn't wait for reshape, the dm-raid test
> shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request().
>
> For md/raid, the problem doesn't exist because:
>
> 1) If array is read-only, it can switch to read-write by ioctl/sysfs;
> 2) md/raid never set MD_RECOVERY_WAIT;
> 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold
>'reconfig_mutex' anymore, it can be cleared and reshape can continue by
>sysfs api 'sync_action'.
>
> However, I'm not sure yet how to avoid the problem in dm-raid yet.
>
>  - patch 11,12 fix this problem by detecting the above 3 cases in
>  dm_suspend(), and fail those IO directly.
>
> If user really meet the IO error, then it means they're reading the wrong
> data before c467e97f079f. And it's safe to read/write the array after
> reshape make progress successfully.
>
> There are also some other minor changes: patch 8 and patch 12;
>
>
> Test result (for v4, I don't think it's 

+ introduce-cpu_dcache_is_aliasing-across-all-architectures.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: Introduce cpu_dcache_is_aliasing() across all architectures
has been added to the -mm mm-unstable branch.  Its filename is
 introduce-cpu_dcache_is_aliasing-across-all-architectures.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/introduce-cpu_dcache_is_aliasing-across-all-architectures.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: Introduce cpu_dcache_is_aliasing() across all architectures
Date: Mon, 12 Feb 2024 11:31:00 -0500

Introduce a generic way to query whether the data cache is virtually
aliased on all architectures.  Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g.  FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Link: 
https://lkml.kernel.org/r/20240212163101.19614-8-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Alasdair Kergon 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Signed-off-by: Andrew Morton 
---

 arch/arc/Kconfig|1 +
 arch/arc/include/asm/cachetype.h|9 +
 arch/arm/Kconfig|1 +
 arch/arm/include/asm/cachetype.h|2 ++
 arch/csky/Kconfig   |1 +
 arch/csky/include/asm/cachetype.h   |9 +
 arch/m68k/Kconfig   |1 +
 arch/m68k/include/asm/cachetype.h   |9 +
 arch/mips/Kconfig   |1 +
 arch/mips/include/asm/cachetype.h   |9 +
 arch/nios2/Kconfig  |1 +
 arch/nios2/include/asm/cachetype.h  |   10 ++
 arch/parisc/Kconfig |1 +
 arch/parisc/include/asm/cachetype.h |9 +
 arch/sh/Kconfig |1 +
 arch/sh/include/asm/cachetype.h |9 +
 arch/sparc/Kconfig  |1 +
 arch/sparc/include/asm/cachetype.h  |   14 ++
 arch/xtensa/Kconfig |1 +
 arch/xtensa/include/asm/cachetype.h |   10 ++
 include/linux/cacheinfo.h   |6 ++
 mm/Kconfig  |6 ++
 22 files changed, 112 insertions(+)

--- /dev/null
+++ a/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
--- 

+ dax-check-for-data-cache-aliasing-at-runtime.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: dax: check for data cache aliasing at runtime
has been added to the -mm mm-unstable branch.  Its filename is
 dax-check-for-data-cache-aliasing-at-runtime.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-check-for-data-cache-aliasing-at-runtime.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: dax: check for data cache aliasing at runtime
Date: Mon, 12 Feb 2024 11:30:59 -0500

Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax().  This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the
kernel is using an aliased mapping) on an architecture which has data
cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n
for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-7-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Alasdair Kergon 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Signed-off-by: Andrew Morton 
---

 drivers/dax/super.c |   10 ++
 fs/Kconfig  |1 -
 2 files changed, 10 insertions(+), 1 deletion(-)

--- a/drivers/dax/super.c~dax-check-for-data-cache-aliasing-at-runtime
+++ a/drivers/dax/super.c
@@ -451,6 +451,16 @@ struct dax_device *alloc_dax(void *priva
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
--- a/fs/Kconfig~dax-check-for-data-cache-aliasing-at-runtime
+++ a/fs/Kconfig
@@ -60,7 +60,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




+ virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
has been added to the -mm mm-unstable branch.  Its filename is
 virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
Date: Mon, 12 Feb 2024 11:30:58 -0500

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-6-mathieu.desnoy...@efficios.com
Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Signed-off-by: Andrew Morton 
---

 fs/fuse/virtio_fs.c |   15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

--- a/fs/fuse/virtio_fs.c~virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal
+++ a/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP);
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct vi
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct vi
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




+ dcssblk-handle-alloc_dax-eopnotsupp-failure.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: dcssblk: handle alloc_dax() -EOPNOTSUPP failure
has been added to the -mm mm-unstable branch.  Its filename is
 dcssblk-handle-alloc_dax-eopnotsupp-failure.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dcssblk-handle-alloc_dax-eopnotsupp-failure.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: dcssblk: handle alloc_dax() -EOPNOTSUPP failure
Date: Mon, 12 Feb 2024 11:30:57 -0500

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture, and
considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from
alloc_dax() should make dcssblk_add_store() fail.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-5-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Signed-off-by: Andrew Morton 
---

 drivers/s390/block/dcssblk.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/drivers/s390/block/dcssblk.c~dcssblk-handle-alloc_dax-eopnotsupp-failure
+++ a/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, st
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, st
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




+ dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
has been added to the -mm mm-unstable branch.  Its filename is
 dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
Date: Mon, 12 Feb 2024 11:30:56 -0500

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-4-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Signed-off-by: Andrew Morton 
---

 drivers/md/dm.c |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

--- a/drivers/md/dm.c~dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal
+++ a/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(i
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




+ nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: nvdimm/pmem: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
has been added to the -mm mm-unstable branch.  Its filename is
 nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: nvdimm/pmem: treat alloc_dax() -EOPNOTSUPP failure as non-fatal
Date: Mon, 12 Feb 2024 11:30:55 -0500

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Link: 
https://lkml.kernel.org/r/20240212163101.19614-3-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Signed-off-by: Andrew Morton 
---

 drivers/nvdimm/pmem.c |   22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

--- 
a/drivers/nvdimm/pmem.c~nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal
+++ a/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct devic
dax_dev = alloc_dax(pmem, _dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
-   goto out;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
_

Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are

nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch
dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch
nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dcssblk-handle-alloc_dax-eopnotsupp-failure.patch
virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch
dax-check-for-data-cache-aliasing-at-runtime.patch
introduce-cpu_dcache_is_aliasing-across-all-architectures.patch
dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch




+ dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch added to mm-unstable branch

2024-02-12 Thread Andrew Morton


The patch titled
 Subject: dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
has been added to the -mm mm-unstable branch.  Its filename is
 dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch

This patch will shortly appear at
 
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch

This patch will later appear in the mm-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
  reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing 
your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

--
From: Mathieu Desnoyers 
Subject: dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
Date: Mon, 12 Feb 2024 11:30:54 -0500

Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression",
v5.

This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache).  It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig
option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures.  The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is what
is needed on ARM.

With this we can basically narrow down the list of architectures which are
unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.


This patch (of 5):

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n
to be consistent with the fact that CONFIG_DAX=y never returns NULL.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Link: 
https://lkml.kernel.org/r/20240212163101.19614-1-mathieu.desnoy...@efficios.com
Link: 
https://lkml.kernel.org/r/20240212163101.19614-2-mathieu.desnoy...@efficios.com
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: Dave Chinner 
Cc: Michael Sclafani 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Signed-off-by: Andrew Morton 
---

 drivers/dax/super.c |5 +
 include/linux/dax.h |6 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

--- 
a/drivers/dax/super.c~dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n
+++ a/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
--- 
a/include/linux/dax.h~dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n
+++ a/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct da
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device 

Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Linus Torvalds
On Mon, 12 Feb 2024 at 14:04, Dan Williams  wrote:
>
> This works because the internals of virtio_fs_cleanup_dax(), "kill_dax()
> and put_dax()", know how to handle a NULL @dax_dev. It is still early
> days with the "cleanup" helpers, but I wonder if anyone else cares that
> the DEFINE_FREE() above does not check for NULL?

Well, the main reason for DEFINE_FREE() to check for NULL is not
correctness, but code generation. See the comment about kfree() in
:

 * NOTE: the DEFINE_FREE()'s @free expression includes a NULL test even though
 * kfree() is fine to be called with a NULL value. This is on purpose. This way
 * the compiler sees the end of our alloc_obj() function as [...]

with the full explanation there.

Now, whether the code wants to actually use the cleanup() helpers for
a single use-case is debatable.

But yes, if it does, I suspect it should use !IS_ERR_OR_NULL(ptr).

Linus



RE: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Dan Williams
Mathieu Desnoyers wrote:
> In preparation for checking whether the architecture has data cache
> aliasing within alloc_dax(), modify the error handling of virtio
> virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
> non-fatal.
> 
> Co-developed-by: Dan Williams 
> Signed-off-by: Dan Williams 
> Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
> caches")
> Signed-off-by: Mathieu Desnoyers 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev
> ---
>  fs/fuse/virtio_fs.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 5f1be1da92ce..f9acd9972af2 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "fuse_i.h"
>  
> @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
>   put_dax(dax_dev);
>  }
>  
> +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) 
> virtio_fs_cleanup_dax(_T))
> +
>  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
> *fs)
>  {
> + struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP);
>   struct virtio_shm_region cache_reg;
>   struct dev_pagemap *pgmap;
>   bool have_cache;
> @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device 
> *vdev, struct virtio_fs *fs)
>   if (!IS_ENABLED(CONFIG_FUSE_DAX))
>   return 0;
>  
> + dax_dev = alloc_dax(fs, _fs_dax_ops);
> + if (IS_ERR(dax_dev)) {
> + int rc = PTR_ERR(dax_dev);
> + return rc == -EOPNOTSUPP ? 0 : rc;
> + }
> +
>   /* Get cache region */
>   have_cache = virtio_get_shm_region(vdev, _reg,
>  (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
> @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device 
> *vdev, struct virtio_fs *fs)
>   dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
> 0x%llx\n",
>   __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
>  
> - fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
> - if (IS_ERR(fs->dax_dev))
> - return PTR_ERR(fs->dax_dev);
> -
> + fs->dax_dev = no_free_ptr(dax_dev);

This works because the internals of virtio_fs_cleanup_dax(), "kill_dax()
and put_dax()", know how to handle a NULL @dax_dev. It is still early
days with the "cleanup" helpers, but I wonder if anyone else cares that
the DEFINE_FREE() above does not check for NULL?

I think it is ok, but wanted to point that out for the virtiofs folks
and wonder what the style should be for things like this going forward.

Other growing pains with "cleanup.h" and ERR_PTR is happening over here:

http://lore.kernel.org/r/65ca861e14779_5a7f29...@dwillia2-xfh.jf.intel.com.notmuch

...and that arrived at a similar style as is being used in this patch.
In both cases the cleanup function is called on exit with a NULL
argument.



Re: [PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-12 Thread fan
On Mon, Feb 12, 2024 at 11:27:22AM -0500, Mathieu Desnoyers wrote:
> Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
> before setting pmem->dax_dev, which therefore issues the two following
> calls on NULL pointers:
> 
> out_cleanup_dax:
> kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);
> 
> Signed-off-by: Mathieu Desnoyers 
> Reviewed-by: Dan Williams 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev

Reviewed-by: Fan Ni 

> ---
>  drivers/nvdimm/pmem.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4e8fdcb3f1c8..9fe358090720 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
>   set_dax_nomc(dax_dev);
>   if (is_nvdimm_sync(nd_region))
>   set_dax_synchronous(dax_dev);
> + pmem->dax_dev = dax_dev;
>   rc = dax_add_host(dax_dev, disk);
>   if (rc)
>   goto out_cleanup_dax;
>   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> - pmem->dax_dev = dax_dev;
> -
>   rc = device_add_disk(dev, disk, pmem_attribute_groups);
>   if (rc)
>   goto out_remove_host;
> -- 
> 2.39.2
> 



Re: linux-next: Tree for Feb 12 (drivers/md/dm-vdo/thread-registry.c on arch/loongarch/)

2024-02-12 Thread Stephen Rothwell
Hi all,

On Mon, 12 Feb 2024 12:33:08 -0800 Randy Dunlap  wrote:
>
> on loongarch:
> 
> ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_register_thread':
> ../drivers/md/dm-vdo/thread-registry.c:32:28: error: 'current' undeclared 
> (first use in this function)
>32 | new_thread->task = current;
>   |^~~
> ../drivers/md/dm-vdo/thread-registry.c:32:28: note: each undeclared 
> identifier is reported only once for each function it appears in
> ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_unregister_thread':
> ../drivers/md/dm-vdo/thread-registry.c:61:37: error: 'current' undeclared 
> (first use in this function)
>61 | if (thread->task == current) {
>   | ^~~
> ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_lookup_thread':
> ../drivers/md/dm-vdo/thread-registry.c:84:37: error: 'current' undeclared 
> (first use in this function)
>84 | if (thread->task == current) {
>   | ^~~

Caused by commit

  bf28b754d024 ("dm vdo: add thread and synchronization utilities")

from the device-mapper tree.

-- 
Cheers,
Stephen Rothwell


pgp7VS_uhBgoH.pgp
Description: OpenPGP digital signature


Re: linux-next: Tree for Feb 12 (drivers/md/dm-vdo/thread-registry.c on arch/loongarch/)

2024-02-12 Thread Randy Dunlap



On 2/11/24 20:31, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20240209:
> 
> The mm tree lost its boot failure.

on loongarch:

../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_register_thread':
../drivers/md/dm-vdo/thread-registry.c:32:28: error: 'current' undeclared 
(first use in this function)
   32 | new_thread->task = current;
  |^~~
../drivers/md/dm-vdo/thread-registry.c:32:28: note: each undeclared identifier 
is reported only once for each function it appears in
../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_unregister_thread':
../drivers/md/dm-vdo/thread-registry.c:61:37: error: 'current' undeclared 
(first use in this function)
   61 | if (thread->task == current) {
  | ^~~
../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_lookup_thread':
../drivers/md/dm-vdo/thread-registry.c:84:37: error: 'current' undeclared 
(first use in this function)
   84 | if (thread->task == current) {
  | ^~~


-- 
#Randy



Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-12 Thread Bart Van Assche

On 2/11/24 17:09, Damien Le Moal wrote:

On 2/11/24 12:40, Bart Van Assche wrote:

For the use cases I'm interested in a hash table implementation that supports
RCU-lookups probably will work better than an xarray. I think that the hash
table implementation in  supports RCU for lookups, insertion
and removal.


It does, but the API for it is not the easiest, and I do not see how that could
be faster than an xarray, especially as the number of zones grows with high
capacity devices (read: potentially more collisions which will slow zone plug
lookups).


From the xarray documentation: "The XArray implementation is efficient when the
indices used are densely clustered". I think we are dealing with a sparse array
and hence that an xarray may not be the best suited data structure. How about
using a hash table and making the hash table larger if the number of open zones
equals the hash table size? That is possible as follows:
* Instead of using DEFINE_HASHTABLE() or DECLARE_HASHTABLE(), allocate the hash
  table dynamically and use the  struct hlist_head __rcu * data type.
* Use rcu_assign_pointer() to modify that pointer and kfree_rcu() to free old
  versions of the hash table.
* Use rcu_dereference_protected() for hash table lookups.

For an example, see also the output of the following command:
$ git grep -nHw state_bydst

Thanks,

Bart.



Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-12 Thread Bart Van Assche

On 2/12/24 00:47, Damien Le Moal wrote:

Replying to myself as I had an idea:
1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
"0" as a value for a zone in that array would indicate that the zone is
conventional. No additional zone bitmap needed.
2) Use a sparse xarray for managing allocated zone write plugs: 64B per
allocated zone write plug needed, which for an SMR drive would generally be at
most 128 * 64B = 8K.

So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead
of the current 1.6 MB. Will try to prototype this to see how performance goes (I
am worried about the xarray lookup overhead in the hot path).


Hi Damien,

Are there any zoned devices where the sequential write required zones occur 
before
the conventional zones? If not, does this mean that the conventional zones 
always
occur before the write pointer zones and also that storing the number of 
conventional
zones is sufficient?

Are there zoned storage devices where each zone has a different capacity? I have
not yet encountered any such device. I'm wondering whether a single capacity
variable would be sufficient for the entire device.

Thank you,

Bart.





Re: [bug report] dm vdo: add statistics reporting

2024-02-12 Thread Matthew Sakai

Hi Dan,

Thanks for this report (and the others). It looks like we may have 
changed our minds about how to do this formatting at some point but 
never cleaned it up. I'll put this on the list of things to improve.


Thanks,
matt

On 2/9/24 08:05, Dan Carpenter wrote:

Hello Matthew Sakai,

The patch 25479d97b12c: "dm vdo: add statistics reporting" from Nov
16, 2023 (linux-next), leads to the following (unpublished) Smatch
static checker warning:

drivers/md/dm-vdo/funnel-workqueue.c:545 vdo_dump_completion_to_buffer() warn: 
why check scnprintf return value?
drivers/md/dm-vdo/message-stats.c:26 write_u64() warn: why check scnprintf 
return value?
drivers/md/dm-vdo/message-stats.c:43 write_u32() warn: why check scnprintf 
return value?
drivers/md/dm-vdo/message-stats.c:60 write_block_count_t() warn: why check 
scnprintf return value?
drivers/md/dm-vdo/message-stats.c:77 write_string() warn: why check scnprintf 
return value?
drivers/md/dm-vdo/message-stats.c:94 write_bool() warn: why check scnprintf 
return value?
drivers/md/dm-vdo/message-stats.c:111 write_u8() warn: why check scnprintf 
return value?

drivers/md/dm-vdo/message-stats.c
 99 static int write_u8(char *prefix,
 100 u8 value,
 101 char *suffix,
 102 char **buf,
 103 unsigned int *maxlen)
 104 {
 105 int count = scnprintf(*buf, *maxlen, "%s%u%s",
 106   prefix == NULL ? "" : prefix,
 107   value,
 108   suffix == NULL ? "" : suffix);
 109 *buf += count;
 110 *maxlen -= count;
--> 111 if (count >= *maxlen)
 
snprintf() returns the number of bytes that would have been printed if
we had enough space and scnprintf() returns the number of bytes that
were printed.  (Both functions exclude the NUL terminator from the
returned count).

It seldom makes sense to check the return from scnprintf().  In this
case count can never be >= *maxlen.  The most it can be is *maxlen - 1.

 112 return VDO_UNEXPECTED_EOF;
 113 return VDO_SUCCESS;
 114 }

The code in vdo_dump_completion_to_buffer() is almost basically correct.
It's just off by one in a totally harmless way and I wouldn't have
reported it if not for these other impossible conditions.

drivers/md/dm-vdo/funnel-workqueue.c
538  void vdo_dump_completion_to_buffer(struct vdo_completion *completion, 
char *buffer,
539 size_t length)
540  {
541  size_t current_length =
542  scnprintf(buffer, length, "%.*s/", TASK_COMM_LEN,
543(completion->my_queue == NULL ? "-" : 
completion->my_queue->name));
544
545  if (current_length < length - 1) {

The highest valid return is current_length == length - 1 but that's
treated as invalid.  If we changed to snprintf() then we could fix the
condition to allow the final character:

current_length = snprintf(...);
if (current_length <= length - 1) {
   

546  get_function_name((void *) completion->callback, 
buffer + current_length,
547length - current_length);
548  }
549  }

regards,
dan carpenter






Re: [PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-12 Thread Dave Jiang



On 2/12/24 9:27 AM, Mathieu Desnoyers wrote:
> Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
> before setting pmem->dax_dev, which therefore issues the two following
> calls on NULL pointers:
> 
> out_cleanup_dax:
> kill_dax(pmem->dax_dev);
> put_dax(pmem->dax_dev);
> 
> Signed-off-by: Mathieu Desnoyers 
> Reviewed-by: Dan Williams 
> Cc: Alasdair Kergon 
> Cc: Mike Snitzer 
> Cc: Mikulas Patocka 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Matthew Wilcox 
> Cc: Arnd Bergmann 
> Cc: Russell King 
> Cc: linux-a...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux...@kvack.org
> Cc: linux-...@vger.kernel.org
> Cc: dm-devel@lists.linux.dev
> Cc: nvd...@lists.linux.dev

Reviewed-by: Dave Jiang 
> ---
>  drivers/nvdimm/pmem.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4e8fdcb3f1c8..9fe358090720 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
>   set_dax_nomc(dax_dev);
>   if (is_nvdimm_sync(nd_region))
>   set_dax_synchronous(dax_dev);
> + pmem->dax_dev = dax_dev;
>   rc = dax_add_host(dax_dev, disk);
>   if (rc)
>   goto out_cleanup_dax;
>   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
> - pmem->dax_dev = dax_dev;
> -
>   rc = device_add_disk(dev, disk, pmem_attribute_groups);
>   if (rc)
>   goto out_remove_host;



[PATCH v5 8/8] dax: Fix incorrect list of data cache aliasing architectures

2024-02-12 Thread Mathieu Desnoyers
commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2




[PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-12 Thread Mathieu Desnoyers
Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ config ARM
select ARCH_32BIT_OFF_T
select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && 
FRAME_POINTER && !ARM_UNWIND
  

[PATCH v5 6/8] dax: Check for data cache aliasing at runtime

2024-02-12 Thread Mathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 10 ++
 fs/Kconfig  |  1 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 205b888d45bf..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
-- 
2.39.2




[PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..f9acd9972af2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP);
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2




[PATCH v5 4/8] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2




[PATCH v5 3/8] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2




[PATCH v5 2/8] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, _dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
-   goto out;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




[PATCH v5 0/8] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-12 Thread Mathieu Desnoyers
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Thanks,

Mathieu

Changes since v4:
- Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP)
  when CONFIG_DAX=n earlier in the series,
- Fold driver cleanup patches into their respective per-driver changes.
- Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this
  series.

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (8):
  dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers/dax/super.c | 14 ++
 drivers/md/dm.c | 17 +
 drivers/nvdimm/pmem.c   | 22 --
 drivers/s390/block/dcssblk.c| 11 ++-
 fs/Kconfig  |  1 -
 fs/fuse/virtio_fs.c | 15 +++
 

[PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

2024-02-12 Thread Mathieu Desnoyers
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
never returns NULL.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 +
 include/linux/dax.h | 6 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..205b888d45bf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2




[PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-12 Thread Mathieu Desnoyers
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




Re: [PATCH v3 0/5] block: remove gfp_mask for blkdev_zone_mgmt()

2024-02-12 Thread Jens Axboe


On Sun, 28 Jan 2024 23:52:15 -0800, Johannes Thumshirn wrote:
> Fueled by the LSFMM discussion on removing GFP_NOFS initiated by Willy,
> I've looked into the sole GFP_NOFS allocation in zonefs. As it turned out,
> it is only done for zone management commands and can be removed.
> 
> After digging into more callers of blkdev_zone_mgmt() I came to the
> conclusion that the gfp_mask parameter can be removed alltogether.
> 
> [...]

Applied, thanks!

[1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call
  commit: 9105ce591b424771b1502ef9836ca7953c3e0af4
[2/5] dm: dm-zoned: guard blkdev_zone_mgmt with noio scope
  commit: 218082010aceb40b5495ebc30028ede6e30ee755
[3/5] btrfs: zoned: call blkdev_zone_mgmt in nofs scope
  commit: d9d556755f16f6af8d1d8ebac38b83a9263394c5
[4/5] f2fs: guard blkdev_zone_mgmt with nofs scope
  commit: 147ec1c60e3273d21ea1f212c6636f231d6d2771
[5/5] block: remove gfp_flags from blkdev_zone_mgmt
  commit: 71f4ecdbb42addf82b01b734b122a02707fed521

Best regards,
-- 
Jens Axboe






Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN

2024-02-12 Thread Martin Wilck
On Mon, 2024-02-12 at 13:32 +0100, Peter Rajnoha wrote:
> On 2/12/24 12:09, Martin Wilck wrote:
> 
> 
> Right, underneath within DM/DM-subystem, we should be able to keep
> and
> restore those reasons for why it has been flagged with
> DM_UDEV_DISABLE_OTHER_RULES_FLAG till now and when the state is good
> enough that we can drop it, we would do it transparently for higher
> (non-dm and non-dm-subsystem) layers. So if there's a case that is
> not
> currently handled by 10-dm.rules or 11-dm-.rules, we can
> fix
> that there. If it's a generic rule that applies to all DM, not just
> subystem, then yes, we can move that to 10-dm.rules (will have a look
> at
> your patch [1]).

I don't think that patch can be used as-is for DM. For multipath, I'm
not aware of any situation where DM_UDEV_DISABLE_OTHER_RULES_FLAG would
be set in the udev cookie, therefore DM_UDEV_DISABLE_OTHER_RULES_FLAG
is essentially an alias for DM_SUSPENDED before 11-dm-mpath.rules
changes it. That's not the case for other targets, in particular LVM.

> > 
> > Well, IIUC the main reason that systemd couldn't use
> > DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the
> > fact
> > that the multipath rules used it in a special way that was
> > inconsistent
> > with the rest of DM ;-)
> 
> 
> Aha! Well, honestly, I don't remember the exact details and context
> of
> that fix, but I know we haven't found a better way...
> 
> > 
> > I think there are 3 variants of "unusable":
> > 
> > a) temporarily unusable (just for this event), ignore this uevent
> > and
> > restore previous properties from db.
> > b) unusable, avoid IO, don't scan, don't activate (this is how we
> > use
> > DM_UDEV_DISABLE_OTHER_RULES_FLAG). Upper layers will usually load
> > saved
> > properties from udev db in this case, too.
> > c) like b), but also try to unmount / unconfigure if already used.
> > This
> > is SYSTEMD_READY=0. I don't think DM has a flag with these
> > semantics at
> > this time. I can imagine such a flag being set if a device was
> > reloaded
> > with an incompatible table, but that's rather a corner case.
> > 
> > It's an honorable goal to condense everything into a single
> > variable
> > for consumer rules, but I think it doesn't work if we want the
> > upper
> > layers to be able to distinguish these. We can merge a) and b) I
> > think,
> > because their meaning for upper layers is practically the same, if
> > we
> > get the saving and restoring right.
> > 
> 
> The c) case - well, it's questionable what should be done in that
> case,
> because that means we have literally cut off the underlying device
> while
> it was still in use. Any further IO from higher layers will return IO
> errors, will queue IOs or, in the worse case, issue IOs to a device
> that
> we don't want to anymore.
> 
> Anyway, if I understand correctly, we simply need to signal higher
> layers either:
> 
>   A) device is unusable, forget it and clear all your current extra
> records you have about this device (including removing any custom
> symlinks for udev). That would also map to SYSTEMD_READY=0.
> 
>   B) device is unusable temporarily, restore any records you need,
> wait
> for the DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 to drop to 0 (or being
> unset).
> 
> What do you think about keeping a single
> DM_UDEV_DISABLE_OTHER_RULES_FLAG for this, just having a different
> value, say "2" to denote the B case? Otherwise, we need 2 distinct
> variables (which is harder for others to accept I bet).

Yes, that could work, if the save / restore is implemented cleanly.


> 
> > > The DM_NOSCAN was actually just a helper and a more human
> > > readable
> > > name
> > > for "DM_SUSBYSTEM_UDEV_FLAG0" within LVM subsystem *only* at
> > > first.
> > > It is used during LV initialization - the wiping/zeroing of the
> > > LV
> > > before it is pronounced as usable - that's why, when it is set,
> > > we
> > > signal to "others" the DM_UDEV_DISABLE_OTHER_RULES_FLAG based on
> > > this
> > > flag. However, since we have the 13-dm-disk.rules which manages
> > > the
> > > blkid call for DM devices (and which is ours - owned by DM), we
> > > need
> > > to
> > > signal these rules to avoid calling blkid (as it could see
> > > uninitiliazed/stale data). In this case, we import any previous
> > > scan
> > > results from udev db. It's not like "completely ignore and skip
> > > rules
> > > for this event".
> > 
> > I'm not sure if I understand what you mean with "completely ignore
> > and
> > skip". Upper-level rules usually can't "completely ignore" an
> > uevent if
> > they need to preserve any udev properties across the current event.
> > If
> > the device is unusable in the weaker "don't try IO" sense, the
> > upper
> > rules must import existing properties from the db, just like 13-dm-
> > disk.rules, lest the properties be forgotten by udev. IMO this
> > weaker
> > semantics (corresponding to b) above) is what matters most for
> > upper
> > level rules.
> 
> I didn't consider that there 

Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN

2024-02-12 Thread Peter Rajnoha
On 2/12/24 12:09, Martin Wilck wrote:
> On Mon, 2024-02-12 at 10:51 +0100, Peter Rajnoha wrote:
>> On 2/9/24 22:58, Martin Wilck wrote:
>>> Hi Zdenek, Peter, David, Ben,
>>>
>>> I have been pondering the device-mapper udev rules a lot lately,
>>> and I
>>> believe I have found glitches in the logic of the device mapper
>>> udev
>>> flags that I'd like to bring to your attention.
>>>
>>> TL;DR: change I propose:
>>>
>>> * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning
>>>   "upper layers should leave this device alone until told
>>> otherwise",
>>> saved between uevents in the udev db
>>> * use DM_SUSPENDED consistently as a flag meaning "upper layers
>>>   should keep their hands off this device temporarily", not saved
>>>   in the udev db.
>>> * don't use any other flags in upper layers, including 13-dm-
>>> disk.rules.
>>>
>>> This implies:
>>>
>>> * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED
>>>   in 10-dm.rules
>>> * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in
>>>   13-dm-disk.rules
>>> * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules
>>> * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules
>>
>> Hi Martin!
>>
>> It is surely possible we could do some improvements and optimizations
>> here - all the rules are a result of long evolution where we were
>> fixing
>> various issues on the way. So if we could shoot down some
>> complexities,
>> that would be great...
>>
>> On the other hand, we need to be really careful, because even a tiny
>> change here can cause troubles if we omit some case.
>>
>> I'll surely go through your suggestions, thanks for diving deep into
>> that! Just please give me some time so I'll remap all the paths
>> through
>> the rules in my head :)
>>
>> Note that we're working with 3 levels of logical rule separation
>> here:
>>
>>   1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules)
>>   2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...)
>>   3) all the "other"
>>
>> As for the very original intentions of the
>> DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the
>> *only
>> one* to check for in "other" rules so they don't need to bother about
>> checking other states/variables ("other" here means other than DM and
>> any DM subsystem). One simple variable to check for others - an ideal
>> -
>> otherwise, we would be having a hard time to persuade others why they
>> need to add complex checks in their rules/applications.
> 
> Yes, I understand that, and it makes a lot of sense. 
> 
> The problem arises by saving and restoring to/from the udev db. We've
> got different possible *reasons* (inputs) for a device becoming
> unusable. The value we communicate to upper layers should arguably be a
> "logical or" of all these. But then we can't use a single variable to
> save and restore the state; we must determine the value of all "reason"
> flags (either directly or by restoring from the db, as appropriate for
> the flag in question) and "or"-them into a single "dontuse" flag.
> 
> Example: Currently, we set DM_UDEV_DISABLE_OTHER_RULES_FLAG if
> DM_SUSPENDED is set. If the next spurious uevent arrives, we see
> DM_UDEV_DISABLE_OTHER_RULES_FLAG set, but we don't know if the origin
> was DM_SUSPENDED or a genuine udevflag. In the former case, the device
> would now be usable, while in the latter case it most probably
> wouldn't. For now, I've posted a patch to fix this for the multipath
> rules [1], but I'd like to see it fixed in the general DM rules.
> 

Right, underneath within DM/DM-subystem, we should be able to keep and
restore those reasons for why it has been flagged with
DM_UDEV_DISABLE_OTHER_RULES_FLAG till now and when the state is good
enough that we can drop it, we would do it transparently for higher
(non-dm and non-dm-subsystem) layers. So if there's a case that is not
currently handled by 10-dm.rules or 11-dm-.rules, we can fix
that there. If it's a generic rule that applies to all DM, not just
subystem, then yes, we can move that to 10-dm.rules (will have a look at
your patch [1]).

>> Yes, there's the unlucky exception in the 99-systemd.rules which
>> needs
>> to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db,
>> otherwise we had been running into an issue with stopping systemd
>> device
>> units prematurely (actually, this was an issue spotted much much
>> later
>> on after years of using the rules without this rule). 
>> And we haven't
>> found a better way to check for this condition. This is mainly
>> because
>> systemd is also a "device" manager/tracker in some way through its
>> "device" units, besides udev itself. I simply gave up there and
>> admitted
>> the check for DM_SUSPENDED case.
> 
> Well, IIUC the main reason that systemd couldn't use
> DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the fact
> that the multipath rules used it in a special way that was inconsistent
> with the rest of DM ;-)


Aha! Well, honestly, I don't 

Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN

2024-02-12 Thread Martin Wilck
On Mon, 2024-02-12 at 10:51 +0100, Peter Rajnoha wrote:
> On 2/9/24 22:58, Martin Wilck wrote:
> > Hi Zdenek, Peter, David, Ben,
> > 
> > I have been pondering the device-mapper udev rules a lot lately,
> > and I
> > believe I have found glitches in the logic of the device mapper
> > udev
> > flags that I'd like to bring to your attention.
> > 
> > TL;DR: change I propose:
> > 
> > * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning
> >   "upper layers should leave this device alone until told
> > otherwise",
> > saved between uevents in the udev db
> > * use DM_SUSPENDED consistently as a flag meaning "upper layers
> >   should keep their hands off this device temporarily", not saved
> >   in the udev db.
> > * don't use any other flags in upper layers, including 13-dm-
> > disk.rules.
> > 
> > This implies:
> > 
> > * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED
> >   in 10-dm.rules
> > * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in
> >   13-dm-disk.rules
> > * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules
> > * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules
> 
> Hi Martin!
> 
> It is surely possible we could do some improvements and optimizations
> here - all the rules are a result of long evolution where we were
> fixing
> various issues on the way. So if we could shoot down some
> complexities,
> that would be great...
> 
> On the other hand, we need to be really careful, because even a tiny
> change here can cause troubles if we omit some case.
> 
> I'll surely go through your suggestions, thanks for diving deep into
> that! Just please give me some time so I'll remap all the paths
> through
> the rules in my head :)
> 
> Note that we're working with 3 levels of logical rule separation
> here:
> 
>   1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules)
>   2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...)
>   3) all the "other"
> 
> As for the very original intentions of the
> DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the
> *only
> one* to check for in "other" rules so they don't need to bother about
> checking other states/variables ("other" here means other than DM and
> any DM subsystem). One simple variable to check for others - an ideal
> -
> otherwise, we would be having a hard time to persuade others why they
> need to add complex checks in their rules/applications.

Yes, I understand that, and it makes a lot of sense. 

The problem arises by saving and restoring to/from the udev db. We've
got different possible *reasons* (inputs) for a device becoming
unusable. The value we communicate to upper layers should arguably be a
"logical or" of all these. But then we can't use a single variable to
save and restore the state; we must determine the value of all "reason"
flags (either directly or by restoring from the db, as appropriate for
the flag in question) and "or"-them into a single "dontuse" flag.

Example: Currently, we set DM_UDEV_DISABLE_OTHER_RULES_FLAG if
DM_SUSPENDED is set. If the next spurious uevent arrives, we see
DM_UDEV_DISABLE_OTHER_RULES_FLAG set, but we don't know if the origin
was DM_SUSPENDED or a genuine udevflag. In the former case, the device
would now be usable, while in the latter case it most probably
wouldn't. For now, I've posted a patch to fix this for the multipath
rules [1], but I'd like to see it fixed in the general DM rules.

> Yes, there's the unlucky exception in the 99-systemd.rules which
> needs
> to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db,
> otherwise we had been running into an issue with stopping systemd
> device
> units prematurely (actually, this was an issue spotted much much
> later
> on after years of using the rules without this rule). 
> And we haven't
> found a better way to check for this condition. This is mainly
> because
> systemd is also a "device" manager/tracker in some way through its
> "device" units, besides udev itself. I simply gave up there and
> admitted
> the check for DM_SUSPENDED case.

Well, IIUC the main reason that systemd couldn't use
DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the fact
that the multipath rules used it in a special way that was inconsistent
with the rest of DM ;-)

I think there are 3 variants of "unusable":

a) temporarily unusable (just for this event), ignore this uevent and
restore previous properties from db.
b) unusable, avoid IO, don't scan, don't activate (this is how we use
DM_UDEV_DISABLE_OTHER_RULES_FLAG). Upper layers will usually load saved
properties from udev db in this case, too.
c) like b), but also try to unmount / unconfigure if already used. This
is SYSTEMD_READY=0. I don't think DM has a flag with these semantics at
this time. I can imagine such a flag being set if a device was reloaded
with an incompatible table, but that's rather a corner case.

It's an honorable goal to condense everything into a single variable
for consumer rules, but I think 

[PATCH] dm vdo: fix an assert statement in start_restoring_volume_sub_index()

2024-02-12 Thread Harshit Mogalapalli
Use "==" instead of "=" in ASSERT() statement.

Fixes: ef074a31e88e ("dm vdo: implement the volume index")
Signed-off-by: Harshit Mogalapalli 
---
This is based on static analysis with Smatch and only compile tested.
---
 drivers/md/dm-vdo/indexer/volume-index.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/dm-vdo/indexer/volume-index.c 
b/drivers/md/dm-vdo/indexer/volume-index.c
index 5fe34e6c1d9b..d6526fe9bbfc 100644
--- a/drivers/md/dm-vdo/indexer/volume-index.c
+++ b/drivers/md/dm-vdo/indexer/volume-index.c
@@ -830,7 +830,7 @@ static int start_restoring_volume_sub_index(struct 
volume_sub_index *sub_index,
decode_u32_le(buffer, , _list);
decode_u32_le(buffer, , _count);
 
-   result = ASSERT(offset = sizeof(buffer),
+   result = ASSERT(offset == sizeof(buffer),
"%zu bytes decoded of %zu expected", offset,
sizeof(buffer));
if (result != UDS_SUCCESS)
-- 
2.39.3




Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN

2024-02-12 Thread Peter Rajnoha
On 2/9/24 22:58, Martin Wilck wrote:
> Hi Zdenek, Peter, David, Ben,
> 
> I have been pondering the device-mapper udev rules a lot lately, and I
> believe I have found glitches in the logic of the device mapper udev
> flags that I'd like to bring to your attention.
> 
> TL;DR: change I propose:
> 
> * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning
>   "upper layers should leave this device alone until told otherwise",
> saved between uevents in the udev db
> * use DM_SUSPENDED consistently as a flag meaning "upper layers
>   should keep their hands off this device temporarily", not saved
>   in the udev db.
> * don't use any other flags in upper layers, including 13-dm-
> disk.rules.
> 
> This implies:
> 
> * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED
>   in 10-dm.rules
> * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in
>   13-dm-disk.rules
> * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules
> * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules

Hi Martin!

It is surely possible we could do some improvements and optimizations
here - all the rules are a result of long evolution where we were fixing
various issues on the way. So if we could shoot down some complexities,
that would be great...

On the other hand, we need to be really careful, because even a tiny
change here can cause troubles if we omit some case.

I'll surely go through your suggestions, thanks for diving deep into
that! Just please give me some time so I'll remap all the paths through
the rules in my head :)

Note that we're working with 3 levels of logical rule separation here:

  1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules)
  2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...)
  3) all the "other"

As for the very original intentions of the
DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the *only
one* to check for in "other" rules so they don't need to bother about
checking other states/variables ("other" here means other than DM and
any DM subsystem). One simple variable to check for others - an ideal -
otherwise, we would be having a hard time to persuade others why they
need to add complex checks in their rules/applications.

Yes, there's the unlucky exception in the 99-systemd.rules which needs
to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db,
otherwise we had been running into an issue with stopping systemd device
units prematurely (actually, this was an issue spotted much much later
on after years of using the rules without this rule). And we haven't
found a better way to check for this condition. This is mainly because
systemd is also a "device" manager/tracker in some way through its
"device" units, besides udev itself. I simply gave up there and admitted
the check for DM_SUSPENDED case.

The DM_NOSCAN was actually just a helper and a more human readable name
for "DM_SUSBYSTEM_UDEV_FLAG0" within LVM subsystem *only* at first.
It is used during LV initialization - the wiping/zeroing of the LV
before it is pronounced as usable - that's why, when it is set, we
signal to "others" the DM_UDEV_DISABLE_OTHER_RULES_FLAG based on this
flag. However, since we have the 13-dm-disk.rules which manages the
blkid call for DM devices (and which is ours - owned by DM), we need to
signal these rules to avoid calling blkid (as it could see
uninitiliazed/stale data). In this case, we import any previous scan
results from udev db. It's not like "completely ignore and skip rules
for this event". Other DM subsystems may get into exactly the same
situation with initialization as LVM, so that's why it ended up as
generic "DM_NOSCAN" so any 11-dm-.rules can use it to signal
13-dm-disk.rules to avoid the blkid scan this way (the "SCAN" here
actually means "blkid scan" so the better name would be
"DM_NO_BLKID_SCAN" probably). The "DM_NOSCAN" is not meant to be
consumed by "others" at all.

The other way round, we also don't want (ideally) to check for
DM_UDEV_DISABLE_OTHER_RULES flag in DM and DM subysystem rules. This
flag is meant for "others" to check. So we have a clear separation of
what is signalled withing "DM world" and "the other world".

The DM_SUSPENDED flag - well, this was supposed to be just internal to
DM and DM subystem rules. All the "others" have the
DM_UDEV_DISABLE_OTHER_RULES_FLAG - they simply don't need to bother for
what exact reason the device is not usable, whether it is a suspended
case or whatever else...

The DM udev rules are designed in a way that all the DM_UDEV_DISABLE_*
can still be changed in DM/DM-subsystem rules based on further findings
and processing of the state. So the original flags from the DM_COOKIE is
just a starting point here.

-- 
Peter




Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-12 Thread Damien Le Moal
On 2/12/24 17:23, Damien Le Moal wrote:
> On 2/11/24 12:40, Bart Van Assche wrote:
>> On 2/9/24 16:06, Damien Le Moal wrote:
>>> On 2/10/24 04:36, Bart Van Assche wrote:
 written zones is typically less than 10. Hence, tracking the partially 
 written
>>>
>>> That is far from guaranteed, especially with devices that have no active 
>>> zone
>>> limits like SMR drives.
>>
>> Interesting. The zoned devices I'm working with try to keep data in memory
>> for all zones that are neither empty nor full and hence impose an upper limit
>> on the number of open zones.
>>
>>> But in any case, what exactly is your idea here ? Can you actually suggest
>>> something ? Are you suggesting that a sparse array of zone plugs be used, 
>>> with
>>> an rb-tree or an xarray ? If that is what you are thinking, I can already 
>>> tell
>>> you that this is the first thing I tried to do. Early versions of this work 
>>> used
>>> a sparse xarray of zone plugs. But the problem with such approach is that 
>>> it is
>>> a lot more complicated and there is a need for a single lock to manage that
>>> structure (which is really not good for performance).
>>
>> Hmm ... since the xarray data structure supports RCU I think that locking the
>> entire xarray is only required if the zone condition changes from empty into
>> not empty or from neither empty nor full into full?
>>
>> For the use cases I'm interested in a hash table implementation that supports
>> RCU-lookups probably will work better than an xarray. I think that the hash
>> table implementation in  supports RCU for lookups, 
>> insertion
>> and removal.
> 
> I spent some time digging into this and also revisiting the possibility of 
> using
> an xarray. Conclusion is that this does not work well, at least in no way not
> better than what I did, and most of the time much worse. The reason is that we
> need at the very least to keep this information around:
> 1) If the zone is conventional or not
> 2) The zone capacity of sequential write required zones
> 
> Unless we keep this information, a report zone would be needed before starting
> writing to a zone that does not yet have a zone write plug allocated.
> 
> (1) and (2) above can be trivially combined into a single 32-bits value. But
> that value must exist for all zones. So at the very least, we need nr_zones * 
> 4B
> of memory allocated at all time. For such case (i.e. non-sparse structure),
> xarray or hash table would be more costly in memory than a simple static 
> array.
> 
> Given that we want to allocate/free zone write plugs dynamically as needed, we
> essentially need an array of pointers, so 8B * nr_zones for the base 
> structure.
> From there, ideally, we should be able to use rcu to safely dereference/modify
> the array entries. However, static arrays are not supported by the rcu code 
> from
> what I read.
> 
> Given this, my current approach that uses 16B per zone is the next best thing 
> I
> can think of without introducing a single lock for modifying the array 
> entries.
> 
> If you have any other idea, please share.

Replying to myself as I had an idea:
1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing
"0" as a value for a zone in that array would indicate that the zone is
conventional. No additional zone bitmap needed.
2) Use a sparse xarray for managing allocated zone write plugs: 64B per
allocated zone write plug needed, which for an SMR drive would generally be at
most 128 * 64B = 8K.

So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead
of the current 1.6 MB. Will try to prototype this to see how performance goes (I
am worried about the xarray lookup overhead in the hot path).

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH 25/26] block: Reduce zone write plugging memory usage

2024-02-12 Thread Damien Le Moal
On 2/11/24 12:40, Bart Van Assche wrote:
> On 2/9/24 16:06, Damien Le Moal wrote:
>> On 2/10/24 04:36, Bart Van Assche wrote:
>>> written zones is typically less than 10. Hence, tracking the partially 
>>> written
>>
>> That is far from guaranteed, especially with devices that have no active zone
>> limits like SMR drives.
> 
> Interesting. The zoned devices I'm working with try to keep data in memory
> for all zones that are neither empty nor full and hence impose an upper limit
> on the number of open zones.
> 
>> But in any case, what exactly is your idea here ? Can you actually suggest
>> something ? Are you suggesting that a sparse array of zone plugs be used, 
>> with
>> an rb-tree or an xarray ? If that is what you are thinking, I can already 
>> tell
>> you that this is the first thing I tried to do. Early versions of this work 
>> used
>> a sparse xarray of zone plugs. But the problem with such approach is that it 
>> is
>> a lot more complicated and there is a need for a single lock to manage that
>> structure (which is really not good for performance).
> 
> Hmm ... since the xarray data structure supports RCU I think that locking the
> entire xarray is only required if the zone condition changes from empty into
> not empty or from neither empty nor full into full?
> 
> For the use cases I'm interested in a hash table implementation that supports
> RCU-lookups probably will work better than an xarray. I think that the hash
> table implementation in  supports RCU for lookups, 
> insertion
> and removal.

I spent some time digging into this and also revisiting the possibility of using
an xarray. Conclusion is that this does not work well, at least in no way not
better than what I did, and most of the time much worse. The reason is that we
need at the very least to keep this information around:
1) If the zone is conventional or not
2) The zone capacity of sequential write required zones

Unless we keep this information, a report zone would be needed before starting
writing to a zone that does not yet have a zone write plug allocated.

(1) and (2) above can be trivially combined into a single 32-bits value. But
that value must exist for all zones. So at the very least, we need nr_zones * 4B
of memory allocated at all time. For such case (i.e. non-sparse structure),
xarray or hash table would be more costly in memory than a simple static array.

Given that we want to allocate/free zone write plugs dynamically as needed, we
essentially need an array of pointers, so 8B * nr_zones for the base structure.
>From there, ideally, we should be able to use rcu to safely dereference/modify
the array entries. However, static arrays are not supported by the rcu code from
what I read.

Given this, my current approach that uses 16B per zone is the next best thing I
can think of without introducing a single lock for modifying the array entries.

If you have any other idea, please share.

-- 
Damien Le Moal
Western Digital Research




Re: [PATCH v3 0/5] block: remove gfp_mask for blkdev_zone_mgmt()

2024-02-12 Thread Johannes Thumshirn
On 29.01.24 08:52, Johannes Thumshirn wrote:
> Fueled by the LSFMM discussion on removing GFP_NOFS initiated by Willy,
> I've looked into the sole GFP_NOFS allocation in zonefs. As it turned out,
> it is only done for zone management commands and can be removed.
> 
> After digging into more callers of blkdev_zone_mgmt() I came to the
> conclusion that the gfp_mask parameter can be removed alltogether.
> 
> So this series switches all callers of blkdev_zone_mgmt() to either use
> GFP_KERNEL where possible or grab a memalloc_no{fs,io} context.
> 
> The final patch in this series is getting rid of the gfp_mask parameter.
> 
> Link: https://lore.kernel.org/all/zzcgxi46ainlc...@casper.infradead.org/
> 
> ---
> Changes in v3:
> - Fix build error after rebase in dm-zoned-metadata.c
> - Link to v2: 
> https://lore.kernel.org/r/20240125-zonefs_nofs-v2-0-2d975c8c1...@wdc.com
> 
> Changes in v2:
> - guard blkdev_zone_mgmt in dm-zoned-metadata.c with memalloc_noio context
> - Link to v1: 
> https://lore.kernel.org/r/20240123-zonefs_nofs-v1-0-cc0b0308e...@wdc.com
> 
> ---
> Johannes Thumshirn (5):
>zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call
>dm: dm-zoned: guard blkdev_zone_mgmt with noio scope
>btrfs: zoned: call blkdev_zone_mgmt in nofs scope
>f2fs: guard blkdev_zone_mgmt with nofs scope
>block: remove gfp_flags from blkdev_zone_mgmt
> 
>   block/blk-zoned.c  | 19 ---
>   drivers/md/dm-zoned-metadata.c |  5 -
>   drivers/nvme/target/zns.c  |  5 ++---
>   fs/btrfs/zoned.c   | 35 +--
>   fs/f2fs/segment.c  | 15 ---
>   fs/zonefs/super.c  |  2 +-
>   include/linux/blkdev.h |  2 +-
>   7 files changed, 53 insertions(+), 30 deletions(-)
> ---
> base-commit: 615d300648869c774bd1fe54b4627bb0c20faed4
> change-id: 20240110-zonefs_nofs-dd1e22b2e046
> 
> Best regards,

Hi Jens,

now that every patch is reviewed, can you queue this series please?

Thanks,
Johannes