Re: [PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
On Mon, Feb 12, 2024 at 11:30:54AM -0500, Mathieu Desnoyers wrote: > Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for > CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y > never returns NULL. All the callers of alloc_dax() only check for IS_ERR(). Doesn't this result in a change of behavior in all the callers? Previously they'd ignore the NULL return value and continue, now they'll error out. Given that, seems dangerous to add a Fixes tag with a v4.0 commit and thus risk regressing all stable kernels. Thanks, Lukas
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On Mon, Feb 12, 2024 at 11:30:58AM -0500, Mathieu Desnoyers wrote: > In preparation for checking whether the architecture has data cache > aliasing within alloc_dax(), modify the error handling of virtio > virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as > non-fatal. > > Co-developed-by: Dan Williams > Signed-off-by: Dan Williams > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing > caches") That's a v4.0 commit, yet this patch uses DEFINE_FREE() which is only available in v6.6 but not any earlier stable kernels. So the Fixes tag feels a bit weird. Apart from that, Reviewed-by: Lukas Wunner
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On Mon, Feb 12, 2024 at 03:02:46PM -0800, Dan Williams wrote: > However, Lukas, I think Linus is right, your DEFINE_FREE() should use > IS_ERR_OR_NULL(). Uh... that's a negative, sir. ;) IS_ERR_OR_NULL() results in... * a superfluous NULL pointer check in x509_key_preparse() and * a superfluous IS_ERR check in x509_cert_parse(). IS_ERR() results *only* in... * a superfluous IS_ERR check in x509_cert_parse(). I can get rid of the IS_ERR() check by using assume(). I can *not* get rid of the NULL pointer check because the compiler is compiled with -fno-delete-null-pointer-checks. (The compiler seems to ignore __attribute__((returns_nonnull)) due to that.) > I.e. the problem is trying to use > __free(x509_free_certificate) in x509_cert_parse(). > > > --- a/crypto/asymmetric_keys/x509_cert_parser.c > > +++ b/crypto/asymmetric_keys/x509_cert_parser.c > > @@ -60,24 +60,24 @@ void x509_free_certificate(struct x509_certificate > > *cert) > > */ > > struct x509_certificate *x509_cert_parse(const void *data, size_t datalen) > > { > > - struct x509_certificate *cert; > > - struct x509_parse_context *ctx; > > + struct x509_certificate *cert __free(x509_free_certificate); > > ...make this: > > struct x509_certificate *cert __free(kfree); That doesn't work I'm afraid. x509_cert_parse() needs x509_free_certificate() to be called in the error path, not kfree(). See the existing code in current mainline: x509_cert_parse() populates three sub-allocations in struct x509_certificate (pub, sig, id) and two sub-sub-allocations (pub->key, pub->params). So I'd have to add five additional local variables which get freed by __cleanup(). One of them (pub->key) requires kfree_sensitive() instead of kfree(), so I'd need an extra DEFINE_FREE() for that. I haven't tried it but I suspect the result would look terrible and David Howells wouldn't like it. > ...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary > call to virtio_fs_cleanup_dax() at function exit that the compiler > should elide. My recommendation is to check for !IS_ERR() in the DEFINE_FREE() clause and amend virtio_fs_cleanup_dax() with a "if (!dax_dev) return;" for defensiveness in case someone calls it with a NULL pointer. That's the best solution I could come up with for the x509_certificate conversion. Note that even with superfluous checks avoided, __cleanup() causes gcc-12 to always generate two return paths. It's very visible in the generated code that all the stack unwinding code gets duplicated in every function using __cleanup(). The existing Assembler code of x509_key_preparse() and x509_cert_parse(), without __cleanup() invocation, has only a single return path. So __cleanup() bloats the code regardless of superfluous checks, but future gcc versions might avoid that. clang-15 generates much more compact code (vmlinux is a couple hundred kBytes smaller), but does weird things such as inlining x509_free_certificate() in x509_cert_parse(). As you may have guessed, I've spent an inordinate amount of time down that rabbit hole. ;( Thanks, Lukas
[linux-next:master] BUILD REGRESSION ae00c445390b349e070a64dc62f08aa878db7248
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master branch HEAD: ae00c445390b349e070a64dc62f08aa878db7248 Add linux-next specific files for 20240212 Error/Warning reports: https://lore.kernel.org/oe-kbuild-all/202402122047.ydhrzmm4-...@intel.com https://lore.kernel.org/oe-kbuild-all/202402130633.bfncwveh-...@intel.com https://lore.kernel.org/oe-kbuild-all/202402131351.a0fzogeg-...@intel.com Error/Warning: (recently discovered and may have been fixed) drivers/md/dm.c:2131:(.text+0x9f34): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `set_dax_nocache' drivers/md/dm.c:2131:(.text+0x9f34): undefined reference to `set_dax_nocache' drivers/md/dm.c:2132:(.text+0x9f3c): relocation truncated to fit: R_OR1K_INSN_REL_26 against undefined symbol `set_dax_nomc' misc.c:(.text+0x14f0): undefined reference to `__ubsan_handle_load_invalid_value' or1k-linux-ld: drivers/md/dm.c:2132:(.text+0x9f3c): undefined reference to `set_dax_nomc' sh4-linux-ld: misc.c:(.text+0x1698): undefined reference to `__ubsan_handle_load_invalid_value' Error/Warning ids grouped by kconfigs: gcc_recent_errors |-- alpha-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arc-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arc-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arc-randconfig-002-20240212 | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- arm-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arm-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arm-randconfig-001-20240212 | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- arm64-defconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- csky-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- csky-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- csky-randconfig-001-20240212 | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- csky-randconfig-002-20240212 | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- csky-randconfig-r061-20240212 | |-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- i386-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- i386-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- i386-buildonly-randconfig-002-20240212 | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- i386-randconfig-013-20240212 | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- i386-randconfig-015-20240212 | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- loongarch-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- loongarch-randconfig-002-20240212 | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- microblaze-allmodconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- microblaze-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- mips-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described-in-nvkm_gsp_radix3_sg |-- nios2-randconfig-001-20240212 | `-- fs-ntfs3-frecord.c:warning:unused-variable-i_size |-- openrisc-allyesconfig | `-- drivers-gpu-drm-nouveau-nvkm-subdev-gsp-r535.c:warning:Function-parameter-or-struct-member-gsp-not-described
Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
On 2/13/24 03:40, Bart Van Assche wrote: > On 2/12/24 00:47, Damien Le Moal wrote: >> Replying to myself as I had an idea: >> 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing >> "0" as a value for a zone in that array would indicate that the zone is >> conventional. No additional zone bitmap needed. >> 2) Use a sparse xarray for managing allocated zone write plugs: 64B per >> allocated zone write plug needed, which for an SMR drive would generally be >> at >> most 128 * 64B = 8K. >> >> So for an SMR drive with 100,000 zones, that would be a total of 408 KB, >> instead >> of the current 1.6 MB. Will try to prototype this to see how performance >> goes (I >> am worried about the xarray lookup overhead in the hot path). > > Hi Damien, > > Are there any zoned devices where the sequential write required zones occur > before > the conventional zones? If not, does this mean that the conventional zones > always > occur before the write pointer zones and also that storing the number of > conventional > zones is sufficient? Not sure where you want to go with this... In any case, there are SMR drives which have conventional zones before and after the bulk of the capacity as sequential write required zones. Conventional zones can be anywhere. > Are there zoned storage devices where each zone has a different capacity? I > have > not yet encountered any such device. I'm wondering whether a single capacity > variable would be sufficient for the entire device. Yes, I did this optimization. Right now, for the 28TB SMR disk case, I am down to a bitmap for conventional zones (16KB) plus max-open-zones * 64 B for the zone write plugs. Cannot go lower than that. I am still looking into xarray vs hash table for the zone write plugs for the overhead/performance. -- Damien Le Moal Western Digital Research
Re: [PATCH] multipath-tools: update no_path_retry value for IBM/2145
On 8/26/21 8:47 AM, Martin Wilck wrote: ^^^ It is never too late! On Thu, 2021-08-26 at 00:24 +0200, Xose Vazquez Perez wrote: Based on current configs: https://www.ibm.com/docs/en/flashsystem-9x00/8.4.x?topic=system-settings-linux-hosts Cc: Martin Wilck Cc: Benjamin Marzinski Cc: Christophe Varoqui Cc: DM-DEVEL ML Signed-off-by: Xose Vazquez Perez --- libmultipath/hwtable.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libmultipath/hwtable.c b/libmultipath/hwtable.c index 2a896440..58554cbb 100644 --- a/libmultipath/hwtable.c +++ b/libmultipath/hwtable.c @@ -662,7 +662,7 @@ static struct hwentry default_hw[] = { /* Storwize family / SAN Volume Controller / Flex System V7000 / FlashSystem V840/V9000/9100 */ .vendor = "IBM", .product = "^2145", - .no_path_retry = NO_PATH_RETRY_QUEUE, + .no_path_retry = 5, .pgpolicy = GROUP_BY_PRIO, .pgfailback = -FAILBACK_IMMEDIATE, .prio_name = PRIO_ALUA, Ref: https://github.com/opensvc/multipath-tools/issues/6 The question is on which basis IBM came up with this recommendation. 5 (aka 25s) is a rather low value. Some users may encounter unpleasant surprises if we change the default this way, as it used to be infinite before. Using 5, the IBS 2145 would have the 2nd-lowest default in hwtable.c after Dell PowerStore (3). Symmetrix has 6; all other arrays default to 10 or higher, many default to "queue". Observing that the above is the documentation for the *Flashsystem* 9200, I consider it likely that the value ".no_path_retry = 5" would apply to flash-based IBM storage products, but not to the older products such as the V7000, which unfortunately use the same device ID. It'd be helpful if someone from IBM could jump in here... Pondering the pros and cons, I vote for keeping the current defaults for now. Martin Some history: first commit 3eb8c380a : { /* IBM SAN Volume Controller */ .vendor= "IBM", .product = "2145", .getuid= DEFAULT_GETUID, .getprio = "mpath_prio_alua /dev/%n", .features = "1 queue_if_no_path", .hwhandler = DEFAULT_HWHANDLER, .selector = DEFAULT_SELECTOR, .pgpolicy = GROUP_BY_PRIO, .pgfailback= -FAILBACK_IMMEDIATE, .rr_weight = RR_WEIGHT_NONE, .no_path_retry = NO_PATH_RETRY_UNDEF, .minio = DEFAULT_MINIO, .checker_name = TUR, }, NO_PATH_RETRY_UNDEF was removed in b7c3cf014 because it was the default value, and later "1 queue_if_no_path" was replaced by NO_PATH_RETRY_QUEUE in 87ea76f99 IBM docs recommends: no_path_retry 5 # or no_path_retry "fail" for some current linux distros IBM Storage FlashSystem 5200, 5000, 5100, Storwize V5100 and V5000E: https://www.ibm.com/docs/en/flashsystem-5x00/8.6.x?topic=system-settings-linux-hosts IBM Storage FlashSystem 7300, 7200 and Storwize V7000: https://www.ibm.com/docs/en/flashsystem-7x00/8.6.x?topic=system-settings-linux-hosts IBM FlashSystem V9000: https://www.ibm.com/docs/en/flashsystem-v9000/8.3.x?topic=system-settings-linux-hosts IBM Storage FlashSystem 9500, 9200 and 9100: https://www.ibm.com/docs/en/flashsystem-9x00/8.6.x?topic=system-settings-linux-hosts Therefore, we should change this value.
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
[ add Lukas ] Linus Torvalds wrote: > On Mon, 12 Feb 2024 at 14:04, Dan Williams wrote: > > > > This works because the internals of virtio_fs_cleanup_dax(), "kill_dax() > > and put_dax()", know how to handle a NULL @dax_dev. It is still early > > days with the "cleanup" helpers, but I wonder if anyone else cares that > > the DEFINE_FREE() above does not check for NULL? > > Well, the main reason for DEFINE_FREE() to check for NULL is not > correctness, but code generation. See the comment about kfree() in > : > > * NOTE: the DEFINE_FREE()'s @free expression includes a NULL test even though > * kfree() is fine to be called with a NULL value. This is on purpose. This > way > * the compiler sees the end of our alloc_obj() function as [...] > > with the full explanation there. > > Now, whether the code wants to actually use the cleanup() helpers for > a single use-case is debatable. > > But yes, if it does, I suspect it should use !IS_ERR_OR_NULL(ptr).o I am trying to arrive at a common recommendation given Lukas found that IS_ERR_OR_NULL() resulted in unwanted NULL checks emitted in the assembly [1]. He is doing something similar: http://lore.kernel.org/r/4143b15418c4ecf87ddeceb36813943c3ede17aa.1707734526.git.lu...@wunner.de ...and introduced an assume() helper. However, Lukas, I think Linus is right, your DEFINE_FREE() should use IS_ERR_OR_NULL(). I.e. the problem is trying to use __free(x509_free_certificate) in x509_cert_parse(). > --- a/crypto/asymmetric_keys/x509_cert_parser.c > +++ b/crypto/asymmetric_keys/x509_cert_parser.c > @@ -60,24 +60,24 @@ void x509_free_certificate(struct x509_certificate *cert) > */ > struct x509_certificate *x509_cert_parse(const void *data, size_t datalen) > { > - struct x509_certificate *cert; > - struct x509_parse_context *ctx; > + struct x509_certificate *cert __free(x509_free_certificate); ...make this: struct x509_certificate *cert __free(kfree); ...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary call to virtio_fs_cleanup_dax() at function exit that the compiler should elide.
Re: dm-ps-historical-service-time should probably set QUEUE_FLAG_STATS
(urgh, apologies for the html) We use io_start_time_ns in dm-mpath.c, but the only actual user is historical-service-time currently. We need blk_stat_enable_accounting()/blk_stat_disable_accounting() calls somewhere. I was considering adding a new multipath feature flag, but we already have "DM_PS_USE_HR_TIMER", which seems similar in spirit. Would that re-use make sense? Khazhy smime.p7s Description: S/MIME Cryptographic Signature
+ dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch added to mm-unstable branch
The patch titled Subject: dax: fix incorrect list of data cache aliasing architectures has been added to the -mm mm-unstable branch. Its filename is dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: dax: fix incorrect list of data cache aliasing architectures Date: Mon, 12 Feb 2024 11:31:01 -0500 commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Link: https://lkml.kernel.org/r/20240212163101.19614-9-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Alasdair Kergon Cc: Dave Chinner Cc: Michael Sclafani Cc: Mike Snitzer Cc: Mikulas Patocka Signed-off-by: Andrew Morton --- drivers/dax/super.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) --- a/drivers/dax/super.c~dax-fix-incorrect-list-of-data-cache-aliasing-architectures +++ a/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -456,9 +457,7 @@ struct dax_device *alloc_dax(void *priva * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
Re: [PATCH v5 00/14] dm-raid/md/raid: fix v6.7 regressions
Hi Mikulas and DM folks, This patchset looks good to me. Please help review the dm-raid code so that we can ship it to 6.8 kernel and back port the fixes to 6.7 stable kernels. In the longer term, we will invest more into a CI system and include the lvm2 test suite as part of the tests. Hopefully, this effort will help us catch similar issues much sooner. Thanks, Song On Thu, Feb 1, 2024 at 1:30 AM Yu Kuai wrote: > > From: Yu Kuai > > Changes in v5: > - remove the patch to wait for bio completion while removing dm disk; > - add patch 6; > - reorder the patches, patch 1-8 is for md/raid, and patch 9-14 is > related to dm-raid; > > Changes in v4: > - add patch 10 to fix a raid456 deadlock(for both md/raid and dm-raid); > - add patch 13 to wait for inflight IO completion while removing dm > device; > > Changes in v3: > - fix a problem in patch 5; > - add patch 12; > > Changes in v2: > - replace revert changes for dm-raid with real fixes; > - fix dm-raid5 deadlock that exist for a long time, this deadlock is > triggered because another problem is fixed in raid5, and instead of > deadlock, user will read wrong data before v6.7, patch 9-11; > > First regression related to stop sync thread: > > The lifetime of sync_thread is designed as following: > > 1) Decide want to start sync_thread, set MD_RECOVERY_NEEDED, and wake up > daemon thread; > 2) Daemon thread detect that MD_RECOVERY_NEEDED is set, then set > MD_RECOVERY_RUNNING and register sync_thread; > 3) Execute md_do_sync() for the actual work, if it's done or > interrupted, it will set MD_RECOVERY_DONE and wake up daemone thread; > 4) Daemon thread detect that MD_RECOVERY_DONE is set, then clear > MD_RECOVERY_RUNNING and unregister sync_thread; > > In v6.7, we fix md/raid to follow this design by commit f52f5c71f3d4 > ("md: fix stopping sync thread"), however, dm-raid is not considered at > that time, and following test will hang: > > shell/integrity-caching.sh > shell/lvconvert-raid-reshape.sh > > This patch set fix the broken test by patch 1-4; > - patch 1 fix that step 4) is broken by suspended array; > - patch 2 fix that step 4) is broken by read-only array; > - patch 3 fix that step 3) is broken that md_do_sync() doesn't set > MD_RECOVERY_DONE; Noted that this patch will introdece new problem that > data will be corrupted, which will be fixed in later patches. > - patch 4 fix that setp 1) is broken that sync_thread is register and > MD_RECOVERY_RUNNING is set directly, md/raid behaviour, not related to > dm-raid; > > With patch 1-4, the above test won't hang anymore, however, the test > will still fail and complain that ext4 is corrupted; > > > Second regression is found by code review, interrupted reshape > concurrent with IO can deadlock, patch 5; > > > Third regression fix 'active_io' leakage, patch 6; > > > The fifth regression related to frozen sync thread: > > Noted that for raid456, if reshape is interrupted, then call > "pers->start_reshape" will corrupt data. And dm-raid rely on md_do_sync() > doesn't set MD_RECOVERY_DONE so that new sync_thread won't be registered, > and patch 3 just break this. > > - Patch 9 fix this problem by interrupting reshape and frozen > sync_thread in dm_suspend(), then unfrozen and continue reshape in > dm_resume(). It's verified that dm-raid tests won't complain that > ext4 is corrupted anymore. > - Patch 10 fix the problem that raid_message() call > md_reap_sync_thread() directly, without holding 'reconfig_mutex'. > > > Last regression related to dm-raid456 IO concurrent with reshape: > > For raid456, if reshape is still in progress, then IO across reshape > position will wait for reshape to make progress. However, for dm-raid, > in following cases reshape will never make progress hence IO will hang: > > 1) the array is read-only; > 2) MD_RECOVERY_WAIT is set; > 3) MD_RECOVERY_FROZEN is set; > > After commit c467e97f079f ("md/raid6: use valid sector values to determine > if an I/O should wait on the reshape") fix the problem that IO across > reshape position doesn't wait for reshape, the dm-raid test > shell/lvconvert-raid-reshape.sh start to hang at raid5_make_request(). > > For md/raid, the problem doesn't exist because: > > 1) If array is read-only, it can switch to read-write by ioctl/sysfs; > 2) md/raid never set MD_RECOVERY_WAIT; > 3) If MD_RECOVERY_FROZEN is set, mddev_suspend() doesn't hold >'reconfig_mutex' anymore, it can be cleared and reshape can continue by >sysfs api 'sync_action'. > > However, I'm not sure yet how to avoid the problem in dm-raid yet. > > - patch 11,12 fix this problem by detecting the above 3 cases in > dm_suspend(), and fail those IO directly. > > If user really meet the IO error, then it means they're reading the wrong > data before c467e97f079f. And it's safe to read/write the array after > reshape make progress successfully. > > There are also some other minor changes: patch 8 and patch 12; > > > Test result (for v4, I don't think it's
+ introduce-cpu_dcache_is_aliasing-across-all-architectures.patch added to mm-unstable branch
The patch titled Subject: Introduce cpu_dcache_is_aliasing() across all architectures has been added to the -mm mm-unstable branch. Its filename is introduce-cpu_dcache_is_aliasing-across-all-architectures.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/introduce-cpu_dcache_is_aliasing-across-all-architectures.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: Introduce cpu_dcache_is_aliasing() across all architectures Date: Mon, 12 Feb 2024 11:31:00 -0500 Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Link: https://lkml.kernel.org/r/20240212163101.19614-8-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Alasdair Kergon Cc: Dave Chinner Cc: Michael Sclafani Cc: Mike Snitzer Cc: Mikulas Patocka Signed-off-by: Andrew Morton --- arch/arc/Kconfig|1 + arch/arc/include/asm/cachetype.h|9 + arch/arm/Kconfig|1 + arch/arm/include/asm/cachetype.h|2 ++ arch/csky/Kconfig |1 + arch/csky/include/asm/cachetype.h |9 + arch/m68k/Kconfig |1 + arch/m68k/include/asm/cachetype.h |9 + arch/mips/Kconfig |1 + arch/mips/include/asm/cachetype.h |9 + arch/nios2/Kconfig |1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig |1 + arch/parisc/include/asm/cachetype.h |9 + arch/sh/Kconfig |1 + arch/sh/include/asm/cachetype.h |9 + arch/sparc/Kconfig |1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig |1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h |6 ++ mm/Kconfig |6 ++ 22 files changed, 112 insertions(+) --- /dev/null +++ a/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif ---
+ dax-check-for-data-cache-aliasing-at-runtime.patch added to mm-unstable branch
The patch titled Subject: dax: check for data cache aliasing at runtime has been added to the -mm mm-unstable branch. Its filename is dax-check-for-data-cache-aliasing-at-runtime.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-check-for-data-cache-aliasing-at-runtime.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: dax: check for data cache aliasing at runtime Date: Mon, 12 Feb 2024 11:30:59 -0500 Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Link: https://lkml.kernel.org/r/20240212163101.19614-7-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Alasdair Kergon Cc: Dave Chinner Cc: Michael Sclafani Cc: Mike Snitzer Cc: Mikulas Patocka Signed-off-by: Andrew Morton --- drivers/dax/super.c | 10 ++ fs/Kconfig |1 - 2 files changed, 10 insertions(+), 1 deletion(-) --- a/drivers/dax/super.c~dax-check-for-data-cache-aliasing-at-runtime +++ a/drivers/dax/super.c @@ -451,6 +451,16 @@ struct dax_device *alloc_dax(void *priva dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); --- a/fs/Kconfig~dax-check-for-data-cache-aliasing-at-runtime +++ a/fs/Kconfig @@ -60,7 +60,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
+ virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch
The patch titled Subject: virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal has been added to the -mm mm-unstable branch. Its filename is virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal Date: Mon, 12 Feb 2024 11:30:58 -0500 In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240212163101.19614-6-mathieu.desnoy...@efficios.com Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Dave Chinner Cc: Michael Sclafani Signed-off-by: Andrew Morton --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) --- a/fs/fuse/virtio_fs.c~virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal +++ a/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void * put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP); struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct vi if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct vi dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
+ dcssblk-handle-alloc_dax-eopnotsupp-failure.patch added to mm-unstable branch
The patch titled Subject: dcssblk: handle alloc_dax() -EOPNOTSUPP failure has been added to the -mm mm-unstable branch. Its filename is dcssblk-handle-alloc_dax-eopnotsupp-failure.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dcssblk-handle-alloc_dax-eopnotsupp-failure.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: dcssblk: handle alloc_dax() -EOPNOTSUPP failure Date: Mon, 12 Feb 2024 11:30:57 -0500 In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. Link: https://lkml.kernel.org/r/20240212163101.19614-5-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Dave Chinner Cc: Michael Sclafani Signed-off-by: Andrew Morton --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) --- a/drivers/s390/block/dcssblk.c~dcssblk-handle-alloc_dax-eopnotsupp-failure +++ a/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, st int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, st if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
+ dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch
The patch titled Subject: dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal has been added to the -mm mm-unstable branch. Its filename is dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal Date: Mon, 12 Feb 2024 11:30:56 -0500 In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Link: https://lkml.kernel.org/r/20240212163101.19614-4-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Dave Chinner Cc: Michael Sclafani Signed-off-by: Andrew Morton --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) --- a/drivers/md/dm.c~dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal +++ a/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(i md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
+ nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch added to mm-unstable branch
The patch titled Subject: nvdimm/pmem: treat alloc_dax() -EOPNOTSUPP failure as non-fatal has been added to the -mm mm-unstable branch. Its filename is nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: nvdimm/pmem: treat alloc_dax() -EOPNOTSUPP failure as non-fatal Date: Mon, 12 Feb 2024 11:30:55 -0500 In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Link: https://lkml.kernel.org/r/20240212163101.19614-3-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Dave Chinner Cc: Michael Sclafani Signed-off-by: Andrew Morton --- drivers/nvdimm/pmem.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) --- a/drivers/nvdimm/pmem.c~nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal +++ a/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct devic dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; _ Patches currently in -mm which might be from mathieu.desnoy...@efficios.com are nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch nvdimm-pmem-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dm-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dcssblk-handle-alloc_dax-eopnotsupp-failure.patch virtio-treat-alloc_dax-eopnotsupp-failure-as-non-fatal.patch dax-check-for-data-cache-aliasing-at-runtime.patch introduce-cpu_dcache_is_aliasing-across-all-architectures.patch dax-fix-incorrect-list-of-data-cache-aliasing-architectures.patch
+ dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch added to mm-unstable branch
The patch titled Subject: dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n has been added to the -mm mm-unstable branch. Its filename is dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days -- From: Mathieu Desnoyers Subject: dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n Date: Mon, 12 Feb 2024 11:30:54 -0500 Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression", v5. This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. This patch (of 5): Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y never returns NULL. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Link: https://lkml.kernel.org/r/20240212163101.19614-1-mathieu.desnoy...@efficios.com Link: https://lkml.kernel.org/r/20240212163101.19614-2-mathieu.desnoy...@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: Dave Chinner Cc: Michael Sclafani Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Signed-off-by: Andrew Morton --- drivers/dax/super.c |5 + include/linux/dax.h |6 +- 2 files changed, 6 insertions(+), 5 deletions(-) --- a/drivers/dax/super.c~dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n +++ a/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { --- a/include/linux/dax.h~dax-alloc_dax-return-err_ptr-eopnotsupp-for-config_dax=n +++ a/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct da static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On Mon, 12 Feb 2024 at 14:04, Dan Williams wrote: > > This works because the internals of virtio_fs_cleanup_dax(), "kill_dax() > and put_dax()", know how to handle a NULL @dax_dev. It is still early > days with the "cleanup" helpers, but I wonder if anyone else cares that > the DEFINE_FREE() above does not check for NULL? Well, the main reason for DEFINE_FREE() to check for NULL is not correctness, but code generation. See the comment about kfree() in : * NOTE: the DEFINE_FREE()'s @free expression includes a NULL test even though * kfree() is fine to be called with a NULL value. This is on purpose. This way * the compiler sees the end of our alloc_obj() function as [...] with the full explanation there. Now, whether the code wants to actually use the cleanup() helpers for a single use-case is debatable. But yes, if it does, I suspect it should use !IS_ERR_OR_NULL(ptr). Linus
RE: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
Mathieu Desnoyers wrote: > In preparation for checking whether the architecture has data cache > aliasing within alloc_dax(), modify the error handling of virtio > virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as > non-fatal. > > Co-developed-by: Dan Williams > Signed-off-by: Dan Williams > Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing > caches") > Signed-off-by: Mathieu Desnoyers > Cc: Alasdair Kergon > Cc: Mike Snitzer > Cc: Mikulas Patocka > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Matthew Wilcox > Cc: Arnd Bergmann > Cc: Russell King > Cc: linux-a...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-...@vger.kernel.org > Cc: dm-devel@lists.linux.dev > Cc: nvd...@lists.linux.dev > --- > fs/fuse/virtio_fs.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index 5f1be1da92ce..f9acd9972af2 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > #include > #include "fuse_i.h" > > @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) > put_dax(dax_dev); > } > > +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) > virtio_fs_cleanup_dax(_T)) > + > static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs > *fs) > { > + struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP); > struct virtio_shm_region cache_reg; > struct dev_pagemap *pgmap; > bool have_cache; > @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device > *vdev, struct virtio_fs *fs) > if (!IS_ENABLED(CONFIG_FUSE_DAX)) > return 0; > > + dax_dev = alloc_dax(fs, _fs_dax_ops); > + if (IS_ERR(dax_dev)) { > + int rc = PTR_ERR(dax_dev); > + return rc == -EOPNOTSUPP ? 0 : rc; > + } > + > /* Get cache region */ > have_cache = virtio_get_shm_region(vdev, _reg, > (u8)VIRTIO_FS_SHMCAP_ID_CACHE); > @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device > *vdev, struct virtio_fs *fs) > dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len > 0x%llx\n", > __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); > > - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); > - if (IS_ERR(fs->dax_dev)) > - return PTR_ERR(fs->dax_dev); > - > + fs->dax_dev = no_free_ptr(dax_dev); This works because the internals of virtio_fs_cleanup_dax(), "kill_dax() and put_dax()", know how to handle a NULL @dax_dev. It is still early days with the "cleanup" helpers, but I wonder if anyone else cares that the DEFINE_FREE() above does not check for NULL? I think it is ok, but wanted to point that out for the virtiofs folks and wonder what the style should be for things like this going forward. Other growing pains with "cleanup.h" and ERR_PTR is happening over here: http://lore.kernel.org/r/65ca861e14779_5a7f29...@dwillia2-xfh.jf.intel.com.notmuch ...and that arrived at a similar style as is being used in this patch. In both cases the cleanup function is called on exit with a NULL argument.
Re: [PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure
On Mon, Feb 12, 2024 at 11:27:22AM -0500, Mathieu Desnoyers wrote: > Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done > before setting pmem->dax_dev, which therefore issues the two following > calls on NULL pointers: > > out_cleanup_dax: > kill_dax(pmem->dax_dev); > put_dax(pmem->dax_dev); > > Signed-off-by: Mathieu Desnoyers > Reviewed-by: Dan Williams > Cc: Alasdair Kergon > Cc: Mike Snitzer > Cc: Mikulas Patocka > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Matthew Wilcox > Cc: Arnd Bergmann > Cc: Russell King > Cc: linux-a...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-...@vger.kernel.org > Cc: dm-devel@lists.linux.dev > Cc: nvd...@lists.linux.dev Reviewed-by: Fan Ni > --- > drivers/nvdimm/pmem.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 4e8fdcb3f1c8..9fe358090720 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, > set_dax_nomc(dax_dev); > if (is_nvdimm_sync(nd_region)) > set_dax_synchronous(dax_dev); > + pmem->dax_dev = dax_dev; > rc = dax_add_host(dax_dev, disk); > if (rc) > goto out_cleanup_dax; > dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > - pmem->dax_dev = dax_dev; > - > rc = device_add_disk(dev, disk, pmem_attribute_groups); > if (rc) > goto out_remove_host; > -- > 2.39.2 >
Re: linux-next: Tree for Feb 12 (drivers/md/dm-vdo/thread-registry.c on arch/loongarch/)
Hi all, On Mon, 12 Feb 2024 12:33:08 -0800 Randy Dunlap wrote: > > on loongarch: > > ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_register_thread': > ../drivers/md/dm-vdo/thread-registry.c:32:28: error: 'current' undeclared > (first use in this function) >32 | new_thread->task = current; > |^~~ > ../drivers/md/dm-vdo/thread-registry.c:32:28: note: each undeclared > identifier is reported only once for each function it appears in > ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_unregister_thread': > ../drivers/md/dm-vdo/thread-registry.c:61:37: error: 'current' undeclared > (first use in this function) >61 | if (thread->task == current) { > | ^~~ > ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_lookup_thread': > ../drivers/md/dm-vdo/thread-registry.c:84:37: error: 'current' undeclared > (first use in this function) >84 | if (thread->task == current) { > | ^~~ Caused by commit bf28b754d024 ("dm vdo: add thread and synchronization utilities") from the device-mapper tree. -- Cheers, Stephen Rothwell pgp7VS_uhBgoH.pgp Description: OpenPGP digital signature
Re: linux-next: Tree for Feb 12 (drivers/md/dm-vdo/thread-registry.c on arch/loongarch/)
On 2/11/24 20:31, Stephen Rothwell wrote: > Hi all, > > Changes since 20240209: > > The mm tree lost its boot failure. on loongarch: ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_register_thread': ../drivers/md/dm-vdo/thread-registry.c:32:28: error: 'current' undeclared (first use in this function) 32 | new_thread->task = current; |^~~ ../drivers/md/dm-vdo/thread-registry.c:32:28: note: each undeclared identifier is reported only once for each function it appears in ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_unregister_thread': ../drivers/md/dm-vdo/thread-registry.c:61:37: error: 'current' undeclared (first use in this function) 61 | if (thread->task == current) { | ^~~ ../drivers/md/dm-vdo/thread-registry.c: In function 'vdo_lookup_thread': ../drivers/md/dm-vdo/thread-registry.c:84:37: error: 'current' undeclared (first use in this function) 84 | if (thread->task == current) { | ^~~ -- #Randy
Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
On 2/11/24 17:09, Damien Le Moal wrote: On 2/11/24 12:40, Bart Van Assche wrote: For the use cases I'm interested in a hash table implementation that supports RCU-lookups probably will work better than an xarray. I think that the hash table implementation in supports RCU for lookups, insertion and removal. It does, but the API for it is not the easiest, and I do not see how that could be faster than an xarray, especially as the number of zones grows with high capacity devices (read: potentially more collisions which will slow zone plug lookups). From the xarray documentation: "The XArray implementation is efficient when the indices used are densely clustered". I think we are dealing with a sparse array and hence that an xarray may not be the best suited data structure. How about using a hash table and making the hash table larger if the number of open zones equals the hash table size? That is possible as follows: * Instead of using DEFINE_HASHTABLE() or DECLARE_HASHTABLE(), allocate the hash table dynamically and use the struct hlist_head __rcu * data type. * Use rcu_assign_pointer() to modify that pointer and kfree_rcu() to free old versions of the hash table. * Use rcu_dereference_protected() for hash table lookups. For an example, see also the output of the following command: $ git grep -nHw state_bydst Thanks, Bart.
Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
On 2/12/24 00:47, Damien Le Moal wrote: Replying to myself as I had an idea: 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing "0" as a value for a zone in that array would indicate that the zone is conventional. No additional zone bitmap needed. 2) Use a sparse xarray for managing allocated zone write plugs: 64B per allocated zone write plug needed, which for an SMR drive would generally be at most 128 * 64B = 8K. So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead of the current 1.6 MB. Will try to prototype this to see how performance goes (I am worried about the xarray lookup overhead in the hot path). Hi Damien, Are there any zoned devices where the sequential write required zones occur before the conventional zones? If not, does this mean that the conventional zones always occur before the write pointer zones and also that storing the number of conventional zones is sufficient? Are there zoned storage devices where each zone has a different capacity? I have not yet encountered any such device. I'm wondering whether a single capacity variable would be sufficient for the entire device. Thank you, Bart.
Re: [bug report] dm vdo: add statistics reporting
Hi Dan, Thanks for this report (and the others). It looks like we may have changed our minds about how to do this formatting at some point but never cleaned it up. I'll put this on the list of things to improve. Thanks, matt On 2/9/24 08:05, Dan Carpenter wrote: Hello Matthew Sakai, The patch 25479d97b12c: "dm vdo: add statistics reporting" from Nov 16, 2023 (linux-next), leads to the following (unpublished) Smatch static checker warning: drivers/md/dm-vdo/funnel-workqueue.c:545 vdo_dump_completion_to_buffer() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:26 write_u64() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:43 write_u32() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:60 write_block_count_t() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:77 write_string() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:94 write_bool() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c:111 write_u8() warn: why check scnprintf return value? drivers/md/dm-vdo/message-stats.c 99 static int write_u8(char *prefix, 100 u8 value, 101 char *suffix, 102 char **buf, 103 unsigned int *maxlen) 104 { 105 int count = scnprintf(*buf, *maxlen, "%s%u%s", 106 prefix == NULL ? "" : prefix, 107 value, 108 suffix == NULL ? "" : suffix); 109 *buf += count; 110 *maxlen -= count; --> 111 if (count >= *maxlen) snprintf() returns the number of bytes that would have been printed if we had enough space and scnprintf() returns the number of bytes that were printed. (Both functions exclude the NUL terminator from the returned count). It seldom makes sense to check the return from scnprintf(). In this case count can never be >= *maxlen. The most it can be is *maxlen - 1. 112 return VDO_UNEXPECTED_EOF; 113 return VDO_SUCCESS; 114 } The code in vdo_dump_completion_to_buffer() is almost basically correct. It's just off by one in a totally harmless way and I wouldn't have reported it if not for these other impossible conditions. drivers/md/dm-vdo/funnel-workqueue.c 538 void vdo_dump_completion_to_buffer(struct vdo_completion *completion, char *buffer, 539 size_t length) 540 { 541 size_t current_length = 542 scnprintf(buffer, length, "%.*s/", TASK_COMM_LEN, 543(completion->my_queue == NULL ? "-" : completion->my_queue->name)); 544 545 if (current_length < length - 1) { The highest valid return is current_length == length - 1 but that's treated as invalid. If we changed to snprintf() then we could fix the condition to allow the final character: current_length = snprintf(...); if (current_length <= length - 1) { 546 get_function_name((void *) completion->callback, buffer + current_length, 547length - current_length); 548 } 549 } regards, dan carpenter
Re: [PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure
On 2/12/24 9:27 AM, Mathieu Desnoyers wrote: > Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done > before setting pmem->dax_dev, which therefore issues the two following > calls on NULL pointers: > > out_cleanup_dax: > kill_dax(pmem->dax_dev); > put_dax(pmem->dax_dev); > > Signed-off-by: Mathieu Desnoyers > Reviewed-by: Dan Williams > Cc: Alasdair Kergon > Cc: Mike Snitzer > Cc: Mikulas Patocka > Cc: Andrew Morton > Cc: Linus Torvalds > Cc: Dan Williams > Cc: Vishal Verma > Cc: Dave Jiang > Cc: Matthew Wilcox > Cc: Arnd Bergmann > Cc: Russell King > Cc: linux-a...@vger.kernel.org > Cc: linux-...@vger.kernel.org > Cc: linux-fsde...@vger.kernel.org > Cc: linux...@kvack.org > Cc: linux-...@vger.kernel.org > Cc: dm-devel@lists.linux.dev > Cc: nvd...@lists.linux.dev Reviewed-by: Dave Jiang > --- > drivers/nvdimm/pmem.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 4e8fdcb3f1c8..9fe358090720 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, > set_dax_nomc(dax_dev); > if (is_nvdimm_sync(nd_region)) > set_dax_synchronous(dax_dev); > + pmem->dax_dev = dax_dev; > rc = dax_add_host(dax_dev, disk); > if (rc) > goto out_cleanup_dax; > dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); > - pmem->dax_dev = dax_dev; > - > rc = device_add_disk(dev, disk, pmem_attribute_groups); > if (rc) > goto out_remove_host;
[PATCH v5 8/8] dax: Fix incorrect list of data cache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ce5bffa86bba..a21a7c262382 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) -- 2.39.2
[PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..7d294a3242a4 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..05fc7ed59712 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..cd13b1788973 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ config ARM select ARCH_32BIT_OFF_T select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
[PATCH v5 6/8] dax: Check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 10 ++ fs/Kconfig | 1 - 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 205b888d45bf..ce5bffa86bba 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX -- 2.39.2
[PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..f9acd9972af2 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP); struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } -- 2.39.2
[PATCH v5 4/8] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..f363c1d51d9a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; -- 2.39.2
[PATCH v5 3/8] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23c32cd1f1d8..acdc00bc05be 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } -- 2.39.2
[PATCH v5 2/8] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..e9898457a7bd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
[PATCH v5 0/8] Introduce cpu_dcache_is_aliasing() to fix DAX regression
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. There are many more axes to test however. I would welcome Tested-by for: - affected architectures, - affected drivers, - affected filesytems. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Thanks, Mathieu Changes since v4: - Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP) when CONFIG_DAX=n earlier in the series, - Fold driver cleanup patches into their respective per-driver changes. - Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this series. Changes since v3: - Fix a leak on dax_add_host() failure in nvdimm/pmem. - Split the series into a bissectable sequence of changes. - Ensure that device-dax use-cases still works on data cache aliasing architectures. Changes since v2: - Move DAX supported runtime check to alloc_dax(), - Modify DM to handle alloc_dax() error as non-fatal, - Remove all filesystem modifications, since the check is now done by alloc_dax(), - rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to eliminate confusion with VFS terminology. Changes since v1: - The order of the series was completely changed based on the feedback received on v1, - cache_is_aliasing() is renamed to dcache_is_aliasing(), - ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, - dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, - the dax_is_supported() check was moved to its rightful place in all filesystems. Cc: Andrew Morton Cc: Linus Torvalds Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org Mathieu Desnoyers (8): dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dcssblk: Handle alloc_dax() -EOPNOTSUPP failure virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dax: Check for data cache aliasing at runtime Introduce cpu_dcache_is_aliasing() across all architectures dax: Fix incorrect list of data cache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ drivers/dax/super.c | 14 ++ drivers/md/dm.c | 17 + drivers/nvdimm/pmem.c | 22 -- drivers/s390/block/dcssblk.c| 11 ++- fs/Kconfig | 1 - fs/fuse/virtio_fs.c | 15 +++
[PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y never returns NULL. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 + include/linux/dax.h | 6 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..205b888d45bf 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..df2d52b8a245 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { -- 2.39.2
[PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
Re: [PATCH v3 0/5] block: remove gfp_mask for blkdev_zone_mgmt()
On Sun, 28 Jan 2024 23:52:15 -0800, Johannes Thumshirn wrote: > Fueled by the LSFMM discussion on removing GFP_NOFS initiated by Willy, > I've looked into the sole GFP_NOFS allocation in zonefs. As it turned out, > it is only done for zone management commands and can be removed. > > After digging into more callers of blkdev_zone_mgmt() I came to the > conclusion that the gfp_mask parameter can be removed alltogether. > > [...] Applied, thanks! [1/5] zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call commit: 9105ce591b424771b1502ef9836ca7953c3e0af4 [2/5] dm: dm-zoned: guard blkdev_zone_mgmt with noio scope commit: 218082010aceb40b5495ebc30028ede6e30ee755 [3/5] btrfs: zoned: call blkdev_zone_mgmt in nofs scope commit: d9d556755f16f6af8d1d8ebac38b83a9263394c5 [4/5] f2fs: guard blkdev_zone_mgmt with nofs scope commit: 147ec1c60e3273d21ea1f212c6636f231d6d2771 [5/5] block: remove gfp_flags from blkdev_zone_mgmt commit: 71f4ecdbb42addf82b01b734b122a02707fed521 Best regards, -- Jens Axboe
Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN
On Mon, 2024-02-12 at 13:32 +0100, Peter Rajnoha wrote: > On 2/12/24 12:09, Martin Wilck wrote: > > > Right, underneath within DM/DM-subystem, we should be able to keep > and > restore those reasons for why it has been flagged with > DM_UDEV_DISABLE_OTHER_RULES_FLAG till now and when the state is good > enough that we can drop it, we would do it transparently for higher > (non-dm and non-dm-subsystem) layers. So if there's a case that is > not > currently handled by 10-dm.rules or 11-dm-.rules, we can > fix > that there. If it's a generic rule that applies to all DM, not just > subystem, then yes, we can move that to 10-dm.rules (will have a look > at > your patch [1]). I don't think that patch can be used as-is for DM. For multipath, I'm not aware of any situation where DM_UDEV_DISABLE_OTHER_RULES_FLAG would be set in the udev cookie, therefore DM_UDEV_DISABLE_OTHER_RULES_FLAG is essentially an alias for DM_SUSPENDED before 11-dm-mpath.rules changes it. That's not the case for other targets, in particular LVM. > > > > Well, IIUC the main reason that systemd couldn't use > > DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the > > fact > > that the multipath rules used it in a special way that was > > inconsistent > > with the rest of DM ;-) > > > Aha! Well, honestly, I don't remember the exact details and context > of > that fix, but I know we haven't found a better way... > > > > > I think there are 3 variants of "unusable": > > > > a) temporarily unusable (just for this event), ignore this uevent > > and > > restore previous properties from db. > > b) unusable, avoid IO, don't scan, don't activate (this is how we > > use > > DM_UDEV_DISABLE_OTHER_RULES_FLAG). Upper layers will usually load > > saved > > properties from udev db in this case, too. > > c) like b), but also try to unmount / unconfigure if already used. > > This > > is SYSTEMD_READY=0. I don't think DM has a flag with these > > semantics at > > this time. I can imagine such a flag being set if a device was > > reloaded > > with an incompatible table, but that's rather a corner case. > > > > It's an honorable goal to condense everything into a single > > variable > > for consumer rules, but I think it doesn't work if we want the > > upper > > layers to be able to distinguish these. We can merge a) and b) I > > think, > > because their meaning for upper layers is practically the same, if > > we > > get the saving and restoring right. > > > > The c) case - well, it's questionable what should be done in that > case, > because that means we have literally cut off the underlying device > while > it was still in use. Any further IO from higher layers will return IO > errors, will queue IOs or, in the worse case, issue IOs to a device > that > we don't want to anymore. > > Anyway, if I understand correctly, we simply need to signal higher > layers either: > > A) device is unusable, forget it and clear all your current extra > records you have about this device (including removing any custom > symlinks for udev). That would also map to SYSTEMD_READY=0. > > B) device is unusable temporarily, restore any records you need, > wait > for the DM_UDEV_DISABLE_OTHER_RULES_FLAG=1 to drop to 0 (or being > unset). > > What do you think about keeping a single > DM_UDEV_DISABLE_OTHER_RULES_FLAG for this, just having a different > value, say "2" to denote the B case? Otherwise, we need 2 distinct > variables (which is harder for others to accept I bet). Yes, that could work, if the save / restore is implemented cleanly. > > > > The DM_NOSCAN was actually just a helper and a more human > > > readable > > > name > > > for "DM_SUSBYSTEM_UDEV_FLAG0" within LVM subsystem *only* at > > > first. > > > It is used during LV initialization - the wiping/zeroing of the > > > LV > > > before it is pronounced as usable - that's why, when it is set, > > > we > > > signal to "others" the DM_UDEV_DISABLE_OTHER_RULES_FLAG based on > > > this > > > flag. However, since we have the 13-dm-disk.rules which manages > > > the > > > blkid call for DM devices (and which is ours - owned by DM), we > > > need > > > to > > > signal these rules to avoid calling blkid (as it could see > > > uninitiliazed/stale data). In this case, we import any previous > > > scan > > > results from udev db. It's not like "completely ignore and skip > > > rules > > > for this event". > > > > I'm not sure if I understand what you mean with "completely ignore > > and > > skip". Upper-level rules usually can't "completely ignore" an > > uevent if > > they need to preserve any udev properties across the current event. > > If > > the device is unusable in the weaker "don't try IO" sense, the > > upper > > rules must import existing properties from the db, just like 13-dm- > > disk.rules, lest the properties be forgotten by udev. IMO this > > weaker > > semantics (corresponding to b) above) is what matters most for > > upper > > level rules. > > I didn't consider that there
Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN
On 2/12/24 12:09, Martin Wilck wrote: > On Mon, 2024-02-12 at 10:51 +0100, Peter Rajnoha wrote: >> On 2/9/24 22:58, Martin Wilck wrote: >>> Hi Zdenek, Peter, David, Ben, >>> >>> I have been pondering the device-mapper udev rules a lot lately, >>> and I >>> believe I have found glitches in the logic of the device mapper >>> udev >>> flags that I'd like to bring to your attention. >>> >>> TL;DR: change I propose: >>> >>> * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning >>> "upper layers should leave this device alone until told >>> otherwise", >>> saved between uevents in the udev db >>> * use DM_SUSPENDED consistently as a flag meaning "upper layers >>> should keep their hands off this device temporarily", not saved >>> in the udev db. >>> * don't use any other flags in upper layers, including 13-dm- >>> disk.rules. >>> >>> This implies: >>> >>> * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED >>> in 10-dm.rules >>> * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in >>> 13-dm-disk.rules >>> * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules >>> * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules >> >> Hi Martin! >> >> It is surely possible we could do some improvements and optimizations >> here - all the rules are a result of long evolution where we were >> fixing >> various issues on the way. So if we could shoot down some >> complexities, >> that would be great... >> >> On the other hand, we need to be really careful, because even a tiny >> change here can cause troubles if we omit some case. >> >> I'll surely go through your suggestions, thanks for diving deep into >> that! Just please give me some time so I'll remap all the paths >> through >> the rules in my head :) >> >> Note that we're working with 3 levels of logical rule separation >> here: >> >> 1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules) >> 2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...) >> 3) all the "other" >> >> As for the very original intentions of the >> DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the >> *only >> one* to check for in "other" rules so they don't need to bother about >> checking other states/variables ("other" here means other than DM and >> any DM subsystem). One simple variable to check for others - an ideal >> - >> otherwise, we would be having a hard time to persuade others why they >> need to add complex checks in their rules/applications. > > Yes, I understand that, and it makes a lot of sense. > > The problem arises by saving and restoring to/from the udev db. We've > got different possible *reasons* (inputs) for a device becoming > unusable. The value we communicate to upper layers should arguably be a > "logical or" of all these. But then we can't use a single variable to > save and restore the state; we must determine the value of all "reason" > flags (either directly or by restoring from the db, as appropriate for > the flag in question) and "or"-them into a single "dontuse" flag. > > Example: Currently, we set DM_UDEV_DISABLE_OTHER_RULES_FLAG if > DM_SUSPENDED is set. If the next spurious uevent arrives, we see > DM_UDEV_DISABLE_OTHER_RULES_FLAG set, but we don't know if the origin > was DM_SUSPENDED or a genuine udevflag. In the former case, the device > would now be usable, while in the latter case it most probably > wouldn't. For now, I've posted a patch to fix this for the multipath > rules [1], but I'd like to see it fixed in the general DM rules. > Right, underneath within DM/DM-subystem, we should be able to keep and restore those reasons for why it has been flagged with DM_UDEV_DISABLE_OTHER_RULES_FLAG till now and when the state is good enough that we can drop it, we would do it transparently for higher (non-dm and non-dm-subsystem) layers. So if there's a case that is not currently handled by 10-dm.rules or 11-dm-.rules, we can fix that there. If it's a generic rule that applies to all DM, not just subystem, then yes, we can move that to 10-dm.rules (will have a look at your patch [1]). >> Yes, there's the unlucky exception in the 99-systemd.rules which >> needs >> to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db, >> otherwise we had been running into an issue with stopping systemd >> device >> units prematurely (actually, this was an issue spotted much much >> later >> on after years of using the rules without this rule). >> And we haven't >> found a better way to check for this condition. This is mainly >> because >> systemd is also a "device" manager/tracker in some way through its >> "device" units, besides udev itself. I simply gave up there and >> admitted >> the check for DM_SUSPENDED case. > > Well, IIUC the main reason that systemd couldn't use > DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the fact > that the multipath rules used it in a special way that was inconsistent > with the rest of DM ;-) Aha! Well, honestly, I don't
Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN
On Mon, 2024-02-12 at 10:51 +0100, Peter Rajnoha wrote: > On 2/9/24 22:58, Martin Wilck wrote: > > Hi Zdenek, Peter, David, Ben, > > > > I have been pondering the device-mapper udev rules a lot lately, > > and I > > believe I have found glitches in the logic of the device mapper > > udev > > flags that I'd like to bring to your attention. > > > > TL;DR: change I propose: > > > > * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning > > "upper layers should leave this device alone until told > > otherwise", > > saved between uevents in the udev db > > * use DM_SUSPENDED consistently as a flag meaning "upper layers > > should keep their hands off this device temporarily", not saved > > in the udev db. > > * don't use any other flags in upper layers, including 13-dm- > > disk.rules. > > > > This implies: > > > > * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED > > in 10-dm.rules > > * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in > > 13-dm-disk.rules > > * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules > > * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules > > Hi Martin! > > It is surely possible we could do some improvements and optimizations > here - all the rules are a result of long evolution where we were > fixing > various issues on the way. So if we could shoot down some > complexities, > that would be great... > > On the other hand, we need to be really careful, because even a tiny > change here can cause troubles if we omit some case. > > I'll surely go through your suggestions, thanks for diving deep into > that! Just please give me some time so I'll remap all the paths > through > the rules in my head :) > > Note that we're working with 3 levels of logical rule separation > here: > > 1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules) > 2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...) > 3) all the "other" > > As for the very original intentions of the > DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the > *only > one* to check for in "other" rules so they don't need to bother about > checking other states/variables ("other" here means other than DM and > any DM subsystem). One simple variable to check for others - an ideal > - > otherwise, we would be having a hard time to persuade others why they > need to add complex checks in their rules/applications. Yes, I understand that, and it makes a lot of sense. The problem arises by saving and restoring to/from the udev db. We've got different possible *reasons* (inputs) for a device becoming unusable. The value we communicate to upper layers should arguably be a "logical or" of all these. But then we can't use a single variable to save and restore the state; we must determine the value of all "reason" flags (either directly or by restoring from the db, as appropriate for the flag in question) and "or"-them into a single "dontuse" flag. Example: Currently, we set DM_UDEV_DISABLE_OTHER_RULES_FLAG if DM_SUSPENDED is set. If the next spurious uevent arrives, we see DM_UDEV_DISABLE_OTHER_RULES_FLAG set, but we don't know if the origin was DM_SUSPENDED or a genuine udevflag. In the former case, the device would now be usable, while in the latter case it most probably wouldn't. For now, I've posted a patch to fix this for the multipath rules [1], but I'd like to see it fixed in the general DM rules. > Yes, there's the unlucky exception in the 99-systemd.rules which > needs > to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db, > otherwise we had been running into an issue with stopping systemd > device > units prematurely (actually, this was an issue spotted much much > later > on after years of using the rules without this rule). > And we haven't > found a better way to check for this condition. This is mainly > because > systemd is also a "device" manager/tracker in some way through its > "device" units, besides udev itself. I simply gave up there and > admitted > the check for DM_SUSPENDED case. Well, IIUC the main reason that systemd couldn't use DM_UDEV_DISABLE_OTHER_RULES_FLAG was at least in part due to the fact that the multipath rules used it in a special way that was inconsistent with the rest of DM ;-) I think there are 3 variants of "unusable": a) temporarily unusable (just for this event), ignore this uevent and restore previous properties from db. b) unusable, avoid IO, don't scan, don't activate (this is how we use DM_UDEV_DISABLE_OTHER_RULES_FLAG). Upper layers will usually load saved properties from udev db in this case, too. c) like b), but also try to unmount / unconfigure if already used. This is SYSTEMD_READY=0. I don't think DM has a flag with these semantics at this time. I can imagine such a flag being set if a device was reloaded with an incompatible table, but that's rather a corner case. It's an honorable goal to condense everything into a single variable for consumer rules, but I think
[PATCH] dm vdo: fix an assert statement in start_restoring_volume_sub_index()
Use "==" instead of "=" in ASSERT() statement. Fixes: ef074a31e88e ("dm vdo: implement the volume index") Signed-off-by: Harshit Mogalapalli --- This is based on static analysis with Smatch and only compile tested. --- drivers/md/dm-vdo/indexer/volume-index.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/md/dm-vdo/indexer/volume-index.c b/drivers/md/dm-vdo/indexer/volume-index.c index 5fe34e6c1d9b..d6526fe9bbfc 100644 --- a/drivers/md/dm-vdo/indexer/volume-index.c +++ b/drivers/md/dm-vdo/indexer/volume-index.c @@ -830,7 +830,7 @@ static int start_restoring_volume_sub_index(struct volume_sub_index *sub_index, decode_u32_le(buffer, , _list); decode_u32_le(buffer, , _count); - result = ASSERT(offset = sizeof(buffer), + result = ASSERT(offset == sizeof(buffer), "%zu bytes decoded of %zu expected", offset, sizeof(buffer)); if (result != UDS_SUCCESS) -- 2.39.3
Re: About DM_UDEV_DISABLE_OTHER_RULES_FLAG and DM_NOSCAN
On 2/9/24 22:58, Martin Wilck wrote: > Hi Zdenek, Peter, David, Ben, > > I have been pondering the device-mapper udev rules a lot lately, and I > believe I have found glitches in the logic of the device mapper udev > flags that I'd like to bring to your attention. > > TL;DR: change I propose: > > * use DM_DISABLE_OTHER_RULES_FLAG consistently as a flag meaning > "upper layers should leave this device alone until told otherwise", > saved between uevents in the udev db > * use DM_SUSPENDED consistently as a flag meaning "upper layers > should keep their hands off this device temporarily", not saved > in the udev db. > * don't use any other flags in upper layers, including 13-dm- > disk.rules. > > This implies: > > * stop setting DM_DISABLE_OTHER_RULES_FLAG from DM_SUSPENDED > in 10-dm.rules > * check DM_DISABLE_OTHER_RULES_FLAG instead of DM_NOSCAN in > 13-dm-disk.rules > * check DM_SUSPENDED in 69-dm-lvm.rules, 80-udisks2.rules > * stop using DM_NOSCAN in 11-dm-mpath.rules and 66-kpartx.rules Hi Martin! It is surely possible we could do some improvements and optimizations here - all the rules are a result of long evolution where we were fixing various issues on the way. So if we could shoot down some complexities, that would be great... On the other hand, we need to be really careful, because even a tiny change here can cause troubles if we omit some case. I'll surely go through your suggestions, thanks for diving deep into that! Just please give me some time so I'll remap all the paths through the rules in my head :) Note that we're working with 3 levels of logical rule separation here: 1) DM base (10-dm.rules, 13-dm-disk.rules, 95-dm-notify.rules) 2) DM subsystem (11-dm-lvm.rules, the mpath rules, ...) 3) all the "other" As for the very original intentions of the DM_UDEV_DISABLE_OTHER_RULES_FLAG - this one was supposed to be the *only one* to check for in "other" rules so they don't need to bother about checking other states/variables ("other" here means other than DM and any DM subsystem). One simple variable to check for others - an ideal - otherwise, we would be having a hard time to persuade others why they need to add complex checks in their rules/applications. Yes, there's the unlucky exception in the 99-systemd.rules which needs to check the DM_SUSPENDED to import the "SYSTEMD_READY" from udev db, otherwise we had been running into an issue with stopping systemd device units prematurely (actually, this was an issue spotted much much later on after years of using the rules without this rule). And we haven't found a better way to check for this condition. This is mainly because systemd is also a "device" manager/tracker in some way through its "device" units, besides udev itself. I simply gave up there and admitted the check for DM_SUSPENDED case. The DM_NOSCAN was actually just a helper and a more human readable name for "DM_SUSBYSTEM_UDEV_FLAG0" within LVM subsystem *only* at first. It is used during LV initialization - the wiping/zeroing of the LV before it is pronounced as usable - that's why, when it is set, we signal to "others" the DM_UDEV_DISABLE_OTHER_RULES_FLAG based on this flag. However, since we have the 13-dm-disk.rules which manages the blkid call for DM devices (and which is ours - owned by DM), we need to signal these rules to avoid calling blkid (as it could see uninitiliazed/stale data). In this case, we import any previous scan results from udev db. It's not like "completely ignore and skip rules for this event". Other DM subsystems may get into exactly the same situation with initialization as LVM, so that's why it ended up as generic "DM_NOSCAN" so any 11-dm-.rules can use it to signal 13-dm-disk.rules to avoid the blkid scan this way (the "SCAN" here actually means "blkid scan" so the better name would be "DM_NO_BLKID_SCAN" probably). The "DM_NOSCAN" is not meant to be consumed by "others" at all. The other way round, we also don't want (ideally) to check for DM_UDEV_DISABLE_OTHER_RULES flag in DM and DM subysystem rules. This flag is meant for "others" to check. So we have a clear separation of what is signalled withing "DM world" and "the other world". The DM_SUSPENDED flag - well, this was supposed to be just internal to DM and DM subystem rules. All the "others" have the DM_UDEV_DISABLE_OTHER_RULES_FLAG - they simply don't need to bother for what exact reason the device is not usable, whether it is a suspended case or whatever else... The DM udev rules are designed in a way that all the DM_UDEV_DISABLE_* can still be changed in DM/DM-subsystem rules based on further findings and processing of the state. So the original flags from the DM_COOKIE is just a starting point here. -- Peter
Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
On 2/12/24 17:23, Damien Le Moal wrote: > On 2/11/24 12:40, Bart Van Assche wrote: >> On 2/9/24 16:06, Damien Le Moal wrote: >>> On 2/10/24 04:36, Bart Van Assche wrote: written zones is typically less than 10. Hence, tracking the partially written >>> >>> That is far from guaranteed, especially with devices that have no active >>> zone >>> limits like SMR drives. >> >> Interesting. The zoned devices I'm working with try to keep data in memory >> for all zones that are neither empty nor full and hence impose an upper limit >> on the number of open zones. >> >>> But in any case, what exactly is your idea here ? Can you actually suggest >>> something ? Are you suggesting that a sparse array of zone plugs be used, >>> with >>> an rb-tree or an xarray ? If that is what you are thinking, I can already >>> tell >>> you that this is the first thing I tried to do. Early versions of this work >>> used >>> a sparse xarray of zone plugs. But the problem with such approach is that >>> it is >>> a lot more complicated and there is a need for a single lock to manage that >>> structure (which is really not good for performance). >> >> Hmm ... since the xarray data structure supports RCU I think that locking the >> entire xarray is only required if the zone condition changes from empty into >> not empty or from neither empty nor full into full? >> >> For the use cases I'm interested in a hash table implementation that supports >> RCU-lookups probably will work better than an xarray. I think that the hash >> table implementation in supports RCU for lookups, >> insertion >> and removal. > > I spent some time digging into this and also revisiting the possibility of > using > an xarray. Conclusion is that this does not work well, at least in no way not > better than what I did, and most of the time much worse. The reason is that we > need at the very least to keep this information around: > 1) If the zone is conventional or not > 2) The zone capacity of sequential write required zones > > Unless we keep this information, a report zone would be needed before starting > writing to a zone that does not yet have a zone write plug allocated. > > (1) and (2) above can be trivially combined into a single 32-bits value. But > that value must exist for all zones. So at the very least, we need nr_zones * > 4B > of memory allocated at all time. For such case (i.e. non-sparse structure), > xarray or hash table would be more costly in memory than a simple static > array. > > Given that we want to allocate/free zone write plugs dynamically as needed, we > essentially need an array of pointers, so 8B * nr_zones for the base > structure. > From there, ideally, we should be able to use rcu to safely dereference/modify > the array entries. However, static arrays are not supported by the rcu code > from > what I read. > > Given this, my current approach that uses 16B per zone is the next best thing > I > can think of without introducing a single lock for modifying the array > entries. > > If you have any other idea, please share. Replying to myself as I had an idea: 1) Store the zone capacity in a separate array: 4B * nr_zones needed. Storing "0" as a value for a zone in that array would indicate that the zone is conventional. No additional zone bitmap needed. 2) Use a sparse xarray for managing allocated zone write plugs: 64B per allocated zone write plug needed, which for an SMR drive would generally be at most 128 * 64B = 8K. So for an SMR drive with 100,000 zones, that would be a total of 408 KB, instead of the current 1.6 MB. Will try to prototype this to see how performance goes (I am worried about the xarray lookup overhead in the hot path). -- Damien Le Moal Western Digital Research
Re: [PATCH 25/26] block: Reduce zone write plugging memory usage
On 2/11/24 12:40, Bart Van Assche wrote: > On 2/9/24 16:06, Damien Le Moal wrote: >> On 2/10/24 04:36, Bart Van Assche wrote: >>> written zones is typically less than 10. Hence, tracking the partially >>> written >> >> That is far from guaranteed, especially with devices that have no active zone >> limits like SMR drives. > > Interesting. The zoned devices I'm working with try to keep data in memory > for all zones that are neither empty nor full and hence impose an upper limit > on the number of open zones. > >> But in any case, what exactly is your idea here ? Can you actually suggest >> something ? Are you suggesting that a sparse array of zone plugs be used, >> with >> an rb-tree or an xarray ? If that is what you are thinking, I can already >> tell >> you that this is the first thing I tried to do. Early versions of this work >> used >> a sparse xarray of zone plugs. But the problem with such approach is that it >> is >> a lot more complicated and there is a need for a single lock to manage that >> structure (which is really not good for performance). > > Hmm ... since the xarray data structure supports RCU I think that locking the > entire xarray is only required if the zone condition changes from empty into > not empty or from neither empty nor full into full? > > For the use cases I'm interested in a hash table implementation that supports > RCU-lookups probably will work better than an xarray. I think that the hash > table implementation in supports RCU for lookups, > insertion > and removal. I spent some time digging into this and also revisiting the possibility of using an xarray. Conclusion is that this does not work well, at least in no way not better than what I did, and most of the time much worse. The reason is that we need at the very least to keep this information around: 1) If the zone is conventional or not 2) The zone capacity of sequential write required zones Unless we keep this information, a report zone would be needed before starting writing to a zone that does not yet have a zone write plug allocated. (1) and (2) above can be trivially combined into a single 32-bits value. But that value must exist for all zones. So at the very least, we need nr_zones * 4B of memory allocated at all time. For such case (i.e. non-sparse structure), xarray or hash table would be more costly in memory than a simple static array. Given that we want to allocate/free zone write plugs dynamically as needed, we essentially need an array of pointers, so 8B * nr_zones for the base structure. >From there, ideally, we should be able to use rcu to safely dereference/modify the array entries. However, static arrays are not supported by the rcu code from what I read. Given this, my current approach that uses 16B per zone is the next best thing I can think of without introducing a single lock for modifying the array entries. If you have any other idea, please share. -- Damien Le Moal Western Digital Research
Re: [PATCH v3 0/5] block: remove gfp_mask for blkdev_zone_mgmt()
On 29.01.24 08:52, Johannes Thumshirn wrote: > Fueled by the LSFMM discussion on removing GFP_NOFS initiated by Willy, > I've looked into the sole GFP_NOFS allocation in zonefs. As it turned out, > it is only done for zone management commands and can be removed. > > After digging into more callers of blkdev_zone_mgmt() I came to the > conclusion that the gfp_mask parameter can be removed alltogether. > > So this series switches all callers of blkdev_zone_mgmt() to either use > GFP_KERNEL where possible or grab a memalloc_no{fs,io} context. > > The final patch in this series is getting rid of the gfp_mask parameter. > > Link: https://lore.kernel.org/all/zzcgxi46ainlc...@casper.infradead.org/ > > --- > Changes in v3: > - Fix build error after rebase in dm-zoned-metadata.c > - Link to v2: > https://lore.kernel.org/r/20240125-zonefs_nofs-v2-0-2d975c8c1...@wdc.com > > Changes in v2: > - guard blkdev_zone_mgmt in dm-zoned-metadata.c with memalloc_noio context > - Link to v1: > https://lore.kernel.org/r/20240123-zonefs_nofs-v1-0-cc0b0308e...@wdc.com > > --- > Johannes Thumshirn (5): >zonefs: pass GFP_KERNEL to blkdev_zone_mgmt() call >dm: dm-zoned: guard blkdev_zone_mgmt with noio scope >btrfs: zoned: call blkdev_zone_mgmt in nofs scope >f2fs: guard blkdev_zone_mgmt with nofs scope >block: remove gfp_flags from blkdev_zone_mgmt > > block/blk-zoned.c | 19 --- > drivers/md/dm-zoned-metadata.c | 5 - > drivers/nvme/target/zns.c | 5 ++--- > fs/btrfs/zoned.c | 35 +-- > fs/f2fs/segment.c | 15 --- > fs/zonefs/super.c | 2 +- > include/linux/blkdev.h | 2 +- > 7 files changed, 53 insertions(+), 30 deletions(-) > --- > base-commit: 615d300648869c774bd1fe54b4627bb0c20faed4 > change-id: 20240110-zonefs_nofs-dd1e22b2e046 > > Best regards, Hi Jens, now that every patch is reviewed, can you queue this series please? Thanks, Johannes