Re: [lttng-dev] Capturing snapshot on kernel panic
he time > it enters. > > While this doesn't necessarily help your original question of panics, if > you want to snapshot before shutdown or reboot and are using systemd, > it's possible to leave a script or binary in a known directory so that > it's invoked prior to the rest of the shutdown sequence[4]. > > [1]: https://lttng.org/docs/v2.13/#doc-persistent-memory-file-systems <https://lttng.org/docs/v2.13/#doc-persistent-memory-file-systems> > [2]: > https://github.com/systemd/systemd/blob/6533c14997700f74e9ea42121303fc1f5c63e62b/src/shutdown/shutdown.c <https://github.com/systemd/systemd/blob/6533c14997700f74e9ea42121303fc1f5c63e62b/src/shutdown/shutdown.c> > [3]: > https://github.com/systemd/systemd/blob/main/src/shared/reboot-util.c#L77 <https://github.com/systemd/systemd/blob/main/src/shared/reboot-util.c#L77> > [4]: https://www.systutorials.com/docs/linux/man/8-systemd-reboot/ <https://www.systutorials.com/docs/linux/man/8-systemd-reboot/> > > hope this helps, > kienan > >> Would you have any suggestions? >> Thanks for your help, >> Cheers >> Damien >> >> >> >> # Prep output dir >> mkdir /application/trace/ >> rm -rf /application/trace/* >> >> # Create session >> sudo lttng destroy snapshot-trace-session >> sudo lttng create snapshot-trace-session --snapshot >> --output="/application/trace/" >> sudo lttng enable-channel --kernel --num-subbuf=8 channelk >> sudo lttng enable-channel --userspace --num-subbuf=8 channelu >> >> # Configure session >> sudo lttng enable-event --kernel --syscall --all --channel channelk >> sudo lttng enable-event --kernel --tracepoint "sched*" --channel channelk >> sudo lttng enable-event --userspace --all --channel channelu >> sudo lttng add-context -u -t vtid -t procname >> sudo lttng remove-trigger trig_reboot >> sudo lttng add-trigger --name=trig_reboot \ >> --condition=event-rule-matches --type=kernel:syscall:entry \ >> --name=reboot\ >> --action=snapshot-session snapshot-trace-session \ >> --rate-policy=once-after:1 >> >> # start & list info >> sudo lttng start >> sudo lttng list snapshot-trace-session >> sudo lttng list-triggers >> >> # test it... >> sudo reboot >> >> #=== reconnect and Nothing :( >> $ ls -alu /application/trace/ >> drwxr-xr-x 2 u u 4096 May 15 2024 . >> drwxr-xr-x 10 u u 4096 May 15 2024 .. >> >> >> ___ >> lttng-dev mailing list >> lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org> >> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev> > ___ > lttng-dev mailing list > lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev> -- *Damien Berget* ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
[lttng-dev] [RELEASE] LTTng-modules 2.13.13 and 2.12.17 (Linux kernel tracer)
Hi, This is a stable release announcement for the LTTng kernel tracer, an out-of-tree kernel tracer for the Linux kernel. The LTTng project provides low-overhead, correlated userspace and kernel tracing on Linux. Its use of the Common Trace Format and a flexible control interface allows it to fulfill various workloads. * New in these releases: - LTTng-modules 2.13.13: - Introduce support for Linux v6.9. - Removed unused duplicated code, add missing static to function definitions, and add missing includes for function declarations which were observed when building against recent kernels with newer toolchains. We plan to adapt our CI to add jobs that will report warnings as errors when building lttng-modules against recent kernels with a recent tool chain so we can catch and fix those warnings earlier in the future. - In both LTTng-modules 2.12.17 and 2.13.13: - Fix incorrect get_pfnblock_flags_mask prototype which did not match upstream after upstream commit 535b81e209219 (v5.9). Fix the prototype mismatch detection code as well. This affects the event mm_page_alloc_extfrag which uses get_pageblock_migratetype(). Note that because the kernel macro get_pageblock_migratetype was also updated to pass 3 parameters to get_pfnblock_flags_mask as its kernel prototype was updated to expect three parameters, it does not matter that the lttng-modules wrapper expects 4 parameters and provides those 4 parameters to the kernel function. This issue should therefore not affect the runtime behavior. - Instrumentation updates to support EL 8.4+. - Instrumentation updates for RHEL kernels. - Instrumentation updates to the timer subsystem to adapt to changes backported in the 4.19 stable kernels. * Detailed change logs: 2024-05-13 (National Leprechaun Day) LTTng modules 2.13.13 * splice wrapper: Fix missing declaration * page alloc wrapper: Fix get_pfnblock_flags_mask prototype * lttng probe: include events-internal.h * syscalls: Remove unused duplicated code * statedump: Add missing events-internal.h include * lttng-events: Add missing static * event notifier: Add missing static * context callstack: Add missing static * lttng-clock: Add missing lttng/events-internal.h include * lttng-calibrate: Add missing static and include * lttng-bytecode: Remove dead code * lttng-abi: Add missing static to function definitions * ring buffer: Add missing static to function definitions * blkdev wrapper: Fix constness warning * Fix: timer_expire_entry changed in 4.19.312 * Fix: dev_base_lock removed in linux 6.9-rc1 * Fix: mm_compaction_migratepages changed in linux 6.9-rc1 * Fix: ASoC add component to set_bias_level events in linux 6.9-rc1 * Fix: ASoC snd_doc_dapm on linux 6.9-rc1 * Fix: build kvm probe on EL 8.4+ * Fix: support ext4_journal_start on EL 8.4+ * Fix: correct RHEL range for kmem_cache_free define 2024-05-13 (National Leprechaun Day) 2.12.17 * page alloc wrapper: Fix get_pfnblock_flags_mask prototype * Fix: timer_expire_entry changed in 4.19.312 * Fix: build kvm probe on EL 8.4+ * Fix: support ext4_journal_start on EL 8.4+ * Fix: correct RHEL range for kmem_cache_free define Project website: https://lttng.org Documentation: https://lttng.org/docs Download link: https://lttng.org/download -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [PATCH v3 54/68] selftests/rseq: Drop define _GNU_SOURCE
On 2024-05-09 15:58, Edward Liaw wrote: _GNU_SOURCE is provided by lib.mk, so it should be dropped to prevent redefinition warnings. Fixes: 809216233555 ("selftests/harness: remove use of LINE_MAX") The patch per se looks fine, except for the "Fixes" tag. Commit 809216233555 introduces use of asprintf in kselftest_harness.h which is used by (all ?) selftests, including the rseq ones. However, the rseq selftests each have the #define _GNU_SOURCE, which would have been OK without those further changes. So this patch is more about consolidating where the _GNU_SOURCE is defined, which is OK with me, but not so much about "fixing" an issue with commit 809216233555. A "Fix" is something to be backported to stable kernels, and I don't think this patch reaches that threshold. If anything, this patch removes a warning that gets added by https://lore.kernel.org/lkml/20240509200022.253089-1-edl...@google.com/T/#mf8438d03de6e2b613da4f86d4f60c5fe1c5f8483 within the same series. Arguably, each #define _GNU_SOURCE could have been first protected by a #ifndef guard to eliminate this transient warning, and there would be nothing to "fix" in this consolidation series. Thoughts ? Thanks, Mathieu Reviewed-by: John Hubbard Reviewed-by: Muhammad Usama Anjum Signed-off-by: Edward Liaw --- tools/testing/selftests/rseq/basic_percpu_ops_test.c | 1 - tools/testing/selftests/rseq/basic_test.c| 2 -- tools/testing/selftests/rseq/param_test.c| 1 - tools/testing/selftests/rseq/rseq.c | 2 -- 4 files changed, 6 deletions(-) diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c b/tools/testing/selftests/rseq/basic_percpu_ops_test.c index 2348d2c20d0a..5961c24ee1ae 100644 --- a/tools/testing/selftests/rseq/basic_percpu_ops_test.c +++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c @@ -1,5 +1,4 @@ // SPDX-License-Identifier: LGPL-2.1 -#define _GNU_SOURCE #include #include #include diff --git a/tools/testing/selftests/rseq/basic_test.c b/tools/testing/selftests/rseq/basic_test.c index 295eea16466f..1fed749b4bd7 100644 --- a/tools/testing/selftests/rseq/basic_test.c +++ b/tools/testing/selftests/rseq/basic_test.c @@ -2,8 +2,6 @@ /* * Basic test coverage for critical regions and rseq_current_cpu(). */ - -#define _GNU_SOURCE #include #include #include diff --git a/tools/testing/selftests/rseq/param_test.c b/tools/testing/selftests/rseq/param_test.c index 2f37961240ca..48a55d94eb72 100644 --- a/tools/testing/selftests/rseq/param_test.c +++ b/tools/testing/selftests/rseq/param_test.c @@ -1,5 +1,4 @@ // SPDX-License-Identifier: LGPL-2.1 -#define _GNU_SOURCE #include #include #include diff --git a/tools/testing/selftests/rseq/rseq.c b/tools/testing/selftests/rseq/rseq.c index 96e812bdf8a4..88602889414c 100644 --- a/tools/testing/selftests/rseq/rseq.c +++ b/tools/testing/selftests/rseq/rseq.c @@ -14,8 +14,6 @@ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. */ - -#define _GNU_SOURCE #include #include #include -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [lttng-dev] [PATCH urcu] fix: handle EINTR correctly in get_cpu_mask_from_sysfs
On 2024-05-02 10:32, Michael Jeanson wrote: On 2024-05-02 09:54, Mathieu Desnoyers wrote: On 2024-05-01 19:42, Benjamin Marzinski via lttng-dev wrote: If the read() in get_cpu_mask_from_sysfs() fails with EINTR, the code is supposed to retry, but the while loop condition has (bytes_read > 0), which is false when read() fails with EINTR. The result is that the code exits the loop, having only read part of the string. Use (bytes_read != 0) in the while loop condition instead, since the (bytes_read < 0) case is already handled in the loop. Thanks for the fix ! It is indeed the right thing to do. I would like to integrate this fix into the librseq and libside projects as well though, but I notice the the copy in liburcu is LGPLv2.1 whereas the copy in librseq and libside are MIT. Michael, should we first relicense the liburcu src/compat-smp.h implementation to MIT so it matches the license of the copies in librseq and libside ? Sure, please go ahead. For the records, we also have a copy of this code in lttng-ust, also under MIT license. So liburcu's copy is the only outlier there. Thanks, Mathieu Thanks, Mathieu Signed-off-by: Benjamin Marzinski --- src/compat-smp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/compat-smp.h b/src/compat-smp.h index 31fa979..075a332 100644 --- a/src/compat-smp.h +++ b/src/compat-smp.h @@ -164,7 +164,7 @@ static inline int get_cpu_mask_from_sysfs(char *buf, size_t max_bytes, const cha total_bytes_read += bytes_read; assert(total_bytes_read <= max_bytes); - } while (max_bytes > total_bytes_read && bytes_read > 0); + } while (max_bytes > total_bytes_read && bytes_read != 0); /* * Make sure the mask read is a null terminated string. -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] [PATCH urcu] fix: handle EINTR correctly in get_cpu_mask_from_sysfs
On 2024-05-01 19:42, Benjamin Marzinski via lttng-dev wrote: If the read() in get_cpu_mask_from_sysfs() fails with EINTR, the code is supposed to retry, but the while loop condition has (bytes_read > 0), which is false when read() fails with EINTR. The result is that the code exits the loop, having only read part of the string. Use (bytes_read != 0) in the while loop condition instead, since the (bytes_read < 0) case is already handled in the loop. Thanks for the fix ! It is indeed the right thing to do. I would like to integrate this fix into the librseq and libside projects as well though, but I notice the the copy in liburcu is LGPLv2.1 whereas the copy in librseq and libside are MIT. Michael, should we first relicense the liburcu src/compat-smp.h implementation to MIT so it matches the license of the copies in librseq and libside ? Thanks, Mathieu Signed-off-by: Benjamin Marzinski --- src/compat-smp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/compat-smp.h b/src/compat-smp.h index 31fa979..075a332 100644 --- a/src/compat-smp.h +++ b/src/compat-smp.h @@ -164,7 +164,7 @@ static inline int get_cpu_mask_from_sysfs(char *buf, size_t max_bytes, const cha total_bytes_read += bytes_read; assert(total_bytes_read <= max_bytes); - } while (max_bytes > total_bytes_read && bytes_read > 0); + } while (max_bytes > total_bytes_read && bytes_read != 0); /* * Make sure the mask read is a null terminated string. -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [PATCH] rcu: Remove unreachable logic
On 2024-04-29 13:34, Alan Huang wrote: call_rcu_core is only called from __call_rcu_common with interrupt disabled. This patch thus removes the unreachable logic and the would-be unused 'flags' parameter. Nack. call_rcu_core() receives a @flags parameter which are the _saved_ flags as they were prior to local_irq_save(). This patch highlights a misunderstanding of what the code is actually doing. Thanks, Mathieu Signed-off-by: Alan Huang --- kernel/rcu/tree.c | 35 ++- 1 file changed, 2 insertions(+), 33 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d5507ac1bbf1..b0ea2ebd7769 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2983,7 +2983,7 @@ static void rcutree_enqueue(struct rcu_data *rdp, struct rcu_head *head, rcu_cal * Handle any core-RCU processing required by a call_rcu() invocation. */ static void call_rcu_core(struct rcu_data *rdp, struct rcu_head *head, - rcu_callback_t func, unsigned long flags) + rcu_callback_t func) { rcutree_enqueue(rdp, head, func); /* @@ -2992,37 +2992,6 @@ static void call_rcu_core(struct rcu_data *rdp, struct rcu_head *head, */ if (!rcu_is_watching()) invoke_rcu_core(); - - /* If interrupts were disabled or CPU offline, don't invoke RCU core. */ - if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id())) - return; - - /* -* Force the grace period if too many callbacks or too long waiting. -* Enforce hysteresis, and don't invoke rcu_force_quiescent_state() -* if some other CPU has recently done so. Also, don't bother -* invoking rcu_force_quiescent_state() if the newly enqueued callback -* is the only one waiting for a grace period to complete. -*/ - if (unlikely(rcu_segcblist_n_cbs(>cblist) > -rdp->qlen_last_fqs_check + qhimark)) { - - /* Are we ignoring a completed grace period? */ - note_gp_changes(rdp); - - /* Start a new grace period if one not already started. */ - if (!rcu_gp_in_progress()) { - rcu_accelerate_cbs_unlocked(rdp->mynode, rdp); - } else { - /* Give the grace period a kick. */ - rdp->blimit = DEFAULT_MAX_RCU_BLIMIT; - if (READ_ONCE(rcu_state.n_force_qs) == rdp->n_force_qs_snap && - rcu_segcblist_first_pend_cb(>cblist) != head) - rcu_force_quiescent_state(); - rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs); - rdp->qlen_last_fqs_check = rcu_segcblist_n_cbs(>cblist); - } - } } /* @@ -3121,7 +3090,7 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy_in) if (unlikely(rcu_rdp_is_offloaded(rdp))) call_rcu_nocb(rdp, head, func, flags, lazy); else - call_rcu_core(rdp, head, func, flags); + call_rcu_core(rdp, head, func); local_irq_restore(flags); } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[lttng-dev] [RELEASE] LTTng-UST 2.12.10 and 2.13.8 (Linux user-space tracer)
LTTng-UST, the Linux Trace Toolkit Next Generation Userspace Tracer, is a low-overhead application tracer. The library "liblttng-ust" enables tracing of applications and libraries. New in both 2.12.10 and 2.13.8: * Add close_range wrapper to liblttng-ust-fd.so GNU libc 2.34 implements a new close_range symbol which is used by the ssh client and other applications to close all file descriptors, including those which do not belong to the application. Override this symbol to prevent the application from closing file descriptors actively used by lttng-ust. * Fix: libc wrapper: use initial-exec for malloc_nesting TLS Use the initial-exec TLS model for the malloc_nesting nesting guard variable to ensure that the GNU libc implementation of the TLS access don't trigger infinite recursion by calling the memory allocator wrapper functions, which can happen with global-dynamic. This fixes a liblttng-ust-libc-wrapper.so regression on recent Fedora distributions. * lttng-ust(3): Fix wrong len_type for sequence `len_type' of a sequence field must be of type unsigned integer. Some provided examples in the man page were incorrectly using a type signed integer, resulting in correct compilation, but error while decoding. New in 2.13.8: * ust-tracepoint-event: Add static check of sequences length type Add a compile-time check to validate that unsigned types are used for the length field of sequences. Detailed change logs: 2024-04-19 (National Garlic Day) lttng-ust 2.13.8 * Add close_range wrapper to liblttng-ust-fd.so * ust-tracepoint-event: Add static check of sequences length type * lttng-ust(3): Fix wrong len_type for sequence * Fix: libc wrapper: use initial-exec for malloc_nesting TLS 2024-04-19 (National Garlic Day) lttng-ust 2.12.10 * Add close_range wrapper to liblttng-ust-fd.so * lttng-ust(3): Fix wrong len_type for sequence * Fix: libc wrapper: use initial-exec for malloc_nesting TL Project website: https://lttng.org Documentation: https://lttng.org/docs Download link: https://lttng.org/download -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [lttng-dev] Software Heritage archival notification for git.liburcu.org
On 2024-04-15 10:20, Michael Jeanson via lttng-dev wrote: On 2024-04-14 20:39, Paul Wise wrote: On Thu, 2024-04-11 at 13:45 -0400, Michael Jeanson wrote: I see no issues with this, thanks for the heads-up. PS: I note that git.liburcu.org and git.lttng.org seem to have identical contents. I wonder if SWH should be archiving just one of them or if we should archive both just in case they get split up? At the moment 'git.liburu.org' is just a CNAME for 'git.lttng.org', we Typo: git.liburcu.org Thanks, Mathieu don't have plans to split them up so far. ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On 2024-04-12 12:28, Beau Belgrave wrote: On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote: On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave wrote: In the Open Telemetry profiling SIG [1], we are trying to find a way to grab a tracing association quickly on a per-sample basis. The team at Elastic has a bespoke way to do this [2], however, I'd like to see a more general way to achieve this. The folks I've been talking with seem open to the idea of just having a TLS value for this we could capture Presumably TLS == Thread Local Storage. Yes, the initial idea is to use thread local storage (TLS). It seems to be the fastest option to save a per-thread value that changes at a fast rate. upon each sample. We could then just state, Open Telemetry SDKs should have a TLS value for span correlation. However, we need a way to sample the TLS or other value(s) when a sampling event is generated. This is supported today on Windows via EventActivityIdControl() [3]. Since Open Telemetry works on both Windows and Linux, ideally we can do something as efficient for Linux based workloads. This series is to explore how it would be best possible to collect supporting data from a user process when a profile sample is collected. Having a value stored in TLS makes a lot of sense for this however there are other ways to explore. Whatever is chosen, kernel samples taken in process context should be able to get this supporting data. In these patches on X64 the fsbase and gsbase are used for this. An option to explore suggested by Mathieu Desnoyers is to utilize rseq for processes to register a value location that can be included when profiling if desired. This would allow a tighter contract between user processes and a profiler. It would allow better labeling/categorizing the correlation values. It is hard to understand this idea. Are you saying stash a cookie in TLS for samples to capture to indicate an activity? Restartable sequences are about preemption on a CPU not of a thread, so at least my intuition is that they feel different. You could stash information like this today by changing the thread name which generates comm events. I've wondered about having similar information in some form of reserved for profiling stack slot, for example, to stash a pointer to the name of a function being interpreted. Snapshotting all of a stack is bad performance wise and for security. A stack slot would be able to deal with nesting. You are getting the idea. A slot or tag for a thread would be great! I'm not a fan of overriding the thread comm name (as that already has a use). TLS would be fine, if we could also pass an offset + size + type. Maybe a stack slot that just points to parts of TLS? That way you could have a set of slots that don't require much memory and selectively copy them out of TLS (or where ever those slots point to in user memory). When I was talking to Mathieu about this, it seems that rseq already had a place to potentially put these slots. I'm unsure though how the per thread aspects would work. Mathieu, can you post your ideas here about that? Sure. I'll try to summarize my thoughts here. By all means, let me know if I'm missing important pieces of the puzzle. First of all, here is my understanding of what information we want to share between userspace and kernel. A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random UUID allows) a portion of the dependency chain involved in doing some work (e.g. answer a HTTP request) across one or many participating hosts. Activity IDs have a parent/child relationship: a parent activity ID can create children activity IDs. For instance, if one host has the service "dispatch", another host has a "web server", and a third host has a SQL database, we should be able to follow the chain of activities needed to answer a web query by following those activity IDs, linking them together through parent/child relationships. This usually requires the communication protocols to convey those activity IDs across hosts. The reason why this information must be provided from userspace is because it's userspace that knows where to find those activity IDs within its application-layer communication protocols. With tracing, taking a full trace of the activity ID spans begin/end from all hosts allow reconstructing the activity IDs parent/child relationships, so we typically only need to extract information about activity ID span begin/end with parent/child info to a tracer. Using activity IDs from a kernel profiler is trickier, because we do not have access to the complete span begin/end trace to reconstruct the activity ID parent/child relationship. This is where I suspect we'd want to introduce a notion of "activity ID stack", so a profiler could reconstruct the currently active stack of activity IDs for the current thread by walking that stack. This profiling could be triggered either from an interrupt (s
Re: [lttng-dev] Compile fix for urcu-bp.c
Hi, There are a few things missing before I can take this patch: - Missing commit message describing the issue, - Missing "Signed-off-by" tag. Thanks! Mathieu On 2024-03-29 10:06, Duncan Sands via lttng-dev wrote: --- src/urcu-bp.c +++ src/urcu-bp.c @@ -409,7 +409,7 @@ void expand_arena(struct registry_arena *arena) new_chunk_size_bytes, 0); if (new_chunk != MAP_FAILED) { /* Should not have moved. */ - assert(new_chunk == last_chunk); + urcu_posix_assert(new_chunk == last_chunk); memset((char *) last_chunk + old_chunk_size_bytes, 0, new_chunk_size_bytes - old_chunk_size_bytes); last_chunk->capacity = new_capacity; ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: Appropriate liburcu cache line size for Power
On 2024-03-25 16:34, Nathan Lynch wrote: Mathieu Desnoyers writes: In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. For what it's worth, I found a copy of an IBM Journal of Research & Development article confirming that POWER5's L3 had a 256-byte line size: Each slice [of the L3] is 12-way set-associative, with 4,096 congruence classes of 256-byte lines managed as two 128-byte sectors to match the L2 line size. https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf I don't know of any reason to prefer 256 over 128 for current Power processors though. Thanks for the pointer. I will add a reference to it in the liburcu source code to explain the cache line size choice. Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH 2/2] ARC: mm: fix new code about cache aliasing
On 2024-03-28 01:39, Vineet Gupta wrote: Manual/partial revert of 8690bbcf3b70 ("Introduce cpu_dcache_is_aliasing() across all architectures") Current generation of ARCv2/ARCv3 based HSxx cores are only PIPT (to software at least). Legacy ARC700 cpus could be VIPT aliasing (based on cache geometry and PAGE_SIZE) however recently that support was ripped out so VIPT aliasing cache is not relevant to ARC anymore. P.S. : This has been discussed a few times on lists [1] P.S.2: Please CC the arch maintainers and/or mailing list before adding such interfaces. Because 8690bbcf3b70 was introducing a tree-wide change affecting all architectures, I CC'd linux-a...@vger.kernel.org. I expected all architecture maintainers to follow that list, which is relatively low volume. I'm sorry that you learn about this after the fact as a result. My intent was to use the list rather than CC about 50 additional people/mailing lists. Of course, if VIPT aliasing is removed from ARC, removing the config ARCH_HAS_CPU_CACHE_ALIASING and using the generic cpu_dcache_is_aliasing() is the way to go. Feel free to add my: Acked-by: Mathieu Desnoyers Thanks, Mathieu [1] http://lists.infradead.org/pipermail/linux-snps-arc/2023-February/006899.html Cc: Mathieu Desnoyers Signed-off-by: Vineet Gupta --- arch/arc/Kconfig | 1 - arch/arc/include/asm/cachetype.h | 9 - 2 files changed, 10 deletions(-) delete mode 100644 arch/arc/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 99d2845f3feb..4092bec198be 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,7 +6,6 @@ config ARC def_bool y select ARC_TIMERS - select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h deleted file mode 100644 index 05fc7ed59712.. --- a/arch/arc/include/asm/cachetype.h +++ /dev/null @@ -1,9 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef __ASM_ARC_CACHETYPE_H -#define __ASM_ARC_CACHETYPE_H - -#include - -#define cpu_dcache_is_aliasing() true - -#endif -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ linux-snps-arc mailing list linux-snps-arc@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-snps-arc
Re: Appropriate liburcu cache line size for Power
On 2024-03-26 03:19, Michael Ellerman wrote: Mathieu Desnoyers writes: Hi, Hi Mathieu, In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. The ISA doesn't specify the cache line size, other than it is smaller than a page. In practice all the 64-bit IBM server CPUs I'm aware of have used 128 bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and Freescale e6500. It is possible to discover at runtime via AUXV headers. But that's no use if you want a compile-time constant. Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI, so changing this would require a soname bump, which I don't want to do without really good reasons. I'm happy to run some benchmarks if you can point me at what to run. I had a poke around the repository and found short_bench, but it seemed to run for a very long time. I've created a dedicated test program for this, see: https://github.com/compudj/userspace-rcu-dev/tree/false-sharing example use: On a AMD Ryzen 7 PRO 6850U with Radeon Graphics: for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 21320 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 22657 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 47599 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 531364 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 523634 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 519402 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 520651 1..1 Would point to false-sharing starting with cache lines smaller than 64 bytes. I get similar results (false-sharing under 64 bytes) with a AMD EPYC 9654 96-Core Processor. The test programs runs 4 threads by default, which can be overridden with "-t N". This may be needed if you want this to use all cores from a larger machine. See "-h" for options. On a POWER9 (architected), altivec supported: for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 12264 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 12276 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 25638 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 39934 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 53971 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 53599 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 53962 1..1 This points at false-sharing below 128 bytes stride. On a e6500, altivec supported, Model 2.0 (pvr 8040 0120) for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 9049 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 9054 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 18643 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 37417 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 37906 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 37870 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 37899 1..1 Which points at false-sharing below 64 bytes. I prefer to be cautious about this cache line size value and aim for a value which takes into account the largest known cache line size for an architecture rather than use a too small due to the large overhead caused by false-sharing. Feedback is welcome. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Appropriate liburcu cache line size for Power
Hi, In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. Thanks! Mathieu [1] https://liburcu.org [2] https://github.com/urcu/userspace-rcu/pull/22 [3] https://www.7-cpu.com/ -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[lttng-dev] [RELEASE] LTTng-modules 2.12.16 and 2.13.12 (Linux kernel tracer)
Hi, This is a release announcement for the currently maintained LTTng-modules Linux kernel tracer stables branches. * New and noteworthy in these releases: Linux kernel v6.8 is now supported by LTTng modules 2.13.12. If you need support for recent kernels (v5.18+), you will need to upgrade to a recent LTTng-modules 2.13.x. Both releases correct issues with SLE kernel version ranges detection. A compilation fix for RHEL 9.3 kernel is present in v2.13.12. Feedback is welcome! Thanks, Mathieu Project website: https://lttng.org Documentation: https://lttng.org/docs Download link: https://lttng.org/download Detailed change logs: 2024-03-21 (National Common Courtesy Day) LTTng modules 2.13.12 * docs: Add supported versions and fix-backport policy * docs: Add links to project resources * Fix: Correct minimum version in jbd2 SLE kernel range * Fix: Handle recent SLE major version codes * Fix: build on sles15sp4 * Compile fixes for RHEL 9.3 kernels * Fix: ext4_discard_preallocations changed in linux 6.8.0-rc3 * Fix: btrfs_get_extent flags and compress_type changed in linux 6.8.0-rc1 * Fix: btrfs_chunk tracepoints changed in linux 6.8.0-rc1 * Fix: strlcpy removed in linux 6.8.0-rc1 * Fix: timer_start changed in linux 6.8.0-rc1 * Fix: sched_stat_runtime changed in linux 6.8.0-rc1 2024-03-21 (National Common Courtesy Day) 2.12.16 * fix: lttng-probe-kvm-x86-mmu build with linux 6.6 * docs: Add supported versions and fix-backport policy * docs: Add links to project resources * Fix: Correct minimum version in jbd2 SLE kernel range * Fix: Handle recent SLE major version codes * Fix: build on sles15sp4 -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 13:58, Steven Rostedt wrote: On Wed, 20 Mar 2024 13:15:39 -0400 Mathieu Desnoyers wrote: I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Because need_resched() itself does not mean it's going to schedule immediately. I looked at a few locations that need_resched() is called. Most are in idle code where the critical timings are already handled. I'm not sure I'd add it for places like mm/memory.c or drivers/md/bcache/btree.c. A lot of places look to use it more for PREEMPT_NONE situations as a open coded cond_resched(). The main reason this one is particularly an issue, is that it spins as long as the owner is still running. Which may be some time, as here it was 7ms. What I think we should be discussing here is how calling need_resched() should interact with the latency tracked by critical timings. AFAIU, when code explicitly calls need_resched() in a loop, there are two cases: - need_resched() returns false: This means the loop can continue without causing long latency on the system. Technically we could restart the critical timings at this point. - need_resched() returns true: This means the loop should exit quickly and call the scheduler. I would not reset the critical timings there, as whatever code is executed between need_resched() returning true and calling the scheduler is adding to latency. Having stop/start critical timings around idle loops seems to just be an optimization over that. As for mm and driver/md code, what is wrong with doing a critical timings reset when need_resched() returns false ? It would prevent a whole class of false-positives rather than playing whack-a-mole with those that pop up. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()
On 2024-03-20 12:20, Steven Rostedt wrote: From: Steven Rostedt (Google) I'm debugging some latency issues on a Chromebook and the preemptirqsoff tracer hit this: # tracer: preemptirqsoff # # preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty # # latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) #- #| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0) #- # => started at: rwsem_optimistic_spin # => ended at: rwsem_optimistic_spin # # #_--=> CPU# # / _-=> irqs-off # | / _=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # / _-=> migrate-disable # | / delay # cmd pid || time | caller # \ /|| \|/ <...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 <-rwsem_optimistic_spin+0x20/0x194 <...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a <-rwsem_optimistic_spin+0x140/0x194 <...>-1 1.N.1. 7824us : => rwsem_optimistic_spin+0x140/0x194 => rwsem_down_write_slowpath+0xc9/0x3fe => copy_process+0xd08/0x1b8a => kernel_clone+0x94/0x256 => __x64_sys_clone+0x7a/0x9a => do_syscall_64+0x51/0xa1 => entry_SYSCALL_64_after_hwframe+0x5c/0xc6 Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that spins with interrupts enabled, preempt disabled, and checks for need_resched(). If it is true, it breaks out and schedules. Hence, it's hiding real issues, because it can spin for a very long time and this is not the source of the latency I'm tracking down. I would like to introduce restart_critical_timings() and place it in locations that have this behavior. Is there any way you could move this to need_resched() rather than sprinkle those everywhere ? Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 147feebd508c..e9f97f60bfc0 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -145,6 +145,13 @@ do { \ # define start_critical_timings() do { } while (0) #endif +/* Used in spins that check need_resched() with preemption disabled */ +static inline void restart_critical_timings(void) +{ + stop_critical_timings(); + start_critical_timings(); +} + #ifdef CONFIG_DEBUG_IRQFLAGS extern void warn_bogus_irq_restore(void); #define raw_check_bogus_irq_restore() \ diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c index 2340b6d90ec6..fa7b43e53d20 100644 --- a/kernel/locking/rwsem.c +++ b/kernel/locking/rwsem.c @@ -27,6 +27,7 @@ #include #include #include +#include #include #ifndef CONFIG_PREEMPT_RT @@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem) */ barrier(); + restart_critical_timings(); if (need_resched() || !owner_on_cpu(owner)) { state = OWNER_NONSPINNABLE; break; @@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem) * a writer, need_resched() check needs to be done here. */ if (owner_state != OWNER_WRITER) { + restart_critical_timings(); if (need_resched()) break; if (rt_task(current) && -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug
On 2024-03-07 14:47, Paul E. McKenney wrote: On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote: On 2024-03-06 22:37, Paul E. McKenney wrote: On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: [...] As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern is concerned, the only valid use-case I can think of is split counters or RCU implementations where there is a single updater doing the increment, and one or more concurrent reader threads that need to snapshot a consistent value with READ_ONCE(). [...] So what would you use that pattern for? One possibility is a per-CPU statistical counter in userspace on a fastpath, in cases where losing the occasional count is OK. Then learning your CPU (and possibly being immediately migrated to some other CPU), READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) make sense. I suppose the same in the kernel if there was a fastpath so extreme you could not afford to disable preemption. At best, very niche. Or am I suffering a failure of imagination yet again? ;-) The (niche) use-cases I have in mind are split-counters and RCU grace period tracking, where precise counters totals are needed (no lost count). In the kernel, this could be: Thank you for looking into this! - A per-cpu counter, each counter incremented from thread context with preemption disabled (single updater per counter), read concurrently by other threads. WRITE_ONCE/READ_ONCE is useful to make sure there is no store/load tearing there. Atomics on the update would be stronger than necessary on the increment fast-path. But if preemption is disabled, the updater can read the value without READ_ONCE() without risk of concurrent update. Or are you concerned about interrupt handlers? This would have to be a read from the interrupt handler, given that an updated from the interrupt handler could result in a lost count. You are correct that the updater don't need READ_ONCE there. It would however require a WRITE_ONCE() to match READ_ONCE() from concurrent reader threads. - A per-thread counter (e.g. within task_struct), only incremented by the single thread, read by various threads concurrently. Ditto. Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers. - A counter which increment happens to be already protected by a lock, read by various threads without taking the lock. (technically doable, but I'm not sure I see a relevant use-case for it) In that case, the lock would exclude concurrent updates, so the lock holder would not need READ_ONCE(), correct? Correct. In user-space: - The "per-cpu" counter would have to use rseq for increments to prevent inopportune migrations, which needs to be implemented in assembler anyway. The counter reads would have to use READ_ONCE(). Fair enough! - The per-thread counter (Thread-Local Storage) incremented by a single thread, read by various threads concurrently, is a good target for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in various liburcu implementations which track read-side critical sections per-thread. Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or similar? Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such, so we can grab the value of the current gp counter and store it into a TLS variable. - Same as for the kernel, a counter increment protected by a lock which needs to be read from various threads concurrently without taking the lock could be a valid use-case, though I fail to see how it is useful due to lack of imagination on my part. ;-) In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though here we have the WRITE_ONCE() but not the READ_ONCE() because we hold the lock excluding any other updates. You are right, the READ_ONCE() is not needed in this case for the updater, only for the concurrent readers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug
On 2024-03-06 22:37, Paul E. McKenney wrote: On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote: [...] As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern is concerned, the only valid use-case I can think of is split counters or RCU implementations where there is a single updater doing the increment, and one or more concurrent reader threads that need to snapshot a consistent value with READ_ONCE(). [...] So what would you use that pattern for? One possibility is a per-CPU statistical counter in userspace on a fastpath, in cases where losing the occasional count is OK. Then learning your CPU (and possibly being immediately migrated to some other CPU), READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not) make sense. I suppose the same in the kernel if there was a fastpath so extreme you could not afford to disable preemption. At best, very niche. Or am I suffering a failure of imagination yet again? ;-) The (niche) use-cases I have in mind are split-counters and RCU grace period tracking, where precise counters totals are needed (no lost count). In the kernel, this could be: - A per-cpu counter, each counter incremented from thread context with preemption disabled (single updater per counter), read concurrently by other threads. WRITE_ONCE/READ_ONCE is useful to make sure there is no store/load tearing there. Atomics on the update would be stronger than necessary on the increment fast-path. - A per-thread counter (e.g. within task_struct), only incremented by the single thread, read by various threads concurrently. - A counter which increment happens to be already protected by a lock, read by various threads without taking the lock. (technically doable, but I'm not sure I see a relevant use-case for it) In user-space: - The "per-cpu" counter would have to use rseq for increments to prevent inopportune migrations, which needs to be implemented in assembler anyway. The counter reads would have to use READ_ONCE(). - The per-thread counter (Thread-Local Storage) incremented by a single thread, read by various threads concurrently, is a good target for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in various liburcu implementations which track read-side critical sections per-thread. - Same as for the kernel, a counter increment protected by a lock which needs to be read from various threads concurrently without taking the lock could be a valid use-case, though I fail to see how it is useful due to lack of imagination on my part. ;-) I'm possibly missing other use-cases, but those come to mind as not involving racy counter increments. I agree that use-cases are so niche that we probably do _not_ want to introduce ADD_SHARED() for that purpose in a common header file, because I suspect that it would then become misused in plenty of scenarios where the updates are actually racy and would be better served by atomics or local-atomics. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug
On 2024-03-06 21:43, Linus Torvalds wrote: [...] Honestly, this all makes me think that we'd be *much* better off showing the real "handoff" with smp_store_release() and smp_load_acquire(). We've done something similar in liburcu (userspace code) to allow Thread Sanitizer to understand the happens-before relationships within the RCU implementations and lock-free data structures. Moving to load-acquire/store-release (C11 model in our case) allowed us to provide enough happens-before relationship for Thread Sanitizer to understand what is happening under the hood in liburcu and perform relevant race detection of user code. As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern is concerned, the only valid use-case I can think of is split counters or RCU implementations where there is a single updater doing the increment, and one or more concurrent reader threads that need to snapshot a consistent value with READ_ONCE(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2] tracing: Limit trace_marker writes to just 4K
On 2024-03-04 22:34, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Limit the max print event of trace_marker to just 4K string size. This must also be less than the amount that can be held by a trace_seq along with the text that is before the output (like the task name, PID, CPU, state, etc). As trace_seq is made to handle large events (some greater than 4K). Make the max size of a trace_marker write event be 4K which is guaranteed to fit in the trace_seq buffer. Suggested-by: Mathieu Desnoyers From my perspective I only attempted to clarify the point Linus made about limiting the trace_marker input to 4kB. Feel adapt the Suggested-by tag accordingly. Reviewed-by: Mathieu Desnoyers Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240304192710.4c996...@gandalf.local.home/ - Just make the max limit 4K and not half of the trace_seq size. The trace_seq is already made to handle events greater than 4k. kernel/trace/trace.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d16b95ca58a7 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7293,6 +7293,8 @@ tracing_free_buffer_release(struct inode *inode, struct file *filp) return 0; } +#define TRACE_MARKER_MAX_SIZE 4096 + static ssize_t tracing_mark_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *fpos) @@ -7320,6 +7322,9 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + if (cnt > TRACE_MARKER_MAX_SIZE) + cnt = TRACE_MARKER_MAX_SIZE; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7333,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 21:37, Steven Rostedt wrote: On Mon, 4 Mar 2024 21:35:38 -0500 Steven Rostedt wrote: And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. validating is debugging. Did Linus put you up to this? To test me to see if I'm learning how to say "No" ;-) No, he did not. I genuinely think that validating size limits like this either at compile time or, when they can vary at runtime like in this case, with a dynamic check, decreases the cognitive load on the reviewers. We can then assume that whatever limit was put in place is actually enforced and not just wishful thinking. If the "header" size upper bound is not validated at runtime, there is not much point in adding the BUILD_BUG_ON() based on that value in the first place, and you should then just add a runtime check that you don't overflow the output buffer before writing the output to it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:59, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:42:39 -0500 Mathieu Desnoyers wrote: #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. That would be a lot of complex code that is for debugging something that has never happened in the past 16 years and Linus has already reprimanded me on doing such things. Is it more complex than remembering the iterator position in print_trace_fmt() right before: if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) { if (iter->iter_flags & TRACE_FILE_LAT_FMT) trace_print_lat_context(iter); else trace_print_context(iter); } and then checking just after that the offset is not beyond 128 bytes ? Perhaps there is even something internal to "iter" that already knows about this offset (?). This would catch any context going beyond its planned upper limit early. Failing early and not just in rare corner cases is good. And it's not for debugging, it's for validation of assumptions made about an upper bound limit defined for a compile-time check, so as the code evolves issues are caught early. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:41, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:35:16 -0500 Steven Rostedt wrote: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq That said, the meta data is most likely not going to be more than 128 bytes (it shouldn't be more than 80). I could do as you suggest and create a separate TRACE_MARKER_SIZE and just make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of the content) by 128 bytes. /* Added meta data should not be more than 128 bytes */ BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE); Bonus points if you add #define TRACE_OUTPUT_META_DATA_MAX_LEN 80 and a runtime check in the code generating this header. This would avoid adding an unchecked upper limit. Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 20:35, Steven Rostedt wrote: On Mon, 4 Mar 2024 20:15:57 -0500 Mathieu Desnoyers wrote: On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); That's not the meta size I'm worried about. The sizeof(meta data) is the raw event binary data, which is not related to the size of the event output. # cd /sys/kernel/tracing # echo hello > trace_marker # cat trace [..] <...>-999 [001] . 2296.140373: tracing_mark_write: hello ^^^ This is the meta data that is added to trace_seq If this header has a known well-defined upper-limit length, then use that in the BUILD_BUG_ON(). Thanks, Mathieu -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE
On 2024-03-04 19:27, Steven Rostedt wrote: From: "Steven Rostedt (Google)" Since the size of trace_seq's buffer is the max an event can output, have the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That will keep writes that has meta data written from being dropped (but reported), because the total output of the print event is greater than what the trace_seq can hold. Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2" seems backwards. It's basically using a define of the maximum buffer size for the pretty-printing output and defining the maximum input size of a system call to half of that. I'd rather see, in a header file shared between tracing mark write implementation and output implementation: #define TRACING_MARK_MAX_SIZE 4096 and then a static validation that this input fits within your pretty printing output in the output implementation file: BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > TRACE_SEQ_SIZE); This way we clearly document that the tracing mark write input limit is 4kB, rather than something derived from the size of an output buffer. Thanks, Mathieu Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..d68544aef65f 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if ((ssize_t)cnt < 0) return -EINVAL; + /* +* TRACE_SEQ_SIZE is the total size of trace_seq buffer used +* for output. As the print event outputs more than just +* the string written, keep it smaller than the trace_seq +* as it could drop the event if the extra data makes it bigger +* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE +* is more than enough. +*/ + if (cnt > TRACE_SEQ_SIZE / 2) + cnt = TRACE_SEQ_SIZE / 2; + meta_size = sizeof(*entry) + 2; /* add '\0' and possible '\n' */ again: size = cnt + meta_size; @@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user *ubuf, if (cnt < FAULTED_SIZE) size += FAULTED_SIZE - cnt; - if (size > TRACE_SEQ_BUFFER_SIZE) { - cnt -= size - TRACE_SEQ_BUFFER_SIZE; - goto again; - } - buffer = tr->array_buffer.buffer; event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size, tracing_gen_ctx()); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] tracing: Remove precision vsnprintf() check from print event
On 2024-03-04 17:43, Steven Rostedt wrote: From: "Steven Rostedt (Google)" This reverts 60be76eeabb3d ("tracing: Add size check when printing trace_marker output"). The only reason the precision check was added was because of a bug that miscalculated the write size of the string into the ring buffer and it truncated it removing the terminating nul byte. On reading the trace it crashed the kernel. But this was due to the bug in the code that happened during development and should never happen in practice. If anything, the precision can hide bugs where the string in the ring buffer isn't nul terminated and it will not be checked. Link: https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/ Link: https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/ Reported-by: Sachin Sant Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker output") Signed-off-by: Steven Rostedt (Google) This is a step in the right direction IMHO. Reviewed-by: Mathieu Desnoyers Just out of curiosity, is there anything to prevent trace_marker from writing a huge string into the ring buffer in the first place ? Is this limit implicit and based on the page size of the architecture or is it a known fixed limit ? (e.g. 4kB strings). It appears to currently be limited by #define TRACE_SEQ_BUFFER_SIZE (PAGE_SIZE * 2 - \ (sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int))) checked within tracing_mark_write(). I would have hoped for a simpler limit (e.g. 4kB) consistent across architectures. But that would belong to a separate change. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment
On 2024-03-01 10:49, Steven Rostedt wrote: On Fri, 1 Mar 2024 13:37:18 +0800 linke wrote: So basically you are worried about read-tearing? That wasn't mentioned in the change log. Yes. Sorry for making this confused, I am not very familiar with this and still learning. No problem. We all have to learn this anyway. Funny part is, if the above timestamp read did a tear, then this would definitely not match, and would return the correct value. That is, the buffer is not empty because the only way for this to get corrupted is if something is in the process of writing to it. I agree with you here. commit = rb_page_commit(commit_page); But if commit_page above is the result of a torn read, the commit field read by rb_page_commit() may not represent a valid value. But commit_page is a word length, and I will argue that any compiler that tears "long" words is broken. ;-) [ For those tuning in, we are discussing ring_buffer_iter_empty() "commit_page = cpu_buffer->commit_page;" racy load. ] I counter-argue that real-world compilers *are* broken based on your personal definition, but we have to deal with them, as documented in Documentation/memory-barriers.txt (see below). What is the added overhead of using a READ_ONCE() there ? Why are we wasting effort trying to guess the compiler behavior if the real-world performance impact is insignificant ? Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE(): "(*) For aligned memory locations whose size allows them to be accessed with a single memory-reference instruction, prevents "load tearing" and "store tearing," in which a single large access is replaced by multiple smaller accesses." I agree that {READ,WRITE}_ONCE() are really not needed at initialization, when there are demonstrably no concurrent accesses to the data But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields just adds complexity, prevents static analyzers to properly understand the code and report issues, and just obfuscates the code. Thanks, Mathieu In this case, READ_ONCE() is only needed for the commit_page. But we can at least keep the READ_ONCE() on the commit_page just because it is used in the next instruction. -- Steve -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [lttng-dev] [PATCH] coredump debugging: add a tracepoint to report the coredumping
On 2024-02-23 09:26, Steven Rostedt wrote: On Mon, 19 Feb 2024 13:01:16 -0500 Mathieu Desnoyers wrote: Between "sched_process_exit" and "sched_process_free", the task can still be observed by a trace analysis looking at sched and signal events: it's a zombie at that stage. Looking at the history of this tracepoint, it was added in 2008 by commit 0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler"). Hmm, LLTng? I wonder who the author was? [ common typo: LLTng -> LTTng ;-) ] Author: Mathieu Desnoyers :-D Mathieu, I would say it's your call on where the tracepoint can be located. You added it, you own it! Wow! that's now 16 years ago :) I've checked with Matthew Khouzam (maintainer of Trace Compass) which care about this tracepoint, and we have not identified any significant impact of moving it on its model of the scheduler, other than slightly changing its timing. I've also checked quickly in lttng-analyses and have not found any code that care about its specific placement. So I would say go ahead and move it earlier in do_exit(), it's fine by me. If you are interested in a bit of archeology, "sched_process_free" originated from my ltt-experimental 0.1.99.13 kernel patch against 2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was a precursor to the LTTng 0.x kernel patchset. https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz Index: kernel/exit.c === --- a/kernel/exit.c (.../trunk/kernel/linux-2.6.12-rc4-mm2) (revision 41) +++ b/kernel/exit.c (.../branches/mathieu/linux-2.6.12-rc4-mm2) (revision 41) @@ -4,6 +4,7 @@ * Copyright (C) 1991, 1992 Linus Torvalds */ +#include #include #include #include @@ -55,6 +56,7 @@ static void __unhash_process(struct task } REMOVE_LINKS(p); + trace_process_free(p->pid); } void release_task(struct task_struct * p) @@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co } exit_mm(tsk); + trace_process_exit(tsk->pid); + exit_sem(tsk); __exit_files(tsk); __exit_fs(tsk); This was a significant improvement over the prior LTT which only had the equivalent of "sched_process_exit", which caused issues with the Linux scheduler model in LTTV due to zombie processes. Here is where it appeared in LTT back in 1999: http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz patch-ltt-2.2.13-991118 diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c --- linux/kernel/exit.c Tue Oct 19 20:14:02 1999 +++ linux-2.2.13/kernel/exit.c Sun Nov 7 23:49:17 1999 @@ -14,6 +14,8 @@ #include #endif +#include + #include #include #include @@ -386,6 +388,8 @@ del_timer(>real_timer); end_bh_atomic(); + TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0); + lock_kernel(); fake_volatile: #ifdef CONFIG_BSD_PROCESS_ACCT And it was moved to its current location (after exit_mm()) a bit later (2001): http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10 diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c --- linux/kernel/exit.c Fri May 4 17:44:06 2001 +++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c Wed Jun 20 12:39:24 2001 @@ -14,6 +14,8 @@ #include #endif +#include + #include #include #include @@ -439,6 +441,8 @@ #endif __exit_mm(tsk); + TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0); + lock_kernel(); sem_exit(); __exit_files(tsk); So this sched_process_exit placement was actually decided by Karim Yaghmour back in the LTT days (2001). I don't think he will mind us moving it around some 23 years later. ;) Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com ___ lttng-dev mailing list lttng-dev@lists.lttng.org https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-02-20 09:19, Steven Rostedt wrote: On Mon, 19 Feb 2024 18:20:32 -0500 Steven Rostedt wrote: Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. Before we had: w = local_read(_page->write); /* get time stamps */ write = local_add_return(length, _page->write); if (write - length == w) { /* do simple case */ } else { /* do complex case */ } By switching the local_add_return() to a local_try_cmpxchg() it can now be: w = local_read(_page->write); again: /* get time stamps */ if (!local_try_cmpxchg(_page->write, , w + length)) goto again; /* do simple case */ Something about this logic is causing __rb_next_reserve() to sometimes always return -EAGAIN and triggering the: RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) Which disables the ring buffer. I'm not sure what it is, but until I do, I'm removing the patch from my queue. Try resetting the info->add_timestamp flags to add_ts_default on goto again within __rb_reserve_next(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers
On 2024-02-16 15:22, Arnd Bergmann wrote: From: Arnd Bergmann In some randconfig builds, the IS_ERR() check appears to not get completely eliminated, resulting in the compiler to insert references to these two functions that cause a link failure: ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined! ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined! Add more stub functions for the dax-disabled case here to make it build again. Hi Arnd, Note that this is a duplicate of: https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/ now present in Andrew's tree. The only differences are the subject, commit message and a newline between "set_dax_nomc" and "set_dax_synchronous" in your change. Thanks, Mathieu Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/ Signed-off-by: Arnd Bergmann --- include/linux/dax.h | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index df2d52b8a245..4527c10016fb 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool dax_synchronous(struct dax_device *dax_dev); void set_dax_synchronous(struct dax_device *dax_dev); +void set_dax_nocache(struct dax_device *dax_dev); +void set_dax_nomc(struct dax_device *dax_dev); + size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); /* @@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev) static inline void set_dax_synchronous(struct dax_device *dax_dev) { } +static inline void set_dax_nocache(struct dax_device *dax_dev) +{ +} +static inline void set_dax_nomc(struct dax_device *dax_dev) +{ +} static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, struct dax_device *dax_dev) { @@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, } #endif -void set_dax_nocache(struct dax_device *dax_dev); -void set_dax_nomc(struct dax_device *dax_dev); - struct writeback_control; #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX) int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk); -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v2] nvdimm/pmem: Fix leak on dax_add_host() failure
On 2024-02-15 09:43, Mathieu Desnoyers wrote: Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: Hi Andrew, I notice that you should update the patch you have in your tree (https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch) with this updated version which includes additional Reviewed-by tags and removes unneeded context that appears to be taken from the previous cover letter. Following your request, I have extracted this patch from the series. Thanks, Mathieu out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Reviewed-by: Dave Jiang Reviewed-by: Fan Ni Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- Changes since v1: - Add Reviewed-by tags. --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH v6 8/9] Introduce cpu_dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..7d294a3242a4 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..05fc7ed59712 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..cd13b1788973 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ conf
[PATCH v6 9/9] dax: Fix incorrect list of data cache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ce5bffa86bba..a21a7c262382 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) -- 2.39.2
[PATCH v6 7/9] dax: Check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 10 ++ fs/Kconfig | 1 - 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 205b888d45bf..ce5bffa86bba 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX -- 2.39.2
[PATCH v6 5/9] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Acked-by: Heiko Carstens Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..f363c1d51d9a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; -- 2.39.2
[PATCH v6 6/9] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..a28466c2da71 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = NULL; struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } -- 2.39.2
[PATCH v6 3/9] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..e9898457a7bd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
[PATCH v6 4/9] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23c32cd1f1d8..acdc00bc05be 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } -- 2.39.2
[PATCH v6 2/9] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y never returns NULL. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: 4e4ced93794a ("dax: Move mandatory ->zero_page_range() check in alloc_dax()") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 + include/linux/dax.h | 6 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..205b888d45bf 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { diff --git a/include/linux/dax.h b/include/linux/dax.h index e3ffe7c7f01d..9d3e3327af4c 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -88,11 +88,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { -- 2.39.2
[PATCH v6 1/9] dax: add empty static inline for CONFIG_DAX=n
When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache() and set_dax_nomc() need to be either within regions of code or compile units which are explicitly not compiled, or they need to rely on compiler optimizations to eliminate calls to those undefined symbols. It appears that at least the openrisc and loongarch architectures don't end up eliminating those undefined symbols even if they are provably within code which is eliminated due to conditional branches depending on constants. Implement empty static inline functions for set_dax_nocache() and set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are removed. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402140037.wgfa1kqx-...@intel.com/ Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402131351.a0fzogeg-...@intel.com/ Fixes: 7ac5360cd4d0 ("dax: remove the copy_from_iter and copy_to_iter methods") Signed-off-by: Mathieu Desnoyers Cc: Christoph Hellwig Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- include/linux/dax.h | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..e3ffe7c7f01d 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -63,6 +63,8 @@ void kill_dax(struct dax_device *dax_dev); void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool dax_synchronous(struct dax_device *dax_dev); +void set_dax_nocache(struct dax_device *dax_dev); +void set_dax_nomc(struct dax_device *dax_dev); void set_dax_synchronous(struct dax_device *dax_dev); size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); @@ -109,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device *dax_dev) { return true; } +static inline void set_dax_nocache(struct dax_device *dax_dev) +{ +} +static inline void set_dax_nomc(struct dax_device *dax_dev) +{ +} static inline void set_dax_synchronous(struct dax_device *dax_dev) { } @@ -124,9 +132,6 @@ static inline size_t dax_recovery_write(struct dax_device *dax_dev, } #endif -void set_dax_nocache(struct dax_device *dax_dev); -void set_dax_nomc(struct dax_device *dax_dev); - struct writeback_control; #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX) int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk); -- 2.39.2
[PATCH v6 0/9] Introduce cpu_dcache_is_aliasing() to fix DAX regression
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Even though it used to work fine before. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. There are many more axes to test however. I would welcome Tested-by for: - affected architectures, - affected drivers, - affected filesytems. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Thanks, Mathieu Changes since v5: - Add empty static inline set_dax_nocache() and set_dax_nomc() for CONFIG_DAX=n. - Update "Fixes" tag for "dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n". - Check IS_ERR_OR_NULL() before calling virtio_fs_cleanup_dax() within virtio_fs_setup_dax(). Changes since v4: - Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP) when CONFIG_DAX=n earlier in the series, - Fold driver cleanup patches into their respective per-driver changes. - Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this series. Changes since v3: - Fix a leak on dax_add_host() failure in nvdimm/pmem. - Split the series into a bissectable sequence of changes. - Ensure that device-dax use-cases still works on data cache aliasing architectures. Changes since v2: - Move DAX supported runtime check to alloc_dax(), - Modify DM to handle alloc_dax() error as non-fatal, - Remove all filesystem modifications, since the check is now done by alloc_dax(), - rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to eliminate confusion with VFS terminology. Changes since v1: - The order of the series was completely changed based on the feedback received on v1, - cache_is_aliasing() is renamed to dcache_is_aliasing(), - ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, - dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, - the dax_is_supported() check was moved to its rightful place in all filesystems. Cc: Andrew Morton Cc: Linus Torvalds Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org Mathieu Desnoyers (9): dax: add empty static inline for CONFIG_DAX=n dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dcssblk: Handle alloc_dax() -EOPNOTSUPP failure virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dax: Check for data cache aliasing at runtime Introduce cpu_dcache_is_aliasing() across all architectures dax: Fix incorrect list of data cache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ drivers
[PATCH v2] nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Reviewed-by: Dave Jiang Reviewed-by: Fan Ni Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- Changes since v1: - Add Reviewed-by tags. --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
Re: CPU data cache across reboot/kexec for pmem/dax devices
On 2024-02-09 15:15, Dan Williams wrote: Mathieu Desnoyers wrote: Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, This one gives me pause, because a trip through the BIOS typically means lots of resets and other low level magic, so this would likely require pushing dirty data out of CPU caches prior to entering the BIOS code paths. So this either needs explicit cache flushing or mapping the memory with write-through semantics. That latter one is not supported in the stack today. I would favor the explicit cache flushing approach over write-through semantics for performance reasons: typically the ring buffers are overwritten, so always storing the data to memory would be wasteful. But I would rather do the cache flushing only on crash (die oops/panic notifiers) rather than require the tracer to flush cache lines immediately after each event is produced, again for performance reasons. I have done a crude attempt at registering die oops/panic notifiers from the pmem driver, and flush all pmem areas to memory when die oops/panic notifiers are called (see the untested patch below). I wonder if this is something that would be generally useful ? - kexec/kdump. This should maintain the state of CPU caches. As far as the CPU is concerned it is just long jumping into a new kernel in memory without resetting any CPU cache state. Good to know that this scenario is expected to "Just Work". Thanks, Mathieu diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index e9898457a7bd..b92f18607d39 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -26,12 +26,18 @@ #include #include #include +#include #include #include "pmem.h" #include "btt.h" #include "pfn.h" #include "nd.h" +static int pmem_die_handler(struct notifier_block *self, + unsigned long ev, void *unused); +static int pmem_panic_handler(struct notifier_block *self, + unsigned long ev, void *unused); + static struct device *to_dev(struct pmem_device *pmem) { /* @@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem) { struct pmem_device *pmem = __pmem; + atomic_notifier_chain_unregister(_notifier_list, + >panic_notifier); + unregister_die_notifier(>die_notifier); dax_remove_host(pmem->disk); kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); @@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev, goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - rc = device_add_disk(dev, disk, pmem_attribute_groups); + pmem->die_notifier.notifier_call = pmem_die_handler; + pmem->die_notifier.priority = -INT_MAX; + rc = register_die_notifier(>die_notifier); if (rc) goto out_remove_host; + pmem->panic_notifier.notifier_call = pmem_panic_handler; + pmem->panic_notifier.priority = -INT_MAX; + rc = atomic_notifier_chain_register(_notifier_list, + >panic_notifier); + if (rc) + goto out_unregister_die; + rc = device_add_disk(dev, disk, pmem_attribute_groups); + if (rc) + goto out_unregister_panic; if (devm_add_action_or_reset(dev, pmem_release_disk, pmem)) return -ENOMEM; @@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev, dev_warn(dev, "'badblocks' notification disabled\n"); return 0; +out_unregister_panic: + atomic_notifier_chain_unregister(_notifi
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On 2024-02-12 18:02, Dan Williams wrote: [...] ...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary call to virtio_fs_cleanup_dax() at function exit that the compiler should elide. OK, so I'll go back to the previous approach for v6: DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) and define the variable as: struct dax_device *dax_dev __free(cleanup_dax) = NULL; Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
On 2024-02-13 14:07, Dan Williams wrote: Lukas Wunner wrote: On Mon, Feb 12, 2024 at 11:30:54AM -0500, Mathieu Desnoyers wrote: Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y never returns NULL. All the callers of alloc_dax() only check for IS_ERR(). Doesn't this result in a change of behavior in all the callers? Previously they'd ignore the NULL return value and continue, now they'll error out. Given that, seems dangerous to add a Fixes tag with a v4.0 commit and thus risk regressing all stable kernels. Oh, good catch, yes that Fixes tag should be: 4e4ced93794a dax: Move mandatory ->zero_page_range() check in alloc_dax() ...as that was the one that updated the alloc_dax() calling convention without fixing up the CONFIG_DAX=n case. This is a pre-requisite for restoring DAX operation on ARM32. I'll change the Fixes tag in this commit to: Fixes: 4e4ced93794a ("dax: Move mandatory ->zero_page_range() check in alloc_dax()") for v6. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On 2024-02-13 01:25, Lukas Wunner wrote: On Mon, Feb 12, 2024 at 11:30:58AM -0500, Mathieu Desnoyers wrote: In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") That's a v4.0 commit, yet this patch uses DEFINE_FREE() which is only available in v6.6 but not any earlier stable kernels. I asked this question to Greg KH before creating this patch, and his answer was to implement my fix for master, and stable kernels would take care of backporting all the required dependencies. Now if I look at latest 6.1, 5.15, 5.10, 5.4, 4.19 stable kernels, none seem to have include/linux/cleanup.h today. But I suspect that sooner or later relevant master branch fixes will require stable kernels to backport cleanup.h, so why not do it now ? Thanks, Mathieu So the Fixes tag feels a bit weird. Apart from that, Reviewed-by: Lukas Wunner -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH v5 8/8] dax: Fix incorrect list of data cache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ce5bffa86bba..a21a7c262382 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) -- 2.39.2
[PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..7d294a3242a4 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..05fc7ed59712 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..cd13b1788973 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ conf
[PATCH v5 6/8] dax: Check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 10 ++ fs/Kconfig | 1 - 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 205b888d45bf..ce5bffa86bba 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX -- 2.39.2
[PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..f9acd9972af2 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP); struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } -- 2.39.2
[PATCH v5 4/8] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..f363c1d51d9a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; -- 2.39.2
[PATCH v5 3/8] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23c32cd1f1d8..acdc00bc05be 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } -- 2.39.2
[PATCH v5 2/8] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..e9898457a7bd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
[PATCH v5 0/8] Introduce cpu_dcache_is_aliasing() to fix DAX regression
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. There are many more axes to test however. I would welcome Tested-by for: - affected architectures, - affected drivers, - affected filesytems. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Thanks, Mathieu Changes since v4: - Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP) when CONFIG_DAX=n earlier in the series, - Fold driver cleanup patches into their respective per-driver changes. - Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this series. Changes since v3: - Fix a leak on dax_add_host() failure in nvdimm/pmem. - Split the series into a bissectable sequence of changes. - Ensure that device-dax use-cases still works on data cache aliasing architectures. Changes since v2: - Move DAX supported runtime check to alloc_dax(), - Modify DM to handle alloc_dax() error as non-fatal, - Remove all filesystem modifications, since the check is now done by alloc_dax(), - rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to eliminate confusion with VFS terminology. Changes since v1: - The order of the series was completely changed based on the feedback received on v1, - cache_is_aliasing() is renamed to dcache_is_aliasing(), - ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, - dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, - the dax_is_supported() check was moved to its rightful place in all filesystems. Cc: Andrew Morton Cc: Linus Torvalds Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org Mathieu Desnoyers (8): dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dcssblk: Handle alloc_dax() -EOPNOTSUPP failure virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dax: Check for data cache aliasing at runtime Introduce cpu_dcache_is_aliasing() across all architectures dax: Fix incorrect list of data cache aliasing architectures arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ drivers/dax/super.c | 14 ++ drivers/md/dm.c | 17 + drivers/nvdimm/pmem.c | 22 -- drivers/s390/block/dcssblk.c| 11 ++- fs/Kconfig
[PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y never returns NULL. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 + include/linux/dax.h | 6 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..205b888d45bf 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..df2d52b8a245 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { -- 2.39.2
[PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Reviewed-by: Dan Williams Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
CPU data cache across reboot/kexec for pmem/dax devices
Hi Dan, In the context of extracting user-space trace data when the kernel crashes, the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the resulting device to create/mount a dax-enabled fs (e.g. ext4). We then use this filesystem to mmap() the shared memory files for the tracer. I want to make sure that the very last events from the userspace tracer written to the memory mapped buffers (mmap()) by userspace are present after a warm-reboot (or kexec/kdump). Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush (or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer data is usually overwritten many times before the system actually crashes, and the only thing we really need to make sure is that the cache lines are not invalidated without write back. So I understand that the main use-case for pmem is nvdimm, and that in order to guarantee persistence of the data on power off an explicit pmem_persist() is needed after each "transaction", but for the needs of tracing, is there some kind of architectural guarantee that the data present in the cpu data cache is not invalidated prior to write back in each of those scenarios ? - reboot with bios explicitly not clearing memory, - kexec/kdump. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime
On 2024-02-08 17:37, Dan Williams wrote: Mathieu Desnoyers wrote: On 2024-02-08 16:39, Dan Williams wrote: [...] So per other feedback on earlier patches, I think this hunk deserves to be moved to its own patch earlier in the series as a standalone fixup. Done. Rest of this patch looks good to me. Adding your Acked-by to what is left of this patch if OK with you. You can add: Reviewed-by: Dan Williams ...after that re-org. Just to make sure: are you OK with me adding your Reviewed-by only for what is left of this patch, or also to the other driver patches after integrating your requested changes ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure
On 2024-02-08 17:12, Andrew Morton wrote: On Thu, 8 Feb 2024 17:04:52 -0500 Mathieu Desnoyers wrote: [...] Should I keep this patch 01/12 within the series for v5 or should I send it separately ? Doesn't matter much, but perfectionism does say "standalone patch please". Will do. I plan to add the following statement to the commit message to make it clear that there is a dependency between the patch series and this fix: [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling
On 2024-02-08 16:55, Dan Williams wrote: [...] Oh... I see you cleanup what I was talking about later in the series. For my taste I don't like to see tech-debt added and then removed later in the series. The whole series would appear to get smaller by removing the alloc_dax() returning NULL case from the beginning, and then doing the EOPNOTSUPP fixups. ...repeat this comment for patch 10, 11, 12. Done. Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime
On 2024-02-08 16:39, Dan Williams wrote: [...] So per other feedback on earlier patches, I think this hunk deserves to be moved to its own patch earlier in the series as a standalone fixup. Done. Rest of this patch looks good to me. Adding your Acked-by to what is left of this patch if OK with you. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On 2024-02-08 16:37, Dan Williams wrote: [...] +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) This IS_ERR_OR_NULL is ok because dax_dev is initialized to NULL, but maybe this could just be IS_ERR() and then initialize @dax_dev to ERR_PTR(-EOPNOTSUPP)? Done. Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
On 2024-02-08 16:36, Dan Williams wrote: [...] Just another "ditto" on alloc_dax() returning NULL so that the ternary can be removed, but otherwise this looks good. Done. Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On 2024-02-08 16:34, Dan Williams wrote: [...] Similar feedback as the pmem change, lets not propagate the mistake that alloc_dax() could return NULL, none of the callers of alloc_dax() properly handled NULL and it was just luck that none of the use cases tried to use alloc_dax() in the CONFIG_DAX=n case. Done. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
On 2024-02-08 16:32, Dan Williams wrote: Mathieu Desnoyers wrote: In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..f1d9f5c6dbac 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev, disk->bb = >bb; dax_dev = alloc_dax(pmem, _dax_ops); - if (IS_ERR(dax_dev)) { - rc = PTR_ERR(dax_dev); - goto out; + if (IS_ERR_OR_NULL(dax_dev)) { alloc_dax() should never return NULL. I.e. the lead in before this patch should fix this misunderstanding: diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..df2d52b8a245 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; Then this ternary can be replaced with just a check of which PTR_ERR() value is being returned. As you noted, I've introduced this as cleanups in later patches. I don't mind folding these into their respective per-driver commits and moving the alloc_dax() hunk earlier in the series. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure
On 2024-02-08 16:21, Andrew Morton wrote: On Thu, 8 Feb 2024 13:49:02 -0500 Mathieu Desnoyers wrote: Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Seems inappropriate that this fix is within this patch series? otoh I assume dax_add_host() has never failed so it doesn't matter much. The series seems useful but is at v4 without much sign of review activity. I think I'll take silence as assent and shall slam it all into -next and see who shouts at me. Thanks Andrew for picking it up! Dan just reacted with feedback that will help reducing the patch series size by removing intermediate commits. I'll implement the requested changes and post a v5 in a few days. So far there are not behavior changes requested in Dan's feedback. Should I keep this patch 01/12 within the series for v5 or should I send it separately ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH v4 11/12] dcssblk: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index a3010849bfed..f363c1d51d9a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -679,8 +679,8 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char goto put_dev; dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } set_dax_synchronous(dax_dev); -- 2.39.2
[PATCH v4 12/12] virtio: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- fs/fuse/virtio_fs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 621b1bca2d55..a28466c2da71 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -809,8 +809,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) return 0; dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); return rc == -EOPNOTSUPP ? 0 : rc; } -- 2.39.2
[PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..7d294a3242a4 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..05fc7ed59712 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..cd13b1788973 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ conf
[PATCH v4 06/12] dax: Check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 15 +++ fs/Kconfig | 1 - include/linux/dax.h | 6 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..ce5bffa86bba 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { @@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..df2d52b8a245 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { -- 2.39.2
[PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index f1d9f5c6dbac..e9898457a7bd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev, disk->bb = >bb; dax_dev = alloc_dax(pmem, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); if (rc != -EOPNOTSUPP) goto out; } else { -- 2.39.2
[PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ce5bffa86bba..a21a7c262382 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) -- 2.39.2
[PATCH v4 10/12] dm: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 2fc22cae9089..acdc00bc05be 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2124,8 +2124,8 @@ static struct mapped_device *alloc_dev(int minor) sprintf(md->disk->disk_name, "dm-%d", minor); dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP) + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; } else { set_dax_nocache(dax_dev); -- 2.39.2
[PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..621b1bca2d55 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = NULL; struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } -- 2.39.2
[PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23c32cd1f1d8..2fc22cae9089 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } -- 2.39.2
[PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..a3010849bfed 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; -- 2.39.2
[PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. There are many more axes to test however. I would welcome Tested-by for: - affected architectures, - affected drivers, - affected filesytems. Thanks, Mathieu Changes since v3: - Fix a leak on dax_add_host() failure in nvdimm/pmem. - Split the series into a bissectable sequence of changes. - Ensure that device-dax use-cases still works on data cache aliasing architectures. Changes since v2: - Move DAX supported runtime check to alloc_dax(), - Modify DM to handle alloc_dax() error as non-fatal, - Remove all filesystem modifications, since the check is now done by alloc_dax(), - rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to eliminate confusion with VFS terminology. Changes since v1: - The order of the series was completely changed based on the feedback received on v1, - cache_is_aliasing() is renamed to dcache_is_aliasing(), - ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, - dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, - the dax_is_supported() check was moved to its rightful place in all filesystems. Cc: Andrew Morton Cc: Linus Torvalds Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org Mathieu Desnoyers (12): nvdimm/pmem: Fix leak on dax_add_host() failure nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dcssblk: Handle alloc_dax() -EOPNOTSUPP failure virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dax: Check for data cache aliasing at runtime Introduce cpu_dcache_is_aliasing() across all architectures dax: Fix incorrect list of data cache aliasing architectures nvdimm/pmem: Cleanup alloc_dax() error handling dm: Cleanup alloc_dax() error handling dcssblk: Cleanup alloc_dax() error handling virtio: Cleanup alloc_dax() error handling arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ drivers/dax/super.c | 14 ++ drivers/md/dm.c | 17 + drivers/nvdimm/pmem.c | 23 --- drivers/s390/block/dcssblk.c| 11 ++- fs/Kconfig | 1 - fs/fuse/virtio_fs.c | 15 +++ include/linux/cacheinfo.h | 6 ++ include/linux/dax.h | 6 +- mm/Kconfig |
[PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
[PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..f1d9f5c6dbac 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev, disk->bb = >bb; dax_dev = alloc_dax(pmem, _dax_ops); - if (IS_ERR(dax_dev)) { - rc = PTR_ERR(dax_dev); - goto out; + if (IS_ERR_OR_NULL(dax_dev)) { + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
On 2024-02-05 13:34, Vincent Donnefort wrote: On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote: [...] How are the kernel linear mapping and the userspace mapping made coherent on architectures with virtually aliasing data caches ? Ref. https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t Hi Mathieu, Thanks for the pointer. We are in the exact same problem as DAX. We do modify the data through the kernel linear mapping while user-space can read it through its own. I should probably return an error when used with any of the arch ARM || SPARC || MIPS, until cpu_dcache_is_aliasing() introduces a fine-grain differentiation. You might want to use LTTng's ring buffer approach instead. See https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202 lib_ring_buffer_flush_read_subbuf_dcache() Basically, whenever user-space grabs a sub-buffer for reading (through lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng calls flush_dcache_page() on all pages of this subbuffer (I should really change this for a call to flush_dcache_folio() which would be more efficient). Note that doing this is not very efficient on architectures which have coherent data caches and incoherent dcache vs icache: in that case, we issue the flush_dcache_page uselessly. I plan on using the new cpu_dcache_is_aliasing() check once/if it makes it way upstream to remove those useless flushes on architectures which define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the data cache. The equivalent of LTTng's "get subbuf" operation would be the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer
vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + struct trace_array *tr = iter->tr; + + ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file); + +#ifdef CONFIG_TRACER_MAX_TRACE + spin_lock(>snapshot_trigger_lock); + if (!WARN_ON(!tr->mapped)) + tr->mapped--; + spin_unlock(>snapshot_trigger_lock); +#endif +} + +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + + WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file)); +} + +static const struct vm_operations_struct tracing_buffers_vmops = { + .open = tracing_buffers_mmap_open, + .close = tracing_buffers_mmap_close, + .fault = tracing_buffers_mmap_fault, +}; + +static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct ftrace_buffer_info *info = filp->private_data; + struct trace_iterator *iter = >iter; + struct trace_array *tr = iter->tr; + int ret = 0; + + if (vma->vm_flags & VM_WRITE) + return -EPERM; + + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE); + vma->vm_ops = _buffers_vmops; + +#ifdef CONFIG_TRACER_MAX_TRACE + /* +* We hold mmap_lock here. lockdep would be unhappy if we would now take +* trace_types_lock. Instead use the specific snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + if (tr->snapshot || tr->mapped == UINT_MAX) { + spin_unlock(>snapshot_trigger_lock); + return -EBUSY; + } + tr->mapped++; + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); +#endif + ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file); +#ifdef CONFIG_TRACER_MAX_TRACE + if (ret) { + spin_lock(>snapshot_trigger_lock); + iter->tr->mapped--; + spin_unlock(>snapshot_trigger_lock); + } +#endif + return ret; +} + static const struct file_operations tracing_buffers_fops = { .open = tracing_buffers_open, .read = tracing_buffers_read, @@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops = { .splice_read= tracing_buffers_splice_read, .unlocked_ioctl = tracing_buffers_ioctl, .llseek = no_llseek, + .mmap = tracing_buffers_mmap, }; static ssize_t diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h index bd312e9afe25..8a96e7a89e6b 100644 --- a/kernel/trace/trace.h +++ b/kernel/trace/trace.h @@ -336,6 +336,7 @@ struct trace_array { boolallocated_snapshot; spinlock_t snapshot_trigger_lock; unsigned intsnapshot; + unsigned intmapped; unsigned long max_latency; #ifdef CONFIG_FSNOTIFY struct dentry *d_max_latency; -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[RFC PATCH v4 11/12] dcssblk: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index a3010849bfed..f363c1d51d9a 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -679,8 +679,8 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char goto put_dev; dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); goto put_dev; } set_dax_synchronous(dax_dev); -- 2.39.2
[RFC PATCH v4 12/12] virtio: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- fs/fuse/virtio_fs.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 621b1bca2d55..a28466c2da71 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -809,8 +809,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) return 0; dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); return rc == -EOPNOTSUPP ? 0 : rc; } -- 2.39.2
[RFC PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index f1d9f5c6dbac..e9898457a7bd 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev, disk->bb = >bb; dax_dev = alloc_dax(pmem, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (IS_ERR(dax_dev)) { + rc = PTR_ERR(dax_dev); if (rc != -EOPNOTSUPP) goto out; } else { -- 2.39.2
[RFC PATCH v4 10/12] dm: Cleanup alloc_dax() error handling
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL, the callers do not have to handle NULL return values anymore. Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 2fc22cae9089..acdc00bc05be 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2124,8 +2124,8 @@ static struct mapped_device *alloc_dev(int minor) sprintf(md->disk->disk_name, "dm-%d", minor); dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR_OR_NULL(dax_dev)) { - if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP) + if (IS_ERR(dax_dev)) { + if (PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; } else { set_dax_nocache(dax_dev); -- 2.39.2
[RFC PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures
Introduce a generic way to query whether the data cache is virtually aliased on all architectures. Its purpose is to ensure that subsystems which are incompatible with virtually aliased data caches (e.g. FS_DAX) can reliably query this. For data cache aliasing, there are three scenarios dependending on the architecture. Here is a breakdown based on my understanding: A) The data cache is always aliasing: * arc * csky * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.) * sh * parisc B) The data cache aliasing is statically known or depends on querying CPU state at runtime: * arm (cache_is_vivt() || cache_is_vipt_aliasing()) * mips (cpu_has_dc_aliases) * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE) * sparc32 (vac_cache_size > PAGE_SIZE) * sparc64 (L1DCACHE_SIZE > PAGE_SIZE) * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE) C) The data cache is never aliasing: * alpha * arm64 (aarch64) * hexagon * loongarch (but with incoherent write buffers, which are disabled since commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE")) * microblaze * openrisc * powerpc * riscv * s390 * um * x86 Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and implement "cpu_dcache_is_aliasing()". Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus cpu_dcache_is_aliasing() simply evaluates to "false". Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future work. This would be useful to gate features like XIP on architectures which have aliasing CPU dcache-icache but not CPU dcache-dcache. Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache" to clarify that we really mean "CPU data cache" and "CPU cache" to eliminate any possible confusion with VFS "dentry cache" and "page cache". Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/ Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ include/linux/cacheinfo.h | 6 ++ mm/Kconfig | 6 ++ 22 files changed, 112 insertions(+) create mode 100644 arch/arc/include/asm/cachetype.h create mode 100644 arch/csky/include/asm/cachetype.h create mode 100644 arch/m68k/include/asm/cachetype.h create mode 100644 arch/mips/include/asm/cachetype.h create mode 100644 arch/nios2/include/asm/cachetype.h create mode 100644 arch/parisc/include/asm/cachetype.h create mode 100644 arch/sh/include/asm/cachetype.h create mode 100644 arch/sparc/include/asm/cachetype.h create mode 100644 arch/xtensa/include/asm/cachetype.h diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index 1b0483c51cc1..7d294a3242a4 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -6,6 +6,7 @@ config ARC def_bool y select ARC_TIMERS + select ARCH_HAS_CPU_CACHE_ALIASING select ARCH_HAS_CACHE_LINE_SIZE select ARCH_HAS_DEBUG_VM_PGTABLE select ARCH_HAS_DMA_PREP_COHERENT diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h new file mode 100644 index ..05fc7ed59712 --- /dev/null +++ b/arch/arc/include/asm/cachetype.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_ARC_CACHETYPE_H +#define __ASM_ARC_CACHETYPE_H + +#include + +#define cpu_dcache_is_aliasing() true + +#endif diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index f8567e95f98b..cd13b1788973 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -5,6 +5,7 @@ conf
[RFC PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures
commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") prevents DAX from building on architectures with virtually aliased dcache with: depends on !(ARM || MIPS || SPARC) This check is too broad (e.g. recent ARMv7 don't have virtually aliased dcaches), and also misses many other architectures with virtually aliased data cache. This is a regression introduced in the v4.0 Linux kernel where the dax mount option is removed for 32-bit ARMv7 boards which have no data cache aliasing, and therefore should work fine with FS_DAX. This was turned into the following check in alloc_dax() by a preparatory change: if (ops && (IS_ENABLED(CONFIG_ARM) || IS_ENABLED(CONFIG_MIPS) || IS_ENABLED(CONFIG_SPARC))) return NULL; Use cpu_dcache_is_aliasing() instead to figure out whether the environment has aliasing data caches. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index ce5bffa86bba..a21a7c262382 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include #include #include +#include #include "dax-private.h" /** @@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ - if (ops && (IS_ENABLED(CONFIG_ARM) || - IS_ENABLED(CONFIG_MIPS) || - IS_ENABLED(CONFIG_SPARC))) + if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); if (WARN_ON_ONCE(ops && !ops->zero_page_range)) -- 2.39.2
[RFC PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of virtio virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Co-developed-by: Dan Williams Signed-off-by: Dan Williams Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- fs/fuse/virtio_fs.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..621b1bca2d55 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) { + struct dax_device *dax_dev __free(cleanup_dax) = NULL; struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + return rc == -EOPNOTSUPP ? 0 : rc; + } + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } -- 2.39.2
[RFC PATCH v4 06/12] dax: Check for data cache aliasing at runtime
Replace the following fs/Kconfig:FS_DAX dependency: depends on !(ARM || MIPS || SPARC) By a runtime check within alloc_dax(). This runtime check returns ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means the kernel is using an aliased mapping) on an architecture which has data cache aliasing. Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for CONFIG_DAX=n for consistency. This is done in preparation for using cpu_dcache_is_aliasing() in a following change which will properly support architectures which detect data cache aliasing at runtime. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/dax/super.c | 15 +++ fs/Kconfig | 1 - include/linux/dax.h | 6 +- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0da9232ea175..ce5bffa86bba 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive); * that any fault handlers or operations that might have seen * dax_alive(), have completed. Any operations that start after * synchronize_srcu() has run will abort upon seeing !dax_alive(). + * + * Note, because alloc_dax() returns an ERR_PTR() on error, callers + * typically store its result into a local variable in order to check + * the result. Therefore, care must be taken to populate the struct + * device dax_dev field make sure the dax_dev is not leaked. */ void kill_dax(struct dax_device *dax_dev) { @@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) dev_t devt; int minor; + /* +* Unavailable on architectures with virtually aliased data caches, +* except for device-dax (NULL operations pointer), which does +* not use aliased mappings from the kernel. +*/ + if (ops && (IS_ENABLED(CONFIG_ARM) || + IS_ENABLED(CONFIG_MIPS) || + IS_ENABLED(CONFIG_SPARC))) + return ERR_PTR(-EOPNOTSUPP); + if (WARN_ON_ONCE(ops && !ops->zero_page_range)) return ERR_PTR(-EINVAL); diff --git a/fs/Kconfig b/fs/Kconfig index 42837617a55b..e5efdb3b276b 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -56,7 +56,6 @@ endif # BLOCK config FS_DAX bool "File system based Direct Access (DAX) support" depends on MMU - depends on !(ARM || MIPS || SPARC) depends on ZONE_DEVICE || FS_DAX_LIMITED select FS_IOMAP select DAX diff --git a/include/linux/dax.h b/include/linux/dax.h index b463502b16e1..df2d52b8a245 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev) static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { - /* -* Callers should check IS_ENABLED(CONFIG_DAX) to know if this -* NULL is an error or expected. -*/ - return NULL; + return ERR_PTR(-EOPNOTSUPP); } static inline void put_dax(struct dax_device *dax_dev) { -- 2.39.2
[RFC PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dcssblk dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures. Considering that s390 is not a data cache aliasing architecture, and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP from alloc_dax() should make dcssblk_add_store() fail. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org --- drivers/s390/block/dcssblk.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..a3010849bfed 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char int rc, i, j, num_of_segments; struct dcssblk_dev_info *dev_info; struct segment_info *seg_info, *temp; + struct dax_device *dax_dev; char *local_buf; unsigned long seg_byte_size; @@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (rc) goto put_dev; - dev_info->dax_dev = alloc_dax(dev_info, _dax_ops); - if (IS_ERR(dev_info->dax_dev)) { - rc = PTR_ERR(dev_info->dax_dev); - dev_info->dax_dev = NULL; + dax_dev = alloc_dax(dev_info, _dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; goto put_dev; } - set_dax_synchronous(dev_info->dax_dev); + set_dax_synchronous(dax_dev); + dev_info->dax_dev = dax_dev; rc = dax_add_host(dev_info->dax_dev, dev_info->gd); if (rc) goto out_dax; -- 2.39.2
[RFC PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM, even on ARMv7 which does not have virtually aliased data caches: commit d92576f1167c ("dax: does not work correctly with virtual aliasing caches") It used to work fine before: I have customers using DAX over pmem on ARMv7, but this regression will likely prevent them from upgrading their kernel. The root of the issue here is the fact that DAX was never designed to handle virtually aliasing data caches (VIVT and VIPT with aliasing data cache). It touches the pages through their linear mapping, which is not consistent with the userspace mappings with virtually aliasing data caches. This patch series introduces cpu_dcache_is_aliasing() with the new Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all architectures. The implementation of cpu_dcache_is_aliasing() is either evaluated to a constant at compile-time or a runtime check, which is what is needed on ARM. With this we can basically narrow down the list of architectures which are unsupported by DAX to those which are really affected. Testing done so far: - Compile allyesconfig on x86-64, - Compile allyesconfig on x86-64, with FS_DAX=n. - Compile allyesconfig on x86-64, with DAX=n. - Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP even on x86-64, thus simulating the behavior expected on an architecture with data cache aliasing. There are many more axes to test however. I would welcome Tested-by for: - affected architectures, - affected drivers, - affected filesytems. Thanks, Mathieu Changes since v3: - Fix a leak on dax_add_host() failure in nvdimm/pmem. - Split the series into a bissectable sequence of changes. - Ensure that device-dax use-cases still works on data cache aliasing architectures. Changes since v2: - Move DAX supported runtime check to alloc_dax(), - Modify DM to handle alloc_dax() error as non-fatal, - Remove all filesystem modifications, since the check is now done by alloc_dax(), - rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to eliminate confusion with VFS terminology. Changes since v1: - The order of the series was completely changed based on the feedback received on v1, - cache_is_aliasing() is renamed to dcache_is_aliasing(), - ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone, - dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is simplified, - the dax_is_supported() check was moved to its rightful place in all filesystems. Cc: Andrew Morton Cc: Linus Torvalds Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev Cc: linux-s...@vger.kernel.org Mathieu Desnoyers (12): nvdimm/pmem: Fix leak on dax_add_host() failure nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dcssblk: Handle alloc_dax() -EOPNOTSUPP failure virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal dax: Check for data cache aliasing at runtime Introduce cpu_dcache_is_aliasing() across all architectures dax: Fix incorrect list of data cache aliasing architectures nvdimm/pmem: Cleanup alloc_dax() error handling dm: Cleanup alloc_dax() error handling dcssblk: Cleanup alloc_dax() error handling virtio: Cleanup alloc_dax() error handling arch/arc/Kconfig| 1 + arch/arc/include/asm/cachetype.h| 9 + arch/arm/Kconfig| 1 + arch/arm/include/asm/cachetype.h| 2 ++ arch/csky/Kconfig | 1 + arch/csky/include/asm/cachetype.h | 9 + arch/m68k/Kconfig | 1 + arch/m68k/include/asm/cachetype.h | 9 + arch/mips/Kconfig | 1 + arch/mips/include/asm/cachetype.h | 9 + arch/nios2/Kconfig | 1 + arch/nios2/include/asm/cachetype.h | 10 ++ arch/parisc/Kconfig | 1 + arch/parisc/include/asm/cachetype.h | 9 + arch/sh/Kconfig | 1 + arch/sh/include/asm/cachetype.h | 9 + arch/sparc/Kconfig | 1 + arch/sparc/include/asm/cachetype.h | 14 ++ arch/xtensa/Kconfig | 1 + arch/xtensa/include/asm/cachetype.h | 10 ++ drivers/dax/super.c | 14 ++ drivers/md/dm.c | 17 + drivers/nvdimm/pmem.c | 23 --- drivers/s390/block/dcssblk.c| 11 ++- fs/Kconfig | 1 - fs/fuse/virtio_fs.c | 15 +++ include/linux/cacheinfo.h | 6 ++ include/linux/dax.h | 6 +- mm/Kconfig |
[RFC PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of dm alloc_dev() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Suggested-by: Dan Williams Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/md/dm.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 23c32cd1f1d8..2fc22cae9089 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device *md) static struct mapped_device *alloc_dev(int minor) { int r, numa_node_id = dm_get_numa_node(); + struct dax_device *dax_dev; struct mapped_device *md; void *old_md; @@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); - if (IS_ENABLED(CONFIG_FS_DAX)) { - md->dax_dev = alloc_dax(md, _dax_ops); - if (IS_ERR(md->dax_dev)) { - md->dax_dev = NULL; + dax_dev = alloc_dax(md, _dax_ops); + if (IS_ERR_OR_NULL(dax_dev)) { + if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP) goto bad; - } - set_dax_nocache(md->dax_dev); - set_dax_nomc(md->dax_dev); - if (dax_add_host(md->dax_dev, md->disk)) + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + md->dax_dev = dax_dev; + if (dax_add_host(dax_dev, md->disk)) goto bad; } -- 2.39.2
[RFC PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. For the transition, consider that alloc_dax() returning NULL is the same as returning -EOPNOTSUPP. Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 9fe358090720..f1d9f5c6dbac 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev, disk->bb = >bb; dax_dev = alloc_dax(pmem, _dax_ops); - if (IS_ERR(dax_dev)) { - rc = PTR_ERR(dax_dev); - goto out; + if (IS_ERR_OR_NULL(dax_dev)) { + rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP; + if (rc != -EOPNOTSUPP) + goto out; + } else { + set_dax_nocache(dax_dev); + set_dax_nomc(dax_dev); + if (is_nvdimm_sync(nd_region)) + set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; + rc = dax_add_host(dax_dev, disk); + if (rc) + goto out_cleanup_dax; + dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); } - set_dax_nocache(dax_dev); - set_dax_nomc(dax_dev); - if (is_nvdimm_sync(nd_region)) - set_dax_synchronous(dax_dev); - pmem->dax_dev = dax_dev; - rc = dax_add_host(dax_dev, disk); - if (rc) - goto out_cleanup_dax; - dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
[RFC PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers Cc: Alasdair Kergon Cc: Mike Snitzer Cc: Mikulas Patocka Cc: Andrew Morton Cc: Linus Torvalds Cc: Dan Williams Cc: Vishal Verma Cc: Dave Jiang Cc: Matthew Wilcox Cc: Arnd Bergmann Cc: Russell King Cc: linux-a...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-fsde...@vger.kernel.org Cc: linux...@kvack.org Cc: linux-...@vger.kernel.org Cc: dm-devel@lists.linux.dev Cc: nvd...@lists.linux.dev --- drivers/nvdimm/pmem.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; -- 2.39.2
Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime
On 2024-02-02 15:14, Dan Williams wrote: Mathieu Desnoyers wrote: [..] Thanks for that. All of those need to be done before the fs goes live later in virtio_device_ready(), but before that point nothing should be calling into virtio_fs_dax_ops, so as far as I can see it is safe to change the order. Sounds good, I'll do that. I will soon be ready to send out a RFC v4, which is still only compiled-tested. Do you happen to have some kind of test suite you can use to automate some of the runtime testing ? There is a test suite for the pmem, dm, and dax changes (https://github.com/pmem/ndctl?tab=readme-ov-file#unit-tests), but not automated unfortunately. The NVDIMM maintainer team will run that before pushing patches out to the fixes branch if you just want to lean on that. For the rest I think we will need to depend on tested-by's from s390 + virtio_fs folks, and / or sufficient soak time in linux-next. I suspect this will be necessary. There are just so many combinations of architectures, drivers and filesystems involved here that I don't think it is realistic to try to do all this testing manually on my own. I prefer to voice this up front, so there are no misplaced expectations about testing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime
On 2024-02-02 14:41, Dan Williams wrote: Mathieu Desnoyers wrote: On 2024-02-02 12:37, Dan Williams wrote: Mathieu Desnoyers wrote: [...] The alternative route I intend to take is to audit all callers of alloc_dax() and make sure they all save the alloc_dax() return value in a struct dax_device * local variable first for the sake of checking for IS_ERR(). This will leave the xyz->dax_dev pointer initialized to NULL in the error case and simplify the rest of error checking. I could maybe get on board with that, but it needs a comment somewhere about the asymmetric subtlety. Is this "somewhere" at every alloc_dax() call site, or do you have something else in mind ? At least kill_dax() should mention the asymmetry I think. Here is what I intend to add: * Note, because alloc_dax() returns an ERR_PTR() on error, callers * typically store its result into a local variable in order to check * the result. Therefore, care must be taken to populate the struct * device dax_dev field make sure the dax_dev is not leaked. The real question is what to do about device-dax. I *think* it is not affected by cpu_dcache aliasing because it never accesses user mappings through a kernel alias. I doubt device-dax is in use on these platforms, but we might need another fixup for that if someone screams about the alloc_dax() behavior change making them lose device-dax access. By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX. Based on your analysis, is alloc_dax() still the right spot where to place this runtime check ? Which call sites are responsible for invoking alloc_dax() for device-dax ? That is in devm_create_dev_dax(). If we know which call sites do not intend to use the kernel linear mapping, we could introduce a flag (or a new variant of the alloc_dax() API) that would either enforce or skip the check. Hmmm, it looks like there is already a natural flag for that. If alloc_dax() is passed a NULL operations pointer it means there are no kernel usages of the aliased mapping. That actually fits rather nicely. Good, I was reaching the same conclusion when I received your reply. I'll do that. It ends up being: /* * Unavailable on architectures with virtually aliased data caches, * except for device-dax (NULL operations pointer), which does * not use aliased mappings from the kernel. */ if (ops && cpu_dcache_is_aliasing()) return ERR_PTR(-EOPNOTSUPP); [..] @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + + if (rc == -EOPNOTSUPP) + return 0; + return rc; + } What is gained by moving this allocation here ? The gain is to fail early in virtio_fs_setup_dax() since the fundamental dependency of alloc_dax() success is not met. For example why let the setup progress to devm_memremap_pages() when alloc_dax() is going to return ERR_PTR(-EOPNOTSUPP). What I don't know is whether there is a dependency requiring to do devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages() before calling alloc_dax() ? Those 3 calls are used to populate: fs->window_phys_addr = (phys_addr_t) cache_reg.addr; fs->window_len = (phys_addr_t) cache_reg.len; and then alloc_dax() takes "fs" as private data parameter. So it's unclear to me whether we can swap the invocation order. I suspect that it is not an issue because it is only used to populate dax_dev->private, but I prefer to confirm this with you just to be on the safe side. Thanks for that. All of those need to be done before the fs goes live later in virtio_device_ready(), but before that point nothing should be calling into virtio_fs_dax_ops, so as far as I can see it is safe to change the order. Sounds good, I'll do that. I will soon be ready to send out a RFC v4, which is still only compiled-tested. Do you happen to have some kind of test suite you can use to automate some of the runtime testing ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime
On 2024-02-02 12:37, Dan Williams wrote: Mathieu Desnoyers wrote: [...] The alternative route I intend to take is to audit all callers of alloc_dax() and make sure they all save the alloc_dax() return value in a struct dax_device * local variable first for the sake of checking for IS_ERR(). This will leave the xyz->dax_dev pointer initialized to NULL in the error case and simplify the rest of error checking. I could maybe get on board with that, but it needs a comment somewhere about the asymmetric subtlety. Is this "somewhere" at every alloc_dax() call site, or do you have something else in mind ? return; if (dax_dev->holder_data != NULL) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..b69c9e442cf4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; If I compare the before / after this change, if previously pmem_attach_disk() was called in a configuration with FS_DAX=n, it would result in a NULL pointer dereference. No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the CONFIG_FS_DAX=n case. Indeed, I was wrong there. So that means that pmem devices on ARM have been possible without FS_DAX. So, in order for alloc_dax() returning ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error path needs to be changed. Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y, which includes configurations that had FS_DAX=n previously. I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP alloc_dax() return value as non-fatal. This would be an error handling fix all by itself. Do we really want to return successfully if dax is unsupported, or should we return an error here ? Per above, there is no error handling fix, and pmem block device available should not depend on alloc_dax() succeeding. I agree on treating alloc_dax() failure as non-fatal. There is however one error handling fix to nvdimm/pmem which I plan to introduce as an initial patch before this change: nvdimm/pmem: Fix leak on dax_add_host() failure Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Signed-off-by: Mathieu Desnoyers diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..9fe358090720 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev, set_dax_nomc(dax_dev); if (is_nvdimm_sync(nd_region)) set_dax_synchronous(dax_dev); + pmem->dax_dev = dax_dev; rc = dax_add_host(dax_dev, disk); if (rc) goto out_cleanup_dax; dax_write_cache(dax_dev, nvdimm_has_cache(nd_region)); - pmem->dax_dev = dax_dev; - rc = device_add_disk(dev, disk, pmem_attribute_groups); if (rc) goto out_remove_host; The real question is what to do about device-dax. I *think* it is not affected by cpu_dcache aliasing because it never accesses user mappings through a kernel alias. I doubt device-dax is in use on these platforms, but we might need another fixup for that if someone screams about the alloc_dax() behavior change making them lose device-dax access. By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX. Based on your analysis, is alloc_dax() still the right spot where to place this runtime check ? Which call sites are responsible for invoking alloc_dax() for device-dax ? If we know which call sites do not intend to use the kernel linear mapping, we could introduce a flag (or a new variant of the alloc_dax() API) that would either enforce or skip the check. [...] Here what I'm seeing so far: - devm_release_mem_region() is never called after devm_request_mem_region(). Not on error, neither on teardown, devm_release_mem_region() is called from virtio_fs_probe() context. That I guess you mean "devm_request_mem_region()" here. means that when virtio_fs_probe() returns an error the driver core will automatically call devm_request_mem_region(). And "devm_release_mem_region()" here. - pgmap is never freed on error after devm_kzalloc. That is what the "devm_" in devm_kzalloc() does, free the memory on driver-probe failure, or after the driver remove callback is invoked. Got it.
Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime
On 2024-02-01 10:44, Mathieu Desnoyers wrote: On 2024-01-31 17:18, Dan Williams wrote: [...] diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index 5f1be1da92ce..11053a70f5ab 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include "fuse_i.h" @@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data) put_dax(dax_dev); } +DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) virtio_fs_cleanup_dax(_T)) + static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) So either I'm completely missing how ownership works in this function, or we should be really concerned about the fact that it does no actual cleanup of anything on any error. [...] Here what I'm seeing so far: - devm_release_mem_region() is never called after devm_request_mem_region(). Not on error, neither on teardown, - pgmap is never freed on error after devm_kzalloc. I was indeed missing something: the devm_ family of functions keeps ownership at the device level, so we would not need explicit teardown. { + struct dax_device *dax_dev __free(cleanup_dax) = NULL; struct virtio_shm_region cache_reg; struct dev_pagemap *pgmap; bool have_cache; @@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) if (!IS_ENABLED(CONFIG_FUSE_DAX)) return 0; + dax_dev = alloc_dax(fs, _fs_dax_ops); + if (IS_ERR(dax_dev)) { + int rc = PTR_ERR(dax_dev); + + if (rc == -EOPNOTSUPP) + return 0; + return rc; + } What is gained by moving this allocation here ? I'm still concerned about moving the call to alloc_dax() before the setup of the memory region it will use. Are those completely independent ? + /* Get cache region */ have_cache = virtio_get_shm_region(vdev, _reg, (u8)VIRTIO_FS_SHMCAP_ID_CACHE); @@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs *fs) dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 0x%llx\n", __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len); - fs->dax_dev = alloc_dax(fs, _fs_dax_ops); - if (IS_ERR(fs->dax_dev)) - return PTR_ERR(fs->dax_dev); - + fs->dax_dev = no_free_ptr(dax_dev); return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax, fs->dax_dev); } [...] Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime
On 2024-02-01 10:44, Mathieu Desnoyers wrote: [...] diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 4e8fdcb3f1c8..b69c9e442cf4 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev, dax_dev = alloc_dax(pmem, _dax_ops); if (IS_ERR(dax_dev)) { rc = PTR_ERR(dax_dev); - goto out; + if (rc != -EOPNOTSUPP) + goto out; If I compare the before / after this change, if previously pmem_attach_disk() was called in a configuration with FS_DAX=n, it would result in a NULL pointer dereference. I was wrong. drivers/nvdimm/Kconfig has: config BLK_DEV_PMEM select DAX and drivers/nvdimm/Makefile has: obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o nd_pmem-y := pmem.o which means that anything in pmem.c can assume that alloc_dax() is implemented. [...] diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 4b7ecd4fd431..f911e58a24dd 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char if (IS_ERR(dev_info->dax_dev)) { rc = PTR_ERR(dev_info->dax_dev); dev_info->dax_dev = NULL; - goto put_dev; + if (rc != -EOPNOTSUPP) + goto put_dev; config DCSSBLK selects FS_DAX_LIMITED and DAX. I'm not sure what selecting DAX is trying to achieve here, because the Kconfig option is "FS_DAX". So depending on the real motivation behind this select, we may want to consider failure rather than success in the -EOPNOTSUPP case. I missed that alloc_dax() is implemented as not supported based on CONFIG_DAX (not CONFIG_FS_DAX). Therefore DCSSBLK Kconfig does the right thing and always selects DAX, and thus an implemented version of alloc_dax(). This takes care of two of my open questions at least. :) Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com