Re: [lttng-dev] Capturing snapshot on kernel panic

2024-05-16 Thread Mathieu Desnoyers via lttng-dev
he time
 > it enters.
 >
 > While this doesn't necessarily help your original question of
panics, if
 > you want to snapshot before shutdown or reboot and are using
systemd,
 > it's possible to leave a script or binary in a known directory so
that
 > it's invoked prior to the rest of the shutdown sequence[4].
 >
 > [1]:
https://lttng.org/docs/v2.13/#doc-persistent-memory-file-systems
<https://lttng.org/docs/v2.13/#doc-persistent-memory-file-systems>
 > [2]:
 >

https://github.com/systemd/systemd/blob/6533c14997700f74e9ea42121303fc1f5c63e62b/src/shutdown/shutdown.c
 
<https://github.com/systemd/systemd/blob/6533c14997700f74e9ea42121303fc1f5c63e62b/src/shutdown/shutdown.c>
 > [3]:
 >
https://github.com/systemd/systemd/blob/main/src/shared/reboot-util.c#L77 
<https://github.com/systemd/systemd/blob/main/src/shared/reboot-util.c#L77>
 > [4]:
https://www.systutorials.com/docs/linux/man/8-systemd-reboot/
<https://www.systutorials.com/docs/linux/man/8-systemd-reboot/>
 >
 > hope this helps,
 > kienan
 >
 >> Would you have any suggestions?
 >> Thanks for your help,
 >> Cheers
 >> Damien
 >>
 >> 
 >>
 >> # Prep output dir
 >> mkdir /application/trace/
 >> rm -rf /application/trace/*
 >>
 >> # Create session
 >> sudo lttng destroy snapshot-trace-session
 >> sudo lttng create snapshot-trace-session --snapshot
 >> --output="/application/trace/"
 >> sudo lttng enable-channel --kernel --num-subbuf=8 channelk
 >> sudo lttng enable-channel --userspace --num-subbuf=8 channelu
 >>
 >> # Configure session
 >> sudo lttng enable-event --kernel --syscall --all --channel channelk
 >> sudo lttng enable-event --kernel --tracepoint "sched*" --channel
channelk
 >> sudo lttng enable-event --userspace --all --channel channelu
 >> sudo lttng add-context -u -t vtid -t procname
 >> sudo lttng remove-trigger trig_reboot
 >> sudo lttng add-trigger --name=trig_reboot \
 >>          --condition=event-rule-matches
--type=kernel:syscall:entry \
 >>          --name=reboot\
 >>          --action=snapshot-session snapshot-trace-session \
 >>          --rate-policy=once-after:1
 >>
 >> # start & list info
 >> sudo lttng start
 >> sudo lttng list snapshot-trace-session
 >> sudo lttng list-triggers
 >>
 >> # test it...
 >> sudo reboot
 >>
 >> #=== reconnect and Nothing :(
 >> $ ls -alu /application/trace/
 >> drwxr-xr-x    2 u  u       4096 May 15  2024 .
 >> drwxr-xr-x   10 u  u       4096 May 15  2024 ..
 >>
 >>
 >> ___
 >> lttng-dev mailing list
 >> lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
 >> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
<https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>
 > ___
 > lttng-dev mailing list
 > lttng-dev@lists.lttng.org <mailto:lttng-dev@lists.lttng.org>
 > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
<https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>



--
*Damien Berget*

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


[lttng-dev] [RELEASE] LTTng-modules 2.13.13 and 2.12.17 (Linux kernel tracer)

2024-05-13 Thread Mathieu Desnoyers via lttng-dev

Hi,

This is a stable release announcement for the LTTng kernel tracer,
an out-of-tree kernel tracer for the Linux kernel.

The LTTng project provides low-overhead, correlated userspace and
kernel tracing on Linux. Its use of the Common Trace Format and a
flexible control interface allows it to fulfill various workloads.

* New in these releases:

- LTTng-modules 2.13.13:

  - Introduce support for Linux v6.9.

  - Removed unused duplicated code, add missing static to
function definitions, and add missing includes for function
declarations which were observed when building against recent
kernels with newer toolchains.
We plan to adapt our CI to add jobs that will report warnings
as errors when building lttng-modules against recent kernels
with a recent tool chain so we can catch and fix those warnings
earlier in the future.

- In both LTTng-modules 2.12.17 and 2.13.13:

  - Fix incorrect get_pfnblock_flags_mask prototype which did not match
upstream after upstream commit 535b81e209219 (v5.9). Fix the prototype
mismatch detection code as well. This affects the event
mm_page_alloc_extfrag which uses get_pageblock_migratetype(). Note that
because the kernel macro get_pageblock_migratetype was also updated
to pass 3 parameters to get_pfnblock_flags_mask as its kernel prototype
was updated to expect three parameters, it does not matter that the
lttng-modules wrapper expects 4 parameters and provides those 4 parameters
to the kernel function. This issue should therefore not affect the
runtime behavior.

  - Instrumentation updates to support EL 8.4+.

  - Instrumentation updates for RHEL kernels.

  - Instrumentation updates to the timer subsystem to adapt to
changes backported in the 4.19 stable kernels.


* Detailed change logs:

2024-05-13 (National Leprechaun Day) LTTng modules 2.13.13
* splice wrapper: Fix missing declaration
* page alloc wrapper: Fix get_pfnblock_flags_mask prototype
* lttng probe: include events-internal.h
* syscalls: Remove unused duplicated code
* statedump: Add missing events-internal.h include
* lttng-events: Add missing static
* event notifier: Add missing static
* context callstack: Add missing static
* lttng-clock: Add missing lttng/events-internal.h include
* lttng-calibrate: Add missing static and include
* lttng-bytecode: Remove dead code
* lttng-abi: Add missing static to function definitions
* ring buffer: Add missing static to function definitions
* blkdev wrapper: Fix constness warning
* Fix: timer_expire_entry changed in 4.19.312
* Fix: dev_base_lock removed in linux 6.9-rc1
* Fix: mm_compaction_migratepages changed in linux 6.9-rc1
* Fix: ASoC add component to set_bias_level events in linux 6.9-rc1
* Fix: ASoC snd_doc_dapm on linux 6.9-rc1
* Fix: build kvm probe on EL 8.4+
* Fix: support ext4_journal_start on EL 8.4+
* Fix: correct RHEL range for kmem_cache_free define

2024-05-13 (National Leprechaun Day) 2.12.17
* page alloc wrapper: Fix get_pfnblock_flags_mask prototype
* Fix: timer_expire_entry changed in 4.19.312
* Fix: build kvm probe on EL 8.4+
* Fix: support ext4_journal_start on EL 8.4+
* Fix: correct RHEL range for kmem_cache_free define

Project website: https://lttng.org
Documentation: https://lttng.org/docs
Download link: https://lttng.org/download

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH v3 54/68] selftests/rseq: Drop define _GNU_SOURCE

2024-05-09 Thread Mathieu Desnoyers

On 2024-05-09 15:58, Edward Liaw wrote:

_GNU_SOURCE is provided by lib.mk, so it should be dropped to prevent
redefinition warnings.

Fixes: 809216233555 ("selftests/harness: remove use of LINE_MAX")


The patch per se looks fine, except for the "Fixes" tag.

Commit 809216233555 introduces use of asprintf in kselftest_harness.h
which is used by (all ?) selftests, including the rseq ones. However,
the rseq selftests each have the #define _GNU_SOURCE, which would have
been OK without those further changes.

So this patch is more about consolidating where the _GNU_SOURCE is
defined, which is OK with me, but not so much about "fixing" an
issue with commit 809216233555.

A "Fix" is something to be backported to stable kernels, and I
don't think this patch reaches that threshold.

If anything, this patch removes a warning that gets added by
https://lore.kernel.org/lkml/20240509200022.253089-1-edl...@google.com/T/#mf8438d03de6e2b613da4f86d4f60c5fe1c5f8483
within the same series.

Arguably, each #define _GNU_SOURCE could have been first protected
by a #ifndef guard to eliminate this transient warning, and there
would be nothing to "fix" in this consolidation series.

Thoughts ?

Thanks,

Mathieu


Reviewed-by: John Hubbard 
Reviewed-by: Muhammad Usama Anjum 
Signed-off-by: Edward Liaw 
---
  tools/testing/selftests/rseq/basic_percpu_ops_test.c | 1 -
  tools/testing/selftests/rseq/basic_test.c| 2 --
  tools/testing/selftests/rseq/param_test.c| 1 -
  tools/testing/selftests/rseq/rseq.c  | 2 --
  4 files changed, 6 deletions(-)

diff --git a/tools/testing/selftests/rseq/basic_percpu_ops_test.c 
b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
index 2348d2c20d0a..5961c24ee1ae 100644
--- a/tools/testing/selftests/rseq/basic_percpu_ops_test.c
+++ b/tools/testing/selftests/rseq/basic_percpu_ops_test.c
@@ -1,5 +1,4 @@
  // SPDX-License-Identifier: LGPL-2.1
-#define _GNU_SOURCE
  #include 
  #include 
  #include 
diff --git a/tools/testing/selftests/rseq/basic_test.c 
b/tools/testing/selftests/rseq/basic_test.c
index 295eea16466f..1fed749b4bd7 100644
--- a/tools/testing/selftests/rseq/basic_test.c
+++ b/tools/testing/selftests/rseq/basic_test.c
@@ -2,8 +2,6 @@
  /*
   * Basic test coverage for critical regions and rseq_current_cpu().
   */
-
-#define _GNU_SOURCE
  #include 
  #include 
  #include 
diff --git a/tools/testing/selftests/rseq/param_test.c 
b/tools/testing/selftests/rseq/param_test.c
index 2f37961240ca..48a55d94eb72 100644
--- a/tools/testing/selftests/rseq/param_test.c
+++ b/tools/testing/selftests/rseq/param_test.c
@@ -1,5 +1,4 @@
  // SPDX-License-Identifier: LGPL-2.1
-#define _GNU_SOURCE
  #include 
  #include 
  #include 
diff --git a/tools/testing/selftests/rseq/rseq.c 
b/tools/testing/selftests/rseq/rseq.c
index 96e812bdf8a4..88602889414c 100644
--- a/tools/testing/selftests/rseq/rseq.c
+++ b/tools/testing/selftests/rseq/rseq.c
@@ -14,8 +14,6 @@
   * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
   * Lesser General Public License for more details.
   */
-
-#define _GNU_SOURCE
  #include 
  #include 
  #include 


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [lttng-dev] [PATCH urcu] fix: handle EINTR correctly in get_cpu_mask_from_sysfs

2024-05-02 Thread Mathieu Desnoyers via lttng-dev

On 2024-05-02 10:32, Michael Jeanson wrote:

On 2024-05-02 09:54, Mathieu Desnoyers wrote:

On 2024-05-01 19:42, Benjamin Marzinski via lttng-dev wrote:

If the read() in get_cpu_mask_from_sysfs() fails with EINTR, the code is
supposed to retry, but the while loop condition has (bytes_read > 0),
which is false when read() fails with EINTR. The result is that the code
exits the loop, having only read part of the string.

Use (bytes_read != 0) in the while loop condition instead, since the
(bytes_read < 0) case is already handled in the loop.


Thanks for the fix ! It is indeed the right thing to do.

I would like to integrate this fix into the librseq and libside
projects as well though, but I notice the the copy in liburcu
is LGPLv2.1 whereas the copy in librseq and libside are
MIT.

Michael, should we first relicense the liburcu src/compat-smp.h
implementation to MIT so it matches the license of the copies
in librseq and libside ?


Sure, please go ahead.


For the records, we also have a copy of this code in lttng-ust,
also under MIT license. So liburcu's copy is the only outlier
there.

Thanks,

Mathieu




Thanks,

Mathieu



Signed-off-by: Benjamin Marzinski 
---
  src/compat-smp.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/compat-smp.h b/src/compat-smp.h
index 31fa979..075a332 100644
--- a/src/compat-smp.h
+++ b/src/compat-smp.h
@@ -164,7 +164,7 @@ static inline int get_cpu_mask_from_sysfs(char 
*buf, size_t max_bytes, const cha

  total_bytes_read += bytes_read;
  assert(total_bytes_read <= max_bytes);
-    } while (max_bytes > total_bytes_read && bytes_read > 0);
+    } while (max_bytes > total_bytes_read && bytes_read != 0);
  /*
   * Make sure the mask read is a null terminated string.






--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] [PATCH urcu] fix: handle EINTR correctly in get_cpu_mask_from_sysfs

2024-05-02 Thread Mathieu Desnoyers via lttng-dev

On 2024-05-01 19:42, Benjamin Marzinski via lttng-dev wrote:

If the read() in get_cpu_mask_from_sysfs() fails with EINTR, the code is
supposed to retry, but the while loop condition has (bytes_read > 0),
which is false when read() fails with EINTR. The result is that the code
exits the loop, having only read part of the string.

Use (bytes_read != 0) in the while loop condition instead, since the
(bytes_read < 0) case is already handled in the loop.


Thanks for the fix ! It is indeed the right thing to do.

I would like to integrate this fix into the librseq and libside
projects as well though, but I notice the the copy in liburcu
is LGPLv2.1 whereas the copy in librseq and libside are
MIT.

Michael, should we first relicense the liburcu src/compat-smp.h
implementation to MIT so it matches the license of the copies
in librseq and libside ?

Thanks,

Mathieu



Signed-off-by: Benjamin Marzinski 
---
  src/compat-smp.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/compat-smp.h b/src/compat-smp.h
index 31fa979..075a332 100644
--- a/src/compat-smp.h
+++ b/src/compat-smp.h
@@ -164,7 +164,7 @@ static inline int get_cpu_mask_from_sysfs(char *buf, size_t 
max_bytes, const cha
  
  		total_bytes_read += bytes_read;

assert(total_bytes_read <= max_bytes);
-   } while (max_bytes > total_bytes_read && bytes_read > 0);
+   } while (max_bytes > total_bytes_read && bytes_read != 0);
  
  	/*

 * Make sure the mask read is a null terminated string.


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH] rcu: Remove unreachable logic

2024-04-29 Thread Mathieu Desnoyers

On 2024-04-29 13:34, Alan Huang wrote:

call_rcu_core is only called from __call_rcu_common with interrupt
disabled. This patch thus removes the unreachable logic and the would-be
unused 'flags' parameter.


Nack.

call_rcu_core() receives a @flags parameter which are the _saved_ flags
as they were prior to local_irq_save().

This patch highlights a misunderstanding of what the code is actually
doing.

Thanks,

Mathieu



Signed-off-by: Alan Huang 
---
  kernel/rcu/tree.c | 35 ++-
  1 file changed, 2 insertions(+), 33 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d5507ac1bbf1..b0ea2ebd7769 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2983,7 +2983,7 @@ static void rcutree_enqueue(struct rcu_data *rdp, struct 
rcu_head *head, rcu_cal
   * Handle any core-RCU processing required by a call_rcu() invocation.
   */
  static void call_rcu_core(struct rcu_data *rdp, struct rcu_head *head,
- rcu_callback_t func, unsigned long flags)
+ rcu_callback_t func)
  {
rcutree_enqueue(rdp, head, func);
/*
@@ -2992,37 +2992,6 @@ static void call_rcu_core(struct rcu_data *rdp, struct 
rcu_head *head,
 */
if (!rcu_is_watching())
invoke_rcu_core();
-
-   /* If interrupts were disabled or CPU offline, don't invoke RCU core. */
-   if (irqs_disabled_flags(flags) || cpu_is_offline(smp_processor_id()))
-   return;
-
-   /*
-* Force the grace period if too many callbacks or too long waiting.
-* Enforce hysteresis, and don't invoke rcu_force_quiescent_state()
-* if some other CPU has recently done so.  Also, don't bother
-* invoking rcu_force_quiescent_state() if the newly enqueued callback
-* is the only one waiting for a grace period to complete.
-*/
-   if (unlikely(rcu_segcblist_n_cbs(>cblist) >
-rdp->qlen_last_fqs_check + qhimark)) {
-
-   /* Are we ignoring a completed grace period? */
-   note_gp_changes(rdp);
-
-   /* Start a new grace period if one not already started. */
-   if (!rcu_gp_in_progress()) {
-   rcu_accelerate_cbs_unlocked(rdp->mynode, rdp);
-   } else {
-   /* Give the grace period a kick. */
-   rdp->blimit = DEFAULT_MAX_RCU_BLIMIT;
-   if (READ_ONCE(rcu_state.n_force_qs) == rdp->n_force_qs_snap 
&&
-   rcu_segcblist_first_pend_cb(>cblist) != head)
-   rcu_force_quiescent_state();
-   rdp->n_force_qs_snap = READ_ONCE(rcu_state.n_force_qs);
-   rdp->qlen_last_fqs_check = 
rcu_segcblist_n_cbs(>cblist);
-   }
-   }
  }
  
  /*

@@ -3121,7 +3090,7 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t 
func, bool lazy_in)
if (unlikely(rcu_rdp_is_offloaded(rdp)))
call_rcu_nocb(rdp, head, func, flags, lazy);
else
-   call_rcu_core(rdp, head, func, flags);
+   call_rcu_core(rdp, head, func);
local_irq_restore(flags);
  }
  


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[lttng-dev] [RELEASE] LTTng-UST 2.12.10 and 2.13.8 (Linux user-space tracer)

2024-04-19 Thread Mathieu Desnoyers via lttng-dev

LTTng-UST, the Linux Trace Toolkit Next Generation Userspace Tracer,
is a low-overhead application tracer. The library "liblttng-ust" enables
tracing of applications and libraries.

New in both 2.12.10 and 2.13.8:

* Add close_range wrapper to liblttng-ust-fd.so

GNU libc 2.34 implements a new close_range symbol which is used
by the ssh client and other applications to close all file descriptors,
including those which do not belong to the application. Override
this symbol to prevent the application from closing file descriptors
actively used by lttng-ust.

* Fix: libc wrapper: use initial-exec for malloc_nesting TLS

Use the initial-exec TLS model for the malloc_nesting nesting guard
variable to ensure that the GNU libc implementation of the TLS access
don't trigger infinite recursion by calling the memory allocator wrapper
functions, which can happen with global-dynamic.

This fixes a liblttng-ust-libc-wrapper.so regression on recent
Fedora distributions.

* lttng-ust(3): Fix wrong len_type for sequence

`len_type' of a sequence field must be of type unsigned integer. Some
provided examples in the man page were incorrectly using a type signed
integer, resulting in correct compilation, but error while decoding.

New in 2.13.8:

* ust-tracepoint-event: Add static check of sequences length type

Add a compile-time check to validate that unsigned types are used
for the length field of sequences.

Detailed change logs:

2024-04-19 (National Garlic Day) lttng-ust 2.13.8
* Add close_range wrapper to liblttng-ust-fd.so
* ust-tracepoint-event: Add static check of sequences length type
* lttng-ust(3): Fix wrong len_type for sequence
* Fix: libc wrapper: use initial-exec for malloc_nesting TLS

2024-04-19 (National Garlic Day) lttng-ust 2.12.10
* Add close_range wrapper to liblttng-ust-fd.so
* lttng-ust(3): Fix wrong len_type for sequence
* Fix: libc wrapper: use initial-exec for malloc_nesting TL


Project website: https://lttng.org
Documentation: https://lttng.org/docs
Download link: https://lttng.org/download

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [lttng-dev] Software Heritage archival notification for git.liburcu.org

2024-04-15 Thread Mathieu Desnoyers via lttng-dev

On 2024-04-15 10:20, Michael Jeanson via lttng-dev wrote:

On 2024-04-14 20:39, Paul Wise wrote:

On Thu, 2024-04-11 at 13:45 -0400, Michael Jeanson wrote:


I see no issues with this, thanks for the heads-up.


PS: I note that git.liburcu.org and git.lttng.org seem to have
identical contents. I wonder if SWH should be archiving just one of
them or if we should archive both just in case they get split up?


At the moment 'git.liburu.org' is just a CNAME for 'git.lttng.org', we 


Typo: git.liburcu.org

Thanks,

Mathieu


don't have plans to split them up so far.

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-12 Thread Mathieu Desnoyers

On 2024-04-12 12:28, Beau Belgrave wrote:

On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote:

On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave  wrote:


In the Open Telemetry profiling SIG [1], we are trying to find a way to
grab a tracing association quickly on a per-sample basis. The team at
Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture


Presumably TLS == Thread Local Storage.



Yes, the initial idea is to use thread local storage (TLS). It seems to
be the fastest option to save a per-thread value that changes at a fast
rate.


upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS or other value(s) when a sampling event is generated. This is
supported today on Windows via EventActivityIdControl() [3]. Since
Open Telemetry works on both Windows and Linux, ideally we can do
something as efficient for Linux based workloads.

This series is to explore how it would be best possible to collect
supporting data from a user process when a profile sample is collected.
Having a value stored in TLS makes a lot of sense for this however
there are other ways to explore. Whatever is chosen, kernel samples
taken in process context should be able to get this supporting data.
In these patches on X64 the fsbase and gsbase are used for this.

An option to explore suggested by Mathieu Desnoyers is to utilize rseq
for processes to register a value location that can be included when
profiling if desired. This would allow a tighter contract between user
processes and a profiler.  It would allow better labeling/categorizing
the correlation values.


It is hard to understand this idea. Are you saying stash a cookie in
TLS for samples to capture to indicate an activity? Restartable
sequences are about preemption on a CPU not of a thread, so at least
my intuition is that they feel different. You could stash information
like this today by changing the thread name which generates comm
events. I've wondered about having similar information in some form of
reserved for profiling stack slot, for example, to stash a pointer to
the name of a function being interpreted. Snapshotting all of a stack
is bad performance wise and for security. A stack slot would be able
to deal with nesting.



You are getting the idea. A slot or tag for a thread would be great! I'm
not a fan of overriding the thread comm name (as that already has a
use). TLS would be fine, if we could also pass an offset + size + type.

Maybe a stack slot that just points to parts of TLS? That way you could
have a set of slots that don't require much memory and selectively copy
them out of TLS (or where ever those slots point to in user memory).

When I was talking to Mathieu about this, it seems that rseq already had
a place to potentially put these slots. I'm unsure though how the per
thread aspects would work.

Mathieu, can you post your ideas here about that?


Sure. I'll try to summarize my thoughts here. By all means, let me
know if I'm missing important pieces of the puzzle.

First of all, here is my understanding of what information we want to
share between userspace and kernel.

A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random
UUID allows) a portion of the dependency chain involved in doing some
work (e.g. answer a HTTP request) across one or many participating hosts.

Activity IDs have a parent/child relationship: a parent activity ID can
create children activity IDs.

For instance, if one host has the service "dispatch", another host
has a "web server", and a third host has a SQL database, we should
be able to follow the chain of activities needed to answer a web
query by following those activity IDs, linking them together
through parent/child relationships. This usually requires the
communication protocols to convey those activity IDs across hosts.

The reason why this information must be provided from userspace is
because it's userspace that knows where to find those activity IDs
within its application-layer communication protocols.

With tracing, taking a full trace of the activity ID spans begin/end
from all hosts allow reconstructing the activity IDs parent/child
relationships, so we typically only need to extract information about
activity ID span begin/end with parent/child info to a tracer.

Using activity IDs from a kernel profiler is trickier, because
we do not have access to the complete span begin/end trace to
reconstruct the activity ID parent/child relationship. This is
where I suspect we'd want to introduce a notion of "activity ID
stack", so a profiler could reconstruct the currently active
stack of activity IDs for the current thread by walking that
stack.

This profiling could be triggered either from an interrupt
(s

Re: [lttng-dev] Compile fix for urcu-bp.c

2024-04-01 Thread Mathieu Desnoyers via lttng-dev

Hi,

There are a few things missing before I can take this patch:

- Missing commit message describing the issue,
- Missing "Signed-off-by" tag.

Thanks!

Mathieu

On 2024-03-29 10:06, Duncan Sands via lttng-dev wrote:

--- src/urcu-bp.c
+++ src/urcu-bp.c
@@ -409,7 +409,7 @@ void expand_arena(struct registry_arena *arena)
  new_chunk_size_bytes, 0);
  if (new_chunk != MAP_FAILED) {
  /* Should not have moved. */
-    assert(new_chunk == last_chunk);
+    urcu_posix_assert(new_chunk == last_chunk);
  memset((char *) last_chunk + old_chunk_size_bytes, 0,
  new_chunk_size_bytes - old_chunk_size_bytes);
  last_chunk->capacity = new_capacity;
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: Appropriate liburcu cache line size for Power

2024-03-28 Thread Mathieu Desnoyers

On 2024-03-25 16:34, Nathan Lynch wrote:

Mathieu Desnoyers  writes:

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


For what it's worth, I found a copy of an IBM Journal of Research &
Development article confirming that POWER5's L3 had a 256-byte line
size:

   Each slice [of the L3] is 12-way set-associative, with 4,096
   congruence classes of 256-byte lines managed as two 128-byte sectors
   to match the L2 line size.

https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf

I don't know of any reason to prefer 256 over 128 for current Power
processors though.


Thanks for the pointer. I will add a reference to it in the liburcu
source code to explain the cache line size choice.

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH 2/2] ARC: mm: fix new code about cache aliasing

2024-03-28 Thread Mathieu Desnoyers

On 2024-03-28 01:39, Vineet Gupta wrote:

Manual/partial revert of 8690bbcf3b70 ("Introduce cpu_dcache_is_aliasing() across 
all architectures")

Current generation of ARCv2/ARCv3 based HSxx cores are only PIPT (to software
at least).

Legacy ARC700 cpus could be VIPT aliasing (based on cache geometry and
PAGE_SIZE) however recently that support was ripped out so VIPT aliasing
cache is not relevant to ARC anymore.

P.S. : This has been discussed a few times on lists [1]
P.S.2: Please CC the arch maintainers and/or mailing list before adding
such interfaces.


Because 8690bbcf3b70 was introducing a tree-wide change affecting all
architectures, I CC'd linux-a...@vger.kernel.org. I expected all
architecture maintainers to follow that list, which is relatively
low volume.

I'm sorry that you learn about this after the fact as a result.
My intent was to use the list rather than CC about 50 additional
people/mailing lists.

Of course, if VIPT aliasing is removed from ARC, removing the
config ARCH_HAS_CPU_CACHE_ALIASING and using the generic
cpu_dcache_is_aliasing() is the way to go. Feel free to add
my:

Acked-by: Mathieu Desnoyers 

Thanks,

Mathieu




[1] 
http://lists.infradead.org/pipermail/linux-snps-arc/2023-February/006899.html

Cc: Mathieu Desnoyers 
Signed-off-by: Vineet Gupta 
---
  arch/arc/Kconfig | 1 -
  arch/arc/include/asm/cachetype.h | 9 -
  2 files changed, 10 deletions(-)
  delete mode 100644 arch/arc/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 99d2845f3feb..4092bec198be 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,7 +6,6 @@
  config ARC
def_bool y
select ARC_TIMERS
-   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
deleted file mode 100644
index 05fc7ed59712..
--- a/arch/arc/include/asm/cachetype.h
+++ /dev/null
@@ -1,9 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_ARC_CACHETYPE_H
-#define __ASM_ARC_CACHETYPE_H
-
-#include 
-
-#define cpu_dcache_is_aliasing()   true
-
-#endif


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: Appropriate liburcu cache line size for Power

2024-03-26 Thread Mathieu Desnoyers

On 2024-03-26 03:19, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

Hi,


Hi Mathieu,


In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


The ISA doesn't specify the cache line size, other than it is smaller
than a page.

In practice all the 64-bit IBM server CPUs I'm aware of have used 128
bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
Freescale e6500.

It is possible to discover at runtime via AUXV headers. But that's no
use if you want a compile-time constant.


Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.



I'm happy to run some benchmarks if you can point me at what to run. I
had a poke around the repository and found short_bench, but it seemed to
run for a very long time.


I've created a dedicated test program for this, see:

https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

example use:

On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1

Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.

The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.

On a POWER9 (architected), altivec supported:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1

This points at false-sharing below 128 bytes stride.

On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1

Which points at false-sharing below 64 bytes.

I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.

Feedback is welcome.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Appropriate liburcu cache line size for Power

2024-03-24 Thread Mathieu Desnoyers

Hi,

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.

Thanks!

Mathieu

[1] https://liburcu.org
[2] https://github.com/urcu/userspace-rcu/pull/22
[3] https://www.7-cpu.com/


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


[lttng-dev] [RELEASE] LTTng-modules 2.12.16 and 2.13.12 (Linux kernel tracer)

2024-03-21 Thread Mathieu Desnoyers via lttng-dev

Hi,

This is a release announcement for the currently maintained
LTTng-modules Linux kernel tracer stables branches.

* New and noteworthy in these releases:

Linux kernel v6.8 is now supported by LTTng modules 2.13.12. If you need
support for recent kernels (v5.18+), you will need to upgrade to a
recent LTTng-modules 2.13.x.

Both releases correct issues with SLE kernel version ranges detection.

A compilation fix for RHEL 9.3 kernel is present in v2.13.12.

Feedback is welcome!

Thanks,

Mathieu

Project website: https://lttng.org
Documentation: https://lttng.org/docs
Download link: https://lttng.org/download

Detailed change logs:

2024-03-21 (National Common Courtesy Day) LTTng modules 2.13.12
* docs: Add supported versions and fix-backport policy
* docs: Add links to project resources
* Fix: Correct minimum version in jbd2 SLE kernel range
* Fix: Handle recent SLE major version codes
* Fix: build on sles15sp4
* Compile fixes for RHEL 9.3 kernels
* Fix: ext4_discard_preallocations changed in linux 6.8.0-rc3
* Fix: btrfs_get_extent flags and compress_type changed in linux 
6.8.0-rc1
* Fix: btrfs_chunk tracepoints changed in linux 6.8.0-rc1
* Fix: strlcpy removed in linux 6.8.0-rc1
* Fix: timer_start changed in linux 6.8.0-rc1
* Fix: sched_stat_runtime changed in linux 6.8.0-rc1

2024-03-21 (National Common Courtesy Day) 2.12.16
* fix: lttng-probe-kvm-x86-mmu build with linux 6.6
* docs: Add supported versions and fix-backport policy
* docs: Add links to project resources
* Fix: Correct minimum version in jbd2 SLE kernel range
* Fix: Handle recent SLE major version codes
* Fix: build on sles15sp4

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers

On 2024-03-20 13:58, Steven Rostedt wrote:

On Wed, 20 Mar 2024 13:15:39 -0400
Mathieu Desnoyers  wrote:


I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?


Because need_resched() itself does not mean it's going to schedule
immediately. I looked at a few locations that need_resched() is called.
Most are in idle code where the critical timings are already handled.

I'm not sure I'd add it for places like mm/memory.c or 
drivers/md/bcache/btree.c.

A lot of places look to use it more for PREEMPT_NONE situations as a open
coded cond_resched().

The main reason this one is particularly an issue, is that it spins as long
as the owner is still running. Which may be some time, as here it was 7ms.


What I think we should be discussing here is how calling need_resched()
should interact with the latency tracked by critical timings.

AFAIU, when code explicitly calls need_resched() in a loop, there are
two cases:

- need_resched() returns false: This means the loop can continue without
  causing long latency on the system. Technically we could restart the
  critical timings at this point.

- need_resched() returns true: This means the loop should exit quickly
  and call the scheduler. I would not reset the critical timings there,
  as whatever code is executed between need_resched() returning true
  and calling the scheduler is adding to latency.

Having stop/start critical timings around idle loops seems to just be
an optimization over that.

As for mm and driver/md code, what is wrong with doing a critical
timings reset when need_resched() returns false ? It would prevent
a whole class of false-positives rather than playing whack-a-mole with
those that pop up.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [RFC][PATCH] tracing: Introduce restart_critical_timings()

2024-03-20 Thread Mathieu Desnoyers

On 2024-03-20 12:20, Steven Rostedt wrote:

From: Steven Rostedt (Google) 

I'm debugging some latency issues on a Chromebook and the preemptirqsoff
tracer hit this:

# tracer: preemptirqsoff
#
# preemptirqsoff latency trace v1.1.5 on 5.15.148-21853-g165fd2387469-dirty
# 
# latency: 7813 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
#-
#| task: init-1 (uid:0 nice:0 policy:0 rt_prio:0)
#-
#  => started at: rwsem_optimistic_spin
#  => ended at:   rwsem_optimistic_spin
#
#
#_--=> CPU#
#   / _-=> irqs-off
#  | / _=> need-resched
#  || / _---=> hardirq/softirq
#  ||| / _--=> preempt-depth
#   / _-=> migrate-disable
#  | / delay
#  cmd pid || time  |   caller
# \   /||  \|/
<...>-1 1...1.0us!: rwsem_optimistic_spin+0x20/0x194 
<-rwsem_optimistic_spin+0x20/0x194
<...>-1 1.N.1. 7813us : rwsem_optimistic_spin+0x140/0x194 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7814us+: tracer_preempt_on+0x4c/0x6a 
<-rwsem_optimistic_spin+0x140/0x194
<...>-1 1.N.1. 7824us : 
  => rwsem_optimistic_spin+0x140/0x194
  => rwsem_down_write_slowpath+0xc9/0x3fe
  => copy_process+0xd08/0x1b8a
  => kernel_clone+0x94/0x256
  => __x64_sys_clone+0x7a/0x9a
  => do_syscall_64+0x51/0xa1
  => entry_SYSCALL_64_after_hwframe+0x5c/0xc6

Which isn't a real issue, as it's in the rwsem_optimistic_spin() code that
spins with interrupts enabled, preempt disabled, and checks for
need_resched(). If it is true, it breaks out and schedules.

Hence, it's hiding real issues, because it can spin for a very long time
and this is not the source of the latency I'm tracking down.

I would like to introduce restart_critical_timings() and place it in
locations that have this behavior.


Is there any way you could move this to need_resched() rather than
sprinkle those everywhere ?

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 147feebd508c..e9f97f60bfc0 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -145,6 +145,13 @@ do {   \
  # define start_critical_timings() do { } while (0)
  #endif
  
+/* Used in spins that check need_resched() with preemption disabled */

+static inline void restart_critical_timings(void)
+{
+   stop_critical_timings();
+   start_critical_timings();
+}
+
  #ifdef CONFIG_DEBUG_IRQFLAGS
  extern void warn_bogus_irq_restore(void);
  #define raw_check_bogus_irq_restore() \
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2340b6d90ec6..fa7b43e53d20 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #ifndef CONFIG_PREEMPT_RT

@@ -780,6 +781,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem)
 */
barrier();
  
+		restart_critical_timings();

if (need_resched() || !owner_on_cpu(owner)) {
state = OWNER_NONSPINNABLE;
break;
@@ -912,6 +914,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 * a writer, need_resched() check needs to be done here.
 */
if (owner_state != OWNER_WRITER) {
+   restart_critical_timings();
if (need_resched())
        break;
if (rt_task(current) &&


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Mathieu Desnoyers

On 2024-03-07 14:47, Paul E. McKenney wrote:

On Thu, Mar 07, 2024 at 08:53:05AM -0500, Mathieu Desnoyers wrote:

On 2024-03-06 22:37, Paul E. McKenney wrote:

On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:

[...]



As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
is concerned, the only valid use-case I can think of is
split counters or RCU implementations where there is a
single updater doing the increment, and one or more
concurrent reader threads that need to snapshot a
consistent value with READ_ONCE().



[...]


So what would you use that pattern for?

One possibility is a per-CPU statistical counter in userspace on a
fastpath, in cases where losing the occasional count is OK.  Then learning
your CPU (and possibly being immediately migrated to some other CPU),
READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
make sense.

I suppose the same in the kernel if there was a fastpath so extreme you
could not afford to disable preemption.

At best, very niche.

Or am I suffering a failure of imagination yet again?  ;-)


The (niche) use-cases I have in mind are split-counters and RCU
grace period tracking, where precise counters totals are needed
(no lost count).

In the kernel, this could be:


Thank you for looking into this!


- A per-cpu counter, each counter incremented from thread context with
   preemption disabled (single updater per counter), read concurrently by
   other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
   is no store/load tearing there. Atomics on the update would be stronger
   than necessary on the increment fast-path.


But if preemption is disabled, the updater can read the value without
READ_ONCE() without risk of concurrent update.  Or are you concerned about
interrupt handlers?  This would have to be a read from the interrupt
handler, given that an updated from the interrupt handler could result
in a lost count.


You are correct that the updater don't need READ_ONCE there. It would
however require a WRITE_ONCE() to match READ_ONCE() from concurrent
reader threads.




- A per-thread counter (e.g. within task_struct), only incremented by the
   single thread, read by various threads concurrently.


Ditto.


Right, only WRITE_ONCE() on the single updater, READ_ONCE() on readers.




- A counter which increment happens to be already protected by a lock, read
   by various threads without taking the lock. (technically doable, but
   I'm not sure I see a relevant use-case for it)


In that case, the lock would exclude concurrent updates, so the lock
holder would not need READ_ONCE(), correct?


Correct.




In user-space:

- The "per-cpu" counter would have to use rseq for increments to prevent
   inopportune migrations, which needs to be implemented in assembler anyway.
   The counter reads would have to use READ_ONCE().


Fair enough!


- The per-thread counter (Thread-Local Storage) incremented by a single
   thread, read by various threads concurrently, is a good target
   for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
   various liburcu implementations which track read-side critical sections
   per-thread.


Agreed, but do any of these use WRITE_ONCE(x, READ_ONCE(x) + 1) or
similar?


Not quite, I recall it's more like WRITE_ONCE(x, READ_ONCE(y)) or such,
so we can grab the value of the current gp counter and store it into a
TLS variable.





- Same as for the kernel, a counter increment protected by a lock which
   needs to be read from various threads concurrently without taking
   the lock could be a valid use-case, though I fail to see how it is
   useful due to lack of imagination on my part. ;-)


In RCU, we have "WRITE_ONCE(*sp, *sp + 1)" for this use case, though
here we have the WRITE_ONCE() but not the READ_ONCE() because we hold
the lock excluding any other updates.


You are right, the READ_ONCE() is not needed in this case for the
updater, only for the concurrent readers.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-07 Thread Mathieu Desnoyers

On 2024-03-06 22:37, Paul E. McKenney wrote:

On Wed, Mar 06, 2024 at 10:06:21PM -0500, Mathieu Desnoyers wrote:

[...]



As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
is concerned, the only valid use-case I can think of is
split counters or RCU implementations where there is a
single updater doing the increment, and one or more
concurrent reader threads that need to snapshot a
consistent value with READ_ONCE().



[...]


So what would you use that pattern for?

One possibility is a per-CPU statistical counter in userspace on a
fastpath, in cases where losing the occasional count is OK.  Then learning
your CPU (and possibly being immediately migrated to some other CPU),
READ_ONCE() of the count, increment, and WRITE_ONCE() might (or might not)
make sense.

I suppose the same in the kernel if there was a fastpath so extreme you
could not afford to disable preemption.

At best, very niche.

Or am I suffering a failure of imagination yet again?  ;-)


The (niche) use-cases I have in mind are split-counters and RCU
grace period tracking, where precise counters totals are needed
(no lost count).

In the kernel, this could be:

- A per-cpu counter, each counter incremented from thread context with
  preemption disabled (single updater per counter), read concurrently by
  other threads. WRITE_ONCE/READ_ONCE is useful to make sure there
  is no store/load tearing there. Atomics on the update would be stronger
  than necessary on the increment fast-path.

- A per-thread counter (e.g. within task_struct), only incremented by the
  single thread, read by various threads concurrently.

- A counter which increment happens to be already protected by a lock, read
  by various threads without taking the lock. (technically doable, but
  I'm not sure I see a relevant use-case for it)

In user-space:

- The "per-cpu" counter would have to use rseq for increments to prevent
  inopportune migrations, which needs to be implemented in assembler anyway.
  The counter reads would have to use READ_ONCE().

- The per-thread counter (Thread-Local Storage) incremented by a single
  thread, read by various threads concurrently, is a good target
  for WRITE_ONCE()/READ_ONCE() pairing. This is actually what we do in
  various liburcu implementations which track read-side critical sections
  per-thread.

- Same as for the kernel, a counter increment protected by a lock which
  needs to be read from various threads concurrently without taking
  the lock could be a valid use-case, though I fail to see how it is
  useful due to lack of imagination on my part. ;-)

I'm possibly missing other use-cases, but those come to mind as not
involving racy counter increments.

I agree that use-cases are so niche that we probably do _not_ want to
introduce ADD_SHARED() for that purpose in a common header file,
because I suspect that it would then become misused in plenty of
scenarios where the updates are actually racy and would be better
served by atomics or local-atomics.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] rcutorture: Fix rcu_torture_pipe_update_one()/rcu_torture_writer() data race and concurrency bug

2024-03-06 Thread Mathieu Desnoyers

On 2024-03-06 21:43, Linus Torvalds wrote:
[...]


Honestly, this all makes me think that we'd be *much* better off
showing the real "handoff" with smp_store_release() and
smp_load_acquire().


We've done something similar in liburcu (userspace code) to allow
Thread Sanitizer to understand the happens-before relationships
within the RCU implementations and lock-free data structures.

Moving to load-acquire/store-release (C11 model in our case)
allowed us to provide enough happens-before relationship for
Thread Sanitizer to understand what is happening under the
hood in liburcu and perform relevant race detection of user
code.

As far as the WRITE_ONCE(x, READ_ONCE(x) + 1) pattern
is concerned, the only valid use-case I can think of is
split counters or RCU implementations where there is a
single updater doing the increment, and one or more
concurrent reader threads that need to snapshot a
consistent value with READ_ONCE().

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v2] tracing: Limit trace_marker writes to just 4K

2024-03-05 Thread Mathieu Desnoyers

On 2024-03-04 22:34, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Limit the max print event of trace_marker to just 4K string size. This must
also be less than the amount that can be held by a trace_seq along with
the text that is before the output (like the task name, PID, CPU, state,
etc). As trace_seq is made to handle large events (some greater than 4K).
Make the max size of a trace_marker write event be 4K which is guaranteed
to fit in the trace_seq buffer.

Suggested-by: Mathieu Desnoyers 


From my perspective I only attempted to clarify the point Linus made
about limiting the trace_marker input to 4kB. Feel adapt the
Suggested-by tag accordingly.

Reviewed-by: Mathieu Desnoyers 

Thanks,

Mathieu


Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240304192710.4c996...@gandalf.local.home/

- Just make the max limit 4K and not half of the trace_seq size.
   The trace_seq is already made to handle events greater than 4k.

  kernel/trace/trace.c | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..d16b95ca58a7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7293,6 +7293,8 @@ tracing_free_buffer_release(struct inode *inode, struct 
file *filp)
return 0;
  }
  
+#define TRACE_MARKER_MAX_SIZE		4096

+
  static ssize_t
  tracing_mark_write(struct file *filp, const char __user *ubuf,
size_t cnt, loff_t *fpos)
@@ -7320,6 +7322,9 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if ((ssize_t)cnt < 0)
return -EINVAL;
  
+	if (cnt > TRACE_MARKER_MAX_SIZE)

+   cnt = TRACE_MARKER_MAX_SIZE;
+
meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
   again:
size = cnt + meta_size;
@@ -7328,11 +7333,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
  
-	if (size > TRACE_SEQ_BUFFER_SIZE) {

-   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
-   goto again;
-   }
-
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
tracing_gen_ctx());


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 21:37, Steven Rostedt wrote:

On Mon, 4 Mar 2024 21:35:38 -0500
Steven Rostedt  wrote:


And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.


validating is debugging.


Did Linus put you up to this? To test me to see if I'm learning how to say "No" 
;-)


No, he did not. I genuinely think that validating size limits like
this either at compile time or, when they can vary at runtime like
in this case, with a dynamic check, decreases the cognitive
load on the reviewers. We can then assume that whatever limit
was put in place is actually enforced and not just wishful
thinking.

If the "header" size upper bound is not validated at runtime, there
is not much point in adding the BUILD_BUG_ON() based on that value
in the first place, and you should then just add a runtime check that
you don't overflow the output buffer before writing the output to it.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 20:59, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:42:39 -0500
Mathieu Desnoyers  wrote:


#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.


That would be a lot of complex code that is for debugging something that
has never happened in the past 16 years and Linus has already reprimanded
me on doing such things.


Is it more complex than remembering the iterator position in
print_trace_fmt() right before:

if (tr->trace_flags & TRACE_ITER_CONTEXT_INFO) {
if (iter->iter_flags & TRACE_FILE_LAT_FMT)
trace_print_lat_context(iter);
else
trace_print_context(iter);
}

and then checking just after that the offset is not beyond 128
bytes ? Perhaps there is even something internal to "iter"
that already knows about this offset (?).

This would catch any context going beyond its planned upper
limit early. Failing early and not just in rare corner cases
is good.

And it's not for debugging, it's for validation of assumptions
made about an upper bound limit defined for a compile-time
check, so as the code evolves issues are caught early.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 20:41, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:35:16 -0500
Steven Rostedt  wrote:


BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


That said, the meta data is most likely not going to be more than 128 bytes
(it shouldn't be more than 80).

I could do as you suggest and create a separate TRACE_MARKER_SIZE and just
make sure that it's less than TRACE_SEQ_BUFFER_SIZE (as that's the size of
the content) by 128 bytes.

/* Added meta data should not be more than 128 bytes */
BUILD_BUG_ON((TRACE_MARKER_MAX_SIZE + 128) > TRACE_SEQ_BUFFER_SIZE);


Bonus points if you add

#define TRACE_OUTPUT_META_DATA_MAX_LEN  80

and a runtime check in the code generating this header.

This would avoid adding an unchecked upper limit.

Thanks,

Mathieu



-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 20:35, Steven Rostedt wrote:

On Mon, 4 Mar 2024 20:15:57 -0500
Mathieu Desnoyers  wrote:


On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);


That's not the meta size I'm worried about. The sizeof(meta data) is the
raw event binary data, which is not related to the size of the event output.

  # cd /sys/kernel/tracing
  # echo hello > trace_marker
  # cat trace
[..]
<...>-999 [001] .  2296.140373: tracing_mark_write: hello
^^^
This is the meta data that is added to trace_seq


If this header has a known well-defined upper-limit length, then use
that in the BUILD_BUG_ON().

Thanks,

Mathieu



-- Steve
--

Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Have trace_marker writes be just half of TRACE_SEQ_SIZE

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 19:27, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Since the size of trace_seq's buffer is the max an event can output, have
the trace_marker be half of the entire TRACE_SEQ_SIZE, which is 4K. That
will keep writes that has meta data written from being dropped (but
reported), because the total output of the print event is greater than
what the trace_seq can hold.


Defining the trace_mark limit in terms of "TRACE_SEQ_SIZE / 2"
seems backwards. It's basically using a define of the maximum
buffer size for the pretty-printing output and defining the maximum
input size of a system call to half of that.

I'd rather see, in a header file shared between tracing mark
write implementation and output implementation:

#define TRACING_MARK_MAX_SIZE   4096

and then a static validation that this input fits within your
pretty printing output in the output implementation file:

BUILD_BUG_ON(TRACING_MARK_MAX_SIZE + sizeof(meta data stuff...) > 
TRACE_SEQ_SIZE);

This way we clearly document that the tracing mark write
input limit is 4kB, rather than something derived from
the size of an output buffer.

Thanks,

Mathieu



Signed-off-by: Steven Rostedt (Google) 
---
  kernel/trace/trace.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..d68544aef65f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7320,6 +7320,17 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if ((ssize_t)cnt < 0)
return -EINVAL;
  
+	/*

+* TRACE_SEQ_SIZE is the total size of trace_seq buffer used
+* for output. As the print event outputs more than just
+* the string written, keep it smaller than the trace_seq
+* as it could drop the event if the extra data makes it bigger
+* than what the trace_seq can hold. Half he TRACE_SEQ_SIZE
+* is more than enough.
+*/
+   if (cnt > TRACE_SEQ_SIZE / 2)
+   cnt = TRACE_SEQ_SIZE / 2;
+
meta_size = sizeof(*entry) + 2;  /* add '\0' and possible '\n' */
   again:
size = cnt + meta_size;
@@ -7328,11 +7339,6 @@ tracing_mark_write(struct file *filp, const char __user 
*ubuf,
if (cnt < FAULTED_SIZE)
size += FAULTED_SIZE - cnt;
  
-	if (size > TRACE_SEQ_BUFFER_SIZE) {

-   cnt -= size - TRACE_SEQ_BUFFER_SIZE;
-   goto again;
-   }
-
buffer = tr->array_buffer.buffer;
event = __trace_buffer_lock_reserve(buffer, TRACE_PRINT, size,
        tracing_gen_ctx());


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Remove precision vsnprintf() check from print event

2024-03-04 Thread Mathieu Desnoyers

On 2024-03-04 17:43, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

This reverts 60be76eeabb3d ("tracing: Add size check when printing
trace_marker output"). The only reason the precision check was added
was because of a bug that miscalculated the write size of the string into
the ring buffer and it truncated it removing the terminating nul byte. On
reading the trace it crashed the kernel. But this was due to the bug in
the code that happened during development and should never happen in
practice. If anything, the precision can hide bugs where the string in the
ring buffer isn't nul terminated and it will not be checked.

Link: 
https://lore.kernel.org/all/c7e7af1a-d30f-4d18-b8e5-af1ef5800...@linux.ibm.com/
Link: 
https://lore.kernel.org/linux-trace-kernel/20240227125706.04279...@gandalf.local.home
Link: https://lore.kernel.org/all/20240302111244.3a167...@gandalf.local.home/

Reported-by: Sachin Sant 
Fixes: 60be76eeabb3d ("tracing: Add size check when printing trace_marker 
output")
Signed-off-by: Steven Rostedt (Google) 


This is a step in the right direction IMHO.

Reviewed-by: Mathieu Desnoyers 

Just out of curiosity, is there anything to prevent trace_marker from
writing a huge string into the ring buffer in the first place ? Is this
limit implicit and based on the page size of the architecture or is it
a known fixed limit ? (e.g. 4kB strings).

It appears to currently be limited by

#define TRACE_SEQ_BUFFER_SIZE   (PAGE_SIZE * 2 - \
(sizeof(struct seq_buf) + sizeof(size_t) + sizeof(int)))

checked within tracing_mark_write().

I would have hoped for a simpler limit (e.g. 4kB) consistent across
architectures. But that would belong to a separate change.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment

2024-03-01 Thread Mathieu Desnoyers

On 2024-03-01 10:49, Steven Rostedt wrote:

On Fri, 1 Mar 2024 13:37:18 +0800
linke  wrote:


So basically you are worried about read-tearing?

That wasn't mentioned in the change log.


Yes. Sorry for making this confused, I am not very familiar with this and
still learning.


No problem. We all have to learn this anyway.




Funny part is, if the above timestamp read did a tear, then this would
definitely not match, and would return the correct value. That is, the
buffer is not empty because the only way for this to get corrupted is if
something is in the process of writing to it.


I agree with you here.

commit = rb_page_commit(commit_page);

But if commit_page above is the result of a torn read, the commit field
read by rb_page_commit() may not represent a valid value.


But commit_page is a word length, and I will argue that any compiler that
tears "long" words is broken. ;-)


[ For those tuning in, we are discussing ring_buffer_iter_empty()
  "commit_page = cpu_buffer->commit_page;" racy load. ]

I counter-argue that real-world compilers *are* broken based on your
personal definition, but we have to deal with them, as documented
in Documentation/memory-barriers.txt (see below).

What is the added overhead of using a READ_ONCE() there ? Why are
we wasting effort trying to guess the compiler behavior if the
real-world performance impact is insignificant ?

Quote from memory-barrier.txt explaining the purpose of {READ,WRITE}_ONCE():

"(*) For aligned memory locations whose size allows them to be accessed
 with a single memory-reference instruction, prevents "load tearing"
 and "store tearing," in which a single large access is replaced by
 multiple smaller accesses."

I agree that {READ,WRITE}_ONCE() are really not needed at initialization,
when there are demonstrably no concurrent accesses to the data

But trying to eliminate {READ,WRITE}_ONCE() on concurrently accessed fields
just adds complexity, prevents static analyzers to properly understand the
code and report issues, and just obfuscates the code.

Thanks,

Mathieu





In this case, READ_ONCE() is only needed for the commit_page.


But we can at least keep the READ_ONCE() on the commit_page just because it
is used in the next instruction.

-- Steve


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [lttng-dev] [PATCH] coredump debugging: add a tracepoint to report the coredumping

2024-02-23 Thread Mathieu Desnoyers via lttng-dev

On 2024-02-23 09:26, Steven Rostedt wrote:

On Mon, 19 Feb 2024 13:01:16 -0500
Mathieu Desnoyers  wrote:


Between "sched_process_exit" and "sched_process_free", the task can still be
observed by a trace analysis looking at sched and signal events: it's a zombie 
at
that stage.


Looking at the history of this tracepoint, it was added in 2008 by commit
0a16b60758433 ("tracing, sched: LTTng instrumentation - scheduler").
Hmm, LLTng? I wonder who the author was?


[ common typo: LLTng -> LTTng ;-) ]



   Author: Mathieu Desnoyers 

  :-D

Mathieu, I would say it's your call on where the tracepoint can be located.
You added it, you own it!


Wow! that's now 16 years ago :)

I've checked with Matthew Khouzam (maintainer of Trace Compass)
which care about this tracepoint, and we have not identified any
significant impact of moving it on its model of the scheduler, other
than slightly changing its timing.

I've also checked quickly in lttng-analyses and have not found
any code that care about its specific placement.

So I would say go ahead and move it earlier in do_exit(), it's
fine by me.

If you are interested in a bit of archeology, "sched_process_free"
originated from my ltt-experimental 0.1.99.13 kernel patch against
2.6.12-rc4-mm2 back in September 2005 (that's 19 years ago). It was
a precursor to the LTTng 0.x kernel patchset.

https://lttng.org/files/ltt-experimental/patch-2.6.12-rc4-mm2-ltt-exp-0.1.99.13.gz

Index: kernel/exit.c
===
--- a/kernel/exit.c (.../trunk/kernel/linux-2.6.12-rc4-mm2) (revision 41)
+++ b/kernel/exit.c (.../branches/mathieu/linux-2.6.12-rc4-mm2) 
(revision 41)
@@ -4,6 +4,7 @@
  *  Copyright (C) 1991, 1992  Linus Torvalds
  */
 
+#include 

 #include 
 #include 
 #include 
@@ -55,6 +56,7 @@ static void __unhash_process(struct task
}
 
 	REMOVE_LINKS(p);

+  trace_process_free(p->pid);
 }
 
 void release_task(struct task_struct * p)

@@ -832,6 +834,8 @@ fastcall NORET_TYPE void do_exit(long co
}
exit_mm(tsk);
 
+	trace_process_exit(tsk->pid);

+
exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);

This was a significant improvement over the prior LTT which only
had the equivalent of "sched_process_exit", which caused issues
with the Linux scheduler model in LTTV due to zombie processes.

Here is where it appeared in LTT back in 1999:

http://www.opersys.com/ftp/pub/LTT/TracePackage-0.9.0.tgz

patch-ltt-2.2.13-991118

diff -urN linux/kernel/exit.c linux-2.2.13/kernel/exit.c
--- linux/kernel/exit.c Tue Oct 19 20:14:02 1999
+++ linux-2.2.13/kernel/exit.c  Sun Nov  7 23:49:17 1999
@@ -14,6 +14,8 @@
 #include 
 #endif
 
+#include 

+
 #include 
 #include 
 #include 
@@ -386,6 +388,8 @@
del_timer(>real_timer);
end_bh_atomic();
 
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);

+
lock_kernel();
 fake_volatile:
 #ifdef CONFIG_BSD_PROCESS_ACCT

And it was moved to its current location (after exit_mm()) a bit
later (2001):

http://www.opersys.com/ftp/pub/LTT/TraceToolkit-0.9.5pre2.tgz

Patches/patch-ltt-linux-2.4.5-vanilla-010909-1.10

diff -urN linux/kernel/exit.c /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c
--- linux/kernel/exit.c Fri May  4 17:44:06 2001
+++ /ext2/home/karym/kernel/linux-2.4.5/kernel/exit.c   Wed Jun 20 12:39:24 2001
@@ -14,6 +14,8 @@
 #include 
 #endif
 
+#include 

+
 #include 
 #include 
 #include 
@@ -439,6 +441,8 @@
 #endif
__exit_mm(tsk);
 
+	TRACE_PROCESS(TRACE_EV_PROCESS_EXIT, 0, 0);

+
lock_kernel();
sem_exit();
__exit_files(tsk);

So this sched_process_exit placement was actually decided
by Karim Yaghmour back in the LTT days (2001). I don't think
he will mind us moving it around some 23 years later. ;)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

___
lttng-dev mailing list
lttng-dev@lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Mathieu Desnoyers

On 2024-02-20 09:19, Steven Rostedt wrote:

On Mon, 19 Feb 2024 18:20:32 -0500
Steven Rostedt  wrote:


Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.

Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.

Before we had:

w = local_read(_page->write);
/* get time stamps */
write = local_add_return(length, _page->write);
if (write - length == w) {
/* do simple case */
} else {
/* do complex case */
}

By switching the local_add_return() to a local_try_cmpxchg() it can now be:

 w = local_read(_page->write);
  again:
/* get time stamps */
if (!local_try_cmpxchg(_page->write, , w + length))
goto again;

 /* do simple case */


Something about this logic is causing __rb_next_reserve() to sometimes
always return -EAGAIN and triggering the:

 RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)

Which disables the ring buffer.

I'm not sure what it is, but until I do, I'm removing the patch from my
queue.


Try resetting the info->add_timestamp flags to add_ts_default on goto again
within __rb_reserve_next().

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] dax: add set_dax_nomc() and set_dax_nocache() stub helpers

2024-02-16 Thread Mathieu Desnoyers

On 2024-02-16 15:22, Arnd Bergmann wrote:

From: Arnd Bergmann 

In some randconfig builds, the IS_ERR() check appears to not get completely
eliminated, resulting in the compiler to insert references to these two
functions that cause a link failure:

ERROR: modpost: "set_dax_nocache" [drivers/md/dm-mod.ko] undefined!
ERROR: modpost: "set_dax_nomc" [drivers/md/dm-mod.ko] undefined!

Add more stub functions for the dax-disabled case here to make it build again.


Hi Arnd,

Note that this is a duplicate of:

https://lore.kernel.org/lkml/20240215144633.96437-2-mathieu.desnoy...@efficios.com/

now present in Andrew's tree.

The only differences are the subject, commit message and a newline between 
"set_dax_nomc"
and "set_dax_synchronous" in your change.

Thanks,

Mathieu



Fixes: d888f6b0a766 ("dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402160420.e4qkwogo-...@intel.com/
Signed-off-by: Arnd Bergmann 
---
  include/linux/dax.h | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index df2d52b8a245..4527c10016fb 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -64,6 +64,9 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
  bool dax_write_cache_enabled(struct dax_device *dax_dev);
  bool dax_synchronous(struct dax_device *dax_dev);
  void set_dax_synchronous(struct dax_device *dax_dev);
+void set_dax_nocache(struct dax_device *dax_dev);
+void set_dax_nomc(struct dax_device *dax_dev);
+
  size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
  /*
@@ -108,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device 
*dax_dev)
  static inline void set_dax_synchronous(struct dax_device *dax_dev)
  {
  }
+static inline void set_dax_nocache(struct dax_device *dax_dev)
+{
+}
+static inline void set_dax_nomc(struct dax_device *dax_dev)
+{
+}
  static inline bool daxdev_mapping_supported(struct vm_area_struct *vma,
struct dax_device *dax_dev)
  {
@@ -120,9 +129,6 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
  }
  #endif
  
-void set_dax_nocache(struct dax_device *dax_dev);

-void set_dax_nomc(struct dax_device *dax_dev);
-
  struct writeback_control;
  #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
  int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v2] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-16 Thread Mathieu Desnoyers

On 2024-02-15 09:43, Mathieu Desnoyers wrote:

Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:


Hi Andrew,

I notice that you should update the patch you have in your tree
(https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/nvdimm-pmem-fix-leak-on-dax_add_host-failure.patch)
with this updated version which includes additional Reviewed-by tags and
removes unneeded context that appears to be taken from the previous cover
letter.

Following your request, I have extracted this patch from the series.

Thanks,

Mathieu



out_cleanup_dax:
 kill_dax(pmem->dax_dev);
 put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Reviewed-by: Dave Jiang 
Reviewed-by: Fan Ni 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
Changes since v1:
- Add Reviewed-by tags.
---
  drivers/nvdimm/pmem.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[PATCH v6 8/9] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-15 Thread Mathieu Desnoyers
Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ conf

[PATCH v6 9/9] dax: Fix incorrect list of data cache aliasing architectures

2024-02-15 Thread Mathieu Desnoyers
commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2




[PATCH v6 7/9] dax: Check for data cache aliasing at runtime

2024-02-15 Thread Mathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 10 ++
 fs/Kconfig  |  1 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 205b888d45bf..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
-- 
2.39.2




[PATCH v6 5/9] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-15 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Acked-by: Heiko Carstens 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2




[PATCH v6 6/9] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-15 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..a28466c2da71 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2




[PATCH v6 3/9] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-15 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, _dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
-   goto out;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




[PATCH v6 4/9] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-15 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2




[PATCH v6 2/9] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

2024-02-15 Thread Mathieu Desnoyers
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
never returns NULL.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: 4e4ced93794a ("dax: Move mandatory ->zero_page_range() check in 
alloc_dax()")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 +
 include/linux/dax.h | 6 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..205b888d45bf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e3ffe7c7f01d..9d3e3327af4c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -88,11 +88,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2




[PATCH v6 1/9] dax: add empty static inline for CONFIG_DAX=n

2024-02-15 Thread Mathieu Desnoyers
When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache()
and set_dax_nomc() need to be either within regions of code or compile
units which are explicitly not compiled, or they need to rely on compiler
optimizations to eliminate calls to those undefined symbols.

It appears that at least the openrisc and loongarch architectures don't
end up eliminating those undefined symbols even if they are provably
within code which is eliminated due to conditional branches depending on
constants.

Implement empty static inline functions for set_dax_nocache() and
set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are
removed.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402140037.wgfa1kqx-...@intel.com/
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202402131351.a0fzogeg-...@intel.com/
Fixes: 7ac5360cd4d0 ("dax: remove the copy_from_iter and copy_to_iter methods")
Signed-off-by: Mathieu Desnoyers 
Cc: Christoph Hellwig 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 include/linux/dax.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..e3ffe7c7f01d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -63,6 +63,8 @@ void kill_dax(struct dax_device *dax_dev);
 void dax_write_cache(struct dax_device *dax_dev, bool wc);
 bool dax_write_cache_enabled(struct dax_device *dax_dev);
 bool dax_synchronous(struct dax_device *dax_dev);
+void set_dax_nocache(struct dax_device *dax_dev);
+void set_dax_nomc(struct dax_device *dax_dev);
 void set_dax_synchronous(struct dax_device *dax_dev);
 size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
void *addr, size_t bytes, struct iov_iter *i);
@@ -109,6 +111,12 @@ static inline bool dax_synchronous(struct dax_device 
*dax_dev)
 {
return true;
 }
+static inline void set_dax_nocache(struct dax_device *dax_dev)
+{
+}
+static inline void set_dax_nomc(struct dax_device *dax_dev)
+{
+}
 static inline void set_dax_synchronous(struct dax_device *dax_dev)
 {
 }
@@ -124,9 +132,6 @@ static inline size_t dax_recovery_write(struct dax_device 
*dax_dev,
 }
 #endif
 
-void set_dax_nocache(struct dax_device *dax_dev);
-void set_dax_nomc(struct dax_device *dax_dev);
-
 struct writeback_control;
 #if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
 int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk);
-- 
2.39.2




[PATCH v6 0/9] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-15 Thread Mathieu Desnoyers
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

Even though it used to work fine before.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Thanks,

Mathieu

Changes since v5:
- Add empty static inline set_dax_nocache() and set_dax_nomc() for
  CONFIG_DAX=n.
- Update "Fixes" tag for "dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP)
  for CONFIG_DAX=n".
- Check IS_ERR_OR_NULL() before calling virtio_fs_cleanup_dax()
  within virtio_fs_setup_dax().

Changes since v4:
- Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP)
  when CONFIG_DAX=n earlier in the series,
- Fold driver cleanup patches into their respective per-driver changes.
- Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this
  series.

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (9):
  dax: add empty static inline for CONFIG_DAX=n
  dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers

[PATCH v2] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-15 Thread Mathieu Desnoyers
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Reviewed-by: Dave Jiang 
Reviewed-by: Fan Ni 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
Changes since v1:
- Add Reviewed-by tags.
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




Re: CPU data cache across reboot/kexec for pmem/dax devices

2024-02-13 Thread Mathieu Desnoyers

On 2024-02-09 15:15, Dan Williams wrote:

Mathieu Desnoyers wrote:

Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,


This one gives me pause, because a trip through the BIOS typically means
lots of resets and other low level magic, so this would likely require
pushing dirty data out of CPU caches prior to entering the BIOS code
paths.

So this either needs explicit cache flushing or mapping the memory with
write-through semantics. That latter one is not supported in the stack
today.


I would favor the explicit cache flushing approach over write-through semantics
for performance reasons: typically the ring buffers are overwritten, so always
storing the data to memory would be wasteful.

But I would rather do the cache flushing only on crash (die oops/panic 
notifiers)
rather than require the tracer to flush cache lines immediately after each event
is produced, again for performance reasons.

I have done a crude attempt at registering die oops/panic notifiers from the 
pmem
driver, and flush all pmem areas to memory when die oops/panic notifiers are
called (see the untested patch below). I wonder if this is something that would 
be
generally useful ?




- kexec/kdump.


This should maintain the state of CPU caches. As far as the CPU is
concerned it is just long jumping into a new kernel in memory without
resetting any CPU cache state.


Good to know that this scenario is expected to "Just Work".

Thanks,

Mathieu


diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index e9898457a7bd..b92f18607d39 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -26,12 +26,18 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "pmem.h"
 #include "btt.h"
 #include "pfn.h"
 #include "nd.h"
 
+static int pmem_die_handler(struct notifier_block *self,

+   unsigned long ev, void *unused);
+static int pmem_panic_handler(struct notifier_block *self,
+   unsigned long ev, void *unused);
+
 static struct device *to_dev(struct pmem_device *pmem)
 {
/*
@@ -423,6 +429,9 @@ static void pmem_release_disk(void *__pmem)
 {
struct pmem_device *pmem = __pmem;
 
+	atomic_notifier_chain_unregister(_notifier_list,

+   >panic_notifier);
+   unregister_die_notifier(>die_notifier);
dax_remove_host(pmem->disk);
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);
@@ -573,9 +582,20 @@ static int pmem_attach_disk(struct device *dev,
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   pmem->die_notifier.notifier_call = pmem_die_handler;
+   pmem->die_notifier.priority = -INT_MAX;
+   rc = register_die_notifier(>die_notifier);
if (rc)
goto out_remove_host;
+   pmem->panic_notifier.notifier_call = pmem_panic_handler;
+   pmem->panic_notifier.priority = -INT_MAX;
+   rc = atomic_notifier_chain_register(_notifier_list,
+   >panic_notifier);
+   if (rc)
+   goto out_unregister_die;
+   rc = device_add_disk(dev, disk, pmem_attribute_groups);
+   if (rc)
+   goto out_unregister_panic;
if (devm_add_action_or_reset(dev, pmem_release_disk, pmem))
return -ENOMEM;
 
@@ -587,6 +607,11 @@ static int pmem_attach_disk(struct device *dev,

dev_warn(dev, "'badblocks' notification disabled\n");
return 0;
 
+out_unregister_panic:

+   atomic_notifier_chain_unregister(_notifi

Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-13 Thread Mathieu Desnoyers

On 2024-02-12 18:02, Dan Williams wrote:
[...]

...and Mathieu, this should be IS_ERR_OR_NULL() to skip an unnecessary
call to virtio_fs_cleanup_dax() at function exit that the compiler
should elide.


OK, so I'll go back to the previous approach for v6:

DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))

and define the variable as:

struct dax_device *dax_dev __free(cleanup_dax) = NULL;

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

2024-02-13 Thread Mathieu Desnoyers

On 2024-02-13 14:07, Dan Williams wrote:

Lukas Wunner wrote:

On Mon, Feb 12, 2024 at 11:30:54AM -0500, Mathieu Desnoyers wrote:

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
never returns NULL.


All the callers of alloc_dax() only check for IS_ERR().

Doesn't this result in a change of behavior in all the callers?
Previously they'd ignore the NULL return value and continue,
now they'll error out.

Given that, seems dangerous to add a Fixes tag with a v4.0 commit
and thus risk regressing all stable kernels.


Oh, good catch, yes that Fixes tag should be:

4e4ced93794a dax: Move mandatory ->zero_page_range() check in alloc_dax()

...as that was the one that updated the alloc_dax() calling convention
without fixing up the CONFIG_DAX=n case.

This is a pre-requisite for restoring DAX operation on ARM32.


I'll change the Fixes tag in this commit to:

Fixes: 4e4ced93794a ("dax: Move mandatory ->zero_page_range() check in 
alloc_dax()")

for v6.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-13 Thread Mathieu Desnoyers

On 2024-02-13 01:25, Lukas Wunner wrote:

On Mon, Feb 12, 2024 at 11:30:58AM -0500, Mathieu Desnoyers wrote:

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")


That's a v4.0 commit, yet this patch uses DEFINE_FREE() which is
only available in v6.6 but not any earlier stable kernels.


I asked this question to Greg KH before creating this patch, and his
answer was to implement my fix for master, and stable kernels would take
care of backporting all the required dependencies.

Now if I look at latest 6.1, 5.15, 5.10, 5.4, 4.19 stable kernels,
none seem to have include/linux/cleanup.h today. But I suspect that
sooner or later relevant master branch fixes will require stable
kernels to backport cleanup.h, so why not do it now ?

Thanks,

Mathieu




So the Fixes tag feels a bit weird.

Apart from that,
Reviewed-by: Lukas Wunner 


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[PATCH v5 8/8] dax: Fix incorrect list of data cache aliasing architectures

2024-02-12 Thread Mathieu Desnoyers
commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2




[PATCH v5 7/8] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-12 Thread Mathieu Desnoyers
Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ conf

[PATCH v5 6/8] dax: Check for data cache aliasing at runtime

2024-02-12 Thread Mathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 10 ++
 fs/Kconfig  |  1 -
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 205b888d45bf..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -450,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
-- 
2.39.2




[PATCH v5 5/8] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..f9acd9972af2 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = ERR_PTR(-EOPNOTSUPP);
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2




[PATCH v5 4/8] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2




[PATCH v5 3/8] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2




[PATCH v5 2/8] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-12 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, _dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
-   goto out;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




[PATCH v5 0/8] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-12 Thread Mathieu Desnoyers
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Thanks,

Mathieu

Changes since v4:
- Move the change which makes alloc_dax() return ERR_PTR(-EOPNOTSUPP)
  when CONFIG_DAX=n earlier in the series,
- Fold driver cleanup patches into their respective per-driver changes.
- Move "nvdimm/pmem: Fix leak on dax_add_host() failure" outside of this
  series.

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (8):
  dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers/dax/super.c | 14 ++
 drivers/md/dm.c | 17 +
 drivers/nvdimm/pmem.c   | 22 --
 drivers/s390/block/dcssblk.c| 11 ++-
 fs/Kconfig

[PATCH v5 1/8] dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n

2024-02-12 Thread Mathieu Desnoyers
Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
never returns NULL.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 +
 include/linux/dax.h | 6 +-
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..205b888d45bf 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2




[PATCH] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-12 Thread Mathieu Desnoyers
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Reviewed-by: Dan Williams 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




CPU data cache across reboot/kexec for pmem/dax devices

2024-02-09 Thread Mathieu Desnoyers

Hi Dan,

In the context of extracting user-space trace data when the kernel crashes,
the LTTng user-space tracer recommends using nvdimm/pmem to reserve an area
of physical (volatile) RAM at boot (memmap=nn[KMG]!ss[KMG]), and use the
resulting device to create/mount a dax-enabled fs (e.g. ext4).

We then use this filesystem to mmap() the shared memory files for the tracer.

I want to make sure that the very last events from the userspace tracer written
to the memory mapped buffers (mmap()) by userspace are present after a
warm-reboot (or kexec/kdump).

Note that the LTTng user-space tracer (LTTng-UST) does *not* issue any clflush
(or equivalent pmem_persist() from libpmem) for performance reasons: ring buffer
data is usually overwritten many times before the system actually crashes, and
the only thing we really need to make sure is that the cache lines are not
invalidated without write back.

So I understand that the main use-case for pmem is nvdimm, and that in order to
guarantee persistence of the data on power off an explicit pmem_persist() is
needed after each "transaction", but for the needs of tracing, is there some
kind of architectural guarantee that the data present in the cpu data cache
is not invalidated prior to write back in each of those scenarios ?

- reboot with bios explicitly not clearing memory,
- kexec/kdump.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 17:37, Dan Williams wrote:

Mathieu Desnoyers wrote:

On 2024-02-08 16:39, Dan Williams wrote:
[...]


So per other feedback on earlier patches, I think this hunk deserves to
be moved to its own patch earlier in the series as a standalone fixup.


Done.



Rest of this patch looks good to me.


Adding your Acked-by to what is left of this patch if OK with you.


You can add:

Reviewed-by: Dan Williams 

...after that re-org.


Just to make sure: are you OK with me adding your Reviewed-by
only for what is left of this patch, or also to the other driver
patches after integrating your requested changes ?

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 17:12, Andrew Morton wrote:

On Thu, 8 Feb 2024 17:04:52 -0500 Mathieu Desnoyers 
 wrote:


[...]


Should I keep this patch 01/12 within the series for v5 or should I
send it separately ?


Doesn't matter much, but perfectionism does say "standalone patch please".


Will do. I plan to add the following statement to the commit message
to make it clear that there is a dependency between the patch series
and this fix:

[ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:55, Dan Williams wrote:
[...]


Oh... I see you cleanup what I was talking about later in the series.

For my taste I don't like to see tech-debt added and then removed later
in the series. The whole series would appear to get smaller by removing
the alloc_dax() returning NULL case from the beginning, and then doing
the EOPNOTSUPP fixups.

...repeat this comment for patch 10, 11, 12.


Done.

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:39, Dan Williams wrote:
[...]


So per other feedback on earlier patches, I think this hunk deserves to
be moved to its own patch earlier in the series as a standalone fixup.


Done.



Rest of this patch looks good to me.


Adding your Acked-by to what is left of this patch if OK with you.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:37, Dan Williams wrote:
[...]

+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))


This IS_ERR_OR_NULL is ok because dax_dev is initialized to NULL, but
maybe this could just be IS_ERR() and then initialize @dax_dev to
ERR_PTR(-EOPNOTSUPP)?


Done.

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:36, Dan Williams wrote:
[...]

Just another "ditto" on alloc_dax() returning NULL so that the ternary
can be removed, but otherwise this looks good.


Done.

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:34, Dan Williams wrote:
[...]

Similar feedback as the pmem change, lets not propagate the mistake that
alloc_dax() could return NULL, none of the callers of alloc_dax()
properly handled NULL and it was just luck that none of the use cases
tried to use alloc_dax() in the CONFIG_DAX=n case.


Done.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:32, Dan Williams wrote:

Mathieu Desnoyers wrote:

In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
  drivers/nvdimm/pmem.c | 26 ++
  1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..f1d9f5c6dbac 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = >bb;
  
  	dax_dev = alloc_dax(pmem, _dax_ops);

-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto out;
+   if (IS_ERR_OR_NULL(dax_dev)) {


alloc_dax() should never return NULL. I.e. the lead in before this patch
should fix this misunderstanding:

diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
  static inline struct dax_device *alloc_dax(void *private,
 const struct dax_operations *ops)
  {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
  }
  static inline void put_dax(struct dax_device *dax_dev)
  {


+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;


Then this ternary can be replaced with just a check of which PTR_ERR()
value is being returned.


As you noted, I've introduced this as cleanups in later patches. I don't
mind folding these into their respective per-driver commits and moving
the alloc_dax() hunk earlier in the series.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers

On 2024-02-08 16:21, Andrew Morton wrote:

On Thu,  8 Feb 2024 13:49:02 -0500 Mathieu Desnoyers 
 wrote:


Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
 kill_dax(pmem->dax_dev);
 put_dax(pmem->dax_dev);


Seems inappropriate that this fix is within this patch series?

otoh I assume dax_add_host() has never failed so it doesn't matter much.


The series seems useful but is at v4 without much sign of review
activity.  I think I'll take silence as assent and shall slam it all
into -next and see who shouts at me.



Thanks Andrew for picking it up! Dan just reacted with feedback that
will help reducing the patch series size by removing intermediate
commits. I'll implement the requested changes and post a v5 in a few
days.

So far there are not behavior changes requested in Dan's feedback.

Should I keep this patch 01/12 within the series for v5 or should I
send it separately ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[PATCH v4 11/12] dcssblk: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index a3010849bfed..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -679,8 +679,8 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
goto put_dev;
 
dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
set_dax_synchronous(dax_dev);
-- 
2.39.2




[PATCH v4 12/12] virtio: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 fs/fuse/virtio_fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 621b1bca2d55..a28466c2da71 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -809,8 +809,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
return 0;
 
dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
return rc == -EOPNOTSUPP ? 0 : rc;
}
 
-- 
2.39.2




[PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-08 Thread Mathieu Desnoyers
Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ conf

[PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-08 Thread Mathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 15 +++
 fs/Kconfig  |  1 -
 include/linux/dax.h |  6 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
@@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2




[PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f1d9f5c6dbac..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = >bb;
 
dax_dev = alloc_dax(pmem, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
if (rc != -EOPNOTSUPP)
goto out;
} else {
-- 
2.39.2




[PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures

2024-02-08 Thread Mathieu Desnoyers
commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2




[PATCH v4 10/12] dm: Cleanup alloc_dax() error handling

2024-02-08 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2fc22cae9089..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2124,8 +2124,8 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
 
dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
} else {
set_dax_nocache(dax_dev);
-- 
2.39.2




[PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..621b1bca2d55 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2




[PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..2fc22cae9089 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2




[PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-08 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..a3010849bfed 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2




[PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-08 Thread Mathieu Desnoyers
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

Thanks,

Mathieu

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (12):
  nvdimm/pmem: Fix leak on dax_add_host() failure
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures
  nvdimm/pmem: Cleanup alloc_dax() error handling
  dm: Cleanup alloc_dax() error handling
  dcssblk: Cleanup alloc_dax() error handling
  virtio: Cleanup alloc_dax() error handling

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers/dax/super.c | 14 ++
 drivers/md/dm.c | 17 +
 drivers/nvdimm/pmem.c   | 23 ---
 drivers/s390/block/dcssblk.c| 11 ++-
 fs/Kconfig  |  1 -
 fs/fuse/virtio_fs.c | 15 +++
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  6 +-
 mm/Kconfig  | 

[PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-08 Thread Mathieu Desnoyers
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




[PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-08 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..f1d9f5c6dbac 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = >bb;
 
dax_dev = alloc_dax(pmem, _dax_ops);
-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto out;
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers

On 2024-02-05 13:34, Vincent Donnefort wrote:

On Mon, Feb 05, 2024 at 11:55:08AM -0500, Mathieu Desnoyers wrote:

[...]



How are the kernel linear mapping and the userspace mapping made coherent
on architectures with virtually aliasing data caches ?

Ref. 
https://lore.kernel.org/lkml/20240202210019.88022-1-mathieu.desnoy...@efficios.com/T/#t


Hi Mathieu,

Thanks for the pointer.

We are in the exact same problem as DAX. We do modify the data through the
kernel linear mapping while user-space can read it through its own. I should
probably return an error when used with any of the arch ARM || SPARC || MIPS,
until cpu_dcache_is_aliasing() introduces a fine-grain differentiation.


You might want to use LTTng's ring buffer approach instead. See

https://github.com/lttng/lttng-modules/blob/master/src/lib/ringbuffer/ring_buffer_frontend.c#L1202

lib_ring_buffer_flush_read_subbuf_dcache()

Basically, whenever user-space grabs a sub-buffer for reading (through
lttng-modules's LTTNG_KERNEL_ABI_RING_BUFFER_GET_SUBBUF ioctl), lttng
calls flush_dcache_page() on all pages of this subbuffer (I should
really change this for a call to flush_dcache_folio() which would be
more efficient).

Note that doing this is not very efficient on architectures which have
coherent data caches and incoherent dcache vs icache: in that case,
we issue the flush_dcache_page uselessly. I plan on using the new
cpu_dcache_is_aliasing() check once/if it makes it way upstream to
remove those useless flushes on architectures which define
ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE, but do not virtually alias the
data cache.

The equivalent of LTTng's "get subbuf" operation would be
the new TRACE_MMAP_IOCTL_GET_READER ioctl in ftrace AFAIU.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v14 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-05 Thread Mathieu Desnoyers
 vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array *tr = iter->tr;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->mapped))
+   tr->mapped--;
+   spin_unlock(>snapshot_trigger_lock);
+#endif
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = >iter;
+   struct trace_array *tr = iter->tr;
+   int ret = 0;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = _buffers_vmops;
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+   /*
+* We hold mmap_lock here. lockdep would be unhappy if we would now take
+* trace_types_lock. Instead use the specific snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+   if (tr->snapshot || tr->mapped == UINT_MAX) {
+   spin_unlock(>snapshot_trigger_lock);
+   return -EBUSY;
+   }
+   tr->mapped++;
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+#endif
+   ret = ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (ret) {
+   spin_lock(>snapshot_trigger_lock);
+   iter->tr->mapped--;
+   spin_unlock(>snapshot_trigger_lock);
+   }
+#endif
+   return ret;
+}
+
  static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.read   = tracing_buffers_read,
@@ -8681,6 +8794,7 @@ static const struct file_operations tracing_buffers_fops 
= {
.splice_read= tracing_buffers_splice_read,
.unlocked_ioctl = tracing_buffers_ioctl,
.llseek = no_llseek,
+   .mmap   = tracing_buffers_mmap,
  };
  
  static ssize_t

diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index bd312e9afe25..8a96e7a89e6b 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -336,6 +336,7 @@ struct trace_array {
boolallocated_snapshot;
    spinlock_t  snapshot_trigger_lock;
unsigned intsnapshot;
+   unsigned intmapped;
unsigned long   max_latency;
  #ifdef CONFIG_FSNOTIFY
struct dentry   *d_max_latency;


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[RFC PATCH v4 11/12] dcssblk: Cleanup alloc_dax() error handling

2024-02-02 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index a3010849bfed..f363c1d51d9a 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -679,8 +679,8 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
goto put_dev;
 
dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
goto put_dev;
}
set_dax_synchronous(dax_dev);
-- 
2.39.2




[RFC PATCH v4 12/12] virtio: Cleanup alloc_dax() error handling

2024-02-02 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 fs/fuse/virtio_fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 621b1bca2d55..a28466c2da71 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -809,8 +809,8 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
return 0;
 
dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
return rc == -EOPNOTSUPP ? 0 : rc;
}
 
-- 
2.39.2




[RFC PATCH v4 09/12] nvdimm/pmem: Cleanup alloc_dax() error handling

2024-02-02 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f1d9f5c6dbac..e9898457a7bd 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,8 +558,8 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = >bb;
 
dax_dev = alloc_dax(pmem, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (IS_ERR(dax_dev)) {
+   rc = PTR_ERR(dax_dev);
if (rc != -EOPNOTSUPP)
goto out;
} else {
-- 
2.39.2




[RFC PATCH v4 10/12] dm: Cleanup alloc_dax() error handling

2024-02-02 Thread Mathieu Desnoyers
Now that alloc_dax() returns ERR_PTR(-EOPNOTSUPP) rather than NULL,
the callers do not have to handle NULL return values anymore.

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 2fc22cae9089..acdc00bc05be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2124,8 +2124,8 @@ static struct mapped_device *alloc_dev(int minor)
sprintf(md->disk->disk_name, "dm-%d", minor);
 
dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR_OR_NULL(dax_dev)) {
-   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
+   if (IS_ERR(dax_dev)) {
+   if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
} else {
set_dax_nocache(dax_dev);
-- 
2.39.2




[RFC PATCH v4 07/12] Introduce cpu_dcache_is_aliasing() across all architectures

2024-02-02 Thread Mathieu Desnoyers
Introduce a generic way to query whether the data cache is virtually
aliased on all architectures. Its purpose is to ensure that subsystems
which are incompatible with virtually aliased data caches (e.g. FS_DAX)
can reliably query this.

For data cache aliasing, there are three scenarios dependending on the
architecture. Here is a breakdown based on my understanding:

A) The data cache is always aliasing:

* arc
* csky
* m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
* sh
* parisc

B) The data cache aliasing is statically known or depends on querying CPU
   state at runtime:

* arm (cache_is_vivt() || cache_is_vipt_aliasing())
* mips (cpu_has_dc_aliases)
* nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
* sparc32 (vac_cache_size > PAGE_SIZE)
* sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
* xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)

C) The data cache is never aliasing:

* alpha
* arm64 (aarch64)
* hexagon
* loongarch (but with incoherent write buffers, which are disabled since
 commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to 
PAGE_SIZE"))
* microblaze
* openrisc
* powerpc
* riscv
* s390
* um
* x86

Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
implement "cpu_dcache_is_aliasing()".

Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
cpu_dcache_is_aliasing() simply evaluates to "false".

Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
work. This would be useful to gate features like XIP on architectures
which have aliasing CPU dcache-icache but not CPU dcache-dcache.

Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
to clarify that we really mean "CPU data cache" and "CPU cache" to
eliminate any possible confusion with VFS "dentry cache" and "page
cache".

Link: https://lore.kernel.org/lkml/20030910210416.ga24...@mail.jlokier.co.uk/
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 include/linux/cacheinfo.h   |  6 ++
 mm/Kconfig  |  6 ++
 22 files changed, 112 insertions(+)
 create mode 100644 arch/arc/include/asm/cachetype.h
 create mode 100644 arch/csky/include/asm/cachetype.h
 create mode 100644 arch/m68k/include/asm/cachetype.h
 create mode 100644 arch/mips/include/asm/cachetype.h
 create mode 100644 arch/nios2/include/asm/cachetype.h
 create mode 100644 arch/parisc/include/asm/cachetype.h
 create mode 100644 arch/sh/include/asm/cachetype.h
 create mode 100644 arch/sparc/include/asm/cachetype.h
 create mode 100644 arch/xtensa/include/asm/cachetype.h

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index 1b0483c51cc1..7d294a3242a4 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -6,6 +6,7 @@
 config ARC
def_bool y
select ARC_TIMERS
+   select ARCH_HAS_CPU_CACHE_ALIASING
select ARCH_HAS_CACHE_LINE_SIZE
select ARCH_HAS_DEBUG_VM_PGTABLE
select ARCH_HAS_DMA_PREP_COHERENT
diff --git a/arch/arc/include/asm/cachetype.h b/arch/arc/include/asm/cachetype.h
new file mode 100644
index ..05fc7ed59712
--- /dev/null
+++ b/arch/arc/include/asm/cachetype.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_ARC_CACHETYPE_H
+#define __ASM_ARC_CACHETYPE_H
+
+#include 
+
+#define cpu_dcache_is_aliasing()   true
+
+#endif
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index f8567e95f98b..cd13b1788973 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -5,6 +5,7 @@ conf

[RFC PATCH v4 08/12] dax: Fix incorrect list of data cache aliasing architectures

2024-02-02 Thread Mathieu Desnoyers
commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
prevents DAX from building on architectures with virtually aliased
dcache with:

  depends on !(ARM || MIPS || SPARC)

This check is too broad (e.g. recent ARMv7 don't have virtually aliased
dcaches), and also misses many other architectures with virtually
aliased data cache.

This is a regression introduced in the v4.0 Linux kernel where the
dax mount option is removed for 32-bit ARMv7 boards which have no data
cache aliasing, and therefore should work fine with FS_DAX.

This was turned into the following check in alloc_dax() by a preparatory
change:

if (ops && (IS_ENABLED(CONFIG_ARM) ||
IS_ENABLED(CONFIG_MIPS) ||
IS_ENABLED(CONFIG_SPARC)))
return NULL;

Use cpu_dcache_is_aliasing() instead to figure out whether the environment
has aliasing data caches.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce5bffa86bba..a21a7c262382 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "dax-private.h"
 
 /**
@@ -455,9 +456,7 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
-   if (ops && (IS_ENABLED(CONFIG_ARM) ||
-   IS_ENABLED(CONFIG_MIPS) ||
-   IS_ENABLED(CONFIG_SPARC)))
+   if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);
 
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
-- 
2.39.2




[RFC PATCH v4 05/12] virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-02 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of virtio
virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 
Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 fs/fuse/virtio_fs.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..621b1bca2d55 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "fuse_i.h"
 
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
put_dax(dax_dev);
 }
 
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))
+
 static int virtio_fs_setup_dax(struct virtio_device *vdev, struct virtio_fs 
*fs)
 {
+   struct dax_device *dax_dev __free(cleanup_dax) = NULL;
struct virtio_shm_region cache_reg;
struct dev_pagemap *pgmap;
bool have_cache;
@@ -804,6 +808,12 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;
 
+   dax_dev = alloc_dax(fs, _fs_dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   int rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   return rc == -EOPNOTSUPP ? 0 : rc;
+   }
+
/* Get cache region */
have_cache = virtio_get_shm_region(vdev, _reg,
   (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +859,7 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",
__func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
 
-   fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-   if (IS_ERR(fs->dax_dev))
-   return PTR_ERR(fs->dax_dev);
-
+   fs->dax_dev = no_free_ptr(dax_dev);
return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
fs->dax_dev);
 }
-- 
2.39.2




[RFC PATCH v4 06/12] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers
Replace the following fs/Kconfig:FS_DAX dependency:

  depends on !(ARM || MIPS || SPARC)

By a runtime check within alloc_dax(). This runtime check returns
ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
the kernel is using an aliased mapping) on an architecture which
has data cache aliasing.

Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
CONFIG_DAX=n for consistency.

This is done in preparation for using cpu_dcache_is_aliasing() in a
following change which will properly support architectures which detect
data cache aliasing at runtime.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/dax/super.c | 15 +++
 fs/Kconfig  |  1 -
 include/linux/dax.h |  6 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea175..ce5bffa86bba 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -319,6 +319,11 @@ EXPORT_SYMBOL_GPL(dax_alive);
  * that any fault handlers or operations that might have seen
  * dax_alive(), have completed.  Any operations that start after
  * synchronize_srcu() has run will abort upon seeing !dax_alive().
+ *
+ * Note, because alloc_dax() returns an ERR_PTR() on error, callers
+ * typically store its result into a local variable in order to check
+ * the result. Therefore, care must be taken to populate the struct
+ * device dax_dev field make sure the dax_dev is not leaked.
  */
 void kill_dax(struct dax_device *dax_dev)
 {
@@ -445,6 +450,16 @@ struct dax_device *alloc_dax(void *private, const struct 
dax_operations *ops)
dev_t devt;
int minor;
 
+   /*
+* Unavailable on architectures with virtually aliased data caches,
+* except for device-dax (NULL operations pointer), which does
+* not use aliased mappings from the kernel.
+*/
+   if (ops && (IS_ENABLED(CONFIG_ARM) ||
+   IS_ENABLED(CONFIG_MIPS) ||
+   IS_ENABLED(CONFIG_SPARC)))
+   return ERR_PTR(-EOPNOTSUPP);
+
if (WARN_ON_ONCE(ops && !ops->zero_page_range))
return ERR_PTR(-EINVAL);
 
diff --git a/fs/Kconfig b/fs/Kconfig
index 42837617a55b..e5efdb3b276b 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -56,7 +56,6 @@ endif # BLOCK
 config FS_DAX
bool "File system based Direct Access (DAX) support"
depends on MMU
-   depends on !(ARM || MIPS || SPARC)
depends on ZONE_DEVICE || FS_DAX_LIMITED
select FS_IOMAP
select DAX
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b463502b16e1..df2d52b8a245 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -86,11 +86,7 @@ static inline void *dax_holder(struct dax_device *dax_dev)
 static inline struct dax_device *alloc_dax(void *private,
const struct dax_operations *ops)
 {
-   /*
-* Callers should check IS_ENABLED(CONFIG_DAX) to know if this
-* NULL is an error or expected.
-*/
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 static inline void put_dax(struct dax_device *dax_dev)
 {
-- 
2.39.2




[RFC PATCH v4 04/12] dcssblk: Handle alloc_dax() -EOPNOTSUPP failure

2024-02-02 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dcssblk
dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.

Considering that s390 is not a data cache aliasing architecture,
and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
from alloc_dax() should make dcssblk_add_store() fail.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org
---
 drivers/s390/block/dcssblk.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..a3010849bfed 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -549,6 +549,7 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
+   struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
 
@@ -677,13 +678,13 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char
if (rc)
goto put_dev;
 
-   dev_info->dax_dev = alloc_dax(dev_info, _dax_ops);
-   if (IS_ERR(dev_info->dax_dev)) {
-   rc = PTR_ERR(dev_info->dax_dev);
-   dev_info->dax_dev = NULL;
+   dax_dev = alloc_dax(dev_info, _dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
goto put_dev;
}
-   set_dax_synchronous(dev_info->dax_dev);
+   set_dax_synchronous(dax_dev);
+   dev_info->dax_dev = dax_dev;
rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
if (rc)
goto out_dax;
-- 
2.39.2




[RFC PATCH v4 00/12] Introduce cpu_dcache_is_aliasing() to fix DAX regression

2024-02-02 Thread Mathieu Desnoyers
This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
even on ARMv7 which does not have virtually aliased data caches:

commit d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")

It used to work fine before: I have customers using DAX over pmem on
ARMv7, but this regression will likely prevent them from upgrading their
kernel.

The root of the issue here is the fact that DAX was never designed to
handle virtually aliasing data caches (VIVT and VIPT with aliasing data
cache). It touches the pages through their linear mapping, which is not
consistent with the userspace mappings with virtually aliasing data
caches.

This patch series introduces cpu_dcache_is_aliasing() with the new
Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
architectures. The implementation of cpu_dcache_is_aliasing() is either
evaluated to a constant at compile-time or a runtime check, which is
what is needed on ARM.

With this we can basically narrow down the list of architectures which
are unsupported by DAX to those which are really affected.

Testing done so far:

- Compile allyesconfig on x86-64,
- Compile allyesconfig on x86-64, with FS_DAX=n.
- Compile allyesconfig on x86-64, with DAX=n.
- Boot test after modifying alloc_dax() to force returning -EOPNOTSUPP
  even on x86-64, thus simulating the behavior expected on an
  architecture with data cache aliasing.

There are many more axes to test however. I would welcome Tested-by for:

- affected architectures,
- affected drivers,
- affected filesytems.

Thanks,

Mathieu

Changes since v3:
- Fix a leak on dax_add_host() failure in nvdimm/pmem.
- Split the series into a bissectable sequence of changes.
- Ensure that device-dax use-cases still works on data cache
  aliasing architectures.

Changes since v2:
- Move DAX supported runtime check to alloc_dax(),
- Modify DM to handle alloc_dax() error as non-fatal,
- Remove all filesystem modifications, since the check is now done by
  alloc_dax(),
- rename "dcache" and "cache" to "cpu dcache" and "cpu cache" to
  eliminate confusion with VFS terminology.

Changes since v1:
- The order of the series was completely changed based on the
  feedback received on v1,
- cache_is_aliasing() is renamed to dcache_is_aliasing(),
- ARCH_HAS_CACHE_ALIASING_DYNAMIC is gone,
- dcache_is_aliasing() vs ARCH_HAS_CACHE_ALIASING relationship is
  simplified,
- the dax_is_supported() check was moved to its rightful place in all
  filesystems.

Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
Cc: linux-s...@vger.kernel.org

Mathieu Desnoyers (12):
  nvdimm/pmem: Fix leak on dax_add_host() failure
  nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dcssblk: Handle alloc_dax() -EOPNOTSUPP failure
  virtio: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal
  dax: Check for data cache aliasing at runtime
  Introduce cpu_dcache_is_aliasing() across all architectures
  dax: Fix incorrect list of data cache aliasing architectures
  nvdimm/pmem: Cleanup alloc_dax() error handling
  dm: Cleanup alloc_dax() error handling
  dcssblk: Cleanup alloc_dax() error handling
  virtio: Cleanup alloc_dax() error handling

 arch/arc/Kconfig|  1 +
 arch/arc/include/asm/cachetype.h|  9 +
 arch/arm/Kconfig|  1 +
 arch/arm/include/asm/cachetype.h|  2 ++
 arch/csky/Kconfig   |  1 +
 arch/csky/include/asm/cachetype.h   |  9 +
 arch/m68k/Kconfig   |  1 +
 arch/m68k/include/asm/cachetype.h   |  9 +
 arch/mips/Kconfig   |  1 +
 arch/mips/include/asm/cachetype.h   |  9 +
 arch/nios2/Kconfig  |  1 +
 arch/nios2/include/asm/cachetype.h  | 10 ++
 arch/parisc/Kconfig |  1 +
 arch/parisc/include/asm/cachetype.h |  9 +
 arch/sh/Kconfig |  1 +
 arch/sh/include/asm/cachetype.h |  9 +
 arch/sparc/Kconfig  |  1 +
 arch/sparc/include/asm/cachetype.h  | 14 ++
 arch/xtensa/Kconfig |  1 +
 arch/xtensa/include/asm/cachetype.h | 10 ++
 drivers/dax/super.c | 14 ++
 drivers/md/dm.c | 17 +
 drivers/nvdimm/pmem.c   | 23 ---
 drivers/s390/block/dcssblk.c| 11 ++-
 fs/Kconfig  |  1 -
 fs/fuse/virtio_fs.c | 15 +++
 include/linux/cacheinfo.h   |  6 ++
 include/linux/dax.h |  6 +-
 mm/Kconfig  | 

[RFC PATCH v4 03/12] dm: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-02 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Suggested-by: Dan Williams 
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/md/dm.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 23c32cd1f1d8..2fc22cae9089 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2054,6 +2054,7 @@ static void cleanup_mapped_device(struct mapped_device 
*md)
 static struct mapped_device *alloc_dev(int minor)
 {
int r, numa_node_id = dm_get_numa_node();
+   struct dax_device *dax_dev;
struct mapped_device *md;
void *old_md;
 
@@ -2122,15 +2123,15 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
 
-   if (IS_ENABLED(CONFIG_FS_DAX)) {
-   md->dax_dev = alloc_dax(md, _dax_ops);
-   if (IS_ERR(md->dax_dev)) {
-   md->dax_dev = NULL;
+   dax_dev = alloc_dax(md, _dax_ops);
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   if (IS_ERR(dax_dev) && PTR_ERR(dax_dev) != -EOPNOTSUPP)
goto bad;
-   }
-   set_dax_nocache(md->dax_dev);
-   set_dax_nomc(md->dax_dev);
-   if (dax_add_host(md->dax_dev, md->disk))
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   md->dax_dev = dax_dev;
+   if (dax_add_host(dax_dev, md->disk))
goto bad;
}
 
-- 
2.39.2




[RFC PATCH v4 02/12] nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal

2024-02-02 Thread Mathieu Desnoyers
In preparation for checking whether the architecture has data cache
aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.

For the transition, consider that alloc_dax() returning NULL is the
same as returning -EOPNOTSUPP.

Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing 
caches")
Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 9fe358090720..f1d9f5c6dbac 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -558,19 +558,21 @@ static int pmem_attach_disk(struct device *dev,
disk->bb = >bb;
 
dax_dev = alloc_dax(pmem, _dax_ops);
-   if (IS_ERR(dax_dev)) {
-   rc = PTR_ERR(dax_dev);
-   goto out;
+   if (IS_ERR_OR_NULL(dax_dev)) {
+   rc = IS_ERR(dax_dev) ? PTR_ERR(dax_dev) : -EOPNOTSUPP;
+   if (rc != -EOPNOTSUPP)
+   goto out;
+   } else {
+   set_dax_nocache(dax_dev);
+   set_dax_nomc(dax_dev);
+   if (is_nvdimm_sync(nd_region))
+   set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
+   rc = dax_add_host(dax_dev, disk);
+   if (rc)
+   goto out_cleanup_dax;
+   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
}
-   set_dax_nocache(dax_dev);
-   set_dax_nomc(dax_dev);
-   if (is_nvdimm_sync(nd_region))
-   set_dax_synchronous(dax_dev);
-   pmem->dax_dev = dax_dev;
-   rc = dax_add_host(dax_dev, disk);
-   if (rc)
-   goto out_cleanup_dax;
-   dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




[RFC PATCH v4 01/12] nvdimm/pmem: Fix leak on dax_add_host() failure

2024-02-02 Thread Mathieu Desnoyers
Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:
kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

Signed-off-by: Mathieu Desnoyers 
Cc: Alasdair Kergon 
Cc: Mike Snitzer 
Cc: Mikulas Patocka 
Cc: Andrew Morton 
Cc: Linus Torvalds 
Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Matthew Wilcox 
Cc: Arnd Bergmann 
Cc: Russell King 
Cc: linux-a...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux...@kvack.org
Cc: linux-...@vger.kernel.org
Cc: dm-devel@lists.linux.dev
Cc: nvd...@lists.linux.dev
---
 drivers/nvdimm/pmem.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;
-- 
2.39.2




Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers

On 2024-02-02 15:14, Dan Williams wrote:

Mathieu Desnoyers wrote:
[..]

Thanks for that. All of those need to be done before the fs goes live
later in virtio_device_ready(), but before that point nothing should be
calling into virtio_fs_dax_ops, so as far as I can see it is safe to
change the order.


Sounds good, I'll do that.

I will soon be ready to send out a RFC v4, which is still only
compiled-tested. Do you happen to have some kind of test suite
you can use to automate some of the runtime testing ?


There is a test suite for the pmem, dm, and dax changes
(https://github.com/pmem/ndctl?tab=readme-ov-file#unit-tests), but not
automated unfortunately. The NVDIMM maintainer team will run that before
pushing patches out to the fixes branch if you just want to lean on
that. For the rest I think we will need to depend on tested-by's from
s390 + virtio_fs folks, and / or sufficient soak time in linux-next.


I suspect this will be necessary. There are just so many combinations
of architectures, drivers and filesystems involved here that I don't
think it is realistic to try to do all this testing manually on my own.

I prefer to voice this up front, so there are no misplaced expectations
about testing.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers

On 2024-02-02 14:41, Dan Williams wrote:

Mathieu Desnoyers wrote:

On 2024-02-02 12:37, Dan Williams wrote:

Mathieu Desnoyers wrote:

[...]





The alternative route I intend to take is to audit all callers
of alloc_dax() and make sure they all save the alloc_dax() return
value in a struct dax_device * local variable first for the sake
of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
initialized to NULL in the error case and simplify the rest of
error checking.


I could maybe get on board with that, but it needs a comment somewhere
about the asymmetric subtlety.


Is this "somewhere" at every alloc_dax() call site, or do you have
something else in mind ?


At least kill_dax() should mention the asymmetry I think.


Here is what I intend to add:

 * Note, because alloc_dax() returns an ERR_PTR() on error, callers
 * typically store its result into a local variable in order to check
 * the result. Therefore, care must be taken to populate the struct
 * device dax_dev field make sure the dax_dev is not leaked.




The real question is what to do about device-dax. I *think* it is not
affected by cpu_dcache aliasing because it never accesses user mappings
through a kernel alias. I doubt device-dax is in use on these platforms,
but we might need another fixup for that if someone screams about the
alloc_dax() behavior change making them lose device-dax access.


By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.

Based on your analysis, is alloc_dax() still the right spot where
to place this runtime check ? Which call sites are responsible
for invoking alloc_dax() for device-dax ?


That is in devm_create_dev_dax().


If we know which call sites do not intend to use the kernel linear
mapping, we could introduce a flag (or a new variant of the alloc_dax()
API) that would either enforce or skip the check.


Hmmm, it looks like there is already a natural flag for that. If
alloc_dax() is passed a NULL operations pointer it means there are no
kernel usages of the aliased mapping. That actually fits rather nicely.


Good, I was reaching the same conclusion when I received your reply.
I'll do that. It ends up being:

/*
 * Unavailable on architectures with virtually aliased data caches,
 * except for device-dax (NULL operations pointer), which does
 * not use aliased mappings from the kernel.
 */
if (ops && cpu_dcache_is_aliasing())
return ERR_PTR(-EOPNOTSUPP);



[..]

@@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct virtio_device *vdev, 
struct virtio_fs *fs)
if (!IS_ENABLED(CONFIG_FUSE_DAX))
return 0;

+	dax_dev = alloc_dax(fs, _fs_dax_ops);

+   if (IS_ERR(dax_dev)) {
+   int rc = PTR_ERR(dax_dev);
+
+   if (rc == -EOPNOTSUPP)
+   return 0;
+   return rc;
+   }


What is gained by moving this allocation here ?


The gain is to fail early in virtio_fs_setup_dax() since the fundamental
dependency of alloc_dax() success is not met. For example why let the
setup progress to devm_memremap_pages() when alloc_dax() is going to
return ERR_PTR(-EOPNOTSUPP).


What I don't know is whether there is a dependency requiring to do
devm_request_mem_region(), devm_kzalloc(), devm_memremap_pages()
before calling alloc_dax() ?

Those 3 calls are used to populate:

  fs->window_phys_addr = (phys_addr_t) cache_reg.addr;
  fs->window_len = (phys_addr_t) cache_reg.len;

and then alloc_dax() takes "fs" as private data parameter. So it's
unclear to me whether we can swap the invocation order. I suspect
that it is not an issue because it is only used to populate
dax_dev->private, but I prefer to confirm this with you just to be
on the safe side.


Thanks for that. All of those need to be done before the fs goes live
later in virtio_device_ready(), but before that point nothing should be
calling into virtio_fs_dax_ops, so as far as I can see it is safe to
change the order.


Sounds good, I'll do that.

I will soon be ready to send out a RFC v4, which is still only
compiled-tested. Do you happen to have some kind of test suite
you can use to automate some of the runtime testing ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers

On 2024-02-02 12:37, Dan Williams wrote:

Mathieu Desnoyers wrote:

[...]





The alternative route I intend to take is to audit all callers
of alloc_dax() and make sure they all save the alloc_dax() return
value in a struct dax_device * local variable first for the sake
of checking for IS_ERR(). This will leave the xyz->dax_dev pointer
initialized to NULL in the error case and simplify the rest of
error checking.


I could maybe get on board with that, but it needs a comment somewhere
about the asymmetric subtlety.


Is this "somewhere" at every alloc_dax() call site, or do you have
something else in mind ?







return;
   
   	if (dax_dev->holder_data != NULL)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b69c9e442cf4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
dax_dev = alloc_dax(pmem, _dax_ops);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
-   goto out;
+   if (rc != -EOPNOTSUPP)
+   goto out;


If I compare the before / after this change, if previously
pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
result in a NULL pointer dereference.


No, alloc_dax() only returns NULL CONFIG_DAX=n case, not the
CONFIG_FS_DAX=n case.


Indeed, I was wrong there.


So that means that pmem devices on ARM have been
possible without FS_DAX. So, in order for alloc_dax() returning
ERR_PTR(-EOPNOTSUPP) to not regress pmem device availability this error
path needs to be changed.

Good point. We're moving the depends on !(ARM || MIPS |PARC) from FS_DAX
Kconfig to a runtime check in alloc_dax(), which is used whenever DAX=y,
which includes configurations that had FS_DAX=n previously.

I'll change the error path in pmem_attack_disk to treat -EOPNOTSUPP
alloc_dax() return value as non-fatal.




This would be an error handling fix all by itself. Do we really want
to return successfully if dax is unsupported, or should we return
an error here ?


Per above, there is no error handling fix, and pmem block device
available should not depend on alloc_dax() succeeding.


I agree on treating alloc_dax() failure as non-fatal. There is
however one error handling fix to nvdimm/pmem which I plan to
introduce as an initial patch before this change:

nvdimm/pmem: Fix leak on dax_add_host() failure

Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done

before setting pmem->dax_dev, which therefore issues the two following
calls on NULL pointers:

out_cleanup_dax:

kill_dax(pmem->dax_dev);
put_dax(pmem->dax_dev);

    Signed-off-by: Mathieu Desnoyers 


diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..9fe358090720 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -566,12 +566,11 @@ static int pmem_attach_disk(struct device *dev,
set_dax_nomc(dax_dev);
if (is_nvdimm_sync(nd_region))
set_dax_synchronous(dax_dev);
+   pmem->dax_dev = dax_dev;
rc = dax_add_host(dax_dev, disk);
if (rc)
goto out_cleanup_dax;
dax_write_cache(dax_dev, nvdimm_has_cache(nd_region));
-   pmem->dax_dev = dax_dev;
-
rc = device_add_disk(dev, disk, pmem_attribute_groups);
if (rc)
goto out_remove_host;



The real question is what to do about device-dax. I *think* it is not
affected by cpu_dcache aliasing because it never accesses user mappings
through a kernel alias. I doubt device-dax is in use on these platforms,
but we might need another fixup for that if someone screams about the
alloc_dax() behavior change making them lose device-dax access.


By "device-dax", I understand you mean drivers/dax/Kconfig:DEV_DAX.

Based on your analysis, is alloc_dax() still the right spot where
to place this runtime check ? Which call sites are responsible
for invoking alloc_dax() for device-dax ?

If we know which call sites do not intend to use the kernel linear
mapping, we could introduce a flag (or a new variant of the alloc_dax()
API) that would either enforce or skip the check.

[...]



Here what I'm seeing so far:

- devm_release_mem_region() is never called after devm_request_mem_region(). Not
on error, neither on teardown,


devm_release_mem_region() is called from virtio_fs_probe() context. That


I guess you mean "devm_request_mem_region()" here.


means that when virtio_fs_probe() returns an error the driver core will
automatically call devm_request_mem_region().


And "devm_release_mem_region()" here.




- pgmap is never freed on error after devm_kzalloc.


That is what the "devm_" in devm_kzalloc() does, free the memory on
driver-probe failure, or after the driver remove callback is invoked.


Got it.






 

Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers

On 2024-02-01 10:44, Mathieu Desnoyers wrote:

On 2024-01-31 17:18, Dan Williams wrote:


[...]



diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce..11053a70f5ab 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -16,6 +16,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include "fuse_i.h"
@@ -795,8 +796,11 @@ static void virtio_fs_cleanup_dax(void *data)
  put_dax(dax_dev);
  }
+DEFINE_FREE(cleanup_dax, struct dax_dev *, if (!IS_ERR_OR_NULL(_T)) 
virtio_fs_cleanup_dax(_T))

+
  static int virtio_fs_setup_dax(struct virtio_device *vdev, struct 
virtio_fs *fs)


So either I'm completely missing how ownership works in this function, or
we should be really concerned about the fact that it does no actual
cleanup of anything on any error.

[...]


Here what I'm seeing so far:

- devm_release_mem_region() is never called after 
devm_request_mem_region(). Not

   on error, neither on teardown,
- pgmap is never freed on error after devm_kzalloc.


I was indeed missing something: the devm_ family of functions
keeps ownership at the device level, so we would not need explicit
teardown.




  {
+    struct dax_device *dax_dev __free(cleanup_dax) = NULL;
  struct virtio_shm_region cache_reg;
  struct dev_pagemap *pgmap;
  bool have_cache;
@@ -804,6 +808,15 @@ static int virtio_fs_setup_dax(struct 
virtio_device *vdev, struct virtio_fs *fs)

  if (!IS_ENABLED(CONFIG_FUSE_DAX))
  return 0;
+    dax_dev = alloc_dax(fs, _fs_dax_ops);
+    if (IS_ERR(dax_dev)) {
+    int rc = PTR_ERR(dax_dev);
+
+    if (rc == -EOPNOTSUPP)
+    return 0;
+    return rc;
+    }


What is gained by moving this allocation here ?


I'm still concerned about moving the call to alloc_dax() before
the setup of the memory region it will use. Are those completely
independent ?




+
  /* Get cache region */
  have_cache = virtio_get_shm_region(vdev, _reg,
 (u8)VIRTIO_FS_SHMCAP_ID_CACHE);
@@ -849,10 +862,7 @@ static int virtio_fs_setup_dax(struct 
virtio_device *vdev, struct virtio_fs *fs)
  dev_dbg(>dev, "%s: window kaddr 0x%px phys_addr 0x%llx len 
0x%llx\n",

  __func__, fs->window_kaddr, cache_reg.addr, cache_reg.len);
-    fs->dax_dev = alloc_dax(fs, _fs_dax_ops);
-    if (IS_ERR(fs->dax_dev))
-    return PTR_ERR(fs->dax_dev);
-
+    fs->dax_dev = no_free_ptr(dax_dev);
  return devm_add_action_or_reset(>dev, virtio_fs_cleanup_dax,
  fs->dax_dev);
  }




[...]

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [RFC PATCH v3 2/4] dax: Check for data cache aliasing at runtime

2024-02-02 Thread Mathieu Desnoyers

On 2024-02-01 10:44, Mathieu Desnoyers wrote:
[...]

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 4e8fdcb3f1c8..b69c9e442cf4 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -560,17 +560,19 @@ static int pmem_attach_disk(struct device *dev,
  dax_dev = alloc_dax(pmem, _dax_ops);
  if (IS_ERR(dax_dev)) {
  rc = PTR_ERR(dax_dev);
-    goto out;
+    if (rc != -EOPNOTSUPP)
+    goto out;


If I compare the before / after this change, if previously
pmem_attach_disk() was called in a configuration with FS_DAX=n, it would
result in a NULL pointer dereference.


I was wrong. drivers/nvdimm/Kconfig has:

config BLK_DEV_PMEM
select DAX

and

drivers/nvdimm/Makefile has:

obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
nd_pmem-y := pmem.o

which means that anything in pmem.c can assume that alloc_dax() is
implemented.

[...]

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 4b7ecd4fd431..f911e58a24dd 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -681,12 +681,14 @@ dcssblk_add_store(struct device *dev, struct 
device_attribute *attr, const char

  if (IS_ERR(dev_info->dax_dev)) {
  rc = PTR_ERR(dev_info->dax_dev);
  dev_info->dax_dev = NULL;
-    goto put_dev;
+    if (rc != -EOPNOTSUPP)
+    goto put_dev;


config DCSSBLK selects FS_DAX_LIMITED and DAX.

I'm not sure what selecting DAX is trying to achieve here, because the
Kconfig option is "FS_DAX".

So depending on the real motivation behind this select, we may want to
consider failure rather than success in the -EOPNOTSUPP case.



I missed that alloc_dax() is implemented as not supported based on
CONFIG_DAX (not CONFIG_FS_DAX).

Therefore DCSSBLK Kconfig does the right thing and always selects DAX,
and thus an implemented version of alloc_dax().

This takes care of two of my open questions at least. :)

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




  1   2   3   4   5   6   7   8   9   10   >