Re: [RFC PATCH] rpmsg: glink: Add bounds check on tx path

2024-03-19 Thread Deepak Kumar Singh




On 1/29/2024 10:03 PM, Michal Koutný wrote:

On Mon, Jan 29, 2024 at 04:18:36PM +0530, Deepak Kumar Singh 
 wrote:

There is already a patch posted for similar problem -
https://lore.kernel.org/all/20231201110631.669085-1-quic_dee...@quicinc.com/


I was not aware, thanks for the pointer.

Do you plan to update your patch to "just" bail-out/zero instead of
using slightly random values (as pointed out by Bjorn)?

Michal

Hi Michal,
Yes, i will be fixing those comments and re post patch.



Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread syzbot
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any 
issue:

Reported-and-tested-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com

Tested on:

commit: 52998cdd Merge branch '6.8/scsi-staging' into 6.8/scsi..
git tree:   git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel config:  https://syzkaller.appspot.com/x/.config?x=7b1f286a7e950707
dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Note: no patches were applied.
Note: testing is done by a robot and is best-effort only.



[PATCH v3 2/2] memory tier: dax/kmem: abstract memory types put

2024-03-19 Thread Ho-Ren (Jack) Chuang
Abstract `kmem_put_memory_types()` into `mt_put_memory_types()` to
accommodate various memory types and enhance flexibility,
similar to `mt_find_alloc_memory_type()`.

Signed-off-by: Ho-Ren (Jack) Chuang 
---
 drivers/dax/kmem.c   |  7 +--
 include/linux/memory-tiers.h |  6 ++
 mm/memory-tiers.c| 12 
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index de1333aa7b3e..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -66,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(&kmem_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, &kmem_memory_types, list) {
-   list_del(&mtype->list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(&kmem_memory_types);
mutex_unlock(&kmem_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index b2135334ac18..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -50,6 +50,7 @@ int mt_set_default_dram_perf(int nid, struct 
access_coordinate *perf,
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
 struct memory_dev_type *mt_find_alloc_memory_type(int adist,
struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -143,5 +144,10 @@ struct memory_dev_type *mt_find_alloc_memory_type(int 
adist, struct list_head *m
 {
return NULL;
 }
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index d9b96b21b65a..6246c28546ba 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -662,6 +662,18 @@ struct memory_dev_type *mt_find_alloc_memory_type(int 
adist, struct list_head *m
 }
 EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
 
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(&mtype->list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 /*
  * This is invoked via late_initcall() to create
  * CPUless memory tiers after HMAT info is ready or
-- 
Ho-Ren (Jack) Chuang




[PATCH v3 1/2] memory tier: dax/kmem: create CPUless memory tiers after obtaining HMAT info

2024-03-19 Thread Ho-Ren (Jack) Chuang
The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

More details:
* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Abstract common functions into `mt_find_alloc_memory_type()`
Since different memory devices require finding or allocating a memory type,
these common steps are abstracted into a single function,
`mt_find_alloc_memory_type()`, enhancing code scalability and conciseness.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Change adist calculation code to use another new lock, `mt_perf_lock`.
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `mt_perf_lock` to protect
`default_dram_perf`. This approach not only avoids deadlock but also
prevents holding a large lock simultaneously.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
---
 drivers/dax/kmem.c   | 13 +
 include/linux/memory-tiers.h |  7 +++
 mm/memory-tiers.c| 94 +---
 3 files changed, 95 insertions(+), 19 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..de1333aa7b3e 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(&kmem_memory_type_lock);
-   list_for_each_entry(mtype, &kmem_memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(&mtype->list, &kmem_memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, &kmem_memory_types);
mutex_unlock(&kmem_memory_type_lock);
 
return mtype;
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..b2135334ac18 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,8 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +138,10 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..d9b96b21b65a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -36,6 +37,11 @@ struct node_mem

[PATCH v3 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-19 Thread Ho-Ren (Jack) Chuang
When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable
from normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes
to the same DRAM tier. This results in normal memory devices with
different attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the initialization
of memory tiers for CPUless NUMA nodes until they obtain HMAT information
and after all devices are initialized at boot time, eliminating the need
for user intervention. If no HMAT is specified, it falls back to
using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance attributes
through QEMU, and the guest now sees memory nodes with performance attributes
in HMAT. With this change, we enable the guest kernel to construct
the correct memory tiering for the memory nodes.

-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u

Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: create CPUless memory tiers after obtaining
HMAT info
  memory tier: dax/kmem: abstract memory types put

 drivers/dax/kmem.c   |  20 +--
 include/linux/memory-tiers.h |  13 +
 mm/memory-tiers.c| 106 ---
 3 files changed, 114 insertions(+), 25 deletions(-)

-- 
Ho-Ren (Jack) Chuang




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Gavin Shan
On 3/20/24 10:49, Michael S. Tsirkin wrote:> 

I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 6f7e5010a673..79456706d0bd 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
/* Put entry in available array (but don't update avail->idx until they
 * do sync). */
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
-   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
+   u16 headwithflag = head | (q->split.avail_idx_shadow & 
~(vq->split.vring.num - 1));
+   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
headwithflag);
  
  	/* Descriptors and available array need to be set before we expose the

 * new available array entries. */

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..bd8f7c763caa 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct 
vhost_virtqueue *vq,
  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
   __virtio16 *head, int idx)
  {
-   return vhost_get_avail(vq, *head,
+   unsigned i = idx;
+   unsigned flag = i & ~(vq->num - 1);
+   unsigned val = vhost_get_avail(vq, *head,
   &vq->avail->ring[idx & (vq->num - 1)]);
+   unsigned valflag = val & ~(vq->num - 1);
+
+   WARN_ON(valflag != flag);
+
+   return val & (vq->num - 1);
  }
  


Thanks, Michael. The code is already self-explanatory. Since vq->num is 256, I 
just
squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
byte is still correct and valid?

avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] =
cpu_to_virtio16(_vq->vdev, head | (avail << 8));


head = vhost16_to_cpu(vq, ring_head);
WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
head = head & 0xff;

One question: Does QEMU has any chance writing data to the available queue when
vhost is enabled? My previous understanding is no, the queue is totally owned by
vhost instead of QEMU.

Before this patch was posted, I had debugging code to record last 16 
transactions
to the available and used queue from guest and host side. It did reveal the 
wrong
head was fetched from the available queue.

[   11.785745]  virtqueue_get_buf_ctx_split 
[   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
[   11.786655] head to be released: 036 077
[   11.786952]
[   11.786952] avail_idx:
[   11.787234] 000  63985  <--
[   11.787237] 001  63986
[   11.787444] 002  63987
[   11.787632] 003  63988
[   11.787821] 004  63989
[   11.788006] 005  63990
[   11.788194] 006  63991
[   11.788381] 007  63992
[   11.788567] 008  63993
[   11.788772] 009  63994
[   11.788957] 010  63995
[   11.789141] 011  63996
[   11.789327] 012  63997
[   11.789515] 013  63998
[   11.789701] 014  63999
[   11.789886] 015  64000
[   11.790068]
[   11.790068] avail_head:
[   11.790529] 000  075  <--
[   11.790718] 001  036
[   11.790890] 002  077
[   11.791061] 003  129
[   11.791231] 004  072
[   11.791400] 005  130
[   11.791574] 006  015
[   11.791748] 007  074
[   11.791918] 008  130
[   11.792094] 009  130
[   11.792263] 010  074
[   11.792437] 011  015
[   11.792617] 012  072
[   11.792788] 013  129
[   11.792961] 014  077// The last two heads from guest to host: 077, 036
[   11.793134] 015  036

[root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost

avail_idx
000  63998
001  64000
002  63954  <---
003  63955
004  63956
005  63974
006  63981
007  63984
008  63986
009  63987
010  63988
011  63989
012  63992
013  63993
014  63995
015  63997

avail_head
000  074
001  015
002  072
003  129
004  074// The last two heads seen by vhost is: 074, 036
005  036
006  075  <---
007  036
008  077
009  129
010  072
011  130
012  015
013  074
014  130
015  130

used_idx
000  64000
001  63882  <---
002  63889
003  63891
004  63898
005  63936
006  63942
007  63946
008  63949
009  63953
010  63957
011  63981
012  63990
013  63992
014  63993
015  63999

used_head
000  072
001  129
002  074  // The 

Re: [PATCH v6 7/8] selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD"

2024-03-19 Thread Google
Hi Ye,

Sorry for replying late.

On Fri, 15 Mar 2024 14:55:39 +0800
Ye Bin  wrote:

> This patch adds test cases for new print format type "%pd/%pD".The test cases
> test the following items:
> 1. Test README if add "%pd/%pD" type;
> 2. Test "%pd" type for dput();
> 3. Test "%pD" type for vfs_read();
> 
> This test case require enable CONFIG_HAVE_FUNCTION_ARG_ACCESS_API 
> configuration.
> 
> Signed-off-by: Ye Bin 
> ---
>  .../ftrace/test.d/kprobe/kprobe_args_vfs.tc   | 43 +++
>  1 file changed, 43 insertions(+)
>  create mode 100644 
> tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> 
> diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc 
> b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> new file mode 100644
> index ..8bea9a75a331
> --- /dev/null
> +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_args_vfs.tc
> @@ -0,0 +1,43 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +# description: Kprobe event VFS type argument
> +# requires: kprobe_events "%pd/%pD":README
> +
> +: "Test argument %pd/%pD in README"
> +grep -q "%pd/%pD" README

If you put this check in "# requires" line, you don't need to check it again.

> +
> +: "Test argument %pd with name"
> +echo 'p:testprobe dput name=$arg1:%pd' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable

So accessing "enable" file here,

> +echo 0 > events/kprobes/testprobe/enable
> +grep "dput" trace | grep -q "enable"

and check it.

OK, the test case itself looks good to me.

Thanks,

> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pd without name"
> +echo 'p:testprobe dput $arg1:%pd' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "dput" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pD with name"
> +echo 'p:testprobe vfs_read name=$arg1:%pD' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1" events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "vfs_read" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> +
> +: "Test argument %pD without name"
> +echo 'p:testprobe vfs_read $arg1:%pD' > kprobe_events
> +echo 1 > events/kprobes/testprobe/enable
> +grep -q "1"  events/kprobes/testprobe/enable
> +echo 0 > events/kprobes/testprobe/enable
> +grep "vfs_read" trace | grep -q "enable"
> +echo "" > kprobe_events
> +echo "" > trace
> -- 
> 2.31.1
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-19 Thread Google
On Tue, 19 Mar 2024 09:19:19 -0700
Andrii Nakryiko  wrote:

> On Mon, Mar 18, 2024 at 9:21 PM Masami Hiramatsu  wrote:
> >
> > Hi,
> >
> > On Mon, 18 Mar 2024 11:17:25 -0700
> > Andrii Nakryiko  wrote:
> >
> > > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > > execution
> > > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > > patches for details.
> >
> > This series looks good to me. Let me pick it on probes/for-next.
> 
> Great, at least I guessed the Git repo right, if not the branch.
> Thanks for pulling it in! I assume some other uprobe-related follow up
> patches should be based on probes/for-next as well, right?

Yes, I'll pick those on linux-trace tree's probes/* branchs
(if there is an urgent patch, it will go through probes/fixes)

Thank you! 

> 
> >
> > Thanks!
> >
> > >
> > > v1->v2:
> > >   - rebased onto trace/core branch of tracing tree, hopefully I guessed 
> > > right;
> > >   - simplified user_cpu_buffer usage further (Oleg Nesterov);
> > >   - simplified patch #3, just moved speculative check outside of lock 
> > > (Oleg);
> > >   - added Reviewed-by from Jiri Olsa.
> > >
> > > Andrii Nakryiko (3):
> > >   uprobes: encapsulate preparation of uprobe args buffer
> > >   uprobes: prepare uprobe args buffer lazily
> > >   uprobes: add speculative lockless system-wide uprobe filter check
> > >
> > >  kernel/trace/trace_uprobe.c | 103 +---
> > >  1 file changed, 59 insertions(+), 44 deletions(-)
> > >
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Wed, Mar 20, 2024 at 09:56:58AM +1000, Gavin Shan wrote:
> On 3/20/24 04:22, Will Deacon wrote:
> > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > On 3/19/24 02:59, Will Deacon wrote:
> > > > >drivers/virtio/virtio_ring.c | 12 +---
> > > > >1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > - /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > > -  * new available array entries. */
> > > > > - virtio_wmb(vq->weak_barriers);
> > > > > + /*
> > > > > +  * Descriptors and available array need to be set before we 
> > > > > expose
> > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > enough
> > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > barrier
> > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > +  * architectures, but performance loss is expected.
> > > > > +  */
> > > > > + virtio_mb(false);
> > > > >   vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > > > here, especially when ordering accesses to coherent memory.
> > > > 
> > > > In practice, either the larger timing different from the DSB or the fact
> > > > that you're going from a Store->Store barrier to a full barrier is what
> > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > (e.g. via __smb_mb()).
> > > > 
> > > > We definitely shouldn't take changes like this without a proper
> > > > explanation of what is going on.
> > > > 
> > > 
> > > Thanks for your comments, Will.
> > > 
> > > Yes, DMB should work for us. However, it seems this instruction has 
> > > issues on
> > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > works
> > > from hardware level. I agree it's not the solution to replace DMB with DSB
> > > before we fully understand the root cause.
> > > 
> > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > issue like
> > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > 
> > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > {
> > >  :
> > >  /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  // Broken: virtio_wmb(vq->weak_barriers);
> > >  // Broken: __dma_mb();
> > >  // Work:   __mb();
> > >  // Work:   __smp_mb();
> > 
> > It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
> > confident are you in that result?
> > 
> 
> Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
> that both __dma_mb() and __smp_mb() work for us. I had too many tests 
> yesterday
> and something may have been messed up.
> 
> Instruction Hitting times in 10 tests
> -
> __smp_wmb() 8
> __smp_mb()  0
> __dma_wmb() 7
> __dma_mb()  0
> __mb()  0
> __wmb() 0
> 
> It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
> read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by 
> Michael
> in another reply. Will update the result soon.
> 
> Thanks,
> Gavin


I think you are wasting the time with these tests. Even if it helps what
does this tell us? Try setting a flag as I suggested elsewhere.
Then check it in vhost.
Or here's another idea - possibly easier. Copy the high bits from index
into ring itself. Then vhost can check that head is synchronized with
index.

Warning: completely untested, not even compiled. But should give you
the idea. If this works btw we should consider making this official in
the spec.


 static inline int vhost_get_avail_flags(struct vh

Re: [PATCH 0/8] tracing: Persistent traces across a reboot or crash

2024-03-19 Thread Steven Rostedt
On Sat, 9 Mar 2024 12:40:51 -0800
Kees Cook  wrote:

> The part I'd like to get wired up sanely is having pstore find the
> nvdimm area automatically, but it never quite happened:
> https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=cs4kutw9tl...@mail.gmail.com/

The automatic detection is what I'm looking for.

> 
> > Thanks for the info. We use pstore on ChromeOS, but it is currently
> > restricted to 1MB which is too small for the tracing buffers. From what
> > I understand, it's also in a specific location where there's only 1MB
> > available for contiguous memory.  
> 
> That's the area that is specifically hardware backed with persistent
> RAM.
> 
> > I'm looking at finding a way to get consistent memory outside that
> > range. That's what I'll be doing next week ;-)
> > 
> > But this code was just to see if I could get a single contiguous range
> > of memory mapped to ftrace, and this patch set does exactly that.  
> 
> Well, please take a look at pstore. It should be able to do everything
> you mention already; it just needs a way to define multiple regions if
> you want to use an area outside of the persistent ram area defined by
> Chrome OS's platform driver.

I'm not exactly sure how to use pstore here. At boot up I just need some
consistent memory reserved for the tracing buffer. It just needs to be the
same location at every boot up.

I don't need a front end. If you mean a way to access it from user space.
The front end is the tracefs directory, as I need all the features that the
tracefs directory gives.

I'm going to look to see how pstore is set up in ChromeOS and see if I can
use whatever it does to allocate another location.

-- Steve



Re: [PATCH 1/2] compiler_types: Ensure __diag_clang() is always available

2024-03-19 Thread Justin Stitt
Hi,

On Tue, Mar 19, 2024 at 09:07:52AM -0700, Nathan Chancellor wrote:
> Attempting to use __diag_clang() and build with GCC results in a build
> error:
> 
>   include/linux/compiler_types.h:468:38: error: 'ignore' undeclared (first 
> use in this function); did you mean 'inode'?
> 468 | __diag_ ## compiler(version, ignore, option)
> |  ^~
> 
> This error occurs because __diag_clang() is only defined in
> compiler-clang.h, which is only included when using clang as the
> compiler. This error has not been seen before because __diag_clang() has
> only been used in __diag_ignore_all(), which is defined in both
> compiler-clang.h and compiler-gcc.h.
> 
> Add an empty stub for __diag_clang() in compiler_types.h, so that it is
> always defined and just becomes a no-op when using GCC.
> 
> Fixes: f014a00bbeb0 ("compiler-clang.h: Add __diag infrastructure for clang")
> Signed-off-by: Nathan Chancellor 

Reviewed-by: Justin Stitt 

> ---
>  include/linux/compiler_types.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
> index 3e64ec0f7ac8..fb0c3ff5497d 100644
> --- a/include/linux/compiler_types.h
> +++ b/include/linux/compiler_types.h
> @@ -461,6 +461,10 @@ struct ftrace_likely_data {
>  #define __diag_GCC(version, severity, string)
>  #endif
>  
> +#ifndef __diag_clang
> +#define __diag_clang(version, severity, string)
> +#endif
> +
>  #define __diag_push()__diag(push)
>  #define __diag_pop() __diag(pop)
>  
> 
> -- 
> 2.44.0

Thanks
Justin



Re: [PATCH 2/2] tracing: Ignore -Wstring-compare with diagnostic macros

2024-03-19 Thread Steven Rostedt
On Tue, 19 Mar 2024 17:30:41 -0700
Justin Stitt  wrote:


> > diff --git a/include/trace/stages/stage6_event_callback.h 
> > b/include/trace/stages/stage6_event_callback.h
> > index 83da83a0c14f..56a4eea5a48e 100644
> > --- a/include/trace/stages/stage6_event_callback.h
> > +++ b/include/trace/stages/stage6_event_callback.h
> > @@ -35,9 +35,14 @@
> > do {\
> > char *__str__ = __get_str(dst); \
> > int __len__ = __get_dynamic_array_len(dst) - 1; \
> > +   __diag_push();  \
> > +   __diag_ignore(clang, 11, "-Wstring-compare",\
> > + "__builtin_constant_p() ensures strcmp()" \
> > + "will be used for string literals");  \
> > WARN_ON_ONCE(__builtin_constant_p(src) ?\
> >  strcmp((src), __data_offsets.dst##_ptr_) : \
> >  (src) != __data_offsets.dst##_ptr_);   \  
> 
> What exactly is the point of the literal string comparison? Why
> doesn't strcmp do the trick?

This is done in the hotpath, and is only for debugging. The string passed
into __string() should be the same as the string passed into __assign_str().

But this is moot as I ended up always using strcmp() even if it will slow
down the recording of the event.

Next merge window the src parameter (along with the strcmp() checks) are
going away.

-- Steve


> 
> > +   __diag_pop();   \
> > memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
> >EVENT_NULL_STR, __len__);\
> > __str__[__len__] = '\0';\
> >
> > --
> > 2.44.0
> >  




Re: [PATCH 2/2] tracing: Ignore -Wstring-compare with diagnostic macros

2024-03-19 Thread Justin Stitt
On Tue, Mar 19, 2024 at 9:08 AM Nathan Chancellor  wrote:
>
> Commit b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON()
> check") addressed a clang warning, -Wstring-compare, with the use of
> __builtin_constant_p() to dispatch to strcmp() if the source string is a
> string literal and a direct comparison if not. Unfortunately, even with
> this change, the warning is still present because __builtin_constant_p()
> is not evaluated at this stage of the pipeline, so clang still thinks
> the else branch could occur for this situation:
>
>   include/trace/events/sunrpc.h:705:4: error: result of comparison against a 
> string literal is unspecified (use an explicit string comparison function 
> instead) [-Werror,-Wstring-compare]
>   ...
>   include/trace/stages/stage6_event_callback.h:40:15: note: expanded from 
> macro '__assign_str'
>  40 |  (src) != __data_offsets.dst##_ptr_);   
> \
> |^
>   ...
>
> Use the compiler diagnostic macros to disable this warning around the
> WARN_ON_ONCE() expression since a string comparison function, strcmp(),
> will always be used for the comparison of string literals.
>
> Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() 
> check")
> Reported-by: Linux Kernel Functional Testing 
> Closes: 
> https://lore.kernel.org/all/CA+G9fYs=otkazs6g1p1ewadfr0qoe6lgovsohqkxmfxoteo...@mail.gmail.com/
> Signed-off-by: Nathan Chancellor 
> ---
>  include/trace/stages/stage6_event_callback.h | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/include/trace/stages/stage6_event_callback.h 
> b/include/trace/stages/stage6_event_callback.h
> index 83da83a0c14f..56a4eea5a48e 100644
> --- a/include/trace/stages/stage6_event_callback.h
> +++ b/include/trace/stages/stage6_event_callback.h
> @@ -35,9 +35,14 @@
> do {\
> char *__str__ = __get_str(dst); \
> int __len__ = __get_dynamic_array_len(dst) - 1; \
> +   __diag_push();  \
> +   __diag_ignore(clang, 11, "-Wstring-compare",\
> + "__builtin_constant_p() ensures strcmp()" \
> + "will be used for string literals");  \
> WARN_ON_ONCE(__builtin_constant_p(src) ?\
>  strcmp((src), __data_offsets.dst##_ptr_) : \
>  (src) != __data_offsets.dst##_ptr_);   \

What exactly is the point of the literal string comparison? Why
doesn't strcmp do the trick?

> +   __diag_pop();   \
> memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
>EVENT_NULL_STR, __len__);\
> __str__[__len__] = '\0';\
>
> --
> 2.44.0
>



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Gavin Shan

On 3/20/24 04:22, Will Deacon wrote:

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:

On 3/19/24 02:59, Will Deacon wrote:

   drivers/virtio/virtio_ring.c | 12 +---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-   /* Descriptors and available array need to be set before we expose the
-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.



Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
 :
 /* Put entry in available array (but don't update avail->idx until they
  * do sync). */
 avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
 vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

 /* Descriptors and available array need to be set before we expose the
  * new available array entries. */
 // Broken: virtio_wmb(vq->weak_barriers);
 // Broken: __dma_mb();
 // Work:   __mb();
 // Work:   __smp_mb();


It's pretty weird that __dma_mb() is "broken" but __smp_mb() "works". How
confident are you in that result?



Yes, __dma_mb() is even stronger than __smp_mb(). I retried the test, showing
that both __dma_mb() and __smp_mb() work for us. I had too many tests yesterday
and something may have been messed up.

Instruction Hitting times in 10 tests
-
__smp_wmb() 8
__smp_mb()  0
__dma_wmb() 7
__dma_mb()  0
__mb()  0
__wmb() 0

It's strange that __smp_mb() works, but __smp_wmb() fails. It seems we need a
read barrier here. I will try WRITE_ONCE() + __smp_wmb() as suggested by Michael
in another reply. Will update the result soon.

Thanks,
Gavin




Re: [PATCH v2] workqueue: add function in event of workqueue_activate_work

2024-03-19 Thread kassey li




on 2024/3/8 10:23, Steven Rostedt wrote:

On Fri, 8 Mar 2024 10:18:18 +0800
Kassey Li  wrote:


The trace event "workqueue_activate_work" only print work struct.
However, function is the region of interest in a full sequence of work.
Current workqueue_activate_work trace event output:

 workqueue_activate_work: work struct ff88b4a0f450

With this change, workqueue_activate_work will print the function name,
align with workqueue_queue_work/execute_start/execute_end event.

 workqueue_activate_work: work struct ff80413a78b8 
function=vmstat_update

Signed-off-by: Kassey Li 
---
Changelog:
v1: 
https://lore.kernel.org/all/20240308010929.1955339-1-quic_yinga...@quicinc.com/
v1->v2:
- do not follow checkpatch in TRACE_EVENT() macros
- add sample "workqueue_activate_work: work struct ff80413a78b8 
function=vmstat_update"


 From a tracing POV,

Reviewed-by: Steven Rostedt (Google) 

-- Steve


hello, Tejun, may you have a chance to review this change ?




---
  include/trace/events/workqueue.h | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/trace/events/workqueue.h b/include/trace/events/workqueue.h
index 262d52021c23..6ef5b7254070 100644
--- a/include/trace/events/workqueue.h
+++ b/include/trace/events/workqueue.h
@@ -64,13 +64,15 @@ TRACE_EVENT(workqueue_activate_work,
  
  	TP_STRUCT__entry(

__field( void *,work)
+   __field( void *,function)
),
  
  	TP_fast_assign(

__entry->work= work;
+   __entry->function= work->func;
),
  
-	TP_printk("work struct %p", __entry->work)

+   TP_printk("work struct %p function=%ps ", __entry->work, 
__entry->function)
  );
  
  /**







Re: [GIT PULL] tracing/tools: Updates for 6.9

2024-03-19 Thread Steven Rostedt
On Mon, 18 Mar 2024 18:41:13 +0100
Daniel Bristot de Oliveira  wrote:

> Steven,
> 
> Tracing tooling updates for 6.9
> 
> Tracing:
> - Update makefiles for latency-collector and RTLA,
>   using tools/build/ makefiles like perf does, inheriting
>   its benefits. For example, having a proper way to
>   handle dependencies.
> 
> - The timerlat tracer has an interface for any tool to use.
>   rtla timerlat tool uses this interface dispatching its
>   own threads as workload. But, rtla timerlat could also be
>   used for any other process. So, add 'rtla timerlat -U'
>   option, allowing the timerlat tool to measure the latency of
>   any task using the timerlat tracer interface.
> 
> Verification:
> - Update makefiles for verification/rv, using tools/build/
>   makefiles like perf does, inheriting its benefits.
>   For example, having a proper way to handle dependencies.
> 
> 
> Please pull the latest trace-tools-v6.9 tree, which can be found at:
> 
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bristot/linux.git
> trace-tools-v6.9

Looks like you just built on top of a random commit from Linus's tree:

commit f6cef5f8c37f58a3bc95b3754c3ae98e086631ca
Merge: 906a93befec8 8f06fb458539
Author: Linus Torvalds 
Date:   Sun Mar 17 16:59:33 2024 -0700

Merge tag 'i3c/for-6.9' of 
git://git.kernel.org/pub/scm/linux/kernel/git/i3c/linux

Linus prefers basing off of real tags or previous pulls from us.

Can you rebase your changes on v6.8 and resend?

  $ git checkout v6.8
  $ git cherry-pick f6cef5f8c37f58a3bc95b3754c3ae98e086631ca..trace-tools-v6.9

Appears to work fine.

Thanks!

-- Steve


> 
> Tag SHA1: 2eb09a97c56af3c27bd9dcebccb495f70d56d5c0
> Head SHA1: 9c63d9f58a42b979a42bcaed534d9246996ac0d9
> 
> 
> Daniel Bristot de Oliveira (4):
>   tools/tracing: Use tools/build makefiles on latency-collector
>   tools/rtla: Use tools/build makefiles to build rtla
>   tools/verification: Use tools/build makefiles on rv
>   tools/rtla: Add -U/--user-load option to timerlat
> 
> 
>  .../tools/rtla/common_timerlat_options.rst |   6 +
>  tools/tracing/latency/.gitignore   |   5 +-
>  tools/tracing/latency/Build|   1 +
>  tools/tracing/latency/Makefile | 105 --
>  tools/tracing/latency/Makefile.config  |  30 +++
>  tools/tracing/rtla/.gitignore  |   7 +-
>  tools/tracing/rtla/Build   |   1 +
>  tools/tracing/rtla/Makefile| 217 
> +++--
>  tools/tracing/rtla/Makefile.config |  47 +
>  tools/tracing/rtla/Makefile.rtla   |  80 
>  tools/tracing/rtla/Makefile.standalone |  26 +++
>  tools/tracing/rtla/sample/timerlat_load.py |  74 +++
>  tools/tracing/rtla/src/Build   |  11 ++
>  tools/tracing/rtla/src/timerlat_hist.c |  16 +-
>  tools/tracing/rtla/src/timerlat_top.c  |  14 +-
>  tools/verification/rv/.gitignore   |   6 +
>  tools/verification/rv/Build|   1 +
>  tools/verification/rv/Makefile | 207 +++-
>  tools/verification/rv/Makefile.config  |  47 +
>  tools/verification/rv/Makefile.rv  |  51 +
>  tools/verification/rv/src/Build|   4 +
>  21 files changed, 650 insertions(+), 306 deletions(-)
>  create mode 100644 tools/tracing/latency/Build
>  create mode 100644 tools/tracing/latency/Makefile.config
>  create mode 100644 tools/tracing/rtla/Build
>  create mode 100644 tools/tracing/rtla/Makefile.config
>  create mode 100644 tools/tracing/rtla/Makefile.rtla
>  create mode 100644 tools/tracing/rtla/Makefile.standalone
>  create mode 100644 tools/tracing/rtla/sample/timerlat_load.py
>  create mode 100644 tools/tracing/rtla/src/Build
>  create mode 100644 tools/verification/rv/.gitignore
>  create mode 100644 tools/verification/rv/Build
>  create mode 100644 tools/verification/rv/Makefile.config
>  create mode 100644 tools/verification/rv/Makefile.rv
>  create mode 100644 tools/verification/rv/src/Build
> ---




Re: [PATCH 0/2] tracing: Fully silence instance of -Wstring-compare

2024-03-19 Thread Nathan Chancellor
On Tue, Mar 19, 2024 at 06:15:09PM -0400, Steven Rostedt wrote:
> On Tue, 19 Mar 2024 09:07:51 -0700
> Nathan Chancellor  wrote:
> 
> > Hi all,
> > 
> > This series fully resolves the new instance of -Wstring-compare from
> > within the __assign_str() macro. The first patch resolves a build
> > failure with GCC that would be seen with just the second patch applied.
> > The second patch actually hides the warning.
> > 
> > NOTE: This is based on trace/for-next, which does not contain the
> > minimum LLVM version bump that occurred later in the current merge
> > window, so this uses
> > 
> >   __diag_ignore(clang, 11, ...
> > 
> > instead of
> > 
> >   __diag_ignore(clang, 13, ...
> > 
> > which will be required when this is merged into Linus's tree. If you can
> > base this series on a tree that has the merge commit e5eb28f6d1af ("Merge
> > tag 'mm-nonmm-stable-2024-03-14-09-36' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") in it, then that
> > change can be done at application time, rather than merge time (or I can
> > send a v2). It would be really nice for this to make the merge window so
> > that this warning does not proliferate into other trees that base on
> > -rc1.
> > 
> 
> I'm guessing this isn't needed with the last update.

Correct, this is resolved in Linus's tree and should be in -next
tomorrow. The first patch may be worth sending standalone, I'll think
about sending it later but that can go in via some other tree, not
yours.

Cheers,
Nathan



Re: [PATCH 0/2] tracing: Fully silence instance of -Wstring-compare

2024-03-19 Thread Steven Rostedt
On Tue, 19 Mar 2024 09:07:51 -0700
Nathan Chancellor  wrote:

> Hi all,
> 
> This series fully resolves the new instance of -Wstring-compare from
> within the __assign_str() macro. The first patch resolves a build
> failure with GCC that would be seen with just the second patch applied.
> The second patch actually hides the warning.
> 
> NOTE: This is based on trace/for-next, which does not contain the
> minimum LLVM version bump that occurred later in the current merge
> window, so this uses
> 
>   __diag_ignore(clang, 11, ...
> 
> instead of
> 
>   __diag_ignore(clang, 13, ...
> 
> which will be required when this is merged into Linus's tree. If you can
> base this series on a tree that has the merge commit e5eb28f6d1af ("Merge
> tag 'mm-nonmm-stable-2024-03-14-09-36' of
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") in it, then that
> change can be done at application time, rather than merge time (or I can
> send a v2). It would be really nice for this to make the merge window so
> that this warning does not proliferate into other trees that base on
> -rc1.
> 

I'm guessing this isn't needed with the last update.

-- Steve



Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Mike Christie
On 3/19/24 12:19 PM, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
>> On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
>>> git tree:   upstream
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
>>> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
>>> Debian) 2.40
>>>
>>> Downloadable assets:
>>> disk image: 
>>> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
>>> vmlinux: 
>>> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
>>> kernel image: 
>>> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
>>>
>>> Key type pkcs7_test registered
>>> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
>>> io scheduler mq-deadline registered
>>> io scheduler kyber registered
>>> io scheduler bfq registered
>>> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
>>> ACPI: button: Power Button [PWRF]
>>> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
>>> ACPI: button: Sleep Button [SLPF]
>>> ioatdma: Intel(R) QuickData Technology Driver 5.00
>>> ACPI: \_SB_.LNKC: Enabled at IRQ 11
>>> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
>>> ACPI: \_SB_.LNKD: Enabled at IRQ 10
>>> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
>>> ACPI: \_SB_.LNKB: Enabled at IRQ 10
>>> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
>>> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
>>> N_HDLC line discipline registered with maxframe=4096
>>> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>>> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
>>> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
>>> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
>>> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
>>> Non-volatile memory driver v1.3
>>> Linux agpgart interface v0.103
>>> ACPI: bus type drm_connector registered
>>> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
>>> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
>>> Console: switching to colour frame buffer device 128x48
>>> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
>>> usbcore: registered new interface driver udl
>>> brd: module loaded
>>> loop: module loaded
>>> zram: Added device: zram0
>>> null_blk: disk nullb0 created
>>> null_blk: module loaded
>>> Guest personality initialized and is inactive
>>> VMCI host device registered (name=vmci, major=10, minor=118)
>>> Initialized host personality
>>> usbcore: registered new interface driver rtsx_usb
>>> usbcore: registered new interface driver viperboard
>>> usbcore: registered new interface driver dln2
>>> usbcore: registered new interface driver pn533_usb
>>> nfcsim 0.2 initialized
>>> usbcore: registered new interface driver port100
>>> usbcore: registered new interface driver nfcmrvl
>>> Loading iSCSI transport class v2.0-870.
>>> virtio_scsi virtio0: 1/0/0 default/read/poll queues
>>> [ cut here ]
>>> refcount_t: decrement hit 0; leaking memory.
>>> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
>>> refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
>>> Modules linked in:
>>> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
>>> 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
>>> Google 02/29/2024
>>> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
>>> Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 
>>> 0c f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb 
>>> d9 e8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
>>> RSP: :c9066e18 EFLAGS: 00010246
>>> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
>>> RDX:  RSI:  RDI: 
>>> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
>>> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
>>> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
>>> FS:  () GS:8880b940() knlGS:
>>> CS:  0010 DS:  ES:  CR0: 80050033
>>> CR2: 88823000 CR3: 0e132000 CR4: 003506f0
>>> DR0:  DR1:  DR2: 
>>> DR3:  DR6: fffe0ff0 DR7: 0

Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 11:03:44AM -0700, Linus Torvalds wrote:
> On Tue, 19 Mar 2024 at 00:41, Michael S. Tsirkin  wrote:
> >
> > virtio: features, fixes
> >
> > Per vq sizes in vdpa.
> > Info query for block devices support in vdpa.
> > DMA sync callbacks in vduse.
> >
> > Fixes, cleanups.
> 
> Grr. I thought the merge message was a bit too terse, but I let it slide.
> 
> But only after pushing it out do I notice that not only was the pull
> request message overly terse, you had also rebased this all just
> moments before sending the pull request and didn't even give a hit of
> a reason for that.
> 
> So I missed that, and the merge is out now, but this was NOT OK.
> 
> Yes, rebasing happens. But last-minute rebasing needs to be explained,
> not some kind of nasty surprise after-the-fact.
> 
> And that pull request explanation was really borderline even *without*
> that issue.
> 
> Linus

OK thanks Linus and sorry. I did that rebase for testing then I thought
hey history looks much nicer now why don't I switch to that.  Just goes
to show not to do this thing past midnight, I write better merge
messages at sane hours, too.

-- 
MST




Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Stefan Hajnoczi
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
52998cdd8d3438df9a77c858a827b8932da1bb28

This is the last time virtio_scsi.c was touched. If the test passes then
the issue is probably in another subsystem and we can bisect more recent
commits. If it fails, then older virtio_scsi.c commits need to be
bisected.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Will Deacon
On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   
> > > vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue 
> like
> __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> 
> static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> {
> :
> /* Put entry in available array (but don't update avail->idx until 
> they
>  * do sync). */
> avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> 
> /* Descriptors and available array need to be set before we expose the
>  * new available array entries. */
> // Broken: virtio_wmb(vq->weak_barriers);
> // Broken: __dma_mb();
> // Work:   __mb();
> // Work:   __smp_mb();

It's

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Will Deacon
On Tue, Mar 19, 2024 at 03:36:31AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2024 at 04:59:24PM +, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >  
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> 
> Just making sure: so on this system, how do
> smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
> with kernel running on another CPU and we are doing something
> unusual in virtio when we use it to synchronize with host
> as opposed to the guest - e.g. CONFIG_SMP is special cased
> because of this:
> 
> #define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)
> 
> Note __smp_wmb not smp_wmb which would be a NOP on UP.

I think that should be fine (as long as the NOP is a barrier() for the
compiler).

wmb() uses a DSB, but that is only really relevant to endpoint ordering
with real I/O devices, e.g. if for some reason you need to guarantee
that a write has made it all the way to the device before proceeding.
Even then you're slightly at the mercy of the memory type and the
interconnect not giving an early acknowledgement, so the extra ordering
is rarely needed in practice and we don't even provide it for our I/O
accessors (e.g. writel() and readl()).

So for virtio between two CPUs using coherent memory, DSB is a red
herring.

Will



Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread Linus Torvalds
On Tue, 19 Mar 2024 at 00:41, Michael S. Tsirkin  wrote:
>
> virtio: features, fixes
>
> Per vq sizes in vdpa.
> Info query for block devices support in vdpa.
> DMA sync callbacks in vduse.
>
> Fixes, cleanups.

Grr. I thought the merge message was a bit too terse, but I let it slide.

But only after pushing it out do I notice that not only was the pull
request message overly terse, you had also rebased this all just
moments before sending the pull request and didn't even give a hit of
a reason for that.

So I missed that, and the merge is out now, but this was NOT OK.

Yes, rebasing happens. But last-minute rebasing needs to be explained,
not some kind of nasty surprise after-the-fact.

And that pull request explanation was really borderline even *without*
that issue.

Linus



Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 01:19:23PM -0400, Stefan Hajnoczi wrote:
> On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of 
> > > git://git.kernel.org/p..
> > > git tree:   upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > 
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> > > 
> > > Key type pkcs7_test registered
> > > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> > > io scheduler mq-deadline registered
> > > io scheduler kyber registered
> > > io scheduler bfq registered
> > > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > > ACPI: button: Power Button [PWRF]
> > > input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> > > ACPI: button: Sleep Button [SLPF]
> > > ioatdma: Intel(R) QuickData Technology Driver 5.00
> > > ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKD: Enabled at IRQ 10
> > > virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> > > ACPI: \_SB_.LNKB: Enabled at IRQ 10
> > > virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> > > virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> > > N_HDLC line discipline registered with maxframe=4096
> > > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > > 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > > 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> > > 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> > > 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> > > Non-volatile memory driver v1.3
> > > Linux agpgart interface v0.103
> > > ACPI: bus type drm_connector registered
> > > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> > > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> > > Console: switching to colour frame buffer device 128x48
> > > platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> > > usbcore: registered new interface driver udl
> > > brd: module loaded
> > > loop: module loaded
> > > zram: Added device: zram0
> > > null_blk: disk nullb0 created
> > > null_blk: module loaded
> > > Guest personality initialized and is inactive
> > > VMCI host device registered (name=vmci, major=10, minor=118)
> > > Initialized host personality
> > > usbcore: registered new interface driver rtsx_usb
> > > usbcore: registered new interface driver viperboard
> > > usbcore: registered new interface driver dln2
> > > usbcore: registered new interface driver pn533_usb
> > > nfcsim 0.2 initialized
> > > usbcore: registered new interface driver port100
> > > usbcore: registered new interface driver nfcmrvl
> > > Loading iSCSI transport class v2.0-870.
> > > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > > [ cut here ]
> > > refcount_t: decrement hit 0; leaking memory.
> > > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > > Modules linked in:
> > > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > > 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 02/29/2024
> > > RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > > Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 
> > > 05 0c f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 
> > > eb d9 e8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> > > RSP: :c9066e18 EFLAGS: 00010246
> > > RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> > > RDX:  RSI:  RDI: 
> > > RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> > > R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> > > R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> > > FS:  () GS:8880b940() 
> > > knlGS:
> > > CS:  

[PATCH] tracing: Just use strcmp() for testing __string() and __assign_str() match

2024-03-19 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As __assign_str() no longer uses its "src" parameter, there's a check to
make sure nothing depends on it being different than what was passed to
__string(). It originally just compared the pointer passed to __string()
with the pointer passed into __assign_str() via the "src" parameter. But
there's a couple of outliers that just pass in a quoted string constant,
where comparing the pointers is UB to the compiler, as the compiler is
free to create multiple copies of the same string constant.

Instead, just use strcmp(). It may slow down the trace event, but this
will eventually be removed.

Also, fix the issue of passing NULL to strcmp() by adding a WARN_ON() to
make sure that both "src" and the pointer saved in __string() are either
both NULL or have content, and then checking if "src" is not NULL before
performing the strcmp().

Link: 
https://lore.kernel.org/all/CAHk-=wjxX16kWd=uxG5wzqt=axoydf1bgwokk+qvmao0zh7...@mail.gmail.com/

Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() check")
Reported-by: Linus Torvalds 
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/stages/stage6_event_callback.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..3690e677263f 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,9 +35,8 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
-   WARN_ON_ONCE(__builtin_constant_p(src) ?\
-strcmp((src), __data_offsets.dst##_ptr_) : \
-(src) != __data_offsets.dst##_ptr_);   \
+   WARN_ON_ONCE(!(void *)(src) != !(void 
*)__data_offsets.dst##_ptr_); \
+   WARN_ON_ONCE((src) && strcmp((src), 
__data_offsets.dst##_ptr_)); \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
-- 
2.43.0




Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> > dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > 
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> > 
> > Key type pkcs7_test registered
> > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> > io scheduler mq-deadline registered
> > io scheduler kyber registered
> > io scheduler bfq registered
> > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > ACPI: button: Power Button [PWRF]
> > input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> > ACPI: button: Sleep Button [SLPF]
> > ioatdma: Intel(R) QuickData Technology Driver 5.00
> > ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> > ACPI: \_SB_.LNKD: Enabled at IRQ 10
> > virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> > ACPI: \_SB_.LNKB: Enabled at IRQ 10
> > virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> > virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> > N_HDLC line discipline registered with maxframe=4096
> > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> > 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> > 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> > Non-volatile memory driver v1.3
> > Linux agpgart interface v0.103
> > ACPI: bus type drm_connector registered
> > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> > Console: switching to colour frame buffer device 128x48
> > platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> > usbcore: registered new interface driver udl
> > brd: module loaded
> > loop: module loaded
> > zram: Added device: zram0
> > null_blk: disk nullb0 created
> > null_blk: module loaded
> > Guest personality initialized and is inactive
> > VMCI host device registered (name=vmci, major=10, minor=118)
> > Initialized host personality
> > usbcore: registered new interface driver rtsx_usb
> > usbcore: registered new interface driver viperboard
> > usbcore: registered new interface driver dln2
> > usbcore: registered new interface driver pn533_usb
> > nfcsim 0.2 initialized
> > usbcore: registered new interface driver port100
> > usbcore: registered new interface driver nfcmrvl
> > Loading iSCSI transport class v2.0-870.
> > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > [ cut here ]
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Modules linked in:
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 02/29/2024
> > RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 
> > 0c f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb 
> > d9 e8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> > RSP: :c9066e18 EFLAGS: 00010246
> > RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> > RDX:  RSI:  RDI: 
> > RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> > R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> > R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> > FS:  () GS:8880b940() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 88823000 CR3: 0e132000 CR4: 003506f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  
> > 

Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread pr-tracker-bot
The pull request you sent on Tue, 19 Mar 2024 03:41:43 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/d95fcdf4961d27a3d17e5c7728367197adc89b8d

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-19 Thread Andrii Nakryiko
On Mon, Mar 18, 2024 at 9:21 PM Masami Hiramatsu  wrote:
>
> Hi,
>
> On Mon, 18 Mar 2024 11:17:25 -0700
> Andrii Nakryiko  wrote:
>
> > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > execution
> > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > patches for details.
>
> This series looks good to me. Let me pick it on probes/for-next.

Great, at least I guessed the Git repo right, if not the branch.
Thanks for pulling it in! I assume some other uprobe-related follow up
patches should be based on probes/for-next as well, right?

>
> Thanks!
>
> >
> > v1->v2:
> >   - rebased onto trace/core branch of tracing tree, hopefully I guessed 
> > right;
> >   - simplified user_cpu_buffer usage further (Oleg Nesterov);
> >   - simplified patch #3, just moved speculative check outside of lock 
> > (Oleg);
> >   - added Reviewed-by from Jiri Olsa.
> >
> > Andrii Nakryiko (3):
> >   uprobes: encapsulate preparation of uprobe args buffer
> >   uprobes: prepare uprobe args buffer lazily
> >   uprobes: add speculative lockless system-wide uprobe filter check
> >
> >  kernel/trace/trace_uprobe.c | 103 +---
> >  1 file changed, 59 insertions(+), 44 deletions(-)
> >
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH 2/2] tracing: Ignore -Wstring-compare with diagnostic macros

2024-03-19 Thread Nathan Chancellor
Commit b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON()
check") addressed a clang warning, -Wstring-compare, with the use of
__builtin_constant_p() to dispatch to strcmp() if the source string is a
string literal and a direct comparison if not. Unfortunately, even with
this change, the warning is still present because __builtin_constant_p()
is not evaluated at this stage of the pipeline, so clang still thinks
the else branch could occur for this situation:

  include/trace/events/sunrpc.h:705:4: error: result of comparison against a 
string literal is unspecified (use an explicit string comparison function 
instead) [-Werror,-Wstring-compare]
  ...
  include/trace/stages/stage6_event_callback.h:40:15: note: expanded from macro 
'__assign_str'
 40 |  (src) != __data_offsets.dst##_ptr_); 
  \
|^
  ...

Use the compiler diagnostic macros to disable this warning around the
WARN_ON_ONCE() expression since a string comparison function, strcmp(),
will always be used for the comparison of string literals.

Fixes: b1afefa62ca9 ("tracing: Use strcmp() in __assign_str() WARN_ON() check")
Reported-by: Linux Kernel Functional Testing 
Closes: 
https://lore.kernel.org/all/CA+G9fYs=otkazs6g1p1ewadfr0qoe6lgovsohqkxmfxoteo...@mail.gmail.com/
Signed-off-by: Nathan Chancellor 
---
 include/trace/stages/stage6_event_callback.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..56a4eea5a48e 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -35,9 +35,14 @@
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
+   __diag_push();  \
+   __diag_ignore(clang, 11, "-Wstring-compare",\
+ "__builtin_constant_p() ensures strcmp()" \
+ "will be used for string literals");  \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
 (src) != __data_offsets.dst##_ptr_);   \
+   __diag_pop();   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\

-- 
2.44.0




[PATCH 0/2] tracing: Fully silence instance of -Wstring-compare

2024-03-19 Thread Nathan Chancellor
Hi all,

This series fully resolves the new instance of -Wstring-compare from
within the __assign_str() macro. The first patch resolves a build
failure with GCC that would be seen with just the second patch applied.
The second patch actually hides the warning.

NOTE: This is based on trace/for-next, which does not contain the
minimum LLVM version bump that occurred later in the current merge
window, so this uses

  __diag_ignore(clang, 11, ...

instead of

  __diag_ignore(clang, 13, ...

which will be required when this is merged into Linus's tree. If you can
base this series on a tree that has the merge commit e5eb28f6d1af ("Merge
tag 'mm-nonmm-stable-2024-03-14-09-36' of
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm") in it, then that
change can be done at application time, rather than merge time (or I can
send a v2). It would be really nice for this to make the merge window so
that this warning does not proliferate into other trees that base on
-rc1.

---
Nathan Chancellor (2):
  compiler_types: Ensure __diag_clang() is always available
  tracing: Ignore -Wstring-compare with diagnostic macros

 include/linux/compiler_types.h   | 4 
 include/trace/stages/stage6_event_callback.h | 5 +
 2 files changed, 9 insertions(+)
---
base-commit: 7604256cecef34a82333d9f78262d3180f4eb525
change-id: 20240319-tracing-fully-silence-wstring-compare-e71e2fd17b2a

Best regards,
-- 
Nathan Chancellor 




[PATCH 1/2] compiler_types: Ensure __diag_clang() is always available

2024-03-19 Thread Nathan Chancellor
Attempting to use __diag_clang() and build with GCC results in a build
error:

  include/linux/compiler_types.h:468:38: error: 'ignore' undeclared (first use 
in this function); did you mean 'inode'?
468 | __diag_ ## compiler(version, ignore, option)
|  ^~

This error occurs because __diag_clang() is only defined in
compiler-clang.h, which is only included when using clang as the
compiler. This error has not been seen before because __diag_clang() has
only been used in __diag_ignore_all(), which is defined in both
compiler-clang.h and compiler-gcc.h.

Add an empty stub for __diag_clang() in compiler_types.h, so that it is
always defined and just becomes a no-op when using GCC.

Fixes: f014a00bbeb0 ("compiler-clang.h: Add __diag infrastructure for clang")
Signed-off-by: Nathan Chancellor 
---
 include/linux/compiler_types.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 3e64ec0f7ac8..fb0c3ff5497d 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -461,6 +461,10 @@ struct ftrace_likely_data {
 #define __diag_GCC(version, severity, string)
 #endif
 
+#ifndef __diag_clang
+#define __diag_clang(version, severity, string)
+#endif
+
 #define __diag_push()  __diag(push)
 #define __diag_pop()   __diag(pop)
 

-- 
2.44.0




Re: [PATCH v13 2/4] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2024-03-19 Thread Tanmay Shah



On 3/19/24 12:29 AM, Krzysztof Kozlowski wrote:
> On 11/03/2024 18:59, Tanmay Shah wrote:
>> From: Radhey Shyam Pandey 
>> 
>> Introduce bindings for TCM memory address space on AMD-xilinx Zynq
>> UltraScale+ platform. It will help in defining TCM in device-tree
>> and make it's access platform agnostic and data-driven.
>> 
>> Tightly-coupled memories(TCMs) are low-latency memory that provides
>> predictable instruction execution and predictable data load/store
>> timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
>> banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
>> 
>> The TCM resources(reg, reg-names and power-domain) are documented for
>> each TCM in the R5 node. The reg and reg-names are made as required
>> properties as we don't want to hardcode TCM addresses for future
>> platforms and for zu+ legacy implementation will ensure that the
>> old dts w/o reg/reg-names works and stable ABI is maintained.
>> 
>> It also extends the examples for TCM split and lockstep modes.
>> 
>> Signed-off-by: Radhey Shyam Pandey 
>> Signed-off-by: Tanmay Shah 
>> ---
> 
> I responded under my reviewed-tag, but to be clear, also here:
> 
> This patch has is not ready. Please do not merge.
> 

Glad we could catch this before merging this.
I will wait for your reply on other thread for refactoring.

> Best regards,
> Krzysztof
> 




[ANNOUNCE] 5.10.212-rt104

2024-03-19 Thread Luis Claudio R. Goncalves
Hello RT-list!

I'm pleased to announce the 5.10.212-rt104 stable release.

This release is an update to the new stable 5.10.212 version and no RT
changes have been performed.

You can get this release via the git tree at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

  branch: v5.10-rt
  Head SHA1: 6f80a051d66ad3ad791960ddedf516b139da8dd8

Or to build 5.10.212-rt104 directly, the following patches should be applied:

  https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.10.tar.xz

  https://www.kernel.org/pub/linux/kernel/v5.x/patch-5.10.212.xz

  
https://www.kernel.org/pub/linux/kernel/projects/rt/5.10/older/patch-5.10.212-rt104.patch.xz

Signing key fingerprint:

  9354 0649 9972 8D31 D464  D140 F394 A423 F8E6 7C26

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Luis




Re: [PATCH 3/3] drivers: remoteproc: add Versal and Versal-NET support

2024-03-19 Thread Tanmay Shah



On 3/19/24 12:25 AM, Krzysztof Kozlowski wrote:
> On 19/03/2024 02:06, Tanmay Shah wrote:
>> 
>> 
>> On 3/17/24 1:55 PM, Krzysztof Kozlowski wrote:
>>> On 15/03/2024 22:15, Tanmay Shah wrote:
 AMD-Xilinx Versal and Versal-NET are successor of ZynqMP platform. ZynqMP
 remoteproc driver is mostly compatible with new platforms except few
 platform specific differences.

 Versal has same IP of cortex-R5 cores hence maintained compatible string
 same as ZynqMP platform. However, hardcode TCM addresses are not
 supported for new platforms and must be provided in device-tree as per
 new bindings. This makes TCM representation data-driven and easy to
 maintain. This check is provided in the driver.

 For Versal-NET platform, TCM doesn't need to be configured in lockstep
 mode or split mode. Hence that call to PMC firmware is avoided in the
 driver for Versal-NET platform.

 Signed-off-by: Tanmay Shah 
 ---
  drivers/remoteproc/xlnx_r5_remoteproc.c | 19 +++
  1 file changed, 15 insertions(+), 4 deletions(-)

 diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
 b/drivers/remoteproc/xlnx_r5_remoteproc.c
 index d4a22caebaad..193bc159d1b4 100644
 --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
 +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
 @@ -323,9 +323,12 @@ static int zynqmp_r5_set_mode(struct zynqmp_r5_core 
 *r5_core,
return ret;
}
  
 -  ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
 -  if (ret < 0)
 -  dev_err(r5_core->dev, "failed to configure TCM\n");
 +  /* TCM configuration is not needed in versal-net */
 +  if (device_is_compatible(r5_core->dev, "xlnx,zynqmp-r5f")) {
 +  ret = zynqmp_pm_set_tcm_config(r5_core->pm_domain_id, tcm_mode);
 +  if (ret < 0)
 +  dev_err(r5_core->dev, "failed to configure TCM\n");
 +  }
  
return ret;
  }
 @@ -933,10 +936,17 @@ static int zynqmp_r5_core_init(struct 
 zynqmp_r5_cluster *cluster,
int ret, i;
  
r5_core = cluster->r5_cores[0];
 +
 +  /*
 +   * New platforms must use device tree for TCM parsing.
 +   * Only ZynqMP uses hardcode TCM.
 +   */
if (of_find_property(r5_core->np, "reg", NULL))
ret = zynqmp_r5_get_tcm_node_from_dt(cluster);
 -  else
 +  else if (of_machine_is_compatible("xlnx,zynqmp"))
ret = zynqmp_r5_get_tcm_node(cluster);
>>>
>>> That's poor code. Your drivers should not depend on platform. I don't
>>> understand why you need to do this and how is even related to this patch.
>> 
>> You are correct, ideally this shouldn't be needed. However, this driver 
>> contains
>> hardcode TCM addresses that were used before TCM bindings were designed and 
>> available in
>> device-tree. This check is provided to maintain backward compatibility with 
>> device-tree
>> where TCM isn't expected.
>> 
>> For new platforms (Versal and Versal-NET) TCM must be provided in 
>> device-tree and for
>> ZynqMP if it's not in device-tree then to maintain backward compatibility 
>> hardcode
>> addresses are used.
> 
> That does not work like this. You cannot bind to some sort of different
> compatible. If you disagree, please list the compatibles the driver
> binds to.

Okay understood. I believe then removing hardcode addresses makes more sense in 
that
case. So, driver will become same for all the devices.

> 
> Best regards,
> Krzysztof
> 




Re: [PATCH 2/3] dt-bindings: remoteproc: add Versal-NET platform

2024-03-19 Thread Tanmay Shah



On 3/19/24 12:30 AM, Krzysztof Kozlowski wrote:
> On 19/03/2024 01:51, Tanmay Shah wrote:
>> Hello Krzysztof,
>> 
>> Thanks for reviews. Please find my comments below.
>> 
>> On 3/17/24 1:53 PM, Krzysztof Kozlowski wrote:
>>> On 15/03/2024 22:15, Tanmay Shah wrote:
 AMD-Xilinx Versal-NET platform is successor of Versal platform. It
 contains multiple clusters of cortex-R52 real-time processing units.
 Each cluster contains two cores of cortex-R52 processors. Each cluster
 can be configured in lockstep mode or split mode.

 Each R52 core is assigned 128KB of TCM memory. ATCM memory is 64KB, BTCM
 and CTCM memoreis are 32KB each. Each TCM memory has its own dedicated
 power-domain that needs to be requested before using it.

 Signed-off-by: Tanmay Shah 
 ---
  .../remoteproc/xlnx,zynqmp-r5fss.yaml | 220 +++---
  1 file changed, 184 insertions(+), 36 deletions(-)

 diff --git 
 a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
 b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
 index 711da0272250..55654ee02eef 100644
 --- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
 +++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
 @@ -18,7 +18,9 @@ description: |
  
  properties:
compatible:
 -const: xlnx,zynqmp-r5fss
 +enum:
 +  - xlnx,zynqmp-r5fss
 +  - xlnx,versal-net-r52fss
  
"#address-cells":
  const: 2
 @@ -64,7 +66,9 @@ patternProperties:
  
  properties:
compatible:
 -const: xlnx,zynqmp-r5f
 +enum:
 +  - xlnx,zynqmp-r5f
 +  - xlnx,versal-net-r52f
  
reg:
  minItems: 1
 @@ -135,9 +139,11 @@ required:
  allOf:
- if:
properties:
 -xlnx,cluster-mode:
 -  enum:
 -- 1
 +compatible:
 +  contains:
 +enum:
 +  - xlnx,versal-net-r52fss
>>>
>>> Why do you touch these lines?
>>>
 +
  then:
patternProperties:
  "^r5f@[0-9a-f]+$":
 @@ -149,16 +155,14 @@ allOf:
items:
  - description: ATCM internal memory
  - description: BTCM internal memory
 -- description: extra ATCM memory in lockstep mode
 -- description: extra BTCM memory in lockstep mode
 +- description: CTCM internal memory
  
  reg-names:
minItems: 1
items:
 -- const: atcm0
 -- const: btcm0
 -- const: atcm1
 -- const: btcm1
 +- const: atcm
 +- const: btcm
 +- const: ctcm
  
  power-domains:
minItems: 2
 @@ -166,33 +170,70 @@ allOf:
  - description: RPU core power domain
  - description: ATCM power domain
  - description: BTCM power domain
 -- description: second ATCM power domain
 -- description: second BTCM power domain
 +- description: CTCM power domain
  
  else:
 -  patternProperties:
 -"^r5f@[0-9a-f]+$":
 -  type: object
 -
 -  properties:
 -reg:
 -  minItems: 1
 -  items:
 -- description: ATCM internal memory
 -- description: BTCM internal memory
 -
 -reg-names:
 -  minItems: 1
 -  items:
 -- const: atcm0
 -- const: btcm0
 -
 -power-domains:
 -  minItems: 2
 -  items:
 -- description: RPU core power domain
 -- description: ATCM power domain
 -- description: BTCM power domain
 +  allOf:
 +- if:
 +properties:
 +  xlnx,cluster-mode:
 +enum:
 +  - 1
>>>
>>> Whatever you did here, is not really readable. You have now multiple
>>> if:then:if:then embedded.
>> 
>> For ZynqMP platform, TCM can be configured differently in lockstep mode
>> and split mode.
>> 
>> For Versal-NET no such configuration is available, but new CTCM memory
>> is added.
>> 
>> So, I am trying to achieve following representation of TCM for both:
>> 
>> if: versal-net compatible
>> then:
>>   ATCM - 64KB
>>   BTCM - 32KB
>>   CTCM - 32KB
>> 
>> else: (ZynqMP compatible)
>>   if:
>> xlnx,cluster-mode (lockstep mode)
>>   then:
>> ATCM0 - 64KB
>> BTCM0 - 64K

Re: [PATCH v2] net/ipv4: add tracepoint for icmp_send

2024-03-19 Thread Steven Rostedt
On Tue, 19 Mar 2024 20:13:52 +0800 (CST)
 wrote:

> From: Peilin He
> 
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
> 
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
> 
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/debug/tracing

FYI, that directory is obsolete. Please always reference /sys/kernel/tracing.

> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
> 
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
> 
> v1->v2:
> Some fixes according to
> https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
>   1. adjust the trace_icmp_send() to more protocols than UDP.
>   2. move the calling of trace_icmp_send after sanity checks
>  in __icmp_send().
> 
> Signed-off-by: Peilin He
> Reviewed-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---
>  include/trace/events/icmp.h | 64 
> +
>  net/ipv4/icmp.c |  4 +++
>  2 files changed, 68 insertions(+)
>  create mode 100644 include/trace/events/icmp.h
> 
> diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
> new file mode 100644
> index ..c3dc337be7bc
> --- /dev/null
> +++ b/include/trace/events/icmp.h
> @@ -0,0 +1,64 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM icmp
> +
> +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_ICMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(icmp_send,
> +
> + TP_PROTO(const struct sk_buff *skb, int type, int code),
> +
> + TP_ARGS(skb, type, code),
> +
> + TP_STRUCT__entry(
> + __field(__u16, sport)
> + __field(__u16, dport)
> + __field(int, type)
> + __field(int, code)
> + __array(__u8, saddr, 4)
> + __array(__u8, daddr, 4)
> + __field(const void *, skbaddr)
> + __field(unsigned short, ulen)

Note, to prevent holes, I usually suggest pointers and longs go first,
followed by ints, and then end with char.

__field(const void *, skbaddr)
__field(int, type)
__field(int, code)
__array(__u8, saddr, 4)
__array(__u8, daddr, 4)
__field(__u16, sport)
__field(__u16, dport)
__field(unsigned short, ulen)

-- Steve


> + ),
> +
> + TP_fast_assign(
> + struct iphdr *iph = ip_hdr(skb);
> + int proto_4 = iph->protocol;
> + __be32 *p32;
> +
> + __entry->skbaddr = skb;
> + __entry->type = type;
> + __entry->code = code;
> +
> + if (proto_4 == IPPROTO_UDP) {
> + struct udphdr *uh = udp_hdr(skb);
> + __entry->sport = ntohs(uh->source);
> + __entry->dport = ntohs(uh->dest);
> + __entry->ulen = ntohs(uh->len);
> + } else {
> + __entry->sport = 0;
> + __entry->dport = 0;
> + __entry->ulen = 0;
> + }
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = iph->saddr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = iph->daddr;
> + ),
> +
> + TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
> ulen=%d skbaddr=%p",
> + __entry->type, __entry->code,
> + __entry->saddr, __entry->sport, __entry->daddr,
> + __entry->dport, __entry->ulen, __entry->skbaddr)
> +);
>

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-19 Thread Peter Hilber
While the virtio-comment list is not available, now also CC'ing Parav,
which may be interested in this virtio-rtc spec related discussion thread.

On 14.03.24 15:19, David Woodhouse wrote:
> On 14 March 2024 11:13:37 CET, Peter Hilber  
> wrote:
>>> To a certain extent, as long as the virtio-rtc device is designed to expose 
>>> time precisely and unambiguously, it's less important if the Linux kernel 
>>> *today* can use that. Although of course we should strive for that. Let's 
>>> be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
>>> that though.
>>>
>>
>> As Virtio is extensible (unlike hardware), my approach is to mostly specify
>> only what also has a PoC user and a use case.
> 
> If we get memory-mapped (X, Y, Z, ±x, ±y) I'll have a user and a use case on 
> day one. Otherwise, as I said in my first response, I can go do that as a 
> separate device and decide that virtio_rtc doesn't meet our needs (especially 
> for maintaining accuracy over LM).

We plan to add 

- leap second indication,

- UTC-to-TAI offset,

- clock smearing indication (including the noon-to-noon linear smearing
  variant which seems to be somewhat popular), and

- clock accuracy indication

to the initial spec and to the PoC implementation.

However, due to resource restrictions, we cannot ourselves add the
memory-mapped clock to the initial spec.

Everyone is very welcome to contribute the memory-mapped clock to the spec,
and I think it might then still make it to the initial version.

> 
> My main concern for virto_rtc is that we avoid *ambiguity*. Yes, I get that 
> it's extensible but we don't want a v1.0 of the spec, implemented by various 
> hypervisors, which still leaves guests not knowing what the actual time is. 
> That would not be good. And even UTC without a leap second indicator has that 
> problem.

Agreed. That should be addressed by the above changes.

Best regards,

Peter



[PATCH v2] net/ipv4: add tracepoint for icmp_send

2024-03-19 Thread xu.xin16
From: Peilin He

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/debug/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sztrnkru_tnuwqhufqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
   in __icmp_send().

Signed-off-by: Peilin He
Reviewed-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 64 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 68 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..c3dc337be7bc
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(const void *, skbaddr)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   if (proto_4 == IPPROTO_UDP) {
+   struct udphdr *uh = udp_hdr(skb);
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   } else {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+   *p32 = iph->saddr;
+
+   p32 = (__be32 *) __entry->daddr;
+   *p32 = iph->daddr;
+   ),
+
+   TP_printk("icmp_send: type=%d, code=%d. From %pI4:%u to %pI4:%u 
ulen=%d skbaddr=%p",
+   __entry->type, __entry->code,
+   __entry->saddr, __entry->sport, __entry->daddr,
+   __entry->dport, __entry->ulen, __entry->skbaddr)
+);
+
+#endif /* _TRACE_ICMP_H */
+
+/* This part must be outside protection */
+#include 
\ No newline at end of file
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..21fb41257fe9 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -92,6 +92,8 @@
 #include 
 #include 
 #include 
+#define CREATE_TRACE_POINTS
+#include 

 /*
  * Build xmit assembly blocks
@@ -672,6 +674,8 @@ void __icmp_send(struct sk_buff *skb_in, int type, int 
code, __be32 info,
}
}

+   trace_icmp_send(skb_in, type, code);
+
/* Needed by both icmp_global_allow and icmp_xmit_lock */
local_bh_disable();

-- 
2.15.2



Re: [PATCH v7 0/3] vduse: add support for networking devices

2024-03-19 Thread Michael S. Tsirkin
On Thu, Feb 29, 2024 at 11:16:04AM +0100, Maxime Coquelin wrote:
> Hello Michael,
> 
> On 2/1/24 09:40, Michael S. Tsirkin wrote:
> > On Thu, Feb 01, 2024 at 09:34:11AM +0100, Maxime Coquelin wrote:
> > > Hi Jason,
> > > 
> > > It looks like all patches got acked by you.
> > > Any blocker to queue the series for next release?
> > > 
> > > Thanks,
> > > Maxime
> > 
> > I think it's good enough at this point. Will put it in
> > linux-next shortly.
> > 
> 
> I fetched linux-next and it seems the series is not in yet.
> Is there anything to be reworked on my side?
> 
> Thanks,
> Maxime

I am sorry I messed up. It was in a wrong branch and was not
pushed so of course it did not get tested and I kept wondering
why. I pushed it in my tree but it is too late to put it upstream
for this cycle. Assuming Linus merges my tree
with no drama, I will make an effort not to rebase my tree below them
so these patches will keep their hashes, you can use them already.

-- 
MST




Re: [PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-19 Thread Simon Horman
On Thu, Mar 14, 2024 at 12:00:27PM -0400, Steven Rostedt wrote:
> On Thu, 14 Mar 2024 15:39:28 +0100
> Paolo Abeni  wrote:
> 
> > On Wed, 2024-03-13 at 09:34 -0400, Steven Rostedt wrote:

...

> > > Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF 
> > > mailbox")  
> > 
> > checkpactch in strict mode complains the hash is not 12 char long.
> 
> Hmm, I wonder why my git blame gives me 13 characters in the sha. (I cut
> and pasted it from git blame). My git config has:
> 
> [core]  
> abbrev = 12

I wonder if there is a collusion at 12 chars in your local tree.

...



Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-19 Thread Tobias Huschle

On 2024-03-19 09:29, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:

On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > >
>
> Could you remind me pls, what is the kworker doing specifically that
> vhost is relying on?

The kworker is handling the actual data moving in memory if I'm not
mistaking.


I think that is the vhost process itself. Maybe you mean the
guest thread versus the vhost thread then?


My understanding was that vhost writes data into a file descriptor which 
then triggers eventfd.


That's at least how I read the vhost code if I remember correctly.

The handler beneath (the kworker) then runs the actual instructions that 
move the data to the receiving vhost on the other end of the connection.


Again, I might be wrong here.



Re: [PATCH 2/2] virtio_balloon: Treat stats requests as wakeup events

2024-03-19 Thread David Hildenbrand

On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Treat stats requests as wakeup events to ensure that the driver responds
to device requests in a timely manner.

Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 75 -
  1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7fe7ef5f1c77..402dec98e08c 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -121,11 +121,14 @@ struct virtio_balloon {
struct page_reporting_dev_info pr_dev_info;
  
  	/* State for keeping the wakeup_source active while adjusting the balloon */

-   spinlock_t adjustment_lock;
-   bool adjustment_signal_pending;
-   bool adjustment_in_progress;
+   spinlock_t wakeup_lock;
+   bool processing_wakeup_event;
+   u32 wakeup_signal_mask;
  };
  
+#define ADJUSTMENT_WAKEUP_SIGNAL (1 << 0)

+#define STATS_WAKEUP_SIGNAL (1 << 1)


I'd suggest a different naming like:

VIRTIO_BALLOON_WAKEUP_SIGNAL_ADJUST
VIRTIO_BALLOON_WAKEUP_SIGNAL_STATS


Apart from that, nothing jumped at me.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 06:08:27PM +1000, Gavin Shan wrote:
> On 3/19/24 17:09, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> > > 
> > > On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > > > 
> > > > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > > > @@ -687,9 +687,15 @@ static inline int 
> > > > > > > > > virtqueue_add_split(struct virtqueue *_vq,
> > > > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > > > (vq->split.vring.num - 1);
> > > > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > > > - /* Descriptors and available array need to be set 
> > > > > > > > > before we expose the
> > > > > > > > > -  * new available array entries. */
> > > > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > > > + /*
> > > > > > > > > +  * Descriptors and available array need to be set 
> > > > > > > > > before we expose
> > > > > > > > > +  * the new available array entries. virtio_wmb() should 
> > > > > > > > > be enough
> > > > > > > > > +  * to ensuere the order theoretically. However, a 
> > > > > > > > > stronger barrier
> > > > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be 
> > > > > > > > > observed
> > > > > > > > > +  * by the host (vhost). A stronger barrier should work 
> > > > > > > > > for other
> > > > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > > > +  */
> > > > > > > > > + virtio_mb(false);
> > > > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > > > >   
> > > > > > > > > vq->split.avail_idx_shadow);
> > > > > > > > 
> > > > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > > > solution
> > > > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > > > 
> > > > > > > > In practice, either the larger timing different from the DSB or 
> > > > > > > > the fact
> > > > > > > > that you're going from a Store->Store barrier to a full barrier 
> > > > > > > > is what
> > > > > > > > makes things "work" for you. Have you tried, for example, a DMB 
> > > > > > > > SY
> > > > > > > > (e.g. via __smb_mb()).
> > > > > > > > 
> > > > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > > > explanation of what is going on.
> > > > > > > > 
> > > > > > > 
> > > > > > > Thanks for your comments, Will.
> > > > > > > 
> > > > > > > Yes, DMB should work for us. However, it seems this instruction 
> > > > > > > has issues on
> > > > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and 
> > > > > > > DSB works
> > > > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > > > with DSB
> > > > > > > before we fully understand the root cause.
> > > > > > > 
> > > > > > > I tried the possible replacement like below. __smp_mb() can avoid 
> > > > > > > the issue like
> > > > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) 
> > > > > > > doesn't.
> > > > > > > 
> > > > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > > > {
> > > > > > >:
> > > > > > >/* Put entry in available array (but don't update 
> > > > > > > avail->idx until they
> > > > > > > * do sync). */
> > > > > > >avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > 
> > > > > > >/* Descriptors and available array need to be set 
> > > > > > > before we expose the
> > > > > > > * new available array entries. */
> > > > > > >// Broken: virtio_wmb(vq->weak_barriers);
> > > > > > >// Broken: __dma_mb();
> > > > > > >// Work:   __mb();
> > > > > > >// Work:   __smp_mb();
> > > > > > >// Work:   __ndelay(100);
> > > > > > >// Work:   __ndelay(10);
> > > > > > >// Broken: __ndelay(9);
> > > > > > > 
> > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > >vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >
> > > > > > > vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > What if you stick __ndelay here?
> > > > > > 
> > > > > 
> > > > >  /* Put entry in availabl

Re: [PATCH 1/2] virtio_balloon: Give the balloon its own wakeup source

2024-03-19 Thread David Hildenbrand

On 18.03.24 10:10, David Stevens wrote:

From: David Stevens 

Wakeup sources don't support nesting multiple events, so sharing a
single object between multiple drivers can result in one driver
overriding the wakeup event processing period specified by another
driver. Have the virtio balloon driver use the wakeup source of the
device it is bound to rather than the wakeup source of the parent
device, to avoid conflicts with the transport layer.

Note that although the virtio balloon's virtio_device itself isn't what
actually wakes up the device, it is responsible for processing wakeup
events. In the same way that EPOLLWAKEUP uses a dedicated wakeup_source
to prevent suspend when userspace is processing wakeup events, a
dedicated wakeup_source is necessary when processing wakeup events in a
higher layer in the kernel.

Fixes: b12fbc3f787e ("virtio_balloon: stay awake while adjusting balloon")
Signed-off-by: David Stevens 
---
  drivers/virtio/virtio_balloon.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1f5b3dd31fcf..7fe7ef5f1c77 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -450,7 +450,7 @@ static void start_update_balloon_size(struct virtio_balloon 
*vb)
vb->adjustment_signal_pending = true;
if (!vb->adjustment_in_progress) {
vb->adjustment_in_progress = true;
-   pm_stay_awake(vb->vdev->dev.parent);
+   pm_stay_awake(&vb->vdev->dev);
}
spin_unlock_irqrestore(&vb->adjustment_lock, flags);
  
@@ -462,7 +462,7 @@ static void end_update_balloon_size(struct virtio_balloon *vb)

spin_lock_irq(&vb->adjustment_lock);
if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
vb->adjustment_in_progress = false;
-   pm_relax(vb->vdev->dev.parent);
+   pm_relax(&vb->vdev->dev);
}
spin_unlock_irq(&vb->adjustment_lock);
  }
@@ -1028,6 +1028,7 @@ static int virtballoon_probe(struct virtio_device *vdev)
}
  
  	spin_lock_init(&vb->adjustment_lock);


Can we add a comment here why we have to do that?


+   device_set_wakeup_capable(&vb->vdev->dev, true);
  
  	virtio_device_ready(vdev);
  


Absolutely not an expert on the details, but I assume this is fine.

Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 09:21:06AM +0100, Tobias Huschle wrote:
> On 2024-03-15 11:31, Michael S. Tsirkin wrote:
> > On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:
> > > On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
> > > >
> > 
> > Could you remind me pls, what is the kworker doing specifically that
> > vhost is relying on?
> 
> The kworker is handling the actual data moving in memory if I'm not
> mistaking.

I think that is the vhost process itself. Maybe you mean the
guest thread versus the vhost thread then?




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb();
> > > >  // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.


> __wmb() works either. No issue found with it.

So this is 
arch/arm64/include/asm/barrier.h:#define __smp_wmb()dmb(ishst)

versus

arch/arm64/include/asm/barrier.h:#define __wmb()dsb(st)


right?

Really interesting. And you are saying dma_wmb does not work either:

arch/arm64/include/asm/barrier.h:#define __dma_wmb()dmb(oshst)


Really strange.




However I found this:
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-attributes/Cacheable-and-shareable-memory-attributes



Going by this picture, all CPUs are in the innner shareable domain so
ishst should be enough to synchronize, right?


However, there are two points that give me pause here:


Inner shareable
This represents a shareability domain that can be shared by multiple
processors, but not necessarily all of the agents in the system. A
system might have multiple Inner Shareable domains. An operation that
affects one Inner Shareable domain does not affect other Inner Shareable
domains in the system. An example of such a domain might be a quad-core
Cortex-A57 cluster.


Point 1

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-03-19 Thread Tobias Huschle

On 2024-03-15 11:31, Michael S. Tsirkin wrote:

On Fri, Mar 15, 2024 at 09:33:49AM +0100, Tobias Huschle wrote:

On Thu, Mar 14, 2024 at 11:09:25AM -0400, Michael S. Tsirkin wrote:
>


Could you remind me pls, what is the kworker doing specifically that
vhost is relying on?


The kworker is handling the actual data moving in memory if I'm not 
mistaking.





Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Gavin Shan

On 3/19/24 17:09, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:


On 3/19/24 16:43, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:

On 3/19/24 16:09, Michael S. Tsirkin wrote:


diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-   /* Descriptors and available array need to be set before we expose the
-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.



Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
   :
   /* Put entry in available array (but don't update avail->idx until 
they
* do sync). */
   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
head);

   /* Descriptors and available array need to be set before we expose 
the
* new available array entries. */
   // Broken: virtio_wmb(vq->weak_barriers);
   // Broken: __dma_mb();
   // Work:   __mb();
   // Work:   __smp_mb();
   // Work:   __ndelay(100);
   // Work:   __ndelay(10);
   // Broken: __ndelay(9);

  vq->split.avail_idx_shadow++;
   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
   vq->split.avail_idx_shadow);


What if you stick __ndelay here?



 /* Put entry in available array (but don't update avail->idx until they
   * do sync). */
  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

  /* Descriptors and available array need to be set before we expose the
   * new available array entries. */
  virtio_wmb(vq->weak_barriers);
  vq->split.avail_idx_shadow++;
  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
  vq->split.avail_idx_shadow);
  /* Try __ndelay(x) here as Michael suggested
   *
   * Work:  __ndelay(200);possiblly make it hard to reproduce
   * Broken:__ndelay(100);
   * Broken:__ndelay(20);
   * Broken:__ndelay(10);
   */
  __ndelay(200);


So we see that just changing the timing masks the race.
What are you using on the host side? vhost or qemu?



__ndelay(200) may make the issue harder to be reproduce as I understand.
More delays here will give vhost relief, reducing the race.

The issue is only reproducible when vhost is turned on. Otherwise, we
aren't able to hit the issue.

-netdev 
tap,id=vnet0,vhost=true,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0



Given it's vhost, it's also possible that the issue is host side.
I wonder what happens if we stick a delay or a stronger barrier
in vhost.c - either here:

 /* Make sure buf

[GIT PULL] virtio: features, fixes

2024-03-19 Thread Michael S. Tsirkin
The following changes since commit e8f897f4afef0031fe618a8e94127a0934896aba:

  Linux 6.8 (2024-03-10 13:38:09 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 5da7137de79ca6ffae3ace77050588cdf5263d33:

  virtio_net: rename free_old_xmit_skbs to free_old_xmit (2024-03-19 03:19:22 
-0400)


virtio: features, fixes

Per vq sizes in vdpa.
Info query for block devices support in vdpa.
DMA sync callbacks in vduse.

Fixes, cleanups.

Signed-off-by: Michael S. Tsirkin 


Andrew Melnychenko (1):
  vhost: Added pad cleanup if vnet_hdr is not present.

David Hildenbrand (1):
  virtio: reenable config if freezing device failed

Jason Wang (2):
  virtio-net: convert rx mode setting to use workqueue
  virtio-net: add cond_resched() to the command waiting loop

Jonah Palmer (1):
  vdpa/mlx5: Allow CVQ size changes

Maxime Coquelin (1):
  vduse: implement DMA sync callbacks

Ricardo B. Marliere (2):
  vdpa: make vdpa_bus const
  virtio: make virtio_bus const

Shannon Nelson (1):
  vdpa/pds: fixes for VF vdpa flr-aer handling

Steve Sistare (2):
  vdpa_sim: reset must not run
  vdpa: skip suspend/resume ops if not DRIVER_OK

Suzuki K Poulose (1):
  virtio: uapi: Drop __packed attribute in linux/virtio_pci.h

Xuan Zhuo (3):
  virtio: packed: fix unmap leak for indirect desc table
  virtio_net: unify the code for recycling the xmit ptr
  virtio_net: rename free_old_xmit_skbs to free_old_xmit

Zhu Lingshan (20):
  vhost-vdpa: uapi to support reporting per vq size
  vDPA: introduce get_vq_size to vdpa_config_ops
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_size
  vp_vdpa: implement vdpa_config_ops.get_vq_size
  eni_vdpa: implement vdpa_config_ops.get_vq_size
  vdpa_sim: implement vdpa_config_ops.get_vq_size for vDPA simulator
  vduse: implement vdpa_config_ops.get_vq_size for vduse
  virtio_vdpa: create vqs with the actual size
  vDPA/ifcvf: get_max_vq_size to return max size
  vDPA/ifcvf: implement vdpa_config_ops.get_vq_num_min
  vDPA: report virtio-block capacity to user space
  vDPA: report virtio-block max segment size to user space
  vDPA: report virtio-block block-size to user space
  vDPA: report virtio-block max segments in a request to user space
  vDPA: report virtio-block MQ info to user space
  vDPA: report virtio-block topology info to user space
  vDPA: report virtio-block discarding configuration to user space
  vDPA: report virtio-block write zeroes configuration to user space
  vDPA: report virtio-block read-only info to user space
  vDPA: report virtio-blk flush info to user space

 drivers/net/virtio_net.c | 151 +++-
 drivers/vdpa/alibaba/eni_vdpa.c  |   8 ++
 drivers/vdpa/ifcvf/ifcvf_base.c  |  11 +-
 drivers/vdpa/ifcvf/ifcvf_base.h  |   2 +
 drivers/vdpa/ifcvf/ifcvf_main.c  |  15 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c|  13 ++-
 drivers/vdpa/pds/aux_drv.c   |   2 +-
 drivers/vdpa/pds/vdpa_dev.c  |  20 +++-
 drivers/vdpa/pds/vdpa_dev.h  |   1 +
 drivers/vdpa/vdpa.c  | 214 ++-
 drivers/vdpa/vdpa_sim/vdpa_sim.c |  15 ++-
 drivers/vdpa/vdpa_user/iova_domain.c |  27 -
 drivers/vdpa/vdpa_user/iova_domain.h |   8 ++
 drivers/vdpa/vdpa_user/vduse_dev.c   |  34 ++
 drivers/vdpa/virtio_pci/vp_vdpa.c|   8 ++
 drivers/vhost/net.c  |   3 +
 drivers/vhost/vdpa.c |  14 +++
 drivers/virtio/virtio.c  |   6 +-
 drivers/virtio/virtio_ring.c |   6 +-
 drivers/virtio/virtio_vdpa.c |   5 +-
 include/linux/vdpa.h |   6 +
 include/uapi/linux/vdpa.h|  17 +++
 include/uapi/linux/vhost.h   |   7 ++
 include/uapi/linux/virtio_pci.h  |  10 +-
 24 files changed, 521 insertions(+), 82 deletions(-)




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Gavin Shan

On 3/19/24 17:04, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:

On 3/19/24 16:10, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:

On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:

On 3/19/24 02:59, Will Deacon wrote:

[...]

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
*_vq,
avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-   /* Descriptors and available array need to be set before we expose the
-* new available array entries. */
-   virtio_wmb(vq->weak_barriers);
+   /*
+* Descriptors and available array need to be set before we expose
+* the new available array entries. virtio_wmb() should be enough
+* to ensuere the order theoretically. However, a stronger barrier
+* is needed by ARM64. Otherwise, the stale data can be observed
+* by the host (vhost). A stronger barrier should work for other
+* architectures, but performance loss is expected.
+*/
+   virtio_mb(false);
vq->split.avail_idx_shadow++;
vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
vq->split.avail_idx_shadow);


Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.



Thanks for your comments, Will.

Yes, DMB should work for us. However, it seems this instruction has issues on
NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
from hardware level. I agree it's not the solution to replace DMB with DSB
before we fully understand the root cause.

I tried the possible replacement like below. __smp_mb() can avoid the issue like
__mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.

static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
{
  :
  /* Put entry in available array (but don't update avail->idx until 
they
   * do sync). */
  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);

  /* Descriptors and available array need to be set before we expose the
   * new available array entries. */
  // Broken: virtio_wmb(vq->weak_barriers);
  // Broken: __dma_mb();
  // Work:   __mb();
  // Work:   __smp_mb();


Did you try __smp_wmb ? And wmb?



virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.

__wmb() works either. No issue found with it.


Oh interesting. So how do smp_mb() and wmb() disassemble on this
platform? Can you please check?



I don't see they have been translated wrongly on Nvidia's grace-hopper:

===> virtio_wmb(vq->weak_barriers)

0x8000807b07c8 <+1168>:  ldrbw0, [x20, #66]
0x8000807b07cc <+1172>:  cbz w0, 0x8000807b089c 

0x8000807b07d0 <+1176>:  dmb ishst // same to __smp_wmb()
:
0x8000807b089c <+1380>:  dmb oshst // same to __dma_wmb()
0x8000807b08a0 <+1384>:  b   0x8000807b07d4 


===> wmb()

0x8000807b07c8 <+1168>:  dsb st





  // Work:   __ndelay(100);
  // Work:   __ndelay(10);
  // Broken: __ndelay(9);

 vq->split.avail_idx_shadow++;
  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
  vq->split.avail_idx_shadow);


What if you stick __ndelay here?


And keep virtio_wmb above?



The result has been shared through a separate reply.




  vq->num_added++;

  pr_debug("Added buffer head %i to %p\n", head, vq);
  END_USE(vq);
  :
}

I also tried to measure the consumed time for various barrier-relative 
instructions using
ktime_get_ns() which should have consumed most of the time. __smb_mb() is 
slower than
__smp_wmb() but faster than __mb()

  Instruction   Range of used time in ns
  --
  __smp_wmb()   [32  1128032]
  __smp_mb()[32  1160096]
  __mb()[32  1162496]



Thanks,
Gavin




Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> 
> Key type pkcs7_test registered
> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> io scheduler mq-deadline registered
> io scheduler kyber registered
> io scheduler bfq registered
> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> ACPI: button: Power Button [PWRF]
> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> ACPI: button: Sleep Button [SLPF]
> ioatdma: Intel(R) QuickData Technology Driver 5.00
> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> N_HDLC line discipline registered with maxframe=4096
> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> Non-volatile memory driver v1.3
> Linux agpgart interface v0.103
> ACPI: bus type drm_connector registered
> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> Console: switching to colour frame buffer device 128x48
> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> usbcore: registered new interface driver udl
> brd: module loaded
> loop: module loaded
> zram: Added device: zram0
> null_blk: disk nullb0 created
> null_blk: module loaded
> Guest personality initialized and is inactive
> VMCI host device registered (name=vmci, major=10, minor=118)
> Initialized host personality
> usbcore: registered new interface driver rtsx_usb
> usbcore: registered new interface driver viperboard
> usbcore: registered new interface driver dln2
> usbcore: registered new interface driver pn533_usb
> nfcsim 0.2 initialized
> usbcore: registered new interface driver port100
> usbcore: registered new interface driver nfcmrvl
> Loading iSCSI transport class v2.0-870.
> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [ cut here ]
> refcount_t: decrement hit 0; leaking memory.
> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
> lib/refcount.c:31
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-11567-gb3603fcb79b1 
> #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 02/29/2024
> RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 0c 
> f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb d9 e8 
> 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> RSP: :c9066e18 EFLAGS: 00010246
> RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> RDX:  RSI:  RDI: 
> RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> FS:  () GS:8880b940() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 88823000 CR3: 0e132000 CR4: 003506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  
>  reset_page_owner include/linux/page_owner.h:25 [inline]
>  free_pages_prepare mm/page_alloc.c:1141 [inline]
>  __free_pages_ok+0xc54/0xd80 mm/page_alloc.c:1270
>  make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
>  vring_alloc_queue drivers/virtio/virtio_ring.c:319 

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Mon, Mar 18, 2024 at 04:59:24PM +, Will Deacon wrote:
> On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > The issue is reported by Yihuang Yu who have 'netperf' test on
> > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > client is started in the VM hosted by grace-hopper machine,
> > while the 'netperf' server is running on grace-grace machine.
> > 
> > The VM is started with virtio-net and vhost has been enabled.
> > We observe a error message spew from VM and then soft-lockup
> > report. The error message indicates the data associated with
> > the descriptor (index: 135) has been released, and the queue
> > is marked as broken. It eventually leads to the endless effort
> > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > and soft-lockup. The stale index 135 is fetched from the available
> > ring and published to the used ring by vhost, meaning we have
> > disordred write to the available ring element and available index.
> > 
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >   -accel kvm -machine virt,gic-version=host\
> >  : \
> >   -netdev tap,id=vnet0,vhost=on\
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > 
> >   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > 
> > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > ARM64. It should work for other architectures, but performance loss is
> > expected.
> > 
> > Cc: sta...@vger.kernel.org
> > Reported-by: Yihuang Yu 
> > Signed-off-by: Gavin Shan 
> > ---
> >  drivers/virtio/virtio_ring.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 49299b1f9ec7..7d852811c912 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> > *_vq,
> > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> >  
> > -   /* Descriptors and available array need to be set before we expose the
> > -* new available array entries. */
> > -   virtio_wmb(vq->weak_barriers);
> > +   /*
> > +* Descriptors and available array need to be set before we expose
> > +* the new available array entries. virtio_wmb() should be enough
> > +* to ensuere the order theoretically. However, a stronger barrier
> > +* is needed by ARM64. Otherwise, the stale data can be observed
> > +* by the host (vhost). A stronger barrier should work for other
> > +* architectures, but performance loss is expected.
> > +*/
> > +   virtio_mb(false);
> > vq->split.avail_idx_shadow++;
> > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > vq->split.avail_idx_shadow);
> 
> Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> here, especially when ordering accesses to coherent memory.
> 
> In practice, either the larger timing different from the DSB or the fact
> that you're going from a Store->Store barrier to a full barrier is what
> makes things "work" for you. Have you tried, for example, a DMB SY
> (e.g. via __smb_mb()).
> 
> We definitely shouldn't take changes like this without a proper
> explanation of what is going on.
> 
> Will

Just making sure: so on this system, how do
smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
with kernel running on another CPU and we are doing something
unusual in virtio when we use it to synchronize with host
as opposed to the guest - e.g. CONFIG_SMP is special cased
because of this:

#define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)

Note __smp_wmb not smp_wmb which would be a NOP on UP.


-- 
MST




[syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com

Key type pkcs7_test registered
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
io scheduler mq-deadline registered
io scheduler kyber registered
io scheduler bfq registered
input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
ACPI: button: Power Button [PWRF]
input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
ACPI: button: Sleep Button [SLPF]
ioatdma: Intel(R) QuickData Technology Driver 5.00
ACPI: \_SB_.LNKC: Enabled at IRQ 11
virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKD: Enabled at IRQ 10
virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKB: Enabled at IRQ 10
virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
N_HDLC line discipline registered with maxframe=4096
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
ACPI: bus type drm_connector registered
[drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
[drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
Console: switching to colour frame buffer device 128x48
platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
usbcore: registered new interface driver udl
brd: module loaded
loop: module loaded
zram: Added device: zram0
null_blk: disk nullb0 created
null_blk: module loaded
Guest personality initialized and is inactive
VMCI host device registered (name=vmci, major=10, minor=118)
Initialized host personality
usbcore: registered new interface driver rtsx_usb
usbcore: registered new interface driver viperboard
usbcore: registered new interface driver dln2
usbcore: registered new interface driver pn533_usb
nfcsim 0.2 initialized
usbcore: registered new interface driver port100
usbcore: registered new interface driver nfcmrvl
Loading iSCSI transport class v2.0-870.
virtio_scsi virtio0: 1/0/0 default/read/poll queues
[ cut here ]
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
02/29/2024
RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 0c 
f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb d9 e8 2b 
d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
RSP: :c9066e18 EFLAGS: 00010246
RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
RDX:  RSI:  RDI: 
RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
FS:  () GS:8880b940() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 88823000 CR3: 0e132000 CR4: 003506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1141 [inline]
 __free_pages_ok+0xc54/0xd80 mm/page_alloc.c:1270
 make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
 vring_alloc_queue drivers/virtio/virtio_ring.c:319 [inline]
 vring_alloc_queue_split+0x20a/0x600 drivers/virtio/virtio_ring.c:1108
 vring_create_virtqueue_split+0xc6/0x310 drivers/virtio/virtio_ring.c:1158
 vring_create_virtqueue+0xca/0x110 drivers/virtio/virtio_ring.c:2683
 setup_vq+0xe9/0x2d0 drivers/virtio/v

Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-03-19 Thread Michael S. Tsirkin
On Wed, Feb 28, 2024 at 09:24:07PM +0300, Dan Carpenter wrote:
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
> 
> Add an array_index_nospec() as well to prevent speculation issues.
> 
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 

Thanks a lot!
I assume this will be squashed in the relevant patch when that is
re-spun.

> ---
> v2: add array_index_nospec()
> v3: I accidentally corrupted v2.  Try again.
> 
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if ((vma->vm_flags & VM_SHARED) == 0)
>   return -EINVAL;
>  
> - if (index > dev->vq_num)
> + if (index >= dev->vq_num)
>   return -EINVAL;
>  
> + index = array_index_nospec(index, dev->vq_num);
>   vq = dev->vqs[index];
>   vaddr = vq->vdpa_reconnect_vaddr;
>   if (vaddr == 0)
> -- 
> 2.43.0




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 04:49:50PM +1000, Gavin Shan wrote:
> 
> On 3/19/24 16:43, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 04:38:49PM +1000, Gavin Shan wrote:
> > > On 3/19/24 16:09, Michael S. Tsirkin wrote:
> > > 
> > > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > > virtqueue *_vq,
> > > > > > >   avail = vq->split.avail_idx_shadow & 
> > > > > > > (vq->split.vring.num - 1);
> > > > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > > > - /* Descriptors and available array need to be set before we 
> > > > > > > expose the
> > > > > > > -  * new available array entries. */
> > > > > > > - virtio_wmb(vq->weak_barriers);
> > > > > > > + /*
> > > > > > > +  * Descriptors and available array need to be set before we 
> > > > > > > expose
> > > > > > > +  * the new available array entries. virtio_wmb() should be 
> > > > > > > enough
> > > > > > > +  * to ensuere the order theoretically. However, a stronger 
> > > > > > > barrier
> > > > > > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > > +  * by the host (vhost). A stronger barrier should work for other
> > > > > > > +  * architectures, but performance loss is expected.
> > > > > > > +  */
> > > > > > > + virtio_mb(false);
> > > > > > >   vq->split.avail_idx_shadow++;
> > > > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > >   
> > > > > > > vq->split.avail_idx_shadow);
> > > > > > 
> > > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > > solution
> > > > > > here, especially when ordering accesses to coherent memory.
> > > > > > 
> > > > > > In practice, either the larger timing different from the DSB or the 
> > > > > > fact
> > > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > > what
> > > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > > (e.g. via __smb_mb()).
> > > > > > 
> > > > > > We definitely shouldn't take changes like this without a proper
> > > > > > explanation of what is going on.
> > > > > > 
> > > > > 
> > > > > Thanks for your comments, Will.
> > > > > 
> > > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > > issues on
> > > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > > works
> > > > > from hardware level. I agree it's not the solution to replace DMB 
> > > > > with DSB
> > > > > before we fully understand the root cause.
> > > > > 
> > > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > > issue like
> > > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) 
> > > > > doesn't.
> > > > > 
> > > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > > {
> > > > >   :
> > > > >   /* Put entry in available array (but don't update 
> > > > > avail->idx until they
> > > > >* do sync). */
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 
> > > > > 1);
> > > > >   vq->split.vring.avail->ring[avail] = 
> > > > > cpu_to_virtio16(_vq->vdev, head);
> > > > > 
> > > > >   /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > >* new available array entries. */
> > > > >   // Broken: virtio_wmb(vq->weak_barriers);
> > > > >   // Broken: __dma_mb();
> > > > >   // Work:   __mb();
> > > > >   // Work:   __smp_mb();
> > > > >   // Work:   __ndelay(100);
> > > > >   // Work:   __ndelay(10);
> > > > >   // Broken: __ndelay(9);
> > > > > 
> > > > >  vq->split.avail_idx_shadow++;
> > > > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > >   
> > > > > vq->split.avail_idx_shadow);
> > > > 
> > > > What if you stick __ndelay here?
> > > > 
> > > 
> > > /* Put entry in available array (but don't update avail->idx 
> > > until they
> > >   * do sync). */
> > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >  vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > 
> > >  /* Descriptors and available array need to be set before we 
> > > expose the
> > >   * new available array entries. */
> > >  virtio_wmb(vq->weak_barriers);
> > >  vq->split.avail_idx_shadow++;
> > >  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Michael S. Tsirkin
On Tue, Mar 19, 2024 at 04:54:15PM +1000, Gavin Shan wrote:
> On 3/19/24 16:10, Michael S. Tsirkin wrote:
> > On Tue, Mar 19, 2024 at 02:09:34AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> > > > On 3/19/24 02:59, Will Deacon wrote:
> [...]
> > > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > > b/drivers/virtio/virtio_ring.c
> > > > > > index 49299b1f9ec7..7d852811c912 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > > > > virtqueue *_vq,
> > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > > head);
> > > > > > -   /* Descriptors and available array need to be set before we 
> > > > > > expose the
> > > > > > -* new available array entries. */
> > > > > > -   virtio_wmb(vq->weak_barriers);
> > > > > > +   /*
> > > > > > +* Descriptors and available array need to be set before we 
> > > > > > expose
> > > > > > +* the new available array entries. virtio_wmb() should be 
> > > > > > enough
> > > > > > +* to ensuere the order theoretically. However, a stronger 
> > > > > > barrier
> > > > > > +* is needed by ARM64. Otherwise, the stale data can be observed
> > > > > > +* by the host (vhost). A stronger barrier should work for other
> > > > > > +* architectures, but performance loss is expected.
> > > > > > +*/
> > > > > > +   virtio_mb(false);
> > > > > > vq->split.avail_idx_shadow++;
> > > > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > > > > 
> > > > > > vq->split.avail_idx_shadow);
> > > > > 
> > > > > Replacing a DMB with a DSB is _very_ unlikely to be the correct 
> > > > > solution
> > > > > here, especially when ordering accesses to coherent memory.
> > > > > 
> > > > > In practice, either the larger timing different from the DSB or the 
> > > > > fact
> > > > > that you're going from a Store->Store barrier to a full barrier is 
> > > > > what
> > > > > makes things "work" for you. Have you tried, for example, a DMB SY
> > > > > (e.g. via __smb_mb()).
> > > > > 
> > > > > We definitely shouldn't take changes like this without a proper
> > > > > explanation of what is going on.
> > > > > 
> > > > 
> > > > Thanks for your comments, Will.
> > > > 
> > > > Yes, DMB should work for us. However, it seems this instruction has 
> > > > issues on
> > > > NVidia's grace-hopper. It's hard for me to understand how DMB and DSB 
> > > > works
> > > > from hardware level. I agree it's not the solution to replace DMB with 
> > > > DSB
> > > > before we fully understand the root cause.
> > > > 
> > > > I tried the possible replacement like below. __smp_mb() can avoid the 
> > > > issue like
> > > > __mb() does. __ndelay(10) can avoid the issue, but __ndelay(9) doesn't.
> > > > 
> > > > static inline int virtqueue_add_split(struct virtqueue *_vq, ...)
> > > > {
> > > >  :
> > > >  /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >   * do sync). */
> > > >  avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > >  vq->split.vring.avail->ring[avail] = 
> > > > cpu_to_virtio16(_vq->vdev, head);
> > > > 
> > > >  /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >   * new available array entries. */
> > > >  // Broken: virtio_wmb(vq->weak_barriers);
> > > >  // Broken: __dma_mb();
> > > >  // Work:   __mb();
> > > >  // Work:   __smp_mb();
> > 
> > Did you try __smp_wmb ? And wmb?
> > 
> 
> virtio_wmb(false) is equivalent to __smb_wmb(), which is broken.
> 
> __wmb() works either. No issue found with it.

Oh interesting. So how do smp_mb() and wmb() disassemble on this
platform? Can you please check?


> > > >  // Work:   __ndelay(100);
> > > >  // Work:   __ndelay(10);
> > > >  // Broken: __ndelay(9);
> > > > 
> > > > vq->split.avail_idx_shadow++;
> > > >  vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > > >  
> > > > vq->split.avail_idx_shadow);
> > > 
> > > What if you stick __ndelay here?
> > 
> > And keep virtio_wmb above?
> > 
> 
> The result has been shared through a separate reply.
> 
> > > 
> > > >  vq->num_added++;
> > > > 
> > > >  pr_debug("Added buffer head %i to %p\n", head, vq);
> > > >  END_USE(vq);
> > > >  :
> > > > }
> > > > 
> > > > I also tried to measure the consumed time for various barrier-relative 
> > > > instructions using
> > > > ktime_get_ns() which should have consumed most of the time. __smb_mb() 
> > > > is slower than
> > > > __smp_wmb() but faster than __mb()
> > > > 
> > >