date:20240326

Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Huang, Ying

"Ho-Ren (Jack) Chuang"  writes:

> The current implementation treats emulated memory devices, such as
> CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> (E820_TYPE_RAM). However, these emulated devices have different
> characteristics than traditional DRAM, making it important to
> distinguish them. Thus, we modify the tiered memory initialization process
> to introduce a delay specifically for CPUless NUMA nodes. This delay
> ensures that the memory tier initialization for these nodes is deferred
> until HMAT information is obtained during the boot process. Finally,
> demotion tables are recalculated at the end.
>
> * late_initcall(memory_tier_late_init);
> Some device drivers may have initialized memory tiers between
> `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> online memory nodes and configuring memory tiers. They should be excluded
> in the late init.
>
> * Handle cases where there is no HMAT when creating memory tiers
> There is a scenario where a CPUless node does not provide HMAT information.
> If no HMAT is specified, it falls back to using the default DRAM tier.
>
> * Introduce another new lock `default_dram_perf_lock` for adist calculation
> In the current implementation, iterating through CPUlist nodes requires
> holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> trying to acquire the same lock, leading to a potential deadlock.
> Therefore, we propose introducing a standalone `default_dram_perf_lock` to
> protect `default_dram_perf_*`. This approach not only avoids deadlock
> but also prevents holding a large lock simultaneously. Besides, this patch
> slightly shortens the time holding the lock by putting the lock closer to
> what it protects as well.
>
> * Upgrade `set_node_memory_tier` to support additional cases, including
>   default DRAM, late CPUless, and hot-plugged initializations.
> To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> handle cases where memtype is not initialized and where HMAT information is
> available.
>
> * Introduce `default_memory_types` for those memory types that are not
>   initialized by device drivers.
> Because late initialized memory and default DRAM memory need to be managed,
> a default memory type is created for storing all memory types that are
> not initialized by device drivers and as a fallback.
>
> * Fix a deadlock bug in `mt_perf_to_adistance`
> Because an error path was not handled properly in `mt_perf_to_adistance`,
> unlock before returning the error.
>
> Signed-off-by: Ho-Ren (Jack) Chuang 
> Signed-off-by: Hao Xiang 
> ---
>  mm/memory-tiers.c | 85 +++
>  1 file changed, 72 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 974af10cfdd8..610db9581ba4 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -36,6 +36,11 @@ struct node_memory_type_map {
>  
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> +/*
> + * The list is used to store all memory types that are not created
> + * by a device driver.
> + */
> +static LIST_HEAD(default_memory_types);
>  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
>  struct memory_dev_type *default_dram_type;
>  
> @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
>  
>  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
>  
> +/* The lock is used to protect `default_dram_perf*` info and nid. */
> +static DEFINE_MUTEX(default_dram_perf_lock);
>  static bool default_dram_perf_error;
>  static struct access_coordinate default_dram_perf;
>  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, 
> struct memory_dev_type *mem
>  static struct memory_tier *set_node_memory_tier(int node)
>  {
>   struct memory_tier *memtier;
> - struct memory_dev_type *memtype;
> + struct memory_dev_type *mtype = default_dram_type;
> + int adist = MEMTIER_ADISTANCE_DRAM;
>   pg_data_t *pgdat = NODE_DATA(node);
>  
>  
> @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int 
> node)
>   if (!node_state(node, N_MEMORY))
>   return ERR_PTR(-EINVAL);
>  
> - __init_node_memory_type(node, default_dram_type);
> + mt_calc_adistance(node, );
> + if (node_memory_types[node].memtype == NULL) {
> + mtype = mt_find_alloc_memory_type(adist, _memory_types);
> + if (IS_ERR(mtype)) {
> + mtype = default_dram_type;
> + pr_info("Failed to allocate a memory type. Fall 
> back.\n");
> + }
> + }
>  
> - memtype = node_memory_types[node].memtype;
> - node_set(node, memtype->nodes);
> - memtier = find_create_memory_tier(memtype);
> + __init_node_memory_type(node, mtype);
> +

[PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Ho-Ren (Jack) Chuang

The current implementation treats emulated memory devices, such as
CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
(E820_TYPE_RAM). However, these emulated devices have different
characteristics than traditional DRAM, making it important to
distinguish them. Thus, we modify the tiered memory initialization process
to introduce a delay specifically for CPUless NUMA nodes. This delay
ensures that the memory tier initialization for these nodes is deferred
until HMAT information is obtained during the boot process. Finally,
demotion tables are recalculated at the end.

* late_initcall(memory_tier_late_init);
Some device drivers may have initialized memory tiers between
`memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
online memory nodes and configuring memory tiers. They should be excluded
in the late init.

* Handle cases where there is no HMAT when creating memory tiers
There is a scenario where a CPUless node does not provide HMAT information.
If no HMAT is specified, it falls back to using the default DRAM tier.

* Introduce another new lock `default_dram_perf_lock` for adist calculation
In the current implementation, iterating through CPUlist nodes requires
holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
trying to acquire the same lock, leading to a potential deadlock.
Therefore, we propose introducing a standalone `default_dram_perf_lock` to
protect `default_dram_perf_*`. This approach not only avoids deadlock
but also prevents holding a large lock simultaneously. Besides, this patch
slightly shortens the time holding the lock by putting the lock closer to
what it protects as well.

* Upgrade `set_node_memory_tier` to support additional cases, including
  default DRAM, late CPUless, and hot-plugged initializations.
To cover hot-plugged memory nodes, `mt_calc_adistance()` and
`mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
handle cases where memtype is not initialized and where HMAT information is
available.

* Introduce `default_memory_types` for those memory types that are not
  initialized by device drivers.
Because late initialized memory and default DRAM memory need to be managed,
a default memory type is created for storing all memory types that are
not initialized by device drivers and as a fallback.

* Fix a deadlock bug in `mt_perf_to_adistance`
Because an error path was not handled properly in `mt_perf_to_adistance`,
unlock before returning the error.

Signed-off-by: Ho-Ren (Jack) Chuang 
Signed-off-by: Hao Xiang 
---
 mm/memory-tiers.c | 85 +++
 1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 974af10cfdd8..610db9581ba4 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -36,6 +36,11 @@ struct node_memory_type_map {
 
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+/*
+ * The list is used to store all memory types that are not created
+ * by a device driver.
+ */
+static LIST_HEAD(default_memory_types);
 static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
 struct memory_dev_type *default_dram_type;
 
@@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly;
 
 static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
 
+/* The lock is used to protect `default_dram_perf*` info and nid. */
+static DEFINE_MUTEX(default_dram_perf_lock);
 static bool default_dram_perf_error;
 static struct access_coordinate default_dram_perf;
 static int default_dram_perf_ref_nid = NUMA_NO_NODE;
@@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct 
memory_dev_type *mem
 static struct memory_tier *set_node_memory_tier(int node)
 {
struct memory_tier *memtier;
-   struct memory_dev_type *memtype;
+   struct memory_dev_type *mtype = default_dram_type;
+   int adist = MEMTIER_ADISTANCE_DRAM;
pg_data_t *pgdat = NODE_DATA(node);
 
 
@@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node)
if (!node_state(node, N_MEMORY))
return ERR_PTR(-EINVAL);
 
-   __init_node_memory_type(node, default_dram_type);
+   mt_calc_adistance(node, );
+   if (node_memory_types[node].memtype == NULL) {
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
+   if (IS_ERR(mtype)) {
+   mtype = default_dram_type;
+   pr_info("Failed to allocate a memory type. Fall 
back.\n");
+   }
+   }
 
-   memtype = node_memory_types[node].memtype;
-   node_set(node, memtype->nodes);
-   memtier = find_create_memory_tier(memtype);
+   __init_node_memory_type(node, mtype);
+
+   mtype = node_memory_types[node].memtype;
+   node_set(node, mtype->nodes);
+   memtier = find_create_memory_tier(mtype);
if (!IS_ERR(memtier))

[PATCH v5 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types

2024-03-26 Thread Ho-Ren (Jack) Chuang

Since different memory devices require finding, allocating, and putting
memory types, these common steps are abstracted in this patch,
enhancing the scalability and conciseness of the code.

Signed-off-by: Ho-Ren (Jack) Chuang 
---
 drivers/dax/kmem.c   | 20 ++--
 include/linux/memory-tiers.h | 13 +
 mm/memory-tiers.c| 32 
 3 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 42ee360cf4e3..01399e5b53b2 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
 
 static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
 {
-   bool found = false;
struct memory_dev_type *mtype;
 
mutex_lock(_memory_type_lock);
-   list_for_each_entry(mtype, _memory_types, list) {
-   if (mtype->adistance == adist) {
-   found = true;
-   break;
-   }
-   }
-   if (!found) {
-   mtype = alloc_memory_type(adist);
-   if (!IS_ERR(mtype))
-   list_add(>list, _memory_types);
-   }
+   mtype = mt_find_alloc_memory_type(adist, _memory_types);
mutex_unlock(_memory_type_lock);
 
return mtype;
@@ -77,13 +66,8 @@ static struct memory_dev_type 
*kmem_find_alloc_memory_type(int adist)
 
 static void kmem_put_memory_types(void)
 {
-   struct memory_dev_type *mtype, *mtn;
-
mutex_lock(_memory_type_lock);
-   list_for_each_entry_safe(mtype, mtn, _memory_types, list) {
-   list_del(>list);
-   put_memory_type(mtype);
-   }
+   mt_put_memory_types(_memory_types);
mutex_unlock(_memory_type_lock);
 }
 
diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 69e781900082..a44c03c2ba3a 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist);
 int mt_set_default_dram_perf(int nid, struct access_coordinate *perf,
 const char *source);
 int mt_perf_to_adistance(struct access_coordinate *perf, int *adist);
+struct memory_dev_type *mt_find_alloc_memory_type(int adist,
+   struct list_head 
*memory_types);
+void mt_put_memory_types(struct list_head *memory_types);
 #ifdef CONFIG_MIGRATION
 int next_demotion_node(int node);
 void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets);
@@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct 
access_coordinate *perf, int *adis
 {
return -EIO;
 }
+
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   return NULL;
+}
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+
+}
 #endif /* CONFIG_NUMA */
 #endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 0537664620e5..974af10cfdd8 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct 
memory_dev_type *memtype)
 }
 EXPORT_SYMBOL_GPL(clear_node_memory_type);
 
+struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head 
*memory_types)
+{
+   bool found = false;
+   struct memory_dev_type *mtype;
+
+   list_for_each_entry(mtype, memory_types, list) {
+   if (mtype->adistance == adist) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mtype = alloc_memory_type(adist);
+   if (!IS_ERR(mtype))
+   list_add(>list, memory_types);
+   }
+
+   return mtype;
+}
+EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type);
+
+void mt_put_memory_types(struct list_head *memory_types)
+{
+   struct memory_dev_type *mtype, *mtn;
+
+   list_for_each_entry_safe(mtype, mtn, memory_types, list) {
+   list_del(>list);
+   put_memory_type(mtype);
+   }
+}
+EXPORT_SYMBOL_GPL(mt_put_memory_types);
+
 static void dump_hmem_attrs(struct access_coordinate *coord, const char 
*prefix)
 {
pr_info(
-- 
Ho-Ren (Jack) Chuang

[PATCH v5 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes

2024-03-26 Thread Ho-Ren (Jack) Chuang

When a memory device, such as CXL1.1 type3 memory, is emulated as
normal memory (E820_TYPE_RAM), the memory device is indistinguishable
from normal DRAM in terms of memory tiering with the current implementation.
The current memory tiering assigns all detected normal memory nodes
to the same DRAM tier. This results in normal memory devices with
different attributions being unable to be assigned to the correct memory tier,
leading to the inability to migrate pages between different types of memory.
https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/

This patchset automatically resolves the issues. It delays the initialization
of memory tiers for CPUless NUMA nodes until they obtain HMAT information
and after all devices are initialized at boot time, eliminating the need
for user intervention. If no HMAT is specified, it falls back to
using `default_dram_type`.

Example usecase:
We have CXL memory on the host, and we create VMs with a new system memory
device backed by host CXL memory. We inject CXL memory performance attributes
through QEMU, and the guest now sees memory nodes with performance attributes
in HMAT. With this change, we enable the guest kernel to construct
the correct memory tiering for the memory nodes.

-v5:
 Thanks to Ying's comments,
 * Add comments about what is protected by `default_dram_perf_lock`
 * Fix an uninitialized pointer mtype
 * Slightly shorten the time holding `default_dram_perf_lock`
 * Fix a deadlock bug in `mt_perf_to_adistance`
-v4:
 Thanks to Ying's comments,
 * Remove redundant code
 * Reorganize patches accordingly
 * 
https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u
-v3:
 Thanks to Ying's comments,
 * Make the newly added code independent of HMAT
 * Upgrade set_node_memory_tier to support more cases
 * Put all non-driver-initialized memory types into default_memory_types
   instead of using hmat_memory_types
 * find_alloc_memory_type -> mt_find_alloc_memory_type
 * 
https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u
-v2:
 Thanks to Ying's comments,
 * Rewrite cover letter & patch description
 * Rename functions, don't use _hmat
 * Abstract common functions into find_alloc_memory_type()
 * Use the expected way to use set_node_memory_tier instead of modifying it
 * 
https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u
-v1:
 * 
https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u


Ho-Ren (Jack) Chuang (2):
  memory tier: dax/kmem: introduce an abstract layer for finding,
allocating, and putting memory types
  memory tier: create CPUless memory tiers after obtaining HMAT info

 drivers/dax/kmem.c   |  20 +-
 include/linux/memory-tiers.h |  13 
 mm/memory-tiers.c| 117 +++
 3 files changed, 119 insertions(+), 31 deletions(-)

-- 
Ho-Ren (Jack) Chuang

Re: [PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()

2024-03-26 Thread Gavin Shan


On 3/27/24 12:41, Jason Wang wrote:

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:


A smp_rmb() has been missed in vhost_enable_notify(), inspired by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
   -m 4096M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=4096M\
:   \
   -netdev tap,id=vnet0,vhost=true  \
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
:
   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
   virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_enable_notify(). Note that it
should be safe until vq->avail_idx is changed by commit d3bb267bbdcb
("vhost: cache avail index in vhost_enable_notify()").

Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc:  # v5.18+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/vhost/vhost.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 00445ab172b3..58f9d6a435f0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
>avail->idx, r);
 return false;
 }
+
 vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/
+   smp_rmb();
+   return true;
+   }

-   return vq->avail_idx != vq->last_avail_idx;
+   return false;


So we only care about the case when vhost_enable_notify() returns true.

In that case, I think you want to order with vhost_get_vq_desc():

last_avail_idx = vq->last_avail_idx;

if (vq->avail_idx == vq->last_avail_idx) { /* false */
}

vhost_get_avail_head(vq, _head, last_avail_idx)

Assuming I understand the patch correctly.

Acked-by: Jason Wang 



Jason, thanks for your review and comments. Your understanding is exactly
what I understood.




  }
  EXPORT_SYMBOL_GPL(vhost_enable_notify);



Thanks,
Gavin

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Gavin Shan


On 3/27/24 12:44, Jason Wang wrote:

On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:


A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
   -m 4096M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=4096M\
:   \
   -netdev tap,id=vnet0,vhost=true  \
   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
:
   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
   virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
should be safe until vq->avail_idx is changed by commit 275bf960ac697
("vhost: better detection of available buffers").

Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
  drivers/vhost/vhost.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..00445ab172b3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
 r = vhost_get_avail_idx(vq, _idx);
 if (unlikely(r))
 return false;
+
 vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/


We need to be more verbose here. For example, which load needs to be
ordered with which load.

The rmb in vhost_get_vq_desc() is used to order the load of avail idx
and the load of head. It is paired with e.g virtio_wmb() in
virtqueue_add_split().

vhost_vq_avail_empty() are mostly used as a hint in
vhost_net_busy_poll() which is under the protection of the vq mutex.

An exception is the tx_can_batch(), but in that case it doesn't even
want to read the head.


Ok, if it is needed only in that path, maybe we can move the barriers there.



[cc Will Deacon]

Jason, appreciate for your review and comments. I think PATCH[1/2] is
the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However,
it would be nice to fix all of them in one shoot. I will try with PATCH[2/2]
only to see if our issue will disappear or not. However, the issue still
exists if PATCH[2/2] is missed.

Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout
== false if I remember correctly. So the added smp_rmb() in 
vhost_vq_avail_empty()
is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order
for the available index and available ring entry (head). For example, 
vhost_vq_avail_empty()
called by tx_can_batch() can see next available index, but its corresponding
available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed.
The next call to get_tx_bufs(), where the available ring entry (head) doesn't
arrived yet, leading to stale available ring entry (head) being fetched.

  handle_tx_copy
get_tx_bufs // smp_rmb() won't be executed when vq->avail_idx 
!= vq->last_avail_idx
tx_can_batch
  vhost_vq_avail_empty  // vq->avail_idx is updated from vq->avail->idx

The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the 
function
is a exposed API, even it's only used by drivers/vhost/net.c at present. It 
means
the API has been broken internally. So it seems more appropriate to fix it up in
vhost_vq_avail_empty() so that the API's users needn't worry about the memory 
access
order.





+   smp_rmb();
+   return false;
+   }

-   return vq->avail_idx == vq->last_avail_idx;
+   return true;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);


Thanks,
Gavin

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:
>
> On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
> >
> > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> > Will Deacon . Otherwise, it's not ensured the
> > available ring entries pushed by guest can be observed by vhost
> > in time, leading to stale available ring entries fetched by vhost
> > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> > grace-hopper (ARM64) platform.
> >
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >   -accel kvm -machine virt,gic-version=host -cpu host  \
> >   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
> >   -m 4096M,slots=16,maxmem=64G \
> >   -object memory-backend-ram,id=mem0,size=4096M\
> >:   \
> >   -netdev tap,id=vnet0,vhost=true  \
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> >:
> >   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
> >   virtio_net virtio0: output.0:id 100 is not a head!
> >
> > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> > should be safe until vq->avail_idx is changed by commit 275bf960ac697
> > ("vhost: better detection of available buffers").
> >
> > Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> > Cc:  # v4.11+
> > Reported-by: Yihuang Yu 
> > Signed-off-by: Gavin Shan 
> > ---
> >  drivers/vhost/vhost.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..00445ab172b3 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> > struct vhost_virtqueue *vq)
> > r = vhost_get_avail_idx(vq, _idx);
> > if (unlikely(r))
> > return false;
> > +
> > vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> > +   if (vq->avail_idx != vq->last_avail_idx) {
> > +   /* Similar to what's done in vhost_get_vq_desc(), we need
> > +* to ensure the available ring entries have been exposed
> > +* by guest.
> > +*/
>
> We need to be more verbose here. For example, which load needs to be
> ordered with which load.
>
> The rmb in vhost_get_vq_desc() is used to order the load of avail idx
> and the load of head. It is paired with e.g virtio_wmb() in
> virtqueue_add_split().
>
> vhost_vq_avail_empty() are mostly used as a hint in
> vhost_net_busy_poll() which is under the protection of the vq mutex.
>
> An exception is the tx_can_batch(), but in that case it doesn't even
> want to read the head.

Ok, if it is needed only in that path, maybe we can move the barriers there.

Thanks

>
> Thanks
>
>
> > +   smp_rmb();
> > +   return false;
> > +   }
> >
> > -   return vq->avail_idx == vq->last_avail_idx;
> > +   return true;
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> >
> > --
> > 2.44.0
> >

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-26 Thread Edgecombe, Rick P

On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote:
> In which conditions which path is used during the initialization of mm
> and why is this the case? It is an open claim in the current form.

There is an arch_pick_mmap_layout() that arch's can have their own rules for. 
There is also a
generic one. It gets called during exec.

> 
> That would be nice to have documented for the sake of being complete
> description. I have zero doubts of the claim being untrue.

...being untrue?

Re: [PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
>
> A smp_rmb() has been missed in vhost_enable_notify(), inspired by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
>
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
>
> Add the missed smp_rmb() in vhost_enable_notify(). Note that it
> should be safe until vq->avail_idx is changed by commit d3bb267bbdcb
> ("vhost: cache avail index in vhost_enable_notify()").
>
> Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> Cc:  # v5.18+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 00445ab172b3..58f9d6a435f0 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>>avail->idx, r);
> return false;
> }
> +
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   if (vq->avail_idx != vq->last_avail_idx) {
> +   /* Similar to what's done in vhost_get_vq_desc(), we need
> +* to ensure the available ring entries have been exposed
> +* by guest.
> +*/
> +   smp_rmb();
> +   return true;
> +   }
>
> -   return vq->avail_idx != vq->last_avail_idx;
> +   return false;

So we only care about the case when vhost_enable_notify() returns true.

In that case, I think you want to order with vhost_get_vq_desc():

last_avail_idx = vq->last_avail_idx;

if (vq->avail_idx == vq->last_avail_idx) { /* false */
}

vhost_get_avail_head(vq, _head, last_avail_idx)

Assuming I understand the patch correctly.

Acked-by: Jason Wang 

Thanks

>  }
>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>
> --
> 2.44.0
>

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
>
> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
>
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
>
> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> ("vhost: better detection of available buffers").
>
> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> Cc:  # v4.11+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..00445ab172b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
> r = vhost_get_avail_idx(vq, _idx);
> if (unlikely(r))
> return false;
> +
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   if (vq->avail_idx != vq->last_avail_idx) {
> +   /* Similar to what's done in vhost_get_vq_desc(), we need
> +* to ensure the available ring entries have been exposed
> +* by guest.
> +*/

We need to be more verbose here. For example, which load needs to be
ordered with which load.

The rmb in vhost_get_vq_desc() is used to order the load of avail idx
and the load of head. It is paired with e.g virtio_wmb() in
virtqueue_add_split().

vhost_vq_avail_empty() are mostly used as a hint in
vhost_net_busy_poll() which is under the protection of the vq mutex.

An exception is the tx_can_batch(), but in that case it doesn't even
want to read the head.

Thanks


> +   smp_rmb();
> +   return false;
> +   }
>
> -   return vq->avail_idx == vq->last_avail_idx;
> +   return true;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>
> --
> 2.44.0
>

Re: [PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported

2024-03-26 Thread Heng Qi





在 2024/3/26 下午11:19, Breno Leitao 写道:

There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

 # ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

   if (!sz) {
   virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

 vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

   while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
---
  drivers/net/virtio_net.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c640fdf28fc5..e6b0eaf08ac2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;
  
+	if (!vi->has_rss && !vi->has_rss_hash_report)

+   return -EOPNOTSUPP;
+


Why not make the second patch as the first, this seems to work better.
Or squash them into one patch.

Apart from these and Xuan's comments.

For series:

        Reviewed-by: Heng Qi 

Regards,
Heng


if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;

Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-26 Thread Xuan Zhuo

On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> Do not set virtnet_info->rss_indir_table_size if RSS is not available
> for the device.
>
> Currently, rss_indir_table_size is set if either has_rss or
> has_rss_hash_report is available, but, it should only be set if has_rss
> is set.
>
> On the virtnet_set_rxfh(), return an invalid command if the request has
> indirection table set, but virtnet does not support RSS.
>
> Suggested-by: Heng Qi 
> Signed-off-by: Breno Leitao 
> ---
>  drivers/net/virtio_net.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..c640fdf28fc5 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   rxfh->hfunc != ETH_RSS_HASH_TOP)
>   return -EOPNOTSUPP;
>
> + if (rxfh->indir && !vi->has_rss)
> + return -EINVAL;
> +
>   if (rxfh->indir) {

Put !vi->has_rss here?

Thanks.


>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> @@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
>   vi->has_rss_hash_report = true;
>
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
>   vi->has_rss = true;
>
> - if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
> + }
> +
> + if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
>
> --
> 2.43.0
>

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Gavin Shan




On 3/27/24 09:14, Gavin Shan wrote:

On 3/27/24 01:46, Will Deacon wrote:

On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
 return false;
 vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   smp_rmb();
 return vq->avail_idx == vq->last_avail_idx;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);



Thanks, Will. I already noticed smp_rmb() has been missed in 
vhost_vq_avail_empty().
The issue still exists after smp_rmb() is added here. However, I'm inspired by 
your
suggestion and recheck the code again. It seems another smp_rmb() has been 
missed
in vhost_enable_notify().

With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm 
unable
to hit the issue. I will try for more times to make sure the issue is really 
resolved.
After that, I will post formal patches for review.



Thanks again, Will. The formal patches have been sent for review.

https://lkml.org/lkml/2024/3/27/40

Thanks,
Gavin

Re: [PATCH v2 0/2] vhost: Fix stale available ring entries

2024-03-26 Thread Gavin Shan


On 3/27/24 09:38, Gavin Shan wrote:

The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64)
platform. The wrong head (available ring entry) is seen by the guest
when running 'netperf' on the guest and running 'netserver' on another
NVidia's grace-grace machine.

   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
   -accel kvm -machine virt,gic-version=host -cpu host  \
   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
   -m 4096M,slots=16,maxmem=64G \
   -object memory-backend-ram,id=mem0,size=4096M\
:   \
   -netdev tap,id=tap0,vhost=true   \
   -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0
:
   guest# ifconfig eth0 | grep 'inet addr'
   inet addr:10.26.1.220
   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
   virtio_net virtio0: output.0:id 100 is not a head!

There is missed smp_rmb() in vhost_vq_avail_empty() and vhost_enable_notify().
Without smp_rmb(), vq->avail_idx is increased but the available ring
entries aren't arriving to vhost side yet. So a stale available ring
entry can be fetched in vhost_get_vq_desc().

Fix it by adding smp_rmb() in those two functions. Note that I need
two patches so that they can be easily picked up by the stable kernel.
With the changes, I'm unable to hit the issue again.

Gavin Shan (2):
   vhost: Add smp_rmb() in vhost_vq_avail_empty()
   vhost: Add smp_rmb() in vhost_enable_notify()

  drivers/vhost/vhost.c | 22 --
  1 file changed, 20 insertions(+), 2 deletions(-)



Sorry, I was supposed to copy Will. Amending for it.

Thanks,
Gavin

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-26 Thread Google

On Mon, 25 Mar 2024 19:04:59 +
Jonthan Haslam  wrote:

> Hi Masami,
> 
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 
> > > microsecs).
> > 
> > Looks good to me.
> > 
> > Acked-by: Masami Hiramatsu (Google) 
> > 
> > BTW, how did you measure the overhead? I think spinlock overhead
> > will depend on how much lock contention happens.
> 
> Absolutely. I have the original production workload to test this with and
> a derived one that mimics this test case. The production case has ~24
> threads running on a 192 core system which access 14 USDTs around 1.5
> million times per second in total (across all USDTs). My test case is
> similar but can drive a higher rate of USDT access across more threads and
> therefore generate higher contention.

Thanks for the info. So this result is measured in enough large machine
with high parallelism. So lock contention is matter.
Can you also include this information with the number in next version?

Thank you,

> 
> All measurements are done using bpftrace scripts around relevant parts of
> code in uprobes.c and application code.
> 
> Jon.
> 
> > 
> > Thank you,
> > 
> > > 
> > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > 
> > > Signed-off-by: Jonathan Haslam 
> > > ---
> > >  kernel/events/uprobes.c | 22 +++---
> > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..42bf9b6e8bc0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > >   */
> > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > >  
> > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree 
> > > access */
> > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access */
> > >  
> > >  #define UPROBES_HASH_SZ  13
> > >  /* serialize uprobe->pending_list */
> > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > *inode, loff_t offset)
> > >  {
> > >   struct uprobe *uprobe;
> > >  
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >   uprobe = __find_uprobe(inode, offset);
> > > - spin_unlock(_treelock);
> > > + read_unlock(_treelock);
> > >  
> > >   return uprobe;
> > >  }
> > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > > *uprobe)
> > >  {
> > >   struct uprobe *u;
> > >  
> > > - spin_lock(_treelock);
> > > + write_lock(_treelock);
> > >   u = __insert_uprobe(uprobe);
> > > - spin_unlock(_treelock);
> > > + write_unlock(_treelock);
> > >  
> > >   return u;
> > >  }
> > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > >   return;
> > >  
> > > - spin_lock(_treelock);
> > > + write_lock(_treelock);
> > >   rb_erase(>rb_node, _tree);
> > > - spin_unlock(_treelock);
> > > + write_unlock(_treelock);
> > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > >   put_uprobe(uprobe);
> > >  }
> > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > >   min = vaddr_to_offset(vma, start);
> > >   max = min + (end - start) - 1;
> > >  
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >   n = find_node_in_range(inode, min, max);
> > >   if (n) {
> > >   for (t = n; t; t = rb_prev(t)) {
> > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > >   get_uprobe(u);
> > >   }
> > >   }
> > > - spin_unlock(_treelock);
> > > + read_unlock(_treelock);
> > >  }
> > >  
> > >  /* @vma contains reference counter, not the probed instruction. */
> > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, 
> > > unsigned long start, unsigned long e
> > >   min = vaddr_to_offset(vma, start);
> > >   max = min + (end - start) - 1;
> > >  
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >   n = find_node_in_range(inode, min, max);
> > > - spin_unlock(_treelock);
> > > + read_unlock(_treelock);
> > >  
> > >   return !!n;
> > >  }
> > > -- 
> > > 2.43.0
> > > 
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google) 


-- 
Masami Hiramatsu (Google)

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-26 Thread Google

On Tue, 26 Mar 2024 09:01:47 -0700
Andrii Nakryiko  wrote:

> On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu  wrote:
> >
> > On Thu, 21 Mar 2024 07:57:35 -0700
> > Jonathan Haslam  wrote:
> >
> > > Active uprobes are stored in an RB tree and accesses to this tree are
> > > dominated by read operations. Currently these accesses are serialized by
> > > a spinlock but this leads to enormous contention when large numbers of
> > > threads are executing active probes.
> > >
> > > This patch converts the spinlock used to serialize access to the
> > > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > > aligns naturally with the overwhelmingly read-only nature of the tree
> > > usage here. Although the addition of reader-writer spinlocks are
> > > discouraged [0], this fix is proposed as an interim solution while an
> > > RCU based approach is implemented (that work is in a nascent form). This
> > > fix also has the benefit of being trivial, self contained and therefore
> > > simple to backport.
> > >
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 
> > > microsecs).
> >
> > Looks good to me.
> >
> > Acked-by: Masami Hiramatsu (Google) 
> 
> Masami,
> 
> Given the discussion around per-cpu rw semaphore and need for
> (internal) batched attachment API for uprobes, do you think you can
> apply this patch as is for now? We can then gain initial improvements
> in scalability that are also easy to backport, and Jonathan will work
> on a more complete solution based on per-cpu RW semaphore, as
> suggested by Ingo.

Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
I would like to wait for the next version.

Thank you,

> 
> >
> > BTW, how did you measure the overhead? I think spinlock overhead
> > will depend on how much lock contention happens.
> >
> > Thank you,
> >
> > >
> > > [0] https://docs.kernel.org/locking/spinlocks.html
> > >
> > > Signed-off-by: Jonathan Haslam 
> > > ---
> > >  kernel/events/uprobes.c | 22 +++---
> > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..42bf9b6e8bc0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > >   */
> > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > >
> > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access 
> > > */
> > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access 
> > > */
> > >
> > >  #define UPROBES_HASH_SZ  13
> > >  /* serialize uprobe->pending_list */
> > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > *inode, loff_t offset)
> > >  {
> > >   struct uprobe *uprobe;
> > >
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >   uprobe = __find_uprobe(inode, offset);
> > > - spin_unlock(_treelock);
> > > + read_unlock(_treelock);
> > >
> > >   return uprobe;
> > >  }
> > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > > *uprobe)
> > >  {
> > >   struct uprobe *u;
> > >
> > > - spin_lock(_treelock);
> > > + write_lock(_treelock);
> > >   u = __insert_uprobe(uprobe);
> > > - spin_unlock(_treelock);
> > > + write_unlock(_treelock);
> > >
> > >   return u;
> > >  }
> > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > >   return;
> > >
> > > - spin_lock(_treelock);
> > > + write_lock(_treelock);
> > >   rb_erase(>rb_node, _tree);
> > > - spin_unlock(_treelock);
> > > + write_unlock(_treelock);
> > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > >   put_uprobe(uprobe);
> > >  }
> > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > >   min = vaddr_to_offset(vma, start);
> > >   max = min + (end - start) - 1;
> > >
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >   n = find_node_in_range(inode, min, max);
> > >   if (n) {
> > >   for (t = n; t; t = rb_prev(t)) {
> > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > >   get_uprobe(u);
> > >   }
> > >   }
> > > - spin_unlock(_treelock);
> > > + read_unlock(_treelock);
> > >  }
> > >
> > >  /* @vma contains reference counter, not the probed instruction. */
> > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, 
> > > unsigned long start, unsigned long e
> > >   min = vaddr_to_offset(vma, start);
> > >   max = min + (end - start) - 1;
> > >
> > > - spin_lock(_treelock);
> > > + read_lock(_treelock);
> > >

[PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()

2024-03-26 Thread Gavin Shan

A smp_rmb() has been missed in vhost_enable_notify(), inspired by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=vnet0,vhost=true  \
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
   :
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_enable_notify(). Note that it
should be safe until vq->avail_idx is changed by commit d3bb267bbdcb
("vhost: cache avail index in vhost_enable_notify()").

Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
Cc:  # v5.18+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 00445ab172b3..58f9d6a435f0 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
   >avail->idx, r);
return false;
}
+
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/
+   smp_rmb();
+   return true;
+   }
 
-   return vq->avail_idx != vq->last_avail_idx;
+   return false;
 }
 EXPORT_SYMBOL_GPL(vhost_enable_notify);
 
-- 
2.44.0

[PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Gavin Shan

A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
Will Deacon . Otherwise, it's not ensured the
available ring entries pushed by guest can be observed by vhost
in time, leading to stale available ring entries fetched by vhost
in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
grace-hopper (ARM64) platform.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=vnet0,vhost=true  \
  -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
   :
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
should be safe until vq->avail_idx is changed by commit 275bf960ac697
("vhost: better detection of available buffers").

Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
Cc:  # v4.11+
Reported-by: Yihuang Yu 
Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..00445ab172b3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
r = vhost_get_avail_idx(vq, _idx);
if (unlikely(r))
return false;
+
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+   if (vq->avail_idx != vq->last_avail_idx) {
+   /* Similar to what's done in vhost_get_vq_desc(), we need
+* to ensure the available ring entries have been exposed
+* by guest.
+*/
+   smp_rmb();
+   return false;
+   }
 
-   return vq->avail_idx == vq->last_avail_idx;
+   return true;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
-- 
2.44.0

[PATCH v2 0/2] vhost: Fix stale available ring entries

2024-03-26 Thread Gavin Shan

The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64)
platform. The wrong head (available ring entry) is seen by the guest
when running 'netperf' on the guest and running 'netserver' on another
NVidia's grace-grace machine.

  /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
  -accel kvm -machine virt,gic-version=host -cpu host  \
  -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
  -m 4096M,slots=16,maxmem=64G \
  -object memory-backend-ram,id=mem0,size=4096M\
   :   \
  -netdev tap,id=tap0,vhost=true   \
  -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0
   :
  guest# ifconfig eth0 | grep 'inet addr'
  inet addr:10.26.1.220
  guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
  virtio_net virtio0: output.0:id 100 is not a head!

There is missed smp_rmb() in vhost_vq_avail_empty() and vhost_enable_notify().
Without smp_rmb(), vq->avail_idx is increased but the available ring
entries aren't arriving to vhost side yet. So a stale available ring
entry can be fetched in vhost_get_vq_desc().

Fix it by adding smp_rmb() in those two functions. Note that I need
two patches so that they can be easily picked up by the stable kernel.
With the changes, I'm unable to hit the issue again.

Gavin Shan (2):
  vhost: Add smp_rmb() in vhost_vq_avail_empty()
  vhost: Add smp_rmb() in vhost_enable_notify()

 drivers/vhost/vhost.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

-- 
2.44.0

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Gavin Shan




On 3/27/24 01:46, Will Deacon wrote:

On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
 return false;
 vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
  
+   smp_rmb();

 return vq->avail_idx == vq->last_avail_idx;
  }
  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);



Thanks, Will. I already noticed smp_rmb() has been missed in 
vhost_vq_avail_empty().
The issue still exists after smp_rmb() is added here. However, I'm inspired by 
your
suggestion and recheck the code again. It seems another smp_rmb() has been 
missed
in vhost_enable_notify().

With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm 
unable
to hit the issue. I will try for more times to make sure the issue is really 
resolved.
After that, I will post formal patches for review.

Thanks,
Gavin

Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory

2024-03-26 Thread SeongJae Park

On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park  wrote:

> On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim  wrote:
[...]
> > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park  wrote:
> > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim  wrote:
[...]
> >
> > I would like to hear how you think about this.
> 
> So, to summarize my humble opinion,
> 
> 1. I like the idea of having two actions.  But I'd like to use names other 
> than
>'promote' and 'demote'.
> 2. I still prefer having a filter for the page granularity access re-check.
> 
[...]
> > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I
> > can talk more about this issue.
> 
> Looking forward to chatting with you :)

We met and discussed about this topic in the chat series yesterday.  Sharing
the summary for keeping the open discussion.

Honggyu thankfully accepted my humble suggestions on the last reply.  Honggyu
will post the third version of this patchset soon.  The patchset will implement
two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD.  Those will migrate
the DAMOS target regions to a user-specified NUMA node, but will have different
prioritization score function.  As name implies, they will prioritize more hot
regions and cold regions, respectively.

Honggyu, please feel free to fix if there is anything wrong or missed.

And thanks to Honggyu again for patiently keeping this productive discussion
and their awesome work.

Thanks,
SJ

[...]

[syzbot] [virtualization?] net boot error: WARNING: refcount bug in __free_pages_ok

2024-03-26 Thread syzbot

Hello,

syzbot found the following issue on:

HEAD commit:c1fd3a9433a2 Merge branch 'there-are-some-bugfix-for-the-h..
git tree:   net
console output: https://syzkaller.appspot.com/x/log.txt?x=134f4c8118
kernel config:  https://syzkaller.appspot.com/x/.config?x=a5e4ca7f025e9172
dashboard link: https://syzkaller.appspot.com/bug?extid=84f677a274bd8b05f6cb
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/89219dafdd42/disk-c1fd3a94.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/d962e40c0da9/vmlinux-c1fd3a94.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/248b8f5eb3a1/bzImage-c1fd3a94.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+84f677a274bd8b05f...@syzkaller.appspotmail.com

Key type pkcs7_test registered
Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
io scheduler mq-deadline registered
io scheduler kyber registered
io scheduler bfq registered
input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
ACPI: button: Power Button [PWRF]
input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
ACPI: button: Sleep Button [SLPF]
ioatdma: Intel(R) QuickData Technology Driver 5.00
ACPI: \_SB_.LNKC: Enabled at IRQ 11
virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKD: Enabled at IRQ 10
virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
ACPI: \_SB_.LNKB: Enabled at IRQ 10
virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
N_HDLC line discipline registered with maxframe=4096
Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
Non-volatile memory driver v1.3
Linux agpgart interface v0.103
ACPI: bus type drm_connector registered
[drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
[drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
Console: switching to colour frame buffer device 128x48
platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
usbcore: registered new interface driver udl
brd: module loaded
loop: module loaded
zram: Added device: zram0
null_blk: disk nullb0 created
null_blk: module loaded
Guest personality initialized and is inactive
VMCI host device registered (name=vmci, major=10, minor=118)
Initialized host personality
usbcore: registered new interface driver rtsx_usb
usbcore: registered new interface driver viperboard
usbcore: registered new interface driver dln2
usbcore: registered new interface driver pn533_usb
nfcsim 0.2 initialized
usbcore: registered new interface driver port100
usbcore: registered new interface driver nfcmrvl
Loading iSCSI transport class v2.0-870.
virtio_scsi virtio0: 1/0/0 default/read/poll queues
[ cut here ]
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 1 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 
lib/refcount.c:31
Modules linked in:
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-12856-gc1fd3a9433a2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
02/29/2024
RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
Code: b2 00 00 00 e8 97 cf e9 fc 5b 5d c3 cc cc cc cc e8 8b cf e9 fc c6 05 6c 
6b e8 0a 01 90 48 c7 c7 e0 34 1f 8c e8 27 6c ac fc 90 <0f> 0b 90 90 eb d9 e8 6b 
cf e9 fc c6 05 49 6b e8 0a 01 90 48 c7 c7
RSP: :c9066e18 EFLAGS: 00010246
RAX: 57706ef3c4162200 RBX: 88801f8f468c RCX: 8880166d8000
RDX:  RSI:  RDI: 
RBP: 0004 R08: 815800c2 R09: fbfff1c396e0
R10: dc00 R11: fbfff1c396e0 R12: ea850dc0
R13: ea850dc8 R14: 1d400010a1b9 R15: 
FS:  () GS:8880b950() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 0e132000 CR4: 003506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 
 reset_page_owner include/linux/page_owner.h:25 [inline]
 free_pages_prepare mm/page_alloc.c:1141 [inline]
 __free_pages_ok+0xc60/0xd90 mm/page_alloc.c:1270
 make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829
 vring_alloc_queue drivers/virtio/virtio_ring.c:319 [inline]
 vring_alloc_queue_split+0x20a/0x600 drivers/virtio/virtio_ring.c:1108
 vring_create_virtqueue_split+0xc6/0x310 drivers/virtio/virtio_ring.c:1158
 vring_create_virtqueue+0xca/0x110 drivers/virtio/virtio_ring.c:2683
 setup_vq+0xe9/0x2d0

Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware

2024-03-26 Thread Arnaud POULIQUEN




On 3/25/24 17:51, Mathieu Poirier wrote:
> On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote:
>> The new TEE remoteproc device is used to manage remote firmware in a
>> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is
>> introduced to delegate the loading of the firmware to the trusted
>> execution context. In such cases, the firmware should be signed and
>> adhere to the image format defined by the TEE.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> Updates from V3:
>> - remove support of the attach use case. Will be addressed in a separate
>>   thread,
>> - add st_rproc_tee_ops::parse_fw ops,
>> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage cross
>>   reference between the rproc struct and the tee_rproc struct in tee_rproc.c.
>> ---
>>  drivers/remoteproc/stm32_rproc.c | 60 +---
>>  1 file changed, 56 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/remoteproc/stm32_rproc.c 
>> b/drivers/remoteproc/stm32_rproc.c
>> index 8cd838df4e92..13df33c78aa2 100644
>> --- a/drivers/remoteproc/stm32_rproc.c
>> +++ b/drivers/remoteproc/stm32_rproc.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #include "remoteproc_internal.h"
>> @@ -49,6 +50,9 @@
>>  #define M4_STATE_STANDBY4
>>  #define M4_STATE_CRASH  5
>>  
>> +/* Remote processor unique identifier aligned with the Trusted Execution 
>> Environment definitions */
> 
> Why is this the case?  At least from the kernel side it is possible to call
> tee_rproc_register() with any kind of value, why is there a need to be any
> kind of alignment with the TEE?


The use of the proc_id is to identify a processor in case of multi 
co-processors.

For instance we can have a system with A DSP and a modem. We would use the same
TEE service, but
the TEE driver will probably be different, same for the signature key.
In such case the proc ID allows to identify the the processor you want to 
address.


> 
>> +#define STM32_MP1_M4_PROC_ID0
>> +
>>  struct stm32_syscon {
>>  struct regmap *map;
>>  u32 reg;
>> @@ -257,6 +261,19 @@ static int stm32_rproc_release(struct rproc *rproc)
>>  return 0;
>>  }
>>  
>> +static int stm32_rproc_tee_stop(struct rproc *rproc)
>> +{
>> +int err;
>> +
>> +stm32_rproc_request_shutdown(rproc);
>> +
>> +err = tee_rproc_stop(rproc);
>> +if (err)
>> +return err;
>> +
>> +return stm32_rproc_release(rproc);
>> +}
>> +
>>  static int stm32_rproc_prepare(struct rproc *rproc)
>>  {
>>  struct device *dev = rproc->dev.parent;
>> @@ -693,8 +710,19 @@ static const struct rproc_ops st_rproc_ops = {
>>  .get_boot_addr  = rproc_elf_get_boot_addr,
>>  };
>>  
>> +static const struct rproc_ops st_rproc_tee_ops = {
>> +.prepare= stm32_rproc_prepare,
>> +.start  = tee_rproc_start,
>> +.stop   = stm32_rproc_tee_stop,
>> +.kick   = stm32_rproc_kick,
>> +.load   = tee_rproc_load_fw,
>> +.parse_fw   = tee_rproc_parse_fw,
>> +.find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table,
>> +};
>> +
>>  static const struct of_device_id stm32_rproc_match[] = {
>> -{ .compatible = "st,stm32mp1-m4" },
>> +{.compatible = "st,stm32mp1-m4",},
>> +{.compatible = "st,stm32mp1-m4-tee",},
>>  {},
>>  };
>>  MODULE_DEVICE_TABLE(of, stm32_rproc_match);
>> @@ -853,6 +881,7 @@ static int stm32_rproc_probe(struct platform_device 
>> *pdev)
>>  struct device *dev = >dev;
>>  struct stm32_rproc *ddata;
>>  struct device_node *np = dev->of_node;
>> +struct tee_rproc *trproc = NULL;
>>  struct rproc *rproc;
>>  unsigned int state;
>>  int ret;
>> @@ -861,9 +890,26 @@ static int stm32_rproc_probe(struct platform_device 
>> *pdev)
>>  if (ret)
>>  return ret;
>>  
>> -rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, 
>> sizeof(*ddata));
>> -if (!rproc)
>> -return -ENOMEM;
>> +if (of_device_is_compatible(np, "st,stm32mp1-m4-tee")) {
>> +/*
>> + * Delegate the firmware management to the secure context.
>> + * The firmware loaded has to be signed.
>> + */
>> +rproc = devm_rproc_alloc(dev, np->name, _rproc_tee_ops, 
>> NULL, sizeof(*ddata));
>> +if (!rproc)
>> +return -ENOMEM;
>> +
>> +trproc = tee_rproc_register(dev, rproc, STM32_MP1_M4_PROC_ID);
>> +if (IS_ERR(trproc)) {
>> +dev_err_probe(dev, PTR_ERR(trproc),
>> +  "signed firmware not supported by TEE\n");
>> +return PTR_ERR(trproc);
>> +}
>> +} else {
>> +rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, 
>> sizeof(*ddata));
>> +if (!rproc)
>> +return -ENOMEM;
>> +}
>>  
>>  ddata = rproc->priv;
>>  
>>

[RFC PATCH v2 4/4] tracing/timer: use __print_sym()

2024-03-26 Thread Johannes Berg

From: Johannes Berg 

Use the new __print_sym() in the timer tracing, just to show
how to convert something. This adds ~80 bytes of .text for a
saving of ~1.5K of data in my builds.

Note the format changes from

print fmt: "success=%d dependency=%s", REC->success, 
__print_symbolic(REC->dependency, { 0, "NONE" }, { (1 << 0), "POSIX_TIMER" }, { 
(1 << 1), "PERF_EVENTS" }, { (1 << 2), "SCHED" }, { (1 << 3), "CLOCK_UNSTABLE" 
}, { (1 << 4), "RCU" }, { (1 << 5), "RCU_EXP" })

to

print fmt: "success=%d dependency=%s", REC->success, 
__print_symbolic(REC->dependency, { 0, "NONE" }, { 1, "POSIX_TIMER" }, { 2, 
"PERF_EVENTS" }, { 4, "SCHED" }, { 8, "CLOCK_UNSTABLE" }, { 16, "RCU" }, { 32, 
"RCU_EXP" })

Since the values are now just printed in the show function as
pure decimal values.

Signed-off-by: Johannes Berg 
---
 include/trace/events/timer.h | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 1ef58a04fc57..d483abffed78 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -402,26 +402,18 @@ TRACE_EVENT(itimer_expire,
 #undef tick_dep_mask_name
 #undef tick_dep_name_end
 
-/* The MASK will convert to their bits and they need to be processed too */
-#define tick_dep_name(sdep) TRACE_DEFINE_ENUM(TICK_DEP_BIT_##sdep); \
-   TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep);
-#define tick_dep_name_end(sdep)  TRACE_DEFINE_ENUM(TICK_DEP_BIT_##sdep); \
-   TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep);
-/* NONE only has a mask defined for it */
-#define tick_dep_mask_name(sdep) TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep);
-
-TICK_DEP_NAMES
-
-#undef tick_dep_name
-#undef tick_dep_mask_name
-#undef tick_dep_name_end
-
 #define tick_dep_name(sdep) { TICK_DEP_MASK_##sdep, #sdep },
 #define tick_dep_mask_name(sdep) { TICK_DEP_MASK_##sdep, #sdep },
 #define tick_dep_name_end(sdep) { TICK_DEP_MASK_##sdep, #sdep }
 
+TRACE_DEFINE_SYM_LIST(tick_dep_names, TICK_DEP_NAMES);
+
+#undef tick_dep_name
+#undef tick_dep_mask_name
+#undef tick_dep_name_end
+
 #define show_tick_dep_name(val)\
-   __print_symbolic(val, TICK_DEP_NAMES)
+   __print_sym(val, tick_dep_names)
 
 TRACE_EVENT(tick_stop,
 
-- 
2.44.0

[RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()

2024-03-26 Thread Johannes Berg

From: Johannes Berg 

The way __print_symbolic() works is limited and inefficient
in multiple ways:
 - you can only use it with a static list of symbols, but
   e.g. the SKB dropreasons are now a dynamic list

 - it builds the list in memory _three_ times, so it takes
   a lot of memory:
   - The print_fmt contains the list (since it's passed to
 the macro there). This actually contains the names
 _twice_, which is fixed up at runtime.
   - TRACE_DEFINE_ENUM() puts a 24-byte struct trace_eval_map
 for every entry, plus the string pointed to by it, which
 cannot be deduplicated with the strings in the print_fmt
   - The in-kernel symbolic printing creates yet another list
 of struct trace_print_flags for trace_print_symbols_seq()

 - it also requires runtime fixup during init, which is a lot
   of string parsing due to the print_fmt fixup

Introduce __print_sym() to - over time - replace the old one.
We can easily extend this also to __print_flags later, but I
cared only about the SKB dropreasons for now, which has only
__print_symbolic().

This new __print_sym() requires only a single list of items,
created by TRACE_DEFINE_SYM_LIST(), or can even use another
already existing list by using TRACE_DEFINE_SYM_FNS() with
lookup and show methods.

Then, instead of doing an init-time fixup, just do this at the
time when userspace reads the print_fmt. This way, dynamically
updated lists are possible.

For userspace, nothing actually changes, because the print_fmt
is shown exactly the same way the old __print_symbolic() was.

This adds about 4k .text in my test builds, but that'll be
more than paid for by the actual conversions.

Signed-off-by: Johannes Berg 
---
v2:
 - fix RCU
 - use ':' as separator to simplify the code, that's
   still not valid in a C identifier
---
 include/asm-generic/vmlinux.lds.h  |  3 +-
 include/linux/module.h |  2 +
 include/linux/trace_events.h   |  7 ++
 include/linux/tracepoint.h | 20 +
 include/trace/stages/init.h| 54 +
 include/trace/stages/stage2_data_offsets.h |  6 ++
 include/trace/stages/stage3_trace_output.h |  9 +++
 include/trace/stages/stage7_class_define.h |  3 +
 kernel/module/main.c   |  3 +
 kernel/trace/trace_events.c| 90 +-
 kernel/trace/trace_output.c| 45 +++
 11 files changed, 239 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index f7749d0f2562..88de434578a5 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -256,7 +256,8 @@
 #define FTRACE_EVENTS()
\
. = ALIGN(8);   \
BOUNDED_SECTION(_ftrace_events) \
-   BOUNDED_SECTION_BY(_ftrace_eval_map, _ftrace_eval_maps)
+   BOUNDED_SECTION_BY(_ftrace_eval_map, _ftrace_eval_maps) \
+   BOUNDED_SECTION(_ftrace_sym_defs)
 #else
 #define FTRACE_EVENTS()
 #endif
diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..571e5e8f17b6 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -524,6 +524,8 @@ struct module {
unsigned int num_trace_events;
struct trace_eval_map **trace_evals;
unsigned int num_trace_evals;
+   struct trace_sym_def **trace_sym_defs;
+   unsigned int num_trace_sym_defs;
 #endif
 #ifdef CONFIG_FTRACE_MCOUNT_RECORD
unsigned int num_ftrace_callsites;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 6f9bdfb09d1d..bc7045d535d0 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -27,6 +27,13 @@ const char *trace_print_flags_seq(struct trace_seq *p, const 
char *delim,
 const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val,
const struct trace_print_flags 
*symbol_array);
 
+const char *trace_print_sym_seq(struct trace_seq *p, unsigned long long val,
+   const char *(*lookup)(unsigned long long val));
+const char *trace_sym_lookup(const struct trace_sym_entry *list,
+size_t len, unsigned long long value);
+void trace_sym_show(struct seq_file *m,
+   const struct trace_sym_entry *list, size_t len);
+
 #if BITS_PER_LONG == 32
 const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim,
  unsigned long long flags,
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 689b6d71590e..cc3b387953d1 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -31,6 +31,24 @@ struct trace_eval_map {
unsigned long   eval_value;
 };
 
+struct trace_sym_def {
+   const char  *system;
+   const char

[RFC PATCH v2 3/4] net: drop_monitor: use drop_reason_lookup()

2024-03-26 Thread Johannes Berg

From: Johannes Berg 

Now that we have drop_reason_lookup(), we can just use it for
drop_monitor as well, rather than exporting the list itself.

Signed-off-by: Johannes Berg 
---
 include/net/dropreason.h |  4 
 net/core/drop_monitor.c  | 18 +++---
 net/core/skbuff.c|  6 +++---
 3 files changed, 6 insertions(+), 22 deletions(-)

diff --git a/include/net/dropreason.h b/include/net/dropreason.h
index c157070b5303..0e2195ccf2cd 100644
--- a/include/net/dropreason.h
+++ b/include/net/dropreason.h
@@ -38,10 +38,6 @@ struct drop_reason_list {
size_t n_reasons;
 };
 
-/* Note: due to dynamic registrations, access must be under RCU */
-extern const struct drop_reason_list __rcu *
-drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM];
-
 #ifdef CONFIG_TRACEPOINTS
 const char *drop_reason_lookup(unsigned long long value);
 void drop_reason_show(struct seq_file *m);
diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index b0f221d658be..185c43e5b501 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -610,9 +610,8 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, 
struct sk_buff *skb,
 size_t payload_len)
 {
struct net_dm_skb_cb *cb = NET_DM_SKB_CB(skb);
-   const struct drop_reason_list *list = NULL;
-   unsigned int subsys, subsys_reason;
char buf[NET_DM_MAX_SYMBOL_LEN];
+   const char *reason_str;
struct nlattr *attr;
void *hdr;
int rc;
@@ -630,19 +629,8 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, 
struct sk_buff *skb,
goto nla_put_failure;
 
rcu_read_lock();
-   subsys = u32_get_bits(cb->reason, SKB_DROP_REASON_SUBSYS_MASK);
-   if (subsys < SKB_DROP_REASON_SUBSYS_NUM)
-   list = rcu_dereference(drop_reasons_by_subsys[subsys]);
-   subsys_reason = cb->reason & ~SKB_DROP_REASON_SUBSYS_MASK;
-   if (!list ||
-   subsys_reason >= list->n_reasons ||
-   !list->reasons[subsys_reason] ||
-   strlen(list->reasons[subsys_reason]) > NET_DM_MAX_REASON_LEN) {
-   list = 
rcu_dereference(drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_CORE]);
-   subsys_reason = SKB_DROP_REASON_NOT_SPECIFIED;
-   }
-   if (nla_put_string(msg, NET_DM_ATTR_REASON,
-  list->reasons[subsys_reason])) {
+   reason_str = drop_reason_lookup(cb->reason);
+   if (nla_put_string(msg, NET_DM_ATTR_REASON, reason_str)) {
rcu_read_unlock();
goto nla_put_failure;
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 012b48da8810..a8065c40a270 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -141,13 +141,11 @@ static const struct drop_reason_list drop_reasons_core = {
.n_reasons = ARRAY_SIZE(drop_reasons),
 };
 
-const struct drop_reason_list __rcu *
+static const struct drop_reason_list __rcu *
 drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM] = {
[SKB_DROP_REASON_SUBSYS_CORE] = RCU_INITIALIZER(_reasons_core),
 };
-EXPORT_SYMBOL(drop_reasons_by_subsys);
 
-#ifdef CONFIG_TRACEPOINTS
 const char *drop_reason_lookup(unsigned long long value)
 {
unsigned long long subsys_id = value >> SKB_DROP_REASON_SUBSYS_SHIFT;
@@ -164,7 +162,9 @@ const char *drop_reason_lookup(unsigned long long value)
return NULL;
return subsys->reasons[reason];
 }
+EXPORT_SYMBOL(drop_reason_lookup);
 
+#ifdef CONFIG_TRACEPOINTS
 void drop_reason_show(struct seq_file *m)
 {
u32 subsys_id;
-- 
2.44.0

[RFC PATCH v2 2/4] net: dropreason: use new __print_sym() in tracing

2024-03-26 Thread Johannes Berg

From: Johannes Berg 

The __print_symbolic() could only ever print the core
drop reasons, since that's the way the infrastructure
works. Now that we have __print_sym() with all the
advantages mentioned in that commit, convert to that
and get all the drop reasons from all subsystems. As
we already have a list of them, that's really easy.

This is a little bit of .text (~100 bytes in my build)
and saves a lot of .data (~17k).

Signed-off-by: Johannes Berg 
---
 include/net/dropreason.h   |  5 +
 include/trace/events/skb.h | 16 +++---
 net/core/skbuff.c  | 43 ++
 3 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/include/net/dropreason.h b/include/net/dropreason.h
index 56cb7be92244..c157070b5303 100644
--- a/include/net/dropreason.h
+++ b/include/net/dropreason.h
@@ -42,6 +42,11 @@ struct drop_reason_list {
 extern const struct drop_reason_list __rcu *
 drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM];
 
+#ifdef CONFIG_TRACEPOINTS
+const char *drop_reason_lookup(unsigned long long value);
+void drop_reason_show(struct seq_file *m);
+#endif
+
 void drop_reasons_register_subsys(enum skb_drop_reason_subsys subsys,
  const struct drop_reason_list *list);
 void drop_reasons_unregister_subsys(enum skb_drop_reason_subsys subsys);
diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 07e0715628ec..8a1a63f9e796 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -8,15 +8,9 @@
 #include 
 #include 
 #include 
+#include 
 
-#undef FN
-#define FN(reason) TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
-DEFINE_DROP_REASON(FN, FN)
-
-#undef FN
-#undef FNe
-#define FN(reason) { SKB_DROP_REASON_##reason, #reason },
-#define FNe(reason){ SKB_DROP_REASON_##reason, #reason }
+TRACE_DEFINE_SYM_FNS(drop_reason, drop_reason_lookup, drop_reason_show);
 
 /*
  * Tracepoint for free an sk_buff:
@@ -44,13 +38,9 @@ TRACE_EVENT(kfree_skb,
 
TP_printk("skbaddr=%p protocol=%u location=%pS reason: %s",
  __entry->skbaddr, __entry->protocol, __entry->location,
- __print_symbolic(__entry->reason,
-  DEFINE_DROP_REASON(FN, FNe)))
+ __print_sym(__entry->reason, drop_reason ))
 );
 
-#undef FN
-#undef FNe
-
 TRACE_EVENT(consume_skb,
 
TP_PROTO(struct sk_buff *skb, void *location),
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b99127712e67..012b48da8810 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -147,6 +147,49 @@ drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM] = {
 };
 EXPORT_SYMBOL(drop_reasons_by_subsys);
 
+#ifdef CONFIG_TRACEPOINTS
+const char *drop_reason_lookup(unsigned long long value)
+{
+   unsigned long long subsys_id = value >> SKB_DROP_REASON_SUBSYS_SHIFT;
+   u32 reason = value & ~SKB_DROP_REASON_SUBSYS_MASK;
+   const struct drop_reason_list *subsys;
+
+   if (subsys_id >= SKB_DROP_REASON_SUBSYS_NUM)
+   return NULL;
+
+   subsys = rcu_dereference(drop_reasons_by_subsys[subsys_id]);
+   if (!subsys)
+   return NULL;
+   if (reason >= subsys->n_reasons)
+   return NULL;
+   return subsys->reasons[reason];
+}
+
+void drop_reason_show(struct seq_file *m)
+{
+   u32 subsys_id;
+
+   rcu_read_lock();
+   for (subsys_id = 0; subsys_id < SKB_DROP_REASON_SUBSYS_NUM; 
subsys_id++) {
+   const struct drop_reason_list *subsys;
+   u32 i;
+
+   subsys = rcu_dereference(drop_reasons_by_subsys[subsys_id]);
+   if (!subsys)
+   continue;
+
+   for (i = 0; i < subsys->n_reasons; i++) {
+   if (!subsys->reasons[i])
+   continue;
+   seq_printf(m, ", { %u, \"%s\" }",
+  (subsys_id << SKB_DROP_REASON_SUBSYS_SHIFT) 
| i,
+  subsys->reasons[i]);
+   }
+   }
+   rcu_read_unlock();
+}
+#endif
+
 /**
  * drop_reasons_register_subsys - register another drop reason subsystem
  * @subsys: the subsystem to register, must not be the core
-- 
2.44.0

[RFC PATCH v2 0/4] tracing: improve symbolic printing

2024-03-26 Thread Johannes Berg

As I mentioned before, it's annoying to see this in dropreason tracing
with trace-cmd:

 irq/65-iwlwifi:-401   [000]22.79: kfree_skb:
skbaddr=0x6a89b000 protocol=0 location=ieee80211_rx_handlers_result+0x21a 
reason: 0x2

and much nicer to see

 irq/65-iwlwifi:-401   [000]22.79: kfree_skb:
skbaddr=0x69142000 protocol=0 location=ieee80211_rx_handlers_result+0x21a 
reason: RX_DROP_MONITOR

The reason for this is that the __print_symbolic() string in tracing
for trace-cmd to parse it is created at build-time, from the long list
of _core_ drop reasons, but the drop reasons are now more dynamic.

So I came up with __print_sym() which is similar, except it doesn't
build the big list of numbers at build time but rather at runtime,
which is actually a big memory saving too. But building it then, at
the time userspace is recording it, lets us include all the known
reasons.

v2:
 - rebased on 6.9-rc1
 - always search for __print_sym() and get rid of the DYNPRINT flag
   and associated code; I think ideally we'll just remove the older
   __print_symbolic() entirely
 - use ':' as the separator instead of "//" since that makes searching
   for it much easier and it's still not a valid char in an identifier
 - fix RCU

johannes

Re: [PATCH v4 1/4] remoteproc: Add TEE support

2024-03-26 Thread Arnaud POULIQUEN

Hello Mathieu,

On 3/25/24 17:46, Mathieu Poirier wrote:
> On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote:
>> Add a remoteproc TEE (Trusted Execution Environment) driver
>> that will be probed by the TEE bus. If the associated Trusted
>> application is supported on secure part this device offers a client
> 
> Device or driver?  I thought I touched on that before.

Right, I changed the first instance and missed this one

> 
>> interface to load a firmware in the secure part.
>> This firmware could be authenticated by the secure trusted application.
>>
>> Signed-off-by: Arnaud Pouliquen 
>> ---
>> Updates from V3:
>> - rework TEE_REMOTEPROC description in Kconfig
>> - fix some namings
>> - add tee_rproc_parse_fw  to support rproc_ops::parse_fw
>> - add proc::tee_interface;
>> - add rproc struct as parameter of the tee_rproc_register() function
>> ---
>>  drivers/remoteproc/Kconfig  |  10 +
>>  drivers/remoteproc/Makefile |   1 +
>>  drivers/remoteproc/tee_remoteproc.c | 434 
>>  include/linux/remoteproc.h  |   4 +
>>  include/linux/tee_remoteproc.h  | 112 +++
>>  5 files changed, 561 insertions(+)
>>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>>  create mode 100644 include/linux/tee_remoteproc.h
>>
>> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
>> index 48845dc8fa85..2cf1431b2b59 100644
>> --- a/drivers/remoteproc/Kconfig
>> +++ b/drivers/remoteproc/Kconfig
>> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC
>>  
>>It's safe to say N if not interested in using RPU r5f cores.
>>  
>> +
>> +config TEE_REMOTEPROC
>> +tristate "remoteproc support by a TEE application"
> 
> s/remoteproc/Remoteproc
> 
>> +depends on OPTEE
>> +help
>> +  Support a remote processor with a TEE application. The Trusted
>> +  Execution Context is responsible for loading the trusted firmware
>> +  image and managing the remote processor's lifecycle.
>> +  This can be either built-in or a loadable module.
>> +
>>  endif # REMOTEPROC
>>  
>>  endmenu
>> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
>> index 91314a9b43ce..fa8daebce277 100644
>> --- a/drivers/remoteproc/Makefile
>> +++ b/drivers/remoteproc/Makefile
>> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)  += rcar_rproc.o
>>  obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o
>>  obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o
>>  obj-$(CONFIG_STM32_RPROC)   += stm32_rproc.o
>> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o
>>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)  += ti_k3_dsp_remoteproc.o
>>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)   += ti_k3_r5_remoteproc.o
>>  obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o
>> diff --git a/drivers/remoteproc/tee_remoteproc.c 
>> b/drivers/remoteproc/tee_remoteproc.c
>> new file mode 100644
>> index ..c855210e52e3
>> --- /dev/null
>> +++ b/drivers/remoteproc/tee_remoteproc.c
>> @@ -0,0 +1,434 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved
>> + * Author: Arnaud Pouliquen 
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "remoteproc_internal.h"
>> +
>> +#define MAX_TEE_PARAM_ARRY_MEMBER   4
>> +
>> +/*
>> + * Authentication of the firmware and load in the remote processor memory
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [in]  params[1].memref:  buffer containing the image of the 
>> buffer
>> + */
>> +#define TA_RPROC_FW_CMD_LOAD_FW 1
>> +
>> +/*
>> + * Start the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_START_FW2
>> +
>> +/*
>> + * Stop the remote processor
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + */
>> +#define TA_RPROC_FW_CMD_STOP_FW 3
>> +
>> +/*
>> + * Return the address of the resource table, or 0 if not found
>> + * No check is done to verify that the address returned is accessible by
>> + * the non secure context. If the resource table is loaded in a protected
>> + * memory the access by the non secure context will lead to a data abort.
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the remote processor
>> + * [out]  params[1].value.a:32bit LSB resource table memory address
>> + * [out]  params[1].value.b:32bit MSB resource table memory address
>> + * [out]  params[2].value.a:32bit LSB resource table memory size
>> + * [out]  params[2].value.b:32bit MSB resource table memory size
>> + */
>> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE   4
>> +
>> +/*
>> + * Return the address of the core dump
>> + *
>> + * [in]  params[0].value.a: unique 32bit identifier of the

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Alexandre Ghiti


On 26/03/2024 17:49, Jarkko Sakkinen wrote:

On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:

Hi Jarkko,

On 25/03/2024 22:55, Jarkko Sakkinen wrote:

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
   arch/riscv/Kconfig  |  1 +
   arch/riscv/kernel/Makefile  |  3 +++
   arch/riscv/kernel/execmem.c | 22 ++
   3 files changed, 26 insertions(+)
   create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
   
   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o

   obj-$(CONFIG_MODULES)+= module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
   obj-$(CONFIG_MODULE_SECTIONS)+= module-sections.o
   
   obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o

diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)

Need to have the parameter name here. I guess this could just as well
pass through gfp to vmalloc from the caller as kprobes does call
module_alloc() with GFP_KERNEL set in RISC-V.


+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}


The __vmalloc_node_range() line ^^ must be from an old kernel since we
added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix
module_alloc() that did not reset the linear mapping permissions").

In addition, I guess module_alloc() should now use alloc_execmem() right?

Ack for the first comment. For the 2nd it is up to arch/ to choose
whether to have shared or separate allocators.

So if you want I can change it that way but did not want to make the
call myself.



I'd say module_alloc() should use alloc_execmem() then since there are 
no differences for now.






+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}


I remember Mike Rapoport sent a patchset to introduce an API for
executable memory allocation
(https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/),
how does this intersect with your work? I don't know the status of his
patchset though.

Thanks,

Alex

I have also made a patch set for kprobes in the 2022:

https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/

I think this Calvin's, Mike's and my early patch set have the same
problem: they try to choke all architectures at once. And further,
Calvin's and Mike's work also try to cover also tracing subsystems
at once.

I feel that my relatively small patch set which deals only with
trivial kprobe (which is more in the leaf than e.g. bpf which
is more like orchestrator tool) and implements one arch of which
dog food I actually eat is a better starting point.

Arch code is always something where you need to have genuine
understanding so full architecture coverage from day one is
just too risky for stability. Linux is better

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 09:16:33 -0700
Andrii Nakryiko  wrote:

> > It's no different than lockdep. Test boxes should have it enabled, but
> > there's no reason to have this enabled in a production system.
> >  
> 
> I tend to agree with Steven here (which is why I sent this patch as it
> is), but I'm happy to do it as an opt-out, if Masami insists. Please
> do let me know if I need to send v2 or this one is actually the one
> we'll end up using. Thanks!

Masami,

Are you OK with just keeping it set to N.

We could have other options like PROVE_LOCKING enable it.

-- Steve

Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Conor Dooley

On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote:
> Tacing with kprobes while running a monolithic kernel is currently
> impossible due the kernel module allocator dependency.
> 
> Address the issue by implementing textmem API for RISC-V.

This doesn't compile for nommu:
  /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: 
'MODULES_VADDR' undeclared (first use in this function)
  /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: 'MODULES_END' 
undeclared (first use in this function)
  /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control 
reaches end of non-void function [-Werror=return-type]
Clang builds also report:
../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in a 
function definition is a C2x extension [-Wc2x-extensions]

> 
> Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> stack
> Link: 
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> continuation
> Signed-off-by: Jarkko Sakkinen 
> ---
> v5-v7:
> - No changes.
> v4:
> - Include linux/execmem.h.
> v3:
> - Architecture independent parts have been split to separate patches.
> - Do not change arch/riscv/kernel/module.c as it is out of scope for
>   this patch set now.

Meta comment. I dunno when v1 was sent, but versions can you please
relax with submitting new versions of your patches? There's conversations
ongoing on v5 at the moment, while this is a more recent version. v2
seems to have been sent on the 23rd and there's been 5 versions in the
last day:
https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=*

Could you please also try and use a cover letter for patchsets, ideally
with a consistent subject? Otherwise I have to manually mark stuff as
superseded.

Thanks,
Conor.

> v2:
> - Better late than never right? :-)
> - Focus only to RISC-V for now to make the patch more digestable. This
>   is the arch where I use the patch on a daily basis to help with QA.
> - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.


signature.asc
Description: PGP signature

[PATCH net-next v4 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

2024-03-26 Thread Balazs Scheidler

The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source
and destination IP/port whereas this information can be critical in case
of UDP/syslog.

Signed-off-by: Balazs Scheidler 
---
 include/trace/events/udp.h | 29 -
 net/ipv4/udp.c |  2 +-
 net/ipv6/udp.c |  3 ++-
 3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h
index 336fe272889f..62bebe2a6ece 100644
--- a/include/trace/events/udp.h
+++ b/include/trace/events/udp.h
@@ -7,24 +7,43 @@
 
 #include 
 #include 
+#include 
 
 TRACE_EVENT(udp_fail_queue_rcv_skb,
 
-   TP_PROTO(int rc, struct sock *sk),
+   TP_PROTO(int rc, struct sock *sk, struct sk_buff *skb),
 
-   TP_ARGS(rc, sk),
+   TP_ARGS(rc, sk, skb),
 
TP_STRUCT__entry(
__field(int, rc)
-   __field(__u16, lport)
+
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(__u16, family)
+   __array(__u8, saddr, sizeof(struct sockaddr_in6))
+   __array(__u8, daddr, sizeof(struct sockaddr_in6))
),
 
TP_fast_assign(
+   const struct udphdr *uh = (const struct udphdr *)udp_hdr(skb);
+
__entry->rc = rc;
-   __entry->lport = inet_sk(sk)->inet_num;
+
+   /* for filtering use */
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->family = sk->sk_family;
+
+   memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+   memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+   TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh);
),
 
-   TP_printk("rc=%d port=%hu", __entry->rc, __entry->lport)
+   TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc", __entry->rc,
+ show_family_name(__entry->family),
+ __entry->saddr, __entry->daddr)
 );
 
 #endif /* _TRACE_UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 661d0e0d273f..531882f321f2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2049,8 +2049,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
drop_reason = SKB_DROP_REASON_PROTO_MEM;
}
UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   trace_udp_fail_queue_rcv_skb(rc, sk, skb);
kfree_skb_reason(skb, drop_reason);
-   trace_udp_fail_queue_rcv_skb(rc, sk);
return -1;
}
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 7c1e6469d091..2e4dc5e6137b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -658,8 +659,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
drop_reason = SKB_DROP_REASON_PROTO_MEM;
}
UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+   trace_udp_fail_queue_rcv_skb(rc, sk, skb);
kfree_skb_reason(skb, drop_reason);
-   trace_udp_fail_queue_rcv_skb(rc, sk);
return -1;
}
 
-- 
2.40.1

[PATCH net-next v4 1/2] net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent

2024-03-26 Thread Balazs Scheidler

This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes
the TCP specific implementation details.

Previously the macro assumed the skb passed as an argument is a
TCP packet, the implementation now uses an argument to the L4 header and
uses that to extract the source/destination ports, which happen
to be named the same in "struct tcphdr" and "struct udphdr"

Reviewed-by: Jason Xing 
Signed-off-by: Balazs Scheidler 
---
 include/trace/events/net_probe_common.h | 40 ++
 include/trace/events/tcp.h  | 45 ++---
 2 files changed, 42 insertions(+), 43 deletions(-)

diff --git a/include/trace/events/net_probe_common.h 
b/include/trace/events/net_probe_common.h
index b1f9a4d3ee13..5e33f91bdea3 100644
--- a/include/trace/events/net_probe_common.h
+++ b/include/trace/events/net_probe_common.h
@@ -70,4 +70,44 @@
TP_STORE_V4MAPPED(__entry, saddr, daddr)
 #endif
 
+#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)   \
+   do {\
+   struct sockaddr_in *v4 = (void *)__entry->saddr;\
+   \
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = protoh->source;  \
+   v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
+   v4 = (void *)__entry->daddr;\
+   v4->sin_family = AF_INET;   \
+   v4->sin_port = protoh->dest;\
+   v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
+   } while (0)
+
+#if IS_ENABLED(CONFIG_IPV6)
+
+#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+   do {\
+   const struct iphdr *iph = ip_hdr(skb);  \
+   \
+   if (iph->version == 6) {\
+   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+   \
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = protoh->source; \
+   v6->sin6_addr = ipv6_hdr(skb)->saddr;   \
+   v6 = (void *)__entry->daddr;\
+   v6->sin6_family = AF_INET6; \
+   v6->sin6_port = protoh->dest;   \
+   v6->sin6_addr = ipv6_hdr(skb)->daddr;   \
+   } else  \
+   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \
+   } while (0)
+
+#else
+
+#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh)  \
+   TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh)
+
+#endif
+
 #endif
diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3c08a0846c47..1db95175c1e5 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -273,48 +273,6 @@ TRACE_EVENT(tcp_probe,
  __entry->skbaddr, __entry->skaddr)
 );
 
-#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb)   \
-   do {\
-   const struct tcphdr *th = (const struct tcphdr *)skb->data; \
-   struct sockaddr_in *v4 = (void *)__entry->saddr;\
-   \
-   v4->sin_family = AF_INET;   \
-   v4->sin_port = th->source;  \
-   v4->sin_addr.s_addr = ip_hdr(skb)->saddr;   \
-   v4 = (void *)__entry->daddr;\
-   v4->sin_family = AF_INET;   \
-   v4->sin_port = th->dest;\
-   v4->sin_addr.s_addr = ip_hdr(skb)->daddr;   \
-   } while (0)
-
-#if IS_ENABLED(CONFIG_IPV6)
-
-#define TP_STORE_ADDR_PORTS_SKB(__entry, skb)  \
-   do {\
-   const struct iphdr *iph = ip_hdr(skb);  \
-   \
-   if (iph->version == 6) {\
-   const struct tcphdr *th = (const struct tcphdr 
*)skb->data; \
-   struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
-   \

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-26 Thread Abdellatif El Khlifi

Hi Mathieu,

> > > > > > > > > > This is an initial patchset for allowing to turn on and off 
> > > > > > > > > > the remote processor.
> > > > > > > > > > The FW is already loaded before the Corstone-1000 SoC is 
> > > > > > > > > > powered on and this
> > > > > > > > > > is done through the FPGA board bootloader in case of the 
> > > > > > > > > > FPGA target. Or by the Corstone-1000 FVP model
> > > > > > > > > > (emulator).
> > > > > > > > > >
> > > > > > > > > >From the above I take it that booting with a preloaded 
> > > > > > > > > >firmware is a
> > > > > > > > > scenario that needs to be supported and not just a temporary 
> > > > > > > > > stage.
> > > > > > > >
> > > > > > > > The current status of the Corstone-1000 SoC requires that there 
> > > > > > > > is
> > > > > > > > a preloaded firmware for the external core. Preloading is done 
> > > > > > > > externally
> > > > > > > > either through the FPGA bootloader or the emulator (FVP) before 
> > > > > > > > powering
> > > > > > > > on the SoC.
> > > > > > > >
> > > > > > >
> > > > > > > Ok
> > > > > > >
> > > > > > > > Corstone-1000 will be upgraded in a way that the A core running 
> > > > > > > > Linux is able
> > > > > > > > to share memory with the remote core and also being able to 
> > > > > > > > access the remote
> > > > > > > > core memory so Linux can copy the firmware to. This HW changes 
> > > > > > > > are still
> > > > > > > > This is why this patchset is relying on a preloaded firmware. 
> > > > > > > > And it's the step 1
> > > > > > > > of adding remoteproc support for Corstone.
> > > > > > > >
> > > > > > >
> > > > > > > Ok, so there is a HW problem where A core and M core can't see 
> > > > > > > each other's
> > > > > > > memory, preventing the A core from copying the firmware image to 
> > > > > > > the proper
> > > > > > > location.
> > > > > > >
> > > > > > > When the HW is fixed, will there be a need to support scenarios 
> > > > > > > where the
> > > > > > > firmware image has been preloaded into memory?
> > > > > >
> > > > > > No, this scenario won't apply when we get the HW upgrade. No need 
> > > > > > for an
> > > > > > external entity anymore. The firmware(s) will all be files in the 
> > > > > > linux filesystem.
> > > > > >
> > > > >
> > > > > Very well.  I am willing to continue with this driver but it does so 
> > > > > little that
> > > > > I wonder if it wouldn't simply be better to move forward with 
> > > > > upstreaming when
> > > > > the HW is fixed.  The choice is yours.
> > > > >
> > > >
> > > > I think Robin has raised few points that need clarification. I think it 
> > > > was
> > > > done as part of DT binding patch. I share those concerns and I wanted to
> > > > reaching to the same concerns by starting the questions I asked on 
> > > > corstone
> > > > device tree changes.
> > > >
> > >
> > > I also agree with Robin's point of view.  Proceeding with an initial
> > > driver with minimal functionality doesn't preclude having complete
> > > bindings.  But that said and as I pointed out, it might be better to
> > > wait for the HW to be fixed before moving forward.
> >
> > We checked with the HW teams. The missing features will be implemented but
> > this will take time.
> >
> > The foundation driver as it is right now is still valuable for people 
> > wanting to
> > know how to power control Corstone external systems in a future proof manner
> > (even in the incomplete state). We prefer to address all the review comments
> > made so it can be merged. This includes making the DT binding as complete as
> > possible as you advised. Then, once the HW is ready, I'll implement the 
> > comms
> > and the FW reload part. Is that OK please ?
> >
> 
> I'm in agreement with that plan as long as we agree the current
> preloaded heuristic is temporary and is not a valid long term
> scenario.

Yes, that's the plan, no problem.

Cheers,
Abdellatif

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:38 PM EET, Mark Rutland wrote:
> On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > > 
> > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > > I think, we'd better to introduce `alloc_execmem()`,
> > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > > 
> > > >   config HAVE_ALLOC_EXECMEM
> > > > bool
> > > > 
> > > >   config ALLOC_EXECMEM
> > > > bool "Executable trampline memory allocation"
> > > > depends on MODULES || HAVE_ALLOC_EXECMEM
> > > > 
> > > > And define fallback macro to module_alloc() like this.
> > > > 
> > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > > #define alloc_execmem(size, gfp)module_alloc(size)
> > > > #endif
> > > 
> > > Please can we *not* do this? I think this is abstracting at the wrong 
> > > level (as
> > > I mentioned on the prior execmem proposals).
> > > 
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
> > 
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> > 
> > Mark, can you send this series here, so that others can review/test it?
>
> I've written up a cover letter and sent that out:
>   
>   https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/
>
> Mark.

Ya, saw it thanks!

BR, Jarkko

Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes

2024-03-26 Thread Arnd Bergmann

On Tue, Mar 26, 2024, at 18:06, Steven Rostedt wrote:
> On Tue, 26 Mar 2024 15:53:38 +0100
> Arnd Bergmann  wrote:
>
>> -const char *
>> +int
>>  ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
>> unsigned long *off, char **modname, char *sym)
>>  {
>>  struct ftrace_mod_map *mod_map;
>> -const char *ret = NULL;
>> +int ret;
>
> This needs to be ret = 0;

Fixed now, thanks!

I'll send a v5 in a few days 

Arnd

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:15 PM EET, Calvin Owens wrote:
> On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > 
> > > Hi Masami,
> > > 
> > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > > Hi Jarkko,
> > > > 
> > > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > > Jarkko Sakkinen  wrote:
> > > > 
> > > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > > impossible due the kernel module allocator dependency.
> > > > > 
> > > > > Address the issue by allowing architectures to implement 
> > > > > module_alloc()
> > > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > > 
> > > > > Realize the feature on RISC-V by separating allocator to 
> > > > > module_alloc.c
> > > > > and implementing module_memfree().
> > > > 
> > > > Even though, this involves changes in arch-independent part. So it 
> > > > should
> > > > be solved by generic way. Did you checked Calvin's thread?
> > > > 
> > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > > 
> > > > I think, we'd better to introduce `alloc_execmem()`,
> > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > > 
> > > >   config HAVE_ALLOC_EXECMEM
> > > > bool
> > > > 
> > > >   config ALLOC_EXECMEM
> > > > bool "Executable trampline memory allocation"
> > > > depends on MODULES || HAVE_ALLOC_EXECMEM
> > > > 
> > > > And define fallback macro to module_alloc() like this.
> > > > 
> > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > > #define alloc_execmem(size, gfp)module_alloc(size)
> > > > #endif
> > > 
> > > Please can we *not* do this? I think this is abstracting at the wrong 
> > > level (as
> > > I mentioned on the prior execmem proposals).
> > > 
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
>
> Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
> about to send a patch to remove it.
>
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
>
> I'm happy with this, it solves the first half of my problem. But I want
> eBPF to work in the !MODULES case too.
>
> I think Mark's approach can work for bpf as well, without needing to
> touch module_alloc() at all? So I might be able to drop that first patch
> entirely.

Yeah, I think we're aligned. Later on, if/when you send the bpf series
please also cc me and I might possibly test those patches too.

BR, Jarkko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 5:24 PM EET, Masami Hiramatsu (Google) wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
>
> > Hi Masami,
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > Hi Jarkko,
> > > 
> > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > Jarkko Sakkinen  wrote:
> > > 
> > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > impossible due the kernel module allocator dependency.
> > > > 
> > > > Address the issue by allowing architectures to implement module_alloc()
> > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > 
> > > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > > and implementing module_memfree().
> > > 
> > > Even though, this involves changes in arch-independent part. So it should
> > > be solved by generic way. Did you checked Calvin's thread?
> > > 
> > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > 
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?
>
> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
>
> Mark, can you send this series here, so that others can review/test it?

I'm totally fine with this but yeah best would be if it could carry
the riscv part. Mark, even if you have only possibility to compile
test that part I can check that it works.

BR, Jarkko

Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 15:53:38 +0100
Arnd Bergmann  wrote:

> -const char *
> +int
>  ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
>  unsigned long *off, char **modname, char *sym)
>  {
>   struct ftrace_mod_map *mod_map;
> - const char *ret = NULL;
> + int ret;

This needs to be ret = 0;

>  
>   /* mod_map is freed via call_rcu() */
>   preempt_disable();

As here we have:

list_for_each_entry_rcu(mod_map, _mod_maps, list) {
ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym);
if (ret) {
if (modname)
*modname = mod_map->mod->name;
break;
}
}
preempt_enable();

return ret;
}

Where it is possible for the loop never to be executed.

-- Steve

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 5:05 PM EET, Masami Hiramatsu (Google) wrote:
> > According to kconfig-language.txt:
> > 
> > "select should be used with care. select will force a symbol to a value
> > without visiting the dependencies."
> > 
> > So the problem here lies in KPROBES config entry using select statement
> > to pick ALLOC_EXECMEM. It will not take the depends on statement into
> > account and thus will allow to select kprobes without any allocator in
> > place.
>
> OK, in that case "depend on" is good.

Yeah, did not remember this at all. Only recalled when I started to
get linking errors when compiling just the first patch... It's a bit
uninituitive twist in kconfig :-)

BR, Jarkko

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 6:49 PM EET, Jarkko Sakkinen wrote:
> On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:
> > Hi Jarkko,
> >
> > On 25/03/2024 22:55, Jarkko Sakkinen wrote:
> > > Tacing with kprobes while running a monolithic kernel is currently
> > > impossible due the kernel module allocator dependency.
> > >
> > > Address the issue by implementing textmem API for RISC-V.
> > >
> > > Link: https://www.sochub.fi # for power on testing new SoC's with a 
> > > minimal stack
> > > Link: 
> > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
> > > # continuation
> > > Signed-off-by: Jarkko Sakkinen 
> > > ---
> > > v5:
> > > - No changes, expect removing alloc_execmem() call which should have
> > >been part of the previous patch.
> > > v4:
> > > - Include linux/execmem.h.
> > > v3:
> > > - Architecture independent parts have been split to separate patches.
> > > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> > >this patch set now.
> > > v2:
> > > - Better late than never right? :-)
> > > - Focus only to RISC-V for now to make the patch more digestable. This
> > >is the arch where I use the patch on a daily basis to help with QA.
> > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> > > ---
> > >   arch/riscv/Kconfig  |  1 +
> > >   arch/riscv/kernel/Makefile  |  3 +++
> > >   arch/riscv/kernel/execmem.c | 22 ++
> > >   3 files changed, 26 insertions(+)
> > >   create mode 100644 arch/riscv/kernel/execmem.c
> > >
> > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > > index e3142ce531a0..499512fb17ff 100644
> > > --- a/arch/riscv/Kconfig
> > > +++ b/arch/riscv/Kconfig
> > > @@ -132,6 +132,7 @@ config RISCV
> > >   select HAVE_KPROBES if !XIP_KERNEL
> > >   select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
> > >   select HAVE_KRETPROBES if !XIP_KERNEL
> > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
> > >   # https://github.com/ClangBuiltLinux/linux/issues/1881
> > >   select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> > >   select HAVE_MOVE_PMD
> > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > > index 604d6bf7e476..337797f10d3e 100644
> > > --- a/arch/riscv/kernel/Makefile
> > > +++ b/arch/riscv/kernel/Makefile
> > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP)   += cpu_ops.o
> > >   
> > >   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
> > >   obj-$(CONFIG_MODULES)   += module.o
> > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y)
> > > +obj-y+= execmem.o
> > > +endif
> > >   obj-$(CONFIG_MODULE_SECTIONS)   += module-sections.o
> > >   
> > >   obj-$(CONFIG_CPU_PM)+= suspend_entry.o suspend.o
> > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
> > > new file mode 100644
> > > index ..3e52522ead32
> > > --- /dev/null
> > > +++ b/arch/riscv/kernel/execmem.c
> > > @@ -0,0 +1,22 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
>
> Need to have the parameter name here. I guess this could just as well
> pass through gfp to vmalloc from the caller as kprobes does call
> module_alloc() with GFP_KERNEL set in RISC-V.
>
> > > +{
> > > + return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > > + MODULES_END, GFP_KERNEL,
> > > + PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > + __builtin_return_address(0));
> > > +}
> >
> >
> > The __vmalloc_node_range() line ^^ must be from an old kernel since we 
> > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
> > module_alloc() that did not reset the linear mapping permissions").
> >
> > In addition, I guess module_alloc() should now use alloc_execmem() right?
>
> Ack for the first comment. For the 2nd it is up to arch/ to choose
> whether to have shared or separate allocators.
>
> So if you want I can change it that way but did not want to make the
> call myself.
>
> >
> >
> > > +
> > > +void free_execmem(void *region)
> > > +{
> > > + if (in_interrupt())
> > > + pr_warn("In interrupt context: vmalloc may not work.\n");
> > > +
> > > + vfree(region);
> > > +}
> >
> >
> > I remember Mike Rapoport sent a patchset to introduce an API for 
> > executable memory allocation 
> > (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/),
> >  
> > how does this intersect with your work? I don't know the status of his 
> > patchset though.
> >
> > Thanks,
> >
> > Alex
>
> I have also made a patch set for kprobes in the 2022:
>
> https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/
>
> I think this Calvin's, Mike's and my early patch set have the same
> problem: they try to choke all architectures at

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:46 PM EET, Mark Rutland wrote:
> Hi Masami,
>
> On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > Hi Jarkko,
> > 
> > On Sun, 24 Mar 2024 01:29:08 +0200
> > Jarkko Sakkinen  wrote:
> > 
> > > Tracing with kprobes while running a monolithic kernel is currently
> > > impossible due the kernel module allocator dependency.
> > > 
> > > Address the issue by allowing architectures to implement module_alloc()
> > > and module_memfree() independent of the module subsystem. An arch tree
> > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > 
> > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > and implementing module_memfree().
> > 
> > Even though, this involves changes in arch-independent part. So it should
> > be solved by generic way. Did you checked Calvin's thread?
> > 
> > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > 
> > I think, we'd better to introduce `alloc_execmem()`,
> > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > 
> >   config HAVE_ALLOC_EXECMEM
> > bool
> > 
> >   config ALLOC_EXECMEM
> > bool "Executable trampline memory allocation"
> > depends on MODULES || HAVE_ALLOC_EXECMEM
> > 
> > And define fallback macro to module_alloc() like this.
> > 
> > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > #define alloc_execmem(size, gfp)module_alloc(size)
> > #endif
>
> Please can we *not* do this? I think this is abstracting at the wrong level 
> (as
> I mentioned on the prior execmem proposals).
>
> Different exectuable allocations can have different requirements. For example,
> on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL
> areas can be anywhere in the kernel VA space.
>
> Forcing those behind the same interface makes things *harder* for 
> architectures
> and/or makes the common code more complicated (if that ends up having to track
> all those different requirements). From my PoV it'd be much better to have
> separate kprobes_alloc_*() functions for kprobes which an architecture can 
> then
> choose to implement using a common library if it wants to.
>
> I took a look at doing that using the core ifdeffery fixups from Jarkko's v6,
> and it looks pretty clean to me (and works in testing on arm64):
>
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
>
> Could we please start with that approach, with kprobe-specific alloc/free code
> provided by the architecture?

How should we move forward?

I'm fine with someone picking the pieces of my work as long as also the
riscv side is included. Can also continue rotating this, whatever works.

>
> Thanks,
> Mark.

BR, Jarkko

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote:
> Hi Jarkko,
>
> On 25/03/2024 22:55, Jarkko Sakkinen wrote:
> > Tacing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> >
> > Address the issue by implementing textmem API for RISC-V.
> >
> > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
> > stack
> > Link: 
> > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # 
> > continuation
> > Signed-off-by: Jarkko Sakkinen 
> > ---
> > v5:
> > - No changes, expect removing alloc_execmem() call which should have
> >been part of the previous patch.
> > v4:
> > - Include linux/execmem.h.
> > v3:
> > - Architecture independent parts have been split to separate patches.
> > - Do not change arch/riscv/kernel/module.c as it is out of scope for
> >this patch set now.
> > v2:
> > - Better late than never right? :-)
> > - Focus only to RISC-V for now to make the patch more digestable. This
> >is the arch where I use the patch on a daily basis to help with QA.
> > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
> > ---
> >   arch/riscv/Kconfig  |  1 +
> >   arch/riscv/kernel/Makefile  |  3 +++
> >   arch/riscv/kernel/execmem.c | 22 ++
> >   3 files changed, 26 insertions(+)
> >   create mode 100644 arch/riscv/kernel/execmem.c
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index e3142ce531a0..499512fb17ff 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -132,6 +132,7 @@ config RISCV
> > select HAVE_KPROBES if !XIP_KERNEL
> > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
> > select HAVE_KRETPROBES if !XIP_KERNEL
> > +   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
> > # https://github.com/ClangBuiltLinux/linux/issues/1881
> > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
> > select HAVE_MOVE_PMD
> > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
> > index 604d6bf7e476..337797f10d3e 100644
> > --- a/arch/riscv/kernel/Makefile
> > +++ b/arch/riscv/kernel/Makefile
> > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
> >   
> >   obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
> >   obj-$(CONFIG_MODULES) += module.o
> > +ifeq ($(CONFIG_ALLOC_EXECMEM),y)
> > +obj-y  += execmem.o
> > +endif
> >   obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o
> >   
> >   obj-$(CONFIG_CPU_PM)  += suspend_entry.o suspend.o
> > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
> > new file mode 100644
> > index ..3e52522ead32
> > --- /dev/null
> > +++ b/arch/riscv/kernel/execmem.c
> > @@ -0,0 +1,22 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */)

Need to have the parameter name here. I guess this could just as well
pass through gfp to vmalloc from the caller as kprobes does call
module_alloc() with GFP_KERNEL set in RISC-V.

> > +{
> > +   return __vmalloc_node_range(size, 1, MODULES_VADDR,
> > +   MODULES_END, GFP_KERNEL,
> > +   PAGE_KERNEL, 0, NUMA_NO_NODE,
> > +   __builtin_return_address(0));
> > +}
>
>
> The __vmalloc_node_range() line ^^ must be from an old kernel since we 
> added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
> module_alloc() that did not reset the linear mapping permissions").
>
> In addition, I guess module_alloc() should now use alloc_execmem() right?

Ack for the first comment. For the 2nd it is up to arch/ to choose
whether to have shared or separate allocators.

So if you want I can change it that way but did not want to make the
call myself.

>
>
> > +
> > +void free_execmem(void *region)
> > +{
> > +   if (in_interrupt())
> > +   pr_warn("In interrupt context: vmalloc may not work.\n");
> > +
> > +   vfree(region);
> > +}
>
>
> I remember Mike Rapoport sent a patchset to introduce an API for 
> executable memory allocation 
> (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), 
> how does this intersect with your work? I don't know the status of his 
> patchset though.
>
> Thanks,
>
> Alex

I have also made a patch set for kprobes in the 2022:

https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/

I think this Calvin's, Mike's and my early patch set have the same
problem: they try to choke all architectures at once. And further,
Calvin's and Mike's work also try to cover also tracing subsystems
at once.

I feel that my relatively small patch set which deals only with
trivial kprobe (which is more in the leaf than e.g. bpf which
is more like orchestrator tool) and implements one arch of which
dog food I actually eat is a better starting

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland

On Tue, Mar 26, 2024 at 09:15:14AM -0700, Calvin Owens wrote:
> On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> > On Tue, 26 Mar 2024 14:46:10 +
> > Mark Rutland  wrote:
> > > Different exectuable allocations can have different requirements. For 
> > > example,
> > > on arm64 modules need to be within 2G of the kernel image, but the 
> > > kprobes XOL
> > > areas can be anywhere in the kernel VA space.
> > > 
> > > Forcing those behind the same interface makes things *harder* for 
> > > architectures
> > > and/or makes the common code more complicated (if that ends up having to 
> > > track
> > > all those different requirements). From my PoV it'd be much better to have
> > > separate kprobes_alloc_*() functions for kprobes which an architecture 
> > > can then
> > > choose to implement using a common library if it wants to.
> > > 
> > > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > > v6,
> > > and it looks pretty clean to me (and works in testing on arm64):
> > > 
> > >   
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > > 
> > > Could we please start with that approach, with kprobe-specific alloc/free 
> > > code
> > > provided by the architecture?
> 
> Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
> about to send a patch to remove it.
> 
> > OK, as far as I can read the code, this method also works and neat! 
> > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> > to user does not help, it should be an internal change. So hiding this 
> > change
> > from user is better choice. Then there is no reason to introduce the new
> > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> 
> I'm happy with this, it solves the first half of my problem. But I want
> eBPF to work in the !MODULES case too.
> 
> I think Mark's approach can work for bpf as well, without needing to
> touch module_alloc() at all? So I might be able to drop that first patch
> entirely.

I'd be very happy with eBPF following the same approach, with BPF-specific
alloc/free functions that we can implement in arch code.

IIUC eBPF code *does* want to be within range of the core kernel image, so for
arm64 we'd want to factor some common logic out of module_alloc() and into
something that module_alloc() and "bpf_alloc()" (or whatever it would be
called) could use. So I don't think we'd necessarily save on touching
module_alloc(), but I think the resulting split would be better.

Thanks,
Mark.

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland

On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?
> 
> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.
> 
> Mark, can you send this series here, so that others can review/test it?

I've written up a cover letter and sent that out:
  
  https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/

Mark.

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Andrii Nakryiko

On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt  wrote:
>
> On Mon, 25 Mar 2024 11:38:48 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > On Fri, 22 Mar 2024 09:03:23 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > > control whether ftrace low-level code performs additional
> > > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > > violations.
> > >
> > > This check is expected to never be true in practice and would be best
> > > controlled with extra config to let users decide if they are willing to
> > > pay the price.
> >
> > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > is OK, but tht should be set to Y by default. If you have already verified
> > that your system never make it true and you want to optimize your ftrace
> > path, you can manually set it to N at your own risk.
> >
>
> Really, it's for debugging. I would argue that it should *not* be default y.
> Peter added this to find all the locations that could be called where RCU
> is not watching. But the issue I have is that this is that it *does cause
> overhead* with function tracing.
>
> I believe we found pretty much all locations that were an issue, and we
> should now just make it an option for developers.
>
> It's no different than lockdep. Test boxes should have it enabled, but
> there's no reason to have this enabled in a production system.
>

I tend to agree with Steven here (which is why I sent this patch as it
is), but I'm happy to do it as an opt-out, if Masami insists. Please
do let me know if I need to send v2 or this one is actually the one
we'll end up using. Thanks!

> -- Steve
>
>
> > >
> > > Cc: Steven Rostedt 
> > > Cc: Masami Hiramatsu 
> > > Cc: Paul E. McKenney 
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  include/linux/trace_recursion.h |  2 +-
> > >  kernel/trace/Kconfig| 13 +
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/trace_recursion.h 
> > > b/include/linux/trace_recursion.h
> > > index d48cd92d2364..24ea8ac049b4 100644
> > > --- a/include/linux/trace_recursion.h
> > > +++ b/include/linux/trace_recursion.h
> > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > > unsigned long parent_ip);
> > >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> > >  #endif
> > >
> > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > >  # define trace_warn_on_no_rcu(ip)  \
> > > ({  \
> > > bool __ret = !rcu_is_watching();\
> >
> > BTW, maybe we can add "unlikely" in the next "if" line?
> >
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 61c541c36596..19bce4e217d6 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > >   This file can be reset, but the limit can not change in
> > >   size at runtime.
> > >
> > > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > > +   bool "Validate RCU is on during ftrace recursion check"
> > > +   depends on FUNCTION_TRACER
> > > +   depends on ARCH_WANTS_NO_INSTR
> >
> >   default y
> >
> > > +   help
> > > + All callbacks that attach to the function tracing have some sort
> > > + of protection against recursion. This option performs additional
> > > + checks to make sure RCU is on when ftrace callbacks recurse.
> > > +
> > > + This will add more overhead to all ftrace-based invocations.
> >
> >   ... invocations, but keep it safe.
> >
> > > +
> > > + If unsure, say N
> >
> >   If unsure, say Y
> >
> > Thank you,
> >
> > > +
> > >  config RING_BUFFER_RECORD_RECURSION
> > > bool "Record functions that recurse in the ring buffer"
> > > depends on FTRACE_RECORD_RECURSION
> > > --
> > > 2.43.0
> > >
> >
> >
>

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Calvin Owens

On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote:
> On Tue, 26 Mar 2024 14:46:10 +
> Mark Rutland  wrote:
> 
> > Hi Masami,
> > 
> > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > > Hi Jarkko,
> > > 
> > > On Sun, 24 Mar 2024 01:29:08 +0200
> > > Jarkko Sakkinen  wrote:
> > > 
> > > > Tracing with kprobes while running a monolithic kernel is currently
> > > > impossible due the kernel module allocator dependency.
> > > > 
> > > > Address the issue by allowing architectures to implement module_alloc()
> > > > and module_memfree() independent of the module subsystem. An arch tree
> > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > > 
> > > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > > and implementing module_memfree().
> > > 
> > > Even though, this involves changes in arch-independent part. So it should
> > > be solved by generic way. Did you checked Calvin's thread?
> > > 
> > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > > 
> > > I think, we'd better to introduce `alloc_execmem()`,
> > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > > 
> > >   config HAVE_ALLOC_EXECMEM
> > >   bool
> > > 
> > >   config ALLOC_EXECMEM
> > >   bool "Executable trampline memory allocation"
> > >   depends on MODULES || HAVE_ALLOC_EXECMEM
> > > 
> > > And define fallback macro to module_alloc() like this.
> > > 
> > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > > #define alloc_execmem(size, gfp)  module_alloc(size)
> > > #endif
> > 
> > Please can we *not* do this? I think this is abstracting at the wrong level 
> > (as
> > I mentioned on the prior execmem proposals).
> > 
> > Different exectuable allocations can have different requirements. For 
> > example,
> > on arm64 modules need to be within 2G of the kernel image, but the kprobes 
> > XOL
> > areas can be anywhere in the kernel VA space.
> > 
> > Forcing those behind the same interface makes things *harder* for 
> > architectures
> > and/or makes the common code more complicated (if that ends up having to 
> > track
> > all those different requirements). From my PoV it'd be much better to have
> > separate kprobes_alloc_*() functions for kprobes which an architecture can 
> > then
> > choose to implement using a common library if it wants to.
> > 
> > I took a look at doing that using the core ifdeffery fixups from Jarkko's 
> > v6,
> > and it looks pretty clean to me (and works in testing on arm64):
> > 
> >   
> > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> > 
> > Could we please start with that approach, with kprobe-specific alloc/free 
> > code
> > provided by the architecture?

Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was
about to send a patch to remove it.

> OK, as far as I can read the code, this method also works and neat! 
> (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
> to user does not help, it should be an internal change. So hiding this change
> from user is better choice. Then there is no reason to introduce the new
> alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.

I'm happy with this, it solves the first half of my problem. But I want
eBPF to work in the !MODULES case too.

I think Mark's approach can work for bpf as well, without needing to
touch module_alloc() at all? So I might be able to drop that first patch
entirely.

https://lore.kernel.org/all/a6b162aed1e6fea7f565ef9dd0204d6f2284bcce.1709676663.git.jcalvinow...@gmail.com/

Thanks,
Calvin

> Mark, can you send this series here, so that others can review/test it?
> 
> Thank you!
> 
> 
> > 
> > Thanks,
> > Mark.
> 
> 
> -- 
> Masami Hiramatsu (Google)

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-26 Thread Andrii Nakryiko

On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu  wrote:
>
> On Thu, 21 Mar 2024 07:57:35 -0700
> Jonathan Haslam  wrote:
>
> > Active uprobes are stored in an RB tree and accesses to this tree are
> > dominated by read operations. Currently these accesses are serialized by
> > a spinlock but this leads to enormous contention when large numbers of
> > threads are executing active probes.
> >
> > This patch converts the spinlock used to serialize access to the
> > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > aligns naturally with the overwhelmingly read-only nature of the tree
> > usage here. Although the addition of reader-writer spinlocks are
> > discouraged [0], this fix is proposed as an interim solution while an
> > RCU based approach is implemented (that work is in a nascent form). This
> > fix also has the benefit of being trivial, self contained and therefore
> > simple to backport.
> >
> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) 

Masami,

Given the discussion around per-cpu rw semaphore and need for
(internal) batched attachment API for uprobes, do you think you can
apply this patch as is for now? We can then gain initial improvements
in scalability that are also easy to backport, and Jonathan will work
on a more complete solution based on per-cpu RW semaphore, as
suggested by Ingo.

>
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.
>
> Thank you,
>
> >
> > [0] https://docs.kernel.org/locking/spinlocks.html
> >
> > Signed-off-by: Jonathan Haslam 
> > ---
> >  kernel/events/uprobes.c | 22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> >   */
> >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> >
> > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access */
> >
> >  #define UPROBES_HASH_SZ  13
> >  /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> > loff_t offset)
> >  {
> >   struct uprobe *uprobe;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   uprobe = __find_uprobe(inode, offset);
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >
> >   return uprobe;
> >  }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > *uprobe)
> >  {
> >   struct uprobe *u;
> >
> > - spin_lock(_treelock);
> > + write_lock(_treelock);
> >   u = __insert_uprobe(uprobe);
> > - spin_unlock(_treelock);
> > + write_unlock(_treelock);
> >
> >   return u;
> >  }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> >   if (WARN_ON(!uprobe_is_active(uprobe)))
> >   return;
> >
> > - spin_lock(_treelock);
> > + write_lock(_treelock);
> >   rb_erase(>rb_node, _tree);
> > - spin_unlock(_treelock);
> > + write_unlock(_treelock);
> >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> >   put_uprobe(uprobe);
> >  }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> >   min = vaddr_to_offset(vma, start);
> >   max = min + (end - start) - 1;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   n = find_node_in_range(inode, min, max);
> >   if (n) {
> >   for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> >   get_uprobe(u);
> >   }
> >   }
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >  }
> >
> >  /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> > long start, unsigned long e
> >   min = vaddr_to_offset(vma, start);
> >   max = min + (end - start) - 1;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   n = find_node_in_range(inode, min, max);
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >
> >   return !!n;
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google)

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Will Deacon

On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > Secondly, the debugging code is enhanced so that the available head for
> > > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > > available
> > > > head for one specific available index is read for twice. I do see the
> > > > available heads are different from the consecutive reads. More details
> > > > are shared as below.
> > > > 
> > > > From the guest side
> > > > ===
> > > > 
> > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > head to be released: 047 062 112
> > > > 
> > > > avail_idx:
> > > > 000  49665
> > > > 001  49666  <--
> > > >  :
> > > > 015  49664
> > > 
> > > what are these #s 49665 and so on?
> > > and how large is the ring?
> > > I am guessing 49664 is the index ring size is 16 and
> > > 49664 % 16 == 0
> > 
> > More than that, 49664 % 256 == 0
> > 
> > So again there seems to be an error in the vicinity of roll-over of
> > the idx low byte, as I observed in the earlier log. Surely this is
> > more than coincidence?
> 
> Yeah, I'd still really like to see the disassembly for both sides of the
> protocol here. Gavin, is that something you're able to provide? Worst
> case, the host and guest vmlinux objects would be a starting point.
> 
> Personally, I'd be fairly surprised if this was a hardware issue.

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

Will

--->8

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
return false;
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
+   smp_rmb();
return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);

Re: [PATCH] [v3] module: don't ignore sysfs_create_link() failures

2024-03-26 Thread Andy Shevchenko

On Tue, Mar 26, 2024 at 03:57:18PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The sysfs_create_link() return code is marked as __must_check, but the
> module_add_driver() function tries hard to not care, by assigning the
> return code to a variable. When building with 'make W=1', gcc still
> warns because this variable is only assigned but not used:
> 
> drivers/base/module.c: In function 'module_add_driver':
> drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used 
> [-Wunused-but-set-variable]
> 
> Rework the code to properly unwind and return the error code to the
> caller. My reading of the original code was that it tries to
> not fail when the links already exist, so keep ignoring -EEXIST
> errors.

> Cc: Luis Chamberlain 
> Cc: linux-modu...@vger.kernel.org
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 

Wondering if you can move these to be after --- to avoid polluting commit
message. This will have the same effect and be archived on lore. But on
pros side it will unload the commit message(s) from unneeded noise.

...

> + error = module_add_driver(drv->owner, drv);
> + if (error) {
> + printk(KERN_ERR "%s: failed to create module links for %s\n",
> + __func__, drv->name);

What's wrong with pr_err()? Even if it's not a style used, in a new pieces of
code this can be improved beforehand. So, we will reduce a technical debt, and
not adding to it.

> + goto out_detach;
> + }

...

> +int module_add_driver(struct module *mod, struct device_driver *drv)
>  {
>   char *driver_name;
> - int no_warn;
> + int ret;

I would move it...

>   struct module_kobject *mk = NULL;

...to be here.

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Google

On Tue, 26 Mar 2024 14:46:10 +
Mark Rutland  wrote:

> Hi Masami,
> 
> On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> > Hi Jarkko,
> > 
> > On Sun, 24 Mar 2024 01:29:08 +0200
> > Jarkko Sakkinen  wrote:
> > 
> > > Tracing with kprobes while running a monolithic kernel is currently
> > > impossible due the kernel module allocator dependency.
> > > 
> > > Address the issue by allowing architectures to implement module_alloc()
> > > and module_memfree() independent of the module subsystem. An arch tree
> > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > > 
> > > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > > and implementing module_memfree().
> > 
> > Even though, this involves changes in arch-independent part. So it should
> > be solved by generic way. Did you checked Calvin's thread?
> > 
> > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> > 
> > I think, we'd better to introduce `alloc_execmem()`,
> > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> > 
> >   config HAVE_ALLOC_EXECMEM
> > bool
> > 
> >   config ALLOC_EXECMEM
> > bool "Executable trampline memory allocation"
> > depends on MODULES || HAVE_ALLOC_EXECMEM
> > 
> > And define fallback macro to module_alloc() like this.
> > 
> > #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> > #define alloc_execmem(size, gfp)module_alloc(size)
> > #endif
> 
> Please can we *not* do this? I think this is abstracting at the wrong level 
> (as
> I mentioned on the prior execmem proposals).
> 
> Different exectuable allocations can have different requirements. For example,
> on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL
> areas can be anywhere in the kernel VA space.
> 
> Forcing those behind the same interface makes things *harder* for 
> architectures
> and/or makes the common code more complicated (if that ends up having to track
> all those different requirements). From my PoV it'd be much better to have
> separate kprobes_alloc_*() functions for kprobes which an architecture can 
> then
> choose to implement using a common library if it wants to.
> 
> I took a look at doing that using the core ifdeffery fixups from Jarkko's v6,
> and it looks pretty clean to me (and works in testing on arm64):
> 
>   
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules
> 
> Could we please start with that approach, with kprobe-specific alloc/free code
> provided by the architecture?

OK, as far as I can read the code, this method also works and neat! 
(and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM
to user does not help, it should be an internal change. So hiding this change
from user is better choice. Then there is no reason to introduce the new
alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable.

Mark, can you send this series here, so that others can review/test it?

Thank you!


> 
> Thanks,
> Mark.


-- 
Masami Hiramatsu (Google)

[PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported

2024-03-26 Thread Breno Leitao

There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c640fdf28fc5..e6b0eaf08ac2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
-- 
2.43.0

[PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-26 Thread Breno Leitao

Do not set virtnet_info->rss_indir_table_size if RSS is not available
for the device.

Currently, rss_indir_table_size is set if either has_rss or
has_rss_hash_report is available, but, it should only be set if has_rss
is set.

On the virtnet_set_rxfh(), return an invalid command if the request has
indirection table set, but virtnet does not support RSS.

Suggested-by: Heng Qi 
Signed-off-by: Breno Leitao 
---
 drivers/net/virtio_net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..c640fdf28fc5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
 
+   if (rxfh->indir && !vi->has_rss)
+   return -EINVAL;
+
if (rxfh->indir) {
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
@@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 
-- 
2.43.0

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-26 Thread Google

On Tue, 26 Mar 2024 15:18:21 +0200
"Jarkko Sakkinen"  wrote:

> On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote:
> > On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote:
> > > > > +#endif /* _LINUX_EXECMEM_H */
> > > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > > index 9d9095e81792..87fd8c14a938 100644
> > > > > --- a/kernel/kprobes.c
> > > > > +++ b/kernel/kprobes.c
> > > > > @@ -44,6 +44,7 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  #include 
> > > > > +#include 
> > > > >  
> > > > >  #define KPROBE_HASH_BITS 6
> > > > >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> > > > >  void __weak *alloc_insn_page(void)
> > > > >  {
> > > > >   /*
> > > > > -  * Use module_alloc() so this page is within +/- 2GB of where 
> > > > > the
> > > > > +  * Use alloc_execmem() so this page is within +/- 2GB of where 
> > > > > the
> > > > >* kernel image and loaded module images reside. This is 
> > > > > required
> > > > >* for most of the architectures.
> > > > >* (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> > > > >*/
> > > > > - return module_alloc(PAGE_SIZE);
> > > > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> > > > >  }
> > > > >  
> > > > >  static void free_insn_page(void *page)
> > > > >  {
> > > > > - module_memfree(page);
> > > > > + free_execmem(page);
> > > > >  }
> > > > >  
> > > > >  struct kprobe_insn_cache kprobe_insn_slots = {
> > > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct 
> > > > > kprobe *p,
> > > > >   goto out;
> > > > >   }
> > > > >  
> > > > > +#ifdef CONFIG_MODULES
> > > >
> > > > You don't need this block, because these APIs have dummy functions.
> > >
> > > Hmm... I'll verify this tomorrow.
> >
> > It depends on having struct module available given "(*probed_mod)->state".

Ah, indeed. We need module_state() function to avoid it.

> >
> > It is non-existent unless CONFIG_MODULES is set given how things are
> > flagged in include/linux/module.h.
> 
> Hey, noticed kconfig issue.
> 
> According to kconfig-language.txt:
> 
> "select should be used with care. select will force a symbol to a value
> without visiting the dependencies."
> 
> So the problem here lies in KPROBES config entry using select statement
> to pick ALLOC_EXECMEM. It will not take the depends on statement into
> account and thus will allow to select kprobes without any allocator in
> place.

OK, in that case "depend on" is good.

> 
> So to address this I'd suggest to use depends on statement also for
> describing relation between KPROBES and ALLOC_EXECMEM. It does not make
> life worse than before for anyone because even with the current kernel
> you have to select MODULES before you can move forward with kprobes.

Yeah, since ALLOC_EXECMEM is enabled by default.

Thank you!

> 
> BR, Jarkko


-- 
Masami Hiramatsu (Google)

[PATCH] [v3] module: don't ignore sysfs_create_link() failures

2024-03-26 Thread Arnd Bergmann

From: Arnd Bergmann 

The sysfs_create_link() return code is marked as __must_check, but the
module_add_driver() function tries hard to not care, by assigning the
return code to a variable. When building with 'make W=1', gcc still
warns because this variable is only assigned but not used:

drivers/base/module.c: In function 'module_add_driver':
drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used 
[-Wunused-but-set-variable]

Rework the code to properly unwind and return the error code to the
caller. My reading of the original code was that it tries to
not fail when the links already exist, so keep ignoring -EEXIST
errors.

Cc: Luis Chamberlain 
Cc: linux-modu...@vger.kernel.org
Cc: Greg Kroah-Hartman 
Cc: "Rafael J. Wysocki" 
Fixes: e17e0f51aeea ("Driver core: show drivers in /sys/module/")
See-also: 4a7fb6363f2d ("add __must_check to device management code")
Signed-off-by: Arnd Bergmann 
---
v3: make error handling stricter, add unwinding,
 fix build fail with CONFIG_MODULES=n
v2: rework to actually handle the error. I have not tested the
error handling beyond build testing, so please review carefully.
---
 drivers/base/base.h   |  9 ++---
 drivers/base/bus.c|  9 -
 drivers/base/module.c | 42 +++---
 3 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 0738ccad08b2..db4f910e8e36 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -192,11 +192,14 @@ extern struct kset *devices_kset;
 void devices_kset_move_last(struct device *dev);
 
 #if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS)
-void module_add_driver(struct module *mod, struct device_driver *drv);
+int module_add_driver(struct module *mod, struct device_driver *drv);
 void module_remove_driver(struct device_driver *drv);
 #else
-static inline void module_add_driver(struct module *mod,
-struct device_driver *drv) { }
+static inline int module_add_driver(struct module *mod,
+   struct device_driver *drv)
+{
+   return 0;
+}
 static inline void module_remove_driver(struct device_driver *drv) { }
 #endif
 
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index daee55c9b2d9..ffea0728b8b2 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -674,7 +674,12 @@ int bus_add_driver(struct device_driver *drv)
if (error)
goto out_del_list;
}
-   module_add_driver(drv->owner, drv);
+   error = module_add_driver(drv->owner, drv);
+   if (error) {
+   printk(KERN_ERR "%s: failed to create module links for %s\n",
+   __func__, drv->name);
+   goto out_detach;
+   }
 
error = driver_create_file(drv, _attr_uevent);
if (error) {
@@ -699,6 +704,8 @@ int bus_add_driver(struct device_driver *drv)
 
return 0;
 
+out_detach:
+   driver_detach(drv);
 out_del_list:
klist_del(>knode_bus);
 out_unregister:
diff --git a/drivers/base/module.c b/drivers/base/module.c
index 46ad4d636731..d16b5c8e5473 100644
--- a/drivers/base/module.c
+++ b/drivers/base/module.c
@@ -30,14 +30,14 @@ static void module_create_drivers_dir(struct module_kobject 
*mk)
mutex_unlock(_dir_mutex);
 }
 
-void module_add_driver(struct module *mod, struct device_driver *drv)
+int module_add_driver(struct module *mod, struct device_driver *drv)
 {
char *driver_name;
-   int no_warn;
+   int ret;
struct module_kobject *mk = NULL;
 
if (!drv)
-   return;
+   return 0;
 
if (mod)
mk = >mkobj;
@@ -56,17 +56,37 @@ void module_add_driver(struct module *mod, struct 
device_driver *drv)
}
 
if (!mk)
-   return;
+   return 0;
+
+   ret = sysfs_create_link(>p->kobj, >kobj, "module");
+   if (ret)
+   return ret;
 
-   /* Don't check return codes; these calls are idempotent */
-   no_warn = sysfs_create_link(>p->kobj, >kobj, "module");
driver_name = make_driver_name(drv);
-   if (driver_name) {
-   module_create_drivers_dir(mk);
-   no_warn = sysfs_create_link(mk->drivers_dir, >p->kobj,
-   driver_name);
-   kfree(driver_name);
+   if (!driver_name) {
+   ret = -ENOMEM;
+   goto out;
+   }
+
+   module_create_drivers_dir(mk);
+   if (!mk->drivers_dir) {
+   ret = -EINVAL;
+   goto out;
}
+
+   ret = sysfs_create_link(mk->drivers_dir, >p->kobj, driver_name);
+   if (ret)
+   goto out;
+
+   kfree(driver_name);
+
+   return 0;
+out:
+   sysfs_remove_link(>p->kobj, "module");
+   sysfs_remove_link(mk->drivers_dir, driver_name);
+   kfree(driver_name);
+
+   return ret;
 }
 
 void

[PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes

2024-03-26 Thread Arnd Bergmann

From: Arnd Bergmann 

Building with W=1 in some configurations produces a false positive
warning for kallsyms:

kernel/kallsyms.c: In function '__sprint_symbol.isra':
kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as 
destination [-Werror=restrict]
  503 | strcpy(buffer, name);
  | ^~~~

This originally showed up while building with -O3, but later started
happening in other configurations as well, depending on inlining
decisions. The underlying issue is that the local 'name' variable is
always initialized to the be the same as 'buffer' in the called functions
that fill the buffer, which gcc notices while inlining, though it could
see that the address check always skips the copy.

The calling conventions here are rather unusual, as all of the internal
lookup functions (bpf_address_lookup, ftrace_mod_address_lookup,
ftrace_func_address_lookup, module_address_lookup and
kallsyms_lookup_buildid) already use the provided buffer and either return
the address of that buffer to indicate success, or NULL for failure,
but the callers are written to also expect an arbitrary other buffer
to be returned.

Rework the calling conventions to return the length of the filled buffer
instead of its address, which is simpler and easier to follow as well
as avoiding the warning. Leave only the kallsyms_lookup() calling conventions
unchanged, since that is called from 16 different functions and
adapting this would be a much bigger change.

Link: https://lore.kernel.org/all/20200107214042.855757-1-a...@arndb.de/
Reviewed-by: Luis Chamberlain 
Acked-by: Steven Rostedt (Google) 
Signed-off-by: Arnd Bergmann 
---
v4: fix string length
v3: use strscpy() instead of strlcpy()
v2: complete rewrite after the first patch was rejected (in 2020). This
is now one of only two warnings that are in the way of enabling
-Wextra/-Wrestrict by default.
---
 include/linux/filter.h   | 14 +++---
 include/linux/ftrace.h   |  6 +++---
 include/linux/module.h   | 14 +++---
 kernel/bpf/core.c|  7 +++
 kernel/kallsyms.c| 23 ---
 kernel/module/kallsyms.c | 26 +-
 kernel/trace/ftrace.c| 13 +
 7 files changed, 50 insertions(+), 53 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c99bc3df2d28..9d4a7c6f023e 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1168,18 +1168,18 @@ static inline bool bpf_jit_kallsyms_enabled(void)
return false;
 }
 
-const char *__bpf_address_lookup(unsigned long addr, unsigned long *size,
+int __bpf_address_lookup(unsigned long addr, unsigned long *size,
 unsigned long *off, char *sym);
 bool is_bpf_text_address(unsigned long addr);
 int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type,
char *sym);
 struct bpf_prog *bpf_prog_ksym_find(unsigned long addr);
 
-static inline const char *
+static inline int
 bpf_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym)
 {
-   const char *ret = __bpf_address_lookup(addr, size, off, sym);
+   int ret = __bpf_address_lookup(addr, size, off, sym);
 
if (ret && modname)
*modname = NULL;
@@ -1223,11 +1223,11 @@ static inline bool bpf_jit_kallsyms_enabled(void)
return false;
 }
 
-static inline const char *
+static inline int
 __bpf_address_lookup(unsigned long addr, unsigned long *size,
 unsigned long *off, char *sym)
 {
-   return NULL;
+   return 0;
 }
 
 static inline bool is_bpf_text_address(unsigned long addr)
@@ -1246,11 +1246,11 @@ static inline struct bpf_prog 
*bpf_prog_ksym_find(unsigned long addr)
return NULL;
 }
 
-static inline const char *
+static inline int
 bpf_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym)
 {
-   return NULL;
+   return 0;
 }
 
 static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp)
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 54d53f345d14..56834a3fa9be 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -87,15 +87,15 @@ struct ftrace_direct_func;
 
 #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_MODULES) && \
defined(CONFIG_DYNAMIC_FTRACE)
-const char *
+int
 ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym);
 #else
-static inline const char *
+static inline int
 ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
   unsigned long *off, char **modname, char *sym)
 {
-   return NULL;
+   return 0;
 }
 #endif
 
diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..118c36366b35 100644
--- a/include/linux/module.h
+++

Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Mark Rutland

Hi Masami,

On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote:
> Hi Jarkko,
> 
> On Sun, 24 Mar 2024 01:29:08 +0200
> Jarkko Sakkinen  wrote:
> 
> > Tracing with kprobes while running a monolithic kernel is currently
> > impossible due the kernel module allocator dependency.
> > 
> > Address the issue by allowing architectures to implement module_alloc()
> > and module_memfree() independent of the module subsystem. An arch tree
> > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file.
> > 
> > Realize the feature on RISC-V by separating allocator to module_alloc.c
> > and implementing module_memfree().
> 
> Even though, this involves changes in arch-independent part. So it should
> be solved by generic way. Did you checked Calvin's thread?
> 
> https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/
> 
> I think, we'd better to introduce `alloc_execmem()`,
> CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first
> 
>   config HAVE_ALLOC_EXECMEM
>   bool
> 
>   config ALLOC_EXECMEM
>   bool "Executable trampline memory allocation"
>   depends on MODULES || HAVE_ALLOC_EXECMEM
> 
> And define fallback macro to module_alloc() like this.
> 
> #ifndef CONFIG_HAVE_ALLOC_EXECMEM
> #define alloc_execmem(size, gfp)  module_alloc(size)
> #endif

Please can we *not* do this? I think this is abstracting at the wrong level (as
I mentioned on the prior execmem proposals).

Different exectuable allocations can have different requirements. For example,
on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL
areas can be anywhere in the kernel VA space.

Forcing those behind the same interface makes things *harder* for architectures
and/or makes the common code more complicated (if that ends up having to track
all those different requirements). From my PoV it'd be much better to have
separate kprobes_alloc_*() functions for kprobes which an architecture can then
choose to implement using a common library if it wants to.

I took a look at doing that using the core ifdeffery fixups from Jarkko's v6,
and it looks pretty clean to me (and works in testing on arm64):

https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules

Could we please start with that approach, with kprobe-specific alloc/free code
provided by the architecture?

Thanks,
Mark.

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-26 Thread Mathieu Poirier

On Mon, 25 Mar 2024 at 11:13, Abdellatif El Khlifi
 wrote:
>
> Hi Mathieu,
>
> > > > > > > > > This is an initial patchset for allowing to turn on and off 
> > > > > > > > > the remote processor.
> > > > > > > > > The FW is already loaded before the Corstone-1000 SoC is 
> > > > > > > > > powered on and this
> > > > > > > > > is done through the FPGA board bootloader in case of the FPGA 
> > > > > > > > > target. Or by the Corstone-1000 FVP model
> > > > > > > > > (emulator).
> > > > > > > > >
> > > > > > > > >From the above I take it that booting with a preloaded 
> > > > > > > > >firmware is a
> > > > > > > > scenario that needs to be supported and not just a temporary 
> > > > > > > > stage.
> > > > > > >
> > > > > > > The current status of the Corstone-1000 SoC requires that there is
> > > > > > > a preloaded firmware for the external core. Preloading is done 
> > > > > > > externally
> > > > > > > either through the FPGA bootloader or the emulator (FVP) before 
> > > > > > > powering
> > > > > > > on the SoC.
> > > > > > >
> > > > > >
> > > > > > Ok
> > > > > >
> > > > > > > Corstone-1000 will be upgraded in a way that the A core running 
> > > > > > > Linux is able
> > > > > > > to share memory with the remote core and also being able to 
> > > > > > > access the remote
> > > > > > > core memory so Linux can copy the firmware to. This HW changes 
> > > > > > > are still
> > > > > > > This is why this patchset is relying on a preloaded firmware. And 
> > > > > > > it's the step 1
> > > > > > > of adding remoteproc support for Corstone.
> > > > > > >
> > > > > >
> > > > > > Ok, so there is a HW problem where A core and M core can't see each 
> > > > > > other's
> > > > > > memory, preventing the A core from copying the firmware image to 
> > > > > > the proper
> > > > > > location.
> > > > > >
> > > > > > When the HW is fixed, will there be a need to support scenarios 
> > > > > > where the
> > > > > > firmware image has been preloaded into memory?
> > > > >
> > > > > No, this scenario won't apply when we get the HW upgrade. No need for 
> > > > > an
> > > > > external entity anymore. The firmware(s) will all be files in the 
> > > > > linux filesystem.
> > > > >
> > > >
> > > > Very well.  I am willing to continue with this driver but it does so 
> > > > little that
> > > > I wonder if it wouldn't simply be better to move forward with 
> > > > upstreaming when
> > > > the HW is fixed.  The choice is yours.
> > > >
> > >
> > > I think Robin has raised few points that need clarification. I think it 
> > > was
> > > done as part of DT binding patch. I share those concerns and I wanted to
> > > reaching to the same concerns by starting the questions I asked on 
> > > corstone
> > > device tree changes.
> > >
> >
> > I also agree with Robin's point of view.  Proceeding with an initial
> > driver with minimal functionality doesn't preclude having complete
> > bindings.  But that said and as I pointed out, it might be better to
> > wait for the HW to be fixed before moving forward.
>
> We checked with the HW teams. The missing features will be implemented but
> this will take time.
>
> The foundation driver as it is right now is still valuable for people wanting 
> to
> know how to power control Corstone external systems in a future proof manner
> (even in the incomplete state). We prefer to address all the review comments
> made so it can be merged. This includes making the DT binding as complete as
> possible as you advised. Then, once the HW is ready, I'll implement the comms
> and the FW reload part. Is that OK please ?
>

I'm in agreement with that plan as long as we agree the current
preloaded heuristic is temporary and is not a valid long term
scenario.

> Cheers,
> Abdellatif

[PATCH v9 3/3] remoteproc: qcom: Remove minidump related data from qcom_common.c

2024-03-26 Thread Mukesh Ojha

As minidump specific data structure and functions move under
config QCOM_RPROC_MINIDUMP, so remove minidump specific data
from driver/remoteproc/qcom_common.c .

Signed-off-by: Mukesh Ojha 
---
Changes in v9:
 - Change in patch order.
 - rebased it.

v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/
v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/
v6: 
https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/
v5: 
https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/
v4: 
https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/

 drivers/remoteproc/qcom_common.c | 160 ---
 1 file changed, 160 deletions(-)

diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c
index 03e5f5d533eb..085fd73fa23a 100644
--- a/drivers/remoteproc/qcom_common.c
+++ b/drivers/remoteproc/qcom_common.c
@@ -17,7 +17,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "remoteproc_internal.h"
 #include "qcom_common.h"
@@ -26,61 +25,6 @@
 #define to_smd_subdev(d) container_of(d, struct qcom_rproc_subdev, subdev)
 #define to_ssr_subdev(d) container_of(d, struct qcom_rproc_ssr, subdev)
 
-#define MAX_NUM_OF_SS   10
-#define MAX_REGION_NAME_LENGTH  16
-#define SBL_MINIDUMP_SMEM_ID   602
-#define MINIDUMP_REGION_VALID  ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' 
<< 0)
-#define MINIDUMP_SS_ENCR_DONE  ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' 
<< 0)
-#define MINIDUMP_SS_ENABLED('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' 
<< 0)
-
-/**
- * struct minidump_region - Minidump region
- * @name   : Name of the region to be dumped
- * @seq_num:   : Use to differentiate regions with same name.
- * @valid  : This entry to be dumped (if set to 1)
- * @address: Physical address of region to be dumped
- * @size   : Size of the region
- */
-struct minidump_region {
-   charname[MAX_REGION_NAME_LENGTH];
-   __le32  seq_num;
-   __le32  valid;
-   __le64  address;
-   __le64  size;
-};
-
-/**
- * struct minidump_subsystem - Subsystem's SMEM Table of content
- * @status : Subsystem toc init status
- * @enabled : if set to 1, this region would be copied during coredump
- * @encryption_status: Encryption status for this subsystem
- * @encryption_required : Decides to encrypt the subsystem regions or not
- * @region_count : Number of regions added in this subsystem toc
- * @regions_baseptr : regions base pointer of the subsystem
- */
-struct minidump_subsystem {
-   __le32  status;
-   __le32  enabled;
-   __le32  encryption_status;
-   __le32  encryption_required;
-   __le32  region_count;
-   __le64  regions_baseptr;
-};
-
-/**
- * struct minidump_global_toc - Global Table of Content
- * @status : Global Minidump init status
- * @md_revision : Minidump revision
- * @enabled : Minidump enable status
- * @subsystems : Array of subsystems toc
- */
-struct minidump_global_toc {
-   __le32  status;
-   __le32  md_revision;
-   __le32  enabled;
-   struct minidump_subsystem   subsystems[MAX_NUM_OF_SS];
-};
-
 struct qcom_ssr_subsystem {
const char *name;
struct srcu_notifier_head notifier_list;
@@ -90,110 +34,6 @@ struct qcom_ssr_subsystem {
 static LIST_HEAD(qcom_ssr_subsystem_list);
 static DEFINE_MUTEX(qcom_ssr_subsys_lock);
 
-static void qcom_minidump_cleanup(struct rproc *rproc)
-{
-   struct rproc_dump_segment *entry, *tmp;
-
-   list_for_each_entry_safe(entry, tmp, >dump_segments, node) {
-   list_del(>node);
-   kfree(entry->priv);
-   kfree(entry);
-   }
-}
-
-static int qcom_add_minidump_segments(struct rproc *rproc, struct 
minidump_subsystem *subsystem,
-   void (*rproc_dumpfn_t)(struct rproc *rproc, struct 
rproc_dump_segment *segment,
-   void *dest, size_t offset, size_t size))
-{
-   struct minidump_region __iomem *ptr;
-   struct minidump_region region;
-   int seg_cnt, i;
-   dma_addr_t da;
-   size_t size;
-   char *name;
-
-   if (WARN_ON(!list_empty(>dump_segments))) {
-   dev_err(>dev, "dump segment list already populated\n");
-   return -EUCLEAN;
-   }
-
-   seg_cnt = le32_to_cpu(subsystem->region_count);
-   ptr = ioremap((unsigned long)le64_to_cpu(subsystem->regions_baseptr),
- seg_cnt * sizeof(struct minidump_region));
-   if (!ptr)
-   return -EFAULT;
-
-   for (i = 0; i < seg_cnt; i++) {
-   memcpy_fromio(, ptr + i, sizeof(region));
-   if (le32_to_cpu(region.valid) == MINIDUMP_REGION_VALID) {
-   name = kstrndup(region.name,

[PATCH v9 2/3] remoteproc: qcom_q6v5_pas: Use qcom_rproc_minidump()

2024-03-26 Thread Mukesh Ojha

Now, as all the minidump specific data structure is moved to
minidump specific files and implementation wise qcom_rproc_minidump()
and qcom_minidump() exactly same and the name qcom_rproc_minidump
make more sense as it happen to collect the minidump for the
remoteproc processors. So, let's use qcom_rproc_minidump() and
we will be removing qcom_minidump() and minidump related stuff
from driver/remoteproc/qcom_common.c .

Signed-off-by: Mukesh Ojha 
---
Changes in v9:
 - Change in patch order from its last version.
 - Rebased it.

v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/
v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/
v6: 
https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/
v5: 
https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/
v4: 
https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/

 drivers/remoteproc/Kconfig | 1 +
 drivers/remoteproc/qcom_q6v5_pas.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
index 48845dc8fa85..cea960749e2c 100644
--- a/drivers/remoteproc/Kconfig
+++ b/drivers/remoteproc/Kconfig
@@ -166,6 +166,7 @@ config QCOM_PIL_INFO
 
 config QCOM_RPROC_COMMON
tristate
+   select QCOM_RPROC_MINIDUMP
 
 config QCOM_Q6V5_COMMON
tristate
diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
b/drivers/remoteproc/qcom_q6v5_pas.c
index 54d8005d40a3..b39f87dfd9c0 100644
--- a/drivers/remoteproc/qcom_q6v5_pas.c
+++ b/drivers/remoteproc/qcom_q6v5_pas.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "qcom_common.h"
 #include "qcom_pil_info.h"
@@ -141,7 +142,7 @@ static void adsp_minidump(struct rproc *rproc)
if (rproc->dump_conf == RPROC_COREDUMP_DISABLED)
return;
 
-   qcom_minidump(rproc, adsp->minidump_id, adsp_segment_dump);
+   qcom_rproc_minidump(rproc, adsp->minidump_id, adsp_segment_dump);
 }
 
 static int adsp_pds_enable(struct qcom_adsp *adsp, struct device **pds,
-- 
2.7.4

[PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module

2024-03-26 Thread Mukesh Ojha

Add qcom_rproc_minidump module in a preparation to remove
minidump specific code from driver/remoteproc/qcom_common.c
and provide needed exported API, this as well helps to
abstract minidump specific data layout from qualcomm's
remoteproc driver.

It is just a copying of qcom_minidump() functionality from
driver/remoteproc/qcom_common.c into a separate file under
qcom_rproc_minidump().

Signed-off-by: Mukesh Ojha 
---
Changes in v9:
 - Added source file driver/remoteproc/qcom_common.c copyright
   to qcom_rproc_minidump.c 
 - Dissociated it from minidump series as this can go separately
   and minidump can put it dependency for the data structure files.

Nothing much changed in these three patches from previous version,
However, giving the link of their older versions.

v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/
v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/
v6: 
https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/
v5: 
https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/
v4: 
https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/

 drivers/soc/qcom/Kconfig  |  10 +++
 drivers/soc/qcom/Makefile |   1 +
 drivers/soc/qcom/qcom_minidump_internal.h |  64 +
 drivers/soc/qcom/qcom_rproc_minidump.c| 115 ++
 include/soc/qcom/qcom_minidump.h  |  23 ++
 5 files changed, 213 insertions(+)
 create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h
 create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c
 create mode 100644 include/soc/qcom/qcom_minidump.h

diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig
index 5af33b0e3470..ed23e0275c22 100644
--- a/drivers/soc/qcom/Kconfig
+++ b/drivers/soc/qcom/Kconfig
@@ -277,4 +277,14 @@ config QCOM_PBS
  This module provides the APIs to the client drivers that wants to 
send the
  PBS trigger event to the PBS RAM.
 
+config QCOM_RPROC_MINIDUMP
+   tristate "QCOM Remoteproc Minidump Support"
+   depends on ARCH_QCOM || COMPILE_TEST
+   depends on QCOM_SMEM
+   help
+ Enablement of core Minidump feature is controlled from boot firmware
+ side, so if it is enabled from firmware, this config allow Linux to
+ query predefined Minidump segments associated with the remote 
processor
+ and check its validity and end up collecting the dump on remote 
processor
+ crash during its recovery.
 endmenu
diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index ca0bece0dfff..44664589263d 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON)  += icc-bwmon.o
 qcom_ice-objs  += ice.o
 obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE)+= qcom_ice.o
 obj-$(CONFIG_QCOM_PBS) +=  qcom-pbs.o
+obj-$(CONFIG_QCOM_RPROC_MINIDUMP)  += qcom_rproc_minidump.o
diff --git a/drivers/soc/qcom/qcom_minidump_internal.h 
b/drivers/soc/qcom/qcom_minidump_internal.h
new file mode 100644
index ..71709235b196
--- /dev/null
+++ b/drivers/soc/qcom/qcom_minidump_internal.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#ifndef _QCOM_MINIDUMP_INTERNAL_H_
+#define _QCOM_MINIDUMP_INTERNAL_H_
+
+#define MAX_NUM_OF_SS   10
+#define MAX_REGION_NAME_LENGTH  16
+#define SBL_MINIDUMP_SMEM_ID   602
+#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' << 0)
+#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' << 0)
+#define MINIDUMP_SS_ENABLED   ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0)
+
+/**
+ * struct minidump_region - Minidump region
+ * @name   : Name of the region to be dumped
+ * @seq_num:   : Use to differentiate regions with same name.
+ * @valid  : This entry to be dumped (if set to 1)
+ * @address: Physical address of region to be dumped
+ * @size   : Size of the region
+ */
+struct minidump_region {
+   charname[MAX_REGION_NAME_LENGTH];
+   __le32  seq_num;
+   __le32  valid;
+   __le64  address;
+   __le64  size;
+};
+
+/**
+ * struct minidump_subsystem - Subsystem's SMEM Table of content
+ * @status : Subsystem toc init status
+ * @enabled : if set to 1, this region would be copied during coredump
+ * @encryption_status: Encryption status for this subsystem
+ * @encryption_required : Decides to encrypt the subsystem regions or not
+ * @region_count : Number of regions added in this subsystem toc
+ * @regions_baseptr : regions base pointer of the subsystem
+ */
+struct minidump_subsystem {
+   __le32  status;
+   __le32  enabled;
+   __le32  encryption_status;
+   __le32  encryption_required;
+   __le32

Re: [PATCH] [v2] module: don't ignore sysfs_create_link() failures

2024-03-26 Thread Arnd Bergmann

On Sat, Mar 23, 2024, at 17:50, Greg Kroah-Hartman wrote:
> On Fri, Mar 22, 2024 at 06:39:11PM +0100, Arnd Bergmann wrote:
>> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
>> index daee55c9b2d9..7ef75b60d331 100644
>> --- a/drivers/base/bus.c
>> +++ b/drivers/base/bus.c
>> @@ -674,7 +674,12 @@ int bus_add_driver(struct device_driver *drv)
>>  if (error)
>>  goto out_del_list;
>>  }
>> -module_add_driver(drv->owner, drv);
>> +error = module_add_driver(drv->owner, drv);
>> +if (error) {
>> +printk(KERN_ERR "%s: failed to create module links for %s\n",
>> +__func__, drv->name);
>> +goto out_del_list;
>
> Don't we need to walk back the driver_attach() call here if this fails?

Yes, fixed now. There are still some other calls right after
it that print an error but don't cause bus_add_driver() to fail
though. We may want to add similar unwinding there, but that
feels like it should be a separate patch.

>>  
>>  if (!mk)
>> -return;
>> +return 0;
>> +
>> +ret = sysfs_create_link(>p->kobj, >kobj, "module");
>> +if (ret && ret != -EEXIST)
>
> Why would EEXIST happen here?  How can this be called twice?
>

My impression was that the lack of error handling and the
comment was ab out a case where that might happen
intentionally. I've removed it now as I couldn't find any
evidence that this is really needed. I suppose we would
find out in testing if we do.

 Arnd

Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Alexandre Ghiti


Hi Jarkko,

On 25/03/2024 22:55, Jarkko Sakkinen wrote:

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5:
- No changes, expect removing alloc_execmem() call which should have
   been part of the previous patch.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
   this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
   is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
  arch/riscv/Kconfig  |  1 +
  arch/riscv/kernel/Makefile  |  3 +++
  arch/riscv/kernel/execmem.c | 22 ++
  3 files changed, 26 insertions(+)
  create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
  
  obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o

  obj-$(CONFIG_MODULES) += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
  obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o
  
  obj-$(CONFIG_CPU_PM)		+= suspend_entry.o suspend.o

diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}



The __vmalloc_node_range() line ^^ must be from an old kernel since we 
added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix 
module_alloc() that did not reset the linear mapping permissions").


In addition, I guess module_alloc() should now use alloc_execmem() right?



+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}



I remember Mike Rapoport sent a patchset to introduce an API for 
executable memory allocation 
(https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), 
how does this intersect with your work? I don't know the status of his 
patchset though.


Thanks,

Alex

[PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n

2024-03-26 Thread Jarkko Sakkinen

Tacing with kprobes while running a monolithic kernel is currently
impossible due the kernel module allocator dependency.

Address the issue by implementing textmem API for RISC-V.

Link: https://www.sochub.fi # for power on testing new SoC's with a minimal 
stack
Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ 
# continuation
Signed-off-by: Jarkko Sakkinen 
---
v5-v7:
- No changes.
v4:
- Include linux/execmem.h.
v3:
- Architecture independent parts have been split to separate patches.
- Do not change arch/riscv/kernel/module.c as it is out of scope for
  this patch set now.
v2:
- Better late than never right? :-)
- Focus only to RISC-V for now to make the patch more digestable. This
  is the arch where I use the patch on a daily basis to help with QA.
- Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration.
---
 arch/riscv/Kconfig  |  1 +
 arch/riscv/kernel/Makefile  |  3 +++
 arch/riscv/kernel/execmem.c | 22 ++
 3 files changed, 26 insertions(+)
 create mode 100644 arch/riscv/kernel/execmem.c

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e3142ce531a0..499512fb17ff 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -132,6 +132,7 @@ config RISCV
select HAVE_KPROBES if !XIP_KERNEL
select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL
select HAVE_KRETPROBES if !XIP_KERNEL
+   select HAVE_ALLOC_EXECMEM if !XIP_KERNEL
# https://github.com/ClangBuiltLinux/linux/issues/1881
select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD
select HAVE_MOVE_PMD
diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile
index 604d6bf7e476..337797f10d3e 100644
--- a/arch/riscv/kernel/Makefile
+++ b/arch/riscv/kernel/Makefile
@@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o
 
 obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o
 obj-$(CONFIG_MODULES)  += module.o
+ifeq ($(CONFIG_ALLOC_EXECMEM),y)
+obj-y  += execmem.o
+endif
 obj-$(CONFIG_MODULE_SECTIONS)  += module-sections.o
 
 obj-$(CONFIG_CPU_PM)   += suspend_entry.o suspend.o
diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c
new file mode 100644
index ..3e52522ead32
--- /dev/null
+++ b/arch/riscv/kernel/execmem.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include 
+#include 
+#include 
+#include 
+
+void *alloc_execmem(unsigned long size, gfp_t /* gfp */)
+{
+   return __vmalloc_node_range(size, 1, MODULES_VADDR,
+   MODULES_END, GFP_KERNEL,
+   PAGE_KERNEL, 0, NUMA_NO_NODE,
+   __builtin_return_address(0));
+}
+
+void free_execmem(void *region)
+{
+   if (in_interrupt())
+   pr_warn("In interrupt context: vmalloc may not work.\n");
+
+   vfree(region);
+}
-- 
2.44.0

[PATCH v7 1/2] kprobes: Implement trampoline memory allocator for tracing

2024-03-26 Thread Jarkko Sakkinen

Tracing with kprobes while running a monolithic kernel is currently
impossible because CONFIG_KPROBES depends on CONFIG_MODULES.

Introduce alloc_execmem() and free_execmem() for allocating executable
memory. If an arch implements these functions, it can mark this up with
the HAVE_ALLOC_EXECMEM kconfig flag.

The second new kconfig flag is ALLOC_EXECMEM, which can be selected if
either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If
HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and
module_memfree() are used as a fallback, thus retaining backwards
compatibility to earlier kernel versions.

This will allow architecture to enable kprobes traces without requiring
to enable module.

The support can be implemented with four easy steps:

1. Implement alloc_execmem().
2. Implement free_execmem().
3. Edit arch//Makefile.
4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig.

Link: 
https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/
Suggested-by: Masami Hiramatsu 
Signed-off-by: Jarkko Sakkinen 
---
v7:
- Use "depends on" for ALLOC_EXECMEM instead of "select"
- Reduced and narrowed CONFIG_MODULES checks further in kprobes.c.
v6:
- Use null pointer for notifiers and register the module notifier only if
  IS_ENABLED(CONFIG_MODULES) is set.
- Fixed typo in the commit message and wrote more verbose description
  of the feature.
v5:
- alloc_execmem() was missing GFP_KERNEL parameter. The patch set did
  compile because 2/2 had the fixup (leaked there when rebasing the
  patch set).
v4:
- Squashed a couple of unrequired CONFIG_MODULES checks.
- See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/
v3:
- A new patch added.
- For IS_DEFINED() I need advice as I could not really find that many
  locations where it would be applicable.
---
 arch/Kconfig| 17 +++-
 include/linux/execmem.h | 13 +
 kernel/kprobes.c| 53 ++---
 kernel/trace/trace_kprobe.c | 15 +--
 4 files changed, 73 insertions(+), 25 deletions(-)
 create mode 100644 include/linux/execmem.h

diff --git a/arch/Kconfig b/arch/Kconfig
index a5af0edd3eb8..5e9735f60f3c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,8 +52,8 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
+   depends on ALLOC_EXECMEM
select KALLSYMS
select TASKS_RCU if PREEMPTION
help
@@ -215,6 +215,21 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_ALLOC_EXECMEM
+   bool
+   help
+ Architectures that select this option are capable of allocating 
trampoline
+ executable memory for tracing subsystems, indepedently of the kernel 
module
+ subsystem.
+
+config ALLOC_EXECMEM
+   bool "Executable (trampoline) memory allocation"
+   default y
+   depends on MODULES || HAVE_ALLOC_EXECMEM
+   help
+ Select this for executable (trampoline) memory. Can be enabled when 
either
+ module allocator or arch-specific allocator is available.
+
 config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
bool
help
diff --git a/include/linux/execmem.h b/include/linux/execmem.h
new file mode 100644
index ..ae2ff151523a
--- /dev/null
+++ b/include/linux/execmem.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_EXECMEM_H
+#define _LINUX_EXECMEM_H
+
+#ifdef CONFIG_HAVE_ALLOC_EXECMEM
+void *alloc_execmem(unsigned long size, gfp_t gfp);
+void free_execmem(void *region);
+#else
+#define alloc_execmem(size, gfp)   module_alloc(size)
+#define free_execmem(region)   module_memfree(region)
+#endif
+
+#endif /* _LINUX_EXECMEM_H */
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index 9d9095e81792..13bef5de315c 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define KPROBE_HASH_BITS 6
 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
@@ -113,17 +114,17 @@ enum kprobe_slot_state {
 void __weak *alloc_insn_page(void)
 {
/*
-* Use module_alloc() so this page is within +/- 2GB of where the
+* Use alloc_execmem() so this page is within +/- 2GB of where the
 * kernel image and loaded module images reside. This is required
 * for most of the architectures.
 * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
 */
-   return module_alloc(PAGE_SIZE);
+   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
 }
 
 static void free_insn_page(void *page)
 {
-   module_memfree(page);
+   free_execmem(page);
 }
 
 struct kprobe_insn_cache kprobe_insn_slots = {
@@ -1592,6 +1593,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
goto out;
}
 
+#ifdef CONFIG_MODULES
/*
 * If the module freed '.init.text', we couldn't

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Jason Xing

On Tue, Mar 26, 2024 at 9:18 PM Eric Dumazet  wrote:
>
> On Tue, Mar 26, 2024 at 11:44 AM Jason Xing  wrote:
>
> > Well, it's a pity that it seems that we are about to abandon this
> > method but it's not that friendly to the users who are unable to
> > deploy BPF...
>
> It is a pity these tracepoint patches are consuming a lot of reviewer
> time, just because
> some people 'can not deploy BPF'

Sure, not everyone can do this easily. The phenomenon still exists and
we cannot ignore it. Do you remember that about a month ago someone
submitted one patch introducing a new tracepoint and then I replied
to/asked you if it's necessary that we replace most of the tracepoints
with BPF? Now I realise and accept the fact...

I'll keep reviewing such patches and hope it can give you maintainers
a break. I don't mind taking some time to do it, after all it's not a
bad thing to help some people.

>
> Well, I came up with more ideas about how to improve the
> > trace function in recent days. The motivation of doing this is that I
> > encountered some issues which could be traced/diagnosed by using trace
> > effortlessly without writing some bpftrace codes again and again. The
> > status of trace seems not active but many people are still using it, I
> > believe.
>
> 'Writing bpftrace codes again and again' is not a good reason to add
> maintenance costs
> to linux networking stack.

I'm just saying :)

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Eric Dumazet

On Tue, Mar 26, 2024 at 11:44 AM Jason Xing  wrote:

> Well, it's a pity that it seems that we are about to abandon this
> method but it's not that friendly to the users who are unable to
> deploy BPF...

It is a pity these tracepoint patches are consuming a lot of reviewer
time, just because
some people 'can not deploy BPF'

Well, I came up with more ideas about how to improve the
> trace function in recent days. The motivation of doing this is that I
> encountered some issues which could be traced/diagnosed by using trace
> effortlessly without writing some bpftrace codes again and again. The
> status of trace seems not active but many people are still using it, I
> believe.

'Writing bpftrace codes again and again' is not a good reason to add
maintenance costs
to linux networking stack.

Re: [PATCH v5 1/2] kprobes: textmem API

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote:
> On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote:
> > > > +#endif /* _LINUX_EXECMEM_H */
> > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> > > > index 9d9095e81792..87fd8c14a938 100644
> > > > --- a/kernel/kprobes.c
> > > > +++ b/kernel/kprobes.c
> > > > @@ -44,6 +44,7 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > >  
> > > >  #define KPROBE_HASH_BITS 6
> > > >  #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS)
> > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state {
> > > >  void __weak *alloc_insn_page(void)
> > > >  {
> > > > /*
> > > > -* Use module_alloc() so this page is within +/- 2GB of where 
> > > > the
> > > > +* Use alloc_execmem() so this page is within +/- 2GB of where 
> > > > the
> > > >  * kernel image and loaded module images reside. This is 
> > > > required
> > > >  * for most of the architectures.
> > > >  * (e.g. x86-64 needs this to handle the %rip-relative fixups.)
> > > >  */
> > > > -   return module_alloc(PAGE_SIZE);
> > > > +   return alloc_execmem(PAGE_SIZE, GFP_KERNEL);
> > > >  }
> > > >  
> > > >  static void free_insn_page(void *page)
> > > >  {
> > > > -   module_memfree(page);
> > > > +   free_execmem(page);
> > > >  }
> > > >  
> > > >  struct kprobe_insn_cache kprobe_insn_slots = {
> > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct 
> > > > kprobe *p,
> > > > goto out;
> > > > }
> > > >  
> > > > +#ifdef CONFIG_MODULES
> > >
> > > You don't need this block, because these APIs have dummy functions.
> >
> > Hmm... I'll verify this tomorrow.
>
> It depends on having struct module available given "(*probed_mod)->state".
>
> It is non-existent unless CONFIG_MODULES is set given how things are
> flagged in include/linux/module.h.

Hey, noticed kconfig issue.

According to kconfig-language.txt:

"select should be used with care. select will force a symbol to a value
without visiting the dependencies."

So the problem here lies in KPROBES config entry using select statement
to pick ALLOC_EXECMEM. It will not take the depends on statement into
account and thus will allow to select kprobes without any allocator in
place.

So to address this I'd suggest to use depends on statement also for
describing relation between KPROBES and ALLOC_EXECMEM. It does not make
life worse than before for anyone because even with the current kernel
you have to select MODULES before you can move forward with kprobes.

BR, Jarkko

[PATCH v2 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)

2024-03-26 Thread Stanislav Jakubek

Add a device tree for the Motorola Moto G (2013) smartphone based
on the Qualcomm MSM8226 SoC.

Initially supported features:
  - Buttons (Volume Down/Up, Power)
  - eMMC
  - Hall Effect Sensor
  - SimpleFB display
  - TMP108 temperature sensor
  - Vibrator

Note: the dhob and shob reserved-memory regions are seemingly a part of some
Motorola specific (firmware?) mechanism, see [1].

[1] 
https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/Documentation/devicetree/bindings/misc/hob_ram.txt

Signed-off-by: Stanislav Jakubek 
---
Changes in V2:
  - split hob-ram reserved-memory region into dhob and shob
  - add a note and a link to downstream documentation with more
information about these regions

 arch/arm/boot/dts/qcom/Makefile   |   1 +
 .../boot/dts/qcom/msm8226-motorola-falcon.dts | 359 ++
 2 files changed, 360 insertions(+)
 create mode 100644 arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts

diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile
index 6478a39b3be5..3eacbf5c0785 100644
--- a/arch/arm/boot/dts/qcom/Makefile
+++ b/arch/arm/boot/dts/qcom/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 dtb-$(CONFIG_ARCH_QCOM) += \
+   msm8226-motorola-falcon.dtb \
qcom-apq8016-sbc.dtb \
qcom-apq8026-asus-sparrow.dtb \
qcom-apq8026-huawei-sturgeon.dtb \
diff --git a/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts 
b/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts
new file mode 100644
index ..029e1b1659c9
--- /dev/null
+++ b/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts
@@ -0,0 +1,359 @@
+// SPDX-License-Identifier: BSD-3-Clause
+
+/dts-v1/;
+
+#include "qcom-msm8226.dtsi"
+#include "pm8226.dtsi"
+
+/delete-node/ _region;
+
+/ {
+   model = "Motorola Moto G (2013)";
+   compatible = "motorola,falcon", "qcom,msm8226";
+   chassis-type = "handset";
+
+   aliases {
+   mmc0 = _1;
+   };
+
+   chosen {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   framebuffer@320 {
+   compatible = "simple-framebuffer";
+   reg = <0x0320 0x80>;
+   width = <720>;
+   height = <1280>;
+   stride = <(720 * 3)>;
+   format = "r8g8b8";
+   vsp-supply = <_lcd_pos>;
+   vsn-supply = <_lcd_neg>;
+   vddio-supply = <_disp_vreg>;
+   };
+   };
+
+   gpio-keys {
+   compatible = "gpio-keys";
+
+   event-hall-sensor {
+   label = "Hall Effect Sensor";
+   gpios = < 51 GPIO_ACTIVE_LOW>;
+   linux,input-type = ;
+   linux,code = ;
+   linux,can-disable;
+   };
+
+   key-volume-up {
+   label = "Volume Up";
+   gpios = < 106 GPIO_ACTIVE_LOW>;
+   linux,code = ;
+   debounce-interval = <15>;
+   };
+   };
+
+   vddio_disp_vreg: regulator-vddio-disp {
+   compatible = "regulator-fixed";
+   regulator-name = "vddio_disp";
+   gpio = < 34 GPIO_ACTIVE_HIGH>;
+   vin-supply = <_l8>;
+   startup-delay-us = <300>;
+   enable-active-high;
+   regulator-boot-on;
+   };
+
+   reserved-memory {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   framebuffer@320 {
+   reg = <0x0320 0x80>;
+   no-map;
+   };
+
+   dhob@f50 {
+   reg = <0x0f50 0x4>;
+   no-map;
+   };
+
+   shob@f54 {
+   reg = <0x0f54 0x2000>;
+   no-map;
+   };
+
+   smem_region: smem@fa0 {
+   reg = <0x0fa0 0x10>;
+   no-map;
+   };
+
+   /* Actually <0x0fa0 0x50>, but first 10 is smem */
+   reserved@fb0 {
+   reg = <0x0fb0 0x40>;
+   no-map;
+   };
+   };
+};
+
+_i2c3 {
+   status = "okay";
+
+   regulator@3e {
+   compatible = "ti,tps65132";
+   reg = <0x3e>;
+   pinctrl-0 = <_lcd_default>;
+   pinctrl-names = "default";
+
+   reg_lcd_pos: outp {
+   regulator-name = "outp";
+   regulator-min-microvolt = <400>;
+   regulator-max-microvolt = <600>;
+   regulator-active-discharge = <1>;
+

[PATCH v2 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)

2024-03-26 Thread Stanislav Jakubek

Document the Motorola Moto G (2013), which is a smartphone based
on the Qualcomm MSM8226 SoC.

Acked-by: Krzysztof Kozlowski 
Signed-off-by: Stanislav Jakubek 
---
Changes in V2:
  - collect Krzysztof's A-b

 Documentation/devicetree/bindings/arm/qcom.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/qcom.yaml 
b/Documentation/devicetree/bindings/arm/qcom.yaml
index 66beaac60e1d..d2910982ae86 100644
--- a/Documentation/devicetree/bindings/arm/qcom.yaml
+++ b/Documentation/devicetree/bindings/arm/qcom.yaml
@@ -137,6 +137,7 @@ properties:
   - microsoft,dempsey
   - microsoft,makepeace
   - microsoft,moneypenny
+  - motorola,falcon
   - samsung,s3ve3g
   - const: qcom,msm8226
 
-- 
2.34.1

Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-26 Thread Jarkko Sakkinen

On Tue Mar 26, 2024 at 4:16 AM EET, Rick Edgecombe wrote:
> The mm_struct contains a function pointer *get_unmapped_area(), which
> is set to either arch_get_unmapped_area() or
> arch_get_unmapped_area_topdown() during the initialization of the mm.

In which conditions which path is used during the initialization of mm
and why is this the case? It is an open claim in the current form.

That would be nice to have documented for the sake of being complete
description. I have zero doubts of the claim being untrue.

BR, Jarkko

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-26 Thread Jonthan Haslam

> > > Have you considered/measured per-CPU RW semaphores?
> >
> > No I hadn't but thanks hugely for suggesting it! In initial measurements
> > it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> > all the exclamation marks but I'm very excited. I'll do some more testing
> > tomorrow but so far it's looking very good.
> >
> 
> Documentation ([0]) says that locking for writing calls
> synchronize_rcu(), is that right? If that's true, attaching multiple
> uprobes (including just attaching a single BPF multi-uprobe) will take
> a really long time. We need to confirm we are not significantly
> regressing this. And if we do, we need to take measures in the BPF
> multi-uprobe attachment code path to make sure that a single
> multi-uprobe attachment is still fast.
> 
> If my worries above turn out to be true, it still feels like a first
> good step should be landing this patch as is (and get it backported to
> older kernels), and then have percpu rw-semaphore as a final (and a
> bit more invasive) solution (it's RCU-based, so feels like a good
> primitive to settle on), making sure to not regress multi-uprobes
> (we'll probably will need some batched API for multiple uprobes).
> 
> Thoughts?

Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is
what calls into synchronize_rcu(). I haven't done the measurements yet but
I would imagine this has to regress probe attachment, at least in the
uncontended case. Of course, reads are by far the dominant mode here but
we probably shouldn't punish writes excessively. I will do some
measurements to quantify the write penalty here.

I agree that a batched interface for probe attachment is needed here. The
usual mode of operation for us is that we have a number of USDTs (uprobes)
in hand and we want to enable and disable them in one shot. Removing the
need to do multiple locking operations is definitely an efficiency
improvement that needs to be done. Tie that together with per-CPU RW
semaphores and this should scale extremely well in both a read and write
case.

Jon.

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Will Deacon

On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > Secondly, the debugging code is enhanced so that the available head for
> > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > available
> > > head for one specific available index is read for twice. I do see the
> > > available heads are different from the consecutive reads. More details
> > > are shared as below.
> > > 
> > > From the guest side
> > > ===
> > > 
> > > virtio_net virtio0: output.0:id 86 is not a head!
> > > head to be released: 047 062 112
> > > 
> > > avail_idx:
> > > 000  49665
> > > 001  49666  <--
> > >  :
> > > 015  49664
> > 
> > what are these #s 49665 and so on?
> > and how large is the ring?
> > I am guessing 49664 is the index ring size is 16 and
> > 49664 % 16 == 0
> 
> More than that, 49664 % 256 == 0
> 
> So again there seems to be an error in the vicinity of roll-over of
> the idx low byte, as I observed in the earlier log. Surely this is
> more than coincidence?

Yeah, I'd still really like to see the disassembly for both sides of the
protocol here. Gavin, is that something you're able to provide? Worst
case, the host and guest vmlinux objects would be a starting point.

Personally, I'd be fairly surprised if this was a hardware issue.

Will

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Paolo Abeni

On Tue, 2024-03-26 at 18:43 +0800, Jason Xing wrote:
> On Tue, Mar 26, 2024 at 6:29 PM Paolo Abeni  wrote:
> > 
> > On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote:
> > > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  
> > > wrote:
> > > > 
> > > > From: Jason Xing 
> > > > 
> > > > Using the macro for other tracepoints use to be more concise.
> > > > No functional change.
> > > > 
> > > > Jason Xing (3):
> > > >   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
> > > >   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
> > > >   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> > > > 
> > > >  include/trace/events/net_probe_common.h | 29 
> > > >  include/trace/events/sock.h | 35 -
> > > 
> > > I just noticed that some trace files in include/trace directory (like
> > > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
> > > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
> > > folks while some files (like tcp.h) have been maintained by specific
> > > maintainers/experts (like Eric) because they belong to one specific
> > > area. I wonder if we can get more networking guys involved in net
> > > tracing.
> > > 
> > > I'm not sure if 1) we can put those files into the "NETWORKING
> > > [GENERAL]" category, or 2) we can create a new category to include
> > > them all.
> > 
> > I think all the file you mentioned are not under networking because of
> > MAINTAINER file inaccuracy, and we could move there them accordingly.
> 
> Yes, they are not under the networking category currently. So how
> could we move them? The MAINTAINER file doesn't have all the specific
> categories which are suitable for each of the trace files.

I think there is no need to other categories: adding the explicit 'F:'
entries for such files in the NETWORKING [GENERAL] section should fit.

> > > I know people start using BPF to trace them all instead, but I can see
> > > some good advantages of those hooks implemented in the kernel, say:
> > > 1) help those machines which are not easy to use BPF tools.
> > > 2) insert the tracepoint in the middle of some functions which cannot
> > > be replaced by bpf kprobe.
> > > 3) if we have enough tracepoints, we can generate a timeline to
> > > know/detect which flow/skb spends unexpected time at which point.
> > > ...
> > > We can do many things in this area, I think :)
> > > 
> > > What do you think about this, Jakub, Paolo, Eric ?
> > 
> > I agree tracepoints are useful, but I think the general agreement is
> > that they are the 'old way', we should try to avoid their
> > proliferation.
> 
> Well, it's a pity that it seems that we are about to abandon this
> method but it's not that friendly to the users who are unable to
> deploy BPF... Well, I came up with more ideas about how to improve the
> trace function in recent days. The motivation of doing this is that I
> encountered some issues which could be traced/diagnosed by using trace
> effortlessly without writing some bpftrace codes again and again. The
> status of trace seems not active but many people are still using it, I
> believe.

I don't think we should abandon it completely. My understanding is that
we should thing carefully before adding new tracepoints, and generally
speaking, avoid adding 'too many' of them.

Cheers,

Paolo

Re: [PATCH net-next v2 3/3] tcp: add location into reset trace process

2024-03-26 Thread Paolo Abeni

On Mon, 2024-03-25 at 14:28 +0800, Jason Xing wrote:
> From: Jason Xing 
> 
> In addition to knowing the 4-tuple of the flow which generates RST,
> the reason why it does so is very important because we have some
> cases where the RST should be sent and have no clue which one
> exactly.
> 
> Adding location of reset process can help us more, like what
> trace_kfree_skb does.
> 
> Signed-off-by: Jason Xing 
> ---
>  include/trace/events/tcp.h | 14 ++
>  net/ipv4/tcp_ipv4.c|  2 +-
>  net/ipv4/tcp_output.c  |  2 +-
>  net/ipv6/tcp_ipv6.c|  2 +-
>  4 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index a13eb2147a02..8f6c1a07503c 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -109,13 +109,17 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   */
>  TRACE_EVENT(tcp_send_reset,
>  
> - TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
> + TP_PROTO(
> + const struct sock *sk,
> + const struct sk_buff *skb,
> + void *location),

Very minor nit: the above lines should be aligned with the open
bracket.

No need to repost just for this, but let's wait for Eric's feedback.

Cheers,

Paolo

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread patchwork-bot+netdevbpf

Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni :

On Mon, 25 Mar 2024 11:43:44 +0800 you wrote:
> From: Jason Xing 
> 
> Using the macro for other tracepoints use to be more concise.
> No functional change.
> 
> Jason Xing (3):
>   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
>   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
>   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> 
> [...]

Here is the summary with links:
  - [net-next,1/3] trace: move to TP_STORE_ADDRS related macro to 
net_probe_common.h
https://git.kernel.org/netdev/net-next/c/b3af9045b482
  - [net-next,2/3] trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
https://git.kernel.org/netdev/net-next/c/a24c855a5ef2
  - [net-next,3/3] trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
https://git.kernel.org/netdev/net-next/c/646700ce23f4

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Jason Xing

On Tue, Mar 26, 2024 at 6:29 PM Paolo Abeni  wrote:
>
> On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote:
> > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  
> > wrote:
> > >
> > > From: Jason Xing 
> > >
> > > Using the macro for other tracepoints use to be more concise.
> > > No functional change.
> > >
> > > Jason Xing (3):
> > >   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
> > >   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
> > >   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> > >
> > >  include/trace/events/net_probe_common.h | 29 
> > >  include/trace/events/sock.h | 35 -
> >
> > I just noticed that some trace files in include/trace directory (like
> > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
> > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
> > folks while some files (like tcp.h) have been maintained by specific
> > maintainers/experts (like Eric) because they belong to one specific
> > area. I wonder if we can get more networking guys involved in net
> > tracing.
> >
> > I'm not sure if 1) we can put those files into the "NETWORKING
> > [GENERAL]" category, or 2) we can create a new category to include
> > them all.
>
> I think all the file you mentioned are not under networking because of
> MAINTAINER file inaccuracy, and we could move there them accordingly.

Yes, they are not under the networking category currently. So how
could we move them? The MAINTAINER file doesn't have all the specific
categories which are suitable for each of the trace files.

> >
> > I know people start using BPF to trace them all instead, but I can see
> > some good advantages of those hooks implemented in the kernel, say:
> > 1) help those machines which are not easy to use BPF tools.
> > 2) insert the tracepoint in the middle of some functions which cannot
> > be replaced by bpf kprobe.
> > 3) if we have enough tracepoints, we can generate a timeline to
> > know/detect which flow/skb spends unexpected time at which point.
> > ...
> > We can do many things in this area, I think :)
> >
> > What do you think about this, Jakub, Paolo, Eric ?
>
> I agree tracepoints are useful, but I think the general agreement is
> that they are the 'old way', we should try to avoid their
> proliferation.

Well, it's a pity that it seems that we are about to abandon this
method but it's not that friendly to the users who are unable to
deploy BPF... Well, I came up with more ideas about how to improve the
trace function in recent days. The motivation of doing this is that I
encountered some issues which could be traced/diagnosed by using trace
effortlessly without writing some bpftrace codes again and again. The
status of trace seems not active but many people are still using it, I
believe.

Thanks,
Jason

>
> Cheers,
>
> Paolo
>

Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro

2024-03-26 Thread Paolo Abeni

On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote:
> On Mon, Mar 25, 2024 at 11:43 AM Jason Xing  wrote:
> > 
> > From: Jason Xing 
> > 
> > Using the macro for other tracepoints use to be more concise.
> > No functional change.
> > 
> > Jason Xing (3):
> >   trace: move to TP_STORE_ADDRS related macro to net_probe_common.h
> >   trace: use TP_STORE_ADDRS() macro in inet_sk_error_report()
> >   trace: use TP_STORE_ADDRS() macro in inet_sock_set_state()
> > 
> >  include/trace/events/net_probe_common.h | 29 
> >  include/trace/events/sock.h | 35 -
> 
> I just noticed that some trace files in include/trace directory (like
> net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h,
> qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking
> folks while some files (like tcp.h) have been maintained by specific
> maintainers/experts (like Eric) because they belong to one specific
> area. I wonder if we can get more networking guys involved in net
> tracing.
> 
> I'm not sure if 1) we can put those files into the "NETWORKING
> [GENERAL]" category, or 2) we can create a new category to include
> them all.

I think all the file you mentioned are not under networking because of
MAINTAINER file inaccuracy, and we could move there them accordingly.
> 
> I know people start using BPF to trace them all instead, but I can see
> some good advantages of those hooks implemented in the kernel, say:
> 1) help those machines which are not easy to use BPF tools.
> 2) insert the tracepoint in the middle of some functions which cannot
> be replaced by bpf kprobe.
> 3) if we have enough tracepoints, we can generate a timeline to
> know/detect which flow/skb spends unexpected time at which point.
> ...
> We can do many things in this area, I think :)
> 
> What do you think about this, Jakub, Paolo, Eric ?

I agree tracepoints are useful, but I think the general agreement is
that they are the 'old way', we should try to avoid their
proliferation. 

Cheers,

Paolo

Re: [PATCH 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)

2024-03-26 Thread Konrad Dybcio

On 25.03.2024 9:25 PM, Stanislav Jakubek wrote:
> On Mon, Mar 25, 2024 at 08:28:27PM +0100, Konrad Dybcio wrote:
>> On 24.03.2024 3:04 PM, Stanislav Jakubek wrote:
>>> Add a device tree for the Motorola Moto G (2013) smartphone based
>>> on the Qualcomm MSM8226 SoC.
>>>
>>> Initially supported features:
>>>   - Buttons (Volume Down/Up, Power)
>>>   - eMMC
>>>   - Hall Effect Sensor
>>>   - SimpleFB display
>>>   - TMP108 temperature sensor
>>>   - Vibrator
>>>
>>> Signed-off-by: Stanislav Jakubek 
>>> ---
>>
>> [...]
>>
>>> +   hob-ram@f50 {
>>> +   reg = <0x0f50 0x4>,
>>> + <0x0f54 0x2000>;
>>> +   no-map;
>>> +   };
>>
>> Any reason it's in two parts? Should it be one contiguous region, or
>> two separate nodes?
>>
>> lgtm otherwise
> 
> Hi Konrad, I copied this from downstream as-is.
> According to the downstream docs [1]:
> 
> HOB RAM MMAP Device provides ability for userspace to access the
> hand over block memory to read out modem related parameters.
> 
> And the two regs are the "DHOB partition" and "SHOB partition".

Oh right, motorola made some inventions here..

> 
> I suppose this is something Motorola (firmware?) specific (since the
> downstream compatible is mmi,hob_ram [2]).
> Should I split this into 2 nodes - dhob@f50 and shob@f54?

Yes please and add the downstream txt link to the commit message in case
somebody was curious down the line.

Konrad

[PATCH v19 RESEND 4/5] Documentation: tracing: Add ring-buffer mapping

2024-03-26 Thread Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..0426ab4bcf3d
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers 
on
+trace_pipe would be.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.44.0.396.g6e790dbe36-goog

[PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-03-26 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index ffcd8dfcaa4f..d25b9d504a7c 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..0f37aa9860fd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent,
+  NULL, NULL);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
/* Make sure the waiters see the new wait_index */
@@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   return VM_FAULT_SIGBUS;
+}
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+   int err = 0;
+
+   /*
+* Called with mmap_lock held. lockdep would be unhappy if we would now
+* take trace_types_lock. Instead use the specific
+* snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+
+   if (tr->snapshot || tr->mapped == UINT_MAX)
+   err = -EBUSY;
+   else
+   tr->mapped++;
+
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+
+   return err;
+
+}
+static void put_snapshot_map(struct

[PATCH v19 RESEND 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-03-26 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..1dc932e7963c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -338,6 +339,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1607,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   mutex_init(_buffer->mapping_lock);
 
bpage = kzalloc_node(ALIGN(sizeof(*bpage),

[PATCH v19 RESEND 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP

2024-03-26 Thread Vincent Donnefort

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..cc9ebe593571 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
list_add(>list, pages);
 
page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-   mflags | __GFP_ZERO,
+   mflags | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
@@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
+   GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.44.0.396.g6e790dbe36-goog

[PATCH v19 RESEND 0/5] Introducing trace buffer mapping by user-space

2024-03-26 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v18 -> v19:
  * Use VM_PFNMAP and vm_insert_pages
  * Allocate ring-buffer subbufs with __GFP_COMP
  * Pad the meta-page with the zero-page to align on the subbuf_order
  * Extend the ring-buffer test with mmap() dedicated suite

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (5):
  ring-buffer: allocate sub-buffers with __GFP_COMP
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +

[PATCH v19 4/5] Documentation: tracing: Add ring-buffer mapping

2024-03-26 Thread Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..0426ab4bcf3d
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers 
on
+trace_pipe would be.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.44.0.396.g6e790dbe36-goog

[PATCH v19 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-03-26 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index ffcd8dfcaa4f..d25b9d504a7c 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 233d1af39fff..0f37aa9860fd 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
+
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent,
+  NULL, NULL);
+   if (err)
+   return err;
+   }
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
 
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
/* Make sure the waiters see the new wait_index */
@@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   return VM_FAULT_SIGBUS;
+}
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+   int err = 0;
+
+   /*
+* Called with mmap_lock held. lockdep would be unhappy if we would now
+* take trace_types_lock. Instead use the specific
+* snapshot_trigger_lock.
+*/
+   spin_lock(>snapshot_trigger_lock);
+
+   if (tr->snapshot || tr->mapped == UINT_MAX)
+   err = -EBUSY;
+   else
+   tr->mapped++;
+
+   spin_unlock(>snapshot_trigger_lock);
+
+   /* Wait for update_max_tr() to observe iter->tr->mapped */
+   if (tr->mapped == 1)
+   synchronize_rcu();
+
+   return err;
+
+}
+static void put_snapshot_map(struct

[PATCH v19 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-03-26 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu,
+   struct vm_area_struct *vma);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index cc9ebe593571..1dc932e7963c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -338,6 +339,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf VA */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1599,6 +1607,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+   mutex_init(_buffer->mapping_lock);
 
bpage = kzalloc_node(ALIGN(sizeof(*bpage),

[PATCH v19 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP

2024-03-26 Thread Vincent Donnefort

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..cc9ebe593571 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
list_add(>list, pages);
 
page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
-   mflags | __GFP_ZERO,
+   mflags | __GFP_COMP | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
@@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
+   GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
__GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.44.0.396.g6e790dbe36-goog

[PATCH v19 0/5] Introducing trace buffer mapping by user-space

2024-03-26 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v18 -> v19:
  * Use VM_PFNMAP and vm_insert_pages
  * Allocate ring-buffer subbufs with __GFP_COMP
  * Pad the meta-page with the zero-page to align on the subbuf_order
  * Extend the ring-buffer test with mmap() dedicated suite

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (5):
  ring-buffer: allocate sub-buffers with __GFP_COMP
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Keir Fraser

On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> > 
> > On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > > diff --git a/drivers/virtio/virtio_ring.c 
> > > > > b/drivers/virtio/virtio_ring.c
> > > > > index 6f7e5010a673..79456706d0bd 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct 
> > > > > virtqueue *_vq,
> > > > >   /* Put entry in available array (but don't update avail->idx 
> > > > > until they
> > > > >* do sync). */
> > > > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > head);
> > > > > + u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > > > > ~(vq->split.vring.num - 1));
> > > > > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > > headwithflag);
> > > > >   /* Descriptors and available array need to be set before we 
> > > > > expose the
> > > > >* new available array entries. */
> > > > > 
> > 
> > Ok, Michael. I continued with my debugging code. It still looks like a
> > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> > involved for the discussion, as suggested by you.
> 
> Do you have a support contact at Nvidia to report this?
> 
> > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> > Note that I have only one vCPU in my configuration.
> 
> Interesting but is guest built with CONFIG_SMP set?

arm64 is always built CONFIG_SMP.

> > Secondly, the debugging code is enhanced so that the available head for
> > (last_avail_idx - 1) is read for twice and recorded. It means the available
> > head for one specific available index is read for twice. I do see the
> > available heads are different from the consecutive reads. More details
> > are shared as below.
> > 
> > From the guest side
> > ===
> > 
> > virtio_net virtio0: output.0:id 86 is not a head!
> > head to be released: 047 062 112
> > 
> > avail_idx:
> > 000  49665
> > 001  49666  <--
> >  :
> > 015  49664
> 
> what are these #s 49665 and so on?
> and how large is the ring?
> I am guessing 49664 is the index ring size is 16 and
> 49664 % 16 == 0

More than that, 49664 % 256 == 0

So again there seems to be an error in the vicinity of roll-over of
the idx low byte, as I observed in the earlier log. Surely this is
more than coincidence?

 -- Keir

> > avail_head:
> 
> 
> is this the avail ring contents?
> 
> > 000  062
> > 001  047  <--
> >  :
> > 015  112
> 
> 
> What are these arrows pointing at, btw?
> 
> 
> > From the host side
> > ==
> > 
> > avail_idx
> > 000  49663
> > 001  49666  <---
> >  :
> > 
> > avail_head
> > 000  062  (062)
> > 001  047  (047)  <---
> >  :
> > 015  086  (112)  // head 086 is returned from the first read,
> >  // but head 112 is returned from the second read
> > 
> > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 
> > 49664
> > 
> > Thanks,
> > Gavin
> 
> OK thanks so this proves it is actually the avail ring value.
> 
> -- 
> MST
>

[PATCH 1/2] LoongArch: KVM: Add steal time support in kvm side

2024-03-26 Thread Bibo Mao

Steal time feature is added here in kvm side, VM can search supported
features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME
is added here. Like x86, steal time structure is saved in guest memory,
one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to
enable the feature.

One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to
save and restore base address of steal time structure when VM is migrated.

Since it needs hypercall instruction emulation handling, and it is
dependent on this patchset:
https://lore.kernel.org/all/20240201031950.3225626-1-maob...@loongson.cn/

Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/kvm_host.h  |   7 ++
 arch/loongarch/include/asm/kvm_para.h  |  10 +++
 arch/loongarch/include/asm/loongarch.h |   1 +
 arch/loongarch/include/uapi/asm/kvm.h  |   4 +
 arch/loongarch/kvm/exit.c  |  35 ++--
 arch/loongarch/kvm/vcpu.c  | 120 +
 6 files changed, 172 insertions(+), 5 deletions(-)

diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index c53946f8ef9f..1d1eaa124349 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #define KVM_PRIVATE_MEM_SLOTS  0
 
 #define KVM_HALT_POLL_NS_DEFAULT   50
+#define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(1)
 
 #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \
KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP)
@@ -199,6 +200,12 @@ struct kvm_vcpu_arch {
struct kvm_mp_state mp_state;
/* cpucfg */
u32 cpucfg[KVM_MAX_CPUCFG_REGS];
+   /* paravirt steal time */
+   struct {
+   u64 guest_addr;
+   u64 last_steal;
+   struct gfn_to_hva_cache cache;
+   } st;
 };
 
 static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg)
diff --git a/arch/loongarch/include/asm/kvm_para.h 
b/arch/loongarch/include/asm/kvm_para.h
index 56775554402a..5fb89e20432d 100644
--- a/arch/loongarch/include/asm/kvm_para.h
+++ b/arch/loongarch/include/asm/kvm_para.h
@@ -12,6 +12,7 @@
 #define KVM_HCALL_CODE_SWDBG   1
 #define KVM_HCALL_PV_SERVICE   HYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_PV_SERVICE)
 #define  KVM_HCALL_FUNC_PV_IPI 1
+#define  KVM_HCALL_FUNC_NOTIFY 2
 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_SWDBG)
 
 /*
@@ -21,6 +22,15 @@
 #define KVM_HCALL_INVALID_CODE -1UL
 #define KVM_HCALL_INVALID_PARAMETER-2UL
 
+#define KVM_STEAL_PHYS_VALID   BIT_ULL(0)
+#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6)
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
+
 /*
  * Hypercall interface for KVM hypervisor
  *
diff --git a/arch/loongarch/include/asm/loongarch.h 
b/arch/loongarch/include/asm/loongarch.h
index 0ad36704cb4b..ab6a5e93c280 100644
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@ -168,6 +168,7 @@
 #define  KVM_SIGNATURE "KVM\0"
 #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
 #define  KVM_FEATURE_PV_IPIBIT(1)
+#define  KVM_FEATURE_STEAL_TIMEBIT(2)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/loongarch/include/uapi/asm/kvm.h 
b/arch/loongarch/include/uapi/asm/kvm.h
index 8f78b23672ac..286b5ce93a57 100644
--- a/arch/loongarch/include/uapi/asm/kvm.h
+++ b/arch/loongarch/include/uapi/asm/kvm.h
@@ -80,7 +80,11 @@ struct kvm_fpu {
 #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << 
LOONGARCH_REG_SHIFT))
 #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, 
REG)
 #define KVM_IOC_CPUCFG(REG)
LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG)
+
+/* Device Control API on vcpu fd */
 #define KVM_LOONGARCH_VCPU_CPUCFG  0
+#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1
+#define  KVM_LOONGARCH_VCPU_PVTIME_GPA 0
 
 struct kvm_debug_exit_arch {
 };
diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
index d71172e2568e..c774e5803f7f 100644
--- a/arch/loongarch/kvm/exit.c
+++ b/arch/loongarch/kvm/exit.c
@@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu)
 static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst)
 {
int rd, rj;
-   unsigned int index;
+   unsigned int index, ret;
unsigned long plv;
 
rd = inst.reg2_format.rd;
@@ -240,10 +240,13 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, 
larch_inst inst)
vcpu->arch.gprs[rd] = 0;
break;
case CPUCFG_KVM_FEATURE:
-   if ((plv & CSR_CRMD_PLV) == PLV_KERN)
-   vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI;
-   else
-   vcpu->arch.gprs[rd] = 0;
+   ret = 0;
+

[PATCH 2/2] LoongArch: Add steal time support in guest side

2024-03-26 Thread Bibo Mao

Percpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.

When vcpu is onlined, function pv_register_steal_time() is called. This
function will pass physical address of struct kvm_steal_time and tells
hypervisor to enable steal time. When vcpu is offline, physical address
is set as 0 and tells hypervisor to disable steal time.

Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/paravirt.h |   5 +
 arch/loongarch/kernel/paravirt.c  | 130 ++
 arch/loongarch/kernel/time.c  |   2 +
 3 files changed, 137 insertions(+)

diff --git a/arch/loongarch/include/asm/paravirt.h 
b/arch/loongarch/include/asm/paravirt.h
index 58f7b7b89f2c..fe27fb5e82b8 100644
--- a/arch/loongarch/include/asm/paravirt.h
+++ b/arch/loongarch/include/asm/paravirt.h
@@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu)
 }
 
 int pv_ipi_init(void);
+int __init pv_time_init(void);
 #else
 static inline int pv_ipi_init(void)
 {
return 0;
 }
 
+static inline int pv_time_init(void)
+{
+   return 0;
+}
 #endif // CONFIG_PARAVIRT
 #endif
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index 9044ed62045c..56182c64ab38 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -5,10 +5,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock;
 
 static u64 native_steal_clock(int cpu)
 {
@@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu)
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 
+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+   steal_acc = false;
+   return 0;
+}
+early_param("no-steal-acc", parse_no_stealacc);
+
+static u64 para_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = _cpu(steal_time, cpu);
+   do {
+
+   version = src->version;
+   /* Make sure that the version is read before the steal */
+   virt_rmb();
+   steal = src->steal;
+   /* Make sure that the steal is read before the next version */
+   virt_rmb();
+
+   } while ((version & 1) || (version != src->version));
+   return steal;
+}
+
+static int pv_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st;
+   unsigned long addr;
+
+   if (!has_steal_clock)
+   return -EPERM;
+
+   st = _cpu(steal_time, cpu);
+   addr = per_cpu_ptr_to_phys(st);
+
+   /* The whole structure kvm_steal_time should be one page */
+   if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) {
+   pr_warn("Illegal PV steal time addr %lx\n", addr);
+   return -EFAULT;
+   }
+
+   addr |= KVM_STEAL_PHYS_VALID;
+   kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr);
+   return 0;
+}
+
 #ifdef CONFIG_SMP
 static void pv_send_ipi_single(int cpu, unsigned int action)
 {
@@ -110,6 +164,32 @@ static void pv_init_ipi(void)
if (r < 0)
panic("SWI0 IRQ request failed\n");
 }
+
+static void pv_disable_steal_time(void)
+{
+   if (has_steal_clock)
+   kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 
0);
+}
+
+static int pv_cpu_online(unsigned int cpu)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   pv_register_steal_time();
+   local_irq_restore(flags);
+   return 0;
+}
+
+static int pv_cpu_down_prepare(unsigned int cpu)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   pv_disable_steal_time();
+   local_irq_restore(flags);
+   return 0;
+}
 #endif
 
 static bool kvm_para_available(void)
@@ -149,3 +229,53 @@ int __init pv_ipi_init(void)
 
return 1;
 }
+
+static void pv_cpu_reboot(void *unused)
+{
+   pv_disable_steal_time();
+}
+
+static int pv_reboot_notify(struct notifier_block *nb, unsigned long code,
+   void *unused)
+{
+   on_each_cpu(pv_cpu_reboot, NULL, 1);
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block pv_reboot_nb = {
+   .notifier_call  = pv_reboot_notify,
+};
+
+int __init pv_time_init(void)
+{
+   int feature;
+
+   if (!cpu_has_hypervisor)
+   return 0;
+   if (!kvm_para_available())
+   return 0;
+
+   feature = read_cpucfg(CPUCFG_KVM_FEATURE);
+   if (!(feature & KVM_FEATURE_STEAL_TIME))
+   return 0;
+
+   has_steal_clock = 1;
+   if (pv_register_steal_time()) {
+   has_steal_clock = 0;
+   return 0;
+   }
+
+   register_reboot_notifier(_reboot_nb);
+   static_call_update(pv_steal_clock,

[PATCH 1/2] LoongArch: KVM: Add steal time support in kvm side

2024-03-26 Thread Bibo Mao

Steal time feature is added here in kvm side, VM can search supported
features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME
is added here. Like x86, steal time structure is saved in guest memory,
one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to
enable the feature.

One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to
save and restore base address of steal time structure when VM is migrated.

Since it needs hypercall instruction emulation handling, and it is
dependent on this patchset:
https://lore.kernel.org/all/20240201031950.3225626-1-maob...@loongson.cn/

Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/kvm_host.h  |   7 ++
 arch/loongarch/include/asm/kvm_para.h  |  10 +++
 arch/loongarch/include/asm/loongarch.h |   1 +
 arch/loongarch/include/uapi/asm/kvm.h  |   4 +
 arch/loongarch/kvm/exit.c  |  35 ++--
 arch/loongarch/kvm/vcpu.c  | 120 +
 6 files changed, 172 insertions(+), 5 deletions(-)

diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index c53946f8ef9f..1d1eaa124349 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #define KVM_PRIVATE_MEM_SLOTS  0
 
 #define KVM_HALT_POLL_NS_DEFAULT   50
+#define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(1)
 
 #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \
KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP)
@@ -199,6 +200,12 @@ struct kvm_vcpu_arch {
struct kvm_mp_state mp_state;
/* cpucfg */
u32 cpucfg[KVM_MAX_CPUCFG_REGS];
+   /* paravirt steal time */
+   struct {
+   u64 guest_addr;
+   u64 last_steal;
+   struct gfn_to_hva_cache cache;
+   } st;
 };
 
 static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg)
diff --git a/arch/loongarch/include/asm/kvm_para.h 
b/arch/loongarch/include/asm/kvm_para.h
index 56775554402a..5fb89e20432d 100644
--- a/arch/loongarch/include/asm/kvm_para.h
+++ b/arch/loongarch/include/asm/kvm_para.h
@@ -12,6 +12,7 @@
 #define KVM_HCALL_CODE_SWDBG   1
 #define KVM_HCALL_PV_SERVICE   HYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_PV_SERVICE)
 #define  KVM_HCALL_FUNC_PV_IPI 1
+#define  KVM_HCALL_FUNC_NOTIFY 2
 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_SWDBG)
 
 /*
@@ -21,6 +22,15 @@
 #define KVM_HCALL_INVALID_CODE -1UL
 #define KVM_HCALL_INVALID_PARAMETER-2UL
 
+#define KVM_STEAL_PHYS_VALID   BIT_ULL(0)
+#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6)
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
+
 /*
  * Hypercall interface for KVM hypervisor
  *
diff --git a/arch/loongarch/include/asm/loongarch.h 
b/arch/loongarch/include/asm/loongarch.h
index 0ad36704cb4b..ab6a5e93c280 100644
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@ -168,6 +168,7 @@
 #define  KVM_SIGNATURE "KVM\0"
 #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
 #define  KVM_FEATURE_PV_IPIBIT(1)
+#define  KVM_FEATURE_STEAL_TIMEBIT(2)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/loongarch/include/uapi/asm/kvm.h 
b/arch/loongarch/include/uapi/asm/kvm.h
index 8f78b23672ac..286b5ce93a57 100644
--- a/arch/loongarch/include/uapi/asm/kvm.h
+++ b/arch/loongarch/include/uapi/asm/kvm.h
@@ -80,7 +80,11 @@ struct kvm_fpu {
 #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << 
LOONGARCH_REG_SHIFT))
 #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, 
REG)
 #define KVM_IOC_CPUCFG(REG)
LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG)
+
+/* Device Control API on vcpu fd */
 #define KVM_LOONGARCH_VCPU_CPUCFG  0
+#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1
+#define  KVM_LOONGARCH_VCPU_PVTIME_GPA 0
 
 struct kvm_debug_exit_arch {
 };
diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
index d71172e2568e..c774e5803f7f 100644
--- a/arch/loongarch/kvm/exit.c
+++ b/arch/loongarch/kvm/exit.c
@@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu)
 static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst)
 {
int rd, rj;
-   unsigned int index;
+   unsigned int index, ret;
unsigned long plv;
 
rd = inst.reg2_format.rd;
@@ -240,10 +240,13 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, 
larch_inst inst)
vcpu->arch.gprs[rd] = 0;
break;
case CPUCFG_KVM_FEATURE:
-   if ((plv & CSR_CRMD_PLV) == PLV_KERN)
-   vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI;
-   else
-   vcpu->arch.gprs[rd] = 0;
+   ret = 0;
+

[PATCH 2/2] LoongArch: Add steal time support in guest side

2024-03-26 Thread Bibo Mao

Percpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.

When vcpu is onlined, function pv_register_steal_time() is called. This
function will pass physical address of struct kvm_steal_time and tells
hypervisor to enable steal time. When vcpu is offline, physical address
is set as 0 and tells hypervisor to disable steal time.

Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/paravirt.h |   5 +
 arch/loongarch/kernel/paravirt.c  | 130 ++
 arch/loongarch/kernel/time.c  |   2 +
 3 files changed, 137 insertions(+)

diff --git a/arch/loongarch/include/asm/paravirt.h 
b/arch/loongarch/include/asm/paravirt.h
index 58f7b7b89f2c..fe27fb5e82b8 100644
--- a/arch/loongarch/include/asm/paravirt.h
+++ b/arch/loongarch/include/asm/paravirt.h
@@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu)
 }
 
 int pv_ipi_init(void);
+int __init pv_time_init(void);
 #else
 static inline int pv_ipi_init(void)
 {
return 0;
 }
 
+static inline int pv_time_init(void)
+{
+   return 0;
+}
 #endif // CONFIG_PARAVIRT
 #endif
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index 9044ed62045c..56182c64ab38 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -5,10 +5,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock;
 
 static u64 native_steal_clock(int cpu)
 {
@@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu)
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 
+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+   steal_acc = false;
+   return 0;
+}
+early_param("no-steal-acc", parse_no_stealacc);
+
+static u64 para_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = _cpu(steal_time, cpu);
+   do {
+
+   version = src->version;
+   /* Make sure that the version is read before the steal */
+   virt_rmb();
+   steal = src->steal;
+   /* Make sure that the steal is read before the next version */
+   virt_rmb();
+
+   } while ((version & 1) || (version != src->version));
+   return steal;
+}
+
+static int pv_register_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st;
+   unsigned long addr;
+
+   if (!has_steal_clock)
+   return -EPERM;
+
+   st = _cpu(steal_time, cpu);
+   addr = per_cpu_ptr_to_phys(st);
+
+   /* The whole structure kvm_steal_time should be one page */
+   if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) {
+   pr_warn("Illegal PV steal time addr %lx\n", addr);
+   return -EFAULT;
+   }
+
+   addr |= KVM_STEAL_PHYS_VALID;
+   kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr);
+   return 0;
+}
+
 #ifdef CONFIG_SMP
 static void pv_send_ipi_single(int cpu, unsigned int action)
 {
@@ -110,6 +164,32 @@ static void pv_init_ipi(void)
if (r < 0)
panic("SWI0 IRQ request failed\n");
 }
+
+static void pv_disable_steal_time(void)
+{
+   if (has_steal_clock)
+   kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 
0);
+}
+
+static int pv_cpu_online(unsigned int cpu)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   pv_register_steal_time();
+   local_irq_restore(flags);
+   return 0;
+}
+
+static int pv_cpu_down_prepare(unsigned int cpu)
+{
+   unsigned long flags;
+
+   local_irq_save(flags);
+   pv_disable_steal_time();
+   local_irq_restore(flags);
+   return 0;
+}
 #endif
 
 static bool kvm_para_available(void)
@@ -149,3 +229,53 @@ int __init pv_ipi_init(void)
 
return 1;
 }
+
+static void pv_cpu_reboot(void *unused)
+{
+   pv_disable_steal_time();
+}
+
+static int pv_reboot_notify(struct notifier_block *nb, unsigned long code,
+   void *unused)
+{
+   on_each_cpu(pv_cpu_reboot, NULL, 1);
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block pv_reboot_nb = {
+   .notifier_call  = pv_reboot_notify,
+};
+
+int __init pv_time_init(void)
+{
+   int feature;
+
+   if (!cpu_has_hypervisor)
+   return 0;
+   if (!kvm_para_available())
+   return 0;
+
+   feature = read_cpucfg(CPUCFG_KVM_FEATURE);
+   if (!(feature & KVM_FEATURE_STEAL_TIME))
+   return 0;
+
+   has_steal_clock = 1;
+   if (pv_register_steal_time()) {
+   has_steal_clock = 0;
+   return 0;
+   }
+
+   register_reboot_notifier(_reboot_nb);
+   static_call_update(pv_steal_clock,

[PATCH 0/2] LoongArch: Add steal time support

2024-03-26 Thread Bibo Mao

Para-virt feature steal time is added in both kvm and guest kernel side.
It is silimar with other architectures, steal time structure comes from
guest memory, also pseduo register is used to save/restore base address
of steal time structure, so that vm migration is supported also.

Bibo Mao (2):
  LoongArch: KVM: Add steal time support in kvm side
  LoongArch: Add steal time support in guest side

 arch/loongarch/include/asm/kvm_host.h  |   7 ++
 arch/loongarch/include/asm/kvm_para.h  |  10 ++
 arch/loongarch/include/asm/loongarch.h |   1 +
 arch/loongarch/include/asm/paravirt.h  |   5 +
 arch/loongarch/include/uapi/asm/kvm.h  |   4 +
 arch/loongarch/kernel/paravirt.c   | 130 +
 arch/loongarch/kernel/time.c   |   2 +
 arch/loongarch/kvm/exit.c  |  35 ++-
 arch/loongarch/kvm/vcpu.c  | 120 +++
 9 files changed, 309 insertions(+), 5 deletions(-)


base-commit: 2ac2b1665d3fbec6ca709dd6ef3ea05f4a51ee4c
-- 
2.39.3

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Michael S. Tsirkin

On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote:
> 
> On 3/20/24 17:14, Michael S. Tsirkin wrote:
> > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> > > On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > index 6f7e5010a673..79456706d0bd 100644
> > > > --- a/drivers/virtio/virtio_ring.c
> > > > +++ b/drivers/virtio/virtio_ring.c
> > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct 
> > > > virtqueue *_vq,
> > > > /* Put entry in available array (but don't update avail->idx 
> > > > until they
> > > >  * do sync). */
> > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > > > -   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > head);
> > > > +   u16 headwithflag = head | (q->split.avail_idx_shadow & 
> > > > ~(vq->split.vring.num - 1));
> > > > +   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > > headwithflag);
> > > > /* Descriptors and available array need to be set before we 
> > > > expose the
> > > >  * new available array entries. */
> > > > 
> 
> Ok, Michael. I continued with my debugging code. It still looks like a
> hardware bug on NVidia's grace-hopper. I really think NVidia needs to be
> involved for the discussion, as suggested by you.

Do you have a support contact at Nvidia to report this?

> Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70.
> Note that I have only one vCPU in my configuration.

Interesting but is guest built with CONFIG_SMP set?

> Secondly, the debugging code is enhanced so that the available head for
> (last_avail_idx - 1) is read for twice and recorded. It means the available
> head for one specific available index is read for twice. I do see the
> available heads are different from the consecutive reads. More details
> are shared as below.
> 
> From the guest side
> ===
> 
> virtio_net virtio0: output.0:id 86 is not a head!
> head to be released: 047 062 112
> 
> avail_idx:
> 000  49665
> 001  49666  <--
>  :
> 015  49664

what are these #s 49665 and so on?
and how large is the ring?
I am guessing 49664 is the index ring size is 16 and
49664 % 16 == 0

> avail_head:


is this the avail ring contents?

> 000  062
> 001  047  <--
>  :
> 015  112


What are these arrows pointing at, btw?


> From the host side
> ==
> 
> avail_idx
> 000  49663
> 001  49666  <---
>  :
> 
> avail_head
> 000  062  (062)
> 001  047  (047)  <---
>  :
> 015  086  (112)  // head 086 is returned from the first read,
>  // but head 112 is returned from the second read
> 
> vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx 
> 49664
> 
> Thanks,
> Gavin

OK thanks so this proves it is actually the avail ring value.

-- 
MST

Re: [PATCH net-next v3 1/2] net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent

2024-03-26 Thread Jason Xing

On Mon, Mar 25, 2024 at 6:29 PM Balazs Scheidler  wrote:
>
> This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes
> the TCP specific implementation details.
>
> Previously the macro assumed the skb passed as an argument is a
> TCP packet, the implementation now uses an argument to the L4 header and
> uses that to extract the source/destination ports, which happen
> to be named the same in "struct tcphdr" and "struct udphdr"
>
> Signed-off-by: Balazs Scheidler 

The patch itself looks good to me, feel free to add:
Reviewed-by: Jason Xing

Re: [PATCH net-next v3 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb

2024-03-26 Thread Jason Xing

On Tue, Mar 26, 2024 at 10:28 AM Jakub Kicinski  wrote:
>
> On Mon, 25 Mar 2024 11:29:18 +0100 Balazs Scheidler wrote:
> > +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
> > +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
>
> Indent with tabs please, checkpatch says:
>
> ERROR: code indent should use tabs where possible
> #59: FILE: include/trace/events/udp.h:38:
> +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$
>
> WARNING: please, no spaces at the start of a line
> #59: FILE: include/trace/events/udp.h:38:
> +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$
>
> ERROR: code indent should use tabs where possible
> #60: FILE: include/trace/events/udp.h:39:
> +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$
>
> WARNING: please, no spaces at the start of a line
> #60: FILE: include/trace/events/udp.h:39:
> +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$

More than this, it would be better to put "From Balazs Scheidler
" in the first line of each patch to
eliminate the mismatched email address warning.

Link (Jakub referred to):
https://patchwork.kernel.org/project/netdevbpf/patch/34a9c221a6d644f18c826a1beddba58af6b7a64c.1711361723.git.balazs.scheid...@axoflow.com/
Detailed info: 
https://netdev.bots.linux.dev/static/nipa/837832/13601927/checkpatch/stdout

> --
> pw-bot: cr

Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-03-26 Thread Ho-Ren (Jack) Chuang

On Mon, Mar 25, 2024 at 8:08 PM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying  wrote:
> >>
> >> "Ho-Ren (Jack) Chuang"  writes:
> >>
> >> > The current implementation treats emulated memory devices, such as
> >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal 
> >> > memory
> >> > (E820_TYPE_RAM). However, these emulated devices have different
> >> > characteristics than traditional DRAM, making it important to
> >> > distinguish them. Thus, we modify the tiered memory initialization 
> >> > process
> >> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> >> > ensures that the memory tier initialization for these nodes is deferred
> >> > until HMAT information is obtained during the boot process. Finally,
> >> > demotion tables are recalculated at the end.
> >> >
> >> > * late_initcall(memory_tier_late_init);
> >> > Some device drivers may have initialized memory tiers between
> >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing
> >> > online memory nodes and configuring memory tiers. They should be excluded
> >> > in the late init.
> >> >
> >> > * Handle cases where there is no HMAT when creating memory tiers
> >> > There is a scenario where a CPUless node does not provide HMAT 
> >> > information.
> >> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >> >
> >> > * Introduce another new lock `default_dram_perf_lock` for adist 
> >> > calculation
> >> > In the current implementation, iterating through CPUlist nodes requires
> >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end 
> >> > up
> >> > trying to acquire the same lock, leading to a potential deadlock.
> >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` 
> >> > to
> >> > protect `default_dram_perf_*`. This approach not only avoids deadlock
> >> > but also prevents holding a large lock simultaneously.
> >> >
> >> > * Upgrade `set_node_memory_tier` to support additional cases, including
> >> >   default DRAM, late CPUless, and hot-plugged initializations.
> >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to
> >> > handle cases where memtype is not initialized and where HMAT information 
> >> > is
> >> > available.
> >> >
> >> > * Introduce `default_memory_types` for those memory types that are not
> >> >   initialized by device drivers.
> >> > Because late initialized memory and default DRAM memory need to be 
> >> > managed,
> >> > a default memory type is created for storing all memory types that are
> >> > not initialized by device drivers and as a fallback.
> >> >
> >> > Signed-off-by: Ho-Ren (Jack) Chuang 
> >> > Signed-off-by: Hao Xiang 
> >> > ---
> >> >  mm/memory-tiers.c | 73 ---
> >> >  1 file changed, 63 insertions(+), 10 deletions(-)
> >> >
> >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> >> > index 974af10cfdd8..9396330fa162 100644
> >> > --- a/mm/memory-tiers.c
> >> > +++ b/mm/memory-tiers.c
> >> > @@ -36,6 +36,11 @@ struct node_memory_type_map {
> >> >
> >> >  static DEFINE_MUTEX(memory_tier_lock);
> >> >  static LIST_HEAD(memory_tiers);
> >> > +/*
> >> > + * The list is used to store all memory types that are not created
> >> > + * by a device driver.
> >> > + */
> >> > +static LIST_HEAD(default_memory_types);
> >> >  static struct node_memory_type_map node_memory_types[MAX_NUMNODES];
> >> >  struct memory_dev_type *default_dram_type;
> >> >
> >> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion 
> >> > __read_mostly;
> >> >
> >> >  static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms);
> >> >
> >> > +static DEFINE_MUTEX(default_dram_perf_lock);
> >>
> >> Better to add comments about what is protected by this lock.
> >>
> >
> > Thank you. I will add a comment like this:
> > + /* The lock is used to protect `default_dram_perf*` info and nid. */
> > +static DEFINE_MUTEX(default_dram_perf_lock);
> >
> > I also found an error path was not handled and
> > found the lock could be put closer to what it protects.
> > I will have them fixed in V5.
> >
> >> >  static bool default_dram_perf_error;
> >> >  static struct access_coordinate default_dram_perf;
> >> >  static int default_dram_perf_ref_nid = NUMA_NO_NODE;
> >> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, 
> >> > struct memory_dev_type *mem
> >> >  static struct memory_tier *set_node_memory_tier(int node)
> >> >  {
> >> >   struct memory_tier *memtier;
> >> > - struct memory_dev_type *memtype;
> >> > + struct memory_dev_type *mtype;
> >>
> >> mtype may be referenced without initialization now below.
> >>
> >
> > Good catch! Thank you.
> >
> > Please check below.
> > I may found a potential NULL pointer dereference.
> >
> >> > + int adist = MEMTIER_ADISTANCE_DRAM;
> >> >

1 2 >

100 matches

Mail list logo