Re: [PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
"Ho-Ren (Jack) Chuang" writes: > The current implementation treats emulated memory devices, such as > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory > (E820_TYPE_RAM). However, these emulated devices have different > characteristics than traditional DRAM, making it important to > distinguish them. Thus, we modify the tiered memory initialization process > to introduce a delay specifically for CPUless NUMA nodes. This delay > ensures that the memory tier initialization for these nodes is deferred > until HMAT information is obtained during the boot process. Finally, > demotion tables are recalculated at the end. > > * late_initcall(memory_tier_late_init); > Some device drivers may have initialized memory tiers between > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > online memory nodes and configuring memory tiers. They should be excluded > in the late init. > > * Handle cases where there is no HMAT when creating memory tiers > There is a scenario where a CPUless node does not provide HMAT information. > If no HMAT is specified, it falls back to using the default DRAM tier. > > * Introduce another new lock `default_dram_perf_lock` for adist calculation > In the current implementation, iterating through CPUlist nodes requires > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up > trying to acquire the same lock, leading to a potential deadlock. > Therefore, we propose introducing a standalone `default_dram_perf_lock` to > protect `default_dram_perf_*`. This approach not only avoids deadlock > but also prevents holding a large lock simultaneously. Besides, this patch > slightly shortens the time holding the lock by putting the lock closer to > what it protects as well. > > * Upgrade `set_node_memory_tier` to support additional cases, including > default DRAM, late CPUless, and hot-plugged initializations. > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > handle cases where memtype is not initialized and where HMAT information is > available. > > * Introduce `default_memory_types` for those memory types that are not > initialized by device drivers. > Because late initialized memory and default DRAM memory need to be managed, > a default memory type is created for storing all memory types that are > not initialized by device drivers and as a fallback. > > * Fix a deadlock bug in `mt_perf_to_adistance` > Because an error path was not handled properly in `mt_perf_to_adistance`, > unlock before returning the error. > > Signed-off-by: Ho-Ren (Jack) Chuang > Signed-off-by: Hao Xiang > --- > mm/memory-tiers.c | 85 +++ > 1 file changed, 72 insertions(+), 13 deletions(-) > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > index 974af10cfdd8..610db9581ba4 100644 > --- a/mm/memory-tiers.c > +++ b/mm/memory-tiers.c > @@ -36,6 +36,11 @@ struct node_memory_type_map { > > static DEFINE_MUTEX(memory_tier_lock); > static LIST_HEAD(memory_tiers); > +/* > + * The list is used to store all memory types that are not created > + * by a device driver. > + */ > +static LIST_HEAD(default_memory_types); > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > struct memory_dev_type *default_dram_type; > > @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; > > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > > +/* The lock is used to protect `default_dram_perf*` info and nid. */ > +static DEFINE_MUTEX(default_dram_perf_lock); > static bool default_dram_perf_error; > static struct access_coordinate default_dram_perf; > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, > struct memory_dev_type *mem > static struct memory_tier *set_node_memory_tier(int node) > { > struct memory_tier *memtier; > - struct memory_dev_type *memtype; > + struct memory_dev_type *mtype = default_dram_type; > + int adist = MEMTIER_ADISTANCE_DRAM; > pg_data_t *pgdat = NODE_DATA(node); > > > @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int > node) > if (!node_state(node, N_MEMORY)) > return ERR_PTR(-EINVAL); > > - __init_node_memory_type(node, default_dram_type); > + mt_calc_adistance(node, ); > + if (node_memory_types[node].memtype == NULL) { > + mtype = mt_find_alloc_memory_type(adist, _memory_types); > + if (IS_ERR(mtype)) { > + mtype = default_dram_type; > + pr_info("Failed to allocate a memory type. Fall > back.\n"); > + } > + } > > - memtype = node_memory_types[node].memtype; > - node_set(node, memtype->nodes); > - memtier = find_create_memory_tier(memtype); > + __init_node_memory_type(node, mtype); > +
[PATCH v5 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
The current implementation treats emulated memory devices, such as CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory (E820_TYPE_RAM). However, these emulated devices have different characteristics than traditional DRAM, making it important to distinguish them. Thus, we modify the tiered memory initialization process to introduce a delay specifically for CPUless NUMA nodes. This delay ensures that the memory tier initialization for these nodes is deferred until HMAT information is obtained during the boot process. Finally, demotion tables are recalculated at the end. * late_initcall(memory_tier_late_init); Some device drivers may have initialized memory tiers between `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing online memory nodes and configuring memory tiers. They should be excluded in the late init. * Handle cases where there is no HMAT when creating memory tiers There is a scenario where a CPUless node does not provide HMAT information. If no HMAT is specified, it falls back to using the default DRAM tier. * Introduce another new lock `default_dram_perf_lock` for adist calculation In the current implementation, iterating through CPUlist nodes requires holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up trying to acquire the same lock, leading to a potential deadlock. Therefore, we propose introducing a standalone `default_dram_perf_lock` to protect `default_dram_perf_*`. This approach not only avoids deadlock but also prevents holding a large lock simultaneously. Besides, this patch slightly shortens the time holding the lock by putting the lock closer to what it protects as well. * Upgrade `set_node_memory_tier` to support additional cases, including default DRAM, late CPUless, and hot-plugged initializations. To cover hot-plugged memory nodes, `mt_calc_adistance()` and `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to handle cases where memtype is not initialized and where HMAT information is available. * Introduce `default_memory_types` for those memory types that are not initialized by device drivers. Because late initialized memory and default DRAM memory need to be managed, a default memory type is created for storing all memory types that are not initialized by device drivers and as a fallback. * Fix a deadlock bug in `mt_perf_to_adistance` Because an error path was not handled properly in `mt_perf_to_adistance`, unlock before returning the error. Signed-off-by: Ho-Ren (Jack) Chuang Signed-off-by: Hao Xiang --- mm/memory-tiers.c | 85 +++ 1 file changed, 72 insertions(+), 13 deletions(-) diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 974af10cfdd8..610db9581ba4 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -36,6 +36,11 @@ struct node_memory_type_map { static DEFINE_MUTEX(memory_tier_lock); static LIST_HEAD(memory_tiers); +/* + * The list is used to store all memory types that are not created + * by a device driver. + */ +static LIST_HEAD(default_memory_types); static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; struct memory_dev_type *default_dram_type; @@ -108,6 +113,8 @@ static struct demotion_nodes *node_demotion __read_mostly; static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); +/* The lock is used to protect `default_dram_perf*` info and nid. */ +static DEFINE_MUTEX(default_dram_perf_lock); static bool default_dram_perf_error; static struct access_coordinate default_dram_perf; static int default_dram_perf_ref_nid = NUMA_NO_NODE; @@ -505,7 +512,8 @@ static inline void __init_node_memory_type(int node, struct memory_dev_type *mem static struct memory_tier *set_node_memory_tier(int node) { struct memory_tier *memtier; - struct memory_dev_type *memtype; + struct memory_dev_type *mtype = default_dram_type; + int adist = MEMTIER_ADISTANCE_DRAM; pg_data_t *pgdat = NODE_DATA(node); @@ -514,11 +522,20 @@ static struct memory_tier *set_node_memory_tier(int node) if (!node_state(node, N_MEMORY)) return ERR_PTR(-EINVAL); - __init_node_memory_type(node, default_dram_type); + mt_calc_adistance(node, ); + if (node_memory_types[node].memtype == NULL) { + mtype = mt_find_alloc_memory_type(adist, _memory_types); + if (IS_ERR(mtype)) { + mtype = default_dram_type; + pr_info("Failed to allocate a memory type. Fall back.\n"); + } + } - memtype = node_memory_types[node].memtype; - node_set(node, memtype->nodes); - memtier = find_create_memory_tier(memtype); + __init_node_memory_type(node, mtype); + + mtype = node_memory_types[node].memtype; + node_set(node, mtype->nodes); + memtier = find_create_memory_tier(mtype); if (!IS_ERR(memtier))
[PATCH v5 1/2] memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types
Since different memory devices require finding, allocating, and putting memory types, these common steps are abstracted in this patch, enhancing the scalability and conciseness of the code. Signed-off-by: Ho-Ren (Jack) Chuang --- drivers/dax/kmem.c | 20 ++-- include/linux/memory-tiers.h | 13 + mm/memory-tiers.c| 32 3 files changed, 47 insertions(+), 18 deletions(-) diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 42ee360cf4e3..01399e5b53b2 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types); static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) { - bool found = false; struct memory_dev_type *mtype; mutex_lock(_memory_type_lock); - list_for_each_entry(mtype, _memory_types, list) { - if (mtype->adistance == adist) { - found = true; - break; - } - } - if (!found) { - mtype = alloc_memory_type(adist); - if (!IS_ERR(mtype)) - list_add(>list, _memory_types); - } + mtype = mt_find_alloc_memory_type(adist, _memory_types); mutex_unlock(_memory_type_lock); return mtype; @@ -77,13 +66,8 @@ static struct memory_dev_type *kmem_find_alloc_memory_type(int adist) static void kmem_put_memory_types(void) { - struct memory_dev_type *mtype, *mtn; - mutex_lock(_memory_type_lock); - list_for_each_entry_safe(mtype, mtn, _memory_types, list) { - list_del(>list); - put_memory_type(mtype); - } + mt_put_memory_types(_memory_types); mutex_unlock(_memory_type_lock); } diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h index 69e781900082..a44c03c2ba3a 100644 --- a/include/linux/memory-tiers.h +++ b/include/linux/memory-tiers.h @@ -48,6 +48,9 @@ int mt_calc_adistance(int node, int *adist); int mt_set_default_dram_perf(int nid, struct access_coordinate *perf, const char *source); int mt_perf_to_adistance(struct access_coordinate *perf, int *adist); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, + struct list_head *memory_types); +void mt_put_memory_types(struct list_head *memory_types); #ifdef CONFIG_MIGRATION int next_demotion_node(int node); void node_get_allowed_targets(pg_data_t *pgdat, nodemask_t *targets); @@ -136,5 +139,15 @@ static inline int mt_perf_to_adistance(struct access_coordinate *perf, int *adis { return -EIO; } + +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + return NULL; +} + +void mt_put_memory_types(struct list_head *memory_types) +{ + +} #endif /* CONFIG_NUMA */ #endif /* _LINUX_MEMORY_TIERS_H */ diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index 0537664620e5..974af10cfdd8 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -623,6 +623,38 @@ void clear_node_memory_type(int node, struct memory_dev_type *memtype) } EXPORT_SYMBOL_GPL(clear_node_memory_type); +struct memory_dev_type *mt_find_alloc_memory_type(int adist, struct list_head *memory_types) +{ + bool found = false; + struct memory_dev_type *mtype; + + list_for_each_entry(mtype, memory_types, list) { + if (mtype->adistance == adist) { + found = true; + break; + } + } + if (!found) { + mtype = alloc_memory_type(adist); + if (!IS_ERR(mtype)) + list_add(>list, memory_types); + } + + return mtype; +} +EXPORT_SYMBOL_GPL(mt_find_alloc_memory_type); + +void mt_put_memory_types(struct list_head *memory_types) +{ + struct memory_dev_type *mtype, *mtn; + + list_for_each_entry_safe(mtype, mtn, memory_types, list) { + list_del(>list); + put_memory_type(mtype); + } +} +EXPORT_SYMBOL_GPL(mt_put_memory_types); + static void dump_hmem_attrs(struct access_coordinate *coord, const char *prefix) { pr_info( -- Ho-Ren (Jack) Chuang
[PATCH v5 0/2] Improved Memory Tier Creation for CPUless NUMA Nodes
When a memory device, such as CXL1.1 type3 memory, is emulated as normal memory (E820_TYPE_RAM), the memory device is indistinguishable from normal DRAM in terms of memory tiering with the current implementation. The current memory tiering assigns all detected normal memory nodes to the same DRAM tier. This results in normal memory devices with different attributions being unable to be assigned to the correct memory tier, leading to the inability to migrate pages between different types of memory. https://lore.kernel.org/linux-mm/ph0pr08mb7955e9f08ccb64f23963b5c3a8...@ph0pr08mb7955.namprd08.prod.outlook.com/T/ This patchset automatically resolves the issues. It delays the initialization of memory tiers for CPUless NUMA nodes until they obtain HMAT information and after all devices are initialized at boot time, eliminating the need for user intervention. If no HMAT is specified, it falls back to using `default_dram_type`. Example usecase: We have CXL memory on the host, and we create VMs with a new system memory device backed by host CXL memory. We inject CXL memory performance attributes through QEMU, and the guest now sees memory nodes with performance attributes in HMAT. With this change, we enable the guest kernel to construct the correct memory tiering for the memory nodes. -v5: Thanks to Ying's comments, * Add comments about what is protected by `default_dram_perf_lock` * Fix an uninitialized pointer mtype * Slightly shorten the time holding `default_dram_perf_lock` * Fix a deadlock bug in `mt_perf_to_adistance` -v4: Thanks to Ying's comments, * Remove redundant code * Reorganize patches accordingly * https://lore.kernel.org/lkml/20240322070356.315922-1-horenchu...@bytedance.com/T/#u -v3: Thanks to Ying's comments, * Make the newly added code independent of HMAT * Upgrade set_node_memory_tier to support more cases * Put all non-driver-initialized memory types into default_memory_types instead of using hmat_memory_types * find_alloc_memory_type -> mt_find_alloc_memory_type * https://lore.kernel.org/lkml/20240320061041.3246828-1-horenchu...@bytedance.com/T/#u -v2: Thanks to Ying's comments, * Rewrite cover letter & patch description * Rename functions, don't use _hmat * Abstract common functions into find_alloc_memory_type() * Use the expected way to use set_node_memory_tier instead of modifying it * https://lore.kernel.org/lkml/20240312061729.1997111-1-horenchu...@bytedance.com/T/#u -v1: * https://lore.kernel.org/lkml/20240301082248.3456086-1-horenchu...@bytedance.com/T/#u Ho-Ren (Jack) Chuang (2): memory tier: dax/kmem: introduce an abstract layer for finding, allocating, and putting memory types memory tier: create CPUless memory tiers after obtaining HMAT info drivers/dax/kmem.c | 20 +- include/linux/memory-tiers.h | 13 mm/memory-tiers.c| 117 +++ 3 files changed, 119 insertions(+), 31 deletions(-) -- Ho-Ren (Jack) Chuang
Re: [PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()
On 3/27/24 12:41, Jason Wang wrote: On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: A smp_rmb() has been missed in vhost_enable_notify(), inspired by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_enable_notify(). Note that it should be safe until vq->avail_idx is changed by commit d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()"). Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") Cc: # v5.18+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 00445ab172b3..58f9d6a435f0 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) >avail->idx, r); return false; } + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ + smp_rmb(); + return true; + } - return vq->avail_idx != vq->last_avail_idx; + return false; So we only care about the case when vhost_enable_notify() returns true. In that case, I think you want to order with vhost_get_vq_desc(): last_avail_idx = vq->last_avail_idx; if (vq->avail_idx == vq->last_avail_idx) { /* false */ } vhost_get_avail_head(vq, _head, last_avail_idx) Assuming I understand the patch correctly. Acked-by: Jason Wang Jason, thanks for your review and comments. Your understanding is exactly what I understood. } EXPORT_SYMBOL_GPL(vhost_enable_notify); Thanks, Gavin
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On 3/27/24 12:44, Jason Wang wrote: On Wed, Mar 27, 2024 at 10:34 AM Jason Wang wrote: On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac697 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..00445ab172b3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ We need to be more verbose here. For example, which load needs to be ordered with which load. The rmb in vhost_get_vq_desc() is used to order the load of avail idx and the load of head. It is paired with e.g virtio_wmb() in virtqueue_add_split(). vhost_vq_avail_empty() are mostly used as a hint in vhost_net_busy_poll() which is under the protection of the vq mutex. An exception is the tx_can_batch(), but in that case it doesn't even want to read the head. Ok, if it is needed only in that path, maybe we can move the barriers there. [cc Will Deacon] Jason, appreciate for your review and comments. I think PATCH[1/2] is the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However, it would be nice to fix all of them in one shoot. I will try with PATCH[2/2] only to see if our issue will disappear or not. However, the issue still exists if PATCH[2/2] is missed. Firstly, We were failing on the transmit queue and {tvq, rvq}->busyloop_timeout == false if I remember correctly. So the added smp_rmb() in vhost_vq_avail_empty() is only a concern to tx_can_batch(). A mutex isn't enough to ensure the order for the available index and available ring entry (head). For example, vhost_vq_avail_empty() called by tx_can_batch() can see next available index, but its corresponding available ring entry (head) may not be seen by vhost yet if smp_rmb() is missed. The next call to get_tx_bufs(), where the available ring entry (head) doesn't arrived yet, leading to stale available ring entry (head) being fetched. handle_tx_copy get_tx_bufs // smp_rmb() won't be executed when vq->avail_idx != vq->last_avail_idx tx_can_batch vhost_vq_avail_empty // vq->avail_idx is updated from vq->avail->idx The reason why I added smp_rmb() to vhost_vq_avail_empty() is because the function is a exposed API, even it's only used by drivers/vhost/net.c at present. It means the API has been broken internally. So it seems more appropriate to fix it up in vhost_vq_avail_empty() so that the API's users needn't worry about the memory access order. + smp_rmb(); + return false; + } - return vq->avail_idx == vq->last_avail_idx; + return true; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); Thanks, Gavin
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On Wed, Mar 27, 2024 at 10:34 AM Jason Wang wrote: > > On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: > > > > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by > > Will Deacon . Otherwise, it's not ensured the > > available ring entries pushed by guest can be observed by vhost > > in time, leading to stale available ring entries fetched by vhost > > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's > > grace-hopper (ARM64) platform. > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > > -accel kvm -machine virt,gic-version=host -cpu host \ > > -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ > > -m 4096M,slots=16,maxmem=64G \ > > -object memory-backend-ram,id=mem0,size=4096M\ > >: \ > > -netdev tap,id=vnet0,vhost=true \ > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 > >: > > guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM > > virtio_net virtio0: output.0:id 100 is not a head! > > > > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it > > should be safe until vq->avail_idx is changed by commit 275bf960ac697 > > ("vhost: better detection of available buffers"). > > > > Fixes: 275bf960ac697 ("vhost: better detection of available buffers") > > Cc: # v4.11+ > > Reported-by: Yihuang Yu > > Signed-off-by: Gavin Shan > > --- > > drivers/vhost/vhost.c | 11 ++- > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 045f666b4f12..00445ab172b3 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, > > struct vhost_virtqueue *vq) > > r = vhost_get_avail_idx(vq, _idx); > > if (unlikely(r)) > > return false; > > + > > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > > + if (vq->avail_idx != vq->last_avail_idx) { > > + /* Similar to what's done in vhost_get_vq_desc(), we need > > +* to ensure the available ring entries have been exposed > > +* by guest. > > +*/ > > We need to be more verbose here. For example, which load needs to be > ordered with which load. > > The rmb in vhost_get_vq_desc() is used to order the load of avail idx > and the load of head. It is paired with e.g virtio_wmb() in > virtqueue_add_split(). > > vhost_vq_avail_empty() are mostly used as a hint in > vhost_net_busy_poll() which is under the protection of the vq mutex. > > An exception is the tx_can_batch(), but in that case it doesn't even > want to read the head. Ok, if it is needed only in that path, maybe we can move the barriers there. Thanks > > Thanks > > > > + smp_rmb(); > > + return false; > > + } > > > > - return vq->avail_idx == vq->last_avail_idx; > > + return true; > > } > > EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); > > > > -- > > 2.44.0 > >
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Tue, 2024-03-26 at 13:57 +0200, Jarkko Sakkinen wrote: > In which conditions which path is used during the initialization of mm > and why is this the case? It is an open claim in the current form. There is an arch_pick_mmap_layout() that arch's can have their own rules for. There is also a generic one. It gets called during exec. > > That would be nice to have documented for the sake of being complete > description. I have zero doubts of the claim being untrue. ...being untrue?
Re: [PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()
On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: > > A smp_rmb() has been missed in vhost_enable_notify(), inspired by > Will Deacon . Otherwise, it's not ensured the > available ring entries pushed by guest can be observed by vhost > in time, leading to stale available ring entries fetched by vhost > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's > grace-hopper (ARM64) platform. > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > -accel kvm -machine virt,gic-version=host -cpu host \ > -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ > -m 4096M,slots=16,maxmem=64G \ > -object memory-backend-ram,id=mem0,size=4096M\ >: \ > -netdev tap,id=vnet0,vhost=true \ > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 >: > guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM > virtio_net virtio0: output.0:id 100 is not a head! > > Add the missed smp_rmb() in vhost_enable_notify(). Note that it > should be safe until vq->avail_idx is changed by commit d3bb267bbdcb > ("vhost: cache avail index in vhost_enable_notify()"). > > Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") > Cc: # v5.18+ > Reported-by: Yihuang Yu > Signed-off-by: Gavin Shan > --- > drivers/vhost/vhost.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 00445ab172b3..58f9d6a435f0 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct > vhost_virtqueue *vq) >>avail->idx, r); > return false; > } > + > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > + if (vq->avail_idx != vq->last_avail_idx) { > + /* Similar to what's done in vhost_get_vq_desc(), we need > +* to ensure the available ring entries have been exposed > +* by guest. > +*/ > + smp_rmb(); > + return true; > + } > > - return vq->avail_idx != vq->last_avail_idx; > + return false; So we only care about the case when vhost_enable_notify() returns true. In that case, I think you want to order with vhost_get_vq_desc(): last_avail_idx = vq->last_avail_idx; if (vq->avail_idx == vq->last_avail_idx) { /* false */ } vhost_get_avail_head(vq, _head, last_avail_idx) Assuming I understand the patch correctly. Acked-by: Jason Wang Thanks > } > EXPORT_SYMBOL_GPL(vhost_enable_notify); > > -- > 2.44.0 >
Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan wrote: > > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by > Will Deacon . Otherwise, it's not ensured the > available ring entries pushed by guest can be observed by vhost > in time, leading to stale available ring entries fetched by vhost > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's > grace-hopper (ARM64) platform. > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > -accel kvm -machine virt,gic-version=host -cpu host \ > -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ > -m 4096M,slots=16,maxmem=64G \ > -object memory-backend-ram,id=mem0,size=4096M\ >: \ > -netdev tap,id=vnet0,vhost=true \ > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 >: > guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM > virtio_net virtio0: output.0:id 100 is not a head! > > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it > should be safe until vq->avail_idx is changed by commit 275bf960ac697 > ("vhost: better detection of available buffers"). > > Fixes: 275bf960ac697 ("vhost: better detection of available buffers") > Cc: # v4.11+ > Reported-by: Yihuang Yu > Signed-off-by: Gavin Shan > --- > drivers/vhost/vhost.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 045f666b4f12..00445ab172b3 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, > struct vhost_virtqueue *vq) > r = vhost_get_avail_idx(vq, _idx); > if (unlikely(r)) > return false; > + > vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > + if (vq->avail_idx != vq->last_avail_idx) { > + /* Similar to what's done in vhost_get_vq_desc(), we need > +* to ensure the available ring entries have been exposed > +* by guest. > +*/ We need to be more verbose here. For example, which load needs to be ordered with which load. The rmb in vhost_get_vq_desc() is used to order the load of avail idx and the load of head. It is paired with e.g virtio_wmb() in virtqueue_add_split(). vhost_vq_avail_empty() are mostly used as a hint in vhost_net_busy_poll() which is under the protection of the vq mutex. An exception is the tx_can_batch(), but in that case it doesn't even want to read the head. Thanks > + smp_rmb(); > + return false; > + } > > - return vq->avail_idx == vq->last_avail_idx; > + return true; > } > EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); > > -- > 2.44.0 >
Re: [PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported
在 2024/3/26 下午11:19, Breno Leitao 写道: There is a bug when setting the RSS options in virtio_net that can break the whole machine, getting the kernel into an infinite loop. Running the following command in any QEMU virtual machine with virtionet will reproduce this problem: # ethtool -X eth0 hfunc toeplitz This is how the problem happens: 1) ethtool_set_rxfh() calls virtnet_set_rxfh() 2) virtnet_set_rxfh() calls virtnet_commit_rss_command() 3) virtnet_commit_rss_command() populates 4 entries for the rss scatter-gather 4) Since the command above does not have a key, then the last scatter-gatter entry will be zeroed, since rss_key_size == 0. sg_buf_size = vi->rss_key_size; 5) This buffer is passed to qemu, but qemu is not happy with a buffer with zero length, and do the following in virtqueue_map_desc() (QEMU function): if (!sz) { virtio_error(vdev, "virtio: zero sized buffers are not allowed"); 6) virtio_error() (also QEMU function) set the device as broken vdev->broken = true; 7) Qemu bails out, and do not repond this crazy kernel. 8) The kernel is waiting for the response to come back (function virtnet_send_command()) 9) The kernel is waiting doing the following : while (!virtqueue_get_buf(vi->cvq, ) && !virtqueue_is_broken(vi->cvq)) cpu_relax(); 10) None of the following functions above is true, thus, the kernel loops here forever. Keeping in mind that virtqueue_is_broken() does not look at the qemu `vdev->broken`, so, it never realizes that the vitio is broken at QEMU side. Fix it by not sending RSS commands if the feature is not available in the device. Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") Cc: sta...@vger.kernel.org Cc: qemu-de...@nongnu.org Signed-off-by: Breno Leitao --- drivers/net/virtio_net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c640fdf28fc5..e6b0eaf08ac2 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev, struct virtnet_info *vi = netdev_priv(dev); int i; + if (!vi->has_rss && !vi->has_rss_hash_report) + return -EOPNOTSUPP; + Why not make the second patch as the first, this seems to work better. Or squash them into one patch. Apart from these and Xuan's comments. For series: Reviewed-by: Heng Qi Regards, Heng if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE && rxfh->hfunc != ETH_RSS_HASH_TOP) return -EOPNOTSUPP;
Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported
On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao wrote: > Do not set virtnet_info->rss_indir_table_size if RSS is not available > for the device. > > Currently, rss_indir_table_size is set if either has_rss or > has_rss_hash_report is available, but, it should only be set if has_rss > is set. > > On the virtnet_set_rxfh(), return an invalid command if the request has > indirection table set, but virtnet does not support RSS. > > Suggested-by: Heng Qi > Signed-off-by: Breno Leitao > --- > drivers/net/virtio_net.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index c22d1118a133..c640fdf28fc5 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev, > rxfh->hfunc != ETH_RSS_HASH_TOP) > return -EOPNOTSUPP; > > + if (rxfh->indir && !vi->has_rss) > + return -EINVAL; > + > if (rxfh->indir) { Put !vi->has_rss here? Thanks. > for (i = 0; i < vi->rss_indir_table_size; ++i) > vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; > @@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev) > if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT)) > vi->has_rss_hash_report = true; > > - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) > + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) { > vi->has_rss = true; > > - if (vi->has_rss || vi->has_rss_hash_report) { > vi->rss_indir_table_size = > virtio_cread16(vdev, offsetof(struct virtio_net_config, > rss_max_indirection_table_length)); > + } > + > + if (vi->has_rss || vi->has_rss_hash_report) { > vi->rss_key_size = > virtio_cread8(vdev, offsetof(struct virtio_net_config, > rss_max_key_size)); > > -- > 2.43.0 >
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On 3/27/24 09:14, Gavin Shan wrote: On 3/27/24 01:46, Will Deacon wrote: On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote: Ok, long shot after eyeballing the vhost code, but does the diff below help at all? It looks like vhost_vq_avail_empty() can advance the value saved in 'vq->avail_idx' but without the read barrier, possibly confusing vhost_get_vq_desc() in polling mode. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..87bff710331a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) return false; vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + smp_rmb(); return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty(). The issue still exists after smp_rmb() is added here. However, I'm inspired by your suggestion and recheck the code again. It seems another smp_rmb() has been missed in vhost_enable_notify(). With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable to hit the issue. I will try for more times to make sure the issue is really resolved. After that, I will post formal patches for review. Thanks again, Will. The formal patches have been sent for review. https://lkml.org/lkml/2024/3/27/40 Thanks, Gavin
Re: [PATCH v2 0/2] vhost: Fix stale available ring entries
On 3/27/24 09:38, Gavin Shan wrote: The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. The wrong head (available ring entry) is seen by the guest when running 'netperf' on the guest and running 'netserver' on another NVidia's grace-grace machine. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=tap0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0 : guest# ifconfig eth0 | grep 'inet addr' inet addr:10.26.1.220 guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! There is missed smp_rmb() in vhost_vq_avail_empty() and vhost_enable_notify(). Without smp_rmb(), vq->avail_idx is increased but the available ring entries aren't arriving to vhost side yet. So a stale available ring entry can be fetched in vhost_get_vq_desc(). Fix it by adding smp_rmb() in those two functions. Note that I need two patches so that they can be easily picked up by the stable kernel. With the changes, I'm unable to hit the issue again. Gavin Shan (2): vhost: Add smp_rmb() in vhost_vq_avail_empty() vhost: Add smp_rmb() in vhost_enable_notify() drivers/vhost/vhost.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) Sorry, I was supposed to copy Will. Amending for it. Thanks, Gavin
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Mon, 25 Mar 2024 19:04:59 + Jonthan Haslam wrote: > Hi Masami, > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 > > > microsecs). > > > > Looks good to me. > > > > Acked-by: Masami Hiramatsu (Google) > > > > BTW, how did you measure the overhead? I think spinlock overhead > > will depend on how much lock contention happens. > > Absolutely. I have the original production workload to test this with and > a derived one that mimics this test case. The production case has ~24 > threads running on a 192 core system which access 14 USDTs around 1.5 > million times per second in total (across all USDTs). My test case is > similar but can drive a higher rate of USDT access across more threads and > therefore generate higher contention. Thanks for the info. So this result is measured in enough large machine with high parallelism. So lock contention is matter. Can you also include this information with the number in next version? Thank you, > > All measurements are done using bpftrace scripts around relevant parts of > code in uprobes.c and application code. > > Jon. > > > > > Thank you, > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > Signed-off-by: Jonathan Haslam > > > --- > > > kernel/events/uprobes.c | 22 +++--- > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > */ > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > access */ > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > #define UPROBES_HASH_SZ 13 > > > /* serialize uprobe->pending_list */ > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > *inode, loff_t offset) > > > { > > > struct uprobe *uprobe; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > > uprobe = __find_uprobe(inode, offset); > > > - spin_unlock(_treelock); > > > + read_unlock(_treelock); > > > > > > return uprobe; > > > } > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > > *uprobe) > > > { > > > struct uprobe *u; > > > > > > - spin_lock(_treelock); > > > + write_lock(_treelock); > > > u = __insert_uprobe(uprobe); > > > - spin_unlock(_treelock); > > > + write_unlock(_treelock); > > > > > > return u; > > > } > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > return; > > > > > > - spin_lock(_treelock); > > > + write_lock(_treelock); > > > rb_erase(>rb_node, _tree); > > > - spin_unlock(_treelock); > > > + write_unlock(_treelock); > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > put_uprobe(uprobe); > > > } > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > > n = find_node_in_range(inode, min, max); > > > if (n) { > > > for (t = n; t; t = rb_prev(t)) { > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > get_uprobe(u); > > > } > > > } > > > - spin_unlock(_treelock); > > > + read_unlock(_treelock); > > > } > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, > > > unsigned long start, unsigned long e > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > > n = find_node_in_range(inode, min, max); > > > - spin_unlock(_treelock); > > > + read_unlock(_treelock); > > > > > > return !!n; > > > } > > > -- > > > 2.43.0 > > > > > > > > > -- > > Masami Hiramatsu (Google) -- Masami Hiramatsu (Google)
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Tue, 26 Mar 2024 09:01:47 -0700 Andrii Nakryiko wrote: > On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu wrote: > > > > On Thu, 21 Mar 2024 07:57:35 -0700 > > Jonathan Haslam wrote: > > > > > Active uprobes are stored in an RB tree and accesses to this tree are > > > dominated by read operations. Currently these accesses are serialized by > > > a spinlock but this leads to enormous contention when large numbers of > > > threads are executing active probes. > > > > > > This patch converts the spinlock used to serialize access to the > > > uprobes_tree RB tree into a reader-writer spinlock. This lock type > > > aligns naturally with the overwhelmingly read-only nature of the tree > > > usage here. Although the addition of reader-writer spinlocks are > > > discouraged [0], this fix is proposed as an interim solution while an > > > RCU based approach is implemented (that work is in a nascent form). This > > > fix also has the benefit of being trivial, self contained and therefore > > > simple to backport. > > > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 > > > microsecs). > > > > Looks good to me. > > > > Acked-by: Masami Hiramatsu (Google) > > Masami, > > Given the discussion around per-cpu rw semaphore and need for > (internal) batched attachment API for uprobes, do you think you can > apply this patch as is for now? We can then gain initial improvements > in scalability that are also easy to backport, and Jonathan will work > on a more complete solution based on per-cpu RW semaphore, as > suggested by Ingo. Yeah, it is interesting to use per-cpu rw semaphore on uprobe. I would like to wait for the next version. Thank you, > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > will depend on how much lock contention happens. > > > > Thank you, > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > Signed-off-by: Jonathan Haslam > > > --- > > > kernel/events/uprobes.c | 22 +++--- > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > */ > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access > > > */ > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access > > > */ > > > > > > #define UPROBES_HASH_SZ 13 > > > /* serialize uprobe->pending_list */ > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > *inode, loff_t offset) > > > { > > > struct uprobe *uprobe; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > > uprobe = __find_uprobe(inode, offset); > > > - spin_unlock(_treelock); > > > + read_unlock(_treelock); > > > > > > return uprobe; > > > } > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > > *uprobe) > > > { > > > struct uprobe *u; > > > > > > - spin_lock(_treelock); > > > + write_lock(_treelock); > > > u = __insert_uprobe(uprobe); > > > - spin_unlock(_treelock); > > > + write_unlock(_treelock); > > > > > > return u; > > > } > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > return; > > > > > > - spin_lock(_treelock); > > > + write_lock(_treelock); > > > rb_erase(>rb_node, _tree); > > > - spin_unlock(_treelock); > > > + write_unlock(_treelock); > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > put_uprobe(uprobe); > > > } > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > > n = find_node_in_range(inode, min, max); > > > if (n) { > > > for (t = n; t; t = rb_prev(t)) { > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > get_uprobe(u); > > > } > > > } > > > - spin_unlock(_treelock); > > > + read_unlock(_treelock); > > > } > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, > > > unsigned long start, unsigned long e > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(_treelock); > > > + read_lock(_treelock); > > >
[PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()
A smp_rmb() has been missed in vhost_enable_notify(), inspired by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_enable_notify(). Note that it should be safe until vq->avail_idx is changed by commit d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()"). Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()") Cc: # v5.18+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 00445ab172b3..58f9d6a435f0 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) >avail->idx, r); return false; } + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ + smp_rmb(); + return true; + } - return vq->avail_idx != vq->last_avail_idx; + return false; } EXPORT_SYMBOL_GPL(vhost_enable_notify); -- 2.44.0
[PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()
A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by Will Deacon . Otherwise, it's not ensured the available ring entries pushed by guest can be observed by vhost in time, leading to stale available ring entries fetched by vhost in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=vnet0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 : guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it should be safe until vq->avail_idx is changed by commit 275bf960ac697 ("vhost: better detection of available buffers"). Fixes: 275bf960ac697 ("vhost: better detection of available buffers") Cc: # v4.11+ Reported-by: Yihuang Yu Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..00445ab172b3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) r = vhost_get_avail_idx(vq, _idx); if (unlikely(r)) return false; + vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + if (vq->avail_idx != vq->last_avail_idx) { + /* Similar to what's done in vhost_get_vq_desc(), we need +* to ensure the available ring entries have been exposed +* by guest. +*/ + smp_rmb(); + return false; + } - return vq->avail_idx == vq->last_avail_idx; + return true; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); -- 2.44.0
[PATCH v2 0/2] vhost: Fix stale available ring entries
The issue was reported by Yihuang Yu on NVidia's grace-hopper (ARM64) platform. The wrong head (available ring entry) is seen by the guest when running 'netperf' on the guest and running 'netserver' on another NVidia's grace-grace machine. /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ -accel kvm -machine virt,gic-version=host -cpu host \ -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \ -m 4096M,slots=16,maxmem=64G \ -object memory-backend-ram,id=mem0,size=4096M\ : \ -netdev tap,id=tap0,vhost=true \ -device virtio-net-pci,bus=pcie.8,netdev=tap0,mac=52:54:00:f1:26:b0 : guest# ifconfig eth0 | grep 'inet addr' inet addr:10.26.1.220 guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM virtio_net virtio0: output.0:id 100 is not a head! There is missed smp_rmb() in vhost_vq_avail_empty() and vhost_enable_notify(). Without smp_rmb(), vq->avail_idx is increased but the available ring entries aren't arriving to vhost side yet. So a stale available ring entry can be fetched in vhost_get_vq_desc(). Fix it by adding smp_rmb() in those two functions. Note that I need two patches so that they can be easily picked up by the stable kernel. With the changes, I'm unable to hit the issue again. Gavin Shan (2): vhost: Add smp_rmb() in vhost_vq_avail_empty() vhost: Add smp_rmb() in vhost_enable_notify() drivers/vhost/vhost.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) -- 2.44.0
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On 3/27/24 01:46, Will Deacon wrote: On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote: Ok, long shot after eyeballing the vhost code, but does the diff below help at all? It looks like vhost_vq_avail_empty() can advance the value saved in 'vq->avail_idx' but without the read barrier, possibly confusing vhost_get_vq_desc() in polling mode. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..87bff710331a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) return false; vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + smp_rmb(); return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); Thanks, Will. I already noticed smp_rmb() has been missed in vhost_vq_avail_empty(). The issue still exists after smp_rmb() is added here. However, I'm inspired by your suggestion and recheck the code again. It seems another smp_rmb() has been missed in vhost_enable_notify(). With smp_rmb() added to vhost_vq_avail_empty() and vhost_enable_notify(), I'm unable to hit the issue. I will try for more times to make sure the issue is really resolved. After that, I will post formal patches for review. Thanks, Gavin
Re: [RFC PATCH v2 0/7] DAMON based 2-tier memory management for CXL memory
On Mon, 25 Mar 2024 15:53:03 -0700 SeongJae Park wrote: > On Mon, 25 Mar 2024 21:01:04 +0900 Honggyu Kim wrote: [...] > > On Fri, 22 Mar 2024 09:32:23 -0700 SeongJae Park wrote: > > > On Fri, 22 Mar 2024 18:02:23 +0900 Honggyu Kim wrote: [...] > > > > I would like to hear how you think about this. > > So, to summarize my humble opinion, > > 1. I like the idea of having two actions. But I'd like to use names other > than >'promote' and 'demote'. > 2. I still prefer having a filter for the page granularity access re-check. > [...] > > I will join the DAMON Beer/Coffee/Tea Chat tomorrow as scheduled so I > > can talk more about this issue. > > Looking forward to chatting with you :) We met and discussed about this topic in the chat series yesterday. Sharing the summary for keeping the open discussion. Honggyu thankfully accepted my humble suggestions on the last reply. Honggyu will post the third version of this patchset soon. The patchset will implement two new DAMOS actions, namely MIGRATE_HOT and MIGRATE_COLD. Those will migrate the DAMOS target regions to a user-specified NUMA node, but will have different prioritization score function. As name implies, they will prioritize more hot regions and cold regions, respectively. Honggyu, please feel free to fix if there is anything wrong or missed. And thanks to Honggyu again for patiently keeping this productive discussion and their awesome work. Thanks, SJ [...]
[syzbot] [virtualization?] net boot error: WARNING: refcount bug in __free_pages_ok
Hello, syzbot found the following issue on: HEAD commit:c1fd3a9433a2 Merge branch 'there-are-some-bugfix-for-the-h.. git tree: net console output: https://syzkaller.appspot.com/x/log.txt?x=134f4c8118 kernel config: https://syzkaller.appspot.com/x/.config?x=a5e4ca7f025e9172 dashboard link: https://syzkaller.appspot.com/bug?extid=84f677a274bd8b05f6cb compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/89219dafdd42/disk-c1fd3a94.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/d962e40c0da9/vmlinux-c1fd3a94.xz kernel image: https://storage.googleapis.com/syzbot-assets/248b8f5eb3a1/bzImage-c1fd3a94.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+84f677a274bd8b05f...@syzkaller.appspotmail.com Key type pkcs7_test registered Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239) io scheduler mq-deadline registered io scheduler kyber registered io scheduler bfq registered input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0 ACPI: button: Power Button [PWRF] input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1 ACPI: button: Sleep Button [SLPF] ioatdma: Intel(R) QuickData Technology Driver 5.00 ACPI: \_SB_.LNKC: Enabled at IRQ 11 virtio-pci :00:03.0: virtio_pci: leaving for legacy driver ACPI: \_SB_.LNKD: Enabled at IRQ 10 virtio-pci :00:04.0: virtio_pci: leaving for legacy driver ACPI: \_SB_.LNKB: Enabled at IRQ 10 virtio-pci :00:06.0: virtio_pci: leaving for legacy driver virtio-pci :00:07.0: virtio_pci: leaving for legacy driver N_HDLC line discipline registered with maxframe=4096 Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A Non-volatile memory driver v1.3 Linux agpgart interface v0.103 ACPI: bus type drm_connector registered [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0 [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1 Console: switching to colour frame buffer device 128x48 platform vkms: [drm] fb0: vkmsdrmfb frame buffer device usbcore: registered new interface driver udl brd: module loaded loop: module loaded zram: Added device: zram0 null_blk: disk nullb0 created null_blk: module loaded Guest personality initialized and is inactive VMCI host device registered (name=vmci, major=10, minor=118) Initialized host personality usbcore: registered new interface driver rtsx_usb usbcore: registered new interface driver viperboard usbcore: registered new interface driver dln2 usbcore: registered new interface driver pn533_usb nfcsim 0.2 initialized usbcore: registered new interface driver port100 usbcore: registered new interface driver nfcmrvl Loading iSCSI transport class v2.0-870. virtio_scsi virtio0: 1/0/0 default/read/poll queues [ cut here ] refcount_t: decrement hit 0; leaking memory. WARNING: CPU: 1 PID: 1 at lib/refcount.c:31 refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 Modules linked in: CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.8.0-syzkaller-12856-gc1fd3a9433a2 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/29/2024 RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31 Code: b2 00 00 00 e8 97 cf e9 fc 5b 5d c3 cc cc cc cc e8 8b cf e9 fc c6 05 6c 6b e8 0a 01 90 48 c7 c7 e0 34 1f 8c e8 27 6c ac fc 90 <0f> 0b 90 90 eb d9 e8 6b cf e9 fc c6 05 49 6b e8 0a 01 90 48 c7 c7 RSP: :c9066e18 EFLAGS: 00010246 RAX: 57706ef3c4162200 RBX: 88801f8f468c RCX: 8880166d8000 RDX: RSI: RDI: RBP: 0004 R08: 815800c2 R09: fbfff1c396e0 R10: dc00 R11: fbfff1c396e0 R12: ea850dc0 R13: ea850dc8 R14: 1d400010a1b9 R15: FS: () GS:8880b950() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 0e132000 CR4: 003506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1141 [inline] __free_pages_ok+0xc60/0xd90 mm/page_alloc.c:1270 make_alloc_exact+0xa3/0xf0 mm/page_alloc.c:4829 vring_alloc_queue drivers/virtio/virtio_ring.c:319 [inline] vring_alloc_queue_split+0x20a/0x600 drivers/virtio/virtio_ring.c:1108 vring_create_virtqueue_split+0xc6/0x310 drivers/virtio/virtio_ring.c:1158 vring_create_virtqueue+0xca/0x110 drivers/virtio/virtio_ring.c:2683 setup_vq+0xe9/0x2d0
Re: [PATCH v4 4/4] remoteproc: stm32: Add support of an OP-TEE TA to load the firmware
On 3/25/24 17:51, Mathieu Poirier wrote: > On Fri, Mar 08, 2024 at 03:47:08PM +0100, Arnaud Pouliquen wrote: >> The new TEE remoteproc device is used to manage remote firmware in a >> secure, trusted context. The 'st,stm32mp1-m4-tee' compatibility is >> introduced to delegate the loading of the firmware to the trusted >> execution context. In such cases, the firmware should be signed and >> adhere to the image format defined by the TEE. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> Updates from V3: >> - remove support of the attach use case. Will be addressed in a separate >> thread, >> - add st_rproc_tee_ops::parse_fw ops, >> - inverse call of devm_rproc_alloc()and tee_rproc_register() to manage cross >> reference between the rproc struct and the tee_rproc struct in tee_rproc.c. >> --- >> drivers/remoteproc/stm32_rproc.c | 60 +--- >> 1 file changed, 56 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/remoteproc/stm32_rproc.c >> b/drivers/remoteproc/stm32_rproc.c >> index 8cd838df4e92..13df33c78aa2 100644 >> --- a/drivers/remoteproc/stm32_rproc.c >> +++ b/drivers/remoteproc/stm32_rproc.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> #include >> >> #include "remoteproc_internal.h" >> @@ -49,6 +50,9 @@ >> #define M4_STATE_STANDBY4 >> #define M4_STATE_CRASH 5 >> >> +/* Remote processor unique identifier aligned with the Trusted Execution >> Environment definitions */ > > Why is this the case? At least from the kernel side it is possible to call > tee_rproc_register() with any kind of value, why is there a need to be any > kind of alignment with the TEE? The use of the proc_id is to identify a processor in case of multi co-processors. For instance we can have a system with A DSP and a modem. We would use the same TEE service, but the TEE driver will probably be different, same for the signature key. In such case the proc ID allows to identify the the processor you want to address. > >> +#define STM32_MP1_M4_PROC_ID0 >> + >> struct stm32_syscon { >> struct regmap *map; >> u32 reg; >> @@ -257,6 +261,19 @@ static int stm32_rproc_release(struct rproc *rproc) >> return 0; >> } >> >> +static int stm32_rproc_tee_stop(struct rproc *rproc) >> +{ >> +int err; >> + >> +stm32_rproc_request_shutdown(rproc); >> + >> +err = tee_rproc_stop(rproc); >> +if (err) >> +return err; >> + >> +return stm32_rproc_release(rproc); >> +} >> + >> static int stm32_rproc_prepare(struct rproc *rproc) >> { >> struct device *dev = rproc->dev.parent; >> @@ -693,8 +710,19 @@ static const struct rproc_ops st_rproc_ops = { >> .get_boot_addr = rproc_elf_get_boot_addr, >> }; >> >> +static const struct rproc_ops st_rproc_tee_ops = { >> +.prepare= stm32_rproc_prepare, >> +.start = tee_rproc_start, >> +.stop = stm32_rproc_tee_stop, >> +.kick = stm32_rproc_kick, >> +.load = tee_rproc_load_fw, >> +.parse_fw = tee_rproc_parse_fw, >> +.find_loaded_rsc_table = tee_rproc_find_loaded_rsc_table, >> +}; >> + >> static const struct of_device_id stm32_rproc_match[] = { >> -{ .compatible = "st,stm32mp1-m4" }, >> +{.compatible = "st,stm32mp1-m4",}, >> +{.compatible = "st,stm32mp1-m4-tee",}, >> {}, >> }; >> MODULE_DEVICE_TABLE(of, stm32_rproc_match); >> @@ -853,6 +881,7 @@ static int stm32_rproc_probe(struct platform_device >> *pdev) >> struct device *dev = >dev; >> struct stm32_rproc *ddata; >> struct device_node *np = dev->of_node; >> +struct tee_rproc *trproc = NULL; >> struct rproc *rproc; >> unsigned int state; >> int ret; >> @@ -861,9 +890,26 @@ static int stm32_rproc_probe(struct platform_device >> *pdev) >> if (ret) >> return ret; >> >> -rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, >> sizeof(*ddata)); >> -if (!rproc) >> -return -ENOMEM; >> +if (of_device_is_compatible(np, "st,stm32mp1-m4-tee")) { >> +/* >> + * Delegate the firmware management to the secure context. >> + * The firmware loaded has to be signed. >> + */ >> +rproc = devm_rproc_alloc(dev, np->name, _rproc_tee_ops, >> NULL, sizeof(*ddata)); >> +if (!rproc) >> +return -ENOMEM; >> + >> +trproc = tee_rproc_register(dev, rproc, STM32_MP1_M4_PROC_ID); >> +if (IS_ERR(trproc)) { >> +dev_err_probe(dev, PTR_ERR(trproc), >> + "signed firmware not supported by TEE\n"); >> +return PTR_ERR(trproc); >> +} >> +} else { >> +rproc = devm_rproc_alloc(dev, np->name, _rproc_ops, NULL, >> sizeof(*ddata)); >> +if (!rproc) >> +return -ENOMEM; >> +} >> >> ddata = rproc->priv; >> >>
[RFC PATCH v2 4/4] tracing/timer: use __print_sym()
From: Johannes Berg Use the new __print_sym() in the timer tracing, just to show how to convert something. This adds ~80 bytes of .text for a saving of ~1.5K of data in my builds. Note the format changes from print fmt: "success=%d dependency=%s", REC->success, __print_symbolic(REC->dependency, { 0, "NONE" }, { (1 << 0), "POSIX_TIMER" }, { (1 << 1), "PERF_EVENTS" }, { (1 << 2), "SCHED" }, { (1 << 3), "CLOCK_UNSTABLE" }, { (1 << 4), "RCU" }, { (1 << 5), "RCU_EXP" }) to print fmt: "success=%d dependency=%s", REC->success, __print_symbolic(REC->dependency, { 0, "NONE" }, { 1, "POSIX_TIMER" }, { 2, "PERF_EVENTS" }, { 4, "SCHED" }, { 8, "CLOCK_UNSTABLE" }, { 16, "RCU" }, { 32, "RCU_EXP" }) Since the values are now just printed in the show function as pure decimal values. Signed-off-by: Johannes Berg --- include/trace/events/timer.h | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h index 1ef58a04fc57..d483abffed78 100644 --- a/include/trace/events/timer.h +++ b/include/trace/events/timer.h @@ -402,26 +402,18 @@ TRACE_EVENT(itimer_expire, #undef tick_dep_mask_name #undef tick_dep_name_end -/* The MASK will convert to their bits and they need to be processed too */ -#define tick_dep_name(sdep) TRACE_DEFINE_ENUM(TICK_DEP_BIT_##sdep); \ - TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep); -#define tick_dep_name_end(sdep) TRACE_DEFINE_ENUM(TICK_DEP_BIT_##sdep); \ - TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep); -/* NONE only has a mask defined for it */ -#define tick_dep_mask_name(sdep) TRACE_DEFINE_ENUM(TICK_DEP_MASK_##sdep); - -TICK_DEP_NAMES - -#undef tick_dep_name -#undef tick_dep_mask_name -#undef tick_dep_name_end - #define tick_dep_name(sdep) { TICK_DEP_MASK_##sdep, #sdep }, #define tick_dep_mask_name(sdep) { TICK_DEP_MASK_##sdep, #sdep }, #define tick_dep_name_end(sdep) { TICK_DEP_MASK_##sdep, #sdep } +TRACE_DEFINE_SYM_LIST(tick_dep_names, TICK_DEP_NAMES); + +#undef tick_dep_name +#undef tick_dep_mask_name +#undef tick_dep_name_end + #define show_tick_dep_name(val)\ - __print_symbolic(val, TICK_DEP_NAMES) + __print_sym(val, tick_dep_names) TRACE_EVENT(tick_stop, -- 2.44.0
[RFC PATCH v2 1/4] tracing: add __print_sym() to replace __print_symbolic()
From: Johannes Berg The way __print_symbolic() works is limited and inefficient in multiple ways: - you can only use it with a static list of symbols, but e.g. the SKB dropreasons are now a dynamic list - it builds the list in memory _three_ times, so it takes a lot of memory: - The print_fmt contains the list (since it's passed to the macro there). This actually contains the names _twice_, which is fixed up at runtime. - TRACE_DEFINE_ENUM() puts a 24-byte struct trace_eval_map for every entry, plus the string pointed to by it, which cannot be deduplicated with the strings in the print_fmt - The in-kernel symbolic printing creates yet another list of struct trace_print_flags for trace_print_symbols_seq() - it also requires runtime fixup during init, which is a lot of string parsing due to the print_fmt fixup Introduce __print_sym() to - over time - replace the old one. We can easily extend this also to __print_flags later, but I cared only about the SKB dropreasons for now, which has only __print_symbolic(). This new __print_sym() requires only a single list of items, created by TRACE_DEFINE_SYM_LIST(), or can even use another already existing list by using TRACE_DEFINE_SYM_FNS() with lookup and show methods. Then, instead of doing an init-time fixup, just do this at the time when userspace reads the print_fmt. This way, dynamically updated lists are possible. For userspace, nothing actually changes, because the print_fmt is shown exactly the same way the old __print_symbolic() was. This adds about 4k .text in my test builds, but that'll be more than paid for by the actual conversions. Signed-off-by: Johannes Berg --- v2: - fix RCU - use ':' as separator to simplify the code, that's still not valid in a C identifier --- include/asm-generic/vmlinux.lds.h | 3 +- include/linux/module.h | 2 + include/linux/trace_events.h | 7 ++ include/linux/tracepoint.h | 20 + include/trace/stages/init.h| 54 + include/trace/stages/stage2_data_offsets.h | 6 ++ include/trace/stages/stage3_trace_output.h | 9 +++ include/trace/stages/stage7_class_define.h | 3 + kernel/module/main.c | 3 + kernel/trace/trace_events.c| 90 +- kernel/trace/trace_output.c| 45 +++ 11 files changed, 239 insertions(+), 3 deletions(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index f7749d0f2562..88de434578a5 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -256,7 +256,8 @@ #define FTRACE_EVENTS() \ . = ALIGN(8); \ BOUNDED_SECTION(_ftrace_events) \ - BOUNDED_SECTION_BY(_ftrace_eval_map, _ftrace_eval_maps) + BOUNDED_SECTION_BY(_ftrace_eval_map, _ftrace_eval_maps) \ + BOUNDED_SECTION(_ftrace_sym_defs) #else #define FTRACE_EVENTS() #endif diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..571e5e8f17b6 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -524,6 +524,8 @@ struct module { unsigned int num_trace_events; struct trace_eval_map **trace_evals; unsigned int num_trace_evals; + struct trace_sym_def **trace_sym_defs; + unsigned int num_trace_sym_defs; #endif #ifdef CONFIG_FTRACE_MCOUNT_RECORD unsigned int num_ftrace_callsites; diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 6f9bdfb09d1d..bc7045d535d0 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -27,6 +27,13 @@ const char *trace_print_flags_seq(struct trace_seq *p, const char *delim, const char *trace_print_symbols_seq(struct trace_seq *p, unsigned long val, const struct trace_print_flags *symbol_array); +const char *trace_print_sym_seq(struct trace_seq *p, unsigned long long val, + const char *(*lookup)(unsigned long long val)); +const char *trace_sym_lookup(const struct trace_sym_entry *list, +size_t len, unsigned long long value); +void trace_sym_show(struct seq_file *m, + const struct trace_sym_entry *list, size_t len); + #if BITS_PER_LONG == 32 const char *trace_print_flags_seq_u64(struct trace_seq *p, const char *delim, unsigned long long flags, diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 689b6d71590e..cc3b387953d1 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -31,6 +31,24 @@ struct trace_eval_map { unsigned long eval_value; }; +struct trace_sym_def { + const char *system; + const char
[RFC PATCH v2 3/4] net: drop_monitor: use drop_reason_lookup()
From: Johannes Berg Now that we have drop_reason_lookup(), we can just use it for drop_monitor as well, rather than exporting the list itself. Signed-off-by: Johannes Berg --- include/net/dropreason.h | 4 net/core/drop_monitor.c | 18 +++--- net/core/skbuff.c| 6 +++--- 3 files changed, 6 insertions(+), 22 deletions(-) diff --git a/include/net/dropreason.h b/include/net/dropreason.h index c157070b5303..0e2195ccf2cd 100644 --- a/include/net/dropreason.h +++ b/include/net/dropreason.h @@ -38,10 +38,6 @@ struct drop_reason_list { size_t n_reasons; }; -/* Note: due to dynamic registrations, access must be under RCU */ -extern const struct drop_reason_list __rcu * -drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM]; - #ifdef CONFIG_TRACEPOINTS const char *drop_reason_lookup(unsigned long long value); void drop_reason_show(struct seq_file *m); diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c index b0f221d658be..185c43e5b501 100644 --- a/net/core/drop_monitor.c +++ b/net/core/drop_monitor.c @@ -610,9 +610,8 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, struct sk_buff *skb, size_t payload_len) { struct net_dm_skb_cb *cb = NET_DM_SKB_CB(skb); - const struct drop_reason_list *list = NULL; - unsigned int subsys, subsys_reason; char buf[NET_DM_MAX_SYMBOL_LEN]; + const char *reason_str; struct nlattr *attr; void *hdr; int rc; @@ -630,19 +629,8 @@ static int net_dm_packet_report_fill(struct sk_buff *msg, struct sk_buff *skb, goto nla_put_failure; rcu_read_lock(); - subsys = u32_get_bits(cb->reason, SKB_DROP_REASON_SUBSYS_MASK); - if (subsys < SKB_DROP_REASON_SUBSYS_NUM) - list = rcu_dereference(drop_reasons_by_subsys[subsys]); - subsys_reason = cb->reason & ~SKB_DROP_REASON_SUBSYS_MASK; - if (!list || - subsys_reason >= list->n_reasons || - !list->reasons[subsys_reason] || - strlen(list->reasons[subsys_reason]) > NET_DM_MAX_REASON_LEN) { - list = rcu_dereference(drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_CORE]); - subsys_reason = SKB_DROP_REASON_NOT_SPECIFIED; - } - if (nla_put_string(msg, NET_DM_ATTR_REASON, - list->reasons[subsys_reason])) { + reason_str = drop_reason_lookup(cb->reason); + if (nla_put_string(msg, NET_DM_ATTR_REASON, reason_str)) { rcu_read_unlock(); goto nla_put_failure; } diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 012b48da8810..a8065c40a270 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -141,13 +141,11 @@ static const struct drop_reason_list drop_reasons_core = { .n_reasons = ARRAY_SIZE(drop_reasons), }; -const struct drop_reason_list __rcu * +static const struct drop_reason_list __rcu * drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM] = { [SKB_DROP_REASON_SUBSYS_CORE] = RCU_INITIALIZER(_reasons_core), }; -EXPORT_SYMBOL(drop_reasons_by_subsys); -#ifdef CONFIG_TRACEPOINTS const char *drop_reason_lookup(unsigned long long value) { unsigned long long subsys_id = value >> SKB_DROP_REASON_SUBSYS_SHIFT; @@ -164,7 +162,9 @@ const char *drop_reason_lookup(unsigned long long value) return NULL; return subsys->reasons[reason]; } +EXPORT_SYMBOL(drop_reason_lookup); +#ifdef CONFIG_TRACEPOINTS void drop_reason_show(struct seq_file *m) { u32 subsys_id; -- 2.44.0
[RFC PATCH v2 2/4] net: dropreason: use new __print_sym() in tracing
From: Johannes Berg The __print_symbolic() could only ever print the core drop reasons, since that's the way the infrastructure works. Now that we have __print_sym() with all the advantages mentioned in that commit, convert to that and get all the drop reasons from all subsystems. As we already have a list of them, that's really easy. This is a little bit of .text (~100 bytes in my build) and saves a lot of .data (~17k). Signed-off-by: Johannes Berg --- include/net/dropreason.h | 5 + include/trace/events/skb.h | 16 +++--- net/core/skbuff.c | 43 ++ 3 files changed, 51 insertions(+), 13 deletions(-) diff --git a/include/net/dropreason.h b/include/net/dropreason.h index 56cb7be92244..c157070b5303 100644 --- a/include/net/dropreason.h +++ b/include/net/dropreason.h @@ -42,6 +42,11 @@ struct drop_reason_list { extern const struct drop_reason_list __rcu * drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM]; +#ifdef CONFIG_TRACEPOINTS +const char *drop_reason_lookup(unsigned long long value); +void drop_reason_show(struct seq_file *m); +#endif + void drop_reasons_register_subsys(enum skb_drop_reason_subsys subsys, const struct drop_reason_list *list); void drop_reasons_unregister_subsys(enum skb_drop_reason_subsys subsys); diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h index 07e0715628ec..8a1a63f9e796 100644 --- a/include/trace/events/skb.h +++ b/include/trace/events/skb.h @@ -8,15 +8,9 @@ #include #include #include +#include -#undef FN -#define FN(reason) TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason); -DEFINE_DROP_REASON(FN, FN) - -#undef FN -#undef FNe -#define FN(reason) { SKB_DROP_REASON_##reason, #reason }, -#define FNe(reason){ SKB_DROP_REASON_##reason, #reason } +TRACE_DEFINE_SYM_FNS(drop_reason, drop_reason_lookup, drop_reason_show); /* * Tracepoint for free an sk_buff: @@ -44,13 +38,9 @@ TRACE_EVENT(kfree_skb, TP_printk("skbaddr=%p protocol=%u location=%pS reason: %s", __entry->skbaddr, __entry->protocol, __entry->location, - __print_symbolic(__entry->reason, - DEFINE_DROP_REASON(FN, FNe))) + __print_sym(__entry->reason, drop_reason )) ); -#undef FN -#undef FNe - TRACE_EVENT(consume_skb, TP_PROTO(struct sk_buff *skb, void *location), diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b99127712e67..012b48da8810 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -147,6 +147,49 @@ drop_reasons_by_subsys[SKB_DROP_REASON_SUBSYS_NUM] = { }; EXPORT_SYMBOL(drop_reasons_by_subsys); +#ifdef CONFIG_TRACEPOINTS +const char *drop_reason_lookup(unsigned long long value) +{ + unsigned long long subsys_id = value >> SKB_DROP_REASON_SUBSYS_SHIFT; + u32 reason = value & ~SKB_DROP_REASON_SUBSYS_MASK; + const struct drop_reason_list *subsys; + + if (subsys_id >= SKB_DROP_REASON_SUBSYS_NUM) + return NULL; + + subsys = rcu_dereference(drop_reasons_by_subsys[subsys_id]); + if (!subsys) + return NULL; + if (reason >= subsys->n_reasons) + return NULL; + return subsys->reasons[reason]; +} + +void drop_reason_show(struct seq_file *m) +{ + u32 subsys_id; + + rcu_read_lock(); + for (subsys_id = 0; subsys_id < SKB_DROP_REASON_SUBSYS_NUM; subsys_id++) { + const struct drop_reason_list *subsys; + u32 i; + + subsys = rcu_dereference(drop_reasons_by_subsys[subsys_id]); + if (!subsys) + continue; + + for (i = 0; i < subsys->n_reasons; i++) { + if (!subsys->reasons[i]) + continue; + seq_printf(m, ", { %u, \"%s\" }", + (subsys_id << SKB_DROP_REASON_SUBSYS_SHIFT) | i, + subsys->reasons[i]); + } + } + rcu_read_unlock(); +} +#endif + /** * drop_reasons_register_subsys - register another drop reason subsystem * @subsys: the subsystem to register, must not be the core -- 2.44.0
[RFC PATCH v2 0/4] tracing: improve symbolic printing
As I mentioned before, it's annoying to see this in dropreason tracing with trace-cmd: irq/65-iwlwifi:-401 [000]22.79: kfree_skb: skbaddr=0x6a89b000 protocol=0 location=ieee80211_rx_handlers_result+0x21a reason: 0x2 and much nicer to see irq/65-iwlwifi:-401 [000]22.79: kfree_skb: skbaddr=0x69142000 protocol=0 location=ieee80211_rx_handlers_result+0x21a reason: RX_DROP_MONITOR The reason for this is that the __print_symbolic() string in tracing for trace-cmd to parse it is created at build-time, from the long list of _core_ drop reasons, but the drop reasons are now more dynamic. So I came up with __print_sym() which is similar, except it doesn't build the big list of numbers at build time but rather at runtime, which is actually a big memory saving too. But building it then, at the time userspace is recording it, lets us include all the known reasons. v2: - rebased on 6.9-rc1 - always search for __print_sym() and get rid of the DYNPRINT flag and associated code; I think ideally we'll just remove the older __print_symbolic() entirely - use ':' as the separator instead of "//" since that makes searching for it much easier and it's still not a valid char in an identifier - fix RCU johannes
Re: [PATCH v4 1/4] remoteproc: Add TEE support
Hello Mathieu, On 3/25/24 17:46, Mathieu Poirier wrote: > On Fri, Mar 08, 2024 at 03:47:05PM +0100, Arnaud Pouliquen wrote: >> Add a remoteproc TEE (Trusted Execution Environment) driver >> that will be probed by the TEE bus. If the associated Trusted >> application is supported on secure part this device offers a client > > Device or driver? I thought I touched on that before. Right, I changed the first instance and missed this one > >> interface to load a firmware in the secure part. >> This firmware could be authenticated by the secure trusted application. >> >> Signed-off-by: Arnaud Pouliquen >> --- >> Updates from V3: >> - rework TEE_REMOTEPROC description in Kconfig >> - fix some namings >> - add tee_rproc_parse_fw to support rproc_ops::parse_fw >> - add proc::tee_interface; >> - add rproc struct as parameter of the tee_rproc_register() function >> --- >> drivers/remoteproc/Kconfig | 10 + >> drivers/remoteproc/Makefile | 1 + >> drivers/remoteproc/tee_remoteproc.c | 434 >> include/linux/remoteproc.h | 4 + >> include/linux/tee_remoteproc.h | 112 +++ >> 5 files changed, 561 insertions(+) >> create mode 100644 drivers/remoteproc/tee_remoteproc.c >> create mode 100644 include/linux/tee_remoteproc.h >> >> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig >> index 48845dc8fa85..2cf1431b2b59 100644 >> --- a/drivers/remoteproc/Kconfig >> +++ b/drivers/remoteproc/Kconfig >> @@ -365,6 +365,16 @@ config XLNX_R5_REMOTEPROC >> >>It's safe to say N if not interested in using RPU r5f cores. >> >> + >> +config TEE_REMOTEPROC >> +tristate "remoteproc support by a TEE application" > > s/remoteproc/Remoteproc > >> +depends on OPTEE >> +help >> + Support a remote processor with a TEE application. The Trusted >> + Execution Context is responsible for loading the trusted firmware >> + image and managing the remote processor's lifecycle. >> + This can be either built-in or a loadable module. >> + >> endif # REMOTEPROC >> >> endmenu >> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile >> index 91314a9b43ce..fa8daebce277 100644 >> --- a/drivers/remoteproc/Makefile >> +++ b/drivers/remoteproc/Makefile >> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o >> obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o >> obj-$(CONFIG_ST_SLIM_REMOTEPROC)+= st_slim_rproc.o >> obj-$(CONFIG_STM32_RPROC) += stm32_rproc.o >> +obj-$(CONFIG_TEE_REMOTEPROC)+= tee_remoteproc.o >> obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o >> obj-$(CONFIG_TI_K3_R5_REMOTEPROC) += ti_k3_r5_remoteproc.o >> obj-$(CONFIG_XLNX_R5_REMOTEPROC)+= xlnx_r5_remoteproc.o >> diff --git a/drivers/remoteproc/tee_remoteproc.c >> b/drivers/remoteproc/tee_remoteproc.c >> new file mode 100644 >> index ..c855210e52e3 >> --- /dev/null >> +++ b/drivers/remoteproc/tee_remoteproc.c >> @@ -0,0 +1,434 @@ >> +// SPDX-License-Identifier: GPL-2.0-or-later >> +/* >> + * Copyright (C) STMicroelectronics 2024 - All Rights Reserved >> + * Author: Arnaud Pouliquen >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include "remoteproc_internal.h" >> + >> +#define MAX_TEE_PARAM_ARRY_MEMBER 4 >> + >> +/* >> + * Authentication of the firmware and load in the remote processor memory >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [in] params[1].memref: buffer containing the image of the >> buffer >> + */ >> +#define TA_RPROC_FW_CMD_LOAD_FW 1 >> + >> +/* >> + * Start the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_START_FW2 >> + >> +/* >> + * Stop the remote processor >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + */ >> +#define TA_RPROC_FW_CMD_STOP_FW 3 >> + >> +/* >> + * Return the address of the resource table, or 0 if not found >> + * No check is done to verify that the address returned is accessible by >> + * the non secure context. If the resource table is loaded in a protected >> + * memory the access by the non secure context will lead to a data abort. >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the remote processor >> + * [out] params[1].value.a:32bit LSB resource table memory address >> + * [out] params[1].value.b:32bit MSB resource table memory address >> + * [out] params[2].value.a:32bit LSB resource table memory size >> + * [out] params[2].value.b:32bit MSB resource table memory size >> + */ >> +#define TA_RPROC_FW_CMD_GET_RSC_TABLE 4 >> + >> +/* >> + * Return the address of the core dump >> + * >> + * [in] params[0].value.a: unique 32bit identifier of the
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On 26/03/2024 17:49, Jarkko Sakkinen wrote: On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote: Hi Jarkko, On 25/03/2024 22:55, Jarkko Sakkinen wrote: Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v5: - No changes, expect removing alloc_execmem() call which should have been part of the previous patch. v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES)+= module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS)+= module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) Need to have the parameter name here. I guess this could just as well pass through gfp to vmalloc from the caller as kprobes does call module_alloc() with GFP_KERNEL set in RISC-V. +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} The __vmalloc_node_range() line ^^ must be from an old kernel since we added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix module_alloc() that did not reset the linear mapping permissions"). In addition, I guess module_alloc() should now use alloc_execmem() right? Ack for the first comment. For the 2nd it is up to arch/ to choose whether to have shared or separate allocators. So if you want I can change it that way but did not want to make the call myself. I'd say module_alloc() should use alloc_execmem() then since there are no differences for now. + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} I remember Mike Rapoport sent a patchset to introduce an API for executable memory allocation (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), how does this intersect with your work? I don't know the status of his patchset though. Thanks, Alex I have also made a patch set for kprobes in the 2022: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ I think this Calvin's, Mike's and my early patch set have the same problem: they try to choke all architectures at once. And further, Calvin's and Mike's work also try to cover also tracing subsystems at once. I feel that my relatively small patch set which deals only with trivial kprobe (which is more in the leaf than e.g. bpf which is more like orchestrator tool) and implements one arch of which dog food I actually eat is a better starting point. Arch code is always something where you need to have genuine understanding so full architecture coverage from day one is just too risky for stability. Linux is better
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 26 Mar 2024 09:16:33 -0700 Andrii Nakryiko wrote: > > It's no different than lockdep. Test boxes should have it enabled, but > > there's no reason to have this enabled in a production system. > > > > I tend to agree with Steven here (which is why I sent this patch as it > is), but I'm happy to do it as an opt-out, if Masami insists. Please > do let me know if I need to send v2 or this one is actually the one > we'll end up using. Thanks! Masami, Are you OK with just keeping it set to N. We could have other options like PROVE_LOCKING enable it. -- Steve
Re: [PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue, Mar 26, 2024 at 03:46:16PM +0200, Jarkko Sakkinen wrote: > Tacing with kprobes while running a monolithic kernel is currently > impossible due the kernel module allocator dependency. > > Address the issue by implementing textmem API for RISC-V. This doesn't compile for nommu: /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:10:46: error: 'MODULES_VADDR' undeclared (first use in this function) /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:11:37: error: 'MODULES_END' undeclared (first use in this function) /build/tmp.3xucsBhqDV/arch/riscv/kernel/execmem.c:14:1: error: control reaches end of non-void function [-Werror=return-type] Clang builds also report: ../arch/riscv/kernel/execmem.c:8:56: warning: omitting the parameter name in a function definition is a C2x extension [-Wc2x-extensions] > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > stack > Link: > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > continuation > Signed-off-by: Jarkko Sakkinen > --- > v5-v7: > - No changes. > v4: > - Include linux/execmem.h. > v3: > - Architecture independent parts have been split to separate patches. > - Do not change arch/riscv/kernel/module.c as it is out of scope for > this patch set now. Meta comment. I dunno when v1 was sent, but versions can you please relax with submitting new versions of your patches? There's conversations ongoing on v5 at the moment, while this is a more recent version. v2 seems to have been sent on the 23rd and there's been 5 versions in the last day: https://patchwork.kernel.org/project/linux-riscv/list/?submitter=195059=* Could you please also try and use a cover letter for patchsets, ideally with a consistent subject? Otherwise I have to manually mark stuff as superseded. Thanks, Conor. > v2: > - Better late than never right? :-) > - Focus only to RISC-V for now to make the patch more digestable. This > is the arch where I use the patch on a daily basis to help with QA. > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. signature.asc Description: PGP signature
[PATCH net-next v4 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
The udp_fail_queue_rcv_skb() tracepoint lacks any details on the source and destination IP/port whereas this information can be critical in case of UDP/syslog. Signed-off-by: Balazs Scheidler --- include/trace/events/udp.h | 29 - net/ipv4/udp.c | 2 +- net/ipv6/udp.c | 3 ++- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/include/trace/events/udp.h b/include/trace/events/udp.h index 336fe272889f..62bebe2a6ece 100644 --- a/include/trace/events/udp.h +++ b/include/trace/events/udp.h @@ -7,24 +7,43 @@ #include #include +#include TRACE_EVENT(udp_fail_queue_rcv_skb, - TP_PROTO(int rc, struct sock *sk), + TP_PROTO(int rc, struct sock *sk, struct sk_buff *skb), - TP_ARGS(rc, sk), + TP_ARGS(rc, sk, skb), TP_STRUCT__entry( __field(int, rc) - __field(__u16, lport) + + __field(__u16, sport) + __field(__u16, dport) + __field(__u16, family) + __array(__u8, saddr, sizeof(struct sockaddr_in6)) + __array(__u8, daddr, sizeof(struct sockaddr_in6)) ), TP_fast_assign( + const struct udphdr *uh = (const struct udphdr *)udp_hdr(skb); + __entry->rc = rc; - __entry->lport = inet_sk(sk)->inet_num; + + /* for filtering use */ + __entry->sport = ntohs(uh->source); + __entry->dport = ntohs(uh->dest); + __entry->family = sk->sk_family; + + memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); + memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); + + TP_STORE_ADDR_PORTS_SKB(__entry, skb, uh); ), - TP_printk("rc=%d port=%hu", __entry->rc, __entry->lport) + TP_printk("rc=%d family=%s src=%pISpc dest=%pISpc", __entry->rc, + show_family_name(__entry->family), + __entry->saddr, __entry->daddr) ); #endif /* _TRACE_UDP_H */ diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 661d0e0d273f..531882f321f2 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2049,8 +2049,8 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) drop_reason = SKB_DROP_REASON_PROTO_MEM; } UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); + trace_udp_fail_queue_rcv_skb(rc, sk, skb); kfree_skb_reason(skb, drop_reason); - trace_udp_fail_queue_rcv_skb(rc, sk); return -1; } diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 7c1e6469d091..2e4dc5e6137b 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -658,8 +659,8 @@ static int __udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb) drop_reason = SKB_DROP_REASON_PROTO_MEM; } UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS, is_udplite); + trace_udp_fail_queue_rcv_skb(rc, sk, skb); kfree_skb_reason(skb, drop_reason); - trace_udp_fail_queue_rcv_skb(rc, sk); return -1; } -- 2.40.1
[PATCH net-next v4 1/2] net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent
This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes the TCP specific implementation details. Previously the macro assumed the skb passed as an argument is a TCP packet, the implementation now uses an argument to the L4 header and uses that to extract the source/destination ports, which happen to be named the same in "struct tcphdr" and "struct udphdr" Reviewed-by: Jason Xing Signed-off-by: Balazs Scheidler --- include/trace/events/net_probe_common.h | 40 ++ include/trace/events/tcp.h | 45 ++--- 2 files changed, 42 insertions(+), 43 deletions(-) diff --git a/include/trace/events/net_probe_common.h b/include/trace/events/net_probe_common.h index b1f9a4d3ee13..5e33f91bdea3 100644 --- a/include/trace/events/net_probe_common.h +++ b/include/trace/events/net_probe_common.h @@ -70,4 +70,44 @@ TP_STORE_V4MAPPED(__entry, saddr, daddr) #endif +#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) \ + do {\ + struct sockaddr_in *v4 = (void *)__entry->saddr;\ + \ + v4->sin_family = AF_INET; \ + v4->sin_port = protoh->source; \ + v4->sin_addr.s_addr = ip_hdr(skb)->saddr; \ + v4 = (void *)__entry->daddr;\ + v4->sin_family = AF_INET; \ + v4->sin_port = protoh->dest;\ + v4->sin_addr.s_addr = ip_hdr(skb)->daddr; \ + } while (0) + +#if IS_ENABLED(CONFIG_IPV6) + +#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ + do {\ + const struct iphdr *iph = ip_hdr(skb); \ + \ + if (iph->version == 6) {\ + struct sockaddr_in6 *v6 = (void *)__entry->saddr; \ + \ + v6->sin6_family = AF_INET6; \ + v6->sin6_port = protoh->source; \ + v6->sin6_addr = ipv6_hdr(skb)->saddr; \ + v6 = (void *)__entry->daddr;\ + v6->sin6_family = AF_INET6; \ + v6->sin6_port = protoh->dest; \ + v6->sin6_addr = ipv6_hdr(skb)->daddr; \ + } else \ + TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh); \ + } while (0) + +#else + +#define TP_STORE_ADDR_PORTS_SKB(__entry, skb, protoh) \ + TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb, protoh) + +#endif + #endif diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 3c08a0846c47..1db95175c1e5 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -273,48 +273,6 @@ TRACE_EVENT(tcp_probe, __entry->skbaddr, __entry->skaddr) ); -#define TP_STORE_ADDR_PORTS_SKB_V4(__entry, skb) \ - do {\ - const struct tcphdr *th = (const struct tcphdr *)skb->data; \ - struct sockaddr_in *v4 = (void *)__entry->saddr;\ - \ - v4->sin_family = AF_INET; \ - v4->sin_port = th->source; \ - v4->sin_addr.s_addr = ip_hdr(skb)->saddr; \ - v4 = (void *)__entry->daddr;\ - v4->sin_family = AF_INET; \ - v4->sin_port = th->dest;\ - v4->sin_addr.s_addr = ip_hdr(skb)->daddr; \ - } while (0) - -#if IS_ENABLED(CONFIG_IPV6) - -#define TP_STORE_ADDR_PORTS_SKB(__entry, skb) \ - do {\ - const struct iphdr *iph = ip_hdr(skb); \ - \ - if (iph->version == 6) {\ - const struct tcphdr *th = (const struct tcphdr *)skb->data; \ - struct sockaddr_in6 *v6 = (void *)__entry->saddr; \ - \
Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver
Hi Mathieu, > > > > > > > > > > This is an initial patchset for allowing to turn on and off > > > > > > > > > > the remote processor. > > > > > > > > > > The FW is already loaded before the Corstone-1000 SoC is > > > > > > > > > > powered on and this > > > > > > > > > > is done through the FPGA board bootloader in case of the > > > > > > > > > > FPGA target. Or by the Corstone-1000 FVP model > > > > > > > > > > (emulator). > > > > > > > > > > > > > > > > > > > >From the above I take it that booting with a preloaded > > > > > > > > > >firmware is a > > > > > > > > > scenario that needs to be supported and not just a temporary > > > > > > > > > stage. > > > > > > > > > > > > > > > > The current status of the Corstone-1000 SoC requires that there > > > > > > > > is > > > > > > > > a preloaded firmware for the external core. Preloading is done > > > > > > > > externally > > > > > > > > either through the FPGA bootloader or the emulator (FVP) before > > > > > > > > powering > > > > > > > > on the SoC. > > > > > > > > > > > > > > > > > > > > > > Ok > > > > > > > > > > > > > > > Corstone-1000 will be upgraded in a way that the A core running > > > > > > > > Linux is able > > > > > > > > to share memory with the remote core and also being able to > > > > > > > > access the remote > > > > > > > > core memory so Linux can copy the firmware to. This HW changes > > > > > > > > are still > > > > > > > > This is why this patchset is relying on a preloaded firmware. > > > > > > > > And it's the step 1 > > > > > > > > of adding remoteproc support for Corstone. > > > > > > > > > > > > > > > > > > > > > > Ok, so there is a HW problem where A core and M core can't see > > > > > > > each other's > > > > > > > memory, preventing the A core from copying the firmware image to > > > > > > > the proper > > > > > > > location. > > > > > > > > > > > > > > When the HW is fixed, will there be a need to support scenarios > > > > > > > where the > > > > > > > firmware image has been preloaded into memory? > > > > > > > > > > > > No, this scenario won't apply when we get the HW upgrade. No need > > > > > > for an > > > > > > external entity anymore. The firmware(s) will all be files in the > > > > > > linux filesystem. > > > > > > > > > > > > > > > > Very well. I am willing to continue with this driver but it does so > > > > > little that > > > > > I wonder if it wouldn't simply be better to move forward with > > > > > upstreaming when > > > > > the HW is fixed. The choice is yours. > > > > > > > > > > > > > I think Robin has raised few points that need clarification. I think it > > > > was > > > > done as part of DT binding patch. I share those concerns and I wanted to > > > > reaching to the same concerns by starting the questions I asked on > > > > corstone > > > > device tree changes. > > > > > > > > > > I also agree with Robin's point of view. Proceeding with an initial > > > driver with minimal functionality doesn't preclude having complete > > > bindings. But that said and as I pointed out, it might be better to > > > wait for the HW to be fixed before moving forward. > > > > We checked with the HW teams. The missing features will be implemented but > > this will take time. > > > > The foundation driver as it is right now is still valuable for people > > wanting to > > know how to power control Corstone external systems in a future proof manner > > (even in the incomplete state). We prefer to address all the review comments > > made so it can be merged. This includes making the DT binding as complete as > > possible as you advised. Then, once the HW is ready, I'll implement the > > comms > > and the FW reload part. Is that OK please ? > > > > I'm in agreement with that plan as long as we agree the current > preloaded heuristic is temporary and is not a valid long term > scenario. Yes, that's the plan, no problem. Cheers, Abdellatif
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:38 PM EET, Mark Rutland wrote: > On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote: > > On Tue, 26 Mar 2024 14:46:10 + > > Mark Rutland wrote: > > > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > > I think, we'd better to introduce `alloc_execmem()`, > > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > > > config HAVE_ALLOC_EXECMEM > > > > bool > > > > > > > > config ALLOC_EXECMEM > > > > bool "Executable trampline memory allocation" > > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > > > And define fallback macro to module_alloc() like this. > > > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > > #define alloc_execmem(size, gfp)module_alloc(size) > > > > #endif > > > > > > Please can we *not* do this? I think this is abstracting at the wrong > > > level (as > > > I mentioned on the prior execmem proposals). > > > > > > Different exectuable allocations can have different requirements. For > > > example, > > > on arm64 modules need to be within 2G of the kernel image, but the > > > kprobes XOL > > > areas can be anywhere in the kernel VA space. > > > > > > Forcing those behind the same interface makes things *harder* for > > > architectures > > > and/or makes the common code more complicated (if that ends up having to > > > track > > > all those different requirements). From my PoV it'd be much better to have > > > separate kprobes_alloc_*() functions for kprobes which an architecture > > > can then > > > choose to implement using a common library if it wants to. > > > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > > v6, > > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > > > Could we please start with that approach, with kprobe-specific alloc/free > > > code > > > provided by the architecture? > > > > OK, as far as I can read the code, this method also works and neat! > > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > > to user does not help, it should be an internal change. So hiding this > > change > > from user is better choice. Then there is no reason to introduce the new > > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > > > Mark, can you send this series here, so that others can review/test it? > > I've written up a cover letter and sent that out: > > https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/ > > Mark. Ya, saw it thanks! BR, Jarkko
Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes
On Tue, Mar 26, 2024, at 18:06, Steven Rostedt wrote: > On Tue, 26 Mar 2024 15:53:38 +0100 > Arnd Bergmann wrote: > >> -const char * >> +int >> ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, >> unsigned long *off, char **modname, char *sym) >> { >> struct ftrace_mod_map *mod_map; >> -const char *ret = NULL; >> +int ret; > > This needs to be ret = 0; Fixed now, thanks! I'll send a v5 in a few days Arnd
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:15 PM EET, Calvin Owens wrote: > On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote: > > On Tue, 26 Mar 2024 14:46:10 + > > Mark Rutland wrote: > > > > > Hi Masami, > > > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > > Hi Jarkko, > > > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > > Jarkko Sakkinen wrote: > > > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > > impossible due the kernel module allocator dependency. > > > > > > > > > > Address the issue by allowing architectures to implement > > > > > module_alloc() > > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > > > Realize the feature on RISC-V by separating allocator to > > > > > module_alloc.c > > > > > and implementing module_memfree(). > > > > > > > > Even though, this involves changes in arch-independent part. So it > > > > should > > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > > > config HAVE_ALLOC_EXECMEM > > > > bool > > > > > > > > config ALLOC_EXECMEM > > > > bool "Executable trampline memory allocation" > > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > > > And define fallback macro to module_alloc() like this. > > > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > > #define alloc_execmem(size, gfp)module_alloc(size) > > > > #endif > > > > > > Please can we *not* do this? I think this is abstracting at the wrong > > > level (as > > > I mentioned on the prior execmem proposals). > > > > > > Different exectuable allocations can have different requirements. For > > > example, > > > on arm64 modules need to be within 2G of the kernel image, but the > > > kprobes XOL > > > areas can be anywhere in the kernel VA space. > > > > > > Forcing those behind the same interface makes things *harder* for > > > architectures > > > and/or makes the common code more complicated (if that ends up having to > > > track > > > all those different requirements). From my PoV it'd be much better to have > > > separate kprobes_alloc_*() functions for kprobes which an architecture > > > can then > > > choose to implement using a common library if it wants to. > > > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > > v6, > > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > > > Could we please start with that approach, with kprobe-specific alloc/free > > > code > > > provided by the architecture? > > Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was > about to send a patch to remove it. > > > OK, as far as I can read the code, this method also works and neat! > > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > > to user does not help, it should be an internal change. So hiding this > > change > > from user is better choice. Then there is no reason to introduce the new > > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > I'm happy with this, it solves the first half of my problem. But I want > eBPF to work in the !MODULES case too. > > I think Mark's approach can work for bpf as well, without needing to > touch module_alloc() at all? So I might be able to drop that first patch > entirely. Yeah, I think we're aligned. Later on, if/when you send the bpf series please also cc me and I might possibly test those patches too. BR, Jarkko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 5:24 PM EET, Masami Hiramatsu (Google) wrote: > On Tue, 26 Mar 2024 14:46:10 + > Mark Rutland wrote: > > > Hi Masami, > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > Hi Jarkko, > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > Jarkko Sakkinen wrote: > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > impossible due the kernel module allocator dependency. > > > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > > and implementing module_memfree(). > > > > > > Even though, this involves changes in arch-independent part. So it should > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > config HAVE_ALLOC_EXECMEM > > > bool > > > > > > config ALLOC_EXECMEM > > > bool "Executable trampline memory allocation" > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > And define fallback macro to module_alloc() like this. > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > #define alloc_execmem(size, gfp) module_alloc(size) > > > #endif > > > > Please can we *not* do this? I think this is abstracting at the wrong level > > (as > > I mentioned on the prior execmem proposals). > > > > Different exectuable allocations can have different requirements. For > > example, > > on arm64 modules need to be within 2G of the kernel image, but the kprobes > > XOL > > areas can be anywhere in the kernel VA space. > > > > Forcing those behind the same interface makes things *harder* for > > architectures > > and/or makes the common code more complicated (if that ends up having to > > track > > all those different requirements). From my PoV it'd be much better to have > > separate kprobes_alloc_*() functions for kprobes which an architecture can > > then > > choose to implement using a common library if it wants to. > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > v6, > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > Could we please start with that approach, with kprobe-specific alloc/free > > code > > provided by the architecture? > > OK, as far as I can read the code, this method also works and neat! > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > to user does not help, it should be an internal change. So hiding this change > from user is better choice. Then there is no reason to introduce the new > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > Mark, can you send this series here, so that others can review/test it? I'm totally fine with this but yeah best would be if it could carry the riscv part. Mark, even if you have only possibility to compile test that part I can check that it works. BR, Jarkko
Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes
On Tue, 26 Mar 2024 15:53:38 +0100 Arnd Bergmann wrote: > -const char * > +int > ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, > unsigned long *off, char **modname, char *sym) > { > struct ftrace_mod_map *mod_map; > - const char *ret = NULL; > + int ret; This needs to be ret = 0; > > /* mod_map is freed via call_rcu() */ > preempt_disable(); As here we have: list_for_each_entry_rcu(mod_map, _mod_maps, list) { ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym); if (ret) { if (modname) *modname = mod_map->mod->name; break; } } preempt_enable(); return ret; } Where it is possible for the loop never to be executed. -- Steve
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 5:05 PM EET, Masami Hiramatsu (Google) wrote: > > According to kconfig-language.txt: > > > > "select should be used with care. select will force a symbol to a value > > without visiting the dependencies." > > > > So the problem here lies in KPROBES config entry using select statement > > to pick ALLOC_EXECMEM. It will not take the depends on statement into > > account and thus will allow to select kprobes without any allocator in > > place. > > OK, in that case "depend on" is good. Yeah, did not remember this at all. Only recalled when I started to get linking errors when compiling just the first patch... It's a bit uninituitive twist in kconfig :-) BR, Jarkko
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 6:49 PM EET, Jarkko Sakkinen wrote: > On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote: > > Hi Jarkko, > > > > On 25/03/2024 22:55, Jarkko Sakkinen wrote: > > > Tacing with kprobes while running a monolithic kernel is currently > > > impossible due the kernel module allocator dependency. > > > > > > Address the issue by implementing textmem API for RISC-V. > > > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a > > > minimal stack > > > Link: > > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ > > > # continuation > > > Signed-off-by: Jarkko Sakkinen > > > --- > > > v5: > > > - No changes, expect removing alloc_execmem() call which should have > > >been part of the previous patch. > > > v4: > > > - Include linux/execmem.h. > > > v3: > > > - Architecture independent parts have been split to separate patches. > > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > > >this patch set now. > > > v2: > > > - Better late than never right? :-) > > > - Focus only to RISC-V for now to make the patch more digestable. This > > >is the arch where I use the patch on a daily basis to help with QA. > > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. > > > --- > > > arch/riscv/Kconfig | 1 + > > > arch/riscv/kernel/Makefile | 3 +++ > > > arch/riscv/kernel/execmem.c | 22 ++ > > > 3 files changed, 26 insertions(+) > > > create mode 100644 arch/riscv/kernel/execmem.c > > > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > > index e3142ce531a0..499512fb17ff 100644 > > > --- a/arch/riscv/Kconfig > > > +++ b/arch/riscv/Kconfig > > > @@ -132,6 +132,7 @@ config RISCV > > > select HAVE_KPROBES if !XIP_KERNEL > > > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > > > select HAVE_KRETPROBES if !XIP_KERNEL > > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL > > > # https://github.com/ClangBuiltLinux/linux/issues/1881 > > > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > > > select HAVE_MOVE_PMD > > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile > > > index 604d6bf7e476..337797f10d3e 100644 > > > --- a/arch/riscv/kernel/Makefile > > > +++ b/arch/riscv/kernel/Makefile > > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o > > > > > > obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o > > > obj-$(CONFIG_MODULES) += module.o > > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y) > > > +obj-y+= execmem.o > > > +endif > > > obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o > > > > > > obj-$(CONFIG_CPU_PM)+= suspend_entry.o suspend.o > > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c > > > new file mode 100644 > > > index ..3e52522ead32 > > > --- /dev/null > > > +++ b/arch/riscv/kernel/execmem.c > > > @@ -0,0 +1,22 @@ > > > +// SPDX-License-Identifier: GPL-2.0-or-later > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) > > Need to have the parameter name here. I guess this could just as well > pass through gfp to vmalloc from the caller as kprobes does call > module_alloc() with GFP_KERNEL set in RISC-V. > > > > +{ > > > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > > > + MODULES_END, GFP_KERNEL, > > > + PAGE_KERNEL, 0, NUMA_NO_NODE, > > > + __builtin_return_address(0)); > > > +} > > > > > > The __vmalloc_node_range() line ^^ must be from an old kernel since we > > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix > > module_alloc() that did not reset the linear mapping permissions"). > > > > In addition, I guess module_alloc() should now use alloc_execmem() right? > > Ack for the first comment. For the 2nd it is up to arch/ to choose > whether to have shared or separate allocators. > > So if you want I can change it that way but did not want to make the > call myself. > > > > > > > > + > > > +void free_execmem(void *region) > > > +{ > > > + if (in_interrupt()) > > > + pr_warn("In interrupt context: vmalloc may not work.\n"); > > > + > > > + vfree(region); > > > +} > > > > > > I remember Mike Rapoport sent a patchset to introduce an API for > > executable memory allocation > > (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), > > > > how does this intersect with your work? I don't know the status of his > > patchset though. > > > > Thanks, > > > > Alex > > I have also made a patch set for kprobes in the 2022: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ > > I think this Calvin's, Mike's and my early patch set have the same > problem: they try to choke all architectures at
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 4:46 PM EET, Mark Rutland wrote: > Hi Masami, > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > Hi Jarkko, > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > Jarkko Sakkinen wrote: > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > impossible due the kernel module allocator dependency. > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > and module_memfree() independent of the module subsystem. An arch tree > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > and implementing module_memfree(). > > > > Even though, this involves changes in arch-independent part. So it should > > be solved by generic way. Did you checked Calvin's thread? > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > I think, we'd better to introduce `alloc_execmem()`, > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > config HAVE_ALLOC_EXECMEM > > bool > > > > config ALLOC_EXECMEM > > bool "Executable trampline memory allocation" > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > And define fallback macro to module_alloc() like this. > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > #define alloc_execmem(size, gfp)module_alloc(size) > > #endif > > Please can we *not* do this? I think this is abstracting at the wrong level > (as > I mentioned on the prior execmem proposals). > > Different exectuable allocations can have different requirements. For example, > on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL > areas can be anywhere in the kernel VA space. > > Forcing those behind the same interface makes things *harder* for > architectures > and/or makes the common code more complicated (if that ends up having to track > all those different requirements). From my PoV it'd be much better to have > separate kprobes_alloc_*() functions for kprobes which an architecture can > then > choose to implement using a common library if it wants to. > > I took a look at doing that using the core ifdeffery fixups from Jarkko's v6, > and it looks pretty clean to me (and works in testing on arm64): > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > Could we please start with that approach, with kprobe-specific alloc/free code > provided by the architecture? How should we move forward? I'm fine with someone picking the pieces of my work as long as also the riscv side is included. Can also continue rotating this, whatever works. > > Thanks, > Mark. BR, Jarkko
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue Mar 26, 2024 at 3:57 PM EET, Alexandre Ghiti wrote: > Hi Jarkko, > > On 25/03/2024 22:55, Jarkko Sakkinen wrote: > > Tacing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by implementing textmem API for RISC-V. > > > > Link: https://www.sochub.fi # for power on testing new SoC's with a minimal > > stack > > Link: > > https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # > > continuation > > Signed-off-by: Jarkko Sakkinen > > --- > > v5: > > - No changes, expect removing alloc_execmem() call which should have > >been part of the previous patch. > > v4: > > - Include linux/execmem.h. > > v3: > > - Architecture independent parts have been split to separate patches. > > - Do not change arch/riscv/kernel/module.c as it is out of scope for > >this patch set now. > > v2: > > - Better late than never right? :-) > > - Focus only to RISC-V for now to make the patch more digestable. This > >is the arch where I use the patch on a daily basis to help with QA. > > - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. > > --- > > arch/riscv/Kconfig | 1 + > > arch/riscv/kernel/Makefile | 3 +++ > > arch/riscv/kernel/execmem.c | 22 ++ > > 3 files changed, 26 insertions(+) > > create mode 100644 arch/riscv/kernel/execmem.c > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > index e3142ce531a0..499512fb17ff 100644 > > --- a/arch/riscv/Kconfig > > +++ b/arch/riscv/Kconfig > > @@ -132,6 +132,7 @@ config RISCV > > select HAVE_KPROBES if !XIP_KERNEL > > select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL > > select HAVE_KRETPROBES if !XIP_KERNEL > > + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL > > # https://github.com/ClangBuiltLinux/linux/issues/1881 > > select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD > > select HAVE_MOVE_PMD > > diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile > > index 604d6bf7e476..337797f10d3e 100644 > > --- a/arch/riscv/kernel/Makefile > > +++ b/arch/riscv/kernel/Makefile > > @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o > > > > obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o > > obj-$(CONFIG_MODULES) += module.o > > +ifeq ($(CONFIG_ALLOC_EXECMEM),y) > > +obj-y += execmem.o > > +endif > > obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o > > > > obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o > > diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c > > new file mode 100644 > > index ..3e52522ead32 > > --- /dev/null > > +++ b/arch/riscv/kernel/execmem.c > > @@ -0,0 +1,22 @@ > > +// SPDX-License-Identifier: GPL-2.0-or-later > > + > > +#include > > +#include > > +#include > > +#include > > + > > +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) Need to have the parameter name here. I guess this could just as well pass through gfp to vmalloc from the caller as kprobes does call module_alloc() with GFP_KERNEL set in RISC-V. > > +{ > > + return __vmalloc_node_range(size, 1, MODULES_VADDR, > > + MODULES_END, GFP_KERNEL, > > + PAGE_KERNEL, 0, NUMA_NO_NODE, > > + __builtin_return_address(0)); > > +} > > > The __vmalloc_node_range() line ^^ must be from an old kernel since we > added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix > module_alloc() that did not reset the linear mapping permissions"). > > In addition, I guess module_alloc() should now use alloc_execmem() right? Ack for the first comment. For the 2nd it is up to arch/ to choose whether to have shared or separate allocators. So if you want I can change it that way but did not want to make the call myself. > > > > + > > +void free_execmem(void *region) > > +{ > > + if (in_interrupt()) > > + pr_warn("In interrupt context: vmalloc may not work.\n"); > > + > > + vfree(region); > > +} > > > I remember Mike Rapoport sent a patchset to introduce an API for > executable memory allocation > (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), > how does this intersect with your work? I don't know the status of his > patchset though. > > Thanks, > > Alex I have also made a patch set for kprobes in the 2022: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ I think this Calvin's, Mike's and my early patch set have the same problem: they try to choke all architectures at once. And further, Calvin's and Mike's work also try to cover also tracing subsystems at once. I feel that my relatively small patch set which deals only with trivial kprobe (which is more in the leaf than e.g. bpf which is more like orchestrator tool) and implements one arch of which dog food I actually eat is a better starting
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue, Mar 26, 2024 at 09:15:14AM -0700, Calvin Owens wrote: > On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote: > > On Tue, 26 Mar 2024 14:46:10 + > > Mark Rutland wrote: > > > Different exectuable allocations can have different requirements. For > > > example, > > > on arm64 modules need to be within 2G of the kernel image, but the > > > kprobes XOL > > > areas can be anywhere in the kernel VA space. > > > > > > Forcing those behind the same interface makes things *harder* for > > > architectures > > > and/or makes the common code more complicated (if that ends up having to > > > track > > > all those different requirements). From my PoV it'd be much better to have > > > separate kprobes_alloc_*() functions for kprobes which an architecture > > > can then > > > choose to implement using a common library if it wants to. > > > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > > v6, > > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > > > Could we please start with that approach, with kprobe-specific alloc/free > > > code > > > provided by the architecture? > > Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was > about to send a patch to remove it. > > > OK, as far as I can read the code, this method also works and neat! > > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > > to user does not help, it should be an internal change. So hiding this > > change > > from user is better choice. Then there is no reason to introduce the new > > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > I'm happy with this, it solves the first half of my problem. But I want > eBPF to work in the !MODULES case too. > > I think Mark's approach can work for bpf as well, without needing to > touch module_alloc() at all? So I might be able to drop that first patch > entirely. I'd be very happy with eBPF following the same approach, with BPF-specific alloc/free functions that we can implement in arch code. IIUC eBPF code *does* want to be within range of the core kernel image, so for arm64 we'd want to factor some common logic out of module_alloc() and into something that module_alloc() and "bpf_alloc()" (or whatever it would be called) could use. So I don't think we'd necessarily save on touching module_alloc(), but I think the resulting split would be better. Thanks, Mark.
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Wed, Mar 27, 2024 at 12:24:03AM +0900, Masami Hiramatsu wrote: > On Tue, 26 Mar 2024 14:46:10 + > Mark Rutland wrote: > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > I think, we'd better to introduce `alloc_execmem()`, > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > config HAVE_ALLOC_EXECMEM > > > bool > > > > > > config ALLOC_EXECMEM > > > bool "Executable trampline memory allocation" > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > And define fallback macro to module_alloc() like this. > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > #define alloc_execmem(size, gfp) module_alloc(size) > > > #endif > > > > Please can we *not* do this? I think this is abstracting at the wrong level > > (as > > I mentioned on the prior execmem proposals). > > > > Different exectuable allocations can have different requirements. For > > example, > > on arm64 modules need to be within 2G of the kernel image, but the kprobes > > XOL > > areas can be anywhere in the kernel VA space. > > > > Forcing those behind the same interface makes things *harder* for > > architectures > > and/or makes the common code more complicated (if that ends up having to > > track > > all those different requirements). From my PoV it'd be much better to have > > separate kprobes_alloc_*() functions for kprobes which an architecture can > > then > > choose to implement using a common library if it wants to. > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > v6, > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > Could we please start with that approach, with kprobe-specific alloc/free > > code > > provided by the architecture? > > OK, as far as I can read the code, this method also works and neat! > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > to user does not help, it should be an internal change. So hiding this change > from user is better choice. Then there is no reason to introduce the new > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. > > Mark, can you send this series here, so that others can review/test it? I've written up a cover letter and sent that out: https://lore.kernel.org/lkml/20240326163624.3253157-1-mark.rutl...@arm.com/ Mark.
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt wrote: > > On Mon, 25 Mar 2024 11:38:48 +0900 > Masami Hiramatsu (Google) wrote: > > > On Fri, 22 Mar 2024 09:03:23 -0700 > > Andrii Nakryiko wrote: > > > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > > control whether ftrace low-level code performs additional > > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > > violations. > > > > > > This check is expected to never be true in practice and would be best > > > controlled with extra config to let users decide if they are willing to > > > pay the price. > > > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > is OK, but tht should be set to Y by default. If you have already verified > > that your system never make it true and you want to optimize your ftrace > > path, you can manually set it to N at your own risk. > > > > Really, it's for debugging. I would argue that it should *not* be default y. > Peter added this to find all the locations that could be called where RCU > is not watching. But the issue I have is that this is that it *does cause > overhead* with function tracing. > > I believe we found pretty much all locations that were an issue, and we > should now just make it an option for developers. > > It's no different than lockdep. Test boxes should have it enabled, but > there's no reason to have this enabled in a production system. > I tend to agree with Steven here (which is why I sent this patch as it is), but I'm happy to do it as an opt-out, if Masami insists. Please do let me know if I need to send v2 or this one is actually the one we'll end up using. Thanks! > -- Steve > > > > > > > > Cc: Steven Rostedt > > > Cc: Masami Hiramatsu > > > Cc: Paul E. McKenney > > > Signed-off-by: Andrii Nakryiko > > > --- > > > include/linux/trace_recursion.h | 2 +- > > > kernel/trace/Kconfig| 13 + > > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/trace_recursion.h > > > b/include/linux/trace_recursion.h > > > index d48cd92d2364..24ea8ac049b4 100644 > > > --- a/include/linux/trace_recursion.h > > > +++ b/include/linux/trace_recursion.h > > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > > unsigned long parent_ip); > > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > > #endif > > > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > > # define trace_warn_on_no_rcu(ip) \ > > > ({ \ > > > bool __ret = !rcu_is_watching();\ > > > > BTW, maybe we can add "unlikely" in the next "if" line? > > > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > > index 61c541c36596..19bce4e217d6 100644 > > > --- a/kernel/trace/Kconfig > > > +++ b/kernel/trace/Kconfig > > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > > This file can be reset, but the limit can not change in > > > size at runtime. > > > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > > + bool "Validate RCU is on during ftrace recursion check" > > > + depends on FUNCTION_TRACER > > > + depends on ARCH_WANTS_NO_INSTR > > > > default y > > > > > + help > > > + All callbacks that attach to the function tracing have some sort > > > + of protection against recursion. This option performs additional > > > + checks to make sure RCU is on when ftrace callbacks recurse. > > > + > > > + This will add more overhead to all ftrace-based invocations. > > > > ... invocations, but keep it safe. > > > > > + > > > + If unsure, say N > > > > If unsure, say Y > > > > Thank you, > > > > > + > > > config RING_BUFFER_RECORD_RECURSION > > > bool "Record functions that recurse in the ring buffer" > > > depends on FTRACE_RECORD_RECURSION > > > -- > > > 2.43.0 > > > > > > > >
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Wednesday 03/27 at 00:24 +0900, Masami Hiramatsu wrote: > On Tue, 26 Mar 2024 14:46:10 + > Mark Rutland wrote: > > > Hi Masami, > > > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > > Hi Jarkko, > > > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > > Jarkko Sakkinen wrote: > > > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > > impossible due the kernel module allocator dependency. > > > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > > and module_memfree() independent of the module subsystem. An arch tree > > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > > and implementing module_memfree(). > > > > > > Even though, this involves changes in arch-independent part. So it should > > > be solved by generic way. Did you checked Calvin's thread? > > > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > > > I think, we'd better to introduce `alloc_execmem()`, > > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > > > config HAVE_ALLOC_EXECMEM > > > bool > > > > > > config ALLOC_EXECMEM > > > bool "Executable trampline memory allocation" > > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > > > And define fallback macro to module_alloc() like this. > > > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > > #define alloc_execmem(size, gfp) module_alloc(size) > > > #endif > > > > Please can we *not* do this? I think this is abstracting at the wrong level > > (as > > I mentioned on the prior execmem proposals). > > > > Different exectuable allocations can have different requirements. For > > example, > > on arm64 modules need to be within 2G of the kernel image, but the kprobes > > XOL > > areas can be anywhere in the kernel VA space. > > > > Forcing those behind the same interface makes things *harder* for > > architectures > > and/or makes the common code more complicated (if that ends up having to > > track > > all those different requirements). From my PoV it'd be much better to have > > separate kprobes_alloc_*() functions for kprobes which an architecture can > > then > > choose to implement using a common library if it wants to. > > > > I took a look at doing that using the core ifdeffery fixups from Jarkko's > > v6, > > and it looks pretty clean to me (and works in testing on arm64): > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > > > Could we please start with that approach, with kprobe-specific alloc/free > > code > > provided by the architecture? Heh, I also noticed that dead !RWX branch in arm64 patch_map(), I was about to send a patch to remove it. > OK, as far as I can read the code, this method also works and neat! > (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM > to user does not help, it should be an internal change. So hiding this change > from user is better choice. Then there is no reason to introduce the new > alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. I'm happy with this, it solves the first half of my problem. But I want eBPF to work in the !MODULES case too. I think Mark's approach can work for bpf as well, without needing to touch module_alloc() at all? So I might be able to drop that first patch entirely. https://lore.kernel.org/all/a6b162aed1e6fea7f565ef9dd0204d6f2284bcce.1709676663.git.jcalvinow...@gmail.com/ Thanks, Calvin > Mark, can you send this series here, so that others can review/test it? > > Thank you! > > > > > > Thanks, > > Mark. > > > -- > Masami Hiramatsu (Google)
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu wrote: > > On Thu, 21 Mar 2024 07:57:35 -0700 > Jonathan Haslam wrote: > > > Active uprobes are stored in an RB tree and accesses to this tree are > > dominated by read operations. Currently these accesses are serialized by > > a spinlock but this leads to enormous contention when large numbers of > > threads are executing active probes. > > > > This patch converts the spinlock used to serialize access to the > > uprobes_tree RB tree into a reader-writer spinlock. This lock type > > aligns naturally with the overwhelmingly read-only nature of the tree > > usage here. Although the addition of reader-writer spinlocks are > > discouraged [0], this fix is proposed as an interim solution while an > > RCU based approach is implemented (that work is in a nascent form). This > > fix also has the benefit of being trivial, self contained and therefore > > simple to backport. > > > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) Masami, Given the discussion around per-cpu rw semaphore and need for (internal) batched attachment API for uprobes, do you think you can apply this patch as is for now? We can then gain initial improvements in scalability that are also easy to backport, and Jonathan will work on a more complete solution based on per-cpu RW semaphore, as suggested by Ingo. > > BTW, how did you measure the overhead? I think spinlock overhead > will depend on how much lock contention happens. > > Thank you, > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > Signed-off-by: Jonathan Haslam > > --- > > kernel/events/uprobes.c | 22 +++--- > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 929e98c62965..42bf9b6e8bc0 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > */ > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access */ > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > #define UPROBES_HASH_SZ 13 > > /* serialize uprobe->pending_list */ > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, > > loff_t offset) > > { > > struct uprobe *uprobe; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > uprobe = __find_uprobe(inode, offset); > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > > > return uprobe; > > } > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > *uprobe) > > { > > struct uprobe *u; > > > > - spin_lock(_treelock); > > + write_lock(_treelock); > > u = __insert_uprobe(uprobe); > > - spin_unlock(_treelock); > > + write_unlock(_treelock); > > > > return u; > > } > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > if (WARN_ON(!uprobe_is_active(uprobe))) > > return; > > > > - spin_lock(_treelock); > > + write_lock(_treelock); > > rb_erase(>rb_node, _tree); > > - spin_unlock(_treelock); > > + write_unlock(_treelock); > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > put_uprobe(uprobe); > > } > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > n = find_node_in_range(inode, min, max); > > if (n) { > > for (t = n; t; t = rb_prev(t)) { > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > get_uprobe(u); > > } > > } > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > } > > > > /* @vma contains reference counter, not the probed instruction. */ > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned > > long start, unsigned long e > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > n = find_node_in_range(inode, min, max); > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > > > return !!n; > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote: > On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote: > > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote: > > > > Secondly, the debugging code is enhanced so that the available head for > > > > (last_avail_idx - 1) is read for twice and recorded. It means the > > > > available > > > > head for one specific available index is read for twice. I do see the > > > > available heads are different from the consecutive reads. More details > > > > are shared as below. > > > > > > > > From the guest side > > > > === > > > > > > > > virtio_net virtio0: output.0:id 86 is not a head! > > > > head to be released: 047 062 112 > > > > > > > > avail_idx: > > > > 000 49665 > > > > 001 49666 <-- > > > > : > > > > 015 49664 > > > > > > what are these #s 49665 and so on? > > > and how large is the ring? > > > I am guessing 49664 is the index ring size is 16 and > > > 49664 % 16 == 0 > > > > More than that, 49664 % 256 == 0 > > > > So again there seems to be an error in the vicinity of roll-over of > > the idx low byte, as I observed in the earlier log. Surely this is > > more than coincidence? > > Yeah, I'd still really like to see the disassembly for both sides of the > protocol here. Gavin, is that something you're able to provide? Worst > case, the host and guest vmlinux objects would be a starting point. > > Personally, I'd be fairly surprised if this was a hardware issue. Ok, long shot after eyeballing the vhost code, but does the diff below help at all? It looks like vhost_vq_avail_empty() can advance the value saved in 'vq->avail_idx' but without the read barrier, possibly confusing vhost_get_vq_desc() in polling mode. Will --->8 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 045f666b4f12..87bff710331a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) return false; vq->avail_idx = vhost16_to_cpu(vq, avail_idx); + smp_rmb(); return vq->avail_idx == vq->last_avail_idx; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
Re: [PATCH] [v3] module: don't ignore sysfs_create_link() failures
On Tue, Mar 26, 2024 at 03:57:18PM +0100, Arnd Bergmann wrote: > From: Arnd Bergmann > > The sysfs_create_link() return code is marked as __must_check, but the > module_add_driver() function tries hard to not care, by assigning the > return code to a variable. When building with 'make W=1', gcc still > warns because this variable is only assigned but not used: > > drivers/base/module.c: In function 'module_add_driver': > drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used > [-Wunused-but-set-variable] > > Rework the code to properly unwind and return the error code to the > caller. My reading of the original code was that it tries to > not fail when the links already exist, so keep ignoring -EEXIST > errors. > Cc: Luis Chamberlain > Cc: linux-modu...@vger.kernel.org > Cc: Greg Kroah-Hartman > Cc: "Rafael J. Wysocki" Wondering if you can move these to be after --- to avoid polluting commit message. This will have the same effect and be archived on lore. But on pros side it will unload the commit message(s) from unneeded noise. ... > + error = module_add_driver(drv->owner, drv); > + if (error) { > + printk(KERN_ERR "%s: failed to create module links for %s\n", > + __func__, drv->name); What's wrong with pr_err()? Even if it's not a style used, in a new pieces of code this can be improved beforehand. So, we will reduce a technical debt, and not adding to it. > + goto out_detach; > + } ... > +int module_add_driver(struct module *mod, struct device_driver *drv) > { > char *driver_name; > - int no_warn; > + int ret; I would move it... > struct module_kobject *mk = NULL; ...to be here. -- With Best Regards, Andy Shevchenko
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
On Tue, 26 Mar 2024 14:46:10 + Mark Rutland wrote: > Hi Masami, > > On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > > Hi Jarkko, > > > > On Sun, 24 Mar 2024 01:29:08 +0200 > > Jarkko Sakkinen wrote: > > > > > Tracing with kprobes while running a monolithic kernel is currently > > > impossible due the kernel module allocator dependency. > > > > > > Address the issue by allowing architectures to implement module_alloc() > > > and module_memfree() independent of the module subsystem. An arch tree > > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > > and implementing module_memfree(). > > > > Even though, this involves changes in arch-independent part. So it should > > be solved by generic way. Did you checked Calvin's thread? > > > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > > > I think, we'd better to introduce `alloc_execmem()`, > > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > > > config HAVE_ALLOC_EXECMEM > > bool > > > > config ALLOC_EXECMEM > > bool "Executable trampline memory allocation" > > depends on MODULES || HAVE_ALLOC_EXECMEM > > > > And define fallback macro to module_alloc() like this. > > > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > > #define alloc_execmem(size, gfp)module_alloc(size) > > #endif > > Please can we *not* do this? I think this is abstracting at the wrong level > (as > I mentioned on the prior execmem proposals). > > Different exectuable allocations can have different requirements. For example, > on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL > areas can be anywhere in the kernel VA space. > > Forcing those behind the same interface makes things *harder* for > architectures > and/or makes the common code more complicated (if that ends up having to track > all those different requirements). From my PoV it'd be much better to have > separate kprobes_alloc_*() functions for kprobes which an architecture can > then > choose to implement using a common library if it wants to. > > I took a look at doing that using the core ifdeffery fixups from Jarkko's v6, > and it looks pretty clean to me (and works in testing on arm64): > > > https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules > > Could we please start with that approach, with kprobe-specific alloc/free code > provided by the architecture? OK, as far as I can read the code, this method also works and neat! (and minimum intrusion). I actually found that exposing CONFIG_ALLOC_EXECMEM to user does not help, it should be an internal change. So hiding this change from user is better choice. Then there is no reason to introduce the new alloc_execmem, but just expand kprobe_alloc_insn_page() is reasonable. Mark, can you send this series here, so that others can review/test it? Thank you! > > Thanks, > Mark. -- Masami Hiramatsu (Google)
[PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported
There is a bug when setting the RSS options in virtio_net that can break the whole machine, getting the kernel into an infinite loop. Running the following command in any QEMU virtual machine with virtionet will reproduce this problem: # ethtool -X eth0 hfunc toeplitz This is how the problem happens: 1) ethtool_set_rxfh() calls virtnet_set_rxfh() 2) virtnet_set_rxfh() calls virtnet_commit_rss_command() 3) virtnet_commit_rss_command() populates 4 entries for the rss scatter-gather 4) Since the command above does not have a key, then the last scatter-gatter entry will be zeroed, since rss_key_size == 0. sg_buf_size = vi->rss_key_size; 5) This buffer is passed to qemu, but qemu is not happy with a buffer with zero length, and do the following in virtqueue_map_desc() (QEMU function): if (!sz) { virtio_error(vdev, "virtio: zero sized buffers are not allowed"); 6) virtio_error() (also QEMU function) set the device as broken vdev->broken = true; 7) Qemu bails out, and do not repond this crazy kernel. 8) The kernel is waiting for the response to come back (function virtnet_send_command()) 9) The kernel is waiting doing the following : while (!virtqueue_get_buf(vi->cvq, ) && !virtqueue_is_broken(vi->cvq)) cpu_relax(); 10) None of the following functions above is true, thus, the kernel loops here forever. Keeping in mind that virtqueue_is_broken() does not look at the qemu `vdev->broken`, so, it never realizes that the vitio is broken at QEMU side. Fix it by not sending RSS commands if the feature is not available in the device. Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.") Cc: sta...@vger.kernel.org Cc: qemu-de...@nongnu.org Signed-off-by: Breno Leitao --- drivers/net/virtio_net.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c640fdf28fc5..e6b0eaf08ac2 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev, struct virtnet_info *vi = netdev_priv(dev); int i; + if (!vi->has_rss && !vi->has_rss_hash_report) + return -EOPNOTSUPP; + if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE && rxfh->hfunc != ETH_RSS_HASH_TOP) return -EOPNOTSUPP; -- 2.43.0
[PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported
Do not set virtnet_info->rss_indir_table_size if RSS is not available for the device. Currently, rss_indir_table_size is set if either has_rss or has_rss_hash_report is available, but, it should only be set if has_rss is set. On the virtnet_set_rxfh(), return an invalid command if the request has indirection table set, but virtnet does not support RSS. Suggested-by: Heng Qi Signed-off-by: Breno Leitao --- drivers/net/virtio_net.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c22d1118a133..c640fdf28fc5 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev, rxfh->hfunc != ETH_RSS_HASH_TOP) return -EOPNOTSUPP; + if (rxfh->indir && !vi->has_rss) + return -EINVAL; + if (rxfh->indir) { for (i = 0; i < vi->rss_indir_table_size; ++i) vi->ctrl->rss.indirection_table[i] = rxfh->indir[i]; @@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev) if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT)) vi->has_rss_hash_report = true; - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) { vi->has_rss = true; - if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_indir_table_size = virtio_cread16(vdev, offsetof(struct virtio_net_config, rss_max_indirection_table_length)); + } + + if (vi->has_rss || vi->has_rss_hash_report) { vi->rss_key_size = virtio_cread8(vdev, offsetof(struct virtio_net_config, rss_max_key_size)); -- 2.43.0
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue, 26 Mar 2024 15:18:21 +0200 "Jarkko Sakkinen" wrote: > On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote: > > On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote: > > > > > +#endif /* _LINUX_EXECMEM_H */ > > > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > > > > index 9d9095e81792..87fd8c14a938 100644 > > > > > --- a/kernel/kprobes.c > > > > > +++ b/kernel/kprobes.c > > > > > @@ -44,6 +44,7 @@ > > > > > #include > > > > > #include > > > > > #include > > > > > +#include > > > > > > > > > > #define KPROBE_HASH_BITS 6 > > > > > #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) > > > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state { > > > > > void __weak *alloc_insn_page(void) > > > > > { > > > > > /* > > > > > - * Use module_alloc() so this page is within +/- 2GB of where > > > > > the > > > > > + * Use alloc_execmem() so this page is within +/- 2GB of where > > > > > the > > > > >* kernel image and loaded module images reside. This is > > > > > required > > > > >* for most of the architectures. > > > > >* (e.g. x86-64 needs this to handle the %rip-relative fixups.) > > > > >*/ > > > > > - return module_alloc(PAGE_SIZE); > > > > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); > > > > > } > > > > > > > > > > static void free_insn_page(void *page) > > > > > { > > > > > - module_memfree(page); > > > > > + free_execmem(page); > > > > > } > > > > > > > > > > struct kprobe_insn_cache kprobe_insn_slots = { > > > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct > > > > > kprobe *p, > > > > > goto out; > > > > > } > > > > > > > > > > +#ifdef CONFIG_MODULES > > > > > > > > You don't need this block, because these APIs have dummy functions. > > > > > > Hmm... I'll verify this tomorrow. > > > > It depends on having struct module available given "(*probed_mod)->state". Ah, indeed. We need module_state() function to avoid it. > > > > It is non-existent unless CONFIG_MODULES is set given how things are > > flagged in include/linux/module.h. > > Hey, noticed kconfig issue. > > According to kconfig-language.txt: > > "select should be used with care. select will force a symbol to a value > without visiting the dependencies." > > So the problem here lies in KPROBES config entry using select statement > to pick ALLOC_EXECMEM. It will not take the depends on statement into > account and thus will allow to select kprobes without any allocator in > place. OK, in that case "depend on" is good. > > So to address this I'd suggest to use depends on statement also for > describing relation between KPROBES and ALLOC_EXECMEM. It does not make > life worse than before for anyone because even with the current kernel > you have to select MODULES before you can move forward with kprobes. Yeah, since ALLOC_EXECMEM is enabled by default. Thank you! > > BR, Jarkko -- Masami Hiramatsu (Google)
[PATCH] [v3] module: don't ignore sysfs_create_link() failures
From: Arnd Bergmann The sysfs_create_link() return code is marked as __must_check, but the module_add_driver() function tries hard to not care, by assigning the return code to a variable. When building with 'make W=1', gcc still warns because this variable is only assigned but not used: drivers/base/module.c: In function 'module_add_driver': drivers/base/module.c:36:6: warning: variable 'no_warn' set but not used [-Wunused-but-set-variable] Rework the code to properly unwind and return the error code to the caller. My reading of the original code was that it tries to not fail when the links already exist, so keep ignoring -EEXIST errors. Cc: Luis Chamberlain Cc: linux-modu...@vger.kernel.org Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Fixes: e17e0f51aeea ("Driver core: show drivers in /sys/module/") See-also: 4a7fb6363f2d ("add __must_check to device management code") Signed-off-by: Arnd Bergmann --- v3: make error handling stricter, add unwinding, fix build fail with CONFIG_MODULES=n v2: rework to actually handle the error. I have not tested the error handling beyond build testing, so please review carefully. --- drivers/base/base.h | 9 ++--- drivers/base/bus.c| 9 - drivers/base/module.c | 42 +++--- 3 files changed, 45 insertions(+), 15 deletions(-) diff --git a/drivers/base/base.h b/drivers/base/base.h index 0738ccad08b2..db4f910e8e36 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -192,11 +192,14 @@ extern struct kset *devices_kset; void devices_kset_move_last(struct device *dev); #if defined(CONFIG_MODULES) && defined(CONFIG_SYSFS) -void module_add_driver(struct module *mod, struct device_driver *drv); +int module_add_driver(struct module *mod, struct device_driver *drv); void module_remove_driver(struct device_driver *drv); #else -static inline void module_add_driver(struct module *mod, -struct device_driver *drv) { } +static inline int module_add_driver(struct module *mod, + struct device_driver *drv) +{ + return 0; +} static inline void module_remove_driver(struct device_driver *drv) { } #endif diff --git a/drivers/base/bus.c b/drivers/base/bus.c index daee55c9b2d9..ffea0728b8b2 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -674,7 +674,12 @@ int bus_add_driver(struct device_driver *drv) if (error) goto out_del_list; } - module_add_driver(drv->owner, drv); + error = module_add_driver(drv->owner, drv); + if (error) { + printk(KERN_ERR "%s: failed to create module links for %s\n", + __func__, drv->name); + goto out_detach; + } error = driver_create_file(drv, _attr_uevent); if (error) { @@ -699,6 +704,8 @@ int bus_add_driver(struct device_driver *drv) return 0; +out_detach: + driver_detach(drv); out_del_list: klist_del(>knode_bus); out_unregister: diff --git a/drivers/base/module.c b/drivers/base/module.c index 46ad4d636731..d16b5c8e5473 100644 --- a/drivers/base/module.c +++ b/drivers/base/module.c @@ -30,14 +30,14 @@ static void module_create_drivers_dir(struct module_kobject *mk) mutex_unlock(_dir_mutex); } -void module_add_driver(struct module *mod, struct device_driver *drv) +int module_add_driver(struct module *mod, struct device_driver *drv) { char *driver_name; - int no_warn; + int ret; struct module_kobject *mk = NULL; if (!drv) - return; + return 0; if (mod) mk = >mkobj; @@ -56,17 +56,37 @@ void module_add_driver(struct module *mod, struct device_driver *drv) } if (!mk) - return; + return 0; + + ret = sysfs_create_link(>p->kobj, >kobj, "module"); + if (ret) + return ret; - /* Don't check return codes; these calls are idempotent */ - no_warn = sysfs_create_link(>p->kobj, >kobj, "module"); driver_name = make_driver_name(drv); - if (driver_name) { - module_create_drivers_dir(mk); - no_warn = sysfs_create_link(mk->drivers_dir, >p->kobj, - driver_name); - kfree(driver_name); + if (!driver_name) { + ret = -ENOMEM; + goto out; + } + + module_create_drivers_dir(mk); + if (!mk->drivers_dir) { + ret = -EINVAL; + goto out; } + + ret = sysfs_create_link(mk->drivers_dir, >p->kobj, driver_name); + if (ret) + goto out; + + kfree(driver_name); + + return 0; +out: + sysfs_remove_link(>p->kobj, "module"); + sysfs_remove_link(mk->drivers_dir, driver_name); + kfree(driver_name); + + return ret; } void
[PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes
From: Arnd Bergmann Building with W=1 in some configurations produces a false positive warning for kallsyms: kernel/kallsyms.c: In function '__sprint_symbol.isra': kernel/kallsyms.c:503:17: error: 'strcpy' source argument is the same as destination [-Werror=restrict] 503 | strcpy(buffer, name); | ^~~~ This originally showed up while building with -O3, but later started happening in other configurations as well, depending on inlining decisions. The underlying issue is that the local 'name' variable is always initialized to the be the same as 'buffer' in the called functions that fill the buffer, which gcc notices while inlining, though it could see that the address check always skips the copy. The calling conventions here are rather unusual, as all of the internal lookup functions (bpf_address_lookup, ftrace_mod_address_lookup, ftrace_func_address_lookup, module_address_lookup and kallsyms_lookup_buildid) already use the provided buffer and either return the address of that buffer to indicate success, or NULL for failure, but the callers are written to also expect an arbitrary other buffer to be returned. Rework the calling conventions to return the length of the filled buffer instead of its address, which is simpler and easier to follow as well as avoiding the warning. Leave only the kallsyms_lookup() calling conventions unchanged, since that is called from 16 different functions and adapting this would be a much bigger change. Link: https://lore.kernel.org/all/20200107214042.855757-1-a...@arndb.de/ Reviewed-by: Luis Chamberlain Acked-by: Steven Rostedt (Google) Signed-off-by: Arnd Bergmann --- v4: fix string length v3: use strscpy() instead of strlcpy() v2: complete rewrite after the first patch was rejected (in 2020). This is now one of only two warnings that are in the way of enabling -Wextra/-Wrestrict by default. --- include/linux/filter.h | 14 +++--- include/linux/ftrace.h | 6 +++--- include/linux/module.h | 14 +++--- kernel/bpf/core.c| 7 +++ kernel/kallsyms.c| 23 --- kernel/module/kallsyms.c | 26 +- kernel/trace/ftrace.c| 13 + 7 files changed, 50 insertions(+), 53 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index c99bc3df2d28..9d4a7c6f023e 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1168,18 +1168,18 @@ static inline bool bpf_jit_kallsyms_enabled(void) return false; } -const char *__bpf_address_lookup(unsigned long addr, unsigned long *size, +int __bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym); bool is_bpf_text_address(unsigned long addr); int bpf_get_kallsym(unsigned int symnum, unsigned long *value, char *type, char *sym); struct bpf_prog *bpf_prog_ksym_find(unsigned long addr); -static inline const char * +static inline int bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym) { - const char *ret = __bpf_address_lookup(addr, size, off, sym); + int ret = __bpf_address_lookup(addr, size, off, sym); if (ret && modname) *modname = NULL; @@ -1223,11 +1223,11 @@ static inline bool bpf_jit_kallsyms_enabled(void) return false; } -static inline const char * +static inline int __bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char *sym) { - return NULL; + return 0; } static inline bool is_bpf_text_address(unsigned long addr) @@ -1246,11 +1246,11 @@ static inline struct bpf_prog *bpf_prog_ksym_find(unsigned long addr) return NULL; } -static inline const char * +static inline int bpf_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym) { - return NULL; + return 0; } static inline void bpf_prog_kallsyms_add(struct bpf_prog *fp) diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h index 54d53f345d14..56834a3fa9be 100644 --- a/include/linux/ftrace.h +++ b/include/linux/ftrace.h @@ -87,15 +87,15 @@ struct ftrace_direct_func; #if defined(CONFIG_FUNCTION_TRACER) && defined(CONFIG_MODULES) && \ defined(CONFIG_DYNAMIC_FTRACE) -const char * +int ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym); #else -static inline const char * +static inline int ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, unsigned long *off, char **modname, char *sym) { - return NULL; + return 0; } #endif diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..118c36366b35 100644 --- a/include/linux/module.h +++
Re: [PATCH v2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Hi Masami, On Mon, Mar 25, 2024 at 11:56:32AM +0900, Masami Hiramatsu wrote: > Hi Jarkko, > > On Sun, 24 Mar 2024 01:29:08 +0200 > Jarkko Sakkinen wrote: > > > Tracing with kprobes while running a monolithic kernel is currently > > impossible due the kernel module allocator dependency. > > > > Address the issue by allowing architectures to implement module_alloc() > > and module_memfree() independent of the module subsystem. An arch tree > > can signal this by setting HAVE_KPROBES_ALLOC in its Kconfig file. > > > > Realize the feature on RISC-V by separating allocator to module_alloc.c > > and implementing module_memfree(). > > Even though, this involves changes in arch-independent part. So it should > be solved by generic way. Did you checked Calvin's thread? > > https://lore.kernel.org/all/cover.1709676663.git.jcalvinow...@gmail.com/ > > I think, we'd better to introduce `alloc_execmem()`, > CONFIG_HAVE_ALLOC_EXECMEM and CONFIG_ALLOC_EXECMEM at first > > config HAVE_ALLOC_EXECMEM > bool > > config ALLOC_EXECMEM > bool "Executable trampline memory allocation" > depends on MODULES || HAVE_ALLOC_EXECMEM > > And define fallback macro to module_alloc() like this. > > #ifndef CONFIG_HAVE_ALLOC_EXECMEM > #define alloc_execmem(size, gfp) module_alloc(size) > #endif Please can we *not* do this? I think this is abstracting at the wrong level (as I mentioned on the prior execmem proposals). Different exectuable allocations can have different requirements. For example, on arm64 modules need to be within 2G of the kernel image, but the kprobes XOL areas can be anywhere in the kernel VA space. Forcing those behind the same interface makes things *harder* for architectures and/or makes the common code more complicated (if that ends up having to track all those different requirements). From my PoV it'd be much better to have separate kprobes_alloc_*() functions for kprobes which an architecture can then choose to implement using a common library if it wants to. I took a look at doing that using the core ifdeffery fixups from Jarkko's v6, and it looks pretty clean to me (and works in testing on arm64): https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=kprobes/without-modules Could we please start with that approach, with kprobe-specific alloc/free code provided by the architecture? Thanks, Mark.
Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver
On Mon, 25 Mar 2024 at 11:13, Abdellatif El Khlifi wrote: > > Hi Mathieu, > > > > > > > > > > This is an initial patchset for allowing to turn on and off > > > > > > > > > the remote processor. > > > > > > > > > The FW is already loaded before the Corstone-1000 SoC is > > > > > > > > > powered on and this > > > > > > > > > is done through the FPGA board bootloader in case of the FPGA > > > > > > > > > target. Or by the Corstone-1000 FVP model > > > > > > > > > (emulator). > > > > > > > > > > > > > > > > > >From the above I take it that booting with a preloaded > > > > > > > > >firmware is a > > > > > > > > scenario that needs to be supported and not just a temporary > > > > > > > > stage. > > > > > > > > > > > > > > The current status of the Corstone-1000 SoC requires that there is > > > > > > > a preloaded firmware for the external core. Preloading is done > > > > > > > externally > > > > > > > either through the FPGA bootloader or the emulator (FVP) before > > > > > > > powering > > > > > > > on the SoC. > > > > > > > > > > > > > > > > > > > Ok > > > > > > > > > > > > > Corstone-1000 will be upgraded in a way that the A core running > > > > > > > Linux is able > > > > > > > to share memory with the remote core and also being able to > > > > > > > access the remote > > > > > > > core memory so Linux can copy the firmware to. This HW changes > > > > > > > are still > > > > > > > This is why this patchset is relying on a preloaded firmware. And > > > > > > > it's the step 1 > > > > > > > of adding remoteproc support for Corstone. > > > > > > > > > > > > > > > > > > > Ok, so there is a HW problem where A core and M core can't see each > > > > > > other's > > > > > > memory, preventing the A core from copying the firmware image to > > > > > > the proper > > > > > > location. > > > > > > > > > > > > When the HW is fixed, will there be a need to support scenarios > > > > > > where the > > > > > > firmware image has been preloaded into memory? > > > > > > > > > > No, this scenario won't apply when we get the HW upgrade. No need for > > > > > an > > > > > external entity anymore. The firmware(s) will all be files in the > > > > > linux filesystem. > > > > > > > > > > > > > Very well. I am willing to continue with this driver but it does so > > > > little that > > > > I wonder if it wouldn't simply be better to move forward with > > > > upstreaming when > > > > the HW is fixed. The choice is yours. > > > > > > > > > > I think Robin has raised few points that need clarification. I think it > > > was > > > done as part of DT binding patch. I share those concerns and I wanted to > > > reaching to the same concerns by starting the questions I asked on > > > corstone > > > device tree changes. > > > > > > > I also agree with Robin's point of view. Proceeding with an initial > > driver with minimal functionality doesn't preclude having complete > > bindings. But that said and as I pointed out, it might be better to > > wait for the HW to be fixed before moving forward. > > We checked with the HW teams. The missing features will be implemented but > this will take time. > > The foundation driver as it is right now is still valuable for people wanting > to > know how to power control Corstone external systems in a future proof manner > (even in the incomplete state). We prefer to address all the review comments > made so it can be merged. This includes making the DT binding as complete as > possible as you advised. Then, once the HW is ready, I'll implement the comms > and the FW reload part. Is that OK please ? > I'm in agreement with that plan as long as we agree the current preloaded heuristic is temporary and is not a valid long term scenario. > Cheers, > Abdellatif
[PATCH v9 3/3] remoteproc: qcom: Remove minidump related data from qcom_common.c
As minidump specific data structure and functions move under config QCOM_RPROC_MINIDUMP, so remove minidump specific data from driver/remoteproc/qcom_common.c . Signed-off-by: Mukesh Ojha --- Changes in v9: - Change in patch order. - rebased it. v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/ v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/ v6: https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/ v5: https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/ v4: https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/ drivers/remoteproc/qcom_common.c | 160 --- 1 file changed, 160 deletions(-) diff --git a/drivers/remoteproc/qcom_common.c b/drivers/remoteproc/qcom_common.c index 03e5f5d533eb..085fd73fa23a 100644 --- a/drivers/remoteproc/qcom_common.c +++ b/drivers/remoteproc/qcom_common.c @@ -17,7 +17,6 @@ #include #include #include -#include #include "remoteproc_internal.h" #include "qcom_common.h" @@ -26,61 +25,6 @@ #define to_smd_subdev(d) container_of(d, struct qcom_rproc_subdev, subdev) #define to_ssr_subdev(d) container_of(d, struct qcom_rproc_ssr, subdev) -#define MAX_NUM_OF_SS 10 -#define MAX_REGION_NAME_LENGTH 16 -#define SBL_MINIDUMP_SMEM_ID 602 -#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' << 0) -#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' << 0) -#define MINIDUMP_SS_ENABLED('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0) - -/** - * struct minidump_region - Minidump region - * @name : Name of the region to be dumped - * @seq_num: : Use to differentiate regions with same name. - * @valid : This entry to be dumped (if set to 1) - * @address: Physical address of region to be dumped - * @size : Size of the region - */ -struct minidump_region { - charname[MAX_REGION_NAME_LENGTH]; - __le32 seq_num; - __le32 valid; - __le64 address; - __le64 size; -}; - -/** - * struct minidump_subsystem - Subsystem's SMEM Table of content - * @status : Subsystem toc init status - * @enabled : if set to 1, this region would be copied during coredump - * @encryption_status: Encryption status for this subsystem - * @encryption_required : Decides to encrypt the subsystem regions or not - * @region_count : Number of regions added in this subsystem toc - * @regions_baseptr : regions base pointer of the subsystem - */ -struct minidump_subsystem { - __le32 status; - __le32 enabled; - __le32 encryption_status; - __le32 encryption_required; - __le32 region_count; - __le64 regions_baseptr; -}; - -/** - * struct minidump_global_toc - Global Table of Content - * @status : Global Minidump init status - * @md_revision : Minidump revision - * @enabled : Minidump enable status - * @subsystems : Array of subsystems toc - */ -struct minidump_global_toc { - __le32 status; - __le32 md_revision; - __le32 enabled; - struct minidump_subsystem subsystems[MAX_NUM_OF_SS]; -}; - struct qcom_ssr_subsystem { const char *name; struct srcu_notifier_head notifier_list; @@ -90,110 +34,6 @@ struct qcom_ssr_subsystem { static LIST_HEAD(qcom_ssr_subsystem_list); static DEFINE_MUTEX(qcom_ssr_subsys_lock); -static void qcom_minidump_cleanup(struct rproc *rproc) -{ - struct rproc_dump_segment *entry, *tmp; - - list_for_each_entry_safe(entry, tmp, >dump_segments, node) { - list_del(>node); - kfree(entry->priv); - kfree(entry); - } -} - -static int qcom_add_minidump_segments(struct rproc *rproc, struct minidump_subsystem *subsystem, - void (*rproc_dumpfn_t)(struct rproc *rproc, struct rproc_dump_segment *segment, - void *dest, size_t offset, size_t size)) -{ - struct minidump_region __iomem *ptr; - struct minidump_region region; - int seg_cnt, i; - dma_addr_t da; - size_t size; - char *name; - - if (WARN_ON(!list_empty(>dump_segments))) { - dev_err(>dev, "dump segment list already populated\n"); - return -EUCLEAN; - } - - seg_cnt = le32_to_cpu(subsystem->region_count); - ptr = ioremap((unsigned long)le64_to_cpu(subsystem->regions_baseptr), - seg_cnt * sizeof(struct minidump_region)); - if (!ptr) - return -EFAULT; - - for (i = 0; i < seg_cnt; i++) { - memcpy_fromio(, ptr + i, sizeof(region)); - if (le32_to_cpu(region.valid) == MINIDUMP_REGION_VALID) { - name = kstrndup(region.name,
[PATCH v9 2/3] remoteproc: qcom_q6v5_pas: Use qcom_rproc_minidump()
Now, as all the minidump specific data structure is moved to minidump specific files and implementation wise qcom_rproc_minidump() and qcom_minidump() exactly same and the name qcom_rproc_minidump make more sense as it happen to collect the minidump for the remoteproc processors. So, let's use qcom_rproc_minidump() and we will be removing qcom_minidump() and minidump related stuff from driver/remoteproc/qcom_common.c . Signed-off-by: Mukesh Ojha --- Changes in v9: - Change in patch order from its last version. - Rebased it. v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/ v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/ v6: https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/ v5: https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/ v4: https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/ drivers/remoteproc/Kconfig | 1 + drivers/remoteproc/qcom_q6v5_pas.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig index 48845dc8fa85..cea960749e2c 100644 --- a/drivers/remoteproc/Kconfig +++ b/drivers/remoteproc/Kconfig @@ -166,6 +166,7 @@ config QCOM_PIL_INFO config QCOM_RPROC_COMMON tristate + select QCOM_RPROC_MINIDUMP config QCOM_Q6V5_COMMON tristate diff --git a/drivers/remoteproc/qcom_q6v5_pas.c b/drivers/remoteproc/qcom_q6v5_pas.c index 54d8005d40a3..b39f87dfd9c0 100644 --- a/drivers/remoteproc/qcom_q6v5_pas.c +++ b/drivers/remoteproc/qcom_q6v5_pas.c @@ -25,6 +25,7 @@ #include #include #include +#include #include "qcom_common.h" #include "qcom_pil_info.h" @@ -141,7 +142,7 @@ static void adsp_minidump(struct rproc *rproc) if (rproc->dump_conf == RPROC_COREDUMP_DISABLED) return; - qcom_minidump(rproc, adsp->minidump_id, adsp_segment_dump); + qcom_rproc_minidump(rproc, adsp->minidump_id, adsp_segment_dump); } static int adsp_pds_enable(struct qcom_adsp *adsp, struct device **pds, -- 2.7.4
[PATCH v9 1/3] soc: qcom: Add qcom_rproc_minidump module
Add qcom_rproc_minidump module in a preparation to remove minidump specific code from driver/remoteproc/qcom_common.c and provide needed exported API, this as well helps to abstract minidump specific data layout from qualcomm's remoteproc driver. It is just a copying of qcom_minidump() functionality from driver/remoteproc/qcom_common.c into a separate file under qcom_rproc_minidump(). Signed-off-by: Mukesh Ojha --- Changes in v9: - Added source file driver/remoteproc/qcom_common.c copyright to qcom_rproc_minidump.c - Dissociated it from minidump series as this can go separately and minidump can put it dependency for the data structure files. Nothing much changed in these three patches from previous version, However, giving the link of their older versions. v8: https://lore.kernel.org/lkml/20240131105734.13090-1-quic_mo...@quicinc.com/ v7: https://lore.kernel.org/lkml/20240109153200.12848-1-quic_mo...@quicinc.com/ v6: https://lore.kernel.org/lkml/1700864395-1479-1-git-send-email-quic_mo...@quicinc.com/ v5: https://lore.kernel.org/lkml/1694429639-21484-1-git-send-email-quic_mo...@quicinc.com/ v4: https://lore.kernel.org/lkml/1687955688-20809-1-git-send-email-quic_mo...@quicinc.com/ drivers/soc/qcom/Kconfig | 10 +++ drivers/soc/qcom/Makefile | 1 + drivers/soc/qcom/qcom_minidump_internal.h | 64 + drivers/soc/qcom/qcom_rproc_minidump.c| 115 ++ include/soc/qcom/qcom_minidump.h | 23 ++ 5 files changed, 213 insertions(+) create mode 100644 drivers/soc/qcom/qcom_minidump_internal.h create mode 100644 drivers/soc/qcom/qcom_rproc_minidump.c create mode 100644 include/soc/qcom/qcom_minidump.h diff --git a/drivers/soc/qcom/Kconfig b/drivers/soc/qcom/Kconfig index 5af33b0e3470..ed23e0275c22 100644 --- a/drivers/soc/qcom/Kconfig +++ b/drivers/soc/qcom/Kconfig @@ -277,4 +277,14 @@ config QCOM_PBS This module provides the APIs to the client drivers that wants to send the PBS trigger event to the PBS RAM. +config QCOM_RPROC_MINIDUMP + tristate "QCOM Remoteproc Minidump Support" + depends on ARCH_QCOM || COMPILE_TEST + depends on QCOM_SMEM + help + Enablement of core Minidump feature is controlled from boot firmware + side, so if it is enabled from firmware, this config allow Linux to + query predefined Minidump segments associated with the remote processor + and check its validity and end up collecting the dump on remote processor + crash during its recovery. endmenu diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile index ca0bece0dfff..44664589263d 100644 --- a/drivers/soc/qcom/Makefile +++ b/drivers/soc/qcom/Makefile @@ -36,3 +36,4 @@ obj-$(CONFIG_QCOM_ICC_BWMON) += icc-bwmon.o qcom_ice-objs += ice.o obj-$(CONFIG_QCOM_INLINE_CRYPTO_ENGINE)+= qcom_ice.o obj-$(CONFIG_QCOM_PBS) += qcom-pbs.o +obj-$(CONFIG_QCOM_RPROC_MINIDUMP) += qcom_rproc_minidump.o diff --git a/drivers/soc/qcom/qcom_minidump_internal.h b/drivers/soc/qcom/qcom_minidump_internal.h new file mode 100644 index ..71709235b196 --- /dev/null +++ b/drivers/soc/qcom/qcom_minidump_internal.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Copyright (c) 2023 Qualcomm Innovation Center, Inc. All rights reserved. + */ + +#ifndef _QCOM_MINIDUMP_INTERNAL_H_ +#define _QCOM_MINIDUMP_INTERNAL_H_ + +#define MAX_NUM_OF_SS 10 +#define MAX_REGION_NAME_LENGTH 16 +#define SBL_MINIDUMP_SMEM_ID 602 +#define MINIDUMP_REGION_VALID ('V' << 24 | 'A' << 16 | 'L' << 8 | 'I' << 0) +#define MINIDUMP_SS_ENCR_DONE ('D' << 24 | 'O' << 16 | 'N' << 8 | 'E' << 0) +#define MINIDUMP_SS_ENABLED ('E' << 24 | 'N' << 16 | 'B' << 8 | 'L' << 0) + +/** + * struct minidump_region - Minidump region + * @name : Name of the region to be dumped + * @seq_num: : Use to differentiate regions with same name. + * @valid : This entry to be dumped (if set to 1) + * @address: Physical address of region to be dumped + * @size : Size of the region + */ +struct minidump_region { + charname[MAX_REGION_NAME_LENGTH]; + __le32 seq_num; + __le32 valid; + __le64 address; + __le64 size; +}; + +/** + * struct minidump_subsystem - Subsystem's SMEM Table of content + * @status : Subsystem toc init status + * @enabled : if set to 1, this region would be copied during coredump + * @encryption_status: Encryption status for this subsystem + * @encryption_required : Decides to encrypt the subsystem regions or not + * @region_count : Number of regions added in this subsystem toc + * @regions_baseptr : regions base pointer of the subsystem + */ +struct minidump_subsystem { + __le32 status; + __le32 enabled; + __le32 encryption_status; + __le32 encryption_required; + __le32
Re: [PATCH] [v2] module: don't ignore sysfs_create_link() failures
On Sat, Mar 23, 2024, at 17:50, Greg Kroah-Hartman wrote: > On Fri, Mar 22, 2024 at 06:39:11PM +0100, Arnd Bergmann wrote: >> diff --git a/drivers/base/bus.c b/drivers/base/bus.c >> index daee55c9b2d9..7ef75b60d331 100644 >> --- a/drivers/base/bus.c >> +++ b/drivers/base/bus.c >> @@ -674,7 +674,12 @@ int bus_add_driver(struct device_driver *drv) >> if (error) >> goto out_del_list; >> } >> -module_add_driver(drv->owner, drv); >> +error = module_add_driver(drv->owner, drv); >> +if (error) { >> +printk(KERN_ERR "%s: failed to create module links for %s\n", >> +__func__, drv->name); >> +goto out_del_list; > > Don't we need to walk back the driver_attach() call here if this fails? Yes, fixed now. There are still some other calls right after it that print an error but don't cause bus_add_driver() to fail though. We may want to add similar unwinding there, but that feels like it should be a separate patch. >> >> if (!mk) >> -return; >> +return 0; >> + >> +ret = sysfs_create_link(>p->kobj, >kobj, "module"); >> +if (ret && ret != -EEXIST) > > Why would EEXIST happen here? How can this be called twice? > My impression was that the lack of error handling and the comment was ab out a case where that might happen intentionally. I've removed it now as I couldn't find any evidence that this is really needed. I suppose we would find out in testing if we do. Arnd
Re: [PATCH v5 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Hi Jarkko, On 25/03/2024 22:55, Jarkko Sakkinen wrote: Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v5: - No changes, expect removing alloc_execmem() call which should have been part of the previous patch. v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} The __vmalloc_node_range() line ^^ must be from an old kernel since we added VM_FLUSH_RESET_PERMS in 6.8, see 749b94b08005 ("riscv: Fix module_alloc() that did not reset the linear mapping permissions"). In addition, I guess module_alloc() should now use alloc_execmem() right? + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} I remember Mike Rapoport sent a patchset to introduce an API for executable memory allocation (https://lore.kernel.org/linux-mm/20230918072955.2507221-1-r...@kernel.org/), how does this intersect with your work? I don't know the status of his patchset though. Thanks, Alex
[PATCH v7 2/2] arch/riscv: Enable kprobes when CONFIG_MODULES=n
Tacing with kprobes while running a monolithic kernel is currently impossible due the kernel module allocator dependency. Address the issue by implementing textmem API for RISC-V. Link: https://www.sochub.fi # for power on testing new SoC's with a minimal stack Link: https://lore.kernel.org/all/2022060814.3054333-1-jar...@profian.com/ # continuation Signed-off-by: Jarkko Sakkinen --- v5-v7: - No changes. v4: - Include linux/execmem.h. v3: - Architecture independent parts have been split to separate patches. - Do not change arch/riscv/kernel/module.c as it is out of scope for this patch set now. v2: - Better late than never right? :-) - Focus only to RISC-V for now to make the patch more digestable. This is the arch where I use the patch on a daily basis to help with QA. - Introduce HAVE_KPROBES_ALLOC flag to help with more gradual migration. --- arch/riscv/Kconfig | 1 + arch/riscv/kernel/Makefile | 3 +++ arch/riscv/kernel/execmem.c | 22 ++ 3 files changed, 26 insertions(+) create mode 100644 arch/riscv/kernel/execmem.c diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e3142ce531a0..499512fb17ff 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -132,6 +132,7 @@ config RISCV select HAVE_KPROBES if !XIP_KERNEL select HAVE_KPROBES_ON_FTRACE if !XIP_KERNEL select HAVE_KRETPROBES if !XIP_KERNEL + select HAVE_ALLOC_EXECMEM if !XIP_KERNEL # https://github.com/ClangBuiltLinux/linux/issues/1881 select HAVE_LD_DEAD_CODE_DATA_ELIMINATION if !LD_IS_LLD select HAVE_MOVE_PMD diff --git a/arch/riscv/kernel/Makefile b/arch/riscv/kernel/Makefile index 604d6bf7e476..337797f10d3e 100644 --- a/arch/riscv/kernel/Makefile +++ b/arch/riscv/kernel/Makefile @@ -73,6 +73,9 @@ obj-$(CONFIG_SMP) += cpu_ops.o obj-$(CONFIG_RISCV_BOOT_SPINWAIT) += cpu_ops_spinwait.o obj-$(CONFIG_MODULES) += module.o +ifeq ($(CONFIG_ALLOC_EXECMEM),y) +obj-y += execmem.o +endif obj-$(CONFIG_MODULE_SECTIONS) += module-sections.o obj-$(CONFIG_CPU_PM) += suspend_entry.o suspend.o diff --git a/arch/riscv/kernel/execmem.c b/arch/riscv/kernel/execmem.c new file mode 100644 index ..3e52522ead32 --- /dev/null +++ b/arch/riscv/kernel/execmem.c @@ -0,0 +1,22 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include +#include + +void *alloc_execmem(unsigned long size, gfp_t /* gfp */) +{ + return __vmalloc_node_range(size, 1, MODULES_VADDR, + MODULES_END, GFP_KERNEL, + PAGE_KERNEL, 0, NUMA_NO_NODE, + __builtin_return_address(0)); +} + +void free_execmem(void *region) +{ + if (in_interrupt()) + pr_warn("In interrupt context: vmalloc may not work.\n"); + + vfree(region); +} -- 2.44.0
[PATCH v7 1/2] kprobes: Implement trampoline memory allocator for tracing
Tracing with kprobes while running a monolithic kernel is currently impossible because CONFIG_KPROBES depends on CONFIG_MODULES. Introduce alloc_execmem() and free_execmem() for allocating executable memory. If an arch implements these functions, it can mark this up with the HAVE_ALLOC_EXECMEM kconfig flag. The second new kconfig flag is ALLOC_EXECMEM, which can be selected if either MODULES is selected or HAVE_ALLOC_EXECMEM is support by the arch. If HAVE_ALLOC_EXECMEM is not supported by an arch, module_alloc() and module_memfree() are used as a fallback, thus retaining backwards compatibility to earlier kernel versions. This will allow architecture to enable kprobes traces without requiring to enable module. The support can be implemented with four easy steps: 1. Implement alloc_execmem(). 2. Implement free_execmem(). 3. Edit arch//Makefile. 4. Set HAVE_ALLOC_EXECMEM in arch//Kconfig. Link: https://lore.kernel.org/all/20240325115632.04e37297491cadfbbf382...@kernel.org/ Suggested-by: Masami Hiramatsu Signed-off-by: Jarkko Sakkinen --- v7: - Use "depends on" for ALLOC_EXECMEM instead of "select" - Reduced and narrowed CONFIG_MODULES checks further in kprobes.c. v6: - Use null pointer for notifiers and register the module notifier only if IS_ENABLED(CONFIG_MODULES) is set. - Fixed typo in the commit message and wrote more verbose description of the feature. v5: - alloc_execmem() was missing GFP_KERNEL parameter. The patch set did compile because 2/2 had the fixup (leaked there when rebasing the patch set). v4: - Squashed a couple of unrequired CONFIG_MODULES checks. - See https://lore.kernel.org/all/d034m18d63ec.2y11d954ys...@kernel.org/ v3: - A new patch added. - For IS_DEFINED() I need advice as I could not really find that many locations where it would be applicable. --- arch/Kconfig| 17 +++- include/linux/execmem.h | 13 + kernel/kprobes.c| 53 ++--- kernel/trace/trace_kprobe.c | 15 +-- 4 files changed, 73 insertions(+), 25 deletions(-) create mode 100644 include/linux/execmem.h diff --git a/arch/Kconfig b/arch/Kconfig index a5af0edd3eb8..5e9735f60f3c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,8 +52,8 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES + depends on ALLOC_EXECMEM select KALLSYMS select TASKS_RCU if PREEMPTION help @@ -215,6 +215,21 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_ALLOC_EXECMEM + bool + help + Architectures that select this option are capable of allocating trampoline + executable memory for tracing subsystems, indepedently of the kernel module + subsystem. + +config ALLOC_EXECMEM + bool "Executable (trampoline) memory allocation" + default y + depends on MODULES || HAVE_ALLOC_EXECMEM + help + Select this for executable (trampoline) memory. Can be enabled when either + module allocator or arch-specific allocator is available. + config ARCH_CORRECT_STACKTRACE_ON_KRETPROBE bool help diff --git a/include/linux/execmem.h b/include/linux/execmem.h new file mode 100644 index ..ae2ff151523a --- /dev/null +++ b/include/linux/execmem.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_EXECMEM_H +#define _LINUX_EXECMEM_H + +#ifdef CONFIG_HAVE_ALLOC_EXECMEM +void *alloc_execmem(unsigned long size, gfp_t gfp); +void free_execmem(void *region); +#else +#define alloc_execmem(size, gfp) module_alloc(size) +#define free_execmem(region) module_memfree(region) +#endif + +#endif /* _LINUX_EXECMEM_H */ diff --git a/kernel/kprobes.c b/kernel/kprobes.c index 9d9095e81792..13bef5de315c 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -44,6 +44,7 @@ #include #include #include +#include #define KPROBE_HASH_BITS 6 #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) @@ -113,17 +114,17 @@ enum kprobe_slot_state { void __weak *alloc_insn_page(void) { /* -* Use module_alloc() so this page is within +/- 2GB of where the +* Use alloc_execmem() so this page is within +/- 2GB of where the * kernel image and loaded module images reside. This is required * for most of the architectures. * (e.g. x86-64 needs this to handle the %rip-relative fixups.) */ - return module_alloc(PAGE_SIZE); + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); } static void free_insn_page(void *page) { - module_memfree(page); + free_execmem(page); } struct kprobe_insn_cache kprobe_insn_slots = { @@ -1592,6 +1593,7 @@ static int check_kprobe_address_safe(struct kprobe *p, goto out; } +#ifdef CONFIG_MODULES /* * If the module freed '.init.text', we couldn't
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
On Tue, Mar 26, 2024 at 9:18 PM Eric Dumazet wrote: > > On Tue, Mar 26, 2024 at 11:44 AM Jason Xing wrote: > > > Well, it's a pity that it seems that we are about to abandon this > > method but it's not that friendly to the users who are unable to > > deploy BPF... > > It is a pity these tracepoint patches are consuming a lot of reviewer > time, just because > some people 'can not deploy BPF' Sure, not everyone can do this easily. The phenomenon still exists and we cannot ignore it. Do you remember that about a month ago someone submitted one patch introducing a new tracepoint and then I replied to/asked you if it's necessary that we replace most of the tracepoints with BPF? Now I realise and accept the fact... I'll keep reviewing such patches and hope it can give you maintainers a break. I don't mind taking some time to do it, after all it's not a bad thing to help some people. > > Well, I came up with more ideas about how to improve the > > trace function in recent days. The motivation of doing this is that I > > encountered some issues which could be traced/diagnosed by using trace > > effortlessly without writing some bpftrace codes again and again. The > > status of trace seems not active but many people are still using it, I > > believe. > > 'Writing bpftrace codes again and again' is not a good reason to add > maintenance costs > to linux networking stack. I'm just saying :)
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
On Tue, Mar 26, 2024 at 11:44 AM Jason Xing wrote: > Well, it's a pity that it seems that we are about to abandon this > method but it's not that friendly to the users who are unable to > deploy BPF... It is a pity these tracepoint patches are consuming a lot of reviewer time, just because some people 'can not deploy BPF' Well, I came up with more ideas about how to improve the > trace function in recent days. The motivation of doing this is that I > encountered some issues which could be traced/diagnosed by using trace > effortlessly without writing some bpftrace codes again and again. The > status of trace seems not active but many people are still using it, I > believe. 'Writing bpftrace codes again and again' is not a good reason to add maintenance costs to linux networking stack.
Re: [PATCH v5 1/2] kprobes: textmem API
On Tue Mar 26, 2024 at 4:01 AM EET, Jarkko Sakkinen wrote: > On Tue Mar 26, 2024 at 3:31 AM EET, Jarkko Sakkinen wrote: > > > > +#endif /* _LINUX_EXECMEM_H */ > > > > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > > > > index 9d9095e81792..87fd8c14a938 100644 > > > > --- a/kernel/kprobes.c > > > > +++ b/kernel/kprobes.c > > > > @@ -44,6 +44,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > #define KPROBE_HASH_BITS 6 > > > > #define KPROBE_TABLE_SIZE (1 << KPROBE_HASH_BITS) > > > > @@ -113,17 +114,17 @@ enum kprobe_slot_state { > > > > void __weak *alloc_insn_page(void) > > > > { > > > > /* > > > > -* Use module_alloc() so this page is within +/- 2GB of where > > > > the > > > > +* Use alloc_execmem() so this page is within +/- 2GB of where > > > > the > > > > * kernel image and loaded module images reside. This is > > > > required > > > > * for most of the architectures. > > > > * (e.g. x86-64 needs this to handle the %rip-relative fixups.) > > > > */ > > > > - return module_alloc(PAGE_SIZE); > > > > + return alloc_execmem(PAGE_SIZE, GFP_KERNEL); > > > > } > > > > > > > > static void free_insn_page(void *page) > > > > { > > > > - module_memfree(page); > > > > + free_execmem(page); > > > > } > > > > > > > > struct kprobe_insn_cache kprobe_insn_slots = { > > > > @@ -1580,6 +1581,7 @@ static int check_kprobe_address_safe(struct > > > > kprobe *p, > > > > goto out; > > > > } > > > > > > > > +#ifdef CONFIG_MODULES > > > > > > You don't need this block, because these APIs have dummy functions. > > > > Hmm... I'll verify this tomorrow. > > It depends on having struct module available given "(*probed_mod)->state". > > It is non-existent unless CONFIG_MODULES is set given how things are > flagged in include/linux/module.h. Hey, noticed kconfig issue. According to kconfig-language.txt: "select should be used with care. select will force a symbol to a value without visiting the dependencies." So the problem here lies in KPROBES config entry using select statement to pick ALLOC_EXECMEM. It will not take the depends on statement into account and thus will allow to select kprobes without any allocator in place. So to address this I'd suggest to use depends on statement also for describing relation between KPROBES and ALLOC_EXECMEM. It does not make life worse than before for anyone because even with the current kernel you have to select MODULES before you can move forward with kprobes. BR, Jarkko
[PATCH v2 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)
Add a device tree for the Motorola Moto G (2013) smartphone based on the Qualcomm MSM8226 SoC. Initially supported features: - Buttons (Volume Down/Up, Power) - eMMC - Hall Effect Sensor - SimpleFB display - TMP108 temperature sensor - Vibrator Note: the dhob and shob reserved-memory regions are seemingly a part of some Motorola specific (firmware?) mechanism, see [1]. [1] https://github.com/LineageOS/android_kernel_motorola_msm8226/blob/cm-14.1/Documentation/devicetree/bindings/misc/hob_ram.txt Signed-off-by: Stanislav Jakubek --- Changes in V2: - split hob-ram reserved-memory region into dhob and shob - add a note and a link to downstream documentation with more information about these regions arch/arm/boot/dts/qcom/Makefile | 1 + .../boot/dts/qcom/msm8226-motorola-falcon.dts | 359 ++ 2 files changed, 360 insertions(+) create mode 100644 arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts diff --git a/arch/arm/boot/dts/qcom/Makefile b/arch/arm/boot/dts/qcom/Makefile index 6478a39b3be5..3eacbf5c0785 100644 --- a/arch/arm/boot/dts/qcom/Makefile +++ b/arch/arm/boot/dts/qcom/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 dtb-$(CONFIG_ARCH_QCOM) += \ + msm8226-motorola-falcon.dtb \ qcom-apq8016-sbc.dtb \ qcom-apq8026-asus-sparrow.dtb \ qcom-apq8026-huawei-sturgeon.dtb \ diff --git a/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts b/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts new file mode 100644 index ..029e1b1659c9 --- /dev/null +++ b/arch/arm/boot/dts/qcom/msm8226-motorola-falcon.dts @@ -0,0 +1,359 @@ +// SPDX-License-Identifier: BSD-3-Clause + +/dts-v1/; + +#include "qcom-msm8226.dtsi" +#include "pm8226.dtsi" + +/delete-node/ _region; + +/ { + model = "Motorola Moto G (2013)"; + compatible = "motorola,falcon", "qcom,msm8226"; + chassis-type = "handset"; + + aliases { + mmc0 = _1; + }; + + chosen { + #address-cells = <1>; + #size-cells = <1>; + ranges; + + framebuffer@320 { + compatible = "simple-framebuffer"; + reg = <0x0320 0x80>; + width = <720>; + height = <1280>; + stride = <(720 * 3)>; + format = "r8g8b8"; + vsp-supply = <_lcd_pos>; + vsn-supply = <_lcd_neg>; + vddio-supply = <_disp_vreg>; + }; + }; + + gpio-keys { + compatible = "gpio-keys"; + + event-hall-sensor { + label = "Hall Effect Sensor"; + gpios = < 51 GPIO_ACTIVE_LOW>; + linux,input-type = ; + linux,code = ; + linux,can-disable; + }; + + key-volume-up { + label = "Volume Up"; + gpios = < 106 GPIO_ACTIVE_LOW>; + linux,code = ; + debounce-interval = <15>; + }; + }; + + vddio_disp_vreg: regulator-vddio-disp { + compatible = "regulator-fixed"; + regulator-name = "vddio_disp"; + gpio = < 34 GPIO_ACTIVE_HIGH>; + vin-supply = <_l8>; + startup-delay-us = <300>; + enable-active-high; + regulator-boot-on; + }; + + reserved-memory { + #address-cells = <1>; + #size-cells = <1>; + ranges; + + framebuffer@320 { + reg = <0x0320 0x80>; + no-map; + }; + + dhob@f50 { + reg = <0x0f50 0x4>; + no-map; + }; + + shob@f54 { + reg = <0x0f54 0x2000>; + no-map; + }; + + smem_region: smem@fa0 { + reg = <0x0fa0 0x10>; + no-map; + }; + + /* Actually <0x0fa0 0x50>, but first 10 is smem */ + reserved@fb0 { + reg = <0x0fb0 0x40>; + no-map; + }; + }; +}; + +_i2c3 { + status = "okay"; + + regulator@3e { + compatible = "ti,tps65132"; + reg = <0x3e>; + pinctrl-0 = <_lcd_default>; + pinctrl-names = "default"; + + reg_lcd_pos: outp { + regulator-name = "outp"; + regulator-min-microvolt = <400>; + regulator-max-microvolt = <600>; + regulator-active-discharge = <1>; +
[PATCH v2 1/2] dt-bindings: arm: qcom: Add Motorola Moto G (2013)
Document the Motorola Moto G (2013), which is a smartphone based on the Qualcomm MSM8226 SoC. Acked-by: Krzysztof Kozlowski Signed-off-by: Stanislav Jakubek --- Changes in V2: - collect Krzysztof's A-b Documentation/devicetree/bindings/arm/qcom.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/arm/qcom.yaml b/Documentation/devicetree/bindings/arm/qcom.yaml index 66beaac60e1d..d2910982ae86 100644 --- a/Documentation/devicetree/bindings/arm/qcom.yaml +++ b/Documentation/devicetree/bindings/arm/qcom.yaml @@ -137,6 +137,7 @@ properties: - microsoft,dempsey - microsoft,makepeace - microsoft,moneypenny + - motorola,falcon - samsung,s3ve3g - const: qcom,msm8226 -- 2.34.1
Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag
On Tue Mar 26, 2024 at 4:16 AM EET, Rick Edgecombe wrote: > The mm_struct contains a function pointer *get_unmapped_area(), which > is set to either arch_get_unmapped_area() or > arch_get_unmapped_area_topdown() during the initialization of the mm. In which conditions which path is used during the initialization of mm and why is this the case? It is an open claim in the current form. That would be nice to have documented for the sake of being complete description. I have zero doubts of the claim being untrue. BR, Jarkko
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
> > > Have you considered/measured per-CPU RW semaphores? > > > > No I hadn't but thanks hugely for suggesting it! In initial measurements > > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > > all the exclamation marks but I'm very excited. I'll do some more testing > > tomorrow but so far it's looking very good. > > > > Documentation ([0]) says that locking for writing calls > synchronize_rcu(), is that right? If that's true, attaching multiple > uprobes (including just attaching a single BPF multi-uprobe) will take > a really long time. We need to confirm we are not significantly > regressing this. And if we do, we need to take measures in the BPF > multi-uprobe attachment code path to make sure that a single > multi-uprobe attachment is still fast. > > If my worries above turn out to be true, it still feels like a first > good step should be landing this patch as is (and get it backported to > older kernels), and then have percpu rw-semaphore as a final (and a > bit more invasive) solution (it's RCU-based, so feels like a good > primitive to settle on), making sure to not regress multi-uprobes > (we'll probably will need some batched API for multiple uprobes). > > Thoughts? Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is what calls into synchronize_rcu(). I haven't done the measurements yet but I would imagine this has to regress probe attachment, at least in the uncontended case. Of course, reads are by far the dominant mode here but we probably shouldn't punish writes excessively. I will do some measurements to quantify the write penalty here. I agree that a batched interface for probe attachment is needed here. The usual mode of operation for us is that we have a number of USDTs (uprobes) in hand and we want to enable and disable them in one shot. Removing the need to do multiple locking operations is definitely an efficiency improvement that needs to be done. Tie that together with per-CPU RW semaphores and this should scale extremely well in both a read and write case. Jon.
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote: > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote: > > > Secondly, the debugging code is enhanced so that the available head for > > > (last_avail_idx - 1) is read for twice and recorded. It means the > > > available > > > head for one specific available index is read for twice. I do see the > > > available heads are different from the consecutive reads. More details > > > are shared as below. > > > > > > From the guest side > > > === > > > > > > virtio_net virtio0: output.0:id 86 is not a head! > > > head to be released: 047 062 112 > > > > > > avail_idx: > > > 000 49665 > > > 001 49666 <-- > > > : > > > 015 49664 > > > > what are these #s 49665 and so on? > > and how large is the ring? > > I am guessing 49664 is the index ring size is 16 and > > 49664 % 16 == 0 > > More than that, 49664 % 256 == 0 > > So again there seems to be an error in the vicinity of roll-over of > the idx low byte, as I observed in the earlier log. Surely this is > more than coincidence? Yeah, I'd still really like to see the disassembly for both sides of the protocol here. Gavin, is that something you're able to provide? Worst case, the host and guest vmlinux objects would be a starting point. Personally, I'd be fairly surprised if this was a hardware issue. Will
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
On Tue, 2024-03-26 at 18:43 +0800, Jason Xing wrote: > On Tue, Mar 26, 2024 at 6:29 PM Paolo Abeni wrote: > > > > On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote: > > > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing > > > wrote: > > > > > > > > From: Jason Xing > > > > > > > > Using the macro for other tracepoints use to be more concise. > > > > No functional change. > > > > > > > > Jason Xing (3): > > > > trace: move to TP_STORE_ADDRS related macro to net_probe_common.h > > > > trace: use TP_STORE_ADDRS() macro in inet_sk_error_report() > > > > trace: use TP_STORE_ADDRS() macro in inet_sock_set_state() > > > > > > > > include/trace/events/net_probe_common.h | 29 > > > > include/trace/events/sock.h | 35 - > > > > > > I just noticed that some trace files in include/trace directory (like > > > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h, > > > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking > > > folks while some files (like tcp.h) have been maintained by specific > > > maintainers/experts (like Eric) because they belong to one specific > > > area. I wonder if we can get more networking guys involved in net > > > tracing. > > > > > > I'm not sure if 1) we can put those files into the "NETWORKING > > > [GENERAL]" category, or 2) we can create a new category to include > > > them all. > > > > I think all the file you mentioned are not under networking because of > > MAINTAINER file inaccuracy, and we could move there them accordingly. > > Yes, they are not under the networking category currently. So how > could we move them? The MAINTAINER file doesn't have all the specific > categories which are suitable for each of the trace files. I think there is no need to other categories: adding the explicit 'F:' entries for such files in the NETWORKING [GENERAL] section should fit. > > > I know people start using BPF to trace them all instead, but I can see > > > some good advantages of those hooks implemented in the kernel, say: > > > 1) help those machines which are not easy to use BPF tools. > > > 2) insert the tracepoint in the middle of some functions which cannot > > > be replaced by bpf kprobe. > > > 3) if we have enough tracepoints, we can generate a timeline to > > > know/detect which flow/skb spends unexpected time at which point. > > > ... > > > We can do many things in this area, I think :) > > > > > > What do you think about this, Jakub, Paolo, Eric ? > > > > I agree tracepoints are useful, but I think the general agreement is > > that they are the 'old way', we should try to avoid their > > proliferation. > > Well, it's a pity that it seems that we are about to abandon this > method but it's not that friendly to the users who are unable to > deploy BPF... Well, I came up with more ideas about how to improve the > trace function in recent days. The motivation of doing this is that I > encountered some issues which could be traced/diagnosed by using trace > effortlessly without writing some bpftrace codes again and again. The > status of trace seems not active but many people are still using it, I > believe. I don't think we should abandon it completely. My understanding is that we should thing carefully before adding new tracepoints, and generally speaking, avoid adding 'too many' of them. Cheers, Paolo
Re: [PATCH net-next v2 3/3] tcp: add location into reset trace process
On Mon, 2024-03-25 at 14:28 +0800, Jason Xing wrote: > From: Jason Xing > > In addition to knowing the 4-tuple of the flow which generates RST, > the reason why it does so is very important because we have some > cases where the RST should be sent and have no clue which one > exactly. > > Adding location of reset process can help us more, like what > trace_kfree_skb does. > > Signed-off-by: Jason Xing > --- > include/trace/events/tcp.h | 14 ++ > net/ipv4/tcp_ipv4.c| 2 +- > net/ipv4/tcp_output.c | 2 +- > net/ipv6/tcp_ipv6.c| 2 +- > 4 files changed, 13 insertions(+), 7 deletions(-) > > diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h > index a13eb2147a02..8f6c1a07503c 100644 > --- a/include/trace/events/tcp.h > +++ b/include/trace/events/tcp.h > @@ -109,13 +109,17 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, > */ > TRACE_EVENT(tcp_send_reset, > > - TP_PROTO(const struct sock *sk, const struct sk_buff *skb), > + TP_PROTO( > + const struct sock *sk, > + const struct sk_buff *skb, > + void *location), Very minor nit: the above lines should be aligned with the open bracket. No need to repost just for this, but let's wait for Eric's feedback. Cheers, Paolo
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
Hello: This series was applied to netdev/net-next.git (main) by Paolo Abeni : On Mon, 25 Mar 2024 11:43:44 +0800 you wrote: > From: Jason Xing > > Using the macro for other tracepoints use to be more concise. > No functional change. > > Jason Xing (3): > trace: move to TP_STORE_ADDRS related macro to net_probe_common.h > trace: use TP_STORE_ADDRS() macro in inet_sk_error_report() > trace: use TP_STORE_ADDRS() macro in inet_sock_set_state() > > [...] Here is the summary with links: - [net-next,1/3] trace: move to TP_STORE_ADDRS related macro to net_probe_common.h https://git.kernel.org/netdev/net-next/c/b3af9045b482 - [net-next,2/3] trace: use TP_STORE_ADDRS() macro in inet_sk_error_report() https://git.kernel.org/netdev/net-next/c/a24c855a5ef2 - [net-next,3/3] trace: use TP_STORE_ADDRS() macro in inet_sock_set_state() https://git.kernel.org/netdev/net-next/c/646700ce23f4 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
On Tue, Mar 26, 2024 at 6:29 PM Paolo Abeni wrote: > > On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote: > > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing > > wrote: > > > > > > From: Jason Xing > > > > > > Using the macro for other tracepoints use to be more concise. > > > No functional change. > > > > > > Jason Xing (3): > > > trace: move to TP_STORE_ADDRS related macro to net_probe_common.h > > > trace: use TP_STORE_ADDRS() macro in inet_sk_error_report() > > > trace: use TP_STORE_ADDRS() macro in inet_sock_set_state() > > > > > > include/trace/events/net_probe_common.h | 29 > > > include/trace/events/sock.h | 35 - > > > > I just noticed that some trace files in include/trace directory (like > > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h, > > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking > > folks while some files (like tcp.h) have been maintained by specific > > maintainers/experts (like Eric) because they belong to one specific > > area. I wonder if we can get more networking guys involved in net > > tracing. > > > > I'm not sure if 1) we can put those files into the "NETWORKING > > [GENERAL]" category, or 2) we can create a new category to include > > them all. > > I think all the file you mentioned are not under networking because of > MAINTAINER file inaccuracy, and we could move there them accordingly. Yes, they are not under the networking category currently. So how could we move them? The MAINTAINER file doesn't have all the specific categories which are suitable for each of the trace files. > > > > I know people start using BPF to trace them all instead, but I can see > > some good advantages of those hooks implemented in the kernel, say: > > 1) help those machines which are not easy to use BPF tools. > > 2) insert the tracepoint in the middle of some functions which cannot > > be replaced by bpf kprobe. > > 3) if we have enough tracepoints, we can generate a timeline to > > know/detect which flow/skb spends unexpected time at which point. > > ... > > We can do many things in this area, I think :) > > > > What do you think about this, Jakub, Paolo, Eric ? > > I agree tracepoints are useful, but I think the general agreement is > that they are the 'old way', we should try to avoid their > proliferation. Well, it's a pity that it seems that we are about to abandon this method but it's not that friendly to the users who are unable to deploy BPF... Well, I came up with more ideas about how to improve the trace function in recent days. The motivation of doing this is that I encountered some issues which could be traced/diagnosed by using trace effortlessly without writing some bpftrace codes again and again. The status of trace seems not active but many people are still using it, I believe. Thanks, Jason > > Cheers, > > Paolo >
Re: [PATCH net-next 0/3] trace: use TP_STORE_ADDRS macro
On Tue, 2024-03-26 at 12:14 +0800, Jason Xing wrote: > On Mon, Mar 25, 2024 at 11:43 AM Jason Xing wrote: > > > > From: Jason Xing > > > > Using the macro for other tracepoints use to be more concise. > > No functional change. > > > > Jason Xing (3): > > trace: move to TP_STORE_ADDRS related macro to net_probe_common.h > > trace: use TP_STORE_ADDRS() macro in inet_sk_error_report() > > trace: use TP_STORE_ADDRS() macro in inet_sock_set_state() > > > > include/trace/events/net_probe_common.h | 29 > > include/trace/events/sock.h | 35 - > > I just noticed that some trace files in include/trace directory (like > net_probe_common.h, sock.h, skb.h, net.h, sock.h, udp.h, sctp.h, > qdisc.h, neigh.h, napi.h, icmp.h, ...) are not owned by networking > folks while some files (like tcp.h) have been maintained by specific > maintainers/experts (like Eric) because they belong to one specific > area. I wonder if we can get more networking guys involved in net > tracing. > > I'm not sure if 1) we can put those files into the "NETWORKING > [GENERAL]" category, or 2) we can create a new category to include > them all. I think all the file you mentioned are not under networking because of MAINTAINER file inaccuracy, and we could move there them accordingly. > > I know people start using BPF to trace them all instead, but I can see > some good advantages of those hooks implemented in the kernel, say: > 1) help those machines which are not easy to use BPF tools. > 2) insert the tracepoint in the middle of some functions which cannot > be replaced by bpf kprobe. > 3) if we have enough tracepoints, we can generate a timeline to > know/detect which flow/skb spends unexpected time at which point. > ... > We can do many things in this area, I think :) > > What do you think about this, Jakub, Paolo, Eric ? I agree tracepoints are useful, but I think the general agreement is that they are the 'old way', we should try to avoid their proliferation. Cheers, Paolo
Re: [PATCH 2/2] ARM: dts: qcom: Add support for Motorola Moto G (2013)
On 25.03.2024 9:25 PM, Stanislav Jakubek wrote: > On Mon, Mar 25, 2024 at 08:28:27PM +0100, Konrad Dybcio wrote: >> On 24.03.2024 3:04 PM, Stanislav Jakubek wrote: >>> Add a device tree for the Motorola Moto G (2013) smartphone based >>> on the Qualcomm MSM8226 SoC. >>> >>> Initially supported features: >>> - Buttons (Volume Down/Up, Power) >>> - eMMC >>> - Hall Effect Sensor >>> - SimpleFB display >>> - TMP108 temperature sensor >>> - Vibrator >>> >>> Signed-off-by: Stanislav Jakubek >>> --- >> >> [...] >> >>> + hob-ram@f50 { >>> + reg = <0x0f50 0x4>, >>> + <0x0f54 0x2000>; >>> + no-map; >>> + }; >> >> Any reason it's in two parts? Should it be one contiguous region, or >> two separate nodes? >> >> lgtm otherwise > > Hi Konrad, I copied this from downstream as-is. > According to the downstream docs [1]: > > HOB RAM MMAP Device provides ability for userspace to access the > hand over block memory to read out modem related parameters. > > And the two regs are the "DHOB partition" and "SHOB partition". Oh right, motorola made some inventions here.. > > I suppose this is something Motorola (firmware?) specific (since the > downstream compatible is mmi,hob_ram [2]). > Should I split this into 2 nodes - dhob@f50 and shob@f54? Yes please and add the downstream txt link to the commit message in case somebody was curious down the line. Konrad
[PATCH v19 RESEND 4/5] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..0426ab4bcf3d --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred as the meta-page. One of the most important field of +the meta-page is the reader. It contains the sub-buffer ID which can be safely +read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot and causes splice to copy +the ring buffer data instead of using the copyless swap from the ring buffer. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable, just like concurrent readers on +trace_pipe would be. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.44.0.396.g6e790dbe36-goog
[PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. CC: Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index ffcd8dfcaa4f..d25b9d504a7c 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 233d1af39fff..0f37aa9860fd 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(_types_lock); spin_lock(>snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(>snapshot_trigger_lock); return -EBUSY; } @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == _trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent, + NULL, NULL); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); /* Make sure the waiters see the new wait_index */ @@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +#ifdef CONFIG_TRACER_MAX_TRACE +static int get_snapshot_map(struct trace_array *tr) +{ + int err = 0; + + /* +* Called with mmap_lock held. lockdep would be unhappy if we would now +* take trace_types_lock. Instead use the specific +* snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + + if (tr->snapshot || tr->mapped == UINT_MAX) + err = -EBUSY; + else + tr->mapped++; + + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); + + return err; + +} +static void put_snapshot_map(struct
[PATCH v19 RESEND 2/5] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. CC: Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..1dc932e7963c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -338,6 +339,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1607,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters); init_waitqueue_head(_buffer->irq_work.full_waiters); + mutex_init(_buffer->mapping_lock); bpage = kzalloc_node(ALIGN(sizeof(*bpage),
[PATCH v19 RESEND 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..cc9ebe593571 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(>list, pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - mflags | __GFP_ZERO, + mflags | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) goto out; page = alloc_pages_node(cpu_to_node(cpu), - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.44.0.396.g6e790dbe36-goog
[PATCH v19 RESEND 0/5] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v18 -> v19: * Use VM_PFNMAP and vm_insert_pages * Allocate ring-buffer subbufs with __GFP_COMP * Pad the meta-page with the zero-page to align on the subbuf_order * Extend the ring-buffer test with mmap() dedicated suite v17 -> v18: * Fix lockdep_assert_held * Fix spin_lock_init typo * Fix CONFIG_TRACER_MAX_TRACE typo v16 -> v17: * Documentation and comments improvements. * Create get/put_snapshot_map() for clearer code. * Replace kzalloc with kcalloc. * Fix -ENOMEM handling in rb_alloc_meta_page(). * Move flush(cpu_buffer->reader_page) behind the reader lock. * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer mutex. (removes READ_ONCE/WRITE_ONCE accesses). v15 -> v16: * Add comment for the dcache flush. * Remove now unnecessary WRITE_ONCE for the meta-page. v14 -> v15: * Add meta-page and reader-page flush. Intends to fix the mapping for VIVT and aliasing-VIPT data caches. * -EPERM on VM_EXEC. * Fix build warning !CONFIG_TRACER_MAX_TRACE. v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. Vincent Donnefort (5): ring-buffer: allocate sub-buffers with __GFP_COMP ring-buffer: Introducing ring-buffer mapping functions tracing: Allow user-space mapping of the ring-buffer Documentation: tracing: Add ring-buffer mapping ring-buffer/selftest: Add ring-buffer mapping test Documentation/trace/index.rst | 1 +
[PATCH v19 4/5] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..0426ab4bcf3d --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred as the meta-page. One of the most important field of +the meta-page is the reader. It contains the sub-buffer ID which can be safely +read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot and causes splice to copy +the ring buffer data instead of using the copyless swap from the ring buffer. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable, just like concurrent readers on +trace_pipe would be. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.44.0.396.g6e790dbe36-goog
[PATCH v19 3/5] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index ffcd8dfcaa4f..d25b9d504a7c 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 233d1af39fff..0f37aa9860fd 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(_types_lock); spin_lock(>snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(>snapshot_trigger_lock); return -EBUSY; } @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == _trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; + + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent, + NULL, NULL); + if (err) + return err; + } - if (cmd) - return -ENOIOCTLCMD; + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); /* Make sure the waiters see the new wait_index */ @@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +#ifdef CONFIG_TRACER_MAX_TRACE +static int get_snapshot_map(struct trace_array *tr) +{ + int err = 0; + + /* +* Called with mmap_lock held. lockdep would be unhappy if we would now +* take trace_types_lock. Instead use the specific +* snapshot_trigger_lock. +*/ + spin_lock(>snapshot_trigger_lock); + + if (tr->snapshot || tr->mapped == UINT_MAX) + err = -EBUSY; + else + tr->mapped++; + + spin_unlock(>snapshot_trigger_lock); + + /* Wait for update_max_tr() to observe iter->tr->mapped */ + if (tr->mapped == 1) + synchronize_rcu(); + + return err; + +} +static void put_snapshot_map(struct
[PATCH v19 2/5] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index dc5ae4e96aee..96d2140b471e 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -223,4 +225,8 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu, + struct vm_area_struct *vma); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index cc9ebe593571..1dc932e7963c 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -338,6 +339,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf VA */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1599,6 +1607,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters); init_waitqueue_head(_buffer->irq_work.full_waiters); + mutex_init(_buffer->mapping_lock); bpage = kzalloc_node(ALIGN(sizeof(*bpage),
[PATCH v19 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP
In preparation for the ring-buffer memory mapping, allocate compound pages for the ring-buffer sub-buffers to enable us to map them to user-space with vm_insert_pages(). Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..cc9ebe593571 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(>list, pages); page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), - mflags | __GFP_ZERO, + mflags | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) goto out; page = alloc_pages_node(cpu_to_node(cpu), - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.44.0.396.g6e790dbe36-goog
[PATCH v19 0/5] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v18 -> v19: * Use VM_PFNMAP and vm_insert_pages * Allocate ring-buffer subbufs with __GFP_COMP * Pad the meta-page with the zero-page to align on the subbuf_order * Extend the ring-buffer test with mmap() dedicated suite v17 -> v18: * Fix lockdep_assert_held * Fix spin_lock_init typo * Fix CONFIG_TRACER_MAX_TRACE typo v16 -> v17: * Documentation and comments improvements. * Create get/put_snapshot_map() for clearer code. * Replace kzalloc with kcalloc. * Fix -ENOMEM handling in rb_alloc_meta_page(). * Move flush(cpu_buffer->reader_page) behind the reader lock. * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer mutex. (removes READ_ONCE/WRITE_ONCE accesses). v15 -> v16: * Add comment for the dcache flush. * Remove now unnecessary WRITE_ONCE for the meta-page. v14 -> v15: * Add meta-page and reader-page flush. Intends to fix the mapping for VIVT and aliasing-VIPT data caches. * -EPERM on VM_EXEC. * Fix build warning !CONFIG_TRACER_MAX_TRACE. v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. Vincent Donnefort (5): ring-buffer: allocate sub-buffers with __GFP_COMP ring-buffer: Introducing ring-buffer mapping functions tracing: Allow user-space mapping of the ring-buffer Documentation: tracing: Add ring-buffer mapping ring-buffer/selftest: Add ring-buffer mapping test Documentation/trace/index.rst | 1 +
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote: > On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote: > > > > On 3/20/24 17:14, Michael S. Tsirkin wrote: > > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote: > > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:> > > > > > diff --git a/drivers/virtio/virtio_ring.c > > > > > b/drivers/virtio/virtio_ring.c > > > > > index 6f7e5010a673..79456706d0bd 100644 > > > > > --- a/drivers/virtio/virtio_ring.c > > > > > +++ b/drivers/virtio/virtio_ring.c > > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct > > > > > virtqueue *_vq, > > > > > /* Put entry in available array (but don't update avail->idx > > > > > until they > > > > >* do sync). */ > > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); > > > > > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, > > > > > head); > > > > > + u16 headwithflag = head | (q->split.avail_idx_shadow & > > > > > ~(vq->split.vring.num - 1)); > > > > > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, > > > > > headwithflag); > > > > > /* Descriptors and available array need to be set before we > > > > > expose the > > > > >* new available array entries. */ > > > > > > > > > Ok, Michael. I continued with my debugging code. It still looks like a > > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be > > involved for the discussion, as suggested by you. > > Do you have a support contact at Nvidia to report this? > > > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70. > > Note that I have only one vCPU in my configuration. > > Interesting but is guest built with CONFIG_SMP set? arm64 is always built CONFIG_SMP. > > Secondly, the debugging code is enhanced so that the available head for > > (last_avail_idx - 1) is read for twice and recorded. It means the available > > head for one specific available index is read for twice. I do see the > > available heads are different from the consecutive reads. More details > > are shared as below. > > > > From the guest side > > === > > > > virtio_net virtio0: output.0:id 86 is not a head! > > head to be released: 047 062 112 > > > > avail_idx: > > 000 49665 > > 001 49666 <-- > > : > > 015 49664 > > what are these #s 49665 and so on? > and how large is the ring? > I am guessing 49664 is the index ring size is 16 and > 49664 % 16 == 0 More than that, 49664 % 256 == 0 So again there seems to be an error in the vicinity of roll-over of the idx low byte, as I observed in the earlier log. Surely this is more than coincidence? -- Keir > > avail_head: > > > is this the avail ring contents? > > > 000 062 > > 001 047 <-- > > : > > 015 112 > > > What are these arrows pointing at, btw? > > > > From the host side > > == > > > > avail_idx > > 000 49663 > > 001 49666 <--- > > : > > > > avail_head > > 000 062 (062) > > 001 047 (047) <--- > > : > > 015 086 (112) // head 086 is returned from the first read, > > // but head 112 is returned from the second read > > > > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx > > 49664 > > > > Thanks, > > Gavin > > OK thanks so this proves it is actually the avail ring value. > > -- > MST >
[PATCH 1/2] LoongArch: KVM: Add steal time support in kvm side
Steal time feature is added here in kvm side, VM can search supported features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME is added here. Like x86, steal time structure is saved in guest memory, one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to enable the feature. One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to save and restore base address of steal time structure when VM is migrated. Since it needs hypercall instruction emulation handling, and it is dependent on this patchset: https://lore.kernel.org/all/20240201031950.3225626-1-maob...@loongson.cn/ Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/kvm_host.h | 7 ++ arch/loongarch/include/asm/kvm_para.h | 10 +++ arch/loongarch/include/asm/loongarch.h | 1 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kvm/exit.c | 35 ++-- arch/loongarch/kvm/vcpu.c | 120 + 6 files changed, 172 insertions(+), 5 deletions(-) diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index c53946f8ef9f..1d1eaa124349 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #define KVM_PRIVATE_MEM_SLOTS 0 #define KVM_HALT_POLL_NS_DEFAULT 50 +#define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(1) #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \ KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP) @@ -199,6 +200,12 @@ struct kvm_vcpu_arch { struct kvm_mp_state mp_state; /* cpucfg */ u32 cpucfg[KVM_MAX_CPUCFG_REGS]; + /* paravirt steal time */ + struct { + u64 guest_addr; + u64 last_steal; + struct gfn_to_hva_cache cache; + } st; }; static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg) diff --git a/arch/loongarch/include/asm/kvm_para.h b/arch/loongarch/include/asm/kvm_para.h index 56775554402a..5fb89e20432d 100644 --- a/arch/loongarch/include/asm/kvm_para.h +++ b/arch/loongarch/include/asm/kvm_para.h @@ -12,6 +12,7 @@ #define KVM_HCALL_CODE_SWDBG 1 #define KVM_HCALL_PV_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_PV_SERVICE) #define KVM_HCALL_FUNC_PV_IPI 1 +#define KVM_HCALL_FUNC_NOTIFY 2 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_SWDBG) /* @@ -21,6 +22,15 @@ #define KVM_HCALL_INVALID_CODE -1UL #define KVM_HCALL_INVALID_PARAMETER-2UL +#define KVM_STEAL_PHYS_VALID BIT_ULL(0) +#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6) +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; + /* * Hypercall interface for KVM hypervisor * diff --git a/arch/loongarch/include/asm/loongarch.h b/arch/loongarch/include/asm/loongarch.h index 0ad36704cb4b..ab6a5e93c280 100644 --- a/arch/loongarch/include/asm/loongarch.h +++ b/arch/loongarch/include/asm/loongarch.h @@ -168,6 +168,7 @@ #define KVM_SIGNATURE "KVM\0" #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4) #define KVM_FEATURE_PV_IPIBIT(1) +#define KVM_FEATURE_STEAL_TIMEBIT(2) #ifndef __ASSEMBLY__ diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 8f78b23672ac..286b5ce93a57 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -80,7 +80,11 @@ struct kvm_fpu { #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT)) #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, REG) #define KVM_IOC_CPUCFG(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG) + +/* Device Control API on vcpu fd */ #define KVM_LOONGARCH_VCPU_CPUCFG 0 +#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1 +#define KVM_LOONGARCH_VCPU_PVTIME_GPA 0 struct kvm_debug_exit_arch { }; diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c index d71172e2568e..c774e5803f7f 100644 --- a/arch/loongarch/kvm/exit.c +++ b/arch/loongarch/kvm/exit.c @@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu) static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) { int rd, rj; - unsigned int index; + unsigned int index, ret; unsigned long plv; rd = inst.reg2_format.rd; @@ -240,10 +240,13 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) vcpu->arch.gprs[rd] = 0; break; case CPUCFG_KVM_FEATURE: - if ((plv & CSR_CRMD_PLV) == PLV_KERN) - vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI; - else - vcpu->arch.gprs[rd] = 0; + ret = 0; +
[PATCH 2/2] LoongArch: Add steal time support in guest side
Percpu struct kvm_steal_time is added here, its size is 64 bytes and also defined as 64 bytes, so that the whole structure is in one physical page. When vcpu is onlined, function pv_register_steal_time() is called. This function will pass physical address of struct kvm_steal_time and tells hypervisor to enable steal time. When vcpu is offline, physical address is set as 0 and tells hypervisor to disable steal time. Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/kernel/paravirt.c | 130 ++ arch/loongarch/kernel/time.c | 2 + 3 files changed, 137 insertions(+) diff --git a/arch/loongarch/include/asm/paravirt.h b/arch/loongarch/include/asm/paravirt.h index 58f7b7b89f2c..fe27fb5e82b8 100644 --- a/arch/loongarch/include/asm/paravirt.h +++ b/arch/loongarch/include/asm/paravirt.h @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu) } int pv_ipi_init(void); +int __init pv_time_init(void); #else static inline int pv_ipi_init(void) { return 0; } +static inline int pv_time_init(void) +{ + return 0; +} #endif // CONFIG_PARAVIRT #endif diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index 9044ed62045c..56182c64ab38 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -5,10 +5,13 @@ #include #include #include +#include #include struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock; static u64 native_steal_clock(int cpu) { @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +static bool steal_acc = true; +static int __init parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} +early_param("no-steal-acc", parse_no_stealacc); + +static u64 para_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = _cpu(steal_time, cpu); + do { + + version = src->version; + /* Make sure that the version is read before the steal */ + virt_rmb(); + steal = src->steal; + /* Make sure that the steal is read before the next version */ + virt_rmb(); + + } while ((version & 1) || (version != src->version)); + return steal; +} + +static int pv_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st; + unsigned long addr; + + if (!has_steal_clock) + return -EPERM; + + st = _cpu(steal_time, cpu); + addr = per_cpu_ptr_to_phys(st); + + /* The whole structure kvm_steal_time should be one page */ + if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) { + pr_warn("Illegal PV steal time addr %lx\n", addr); + return -EFAULT; + } + + addr |= KVM_STEAL_PHYS_VALID; + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr); + return 0; +} + #ifdef CONFIG_SMP static void pv_send_ipi_single(int cpu, unsigned int action) { @@ -110,6 +164,32 @@ static void pv_init_ipi(void) if (r < 0) panic("SWI0 IRQ request failed\n"); } + +static void pv_disable_steal_time(void) +{ + if (has_steal_clock) + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 0); +} + +static int pv_cpu_online(unsigned int cpu) +{ + unsigned long flags; + + local_irq_save(flags); + pv_register_steal_time(); + local_irq_restore(flags); + return 0; +} + +static int pv_cpu_down_prepare(unsigned int cpu) +{ + unsigned long flags; + + local_irq_save(flags); + pv_disable_steal_time(); + local_irq_restore(flags); + return 0; +} #endif static bool kvm_para_available(void) @@ -149,3 +229,53 @@ int __init pv_ipi_init(void) return 1; } + +static void pv_cpu_reboot(void *unused) +{ + pv_disable_steal_time(); +} + +static int pv_reboot_notify(struct notifier_block *nb, unsigned long code, + void *unused) +{ + on_each_cpu(pv_cpu_reboot, NULL, 1); + return NOTIFY_DONE; +} + +static struct notifier_block pv_reboot_nb = { + .notifier_call = pv_reboot_notify, +}; + +int __init pv_time_init(void) +{ + int feature; + + if (!cpu_has_hypervisor) + return 0; + if (!kvm_para_available()) + return 0; + + feature = read_cpucfg(CPUCFG_KVM_FEATURE); + if (!(feature & KVM_FEATURE_STEAL_TIME)) + return 0; + + has_steal_clock = 1; + if (pv_register_steal_time()) { + has_steal_clock = 0; + return 0; + } + + register_reboot_notifier(_reboot_nb); + static_call_update(pv_steal_clock,
[PATCH 1/2] LoongArch: KVM: Add steal time support in kvm side
Steal time feature is added here in kvm side, VM can search supported features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME is added here. Like x86, steal time structure is saved in guest memory, one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to enable the feature. One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to save and restore base address of steal time structure when VM is migrated. Since it needs hypercall instruction emulation handling, and it is dependent on this patchset: https://lore.kernel.org/all/20240201031950.3225626-1-maob...@loongson.cn/ Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/kvm_host.h | 7 ++ arch/loongarch/include/asm/kvm_para.h | 10 +++ arch/loongarch/include/asm/loongarch.h | 1 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kvm/exit.c | 35 ++-- arch/loongarch/kvm/vcpu.c | 120 + 6 files changed, 172 insertions(+), 5 deletions(-) diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index c53946f8ef9f..1d1eaa124349 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #define KVM_PRIVATE_MEM_SLOTS 0 #define KVM_HALT_POLL_NS_DEFAULT 50 +#define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(1) #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \ KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP) @@ -199,6 +200,12 @@ struct kvm_vcpu_arch { struct kvm_mp_state mp_state; /* cpucfg */ u32 cpucfg[KVM_MAX_CPUCFG_REGS]; + /* paravirt steal time */ + struct { + u64 guest_addr; + u64 last_steal; + struct gfn_to_hva_cache cache; + } st; }; static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg) diff --git a/arch/loongarch/include/asm/kvm_para.h b/arch/loongarch/include/asm/kvm_para.h index 56775554402a..5fb89e20432d 100644 --- a/arch/loongarch/include/asm/kvm_para.h +++ b/arch/loongarch/include/asm/kvm_para.h @@ -12,6 +12,7 @@ #define KVM_HCALL_CODE_SWDBG 1 #define KVM_HCALL_PV_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_PV_SERVICE) #define KVM_HCALL_FUNC_PV_IPI 1 +#define KVM_HCALL_FUNC_NOTIFY 2 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_SWDBG) /* @@ -21,6 +22,15 @@ #define KVM_HCALL_INVALID_CODE -1UL #define KVM_HCALL_INVALID_PARAMETER-2UL +#define KVM_STEAL_PHYS_VALID BIT_ULL(0) +#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6) +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; + /* * Hypercall interface for KVM hypervisor * diff --git a/arch/loongarch/include/asm/loongarch.h b/arch/loongarch/include/asm/loongarch.h index 0ad36704cb4b..ab6a5e93c280 100644 --- a/arch/loongarch/include/asm/loongarch.h +++ b/arch/loongarch/include/asm/loongarch.h @@ -168,6 +168,7 @@ #define KVM_SIGNATURE "KVM\0" #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4) #define KVM_FEATURE_PV_IPIBIT(1) +#define KVM_FEATURE_STEAL_TIMEBIT(2) #ifndef __ASSEMBLY__ diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 8f78b23672ac..286b5ce93a57 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -80,7 +80,11 @@ struct kvm_fpu { #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT)) #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, REG) #define KVM_IOC_CPUCFG(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG) + +/* Device Control API on vcpu fd */ #define KVM_LOONGARCH_VCPU_CPUCFG 0 +#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1 +#define KVM_LOONGARCH_VCPU_PVTIME_GPA 0 struct kvm_debug_exit_arch { }; diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c index d71172e2568e..c774e5803f7f 100644 --- a/arch/loongarch/kvm/exit.c +++ b/arch/loongarch/kvm/exit.c @@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu) static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) { int rd, rj; - unsigned int index; + unsigned int index, ret; unsigned long plv; rd = inst.reg2_format.rd; @@ -240,10 +240,13 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) vcpu->arch.gprs[rd] = 0; break; case CPUCFG_KVM_FEATURE: - if ((plv & CSR_CRMD_PLV) == PLV_KERN) - vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI; - else - vcpu->arch.gprs[rd] = 0; + ret = 0; +
[PATCH 2/2] LoongArch: Add steal time support in guest side
Percpu struct kvm_steal_time is added here, its size is 64 bytes and also defined as 64 bytes, so that the whole structure is in one physical page. When vcpu is onlined, function pv_register_steal_time() is called. This function will pass physical address of struct kvm_steal_time and tells hypervisor to enable steal time. When vcpu is offline, physical address is set as 0 and tells hypervisor to disable steal time. Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/kernel/paravirt.c | 130 ++ arch/loongarch/kernel/time.c | 2 + 3 files changed, 137 insertions(+) diff --git a/arch/loongarch/include/asm/paravirt.h b/arch/loongarch/include/asm/paravirt.h index 58f7b7b89f2c..fe27fb5e82b8 100644 --- a/arch/loongarch/include/asm/paravirt.h +++ b/arch/loongarch/include/asm/paravirt.h @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu) } int pv_ipi_init(void); +int __init pv_time_init(void); #else static inline int pv_ipi_init(void) { return 0; } +static inline int pv_time_init(void) +{ + return 0; +} #endif // CONFIG_PARAVIRT #endif diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index 9044ed62045c..56182c64ab38 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -5,10 +5,13 @@ #include #include #include +#include #include struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock; static u64 native_steal_clock(int cpu) { @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +static bool steal_acc = true; +static int __init parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} +early_param("no-steal-acc", parse_no_stealacc); + +static u64 para_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = _cpu(steal_time, cpu); + do { + + version = src->version; + /* Make sure that the version is read before the steal */ + virt_rmb(); + steal = src->steal; + /* Make sure that the steal is read before the next version */ + virt_rmb(); + + } while ((version & 1) || (version != src->version)); + return steal; +} + +static int pv_register_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st; + unsigned long addr; + + if (!has_steal_clock) + return -EPERM; + + st = _cpu(steal_time, cpu); + addr = per_cpu_ptr_to_phys(st); + + /* The whole structure kvm_steal_time should be one page */ + if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) { + pr_warn("Illegal PV steal time addr %lx\n", addr); + return -EFAULT; + } + + addr |= KVM_STEAL_PHYS_VALID; + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr); + return 0; +} + #ifdef CONFIG_SMP static void pv_send_ipi_single(int cpu, unsigned int action) { @@ -110,6 +164,32 @@ static void pv_init_ipi(void) if (r < 0) panic("SWI0 IRQ request failed\n"); } + +static void pv_disable_steal_time(void) +{ + if (has_steal_clock) + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, 0); +} + +static int pv_cpu_online(unsigned int cpu) +{ + unsigned long flags; + + local_irq_save(flags); + pv_register_steal_time(); + local_irq_restore(flags); + return 0; +} + +static int pv_cpu_down_prepare(unsigned int cpu) +{ + unsigned long flags; + + local_irq_save(flags); + pv_disable_steal_time(); + local_irq_restore(flags); + return 0; +} #endif static bool kvm_para_available(void) @@ -149,3 +229,53 @@ int __init pv_ipi_init(void) return 1; } + +static void pv_cpu_reboot(void *unused) +{ + pv_disable_steal_time(); +} + +static int pv_reboot_notify(struct notifier_block *nb, unsigned long code, + void *unused) +{ + on_each_cpu(pv_cpu_reboot, NULL, 1); + return NOTIFY_DONE; +} + +static struct notifier_block pv_reboot_nb = { + .notifier_call = pv_reboot_notify, +}; + +int __init pv_time_init(void) +{ + int feature; + + if (!cpu_has_hypervisor) + return 0; + if (!kvm_para_available()) + return 0; + + feature = read_cpucfg(CPUCFG_KVM_FEATURE); + if (!(feature & KVM_FEATURE_STEAL_TIME)) + return 0; + + has_steal_clock = 1; + if (pv_register_steal_time()) { + has_steal_clock = 0; + return 0; + } + + register_reboot_notifier(_reboot_nb); + static_call_update(pv_steal_clock,
[PATCH 0/2] LoongArch: Add steal time support
Para-virt feature steal time is added in both kvm and guest kernel side. It is silimar with other architectures, steal time structure comes from guest memory, also pseduo register is used to save/restore base address of steal time structure, so that vm migration is supported also. Bibo Mao (2): LoongArch: KVM: Add steal time support in kvm side LoongArch: Add steal time support in guest side arch/loongarch/include/asm/kvm_host.h | 7 ++ arch/loongarch/include/asm/kvm_para.h | 10 ++ arch/loongarch/include/asm/loongarch.h | 1 + arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kernel/paravirt.c | 130 + arch/loongarch/kernel/time.c | 2 + arch/loongarch/kvm/exit.c | 35 ++- arch/loongarch/kvm/vcpu.c | 120 +++ 9 files changed, 309 insertions(+), 5 deletions(-) base-commit: 2ac2b1665d3fbec6ca709dd6ef3ea05f4a51ee4c -- 2.39.3
Re: [PATCH] virtio_ring: Fix the stale index in available ring
On Mon, Mar 25, 2024 at 05:34:29PM +1000, Gavin Shan wrote: > > On 3/20/24 17:14, Michael S. Tsirkin wrote: > > On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote: > > > On 3/20/24 10:49, Michael S. Tsirkin wrote:> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > > index 6f7e5010a673..79456706d0bd 100644 > > > > --- a/drivers/virtio/virtio_ring.c > > > > +++ b/drivers/virtio/virtio_ring.c > > > > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct > > > > virtqueue *_vq, > > > > /* Put entry in available array (but don't update avail->idx > > > > until they > > > > * do sync). */ > > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); > > > > - vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, > > > > head); > > > > + u16 headwithflag = head | (q->split.avail_idx_shadow & > > > > ~(vq->split.vring.num - 1)); > > > > + vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, > > > > headwithflag); > > > > /* Descriptors and available array need to be set before we > > > > expose the > > > > * new available array entries. */ > > > > > > Ok, Michael. I continued with my debugging code. It still looks like a > hardware bug on NVidia's grace-hopper. I really think NVidia needs to be > involved for the discussion, as suggested by you. Do you have a support contact at Nvidia to report this? > Firstly, I bind the vhost process and vCPU thread to CPU#71 and CPU#70. > Note that I have only one vCPU in my configuration. Interesting but is guest built with CONFIG_SMP set? > Secondly, the debugging code is enhanced so that the available head for > (last_avail_idx - 1) is read for twice and recorded. It means the available > head for one specific available index is read for twice. I do see the > available heads are different from the consecutive reads. More details > are shared as below. > > From the guest side > === > > virtio_net virtio0: output.0:id 86 is not a head! > head to be released: 047 062 112 > > avail_idx: > 000 49665 > 001 49666 <-- > : > 015 49664 what are these #s 49665 and so on? and how large is the ring? I am guessing 49664 is the index ring size is 16 and 49664 % 16 == 0 > avail_head: is this the avail ring contents? > 000 062 > 001 047 <-- > : > 015 112 What are these arrows pointing at, btw? > From the host side > == > > avail_idx > 000 49663 > 001 49666 <--- > : > > avail_head > 000 062 (062) > 001 047 (047) <--- > : > 015 086 (112) // head 086 is returned from the first read, > // but head 112 is returned from the second read > > vhost_get_vq_desc: Inconsistent head in two read (86 -> 112) for avail_idx > 49664 > > Thanks, > Gavin OK thanks so this proves it is actually the avail ring value. -- MST
Re: [PATCH net-next v3 1/2] net: port TP_STORE_ADDR_PORTS_SKB macro to be tcp/udp independent
On Mon, Mar 25, 2024 at 6:29 PM Balazs Scheidler wrote: > > This patch moves TP_STORE_ADDR_PORTS_SKB() to a common header and removes > the TCP specific implementation details. > > Previously the macro assumed the skb passed as an argument is a > TCP packet, the implementation now uses an argument to the L4 header and > uses that to extract the source/destination ports, which happen > to be named the same in "struct tcphdr" and "struct udphdr" > > Signed-off-by: Balazs Scheidler The patch itself looks good to me, feel free to add: Reviewed-by: Jason Xing
Re: [PATCH net-next v3 2/2] net: udp: add IP/port data to the tracepoint udp/udp_fail_queue_rcv_skb
On Tue, Mar 26, 2024 at 10:28 AM Jakub Kicinski wrote: > > On Mon, 25 Mar 2024 11:29:18 +0100 Balazs Scheidler wrote: > > +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6)); > > +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6)); > > Indent with tabs please, checkpatch says: > > ERROR: code indent should use tabs where possible > #59: FILE: include/trace/events/udp.h:38: > +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$ > > WARNING: please, no spaces at the start of a line > #59: FILE: include/trace/events/udp.h:38: > +memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));$ > > ERROR: code indent should use tabs where possible > #60: FILE: include/trace/events/udp.h:39: > +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$ > > WARNING: please, no spaces at the start of a line > #60: FILE: include/trace/events/udp.h:39: > +memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));$ More than this, it would be better to put "From Balazs Scheidler " in the first line of each patch to eliminate the mismatched email address warning. Link (Jakub referred to): https://patchwork.kernel.org/project/netdevbpf/patch/34a9c221a6d644f18c826a1beddba58af6b7a64c.1711361723.git.balazs.scheid...@axoflow.com/ Detailed info: https://netdev.bots.linux.dev/static/nipa/837832/13601927/checkpatch/stdout > -- > pw-bot: cr
Re: [External] Re: [PATCH v4 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Mon, Mar 25, 2024 at 8:08 PM Huang, Ying wrote: > > "Ho-Ren (Jack) Chuang" writes: > > > On Fri, Mar 22, 2024 at 1:41 AM Huang, Ying wrote: > >> > >> "Ho-Ren (Jack) Chuang" writes: > >> > >> > The current implementation treats emulated memory devices, such as > >> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal > >> > memory > >> > (E820_TYPE_RAM). However, these emulated devices have different > >> > characteristics than traditional DRAM, making it important to > >> > distinguish them. Thus, we modify the tiered memory initialization > >> > process > >> > to introduce a delay specifically for CPUless NUMA nodes. This delay > >> > ensures that the memory tier initialization for these nodes is deferred > >> > until HMAT information is obtained during the boot process. Finally, > >> > demotion tables are recalculated at the end. > >> > > >> > * late_initcall(memory_tier_late_init); > >> > Some device drivers may have initialized memory tiers between > >> > `memory_tier_init()` and `memory_tier_late_init()`, potentially bringing > >> > online memory nodes and configuring memory tiers. They should be excluded > >> > in the late init. > >> > > >> > * Handle cases where there is no HMAT when creating memory tiers > >> > There is a scenario where a CPUless node does not provide HMAT > >> > information. > >> > If no HMAT is specified, it falls back to using the default DRAM tier. > >> > > >> > * Introduce another new lock `default_dram_perf_lock` for adist > >> > calculation > >> > In the current implementation, iterating through CPUlist nodes requires > >> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end > >> > up > >> > trying to acquire the same lock, leading to a potential deadlock. > >> > Therefore, we propose introducing a standalone `default_dram_perf_lock` > >> > to > >> > protect `default_dram_perf_*`. This approach not only avoids deadlock > >> > but also prevents holding a large lock simultaneously. > >> > > >> > * Upgrade `set_node_memory_tier` to support additional cases, including > >> > default DRAM, late CPUless, and hot-plugged initializations. > >> > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > >> > `mt_find_alloc_memory_type()` are moved into `set_node_memory_tier()` to > >> > handle cases where memtype is not initialized and where HMAT information > >> > is > >> > available. > >> > > >> > * Introduce `default_memory_types` for those memory types that are not > >> > initialized by device drivers. > >> > Because late initialized memory and default DRAM memory need to be > >> > managed, > >> > a default memory type is created for storing all memory types that are > >> > not initialized by device drivers and as a fallback. > >> > > >> > Signed-off-by: Ho-Ren (Jack) Chuang > >> > Signed-off-by: Hao Xiang > >> > --- > >> > mm/memory-tiers.c | 73 --- > >> > 1 file changed, 63 insertions(+), 10 deletions(-) > >> > > >> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c > >> > index 974af10cfdd8..9396330fa162 100644 > >> > --- a/mm/memory-tiers.c > >> > +++ b/mm/memory-tiers.c > >> > @@ -36,6 +36,11 @@ struct node_memory_type_map { > >> > > >> > static DEFINE_MUTEX(memory_tier_lock); > >> > static LIST_HEAD(memory_tiers); > >> > +/* > >> > + * The list is used to store all memory types that are not created > >> > + * by a device driver. > >> > + */ > >> > +static LIST_HEAD(default_memory_types); > >> > static struct node_memory_type_map node_memory_types[MAX_NUMNODES]; > >> > struct memory_dev_type *default_dram_type; > >> > > >> > @@ -108,6 +113,7 @@ static struct demotion_nodes *node_demotion > >> > __read_mostly; > >> > > >> > static BLOCKING_NOTIFIER_HEAD(mt_adistance_algorithms); > >> > > >> > +static DEFINE_MUTEX(default_dram_perf_lock); > >> > >> Better to add comments about what is protected by this lock. > >> > > > > Thank you. I will add a comment like this: > > + /* The lock is used to protect `default_dram_perf*` info and nid. */ > > +static DEFINE_MUTEX(default_dram_perf_lock); > > > > I also found an error path was not handled and > > found the lock could be put closer to what it protects. > > I will have them fixed in V5. > > > >> > static bool default_dram_perf_error; > >> > static struct access_coordinate default_dram_perf; > >> > static int default_dram_perf_ref_nid = NUMA_NO_NODE; > >> > @@ -505,7 +511,8 @@ static inline void __init_node_memory_type(int node, > >> > struct memory_dev_type *mem > >> > static struct memory_tier *set_node_memory_tier(int node) > >> > { > >> > struct memory_tier *memtier; > >> > - struct memory_dev_type *memtype; > >> > + struct memory_dev_type *mtype; > >> > >> mtype may be referenced without initialization now below. > >> > > > > Good catch! Thank you. > > > > Please check below. > > I may found a potential NULL pointer dereference. > > > >> > + int adist = MEMTIER_ADISTANCE_DRAM; > >> >