Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect
On Wed, Apr 17, 2024 at 4:51 PM Jason Xing wrote: > > From: Jason Xing > > In production, there are so many cases about why the RST skb is sent but > we don't have a very convenient/fast method to detect the exact underlying > reasons. > > RST is implemented in two kinds: passive kind (like tcp_v4_send_reset()) > and active kind (like tcp_send_active_reset()). The former can be traced > carefully 1) in TCP, with the help of drop reasons, which is based on > Eric's idea[1], 2) in MPTCP, with the help of reset options defined in > RFC 8684. The latter is relatively independent, which should be > implemented on our own. > > In this series, I focus on the fundamental implement mostly about how > the rstreason mechnism works and give the detailed passive part as an > example, not including the active reset part. In future, we can go > further and refine those NOT_SPECIFIED reasons. > > Here are some examples when tracing: > -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x > skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED > -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x > skaddr=x src=x dest=x state=x reason=NO_SOCKET > > [1] > Link: > https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ > > v6 > 1. add back casts, or else they are treated as error. > > v5 > Link: > https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/ > 1. address format issue (like reverse xmas tree) (Eric, Paolo) > 2. remove unnecessary casts. (Eric) > 3. introduce a helper used in mptcp active reset. See patch 6. (Paolo) > > v4 > Link: > https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/ > 1. passing 'enum sk_rst_reason' for readability when tracing (Antoine) > > v3 > Link: > https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/ > 1. rebase (mptcp part) and address what Mat suggested. > > v2 > Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/ > 1. rebase against the latest net-next tree > > > > Jason Xing (7): > net: introduce rstreason to detect why the RST is sent > rstreason: prepare for passive reset > rstreason: prepare for active reset > tcp: support rstreason for passive reset > mptcp: support rstreason for passive reset > mptcp: introducing a helper into active reset logic > rstreason: make it work in trace world > > include/net/request_sock.h | 4 +- > include/net/rstreason.h| 93 ++ > include/net/tcp.h | 3 +- > include/trace/events/tcp.h | 37 +-- > net/dccp/ipv4.c| 10 ++-- > net/dccp/ipv6.c| 10 ++-- > net/dccp/minisocks.c | 3 +- > net/ipv4/tcp.c | 15 -- > net/ipv4/tcp_ipv4.c| 14 +++--- > net/ipv4/tcp_minisocks.c | 3 +- > net/ipv4/tcp_output.c | 5 +- > net/ipv4/tcp_timer.c | 9 ++-- > net/ipv6/tcp_ipv6.c| 17 --- > net/mptcp/protocol.c | 2 +- > net/mptcp/protocol.h | 11 + > net/mptcp/subflow.c| 27 --- > 16 files changed, 216 insertions(+), 47 deletions(-) > create mode 100644 include/net/rstreason.h > > -- > 2.37.3 > Hello maintainers, I'm not sure why the patch series has been changed to 'Changes Requested', until now I don't think I need to change something. Should I repost this series (keeping the v6 tag) and then wait for more comments? Thanks, Jason
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin wrote: > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > > Add the function vduse_alloc_reconnnect_info_mem > > and vduse_alloc_reconnnect_info_mem > > These functions allow vduse to allocate and free memory for reconnection > > information. The amount of memory allocated is vq_num pages. > > Each VQS will map its own page where the reconnection information will be > > saved > > > > Signed-off-by: Cindy Lu > > --- > > drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++ > > 1 file changed, 40 insertions(+) > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > > b/drivers/vdpa/vdpa_user/vduse_dev.c > > index ef3c9681941e..2da659d5f4a8 100644 > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > > int irq_effective_cpu; > > struct cpumask irq_affinity; > > struct kobject kobj; > > + unsigned long vdpa_reconnect_vaddr; > > }; > > > > struct vduse_dev; > > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct > > vduse_virtqueue *vq) > > > > vq->irq_effective_cpu = curr_cpu; > > } > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev) > > +{ > > + unsigned long vaddr = 0; > > + struct vduse_virtqueue *vq; > > + > > + for (int i = 0; i < dev->vq_num; i++) { > > + /*page 0~ vq_num save the reconnect info for vq*/ > > + vq = dev->vqs[i]; > > + vaddr = get_zeroed_page(GFP_KERNEL); > > > I don't get why you insist on stealing kernel memory for something > that is just used by userspace to store data for its own use. > Userspace does not lack ways to persist data, for example, > create a regular file anywhere in the filesystem. Good point. So the motivation here is to: 1) be self contained, no dependency for high speed persist data storage like tmpfs 2) standardize the format in uAPI which allows reconnection from arbitrary userspace, unfortunately, such effort was removed in new versions If the above doesn't make sense, we don't need to offer those pages by VDUSE. Thanks > > > > > + if (vaddr == 0) > > + return -ENOMEM; > > + > > + vq->vdpa_reconnect_vaddr = vaddr; > > + } > > + > > + return 0; > > +} > > + > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev) > > +{ > > + struct vduse_virtqueue *vq; > > + > > + for (int i = 0; i < dev->vq_num; i++) { > > + vq = dev->vqs[i]; > > + > > + if (vq->vdpa_reconnect_vaddr) > > + free_page(vq->vdpa_reconnect_vaddr); > > + vq->vdpa_reconnect_vaddr = 0; > > + } > > + > > + return 0; > > +} > > > > static long vduse_dev_ioctl(struct file *file, unsigned int cmd, > > unsigned long arg) > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name) > > mutex_unlock(>lock); > > return -EBUSY; > > } > > + vduse_free_reconnnect_info_mem(dev); > > + > > dev->connected = true; > > mutex_unlock(>lock); > > > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config > > *config, > > ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num); > > if (ret) > > goto err_vqs; > > + ret = vduse_alloc_reconnnect_info_mem(dev); > > + if (ret < 0) > > + goto err_mem; > > > > __module_get(THIS_MODULE); > > > > return 0; > > err_vqs: > > device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor)); > > +err_mem: > > + vduse_free_reconnnect_info_mem(dev); > > err_dev: > > idr_remove(_idr, dev->minor); > > err_idr: > > -- > > 2.43.0 >
Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint
On Wed, Apr 17, 2024 at 3:20 PM Liang Chen wrote: > > The RSS hash report is a feature that's part of the virtio specification. > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost > (still a work in progress as per [1]) support this feature. While the > capability to obtain the RSS hash has been enabled in the normal path, > it's currently missing in the XDP path. Therefore, we are introducing > XDP hints through kfuncs to allow XDP programs to access the RSS hash. > > 1. > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r > > Signed-off-by: Liang Chen > --- Acked-by: Jason Wang Thanks
Re: [PATCH v12 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup
On 16/04/2024 3:20 pm, Haitao Huang wrote: From: Kristen Carlson Accardi Currently in the EPC page allocation, the kernel simply fails the allocation when the current EPC cgroup fails to charge due to its usage reaching limit. This is not ideal. When that happens, a better way is to reclaim EPC page(s) from the current EPC cgroup (and/or its descendants) to reduce its usage so the new allocation can succeed. Add the basic building blocks to support per-cgroup reclamation. Currently the kernel only has one place to reclaim EPC pages: the global EPC LRU list. To support the "per-cgroup" EPC reclaim, maintain an LRU list for each EPC cgroup, and introduce a "cgroup" variant function to reclaim EPC pages from a given EPC cgroup and its descendants. Currently the kernel does the global EPC reclaim in sgx_reclaim_page(). It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16) pages. Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN pages from the global LRU, and then tries to reclaim these pages at once for better performance. Implement the "cgroup" variant EPC reclaim in a similar way, but keep the implementation simple: 1) change sgx_reclaim_pages() to take an LRU as input, and return the pages that are "scanned" and attempted for reclamation (but not necessarily reclaimed successfully); 2) loop the given EPC cgroup and its descendants and do the new sgx_reclaim_pages() until SGX_NR_TO_SCAN pages are "scanned". This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC cgroup, and only moves to its descendants when there's no enough reclaimable EPC pages to "scan" in its LRU. It should be enough for most cases. Note, this simple implementation doesn't _exactly_ mimic the current global EPC reclaim (which always tries to do the actual reclaim in batch of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN reclaimable pages, the actual reclaim of EPC pages will be split into smaller batches _across_ multiple LRUs with each being smaller than SGX_NR_TO_SCAN pages. A more precise way to mimic the current global EPC reclaim would be to have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages _across_ the given EPC cgroup _AND_ its descendants, and then do the actual reclaim in one batch. But this is unnecessarily complicated at this stage. Alternatively, the current sgx_reclaim_pages() could be changed to return the actual "reclaimed" pages, but not "scanned" pages. However, the reclamation is a lengthy process, forcing a successful reclamation of predetermined number of pages may block the caller for too long. And that may not be acceptable in some synchronous contexts, e.g., in serving an ioctl(). With this building block in place, add synchronous reclamation support in sgx_cgroup_try_charge(): trigger a call to sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the caller allows synchronous reclaim as indicated by s newly added parameter. A later patch will add support for asynchronous reclamation reusing sgx_cgroup_reclaim_pages(). Note all reclaimable EPC pages are still tracked in the global LRU thus no per-cgroup reclamation is actually active at the moment. Per-cgroup tracking and reclamation will be turned on in the end after all necessary infrastructure is in place. Nit: "all necessary infrastructures are in place", or, "all necessary building blocks are in place". ? Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson Signed-off-by: Kristen Carlson Accardi Co-developed-by: Haitao Huang Signed-off-by: Haitao Huang Tested-by: Jarkko Sakkinen --- Reviewed-by: Kai Huang More nitpickings below: [...] -static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg) +static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum sgx_reclaim reclaim) Let's still wrap the text on 80-character basis. I guess most people are more used to that. [...] - epc_page = list_first_entry_or_null(_global_lru.reclaimable, - struct sgx_epc_page, list); + epc_page = list_first_entry_or_null(>reclaimable, struct sgx_epc_page, list); Ditto.
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport wrote: > > On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote: > > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote: > > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote: > > > > +/** > > > > + * enum execmem_type - types of executable memory ranges > > > > + * > > > > + * There are several subsystems that allocate executable memory. > > > > + * Architectures define different restrictions on placement, > > > > + * permissions, alignment and other parameters for memory that can be > > > > used > > > > + * by these subsystems. > > > > + * Types in this enum identify subsystems that allocate executable > > > > memory > > > > + * and let architectures define parameters for ranges suitable for > > > > + * allocations by each subsystem. > > > > + * > > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types > > > > that > > > > + * are not explcitly defined. > > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections > > > > + * @EXECMEM_KPROBES: parameters for kprobes > > > > + * @EXECMEM_FTRACE: parameters for ftrace > > > > + * @EXECMEM_BPF: parameters for BPF > > > > + * @EXECMEM_TYPE_MAX: > > > > + */ > > > > +enum execmem_type { > > > > + EXECMEM_DEFAULT, > > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT, > > > > + EXECMEM_KPROBES, > > > > + EXECMEM_FTRACE, > > > > + EXECMEM_BPF, > > > > + EXECMEM_TYPE_MAX, > > > > +}; > > > > > > Can we please get a break-down of how all these types are actually > > > different from one another? > > > > > > I'm thinking some platforms have a tiny immediate space (arm64 comes to > > > mind) and has less strict placement constraints for some of them? > > > > Yeah, and really I'd *much* rather deal with that in arch code, as I have > > said > > several times. > > > > For arm64 we have two bsaic restrictions: > > > > 1) Direct branches can go +/-128M > >We can expand this range by having direct branches go to PLTs, at a > >performance cost. > > > > 2) PREL32 relocations can go +/-2G > >We cannot expand this further. > > > > * We don't need to allocate memory for ftrace. We do not use trampolines. > > > > * Kprobes XOL areas don't care about either of those; we don't place any > > PC-relative instructions in those. Maybe we want to in future. > > > > * Modules care about both; we'd *prefer* to place them within +/-128M of all > > other kernel/module code, but if there's no space we can use PLTs and > > expand > > that to +/-2G. Since modules can refreence other modules, that ends up > > actually being halved, and modules have to fit within some 2G window that > > also covers the kernel. Is +/- 2G enough for all realistic use cases? If so, I guess we don't really need EXECMEM_ANYWHERE below? > > > > * I'm not sure about BPF's requirements; it seems happy doing the same as > > modules. > > BPF are happy with vmalloc(). > > > So if we *must* use a common execmem allocator, what we'd reall want is our > > own > > types, e.g. > > > > EXECMEM_ANYWHERE > > EXECMEM_NOPLT > > EXECMEM_PREL32 > > > > ... and then we use those in arch code to implement module_alloc() and > > friends. > > I'm looking at execmem_types more as definition of the consumers, maybe I > should have named the enum execmem_consumer at the first place. I think looking at execmem_type from consumers' point of view adds unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe, and bpf (and maybe also module text) all have the same requirements. Did I miss something? IOW, we have enum execmem_type { EXECMEM_DEFAULT, EXECMEM_TEXT, EXECMEM_KPROBES = EXECMEM_TEXT, EXECMEM_FTRACE = EXECMEM_TEXT, EXECMEM_BPF = EXECMEM_TEXT, /* we may end up without _KPROBE, _FTRACE, _BPF */ EXECMEM_DATA, /* rw */ EXECMEM_RO_DATA, EXECMEM_RO_AFTER_INIT, EXECMEM_TYPE_MAX, }; Does this make sense? Thanks, Song
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Wed Apr 17, 2024 at 1:04 AM EEST, Haitao Huang wrote: > On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen > wrote: > > > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote: > >> I did declare the configs in the config file but I missed it in my patch > >> as stated earlier. IIUC, that would not cause this error though. > >> > >> Maybe I should exit with the skip code if no CGROUP_MISC (no more > >> CGROUP_SGX_EPC) is configured? > > > > OK, so I wanted to do a distro kernel test here, and used the default > > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set. > > > >> tools/testing/selftests$ find . -name README > >> ./futex/README > >> ./tc-testing/README > >> ./net/forwarding/README > >> ./powerpc/nx-gzip/README > >> ./ftrace/README > >> ./arm64/signal/README > >> ./arm64/fp/README > >> ./arm64/README > >> ./zram/README > >> ./livepatch/README > >> ./resctrl/README > > > > So is there a README because of override timeout parameter? Maybe it > > should be just set to a high enough value? > > > > BR, Jarkko > > > > > From the docs, I think we are supposed to use override. > See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests OK, fair enough :-) I did not know this. BR, Jarkko
Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES
Hi Mike, On Thu, 11 Apr 2024 19:00:50 +0300 Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > kprobes depended on CONFIG_MODULES because it has to allocate memory for > code. > > Since code allocations are now implemented with execmem, kprobes can be > enabled in non-modular kernels. > > Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside > modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the > dependency of CONFIG_KPROBES on CONFIG_MODULES. Thanks for this work, but this conflicts with the latest fix in v6.9-rc4. Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in function body? We have enough dummy functions for that, so it should not make a problem. Thank you, > > Signed-off-by: Mike Rapoport (IBM) > --- > arch/Kconfig| 2 +- > kernel/kprobes.c| 43 + > kernel/trace/trace_kprobe.c | 11 ++ > 3 files changed, 37 insertions(+), 19 deletions(-) > > diff --git a/arch/Kconfig b/arch/Kconfig > index bc9e8e5dccd5..68177adf61a0 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -52,9 +52,9 @@ config GENERIC_ENTRY > > config KPROBES > bool "Kprobes" > - depends on MODULES > depends on HAVE_KPROBES > select KALLSYMS > + select EXECMEM > select TASKS_RCU if PREEMPTION > help > Kprobes allows you to trap at almost any kernel address and > diff --git a/kernel/kprobes.c b/kernel/kprobes.c > index 047ca629ce49..90c056853e6f 100644 > --- a/kernel/kprobes.c > +++ b/kernel/kprobes.c > @@ -1580,6 +1580,7 @@ static int check_kprobe_address_safe(struct kprobe *p, > goto out; > } > > +#ifdef CONFIG_MODULES > /* Check if 'p' is probing a module. */ > *probed_mod = __module_text_address((unsigned long) p->addr); > if (*probed_mod) { > @@ -1603,6 +1604,8 @@ static int check_kprobe_address_safe(struct kprobe *p, > ret = -ENOENT; > } > } > +#endif > + > out: > preempt_enable(); > jump_label_unlock(); > @@ -2482,24 +2485,6 @@ int kprobe_add_area_blacklist(unsigned long start, > unsigned long end) > return 0; > } > > -/* Remove all symbols in given area from kprobe blacklist */ > -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long > end) > -{ > - struct kprobe_blacklist_entry *ent, *n; > - > - list_for_each_entry_safe(ent, n, _blacklist, list) { > - if (ent->start_addr < start || ent->start_addr >= end) > - continue; > - list_del(>list); > - kfree(ent); > - } > -} > - > -static void kprobe_remove_ksym_blacklist(unsigned long entry) > -{ > - kprobe_remove_area_blacklist(entry, entry + 1); > -} > - > int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long > *value, > char *type, char *sym) > { > @@ -2564,6 +2549,25 @@ static int __init populate_kprobe_blacklist(unsigned > long *start, > return ret ? : arch_populate_kprobe_blacklist(); > } > > +#ifdef CONFIG_MODULES > +/* Remove all symbols in given area from kprobe blacklist */ > +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long > end) > +{ > + struct kprobe_blacklist_entry *ent, *n; > + > + list_for_each_entry_safe(ent, n, _blacklist, list) { > + if (ent->start_addr < start || ent->start_addr >= end) > + continue; > + list_del(>list); > + kfree(ent); > + } > +} > + > +static void kprobe_remove_ksym_blacklist(unsigned long entry) > +{ > + kprobe_remove_area_blacklist(entry, entry + 1); > +} > + > static void add_module_kprobe_blacklist(struct module *mod) > { > unsigned long start, end; > @@ -2665,6 +2669,7 @@ static struct notifier_block kprobe_module_nb = { > .notifier_call = kprobes_module_callback, > .priority = 0 > }; > +#endif > > void kprobe_free_init_mem(void) > { > @@ -2724,8 +2729,10 @@ static int __init init_kprobes(void) > err = arch_init_kprobes(); > if (!err) > err = register_die_notifier(_exceptions_nb); > +#ifdef CONFIG_MODULES > if (!err) > err = register_module_notifier(_module_nb); > +#endif > > kprobes_initialized = (err == 0); > kprobe_sysctls_init(); > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index 14099cc17fc9..f0610137d6a3 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -111,6 +111,7 @@ static nokprobe_inline bool > trace_kprobe_within_module(struct trace_kprobe *tk, > return strncmp(module_name(mod), name, len) == 0 && name[len] == ':'; > } > > +#ifdef CONFIG_MODULES > static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe > *tk) > { > char *p; > @@ -129,6 +130,12 @@ static nokprobe_inline bool >
Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()
On Thu, 11 Apr 2024 19:00:41 +0300 Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > module_alloc() is used everywhere as a mean to allocate memory for code. > > Beside being semantically wrong, this unnecessarily ties all subsystems > that need to allocate code, such as ftrace, kprobes and BPF to modules and > puts the burden of code allocation to the modules code. > > Several architectures override module_alloc() because of various > constraints where the executable memory can be located and this causes > additional obstacles for improvements of code allocation. > > Start splitting code allocation from modules by introducing execmem_alloc() > and execmem_free() APIs. > > Initially, execmem_alloc() is a wrapper for module_alloc() and > execmem_free() is a replacement of module_memfree() to allow updating all > call sites to use the new APIs. > > Since architectures define different restrictions on placement, > permissions, alignment and other parameters for memory that can be used by > different subsystems that allocate executable memory, execmem_alloc() takes > a type argument, that will be used to identify the calling subsystem and to > allow architectures define parameters for ranges suitable for that > subsystem. > This looks good to me for the kprobe part. Acked-by: Masami Hiramatsu (Google) Thank you, > Signed-off-by: Mike Rapoport (IBM) > --- > arch/powerpc/kernel/kprobes.c| 6 ++-- > arch/s390/kernel/ftrace.c| 4 +-- > arch/s390/kernel/kprobes.c | 4 +-- > arch/s390/kernel/module.c| 5 +-- > arch/sparc/net/bpf_jit_comp_32.c | 8 ++--- > arch/x86/kernel/ftrace.c | 6 ++-- > arch/x86/kernel/kprobes/core.c | 4 +-- > include/linux/execmem.h | 57 > include/linux/moduleloader.h | 3 -- > kernel/bpf/core.c| 6 ++-- > kernel/kprobes.c | 8 ++--- > kernel/module/Kconfig| 1 + > kernel/module/main.c | 25 +- > mm/Kconfig | 3 ++ > mm/Makefile | 1 + > mm/execmem.c | 26 +++ > 16 files changed, 122 insertions(+), 45 deletions(-) > create mode 100644 include/linux/execmem.h > create mode 100644 mm/execmem.c > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c > index bbca90a5e2ec..9fcd01bb2ce6 100644 > --- a/arch/powerpc/kernel/kprobes.c > +++ b/arch/powerpc/kernel/kprobes.c > @@ -19,8 +19,8 @@ > #include > #include > #include > -#include > #include > +#include > #include > #include > #include > @@ -130,7 +130,7 @@ void *alloc_insn_page(void) > { > void *page; > > - page = module_alloc(PAGE_SIZE); > + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); > if (!page) > return NULL; > > @@ -142,7 +142,7 @@ void *alloc_insn_page(void) > } > return page; > error: > - module_memfree(page); > + execmem_free(page); > return NULL; > } > > diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c > index c46381ea04ec..798249ef5646 100644 > --- a/arch/s390/kernel/ftrace.c > +++ b/arch/s390/kernel/ftrace.c > @@ -7,13 +7,13 @@ > * Author(s): Martin Schwidefsky > */ > > -#include > #include > #include > #include > #include > #include > #include > +#include > #include > #include > #include > @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void) > { > const char *start, *end; > > - ftrace_plt = module_alloc(PAGE_SIZE); > + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE); > if (!ftrace_plt) > panic("cannot allocate ftrace plt\n"); > > diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c > index f0cf20d4b3c5..3c1b1be744de 100644 > --- a/arch/s390/kernel/kprobes.c > +++ b/arch/s390/kernel/kprobes.c > @@ -9,7 +9,6 @@ > > #define pr_fmt(fmt) "kprobes: " fmt > > -#include > #include > #include > #include > @@ -21,6 +20,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -38,7 +38,7 @@ void *alloc_insn_page(void) > { > void *page; > > - page = module_alloc(PAGE_SIZE); > + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); > if (!page) > return NULL; > set_memory_rox((unsigned long)page, 1); > diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c > index 42215f9404af..ac97a905e8cd 100644 > --- a/arch/s390/kernel/module.c > +++ b/arch/s390/kernel/module.c > @@ -21,6 +21,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size) > #ifdef CONFIG_FUNCTION_TRACER > void module_arch_cleanup(struct module *mod) > { > - module_memfree(mod->arch.trampolines_start); > + execmem_free(mod->arch.trampolines_start); > } > #endif > > @@ -510,7 +511,7 @@
Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads
Sorry, this is actually v3. (miss-configured...) Thanks, On Thu, 18 Apr 2024 05:46:34 +0900 "Masami Hiramatsu (Google)" wrote: > From: Masami Hiramatsu (Google) > > Currently, kprobe event checks whether the target symbol name is unique > or not, so that it does not put a probe on an unexpected place. But this > skips the check if the target is on a module because the module may not > be loaded. > > To fix this issue, this patch checks the number of probe target symbols > in a target module when the module is loaded. If the probe is not on the > unique name symbols in the module, it will be rejected at that point. > > Note that the symbol which has a unique name in the target module, > it will be accepted even if there are same-name symbols in the > kernel or other modules, > > Signed-off-by: Masami Hiramatsu (Google) > --- > Changes in v3: > - Update the patch description. > Updated from last October post, which was dropped by test failure: > > https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/ > Changes in v2: > - Fix to skip checking uniqueness if the target module is not loaded. > - Fix register_module_trace_kprobe() to pass correct symbol name. > - Fix to call __register_trace_kprobe() from module callback. > --- > kernel/trace/trace_kprobe.c | 125 > --- > 1 file changed, 81 insertions(+), 44 deletions(-) > > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index c68d4e830fbe..0113afe2662d 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk) > return ret; > } > > +static int validate_module_probe_symbol(const char *modname, const char > *symbol); > + > +static int register_module_trace_kprobe(struct module *mod, struct > trace_kprobe *tk) > +{ > + const char *p; > + int ret = 0; > + > + p = strchr(trace_kprobe_symbol(tk), ':'); > + if (p) > + ret = validate_module_probe_symbol(module_name(mod), p + 1); > + if (!ret) > + ret = __register_trace_kprobe(tk); > + return ret; > +} > + > /* Module notifier call back, checking event on the module */ > static int trace_kprobe_module_callback(struct notifier_block *nb, > unsigned long val, void *data) > @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct > notifier_block *nb, > if (trace_kprobe_within_module(tk, mod)) { > /* Don't need to check busy - this should have gone. */ > __unregister_trace_kprobe(tk); > - ret = __register_trace_kprobe(tk); > + ret = register_module_trace_kprobe(mod, tk); > if (ret) > pr_warn("Failed to re-register probe %s on %s: > %d\n", > trace_probe_name(>tp), > @@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char > *name, unsigned long unused) > return 0; > } > > -static unsigned int number_of_same_symbols(char *func_name) > +static unsigned int number_of_same_symbols(const char *mod, const char > *func_name) > { > struct sym_count_ctx ctx = { .count = 0, .name = func_name }; > > - kallsyms_on_each_match_symbol(count_symbols, func_name, ); > + if (!mod) > + kallsyms_on_each_match_symbol(count_symbols, func_name, > ); > > - module_kallsyms_on_each_symbol(NULL, count_mod_symbols, ); > + module_kallsyms_on_each_symbol(mod, count_mod_symbols, ); > > return ctx.count; > } > > +static int validate_module_probe_symbol(const char *modname, const char > *symbol) > +{ > + unsigned int count = number_of_same_symbols(modname, symbol); > + > + if (count > 1) { > + /* > + * Users should use ADDR to remove the ambiguity of > + * using KSYM only. > + */ > + return -EADDRNOTAVAIL; > + } else if (count == 0) { > + /* > + * We can return ENOENT earlier than when register the > + * kprobe. > + */ > + return -ENOENT; > + } > + return 0; > +} > + > +static int validate_probe_symbol(char *symbol) > +{ > + struct module *mod = NULL; > + char *modname = NULL, *p; > + int ret = 0; > + > + p = strchr(symbol, ':'); > + if (p) { > + modname = symbol; > + symbol = p + 1; > + *p = '\0'; > + /* Return 0 (defer) if the module does not exist yet. */ > + rcu_read_lock_sched(); > + mod = find_module(modname); > + if (mod && !try_module_get(mod)) > + mod = NULL; > + rcu_read_unlock_sched(); > + if (!mod) > + goto out; > + } > + >
[PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads
From: Masami Hiramatsu (Google) Currently, kprobe event checks whether the target symbol name is unique or not, so that it does not put a probe on an unexpected place. But this skips the check if the target is on a module because the module may not be loaded. To fix this issue, this patch checks the number of probe target symbols in a target module when the module is loaded. If the probe is not on the unique name symbols in the module, it will be rejected at that point. Note that the symbol which has a unique name in the target module, it will be accepted even if there are same-name symbols in the kernel or other modules, Signed-off-by: Masami Hiramatsu (Google) --- Changes in v3: - Update the patch description. Updated from last October post, which was dropped by test failure: https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/ Changes in v2: - Fix to skip checking uniqueness if the target module is not loaded. - Fix register_module_trace_kprobe() to pass correct symbol name. - Fix to call __register_trace_kprobe() from module callback. --- kernel/trace/trace_kprobe.c | 125 --- 1 file changed, 81 insertions(+), 44 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index c68d4e830fbe..0113afe2662d 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk) return ret; } +static int validate_module_probe_symbol(const char *modname, const char *symbol); + +static int register_module_trace_kprobe(struct module *mod, struct trace_kprobe *tk) +{ + const char *p; + int ret = 0; + + p = strchr(trace_kprobe_symbol(tk), ':'); + if (p) + ret = validate_module_probe_symbol(module_name(mod), p + 1); + if (!ret) + ret = __register_trace_kprobe(tk); + return ret; +} + /* Module notifier call back, checking event on the module */ static int trace_kprobe_module_callback(struct notifier_block *nb, unsigned long val, void *data) @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct notifier_block *nb, if (trace_kprobe_within_module(tk, mod)) { /* Don't need to check busy - this should have gone. */ __unregister_trace_kprobe(tk); - ret = __register_trace_kprobe(tk); + ret = register_module_trace_kprobe(mod, tk); if (ret) pr_warn("Failed to re-register probe %s on %s: %d\n", trace_probe_name(>tp), @@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char *name, unsigned long unused) return 0; } -static unsigned int number_of_same_symbols(char *func_name) +static unsigned int number_of_same_symbols(const char *mod, const char *func_name) { struct sym_count_ctx ctx = { .count = 0, .name = func_name }; - kallsyms_on_each_match_symbol(count_symbols, func_name, ); + if (!mod) + kallsyms_on_each_match_symbol(count_symbols, func_name, ); - module_kallsyms_on_each_symbol(NULL, count_mod_symbols, ); + module_kallsyms_on_each_symbol(mod, count_mod_symbols, ); return ctx.count; } +static int validate_module_probe_symbol(const char *modname, const char *symbol) +{ + unsigned int count = number_of_same_symbols(modname, symbol); + + if (count > 1) { + /* +* Users should use ADDR to remove the ambiguity of +* using KSYM only. +*/ + return -EADDRNOTAVAIL; + } else if (count == 0) { + /* +* We can return ENOENT earlier than when register the +* kprobe. +*/ + return -ENOENT; + } + return 0; +} + +static int validate_probe_symbol(char *symbol) +{ + struct module *mod = NULL; + char *modname = NULL, *p; + int ret = 0; + + p = strchr(symbol, ':'); + if (p) { + modname = symbol; + symbol = p + 1; + *p = '\0'; + /* Return 0 (defer) if the module does not exist yet. */ + rcu_read_lock_sched(); + mod = find_module(modname); + if (mod && !try_module_get(mod)) + mod = NULL; + rcu_read_unlock_sched(); + if (!mod) + goto out; + } + + ret = validate_module_probe_symbol(modname, symbol); +out: + if (p) + *p = ':'; + if (mod) + module_put(mod); + return ret; +} + static int trace_kprobe_entry_handler(struct kretprobe_instance *ri,
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On 4/17/24 01:48, Willy Tarreau wrote: On Wed, Apr 17, 2024 at 10:16:26AM +0200, Greg KH wrote: at the scripts used by stable developers - and maybe at the ML server - to catch different variations won't hurt, as it sounds likely that people will end messing up with a big name like "do-not-apply-to-stable", typing instead things like: do_not_apply_to_stable dont-apply-to-stable and other variants. I want this very explicit that someone does not want this applied, and that it has a reason to do so. And if getting the email right to do so is the issue with that, that's fine. This is a very rare case that almost no one should normally hit. For using a comparable approach in haproxy on a daily basis, I do see the value in this. We just mark a lot of fixes "no backport needed" or "no backport needed unless blablabla" for everything that is only relevant to the dev tree, and that's a huge time saver for those working on the backports later. Maybe "not-for-stable" would be both shorter and easier to remember BTW ? Yes, "not-for-stable" looks like a good name to me. -- Florian
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
Em Wed, 17 Apr 2024 10:16:26 +0200 Greg KH escreveu: > On Wed, Apr 17, 2024 at 09:09:26AM +0100, Mauro Carvalho Chehab wrote: > > Em Wed, 17 Apr 2024 09:48:18 +0200 > > Thorsten Leemhuis escreveu: > > > > > Hi kernel.org helpdesk! > > > > > > Could you please create the email alias > > > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > > > just like sta...@kernel.org does? > > > > > > That's an idea GregKH brought up a few days ago here: > > > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > > > > > > To quote: > > > > > > > How about: > > > > cc: # Reason goes here, and > > > > must be present > > > > > > > > and we can make that address be routed to /dev/null just like > > > > is? > > > > > > There was some discussion about using something shorter, but in the end > > > there was no strong opposition and the thread ended a a few days ago. > > > > Heh, a shorter name would make it a lot easier to remember, specially > > since not wanting a patch to go to stable is an exception... I bet > > I'll never remember the right syntax, needing to look at the docs > > every time it would be used. > > > > IMO, something like: > > no-stable > > or > > nostable > > > > would do the trick and would be a lot easier to remember. > > > > Btw, IMO, it won't hurt accepting more than one variant that > > could be allowed, e. g. using a regular expression like: > > > > (do)?[-_]?(nt|not?).*stable > > That's not going to work at the mailing list server, or with my scripts, > sorry. A setting like that would likely be at exim (or similar). Most smtp servers allow some sort of wildcards, as those are used to pass or not e-mails to list servers and/or handle custom mail forward rules. At client level, one could use dovecot with pigeonhole to have sieve rules to filter e-mails. That's what I do here. > > at the scripts used by stable developers - and maybe at the ML server - to > > catch different variations won't hurt, as it sounds likely that people will > > end messing up with a big name like "do-not-apply-to-stable", typing > > instead things like: > > > > do_not_apply_to_stable > > dont-apply-to-stable > > > > and other variants. > > I want this very explicit that someone does not want this applied, and > that it has a reason to do so. And if getting the email right to do so > is the issue with that, that's fine. This is a very rare case that > almost no one should normally hit. Yeah, agreed: those are very rare exceptions. I remember just one or two cases where a media fix patch couldn't be queued to stable. The already-existing workflow worked for those. > And again, if maintainers don't want their tree to have Fixes: and > AUTOBOT run on them, we can easily add that to our lists. That's the > simpler and more explicit thing to do for those that do not want to have > anything backported they do not explicitly mark as such (some subsystems > do this already, like kvm and -mm and xfs, it's fine!). This all is > here because of maintainers who do NOT want to do that. From my side, I'm fine with whatever-explicit-long-filter-email. Regards, Mauro
[PATCH RT 0/1] Linux v4.19.312-rt134-rc1
Dear RT Folks, This is the RT stable review cycle of patch 4.19.312-rt134-rc1. Please scream at me if I messed something up. Please test the patches too. The -rc release is also available on kernel.org https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git on the v4.19-rt-next branch. If all goes well, this patch will be converted to the next main release on 2024-04-24. Signing key fingerprint: 5BF6 7BC5 0826 72CA BB45 ACAE 587C 5ECA 5D0A 306C All keys used for the above files and repositories can be found on the following git repository: git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git Enjoy! Daniel Changes from v4.19.307-rt133: Daniel Wagner (1): Linux 4.19.312-rt134 localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.44.0
[PATCH RT 1/1] Linux 4.19.312-rt134
v4.19.312-rt134-rc1 stable review patch. If anyone has any objections, please let me know. --- Signed-off-by: Daniel Wagner --- localversion-rt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/localversion-rt b/localversion-rt index c2c7e0fb6685..6067da4c8c99 100644 --- a/localversion-rt +++ b/localversion-rt @@ -1 +1 @@ --rt133 +-rt134 -- 2.44.0
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 03:38:28PM +0200, Greg KH wrote: > > That afaics makes them useless for the stable team (Greg may correct > > me > > if I'm wrong here), as they deal with the commits and have no easy, > > fast, and reliable way to look up the patch posting to query this. Or is > > the "patch basement" available somehow in git for each commit and I just > > missed that? > > You are correct, as-is, that would make it useless for my tools. > > BUT I could, if it's possible, just look up the original in lore somehow > and parse that. If it's there, does anyone have a "simple" way to map a > git commit back to a lore message if it does NOT have a Link: line in > it? Our current way of doing it is going by patch-id, and it's not great either, because there is more than one way to create a patch from a git commit. We've discovered this when Linus recommended that people send patches with the --histogram algorithm, which broke a bunch of our stuff. :) E.g. here's a recent commit that has a Fixes: git show c0297e7dd50795d559f3534887a6de1756b35d0f | git patch-id --stable | cut -d' ' -f1 c2f5c42a5a3bc05ffacd9679dd367e4a2207b018 it successfully maps to the patch: https://lore.kernel.org/all/?q=patchid%3Ac2f5c42a5a3bc05ffacd9679dd367e4a2207b018 I only put this here for academic purposes -- I really don't want you to go that route, because it's fragile (more fragile than git notes, and that's saying something). -K
Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads
On Tue, 16 Apr 2024 00:47:26 -0400 Steven Rostedt wrote: > On Mon, 15 Apr 2024 18:40:23 +0900 > "Masami Hiramatsu (Google)" wrote: > > > Check the number of probe target symbols in the target module when > > the module is loaded. If the probe is not on the unique name symbols > > in the module, it will be rejected at that point. > > > > Note that the symbol which has a unique name in the target module, > > it will be accepted even if there are same-name symbols in the > > kernel or other modules, > > This says what it does, but doesn't explain why it is doing it. > What's the purpose of this patch? Thank you for pointing it out, I just reused the description which I sent last year. It needs to be updated. - Currently, kprobe event checks whether the target symbol name is unique, so that it does not put a probe on unexpected place. But this skips the check if the target is on a module. This fixes the issue by checking the symbol is unique in the target module if the target is on a module. - Thanks! > > -- Steve -- Masami Hiramatsu (Google)
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote: > On 17.04.24 14:52, Konstantin Ryabitsev wrote: > > On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote: > >> > >> Could you please create the email alias > >> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > >> just like sta...@kernel.org does? > >> > >> That's an idea GregKH brought up a few days ago here: > >> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > >> > >> To quote: > >> > >>> How about: > >>> cc: # Reason goes here, and must be > >>> present > >>> > >>> and we can make that address be routed to /dev/null just like > >>> is? > > First, many thx for your feedback. > > > That would make it into actual commits and probably irk maintainers and > > Linus, no? > > Yup. > > > I also don't really love the idea of overloading email > > addresses with additional semantics. Using Cc: stable kinda makes sense, > > even if it's not a real email address (but it could become at some > > point), but this feels different. > > Okay. > > > In general, I feel this information belongs in the patch basement (the > > place where change-id, base-commit, etc goes). E.g.: > > > > stable-autosel: ignore > > [This fix requires a feature that is only present in mainline] > > > > This allows passing along structured information that can be parsed by > > automated tooling without putting it into the commit. > > That afaics makes them useless for the stable team (Greg may correct me > if I'm wrong here), as they deal with the commits and have no easy, > fast, and reliable way to look up the patch posting to query this. Or is > the "patch basement" available somehow in git for each commit and I just > missed that? You are correct, as-is, that would make it useless for my tools. BUT I could, if it's possible, just look up the original in lore somehow and parse that. If it's there, does anyone have a "simple" way to map a git commit back to a lore message if it does NOT have a Link: line in it? I guess I should look at setting up a local copy of lei to dig through the git record of lkml? But what about patches that aren't cc: to lkml, I don't want to have to hit lore.kernel.org for each query, that's going to not be nice. > Guess in a better world we might use "git notes" for this, but we > currently do not use them because they iirc are somewhat tricky (they > are easily lost on rebases iirc is one of the reasons ) -- and starting > to use them just for this is not worth it. git notes never works for anything, and everyone always mentions it :) > >> There was some discussion about using something shorter, but in the end > >> there was no strong opposition and the thread ended a a few days ago. > > I feel this is a significant change to the workflow, so I would like the > > workflows list to have another go at this topic. :) > > FWIW, we could go back to what I initially proposed: use the existing > stable tag with a pre-defined comment to mark patches that AUTOSEL et. > al. should not pick up: > https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/ If you can pick a better string, possibly, yes. But in the end, your proposal seems to imply: cc: sta...@kernel.org # Psych! Just kidding, never backport this! but really, that's just mean, and again, this is a VERY rare case you are trying to automate here. We have MUCH better and simpler ways for maintainers to not have their subsystems scanned for stuff like this, why are we spending all of our time on this topic? thanks, greg k-h
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote: > > In general, I feel this information belongs in the patch basement > > (the place where change-id, base-commit, etc goes). E.g.: > > > > stable-autosel: ignore > > [This fix requires a feature that is only present in mainline] > > > > This allows passing along structured information that can be parsed by > > automated tooling without putting it into the commit. > > That afaics makes them useless for the stable team (Greg may correct me > if I'm wrong here), as they deal with the commits and have no easy, > fast, and reliable way to look up the patch posting to query this. Or is > the "patch basement" available somehow in git for each commit and I just > missed that? Ah, okay, my assumption was wrong, then. How about meeting things halfway, then: Cc: stable+noauto...@kernel.org # Reason -K
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On 17.04.24 14:52, Konstantin Ryabitsev wrote: > On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote: >> >> Could you please create the email alias >> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, >> just like sta...@kernel.org does? >> >> That's an idea GregKH brought up a few days ago here: >> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ >> >> To quote: >> >>> How about: >>> cc: # Reason goes here, and must be >>> present >>> >>> and we can make that address be routed to /dev/null just like >>> is? First, many thx for your feedback. > That would make it into actual commits and probably irk maintainers and > Linus, no? Yup. > I also don't really love the idea of overloading email > addresses with additional semantics. Using Cc: stable kinda makes sense, > even if it's not a real email address (but it could become at some > point), but this feels different. Okay. > In general, I feel this information belongs in the patch basement (the > place where change-id, base-commit, etc goes). E.g.: > > stable-autosel: ignore > [This fix requires a feature that is only present in mainline] > > This allows passing along structured information that can be parsed by > automated tooling without putting it into the commit. That afaics makes them useless for the stable team (Greg may correct me if I'm wrong here), as they deal with the commits and have no easy, fast, and reliable way to look up the patch posting to query this. Or is the "patch basement" available somehow in git for each commit and I just missed that? Guess in a better world we might use "git notes" for this, but we currently do not use them because they iirc are somewhat tricky (they are easily lost on rebases iirc is one of the reasons ) -- and starting to use them just for this is not worth it. >> There was some discussion about using something shorter, but in the end >> there was no strong opposition and the thread ended a a few days ago. > I feel this is a significant change to the workflow, so I would like the > workflows list to have another go at this topic. :) FWIW, we could go back to what I initially proposed: use the existing stable tag with a pre-defined comment to mark patches that AUTOSEL et. al. should not pick up: https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/ Ciao, Thorsten
Re: [PATCH net-next v2 07/15] mm: page_frag: add '_va' suffix to page_frag API
On 2024/4/17 0:12, Alexander H Duyck wrote: > On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote: >> Currently most of the API for page_frag API is returning >> 'virtual address' as output or expecting 'virtual address' >> as input, in order to differentiate the API handling between >> 'virtual address' and 'struct page', add '_va' suffix to the >> corresponding API mirroring the page_pool_alloc_va() API of >> the page_pool. >> >> Signed-off-by: Yunsheng Lin > > This patch is a total waste of time. By that logic we should be > renaming __get_free_pages since it essentially does the same thing. > > This just seems like more code changes for the sake of adding code > changes rather than fixing anything. In my opinion it should be dropped > from the set. The rename is to support different use case as mentioned below in patch 14: "Depending on different use cases, callers expecting to deal with va, page or both va and page for them may call page_frag_alloc_va*, page_frag_alloc_pg*, or page_frag_alloc* API accordingly." Naming is hard anyway, I am open to better API naming for the above use cases. > > . >
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On 4/17/24 2:52 PM, Konstantin Ryabitsev wrote: > On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote: >> Hi kernel.org helpdesk! >> >> Could you please create the email alias >> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, >> just like sta...@kernel.org does? >> >> That's an idea GregKH brought up a few days ago here: >> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ >> >> To quote: >> >> > How about: >> >cc: # Reason goes here, and must be >> > present >> > >> > and we can make that address be routed to /dev/null just like >> > is? > > That would make it into actual commits and probably irk maintainers and > Linus, no? I also don't really love the idea of overloading email > addresses with additional semantics. Using Cc: stable kinda makes sense, > even if it's not a real email address (but it could become at some > point), but this feels different. > > In general, I feel this information belongs in the patch basement (the > place where change-id, base-commit, etc goes). E.g.: > > stable-autosel: ignore > [This fix requires a feature that is only present in mainline] > > This allows passing along structured information that can be parsed by > automated tooling without putting it into the commit. But isn't it the actual commit what the stable tooling parses? >> There was some discussion about using something shorter, but in the end >> there was no strong opposition and the thread ended a a few days ago. > > I feel this is a significant change to the workflow, so I would like the > workflows list to have another go at this topic. :) > > -K >
Re: [PATCH net-next] ipvs: allow some sysctls in non-init user namespaces
Hello, On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote: > Let's make all IPVS sysctls visible and RO even when > network namespace is owned by non-initial user namespace. > > Let's make a few sysctls to be writable: > - conntrack > - conn_reuse_mode > - expire_nodest_conn > - expire_quiescent_template > > I'm trying to be conservative with this to prevent > introducing any security issues in there. Maybe, > we can allow more sysctls to be writable, but let's > do this on-demand and when we see real use-case. > > This list of sysctls was chosen because I can't > see any security risks allowing them and also > Kubernetes uses [2] these specific sysctls. > > This patch is motivated by user request in the LXC > project [1]. > > [1] https://github.com/lxc/lxc/issues/4278 > [2] > https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103 > > Cc: Stéphane Graber > Cc: Christian Brauner > Cc: Julian Anastasov > Cc: Simon Horman > Cc: Pablo Neira Ayuso > Cc: Jozsef Kadlecsik > Cc: Florian Westphal > Signed-off-by: Alexander Mikhalitsyn > --- > net/netfilter/ipvs/ip_vs_ctl.c | 18 +++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 143a341bbc0a..92a818c2f783 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -4285,10 +4285,22 @@ static int __net_init > ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs) As the list of privileged vars is short I prefer to use a bool and to make only some vars read-only: bool unpriv = false; > if (tbl == NULL) > return -ENOMEM; > > - /* Don't export sysctls to unprivileged users */ > + /* Let's show all sysctls in non-init user namespace-owned > + * net namespaces, but make them read-only. > + * > + * Allow only a few specific sysctls to be writable. > + */ > if (net->user_ns != _user_ns) { Here we should just set: unpriv = true; > - tbl[0].procname = NULL; > - ctl_table_size = 0; > + for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) { > + if (!tbl[idx].procname) > + continue; > + > + if (!((strcmp(tbl[idx].procname, "conntrack") > == 0) || > + (strcmp(tbl[idx].procname, > "conn_reuse_mode") == 0) || > + (strcmp(tbl[idx].procname, > "expire_nodest_conn") == 0) || > + (strcmp(tbl[idx].procname, > "expire_quiescent_template") == 0))) > + tbl[idx].mode = 0444; > + } > } > } else > tbl = vs_vars; And below at every place to use: if (unpriv) tbl[idx].mode = 0444; for the following 4 privileged sysctl vars: - sync_qlen_max: - allocates messages in kernel context - this needs better tunning in another patch - sync_sock_size: - allocates messages in kernel context - run_estimation: - for now, better init ns to decide if to use est stats - est_nice: - for now, better init ns to decide the value - debug_level: - already set to 0444 I.e. these vars allocate resources (mem, CPU) without proper control, so for now we will just copy them from init ns without allowing writing. And they are vars that are not tuned often. Also we do not know which netns is supposed to be the privileged one, some solutions move all devices out of init_net, so we can not decide where to use lower limits. OTOH, "amemthresh" is not privileged but needs single READ_ONCE for sysctl_amemthresh in update_defense_level() due to the possible div by zero if we allow writing to anyone, eg.: int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0); ... nomem = availmem < amemthresh; ... use only amemthresh All other vars can be writable. Regards -- Julian Anastasov
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote: > Hi kernel.org helpdesk! > > Could you please create the email alias > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > just like sta...@kernel.org does? > > That's an idea GregKH brought up a few days ago here: > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > > To quote: > > > How about: > > cc: # Reason goes here, and must be > > present > > > > and we can make that address be routed to /dev/null just like > > is? That would make it into actual commits and probably irk maintainers and Linus, no? I also don't really love the idea of overloading email addresses with additional semantics. Using Cc: stable kinda makes sense, even if it's not a real email address (but it could become at some point), but this feels different. In general, I feel this information belongs in the patch basement (the place where change-id, base-commit, etc goes). E.g.: stable-autosel: ignore [This fix requires a feature that is only present in mainline] This allows passing along structured information that can be parsed by automated tooling without putting it into the commit. > There was some discussion about using something shorter, but in the end > there was no strong opposition and the thread ended a a few days ago. I feel this is a significant change to the workflow, so I would like the workflows list to have another go at this topic. :) -K
[syzbot] [bpf?] [trace?] possible deadlock in __send_signal_locked
Hello, syzbot found the following issue on: HEAD commit:96fca68c4fbf Merge tag 'nfsd-6.9-3' of git://git.kernel.or.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=13967bb318 kernel config: https://syzkaller.appspot.com/x/.config?x=85dbe39cf8e4f599 dashboard link: https://syzkaller.appspot.com/bug?extid=6e3b6eab5bd4ed584a38 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-96fca68c.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/d6d7a71ca443/vmlinux-96fca68c.xz kernel image: https://storage.googleapis.com/syzbot-assets/accb76ce6c9c/bzImage-96fca68c.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+6e3b6eab5bd4ed584...@syzkaller.appspotmail.com == WARNING: possible circular locking dependency detected 6.9.0-rc4-syzkaller-00031-g96fca68c4fbf #0 Not tainted -- syz-executor.0/7699 is trying to acquire lock: 88806b53d998 (>lock){-.-.}-{2:2}, at: __queue_work+0x23a/0x1020 kernel/workqueue.c:2346 but task is already holding lock: 888023446620 (>signalfd_wqh){}-{2:2}, at: __wake_up_common_lock kernel/sched/wait.c:105 [inline] 888023446620 (>signalfd_wqh){}-{2:2}, at: __wake_up+0x1c/0x60 kernel/sched/wait.c:127 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (>signalfd_wqh){}-{2:2}: __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3a/0x60 kernel/locking/spinlock.c:162 __wake_up_common_lock kernel/sched/wait.c:105 [inline] __wake_up+0x1c/0x60 kernel/sched/wait.c:127 signalfd_notify include/linux/signalfd.h:22 [inline] __send_signal_locked+0x951/0x11c0 kernel/signal.c:1168 do_notify_parent+0xeb4/0x1040 kernel/signal.c:2143 exit_notify kernel/exit.c:754 [inline] do_exit+0x1369/0x2c10 kernel/exit.c:898 do_group_exit+0xd3/0x2a0 kernel/exit.c:1027 __do_sys_exit_group kernel/exit.c:1038 [inline] __se_sys_exit_group kernel/exit.c:1036 [inline] __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1036 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #2 (>siglock){-...}-{2:2}: __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x3a/0x60 kernel/locking/spinlock.c:162 __lock_task_sighand+0xc2/0x340 kernel/signal.c:1414 lock_task_sighand include/linux/sched/signal.h:746 [inline] do_send_sig_info kernel/signal.c:1300 [inline] group_send_sig_info+0x290/0x300 kernel/signal.c:1453 bpf_send_signal_common+0x2e8/0x3a0 kernel/trace/bpf_trace.c:881 bpf_send_signal_thread kernel/trace/bpf_trace.c:898 [inline] bpf_send_signal_thread+0x16/0x20 kernel/trace/bpf_trace.c:896 ___bpf_prog_run+0x3e51/0xabd0 kernel/bpf/core.c:1997 __bpf_prog_run32+0xc1/0x100 kernel/bpf/core.c:2236 bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline] __bpf_prog_run include/linux/filter.h:657 [inline] bpf_prog_run include/linux/filter.h:664 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline] bpf_trace_run4+0x176/0x460 kernel/trace/bpf_trace.c:2422 __bpf_trace_mmap_lock_acquire_returned+0x134/0x180 include/trace/events/mmap_lock.h:52 trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:52 [inline] __mmap_lock_do_trace_acquire_returned+0x456/0x790 mm/mmap_lock.c:237 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_write_lock include/linux/mmap_lock.h:109 [inline] __do_sys_set_mempolicy_home_node+0x574/0x860 mm/mempolicy.c:1568 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (lock#11){+.+.}-{2:2}: local_lock_acquire include/linux/local_lock_internal.h:29 [inline] __mmap_lock_do_trace_acquire_returned+0x97/0x790 mm/mmap_lock.c:237 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_read_trylock include/linux/mmap_lock.h:166 [inline] stack_map_get_build_id_offset+0x5df/0x7d0 kernel/bpf/stackmap.c:141 __bpf_get_stack+0x6bf/0x700 kernel/bpf/stackmap.c:449 bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1985 [inline] bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1975 ___bpf_prog_run+0x3e51/0xabd0 kernel/bpf/core.c:1997
Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection
On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote: > Add the function vduse_alloc_reconnnect_info_mem > and vduse_alloc_reconnnect_info_mem > These functions allow vduse to allocate and free memory for reconnection > information. The amount of memory allocated is vq_num pages. > Each VQS will map its own page where the reconnection information will be > saved > > Signed-off-by: Cindy Lu > --- > drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++ > 1 file changed, 40 insertions(+) > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c > b/drivers/vdpa/vdpa_user/vduse_dev.c > index ef3c9681941e..2da659d5f4a8 100644 > --- a/drivers/vdpa/vdpa_user/vduse_dev.c > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c > @@ -65,6 +65,7 @@ struct vduse_virtqueue { > int irq_effective_cpu; > struct cpumask irq_affinity; > struct kobject kobj; > + unsigned long vdpa_reconnect_vaddr; > }; > > struct vduse_dev; > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct > vduse_virtqueue *vq) > > vq->irq_effective_cpu = curr_cpu; > } > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev) > +{ > + unsigned long vaddr = 0; > + struct vduse_virtqueue *vq; > + > + for (int i = 0; i < dev->vq_num; i++) { > + /*page 0~ vq_num save the reconnect info for vq*/ > + vq = dev->vqs[i]; > + vaddr = get_zeroed_page(GFP_KERNEL); I don't get why you insist on stealing kernel memory for something that is just used by userspace to store data for its own use. Userspace does not lack ways to persist data, for example, create a regular file anywhere in the filesystem. > + if (vaddr == 0) > + return -ENOMEM; > + > + vq->vdpa_reconnect_vaddr = vaddr; > + } > + > + return 0; > +} > + > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev) > +{ > + struct vduse_virtqueue *vq; > + > + for (int i = 0; i < dev->vq_num; i++) { > + vq = dev->vqs[i]; > + > + if (vq->vdpa_reconnect_vaddr) > + free_page(vq->vdpa_reconnect_vaddr); > + vq->vdpa_reconnect_vaddr = 0; > + } > + > + return 0; > +} > > static long vduse_dev_ioctl(struct file *file, unsigned int cmd, > unsigned long arg) > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name) > mutex_unlock(>lock); > return -EBUSY; > } > + vduse_free_reconnnect_info_mem(dev); > + > dev->connected = true; > mutex_unlock(>lock); > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config > *config, > ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num); > if (ret) > goto err_vqs; > + ret = vduse_alloc_reconnnect_info_mem(dev); > + if (ret < 0) > + goto err_mem; > > __module_get(THIS_MODULE); > > return 0; > err_vqs: > device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor)); > +err_mem: > + vduse_free_reconnnect_info_mem(dev); > err_dev: > idr_remove(_idr, dev->minor); > err_idr: > -- > 2.43.0
Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent
Hello Eric, On Wed, Apr 17, 2024 at 5:02 PM Eric Dumazet wrote: > > On Wed, Apr 17, 2024 at 10:51 AM Jason Xing wrote: > > > > From: Jason Xing > > > > Add a new standalone file for the easy future extension to support > > both active reset and passive reset in the TCP/DCCP/MPTCP protocols. > > > > This patch only does the preparations for reset reason mechanism, > > nothing else changes. > > > > The reset reasons are divided into three parts: > > 1) reuse drop reasons for passive reset in TCP > > 2) reuse MP_TCPRST option for MPTCP > > 3) our own reasons > > > > I will implement the basic codes of active/passive reset reason in > > those three protocols, which is not complete for this moment. But > > it provides a new chance to let other people add more reasons into > > it:) > > > > Signed-off-by: Jason Xing > > My original suggestion was to use normal values in 'enum > skb_drop_reason', even if there was not necessarily a 'drop' > in the common sense. > > https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ > > This would avoid these ugly casts later, even casting an enum to other > ones is not very logical. Thanks for your comment. It's a little bit tricky. That's the reason I documented and commented on this in the rstreason.h file. I hope it's not that hard to understand. > Going through an u32 pivot is quite a hack. > > If you feel the need to put them in a special group, this is fine by me. Yes, rst reasons only partially rely on the drop reason mechanism to support passive rst for TCP well, but not supporting other cases. My final goal is to cover all the cases for the future, so I wish I can put it into a separate group, then people like me who find it useful can introduce more reasons into it. Thanks, Jason
Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent
On Wed, Apr 17, 2024 at 10:51 AM Jason Xing wrote: > > From: Jason Xing > > Add a new standalone file for the easy future extension to support > both active reset and passive reset in the TCP/DCCP/MPTCP protocols. > > This patch only does the preparations for reset reason mechanism, > nothing else changes. > > The reset reasons are divided into three parts: > 1) reuse drop reasons for passive reset in TCP > 2) reuse MP_TCPRST option for MPTCP > 3) our own reasons > > I will implement the basic codes of active/passive reset reason in > those three protocols, which is not complete for this moment. But > it provides a new chance to let other people add more reasons into > it:) > > Signed-off-by: Jason Xing My original suggestion was to use normal values in 'enum skb_drop_reason', even if there was not necessarily a 'drop' in the common sense. https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ This would avoid these ugly casts later, even casting an enum to other ones is not very logical. Going through an u32 pivot is quite a hack. If you feel the need to put them in a special group, this is fine by me.
[PATCH 1/1] virtio: Add support for the virtio suspend feature
Add support for the VIRTIO_F_SUSPEND feature. When this feature is negotiated, power management can use it to suspend virtio devices instead of resorting to resetting the devices entirely. Signed-off-by: David Stevens --- drivers/virtio/virtio.c| 32 ++ drivers/virtio/virtio_pci_common.c | 29 +++ drivers/virtio/virtio_pci_modern.c | 19 ++ include/linux/virtio.h | 2 ++ include/uapi/linux/virtio_config.h | 10 +- 5 files changed, 74 insertions(+), 18 deletions(-) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index f4080692b351..cd11495a5098 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -1,5 +1,6 @@ // SPDX-License-Identifier: GPL-2.0-only #include +#include #include #include #include @@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev) return ret; } EXPORT_SYMBOL_GPL(virtio_device_restore); + +static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool enabled) +{ + u8 status, target; + + status = dev->config->get_status(dev); + if (enabled) + target = status | VIRTIO_CONFIG_S_SUSPEND; + else + target = status & ~VIRTIO_CONFIG_S_SUSPEND; + dev->config->set_status(dev, target); + + while ((status = dev->config->get_status(dev)) != target) { + if (status & VIRTIO_CONFIG_S_NEEDS_RESET) + return -EIO; + mdelay(10); + } + return 0; +} + +int virtio_device_suspend(struct virtio_device *dev) +{ + return virtio_device_set_suspend_bit(dev, true); +} +EXPORT_SYMBOL_GPL(virtio_device_suspend); + +int virtio_device_resume(struct virtio_device *dev) +{ + return virtio_device_set_suspend_bit(dev, false); +} +EXPORT_SYMBOL_GPL(virtio_device_resume); #endif static int virtio_init(void) diff --git a/drivers/virtio/virtio_pci_common.c b/drivers/virtio/virtio_pci_common.c index b655fccaf773..4d542de05970 100644 --- a/drivers/virtio/virtio_pci_common.c +++ b/drivers/virtio/virtio_pci_common.c @@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev) return virtio_device_restore(_dev->vdev); } -static bool vp_supports_pm_no_reset(struct device *dev) +static int virtio_pci_suspend(struct device *dev) { struct pci_dev *pci_dev = to_pci_dev(dev); - u16 pmcsr; - - if (!pci_dev->pm_cap) - return false; - - pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, ); - if (PCI_POSSIBLE_ERROR(pmcsr)) { - dev_err(dev, "Unable to query pmcsr"); - return false; - } + struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); - return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET; -} + if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND)) + return virtio_device_suspend(_dev->vdev); -static int virtio_pci_suspend(struct device *dev) -{ - return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev); + return virtio_pci_freeze(dev); } static int virtio_pci_resume(struct device *dev) { - return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev); + struct pci_dev *pci_dev = to_pci_dev(dev); + struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); + + if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND)) + return virtio_device_resume(_dev->vdev); + + return virtio_pci_restore(dev); } static const struct dev_pm_ops virtio_pci_pm_ops = { diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index f62b530aa3b5..ac8734526b8d 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device *vdev) __virtqueue_break(admin_vq->info.vq); } +static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev) +{ + u16 pmcsr; + + if (!pci_dev->pm_cap) + return false; + + pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, ); + if (PCI_POSSIBLE_ERROR(pmcsr)) { + dev_err(_dev->dev, "Unable to query pmcsr"); + return false; + } + + return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET; +} + static void vp_transport_features(struct virtio_device *vdev, u64 features) { struct virtio_pci_device *vp_dev = to_vp_device(vdev); @@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device *vdev, u64 features) if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ)) __virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ); + + if (features & BIT_ULL(VIRTIO_F_SUSPEND) && vp_supports_pm_no_reset(pci_dev)) + __virtio_set_bit(vdev, VIRTIO_F_SUSPEND); } static int __vp_check_common_size_one_feature(struct virtio_device *vdev, u32 fbit, diff --git
[PATCH 0/1] virtio: Add suspend support
This series implements support for the virtio device suspend feature that is under discussion. Unfortunately, the virtio mailing list is currently being migrated, so recent discussion of the proposal is not archived anywhere. There current version of the proposal is a combination of [1] and [2]. [1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/ [2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html David Stevens (1): virtio: Add support for the virtio suspend feature drivers/virtio/virtio.c| 32 ++ drivers/virtio/virtio_pci_common.c | 29 +++ drivers/virtio/virtio_pci_modern.c | 19 ++ include/linux/virtio.h | 2 ++ include/uapi/linux/virtio_config.h | 10 +- 5 files changed, 74 insertions(+), 18 deletions(-) base-commit: e8f897f4afef0031fe618a8e94127a0934896aba -- 2.44.0.683.g7961c838ac-goog
Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info
On Wed, Apr 10, 2024 at 9:51 AM Jonathan Cameron wrote: > > On Tue, 9 Apr 2024 12:02:31 -0700 > "Ho-Ren (Jack) Chuang" wrote: > > > Hi Jonathan, > > > > On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron > > wrote: > > > > > > On Fri, 5 Apr 2024 15:43:47 -0700 > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron > > > > wrote: > > > > > > > > > > On Fri, 5 Apr 2024 00:07:06 + > > > > > "Ho-Ren (Jack) Chuang" wrote: > > > > > > > > > > > The current implementation treats emulated memory devices, such as > > > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as > > > > > > normal memory > > > > > > (E820_TYPE_RAM). However, these emulated devices have different > > > > > > characteristics than traditional DRAM, making it important to > > > > > > distinguish them. Thus, we modify the tiered memory initialization > > > > > > process > > > > > > to introduce a delay specifically for CPUless NUMA nodes. This delay > > > > > > ensures that the memory tier initialization for these nodes is > > > > > > deferred > > > > > > until HMAT information is obtained during the boot process. Finally, > > > > > > demotion tables are recalculated at the end. > > > > > > > > > > > > * late_initcall(memory_tier_late_init); > > > > > > Some device drivers may have initialized memory tiers between > > > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially > > > > > > bringing > > > > > > online memory nodes and configuring memory tiers. They should be > > > > > > excluded > > > > > > in the late init. > > > > > > > > > > > > * Handle cases where there is no HMAT when creating memory tiers > > > > > > There is a scenario where a CPUless node does not provide HMAT > > > > > > information. > > > > > > If no HMAT is specified, it falls back to using the default DRAM > > > > > > tier. > > > > > > > > > > > > * Introduce another new lock `default_dram_perf_lock` for adist > > > > > > calculation > > > > > > In the current implementation, iterating through CPUlist nodes > > > > > > requires > > > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will > > > > > > end up > > > > > > trying to acquire the same lock, leading to a potential deadlock. > > > > > > Therefore, we propose introducing a standalone > > > > > > `default_dram_perf_lock` to > > > > > > protect `default_dram_perf_*`. This approach not only avoids > > > > > > deadlock > > > > > > but also prevents holding a large lock simultaneously. > > > > > > > > > > > > * Upgrade `set_node_memory_tier` to support additional cases, > > > > > > including > > > > > > default DRAM, late CPUless, and hot-plugged initializations. > > > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and > > > > > > `mt_find_alloc_memory_type()` are moved into > > > > > > `set_node_memory_tier()` to > > > > > > handle cases where memtype is not initialized and where HMAT > > > > > > information is > > > > > > available. > > > > > > > > > > > > * Introduce `default_memory_types` for those memory types that are > > > > > > not > > > > > > initialized by device drivers. > > > > > > Because late initialized memory and default DRAM memory need to be > > > > > > managed, > > > > > > a default memory type is created for storing all memory types that > > > > > > are > > > > > > not initialized by device drivers and as a fallback. > > > > > > > > > > > > Signed-off-by: Ho-Ren (Jack) Chuang > > > > > > Signed-off-by: Hao Xiang > > > > > > Reviewed-by: "Huang, Ying" > > > > > > > > > > Hi - one remaining question. Why can't we delay init for all nodes > > > > > to either drivers or your fallback late_initcall code. > > > > > It would be nice to reduce possible code paths. > > > > > > > > I try not to change too much of the existing code structure in > > > > this patchset. > > > > > > > > To me, postponing/moving all memory tier registrations to > > > > late_initcall() is another possible action item for the next patchset. > > > > > > > > After tier_mem(), hmat_init() is called, which requires registering > > > > `default_dram_type` info. This is when `default_dram_type` is needed. > > > > However, it is indeed possible to postpone the latter part, > > > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can > > > > indeed be split into two parts, and the latter part can be moved to > > > > late_initcall() to be processed together. > > > > > > > > Doing this all memory-type drivers have to call late_initcall() to > > > > register a memory tier. I’m not sure how many they are? > > > > > > > > What do you guys think? > > > > > > Gut feeling - if you are going to move it for some cases, move it for > > > all of them. Then we only have to test once ;) > > > > > > J > > > > Thank you for your reminder! I agree~ That's why I'm considering > > changing them in the next patchset because of the amount of changes. > > And also, this patchset already contains too many
[PATCH net-next v6 6/7] mptcp: introducing a helper into active reset logic
From: Jason Xing Since we have mapped every mptcp reset reason definition in enum sk_rst_reason, introducing a new helper can cover some missing places where we have already set the subflow->reset_reason. Note: using SK_RST_REASON_NOT_SPECIFIED is the same as SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown. So we can convert it directly. Suggested-by: Paolo Abeni Signed-off-by: Jason Xing --- Link: https://lore.kernel.org/all/2d3ea199eef53cf6a0c48e21abdee0eefbdee927.ca...@redhat.com/ --- net/mptcp/protocol.c | 4 +--- net/mptcp/protocol.h | 11 +++ net/mptcp/subflow.c | 6 ++ 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 065967086492..4b13ca362efa 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -21,7 +21,6 @@ #endif #include #include -#include #include #include "protocol.h" #include "mib.h" @@ -2570,8 +2569,7 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk) slow = lock_sock_fast(tcp_sk); if (tcp_sk->sk_state != TCP_CLOSE) { - tcp_send_active_reset(tcp_sk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(tcp_sk); tcp_set_state(tcp_sk, TCP_CLOSE); } unlock_sock_fast(tcp_sk, slow); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index fdfa843e2d88..82ef2f42a1bc 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -12,6 +12,7 @@ #include #include #include +#include #include "mptcp_pm_gen.h" @@ -581,6 +582,16 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context *subflow) WRITE_ONCE(subflow->local_id, -1); } +static inline void +mptcp_send_active_reset_reason(struct sock *sk) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); + enum sk_rst_reason reason; + + reason = convert_mptcp_reason(subflow->reset_reason); + tcp_send_active_reset(sk, GFP_ATOMIC, reason); +} + static inline u64 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow) { diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index bde4a7fdee82..4783d558863c 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -20,7 +20,6 @@ #include #endif #include -#include #include "protocol.h" #include "mib.h" @@ -424,7 +423,7 @@ void mptcp_subflow_reset(struct sock *ssk) /* must hold: tcp_done() could drop last reference on parent */ sock_hold(sk); - tcp_send_active_reset(ssk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(ssk); tcp_done(ssk); if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, _sk(sk)->flags)) mptcp_schedule_work(sk); @@ -1362,8 +1361,7 @@ static bool subflow_check_data_avail(struct sock *ssk) tcp_set_state(ssk, TCP_CLOSE); while ((skb = skb_peek(>sk_receive_queue))) sk_eat_skb(ssk, skb); - tcp_send_active_reset(ssk, GFP_ATOMIC, - SK_RST_REASON_NOT_SPECIFIED); + mptcp_send_active_reset_reason(ssk); WRITE_ONCE(subflow->data_avail, false); return false; } -- 2.37.3
[PATCH net-next v6 7/7] rstreason: make it work in trace world
From: Jason Xing At last, we should let it work by introducing this reset reason in trace world. One of the possible expected outputs is: ... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx state=TCP_ESTABLISHED reason=NOT_SPECIFIED Signed-off-by: Jason Xing Reviewed-by: Steven Rostedt (Google) --- include/trace/events/tcp.h | 37 + net/ipv4/tcp_ipv4.c| 2 +- net/ipv4/tcp_output.c | 2 +- net/ipv6/tcp_ipv6.c| 2 +- 4 files changed, 36 insertions(+), 7 deletions(-) diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h index 5c04a61a11c2..b1455cbc0634 100644 --- a/include/trace/events/tcp.h +++ b/include/trace/events/tcp.h @@ -11,6 +11,7 @@ #include #include #include +#include /* * tcp event with arguments sk and skb @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, TP_ARGS(sk, skb) ); +#undef FN1 +#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason); +#undef FN2 +#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason); +DEFINE_RST_REASON(FN1, FN1) + +#undef FN1 +#undef FNe1 +#define FN1(reason){ SK_RST_REASON_##reason, #reason }, +#define FNe1(reason) { SK_RST_REASON_##reason, #reason } + +#undef FN2 +#undef FNe2 +#define FN2(reason){ SKB_DROP_REASON_##reason, #reason }, +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason } /* * skb of trace_tcp_send_reset is the skb that caused RST. In case of * active reset, skb should be NULL */ TRACE_EVENT(tcp_send_reset, - TP_PROTO(const struct sock *sk, const struct sk_buff *skb), + TP_PROTO(const struct sock *sk, +const struct sk_buff *skb, +const enum sk_rst_reason reason), - TP_ARGS(sk, skb), + TP_ARGS(sk, skb, reason), TP_STRUCT__entry( __field(const void *, skbaddr) __field(const void *, skaddr) __field(int, state) + __field(enum sk_rst_reason, reason) __array(__u8, saddr, sizeof(struct sockaddr_in6)) __array(__u8, daddr, sizeof(struct sockaddr_in6)) ), @@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset, */ TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, entry->saddr); } + __entry->reason = reason; ), - TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s", + TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s reason=%s", __entry->skbaddr, __entry->skaddr, __entry->saddr, __entry->daddr, - __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN") + __entry->state ? show_tcp_state_name(__entry->state) : "UNKNOWN", + __entry->reason < RST_REASON_START ? + __print_symbolic(__entry->reason, DEFINE_DROP_REASON(FN2, FNe2)) : + __print_symbolic(__entry->reason, DEFINE_RST_REASON(FN1, FNe1))) ); +#undef FN1 +#undef FNe1 + +#undef FN2 +#undef FNe2 + /* * tcp event with arguments sk * diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d78412cf8566..461b4d2b7cfe 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb, if (sk) arg.bound_dev_if = sk->sk_bound_dev_if; - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) != offsetof(struct inet_timewait_sock, tw_bound_dev_if)); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 276d9d541b01..b08ffb17d5a0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3612,7 +3612,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority, /* skb of trace_tcp_send_reset() keeps the skb that caused RST, * skb here is different to the troublesome skb, so use NULL */ - trace_tcp_send_reset(sk, NULL); + trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED); } /* Send a crossed SYN-ACK during socket establishment. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index c46095fb596c..6a4736ec3df0 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1133,7 +1133,7 @@ static void tcp_v6_send_reset(const struct sock *sk, struct sk_buff *skb, label = ip6_flowlabel(ipv6h); } - trace_tcp_send_reset(sk, skb); + trace_tcp_send_reset(sk, skb, reason); tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1, ipv6_get_dsfield(ipv6h), label, priority, txhash, -- 2.37.3
[PATCH net-next v6 5/7] mptcp: support rstreason for passive reset
From: Jason Xing It relys on what reset options in the skb are as rfc8684 says. Reusing this logic can save us much energy. This patch replaces most of the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- net/mptcp/subflow.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index ac867d277860..bde4a7fdee82 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct sock *sk, return dst; dst_release(dst); - if (!req->syncookie) - tcp_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + if (!req->syncookie) { + struct mptcp_ext *mpext = mptcp_get_ext(skb); + enum sk_rst_reason reason; + + reason = convert_mptcp_reason(mpext->reset_reason); + tcp_request_sock_ops.send_reset(sk, skb, reason); + } return NULL; } @@ -377,8 +382,13 @@ static struct dst_entry *subflow_v6_route_req(const struct sock *sk, return dst; dst_release(dst); - if (!req->syncookie) - tcp6_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + if (!req->syncookie) { + struct mptcp_ext *mpext = mptcp_get_ext(skb); + enum sk_rst_reason reason; + + reason = convert_mptcp_reason(mpext->reset_reason); + tcp6_request_sock_ops.send_reset(sk, skb, reason); + } return NULL; } #endif @@ -783,6 +793,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, struct mptcp_subflow_request_sock *subflow_req; struct mptcp_options_received mp_opt; bool fallback, fallback_is_fatal; + enum sk_rst_reason reason; struct mptcp_sock *owner; struct sock *child; @@ -913,7 +924,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock *sk, tcp_rsk(req)->drop_req = true; inet_csk_prepare_for_destroy_sock(child); tcp_done(child); - req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + reason = convert_mptcp_reason(mptcp_get_ext(skb)->reset_reason); + req->rsk_ops->send_reset(sk, skb, reason); /* The last child reference will be released by the caller */ return child; -- 2.37.3
[PATCH net-next v6 4/7] tcp: support rstreason for passive reset
From: Jason Xing Reuse the dropreason logic to show the exact reason of tcp reset, so we don't need to implement those duplicated reset reasons. This patch replaces all the prior NOT_SPECIFIED reasons. Signed-off-by: Jason Xing --- net/ipv4/tcp_ipv4.c | 8 net/ipv6/tcp_ipv6.c | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 418d11902fa7..d78412cf8566 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(rsk, skb, (u32)reason); discard: kfree_skb_reason(skb, reason); /* Be careful here. If this function gets more complicated and @@ -2278,7 +2278,7 @@ int tcp_v4_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v4_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(nsk, skb, (u32)drop_reason); goto discard_and_relse; } sock_put(sk); @@ -2357,7 +2357,7 @@ int tcp_v4_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(NULL, skb, (u32)drop_reason); } discard_it: @@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb) tcp_v4_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v4_send_reset(sk, skb, (u32)drop_reason); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; case TCP_TW_SUCCESS:; diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 017f6293b5f4..c46095fb596c 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1680,7 +1680,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, (u32)reason); discard: if (opt_skb) __kfree_skb(opt_skb); @@ -1865,7 +1865,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) } else { drop_reason = tcp_child_process(sk, nsk, skb); if (drop_reason) { - tcp_v6_send_reset(nsk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(nsk, skb, (u32)drop_reason); goto discard_and_relse; } sock_put(sk); @@ -1942,7 +1942,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) bad_packet: __TCP_INC_STATS(net, TCP_MIB_INERRS); } else { - tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(NULL, skb, (u32)drop_reason); } discard_it: @@ -1998,7 +1998,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) tcp_v6_timewait_ack(sk, skb); break; case TCP_TW_RST: - tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + tcp_v6_send_reset(sk, skb, (u32)drop_reason); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; case TCP_TW_SUCCESS: -- 2.37.3
[PATCH net-next v6 3/7] rstreason: prepare for active reset
From: Jason Xing Like what we did to passive reset: only passing possible reset reason in each active reset path. No functional changes. Signed-off-by: Jason Xing --- include/net/tcp.h | 3 ++- net/ipv4/tcp.c| 15 ++- net/ipv4/tcp_output.c | 3 ++- net/ipv4/tcp_timer.c | 9 ++--- net/mptcp/protocol.c | 4 +++- net/mptcp/subflow.c | 5 +++-- 6 files changed, 26 insertions(+), 13 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index b935e1ae4caf..adeacc9aa28a 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -670,7 +670,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue, void tcp_send_probe0(struct sock *); int tcp_write_wakeup(struct sock *, int mib); void tcp_send_fin(struct sock *sk); -void tcp_send_active_reset(struct sock *sk, gfp_t priority); +void tcp_send_active_reset(struct sock *sk, gfp_t priority, + enum sk_rst_reason reason); int tcp_send_synack(struct sock *); void tcp_push_one(struct sock *, unsigned int mss_now); void __tcp_send_ack(struct sock *sk, u32 rcv_nxt); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index f23b9ea5..4ec0f4feee00 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -275,6 +275,7 @@ #include #include #include +#include #include #include @@ -2811,7 +2812,8 @@ void __tcp_close(struct sock *sk, long timeout) /* Unread data was tossed, zap the connection. */ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE); tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, sk->sk_allocation); + tcp_send_active_reset(sk, sk->sk_allocation, + SK_RST_REASON_NOT_SPECIFIED); } else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) { /* Check zero linger _after_ checking for unread data. */ sk->sk_prot->disconnect(sk, 0); @@ -2885,7 +2887,8 @@ void __tcp_close(struct sock *sk, long timeout) struct tcp_sock *tp = tcp_sk(sk); if (READ_ONCE(tp->linger2) < 0) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONLINGER); } else { @@ -2903,7 +2906,8 @@ void __tcp_close(struct sock *sk, long timeout) if (sk->sk_state != TCP_CLOSE) { if (tcp_check_oom(sk, 0)) { tcp_set_state(sk, TCP_CLOSE); - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONMEMORY); } else if (!check_net(sock_net(sk))) { @@ -3007,7 +3011,7 @@ int tcp_disconnect(struct sock *sk, int flags) /* The last check adjusts for discrepancy of Linux wrt. RFC * states */ - tcp_send_active_reset(sk, gfp_any()); + tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_NOT_SPECIFIED); WRITE_ONCE(sk->sk_err, ECONNRESET); } else if (old_state == TCP_SYN_SENT) WRITE_ONCE(sk->sk_err, ECONNRESET); @@ -4564,7 +4568,8 @@ int tcp_abort(struct sock *sk, int err) smp_wmb(); sk_error_report(sk); if (tcp_need_reset(sk->sk_state)) - tcp_send_active_reset(sk, GFP_ATOMIC); + tcp_send_active_reset(sk, GFP_ATOMIC, + SK_RST_REASON_NOT_SPECIFIED); tcp_done(sk); } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 61119d42b0fd..276d9d541b01 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3586,7 +3586,8 @@ void tcp_send_fin(struct sock *sk) * was unread data in the receive queue. This behavior is recommended * by RFC 2525, section 2.17. -DaveM */ -void tcp_send_active_reset(struct sock *sk, gfp_t priority) +void tcp_send_active_reset(struct sock *sk, gfp_t priority, + enum sk_rst_reason reason) { struct sk_buff *skb; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 976db57b95d4..83fe7f62f7f1 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -22,6 +22,7 @@ #include #include #include +#include static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk) { @@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool do_reset) (!tp->snd_wnd && !tp->packets_out))
[PATCH net-next v6 2/7] rstreason: prepare for passive reset
From: Jason Xing Adjust the parameter and support passing reason of reset which is for now NOT_SPECIFIED. No functional changes. Signed-off-by: Jason Xing --- include/net/request_sock.h | 4 +++- net/dccp/ipv4.c| 10 ++ net/dccp/ipv6.c| 10 ++ net/dccp/minisocks.c | 3 ++- net/ipv4/tcp_ipv4.c| 12 +++- net/ipv4/tcp_minisocks.c | 3 ++- net/ipv6/tcp_ipv6.c| 15 +-- net/mptcp/subflow.c| 8 +--- 8 files changed, 40 insertions(+), 25 deletions(-) diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 004e651e6067..bdc737832da6 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -18,6 +18,7 @@ #include #include +#include struct request_sock; struct sk_buff; @@ -34,7 +35,8 @@ struct request_sock_ops { void(*send_ack)(const struct sock *sk, struct sk_buff *skb, struct request_sock *req); void(*send_reset)(const struct sock *sk, - struct sk_buff *skb); + struct sk_buff *skb, + enum sk_rst_reason reason); void(*destructor)(struct request_sock *req); void(*syn_ack_timeout)(const struct request_sock *req); }; diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c index 9fc9cea4c251..ff41bd6f99c3 100644 --- a/net/dccp/ipv4.c +++ b/net/dccp/ipv4.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "ackvec.h" #include "ccid.h" @@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, struct request_sock *req return err; } -static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + enum sk_rst_reason reason) { int err; const struct iphdr *rxiph; @@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); kfree_skb(skb); return 0; } @@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v4_ctl_send_reset(sk, skb); + dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c index c8ca703dc331..85f4b8fdbe5e 100644 --- a/net/dccp/ipv6.c +++ b/net/dccp/ipv6.c @@ -29,6 +29,7 @@ #include #include #include +#include #include "dccp.h" #include "ipv6.h" @@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock *req) kfree_skb(inet_rsk(req)->pktopts); } -static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb) +static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff *rxskb, + enum sk_rst_reason reason) { const struct ipv6hdr *rxip6h; struct sk_buff *skb; @@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff *skb) return 0; reset: - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); discard: if (opt_skb != NULL) __kfree_skb(opt_skb); @@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (nsk == sk) { reqsk_put(req); } else if (dccp_child_process(sk, nsk, skb)) { - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); goto discard_and_relse; } else { sock_put(sk); @@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb) if (dh->dccph_type != DCCP_PKT_RESET) { DCCP_SKB_CB(skb)->dccpd_reset_code = DCCP_RESET_CODE_NO_CONNECTION; - dccp_v6_ctl_send_reset(sk, skb); + dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); } discard_it: diff --git a/net/dccp/minisocks.c
[PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent
From: Jason Xing Add a new standalone file for the easy future extension to support both active reset and passive reset in the TCP/DCCP/MPTCP protocols. This patch only does the preparations for reset reason mechanism, nothing else changes. The reset reasons are divided into three parts: 1) reuse drop reasons for passive reset in TCP 2) reuse MP_TCPRST option for MPTCP 3) our own reasons I will implement the basic codes of active/passive reset reason in those three protocols, which is not complete for this moment. But it provides a new chance to let other people add more reasons into it:) Signed-off-by: Jason Xing --- include/net/rstreason.h | 93 + 1 file changed, 93 insertions(+) create mode 100644 include/net/rstreason.h diff --git a/include/net/rstreason.h b/include/net/rstreason.h new file mode 100644 index ..0c3fa55fa62f --- /dev/null +++ b/include/net/rstreason.h @@ -0,0 +1,93 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ + +#ifndef _LINUX_RSTREASON_H +#define _LINUX_RSTREASON_H +#include + +#define DEFINE_RST_REASON(FN, FNe) \ + FN(MPTCP_RST_EUNSPEC) \ + FN(MPTCP_RST_EMPTCP)\ + FN(MPTCP_RST_ERESOURCE) \ + FN(MPTCP_RST_EPROHIBIT) \ + FN(MPTCP_RST_EWQ2BIG) \ + FN(MPTCP_RST_EBADPERF) \ + FN(MPTCP_RST_EMIDDLEBOX)\ + FN(NOT_SPECIFIED) \ + FNe(MAX) + +#define RST_REASON_START (SKB_DROP_REASON_MAX + 1) + +/* There are three parts in order: + * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP + * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use + * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason + */ +enum sk_rst_reason { + /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse +* of skb drop reason because rst reason relies on what drop reason +* indicates exactly why it could happen. +*/ + + /* Copy from include/uapi/linux/mptcp.h. +* These reset fields will not be changed since they adhere to +* RFC 8684. So do not touch them. I'm going to list each definition +* of them respectively. +*/ + /* Unspecified error. +* This is the default error; it implies that the subflow is no +* longer available. The presence of this option shows that the +* RST was generated by an MPTCP-aware device. +*/ + SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START, + /* MPTCP-specific error. +* An error has been detected in the processing of MPTCP options. +* This is the usual reason code to return in the cases where a RST +* is being sent to close a subflow because of an invalid response. +*/ + SK_RST_REASON_MPTCP_RST_EMPTCP, + /* Lack of resources. +* This code indicates that the sending host does not have enough +* resources to support the terminated subflow. +*/ + SK_RST_REASON_MPTCP_RST_ERESOURCE, + /* Administratively prohibited. +* This code indicates that the requested subflow is prohibited by +* the policies of the sending host. +*/ + SK_RST_REASON_MPTCP_RST_EPROHIBIT, + /* Too much outstanding data. +* This code indicates that there is an excessive amount of data +* that needs to be transmitted over the terminated subflow while +* having already been acknowledged over one or more other subflows. +* This may occur if a path has been unavailable for a short period +* and it is more efficient to reset and start again than it is to +* retransmit the queued data. +*/ + SK_RST_REASON_MPTCP_RST_EWQ2BIG, + /* Unacceptable performance. +* This code indicates that the performance of this subflow was +* too low compared to the other subflows of this Multipath TCP +* connection. +*/ + SK_RST_REASON_MPTCP_RST_EBADPERF, + /* Middlebox interference. +* Middlebox interference has been detected over this subflow, +* making MPTCP signaling invalid. For example, this may be sent +* if the checksum does not validate. +*/ + SK_RST_REASON_MPTCP_RST_EMIDDLEBOX, + + /* For the real standalone socket reset reason, we start from here */ + SK_RST_REASON_NOT_SPECIFIED, + + /* Maximum of socket reset reasons. +* It shouldn't be used as a real 'reason'. +*/ + SK_RST_REASON_MAX, +}; + +static inline enum sk_rst_reason convert_mptcp_reason(u32 reason) +{ + return reason += RST_REASON_START; +} +#endif -- 2.37.3
[PATCH net-next v6 0/7] Implement reset reason mechanism to detect
From: Jason Xing In production, there are so many cases about why the RST skb is sent but we don't have a very convenient/fast method to detect the exact underlying reasons. RST is implemented in two kinds: passive kind (like tcp_v4_send_reset()) and active kind (like tcp_send_active_reset()). The former can be traced carefully 1) in TCP, with the help of drop reasons, which is based on Eric's idea[1], 2) in MPTCP, with the help of reset options defined in RFC 8684. The latter is relatively independent, which should be implemented on our own. In this series, I focus on the fundamental implement mostly about how the rstreason mechnism works and give the detailed passive part as an example, not including the active reset part. In future, we can go further and refine those NOT_SPECIFIED reasons. Here are some examples when tracing: -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED -0 [002] ..s1. 1830.262425: tcp_send_reset: skbaddr=x skaddr=x src=x dest=x state=x reason=NO_SOCKET [1] Link: https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/ v6 1. add back casts, or else they are treated as error. v5 Link: https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/ 1. address format issue (like reverse xmas tree) (Eric, Paolo) 2. remove unnecessary casts. (Eric) 3. introduce a helper used in mptcp active reset. See patch 6. (Paolo) v4 Link: https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/ 1. passing 'enum sk_rst_reason' for readability when tracing (Antoine) v3 Link: https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/ 1. rebase (mptcp part) and address what Mat suggested. v2 Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/ 1. rebase against the latest net-next tree Jason Xing (7): net: introduce rstreason to detect why the RST is sent rstreason: prepare for passive reset rstreason: prepare for active reset tcp: support rstreason for passive reset mptcp: support rstreason for passive reset mptcp: introducing a helper into active reset logic rstreason: make it work in trace world include/net/request_sock.h | 4 +- include/net/rstreason.h| 93 ++ include/net/tcp.h | 3 +- include/trace/events/tcp.h | 37 +-- net/dccp/ipv4.c| 10 ++-- net/dccp/ipv6.c| 10 ++-- net/dccp/minisocks.c | 3 +- net/ipv4/tcp.c | 15 -- net/ipv4/tcp_ipv4.c| 14 +++--- net/ipv4/tcp_minisocks.c | 3 +- net/ipv4/tcp_output.c | 5 +- net/ipv4/tcp_timer.c | 9 ++-- net/ipv6/tcp_ipv6.c| 17 --- net/mptcp/protocol.c | 2 +- net/mptcp/protocol.h | 11 + net/mptcp/subflow.c| 27 --- 16 files changed, 216 insertions(+), 47 deletions(-) create mode 100644 include/net/rstreason.h -- 2.37.3
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 10:16:26AM +0200, Greg KH wrote: > > at the scripts used by stable developers - and maybe at the ML server - to > > catch different variations won't hurt, as it sounds likely that people will > > end messing up with a big name like "do-not-apply-to-stable", typing > > instead things like: > > > > do_not_apply_to_stable > > dont-apply-to-stable > > > > and other variants. > > I want this very explicit that someone does not want this applied, and > that it has a reason to do so. And if getting the email right to do so > is the issue with that, that's fine. This is a very rare case that > almost no one should normally hit. For using a comparable approach in haproxy on a daily basis, I do see the value in this. We just mark a lot of fixes "no backport needed" or "no backport needed unless blablabla" for everything that is only relevant to the dev tree, and that's a huge time saver for those working on the backports later. Maybe "not-for-stable" would be both shorter and easier to remember BTW ? Regards, Willy
Re: [PATCH v5 5/5] Documentation: Add reconnect process for VDUSE
On Tue, Apr 16, 2024 at 11:46 AM Jason Wang wrote: > > On Fri, Apr 12, 2024 at 9:31 PM Cindy Lu wrote: > > > > Add a document explaining the reconnect process, including what the > > Userspace App needs to do and how it works with the kernel. > > > > Signed-off-by: Cindy Lu > > --- > > Documentation/userspace-api/vduse.rst | 41 +++ > > 1 file changed, 41 insertions(+) > > > > diff --git a/Documentation/userspace-api/vduse.rst > > b/Documentation/userspace-api/vduse.rst > > index bdb880e01132..7faa83462e78 100644 > > --- a/Documentation/userspace-api/vduse.rst > > +++ b/Documentation/userspace-api/vduse.rst > > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows: > > after the used ring is filled. > > > > For more details on the uAPI, please see include/uapi/linux/vduse.h. > > + > > +HOW VDUSE devices reconnection works > > + > > +1. What is reconnection? > > + > > + When the userspace application loads, it should establish a connection > > + to the vduse kernel device. Sometimes,the userspace application exists, > > I guess you meant "exists"? If yes, it should be better to say "exits > unexpectedly" > > > + and we want to support its restart and connect to the kernel device > > again > > + > > +2. How can I support reconnection in a userspace application? > > Better to say "How reconnection is supported"? > > > + > > +2.1 During initialization, the userspace application should first verify > > the > > +existence of the device "/dev/vduse/vduse_name". > > +If it doesn't exist, it means this is the first-time for connection. > > goto step 2.2 > > +If it exists, it means this is a reconnection, and we should goto step > > 2.3 > > + > > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on > > +/dev/vduse/control. > > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for > > +the reconnect information. The total memory size is > > PAGE_SIZE*vq_mumber. > > I think we need to mention that this should be part of the previous > "VDUSE devices are created as follows"? > > > + > > +2.3 Check if the information is suitable for reconnect > > +If this is reconnection : > > +Before attempting to reconnect, The userspace application needs to use > > the > > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, > > VDUSE_DEV_GET_FEATURES...) > > +to get the information from kernel. > > +Please review the information and confirm if it is suitable to > > reconnect. > > Need to define "review" here and how to decide if it is not suitable > to reconnect. > > > + > > +2.4 Userspace application needs to mmap the memory to userspace > > +The userspace application requires mapping one page for every vq. > > These pages > > +should be used to save vq-related information during system running. > > Not a native speaker, but it looks better with > > "should be used by the userspace to store virtqueue specific information". > > > Additionally, > > +the application must define its own structure to store information for > > reconnection. > > + > > +2.5 Completed the initialization and running the application. > > +While the application is running, it is important to store relevant > > information > > +about reconnections in mapped pages. > > I think we need some link/code examples to demonstrate what needs to be > stored. > > > When calling the ioctl VDUSE_VQ_GET_INFO to > > +get vq information, it's necessary to check whether it's a > > reconnection. > > Better with some examples of codes. > > > If it is > > +a reconnection, the vq-related information must be get from the mapped > > pages. > > + > > +2.6 When the Userspace application exits, it is necessary to unmap all the > > +pages for reconnection > > This seems to be unnecessary, for example there could be an unexpected exit. > > Thanks > Thanks Jason, I will send a new version Thanks Cindy > > -- > > 2.43.0 > > >
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 09:09:26AM +0100, Mauro Carvalho Chehab wrote: > Em Wed, 17 Apr 2024 09:48:18 +0200 > Thorsten Leemhuis escreveu: > > > Hi kernel.org helpdesk! > > > > Could you please create the email alias > > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > > just like sta...@kernel.org does? > > > > That's an idea GregKH brought up a few days ago here: > > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > > > > To quote: > > > > > How about: > > > cc: # Reason goes here, and must be > > > present > > > > > > and we can make that address be routed to /dev/null just like > > > is? > > > > There was some discussion about using something shorter, but in the end > > there was no strong opposition and the thread ended a a few days ago. > > Heh, a shorter name would make it a lot easier to remember, specially > since not wanting a patch to go to stable is an exception... I bet > I'll never remember the right syntax, needing to look at the docs > every time it would be used. > > IMO, something like: > no-stable > or > nostable > > would do the trick and would be a lot easier to remember. > > Btw, IMO, it won't hurt accepting more than one variant that > could be allowed, e. g. using a regular expression like: > > (do)?[-_]?(nt|not?).*stable That's not going to work at the mailing list server, or with my scripts, sorry. > at the scripts used by stable developers - and maybe at the ML server - to > catch different variations won't hurt, as it sounds likely that people will > end messing up with a big name like "do-not-apply-to-stable", typing > instead things like: > > do_not_apply_to_stable > dont-apply-to-stable > > and other variants. I want this very explicit that someone does not want this applied, and that it has a reason to do so. And if getting the email right to do so is the issue with that, that's fine. This is a very rare case that almost no one should normally hit. And again, if maintainers don't want their tree to have Fixes: and AUTOBOT run on them, we can easily add that to our lists. That's the simpler and more explicit thing to do for those that do not want to have anything backported they do not explicitly mark as such (some subsystems do this already, like kvm and -mm and xfs, it's fine!). This all is here because of maintainers who do NOT want to do that. thanks, greg k-h
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
Em Wed, 17 Apr 2024 09:48:18 +0200 Thorsten Leemhuis escreveu: > Hi kernel.org helpdesk! > > Could you please create the email alias > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > just like sta...@kernel.org does? > > That's an idea GregKH brought up a few days ago here: > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > > To quote: > > > How about: > > cc: # Reason goes here, and must be > > present > > > > and we can make that address be routed to /dev/null just like > > is? > > There was some discussion about using something shorter, but in the end > there was no strong opposition and the thread ended a a few days ago. Heh, a shorter name would make it a lot easier to remember, specially since not wanting a patch to go to stable is an exception... I bet I'll never remember the right syntax, needing to look at the docs every time it would be used. IMO, something like: no-stable or nostable would do the trick and would be a lot easier to remember. Btw, IMO, it won't hurt accepting more than one variant that could be allowed, e. g. using a regular expression like: (do)?[-_]?(nt|not?).*stable at the scripts used by stable developers - and maybe at the ML server - to catch different variations won't hurt, as it sounds likely that people will end messing up with a big name like "do-not-apply-to-stable", typing instead things like: do_not_apply_to_stable dont-apply-to-stable and other variants. Just my 2 cents. Regards, Mauro
Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint
在 2024/4/17 下午3:18, Liang Chen 写道: The RSS hash report is a feature that's part of the virtio specification. Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost (still a work in progress as per [1]) support this feature. While the capability to obtain the RSS hash has been enabled in the normal path, it's currently missing in the XDP path. Therefore, we are introducing XDP hints through kfuncs to allow XDP programs to access the RSS hash. 1. https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r Signed-off-by: Liang Chen Reviewed-by: Heng Qi Thanks. --- Changes from v8: - move max table macro out of uAPI Changes from v7: - use table lookup for rss hash type Changes from v6: - fix a coding style issue Changes from v5: - Preservation of the hash value has been dropped, following the conclusion from discussions in V3 reviews. The virtio_net driver doesn't accessing/using the virtio_net_hdr after the XDP program execution, so nothing tragic should happen. As to the xdp program, if it smashes the entry in virtio header, it is likely buggy anyways. Additionally, looking up the Intel IGC driver, it also does not bother with this particular aspect. --- drivers/net/virtio_net.c | 43 1 file changed, 43 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c22d1118a133..eb99bf6c555e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -4621,6 +4621,48 @@ static void virtnet_set_big_packets(struct virtnet_info *vi, const int mtu) } } +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE 10 +static enum xdp_rss_hash_type +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = { + [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE, + [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4, + [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP, + [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP, + [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6, + [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP, + [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP, + [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX, + [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX, + [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX +}; + +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash, + enum xdp_rss_hash_type *rss_type) +{ + const struct xdp_buff *xdp = (void *)_ctx; + struct virtio_net_hdr_v1_hash *hdr_hash; + struct virtnet_info *vi; + u16 hash_report; + + if (!(xdp->rxq->dev->features & NETIF_F_RXHASH)) + return -ENODATA; + + vi = netdev_priv(xdp->rxq->dev); + hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len); + hash_report = __le16_to_cpu(hdr_hash->hash_report); + + if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE) + hash_report = VIRTIO_NET_HASH_REPORT_NONE; + + *rss_type = virtnet_xdp_rss_type[hash_report]; + *hash = __le32_to_cpu(hdr_hash->hash_value); + return 0; +} + +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = { + .xmo_rx_hash= virtnet_xdp_rx_hash, +}; + static int virtnet_probe(struct virtio_device *vdev) { int i, err = -ENOMEM; @@ -4747,6 +4789,7 @@ static int virtnet_probe(struct virtio_device *vdev) VIRTIO_NET_RSS_HASH_TYPE_UDP_EX); dev->hw_features |= NETIF_F_RXHASH; + dev->xdp_metadata_ops = _xdp_metadata_ops; } if (vi->has_rss_hash_report)
Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote: > Hi kernel.org helpdesk! > > Could you please create the email alias > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, > just like sta...@kernel.org does? > > That's an idea GregKH brought up a few days ago here: > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ > > To quote: > > > How about: > > cc: # Reason goes here, and must be > > present > > > > and we can make that address be routed to /dev/null just like > > is? > > There was some discussion about using something shorter, but in the end > there was no strong opposition and the thread ended a a few days ago. > > The goal is to have a tag that developers can use in Linux kenrel > commits that have a Fixes: tag, but nevertheless should not be > backported by the stable-team without explicit request. The thread > linked above explains this in more detail. Once the address exists I'll > resubmit the patches in question that will document the tag. > > Is asking for this here like this the right way? If I need to file a > ticket somewhere or some ack from a higher authority, just let me know! I approve this message :) thanks, greg k-h
Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint
On 17/04/2024 09.18, Liang Chen wrote: The RSS hash report is a feature that's part of the virtio specification. Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost (still a work in progress as per [1]) support this feature. While the capability to obtain the RSS hash has been enabled in the normal path, it's currently missing in the XDP path. Therefore, we are introducing XDP hints through kfuncs to allow XDP programs to access the RSS hash. 1. https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r Signed-off-by: Liang Chen LGTM Acked-by: Jesper Dangaard Brouer
Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null
Hi kernel.org helpdesk! Could you please create the email alias do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null, just like sta...@kernel.org does? That's an idea GregKH brought up a few days ago here: https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/ To quote: > How about: > cc: # Reason goes here, and must be > present > > and we can make that address be routed to /dev/null just like > is? There was some discussion about using something shorter, but in the end there was no strong opposition and the thread ended a a few days ago. The goal is to have a tag that developers can use in Linux kenrel commits that have a Fixes: tag, but nevertheless should not be backported by the stable-team without explicit request. The thread linked above explains this in more detail. Once the address exists I'll resubmit the patches in question that will document the tag. Is asking for this here like this the right way? If I need to file a ticket somewhere or some ack from a higher authority, just let me know! Ciao, Thorsten
[PATCH net-next v9] virtio_net: Support RX hash XDP hint
The RSS hash report is a feature that's part of the virtio specification. Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost (still a work in progress as per [1]) support this feature. While the capability to obtain the RSS hash has been enabled in the normal path, it's currently missing in the XDP path. Therefore, we are introducing XDP hints through kfuncs to allow XDP programs to access the RSS hash. 1. https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r Signed-off-by: Liang Chen --- Changes from v8: - move max table macro out of uAPI Changes from v7: - use table lookup for rss hash type Changes from v6: - fix a coding style issue Changes from v5: - Preservation of the hash value has been dropped, following the conclusion from discussions in V3 reviews. The virtio_net driver doesn't accessing/using the virtio_net_hdr after the XDP program execution, so nothing tragic should happen. As to the xdp program, if it smashes the entry in virtio header, it is likely buggy anyways. Additionally, looking up the Intel IGC driver, it also does not bother with this particular aspect. --- drivers/net/virtio_net.c | 43 1 file changed, 43 insertions(+) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index c22d1118a133..eb99bf6c555e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -4621,6 +4621,48 @@ static void virtnet_set_big_packets(struct virtnet_info *vi, const int mtu) } } +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE 10 +static enum xdp_rss_hash_type +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = { + [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE, + [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4, + [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP, + [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP, + [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6, + [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP, + [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP, + [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX, + [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX, + [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX +}; + +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash, + enum xdp_rss_hash_type *rss_type) +{ + const struct xdp_buff *xdp = (void *)_ctx; + struct virtio_net_hdr_v1_hash *hdr_hash; + struct virtnet_info *vi; + u16 hash_report; + + if (!(xdp->rxq->dev->features & NETIF_F_RXHASH)) + return -ENODATA; + + vi = netdev_priv(xdp->rxq->dev); + hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len); + hash_report = __le16_to_cpu(hdr_hash->hash_report); + + if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE) + hash_report = VIRTIO_NET_HASH_REPORT_NONE; + + *rss_type = virtnet_xdp_rss_type[hash_report]; + *hash = __le32_to_cpu(hdr_hash->hash_value); + return 0; +} + +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = { + .xmo_rx_hash= virtnet_xdp_rx_hash, +}; + static int virtnet_probe(struct virtio_device *vdev) { int i, err = -ENOMEM; @@ -4747,6 +4789,7 @@ static int virtnet_probe(struct virtio_device *vdev) VIRTIO_NET_RSS_HASH_TYPE_UDP_EX); dev->hw_features |= NETIF_F_RXHASH; + dev->xdp_metadata_ops = _xdp_metadata_ops; } if (vi->has_rss_hash_report) -- 2.40.1