Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-17 Thread Jason Xing
On Wed, Apr 17, 2024 at 4:51 PM Jason Xing  wrote:
>
> From: Jason Xing 
>
> In production, there are so many cases about why the RST skb is sent but
> we don't have a very convenient/fast method to detect the exact underlying
> reasons.
>
> RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
> and active kind (like tcp_send_active_reset()). The former can be traced
> carefully 1) in TCP, with the help of drop reasons, which is based on
> Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
> RFC 8684. The latter is relatively independent, which should be
> implemented on our own.
>
> In this series, I focus on the fundamental implement mostly about how
> the rstreason mechnism works and give the detailed passive part as an
> example, not including the active reset part. In future, we can go
> further and refine those NOT_SPECIFIED reasons.
>
> Here are some examples when tracing:
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
> -0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
> skaddr=x src=x dest=x state=x reason=NO_SOCKET
>
> [1]
> Link: 
> https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/
>
> v6
> 1. add back casts, or else they are treated as error.
>
> v5
> Link: 
> https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/
> 1. address format issue (like reverse xmas tree) (Eric, Paolo)
> 2. remove unnecessary casts. (Eric)
> 3. introduce a helper used in mptcp active reset. See patch 6. (Paolo)
>
> v4
> Link: 
> https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/
> 1. passing 'enum sk_rst_reason' for readability when tracing (Antoine)
>
> v3
> Link: 
> https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/
> 1. rebase (mptcp part) and address what Mat suggested.
>
> v2
> Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
> 1. rebase against the latest net-next tree
>
>
>
> Jason Xing (7):
>   net: introduce rstreason to detect why the RST is sent
>   rstreason: prepare for passive reset
>   rstreason: prepare for active reset
>   tcp: support rstreason for passive reset
>   mptcp: support rstreason for passive reset
>   mptcp: introducing a helper into active reset logic
>   rstreason: make it work in trace world
>
>  include/net/request_sock.h |  4 +-
>  include/net/rstreason.h| 93 ++
>  include/net/tcp.h  |  3 +-
>  include/trace/events/tcp.h | 37 +--
>  net/dccp/ipv4.c| 10 ++--
>  net/dccp/ipv6.c| 10 ++--
>  net/dccp/minisocks.c   |  3 +-
>  net/ipv4/tcp.c | 15 --
>  net/ipv4/tcp_ipv4.c| 14 +++---
>  net/ipv4/tcp_minisocks.c   |  3 +-
>  net/ipv4/tcp_output.c  |  5 +-
>  net/ipv4/tcp_timer.c   |  9 ++--
>  net/ipv6/tcp_ipv6.c| 17 ---
>  net/mptcp/protocol.c   |  2 +-
>  net/mptcp/protocol.h   | 11 +
>  net/mptcp/subflow.c| 27 ---
>  16 files changed, 216 insertions(+), 47 deletions(-)
>  create mode 100644 include/net/rstreason.h
>
> --
> 2.37.3
>

Hello maintainers,

I'm not sure why the patch series has been changed to 'Changes
Requested', until now I don't think I need to change something.

Should I repost this series (keeping the v6 tag) and then wait for
more comments?

Thanks,
Jason



Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-17 Thread Jason Wang
On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  wrote:
>
> On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > Add the function vduse_alloc_reconnnect_info_mem
> > and vduse_alloc_reconnnect_info_mem
> > These functions allow vduse to allocate and free memory for reconnection
> > information. The amount of memory allocated is vq_num pages.
> > Each VQS will map its own page where the reconnection information will be 
> > saved
> >
> > Signed-off-by: Cindy Lu 
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index ef3c9681941e..2da659d5f4a8 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> >   int irq_effective_cpu;
> >   struct cpumask irq_affinity;
> >   struct kobject kobj;
> > + unsigned long vdpa_reconnect_vaddr;
> >  };
> >
> >  struct vduse_dev;
> > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> > vduse_virtqueue *vq)
> >
> >   vq->irq_effective_cpu = curr_cpu;
> >  }
> > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > +{
> > + unsigned long vaddr = 0;
> > + struct vduse_virtqueue *vq;
> > +
> > + for (int i = 0; i < dev->vq_num; i++) {
> > + /*page 0~ vq_num save the reconnect info for vq*/
> > + vq = dev->vqs[i];
> > + vaddr = get_zeroed_page(GFP_KERNEL);
>
>
> I don't get why you insist on stealing kernel memory for something
> that is just used by userspace to store data for its own use.
> Userspace does not lack ways to persist data, for example,
> create a regular file anywhere in the filesystem.

Good point. So the motivation here is to:

1) be self contained, no dependency for high speed persist data
storage like tmpfs
2) standardize the format in uAPI which allows reconnection from
arbitrary userspace, unfortunately, such effort was removed in new
versions

If the above doesn't make sense, we don't need to offer those pages by VDUSE.

Thanks


>
>
>
> > + if (vaddr == 0)
> > + return -ENOMEM;
> > +
> > + vq->vdpa_reconnect_vaddr = vaddr;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > +{
> > + struct vduse_virtqueue *vq;
> > +
> > + for (int i = 0; i < dev->vq_num; i++) {
> > + vq = dev->vqs[i];
> > +
> > + if (vq->vdpa_reconnect_vaddr)
> > + free_page(vq->vdpa_reconnect_vaddr);
> > + vq->vdpa_reconnect_vaddr = 0;
> > + }
> > +
> > + return 0;
> > +}
> >
> >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >   unsigned long arg)
> > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> >   mutex_unlock(>lock);
> >   return -EBUSY;
> >   }
> > + vduse_free_reconnnect_info_mem(dev);
> > +
> >   dev->connected = true;
> >   mutex_unlock(>lock);
> >
> > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config 
> > *config,
> >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> >   if (ret)
> >   goto err_vqs;
> > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > + if (ret < 0)
> > + goto err_mem;
> >
> >   __module_get(THIS_MODULE);
> >
> >   return 0;
> >  err_vqs:
> >   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > +err_mem:
> > + vduse_free_reconnnect_info_mem(dev);
> >  err_dev:
> >   idr_remove(_idr, dev->minor);
> >  err_idr:
> > --
> > 2.43.0
>




Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-17 Thread Jason Wang
On Wed, Apr 17, 2024 at 3:20 PM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---

Acked-by: Jason Wang 

Thanks




Re: [PATCH v12 08/14] x86/sgx: Add basic EPC reclamation flow for cgroup

2024-04-17 Thread Huang, Kai




On 16/04/2024 3:20 pm, Haitao Huang wrote:

From: Kristen Carlson Accardi 

Currently in the EPC page allocation, the kernel simply fails the
allocation when the current EPC cgroup fails to charge due to its usage
reaching limit.  This is not ideal. When that happens, a better way is
to reclaim EPC page(s) from the current EPC cgroup (and/or its
descendants) to reduce its usage so the new allocation can succeed.

Add the basic building blocks to support per-cgroup reclamation.

Currently the kernel only has one place to reclaim EPC pages: the global
EPC LRU list.  To support the "per-cgroup" EPC reclaim, maintain an LRU
list for each EPC cgroup, and introduce a "cgroup" variant function to
reclaim EPC pages from a given EPC cgroup and its descendants.

Currently the kernel does the global EPC reclaim in sgx_reclaim_page().
It always tries to reclaim EPC pages in batch of SGX_NR_TO_SCAN (16)
pages.  Specifically, it always "scans", or "isolates" SGX_NR_TO_SCAN
pages from the global LRU, and then tries to reclaim these pages at once
for better performance.

Implement the "cgroup" variant EPC reclaim in a similar way, but keep
the implementation simple: 1) change sgx_reclaim_pages() to take an LRU
as input, and return the pages that are "scanned" and attempted for
reclamation (but not necessarily reclaimed successfully); 2) loop the
given EPC cgroup and its descendants and do the new sgx_reclaim_pages()
until SGX_NR_TO_SCAN pages are "scanned".

This implementation, encapsulated in sgx_cgroup_reclaim_pages(), always
tries to reclaim SGX_NR_TO_SCAN pages from the LRU of the given EPC
cgroup, and only moves to its descendants when there's no enough
reclaimable EPC pages to "scan" in its LRU.  It should be enough for
most cases.

Note, this simple implementation doesn't _exactly_ mimic the current
global EPC reclaim (which always tries to do the actual reclaim in batch
of SGX_NR_TO_SCAN pages): when LRUs have less than SGX_NR_TO_SCAN
reclaimable pages, the actual reclaim of EPC pages will be split into
smaller batches _across_ multiple LRUs with each being smaller than
SGX_NR_TO_SCAN pages.

A more precise way to mimic the current global EPC reclaim would be to
have a new function to only "scan" (or "isolate") SGX_NR_TO_SCAN pages
_across_ the given EPC cgroup _AND_ its descendants, and then do the
actual reclaim in one batch.  But this is unnecessarily complicated at
this stage.

Alternatively, the current sgx_reclaim_pages() could be changed to
return the actual "reclaimed" pages, but not "scanned" pages. However,
the reclamation is a lengthy process, forcing a successful reclamation
of predetermined number of pages may block the caller for too long. And
that may not be acceptable in some synchronous contexts, e.g., in
serving an ioctl().

With this building block in place, add synchronous reclamation support
in sgx_cgroup_try_charge(): trigger a call to
sgx_cgroup_reclaim_pages() if the cgroup reaches its limit and the
caller allows synchronous reclaim as indicated by s newly added
parameter.

A later patch will add support for asynchronous reclamation reusing
sgx_cgroup_reclaim_pages().

Note all reclaimable EPC pages are still tracked in the global LRU thus
no per-cgroup reclamation is actually active at the moment. Per-cgroup
tracking and reclamation will be turned on in the end after all
necessary infrastructure is in place.


Nit:

"all necessary infrastructures are in place", or, "all necessary 
building blocks are in place".


?



Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---


Reviewed-by: Kai Huang 

More nitpickings below:

[...]


-static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg)
+static inline int sgx_cgroup_try_charge(struct sgx_cgroup *sgx_cg, enum 
sgx_reclaim reclaim)


Let's still wrap the text on 80-character basis.

I guess most people are more used to that.

[...]


-   epc_page = list_first_entry_or_null(_global_lru.reclaimable,
-   struct sgx_epc_page, list);
+   epc_page = list_first_entry_or_null(>reclaimable, struct 
sgx_epc_page, list);


Ditto.



Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Song Liu
On Tue, Apr 16, 2024 at 12:23 AM Mike Rapoport  wrote:
>
> On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> > On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > > +/**
> > > > + * enum execmem_type - types of executable memory ranges
> > > > + *
> > > > + * There are several subsystems that allocate executable memory.
> > > > + * Architectures define different restrictions on placement,
> > > > + * permissions, alignment and other parameters for memory that can be 
> > > > used
> > > > + * by these subsystems.
> > > > + * Types in this enum identify subsystems that allocate executable 
> > > > memory
> > > > + * and let architectures define parameters for ranges suitable for
> > > > + * allocations by each subsystem.
> > > > + *
> > > > + * @EXECMEM_DEFAULT: default parameters that would be used for types 
> > > > that
> > > > + * are not explcitly defined.
> > > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > > + * @EXECMEM_BPF: parameters for BPF
> > > > + * @EXECMEM_TYPE_MAX:
> > > > + */
> > > > +enum execmem_type {
> > > > + EXECMEM_DEFAULT,
> > > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > > + EXECMEM_KPROBES,
> > > > + EXECMEM_FTRACE,
> > > > + EXECMEM_BPF,
> > > > + EXECMEM_TYPE_MAX,
> > > > +};
> > >
> > > Can we please get a break-down of how all these types are actually
> > > different from one another?
> > >
> > > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > > mind) and has less strict placement constraints for some of them?
> >
> > Yeah, and really I'd *much* rather deal with that in arch code, as I have 
> > said
> > several times.
> >
> > For arm64 we have two bsaic restrictions:
> >
> > 1) Direct branches can go +/-128M
> >We can expand this range by having direct branches go to PLTs, at a
> >performance cost.
> >
> > 2) PREL32 relocations can go +/-2G
> >We cannot expand this further.
> >
> > * We don't need to allocate memory for ftrace. We do not use trampolines.
> >
> > * Kprobes XOL areas don't care about either of those; we don't place any
> >   PC-relative instructions in those. Maybe we want to in future.
> >
> > * Modules care about both; we'd *prefer* to place them within +/-128M of all
> >   other kernel/module code, but if there's no space we can use PLTs and 
> > expand
> >   that to +/-2G. Since modules can refreence other modules, that ends up
> >   actually being halved, and modules have to fit within some 2G window that
> >   also covers the kernel.

Is +/- 2G enough for all realistic use cases? If so, I guess we don't
really need
EXECMEM_ANYWHERE below?

> >
> > * I'm not sure about BPF's requirements; it seems happy doing the same as
> >   modules.
>
> BPF are happy with vmalloc().
>
> > So if we *must* use a common execmem allocator, what we'd reall want is our 
> > own
> > types, e.g.
> >
> >   EXECMEM_ANYWHERE
> >   EXECMEM_NOPLT
> >   EXECMEM_PREL32
> >
> > ... and then we use those in arch code to implement module_alloc() and 
> > friends.
>
> I'm looking at execmem_types more as definition of the consumers, maybe I
> should have named the enum execmem_consumer at the first place.

I think looking at execmem_type from consumers' point of view adds
unnecessary complexity. IIUC, for most (if not all) archs, ftrace, kprobe,
and bpf (and maybe also module text) all have the same requirements.
Did I miss something?

IOW, we have

enum execmem_type {
EXECMEM_DEFAULT,
EXECMEM_TEXT,
EXECMEM_KPROBES = EXECMEM_TEXT,
EXECMEM_FTRACE = EXECMEM_TEXT,
EXECMEM_BPF = EXECMEM_TEXT,  /* we may end up without
_KPROBE, _FTRACE, _BPF */
EXECMEM_DATA,  /* rw */
EXECMEM_RO_DATA,
EXECMEM_RO_AFTER_INIT,
EXECMEM_TYPE_MAX,
};

Does this make sense?

Thanks,
Song



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-17 Thread Jarkko Sakkinen
On Wed Apr 17, 2024 at 1:04 AM EEST, Haitao Huang wrote:
> On Tue, 16 Apr 2024 11:08:11 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Tue Apr 16, 2024 at 5:54 PM EEST, Haitao Huang wrote:
> >> I did declare the configs in the config file but I missed it in my patch
> >> as stated earlier. IIUC, that would not cause this error though.
> >>
> >> Maybe I should exit with the skip code if no CGROUP_MISC (no more
> >> CGROUP_SGX_EPC) is configured?
> >
> > OK, so I wanted to do a distro kernel test here, and used the default
> > OpenSUSE kernel config. I need to check if it has CGROUP_MISC set.
> >
> >> tools/testing/selftests$ find . -name README
> >> ./futex/README
> >> ./tc-testing/README
> >> ./net/forwarding/README
> >> ./powerpc/nx-gzip/README
> >> ./ftrace/README
> >> ./arm64/signal/README
> >> ./arm64/fp/README
> >> ./arm64/README
> >> ./zram/README
> >> ./livepatch/README
> >> ./resctrl/README
> >
> > So is there a README because of override timeout parameter? Maybe it
> > should be just set to a high enough value?
> >
> > BR, Jarkko
> >
>
>
>  From the docs, I think we are supposed to use override.
> See: https://docs.kernel.org/dev-tools/kselftest.html#timeout-for-selftests

OK, fair enough :-) I did not know this.

BR, Jarkko



Re: [PATCH v4 14/15] kprobes: remove dependency on CONFIG_MODULES

2024-04-17 Thread Masami Hiramatsu
Hi Mike,

On Thu, 11 Apr 2024 19:00:50 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> kprobes depended on CONFIG_MODULES because it has to allocate memory for
> code.
> 
> Since code allocations are now implemented with execmem, kprobes can be
> enabled in non-modular kernels.
> 
> Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
> modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
> dependency of CONFIG_KPROBES on CONFIG_MODULES.

Thanks for this work, but this conflicts with the latest fix in v6.9-rc4.
Also, can you use IS_ENABLED(CONFIG_MODULES) instead of #ifdefs in
function body? We have enough dummy functions for that, so it should
not make a problem.

Thank you,

> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/Kconfig|  2 +-
>  kernel/kprobes.c| 43 +
>  kernel/trace/trace_kprobe.c | 11 ++
>  3 files changed, 37 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index bc9e8e5dccd5..68177adf61a0 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -52,9 +52,9 @@ config GENERIC_ENTRY
>  
>  config KPROBES
>   bool "Kprobes"
> - depends on MODULES
>   depends on HAVE_KPROBES
>   select KALLSYMS
> + select EXECMEM
>   select TASKS_RCU if PREEMPTION
>   help
> Kprobes allows you to trap at almost any kernel address and
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index 047ca629ce49..90c056853e6f 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -1580,6 +1580,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   goto out;
>   }
>  
> +#ifdef CONFIG_MODULES
>   /* Check if 'p' is probing a module. */
>   *probed_mod = __module_text_address((unsigned long) p->addr);
>   if (*probed_mod) {
> @@ -1603,6 +1604,8 @@ static int check_kprobe_address_safe(struct kprobe *p,
>   ret = -ENOENT;
>   }
>   }
> +#endif
> +
>  out:
>   preempt_enable();
>   jump_label_unlock();
> @@ -2482,24 +2485,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
> unsigned long end)
>   return 0;
>  }
>  
> -/* Remove all symbols in given area from kprobe blacklist */
> -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> -{
> - struct kprobe_blacklist_entry *ent, *n;
> -
> - list_for_each_entry_safe(ent, n, _blacklist, list) {
> - if (ent->start_addr < start || ent->start_addr >= end)
> - continue;
> - list_del(>list);
> - kfree(ent);
> - }
> -}
> -
> -static void kprobe_remove_ksym_blacklist(unsigned long entry)
> -{
> - kprobe_remove_area_blacklist(entry, entry + 1);
> -}
> -
>  int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long 
> *value,
>  char *type, char *sym)
>  {
> @@ -2564,6 +2549,25 @@ static int __init populate_kprobe_blacklist(unsigned 
> long *start,
>   return ret ? : arch_populate_kprobe_blacklist();
>  }
>  
> +#ifdef CONFIG_MODULES
> +/* Remove all symbols in given area from kprobe blacklist */
> +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
> end)
> +{
> + struct kprobe_blacklist_entry *ent, *n;
> +
> + list_for_each_entry_safe(ent, n, _blacklist, list) {
> + if (ent->start_addr < start || ent->start_addr >= end)
> + continue;
> + list_del(>list);
> + kfree(ent);
> + }
> +}
> +
> +static void kprobe_remove_ksym_blacklist(unsigned long entry)
> +{
> + kprobe_remove_area_blacklist(entry, entry + 1);
> +}
> +
>  static void add_module_kprobe_blacklist(struct module *mod)
>  {
>   unsigned long start, end;
> @@ -2665,6 +2669,7 @@ static struct notifier_block kprobe_module_nb = {
>   .notifier_call = kprobes_module_callback,
>   .priority = 0
>  };
> +#endif
>  
>  void kprobe_free_init_mem(void)
>  {
> @@ -2724,8 +2729,10 @@ static int __init init_kprobes(void)
>   err = arch_init_kprobes();
>   if (!err)
>   err = register_die_notifier(_exceptions_nb);
> +#ifdef CONFIG_MODULES
>   if (!err)
>   err = register_module_notifier(_module_nb);
> +#endif
>  
>   kprobes_initialized = (err == 0);
>   kprobe_sysctls_init();
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index 14099cc17fc9..f0610137d6a3 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -111,6 +111,7 @@ static nokprobe_inline bool 
> trace_kprobe_within_module(struct trace_kprobe *tk,
>   return strncmp(module_name(mod), name, len) == 0 && name[len] == ':';
>  }
>  
> +#ifdef CONFIG_MODULES
>  static nokprobe_inline bool trace_kprobe_module_exist(struct trace_kprobe 
> *tk)
>  {
>   char *p;
> @@ -129,6 +130,12 @@ static nokprobe_inline bool 
> 

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-17 Thread Google
On Thu, 11 Apr 2024 19:00:41 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> module_alloc() is used everywhere as a mean to allocate memory for code.
> 
> Beside being semantically wrong, this unnecessarily ties all subsystems
> that need to allocate code, such as ftrace, kprobes and BPF to modules and
> puts the burden of code allocation to the modules code.
> 
> Several architectures override module_alloc() because of various
> constraints where the executable memory can be located and this causes
> additional obstacles for improvements of code allocation.
> 
> Start splitting code allocation from modules by introducing execmem_alloc()
> and execmem_free() APIs.
> 
> Initially, execmem_alloc() is a wrapper for module_alloc() and
> execmem_free() is a replacement of module_memfree() to allow updating all
> call sites to use the new APIs.
> 
> Since architectures define different restrictions on placement,
> permissions, alignment and other parameters for memory that can be used by
> different subsystems that allocate executable memory, execmem_alloc() takes
> a type argument, that will be used to identify the calling subsystem and to
> allow architectures define parameters for ranges suitable for that
> subsystem.
> 

This looks good to me for the kprobe part.

Acked-by: Masami Hiramatsu (Google) 

Thank you,

> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/powerpc/kernel/kprobes.c|  6 ++--
>  arch/s390/kernel/ftrace.c|  4 +--
>  arch/s390/kernel/kprobes.c   |  4 +--
>  arch/s390/kernel/module.c|  5 +--
>  arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
>  arch/x86/kernel/ftrace.c |  6 ++--
>  arch/x86/kernel/kprobes/core.c   |  4 +--
>  include/linux/execmem.h  | 57 
>  include/linux/moduleloader.h |  3 --
>  kernel/bpf/core.c|  6 ++--
>  kernel/kprobes.c |  8 ++---
>  kernel/module/Kconfig|  1 +
>  kernel/module/main.c | 25 +-
>  mm/Kconfig   |  3 ++
>  mm/Makefile  |  1 +
>  mm/execmem.c | 26 +++
>  16 files changed, 122 insertions(+), 45 deletions(-)
>  create mode 100644 include/linux/execmem.h
>  create mode 100644 mm/execmem.c
> 
> diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> index bbca90a5e2ec..9fcd01bb2ce6 100644
> --- a/arch/powerpc/kernel/kprobes.c
> +++ b/arch/powerpc/kernel/kprobes.c
> @@ -19,8 +19,8 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -130,7 +130,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>  
> @@ -142,7 +142,7 @@ void *alloc_insn_page(void)
>   }
>   return page;
>  error:
> - module_memfree(page);
> + execmem_free(page);
>   return NULL;
>  }
>  
> diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
> index c46381ea04ec..798249ef5646 100644
> --- a/arch/s390/kernel/ftrace.c
> +++ b/arch/s390/kernel/ftrace.c
> @@ -7,13 +7,13 @@
>   *   Author(s): Martin Schwidefsky 
>   */
>  
> -#include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
>  {
>   const char *start, *end;
>  
> - ftrace_plt = module_alloc(PAGE_SIZE);
> + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
>   if (!ftrace_plt)
>   panic("cannot allocate ftrace plt\n");
>  
> diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
> index f0cf20d4b3c5..3c1b1be744de 100644
> --- a/arch/s390/kernel/kprobes.c
> +++ b/arch/s390/kernel/kprobes.c
> @@ -9,7 +9,6 @@
>  
>  #define pr_fmt(fmt) "kprobes: " fmt
>  
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -21,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -38,7 +38,7 @@ void *alloc_insn_page(void)
>  {
>   void *page;
>  
> - page = module_alloc(PAGE_SIZE);
> + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
>   if (!page)
>   return NULL;
>   set_memory_rox((unsigned long)page, 1);
> diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
> index 42215f9404af..ac97a905e8cd 100644
> --- a/arch/s390/kernel/module.c
> +++ b/arch/s390/kernel/module.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
>  #ifdef CONFIG_FUNCTION_TRACER
>  void module_arch_cleanup(struct module *mod)
>  {
> - module_memfree(mod->arch.trampolines_start);
> + execmem_free(mod->arch.trampolines_start);
>  }
>  #endif
>  
> @@ -510,7 +511,7 @@ 

Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Google
Sorry, this is actually v3. (miss-configured...)

Thanks,

On Thu, 18 Apr 2024 05:46:34 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Currently, kprobe event checks whether the target symbol name is unique
> or not, so that it does not put a probe on an unexpected place. But this
> skips the check if the target is on a module because the module may not
> be loaded.
> 
> To fix this issue, this patch checks the number of probe target symbols
> in a target module when the module is loaded. If the probe is not on the
> unique name symbols in the module, it will be rejected at that point.
> 
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v3:
>   - Update the patch description.
>  Updated from last October post, which was dropped by test failure:
> 
> https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/
>  Changes in v2:
>   - Fix to skip checking uniqueness if the target module is not loaded.
>   - Fix register_module_trace_kprobe() to pass correct symbol name.
>   - Fix to call __register_trace_kprobe() from module callback.
> ---
>  kernel/trace/trace_kprobe.c |  125 
> ---
>  1 file changed, 81 insertions(+), 44 deletions(-)
> 
> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index c68d4e830fbe..0113afe2662d 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
>   return ret;
>  }
>  
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol);
> +
> +static int register_module_trace_kprobe(struct module *mod, struct 
> trace_kprobe *tk)
> +{
> + const char *p;
> + int ret = 0;
> +
> + p = strchr(trace_kprobe_symbol(tk), ':');
> + if (p)
> + ret = validate_module_probe_symbol(module_name(mod), p + 1);
> + if (!ret)
> + ret = __register_trace_kprobe(tk);
> + return ret;
> +}
> +
>  /* Module notifier call back, checking event on the module */
>  static int trace_kprobe_module_callback(struct notifier_block *nb,
>  unsigned long val, void *data)
> @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct 
> notifier_block *nb,
>   if (trace_kprobe_within_module(tk, mod)) {
>   /* Don't need to check busy - this should have gone. */
>   __unregister_trace_kprobe(tk);
> - ret = __register_trace_kprobe(tk);
> + ret = register_module_trace_kprobe(mod, tk);
>   if (ret)
>   pr_warn("Failed to re-register probe %s on %s: 
> %d\n",
>   trace_probe_name(>tp),
> @@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char 
> *name, unsigned long unused)
>   return 0;
>  }
>  
> -static unsigned int number_of_same_symbols(char *func_name)
> +static unsigned int number_of_same_symbols(const char *mod, const char 
> *func_name)
>  {
>   struct sym_count_ctx ctx = { .count = 0, .name = func_name };
>  
> - kallsyms_on_each_match_symbol(count_symbols, func_name, );
> + if (!mod)
> + kallsyms_on_each_match_symbol(count_symbols, func_name, 
> );
>  
> - module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
> + module_kallsyms_on_each_symbol(mod, count_mod_symbols, );
>  
>   return ctx.count;
>  }
>  
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol)
> +{
> + unsigned int count = number_of_same_symbols(modname, symbol);
> +
> + if (count > 1) {
> + /*
> +  * Users should use ADDR to remove the ambiguity of
> +  * using KSYM only.
> +  */
> + return -EADDRNOTAVAIL;
> + } else if (count == 0) {
> + /*
> +  * We can return ENOENT earlier than when register the
> +  * kprobe.
> +  */
> + return -ENOENT;
> + }
> + return 0;
> +}
> +
> +static int validate_probe_symbol(char *symbol)
> +{
> + struct module *mod = NULL;
> + char *modname = NULL, *p;
> + int ret = 0;
> +
> + p = strchr(symbol, ':');
> + if (p) {
> + modname = symbol;
> + symbol = p + 1;
> + *p = '\0';
> + /* Return 0 (defer) if the module does not exist yet. */
> + rcu_read_lock_sched();
> + mod = find_module(modname);
> + if (mod && !try_module_get(mod))
> + mod = NULL;
> + rcu_read_unlock_sched();
> + if (!mod)
> + goto out;
> + }
> +
> 

[PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Masami Hiramatsu (Google)
From: Masami Hiramatsu (Google) 

Currently, kprobe event checks whether the target symbol name is unique
or not, so that it does not put a probe on an unexpected place. But this
skips the check if the target is on a module because the module may not
be loaded.

To fix this issue, this patch checks the number of probe target symbols
in a target module when the module is loaded. If the probe is not on the
unique name symbols in the module, it will be rejected at that point.

Note that the symbol which has a unique name in the target module,
it will be accepted even if there are same-name symbols in the
kernel or other modules,

Signed-off-by: Masami Hiramatsu (Google) 
---
 Changes in v3:
  - Update the patch description.
 Updated from last October post, which was dropped by test failure:

https://lore.kernel.org/linux-trace-kernel/169854904604.132316.12500381416261460174.stgit@devnote2/
 Changes in v2:
  - Fix to skip checking uniqueness if the target module is not loaded.
  - Fix register_module_trace_kprobe() to pass correct symbol name.
  - Fix to call __register_trace_kprobe() from module callback.
---
 kernel/trace/trace_kprobe.c |  125 ---
 1 file changed, 81 insertions(+), 44 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index c68d4e830fbe..0113afe2662d 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
return ret;
 }
 
+static int validate_module_probe_symbol(const char *modname, const char 
*symbol);
+
+static int register_module_trace_kprobe(struct module *mod, struct 
trace_kprobe *tk)
+{
+   const char *p;
+   int ret = 0;
+
+   p = strchr(trace_kprobe_symbol(tk), ':');
+   if (p)
+   ret = validate_module_probe_symbol(module_name(mod), p + 1);
+   if (!ret)
+   ret = __register_trace_kprobe(tk);
+   return ret;
+}
+
 /* Module notifier call back, checking event on the module */
 static int trace_kprobe_module_callback(struct notifier_block *nb,
   unsigned long val, void *data)
@@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct 
notifier_block *nb,
if (trace_kprobe_within_module(tk, mod)) {
/* Don't need to check busy - this should have gone. */
__unregister_trace_kprobe(tk);
-   ret = __register_trace_kprobe(tk);
+   ret = register_module_trace_kprobe(mod, tk);
if (ret)
pr_warn("Failed to re-register probe %s on %s: 
%d\n",
trace_probe_name(>tp),
@@ -729,17 +744,68 @@ static int count_mod_symbols(void *data, const char 
*name, unsigned long unused)
return 0;
 }
 
-static unsigned int number_of_same_symbols(char *func_name)
+static unsigned int number_of_same_symbols(const char *mod, const char 
*func_name)
 {
struct sym_count_ctx ctx = { .count = 0, .name = func_name };
 
-   kallsyms_on_each_match_symbol(count_symbols, func_name, );
+   if (!mod)
+   kallsyms_on_each_match_symbol(count_symbols, func_name, 
);
 
-   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
+   module_kallsyms_on_each_symbol(mod, count_mod_symbols, );
 
return ctx.count;
 }
 
+static int validate_module_probe_symbol(const char *modname, const char 
*symbol)
+{
+   unsigned int count = number_of_same_symbols(modname, symbol);
+
+   if (count > 1) {
+   /*
+* Users should use ADDR to remove the ambiguity of
+* using KSYM only.
+*/
+   return -EADDRNOTAVAIL;
+   } else if (count == 0) {
+   /*
+* We can return ENOENT earlier than when register the
+* kprobe.
+*/
+   return -ENOENT;
+   }
+   return 0;
+}
+
+static int validate_probe_symbol(char *symbol)
+{
+   struct module *mod = NULL;
+   char *modname = NULL, *p;
+   int ret = 0;
+
+   p = strchr(symbol, ':');
+   if (p) {
+   modname = symbol;
+   symbol = p + 1;
+   *p = '\0';
+   /* Return 0 (defer) if the module does not exist yet. */
+   rcu_read_lock_sched();
+   mod = find_module(modname);
+   if (mod && !try_module_get(mod))
+   mod = NULL;
+   rcu_read_unlock_sched();
+   if (!mod)
+   goto out;
+   }
+
+   ret = validate_module_probe_symbol(modname, symbol);
+out:
+   if (p)
+   *p = ':';
+   if (mod)
+   module_put(mod);
+   return ret;
+}
+
 static int trace_kprobe_entry_handler(struct kretprobe_instance *ri,
  

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Florian Fainelli

On 4/17/24 01:48, Willy Tarreau wrote:

On Wed, Apr 17, 2024 at 10:16:26AM +0200, Greg KH wrote:

at the scripts used by stable developers - and maybe at the ML server - to
catch different variations won't hurt, as it sounds likely that people will
end messing up with a big name like "do-not-apply-to-stable", typing
instead things like:

do_not_apply_to_stable
dont-apply-to-stable

and other variants.


I want this very explicit that someone does not want this applied, and
that it has a reason to do so.  And if getting the email right to do so
is the issue with that, that's fine.  This is a very rare case that
almost no one should normally hit.


For using a comparable approach in haproxy on a daily basis, I do see
the value in this. We just mark a lot of fixes "no backport needed" or
"no backport needed unless blablabla" for everything that is only
relevant to the dev tree, and that's a huge time saver for those working
on the backports later.

Maybe "not-for-stable" would be both shorter and easier to remember BTW ?


Yes, "not-for-stable" looks like a good name to me.
--
Florian




Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Mauro Carvalho Chehab
Em Wed, 17 Apr 2024 10:16:26 +0200
Greg KH  escreveu:

> On Wed, Apr 17, 2024 at 09:09:26AM +0100, Mauro Carvalho Chehab wrote:
> > Em Wed, 17 Apr 2024 09:48:18 +0200
> > Thorsten Leemhuis  escreveu:
> >   
> > > Hi kernel.org helpdesk!
> > > 
> > > Could you please create the email alias
> > > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> > > just like sta...@kernel.org does?
> > > 
> > > That's an idea GregKH brought up a few days ago here:
> > > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> > > 
> > > To quote:
> > >   
> > > > How about:
> > > > cc:  # Reason goes here, and 
> > > > must be present
> > > > 
> > > > and we can make that address be routed to /dev/null just like
> > > >  is?
> > > 
> > > There was some discussion about using something shorter, but in the end
> > > there was no strong opposition and the thread ended a a few days ago.  
> > 
> > Heh, a shorter name would make it a lot easier to remember, specially
> > since not wanting a patch to go to stable is an exception... I bet
> > I'll never remember the right syntax, needing to look at the docs
> > every time it would be used.
> > 
> > IMO, something like:
> > no-stable
> > or
> > nostable
> > 
> > would do the trick and would be a lot easier to remember.
> > 
> > Btw, IMO, it won't hurt accepting more than one variant that
> > could be allowed, e. g. using a regular expression like:
> > 
> > (do)?[-_]?(nt|not?).*stable  
> 
> That's not going to work at the mailing list server, or with my scripts,
> sorry.

A setting like that would likely be at exim (or similar). Most smtp servers
allow some sort of wildcards, as those are used to pass or not e-mails to
list servers and/or handle custom mail forward rules.

At client level, one could use dovecot with pigeonhole to have sieve
rules to filter e-mails. That's what I do here.
 
> > at the scripts used by stable developers - and maybe at the ML server - to
> > catch different variations won't hurt, as it sounds likely that people will
> > end messing up with a big name like "do-not-apply-to-stable", typing
> > instead things like:
> > 
> > do_not_apply_to_stable
> > dont-apply-to-stable
> > 
> > and other variants.  
> 
> I want this very explicit that someone does not want this applied, and
> that it has a reason to do so.  And if getting the email right to do so
> is the issue with that, that's fine.  This is a very rare case that
> almost no one should normally hit.

Yeah, agreed: those are very rare exceptions. I remember just one or 
two cases where a media fix patch couldn't be queued to stable. 
The already-existing workflow worked for those.

> And again, if maintainers don't want their tree to have Fixes: and
> AUTOBOT run on them, we can easily add that to our lists.  That's the
> simpler and more explicit thing to do for those that do not want to have
> anything backported they do not explicitly mark as such (some subsystems
> do this already, like kvm and -mm and xfs, it's fine!).  This all is
> here because of maintainers who do NOT want to do that.

From my side, I'm fine with whatever-explicit-long-filter-email.

Regards,
Mauro




[PATCH RT 0/1] Linux v4.19.312-rt134-rc1

2024-04-17 Thread Daniel Wagner
Dear RT Folks,

This is the RT stable review cycle of patch 4.19.312-rt134-rc1.

Please scream at me if I messed something up. Please test the patches
too.

The -rc release is also available on kernel.org

  https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git

on the v4.19-rt-next branch.

If all goes well, this patch will be converted to the next main
release on 2024-04-24.

Signing key fingerprint:

  5BF6 7BC5 0826 72CA BB45  ACAE 587C 5ECA 5D0A 306C

All keys used for the above files and repositories can be found on the
following git repository:

   git://git.kernel.org/pub/scm/docs/kernel/pgpkeys.git

Enjoy!
Daniel

Changes from v4.19.307-rt133:


Daniel Wagner (1):
  Linux 4.19.312-rt134

 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.44.0




[PATCH RT 1/1] Linux 4.19.312-rt134

2024-04-17 Thread Daniel Wagner
v4.19.312-rt134-rc1 stable review patch.
If anyone has any objections, please let me know.

---


Signed-off-by: Daniel Wagner 
---
 localversion-rt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/localversion-rt b/localversion-rt
index c2c7e0fb6685..6067da4c8c99 100644
--- a/localversion-rt
+++ b/localversion-rt
@@ -1 +1 @@
--rt133
+-rt134
-- 
2.44.0




Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Konstantin Ryabitsev
On Wed, Apr 17, 2024 at 03:38:28PM +0200, Greg KH wrote:
> > That afaics makes them useless for the stable team (Greg may correct 
> > me
> > if I'm wrong here), as they deal with the commits and have no easy,
> > fast, and reliable way to look up the patch posting to query this. Or is
> > the "patch basement" available somehow in git for each commit and I just
> > missed that?
> 
> You are correct, as-is, that would make it useless for my tools.
> 
> BUT I could, if it's possible, just look up the original in lore somehow
> and parse that.  If it's there, does anyone have a "simple" way to map a
> git commit back to a lore message if it does NOT have a Link: line in
> it?

Our current way of doing it is going by patch-id, and it's not great 
either, because there is more than one way to create a patch from a git 
commit. We've discovered this when Linus recommended that people send 
patches with the --histogram algorithm, which broke a bunch of our 
stuff. :)

E.g. here's a recent commit that has a Fixes:

  git show c0297e7dd50795d559f3534887a6de1756b35d0f | git patch-id --stable | 
cut -d' ' -f1
  c2f5c42a5a3bc05ffacd9679dd367e4a2207b018

it successfully maps to the patch:
https://lore.kernel.org/all/?q=patchid%3Ac2f5c42a5a3bc05ffacd9679dd367e4a2207b018

I only put this here for academic purposes -- I really don't want you to 
go that route, because it's fragile (more fragile than git notes, and 
that's saying something).

-K



Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-17 Thread Google
On Tue, 16 Apr 2024 00:47:26 -0400
Steven Rostedt  wrote:

> On Mon, 15 Apr 2024 18:40:23 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Check the number of probe target symbols in the target module when
> > the module is loaded. If the probe is not on the unique name symbols
> > in the module, it will be rejected at that point.
> > 
> > Note that the symbol which has a unique name in the target module,
> > it will be accepted even if there are same-name symbols in the
> > kernel or other modules,
> 
> This says what it does, but doesn't explain why it is doing it.
> What's the purpose of this patch?

Thank you for pointing it out, I just reused the description which I
sent last year. It needs to be updated.

-
Currently, kprobe event checks whether the target symbol name is
unique, so that it does not put a probe on unexpected place. But this
skips the check if the target is on a module. This fixes the issue
by checking the symbol is unique in the target module if the target
is on a module.
-

Thanks!

> 
> -- Steve


-- 
Masami Hiramatsu (Google) 



Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Greg KH
On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
> On 17.04.24 14:52, Konstantin Ryabitsev wrote:
> > On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
> >>
> >> Could you please create the email alias
> >> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> >> just like sta...@kernel.org does?
> >>
> >> That's an idea GregKH brought up a few days ago here:
> >> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> >>
> >> To quote:
> >>
> >>> How about:
> >>>   cc:  # Reason goes here, and must be 
> >>> present
> >>>
> >>> and we can make that address be routed to /dev/null just like
> >>>  is?
> 
> First, many thx for your feedback.
> 
> > That would make it into actual commits and probably irk maintainers and 
> > Linus, no?
> 
> Yup.
> 
> > I also don't really love the idea of overloading email 
> > addresses with additional semantics. Using Cc: stable kinda makes sense, 
> > even if it's not a real email address (but it could become at some 
> > point), but this feels different.
> 
> Okay.
> 
> > In general, I feel this information belongs in the patch basement (the 
> > place where change-id, base-commit, etc goes). E.g.:
> > 
> > stable-autosel: ignore
> > [This fix requires a feature that is only present in mainline]
> > 
> > This allows passing along structured information that can be parsed by 
> > automated tooling without putting it into the commit.
> 
> That afaics makes them useless for the stable team (Greg may correct me
> if I'm wrong here), as they deal with the commits and have no easy,
> fast, and reliable way to look up the patch posting to query this. Or is
> the "patch basement" available somehow in git for each commit and I just
> missed that?

You are correct, as-is, that would make it useless for my tools.

BUT I could, if it's possible, just look up the original in lore somehow
and parse that.  If it's there, does anyone have a "simple" way to map a
git commit back to a lore message if it does NOT have a Link: line in
it?

I guess I should look at setting up a local copy of lei to dig through
the git record of lkml?  But what about patches that aren't cc: to lkml,
I don't want to have to hit lore.kernel.org for each query, that's going
to not be nice.

> Guess in a better world we might use "git notes" for this, but we
> currently do not use them because they iirc are somewhat tricky (they
> are easily lost on rebases iirc is one of the reasons ) -- and starting
> to use them just for this is not worth it.

git notes never works for anything, and everyone always mentions it :)

> >> There was some discussion about using something shorter, but in the end
> >> there was no strong opposition and the thread ended a a few days ago.
> > I feel this is a significant change to the workflow, so I would like the 
> > workflows list to have another go at this topic. :)
> 
> FWIW, we could go back to what I initially proposed: use the existing
> stable tag with a pre-defined comment to mark patches that AUTOSEL et.
> al. should not pick up:
> https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/

If you can pick a better string, possibly, yes.

But in the end, your proposal seems to imply:

cc: sta...@kernel.org   # Psych!  Just kidding, never backport this!

but really, that's just mean, and again, this is a VERY rare case you
are trying to automate here.  We have MUCH better and simpler ways for
maintainers to not have their subsystems scanned for stuff like this,
why are we spending all of our time on this topic?

thanks,

greg k-h



Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Konstantin Ryabitsev
On Wed, Apr 17, 2024 at 03:21:12PM +0200, Thorsten Leemhuis wrote:
> > In general, I feel this information belongs in the patch basement 
> > (the place where change-id, base-commit, etc goes). E.g.:
> > 
> > stable-autosel: ignore
> > [This fix requires a feature that is only present in mainline]
> > 
> > This allows passing along structured information that can be parsed by 
> > automated tooling without putting it into the commit.
> 
> That afaics makes them useless for the stable team (Greg may correct me
> if I'm wrong here), as they deal with the commits and have no easy,
> fast, and reliable way to look up the patch posting to query this. Or is
> the "patch basement" available somehow in git for each commit and I just
> missed that?

Ah, okay, my assumption was wrong, then.

How about meeting things halfway, then:

Cc: stable+noauto...@kernel.org # Reason

-K



Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Thorsten Leemhuis
On 17.04.24 14:52, Konstantin Ryabitsev wrote:
> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
>>
>> Could you please create the email alias
>> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
>> just like sta...@kernel.org does?
>>
>> That's an idea GregKH brought up a few days ago here:
>> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
>>
>> To quote:
>>
>>> How about:
>>> cc:  # Reason goes here, and must be 
>>> present
>>>
>>> and we can make that address be routed to /dev/null just like
>>>  is?

First, many thx for your feedback.

> That would make it into actual commits and probably irk maintainers and 
> Linus, no?

Yup.

> I also don't really love the idea of overloading email 
> addresses with additional semantics. Using Cc: stable kinda makes sense, 
> even if it's not a real email address (but it could become at some 
> point), but this feels different.

Okay.

> In general, I feel this information belongs in the patch basement (the 
> place where change-id, base-commit, etc goes). E.g.:
> 
> stable-autosel: ignore
> [This fix requires a feature that is only present in mainline]
> 
> This allows passing along structured information that can be parsed by 
> automated tooling without putting it into the commit.

That afaics makes them useless for the stable team (Greg may correct me
if I'm wrong here), as they deal with the commits and have no easy,
fast, and reliable way to look up the patch posting to query this. Or is
the "patch basement" available somehow in git for each commit and I just
missed that?

Guess in a better world we might use "git notes" for this, but we
currently do not use them because they iirc are somewhat tricky (they
are easily lost on rebases iirc is one of the reasons ) -- and starting
to use them just for this is not worth it.

>> There was some discussion about using something shorter, but in the end
>> there was no strong opposition and the thread ended a a few days ago.
> I feel this is a significant change to the workflow, so I would like the 
> workflows list to have another go at this topic. :)

FWIW, we could go back to what I initially proposed: use the existing
stable tag with a pre-defined comment to mark patches that AUTOSEL et.
al. should not pick up:
https://lore.kernel.org/all/c0a08b160b286e8c98549eedb37404c6e784cf8a.1712812895.git.li...@leemhuis.info/

Ciao, Thorsten



Re: [PATCH net-next v2 07/15] mm: page_frag: add '_va' suffix to page_frag API

2024-04-17 Thread Yunsheng Lin
On 2024/4/17 0:12, Alexander H Duyck wrote:
> On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
>> Currently most of the API for page_frag API is returning
>> 'virtual address' as output or expecting 'virtual address'
>> as input, in order to differentiate the API handling between
>> 'virtual address' and 'struct page', add '_va' suffix to the
>> corresponding API mirroring the page_pool_alloc_va() API of
>> the page_pool.
>>
>> Signed-off-by: Yunsheng Lin 
> 
> This patch is a total waste of time. By that logic we should be
> renaming __get_free_pages since it essentially does the same thing.
> 
> This just seems like more code changes for the sake of adding code
> changes rather than fixing anything. In my opinion it should be dropped
> from the set.

The rename is to support different use case as mentioned below in patch
14:
"Depending on different use cases, callers expecting to deal with va, page or
both va and page for them may call page_frag_alloc_va*, page_frag_alloc_pg*,
or page_frag_alloc* API accordingly."

Naming is hard anyway, I am open to better API naming for the above use cases.

> 
> .
> 



Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Vlastimil Babka
On 4/17/24 2:52 PM, Konstantin Ryabitsev wrote:
> On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
>> Hi kernel.org helpdesk!
>> 
>> Could you please create the email alias
>> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
>> just like sta...@kernel.org does?
>> 
>> That's an idea GregKH brought up a few days ago here:
>> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
>> 
>> To quote:
>> 
>> > How about:
>> >cc:  # Reason goes here, and must be 
>> > present
>> > 
>> > and we can make that address be routed to /dev/null just like
>> >  is?
> 
> That would make it into actual commits and probably irk maintainers and 
> Linus, no? I also don't really love the idea of overloading email 
> addresses with additional semantics. Using Cc: stable kinda makes sense, 
> even if it's not a real email address (but it could become at some 
> point), but this feels different.
> 
> In general, I feel this information belongs in the patch basement (the 
> place where change-id, base-commit, etc goes). E.g.:
> 
> stable-autosel: ignore
> [This fix requires a feature that is only present in mainline]
> 
> This allows passing along structured information that can be parsed by 
> automated tooling without putting it into the commit.

But isn't it the actual commit what the stable tooling parses?

>> There was some discussion about using something shorter, but in the end
>> there was no strong opposition and the thread ended a a few days ago.
> 
> I feel this is a significant change to the workflow, so I would like the 
> workflows list to have another go at this topic. :)
> 
> -K
> 




Re: [PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-17 Thread Julian Anastasov

Hello,

On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls visible and RO even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be writable:
> - conntrack
> - conn_reuse_mode
> - expire_nodest_conn
> - expire_quiescent_template
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This list of sysctls was chosen because I can't
> see any security risks allowing them and also
> Kubernetes uses [2] these specific sysctls.
> 
> This patch is motivated by user request in the LXC
> project [1].
> 
> [1] https://github.com/lxc/lxc/issues/4278
> [2] 
> https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> 
> Cc: Stéphane Graber 
> Cc: Christian Brauner 
> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..92a818c2f783 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4285,10 +4285,22 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)

As the list of privileged vars is short I prefer
to use a bool and to make only some vars read-only:

bool unpriv = false;

>   if (tbl == NULL)
>   return -ENOMEM;
>  
> - /* Don't export sysctls to unprivileged users */
> + /* Let's show all sysctls in non-init user namespace-owned
> +  * net namespaces, but make them read-only.
> +  *
> +  * Allow only a few specific sysctls to be writable.
> +  */
>   if (net->user_ns != _user_ns) {

Here we should just set: unpriv = true;

> - tbl[0].procname = NULL;
> - ctl_table_size = 0;
> + for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> + if (!tbl[idx].procname)
> + continue;
> +
> + if (!((strcmp(tbl[idx].procname, "conntrack") 
> == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "conn_reuse_mode") == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "expire_nodest_conn") == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "expire_quiescent_template") == 0)))
> + tbl[idx].mode = 0444;
> + }
>   }
>   } else
>   tbl = vs_vars;

And below at every place to use:

if (unpriv)
tbl[idx].mode = 0444;

for the following 4 privileged sysctl vars:

- sync_qlen_max:
- allocates messages in kernel context
- this needs better tunning in another patch

- sync_sock_size:
- allocates messages in kernel context

- run_estimation:
- for now, better init ns to decide if to use est stats

- est_nice:
- for now, better init ns to decide the value

- debug_level:
- already set to 0444

I.e. these vars allocate resources (mem, CPU) without
proper control, so for now we will just copy them from init ns
without allowing writing. And they are vars that are not tuned
often. Also we do not know which netns is supposed to be the
privileged one, some solutions move all devices out of init_net,
so we can not decide where to use lower limits.

OTOH, "amemthresh" is not privileged but needs single READ_ONCE 
for sysctl_amemthresh in update_defense_level() due to the possible
div by zero if we allow writing to anyone, eg.:

int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
...
nomem = availmem < amemthresh;
... use only amemthresh

All other vars can be writable.

Regards

--
Julian Anastasov 

Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Konstantin Ryabitsev
On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
> Hi kernel.org helpdesk!
> 
> Could you please create the email alias
> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> just like sta...@kernel.org does?
> 
> That's an idea GregKH brought up a few days ago here:
> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> 
> To quote:
> 
> > How about:
> > cc:  # Reason goes here, and must be 
> > present
> > 
> > and we can make that address be routed to /dev/null just like
> >  is?

That would make it into actual commits and probably irk maintainers and 
Linus, no? I also don't really love the idea of overloading email 
addresses with additional semantics. Using Cc: stable kinda makes sense, 
even if it's not a real email address (but it could become at some 
point), but this feels different.

In general, I feel this information belongs in the patch basement (the 
place where change-id, base-commit, etc goes). E.g.:

stable-autosel: ignore
[This fix requires a feature that is only present in mainline]

This allows passing along structured information that can be parsed by 
automated tooling without putting it into the commit.

> There was some discussion about using something shorter, but in the end
> there was no strong opposition and the thread ended a a few days ago.

I feel this is a significant change to the workflow, so I would like the 
workflows list to have another go at this topic. :)

-K



[syzbot] [bpf?] [trace?] possible deadlock in __send_signal_locked

2024-04-17 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:96fca68c4fbf Merge tag 'nfsd-6.9-3' of git://git.kernel.or..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=13967bb318
kernel config:  https://syzkaller.appspot.com/x/.config?x=85dbe39cf8e4f599
dashboard link: https://syzkaller.appspot.com/bug?extid=6e3b6eab5bd4ed584a38
compiler:   gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 
2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): 
https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-96fca68c.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/d6d7a71ca443/vmlinux-96fca68c.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/accb76ce6c9c/bzImage-96fca68c.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6e3b6eab5bd4ed584...@syzkaller.appspotmail.com

==
WARNING: possible circular locking dependency detected
6.9.0-rc4-syzkaller-00031-g96fca68c4fbf #0 Not tainted
--
syz-executor.0/7699 is trying to acquire lock:
88806b53d998 (>lock){-.-.}-{2:2}, at: __queue_work+0x23a/0x1020 
kernel/workqueue.c:2346

but task is already holding lock:
888023446620 (>signalfd_wqh){}-{2:2}, at: 
__wake_up_common_lock kernel/sched/wait.c:105 [inline]
888023446620 (>signalfd_wqh){}-{2:2}, at: __wake_up+0x1c/0x60 
kernel/sched/wait.c:127

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (>signalfd_wqh){}-{2:2}:
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x3a/0x60 kernel/locking/spinlock.c:162
   __wake_up_common_lock kernel/sched/wait.c:105 [inline]
   __wake_up+0x1c/0x60 kernel/sched/wait.c:127
   signalfd_notify include/linux/signalfd.h:22 [inline]
   __send_signal_locked+0x951/0x11c0 kernel/signal.c:1168
   do_notify_parent+0xeb4/0x1040 kernel/signal.c:2143
   exit_notify kernel/exit.c:754 [inline]
   do_exit+0x1369/0x2c10 kernel/exit.c:898
   do_group_exit+0xd3/0x2a0 kernel/exit.c:1027
   __do_sys_exit_group kernel/exit.c:1038 [inline]
   __se_sys_exit_group kernel/exit.c:1036 [inline]
   __x64_sys_exit_group+0x3e/0x50 kernel/exit.c:1036
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #2 (>siglock){-...}-{2:2}:
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x3a/0x60 kernel/locking/spinlock.c:162
   __lock_task_sighand+0xc2/0x340 kernel/signal.c:1414
   lock_task_sighand include/linux/sched/signal.h:746 [inline]
   do_send_sig_info kernel/signal.c:1300 [inline]
   group_send_sig_info+0x290/0x300 kernel/signal.c:1453
   bpf_send_signal_common+0x2e8/0x3a0 kernel/trace/bpf_trace.c:881
   bpf_send_signal_thread kernel/trace/bpf_trace.c:898 [inline]
   bpf_send_signal_thread+0x16/0x20 kernel/trace/bpf_trace.c:896
   ___bpf_prog_run+0x3e51/0xabd0 kernel/bpf/core.c:1997
   __bpf_prog_run32+0xc1/0x100 kernel/bpf/core.c:2236
   bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
   __bpf_prog_run include/linux/filter.h:657 [inline]
   bpf_prog_run include/linux/filter.h:664 [inline]
   __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline]
   bpf_trace_run4+0x176/0x460 kernel/trace/bpf_trace.c:2422
   __bpf_trace_mmap_lock_acquire_returned+0x134/0x180 
include/trace/events/mmap_lock.h:52
   trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:52 
[inline]
   __mmap_lock_do_trace_acquire_returned+0x456/0x790 mm/mmap_lock.c:237
   __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
   mmap_write_lock include/linux/mmap_lock.h:109 [inline]
   __do_sys_set_mempolicy_home_node+0x574/0x860 mm/mempolicy.c:1568
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xcf/0x260 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #1 (lock#11){+.+.}-{2:2}:
   local_lock_acquire include/linux/local_lock_internal.h:29 [inline]
   __mmap_lock_do_trace_acquire_returned+0x97/0x790 mm/mmap_lock.c:237
   __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
   mmap_read_trylock include/linux/mmap_lock.h:166 [inline]
   stack_map_get_build_id_offset+0x5df/0x7d0 kernel/bpf/stackmap.c:141
   __bpf_get_stack+0x6bf/0x700 kernel/bpf/stackmap.c:449
   bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1985 [inline]
   bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1975
   ___bpf_prog_run+0x3e51/0xabd0 kernel/bpf/core.c:1997
   

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-17 Thread Michael S. Tsirkin
On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> Add the function vduse_alloc_reconnnect_info_mem
> and vduse_alloc_reconnnect_info_mem
> These functions allow vduse to allocate and free memory for reconnection
> information. The amount of memory allocated is vq_num pages.
> Each VQS will map its own page where the reconnection information will be 
> saved
> 
> Signed-off-by: Cindy Lu 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
>  1 file changed, 40 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index ef3c9681941e..2da659d5f4a8 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -65,6 +65,7 @@ struct vduse_virtqueue {
>   int irq_effective_cpu;
>   struct cpumask irq_affinity;
>   struct kobject kobj;
> + unsigned long vdpa_reconnect_vaddr;
>  };
>  
>  struct vduse_dev;
> @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> vduse_virtqueue *vq)
>  
>   vq->irq_effective_cpu = curr_cpu;
>  }
> +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + unsigned long vaddr = 0;
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + /*page 0~ vq_num save the reconnect info for vq*/
> + vq = dev->vqs[i];
> + vaddr = get_zeroed_page(GFP_KERNEL);


I don't get why you insist on stealing kernel memory for something
that is just used by userspace to store data for its own use.
Userspace does not lack ways to persist data, for example,
create a regular file anywhere in the filesystem.



> + if (vaddr == 0)
> + return -ENOMEM;
> +
> + vq->vdpa_reconnect_vaddr = vaddr;
> + }
> +
> + return 0;
> +}
> +
> +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> +{
> + struct vduse_virtqueue *vq;
> +
> + for (int i = 0; i < dev->vq_num; i++) {
> + vq = dev->vqs[i];
> +
> + if (vq->vdpa_reconnect_vaddr)
> + free_page(vq->vdpa_reconnect_vaddr);
> + vq->vdpa_reconnect_vaddr = 0;
> + }
> +
> + return 0;
> +}
>  
>  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>   unsigned long arg)
> @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
>   mutex_unlock(>lock);
>   return -EBUSY;
>   }
> + vduse_free_reconnnect_info_mem(dev);
> +
>   dev->connected = true;
>   mutex_unlock(>lock);
>  
> @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
>   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
>   if (ret)
>   goto err_vqs;
> + ret = vduse_alloc_reconnnect_info_mem(dev);
> + if (ret < 0)
> + goto err_mem;
>  
>   __module_get(THIS_MODULE);
>  
>   return 0;
>  err_vqs:
>   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +err_mem:
> + vduse_free_reconnnect_info_mem(dev);
>  err_dev:
>   idr_remove(_idr, dev->minor);
>  err_idr:
> -- 
> 2.43.0




Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-17 Thread Jason Xing
Hello Eric,

On Wed, Apr 17, 2024 at 5:02 PM Eric Dumazet  wrote:
>
> On Wed, Apr 17, 2024 at 10:51 AM Jason Xing  wrote:
> >
> > From: Jason Xing 
> >
> > Add a new standalone file for the easy future extension to support
> > both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
> >
> > This patch only does the preparations for reset reason mechanism,
> > nothing else changes.
> >
> > The reset reasons are divided into three parts:
> > 1) reuse drop reasons for passive reset in TCP
> > 2) reuse MP_TCPRST option for MPTCP
> > 3) our own reasons
> >
> > I will implement the basic codes of active/passive reset reason in
> > those three protocols, which is not complete for this moment. But
> > it provides a new chance to let other people add more reasons into
> > it:)
> >
> > Signed-off-by: Jason Xing 
>
> My original suggestion was to use normal values in  'enum
> skb_drop_reason', even if there was not necessarily a 'drop'
> in the common sense.
>
> https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/
>
> This would avoid these ugly casts later, even casting an enum to other
> ones is not very logical.

Thanks for your comment.

It's a little bit tricky. That's the reason I documented and commented
on this in the rstreason.h file. I hope it's not that hard to
understand.

> Going through an u32 pivot is quite a hack.
>
> If you feel the need to put them in a special group, this is fine by me.

Yes, rst reasons only partially rely on the drop reason mechanism to
support passive rst for TCP well, but not supporting other cases. My
final goal is to cover all the cases for the future, so I wish I can
put it into a separate group, then people like me who find it useful
can introduce more reasons into it.

Thanks,
Jason



Re: [PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-17 Thread Eric Dumazet
On Wed, Apr 17, 2024 at 10:51 AM Jason Xing  wrote:
>
> From: Jason Xing 
>
> Add a new standalone file for the easy future extension to support
> both active reset and passive reset in the TCP/DCCP/MPTCP protocols.
>
> This patch only does the preparations for reset reason mechanism,
> nothing else changes.
>
> The reset reasons are divided into three parts:
> 1) reuse drop reasons for passive reset in TCP
> 2) reuse MP_TCPRST option for MPTCP
> 3) our own reasons
>
> I will implement the basic codes of active/passive reset reason in
> those three protocols, which is not complete for this moment. But
> it provides a new chance to let other people add more reasons into
> it:)
>
> Signed-off-by: Jason Xing 

My original suggestion was to use normal values in  'enum
skb_drop_reason', even if there was not necessarily a 'drop'
in the common sense.

https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

This would avoid these ugly casts later, even casting an enum to other
ones is not very logical.
Going through an u32 pivot is quite a hack.

If you feel the need to put them in a special group, this is fine by me.



[PATCH 1/1] virtio: Add support for the virtio suspend feature

2024-04-17 Thread David Stevens
Add support for the VIRTIO_F_SUSPEND feature. When this feature is
negotiated, power management can use it to suspend virtio devices
instead of resorting to resetting the devices entirely.

Signed-off-by: David Stevens 
---
 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index f4080692b351..cd11495a5098 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -580,6 +581,37 @@ int virtio_device_restore(struct virtio_device *dev)
return ret;
 }
 EXPORT_SYMBOL_GPL(virtio_device_restore);
+
+static int virtio_device_set_suspend_bit(struct virtio_device *dev, bool 
enabled)
+{
+   u8 status, target;
+
+   status = dev->config->get_status(dev);
+   if (enabled)
+   target = status | VIRTIO_CONFIG_S_SUSPEND;
+   else
+   target = status & ~VIRTIO_CONFIG_S_SUSPEND;
+   dev->config->set_status(dev, target);
+
+   while ((status = dev->config->get_status(dev)) != target) {
+   if (status & VIRTIO_CONFIG_S_NEEDS_RESET)
+   return -EIO;
+   mdelay(10);
+   }
+   return 0;
+}
+
+int virtio_device_suspend(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, true);
+}
+EXPORT_SYMBOL_GPL(virtio_device_suspend);
+
+int virtio_device_resume(struct virtio_device *dev)
+{
+   return virtio_device_set_suspend_bit(dev, false);
+}
+EXPORT_SYMBOL_GPL(virtio_device_resume);
 #endif
 
 static int virtio_init(void)
diff --git a/drivers/virtio/virtio_pci_common.c 
b/drivers/virtio/virtio_pci_common.c
index b655fccaf773..4d542de05970 100644
--- a/drivers/virtio/virtio_pci_common.c
+++ b/drivers/virtio/virtio_pci_common.c
@@ -495,31 +495,26 @@ static int virtio_pci_restore(struct device *dev)
return virtio_device_restore(_dev->vdev);
 }
 
-static bool vp_supports_pm_no_reset(struct device *dev)
+static int virtio_pci_suspend(struct device *dev)
 {
struct pci_dev *pci_dev = to_pci_dev(dev);
-   u16 pmcsr;
-
-   if (!pci_dev->pm_cap)
-   return false;
-
-   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
-   if (PCI_POSSIBLE_ERROR(pmcsr)) {
-   dev_err(dev, "Unable to query pmcsr");
-   return false;
-   }
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
 
-   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
-}
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_suspend(_dev->vdev);
 
-static int virtio_pci_suspend(struct device *dev)
-{
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_freeze(dev);
+   return virtio_pci_freeze(dev);
 }
 
 static int virtio_pci_resume(struct device *dev)
 {
-   return vp_supports_pm_no_reset(dev) ? 0 : virtio_pci_restore(dev);
+   struct pci_dev *pci_dev = to_pci_dev(dev);
+   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
+
+   if (virtio_has_feature(_dev->vdev, VIRTIO_F_SUSPEND))
+   return virtio_device_resume(_dev->vdev);
+
+   return virtio_pci_restore(dev);
 }
 
 static const struct dev_pm_ops virtio_pci_pm_ops = {
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index f62b530aa3b5..ac8734526b8d 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -209,6 +209,22 @@ static void vp_modern_avq_deactivate(struct virtio_device 
*vdev)
__virtqueue_break(admin_vq->info.vq);
 }
 
+static bool vp_supports_pm_no_reset(struct pci_dev *pci_dev)
+{
+   u16 pmcsr;
+
+   if (!pci_dev->pm_cap)
+   return false;
+
+   pci_read_config_word(pci_dev, pci_dev->pm_cap + PCI_PM_CTRL, );
+   if (PCI_POSSIBLE_ERROR(pmcsr)) {
+   dev_err(_dev->dev, "Unable to query pmcsr");
+   return false;
+   }
+
+   return pmcsr & PCI_PM_CTRL_NO_SOFT_RESET;
+}
+
 static void vp_transport_features(struct virtio_device *vdev, u64 features)
 {
struct virtio_pci_device *vp_dev = to_vp_device(vdev);
@@ -223,6 +239,9 @@ static void vp_transport_features(struct virtio_device 
*vdev, u64 features)
 
if (features & BIT_ULL(VIRTIO_F_ADMIN_VQ))
__virtio_set_bit(vdev, VIRTIO_F_ADMIN_VQ);
+
+   if (features & BIT_ULL(VIRTIO_F_SUSPEND) && 
vp_supports_pm_no_reset(pci_dev))
+   __virtio_set_bit(vdev, VIRTIO_F_SUSPEND);
 }
 
 static int __vp_check_common_size_one_feature(struct virtio_device *vdev, u32 
fbit,
diff --git 

[PATCH 0/1] virtio: Add suspend support

2024-04-17 Thread David Stevens
This series implements support for the virtio device suspend feature
that is under discussion. Unfortunately, the virtio mailing list is
currently being migrated, so recent discussion of the proposal is not
archived anywhere. There current version of the proposal is a
combination of [1] and [2].

[1] https://lore.kernel.org/all/20230906081637.32185-3-lingshan@intel.com/
[2] https://lists.oasis-open.org/archives/virtio-comment/202402/msg00088.html

David Stevens (1):
  virtio: Add support for the virtio suspend feature

 drivers/virtio/virtio.c| 32 ++
 drivers/virtio/virtio_pci_common.c | 29 +++
 drivers/virtio/virtio_pci_modern.c | 19 ++
 include/linux/virtio.h |  2 ++
 include/uapi/linux/virtio_config.h | 10 +-
 5 files changed, 74 insertions(+), 18 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.44.0.683.g7961c838ac-goog




Re: [External] Re: [PATCH v11 2/2] memory tier: create CPUless memory tiers after obtaining HMAT info

2024-04-17 Thread Ho-Ren (Jack) Chuang
On Wed, Apr 10, 2024 at 9:51 AM Jonathan Cameron
 wrote:
>
> On Tue, 9 Apr 2024 12:02:31 -0700
> "Ho-Ren (Jack) Chuang"  wrote:
>
> > Hi Jonathan,
> >
> > On Tue, Apr 9, 2024 at 9:12 AM Jonathan Cameron
> >  wrote:
> > >
> > > On Fri, 5 Apr 2024 15:43:47 -0700
> > > "Ho-Ren (Jack) Chuang"  wrote:
> > >
> > > > On Fri, Apr 5, 2024 at 7:03 AM Jonathan Cameron
> > > >  wrote:
> > > > >
> > > > > On Fri,  5 Apr 2024 00:07:06 +
> > > > > "Ho-Ren (Jack) Chuang"  wrote:
> > > > >
> > > > > > The current implementation treats emulated memory devices, such as
> > > > > > CXL1.1 type3 memory, as normal DRAM when they are emulated as 
> > > > > > normal memory
> > > > > > (E820_TYPE_RAM). However, these emulated devices have different
> > > > > > characteristics than traditional DRAM, making it important to
> > > > > > distinguish them. Thus, we modify the tiered memory initialization 
> > > > > > process
> > > > > > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > > > > > ensures that the memory tier initialization for these nodes is 
> > > > > > deferred
> > > > > > until HMAT information is obtained during the boot process. Finally,
> > > > > > demotion tables are recalculated at the end.
> > > > > >
> > > > > > * late_initcall(memory_tier_late_init);
> > > > > > Some device drivers may have initialized memory tiers between
> > > > > > `memory_tier_init()` and `memory_tier_late_init()`, potentially 
> > > > > > bringing
> > > > > > online memory nodes and configuring memory tiers. They should be 
> > > > > > excluded
> > > > > > in the late init.
> > > > > >
> > > > > > * Handle cases where there is no HMAT when creating memory tiers
> > > > > > There is a scenario where a CPUless node does not provide HMAT 
> > > > > > information.
> > > > > > If no HMAT is specified, it falls back to using the default DRAM 
> > > > > > tier.
> > > > > >
> > > > > > * Introduce another new lock `default_dram_perf_lock` for adist 
> > > > > > calculation
> > > > > > In the current implementation, iterating through CPUlist nodes 
> > > > > > requires
> > > > > > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will 
> > > > > > end up
> > > > > > trying to acquire the same lock, leading to a potential deadlock.
> > > > > > Therefore, we propose introducing a standalone 
> > > > > > `default_dram_perf_lock` to
> > > > > > protect `default_dram_perf_*`. This approach not only avoids 
> > > > > > deadlock
> > > > > > but also prevents holding a large lock simultaneously.
> > > > > >
> > > > > > * Upgrade `set_node_memory_tier` to support additional cases, 
> > > > > > including
> > > > > >   default DRAM, late CPUless, and hot-plugged initializations.
> > > > > > To cover hot-plugged memory nodes, `mt_calc_adistance()` and
> > > > > > `mt_find_alloc_memory_type()` are moved into 
> > > > > > `set_node_memory_tier()` to
> > > > > > handle cases where memtype is not initialized and where HMAT 
> > > > > > information is
> > > > > > available.
> > > > > >
> > > > > > * Introduce `default_memory_types` for those memory types that are 
> > > > > > not
> > > > > >   initialized by device drivers.
> > > > > > Because late initialized memory and default DRAM memory need to be 
> > > > > > managed,
> > > > > > a default memory type is created for storing all memory types that 
> > > > > > are
> > > > > > not initialized by device drivers and as a fallback.
> > > > > >
> > > > > > Signed-off-by: Ho-Ren (Jack) Chuang 
> > > > > > Signed-off-by: Hao Xiang 
> > > > > > Reviewed-by: "Huang, Ying" 
> > > > >
> > > > > Hi - one remaining question. Why can't we delay init for all nodes
> > > > > to either drivers or your fallback late_initcall code.
> > > > > It would be nice to reduce possible code paths.
> > > >
> > > > I try not to change too much of the existing code structure in
> > > > this patchset.
> > > >
> > > > To me, postponing/moving all memory tier registrations to
> > > > late_initcall() is another possible action item for the next patchset.
> > > >
> > > > After tier_mem(), hmat_init() is called, which requires registering
> > > > `default_dram_type` info. This is when `default_dram_type` is needed.
> > > > However, it is indeed possible to postpone the latter part,
> > > > set_node_memory_tier(), to `late_init(). So, memory_tier_init() can
> > > > indeed be split into two parts, and the latter part can be moved to
> > > > late_initcall() to be processed together.
> > > >
> > > > Doing this all memory-type drivers have to call late_initcall() to
> > > > register a memory tier. I’m not sure how many they are?
> > > >
> > > > What do you guys think?
> > >
> > > Gut feeling - if you are going to move it for some cases, move it for
> > > all of them.  Then we only have to test once ;)
> > >
> > > J
> >
> > Thank you for your reminder! I agree~ That's why I'm considering
> > changing them in the next patchset because of the amount of changes.
> > And also, this patchset already contains too many 

[PATCH net-next v6 6/7] mptcp: introducing a helper into active reset logic

2024-04-17 Thread Jason Xing
From: Jason Xing 

Since we have mapped every mptcp reset reason definition
in enum sk_rst_reason, introducing a new helper can cover
some missing places where we have already set the
subflow->reset_reason.

Note: using SK_RST_REASON_NOT_SPECIFIED is the same as
SK_RST_REASON_MPTCP_RST_EUNSPEC. They are both unknown.
So we can convert it directly.

Suggested-by: Paolo Abeni 
Signed-off-by: Jason Xing 
---
Link: 
https://lore.kernel.org/all/2d3ea199eef53cf6a0c48e21abdee0eefbdee927.ca...@redhat.com/
---
 net/mptcp/protocol.c |  4 +---
 net/mptcp/protocol.h | 11 +++
 net/mptcp/subflow.c  |  6 ++
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 065967086492..4b13ca362efa 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -21,7 +21,6 @@
 #endif
 #include 
 #include 
-#include 
 #include 
 #include "protocol.h"
 #include "mib.h"
@@ -2570,8 +2569,7 @@ static void mptcp_check_fastclose(struct mptcp_sock *msk)
 
slow = lock_sock_fast(tcp_sk);
if (tcp_sk->sk_state != TCP_CLOSE) {
-   tcp_send_active_reset(tcp_sk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(tcp_sk);
tcp_set_state(tcp_sk, TCP_CLOSE);
}
unlock_sock_fast(tcp_sk, slow);
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index fdfa843e2d88..82ef2f42a1bc 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mptcp_pm_gen.h"
 
@@ -581,6 +582,16 @@ mptcp_subflow_ctx_reset(struct mptcp_subflow_context 
*subflow)
WRITE_ONCE(subflow->local_id, -1);
 }
 
+static inline void
+mptcp_send_active_reset_reason(struct sock *sk)
+{
+   struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(subflow->reset_reason);
+   tcp_send_active_reset(sk, GFP_ATOMIC, reason);
+}
+
 static inline u64
 mptcp_subflow_get_map_offset(const struct mptcp_subflow_context *subflow)
 {
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index bde4a7fdee82..4783d558863c 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -20,7 +20,6 @@
 #include 
 #endif
 #include 
-#include 
 
 #include "protocol.h"
 #include "mib.h"
@@ -424,7 +423,7 @@ void mptcp_subflow_reset(struct sock *ssk)
/* must hold: tcp_done() could drop last reference on parent */
sock_hold(sk);
 
-   tcp_send_active_reset(ssk, GFP_ATOMIC, SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
tcp_done(ssk);
if (!test_and_set_bit(MPTCP_WORK_CLOSE_SUBFLOW, _sk(sk)->flags))
mptcp_schedule_work(sk);
@@ -1362,8 +1361,7 @@ static bool subflow_check_data_avail(struct sock *ssk)
tcp_set_state(ssk, TCP_CLOSE);
while ((skb = skb_peek(>sk_receive_queue)))
sk_eat_skb(ssk, skb);
-   tcp_send_active_reset(ssk, GFP_ATOMIC,
- SK_RST_REASON_NOT_SPECIFIED);
+   mptcp_send_active_reset_reason(ssk);
WRITE_ONCE(subflow->data_avail, false);
return false;
}
-- 
2.37.3




[PATCH net-next v6 7/7] rstreason: make it work in trace world

2024-04-17 Thread Jason Xing
From: Jason Xing 

At last, we should let it work by introducing this reset reason in
trace world.

One of the possible expected outputs is:
... tcp_send_reset: skbaddr=xxx skaddr=xxx src=xxx dest=xxx
state=TCP_ESTABLISHED reason=NOT_SPECIFIED

Signed-off-by: Jason Xing 
Reviewed-by: Steven Rostedt (Google) 
---
 include/trace/events/tcp.h | 37 +
 net/ipv4/tcp_ipv4.c|  2 +-
 net/ipv4/tcp_output.c  |  2 +-
 net/ipv6/tcp_ipv6.c|  2 +-
 4 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 5c04a61a11c2..b1455cbc0634 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * tcp event with arguments sk and skb
@@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
TP_ARGS(sk, skb)
 );
 
+#undef FN1
+#define FN1(reason)TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
+#undef FN2
+#define FN2(reason)TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
+DEFINE_RST_REASON(FN1, FN1)
+
+#undef FN1
+#undef FNe1
+#define FN1(reason){ SK_RST_REASON_##reason, #reason },
+#define FNe1(reason)   { SK_RST_REASON_##reason, #reason }
+
+#undef FN2
+#undef FNe2
+#define FN2(reason){ SKB_DROP_REASON_##reason, #reason },
+#define FNe2(reason)   { SKB_DROP_REASON_##reason, #reason }
 /*
  * skb of trace_tcp_send_reset is the skb that caused RST. In case of
  * active reset, skb should be NULL
  */
 TRACE_EVENT(tcp_send_reset,
 
-   TP_PROTO(const struct sock *sk, const struct sk_buff *skb),
+   TP_PROTO(const struct sock *sk,
+const struct sk_buff *skb,
+const enum sk_rst_reason reason),
 
-   TP_ARGS(sk, skb),
+   TP_ARGS(sk, skb, reason),
 
TP_STRUCT__entry(
__field(const void *, skbaddr)
__field(const void *, skaddr)
__field(int, state)
+   __field(enum sk_rst_reason, reason)
__array(__u8, saddr, sizeof(struct sockaddr_in6))
__array(__u8, daddr, sizeof(struct sockaddr_in6))
),
@@ -113,14 +132,24 @@ TRACE_EVENT(tcp_send_reset,
 */
TP_STORE_ADDR_PORTS_SKB(skb, th, entry->daddr, 
entry->saddr);
}
+   __entry->reason = reason;
),
 
-   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s",
+   TP_printk("skbaddr=%p skaddr=%p src=%pISpc dest=%pISpc state=%s 
reason=%s",
  __entry->skbaddr, __entry->skaddr,
  __entry->saddr, __entry->daddr,
- __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN")
+ __entry->state ? show_tcp_state_name(__entry->state) : 
"UNKNOWN",
+ __entry->reason < RST_REASON_START ?
+   __print_symbolic(__entry->reason, 
DEFINE_DROP_REASON(FN2, FNe2)) :
+   __print_symbolic(__entry->reason, 
DEFINE_RST_REASON(FN1, FNe1)))
 );
 
+#undef FN1
+#undef FNe1
+
+#undef FN2
+#undef FNe2
+
 /*
  * tcp event with arguments sk
  *
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d78412cf8566..461b4d2b7cfe 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -871,7 +871,7 @@ static void tcp_v4_send_reset(const struct sock *sk, struct 
sk_buff *skb,
if (sk)
arg.bound_dev_if = sk->sk_bound_dev_if;
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
BUILD_BUG_ON(offsetof(struct sock, sk_bound_dev_if) !=
 offsetof(struct inet_timewait_sock, tw_bound_dev_if));
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 276d9d541b01..b08ffb17d5a0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3612,7 +3612,7 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority,
/* skb of trace_tcp_send_reset() keeps the skb that caused RST,
 * skb here is different to the troublesome skb, so use NULL
 */
-   trace_tcp_send_reset(sk, NULL);
+   trace_tcp_send_reset(sk, NULL, SK_RST_REASON_NOT_SPECIFIED);
 }
 
 /* Send a crossed SYN-ACK during socket establishment.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index c46095fb596c..6a4736ec3df0 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1133,7 +1133,7 @@ static void tcp_v6_send_reset(const struct sock *sk, 
struct sk_buff *skb,
label = ip6_flowlabel(ipv6h);
}
 
-   trace_tcp_send_reset(sk, skb);
+   trace_tcp_send_reset(sk, skb, reason);
 
tcp_v6_send_response(sk, skb, seq, ack_seq, 0, 0, 0, oif, 1,
 ipv6_get_dsfield(ipv6h), label, priority, txhash,
-- 
2.37.3




[PATCH net-next v6 5/7] mptcp: support rstreason for passive reset

2024-04-17 Thread Jason Xing
From: Jason Xing 

It relys on what reset options in the skb are as rfc8684 says. Reusing
this logic can save us much energy. This patch replaces most of the prior
NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/mptcp/subflow.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index ac867d277860..bde4a7fdee82 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 
@@ -377,8 +382,13 @@ static struct dst_entry *subflow_v6_route_req(const struct 
sock *sk,
return dst;
 
dst_release(dst);
-   if (!req->syncookie)
-   tcp6_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp6_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }
 #endif
@@ -783,6 +793,7 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
struct mptcp_subflow_request_sock *subflow_req;
struct mptcp_options_received mp_opt;
bool fallback, fallback_is_fatal;
+   enum sk_rst_reason reason;
struct mptcp_sock *owner;
struct sock *child;
 
@@ -913,7 +924,8 @@ static struct sock *subflow_syn_recv_sock(const struct sock 
*sk,
tcp_rsk(req)->drop_req = true;
inet_csk_prepare_for_destroy_sock(child);
tcp_done(child);
-   req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   reason = convert_mptcp_reason(mptcp_get_ext(skb)->reset_reason);
+   req->rsk_ops->send_reset(sk, skb, reason);
 
/* The last child reference will be released by the caller */
return child;
-- 
2.37.3




[PATCH net-next v6 4/7] tcp: support rstreason for passive reset

2024-04-17 Thread Jason Xing
From: Jason Xing 

Reuse the dropreason logic to show the exact reason of tcp reset,
so we don't need to implement those duplicated reset reasons.
This patch replaces all the prior NOT_SPECIFIED reasons.

Signed-off-by: Jason Xing 
---
 net/ipv4/tcp_ipv4.c | 8 
 net/ipv6/tcp_ipv6.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 418d11902fa7..d78412cf8566 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1936,7 +1936,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v4_send_reset(rsk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(rsk, skb, (u32)reason);
 discard:
kfree_skb_reason(skb, reason);
/* Be careful here. If this function gets more complicated and
@@ -2278,7 +2278,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v4_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(nsk, skb, (u32)drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -2357,7 +2357,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v4_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(NULL, skb, (u32)drop_reason);
}
 
 discard_it:
@@ -2409,7 +2409,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
tcp_v4_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v4_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v4_send_reset(sk, skb, (u32)drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 017f6293b5f4..c46095fb596c 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1680,7 +1680,7 @@ int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, (u32)reason);
 discard:
if (opt_skb)
__kfree_skb(opt_skb);
@@ -1865,7 +1865,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
} else {
drop_reason = tcp_child_process(sk, nsk, skb);
if (drop_reason) {
-   tcp_v6_send_reset(nsk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(nsk, skb, (u32)drop_reason);
goto discard_and_relse;
}
sock_put(sk);
@@ -1942,7 +1942,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
 bad_packet:
__TCP_INC_STATS(net, TCP_MIB_INERRS);
} else {
-   tcp_v6_send_reset(NULL, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(NULL, skb, (u32)drop_reason);
}
 
 discard_it:
@@ -1998,7 +1998,7 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff 
*skb)
tcp_v6_timewait_ack(sk, skb);
break;
case TCP_TW_RST:
-   tcp_v6_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
+   tcp_v6_send_reset(sk, skb, (u32)drop_reason);
inet_twsk_deschedule_put(inet_twsk(sk));
goto discard_it;
case TCP_TW_SUCCESS:
-- 
2.37.3




[PATCH net-next v6 3/7] rstreason: prepare for active reset

2024-04-17 Thread Jason Xing
From: Jason Xing 

Like what we did to passive reset:
only passing possible reset reason in each active reset path.

No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/tcp.h |  3 ++-
 net/ipv4/tcp.c| 15 ++-
 net/ipv4/tcp_output.c |  3 ++-
 net/ipv4/tcp_timer.c  |  9 ++---
 net/mptcp/protocol.c  |  4 +++-
 net/mptcp/subflow.c   |  5 +++--
 6 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index b935e1ae4caf..adeacc9aa28a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -670,7 +670,8 @@ int tcp_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 void tcp_send_probe0(struct sock *);
 int tcp_write_wakeup(struct sock *, int mib);
 void tcp_send_fin(struct sock *sk);
-void tcp_send_active_reset(struct sock *sk, gfp_t priority);
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason);
 int tcp_send_synack(struct sock *);
 void tcp_push_one(struct sock *, unsigned int mss_now);
 void __tcp_send_ack(struct sock *sk, u32 rcv_nxt);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f23b9ea5..4ec0f4feee00 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -275,6 +275,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2811,7 +2812,8 @@ void __tcp_close(struct sock *sk, long timeout)
/* Unread data was tossed, zap the connection. */
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, sk->sk_allocation);
+   tcp_send_active_reset(sk, sk->sk_allocation,
+ SK_RST_REASON_NOT_SPECIFIED);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2885,7 +2887,8 @@ void __tcp_close(struct sock *sk, long timeout)
struct tcp_sock *tp = tcp_sk(sk);
if (READ_ONCE(tp->linger2) < 0) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONLINGER);
} else {
@@ -2903,7 +2906,8 @@ void __tcp_close(struct sock *sk, long timeout)
if (sk->sk_state != TCP_CLOSE) {
if (tcp_check_oom(sk, 0)) {
tcp_set_state(sk, TCP_CLOSE);
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
__NET_INC_STATS(sock_net(sk),
LINUX_MIB_TCPABORTONMEMORY);
} else if (!check_net(sock_net(sk))) {
@@ -3007,7 +3011,7 @@ int tcp_disconnect(struct sock *sk, int flags)
/* The last check adjusts for discrepancy of Linux wrt. RFC
 * states
 */
-   tcp_send_active_reset(sk, gfp_any());
+   tcp_send_active_reset(sk, gfp_any(), 
SK_RST_REASON_NOT_SPECIFIED);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
WRITE_ONCE(sk->sk_err, ECONNRESET);
@@ -4564,7 +4568,8 @@ int tcp_abort(struct sock *sk, int err)
smp_wmb();
sk_error_report(sk);
if (tcp_need_reset(sk->sk_state))
-   tcp_send_active_reset(sk, GFP_ATOMIC);
+   tcp_send_active_reset(sk, GFP_ATOMIC,
+ SK_RST_REASON_NOT_SPECIFIED);
tcp_done(sk);
}
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 61119d42b0fd..276d9d541b01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3586,7 +3586,8 @@ void tcp_send_fin(struct sock *sk)
  * was unread data in the receive queue.  This behavior is recommended
  * by RFC 2525, section 2.17.  -DaveM
  */
-void tcp_send_active_reset(struct sock *sk, gfp_t priority)
+void tcp_send_active_reset(struct sock *sk, gfp_t priority,
+  enum sk_rst_reason reason)
 {
struct sk_buff *skb;
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 976db57b95d4..83fe7f62f7f1 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static u32 tcp_clamp_rto_to_user_timeout(const struct sock *sk)
 {
@@ -127,7 +128,8 @@ static int tcp_out_of_resources(struct sock *sk, bool 
do_reset)
(!tp->snd_wnd && !tp->packets_out))
  

[PATCH net-next v6 2/7] rstreason: prepare for passive reset

2024-04-17 Thread Jason Xing
From: Jason Xing 

Adjust the parameter and support passing reason of reset which
is for now NOT_SPECIFIED. No functional changes.

Signed-off-by: Jason Xing 
---
 include/net/request_sock.h |  4 +++-
 net/dccp/ipv4.c| 10 ++
 net/dccp/ipv6.c| 10 ++
 net/dccp/minisocks.c   |  3 ++-
 net/ipv4/tcp_ipv4.c| 12 +++-
 net/ipv4/tcp_minisocks.c   |  3 ++-
 net/ipv6/tcp_ipv6.c| 15 +--
 net/mptcp/subflow.c|  8 +---
 8 files changed, 40 insertions(+), 25 deletions(-)

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 004e651e6067..bdc737832da6 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -18,6 +18,7 @@
 #include 
 
 #include 
+#include 
 
 struct request_sock;
 struct sk_buff;
@@ -34,7 +35,8 @@ struct request_sock_ops {
void(*send_ack)(const struct sock *sk, struct sk_buff *skb,
struct request_sock *req);
void(*send_reset)(const struct sock *sk,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ enum sk_rst_reason reason);
void(*destructor)(struct request_sock *req);
void(*syn_ack_timeout)(const struct request_sock *req);
 };
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 9fc9cea4c251..ff41bd6f99c3 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ackvec.h"
 #include "ccid.h"
@@ -521,7 +522,8 @@ static int dccp_v4_send_response(const struct sock *sk, 
struct request_sock *req
return err;
 }
 
-static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v4_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
int err;
const struct iphdr *rxiph;
@@ -706,7 +708,7 @@ int dccp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
return 0;
 
 reset:
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
kfree_skb(skb);
return 0;
 }
@@ -869,7 +871,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -909,7 +911,7 @@ static int dccp_v4_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v4_ctl_send_reset(sk, skb);
+   dccp_v4_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index c8ca703dc331..85f4b8fdbe5e 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "dccp.h"
 #include "ipv6.h"
@@ -256,7 +257,8 @@ static void dccp_v6_reqsk_destructor(struct request_sock 
*req)
kfree_skb(inet_rsk(req)->pktopts);
 }
 
-static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb)
+static void dccp_v6_ctl_send_reset(const struct sock *sk, struct sk_buff 
*rxskb,
+  enum sk_rst_reason reason)
 {
const struct ipv6hdr *rxip6h;
struct sk_buff *skb;
@@ -656,7 +658,7 @@ static int dccp_v6_do_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 
 reset:
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
 discard:
if (opt_skb != NULL)
__kfree_skb(opt_skb);
@@ -762,7 +764,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
} else if (dccp_child_process(sk, nsk, skb)) {
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
goto discard_and_relse;
} else {
sock_put(sk);
@@ -801,7 +803,7 @@ static int dccp_v6_rcv(struct sk_buff *skb)
if (dh->dccph_type != DCCP_PKT_RESET) {
DCCP_SKB_CB(skb)->dccpd_reset_code =
DCCP_RESET_CODE_NO_CONNECTION;
-   dccp_v6_ctl_send_reset(sk, skb);
+   dccp_v6_ctl_send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED);
}
 
 discard_it:
diff --git a/net/dccp/minisocks.c 

[PATCH net-next v6 1/7] net: introduce rstreason to detect why the RST is sent

2024-04-17 Thread Jason Xing
From: Jason Xing 

Add a new standalone file for the easy future extension to support
both active reset and passive reset in the TCP/DCCP/MPTCP protocols.

This patch only does the preparations for reset reason mechanism,
nothing else changes.

The reset reasons are divided into three parts:
1) reuse drop reasons for passive reset in TCP
2) reuse MP_TCPRST option for MPTCP
3) our own reasons

I will implement the basic codes of active/passive reset reason in
those three protocols, which is not complete for this moment. But
it provides a new chance to let other people add more reasons into
it:)

Signed-off-by: Jason Xing 
---
 include/net/rstreason.h | 93 +
 1 file changed, 93 insertions(+)
 create mode 100644 include/net/rstreason.h

diff --git a/include/net/rstreason.h b/include/net/rstreason.h
new file mode 100644
index ..0c3fa55fa62f
--- /dev/null
+++ b/include/net/rstreason.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef _LINUX_RSTREASON_H
+#define _LINUX_RSTREASON_H
+#include 
+
+#define DEFINE_RST_REASON(FN, FNe) \
+   FN(MPTCP_RST_EUNSPEC)   \
+   FN(MPTCP_RST_EMPTCP)\
+   FN(MPTCP_RST_ERESOURCE) \
+   FN(MPTCP_RST_EPROHIBIT) \
+   FN(MPTCP_RST_EWQ2BIG)   \
+   FN(MPTCP_RST_EBADPERF)  \
+   FN(MPTCP_RST_EMIDDLEBOX)\
+   FN(NOT_SPECIFIED)   \
+   FNe(MAX)
+
+#define RST_REASON_START (SKB_DROP_REASON_MAX + 1)
+
+/* There are three parts in order:
+ * 1) 0 - SKB_DROP_REASON_MAX: rely on drop reasons for passive reset in TCP
+ * 2) SKB_DROP_REASON_MAX + 1 - MPTCP_RST_EMIDDLEBOX: for MPTCP use
+ * 3) MPTCP_RST_EMIDDLEBOX - SK_RST_REASON_MAX: independent reset reason
+ */
+enum sk_rst_reason {
+   /* Leave this 'blank' part (0-SKB_DROP_REASON_MAX) for the reuse
+* of skb drop reason because rst reason relies on what drop reason
+* indicates exactly why it could happen.
+*/
+
+   /* Copy from include/uapi/linux/mptcp.h.
+* These reset fields will not be changed since they adhere to
+* RFC 8684. So do not touch them. I'm going to list each definition
+* of them respectively.
+*/
+   /* Unspecified error.
+* This is the default error; it implies that the subflow is no
+* longer available. The presence of this option shows that the
+* RST was generated by an MPTCP-aware device.
+*/
+   SK_RST_REASON_MPTCP_RST_EUNSPEC = RST_REASON_START,
+   /* MPTCP-specific error.
+* An error has been detected in the processing of MPTCP options.
+* This is the usual reason code to return in the cases where a RST
+* is being sent to close a subflow because of an invalid response.
+*/
+   SK_RST_REASON_MPTCP_RST_EMPTCP,
+   /* Lack of resources.
+* This code indicates that the sending host does not have enough
+* resources to support the terminated subflow.
+*/
+   SK_RST_REASON_MPTCP_RST_ERESOURCE,
+   /* Administratively prohibited.
+* This code indicates that the requested subflow is prohibited by
+* the policies of the sending host.
+*/
+   SK_RST_REASON_MPTCP_RST_EPROHIBIT,
+   /* Too much outstanding data.
+* This code indicates that there is an excessive amount of data
+* that needs to be transmitted over the terminated subflow while
+* having already been acknowledged over one or more other subflows.
+* This may occur if a path has been unavailable for a short period
+* and it is more efficient to reset and start again than it is to
+* retransmit the queued data.
+*/
+   SK_RST_REASON_MPTCP_RST_EWQ2BIG,
+   /* Unacceptable performance.
+* This code indicates that the performance of this subflow was
+* too low compared to the other subflows of this Multipath TCP
+* connection.
+*/
+   SK_RST_REASON_MPTCP_RST_EBADPERF,
+   /* Middlebox interference.
+* Middlebox interference has been detected over this subflow,
+* making MPTCP signaling invalid. For example, this may be sent
+* if the checksum does not validate.
+*/
+   SK_RST_REASON_MPTCP_RST_EMIDDLEBOX,
+
+   /* For the real standalone socket reset reason, we start from here */
+   SK_RST_REASON_NOT_SPECIFIED,
+
+   /* Maximum of socket reset reasons.
+* It shouldn't be used as a real 'reason'.
+*/
+   SK_RST_REASON_MAX,
+};
+
+static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
+{
+   return reason += RST_REASON_START;
+}
+#endif
-- 
2.37.3




[PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-17 Thread Jason Xing
From: Jason Xing 

In production, there are so many cases about why the RST skb is sent but
we don't have a very convenient/fast method to detect the exact underlying
reasons.

RST is implemented in two kinds: passive kind (like tcp_v4_send_reset())
and active kind (like tcp_send_active_reset()). The former can be traced
carefully 1) in TCP, with the help of drop reasons, which is based on
Eric's idea[1], 2) in MPTCP, with the help of reset options defined in
RFC 8684. The latter is relatively independent, which should be
implemented on our own.

In this series, I focus on the fundamental implement mostly about how
the rstreason mechnism works and give the detailed passive part as an
example, not including the active reset part. In future, we can go
further and refine those NOT_SPECIFIED reasons.

Here are some examples when tracing:
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NOT_SPECIFIED
-0   [002] ..s1.  1830.262425: tcp_send_reset: skbaddr=x
skaddr=x src=x dest=x state=x reason=NO_SOCKET

[1]
Link: 
https://lore.kernel.org/all/CANn89iJw8x-LqgsWOeJQQvgVg6DnL5aBRLi10QN2WBdr+X4k=w...@mail.gmail.com/

v6
1. add back casts, or else they are treated as error.

v5
Link: 
https://lore.kernel.org/all/2024045630.38420-1-kerneljasonx...@gmail.com/
1. address format issue (like reverse xmas tree) (Eric, Paolo)
2. remove unnecessary casts. (Eric)
3. introduce a helper used in mptcp active reset. See patch 6. (Paolo)

v4
Link: 
https://lore.kernel.org/all/20240409100934.37725-1-kerneljasonx...@gmail.com/
1. passing 'enum sk_rst_reason' for readability when tracing (Antoine)

v3
Link: 
https://lore.kernel.org/all/20240404072047.11490-1-kerneljasonx...@gmail.com/
1. rebase (mptcp part) and address what Mat suggested.

v2
Link: https://lore.kernel.org/all/20240403185033.47ebc...@kernel.org/
1. rebase against the latest net-next tree



Jason Xing (7):
  net: introduce rstreason to detect why the RST is sent
  rstreason: prepare for passive reset
  rstreason: prepare for active reset
  tcp: support rstreason for passive reset
  mptcp: support rstreason for passive reset
  mptcp: introducing a helper into active reset logic
  rstreason: make it work in trace world

 include/net/request_sock.h |  4 +-
 include/net/rstreason.h| 93 ++
 include/net/tcp.h  |  3 +-
 include/trace/events/tcp.h | 37 +--
 net/dccp/ipv4.c| 10 ++--
 net/dccp/ipv6.c| 10 ++--
 net/dccp/minisocks.c   |  3 +-
 net/ipv4/tcp.c | 15 --
 net/ipv4/tcp_ipv4.c| 14 +++---
 net/ipv4/tcp_minisocks.c   |  3 +-
 net/ipv4/tcp_output.c  |  5 +-
 net/ipv4/tcp_timer.c   |  9 ++--
 net/ipv6/tcp_ipv6.c| 17 ---
 net/mptcp/protocol.c   |  2 +-
 net/mptcp/protocol.h   | 11 +
 net/mptcp/subflow.c| 27 ---
 16 files changed, 216 insertions(+), 47 deletions(-)
 create mode 100644 include/net/rstreason.h

-- 
2.37.3




Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Willy Tarreau
On Wed, Apr 17, 2024 at 10:16:26AM +0200, Greg KH wrote:
> > at the scripts used by stable developers - and maybe at the ML server - to
> > catch different variations won't hurt, as it sounds likely that people will
> > end messing up with a big name like "do-not-apply-to-stable", typing
> > instead things like:
> > 
> > do_not_apply_to_stable
> > dont-apply-to-stable
> > 
> > and other variants.
> 
> I want this very explicit that someone does not want this applied, and
> that it has a reason to do so.  And if getting the email right to do so
> is the issue with that, that's fine.  This is a very rare case that
> almost no one should normally hit.

For using a comparable approach in haproxy on a daily basis, I do see
the value in this. We just mark a lot of fixes "no backport needed" or
"no backport needed unless blablabla" for everything that is only
relevant to the dev tree, and that's a huge time saver for those working
on the backports later.

Maybe "not-for-stable" would be both shorter and easier to remember BTW ?

Regards,
Willy



Re: [PATCH v5 5/5] Documentation: Add reconnect process for VDUSE

2024-04-17 Thread Cindy Lu
On Tue, Apr 16, 2024 at 11:46 AM Jason Wang  wrote:
>
> On Fri, Apr 12, 2024 at 9:31 PM Cindy Lu  wrote:
> >
> > Add a document explaining the reconnect process, including what the
> > Userspace App needs to do and how it works with the kernel.
> >
> > Signed-off-by: Cindy Lu 
> > ---
> >  Documentation/userspace-api/vduse.rst | 41 +++
> >  1 file changed, 41 insertions(+)
> >
> > diff --git a/Documentation/userspace-api/vduse.rst 
> > b/Documentation/userspace-api/vduse.rst
> > index bdb880e01132..7faa83462e78 100644
> > --- a/Documentation/userspace-api/vduse.rst
> > +++ b/Documentation/userspace-api/vduse.rst
> > @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> > after the used ring is filled.
> >
> >  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> > +
> > +HOW VDUSE devices reconnection works
> > +
> > +1. What is reconnection?
> > +
> > +   When the userspace application loads, it should establish a connection
> > +   to the vduse kernel device. Sometimes,the userspace application exists,
>
> I guess you meant "exists"? If yes, it should be better to say "exits
> unexpectedly"
>
> > +   and we want to support its restart and connect to the kernel device 
> > again
> > +
> > +2. How can I support reconnection in a userspace application?
>
> Better to say "How reconnection is supported"?
>
> > +
> > +2.1 During initialization, the userspace application should first verify 
> > the
> > +existence of the device "/dev/vduse/vduse_name".
> > +If it doesn't exist, it means this is the first-time for connection. 
> > goto step 2.2
> > +If it exists, it means this is a reconnection, and we should goto step 
> > 2.3
> > +
> > +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> > +/dev/vduse/control.
> > +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> > +the reconnect information. The total memory size is 
> > PAGE_SIZE*vq_mumber.
>
> I think we need to mention that this should be part of the previous
> "VDUSE devices are created as follows"?
>
> > +
> > +2.3 Check if the information is suitable for reconnect
> > +If this is reconnection :
> > +Before attempting to reconnect, The userspace application needs to use 
> > the
> > +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> > VDUSE_DEV_GET_FEATURES...)
> > +to get the information from kernel.
> > +Please review the information and confirm if it is suitable to 
> > reconnect.
>
> Need to define "review" here and how to decide if it is not suitable
> to reconnect.
>
> > +
> > +2.4 Userspace application needs to mmap the memory to userspace
> > +The userspace application requires mapping one page for every vq. 
> > These pages
> > +should be used to save vq-related information during system running.
>
> Not a native speaker, but it looks better with
>
> "should be used by the userspace to store virtqueue specific information".
>
> > Additionally,
> > +the application must define its own structure to store information for 
> > reconnection.
> > +
> > +2.5 Completed the initialization and running the application.
> > +While the application is running, it is important to store relevant 
> > information
> > +about reconnections in mapped pages.
>
> I think we need some link/code examples to demonstrate what needs to be 
> stored.
>
> > When calling the ioctl VDUSE_VQ_GET_INFO to
> > +get vq information, it's necessary to check whether it's a 
> > reconnection.
>
> Better with some examples of codes.
>
> > If it is
> > +a reconnection, the vq-related information must be get from the mapped 
> > pages.
> > +
> > +2.6 When the Userspace application exits, it is necessary to unmap all the
> > +pages for reconnection
>
> This seems to be unnecessary, for example there could be an unexpected exit.
>
> Thanks
>
Thanks Jason, I will send a new version
Thanks
Cindy
> > --
> > 2.43.0
> >
>




Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Greg KH
On Wed, Apr 17, 2024 at 09:09:26AM +0100, Mauro Carvalho Chehab wrote:
> Em Wed, 17 Apr 2024 09:48:18 +0200
> Thorsten Leemhuis  escreveu:
> 
> > Hi kernel.org helpdesk!
> > 
> > Could you please create the email alias
> > do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> > just like sta...@kernel.org does?
> > 
> > That's an idea GregKH brought up a few days ago here:
> > https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> > 
> > To quote:
> > 
> > > How about:
> > >   cc:  # Reason goes here, and must be 
> > > present
> > > 
> > > and we can make that address be routed to /dev/null just like
> > >  is?  
> > 
> > There was some discussion about using something shorter, but in the end
> > there was no strong opposition and the thread ended a a few days ago.
> 
> Heh, a shorter name would make it a lot easier to remember, specially
> since not wanting a patch to go to stable is an exception... I bet
> I'll never remember the right syntax, needing to look at the docs
> every time it would be used.
> 
> IMO, something like:
>   no-stable
> or
>   nostable
> 
> would do the trick and would be a lot easier to remember.
> 
> Btw, IMO, it won't hurt accepting more than one variant that
> could be allowed, e. g. using a regular expression like:
> 
>   (do)?[-_]?(nt|not?).*stable

That's not going to work at the mailing list server, or with my scripts,
sorry.

> at the scripts used by stable developers - and maybe at the ML server - to
> catch different variations won't hurt, as it sounds likely that people will
> end messing up with a big name like "do-not-apply-to-stable", typing
> instead things like:
> 
>   do_not_apply_to_stable
>   dont-apply-to-stable
> 
> and other variants.

I want this very explicit that someone does not want this applied, and
that it has a reason to do so.  And if getting the email right to do so
is the issue with that, that's fine.  This is a very rare case that
almost no one should normally hit.

And again, if maintainers don't want their tree to have Fixes: and
AUTOBOT run on them, we can easily add that to our lists.  That's the
simpler and more explicit thing to do for those that do not want to have
anything backported they do not explicitly mark as such (some subsystems
do this already, like kvm and -mm and xfs, it's fine!).  This all is
here because of maintainers who do NOT want to do that.

thanks,

greg k-h



Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Mauro Carvalho Chehab
Em Wed, 17 Apr 2024 09:48:18 +0200
Thorsten Leemhuis  escreveu:

> Hi kernel.org helpdesk!
> 
> Could you please create the email alias
> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> just like sta...@kernel.org does?
> 
> That's an idea GregKH brought up a few days ago here:
> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> 
> To quote:
> 
> > How about:
> > cc:  # Reason goes here, and must be 
> > present
> > 
> > and we can make that address be routed to /dev/null just like
> >  is?  
> 
> There was some discussion about using something shorter, but in the end
> there was no strong opposition and the thread ended a a few days ago.

Heh, a shorter name would make it a lot easier to remember, specially
since not wanting a patch to go to stable is an exception... I bet
I'll never remember the right syntax, needing to look at the docs
every time it would be used.

IMO, something like:
no-stable
or
nostable

would do the trick and would be a lot easier to remember.

Btw, IMO, it won't hurt accepting more than one variant that
could be allowed, e. g. using a regular expression like:

(do)?[-_]?(nt|not?).*stable

at the scripts used by stable developers - and maybe at the ML server - to
catch different variations won't hurt, as it sounds likely that people will
end messing up with a big name like "do-not-apply-to-stable", typing
instead things like:

do_not_apply_to_stable
dont-apply-to-stable

and other variants.

Just my 2 cents.

Regards,
Mauro



Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-17 Thread Heng Qi




在 2024/4/17 下午3:18, Liang Chen 写道:

The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen 


Reviewed-by: Heng Qi 

Thanks.


---
   Changes from v8:
- move max table macro out of uAPI
   Changes from v7:
- use table lookup for rss hash type
   Changes from v6:
- fix a coding style issue
   Changes from v5:
- Preservation of the hash value has been dropped, following the conclusion
   from discussions in V3 reviews. The virtio_net driver doesn't
   accessing/using the virtio_net_hdr after the XDP program execution, so
   nothing tragic should happen. As to the xdp program, if it smashes the
   entry in virtio header, it is likely buggy anyways. Additionally, looking
   up the Intel IGC driver,  it also does not bother with this particular
   aspect.
---
  drivers/net/virtio_net.c | 43 
  1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..eb99bf6c555e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4621,6 +4621,48 @@ static void virtnet_set_big_packets(struct virtnet_info 
*vi, const int mtu)
}
  }
  
+#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10

+static enum xdp_rss_hash_type
+virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
+   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
+   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
+   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
+   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
+   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
+   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
+};
+
+static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
+  enum xdp_rss_hash_type *rss_type)
+{
+   const struct xdp_buff *xdp = (void *)_ctx;
+   struct virtio_net_hdr_v1_hash *hdr_hash;
+   struct virtnet_info *vi;
+   u16 hash_report;
+
+   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
+   return -ENODATA;
+
+   vi = netdev_priv(xdp->rxq->dev);
+   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
+   hash_report = __le16_to_cpu(hdr_hash->hash_report);
+
+   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
+   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
+
+   *rss_type = virtnet_xdp_rss_type[hash_report];
+   *hash = __le32_to_cpu(hdr_hash->hash_value);
+   return 0;
+}
+
+static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
+   .xmo_rx_hash= virtnet_xdp_rx_hash,
+};
+
  static int virtnet_probe(struct virtio_device *vdev)
  {
int i, err = -ENOMEM;
@@ -4747,6 +4789,7 @@ static int virtnet_probe(struct virtio_device *vdev)
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
  
  		dev->hw_features |= NETIF_F_RXHASH;

+   dev->xdp_metadata_ops = _xdp_metadata_ops;
}
  
  	if (vi->has_rss_hash_report)





Re: Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Greg KH
On Wed, Apr 17, 2024 at 09:48:18AM +0200, Thorsten Leemhuis wrote:
> Hi kernel.org helpdesk!
> 
> Could you please create the email alias
> do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
> just like sta...@kernel.org does?
> 
> That's an idea GregKH brought up a few days ago here:
> https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/
> 
> To quote:
> 
> > How about:
> > cc:  # Reason goes here, and must be 
> > present
> > 
> > and we can make that address be routed to /dev/null just like
> >  is?
> 
> There was some discussion about using something shorter, but in the end
> there was no strong opposition and the thread ended a a few days ago.
> 
> The goal is to have a tag that developers can use in Linux kenrel
> commits that have a Fixes: tag, but nevertheless should not be
> backported by the stable-team without explicit request. The thread
> linked above explains this in more detail. Once the address exists I'll
> resubmit the patches in question that will document the tag.
> 
> Is asking for this here like this the right way? If I need to file a
> ticket somewhere or some ack from a higher authority, just let me know!

I approve this message :)

thanks,

greg k-h



Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-17 Thread Jesper Dangaard Brouer




On 17/04/2024 09.18, Liang Chen wrote:

The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen


LGTM

Acked-by: Jesper Dangaard Brouer 



Please create the email alias do-not-apply-to-sta...@kernel.org -> /dev/null

2024-04-17 Thread Thorsten Leemhuis
Hi kernel.org helpdesk!

Could you please create the email alias
do-not-apply-to-sta...@kernel.org which redirects all mail to /dev/null,
just like sta...@kernel.org does?

That's an idea GregKH brought up a few days ago here:
https://lore.kernel.org/all/2024041123-earthling-primarily-4656@gregkh/

To quote:

> How about:
>   cc:  # Reason goes here, and must be 
> present
> 
> and we can make that address be routed to /dev/null just like
>  is?

There was some discussion about using something shorter, but in the end
there was no strong opposition and the thread ended a a few days ago.

The goal is to have a tag that developers can use in Linux kenrel
commits that have a Fixes: tag, but nevertheless should not be
backported by the stable-team without explicit request. The thread
linked above explains this in more detail. Once the address exists I'll
resubmit the patches in question that will document the tag.

Is asking for this here like this the right way? If I need to file a
ticket somewhere or some ack from a higher authority, just let me know!

Ciao, Thorsten




[PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-17 Thread Liang Chen
The RSS hash report is a feature that's part of the virtio specification.
Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
(still a work in progress as per [1]) support this feature. While the
capability to obtain the RSS hash has been enabled in the normal path,
it's currently missing in the XDP path. Therefore, we are introducing
XDP hints through kfuncs to allow XDP programs to access the RSS hash.

1.
https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r

Signed-off-by: Liang Chen 
---
  Changes from v8:
- move max table macro out of uAPI
  Changes from v7:
- use table lookup for rss hash type
  Changes from v6:
- fix a coding style issue
  Changes from v5:
- Preservation of the hash value has been dropped, following the conclusion
  from discussions in V3 reviews. The virtio_net driver doesn't
  accessing/using the virtio_net_hdr after the XDP program execution, so
  nothing tragic should happen. As to the xdp program, if it smashes the
  entry in virtio header, it is likely buggy anyways. Additionally, looking
  up the Intel IGC driver,  it also does not bother with this particular
  aspect.
---
 drivers/net/virtio_net.c | 43 
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..eb99bf6c555e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4621,6 +4621,48 @@ static void virtnet_set_big_packets(struct virtnet_info 
*vi, const int mtu)
}
 }
 
+#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10
+static enum xdp_rss_hash_type
+virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
+   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
+   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
+   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
+   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
+   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
+   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
+   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
+   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
+};
+
+static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
+  enum xdp_rss_hash_type *rss_type)
+{
+   const struct xdp_buff *xdp = (void *)_ctx;
+   struct virtio_net_hdr_v1_hash *hdr_hash;
+   struct virtnet_info *vi;
+   u16 hash_report;
+
+   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
+   return -ENODATA;
+
+   vi = netdev_priv(xdp->rxq->dev);
+   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
+   hash_report = __le16_to_cpu(hdr_hash->hash_report);
+
+   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
+   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
+
+   *rss_type = virtnet_xdp_rss_type[hash_report];
+   *hash = __le32_to_cpu(hdr_hash->hash_value);
+   return 0;
+}
+
+static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
+   .xmo_rx_hash= virtnet_xdp_rx_hash,
+};
+
 static int virtnet_probe(struct virtio_device *vdev)
 {
int i, err = -ENOMEM;
@@ -4747,6 +4789,7 @@ static int virtnet_probe(struct virtio_device *vdev)
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
 
dev->hw_features |= NETIF_F_RXHASH;
+   dev->xdp_metadata_ops = _xdp_metadata_ops;
}
 
if (vi->has_rss_hash_report)
-- 
2.40.1