Re: [PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header

2024-05-23 Thread Google
On Thu, 23 May 2024 19:14:59 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:35 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > Add ftrace_regs definition for x86_64 in the ftrace header to
> > clarify what register will be accessible from ftrace_regs.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v3:
> >   - Add rip to be saved.
> >  Changes in v2:
> >   - Newly added.
> > ---
> >  arch/x86/include/asm/ftrace.h |6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> > index cf88cc8cc74d..c88bf47f46da 100644
> > --- a/arch/x86/include/asm/ftrace.h
> > +++ b/arch/x86/include/asm/ftrace.h
> > @@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> > long addr)
> >  
> >  #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >  struct ftrace_regs {
> > +   /*
> > +* On the x86_64, the ftrace_regs saves;
> > +* rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp.
> > +* Also orig_ax is used for passing direct trampoline address.
> > +* x86_32 doesn't support ftrace_regs.
> 
> Should add a comment that if fregs->regs.cs is set, then all of the pt_regs
> is valid.

But what about rbx and r1*? Only regs->cs should be care for pt_regs?
Or, did you mean "the ftrace_regs is valid"?

> And x86_32 does support ftrace_regs, it just doesn't support
> having a subset of it.

Oh, thanks. I'll update the comment about x86_32.

Thank you,

> 
> -- Steve
> 
> 
> > +*/
> > struct pt_regs  regs;
> >  };
> >  
> 
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition

2024-05-23 Thread Google
On Thu, 23 May 2024 19:10:31 -0400
Steven Rostedt  wrote:

> On Tue,  7 May 2024 23:08:12 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > From: Masami Hiramatsu (Google) 
> > 
> > To clarify what will be expected on ftrace_regs, add a comment to the
> > architecture independent definition of the ftrace_regs.
> > 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > Acked-by: Mark Rutland 
> > ---
> >  Changes in v8:
> >   - Update that the saved registers depends on the context.
> >  Changes in v3:
> >   - Add instruction pointer
> >  Changes in v2:
> >   - newly added.
> > ---
> >  include/linux/ftrace.h |   26 ++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index 54d53f345d14..b81f1afa82a1 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -118,6 +118,32 @@ extern int ftrace_enabled;
> >  
> >  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> >  
> > +/**
> > + * ftrace_regs - ftrace partial/optimal register set
> > + *
> > + * ftrace_regs represents a group of registers which is used at the
> > + * function entry and exit. There are three types of registers.
> > + *
> > + * - Registers for passing the parameters to callee, including the stack
> > + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> > + * - Registers for passing the return values to caller.
> > + *   (e.g. rax and rdx on x86_64)
> > + * - Registers for hooking the function call and return including the
> > + *   frame pointer (the frame pointer is architecture/config dependent)
> > + *   (e.g. rip, rbp and rsp for x86_64)
> > + *
> > + * Also, architecture dependent fields can be used for internal process.
> > + * (e.g. orig_ax on x86_64)
> > + *
> > + * On the function entry, those registers will be restored except for
> > + * the stack pointer, so that user can change the function parameters
> > + * and instruction pointer (e.g. live patching.)
> > + * On the function exit, only registers which is used for return values
> > + * are restored.
> 
> I wonder if we should also add a note about some architectures in some
> circumstances may store all pt_regs in ftrace_regs. For example, if an
> architecture supports FTRACE_WITH_REGS, it may pass the pt_regs within the
> ftrace_regs. If that is the case, then ftrace_get_regs() called on it will
> return a pointer to a valid pt_regs, or NULL if it is not supported or the
> ftrace_regs does not have a all the registers.

Agreed. That case also should be noted. Thanks for pointing!


> 
> -- Steve
> 
> 
> > + *
> > + * NOTE: user *must not* access regs directly, only do it via APIs, because
> > + * the member can be changed according to the architecture.
> > + */
> >  struct ftrace_regs {
> > struct pt_regs  regs;
> >  };
> 


-- 
Masami Hiramatsu (Google) 



Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-05-23 Thread John Groves
On 24/05/23 03:57PM, Miklos Szeredi wrote:
> [trimming CC list]
> 
> On Thu, 23 May 2024 at 04:49, John Groves  wrote:
> 
> > - memmap=! will reserve a pretend pmem device at 
> > 
> > - memmap=$ will reserve a pretend dax device at 
> > 
> 
> Doesn't get me a /dev/dax or /dev/pmem
> 
> Complete qemu command line:
> 
> qemu-kvm -s -serial none -parallel none -kernel
> /home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
> format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
> format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
> stdio,id=virtiocon0,signal=off -device virtio-serial -device
> virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
> nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
> -device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
> virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
> memmap=4G$4G'
> 
> root@kvm:~/famfs# scripts/chk_efi.sh
> This system is neither Ubuntu nor Fedora. It is identified as debian.
> /sys/firmware/efi not found; probably not efi
>  not found; probably nof efi
> /boot/efi/EFI not found; probably not efi
> /boot/efi/EFI/BOOT not found; probably not efi
> /boot/efi/EFI/ not found; probably not efi
> /boot/efi/EFI//grub.cfg not found; probably nof efi
> Probably not efi; errs=6
> 
> Thanks,
> Miklos


Apologies, but I'm short on time at the moment - going into a long holiday
weekend in the US with family plans. I should be focused again by middle of
next week.

But can you check /proc/cmdline to see of the memmap arg got through without
getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen
the need for \\\$. If it's un-mangled, there should be a dax device.

If that doesn't work, it's worth trying '!' instead, which I think would give
you a pmem device - if the arg gets through (but ! is less likely to get
horked). That pmem device can be converted to devdax...

Regards,
John




Re: [PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header

2024-05-23 Thread Steven Rostedt
On Tue,  7 May 2024 23:08:35 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> Add ftrace_regs definition for x86_64 in the ftrace header to
> clarify what register will be accessible from ftrace_regs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v3:
>   - Add rip to be saved.
>  Changes in v2:
>   - Newly added.
> ---
>  arch/x86/include/asm/ftrace.h |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
> index cf88cc8cc74d..c88bf47f46da 100644
> --- a/arch/x86/include/asm/ftrace.h
> +++ b/arch/x86/include/asm/ftrace.h
> @@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned 
> long addr)
>  
>  #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>  struct ftrace_regs {
> + /*
> +  * On the x86_64, the ftrace_regs saves;
> +  * rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp.
> +  * Also orig_ax is used for passing direct trampoline address.
> +  * x86_32 doesn't support ftrace_regs.

Should add a comment that if fregs->regs.cs is set, then all of the pt_regs
is valid. And x86_32 does support ftrace_regs, it just doesn't support
having a subset of it.

-- Steve


> +  */
>   struct pt_regs  regs;
>  };
>  




Re: [PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition

2024-05-23 Thread Steven Rostedt
On Tue,  7 May 2024 23:08:12 +0900
"Masami Hiramatsu (Google)"  wrote:

> From: Masami Hiramatsu (Google) 
> 
> To clarify what will be expected on ftrace_regs, add a comment to the
> architecture independent definition of the ftrace_regs.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> Acked-by: Mark Rutland 
> ---
>  Changes in v8:
>   - Update that the saved registers depends on the context.
>  Changes in v3:
>   - Add instruction pointer
>  Changes in v2:
>   - newly added.
> ---
>  include/linux/ftrace.h |   26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index 54d53f345d14..b81f1afa82a1 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -118,6 +118,32 @@ extern int ftrace_enabled;
>  
>  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>  
> +/**
> + * ftrace_regs - ftrace partial/optimal register set
> + *
> + * ftrace_regs represents a group of registers which is used at the
> + * function entry and exit. There are three types of registers.
> + *
> + * - Registers for passing the parameters to callee, including the stack
> + *   pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64)
> + * - Registers for passing the return values to caller.
> + *   (e.g. rax and rdx on x86_64)
> + * - Registers for hooking the function call and return including the
> + *   frame pointer (the frame pointer is architecture/config dependent)
> + *   (e.g. rip, rbp and rsp for x86_64)
> + *
> + * Also, architecture dependent fields can be used for internal process.
> + * (e.g. orig_ax on x86_64)
> + *
> + * On the function entry, those registers will be restored except for
> + * the stack pointer, so that user can change the function parameters
> + * and instruction pointer (e.g. live patching.)
> + * On the function exit, only registers which is used for return values
> + * are restored.

I wonder if we should also add a note about some architectures in some
circumstances may store all pt_regs in ftrace_regs. For example, if an
architecture supports FTRACE_WITH_REGS, it may pass the pt_regs within the
ftrace_regs. If that is the case, then ftrace_get_regs() called on it will
return a pointer to a valid pt_regs, or NULL if it is not supported or the
ftrace_regs does not have a all the registers.

-- Steve


> + *
> + * NOTE: user *must not* access regs directly, only do it via APIs, because
> + * the member can be changed according to the architecture.
> + */
>  struct ftrace_regs {
>   struct pt_regs  regs;
>  };




Re: [PATCH v2 1/1] x86/vector: Fix vector leak during CPU offline

2024-05-23 Thread Thomas Gleixner
On Wed, May 22 2024 at 15:02, Dongli Zhang wrote:
> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
> interrupt affinity reconfiguration via procfs. Instead, the change is
> deferred until the next instance of the interrupt being triggered on the
> original CPU.
>
> When the interrupt next triggers on the original CPU, the new affinity is
> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
> but if the old vector on the original CPU remains online, it is not
> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
> reclaiming process is delayed until the next trigger of the interrupt on
> the new CPU.
>
> Upon the subsequent triggering of the interrupt on the new CPU,
> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
> remains online. Subsequently, the timer on the old CPU iterates over its
> vector_cleanup list, reclaiming old vectors.
>
> However, a rare scenario arises if the old CPU is outgoing before the
> interrupt triggers again on the new CPU. The irq_force_complete_move() may
> not have the chance to be invoked on the outgoing CPU to reclaim the old
> apicd->prev_vector. This is because the interrupt isn't currently affine to
> the outgoing CPU, and irq_needs_fixup() returns false. Even though
> __vector_schedule_cleanup() is later called on the new CPU, it doesn't
> reclaim apicd->prev_vector; instead, it simply resets both
> apicd->move_in_progress and apicd->prev_vector to 0.
>
> As a result, the vector remains unreclaimed in vector_matrix, leading to a
> CPU vector leak.
>
> To address this issue, move the invocation of irq_force_complete_move()
> before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the
> interrupt is currently or used to affine to the outgoing CPU. Additionally,
> reclaim the vector in __vector_schedule_cleanup() as well, following a
> warning message, although theoretically it should never see
> apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU.

Nice change log!



Re: [GIT PULL v2] virtio: features, fixes, cleanups

2024-05-23 Thread pr-tracker-bot
The pull request you sent on Thu, 23 May 2024 02:00:17 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/2ef32ad2241340565c35baf77fc95053c84eeeb0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-23 Thread Dave Hansen
On 5/23/24 11:39, Jürgen Groß wrote:
>>
>> Let's just keep it simple.  How about the attached patch?
> 
> Simple indeed. The attachment is empty. 😛

Let's try this again.diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..c193c9e60a1b 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,8 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
 
 void __init native_pv_lock_init(void)
 {
-	if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
-	!boot_cpu_has(X86_FEATURE_HYPERVISOR))
+	if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
 		static_branch_disable(&virt_spin_lock_key);
 }
 


Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-23 Thread Jürgen Groß

On 23.05.24 18:30, Dave Hansen wrote:

On 5/16/24 06:02, Chen Yu wrote:

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is set to true on bare-metal.
The qspinlock degenerates to test-and-set spinlock, which decrease the
performance on bare-metal.

Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS
is not set, or it is on bare-metal.


This is missing some background:

The kernel can change spinlock behavior when running as a guest.  But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

... then describe the regression and the fix


diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..ee51c0949ed8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
  
  void __init native_pv_lock_init(void)

  {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
+   if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) ||
!boot_cpu_has(X86_FEATURE_HYPERVISOR))
static_branch_disable(&virt_spin_lock_key);
  }

This gets used at a single site:

 if (pv_enabled())
 goto pv_queue;

 if (virt_spin_lock(lock))
 return;

which is logically:

if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS))
goto ...; // don't look at virt_spin_lock_key

if (virt_spin_lock_key)
return; // On virt, but non-paravirt.  Did Test-and-Set
// spinlock.

So I _think_ Arnd was trying to optimize native_pv_lock_init() away when
it's going to get skipped over anyway by the 'goto'.

But this took me at least 30 minutes of scratching my head and trying to
untangle the whole thing.  It's all far too subtle for my taste, and all
of that to save a few bytes of init text in a configuration that's
probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n).

Let's just keep it simple.  How about the attached patch?


Simple indeed. The attachment is empty. :-p


Juergen



Re: [PATCH] riscv: Fix early ftrace nop patching

2024-05-23 Thread patchwork-bot+linux-riscv
Hello:

This patch was applied to riscv/linux.git (for-next)
by Palmer Dabbelt :

On Thu, 23 May 2024 13:51:34 +0200 you wrote:
> Commit c97bf629963e ("riscv: Fix text patching when IPI are used")
> converted ftrace_make_nop() to use patch_insn_write() which does not
> emit any icache flush relying entirely on __ftrace_modify_code() to do
> that.
> 
> But we missed that ftrace_make_nop() was called very early directly when
> converting mcount calls into nops (actually on riscv it converts 2B nops
> emitted by the compiler into 4B nops).
> 
> [...]

Here is the summary with links:
  - riscv: Fix early ftrace nop patching
https://git.kernel.org/riscv/c/6ca445d8af0e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html





Re: [PATCH v2 1/2] drivers: remoteproc: xlnx: add attach detach support

2024-05-23 Thread Tanmay Shah



On 5/23/24 12:05 PM, Mathieu Poirier wrote:
> On Wed, May 22, 2024 at 09:36:26AM -0500, Tanmay Shah wrote:
>> 
>> 
>> On 5/21/24 12:56 PM, Mathieu Poirier wrote:
>> > Hi Tanmay,
>> > 
>> > On Fri, May 10, 2024 at 05:51:25PM -0700, Tanmay Shah wrote:
>> >> It is possible that remote processor is already running before
>> >> linux boot or remoteproc platform driver probe. Implement required
>> >> remoteproc framework ops to provide resource table address and
>> >> connect or disconnect with remote processor in such case.
>> >> 
>> >> Signed-off-by: Tanmay Shah 
>> >> ---
>> >> 
>> >> Changes in v2:
>> >>   - Fix following sparse warnings
>> >> 
>> >> drivers/remoteproc/xlnx_r5_remoteproc.c:827:21: sparse:expected 
>> >> struct rsc_tbl_data *rsc_data_va
>> >> drivers/remoteproc/xlnx_r5_remoteproc.c:844:18: sparse:expected 
>> >> struct resource_table *rsc_addr
>> >> drivers/remoteproc/xlnx_r5_remoteproc.c:898:24: sparse:expected void 
>> >> volatile [noderef] __iomem *addr
>> >> 
>> >>  drivers/remoteproc/xlnx_r5_remoteproc.c | 164 +++-
>> >>  1 file changed, 160 insertions(+), 4 deletions(-)
>> >> 
>> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
>> >> b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> index 84243d1dff9f..039370cffa32 100644
>> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
>> >> @@ -25,6 +25,10 @@
>> >>  /* RX mailbox client buffer max length */
>> >>  #define MBOX_CLIENT_BUF_MAX  (IPI_BUF_LEN_MAX + \
>> >>sizeof(struct zynqmp_ipi_message))
>> >> +
>> >> +#define RSC_TBL_XLNX_MAGIC   ((uint32_t)'x' << 24 | (uint32_t)'a' << 
>> >> 16 | \
>> >> +  (uint32_t)'m' << 8 | (uint32_t)'p')
>> >> +
>> >>  /*
>> >>   * settings for RPU cluster mode which
>> >>   * reflects possible values of xlnx,cluster-mode dt-property
>> >> @@ -73,6 +77,15 @@ struct mbox_info {
>> >>   struct mbox_chan *rx_chan;
>> >>  };
>> >>  
>> >> +/* Xilinx Platform specific data structure */
>> >> +struct rsc_tbl_data {
>> >> + const int version;
>> >> + const u32 magic_num;
>> >> + const u32 comp_magic_num;
>> > 
>> > Why is a complement magic number needed?
>> 
>> Actually magic number is 64-bit. There is good chance that
>> firmware can have 32-bit op-code or data same as magic number, but very less
>> chance of its complement in the next address. So, we can assume magic number
>> is 64-bit. 
>>
> 
> So why not having a magic number that is a u64?
> 
>> > 
>> >> + const u32 rsc_tbl_size;
>> >> + const uintptr_t rsc_tbl;
>> >> +} __packed;
>> >> +
>> >>  /*
>> >>   * Hardcoded TCM bank values. This will stay in driver to maintain 
>> >> backward
>> >>   * compatibility with device-tree that does not have TCM information.
>> >> @@ -95,20 +108,24 @@ static const struct mem_bank_data 
>> >> zynqmp_tcm_banks_lockstep[] = {
>> >>  /**
>> >>   * struct zynqmp_r5_core
>> >>   *
>> >> + * @rsc_tbl_va: resource table virtual address
>> >>   * @dev: device of RPU instance
>> >>   * @np: device node of RPU instance
>> >>   * @tcm_bank_count: number TCM banks accessible to this RPU
>> >>   * @tcm_banks: array of each TCM bank data
>> >>   * @rproc: rproc handle
>> >> + * @rsc_tbl_size: resource table size retrieved from remote
>> >>   * @pm_domain_id: RPU CPU power domain id
>> >>   * @ipi: pointer to mailbox information
>> >>   */
>> >>  struct zynqmp_r5_core {
>> >> + struct resource_table *rsc_tbl_va;
>> > 
>> > Shouldn't this be of type "void __iomem *"?  Did sparse give you trouble 
>> > on that
>> > one?
>> 
>> I fixed sparse warnings with typecast below [1].
>> 
> 
> My point is, ioremap_wc() returns a "void__iomem *" so why not using that
> instead of a "struct resource_table *"?

Ack.

> 
> 
>> > 
>> >>   struct device *dev;
>> >>   struct device_node *np;
>> >>   int tcm_bank_count;
>> >>   struct mem_bank_data **tcm_banks;
>> >>   struct rproc *rproc;
>> >> + u32 rsc_tbl_size;
>> >>   u32 pm_domain_id;
>> >>   struct mbox_info *ipi;
>> >>  };
>> >> @@ -621,10 +638,19 @@ static int zynqmp_r5_rproc_prepare(struct rproc 
>> >> *rproc)
>> >>  {
>> >>   int ret;
>> >>  
>> >> - ret = add_tcm_banks(rproc);
>> >> - if (ret) {
>> >> - dev_err(&rproc->dev, "failed to get TCM banks, err %d\n", ret);
>> >> - return ret;
>> >> + /**
>> > 
>> > Using "/**" is for comments that will endup in the documentation, which I 
>> > don't
>> > think is needed here.  Please correct throughout the patch.
>> 
>> Thanks. Ack, I will use only /* format.
>> 
>> > 
>> >> +  * For attach/detach use case, Firmware is already loaded so
>> >> +  * TCM isn't really needed at all. Also, for security TCM can be
>> >> +  * locked in such case and linux may not have access at all.
>> >> +  * So avoid adding TCM banks. TCM power-domains requested during attach
>> >> +  * callback.
>> >> +  */
>> >> + if (rproc->state != RPROC_DETACHED) {
>> >> + ret = add_tcm_banks(rproc);
>> >> + 

Re: [PATCH v2 1/2] drivers: remoteproc: xlnx: add attach detach support

2024-05-23 Thread Mathieu Poirier
On Wed, May 22, 2024 at 09:36:26AM -0500, Tanmay Shah wrote:
> 
> 
> On 5/21/24 12:56 PM, Mathieu Poirier wrote:
> > Hi Tanmay,
> > 
> > On Fri, May 10, 2024 at 05:51:25PM -0700, Tanmay Shah wrote:
> >> It is possible that remote processor is already running before
> >> linux boot or remoteproc platform driver probe. Implement required
> >> remoteproc framework ops to provide resource table address and
> >> connect or disconnect with remote processor in such case.
> >> 
> >> Signed-off-by: Tanmay Shah 
> >> ---
> >> 
> >> Changes in v2:
> >>   - Fix following sparse warnings
> >> 
> >> drivers/remoteproc/xlnx_r5_remoteproc.c:827:21: sparse:expected struct 
> >> rsc_tbl_data *rsc_data_va
> >> drivers/remoteproc/xlnx_r5_remoteproc.c:844:18: sparse:expected struct 
> >> resource_table *rsc_addr
> >> drivers/remoteproc/xlnx_r5_remoteproc.c:898:24: sparse:expected void 
> >> volatile [noderef] __iomem *addr
> >> 
> >>  drivers/remoteproc/xlnx_r5_remoteproc.c | 164 +++-
> >>  1 file changed, 160 insertions(+), 4 deletions(-)
> >> 
> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
> >> b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> index 84243d1dff9f..039370cffa32 100644
> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
> >> @@ -25,6 +25,10 @@
> >>  /* RX mailbox client buffer max length */
> >>  #define MBOX_CLIENT_BUF_MAX   (IPI_BUF_LEN_MAX + \
> >> sizeof(struct zynqmp_ipi_message))
> >> +
> >> +#define RSC_TBL_XLNX_MAGIC((uint32_t)'x' << 24 | (uint32_t)'a' << 
> >> 16 | \
> >> +   (uint32_t)'m' << 8 | (uint32_t)'p')
> >> +
> >>  /*
> >>   * settings for RPU cluster mode which
> >>   * reflects possible values of xlnx,cluster-mode dt-property
> >> @@ -73,6 +77,15 @@ struct mbox_info {
> >>struct mbox_chan *rx_chan;
> >>  };
> >>  
> >> +/* Xilinx Platform specific data structure */
> >> +struct rsc_tbl_data {
> >> +  const int version;
> >> +  const u32 magic_num;
> >> +  const u32 comp_magic_num;
> > 
> > Why is a complement magic number needed?
> 
> Actually magic number is 64-bit. There is good chance that
> firmware can have 32-bit op-code or data same as magic number, but very less
> chance of its complement in the next address. So, we can assume magic number
> is 64-bit. 
>

So why not having a magic number that is a u64?

> > 
> >> +  const u32 rsc_tbl_size;
> >> +  const uintptr_t rsc_tbl;
> >> +} __packed;
> >> +
> >>  /*
> >>   * Hardcoded TCM bank values. This will stay in driver to maintain 
> >> backward
> >>   * compatibility with device-tree that does not have TCM information.
> >> @@ -95,20 +108,24 @@ static const struct mem_bank_data 
> >> zynqmp_tcm_banks_lockstep[] = {
> >>  /**
> >>   * struct zynqmp_r5_core
> >>   *
> >> + * @rsc_tbl_va: resource table virtual address
> >>   * @dev: device of RPU instance
> >>   * @np: device node of RPU instance
> >>   * @tcm_bank_count: number TCM banks accessible to this RPU
> >>   * @tcm_banks: array of each TCM bank data
> >>   * @rproc: rproc handle
> >> + * @rsc_tbl_size: resource table size retrieved from remote
> >>   * @pm_domain_id: RPU CPU power domain id
> >>   * @ipi: pointer to mailbox information
> >>   */
> >>  struct zynqmp_r5_core {
> >> +  struct resource_table *rsc_tbl_va;
> > 
> > Shouldn't this be of type "void __iomem *"?  Did sparse give you trouble on 
> > that
> > one?
> 
> I fixed sparse warnings with typecast below [1].
> 

My point is, ioremap_wc() returns a "void__iomem *" so why not using that
instead of a "struct resource_table *"?


> > 
> >>struct device *dev;
> >>struct device_node *np;
> >>int tcm_bank_count;
> >>struct mem_bank_data **tcm_banks;
> >>struct rproc *rproc;
> >> +  u32 rsc_tbl_size;
> >>u32 pm_domain_id;
> >>struct mbox_info *ipi;
> >>  };
> >> @@ -621,10 +638,19 @@ static int zynqmp_r5_rproc_prepare(struct rproc 
> >> *rproc)
> >>  {
> >>int ret;
> >>  
> >> -  ret = add_tcm_banks(rproc);
> >> -  if (ret) {
> >> -  dev_err(&rproc->dev, "failed to get TCM banks, err %d\n", ret);
> >> -  return ret;
> >> +  /**
> > 
> > Using "/**" is for comments that will endup in the documentation, which I 
> > don't
> > think is needed here.  Please correct throughout the patch.
> 
> Thanks. Ack, I will use only /* format.
> 
> > 
> >> +   * For attach/detach use case, Firmware is already loaded so
> >> +   * TCM isn't really needed at all. Also, for security TCM can be
> >> +   * locked in such case and linux may not have access at all.
> >> +   * So avoid adding TCM banks. TCM power-domains requested during attach
> >> +   * callback.
> >> +   */
> >> +  if (rproc->state != RPROC_DETACHED) {
> >> +  ret = add_tcm_banks(rproc);
> >> +  if (ret) {
> >> +  dev_err(&rproc->dev, "failed to get TCM banks, err 
> >> %d\n", ret);
> >> +  return ret;
> >> +  }
> >>  

[PATCH] ipvs: Avoid unnecessary calls to skb_is_gso_sctp

2024-05-23 Thread Ismael Luceno
In the context of the SCTP SNAT/DNAT handler, these calls can only
return true.

Ref: e10d3ba4d434 ("ipvs: Fix checksumming on GSO of SCTP packets")
Signed-off-by: Ismael Luceno 
CC: Pablo Neira Ayuso 
CC: Michal Kubeček 
CC: Simon Horman 
CC: Julian Anastasov 
CC: lvs-de...@vger.kernel.org
CC: netfilter-de...@vger.kernel.org
CC: net...@vger.kernel.org
CC: coret...@netfilter.org
---
 net/netfilter/ipvs/ip_vs_proto_sctp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
b/net/netfilter/ipvs/ip_vs_proto_sctp.c
index 1e689c714127..83e452916403 100644
--- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
+++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
@@ -126,7 +126,7 @@ sctp_snat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
if (sctph->source != cp->vport || payload_csum ||
skb->ip_summed == CHECKSUM_PARTIAL) {
sctph->source = cp->vport;
-   if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
+   if (!skb_is_gso(skb))
sctp_nat_csum(skb, sctph, sctphoff);
} else {
skb->ip_summed = CHECKSUM_UNNECESSARY;
@@ -175,7 +175,7 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
ip_vs_protocol *pp,
(skb->ip_summed == CHECKSUM_PARTIAL &&
 !(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
sctph->dest = cp->dport;
-   if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
+   if (!skb_is_gso(skb))
sctp_nat_csum(skb, sctph, sctphoff);
} else if (skb->ip_summed != CHECKSUM_PARTIAL) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
-- 
2.44.0




Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-23 Thread Dave Hansen
On 5/16/24 06:02, Chen Yu wrote:
> Performance drop is reported when running encode/decode workload and
> BenchSEE cache sub-workload.
> Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
> native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
> is disabled the virt_spin_lock_key is set to true on bare-metal.
> The qspinlock degenerates to test-and-set spinlock, which decrease the
> performance on bare-metal.
> 
> Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS
> is not set, or it is on bare-metal.

This is missing some background:

The kernel can change spinlock behavior when running as a guest.  But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

... then describe the regression and the fix

> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index 5358d43886ad..ee51c0949ed8 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
>  
>  void __init native_pv_lock_init(void)
>  {
> - if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
> + if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) ||
>   !boot_cpu_has(X86_FEATURE_HYPERVISOR))
>   static_branch_disable(&virt_spin_lock_key);
>  }
This gets used at a single site:

if (pv_enabled())
goto pv_queue;

if (virt_spin_lock(lock))
return;

which is logically:

if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS))
goto ...; // don't look at virt_spin_lock_key

if (virt_spin_lock_key)
return; // On virt, but non-paravirt.  Did Test-and-Set
// spinlock.

So I _think_ Arnd was trying to optimize native_pv_lock_init() away when
it's going to get skipped over anyway by the 'goto'.

But this took me at least 30 minutes of scratching my head and trying to
untangle the whole thing.  It's all far too subtle for my taste, and all
of that to save a few bytes of init text in a configuration that's
probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n).

Let's just keep it simple.  How about the attached patch?

Re: [PATCH v2] sched/rt: Clean up usage of rt_task()

2024-05-23 Thread Steven Rostedt
On Wed, 15 May 2024 23:05:36 +0100
Qais Yousef  wrote:
> diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> index df3aca89d4f5..5cb88b748ad6 100644
> --- a/include/linux/sched/deadline.h
> +++ b/include/linux/sched/deadline.h
> @@ -10,8 +10,6 @@
>  
>  #include 
>  
> -#define MAX_DL_PRIO  0
> -
>  static inline int dl_prio(int prio)
>  {
>   if (unlikely(prio < MAX_DL_PRIO))
> @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
>   return 0;
>  }
>  
> +/*
> + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> + */
>  static inline int dl_task(struct task_struct *p)
>  {
>   return dl_prio(p->prio);
> diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> index ab83d85e1183..6ab43b4f72f9 100644
> --- a/include/linux/sched/prio.h
> +++ b/include/linux/sched/prio.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #define MAX_RT_PRIO  100
> +#define MAX_DL_PRIO  0
>  
>  #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
>  #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index b2b9e6eb9683..a055dd68a77c 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -7,18 +7,43 @@
>  struct task_struct;
>  
>  static inline int rt_prio(int prio)
> +{
> + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> + return 1;
> + return 0;
> +}
> +
> +static inline int realtime_prio(int prio)
>  {
>   if (unlikely(prio < MAX_RT_PRIO))
>   return 1;
>   return 0;
>  }

I'm thinking we should change the above to bool (separate patch), as
returning an int may give one the impression that it returns the actual
priority number. Having it return bool will clear that up.

In fact, if we are touching theses functions, might as well change all of
them to bool when returning true/false. Just to make it easier to
understand what they are doing.

>  
> +/*
> + * Returns true if a task has a priority that belongs to RT class. PI-boosted
> + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
> + */
>  static inline int rt_task(struct task_struct *p)
>  {
>   return rt_prio(p->prio);
>  }
>  
> -static inline bool task_is_realtime(struct task_struct *tsk)
> +/*
> + * Returns true if a task has a priority that belongs to RT or DL classes.
> + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore
> + * PI-boosted tasks.
> + */
> +static inline int realtime_task(struct task_struct *p)
> +{
> + return realtime_prio(p->prio);
> +}
> +
> +/*
> + * Returns true if a task has a policy that belongs to RT or DL classes.
> + * PI-boosted tasks will return false.
> + */
> +static inline bool realtime_task_policy(struct task_struct *tsk)
>  {
>   int policy = tsk->policy;
>  



> diff --git a/kernel/trace/trace_sched_wakeup.c 
> b/kernel/trace/trace_sched_wakeup.c
> index 0469a04a355f..19d737742e29 100644
> --- a/kernel/trace/trace_sched_wakeup.c
> +++ b/kernel/trace/trace_sched_wakeup.c
> @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p)
>*  - wakeup_dl handles tasks belonging to sched_dl class only.
>*/
>   if (tracing_dl || (wakeup_dl && !dl_task(p)) ||
> - (wakeup_rt && !dl_task(p) && !rt_task(p)) ||
> + (wakeup_rt && !realtime_task(p)) ||
>   (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= 
> current->prio)))
>   return;
>  

Reviewed-by: Steven Rostedt (Google) 





Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices

2024-05-23 Thread Michael S. Tsirkin
On Fri, May 17, 2024 at 10:46:02PM +0800, Xuewei Niu wrote:
>  include/linux/virtio_vsock.h|   2 +-
>  include/net/af_vsock.h  |  25 ++-
>  include/uapi/linux/virtio_vsock.h   |   1 +
>  include/uapi/linux/vm_sockets.h |  14 ++
>  net/vmw_vsock/af_vsock.c| 116 +--
>  net/vmw_vsock/virtio_transport.c| 255 ++--
>  net/vmw_vsock/virtio_transport_common.c |  16 +-
>  net/vmw_vsock/vsock_loopback.c  |   4 +-
>  8 files changed, 352 insertions(+), 81 deletions(-)

As any change to virtio device/driver interface, this has to
go through the virtio TC. Please subscribe at
virtio-comment+subscr...@lists.linux.dev and then
contact the TC at virtio-comm...@lists.linux.dev

You will likely eventually need to write a spec draft document, too.

-- 
MST




Re: [PATCH] livepatch: introduce klp_func called interface

2024-05-23 Thread Dan Carpenter
On Sun, May 19, 2024 at 03:43:43PM +0800, Wardenjohn wrote:
> Livepatch module usually used to modify kernel functions.
> If the patched function have bug, it may cause serious result
> such as kernel crash.
> 
> This commit introduce a read only interface of livepatch
> sysfs interface. If a livepatch function is called, this
> sysfs interface "called" of the patched function will
> set to be 1.
> 
> /sys/kernel/livepatchcalled
> 
> This value "called" is quite necessary for kernel stability assurance for 
> livepatching
> module of a running system. Testing process is important before a livepatch 
> module
> apply to a production system. With this interface, testing process can easily
> find out which function is successfully called. Any testing process can make 
> sure they
> have successfully cover all the patched function that changed with the help 
> of this interface.
> ---

Always run your patches through checkpatch.

So this patch is so that testers can see if a function has been called?
Can you not get the same information from gcov or ftrace?

There are style issues with the patch, but it's not so important until
the design is agreed on.

regards,
dan carpenter




Re: [PATCH] riscv: Fix early ftrace nop patching

2024-05-23 Thread Conor Dooley
On Thu, May 23, 2024 at 01:51:34PM +0200, Alexandre Ghiti wrote:
> Commit c97bf629963e ("riscv: Fix text patching when IPI are used")
> converted ftrace_make_nop() to use patch_insn_write() which does not
> emit any icache flush relying entirely on __ftrace_modify_code() to do
> that.
> 
> But we missed that ftrace_make_nop() was called very early directly when
> converting mcount calls into nops (actually on riscv it converts 2B nops
> emitted by the compiler into 4B nops).
> 
> This caused crashes on multiple HW as reported by Conor and Björn since
> the booting core could have half-patched instructions in its icache
> which would trigger an illegal instruction trap: fix this by emitting a
> local flush icache when early patching nops.
> 
> Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used")
> Signed-off-by: Alexandre Ghiti 

Reported-by: Conor Dooley 
Tested-by: Conor Dooley 

Thanks for the quick fix Alex :)


signature.asc
Description: PGP signature


Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

2024-05-23 Thread Miklos Szeredi
[trimming CC list]

On Thu, 23 May 2024 at 04:49, John Groves  wrote:

> - memmap=! will reserve a pretend pmem device at 
> 
> - memmap=$ will reserve a pretend dax device at 

Doesn't get me a /dev/dax or /dev/pmem

Complete qemu command line:

qemu-kvm -s -serial none -parallel none -kernel
/home/mszeredi/git/linux/arch/x86/boot/bzImage -drive
format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
stdio,id=virtiocon0,signal=off -device virtio-serial -device
virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net
nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home
-device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device
virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0
memmap=4G$4G'

root@kvm:~/famfs# scripts/chk_efi.sh
This system is neither Ubuntu nor Fedora. It is identified as debian.
/sys/firmware/efi not found; probably not efi
 not found; probably nof efi
/boot/efi/EFI not found; probably not efi
/boot/efi/EFI/BOOT not found; probably not efi
/boot/efi/EFI/ not found; probably not efi
/boot/efi/EFI//grub.cfg not found; probably nof efi
Probably not efi; errs=6

Thanks,
Miklos



Re: [PATCH] riscv: Fix early ftrace nop patching

2024-05-23 Thread Björn Töpel
Alexandre Ghiti  writes:

> Commit c97bf629963e ("riscv: Fix text patching when IPI are used")
> converted ftrace_make_nop() to use patch_insn_write() which does not
> emit any icache flush relying entirely on __ftrace_modify_code() to do
> that.
>
> But we missed that ftrace_make_nop() was called very early directly when
> converting mcount calls into nops (actually on riscv it converts 2B nops
> emitted by the compiler into 4B nops).
>
> This caused crashes on multiple HW as reported by Conor and Björn since
> the booting core could have half-patched instructions in its icache
> which would trigger an illegal instruction trap: fix this by emitting a
> local flush icache when early patching nops.
>
> Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used")
> Signed-off-by: Alexandre Ghiti 

Nice!

I've manged to reproduce the crash on the VisionFive2 board (however
only triggered when CONFIG_RELOCATABLE=y), and can verify that this fix
solves the issue.

Reviewed-by: Björn Töpel 
Tested-by: Björn Töpel 




[PATCHv7 9/9] man2: Add uretprobe syscall page

2024-05-23 Thread Jiri Olsa
Adding man page for new uretprobe syscall.

Reviewed-by: Alejandro Colomar 
Signed-off-by: Jiri Olsa 
---
 man/man2/uretprobe.2 | 56 
 1 file changed, 56 insertions(+)
 create mode 100644 man/man2/uretprobe.2

diff --git a/man/man2/uretprobe.2 b/man/man2/uretprobe.2
new file mode 100644
index ..cf1c2b0d852e
--- /dev/null
+++ b/man/man2/uretprobe.2
@@ -0,0 +1,56 @@
+.\" Copyright (C) 2024, Jiri Olsa 
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+uretprobe \- execute pending return uprobes
+.SH SYNOPSIS
+.nf
+.B int uretprobe(void)
+.fi
+.SH DESCRIPTION
+The
+.BR uretprobe ()
+system call is an alternative to breakpoint instructions for triggering return
+uprobe consumers.
+.P
+Calls to
+.BR uretprobe ()
+system call are only made from the user-space trampoline provided by the 
kernel.
+Calls from any other place result in a
+.BR SIGILL .
+.SH RETURN VALUE
+The
+.BR uretprobe ()
+system call return value is architecture-specific.
+.SH ERRORS
+.TP
+.B SIGILL
+The
+.BR uretprobe ()
+system call was called by a user-space program.
+.SH VERSIONS
+Details of the
+.BR uretprobe ()
+system call behavior vary across systems.
+.SH STANDARDS
+None.
+.SH HISTORY
+TBD
+.SH NOTES
+The
+.BR uretprobe ()
+system call was initially introduced for the x86_64 architecture
+where it was shown to be faster than breakpoint traps.
+It might be extended to other architectures.
+.P
+The
+.BR uretprobe ()
+system call exists only to allow the invocation of return uprobe consumers.
+It should
+.B never
+be called directly.
+Details of the arguments (if any) passed to
+.BR uretprobe ()
+and the return value are architecture-specific.
-- 
2.45.1




[PATCHv7 bpf-next 8/9] selftests/bpf: Add uretprobe shadow stack test

2024-05-23 Thread Jiri Olsa
Adding uretprobe shadow stack test that runs all existing
uretprobe tests with shadow stack enabled if it's available.

Signed-off-by: Jiri Olsa 
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++
 1 file changed, 60 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 3ef324c2db50..fda456401284 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -9,6 +9,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include "uprobe_syscall.skel.h"
 #include "uprobe_syscall_executed.skel.h"
 
@@ -297,6 +300,56 @@ static void test_uretprobe_syscall_call(void)
close(go[1]);
close(go[0]);
 }
+
+/*
+ * Borrowed from tools/testing/selftests/x86/test_shadow_stack.c.
+ *
+ * For use in inline enablement of shadow stack.
+ *
+ * The program can't return from the point where shadow stack gets enabled
+ * because there will be no address on the shadow stack. So it can't use
+ * syscall() for enablement, since it is a function.
+ *
+ * Based on code from nolibc.h. Keep a copy here because this can't pull
+ * in all of nolibc.h.
+ */
+#define ARCH_PRCTL(arg1, arg2) \
+({ \
+   long _ret;  \
+   register long _num  asm("eax") = __NR_arch_prctl;   \
+   register long _arg1 asm("rdi") = (long)(arg1);  \
+   register long _arg2 asm("rsi") = (long)(arg2);  \
+   \
+   asm volatile (  \
+   "syscall\n" \
+   : "=a"(_ret)\
+   : "r"(_arg1), "r"(_arg2),   \
+ "0"(_num) \
+   : "rcx", "r11", "memory", "cc"  \
+   );  \
+   _ret;   \
+})
+
+#ifndef ARCH_SHSTK_ENABLE
+#define ARCH_SHSTK_ENABLE  0x5001
+#define ARCH_SHSTK_DISABLE 0x5002
+#define ARCH_SHSTK_SHSTK   (1ULL <<  0)
+#endif
+
+static void test_uretprobe_shadow_stack(void)
+{
+   if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) {
+   test__skip();
+   return;
+   }
+
+   /* Run all of the uretprobe tests. */
+   test_uretprobe_regs_equal();
+   test_uretprobe_regs_change();
+   test_uretprobe_syscall_call();
+
+   ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -312,6 +365,11 @@ static void test_uretprobe_syscall_call(void)
 {
test__skip();
 }
+
+static void test_uretprobe_shadow_stack(void)
+{
+   test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -322,4 +380,6 @@ void test_uprobe_syscall(void)
test_uretprobe_regs_change();
if (test__start_subtest("uretprobe_syscall_call"))
test_uretprobe_syscall_call();
+   if (test__start_subtest("uretprobe_shadow_stack"))
+   test_uretprobe_shadow_stack();
 }
-- 
2.45.1




[PATCHv7 bpf-next 7/9] selftests/bpf: Add uretprobe syscall call from user space test

2024-05-23 Thread Jiri Olsa
Adding test to verify that when called from outside of the
trampoline provided by kernel, the uretprobe syscall will cause
calling process to receive SIGILL signal and the attached bpf
program is not executed.

Acked-by: Andrii Nakryiko 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++
 .../bpf/progs/uprobe_syscall_executed.c   | 17 
 2 files changed, 112 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 1a50cd35205d..3ef324c2db50 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -7,7 +7,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "uprobe_syscall.skel.h"
+#include "uprobe_syscall_executed.skel.h"
 
 __naked unsigned long uretprobe_regs_trigger(void)
 {
@@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void)
}
 }
 
+#ifndef __NR_uretprobe
+#define __NR_uretprobe 462
+#endif
+
+__naked unsigned long uretprobe_syscall_call_1(void)
+{
+   /*
+* Pretend we are uretprobe trampoline to trigger the return
+* probe invocation in order to verify we get SIGILL.
+*/
+   asm volatile (
+   "pushq %rax\n"
+   "pushq %rcx\n"
+   "pushq %r11\n"
+   "movq $" __stringify(__NR_uretprobe) ", %rax\n"
+   "syscall\n"
+   "popq %r11\n"
+   "popq %rcx\n"
+   "retq\n"
+   );
+}
+
+__naked unsigned long uretprobe_syscall_call(void)
+{
+   asm volatile (
+   "call uretprobe_syscall_call_1\n"
+   "retq\n"
+   );
+}
+
+static void test_uretprobe_syscall_call(void)
+{
+   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+   .retprobe = true,
+   );
+   struct uprobe_syscall_executed *skel;
+   int pid, status, err, go[2], c;
+
+   if (ASSERT_OK(pipe(go), "pipe"))
+   return;
+
+   skel = uprobe_syscall_executed__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+   goto cleanup;
+
+   pid = fork();
+   if (!ASSERT_GE(pid, 0, "fork"))
+   goto cleanup;
+
+   /* child */
+   if (pid == 0) {
+   close(go[1]);
+
+   /* wait for parent's kick */
+   err = read(go[0], &c, 1);
+   if (err != 1)
+   exit(-1);
+
+   uretprobe_syscall_call();
+   _exit(0);
+   }
+
+   skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, 
pid,
+   "/proc/self/exe",
+   
"uretprobe_syscall_call", &opts);
+   if (!ASSERT_OK_PTR(skel->links.test, 
"bpf_program__attach_uprobe_multi"))
+   goto cleanup;
+
+   /* kick the child */
+   write(go[1], &c, 1);
+   err = waitpid(pid, &status, 0);
+   ASSERT_EQ(err, pid, "waitpid");
+
+   /* verify the child got killed with SIGILL */
+   ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED");
+   ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG");
+
+   /* verify the uretprobe program wasn't called */
+   ASSERT_EQ(skel->bss->executed, 0, "executed");
+
+cleanup:
+   uprobe_syscall_executed__destroy(skel);
+   close(go[1]);
+   close(go[0]);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -219,6 +307,11 @@ static void test_uretprobe_regs_change(void)
 {
test__skip();
 }
+
+static void test_uretprobe_syscall_call(void)
+{
+   test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -227,4 +320,6 @@ void test_uprobe_syscall(void)
test_uretprobe_regs_equal();
if (test__start_subtest("uretprobe_regs_change"))
test_uretprobe_regs_change();
+   if (test__start_subtest("uretprobe_syscall_call"))
+   test_uretprobe_syscall_call();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c 
b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
new file mode 100644
index ..0d7f1a7db2e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include 
+#include 
+
+struct pt_regs regs;
+
+char _license[] SEC("license") = "GPL";
+
+int executed = 0;
+
+SEC("uretprobe.multi")
+int test(struct pt_regs *regs)
+{
+   executed = 1;
+   return 0;
+}
-- 
2.45.1




[PATCHv7 bpf-next 6/9] selftests/bpf: Add uretprobe syscall test for regs changes

2024-05-23 Thread Jiri Olsa
Adding test that creates uprobe consumer on uretprobe which changes some
of the registers. Making sure the changed registers are propagated to the
user space when the ureptobe syscall trampoline is used on x86_64.

To be able to do this, adding support to bpf_testmod to create uprobe via
new attribute file:
  /sys/kernel/bpf_testmod_uprobe

This file is expecting file offset and creates related uprobe on current
process exe file and removes existing uprobe if offset is 0. The can be
only single uprobe at any time.

The uprobe has specific consumer that changes registers used in ureprobe
syscall trampoline and which are later checked in the test.

Acked-by: Andrii Nakryiko 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 +-
 .../selftests/bpf/prog_tests/uprobe_syscall.c |  67 ++
 2 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 2a18bd320e92..b0132a342bb5 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "bpf_testmod.h"
 #include "bpf_testmod_kfunc.h"
 
@@ -358,6 +359,119 @@ static struct bin_attribute bin_attr_bpf_testmod_file 
__ro_after_init = {
.write = bpf_testmod_test_write,
 };
 
+/* bpf_testmod_uprobe sysfs attribute is so far enabled for x86_64 only,
+ * please see test_uretprobe_regs_change test
+ */
+#ifdef __x86_64__
+
+static int
+uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
+  struct pt_regs *regs)
+
+{
+   regs->ax  = 0x12345678deadbeef;
+   regs->cx  = 0x87654321feebdaed;
+   regs->r11 = (u64) -1;
+   return true;
+}
+
+struct testmod_uprobe {
+   struct path path;
+   loff_t offset;
+   struct uprobe_consumer consumer;
+};
+
+static DEFINE_MUTEX(testmod_uprobe_mutex);
+
+static struct testmod_uprobe uprobe = {
+   .consumer.ret_handler = uprobe_ret_handler,
+};
+
+static int testmod_register_uprobe(loff_t offset)
+{
+   int err = -EBUSY;
+
+   if (uprobe.offset)
+   return -EBUSY;
+
+   mutex_lock(&testmod_uprobe_mutex);
+
+   if (uprobe.offset)
+   goto out;
+
+   err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path);
+   if (err)
+   goto out;
+
+   err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
+offset, 0, &uprobe.consumer);
+   if (err)
+   path_put(&uprobe.path);
+   else
+   uprobe.offset = offset;
+
+out:
+   mutex_unlock(&testmod_uprobe_mutex);
+   return err;
+}
+
+static void testmod_unregister_uprobe(void)
+{
+   mutex_lock(&testmod_uprobe_mutex);
+
+   if (uprobe.offset) {
+   uprobe_unregister(d_real_inode(uprobe.path.dentry),
+ uprobe.offset, &uprobe.consumer);
+   uprobe.offset = 0;
+   }
+
+   mutex_unlock(&testmod_uprobe_mutex);
+}
+
+static ssize_t
+bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj,
+struct bin_attribute *bin_attr,
+char *buf, loff_t off, size_t len)
+{
+   unsigned long offset = 0;
+   int err = 0;
+
+   if (kstrtoul(buf, 0, &offset))
+   return -EINVAL;
+
+   if (offset)
+   err = testmod_register_uprobe(offset);
+   else
+   testmod_unregister_uprobe();
+
+   return err ?: strlen(buf);
+}
+
+static struct bin_attribute bin_attr_bpf_testmod_uprobe_file __ro_after_init = 
{
+   .attr = { .name = "bpf_testmod_uprobe", .mode = 0666, },
+   .write = bpf_testmod_uprobe_write,
+};
+
+static int register_bpf_testmod_uprobe(void)
+{
+   return sysfs_create_bin_file(kernel_kobj, 
&bin_attr_bpf_testmod_uprobe_file);
+}
+
+static void unregister_bpf_testmod_uprobe(void)
+{
+   testmod_unregister_uprobe();
+   sysfs_remove_bin_file(kernel_kobj, &bin_attr_bpf_testmod_uprobe_file);
+}
+
+#else
+static int register_bpf_testmod_uprobe(void)
+{
+   return 0;
+}
+
+static void unregister_bpf_testmod_uprobe(void) { }
+#endif
+
 BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids)
 BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW)
 BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL)
@@ -912,7 +1026,13 @@ static int bpf_testmod_init(void)
return -EINVAL;
sock = NULL;
mutex_init(&sock_lock);
-   return sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file);
+   ret = sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file);
+   if (ret < 0)
+   return ret;
+   ret = register_bpf_testmod_uprobe();
+   if (ret < 0)
+   return ret;
+   

[PATCHv7 bpf-next 5/9] selftests/bpf: Add uretprobe syscall test for regs integrity

2024-05-23 Thread Jiri Olsa
Add uretprobe syscall test that compares register values before
and after the uretprobe is hit. It also compares the register
values seen from attached bpf program.

Acked-by: Andrii Nakryiko 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 tools/include/linux/compiler.h|   4 +
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 163 ++
 .../selftests/bpf/progs/uprobe_syscall.c  |  15 ++
 3 files changed, 182 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c

diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h
index 8a63a9913495..6f7f22ac9da5 100644
--- a/tools/include/linux/compiler.h
+++ b/tools/include/linux/compiler.h
@@ -62,6 +62,10 @@
 #define __nocf_check __attribute__((nocf_check))
 #endif
 
+#ifndef __naked
+#define __naked __attribute__((__naked__))
+#endif
+
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #ifndef __same_type
 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
new file mode 100644
index ..311ac19d8992
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+
+#ifdef __x86_64__
+
+#include 
+#include 
+#include 
+#include "uprobe_syscall.skel.h"
+
+__naked unsigned long uretprobe_regs_trigger(void)
+{
+   asm volatile (
+   "movq $0xdeadbeef, %rax\n"
+   "ret\n"
+   );
+}
+
+__naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
+{
+   asm volatile (
+   "movq %r15,   0(%rdi)\n"
+   "movq %r14,   8(%rdi)\n"
+   "movq %r13,  16(%rdi)\n"
+   "movq %r12,  24(%rdi)\n"
+   "movq %rbp,  32(%rdi)\n"
+   "movq %rbx,  40(%rdi)\n"
+   "movq %r11,  48(%rdi)\n"
+   "movq %r10,  56(%rdi)\n"
+   "movq  %r9,  64(%rdi)\n"
+   "movq  %r8,  72(%rdi)\n"
+   "movq %rax,  80(%rdi)\n"
+   "movq %rcx,  88(%rdi)\n"
+   "movq %rdx,  96(%rdi)\n"
+   "movq %rsi, 104(%rdi)\n"
+   "movq %rdi, 112(%rdi)\n"
+   "movq   $0, 120(%rdi)\n" /* orig_rax */
+   "movq   $0, 128(%rdi)\n" /* rip  */
+   "movq   $0, 136(%rdi)\n" /* cs   */
+   "pushf\n"
+   "pop %rax\n"
+   "movq %rax, 144(%rdi)\n" /* eflags   */
+   "movq %rsp, 152(%rdi)\n" /* rsp  */
+   "movq   $0, 160(%rdi)\n" /* ss   */
+
+   /* save 2nd argument */
+   "pushq %rsi\n"
+   "call uretprobe_regs_trigger\n"
+
+   /* save  return value and load 2nd argument pointer to rax */
+   "pushq %rax\n"
+   "movq 8(%rsp), %rax\n"
+
+   "movq %r15,   0(%rax)\n"
+   "movq %r14,   8(%rax)\n"
+   "movq %r13,  16(%rax)\n"
+   "movq %r12,  24(%rax)\n"
+   "movq %rbp,  32(%rax)\n"
+   "movq %rbx,  40(%rax)\n"
+   "movq %r11,  48(%rax)\n"
+   "movq %r10,  56(%rax)\n"
+   "movq  %r9,  64(%rax)\n"
+   "movq  %r8,  72(%rax)\n"
+   "movq %rcx,  88(%rax)\n"
+   "movq %rdx,  96(%rax)\n"
+   "movq %rsi, 104(%rax)\n"
+   "movq %rdi, 112(%rax)\n"
+   "movq   $0, 120(%rax)\n" /* orig_rax */
+   "movq   $0, 128(%rax)\n" /* rip  */
+   "movq   $0, 136(%rax)\n" /* cs   */
+
+   /* restore return value and 2nd argument */
+   "pop %rax\n"
+   "pop %rsi\n"
+
+   "movq %rax,  80(%rsi)\n"
+
+   "pushf\n"
+   "pop %rax\n"
+
+   "movq %rax, 144(%rsi)\n" /* eflags   */
+   "movq %rsp, 152(%rsi)\n" /* rsp  */
+   "movq   $0, 160(%rsi)\n" /* ss   */
+   "ret\n"
+);
+}
+
+static void test_uretprobe_regs_equal(void)
+{
+   struct uprobe_syscall *skel = NULL;
+   struct pt_regs before = {}, after = {};
+   unsigned long *pb = (unsigned long *) &before;
+   unsigned long *pa = (unsigned long *) &after;
+   unsigned long *pp;
+   unsigned int i, cnt;
+   int err;
+
+   skel = uprobe_syscall__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "uprobe_syscall__open_and_load"))
+   goto cleanup;
+
+   err = uprobe_syscall__attach(skel);
+   if (!ASSERT_OK(err, "uprobe_syscall__attach"))
+   goto cleanup;
+
+   uretprobe_regs(&before, &after);
+
+   pp = (unsigned long *) &skel->bss->regs;
+   cnt = sizeof(before)/sizeof(*pb);
+
+ 

[PATCHv7 bpf-next 4/9] selftests/x86: Add return uprobe shadow stack test

2024-05-23 Thread Jiri Olsa
Adding return uprobe test for shadow stack and making sure it's
working properly. Borrowed some of the code from bpf selftests.

Signed-off-by: Jiri Olsa 
---
 .../testing/selftests/x86/test_shadow_stack.c | 145 ++
 1 file changed, 145 insertions(+)

diff --git a/tools/testing/selftests/x86/test_shadow_stack.c 
b/tools/testing/selftests/x86/test_shadow_stack.c
index 757e6527f67e..e3501b7e2ecc 100644
--- a/tools/testing/selftests/x86/test_shadow_stack.c
+++ b/tools/testing/selftests/x86/test_shadow_stack.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Define the ABI defines if needed, so people can run the tests
@@ -681,6 +682,144 @@ int test_32bit(void)
return !segv_triggered;
 }
 
+static int parse_uint_from_file(const char *file, const char *fmt)
+{
+   int err, ret;
+   FILE *f;
+
+   f = fopen(file, "re");
+   if (!f) {
+   err = -errno;
+   printf("failed to open '%s': %d\n", file, err);
+   return err;
+   }
+   err = fscanf(f, fmt, &ret);
+   if (err != 1) {
+   err = err == EOF ? -EIO : -errno;
+   printf("failed to parse '%s': %d\n", file, err);
+   fclose(f);
+   return err;
+   }
+   fclose(f);
+   return ret;
+}
+
+static int determine_uprobe_perf_type(void)
+{
+   const char *file = "/sys/bus/event_source/devices/uprobe/type";
+
+   return parse_uint_from_file(file, "%d\n");
+}
+
+static int determine_uprobe_retprobe_bit(void)
+{
+   const char *file = 
"/sys/bus/event_source/devices/uprobe/format/retprobe";
+
+   return parse_uint_from_file(file, "config:%d\n");
+}
+
+static ssize_t get_uprobe_offset(const void *addr)
+{
+   size_t start, end, base;
+   char buf[256];
+   bool found = false;
+   FILE *f;
+
+   f = fopen("/proc/self/maps", "r");
+   if (!f)
+   return -errno;
+
+   while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) 
== 4) {
+   if (buf[2] == 'x' && (uintptr_t)addr >= start && 
(uintptr_t)addr < end) {
+   found = true;
+   break;
+   }
+   }
+
+   fclose(f);
+
+   if (!found)
+   return -ESRCH;
+
+   return (uintptr_t)addr - start + base;
+}
+
+static __attribute__((noinline)) void uretprobe_trigger(void)
+{
+   asm volatile ("");
+}
+
+/*
+ * This test setups return uprobe, which is sensitive to shadow stack
+ * (crashes without extra fix). After executing the uretprobe we fail
+ * the test if we receive SIGSEGV, no crash means we're good.
+ *
+ * Helper functions above borrowed from bpf selftests.
+ */
+static int test_uretprobe(void)
+{
+   const size_t attr_sz = sizeof(struct perf_event_attr);
+   const char *file = "/proc/self/exe";
+   int bit, fd = 0, type, err = 1;
+   struct perf_event_attr attr;
+   struct sigaction sa = {};
+   ssize_t offset;
+
+   type = determine_uprobe_perf_type();
+   if (type < 0) {
+   if (type == -ENOENT)
+   printf("[SKIP]\tUretprobe test, uprobes are not 
available\n");
+   return 0;
+   }
+
+   offset = get_uprobe_offset(uretprobe_trigger);
+   if (offset < 0)
+   return 1;
+
+   bit = determine_uprobe_retprobe_bit();
+   if (bit < 0)
+   return 1;
+
+   sa.sa_sigaction = segv_gp_handler;
+   sa.sa_flags = SA_SIGINFO;
+   if (sigaction(SIGSEGV, &sa, NULL))
+   return 1;
+
+   /* Setup return uprobe through perf event interface. */
+   memset(&attr, 0, attr_sz);
+   attr.size = attr_sz;
+   attr.type = type;
+   attr.config = 1 << bit;
+   attr.config1 = (__u64) (unsigned long) file;
+   attr.config2 = offset;
+
+   fd = syscall(__NR_perf_event_open, &attr, 0 /* pid */, -1 /* cpu */,
+-1 /* group_fd */, PERF_FLAG_FD_CLOEXEC);
+   if (fd < 0)
+   goto out;
+
+   if (sigsetjmp(jmp_buffer, 1))
+   goto out;
+
+   ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK);
+
+   /*
+* This either segfaults and goes through sigsetjmp above
+* or succeeds and we're good.
+*/
+   uretprobe_trigger();
+
+   printf("[OK]\tUretprobe test\n");
+   err = 0;
+
+out:
+   ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK);
+   signal(SIGSEGV, SIG_DFL);
+   if (fd)
+   close(fd);
+   return err;
+}
+
 void segv_handler_ptrace(int signum, siginfo_t *si, void *uc)
 {
/* The SSP adjustment caused a segfault. */
@@ -867,6 +1006,12 @@ int main(int argc, char *argv[])
goto out;
}
 
+   if (test_uretprobe()) {
+   ret = 1;
+   printf("[FAIL]\turetprobe test\n");
+   goto out;
+   }
+
return ret;
 
 out:
-- 
2.45.1




[PATCHv7 bpf-next 3/9] uprobe: Add uretprobe syscall to speed up return probe

2024-05-23 Thread Jiri Olsa
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
on stack with address of the trampoline that contains breakpoint
instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
on stack - rax is set to syscall number and r11/rcx are changed and
used by syscall instruction

  - the syscall code reads the original values of those registers and
restore those values in task's pt_regs area

  - only caller from trampoline exposed in '[uprobes]' is allowed,
the process will receive SIGILL signal otherwise

Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:
uretprobe-nop  :1.498 ± 0.000M/s
uretprobe-push :1.448 ± 0.001M/s
uretprobe-ret  :0.816 ± 0.001M/s

  with the fix:
uretprobe-nop  :1.969 ± 0.002M/s  < 31% speed up
uretprobe-push :1.910 ± 0.000M/s  < 31% speed up
uretprobe-ret  :0.934 ± 0.000M/s  < 14% speed up

  On Amd (AMD Ryzen 7 5700U)

  current:
uretprobe-nop  :0.778 ± 0.001M/s
uretprobe-push :0.744 ± 0.001M/s
uretprobe-ret  :0.540 ± 0.001M/s

  with the fix:
uretprobe-nop  :0.860 ± 0.001M/s  < 10% speed up
uretprobe-push :0.818 ± 0.001M/s  < 10% speed up
uretprobe-ret  :0.578 ± 0.000M/s  <  7% speed up

The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.

The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.

The performance test is part of bpf selftests:
  tools/testing/selftests/bpf/run_bench_uprobes.sh

Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.

Note that when shadow stack is enabled the uretprobe syscall returns
via iret, which is slower than return via sysret, but won't cause the
shadow stack violation.

Suggested-by: Andrii Nakryiko 
Reviewed-by: Oleg Nesterov 
Reviewed-by: Masami Hiramatsu (Google) 
Acked-by: Andrii Nakryiko 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Jiri Olsa 
---
 arch/x86/include/asm/shstk.h |   2 +
 arch/x86/kernel/shstk.c  |   5 ++
 arch/x86/kernel/uprobes.c| 117 +++
 include/linux/uprobes.h  |   3 +
 kernel/events/uprobes.c  |  24 ---
 5 files changed, 144 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 896909f306e3..4cb77e004615 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -22,6 +22,7 @@ void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
 int shstk_update_last_frame(unsigned long val);
+bool shstk_is_enabled(void);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
   unsigned long arg2) { return -EINVAL; }
@@ -33,6 +34,7 @@ static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
 static inline int shstk_update_last_frame(unsigned long val) { return 0; }
+static inline bool shstk_is_enabled(void) { return false; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 9797d4cdb78a..059685612362 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -588,3 +588,8 @@ int shstk_update_last_frame(unsigned long val)
ssp = get_user_shstk_addr();
return write_user_shstk_64((u64 __user *)ssp, (u64)val);
 }
+
+bool shstk_is_enabled(void)
+{
+   return features_enabled(ARCH_SHSTK_SHSTK);
+}
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6402fb3089d2..5a952c5ea66b 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -308,6 +309,122 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, 
struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+   ".pushsection .rodata\n"
+   ".global uretprobe_trampoline_entry\n"
+ 

[PATCHv7 bpf-next 2/9] uprobe: Wire up uretprobe system call

2024-05-23 Thread Jiri Olsa
Wiring up uretprobe system call, which comes in following changes.
We need to do the wiring before, because the uretprobe implementation
needs the syscall number.

Note at the moment uretprobe syscall is supported only for native
64-bit process.

Reviewed-by: Oleg Nesterov 
Reviewed-by: Masami Hiramatsu (Google) 
Acked-by: Andrii Nakryiko 
Signed-off-by: Jiri Olsa 
---
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h   | 2 ++
 include/uapi/asm-generic/unistd.h  | 5 -
 kernel/sys_ni.c| 2 ++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index cc78226ffc35..47dfea0a827c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459common  lsm_get_self_attr   sys_lsm_get_self_attr
 460common  lsm_set_self_attr   sys_lsm_set_self_attr
 461common  lsm_list_modulessys_lsm_list_modules
+46264  uretprobe   sys_uretprobe
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e619ac10cd23..5318e0e76799 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, u32 *size, 
u32 flags);
 /* x86 */
 asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
+asmlinkage long sys_uretprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..8a747cd1d735 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_uretprobe 462
+__SYSCALL(__NR_uretprobe, sys_uretprobe)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..be6195e0d078 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+COND_SYSCALL(uretprobe);
-- 
2.45.1




[PATCHv7 bpf-next 1/9] x86/shstk: Make return uprobe work with shadow stack

2024-05-23 Thread Jiri Olsa
Currently the application with enabled shadow stack will crash
if it sets up return uprobe. The reason is the uretprobe kernel
code changes the user space task's stack, but does not update
shadow stack accordingly.

Adding new functions to update values on shadow stack and using
them in uprobe code to keep shadow stack in sync with uretprobe
changes to user stack.

Reviewed-by: Oleg Nesterov 
Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface")
Signed-off-by: Jiri Olsa 
---
 arch/x86/include/asm/shstk.h |  2 ++
 arch/x86/kernel/shstk.c  | 11 +++
 arch/x86/kernel/uprobes.c|  7 ++-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
index 42fee8959df7..896909f306e3 100644
--- a/arch/x86/include/asm/shstk.h
+++ b/arch/x86/include/asm/shstk.h
@@ -21,6 +21,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *p, 
unsigned long clon
 void shstk_free(struct task_struct *p);
 int setup_signal_shadow_stack(struct ksignal *ksig);
 int restore_signal_shadow_stack(void);
+int shstk_update_last_frame(unsigned long val);
 #else
 static inline long shstk_prctl(struct task_struct *task, int option,
   unsigned long arg2) { return -EINVAL; }
@@ -31,6 +32,7 @@ static inline unsigned long shstk_alloc_thread_stack(struct 
task_struct *p,
 static inline void shstk_free(struct task_struct *p) {}
 static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
 static inline int restore_signal_shadow_stack(void) { return 0; }
+static inline int shstk_update_last_frame(unsigned long val) { return 0; }
 #endif /* CONFIG_X86_USER_SHADOW_STACK */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
index 6f1e9883f074..9797d4cdb78a 100644
--- a/arch/x86/kernel/shstk.c
+++ b/arch/x86/kernel/shstk.c
@@ -577,3 +577,14 @@ long shstk_prctl(struct task_struct *task, int option, 
unsigned long arg2)
return wrss_control(true);
return -EINVAL;
 }
+
+int shstk_update_last_frame(unsigned long val)
+{
+   unsigned long ssp;
+
+   if (!features_enabled(ARCH_SHSTK_SHSTK))
+   return 0;
+
+   ssp = get_user_shstk_addr();
+   return write_user_shstk_64((u64 __user *)ssp, (u64)val);
+}
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..6402fb3089d2 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -1076,8 +1076,13 @@ arch_uretprobe_hijack_return_addr(unsigned long 
trampoline_vaddr, struct pt_regs
return orig_ret_vaddr;
 
nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, 
rasize);
-   if (likely(!nleft))
+   if (likely(!nleft)) {
+   if (shstk_update_last_frame(trampoline_vaddr)) {
+   force_sig(SIGSEGV);
+   return -1;
+   }
return orig_ret_vaddr;
+   }
 
if (nleft != rasize) {
pr_err("return address clobbered: pid=%d, %%sp=%#lx, 
%%ip=%#lx\n",
-- 
2.45.1




[PATCHv7 bpf-next 0/9] uprobe: uretprobe speed up

2024-05-23 Thread Jiri Olsa
hi,
as part of the effort on speeding up the uprobes [0] coming with
return uprobe optimization by using syscall instead of the trap
on the uretprobe trampoline.

The speed up depends on instruction type that uprobe is installed
and depends on specific HW type, please check patch 1 for details.

Patches 1-8 are based on bpf-next/master, but patch 2 and 3 are
apply-able on linux-trace.git tree probes/for-next branch.
Patch 9 is based on man-pages master.

v7 changes:
- fixes in man page [Alejandro Colomar]
- fixed patch #1 fixes tag [Oleg]

Also available at:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  uretprobe_syscall

thanks,
jirka


Notes to check list items in Documentation/process/adding-syscalls.rst:

- System Call Alternatives
  New syscall seems like the best way in here, because we need
  just to quickly enter kernel with no extra arguments processing,
  which we'd need to do if we decided to use another syscall.

- Designing the API: Planning for Extension
  The uretprobe syscall is very specific and most likely won't be
  extended in the future.

  At the moment it does not take any arguments and even if it does
  in future, it's allowed to be called only from trampoline prepared
  by kernel, so there'll be no broken user.

- Designing the API: Other Considerations
  N/A because uretprobe syscall does not return reference to kernel
  object.

- Proposing the API
  Wiring up of the uretprobe system call is in separate change,
  selftests and man page changes are part of the patchset.

- Generic System Call Implementation
  There's no CONFIG option for the new functionality because it
  keeps the same behaviour from the user POV.

- x86 System Call Implementation
  It's 64-bit syscall only.

- Compatibility System Calls (Generic)
  N/A uretprobe syscall has no arguments and is not supported
  for compat processes.

- Compatibility System Calls (x86)
  N/A uretprobe syscall is not supported for compat processes.

- System Calls Returning Elsewhere
  N/A.

- Other Details
  N/A.

- Testing
  Adding new bpf selftests and ran ltp on top of this change.

- Man Page
  Attached.

- Do not call System Calls in the Kernel
  N/A.


[0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
---
Jiri Olsa (8):
  x86/shstk: Make return uprobe work with shadow stack
  uprobe: Wire up uretprobe system call
  uprobe: Add uretprobe syscall to speed up return probe
  selftests/x86: Add return uprobe shadow stack test
  selftests/bpf: Add uretprobe syscall test for regs integrity
  selftests/bpf: Add uretprobe syscall test for regs changes
  selftests/bpf: Add uretprobe syscall call from user space test
  selftests/bpf: Add uretprobe shadow stack test

 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 arch/x86/include/asm/shstk.h|   4 +
 arch/x86/kernel/shstk.c |  16 
 arch/x86/kernel/uprobes.c   | 124 
-
 include/linux/syscalls.h|   2 +
 include/linux/uprobes.h |   3 +
 include/uapi/asm-generic/unistd.h   |   5 +-
 kernel/events/uprobes.c |  24 --
 kernel/sys_ni.c |   2 +
 tools/include/linux/compiler.h  |   4 +
 tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 
-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 385 
+++
 tools/testing/selftests/bpf/progs/uprobe_syscall.c  |  15 
 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c |  17 
 tools/testing/selftests/x86/test_shadow_stack.c | 145 
++
 15 files changed, 860 insertions(+), 10 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c

Jiri Olsa (1):
  man2: Add uretprobe syscall page

 man/man2/uretprobe.2 | 56 

 1 file changed, 56 insertions(+)
 create mode 100644 man/man2/uretprobe.2



[PATCH] riscv: Fix early ftrace nop patching

2024-05-23 Thread Alexandre Ghiti
Commit c97bf629963e ("riscv: Fix text patching when IPI are used")
converted ftrace_make_nop() to use patch_insn_write() which does not
emit any icache flush relying entirely on __ftrace_modify_code() to do
that.

But we missed that ftrace_make_nop() was called very early directly when
converting mcount calls into nops (actually on riscv it converts 2B nops
emitted by the compiler into 4B nops).

This caused crashes on multiple HW as reported by Conor and Björn since
the booting core could have half-patched instructions in its icache
which would trigger an illegal instruction trap: fix this by emitting a
local flush icache when early patching nops.

Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used")
Signed-off-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/cacheflush.h | 6 ++
 arch/riscv/kernel/ftrace.c  | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/arch/riscv/include/asm/cacheflush.h 
b/arch/riscv/include/asm/cacheflush.h
index dd8d07146116..ce79c558a4c8 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -13,6 +13,12 @@ static inline void local_flush_icache_all(void)
asm volatile ("fence.i" ::: "memory");
 }
 
+static inline void local_flush_icache_range(unsigned long start,
+   unsigned long end)
+{
+   local_flush_icache_all();
+}
+
 #define PG_dcache_clean PG_arch_1
 
 static inline void flush_dcache_folio(struct folio *folio)
diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c
index 4f4987a6d83d..32e7c401dfb4 100644
--- a/arch/riscv/kernel/ftrace.c
+++ b/arch/riscv/kernel/ftrace.c
@@ -120,6 +120,9 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace 
*rec)
out = ftrace_make_nop(mod, rec, MCOUNT_ADDR);
mutex_unlock(&text_mutex);
 
+   if (!mod)
+   local_flush_icache_range(rec->ip, rec->ip + MCOUNT_INSN_SIZE);
+
return out;
 }
 
-- 
2.39.2




Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices

2024-05-23 Thread Stefano Garzarella

Hi,
thanks for this RFC!

On Fri, May 17, 2024 at 10:46:02PM GMT, Xuewei Niu wrote:

# Motivition

Vsock is a lightweight and widely used data exchange mechanism between host
and guest. Kata Containers, a secure container runtime, leverages the
capability to exchange control data between the shim and the kata-agent.

The Linux kernel only supports one vsock device for virtio-vsock transport,
resulting in the following limitations:

* Poor performance isolation: All vsock connections share the same
virtqueue.


This might be fixed if we implement multi-queue in virtio-vsock.


* Cannot enable more than one backend: Virtio-vsock, vhost-vsock, and
vhost-user-vsock cannot be enabled simultaneously on the transport.

We’d like to transfer networking data, such as TSI (Transparent Socket
Impersonation), over vsock via the vhost-user protocol to reduce overhead.
However, by default, the vsock device is occupied by the kata-agent.

# Usages

Principle: **Supporting virtio-vsock multi-devices while also being
compatible with existing ones.**

## Connection from Guest to Host

There are two valuable questions to take about:

1. How to be compatible with the existing usages?
2. How do we specify a virtio-vsock device?

### Question 1

Before we delve into question 1, I'd like to provide a piece of pseudocode
as an example of one of the existing use cases from the guest's
perspective.

Assuming there is one virtio-vsock device with CID 4. One of existing
usages to connect to host is shown as following.

```
fd = socket(AF_VSOCK);
connect(fd, 2, 1234);
n = write(fd, buffer);
```

The result is that a connection is established from the guest (4, ?) to the
host (2, 1234), where "?" denotes a random port.

In the context of multi-devices, there are more than two devices. If the
users don’t specify one CID explicitly, the kernel becomes confused about
which device to use. The new implementation should be compatible with the
old one.

We expanded the virtio-vsock specification to address this issue. The
specification now includes a new field called "order".

```
struct virtio_vsock_config {
 __le64 guest_cid;
 __le64 order;
} _attribute_((packed));
```

In the phase of virtio-vsock driver probing, the guest kernel reads 
from

VMM to get the order of each device. **We stipulate that the device with the
smallest order is regarded as the default device**(this mechanism functions
as a 'default gateway' in networking).

Assuming there are three virtio-vsock devices: device1 (CID=3), device2
(CID=4), and device3 (CID=5). The arrangement of the list is as follows
from the perspective of the guest kernel:

```
virtio_vsock_list =
virtio_vsock { cid: 4, order: 0 } -> virtio_vsock { cid: 3, order: 1 } -> 
virtio_vsock { cid: 5, order: 10 }
```

At this time, the guest kernel realizes that the device2 (CID=4) is the
default device. Execute the same code as before.

```
fd = socket(AF_VSOCK);
connect(fd, 2, 1234);
n = write(fd, buffer);
```

A connection will be established from the guest (4, ?) to the host (2, 1234).


It seems that only the one with order 0 is used here though, so what is 
the ordering for?
Wouldn't it suffice to simply indicate the default device (e.g., like 
the default gateway for networking)?




### Question 2

Now, the user wants to specify a device instead of the default one. An
explicit binding operation is required to be performed.

Use the device (CID=3), where “-1” represents any port, the kernel will


We have a macro: VMADDR_PORT_ANY (which is -1)


search an available port automatically.

```
fd = socket(AF_VSOCK);
bind(fd, 3, -1);
connect(fd, 2, 1234);)
n = write(fd, buffer);
```

Use the device (CID=4).

```
fd = socket(AF_VSOCK);
bind(fd, 4, -1);
connect(fd, 2, 1234);
n = write(fd, buffer);
```

## Connection from Host to Guest

Connection from host to guest is quite similar to the existing usages. The
device’s CID is specified by the bind operation.

Listen at the device (CID=3)’s port 1.

```
fd = socket(AF_VSOCK);
bind(fd, 3, 1);
listen(fd);
new_fd = accept(fd, &host_cid, &host_port);
n = write(fd, buffer);
```

Listen at the device (CID=4)’s port 1.

```
fd = socket(AF_VSOCK);
bind(fd, 4, 1);
listen(fd);
new_fd = accept(fd, &host_cid, &host_port);
n = write(fd, buffer);
```

# Use Cases

We've completed a POC with Kata Containers, Ztunnel, which is a
purpose-built per-node proxy for Istio ambient mesh, and TSI. Please refer
to the following link for more details.

Link: https://bit.ly/4bdPJbU


Thank you for this RFC, I left several comments in the patches, we still 
have some work to do, but I think it is something we can support :-)


Here I summarize the things that I think we need to fix:
1. Avoid adding transport-specific things in af_vsock.c
   We need to have a generic API to allow other transports to implement
   the same functionality.
2. We need to add negotiation of a new feature in virtio/vhost transports
   We need to enable or disable support depending on whether t

Re: [RFC PATCH 5/5] vsock: Add an ioctl request to get all CIDs

2024-05-23 Thread Stefano Garzarella

On Fri, May 17, 2024 at 10:46:07PM GMT, Xuewei Niu wrote:

The new request is called `IOCTL_VM_SOCKETS_GET_LOCAL_CIDS`. And the old
one, `IOCTL_VM_SOCKETS_GET_LOCAL_CID` is retained.

For the transport that supports multi-devices:

* `IOCTL_VM_SOCKETS_GET_LOCAL_CID` returns "-1";


What about returning the default CID (lower prio)?

* `IOCTL_VM_SOCKETS_GET_LOCAL_CIDS` returns a vector of CIDS. The usage 
is

shown as following.

```
struct vsock_local_cids local_cids;
if ((ret = ioctl(fd, IOCTL_VM_SOCKETS_GET_LOCAL_CIDS, &local_cids))) {
   perror("failed to get cids");
   exit(1);
}
for (i = 0; i
---
include/net/af_vsock.h  |  7 +++
include/uapi/linux/vm_sockets.h |  8 
net/vmw_vsock/af_vsock.c| 19 +++
3 files changed, 34 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 25f7dc3d602d..2febc816e388 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -264,4 +264,11 @@ static inline bool vsock_msgzerocopy_allow(const struct 
vsock_transport *t)
{
return t->msgzerocopy_allow && t->msgzerocopy_allow();
}
+
+/ IOCTL /
+/* Type of return value of IOCTL_VM_SOCKETS_GET_LOCAL_CIDS. */
+struct vsock_local_cids {
+   int nr;
+   unsigned int data[MAX_VSOCK_NUM];
+};
#endif /* __AF_VSOCK_H__ */
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index 36ca5023293a..01f73fb7af5a 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -195,8 +195,16 @@ struct sockaddr_vm {

#define MAX_VSOCK_NUM 16


Okay, now I see why you need this in the UAPI, but pleace try to follow
other defines.

What about VM_SOCKETS_MAX_DEVS ?



+/* Return actual context id if the transport not support vsock
+ * multi-devices. Otherwise, return `-1U`.
+ */
+
#define IOCTL_VM_SOCKETS_GET_LOCAL_CID  _IO(7, 0xb9)

+/* Only available in transports that support multiple devices. */
+
+#define IOCTL_VM_SOCKETS_GET_LOCAL_CIDS _IOR(7, 0xba, struct 
vsock_local_cids)
+
/* MSG_ZEROCOPY notifications are encoded in the standard error format,
 * sock_extended_err. See Documentation/networking/msg_zerocopy.rst in
 * kernel source tree for more details.
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 3b34be802bf2..2ea2ff52f15b 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -2454,6 +2454,7 @@ static long vsock_dev_do_ioctl(struct file *filp,
u32 __user *p = ptr;
u32 cid = VMADDR_CID_ANY;
int retval = 0;
+   struct vsock_local_cids local_cids;

switch (cmd) {
case IOCTL_VM_SOCKETS_GET_LOCAL_CID:
@@ -2469,6 +2470,24 @@ static long vsock_dev_do_ioctl(struct file *filp,
retval = -EFAULT;
break;

+   case IOCTL_VM_SOCKETS_GET_LOCAL_CIDS:
+   if (!transport_g2h || !transport_g2h->get_local_cids)
+   goto fault;
+
+   rcu_read_lock();
+   local_cids.nr = transport_g2h->get_local_cids(local_cids.data);
+   rcu_read_unlock();
+
+   if (local_cids.nr < 0 ||
+   copy_to_user(p, &local_cids, sizeof(local_cids)))
+   goto fault;
+
+   break;
+
+fault:
+   retval = -EFAULT;
+   break;
+
default:
retval = -ENOIOCTLCMD;
}
--
2.34.1






Re: [RFC PATCH 4/5] vsock: seqpacket_allow adapts to multi-devices

2024-05-23 Thread Stefano Garzarella

On Fri, May 17, 2024 at 10:46:06PM GMT, Xuewei Niu wrote:

Adds a new argument, named "src_cid", to let them know which `virtio_vsock`
to be selected.

Signed-off-by: Xuewei Niu 
---
include/net/af_vsock.h   |  2 +-
net/vmw_vsock/af_vsock.c | 15 +--
net/vmw_vsock/virtio_transport.c |  4 ++--
net/vmw_vsock/vsock_loopback.c   |  4 ++--
4 files changed, 18 insertions(+), 7 deletions(-)


Same for this.



diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0151296a0bc5..25f7dc3d602d 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -143,7 +143,7 @@ struct vsock_transport {
 int flags);
int (*seqpacket_enqueue)(struct vsock_sock *vsk, struct msghdr *msg,
 size_t len);
-   bool (*seqpacket_allow)(u32 remote_cid);
+   bool (*seqpacket_allow)(u32 src_cid, u32 remote_cid);
u32 (*seqpacket_has_data)(struct vsock_sock *vsk);

/* Notification. */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index da06ddc940cd..3b34be802bf2 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -470,10 +470,12 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
{
const struct vsock_transport *new_transport;
struct sock *sk = sk_vsock(vsk);
-   unsigned int remote_cid = vsk->remote_addr.svm_cid;
+   unsigned int src_cid, remote_cid;
__u8 remote_flags;
int ret;

+   remote_cid = vsk->remote_addr.svm_cid;
+
/* If the packet is coming with the source and destination CIDs higher
 * than VMADDR_CID_HOST, then a vsock channel where all the packets are
 * forwarded to the host should be established. Then the host will
@@ -527,8 +529,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct 
vsock_sock *psk)
return -ENODEV;

if (sk->sk_type == SOCK_SEQPACKET) {
+   if (vsk->local_addr.svm_cid == VMADDR_CID_ANY) {
+   if (new_transport->get_default_cid)
+   src_cid = new_transport->get_default_cid();
+   else
+   src_cid = new_transport->get_local_cid();
+   } else {
+   src_cid = vsk->local_addr.svm_cid;
+   }
+
if (!new_transport->seqpacket_allow ||
-   !new_transport->seqpacket_allow(remote_cid)) {
+   !new_transport->seqpacket_allow(src_cid, remote_cid)) {
module_put(new_transport->module);
return -ESOCKTNOSUPPORT;
}
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 998b22e5ce36..0bddcbd906a2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -615,14 +615,14 @@ static struct virtio_transport virtio_transport = {
.can_msgzerocopy = virtio_transport_can_msgzerocopy,
};

-static bool virtio_transport_seqpacket_allow(u32 remote_cid)
+static bool virtio_transport_seqpacket_allow(u32 src_cid, u32 remote_cid)
{
struct virtio_vsock *vsock;
bool seqpacket_allow;

seqpacket_allow = false;
rcu_read_lock();
-   vsock = rcu_dereference(the_virtio_vsock);
+   vsock = virtio_transport_get_virtio_vsock(src_cid);
if (vsock)
seqpacket_allow = vsock->seqpacket_allow;
rcu_read_unlock();
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 6dea6119f5b2..b94358f5bb2c 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -46,7 +46,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
return 0;
}

-static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_seqpacket_allow(u32 src_cid, u32 remote_cid);
static bool vsock_loopback_msgzerocopy_allow(void)
{
return true;
@@ -104,7 +104,7 @@ static struct virtio_transport loopback_transport = {
.send_pkt = vsock_loopback_send_pkt,
};

-static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
+static bool vsock_loopback_seqpacket_allow(u32 src_cid, u32 remote_cid)
{
return true;
}
--
2.34.1






Re: [RFC PATCH 3/5] vsock/virtio: can_msgzerocopy adapts to multi-devices

2024-05-23 Thread Stefano Garzarella

On Fri, May 17, 2024 at 10:46:05PM GMT, Xuewei Niu wrote:

Adds a new argument, named "cid", to let them know which `virtio_vsock` to
be selected.

Signed-off-by: Xuewei Niu 
---
include/linux/virtio_vsock.h| 2 +-
net/vmw_vsock/virtio_transport.c| 5 ++---
net/vmw_vsock/virtio_transport_common.c | 6 +++---
3 files changed, 6 insertions(+), 7 deletions(-)


Every commit in linux must be working to support bisection. So these 
changes should be made before adding multi-device support.




diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..21bfd5e0c2e7 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -168,7 +168,7 @@ struct virtio_transport {
 * extra checks and can perform zerocopy transmission by
 * default.
 */
-   bool (*can_msgzerocopy)(int bufs_num);
+   bool (*can_msgzerocopy)(u32 cid, int bufs_num);
};

ssize_t
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 93d25aeafb83..998b22e5ce36 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -521,14 +521,13 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

-static bool virtio_transport_can_msgzerocopy(int bufs_num)
+static bool virtio_transport_can_msgzerocopy(u32 cid, int bufs_num)
{
struct virtio_vsock *vsock;
bool res = false;

rcu_read_lock();
-
-   vsock = rcu_dereference(the_virtio_vsock);
+   vsock = virtio_transport_get_virtio_vsock(cid);
if (vsock) {
struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];

diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index bed75a41419e..e7315d7b9af1 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -39,7 +39,7 @@ virtio_transport_get_ops(struct vsock_sock *vsk)

static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops,
   struct virtio_vsock_pkt_info *info,
-  size_t pkt_len)
+  size_t pkt_len, unsigned int cid)
{
struct iov_iter *iov_iter;

@@ -62,7 +62,7 @@ static bool virtio_transport_can_zcopy(const struct 
virtio_transport *t_ops,
int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);

/* +1 is for packet header. */
-   return t_ops->can_msgzerocopy(pages_to_send + 1);
+   return t_ops->can_msgzerocopy(cid, pages_to_send + 1);
}

return true;
@@ -375,7 +375,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock 
*vsk,
info->msg->msg_flags &= ~MSG_ZEROCOPY;

if (info->msg->msg_flags & MSG_ZEROCOPY)
-   can_zcopy = virtio_transport_can_zcopy(t_ops, info, 
pkt_len);
+   can_zcopy = virtio_transport_can_zcopy(t_ops, info, 
pkt_len, src_cid);

if (can_zcopy)
max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
--
2.34.1






Re: [RFC PATCH 2/5] vsock/virtio: Add support for multi-devices

2024-05-23 Thread Stefano Garzarella

On Fri, May 17, 2024 at 10:46:04PM GMT, Xuewei Niu wrote:

The maximum number of devices is limited by `MAX_VSOCK_NUM`.

Extends `vsock_transport` struct with 4 methods to support multi-devices:

* `get_virtio_vsock()`: It receives a CID, and returns a struct of virtio
 vsock. This method is designed to select a vsock device by its CID.
* `get_default_cid()`: It receives nothing, returns the default CID of the
 first vsock device registered to the kernel.
* `get_local_cids()`: It returns a vector of vsock devices' CIDs.
* `compare_order()`: It receives two different CIDs, named "left" and
 "right" respectively. It returns "-1" while the "left" is behind the
 "right". Otherwise, return "1".

`get_local_cid()` is retained, but returns "-1" if the transport supports
multi-devices.

Replaces the single instance of `virtio_vsock` with a list, named
`virtio_vsock_list`. The devices are inserted into the list when probing.

The kernel will deny devices from being registered if there are conflicts
existing in CIDs or orders.

Signed-off-by: Xuewei Niu 
---
include/net/af_vsock.h  |  16 ++
include/uapi/linux/vm_sockets.h |   6 +
net/vmw_vsock/af_vsock.c|  82 ++--
net/vmw_vsock/virtio_transport.c| 246 ++--
net/vmw_vsock/virtio_transport_common.c |  10 +-
5 files changed, 293 insertions(+), 67 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 535701efc1e5..0151296a0bc5 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -174,6 +174,22 @@ struct vsock_transport {

/* Addressing. */
u32 (*get_local_cid)(void);
+   /* Held rcu read lock by the caller. */


We should also explain why the rcu is needed.


+   struct virtio_vsock *(*get_virtio_vsock)(unsigned int cid);


af_vsock supports several transports (i.e. HyperV, VMCI, VIRTIO/VHOST,
loobpack), so we need to be generic here.

In addition, the pointer returned by this function is never used, so
why we need this?


+   unsigned int (*get_default_cid)(void);
+   /* Get an list containing all the CIDs of registered vsock.   Return
+* the length of the list.
+*
+* Held rcu read lock by the caller.
+*/
+   int (*get_local_cids)(unsigned int *local_cids);


Why int? get_local_cid() returns an u32, we should do the same.

In addition, can we remove get_local_cid() and implement 
get_local_cids() for all the transports?



+   /* Compare the order of two devices.  Given the guest CIDs of two
+* different devices, returns -1 while the left is behind the right.
+* Otherwise, return 1.
+*
+* Held rcu read lock by the caller.
+*/
+   int (*compare_order)(unsigned int left, unsigned int right);


Please check better the type for CIDs all over the place.



/* Read a single skb */
int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index ed07181d4eff..36ca5023293a 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -189,6 +189,12 @@ struct sockaddr_vm {
   sizeof(__u8)];
};

+/* The maximum number of vsock devices.  Each vsock device has an exclusive
+ * context id.
+ */
+
+#define MAX_VSOCK_NUM 16


This is used internally in AF_VSOCK, I don't think we should expose it
in the UAPI.



+
#define IOCTL_VM_SOCKETS_GET_LOCAL_CID  _IO(7, 0xb9)

/* MSG_ZEROCOPY notifications are encoded in the standard error format,
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 54ba7316f808..da06ddc940cd 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -234,19 +234,45 @@ static void __vsock_remove_connected(struct vsock_sock 
*vsk)

static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
{
-   struct vsock_sock *vsk;
+   struct vsock_sock *vsk, *any_vsk = NULL;

+   rcu_read_lock();


Why the rcu is needed?

	list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) 
	{

+   /* The highest priority: full match. */
if (vsock_addr_equals_addr(addr, &vsk->local_addr))
-   return sk_vsock(vsk);
+   goto out;

-   if (addr->svm_port == vsk->local_addr.svm_port &&
-   (vsk->local_addr.svm_cid == VMADDR_CID_ANY ||
-addr->svm_cid == VMADDR_CID_ANY))
-   return sk_vsock(vsk);
+   /* Port match */
+   if (addr->svm_port == vsk->local_addr.svm_port) {
+   /* The second priority: local cid is VMADDR_CID_ANY. */
+   if (vsk->local_addr.svm_cid == VMADDR_CID_ANY)
+   goto out;
+
+   /* The third priority: local cid isn't VMADDR_CID_ANY. 
*/
+   if (addr->svm_cid == VMADDR_CI

Re: [RFC PATCH 1/5] vsock/virtio: Extend virtio-vsock spec with an "order" field

2024-05-23 Thread Stefano Garzarella

As Alyssa suggested, we should discuss spec changes in the virtio ML.
BTW as long as this is an RFC, it's fine. Just be sure, though, to 
remember to merge the change in the specification first versus the 
patches in Linux.
So I recommend that you don't send a non-RFC set into Linux until you 
have agreed on the changes to the specification.


On Fri, May 17, 2024 at 10:46:03PM GMT, Xuewei Niu wrote:

The "order" field determines the location of the device in the linked list,
the device with CID 4, having a smallest order, is in the first place, and
so forth.


Do we really need an order, or would it suffice to just indicate the 
device to be used by default? (as the default gateway in networking)




Rules:

* It doesn’t have to be continuous;
* It cannot exist conflicts;
* It is optional for the mode of a single device, but is required for the
 mode of multiple devices.


We should also add a feature to support this new field.



Signed-off-by: Xuewei Niu 
---
include/uapi/linux/virtio_vsock.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/virtio_vsock.h 
b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..b62ec7d2ab1e 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -43,6 +43,7 @@

struct virtio_vsock_config {
__le64 guest_cid;
+   __le64 order;


Do we really need 64 bits for the order?


} __attribute__((packed));

enum virtio_vsock_event_id {
--
2.34.1






Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled

2024-05-23 Thread Juergen Gross

On 16.05.24 15:02, Chen Yu wrote:

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is set to true on bare-metal.
The qspinlock degenerates to test-and-set spinlock, which decrease the
performance on bare-metal.

Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS
is not set, or it is on bare-metal.

Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function 
warning")
Suggested-by: Qiuxu Zhuo 
Reported-by: Prem Nath Dey 
Reported-by: Xiaoping Zhou 
Signed-off-by: Chen Yu 


Reviewed-by: Juergen Gross 


Juergen


---
  arch/x86/kernel/paravirt.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..ee51c0949ed8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
  
  void __init native_pv_lock_init(void)

  {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
+   if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) ||
!boot_cpu_has(X86_FEATURE_HYPERVISOR))
static_branch_disable(&virt_spin_lock_key);
  }




OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature


[PATCH] ring-buffer: Align meta-page to sub-buffers for improved TLB usage

2024-05-23 Thread Vincent Donnefort
Previously, the mapped ring-buffer layout caused misalignment between
the meta-page and sub-buffers when the sub-buffer size was not a
multiple of PAGE_SIZE. This prevented hardware with larger TLB entries
from utilizing them effectively.

Add a padding with the zero-page between the meta-page and sub-buffers.
Also update the ring-buffer map_test to verify that padding.

Signed-off-by: Vincent Donnefort 

-- 

This is based on the mm-unstable branch [1] as it depends on David's work [2]
for allowing the zero-page in vm_insert_page().

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
[2] https://lore.kernel.org/all/20240522125713.775114-1-da...@redhat.com

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7345a8b625fb..acaab4d4288f 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6148,10 +6148,10 @@ static void rb_setup_ids_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer,
/* install subbuf ID to kern VA translation */
cpu_buffer->subbuf_ids = subbuf_ids;
 
-   meta->meta_page_size = PAGE_SIZE;
meta->meta_struct_len = sizeof(*meta);
meta->nr_subbufs = nr_subbufs;
meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   meta->meta_page_size = meta->subbuf_size;
 
rb_update_meta_page(cpu_buffer);
 }
@@ -6238,6 +6238,12 @@ static int __rb_map_vma(struct ring_buffer_per_cpu 
*cpu_buffer,
!(vma->vm_flags & VM_MAYSHARE))
return -EPERM;
 
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   if (subbuf_order && pgoff % subbuf_pages)
+   return -EINVAL;
+
/*
 * Make sure the mapping cannot become writable later. Also tell the VM
 * to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND).
@@ -6247,11 +6253,8 @@ static int __rb_map_vma(struct ring_buffer_per_cpu 
*cpu_buffer,
 
lockdep_assert_held(&cpu_buffer->mapping_lock);
 
-   subbuf_order = cpu_buffer->buffer->subbuf_order;
-   subbuf_pages = 1 << subbuf_order;
-
nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
-   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+   nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff; /* + meta-page */
 
vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
if (!vma_pages || vma_pages > nr_pages)
@@ -6264,20 +6267,20 @@ static int __rb_map_vma(struct ring_buffer_per_cpu 
*cpu_buffer,
return -ENOMEM;
 
if (!pgoff) {
+   unsigned long meta_page_padding;
+
pages[p++] = virt_to_page(cpu_buffer->meta_page);
 
/*
-* TODO: Align sub-buffers on their size, once
-* vm_insert_pages() supports the zero-page.
+* Pad with the zero-page to align the meta-page with the
+* sub-buffers.
 */
+   meta_page_padding = subbuf_pages - 1;
+   while (meta_page_padding-- && p < nr_pages)
+   pages[p++] = ZERO_PAGE(vma->vm_start + (PAGE_SIZE * p));
} else {
/* Skip the meta-page */
-   pgoff--;
-
-   if (pgoff % subbuf_pages) {
-   err = -EINVAL;
-   goto out;
-   }
+   pgoff -= subbuf_pages;
 
s += pgoff / subbuf_pages;
}
diff --git a/tools/testing/selftests/ring-buffer/map_test.c 
b/tools/testing/selftests/ring-buffer/map_test.c
index a9006fa7097e..4bb0192e43f3 100644
--- a/tools/testing/selftests/ring-buffer/map_test.c
+++ b/tools/testing/selftests/ring-buffer/map_test.c
@@ -228,6 +228,20 @@ TEST_F(map, data_mmap)
data = mmap(NULL, data_len, PROT_READ, MAP_SHARED,
desc->cpu_fd, meta_len);
ASSERT_EQ(data, MAP_FAILED);
+
+   /* Verify meta-page padding */
+   if (desc->meta->meta_page_size > getpagesize()) {
+   void *addr;
+
+   data_len = desc->meta->meta_page_size;
+   data = mmap(NULL, data_len,
+   PROT_READ, MAP_SHARED, desc->cpu_fd, 0);
+   ASSERT_NE(data, MAP_FAILED);
+
+   addr = (void *)((unsigned long)data + getpagesize());
+   ASSERT_EQ(*((int *)addr), 0);
+   munmap(data, data_len);
+   }
 }
 
 FIXTURE(snapshot) {

base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
-- 
2.45.1.288.g0e0cd299f1-goog




Re: [RFC PATCH 1/5] vsock/virtio: Extend virtio-vsock spec with an "order" field

2024-05-23 Thread Alyssa Ross
(CCing virtio-comment, since this proposes adding a field to a struct
that is standardized[1] in the VIRTIO spec, so changes to the Linux
implementation should presumably be coordinated with changes to the
spec.)

[1]: 
https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4780004

On Fri, May 17, 2024 at 10:46:03PM +0800, Xuewei Niu wrote:
> The "order" field determines the location of the device in the linked list,
> the device with CID 4, having a smallest order, is in the first place, and
> so forth.
>
> Rules:
>
> * It doesn’t have to be continuous;
> * It cannot exist conflicts;
> * It is optional for the mode of a single device, but is required for the
>   mode of multiple devices.
>
> Signed-off-by: Xuewei Niu 
> ---
>  include/uapi/linux/virtio_vsock.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/uapi/linux/virtio_vsock.h 
> b/include/uapi/linux/virtio_vsock.h
> index 64738838bee5..b62ec7d2ab1e 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -43,6 +43,7 @@
>
>  struct virtio_vsock_config {
>   __le64 guest_cid;
> + __le64 order;
>  } __attribute__((packed));
>
>  enum virtio_vsock_event_id {
> --
> 2.34.1
>


signature.asc
Description: PGP signature