Re: [PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header
On Thu, 23 May 2024 19:14:59 -0400 Steven Rostedt wrote: > On Tue, 7 May 2024 23:08:35 +0900 > "Masami Hiramatsu (Google)" wrote: > > > From: Masami Hiramatsu (Google) > > > > Add ftrace_regs definition for x86_64 in the ftrace header to > > clarify what register will be accessible from ftrace_regs. > > > > Signed-off-by: Masami Hiramatsu (Google) > > --- > > Changes in v3: > > - Add rip to be saved. > > Changes in v2: > > - Newly added. > > --- > > arch/x86/include/asm/ftrace.h |6 ++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h > > index cf88cc8cc74d..c88bf47f46da 100644 > > --- a/arch/x86/include/asm/ftrace.h > > +++ b/arch/x86/include/asm/ftrace.h > > @@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned > > long addr) > > > > #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > > struct ftrace_regs { > > + /* > > +* On the x86_64, the ftrace_regs saves; > > +* rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp. > > +* Also orig_ax is used for passing direct trampoline address. > > +* x86_32 doesn't support ftrace_regs. > > Should add a comment that if fregs->regs.cs is set, then all of the pt_regs > is valid. But what about rbx and r1*? Only regs->cs should be care for pt_regs? Or, did you mean "the ftrace_regs is valid"? > And x86_32 does support ftrace_regs, it just doesn't support > having a subset of it. Oh, thanks. I'll update the comment about x86_32. Thank you, > > -- Steve > > > > +*/ > > struct pt_regs regs; > > }; > > > > -- Masami Hiramatsu (Google)
Re: [PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition
On Thu, 23 May 2024 19:10:31 -0400 Steven Rostedt wrote: > On Tue, 7 May 2024 23:08:12 +0900 > "Masami Hiramatsu (Google)" wrote: > > > From: Masami Hiramatsu (Google) > > > > To clarify what will be expected on ftrace_regs, add a comment to the > > architecture independent definition of the ftrace_regs. > > > > Signed-off-by: Masami Hiramatsu (Google) > > Acked-by: Mark Rutland > > --- > > Changes in v8: > > - Update that the saved registers depends on the context. > > Changes in v3: > > - Add instruction pointer > > Changes in v2: > > - newly added. > > --- > > include/linux/ftrace.h | 26 ++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > > index 54d53f345d14..b81f1afa82a1 100644 > > --- a/include/linux/ftrace.h > > +++ b/include/linux/ftrace.h > > @@ -118,6 +118,32 @@ extern int ftrace_enabled; > > > > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > > > > +/** > > + * ftrace_regs - ftrace partial/optimal register set > > + * > > + * ftrace_regs represents a group of registers which is used at the > > + * function entry and exit. There are three types of registers. > > + * > > + * - Registers for passing the parameters to callee, including the stack > > + * pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64) > > + * - Registers for passing the return values to caller. > > + * (e.g. rax and rdx on x86_64) > > + * - Registers for hooking the function call and return including the > > + * frame pointer (the frame pointer is architecture/config dependent) > > + * (e.g. rip, rbp and rsp for x86_64) > > + * > > + * Also, architecture dependent fields can be used for internal process. > > + * (e.g. orig_ax on x86_64) > > + * > > + * On the function entry, those registers will be restored except for > > + * the stack pointer, so that user can change the function parameters > > + * and instruction pointer (e.g. live patching.) > > + * On the function exit, only registers which is used for return values > > + * are restored. > > I wonder if we should also add a note about some architectures in some > circumstances may store all pt_regs in ftrace_regs. For example, if an > architecture supports FTRACE_WITH_REGS, it may pass the pt_regs within the > ftrace_regs. If that is the case, then ftrace_get_regs() called on it will > return a pointer to a valid pt_regs, or NULL if it is not supported or the > ftrace_regs does not have a all the registers. Agreed. That case also should be noted. Thanks for pointing! > > -- Steve > > > > + * > > + * NOTE: user *must not* access regs directly, only do it via APIs, because > > + * the member can be changed according to the architecture. > > + */ > > struct ftrace_regs { > > struct pt_regs regs; > > }; > -- Masami Hiramatsu (Google)
Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
On 24/05/23 03:57PM, Miklos Szeredi wrote: > [trimming CC list] > > On Thu, 23 May 2024 at 04:49, John Groves wrote: > > > - memmap=! will reserve a pretend pmem device at > > > > - memmap=$ will reserve a pretend dax device at > > > > Doesn't get me a /dev/dax or /dev/pmem > > Complete qemu command line: > > qemu-kvm -s -serial none -parallel none -kernel > /home/mszeredi/git/linux/arch/x86/boot/bzImage -drive > format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive > format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev > stdio,id=virtiocon0,signal=off -device virtio-serial -device > virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net > nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home > -device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device > virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0 > memmap=4G$4G' > > root@kvm:~/famfs# scripts/chk_efi.sh > This system is neither Ubuntu nor Fedora. It is identified as debian. > /sys/firmware/efi not found; probably not efi > not found; probably nof efi > /boot/efi/EFI not found; probably not efi > /boot/efi/EFI/BOOT not found; probably not efi > /boot/efi/EFI/ not found; probably not efi > /boot/efi/EFI//grub.cfg not found; probably nof efi > Probably not efi; errs=6 > > Thanks, > Miklos Apologies, but I'm short on time at the moment - going into a long holiday weekend in the US with family plans. I should be focused again by middle of next week. But can you check /proc/cmdline to see of the memmap arg got through without getting mangled? The '$' tends to get fubar'd. You might need \$, or I've seen the need for \\\$. If it's un-mangled, there should be a dax device. If that doesn't work, it's worth trying '!' instead, which I think would give you a pmem device - if the arg gets through (but ! is less likely to get horked). That pmem device can be converted to devdax... Regards, John
Re: [PATCH v10 03/36] x86: tracing: Add ftrace_regs definition in the header
On Tue, 7 May 2024 23:08:35 +0900 "Masami Hiramatsu (Google)" wrote: > From: Masami Hiramatsu (Google) > > Add ftrace_regs definition for x86_64 in the ftrace header to > clarify what register will be accessible from ftrace_regs. > > Signed-off-by: Masami Hiramatsu (Google) > --- > Changes in v3: > - Add rip to be saved. > Changes in v2: > - Newly added. > --- > arch/x86/include/asm/ftrace.h |6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h > index cf88cc8cc74d..c88bf47f46da 100644 > --- a/arch/x86/include/asm/ftrace.h > +++ b/arch/x86/include/asm/ftrace.h > @@ -36,6 +36,12 @@ static inline unsigned long ftrace_call_adjust(unsigned > long addr) > > #ifdef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > struct ftrace_regs { > + /* > + * On the x86_64, the ftrace_regs saves; > + * rax, rcx, rdx, rdi, rsi, r8, r9, rbp, rip and rsp. > + * Also orig_ax is used for passing direct trampoline address. > + * x86_32 doesn't support ftrace_regs. Should add a comment that if fregs->regs.cs is set, then all of the pt_regs is valid. And x86_32 does support ftrace_regs, it just doesn't support having a subset of it. -- Steve > + */ > struct pt_regs regs; > }; >
Re: [PATCH v10 01/36] tracing: Add a comment about ftrace_regs definition
On Tue, 7 May 2024 23:08:12 +0900 "Masami Hiramatsu (Google)" wrote: > From: Masami Hiramatsu (Google) > > To clarify what will be expected on ftrace_regs, add a comment to the > architecture independent definition of the ftrace_regs. > > Signed-off-by: Masami Hiramatsu (Google) > Acked-by: Mark Rutland > --- > Changes in v8: > - Update that the saved registers depends on the context. > Changes in v3: > - Add instruction pointer > Changes in v2: > - newly added. > --- > include/linux/ftrace.h | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > index 54d53f345d14..b81f1afa82a1 100644 > --- a/include/linux/ftrace.h > +++ b/include/linux/ftrace.h > @@ -118,6 +118,32 @@ extern int ftrace_enabled; > > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > > +/** > + * ftrace_regs - ftrace partial/optimal register set > + * > + * ftrace_regs represents a group of registers which is used at the > + * function entry and exit. There are three types of registers. > + * > + * - Registers for passing the parameters to callee, including the stack > + * pointer. (e.g. rcx, rdx, rdi, rsi, r8, r9 and rsp on x86_64) > + * - Registers for passing the return values to caller. > + * (e.g. rax and rdx on x86_64) > + * - Registers for hooking the function call and return including the > + * frame pointer (the frame pointer is architecture/config dependent) > + * (e.g. rip, rbp and rsp for x86_64) > + * > + * Also, architecture dependent fields can be used for internal process. > + * (e.g. orig_ax on x86_64) > + * > + * On the function entry, those registers will be restored except for > + * the stack pointer, so that user can change the function parameters > + * and instruction pointer (e.g. live patching.) > + * On the function exit, only registers which is used for return values > + * are restored. I wonder if we should also add a note about some architectures in some circumstances may store all pt_regs in ftrace_regs. For example, if an architecture supports FTRACE_WITH_REGS, it may pass the pt_regs within the ftrace_regs. If that is the case, then ftrace_get_regs() called on it will return a pointer to a valid pt_regs, or NULL if it is not supported or the ftrace_regs does not have a all the registers. -- Steve > + * > + * NOTE: user *must not* access regs directly, only do it via APIs, because > + * the member can be changed according to the architecture. > + */ > struct ftrace_regs { > struct pt_regs regs; > };
Re: [PATCH v2 1/1] x86/vector: Fix vector leak during CPU offline
On Wed, May 22 2024 at 15:02, Dongli Zhang wrote: > The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of > interrupt affinity reconfiguration via procfs. Instead, the change is > deferred until the next instance of the interrupt being triggered on the > original CPU. > > When the interrupt next triggers on the original CPU, the new affinity is > enforced within __irq_move_irq(). A vector is allocated from the new CPU, > but if the old vector on the original CPU remains online, it is not > immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the > reclaiming process is delayed until the next trigger of the interrupt on > the new CPU. > > Upon the subsequent triggering of the interrupt on the new CPU, > irq_complete_move() adds a task to the old CPU's vector_cleanup list if it > remains online. Subsequently, the timer on the old CPU iterates over its > vector_cleanup list, reclaiming old vectors. > > However, a rare scenario arises if the old CPU is outgoing before the > interrupt triggers again on the new CPU. The irq_force_complete_move() may > not have the chance to be invoked on the outgoing CPU to reclaim the old > apicd->prev_vector. This is because the interrupt isn't currently affine to > the outgoing CPU, and irq_needs_fixup() returns false. Even though > __vector_schedule_cleanup() is later called on the new CPU, it doesn't > reclaim apicd->prev_vector; instead, it simply resets both > apicd->move_in_progress and apicd->prev_vector to 0. > > As a result, the vector remains unreclaimed in vector_matrix, leading to a > CPU vector leak. > > To address this issue, move the invocation of irq_force_complete_move() > before the irq_needs_fixup() call to reclaim apicd->prev_vector, if the > interrupt is currently or used to affine to the outgoing CPU. Additionally, > reclaim the vector in __vector_schedule_cleanup() as well, following a > warning message, although theoretically it should never see > apicd->move_in_progress with apicd->prev_cpu pointing to an offline CPU. Nice change log!
Re: [GIT PULL v2] virtio: features, fixes, cleanups
The pull request you sent on Thu, 23 May 2024 02:00:17 -0400: > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/2ef32ad2241340565c35baf77fc95053c84eeeb0 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled
On 5/23/24 11:39, Jürgen Groß wrote: >> >> Let's just keep it simple. How about the attached patch? > > Simple indeed. The attachment is empty. 😛 Let's try this again.diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 5358d43886ad..c193c9e60a1b 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -55,8 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); void __init native_pv_lock_init(void) { - if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && - !boot_cpu_has(X86_FEATURE_HYPERVISOR)) + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) static_branch_disable(&virt_spin_lock_key); }
Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled
On 23.05.24 18:30, Dave Hansen wrote: On 5/16/24 06:02, Chen Yu wrote: Performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is set to true on bare-metal. The qspinlock degenerates to test-and-set spinlock, which decrease the performance on bare-metal. Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS is not set, or it is on bare-metal. This is missing some background: The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. So there's a 'virt_spin_lock_key' static key to switch between the two modes. The static key is always enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior). ... then describe the regression and the fix diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 5358d43886ad..ee51c0949ed8 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); void __init native_pv_lock_init(void) { - if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && + if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !boot_cpu_has(X86_FEATURE_HYPERVISOR)) static_branch_disable(&virt_spin_lock_key); } This gets used at a single site: if (pv_enabled()) goto pv_queue; if (virt_spin_lock(lock)) return; which is logically: if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS)) goto ...; // don't look at virt_spin_lock_key if (virt_spin_lock_key) return; // On virt, but non-paravirt. Did Test-and-Set // spinlock. So I _think_ Arnd was trying to optimize native_pv_lock_init() away when it's going to get skipped over anyway by the 'goto'. But this took me at least 30 minutes of scratching my head and trying to untangle the whole thing. It's all far too subtle for my taste, and all of that to save a few bytes of init text in a configuration that's probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n). Let's just keep it simple. How about the attached patch? Simple indeed. The attachment is empty. :-p Juergen
Re: [PATCH] riscv: Fix early ftrace nop patching
Hello: This patch was applied to riscv/linux.git (for-next) by Palmer Dabbelt : On Thu, 23 May 2024 13:51:34 +0200 you wrote: > Commit c97bf629963e ("riscv: Fix text patching when IPI are used") > converted ftrace_make_nop() to use patch_insn_write() which does not > emit any icache flush relying entirely on __ftrace_modify_code() to do > that. > > But we missed that ftrace_make_nop() was called very early directly when > converting mcount calls into nops (actually on riscv it converts 2B nops > emitted by the compiler into 4B nops). > > [...] Here is the summary with links: - riscv: Fix early ftrace nop patching https://git.kernel.org/riscv/c/6ca445d8af0e You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH v2 1/2] drivers: remoteproc: xlnx: add attach detach support
On 5/23/24 12:05 PM, Mathieu Poirier wrote: > On Wed, May 22, 2024 at 09:36:26AM -0500, Tanmay Shah wrote: >> >> >> On 5/21/24 12:56 PM, Mathieu Poirier wrote: >> > Hi Tanmay, >> > >> > On Fri, May 10, 2024 at 05:51:25PM -0700, Tanmay Shah wrote: >> >> It is possible that remote processor is already running before >> >> linux boot or remoteproc platform driver probe. Implement required >> >> remoteproc framework ops to provide resource table address and >> >> connect or disconnect with remote processor in such case. >> >> >> >> Signed-off-by: Tanmay Shah >> >> --- >> >> >> >> Changes in v2: >> >> - Fix following sparse warnings >> >> >> >> drivers/remoteproc/xlnx_r5_remoteproc.c:827:21: sparse:expected >> >> struct rsc_tbl_data *rsc_data_va >> >> drivers/remoteproc/xlnx_r5_remoteproc.c:844:18: sparse:expected >> >> struct resource_table *rsc_addr >> >> drivers/remoteproc/xlnx_r5_remoteproc.c:898:24: sparse:expected void >> >> volatile [noderef] __iomem *addr >> >> >> >> drivers/remoteproc/xlnx_r5_remoteproc.c | 164 +++- >> >> 1 file changed, 160 insertions(+), 4 deletions(-) >> >> >> >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c >> >> b/drivers/remoteproc/xlnx_r5_remoteproc.c >> >> index 84243d1dff9f..039370cffa32 100644 >> >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c >> >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c >> >> @@ -25,6 +25,10 @@ >> >> /* RX mailbox client buffer max length */ >> >> #define MBOX_CLIENT_BUF_MAX (IPI_BUF_LEN_MAX + \ >> >>sizeof(struct zynqmp_ipi_message)) >> >> + >> >> +#define RSC_TBL_XLNX_MAGIC ((uint32_t)'x' << 24 | (uint32_t)'a' << >> >> 16 | \ >> >> + (uint32_t)'m' << 8 | (uint32_t)'p') >> >> + >> >> /* >> >> * settings for RPU cluster mode which >> >> * reflects possible values of xlnx,cluster-mode dt-property >> >> @@ -73,6 +77,15 @@ struct mbox_info { >> >> struct mbox_chan *rx_chan; >> >> }; >> >> >> >> +/* Xilinx Platform specific data structure */ >> >> +struct rsc_tbl_data { >> >> + const int version; >> >> + const u32 magic_num; >> >> + const u32 comp_magic_num; >> > >> > Why is a complement magic number needed? >> >> Actually magic number is 64-bit. There is good chance that >> firmware can have 32-bit op-code or data same as magic number, but very less >> chance of its complement in the next address. So, we can assume magic number >> is 64-bit. >> > > So why not having a magic number that is a u64? > >> > >> >> + const u32 rsc_tbl_size; >> >> + const uintptr_t rsc_tbl; >> >> +} __packed; >> >> + >> >> /* >> >> * Hardcoded TCM bank values. This will stay in driver to maintain >> >> backward >> >> * compatibility with device-tree that does not have TCM information. >> >> @@ -95,20 +108,24 @@ static const struct mem_bank_data >> >> zynqmp_tcm_banks_lockstep[] = { >> >> /** >> >> * struct zynqmp_r5_core >> >> * >> >> + * @rsc_tbl_va: resource table virtual address >> >> * @dev: device of RPU instance >> >> * @np: device node of RPU instance >> >> * @tcm_bank_count: number TCM banks accessible to this RPU >> >> * @tcm_banks: array of each TCM bank data >> >> * @rproc: rproc handle >> >> + * @rsc_tbl_size: resource table size retrieved from remote >> >> * @pm_domain_id: RPU CPU power domain id >> >> * @ipi: pointer to mailbox information >> >> */ >> >> struct zynqmp_r5_core { >> >> + struct resource_table *rsc_tbl_va; >> > >> > Shouldn't this be of type "void __iomem *"? Did sparse give you trouble >> > on that >> > one? >> >> I fixed sparse warnings with typecast below [1]. >> > > My point is, ioremap_wc() returns a "void__iomem *" so why not using that > instead of a "struct resource_table *"? Ack. > > >> > >> >> struct device *dev; >> >> struct device_node *np; >> >> int tcm_bank_count; >> >> struct mem_bank_data **tcm_banks; >> >> struct rproc *rproc; >> >> + u32 rsc_tbl_size; >> >> u32 pm_domain_id; >> >> struct mbox_info *ipi; >> >> }; >> >> @@ -621,10 +638,19 @@ static int zynqmp_r5_rproc_prepare(struct rproc >> >> *rproc) >> >> { >> >> int ret; >> >> >> >> - ret = add_tcm_banks(rproc); >> >> - if (ret) { >> >> - dev_err(&rproc->dev, "failed to get TCM banks, err %d\n", ret); >> >> - return ret; >> >> + /** >> > >> > Using "/**" is for comments that will endup in the documentation, which I >> > don't >> > think is needed here. Please correct throughout the patch. >> >> Thanks. Ack, I will use only /* format. >> >> > >> >> + * For attach/detach use case, Firmware is already loaded so >> >> + * TCM isn't really needed at all. Also, for security TCM can be >> >> + * locked in such case and linux may not have access at all. >> >> + * So avoid adding TCM banks. TCM power-domains requested during attach >> >> + * callback. >> >> + */ >> >> + if (rproc->state != RPROC_DETACHED) { >> >> + ret = add_tcm_banks(rproc); >> >> +
Re: [PATCH v2 1/2] drivers: remoteproc: xlnx: add attach detach support
On Wed, May 22, 2024 at 09:36:26AM -0500, Tanmay Shah wrote: > > > On 5/21/24 12:56 PM, Mathieu Poirier wrote: > > Hi Tanmay, > > > > On Fri, May 10, 2024 at 05:51:25PM -0700, Tanmay Shah wrote: > >> It is possible that remote processor is already running before > >> linux boot or remoteproc platform driver probe. Implement required > >> remoteproc framework ops to provide resource table address and > >> connect or disconnect with remote processor in such case. > >> > >> Signed-off-by: Tanmay Shah > >> --- > >> > >> Changes in v2: > >> - Fix following sparse warnings > >> > >> drivers/remoteproc/xlnx_r5_remoteproc.c:827:21: sparse:expected struct > >> rsc_tbl_data *rsc_data_va > >> drivers/remoteproc/xlnx_r5_remoteproc.c:844:18: sparse:expected struct > >> resource_table *rsc_addr > >> drivers/remoteproc/xlnx_r5_remoteproc.c:898:24: sparse:expected void > >> volatile [noderef] __iomem *addr > >> > >> drivers/remoteproc/xlnx_r5_remoteproc.c | 164 +++- > >> 1 file changed, 160 insertions(+), 4 deletions(-) > >> > >> diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c > >> b/drivers/remoteproc/xlnx_r5_remoteproc.c > >> index 84243d1dff9f..039370cffa32 100644 > >> --- a/drivers/remoteproc/xlnx_r5_remoteproc.c > >> +++ b/drivers/remoteproc/xlnx_r5_remoteproc.c > >> @@ -25,6 +25,10 @@ > >> /* RX mailbox client buffer max length */ > >> #define MBOX_CLIENT_BUF_MAX (IPI_BUF_LEN_MAX + \ > >> sizeof(struct zynqmp_ipi_message)) > >> + > >> +#define RSC_TBL_XLNX_MAGIC((uint32_t)'x' << 24 | (uint32_t)'a' << > >> 16 | \ > >> + (uint32_t)'m' << 8 | (uint32_t)'p') > >> + > >> /* > >> * settings for RPU cluster mode which > >> * reflects possible values of xlnx,cluster-mode dt-property > >> @@ -73,6 +77,15 @@ struct mbox_info { > >>struct mbox_chan *rx_chan; > >> }; > >> > >> +/* Xilinx Platform specific data structure */ > >> +struct rsc_tbl_data { > >> + const int version; > >> + const u32 magic_num; > >> + const u32 comp_magic_num; > > > > Why is a complement magic number needed? > > Actually magic number is 64-bit. There is good chance that > firmware can have 32-bit op-code or data same as magic number, but very less > chance of its complement in the next address. So, we can assume magic number > is 64-bit. > So why not having a magic number that is a u64? > > > >> + const u32 rsc_tbl_size; > >> + const uintptr_t rsc_tbl; > >> +} __packed; > >> + > >> /* > >> * Hardcoded TCM bank values. This will stay in driver to maintain > >> backward > >> * compatibility with device-tree that does not have TCM information. > >> @@ -95,20 +108,24 @@ static const struct mem_bank_data > >> zynqmp_tcm_banks_lockstep[] = { > >> /** > >> * struct zynqmp_r5_core > >> * > >> + * @rsc_tbl_va: resource table virtual address > >> * @dev: device of RPU instance > >> * @np: device node of RPU instance > >> * @tcm_bank_count: number TCM banks accessible to this RPU > >> * @tcm_banks: array of each TCM bank data > >> * @rproc: rproc handle > >> + * @rsc_tbl_size: resource table size retrieved from remote > >> * @pm_domain_id: RPU CPU power domain id > >> * @ipi: pointer to mailbox information > >> */ > >> struct zynqmp_r5_core { > >> + struct resource_table *rsc_tbl_va; > > > > Shouldn't this be of type "void __iomem *"? Did sparse give you trouble on > > that > > one? > > I fixed sparse warnings with typecast below [1]. > My point is, ioremap_wc() returns a "void__iomem *" so why not using that instead of a "struct resource_table *"? > > > >>struct device *dev; > >>struct device_node *np; > >>int tcm_bank_count; > >>struct mem_bank_data **tcm_banks; > >>struct rproc *rproc; > >> + u32 rsc_tbl_size; > >>u32 pm_domain_id; > >>struct mbox_info *ipi; > >> }; > >> @@ -621,10 +638,19 @@ static int zynqmp_r5_rproc_prepare(struct rproc > >> *rproc) > >> { > >>int ret; > >> > >> - ret = add_tcm_banks(rproc); > >> - if (ret) { > >> - dev_err(&rproc->dev, "failed to get TCM banks, err %d\n", ret); > >> - return ret; > >> + /** > > > > Using "/**" is for comments that will endup in the documentation, which I > > don't > > think is needed here. Please correct throughout the patch. > > Thanks. Ack, I will use only /* format. > > > > >> + * For attach/detach use case, Firmware is already loaded so > >> + * TCM isn't really needed at all. Also, for security TCM can be > >> + * locked in such case and linux may not have access at all. > >> + * So avoid adding TCM banks. TCM power-domains requested during attach > >> + * callback. > >> + */ > >> + if (rproc->state != RPROC_DETACHED) { > >> + ret = add_tcm_banks(rproc); > >> + if (ret) { > >> + dev_err(&rproc->dev, "failed to get TCM banks, err > >> %d\n", ret); > >> + return ret; > >> + } > >>
[PATCH] ipvs: Avoid unnecessary calls to skb_is_gso_sctp
In the context of the SCTP SNAT/DNAT handler, these calls can only return true. Ref: e10d3ba4d434 ("ipvs: Fix checksumming on GSO of SCTP packets") Signed-off-by: Ismael Luceno CC: Pablo Neira Ayuso CC: Michal Kubeček CC: Simon Horman CC: Julian Anastasov CC: lvs-de...@vger.kernel.org CC: netfilter-de...@vger.kernel.org CC: net...@vger.kernel.org CC: coret...@netfilter.org --- net/netfilter/ipvs/ip_vs_proto_sctp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c b/net/netfilter/ipvs/ip_vs_proto_sctp.c index 1e689c714127..83e452916403 100644 --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c @@ -126,7 +126,7 @@ sctp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, if (sctph->source != cp->vport || payload_csum || skb->ip_summed == CHECKSUM_PARTIAL) { sctph->source = cp->vport; - if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb)) + if (!skb_is_gso(skb)) sctp_nat_csum(skb, sctph, sctphoff); } else { skb->ip_summed = CHECKSUM_UNNECESSARY; @@ -175,7 +175,7 @@ sctp_dnat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, (skb->ip_summed == CHECKSUM_PARTIAL && !(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) { sctph->dest = cp->dport; - if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb)) + if (!skb_is_gso(skb)) sctp_nat_csum(skb, sctph, sctphoff); } else if (skb->ip_summed != CHECKSUM_PARTIAL) { skb->ip_summed = CHECKSUM_UNNECESSARY; -- 2.44.0
Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled
On 5/16/24 06:02, Chen Yu wrote: > Performance drop is reported when running encode/decode workload and > BenchSEE cache sub-workload. > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS > is disabled the virt_spin_lock_key is set to true on bare-metal. > The qspinlock degenerates to test-and-set spinlock, which decrease the > performance on bare-metal. > > Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS > is not set, or it is on bare-metal. This is missing some background: The kernel can change spinlock behavior when running as a guest. But this guest-friendly behavior causes performance problems on bare metal. So there's a 'virt_spin_lock_key' static key to switch between the two modes. The static key is always enabled by default (run in guest mode) and should be disabled for bare metal (and in some guests that want native behavior). ... then describe the regression and the fix > diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c > index 5358d43886ad..ee51c0949ed8 100644 > --- a/arch/x86/kernel/paravirt.c > +++ b/arch/x86/kernel/paravirt.c > @@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); > > void __init native_pv_lock_init(void) > { > - if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && > + if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || > !boot_cpu_has(X86_FEATURE_HYPERVISOR)) > static_branch_disable(&virt_spin_lock_key); > } This gets used at a single site: if (pv_enabled()) goto pv_queue; if (virt_spin_lock(lock)) return; which is logically: if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS)) goto ...; // don't look at virt_spin_lock_key if (virt_spin_lock_key) return; // On virt, but non-paravirt. Did Test-and-Set // spinlock. So I _think_ Arnd was trying to optimize native_pv_lock_init() away when it's going to get skipped over anyway by the 'goto'. But this took me at least 30 minutes of scratching my head and trying to untangle the whole thing. It's all far too subtle for my taste, and all of that to save a few bytes of init text in a configuration that's probably not even used very often (PARAVIRT=y, but PARAVIRT_SPINLOCKS=n). Let's just keep it simple. How about the attached patch?
Re: [PATCH v2] sched/rt: Clean up usage of rt_task()
On Wed, 15 May 2024 23:05:36 +0100 Qais Yousef wrote: > diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h > index df3aca89d4f5..5cb88b748ad6 100644 > --- a/include/linux/sched/deadline.h > +++ b/include/linux/sched/deadline.h > @@ -10,8 +10,6 @@ > > #include > > -#define MAX_DL_PRIO 0 > - > static inline int dl_prio(int prio) > { > if (unlikely(prio < MAX_DL_PRIO)) > @@ -19,6 +17,10 @@ static inline int dl_prio(int prio) > return 0; > } > > +/* > + * Returns true if a task has a priority that belongs to DL class. PI-boosted > + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks. > + */ > static inline int dl_task(struct task_struct *p) > { > return dl_prio(p->prio); > diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h > index ab83d85e1183..6ab43b4f72f9 100644 > --- a/include/linux/sched/prio.h > +++ b/include/linux/sched/prio.h > @@ -14,6 +14,7 @@ > */ > > #define MAX_RT_PRIO 100 > +#define MAX_DL_PRIO 0 > > #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH) > #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2) > diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h > index b2b9e6eb9683..a055dd68a77c 100644 > --- a/include/linux/sched/rt.h > +++ b/include/linux/sched/rt.h > @@ -7,18 +7,43 @@ > struct task_struct; > > static inline int rt_prio(int prio) > +{ > + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO)) > + return 1; > + return 0; > +} > + > +static inline int realtime_prio(int prio) > { > if (unlikely(prio < MAX_RT_PRIO)) > return 1; > return 0; > } I'm thinking we should change the above to bool (separate patch), as returning an int may give one the impression that it returns the actual priority number. Having it return bool will clear that up. In fact, if we are touching theses functions, might as well change all of them to bool when returning true/false. Just to make it easier to understand what they are doing. > > +/* > + * Returns true if a task has a priority that belongs to RT class. PI-boosted > + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks. > + */ > static inline int rt_task(struct task_struct *p) > { > return rt_prio(p->prio); > } > > -static inline bool task_is_realtime(struct task_struct *tsk) > +/* > + * Returns true if a task has a priority that belongs to RT or DL classes. > + * PI-boosted tasks will return true. Use realtime_task_policy() to ignore > + * PI-boosted tasks. > + */ > +static inline int realtime_task(struct task_struct *p) > +{ > + return realtime_prio(p->prio); > +} > + > +/* > + * Returns true if a task has a policy that belongs to RT or DL classes. > + * PI-boosted tasks will return false. > + */ > +static inline bool realtime_task_policy(struct task_struct *tsk) > { > int policy = tsk->policy; > > diff --git a/kernel/trace/trace_sched_wakeup.c > b/kernel/trace/trace_sched_wakeup.c > index 0469a04a355f..19d737742e29 100644 > --- a/kernel/trace/trace_sched_wakeup.c > +++ b/kernel/trace/trace_sched_wakeup.c > @@ -545,7 +545,7 @@ probe_wakeup(void *ignore, struct task_struct *p) >* - wakeup_dl handles tasks belonging to sched_dl class only. >*/ > if (tracing_dl || (wakeup_dl && !dl_task(p)) || > - (wakeup_rt && !dl_task(p) && !rt_task(p)) || > + (wakeup_rt && !realtime_task(p)) || > (!dl_task(p) && (p->prio >= wakeup_prio || p->prio >= > current->prio))) > return; > Reviewed-by: Steven Rostedt (Google)
Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices
On Fri, May 17, 2024 at 10:46:02PM +0800, Xuewei Niu wrote: > include/linux/virtio_vsock.h| 2 +- > include/net/af_vsock.h | 25 ++- > include/uapi/linux/virtio_vsock.h | 1 + > include/uapi/linux/vm_sockets.h | 14 ++ > net/vmw_vsock/af_vsock.c| 116 +-- > net/vmw_vsock/virtio_transport.c| 255 ++-- > net/vmw_vsock/virtio_transport_common.c | 16 +- > net/vmw_vsock/vsock_loopback.c | 4 +- > 8 files changed, 352 insertions(+), 81 deletions(-) As any change to virtio device/driver interface, this has to go through the virtio TC. Please subscribe at virtio-comment+subscr...@lists.linux.dev and then contact the TC at virtio-comm...@lists.linux.dev You will likely eventually need to write a spec draft document, too. -- MST
Re: [PATCH] livepatch: introduce klp_func called interface
On Sun, May 19, 2024 at 03:43:43PM +0800, Wardenjohn wrote: > Livepatch module usually used to modify kernel functions. > If the patched function have bug, it may cause serious result > such as kernel crash. > > This commit introduce a read only interface of livepatch > sysfs interface. If a livepatch function is called, this > sysfs interface "called" of the patched function will > set to be 1. > > /sys/kernel/livepatchcalled > > This value "called" is quite necessary for kernel stability assurance for > livepatching > module of a running system. Testing process is important before a livepatch > module > apply to a production system. With this interface, testing process can easily > find out which function is successfully called. Any testing process can make > sure they > have successfully cover all the patched function that changed with the help > of this interface. > --- Always run your patches through checkpatch. So this patch is so that testers can see if a function has been called? Can you not get the same information from gcov or ftrace? There are style issues with the patch, but it's not so important until the design is agreed on. regards, dan carpenter
Re: [PATCH] riscv: Fix early ftrace nop patching
On Thu, May 23, 2024 at 01:51:34PM +0200, Alexandre Ghiti wrote: > Commit c97bf629963e ("riscv: Fix text patching when IPI are used") > converted ftrace_make_nop() to use patch_insn_write() which does not > emit any icache flush relying entirely on __ftrace_modify_code() to do > that. > > But we missed that ftrace_make_nop() was called very early directly when > converting mcount calls into nops (actually on riscv it converts 2B nops > emitted by the compiler into 4B nops). > > This caused crashes on multiple HW as reported by Conor and Björn since > the booting core could have half-patched instructions in its icache > which would trigger an illegal instruction trap: fix this by emitting a > local flush icache when early patching nops. > > Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used") > Signed-off-by: Alexandre Ghiti Reported-by: Conor Dooley Tested-by: Conor Dooley Thanks for the quick fix Alex :) signature.asc Description: PGP signature
Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
[trimming CC list] On Thu, 23 May 2024 at 04:49, John Groves wrote: > - memmap=! will reserve a pretend pmem device at > > - memmap=$ will reserve a pretend dax device at Doesn't get me a /dev/dax or /dev/pmem Complete qemu command line: qemu-kvm -s -serial none -parallel none -kernel /home/mszeredi/git/linux/arch/x86/boot/bzImage -drive format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev stdio,id=virtiocon0,signal=off -device virtio-serial -device virtconsole,chardev=virtiocon0 -cpu host -m 8G -net user -net nic,model=virtio -fsdev local,security_model=none,id=fsdev0,path=/home -device virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -device virtio-rng-pci -smp 4 -append 'root=/dev/vda console=hvc0 memmap=4G$4G' root@kvm:~/famfs# scripts/chk_efi.sh This system is neither Ubuntu nor Fedora. It is identified as debian. /sys/firmware/efi not found; probably not efi not found; probably nof efi /boot/efi/EFI not found; probably not efi /boot/efi/EFI/BOOT not found; probably not efi /boot/efi/EFI/ not found; probably not efi /boot/efi/EFI//grub.cfg not found; probably nof efi Probably not efi; errs=6 Thanks, Miklos
Re: [PATCH] riscv: Fix early ftrace nop patching
Alexandre Ghiti writes: > Commit c97bf629963e ("riscv: Fix text patching when IPI are used") > converted ftrace_make_nop() to use patch_insn_write() which does not > emit any icache flush relying entirely on __ftrace_modify_code() to do > that. > > But we missed that ftrace_make_nop() was called very early directly when > converting mcount calls into nops (actually on riscv it converts 2B nops > emitted by the compiler into 4B nops). > > This caused crashes on multiple HW as reported by Conor and Björn since > the booting core could have half-patched instructions in its icache > which would trigger an illegal instruction trap: fix this by emitting a > local flush icache when early patching nops. > > Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used") > Signed-off-by: Alexandre Ghiti Nice! I've manged to reproduce the crash on the VisionFive2 board (however only triggered when CONFIG_RELOCATABLE=y), and can verify that this fix solves the issue. Reviewed-by: Björn Töpel Tested-by: Björn Töpel
[PATCHv7 9/9] man2: Add uretprobe syscall page
Adding man page for new uretprobe syscall. Reviewed-by: Alejandro Colomar Signed-off-by: Jiri Olsa --- man/man2/uretprobe.2 | 56 1 file changed, 56 insertions(+) create mode 100644 man/man2/uretprobe.2 diff --git a/man/man2/uretprobe.2 b/man/man2/uretprobe.2 new file mode 100644 index ..cf1c2b0d852e --- /dev/null +++ b/man/man2/uretprobe.2 @@ -0,0 +1,56 @@ +.\" Copyright (C) 2024, Jiri Olsa +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.TH uretprobe 2 (date) "Linux man-pages (unreleased)" +.SH NAME +uretprobe \- execute pending return uprobes +.SH SYNOPSIS +.nf +.B int uretprobe(void) +.fi +.SH DESCRIPTION +The +.BR uretprobe () +system call is an alternative to breakpoint instructions for triggering return +uprobe consumers. +.P +Calls to +.BR uretprobe () +system call are only made from the user-space trampoline provided by the kernel. +Calls from any other place result in a +.BR SIGILL . +.SH RETURN VALUE +The +.BR uretprobe () +system call return value is architecture-specific. +.SH ERRORS +.TP +.B SIGILL +The +.BR uretprobe () +system call was called by a user-space program. +.SH VERSIONS +Details of the +.BR uretprobe () +system call behavior vary across systems. +.SH STANDARDS +None. +.SH HISTORY +TBD +.SH NOTES +The +.BR uretprobe () +system call was initially introduced for the x86_64 architecture +where it was shown to be faster than breakpoint traps. +It might be extended to other architectures. +.P +The +.BR uretprobe () +system call exists only to allow the invocation of return uprobe consumers. +It should +.B never +be called directly. +Details of the arguments (if any) passed to +.BR uretprobe () +and the return value are architecture-specific. -- 2.45.1
[PATCHv7 bpf-next 8/9] selftests/bpf: Add uretprobe shadow stack test
Adding uretprobe shadow stack test that runs all existing uretprobe tests with shadow stack enabled if it's available. Signed-off-by: Jiri Olsa --- .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++ 1 file changed, 60 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c index 3ef324c2db50..fda456401284 100644 --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c @@ -9,6 +9,9 @@ #include #include #include +#include +#include +#include #include "uprobe_syscall.skel.h" #include "uprobe_syscall_executed.skel.h" @@ -297,6 +300,56 @@ static void test_uretprobe_syscall_call(void) close(go[1]); close(go[0]); } + +/* + * Borrowed from tools/testing/selftests/x86/test_shadow_stack.c. + * + * For use in inline enablement of shadow stack. + * + * The program can't return from the point where shadow stack gets enabled + * because there will be no address on the shadow stack. So it can't use + * syscall() for enablement, since it is a function. + * + * Based on code from nolibc.h. Keep a copy here because this can't pull + * in all of nolibc.h. + */ +#define ARCH_PRCTL(arg1, arg2) \ +({ \ + long _ret; \ + register long _num asm("eax") = __NR_arch_prctl; \ + register long _arg1 asm("rdi") = (long)(arg1); \ + register long _arg2 asm("rsi") = (long)(arg2); \ + \ + asm volatile ( \ + "syscall\n" \ + : "=a"(_ret)\ + : "r"(_arg1), "r"(_arg2), \ + "0"(_num) \ + : "rcx", "r11", "memory", "cc" \ + ); \ + _ret; \ +}) + +#ifndef ARCH_SHSTK_ENABLE +#define ARCH_SHSTK_ENABLE 0x5001 +#define ARCH_SHSTK_DISABLE 0x5002 +#define ARCH_SHSTK_SHSTK (1ULL << 0) +#endif + +static void test_uretprobe_shadow_stack(void) +{ + if (ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK)) { + test__skip(); + return; + } + + /* Run all of the uretprobe tests. */ + test_uretprobe_regs_equal(); + test_uretprobe_regs_change(); + test_uretprobe_syscall_call(); + + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK); +} #else static void test_uretprobe_regs_equal(void) { @@ -312,6 +365,11 @@ static void test_uretprobe_syscall_call(void) { test__skip(); } + +static void test_uretprobe_shadow_stack(void) +{ + test__skip(); +} #endif void test_uprobe_syscall(void) @@ -322,4 +380,6 @@ void test_uprobe_syscall(void) test_uretprobe_regs_change(); if (test__start_subtest("uretprobe_syscall_call")) test_uretprobe_syscall_call(); + if (test__start_subtest("uretprobe_shadow_stack")) + test_uretprobe_shadow_stack(); } -- 2.45.1
[PATCHv7 bpf-next 7/9] selftests/bpf: Add uretprobe syscall call from user space test
Adding test to verify that when called from outside of the trampoline provided by kernel, the uretprobe syscall will cause calling process to receive SIGILL signal and the attached bpf program is not executed. Acked-by: Andrii Nakryiko Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Jiri Olsa --- .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++ .../bpf/progs/uprobe_syscall_executed.c | 17 2 files changed, 112 insertions(+) create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c index 1a50cd35205d..3ef324c2db50 100644 --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c @@ -7,7 +7,10 @@ #include #include #include +#include +#include #include "uprobe_syscall.skel.h" +#include "uprobe_syscall_executed.skel.h" __naked unsigned long uretprobe_regs_trigger(void) { @@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void) } } +#ifndef __NR_uretprobe +#define __NR_uretprobe 462 +#endif + +__naked unsigned long uretprobe_syscall_call_1(void) +{ + /* +* Pretend we are uretprobe trampoline to trigger the return +* probe invocation in order to verify we get SIGILL. +*/ + asm volatile ( + "pushq %rax\n" + "pushq %rcx\n" + "pushq %r11\n" + "movq $" __stringify(__NR_uretprobe) ", %rax\n" + "syscall\n" + "popq %r11\n" + "popq %rcx\n" + "retq\n" + ); +} + +__naked unsigned long uretprobe_syscall_call(void) +{ + asm volatile ( + "call uretprobe_syscall_call_1\n" + "retq\n" + ); +} + +static void test_uretprobe_syscall_call(void) +{ + LIBBPF_OPTS(bpf_uprobe_multi_opts, opts, + .retprobe = true, + ); + struct uprobe_syscall_executed *skel; + int pid, status, err, go[2], c; + + if (ASSERT_OK(pipe(go), "pipe")) + return; + + skel = uprobe_syscall_executed__open_and_load(); + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load")) + goto cleanup; + + pid = fork(); + if (!ASSERT_GE(pid, 0, "fork")) + goto cleanup; + + /* child */ + if (pid == 0) { + close(go[1]); + + /* wait for parent's kick */ + err = read(go[0], &c, 1); + if (err != 1) + exit(-1); + + uretprobe_syscall_call(); + _exit(0); + } + + skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, pid, + "/proc/self/exe", + "uretprobe_syscall_call", &opts); + if (!ASSERT_OK_PTR(skel->links.test, "bpf_program__attach_uprobe_multi")) + goto cleanup; + + /* kick the child */ + write(go[1], &c, 1); + err = waitpid(pid, &status, 0); + ASSERT_EQ(err, pid, "waitpid"); + + /* verify the child got killed with SIGILL */ + ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED"); + ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG"); + + /* verify the uretprobe program wasn't called */ + ASSERT_EQ(skel->bss->executed, 0, "executed"); + +cleanup: + uprobe_syscall_executed__destroy(skel); + close(go[1]); + close(go[0]); +} #else static void test_uretprobe_regs_equal(void) { @@ -219,6 +307,11 @@ static void test_uretprobe_regs_change(void) { test__skip(); } + +static void test_uretprobe_syscall_call(void) +{ + test__skip(); +} #endif void test_uprobe_syscall(void) @@ -227,4 +320,6 @@ void test_uprobe_syscall(void) test_uretprobe_regs_equal(); if (test__start_subtest("uretprobe_regs_change")) test_uretprobe_regs_change(); + if (test__start_subtest("uretprobe_syscall_call")) + test_uretprobe_syscall_call(); } diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c new file mode 100644 index ..0d7f1a7db2e2 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c @@ -0,0 +1,17 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "vmlinux.h" +#include +#include + +struct pt_regs regs; + +char _license[] SEC("license") = "GPL"; + +int executed = 0; + +SEC("uretprobe.multi") +int test(struct pt_regs *regs) +{ + executed = 1; + return 0; +} -- 2.45.1
[PATCHv7 bpf-next 6/9] selftests/bpf: Add uretprobe syscall test for regs changes
Adding test that creates uprobe consumer on uretprobe which changes some of the registers. Making sure the changed registers are propagated to the user space when the ureptobe syscall trampoline is used on x86_64. To be able to do this, adding support to bpf_testmod to create uprobe via new attribute file: /sys/kernel/bpf_testmod_uprobe This file is expecting file offset and creates related uprobe on current process exe file and removes existing uprobe if offset is 0. The can be only single uprobe at any time. The uprobe has specific consumer that changes registers used in ureprobe syscall trampoline and which are later checked in the test. Acked-by: Andrii Nakryiko Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Jiri Olsa --- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 123 +- .../selftests/bpf/prog_tests/uprobe_syscall.c | 67 ++ 2 files changed, 189 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index 2a18bd320e92..b0132a342bb5 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -18,6 +18,7 @@ #include #include #include +#include #include "bpf_testmod.h" #include "bpf_testmod_kfunc.h" @@ -358,6 +359,119 @@ static struct bin_attribute bin_attr_bpf_testmod_file __ro_after_init = { .write = bpf_testmod_test_write, }; +/* bpf_testmod_uprobe sysfs attribute is so far enabled for x86_64 only, + * please see test_uretprobe_regs_change test + */ +#ifdef __x86_64__ + +static int +uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func, + struct pt_regs *regs) + +{ + regs->ax = 0x12345678deadbeef; + regs->cx = 0x87654321feebdaed; + regs->r11 = (u64) -1; + return true; +} + +struct testmod_uprobe { + struct path path; + loff_t offset; + struct uprobe_consumer consumer; +}; + +static DEFINE_MUTEX(testmod_uprobe_mutex); + +static struct testmod_uprobe uprobe = { + .consumer.ret_handler = uprobe_ret_handler, +}; + +static int testmod_register_uprobe(loff_t offset) +{ + int err = -EBUSY; + + if (uprobe.offset) + return -EBUSY; + + mutex_lock(&testmod_uprobe_mutex); + + if (uprobe.offset) + goto out; + + err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path); + if (err) + goto out; + + err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry), +offset, 0, &uprobe.consumer); + if (err) + path_put(&uprobe.path); + else + uprobe.offset = offset; + +out: + mutex_unlock(&testmod_uprobe_mutex); + return err; +} + +static void testmod_unregister_uprobe(void) +{ + mutex_lock(&testmod_uprobe_mutex); + + if (uprobe.offset) { + uprobe_unregister(d_real_inode(uprobe.path.dentry), + uprobe.offset, &uprobe.consumer); + uprobe.offset = 0; + } + + mutex_unlock(&testmod_uprobe_mutex); +} + +static ssize_t +bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj, +struct bin_attribute *bin_attr, +char *buf, loff_t off, size_t len) +{ + unsigned long offset = 0; + int err = 0; + + if (kstrtoul(buf, 0, &offset)) + return -EINVAL; + + if (offset) + err = testmod_register_uprobe(offset); + else + testmod_unregister_uprobe(); + + return err ?: strlen(buf); +} + +static struct bin_attribute bin_attr_bpf_testmod_uprobe_file __ro_after_init = { + .attr = { .name = "bpf_testmod_uprobe", .mode = 0666, }, + .write = bpf_testmod_uprobe_write, +}; + +static int register_bpf_testmod_uprobe(void) +{ + return sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_uprobe_file); +} + +static void unregister_bpf_testmod_uprobe(void) +{ + testmod_unregister_uprobe(); + sysfs_remove_bin_file(kernel_kobj, &bin_attr_bpf_testmod_uprobe_file); +} + +#else +static int register_bpf_testmod_uprobe(void) +{ + return 0; +} + +static void unregister_bpf_testmod_uprobe(void) { } +#endif + BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW) BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL) @@ -912,7 +1026,13 @@ static int bpf_testmod_init(void) return -EINVAL; sock = NULL; mutex_init(&sock_lock); - return sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file); + ret = sysfs_create_bin_file(kernel_kobj, &bin_attr_bpf_testmod_file); + if (ret < 0) + return ret; + ret = register_bpf_testmod_uprobe(); + if (ret < 0) + return ret; +
[PATCHv7 bpf-next 5/9] selftests/bpf: Add uretprobe syscall test for regs integrity
Add uretprobe syscall test that compares register values before and after the uretprobe is hit. It also compares the register values seen from attached bpf program. Acked-by: Andrii Nakryiko Reviewed-by: Masami Hiramatsu (Google) Signed-off-by: Jiri Olsa --- tools/include/linux/compiler.h| 4 + .../selftests/bpf/prog_tests/uprobe_syscall.c | 163 ++ .../selftests/bpf/progs/uprobe_syscall.c | 15 ++ 3 files changed, 182 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h index 8a63a9913495..6f7f22ac9da5 100644 --- a/tools/include/linux/compiler.h +++ b/tools/include/linux/compiler.h @@ -62,6 +62,10 @@ #define __nocf_check __attribute__((nocf_check)) #endif +#ifndef __naked +#define __naked __attribute__((__naked__)) +#endif + /* Are two types/vars the same type (ignoring qualifiers)? */ #ifndef __same_type # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b)) diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c new file mode 100644 index ..311ac19d8992 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c @@ -0,0 +1,163 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include + +#ifdef __x86_64__ + +#include +#include +#include +#include "uprobe_syscall.skel.h" + +__naked unsigned long uretprobe_regs_trigger(void) +{ + asm volatile ( + "movq $0xdeadbeef, %rax\n" + "ret\n" + ); +} + +__naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after) +{ + asm volatile ( + "movq %r15, 0(%rdi)\n" + "movq %r14, 8(%rdi)\n" + "movq %r13, 16(%rdi)\n" + "movq %r12, 24(%rdi)\n" + "movq %rbp, 32(%rdi)\n" + "movq %rbx, 40(%rdi)\n" + "movq %r11, 48(%rdi)\n" + "movq %r10, 56(%rdi)\n" + "movq %r9, 64(%rdi)\n" + "movq %r8, 72(%rdi)\n" + "movq %rax, 80(%rdi)\n" + "movq %rcx, 88(%rdi)\n" + "movq %rdx, 96(%rdi)\n" + "movq %rsi, 104(%rdi)\n" + "movq %rdi, 112(%rdi)\n" + "movq $0, 120(%rdi)\n" /* orig_rax */ + "movq $0, 128(%rdi)\n" /* rip */ + "movq $0, 136(%rdi)\n" /* cs */ + "pushf\n" + "pop %rax\n" + "movq %rax, 144(%rdi)\n" /* eflags */ + "movq %rsp, 152(%rdi)\n" /* rsp */ + "movq $0, 160(%rdi)\n" /* ss */ + + /* save 2nd argument */ + "pushq %rsi\n" + "call uretprobe_regs_trigger\n" + + /* save return value and load 2nd argument pointer to rax */ + "pushq %rax\n" + "movq 8(%rsp), %rax\n" + + "movq %r15, 0(%rax)\n" + "movq %r14, 8(%rax)\n" + "movq %r13, 16(%rax)\n" + "movq %r12, 24(%rax)\n" + "movq %rbp, 32(%rax)\n" + "movq %rbx, 40(%rax)\n" + "movq %r11, 48(%rax)\n" + "movq %r10, 56(%rax)\n" + "movq %r9, 64(%rax)\n" + "movq %r8, 72(%rax)\n" + "movq %rcx, 88(%rax)\n" + "movq %rdx, 96(%rax)\n" + "movq %rsi, 104(%rax)\n" + "movq %rdi, 112(%rax)\n" + "movq $0, 120(%rax)\n" /* orig_rax */ + "movq $0, 128(%rax)\n" /* rip */ + "movq $0, 136(%rax)\n" /* cs */ + + /* restore return value and 2nd argument */ + "pop %rax\n" + "pop %rsi\n" + + "movq %rax, 80(%rsi)\n" + + "pushf\n" + "pop %rax\n" + + "movq %rax, 144(%rsi)\n" /* eflags */ + "movq %rsp, 152(%rsi)\n" /* rsp */ + "movq $0, 160(%rsi)\n" /* ss */ + "ret\n" +); +} + +static void test_uretprobe_regs_equal(void) +{ + struct uprobe_syscall *skel = NULL; + struct pt_regs before = {}, after = {}; + unsigned long *pb = (unsigned long *) &before; + unsigned long *pa = (unsigned long *) &after; + unsigned long *pp; + unsigned int i, cnt; + int err; + + skel = uprobe_syscall__open_and_load(); + if (!ASSERT_OK_PTR(skel, "uprobe_syscall__open_and_load")) + goto cleanup; + + err = uprobe_syscall__attach(skel); + if (!ASSERT_OK(err, "uprobe_syscall__attach")) + goto cleanup; + + uretprobe_regs(&before, &after); + + pp = (unsigned long *) &skel->bss->regs; + cnt = sizeof(before)/sizeof(*pb); + +
[PATCHv7 bpf-next 4/9] selftests/x86: Add return uprobe shadow stack test
Adding return uprobe test for shadow stack and making sure it's working properly. Borrowed some of the code from bpf selftests. Signed-off-by: Jiri Olsa --- .../testing/selftests/x86/test_shadow_stack.c | 145 ++ 1 file changed, 145 insertions(+) diff --git a/tools/testing/selftests/x86/test_shadow_stack.c b/tools/testing/selftests/x86/test_shadow_stack.c index 757e6527f67e..e3501b7e2ecc 100644 --- a/tools/testing/selftests/x86/test_shadow_stack.c +++ b/tools/testing/selftests/x86/test_shadow_stack.c @@ -34,6 +34,7 @@ #include #include #include +#include /* * Define the ABI defines if needed, so people can run the tests @@ -681,6 +682,144 @@ int test_32bit(void) return !segv_triggered; } +static int parse_uint_from_file(const char *file, const char *fmt) +{ + int err, ret; + FILE *f; + + f = fopen(file, "re"); + if (!f) { + err = -errno; + printf("failed to open '%s': %d\n", file, err); + return err; + } + err = fscanf(f, fmt, &ret); + if (err != 1) { + err = err == EOF ? -EIO : -errno; + printf("failed to parse '%s': %d\n", file, err); + fclose(f); + return err; + } + fclose(f); + return ret; +} + +static int determine_uprobe_perf_type(void) +{ + const char *file = "/sys/bus/event_source/devices/uprobe/type"; + + return parse_uint_from_file(file, "%d\n"); +} + +static int determine_uprobe_retprobe_bit(void) +{ + const char *file = "/sys/bus/event_source/devices/uprobe/format/retprobe"; + + return parse_uint_from_file(file, "config:%d\n"); +} + +static ssize_t get_uprobe_offset(const void *addr) +{ + size_t start, end, base; + char buf[256]; + bool found = false; + FILE *f; + + f = fopen("/proc/self/maps", "r"); + if (!f) + return -errno; + + while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) { + if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) { + found = true; + break; + } + } + + fclose(f); + + if (!found) + return -ESRCH; + + return (uintptr_t)addr - start + base; +} + +static __attribute__((noinline)) void uretprobe_trigger(void) +{ + asm volatile (""); +} + +/* + * This test setups return uprobe, which is sensitive to shadow stack + * (crashes without extra fix). After executing the uretprobe we fail + * the test if we receive SIGSEGV, no crash means we're good. + * + * Helper functions above borrowed from bpf selftests. + */ +static int test_uretprobe(void) +{ + const size_t attr_sz = sizeof(struct perf_event_attr); + const char *file = "/proc/self/exe"; + int bit, fd = 0, type, err = 1; + struct perf_event_attr attr; + struct sigaction sa = {}; + ssize_t offset; + + type = determine_uprobe_perf_type(); + if (type < 0) { + if (type == -ENOENT) + printf("[SKIP]\tUretprobe test, uprobes are not available\n"); + return 0; + } + + offset = get_uprobe_offset(uretprobe_trigger); + if (offset < 0) + return 1; + + bit = determine_uprobe_retprobe_bit(); + if (bit < 0) + return 1; + + sa.sa_sigaction = segv_gp_handler; + sa.sa_flags = SA_SIGINFO; + if (sigaction(SIGSEGV, &sa, NULL)) + return 1; + + /* Setup return uprobe through perf event interface. */ + memset(&attr, 0, attr_sz); + attr.size = attr_sz; + attr.type = type; + attr.config = 1 << bit; + attr.config1 = (__u64) (unsigned long) file; + attr.config2 = offset; + + fd = syscall(__NR_perf_event_open, &attr, 0 /* pid */, -1 /* cpu */, +-1 /* group_fd */, PERF_FLAG_FD_CLOEXEC); + if (fd < 0) + goto out; + + if (sigsetjmp(jmp_buffer, 1)) + goto out; + + ARCH_PRCTL(ARCH_SHSTK_ENABLE, ARCH_SHSTK_SHSTK); + + /* +* This either segfaults and goes through sigsetjmp above +* or succeeds and we're good. +*/ + uretprobe_trigger(); + + printf("[OK]\tUretprobe test\n"); + err = 0; + +out: + ARCH_PRCTL(ARCH_SHSTK_DISABLE, ARCH_SHSTK_SHSTK); + signal(SIGSEGV, SIG_DFL); + if (fd) + close(fd); + return err; +} + void segv_handler_ptrace(int signum, siginfo_t *si, void *uc) { /* The SSP adjustment caused a segfault. */ @@ -867,6 +1006,12 @@ int main(int argc, char *argv[]) goto out; } + if (test_uretprobe()) { + ret = 1; + printf("[FAIL]\turetprobe test\n"); + goto out; + } + return ret; out: -- 2.45.1
[PATCHv7 bpf-next 3/9] uprobe: Add uretprobe syscall to speed up return probe
Adding uretprobe syscall instead of trap to speed up return probe. At the moment the uretprobe setup/path is: - install entry uprobe - when the uprobe is hit, it overwrites probed function's return address on stack with address of the trampoline that contains breakpoint instruction - the breakpoint trap code handles the uretprobe consumers execution and jumps back to original return address This patch replaces the above trampoline's breakpoint instruction with new ureprobe syscall call. This syscall does exactly the same job as the trap with some more extra work: - syscall trampoline must save original value for rax/r11/rcx registers on stack - rax is set to syscall number and r11/rcx are changed and used by syscall instruction - the syscall code reads the original values of those registers and restore those values in task's pt_regs area - only caller from trampoline exposed in '[uprobes]' is allowed, the process will receive SIGILL signal otherwise Even with some extra work, using the uretprobes syscall shows speed improvement (compared to using standard breakpoint): On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz) current: uretprobe-nop :1.498 ± 0.000M/s uretprobe-push :1.448 ± 0.001M/s uretprobe-ret :0.816 ± 0.001M/s with the fix: uretprobe-nop :1.969 ± 0.002M/s < 31% speed up uretprobe-push :1.910 ± 0.000M/s < 31% speed up uretprobe-ret :0.934 ± 0.000M/s < 14% speed up On Amd (AMD Ryzen 7 5700U) current: uretprobe-nop :0.778 ± 0.001M/s uretprobe-push :0.744 ± 0.001M/s uretprobe-ret :0.540 ± 0.001M/s with the fix: uretprobe-nop :0.860 ± 0.001M/s < 10% speed up uretprobe-push :0.818 ± 0.001M/s < 10% speed up uretprobe-ret :0.578 ± 0.000M/s < 7% speed up The performance test spawns a thread that runs loop which triggers uprobe with attached bpf program that increments the counter that gets printed in results above. The uprobe (and uretprobe) kind is determined by which instruction is being patched with breakpoint instruction. That's also important for uretprobes, because uprobe is installed for each uretprobe. The performance test is part of bpf selftests: tools/testing/selftests/bpf/run_bench_uprobes.sh Note at the moment uretprobe syscall is supported only for native 64-bit process, compat process still uses standard breakpoint. Note that when shadow stack is enabled the uretprobe syscall returns via iret, which is slower than return via sysret, but won't cause the shadow stack violation. Suggested-by: Andrii Nakryiko Reviewed-by: Oleg Nesterov Reviewed-by: Masami Hiramatsu (Google) Acked-by: Andrii Nakryiko Signed-off-by: Oleg Nesterov Signed-off-by: Jiri Olsa --- arch/x86/include/asm/shstk.h | 2 + arch/x86/kernel/shstk.c | 5 ++ arch/x86/kernel/uprobes.c| 117 +++ include/linux/uprobes.h | 3 + kernel/events/uprobes.c | 24 --- 5 files changed, 144 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index 896909f306e3..4cb77e004615 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -22,6 +22,7 @@ void shstk_free(struct task_struct *p); int setup_signal_shadow_stack(struct ksignal *ksig); int restore_signal_shadow_stack(void); int shstk_update_last_frame(unsigned long val); +bool shstk_is_enabled(void); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { return -EINVAL; } @@ -33,6 +34,7 @@ static inline void shstk_free(struct task_struct *p) {} static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; } static inline int restore_signal_shadow_stack(void) { return 0; } static inline int shstk_update_last_frame(unsigned long val) { return 0; } +static inline bool shstk_is_enabled(void) { return false; } #endif /* CONFIG_X86_USER_SHADOW_STACK */ #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 9797d4cdb78a..059685612362 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -588,3 +588,8 @@ int shstk_update_last_frame(unsigned long val) ssp = get_user_shstk_addr(); return write_user_shstk_64((u64 __user *)ssp, (u64)val); } + +bool shstk_is_enabled(void) +{ + return features_enabled(ARCH_SHSTK_SHSTK); +} diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 6402fb3089d2..5a952c5ea66b 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include @@ -308,6 +309,122 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, struct insn *insn, bool } #ifdef CONFIG_X86_64 + +asm ( + ".pushsection .rodata\n" + ".global uretprobe_trampoline_entry\n" +
[PATCHv7 bpf-next 2/9] uprobe: Wire up uretprobe system call
Wiring up uretprobe system call, which comes in following changes. We need to do the wiring before, because the uretprobe implementation needs the syscall number. Note at the moment uretprobe syscall is supported only for native 64-bit process. Reviewed-by: Oleg Nesterov Reviewed-by: Masami Hiramatsu (Google) Acked-by: Andrii Nakryiko Signed-off-by: Jiri Olsa --- arch/x86/entry/syscalls/syscall_64.tbl | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 5 - kernel/sys_ni.c| 2 ++ 4 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index cc78226ffc35..47dfea0a827c 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -383,6 +383,7 @@ 459common lsm_get_self_attr sys_lsm_get_self_attr 460common lsm_set_self_attr sys_lsm_set_self_attr 461common lsm_list_modulessys_lsm_list_modules +46264 uretprobe sys_uretprobe # # Due to a historical design error, certain syscalls are numbered differently diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index e619ac10cd23..5318e0e76799 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, u32 *size, u32 flags); /* x86 */ asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on); +asmlinkage long sys_uretprobe(void); + /* pciconfig: alpha, arm, arm64, ia64, sparc */ asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn, unsigned long off, unsigned long len, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 75f00965ab15..8a747cd1d735 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) #define __NR_lsm_list_modules 461 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) +#define __NR_uretprobe 462 +__SYSCALL(__NR_uretprobe, sys_uretprobe) + #undef __NR_syscalls -#define __NR_syscalls 462 +#define __NR_syscalls 463 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index faad00cce269..be6195e0d078 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16); /* restartable sequence */ COND_SYSCALL(rseq); + +COND_SYSCALL(uretprobe); -- 2.45.1
[PATCHv7 bpf-next 1/9] x86/shstk: Make return uprobe work with shadow stack
Currently the application with enabled shadow stack will crash if it sets up return uprobe. The reason is the uretprobe kernel code changes the user space task's stack, but does not update shadow stack accordingly. Adding new functions to update values on shadow stack and using them in uprobe code to keep shadow stack in sync with uretprobe changes to user stack. Reviewed-by: Oleg Nesterov Fixes: 488af8ea7131 ("x86/shstk: Wire in shadow stack interface") Signed-off-by: Jiri Olsa --- arch/x86/include/asm/shstk.h | 2 ++ arch/x86/kernel/shstk.c | 11 +++ arch/x86/kernel/uprobes.c| 7 ++- 3 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h index 42fee8959df7..896909f306e3 100644 --- a/arch/x86/include/asm/shstk.h +++ b/arch/x86/include/asm/shstk.h @@ -21,6 +21,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *p, unsigned long clon void shstk_free(struct task_struct *p); int setup_signal_shadow_stack(struct ksignal *ksig); int restore_signal_shadow_stack(void); +int shstk_update_last_frame(unsigned long val); #else static inline long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) { return -EINVAL; } @@ -31,6 +32,7 @@ static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p, static inline void shstk_free(struct task_struct *p) {} static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; } static inline int restore_signal_shadow_stack(void) { return 0; } +static inline int shstk_update_last_frame(unsigned long val) { return 0; } #endif /* CONFIG_X86_USER_SHADOW_STACK */ #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c index 6f1e9883f074..9797d4cdb78a 100644 --- a/arch/x86/kernel/shstk.c +++ b/arch/x86/kernel/shstk.c @@ -577,3 +577,14 @@ long shstk_prctl(struct task_struct *task, int option, unsigned long arg2) return wrss_control(true); return -EINVAL; } + +int shstk_update_last_frame(unsigned long val) +{ + unsigned long ssp; + + if (!features_enabled(ARCH_SHSTK_SHSTK)) + return 0; + + ssp = get_user_shstk_addr(); + return write_user_shstk_64((u64 __user *)ssp, (u64)val); +} diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 6c07f6daaa22..6402fb3089d2 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c @@ -1076,8 +1076,13 @@ arch_uretprobe_hijack_return_addr(unsigned long trampoline_vaddr, struct pt_regs return orig_ret_vaddr; nleft = copy_to_user((void __user *)regs->sp, &trampoline_vaddr, rasize); - if (likely(!nleft)) + if (likely(!nleft)) { + if (shstk_update_last_frame(trampoline_vaddr)) { + force_sig(SIGSEGV); + return -1; + } return orig_ret_vaddr; + } if (nleft != rasize) { pr_err("return address clobbered: pid=%d, %%sp=%#lx, %%ip=%#lx\n", -- 2.45.1
[PATCHv7 bpf-next 0/9] uprobe: uretprobe speed up
hi, as part of the effort on speeding up the uprobes [0] coming with return uprobe optimization by using syscall instead of the trap on the uretprobe trampoline. The speed up depends on instruction type that uprobe is installed and depends on specific HW type, please check patch 1 for details. Patches 1-8 are based on bpf-next/master, but patch 2 and 3 are apply-able on linux-trace.git tree probes/for-next branch. Patch 9 is based on man-pages master. v7 changes: - fixes in man page [Alejandro Colomar] - fixed patch #1 fixes tag [Oleg] Also available at: https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git uretprobe_syscall thanks, jirka Notes to check list items in Documentation/process/adding-syscalls.rst: - System Call Alternatives New syscall seems like the best way in here, because we need just to quickly enter kernel with no extra arguments processing, which we'd need to do if we decided to use another syscall. - Designing the API: Planning for Extension The uretprobe syscall is very specific and most likely won't be extended in the future. At the moment it does not take any arguments and even if it does in future, it's allowed to be called only from trampoline prepared by kernel, so there'll be no broken user. - Designing the API: Other Considerations N/A because uretprobe syscall does not return reference to kernel object. - Proposing the API Wiring up of the uretprobe system call is in separate change, selftests and man page changes are part of the patchset. - Generic System Call Implementation There's no CONFIG option for the new functionality because it keeps the same behaviour from the user POV. - x86 System Call Implementation It's 64-bit syscall only. - Compatibility System Calls (Generic) N/A uretprobe syscall has no arguments and is not supported for compat processes. - Compatibility System Calls (x86) N/A uretprobe syscall is not supported for compat processes. - System Calls Returning Elsewhere N/A. - Other Details N/A. - Testing Adding new bpf selftests and ran ltp on top of this change. - Man Page Attached. - Do not call System Calls in the Kernel N/A. [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/ --- Jiri Olsa (8): x86/shstk: Make return uprobe work with shadow stack uprobe: Wire up uretprobe system call uprobe: Add uretprobe syscall to speed up return probe selftests/x86: Add return uprobe shadow stack test selftests/bpf: Add uretprobe syscall test for regs integrity selftests/bpf: Add uretprobe syscall test for regs changes selftests/bpf: Add uretprobe syscall call from user space test selftests/bpf: Add uretprobe shadow stack test arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/x86/include/asm/shstk.h| 4 + arch/x86/kernel/shstk.c | 16 arch/x86/kernel/uprobes.c | 124 - include/linux/syscalls.h| 2 + include/linux/uprobes.h | 3 + include/uapi/asm-generic/unistd.h | 5 +- kernel/events/uprobes.c | 24 -- kernel/sys_ni.c | 2 + tools/include/linux/compiler.h | 4 + tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 123 - tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 385 +++ tools/testing/selftests/bpf/progs/uprobe_syscall.c | 15 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c | 17 tools/testing/selftests/x86/test_shadow_stack.c | 145 ++ 15 files changed, 860 insertions(+), 10 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c Jiri Olsa (1): man2: Add uretprobe syscall page man/man2/uretprobe.2 | 56 1 file changed, 56 insertions(+) create mode 100644 man/man2/uretprobe.2
[PATCH] riscv: Fix early ftrace nop patching
Commit c97bf629963e ("riscv: Fix text patching when IPI are used") converted ftrace_make_nop() to use patch_insn_write() which does not emit any icache flush relying entirely on __ftrace_modify_code() to do that. But we missed that ftrace_make_nop() was called very early directly when converting mcount calls into nops (actually on riscv it converts 2B nops emitted by the compiler into 4B nops). This caused crashes on multiple HW as reported by Conor and Björn since the booting core could have half-patched instructions in its icache which would trigger an illegal instruction trap: fix this by emitting a local flush icache when early patching nops. Fixes: c97bf629963e ("riscv: Fix text patching when IPI are used") Signed-off-by: Alexandre Ghiti --- arch/riscv/include/asm/cacheflush.h | 6 ++ arch/riscv/kernel/ftrace.c | 3 +++ 2 files changed, 9 insertions(+) diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h index dd8d07146116..ce79c558a4c8 100644 --- a/arch/riscv/include/asm/cacheflush.h +++ b/arch/riscv/include/asm/cacheflush.h @@ -13,6 +13,12 @@ static inline void local_flush_icache_all(void) asm volatile ("fence.i" ::: "memory"); } +static inline void local_flush_icache_range(unsigned long start, + unsigned long end) +{ + local_flush_icache_all(); +} + #define PG_dcache_clean PG_arch_1 static inline void flush_dcache_folio(struct folio *folio) diff --git a/arch/riscv/kernel/ftrace.c b/arch/riscv/kernel/ftrace.c index 4f4987a6d83d..32e7c401dfb4 100644 --- a/arch/riscv/kernel/ftrace.c +++ b/arch/riscv/kernel/ftrace.c @@ -120,6 +120,9 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec) out = ftrace_make_nop(mod, rec, MCOUNT_ADDR); mutex_unlock(&text_mutex); + if (!mod) + local_flush_icache_range(rec->ip, rec->ip + MCOUNT_INSN_SIZE); + return out; } -- 2.39.2
Re: [RFC PATCH 0/5] vsock/virtio: Add support for multi-devices
Hi, thanks for this RFC! On Fri, May 17, 2024 at 10:46:02PM GMT, Xuewei Niu wrote: # Motivition Vsock is a lightweight and widely used data exchange mechanism between host and guest. Kata Containers, a secure container runtime, leverages the capability to exchange control data between the shim and the kata-agent. The Linux kernel only supports one vsock device for virtio-vsock transport, resulting in the following limitations: * Poor performance isolation: All vsock connections share the same virtqueue. This might be fixed if we implement multi-queue in virtio-vsock. * Cannot enable more than one backend: Virtio-vsock, vhost-vsock, and vhost-user-vsock cannot be enabled simultaneously on the transport. We’d like to transfer networking data, such as TSI (Transparent Socket Impersonation), over vsock via the vhost-user protocol to reduce overhead. However, by default, the vsock device is occupied by the kata-agent. # Usages Principle: **Supporting virtio-vsock multi-devices while also being compatible with existing ones.** ## Connection from Guest to Host There are two valuable questions to take about: 1. How to be compatible with the existing usages? 2. How do we specify a virtio-vsock device? ### Question 1 Before we delve into question 1, I'd like to provide a piece of pseudocode as an example of one of the existing use cases from the guest's perspective. Assuming there is one virtio-vsock device with CID 4. One of existing usages to connect to host is shown as following. ``` fd = socket(AF_VSOCK); connect(fd, 2, 1234); n = write(fd, buffer); ``` The result is that a connection is established from the guest (4, ?) to the host (2, 1234), where "?" denotes a random port. In the context of multi-devices, there are more than two devices. If the users don’t specify one CID explicitly, the kernel becomes confused about which device to use. The new implementation should be compatible with the old one. We expanded the virtio-vsock specification to address this issue. The specification now includes a new field called "order". ``` struct virtio_vsock_config { __le64 guest_cid; __le64 order; } _attribute_((packed)); ``` In the phase of virtio-vsock driver probing, the guest kernel reads from VMM to get the order of each device. **We stipulate that the device with the smallest order is regarded as the default device**(this mechanism functions as a 'default gateway' in networking). Assuming there are three virtio-vsock devices: device1 (CID=3), device2 (CID=4), and device3 (CID=5). The arrangement of the list is as follows from the perspective of the guest kernel: ``` virtio_vsock_list = virtio_vsock { cid: 4, order: 0 } -> virtio_vsock { cid: 3, order: 1 } -> virtio_vsock { cid: 5, order: 10 } ``` At this time, the guest kernel realizes that the device2 (CID=4) is the default device. Execute the same code as before. ``` fd = socket(AF_VSOCK); connect(fd, 2, 1234); n = write(fd, buffer); ``` A connection will be established from the guest (4, ?) to the host (2, 1234). It seems that only the one with order 0 is used here though, so what is the ordering for? Wouldn't it suffice to simply indicate the default device (e.g., like the default gateway for networking)? ### Question 2 Now, the user wants to specify a device instead of the default one. An explicit binding operation is required to be performed. Use the device (CID=3), where “-1” represents any port, the kernel will We have a macro: VMADDR_PORT_ANY (which is -1) search an available port automatically. ``` fd = socket(AF_VSOCK); bind(fd, 3, -1); connect(fd, 2, 1234);) n = write(fd, buffer); ``` Use the device (CID=4). ``` fd = socket(AF_VSOCK); bind(fd, 4, -1); connect(fd, 2, 1234); n = write(fd, buffer); ``` ## Connection from Host to Guest Connection from host to guest is quite similar to the existing usages. The device’s CID is specified by the bind operation. Listen at the device (CID=3)’s port 1. ``` fd = socket(AF_VSOCK); bind(fd, 3, 1); listen(fd); new_fd = accept(fd, &host_cid, &host_port); n = write(fd, buffer); ``` Listen at the device (CID=4)’s port 1. ``` fd = socket(AF_VSOCK); bind(fd, 4, 1); listen(fd); new_fd = accept(fd, &host_cid, &host_port); n = write(fd, buffer); ``` # Use Cases We've completed a POC with Kata Containers, Ztunnel, which is a purpose-built per-node proxy for Istio ambient mesh, and TSI. Please refer to the following link for more details. Link: https://bit.ly/4bdPJbU Thank you for this RFC, I left several comments in the patches, we still have some work to do, but I think it is something we can support :-) Here I summarize the things that I think we need to fix: 1. Avoid adding transport-specific things in af_vsock.c We need to have a generic API to allow other transports to implement the same functionality. 2. We need to add negotiation of a new feature in virtio/vhost transports We need to enable or disable support depending on whether t
Re: [RFC PATCH 5/5] vsock: Add an ioctl request to get all CIDs
On Fri, May 17, 2024 at 10:46:07PM GMT, Xuewei Niu wrote: The new request is called `IOCTL_VM_SOCKETS_GET_LOCAL_CIDS`. And the old one, `IOCTL_VM_SOCKETS_GET_LOCAL_CID` is retained. For the transport that supports multi-devices: * `IOCTL_VM_SOCKETS_GET_LOCAL_CID` returns "-1"; What about returning the default CID (lower prio)? * `IOCTL_VM_SOCKETS_GET_LOCAL_CIDS` returns a vector of CIDS. The usage is shown as following. ``` struct vsock_local_cids local_cids; if ((ret = ioctl(fd, IOCTL_VM_SOCKETS_GET_LOCAL_CIDS, &local_cids))) { perror("failed to get cids"); exit(1); } for (i = 0; i --- include/net/af_vsock.h | 7 +++ include/uapi/linux/vm_sockets.h | 8 net/vmw_vsock/af_vsock.c| 19 +++ 3 files changed, 34 insertions(+) diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h index 25f7dc3d602d..2febc816e388 100644 --- a/include/net/af_vsock.h +++ b/include/net/af_vsock.h @@ -264,4 +264,11 @@ static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t) { return t->msgzerocopy_allow && t->msgzerocopy_allow(); } + +/ IOCTL / +/* Type of return value of IOCTL_VM_SOCKETS_GET_LOCAL_CIDS. */ +struct vsock_local_cids { + int nr; + unsigned int data[MAX_VSOCK_NUM]; +}; #endif /* __AF_VSOCK_H__ */ diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h index 36ca5023293a..01f73fb7af5a 100644 --- a/include/uapi/linux/vm_sockets.h +++ b/include/uapi/linux/vm_sockets.h @@ -195,8 +195,16 @@ struct sockaddr_vm { #define MAX_VSOCK_NUM 16 Okay, now I see why you need this in the UAPI, but pleace try to follow other defines. What about VM_SOCKETS_MAX_DEVS ? +/* Return actual context id if the transport not support vsock + * multi-devices. Otherwise, return `-1U`. + */ + #define IOCTL_VM_SOCKETS_GET_LOCAL_CID _IO(7, 0xb9) +/* Only available in transports that support multiple devices. */ + +#define IOCTL_VM_SOCKETS_GET_LOCAL_CIDS _IOR(7, 0xba, struct vsock_local_cids) + /* MSG_ZEROCOPY notifications are encoded in the standard error format, * sock_extended_err. See Documentation/networking/msg_zerocopy.rst in * kernel source tree for more details. diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 3b34be802bf2..2ea2ff52f15b 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -2454,6 +2454,7 @@ static long vsock_dev_do_ioctl(struct file *filp, u32 __user *p = ptr; u32 cid = VMADDR_CID_ANY; int retval = 0; + struct vsock_local_cids local_cids; switch (cmd) { case IOCTL_VM_SOCKETS_GET_LOCAL_CID: @@ -2469,6 +2470,24 @@ static long vsock_dev_do_ioctl(struct file *filp, retval = -EFAULT; break; + case IOCTL_VM_SOCKETS_GET_LOCAL_CIDS: + if (!transport_g2h || !transport_g2h->get_local_cids) + goto fault; + + rcu_read_lock(); + local_cids.nr = transport_g2h->get_local_cids(local_cids.data); + rcu_read_unlock(); + + if (local_cids.nr < 0 || + copy_to_user(p, &local_cids, sizeof(local_cids))) + goto fault; + + break; + +fault: + retval = -EFAULT; + break; + default: retval = -ENOIOCTLCMD; } -- 2.34.1
Re: [RFC PATCH 4/5] vsock: seqpacket_allow adapts to multi-devices
On Fri, May 17, 2024 at 10:46:06PM GMT, Xuewei Niu wrote: Adds a new argument, named "src_cid", to let them know which `virtio_vsock` to be selected. Signed-off-by: Xuewei Niu --- include/net/af_vsock.h | 2 +- net/vmw_vsock/af_vsock.c | 15 +-- net/vmw_vsock/virtio_transport.c | 4 ++-- net/vmw_vsock/vsock_loopback.c | 4 ++-- 4 files changed, 18 insertions(+), 7 deletions(-) Same for this. diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h index 0151296a0bc5..25f7dc3d602d 100644 --- a/include/net/af_vsock.h +++ b/include/net/af_vsock.h @@ -143,7 +143,7 @@ struct vsock_transport { int flags); int (*seqpacket_enqueue)(struct vsock_sock *vsk, struct msghdr *msg, size_t len); - bool (*seqpacket_allow)(u32 remote_cid); + bool (*seqpacket_allow)(u32 src_cid, u32 remote_cid); u32 (*seqpacket_has_data)(struct vsock_sock *vsk); /* Notification. */ diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index da06ddc940cd..3b34be802bf2 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -470,10 +470,12 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk) { const struct vsock_transport *new_transport; struct sock *sk = sk_vsock(vsk); - unsigned int remote_cid = vsk->remote_addr.svm_cid; + unsigned int src_cid, remote_cid; __u8 remote_flags; int ret; + remote_cid = vsk->remote_addr.svm_cid; + /* If the packet is coming with the source and destination CIDs higher * than VMADDR_CID_HOST, then a vsock channel where all the packets are * forwarded to the host should be established. Then the host will @@ -527,8 +529,17 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk) return -ENODEV; if (sk->sk_type == SOCK_SEQPACKET) { + if (vsk->local_addr.svm_cid == VMADDR_CID_ANY) { + if (new_transport->get_default_cid) + src_cid = new_transport->get_default_cid(); + else + src_cid = new_transport->get_local_cid(); + } else { + src_cid = vsk->local_addr.svm_cid; + } + if (!new_transport->seqpacket_allow || - !new_transport->seqpacket_allow(remote_cid)) { + !new_transport->seqpacket_allow(src_cid, remote_cid)) { module_put(new_transport->module); return -ESOCKTNOSUPPORT; } diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index 998b22e5ce36..0bddcbd906a2 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -615,14 +615,14 @@ static struct virtio_transport virtio_transport = { .can_msgzerocopy = virtio_transport_can_msgzerocopy, }; -static bool virtio_transport_seqpacket_allow(u32 remote_cid) +static bool virtio_transport_seqpacket_allow(u32 src_cid, u32 remote_cid) { struct virtio_vsock *vsock; bool seqpacket_allow; seqpacket_allow = false; rcu_read_lock(); - vsock = rcu_dereference(the_virtio_vsock); + vsock = virtio_transport_get_virtio_vsock(src_cid); if (vsock) seqpacket_allow = vsock->seqpacket_allow; rcu_read_unlock(); diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c index 6dea6119f5b2..b94358f5bb2c 100644 --- a/net/vmw_vsock/vsock_loopback.c +++ b/net/vmw_vsock/vsock_loopback.c @@ -46,7 +46,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk) return 0; } -static bool vsock_loopback_seqpacket_allow(u32 remote_cid); +static bool vsock_loopback_seqpacket_allow(u32 src_cid, u32 remote_cid); static bool vsock_loopback_msgzerocopy_allow(void) { return true; @@ -104,7 +104,7 @@ static struct virtio_transport loopback_transport = { .send_pkt = vsock_loopback_send_pkt, }; -static bool vsock_loopback_seqpacket_allow(u32 remote_cid) +static bool vsock_loopback_seqpacket_allow(u32 src_cid, u32 remote_cid) { return true; } -- 2.34.1
Re: [RFC PATCH 3/5] vsock/virtio: can_msgzerocopy adapts to multi-devices
On Fri, May 17, 2024 at 10:46:05PM GMT, Xuewei Niu wrote: Adds a new argument, named "cid", to let them know which `virtio_vsock` to be selected. Signed-off-by: Xuewei Niu --- include/linux/virtio_vsock.h| 2 +- net/vmw_vsock/virtio_transport.c| 5 ++--- net/vmw_vsock/virtio_transport_common.c | 6 +++--- 3 files changed, 6 insertions(+), 7 deletions(-) Every commit in linux must be working to support bisection. So these changes should be made before adding multi-device support. diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h index c82089dee0c8..21bfd5e0c2e7 100644 --- a/include/linux/virtio_vsock.h +++ b/include/linux/virtio_vsock.h @@ -168,7 +168,7 @@ struct virtio_transport { * extra checks and can perform zerocopy transmission by * default. */ - bool (*can_msgzerocopy)(int bufs_num); + bool (*can_msgzerocopy)(u32 cid, int bufs_num); }; ssize_t diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index 93d25aeafb83..998b22e5ce36 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -521,14 +521,13 @@ static void virtio_vsock_rx_done(struct virtqueue *vq) queue_work(virtio_vsock_workqueue, &vsock->rx_work); } -static bool virtio_transport_can_msgzerocopy(int bufs_num) +static bool virtio_transport_can_msgzerocopy(u32 cid, int bufs_num) { struct virtio_vsock *vsock; bool res = false; rcu_read_lock(); - - vsock = rcu_dereference(the_virtio_vsock); + vsock = virtio_transport_get_virtio_vsock(cid); if (vsock) { struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX]; diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index bed75a41419e..e7315d7b9af1 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -39,7 +39,7 @@ virtio_transport_get_ops(struct vsock_sock *vsk) static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops, struct virtio_vsock_pkt_info *info, - size_t pkt_len) + size_t pkt_len, unsigned int cid) { struct iov_iter *iov_iter; @@ -62,7 +62,7 @@ static bool virtio_transport_can_zcopy(const struct virtio_transport *t_ops, int pages_to_send = iov_iter_npages(iov_iter, MAX_SKB_FRAGS); /* +1 is for packet header. */ - return t_ops->can_msgzerocopy(pages_to_send + 1); + return t_ops->can_msgzerocopy(cid, pages_to_send + 1); } return true; @@ -375,7 +375,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk, info->msg->msg_flags &= ~MSG_ZEROCOPY; if (info->msg->msg_flags & MSG_ZEROCOPY) - can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len); + can_zcopy = virtio_transport_can_zcopy(t_ops, info, pkt_len, src_cid); if (can_zcopy) max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, -- 2.34.1
Re: [RFC PATCH 2/5] vsock/virtio: Add support for multi-devices
On Fri, May 17, 2024 at 10:46:04PM GMT, Xuewei Niu wrote: The maximum number of devices is limited by `MAX_VSOCK_NUM`. Extends `vsock_transport` struct with 4 methods to support multi-devices: * `get_virtio_vsock()`: It receives a CID, and returns a struct of virtio vsock. This method is designed to select a vsock device by its CID. * `get_default_cid()`: It receives nothing, returns the default CID of the first vsock device registered to the kernel. * `get_local_cids()`: It returns a vector of vsock devices' CIDs. * `compare_order()`: It receives two different CIDs, named "left" and "right" respectively. It returns "-1" while the "left" is behind the "right". Otherwise, return "1". `get_local_cid()` is retained, but returns "-1" if the transport supports multi-devices. Replaces the single instance of `virtio_vsock` with a list, named `virtio_vsock_list`. The devices are inserted into the list when probing. The kernel will deny devices from being registered if there are conflicts existing in CIDs or orders. Signed-off-by: Xuewei Niu --- include/net/af_vsock.h | 16 ++ include/uapi/linux/vm_sockets.h | 6 + net/vmw_vsock/af_vsock.c| 82 ++-- net/vmw_vsock/virtio_transport.c| 246 ++-- net/vmw_vsock/virtio_transport_common.c | 10 +- 5 files changed, 293 insertions(+), 67 deletions(-) diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h index 535701efc1e5..0151296a0bc5 100644 --- a/include/net/af_vsock.h +++ b/include/net/af_vsock.h @@ -174,6 +174,22 @@ struct vsock_transport { /* Addressing. */ u32 (*get_local_cid)(void); + /* Held rcu read lock by the caller. */ We should also explain why the rcu is needed. + struct virtio_vsock *(*get_virtio_vsock)(unsigned int cid); af_vsock supports several transports (i.e. HyperV, VMCI, VIRTIO/VHOST, loobpack), so we need to be generic here. In addition, the pointer returned by this function is never used, so why we need this? + unsigned int (*get_default_cid)(void); + /* Get an list containing all the CIDs of registered vsock. Return +* the length of the list. +* +* Held rcu read lock by the caller. +*/ + int (*get_local_cids)(unsigned int *local_cids); Why int? get_local_cid() returns an u32, we should do the same. In addition, can we remove get_local_cid() and implement get_local_cids() for all the transports? + /* Compare the order of two devices. Given the guest CIDs of two +* different devices, returns -1 while the left is behind the right. +* Otherwise, return 1. +* +* Held rcu read lock by the caller. +*/ + int (*compare_order)(unsigned int left, unsigned int right); Please check better the type for CIDs all over the place. /* Read a single skb */ int (*read_skb)(struct vsock_sock *, skb_read_actor_t); diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h index ed07181d4eff..36ca5023293a 100644 --- a/include/uapi/linux/vm_sockets.h +++ b/include/uapi/linux/vm_sockets.h @@ -189,6 +189,12 @@ struct sockaddr_vm { sizeof(__u8)]; }; +/* The maximum number of vsock devices. Each vsock device has an exclusive + * context id. + */ + +#define MAX_VSOCK_NUM 16 This is used internally in AF_VSOCK, I don't think we should expose it in the UAPI. + #define IOCTL_VM_SOCKETS_GET_LOCAL_CID _IO(7, 0xb9) /* MSG_ZEROCOPY notifications are encoded in the standard error format, diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c index 54ba7316f808..da06ddc940cd 100644 --- a/net/vmw_vsock/af_vsock.c +++ b/net/vmw_vsock/af_vsock.c @@ -234,19 +234,45 @@ static void __vsock_remove_connected(struct vsock_sock *vsk) static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr) { - struct vsock_sock *vsk; + struct vsock_sock *vsk, *any_vsk = NULL; + rcu_read_lock(); Why the rcu is needed? list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) { + /* The highest priority: full match. */ if (vsock_addr_equals_addr(addr, &vsk->local_addr)) - return sk_vsock(vsk); + goto out; - if (addr->svm_port == vsk->local_addr.svm_port && - (vsk->local_addr.svm_cid == VMADDR_CID_ANY || -addr->svm_cid == VMADDR_CID_ANY)) - return sk_vsock(vsk); + /* Port match */ + if (addr->svm_port == vsk->local_addr.svm_port) { + /* The second priority: local cid is VMADDR_CID_ANY. */ + if (vsk->local_addr.svm_cid == VMADDR_CID_ANY) + goto out; + + /* The third priority: local cid isn't VMADDR_CID_ANY. */ + if (addr->svm_cid == VMADDR_CI
Re: [RFC PATCH 1/5] vsock/virtio: Extend virtio-vsock spec with an "order" field
As Alyssa suggested, we should discuss spec changes in the virtio ML. BTW as long as this is an RFC, it's fine. Just be sure, though, to remember to merge the change in the specification first versus the patches in Linux. So I recommend that you don't send a non-RFC set into Linux until you have agreed on the changes to the specification. On Fri, May 17, 2024 at 10:46:03PM GMT, Xuewei Niu wrote: The "order" field determines the location of the device in the linked list, the device with CID 4, having a smallest order, is in the first place, and so forth. Do we really need an order, or would it suffice to just indicate the device to be used by default? (as the default gateway in networking) Rules: * It doesn’t have to be continuous; * It cannot exist conflicts; * It is optional for the mode of a single device, but is required for the mode of multiple devices. We should also add a feature to support this new field. Signed-off-by: Xuewei Niu --- include/uapi/linux/virtio_vsock.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h index 64738838bee5..b62ec7d2ab1e 100644 --- a/include/uapi/linux/virtio_vsock.h +++ b/include/uapi/linux/virtio_vsock.h @@ -43,6 +43,7 @@ struct virtio_vsock_config { __le64 guest_cid; + __le64 order; Do we really need 64 bits for the order? } __attribute__((packed)); enum virtio_vsock_event_id { -- 2.34.1
Re: [PATCH] x86/paravirt: Disable virt spinlock when CONFIG_PARAVIRT_SPINLOCKS disabled
On 16.05.24 15:02, Chen Yu wrote: Performance drop is reported when running encode/decode workload and BenchSEE cache sub-workload. Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS is disabled the virt_spin_lock_key is set to true on bare-metal. The qspinlock degenerates to test-and-set spinlock, which decrease the performance on bare-metal. Fix this by disabling virt_spin_lock_key if CONFIG_PARAVIRT_SPINLOCKS is not set, or it is on bare-metal. Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function warning") Suggested-by: Qiuxu Zhuo Reported-by: Prem Nath Dey Reported-by: Xiaoping Zhou Signed-off-by: Chen Yu Reviewed-by: Juergen Gross Juergen --- arch/x86/kernel/paravirt.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 5358d43886ad..ee51c0949ed8 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -55,7 +55,7 @@ DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); void __init native_pv_lock_init(void) { - if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) && + if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !boot_cpu_has(X86_FEATURE_HYPERVISOR)) static_branch_disable(&virt_spin_lock_key); } OpenPGP_0xB0DE9DD628BF132F.asc Description: OpenPGP public key OpenPGP_signature.asc Description: OpenPGP digital signature
[PATCH] ring-buffer: Align meta-page to sub-buffers for improved TLB usage
Previously, the mapped ring-buffer layout caused misalignment between the meta-page and sub-buffers when the sub-buffer size was not a multiple of PAGE_SIZE. This prevented hardware with larger TLB entries from utilizing them effectively. Add a padding with the zero-page between the meta-page and sub-buffers. Also update the ring-buffer map_test to verify that padding. Signed-off-by: Vincent Donnefort -- This is based on the mm-unstable branch [1] as it depends on David's work [2] for allowing the zero-page in vm_insert_page(). [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git [2] https://lore.kernel.org/all/20240522125713.775114-1-da...@redhat.com diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 7345a8b625fb..acaab4d4288f 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -6148,10 +6148,10 @@ static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, /* install subbuf ID to kern VA translation */ cpu_buffer->subbuf_ids = subbuf_ids; - meta->meta_page_size = PAGE_SIZE; meta->meta_struct_len = sizeof(*meta); meta->nr_subbufs = nr_subbufs; meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + meta->meta_page_size = meta->subbuf_size; rb_update_meta_page(cpu_buffer); } @@ -6238,6 +6238,12 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, !(vma->vm_flags & VM_MAYSHARE)) return -EPERM; + subbuf_order = cpu_buffer->buffer->subbuf_order; + subbuf_pages = 1 << subbuf_order; + + if (subbuf_order && pgoff % subbuf_pages) + return -EINVAL; + /* * Make sure the mapping cannot become writable later. Also tell the VM * to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). @@ -6247,11 +6253,8 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, lockdep_assert_held(&cpu_buffer->mapping_lock); - subbuf_order = cpu_buffer->buffer->subbuf_order; - subbuf_pages = 1 << subbuf_order; - nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */ - nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */ + nr_pages = ((nr_subbufs + 1) << subbuf_order) - pgoff; /* + meta-page */ vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; if (!vma_pages || vma_pages > nr_pages) @@ -6264,20 +6267,20 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, return -ENOMEM; if (!pgoff) { + unsigned long meta_page_padding; + pages[p++] = virt_to_page(cpu_buffer->meta_page); /* -* TODO: Align sub-buffers on their size, once -* vm_insert_pages() supports the zero-page. +* Pad with the zero-page to align the meta-page with the +* sub-buffers. */ + meta_page_padding = subbuf_pages - 1; + while (meta_page_padding-- && p < nr_pages) + pages[p++] = ZERO_PAGE(vma->vm_start + (PAGE_SIZE * p)); } else { /* Skip the meta-page */ - pgoff--; - - if (pgoff % subbuf_pages) { - err = -EINVAL; - goto out; - } + pgoff -= subbuf_pages; s += pgoff / subbuf_pages; } diff --git a/tools/testing/selftests/ring-buffer/map_test.c b/tools/testing/selftests/ring-buffer/map_test.c index a9006fa7097e..4bb0192e43f3 100644 --- a/tools/testing/selftests/ring-buffer/map_test.c +++ b/tools/testing/selftests/ring-buffer/map_test.c @@ -228,6 +228,20 @@ TEST_F(map, data_mmap) data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, desc->cpu_fd, meta_len); ASSERT_EQ(data, MAP_FAILED); + + /* Verify meta-page padding */ + if (desc->meta->meta_page_size > getpagesize()) { + void *addr; + + data_len = desc->meta->meta_page_size; + data = mmap(NULL, data_len, + PROT_READ, MAP_SHARED, desc->cpu_fd, 0); + ASSERT_NE(data, MAP_FAILED); + + addr = (void *)((unsigned long)data + getpagesize()); + ASSERT_EQ(*((int *)addr), 0); + munmap(data, data_len); + } } FIXTURE(snapshot) { base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed -- 2.45.1.288.g0e0cd299f1-goog
Re: [RFC PATCH 1/5] vsock/virtio: Extend virtio-vsock spec with an "order" field
(CCing virtio-comment, since this proposes adding a field to a struct that is standardized[1] in the VIRTIO spec, so changes to the Linux implementation should presumably be coordinated with changes to the spec.) [1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-4780004 On Fri, May 17, 2024 at 10:46:03PM +0800, Xuewei Niu wrote: > The "order" field determines the location of the device in the linked list, > the device with CID 4, having a smallest order, is in the first place, and > so forth. > > Rules: > > * It doesn’t have to be continuous; > * It cannot exist conflicts; > * It is optional for the mode of a single device, but is required for the > mode of multiple devices. > > Signed-off-by: Xuewei Niu > --- > include/uapi/linux/virtio_vsock.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/uapi/linux/virtio_vsock.h > b/include/uapi/linux/virtio_vsock.h > index 64738838bee5..b62ec7d2ab1e 100644 > --- a/include/uapi/linux/virtio_vsock.h > +++ b/include/uapi/linux/virtio_vsock.h > @@ -43,6 +43,7 @@ > > struct virtio_vsock_config { > __le64 guest_cid; > + __le64 order; > } __attribute__((packed)); > > enum virtio_vsock_event_id { > -- > 2.34.1 > signature.asc Description: PGP signature