Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Verma, Vishal L wrote: > > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, > > > struct device_attribute *attr, > > > if (!victim) > > > return -ENXIO; > > > > > > - rc = down_write_killable(_region_rwsem); > > > - if (rc) > > > - return rc; > > > - rc = down_write_killable(_dev_rwsem); > > > - if (rc) { > > > - up_write(_region_rwsem); > > > - return rc; > > > - } > > > + device_lock(dev); > > > + device_lock(victim); > > > dev_dax = to_dev_dax(victim); > > > + rc = down_write_killable(_dev_rwsem); > > > > This begs the question, why down_write_killable(), but not > > device_lock_interruptible()? > > Do you mean change the device_lock()s to device_lock_interruptible() in > addition to the taking the rwsem (i.e. not instead of the rwsem..)? I mean convert the rwsem to drop _killable. > I guess I just restored what was there previously - but the > interruptible variant makes sense, I can make that change. So the original code did device_lock(), then the rework added killable rwsem (deleted device_lock()), and now the fixes add device_lock() back. So now that there is a mix of killable/interruptible lock usage all the locks should agree. Since there really is no risk of these operations being long running there is no driving need to make them killable/interruptible, so go with the simple option.
Re: [EXTERNAL] Re: [PATCH v2 1/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1
Hello, On 26/04/24 22:39, Mathieu Poirier wrote: Good day, On Wed, Apr 24, 2024 at 06: 35: 03PM +0530, Beleswar Padhi wrote: > From: Apurva Nandan > > PSC controller has a limitation that it can only power-up the second core > when the first core is in ON ZjQcmQRYFpfptBannerStart This message was sent from outside of Texas Instruments. Do not click links or open attachments unless you recognize the source of this email and know the content is safe. If you wish to report this message to IT Security, please forward the message as an attachment to phish...@list.ti.com ZjQcmQRYFpfptBannerEnd Good day, On Wed, Apr 24, 2024 at 06:35:03PM +0530, Beleswar Padhi wrote: > From: Apurva Nandan > > PSC controller has a limitation that it can only power-up the second core > when the first core is in ON state. Power-state for core0 should be equal > to or higher than core1, else the kernel is seen hanging during rproc > loading. > > Make the powering up of cores sequential, by waiting for the current core > to power-up before proceeding to the next core, with a timeout of 2sec. > Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait > for the current core to be released from reset before proceeding with the > next core. > > Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F subsystem") > > Signed-off-by: Apurva Nandan You need to add your own SoB as well. > --- > drivers/remoteproc/ti_k3_r5_remoteproc.c | 28 > 1 file changed, 28 insertions(+) > > diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c > index ad3415a3851b..5a9bd5d4a2ea 100644 > --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c > +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c > @@ -103,12 +103,14 @@ struct k3_r5_soc_data { > * @dev: cached device pointer > * @mode: Mode to configure the Cluster - Split or LockStep > * @cores: list of R5 cores within the cluster > + * @core_transition: wait queue to sync core state changes > * @soc_data: SoC-specific feature data for a R5FSS > */ > struct k3_r5_cluster { >struct device *dev; >enum cluster_mode mode; >struct list_head cores; > + wait_queue_head_t core_transition; >const struct k3_r5_soc_data *soc_data; > }; > > @@ -128,6 +130,7 @@ struct k3_r5_cluster { > * @atcm_enable: flag to control ATCM enablement > * @btcm_enable: flag to control BTCM enablement > * @loczrama: flag to dictate which TCM is at device address 0x0 > + * @released_from_reset: flag to signal when core is out of reset > */ > struct k3_r5_core { >struct list_head elem; > @@ -144,6 +147,7 @@ struct k3_r5_core { >u32 atcm_enable; >u32 btcm_enable; >u32 loczrama; > + bool released_from_reset; > }; > > /** > @@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc) >ret); >return ret; >} > + core->released_from_reset = true; > + wake_up_interruptible(>core_transition); > > /* > * Newer IP revisions like on J7200 SoCs support h/w auto-initialization > @@ -1140,6 +1146,7 @@ static int k3_r5_rproc_configure_mode(struct k3_r5_rproc *kproc) >return ret; >} > > + core->released_from_reset = c_state; I understand why this is needed but it line could be very cryptic for people trying to understand this driver. Please add a comment to describe what is happening here. Thanks for the review. I will send v3 addressing these comments shortly! >ret = ti_sci_proc_get_status(core->tsp, _vec, , , > ); >if (ret < 0) { > @@ -1280,6 +1287,26 @@ static int k3_r5_cluster_rproc_init(struct platform_device *pdev) >cluster->mode == CLUSTER_MODE_SINGLECPU || >cluster->mode == CLUSTER_MODE_SINGLECORE) >break; > + > + /* > + * R5 cores require to be powered on sequentially, core0 > + * should be in higher power state than core1 in a cluster > + * So, wait for current core to power up before proceeding > + * to next core and put timeout of 2sec for each core. > + * > + * This waiting mechanism is necessary because > + * rproc_auto_boot_callback() for core1 can be called before > + * core0 due to thread execution order. > + */ > + ret = wait_event_interruptible_timeout(cluster->core_transition, > + core->released_from_reset, > + msecs_to_jiffies(2000)); > + if (ret <= 0) { > + dev_err(dev, > + "Timed out waiting for %s core to power up!\n", > + rproc->name); > + return ret; > + } >} > > return 0; > @@ -1709,6 +1736,7 @@ static int k3_r5_probe(struct platform_device *pdev) >cluster->dev = dev; >
Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
On Mon, 2024-04-29 at 18:25 -0700, Dan Williams wrote: > Vishal Verma wrote: > > Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > > rwsem") > > was a bit overzealous in eliminating device_lock() usage, and ended up > > removing a couple of lock acquisitions which were needed, and as a > > result, fix some of the conditional locking missteps that the above > > commit introduced in unregister_dax_dev() and unregister_dax_mapping(). > > I think it makes sense to tell the story a bit about why the > delete_store() conversion was problematic, because the > unregister_dev_dax() changes were just a knock-on effect to fixing the > delete_store() flow. > > Something like: > > --- > commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > aimed to undo device_lock() abuses for protecting changes to dax-driver > internal data-structures like the dax_region resource tree to > device-dax-instance range structures. However, the device_lock() was > legitamately > enforcing that devices to be deleted were not current actively attached > to any driver nor assigned any capacity from the region. > --- > > ...you can fill in a couple notes about the knock-on fixups after that > was restored. Sounds good, updated! > > > > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, > > struct device_attribute *attr, > > if (!victim) > > return -ENXIO; > > > > - rc = down_write_killable(_region_rwsem); > > - if (rc) > > - return rc; > > - rc = down_write_killable(_dev_rwsem); > > - if (rc) { > > - up_write(_region_rwsem); > > - return rc; > > - } > > + device_lock(dev); > > + device_lock(victim); > > dev_dax = to_dev_dax(victim); > > + rc = down_write_killable(_dev_rwsem); > > This begs the question, why down_write_killable(), but not > device_lock_interruptible()? Do you mean change the device_lock()s to device_lock_interruptible() in addition to the taking the rwsem (i.e. not instead of the rwsem..)? I guess I just restored what was there previously - but the interruptible variant makes sense, I can make that change. > > I do not expect any of this is long running so likely down_write() is > sufficient here, especially since the heaviest locks to acquire are > already held by the time rwsem is considered. > > Other than that this looks good to me: > > You can include my Reviewed-by on the next posting. Thanks for the review Dan!
[PATCH v2 1/2] LoongArch: KVM: Add steal time support in kvm side
Steal time feature is added here in kvm side, VM can search supported features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME is added here. Like x86, steal time structure is saved in guest memory, one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to enable the feature. One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to save and restore base address of steal time structure when VM is migrated. Since it needs hypercall instruction emulation handling, and it is dependent on this patchset: https://lore.kernel.org/all/20240428100518.1642324-1-maob...@loongson.cn/ Signed-off-by: Bibo Mao --- arch/loongarch/include/asm/kvm_host.h | 7 ++ arch/loongarch/include/asm/kvm_para.h | 10 +++ arch/loongarch/include/asm/loongarch.h | 1 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kvm/exit.c | 29 +- arch/loongarch/kvm/vcpu.c | 120 + 6 files changed, 169 insertions(+), 2 deletions(-) diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index 1921750d4b4c..30bda553c54d 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -30,6 +30,7 @@ #define KVM_PRIVATE_MEM_SLOTS 0 #define KVM_HALT_POLL_NS_DEFAULT 50 +#define KVM_REQ_RECORD_STEAL KVM_ARCH_REQ(1) #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \ KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP) @@ -198,6 +199,12 @@ struct kvm_vcpu_arch { struct kvm_mp_state mp_state; /* cpucfg */ u32 cpucfg[KVM_MAX_CPUCFG_REGS]; + /* paravirt steal time */ + struct { + u64 guest_addr; + u64 last_steal; + struct gfn_to_hva_cache cache; + } st; }; static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg) diff --git a/arch/loongarch/include/asm/kvm_para.h b/arch/loongarch/include/asm/kvm_para.h index 56775554402a..5fb89e20432d 100644 --- a/arch/loongarch/include/asm/kvm_para.h +++ b/arch/loongarch/include/asm/kvm_para.h @@ -12,6 +12,7 @@ #define KVM_HCALL_CODE_SWDBG 1 #define KVM_HCALL_PV_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_PV_SERVICE) #define KVM_HCALL_FUNC_PV_IPI 1 +#define KVM_HCALL_FUNC_NOTIFY 2 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_SWDBG) /* @@ -21,6 +22,15 @@ #define KVM_HCALL_INVALID_CODE -1UL #define KVM_HCALL_INVALID_PARAMETER-2UL +#define KVM_STEAL_PHYS_VALID BIT_ULL(0) +#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6) +struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u32 pad[12]; +}; + /* * Hypercall interface for KVM hypervisor * diff --git a/arch/loongarch/include/asm/loongarch.h b/arch/loongarch/include/asm/loongarch.h index 0ad36704cb4b..ab6a5e93c280 100644 --- a/arch/loongarch/include/asm/loongarch.h +++ b/arch/loongarch/include/asm/loongarch.h @@ -168,6 +168,7 @@ #define KVM_SIGNATURE "KVM\0" #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4) #define KVM_FEATURE_PV_IPIBIT(1) +#define KVM_FEATURE_STEAL_TIMEBIT(2) #ifndef __ASSEMBLY__ diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 8f78b23672ac..286b5ce93a57 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -80,7 +80,11 @@ struct kvm_fpu { #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT)) #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, REG) #define KVM_IOC_CPUCFG(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG) + +/* Device Control API on vcpu fd */ #define KVM_LOONGARCH_VCPU_CPUCFG 0 +#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1 +#define KVM_LOONGARCH_VCPU_PVTIME_GPA 0 struct kvm_debug_exit_arch { }; diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c index 680aeaa4aeeb..af0f1c46e4eb 100644 --- a/arch/loongarch/kvm/exit.c +++ b/arch/loongarch/kvm/exit.c @@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu) static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) { int rd, rj; - unsigned int index; + unsigned int index, ret; rd = inst.reg2_format.rd; rj = inst.reg2_format.rj; @@ -232,7 +232,10 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst) vcpu->arch.gprs[rd] = *(unsigned int *)KVM_SIGNATURE; break; case CPUCFG_KVM_FEATURE: - vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI; + ret = KVM_FEATURE_PV_IPI; + if (sched_info_on()) + ret |= KVM_FEATURE_STEAL_TIME; +
[PATCH v2 2/2] LoongArch: Add steal time support in guest side
Percpu struct kvm_steal_time is added here, its size is 64 bytes and also defined as 64 bytes, so that the whole structure is in one physical page. When vcpu is onlined, function pv_enable_steal_time() is called. This function will pass guest physical address of struct kvm_steal_time and tells hypervisor to enable steal time. When vcpu is offline, physical address is set as 0 and tells hypervisor to disable steal time. Here is output of vmstat on guest when there is workload on both host and guest. It includes steal time stat information. procs ---memory-- -io -system-- --cpu- r b swpd free inact active bibo in cs us sy id wa st 15 1 0 7583616 184112 72208200 162 52 31 6 43 0 20 17 0 0 7583616 184704 721920 0 6318 6885 5 60 8 5 22 16 0 0 7583616 185392 721440 0 1766 1081 0 49 0 1 50 16 0 0 7583616 184816 723040 0 6300 6166 4 62 12 2 20 18 0 0 7583632 184480 722400 0 2814 1754 2 58 4 1 35 Signed-off-by: Bibo Mao --- arch/loongarch/Kconfig| 11 +++ arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/kernel/paravirt.c | 131 ++ arch/loongarch/kernel/time.c | 2 + 4 files changed, 149 insertions(+) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 0a1540a8853e..f3a03c33a052 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -592,6 +592,17 @@ config PARAVIRT over full virtualization. However, when run without a hypervisor the kernel is theoretically slower and slightly larger. +config PARAVIRT_TIME_ACCOUNTING + bool "Paravirtual steal time accounting" + select PARAVIRT + help + Select this option to enable fine granularity task steal time + accounting. Time spent executing other tasks in parallel with + the current vCPU is discounted from the vCPU power. To account for + that, there can be a small performance impact. + + If in doubt, say N here. + config ARCH_SUPPORTS_KEXEC def_bool y diff --git a/arch/loongarch/include/asm/paravirt.h b/arch/loongarch/include/asm/paravirt.h index 58f7b7b89f2c..fe27fb5e82b8 100644 --- a/arch/loongarch/include/asm/paravirt.h +++ b/arch/loongarch/include/asm/paravirt.h @@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu) } int pv_ipi_init(void); +int __init pv_time_init(void); #else static inline int pv_ipi_init(void) { return 0; } +static inline int pv_time_init(void) +{ + return 0; +} #endif // CONFIG_PARAVIRT #endif diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c index 9044ed62045c..3f83afc7e2d2 100644 --- a/arch/loongarch/kernel/paravirt.c +++ b/arch/loongarch/kernel/paravirt.c @@ -5,10 +5,13 @@ #include #include #include +#include #include struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); +static int has_steal_clock; static u64 native_steal_clock(int cpu) { @@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu) DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock); +static bool steal_acc = true; +static int __init parse_no_stealacc(char *arg) +{ + steal_acc = false; + return 0; +} +early_param("no-steal-acc", parse_no_stealacc); + +static u64 para_steal_clock(int cpu) +{ + u64 steal; + struct kvm_steal_time *src; + int version; + + src = _cpu(steal_time, cpu); + do { + + version = src->version; + /* Make sure that the version is read before the steal */ + virt_rmb(); + steal = src->steal; + /* Make sure that the steal is read before the next version */ + virt_rmb(); + + } while ((version & 1) || (version != src->version)); + return steal; +} + +static int pv_enable_steal_time(void) +{ + int cpu = smp_processor_id(); + struct kvm_steal_time *st; + unsigned long addr; + + if (!has_steal_clock) + return -EPERM; + + st = _cpu(steal_time, cpu); + addr = per_cpu_ptr_to_phys(st); + + /* The whole structure kvm_steal_time should be one page */ + if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) { + pr_warn("Illegal PV steal time addr %lx\n", addr); + return -EFAULT; + } + + addr |= KVM_STEAL_PHYS_VALID; + kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr); + return 0; +} + #ifdef CONFIG_SMP static void pv_send_ipi_single(int cpu, unsigned int action) { @@ -110,6 +164,32 @@ static void pv_init_ipi(void) if (r < 0) panic("SWI0 IRQ request failed\n"); } + +static void pv_disable_steal_time(void) +{ + if (has_steal_clock) +
[PATCH v2 0/2] LoongArch: Add steal time support
Para-virt feature steal time is added in both kvm and guest kernel side. It is silimar with other architectures, steal time structure comes from guest memory, also pseduo register is used to save/restore base address of steal time structure, so that vm migration is supported also. --- v2: 1. Add PARAVIRT_TIME_ACCOUNTING kconfig option in file arch/loongarch/Kconfig 2. Function name change such as replace pv_register_steal_time with pv_enable_steal_time etc --- Bibo Mao (2): LoongArch: KVM: Add steal time support in kvm side LoongArch: Add steal time support in guest side arch/loongarch/Kconfig | 11 +++ arch/loongarch/include/asm/kvm_host.h | 7 ++ arch/loongarch/include/asm/kvm_para.h | 10 ++ arch/loongarch/include/asm/loongarch.h | 1 + arch/loongarch/include/asm/paravirt.h | 5 + arch/loongarch/include/uapi/asm/kvm.h | 4 + arch/loongarch/kernel/paravirt.c | 131 + arch/loongarch/kernel/time.c | 2 + arch/loongarch/kvm/exit.c | 29 +- arch/loongarch/kvm/vcpu.c | 120 ++ 10 files changed, 318 insertions(+), 2 deletions(-) base-commit: 2c8159388952f530bd260e097293ccc0209240be -- 2.39.3
Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
On Mon, 29 Apr 2024 17:18:05 -0500, Huang, Kai wrote: /* @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag */ static inline bool sgx_can_reclaim(void) { -return !list_empty(_global_lru.reclaimable); +return !sgx_cgroup_lru_empty(misc_cg_root()) || + !list_empty(_global_lru.reclaimable); } Shouldn't this be: if (IS_ENABLED(CONFIG_CGROUP_MISC)) return !sgx_cgroup_lru_empty(misc_cg_root()); else return !list_empty(_global_lru.reclaimable); ? In this way, it is consistent with the sgx_reclaim_pages_global() below. I changed to this way because sgx_cgroup_lru_empty() is now defined in both KConfig cases. And it seems better to minimize use of the KConfig variables based on earlier feedback (some are yours). Don't really have strong preference here. So let me know one way of the other. But IMHO your code could be confusing, e.g., it can be interpreted as: The EPC pages can be managed by both the cgroup LRUs and the sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC is on. Which is not true. So we should make the code clearly reflect the true behaviour. Ok, I'll change back. Thanks Haitao
Re: [PATCH v2 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show
Vishal Verma wrote: > In size_show(), the dax_dev_rwsem only needs a read lock, but was > acquiring a write lock. Change it to down_read_interruptible() so it > doesn't unnecessarily hold a write lock. > > Cc: Dan Williams > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Looks good, Reviewed-by: Dan Williams
Re: [PATCH v2 3/4] dax/bus.c: Don't use down_write_killable for non-user processes
Vishal Verma wrote: > Change an instance of down_write_killable() to a simple down_write() where > there is no user process that might want to interrupt the operation. > > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 6 +- > 1 file changed, 1 insertion(+), 5 deletions(-) Looks good, Reviewed-by: Dan Williams
Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths
Vishal Verma wrote: > Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > was a bit overzealous in eliminating device_lock() usage, and ended up > removing a couple of lock acquisitions which were needed, and as a > result, fix some of the conditional locking missteps that the above > commit introduced in unregister_dax_dev() and unregister_dax_mapping(). I think it makes sense to tell the story a bit about why the delete_store() conversion was problematic, because the unregister_dev_dax() changes were just a knock-on effect to fixing the delete_store() flow. Something like: --- commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local rwsem") aimed to undo device_lock() abuses for protecting changes to dax-driver internal data-structures like the dax_region resource tree to device-dax-instance range structures. However, the device_lock() was legitamately enforcing that devices to be deleted were not current actively attached to any driver nor assigned any capacity from the region. --- ...you can fill in a couple notes about the knock-on fixups after that was restored. > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 44 ++-- > 1 file changed, 10 insertions(+), 34 deletions(-) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 7924dd542a13..4e04b228b080 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax) > trim_dev_dax_range(dev_dax); > } > > -static void __unregister_dev_dax(void *dev) > +static void unregister_dev_dax(void *dev) > { > struct dev_dax *dev_dax = to_dev_dax(dev); > > dev_dbg(dev, "%s\n", __func__); > > + down_write(_region_rwsem); > kill_dev_dax(dev_dax); > device_del(dev); > free_dev_dax_ranges(dev_dax); > put_device(dev); > -} > - > -static void unregister_dev_dax(void *dev) > -{ > - if (rwsem_is_locked(_region_rwsem)) > - return __unregister_dev_dax(dev); > - > - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0)) > - return; > - __unregister_dev_dax(dev); > up_write(_region_rwsem); > } > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, struct > device_attribute *attr, > if (!victim) > return -ENXIO; > > - rc = down_write_killable(_region_rwsem); > - if (rc) > - return rc; > - rc = down_write_killable(_dev_rwsem); > - if (rc) { > - up_write(_region_rwsem); > - return rc; > - } > + device_lock(dev); > + device_lock(victim); > dev_dax = to_dev_dax(victim); > + rc = down_write_killable(_dev_rwsem); This begs the question, why down_write_killable(), but not device_lock_interruptible()? I do not expect any of this is long running so likely down_write() is sufficient here, especially since the heaviest locks to acquire are already held by the time rwsem is considered. Other than that this looks good to me: You can include my Reviewed-by on the next posting.
Re: [PATCH v2 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts
Vishal Verma wrote: > In [1], Dan points out that all of the WARN_ON_ONCE() usage in the > referenced patch should be replaced with lockdep_assert_held, or > lockdep_held_assert_write(). Replace these as appropriate. > > Link: > https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch > [1] > Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local > rwsem") > Cc: Andrew Morton > Reported-by: Dan Williams > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 16 > 1 file changed, 8 insertions(+), 8 deletions(-) Looks good, Reviewed-by: Dan Williams
BUG: unable to handle kernel paging request in do_split
Hello. We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the email were a PoC file of the issue. Stack dump: BUG: unable to handle page fault for address: ed110c2fd97f #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 7ffd0067 P4D 7ffd0067 PUD 0 Oops: [#1] PREEMPT SMP KASAN NOPTI CPU: 0 PID: 24082 Comm: syz-executor.3 Not tainted 6.7.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:do_split+0xfef/0x1e10 fs/ext4/namei.c:2047 Code: d2 0f 85 38 0b 00 00 8b 45 00 89 84 24 84 00 00 00 41 8d 45 ff 48 8d 1c c3 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <0f> b6 14 02 48 89 d8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ef RSP: 0018:c90001e9f858 EFLAGS: 00010a02 RAX: dc00 RBX: 617ecbf8 RCX: c9001048f000 RDX: 11110c2fd97f RSI: 823364ab RDI: 0005 RBP: 8880617ecc00 R08: 0005 R09: R10: R11: R12: dc00 R13: R14: R15: 88801ee8d2b0 FS: 7f191402a640() GS:88802c60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: ed110c2fd97f CR3: 5500a000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: make_indexed_dir+0x1158/0x1540 fs/ext4/namei.c:2342 ext4_add_entry+0xcd0/0xe80 fs/ext4/namei.c:2454 ext4_add_nondir+0x90/0x2b0 fs/ext4/namei.c:2795 ext4_symlink+0x539/0x9e0 fs/ext4/namei.c:3436 vfs_symlink fs/namei.c:4464 [inline] vfs_symlink+0x3f6/0x640 fs/namei.c:4448 do_symlinkat+0x245/0x2f0 fs/namei.c:4490 __do_sys_symlink fs/namei.c:4511 [inline] __se_sys_symlink fs/namei.c:4509 [inline] __x64_sys_symlink+0x79/0xa0 fs/namei.c:4509 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7f191329002d Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:7f191402a028 EFLAGS: 0246 ORIG_RAX: 0058 RAX: ffda RBX: 7f19133cbf80 RCX: 7f191329002d RDX: RSI: 2e40 RDI: 20001640 RBP: 7f19132f14d0 R08: R09: R10: R11: 0246 R12: R13: 000b R14: 7f19133cbf80 R15: 7f191400a000 Modules linked in: CR2: ed110c2fd97f ---[ end trace ]--- RIP: 0010:do_split+0xfef/0x1e10 fs/ext4/namei.c:2047 Code: d2 0f 85 38 0b 00 00 8b 45 00 89 84 24 84 00 00 00 41 8d 45 ff 48 8d 1c c3 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <0f> b6 14 02 48 89 d8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ef RSP: 0018:c90001e9f858 EFLAGS: 00010a02 RAX: dc00 RBX: 617ecbf8 RCX: c9001048f000 RDX: 11110c2fd97f RSI: 823364ab RDI: 0005 RBP: 8880617ecc00 R08: 0005 R09: R10: R11: R12: dc00 R13: R14: R15: 88801ee8d2b0 FS: 7f191402a640() GS:88802c60() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: ed110c2fd97f CR3: 5500a000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Code disassembly (best guess): 0: d2 0f rorb %cl,(%rdi) 2: 85 38 test %edi,(%rax) 4: 0b 00 or (%rax),%eax 6: 00 8b 45 00 89 84 add%cl,-0x7b76ffbb(%rbx) c: 24 84 and$0x84,%al e: 00 00 add%al,(%rax) 10: 00 41 8dadd%al,-0x73(%rcx) 13: 45 ff 48 8d rex.RB decl -0x73(%r8) 17: 1c c3 sbb$0xc3,%al 19: 48 b8 00 00 00 00 00movabs $0xdc00,%rax 20: fc ff df 23: 48 89 damov%rbx,%rdx 26: 48 c1 ea 03 shr$0x3,%rdx * 2a: 0f b6 14 02 movzbl (%rdx,%rax,1),%edx <-- trapping instruction 2e: 48 89 d8mov%rbx,%rax 31: 83 e0 07and$0x7,%eax 34: 83 c0 03add$0x3,%eax 37: 38 d0 cmp%dl,%al 39: 7c 08 jl 0x43 3b: 84 d2 test %dl,%dl 3d: 0f .byte 0xf 3e: 85 ef test %ebp,%edi Thank you
[PATCH v4 3/4] remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes
The SCP on different chips will require different DRAM sizes and IPI shared buffer sizes based on varying requirements. Signed-off-by: Olivia Wen Reviewed-by: AngeloGioacchino Del Regno --- drivers/remoteproc/mtk_common.h | 11 -- drivers/remoteproc/mtk_scp.c | 84 +++- drivers/remoteproc/mtk_scp_ipi.c | 7 +++- 3 files changed, 79 insertions(+), 23 deletions(-) diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h index 6d7736a..fd5c539 100644 --- a/drivers/remoteproc/mtk_common.h +++ b/drivers/remoteproc/mtk_common.h @@ -78,7 +78,6 @@ #define MT8195_L2TCM_OFFSET0x850d0 #define SCP_FW_VER_LEN 32 -#define SCP_SHARE_BUFFER_SIZE 288 struct scp_run { u32 signaled; @@ -97,6 +96,11 @@ struct scp_ipi_desc { struct mtk_scp; +struct mtk_scp_sizes_data { + size_t max_dram_size; + size_t ipi_share_buffer_size; +}; + struct mtk_scp_of_data { int (*scp_clk_get)(struct mtk_scp *scp); int (*scp_before_load)(struct mtk_scp *scp); @@ -110,6 +114,7 @@ struct mtk_scp_of_data { u32 host_to_scp_int_bit; size_t ipi_buf_offset; + const struct mtk_scp_sizes_data *scp_sizes; }; struct mtk_scp_of_cluster { @@ -141,10 +146,10 @@ struct mtk_scp { struct scp_ipi_desc ipi_desc[SCP_IPI_MAX]; bool ipi_id_ack[SCP_IPI_MAX]; wait_queue_head_t ack_wq; + u8 *share_buf; void *cpu_addr; dma_addr_t dma_addr; - size_t dram_size; struct rproc_subdev *rpmsg_subdev; @@ -162,7 +167,7 @@ struct mtk_scp { struct mtk_share_obj { u32 id; u32 len; - u8 share_buf[SCP_SHARE_BUFFER_SIZE]; + u8 *share_buf; }; void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len); diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c index 6295148..e281d28 100644 --- a/drivers/remoteproc/mtk_scp.c +++ b/drivers/remoteproc/mtk_scp.c @@ -20,7 +20,6 @@ #include "mtk_common.h" #include "remoteproc_internal.h" -#define MAX_CODE_SIZE 0x50 #define SECTION_NAME_IPI_BUFFER ".ipi_buffer" /** @@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp *scp) { struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf; struct scp_ipi_desc *ipi_desc = scp->ipi_desc; - u8 tmp_data[SCP_SHARE_BUFFER_SIZE]; scp_ipi_handler_t handler; u32 id = readl(_obj->id); u32 len = readl(_obj->len); + const struct mtk_scp_sizes_data *scp_sizes; - if (len > SCP_SHARE_BUFFER_SIZE) { - dev_err(scp->dev, "ipi message too long (len %d, max %d)", len, - SCP_SHARE_BUFFER_SIZE); + scp_sizes = scp->data->scp_sizes; + if (len > scp_sizes->ipi_share_buffer_size) { + dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len, + scp_sizes->ipi_share_buffer_size); return; } if (id >= SCP_IPI_MAX) { @@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp *scp) return; } - memcpy_fromio(tmp_data, _obj->share_buf, len); - handler(tmp_data, len, ipi_desc[id].priv); + memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size); + memcpy_fromio(scp->share_buf, _obj->share_buf, len); + handler(scp->share_buf, len, ipi_desc[id].priv); scp_ipi_unlock(scp, id); scp->ipi_id_ack[id] = true; @@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct firmware *fw) { int ret; size_t buf_sz, offset; + size_t share_buf_offset; + const struct mtk_scp_sizes_data *scp_sizes; /* read the ipi buf addr from FW itself first */ ret = scp_elf_read_ipi_buf_addr(scp, fw, ); @@ -152,12 +155,15 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct firmware *fw) return -EOVERFLOW; } + scp_sizes = scp->data->scp_sizes; scp->recv_buf = (struct mtk_share_obj __iomem *) (scp->sram_base + offset); + share_buf_offset = sizeof(scp->recv_buf->id) + + sizeof(scp->recv_buf->len) + scp_sizes->ipi_share_buffer_size; scp->send_buf = (struct mtk_share_obj __iomem *) - (scp->sram_base + offset + sizeof(*scp->recv_buf)); - memset_io(scp->recv_buf, 0, sizeof(*scp->recv_buf)); - memset_io(scp->send_buf, 0, sizeof(*scp->send_buf)); + (scp->sram_base + offset + share_buf_offset); + memset_io(scp->recv_buf, 0, share_buf_offset); + memset_io(scp->send_buf, 0, share_buf_offset); return 0; } @@ -741,14 +747,16 @@ static int scp_start(struct rproc *rproc) static void *mt8183_scp_da_to_va(struct mtk_scp *scp, u64 da, size_t len) { int offset; + const struct mtk_scp_sizes_data *scp_sizes; + scp_sizes =
[PATCH v4 2/4] remoteproc: mediatek: Support MT8188 SCP core 1
MT8188 SCP has two RISC-V cores which is similar to MT8195 but without L1TCM. We've added MT8188-specific functions to configure L1TCM in multicore setups. Signed-off-by: Olivia Wen Reviewed-by: AngeloGioacchino Del Regno --- drivers/remoteproc/mtk_scp.c | 146 ++- 1 file changed, 143 insertions(+), 3 deletions(-) diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c index 6751829..6295148 100644 --- a/drivers/remoteproc/mtk_scp.c +++ b/drivers/remoteproc/mtk_scp.c @@ -471,6 +471,86 @@ static int mt8186_scp_before_load(struct mtk_scp *scp) return 0; } +static int mt8188_scp_l2tcm_on(struct mtk_scp *scp) +{ + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + mutex_lock(_cluster->cluster_lock); + + if (scp_cluster->l2tcm_refcnt == 0) { + /* clear SPM interrupt, SCP2SPM_IPC_CLR */ + writel(0xff, scp->cluster->reg_base + MT8192_SCP2SPM_IPC_CLR); + + /* Power on L2TCM */ + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_0, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_1, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_2, 0); + scp_sram_power_on(scp->cluster->reg_base + MT8192_L1TCM_SRAM_PDN, 0); + } + + scp_cluster->l2tcm_refcnt += 1; + + mutex_unlock(_cluster->cluster_lock); + + return 0; +} + +static int mt8188_scp_before_load(struct mtk_scp *scp) +{ + writel(1, scp->cluster->reg_base + MT8192_CORE0_SW_RSTN_SET); + + mt8188_scp_l2tcm_on(scp); + + scp_sram_power_on(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0); + + /* enable MPU for all memory regions */ + writel(0xff, scp->cluster->reg_base + MT8192_CORE0_MEM_ATT_PREDEF); + + return 0; +} + +static int mt8188_scp_c1_before_load(struct mtk_scp *scp) +{ + u32 sec_ctrl; + struct mtk_scp *scp_c0; + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + scp->data->scp_reset_assert(scp); + + mt8188_scp_l2tcm_on(scp); + + scp_sram_power_on(scp->cluster->reg_base + MT8195_CPU1_SRAM_PD, 0); + + /* enable MPU for all memory regions */ + writel(0xff, scp->cluster->reg_base + MT8195_CORE1_MEM_ATT_PREDEF); + + /* +* The L2TCM_OFFSET_RANGE and L2TCM_OFFSET shift the destination address +* on SRAM when SCP core 1 accesses SRAM. +* +* This configuration solves booting the SCP core 0 and core 1 from +* different SRAM address because core 0 and core 1 both boot from +* the head of SRAM by default. this must be configured before boot SCP core 1. +* +* The value of L2TCM_OFFSET_RANGE is from the viewpoint of SCP core 1. +* When SCP core 1 issues address within the range (L2TCM_OFFSET_RANGE), +* the address will be added with a fixed offset (L2TCM_OFFSET) on the bus. +* The shift action is tranparent to software. +*/ + writel(0, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_LOW); + writel(scp->sram_size, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_HIGH); + + scp_c0 = list_first_entry(_cluster->mtk_scp_list, struct mtk_scp, elem); + writel(scp->sram_phys - scp_c0->sram_phys, scp->cluster->reg_base + MT8195_L2TCM_OFFSET); + + /* enable SRAM offset when fetching instruction and data */ + sec_ctrl = readl(scp->cluster->reg_base + MT8195_SEC_CTRL); + sec_ctrl |= MT8195_CORE_OFFSET_ENABLE_I | MT8195_CORE_OFFSET_ENABLE_D; + writel(sec_ctrl, scp->cluster->reg_base + MT8195_SEC_CTRL); + + return 0; +} + static int mt8192_scp_before_load(struct mtk_scp *scp) { /* clear SPM interrupt, SCP2SPM_IPC_CLR */ @@ -717,6 +797,47 @@ static void mt8183_scp_stop(struct mtk_scp *scp) writel(0, scp->cluster->reg_base + MT8183_WDT_CFG); } +static void mt8188_scp_l2tcm_off(struct mtk_scp *scp) +{ + struct mtk_scp_of_cluster *scp_cluster = scp->cluster; + + mutex_lock(_cluster->cluster_lock); + + if (scp_cluster->l2tcm_refcnt > 0) + scp_cluster->l2tcm_refcnt -= 1; + + if (scp_cluster->l2tcm_refcnt == 0) { + /* Power off L2TCM */ + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_0, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_1, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L2TCM_SRAM_PD_2, 0); + scp_sram_power_off(scp->cluster->reg_base + MT8192_L1TCM_SRAM_PDN, 0); + } + + mutex_unlock(_cluster->cluster_lock); +} + +static void mt8188_scp_stop(struct mtk_scp *scp) +{ + mt8188_scp_l2tcm_off(scp); + + scp_sram_power_off(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0); + + /* Disable SCP watchdog */ + writel(0, scp->cluster->reg_base +
[PATCH v4 0/4] Support MT8188 SCP core 1
Change in v4: Updating the description of PATCH v4 4/4. Olivia Wen (4): dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP remoteproc: mediatek: Support MT8188 SCP core 1 remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes remoteproc: mediatek: Add IMGSYS IPI command .../devicetree/bindings/remoteproc/mtk,scp.yaml| 2 + drivers/remoteproc/mtk_common.h| 11 +- drivers/remoteproc/mtk_scp.c | 230 +++-- drivers/remoteproc/mtk_scp_ipi.c | 7 +- include/linux/remoteproc/mtk_scp.h | 1 + 5 files changed, 225 insertions(+), 26 deletions(-) -- 2.6.4
[PATCH v4 4/4] remoteproc: mediatek: Add IMGSYS IPI command
Add an IPI command definition for communication with IMGSYS through SCP mailbox. Signed-off-by: Olivia Wen --- include/linux/remoteproc/mtk_scp.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/linux/remoteproc/mtk_scp.h b/include/linux/remoteproc/mtk_scp.h index 7c2b7cc9..344ff41 100644 --- a/include/linux/remoteproc/mtk_scp.h +++ b/include/linux/remoteproc/mtk_scp.h @@ -43,6 +43,7 @@ enum scp_ipi_id { SCP_IPI_CROS_HOST_CMD, SCP_IPI_VDEC_LAT, SCP_IPI_VDEC_CORE, + SCP_IPI_IMGSYS_CMD, SCP_IPI_NS_SERVICE = 0xFF, SCP_IPI_MAX = 0x100, }; -- 2.6.4
[PATCH v4 1/4] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
Under different applications, the MT8188 SCP can be used as single-core or dual-core. Signed-off-by: Olivia Wen Acked-by: Krzysztof Kozlowski Reviewed-by: AngeloGioacchino Del Regno --- Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml index 507f98f..c5dc3c2 100644 --- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml +++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml @@ -19,6 +19,7 @@ properties: - mediatek,mt8183-scp - mediatek,mt8186-scp - mediatek,mt8188-scp + - mediatek,mt8188-scp-dual - mediatek,mt8192-scp - mediatek,mt8195-scp - mediatek,mt8195-scp-dual @@ -194,6 +195,7 @@ allOf: properties: compatible: enum: +- mediatek,mt8188-scp-dual - mediatek,mt8195-scp-dual then: properties: -- 2.6.4
Re: [PATCH v2 0/4] vhost: Cleanup
On 4/30/24 04:50, Michael S. Tsirkin wrote: On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote: This is suggested by Michael S. Tsirkin according to [1] and the goal is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it, the caller of the function needn't to worry about memory barriers. Since we're here, other cleanups are also applied. [1] https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/ Patch 1 makes some sense, gave some comments. Rest I think we should just drop. Sure, v3 has been sent with PATCH[v2 2/3/4] dropped. Please take a look when you getting a chance. v3: https://lore.kernel.org/virtualization/20240429232748.642356-1-gs...@redhat.com/T/#u Thanks, Gavin
[PATCH v3] vhost: Improve vhost_get_avail_idx() with smp_rmb()
From: "Michael S. Tsirkin" All the callers of vhost_get_avail_idx() are concerned with the memory barrier, imposed by smp_rmb() to ensure the order of the available ring entry read and avail_idx read. Improve vhost_get_avail_idx() so that smp_rmb() is executed when the avail_idx is accessed. With it, the callers needn't to worry about the memory barrier. As a side benefit, we also validate the index on all paths now, which will hopefully help to catch future errors earlier. Note that current code is inconsistent in how the errors are handled. They are treated as an empty ring in some places, but as non-empty ring in other places. This patch doesn't attempt to change the existing behaviour. No functional change intended. Signed-off-by: Michael S. Tsirkin Reviewed-by: Gavin Shan Acked-by: Will Deacon --- v3: Improved commit log and comments as Michael suggested --- drivers/vhost/vhost.c | 105 +- 1 file changed, 42 insertions(+), 63 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 8995730ce0bf..60d9592eff7b 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) mutex_unlock(>vqs[i]->mutex); } -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, - __virtio16 *idx) +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *idx, >avail->idx); + __virtio16 idx; + int r; + + r = vhost_get_avail(vq, idx, >avail->idx); + if (unlikely(r < 0)) { + vq_err(vq, "Failed to access available index at %p (%d)\n", + >avail->idx, r); + return r; + } + + /* Check it isn't doing very strange thing with available indexes */ + vq->avail_idx = vhost16_to_cpu(vq, idx); + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { + vq_err(vq, "Invalid available index change from %u to %u", + vq->last_avail_idx, vq->avail_idx); + return -EINVAL; + } + + /* We're done if there is nothing new */ + if (vq->avail_idx == vq->last_avail_idx) + return 0; + + /* +* We updated vq->avail_idx so we need a memory barrier between +* the index read above and the caller reading avail ring entries. +*/ + smp_rmb(); + return 1; } static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx; - __virtio16 avail_idx; + u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; - /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail_idx = vq->last_avail_idx; - if (vq->avail_idx == vq->last_avail_idx) { - if (unlikely(vhost_get_avail_idx(vq, _idx))) { - vq_err(vq, "Failed to access avail idx at %p\n", - >avail->idx); - return -EFAULT; - } - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { - vq_err(vq, "Guest moved avail index from %u to %u", - last_avail_idx, vq->avail_idx); - return -EFAULT; - } + ret = vhost_get_avail_idx(vq); + if (unlikely(ret < 0)) + return ret; - /* If there's nothing new since last we looked, return -* invalid. -*/ - if (vq->avail_idx == last_avail_idx) + if (!ret) return vq->num; - - /* Only get avail ring entries after they have been -* exposed by guest. -*/ - smp_rmb(); } /* Grab the next descriptor number they're advertising, and increment @@ -2790,35 +2795,21 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); /* return true if we're sure that avaiable ring is empty */ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) { - __virtio16 avail_idx; int r; if (vq->avail_idx != vq->last_avail_idx) return false; - r = vhost_get_avail_idx(vq, _idx); - if (unlikely(r)) - return false; - - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - if (vq->avail_idx != vq->last_avail_idx) { - /* Since we have updated avail_idx, the following -* call to vhost_get_vq_desc() will read available -* ring
Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()
On 4/30/24 04:44, Michael S. Tsirkin wrote: On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote: From: "Michael S. Tsirkin" All the callers of vhost_get_avail_idx() are concerned to the memory *with* the memory barrier Thanks, will be corrected in v3. barrier, imposed by smp_rmb() to ensure the order of the available ring entry read and avail_idx read. Improve vhost_get_avail_idx() so that smp_rmb() is executed when the avail_idx is advanced. accessed, not advanced. guest advances it. smp_rmb() is executed only when vp->last_avail_idx != vp->avail_idx. I used 'advanced' to indicate the condition. 'accessed' is also correct since the 'advanced' case included to 'accessed' case. With it, the callers needn't to worry about the memory barrier. No functional change intended. I'd add: As a side benefit, we also validate the index on all paths now, which will hopefully help catch future errors earlier. Note: current code is inconsistent in how it handles errors: some places treat it as an empty ring, others - non empty. This patch does not attempt to change the existing behaviour. Ok, I will integrate this to v3's commit log. Signed-off-by: Michael S. Tsirkin [gshan: repainted vhost_get_avail_idx()] ?repainted? It's just a indicator to say the changes aren't simply copied from [1]. Some follow-up changes are also applied. So it needs to be reviewed. I will drop this in v3. Reviewed-by: Gavin Shan Acked-by: Will Deacon --- drivers/vhost/vhost.c | 106 +- 1 file changed, 42 insertions(+), 64 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 8995730ce0bf..7aa623117aab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) mutex_unlock(>vqs[i]->mutex); } -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, - __virtio16 *idx) +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *idx, >avail->idx); + __virtio16 idx; + int r; + + r = vhost_get_avail(vq, idx, >avail->idx); + if (unlikely(r < 0)) { + vq_err(vq, "Failed to access available index at %p (%d)\n", + >avail->idx, r); + return r; + } + + /* Check it isn't doing very strange thing with available indexes */ + vq->avail_idx = vhost16_to_cpu(vq, idx); + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { + vq_err(vq, "Invalid available index change from %u to %u", + vq->last_avail_idx, vq->avail_idx); + return -EINVAL; + } + + /* We're done if there is nothing new */ + if (vq->avail_idx == vq->last_avail_idx) + return 0; + + /* +* We updated vq->avail_idx so we need a memory barrier between +* the index read above and the caller reading avail ring entries. +*/ + smp_rmb(); + return 1; } static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx; - __virtio16 avail_idx; + u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; - /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail_idx = vq->last_avail_idx; - if (vq->avail_idx == vq->last_avail_idx) { - if (unlikely(vhost_get_avail_idx(vq, _idx))) { - vq_err(vq, "Failed to access avail idx at %p\n", - >avail->idx); - return -EFAULT; - } - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { - vq_err(vq, "Guest moved avail index from %u to %u", - last_avail_idx, vq->avail_idx); - return -EFAULT; - } + ret = vhost_get_avail_idx(vq); + if (unlikely(ret < 0)) + return ret; - /* If there's nothing new since last we looked, return -* invalid. -*/ - if (vq->avail_idx == last_avail_idx) + if (!ret) return vq->num; - - /* Only get avail ring entries after they have been -* exposed by guest. -*/ - smp_rmb(); } /* Grab the next descriptor number they're advertising, and increment @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); /* return true if we're sure that
Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
On 4/30/24 04:45, Michael S. Tsirkin wrote: On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote: The local variable @last_avail_idx is equivalent to vq->last_avail_idx. So the code can be simplified a bit by dropping the local variable @last_avail_idx. No functional change intended. Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7aa623117aab..b278c0333a66 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Grab the next descriptor number they're advertising, and increment * the index we've seen. */ - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) { + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) { vq_err(vq, "Failed to read head: idx %d address %p\n", - last_avail_idx, - >avail->ring[last_avail_idx % vq->num]); + vq->last_avail_idx, + >avail->ring[vq->last_avail_idx % vq->num]); return -EFAULT; } I don't see the big advantage and the line is long now. The point is to avoid the local variable @last_avail_idx since it's equivalent to vq->last_avail_idx, as stated in the commit log. Besides, it paves the way for PATCH[v2 3/4] where the whole logic fetching the head and sanity check is moved to vhost_get_avail_head(), so that vhost_get_vq_desc() is simplified I will drop PATCH[2, 3, 4] as you suggested. Thanks, Gavin
Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko wrote: > > At the lowest level, rethook-based kretprobes on x86-64 architecture go > through arch_rethoook_trampoline() function, manually written in > assembly, which calls into a simple arch_rethook_trampoline_callback() > function, written in C, and only doing a few straightforward field > assignments, before calling further into rethook_trampoline_handler(), > which handles kretprobe callbacks generically. > > Looking at simplicity of arch_rethook_trampoline_callback(), it seems > not really worthwhile to spend an extra function call just to do 4 or > 5 assignments. As such, this patch proposes to "inline" > arch_rethook_trampoline_callback() into arch_rethook_trampoline() by > manually implementing it in an assembly code. > > This has two motivations. First, we do get a bit of runtime speed up by > avoiding function calls. Using BPF selftests's bench tool, we see > 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe > triggering code path: > > BEFORE (latest probes/for-next) > === > kretprobe : 10.455 ± 0.024M/s > kretprobe-multi: 11.150 ± 0.012M/s > > AFTER (probes/for-next + this patch) > > kretprobe : 10.540 ± 0.009M/s (+0.8%) > kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) > > Second, and no less importantly for some specialized use cases, this > avoids unnecessarily "polluting" LBR records with an extra function call > (recorded as a jump by CPU). This is the case for the retsnoop ([0]) > tool, which relies havily on capturing LBR records to provide users with > lots of insight into kernel internals. > > This RFC patch is only inlining this function for x86-64, but it's > possible to do that for 32-bit x86 arch as well and then remove > arch_rethook_trampoline_callback() implementation altogether. Please let > me know if this change is acceptable and whether I should complete it > with 32-bit "inlining" as well. Thanks! > > [0] > https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr > > Signed-off-by: Andrii Nakryiko > --- > arch/x86/kernel/asm-offsets_64.c | 4 > arch/x86/kernel/rethook.c| 37 +++- > 2 files changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/asm-offsets_64.c > b/arch/x86/kernel/asm-offsets_64.c > index bb65371ea9df..5c444abc540c 100644 > --- a/arch/x86/kernel/asm-offsets_64.c > +++ b/arch/x86/kernel/asm-offsets_64.c > @@ -42,6 +42,10 @@ int main(void) > ENTRY(r14); > ENTRY(r15); > ENTRY(flags); > + ENTRY(ip); > + ENTRY(cs); > + ENTRY(ss); > + ENTRY(orig_ax); > BLANK(); > #undef ENTRY > > diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c > index 8a1c0111ae79..3e1c01beebd1 100644 > --- a/arch/x86/kernel/rethook.c > +++ b/arch/x86/kernel/rethook.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > #include "kprobes/common.h" > > @@ -34,10 +35,36 @@ asm( > " pushq %rsp\n" > " pushfq\n" > SAVE_REGS_STRING > - " movq %rsp, %rdi\n" > - " call arch_rethook_trampoline_callback\n" > + " movq %rsp, %rdi\n" /* $rdi points to regs */ > + /* fixup registers */ > + /* regs->cs = __KERNEL_CS; */ > + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) > "(%rdi)\n" > + /* regs->ip = (unsigned long)_rethook_trampoline; */ > + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) > "(%rdi)\n" > + /* regs->orig_ax = ~0UL; */ > + " movq $0x, " __stringify(pt_regs_orig_ax) > "(%rdi)\n" > + /* regs->sp += 2*sizeof(long); */ > + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" > + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ > + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" BTW, all this __stringify() ugliness can be avoided if we move this assembly into its own .S file, like lots of other assembly functions in arch/x86/kernel subdir. That has another benefit of generating better line information in DWARF for those assembly instructions. It's lots more work, so before I do this, I'd like to get confirmation that this change is acceptable in principle. > + /* > +* The return address at 'frame_pointer' is recovered by the > +* arch_rethook_fixup_return() which called from this > +* rethook_trampoline_handler(). > +*/ > + " call rethook_trampoline_handler\n" > + /* > +* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF. > +* > +* We don't save/restore %rax below, because we ignore > +* rethook_trampoline_handler result. > +* > +* *(unsigned long *)>ss = regs->flags; > +*/ > + " mov " __stringify(pt_regs_flags) "(%rsp), %rax\n" >
Re: [PATCH bpf-next] bpf: add support to read cpu_entry in bpf program
On 4/27/24 8:18 AM, Florian Lehner wrote: Add new field "cpu_entry" to bpf_perf_event_data which could be read by bpf programs attached to perf events. The value contains the CPU value recorded by specifying sample_type with PERF_SAMPLE_CPU when calling perf_event_open(). You can use bpf_cast_to_kern_ctx kfunc which can cast 'struct bpf_perf_event_data' ctx to 'struct bpf_perf_event_data_kern'. struct bpf_perf_event_data_kern { bpf_user_pt_regs_t *regs; struct perf_sample_data *data; struct perf_event *event; }; You can access bpf_perf_event_data_kern->data and then to access 'cpu_entry' field. Signed-off-by: Florian Lehner --- include/uapi/linux/bpf_perf_event.h | 4 kernel/trace/bpf_trace.c | 13 + tools/include/uapi/linux/bpf_perf_event.h | 4 3 files changed, 21 insertions(+) diff --git a/include/uapi/linux/bpf_perf_event.h b/include/uapi/linux/bpf_perf_event.h index eb1b9d21250c..4856b4396ece 100644 --- a/include/uapi/linux/bpf_perf_event.h +++ b/include/uapi/linux/bpf_perf_event.h @@ -14,6 +14,10 @@ struct bpf_perf_event_data { bpf_user_pt_regs_t regs; __u64 sample_period; __u64 addr; + struct { + u32 cpu; + u32 reserved; + } cpu_entry; }; #endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */ diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index afb232b1d7c2..2b303221af5c 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -2176,6 +2176,11 @@ static bool pe_prog_is_valid_access(int off, int size, enum bpf_access_type type if (!bpf_ctx_narrow_access_ok(off, size, size_u64)) return false; break; + case bpf_ctx_range(struct bpf_perf_event_data, cpu_entry): + bpf_ctx_record_field_size(info, size_u64); + if (!bpf_ctx_narrow_access_ok(off, size, size_u64)) + return false; + break; default: if (size != sizeof(long)) return false; @@ -2208,6 +2213,14 @@ static u32 pe_prog_convert_ctx_access(enum bpf_access_type type, bpf_target_off(struct perf_sample_data, addr, 8, target_size)); break; + case offsetof(struct bpf_perf_event_data, cpu_entry): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_perf_event_data_kern, + data), si->dst_reg, si->src_reg, + offsetof(struct bpf_perf_event_data_kern, data)); + *insn++ = BPF_LDX_MEM(BPF_DW, si->dst_reg, si->dst_reg, + bpf_target_off(struct perf_sample_data, cpu_entry, 8, +target_size)); + break; default: *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_perf_event_data_kern, regs), si->dst_reg, si->src_reg, diff --git a/tools/include/uapi/linux/bpf_perf_event.h b/tools/include/uapi/linux/bpf_perf_event.h index eb1b9d21250c..4856b4396ece 100644 --- a/tools/include/uapi/linux/bpf_perf_event.h +++ b/tools/include/uapi/linux/bpf_perf_event.h @@ -14,6 +14,10 @@ struct bpf_perf_event_data { bpf_user_pt_regs_t regs; __u64 sample_period; __u64 addr; + struct { + u32 cpu; + u32 reserved; + } cpu_entry; }; #endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */
Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
/* @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag */ static inline bool sgx_can_reclaim(void) { - return !list_empty(_global_lru.reclaimable); + return !sgx_cgroup_lru_empty(misc_cg_root()) || + !list_empty(_global_lru.reclaimable); } Shouldn't this be: if (IS_ENABLED(CONFIG_CGROUP_MISC)) return !sgx_cgroup_lru_empty(misc_cg_root()); else return !list_empty(_global_lru.reclaimable); ? In this way, it is consistent with the sgx_reclaim_pages_global() below. I changed to this way because sgx_cgroup_lru_empty() is now defined in both KConfig cases. And it seems better to minimize use of the KConfig variables based on earlier feedback (some are yours). Don't really have strong preference here. So let me know one way of the other. But IMHO your code could be confusing, e.g., it can be interpreted as: The EPC pages can be managed by both the cgroup LRUs and the sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC is on. Which is not true. So we should make the code clearly reflect the true behaviour.
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Sun, Apr 28, 2024 at 4:25 PM Steven Rostedt wrote: > > On Thu, 25 Apr 2024 13:31:53 -0700 > Andrii Nakryiko wrote: > > I'm just coming back from Japan (work and then a vacation), and > catching up on my email during the 6 hour layover in Detroit. > > > Hey Masami, > > > > I can't really review most of that code as I'm completely unfamiliar > > with all those inner workings of fprobe/ftrace/function_graph. I left > > a few comments where there were somewhat more obvious BPF-related > > pieces. > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > and then with your series applied on top. Just to see if there are any > > regressions. I think it will be a useful data point for you. > > > > You should be already familiar with the bench tool we have in BPF > > selftests (I used it on some other patches for your tree). > > I should get familiar with your tools too. > It's a nifty and self-contained tool to do some micro-benchmarking, I replied to Masami with a few details on how to build and use it. > > > > BASELINE > > > > kprobe : 24.634 ± 0.205M/s > > kprobe-multi : 28.898 ± 0.531M/s > > kretprobe : 10.478 ± 0.015M/s > > kretprobe-multi: 11.012 ± 0.063M/s > > > > THIS PATCH SET ON TOP > > = > > kprobe : 25.144 ± 0.027M/s (+2%) > > kprobe-multi : 28.909 ± 0.074M/s > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > > > These numbers are pretty stable and look to be more or less representative. > > Thanks for running this. > > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > about the same, though. > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > slower, almost by 10%. Not sure why, but looks real after re-running > > benchmarks a bunch of times and getting stable results. > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > Again, I don't know if it is expected or not, but it's a nice > > improvement. > > > > If you have any idea why kretprobes would get so much slower, it would > > be nice to look into that and see if you can mitigate the regression > > somehow. Thanks! > > My guess is that this patch set helps generic use cases for tracing the > return of functions, but will likely add more overhead for single use > cases. That is, kretprobe is made to be specific for a single function, > but kretprobe-multi is more generic. Hence the generic version will > improve at the sacrifice of the specific function. I did expect as much. > > That said, I think there's probably a lot of low hanging fruit that can > be done to this series to help improve the kretprobe performance. I'm > not sure we can get back to the baseline, but I'm hoping we can at > least make it much better than that 10% slowdown. That would certainly be appreciated, thanks! But I'm also considering trying to switch to multi-kprobe/kretprobe automatically on libbpf side, whenever possible, so that users can get the best performance. There might still be situations where this can't be done, so singular kprobe/kretprobe can't be completely deprecated, but multi variants seems to be universally faster, so I'm going to make them a default (I need to handle some backwards compat aspect, but that's libbpf-specific stuff you shouldn't be concerned with). > > I'll be reviewing this patch set this week as I recover from jetlag. > > -- Steve
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu wrote: > > Hi Andrii, > > On Thu, 25 Apr 2024 13:31:53 -0700 > Andrii Nakryiko wrote: > > > Hey Masami, > > > > I can't really review most of that code as I'm completely unfamiliar > > with all those inner workings of fprobe/ftrace/function_graph. I left > > a few comments where there were somewhat more obvious BPF-related > > pieces. > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > and then with your series applied on top. Just to see if there are any > > regressions. I think it will be a useful data point for you. > > Thanks for testing! > > > > > You should be already familiar with the bench tool we have in BPF > > selftests (I used it on some other patches for your tree). > > What patches we need? > You mean for this `bench` tool? They are part of BPF selftests (under tools/testing/selftests/bpf), you can build them by running: $ make RELEASE=1 -j$(nproc) bench After that you'll get a self-container `bench` binary, which has all the self-contained benchmarks. You might also find a small script (benchs/run_bench_trigger.sh inside BPF selftests directory) helpful, it collects final summary of the benchmark run and optionally accepts a specific set of benchmarks. So you can use it like this: $ benchs/run_bench_trigger.sh kprobe kprobe-multi kprobe : 18.731 ± 0.639M/s kprobe-multi : 23.938 ± 0.612M/s By default it will run a wider set of benchmarks (no uprobes, but a bunch of extra fentry/fexit tests and stuff like this). > > > > BASELINE > > > > kprobe : 24.634 ± 0.205M/s > > kprobe-multi : 28.898 ± 0.531M/s > > kretprobe : 10.478 ± 0.015M/s > > kretprobe-multi: 11.012 ± 0.063M/s > > > > THIS PATCH SET ON TOP > > = > > kprobe : 25.144 ± 0.027M/s (+2%) > > kprobe-multi : 28.909 ± 0.074M/s > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > This looks good. Kretprobe should also use kretprobe-multi (fprobe) > eventually because it should be a single callback version of > kretprobe-multi. > > > > > These numbers are pretty stable and look to be more or less representative. > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > about the same, though. > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > slower, almost by 10%. Not sure why, but looks real after re-running > > benchmarks a bunch of times and getting stable results. > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series. > So nothing should be changed. Maybe cache access pattern has been > changed? > I'll check it with tracefs (to remove the effect from bpf related changes) > > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > Again, I don't know if it is expected or not, but it's a nice > > improvement. > > Thanks! > > > > > If you have any idea why kretprobes would get so much slower, it would > > be nice to look into that and see if you can mitigate the regression > > somehow. Thanks! > > OK, let me check it. > > Thank you! > > > > > > > > 51 files changed, 2325 insertions(+), 882 deletions(-) > > > create mode 100644 > > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc > > > > > > -- > > > Masami Hiramatsu (Google) > > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v2 0/4] vhost: Cleanup
On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote: > This is suggested by Michael S. Tsirkin according to [1] and the goal > is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it, > the caller of the function needn't to worry about memory barriers. Since > we're here, other cleanups are also applied. > > [1] > https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/ Patch 1 makes some sense, gave some comments. Rest I think we should just drop. > PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if > needed. Besides, the sanity checks on the retrieved available > queue index are also squeezed to vhost_get_avail_idx() > PATCH[2] drops the local variable @last_avail_idx since it's equivalent > to vq->last_avail_idx > PATCH[3] improves vhost_get_avail_head(), similar to what we're doing > for vhost_get_avail_idx(), so that the relevant sanity checks > on the head are squeezed to vhost_get_avail_head() > PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space > as the terminator for each line > > Gavin Shan (3): > vhost: Drop variable last_avail_idx in vhost_get_vq_desc() > vhost: Improve vhost_get_avail_head() > vhost: Reformat vhost_{get, put}_user() > > Michael S. Tsirkin (1): > vhost: Improve vhost_get_avail_idx() with smp_rmb() > > drivers/vhost/vhost.c | 215 +++--- > 1 file changed, 97 insertions(+), 118 deletions(-) > > Changelog > = > v2: > * Improve vhost_get_avail_idx() as Michael suggested in [1] > as above (Michael) > * Correct @head's type from 'unsigned int' to 'int' > (l...@intel.com) > > -- > 2.44.0
Re: [PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()
On Mon, Apr 29, 2024 at 08:14:00PM +1000, Gavin Shan wrote: > Reformat the macros to use tab as the terminator for each line so > that it looks clean. > > No functional change intended. > > Signed-off-by: Gavin Shan Just messes up history for no real gain. > --- > drivers/vhost/vhost.c | 60 +-- > 1 file changed, 30 insertions(+), 30 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 4ddb9ec2fe46..c1ed5e750521 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct > vhost_virtqueue *vq, > return __vhost_get_user_slow(vq, addr, size, type); > } > > -#define vhost_put_user(vq, x, ptr) \ > -({ \ > - int ret; \ > - if (!vq->iotlb) { \ > - ret = __put_user(x, ptr); \ > - } else { \ > - __typeof__(ptr) to = \ > +#define vhost_put_user(vq, x, ptr) \ > +({ \ > + int ret;\ > + if (!vq->iotlb) { \ > + ret = __put_user(x, ptr); \ > + } else {\ > + __typeof__(ptr) to =\ > (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ > - sizeof(*ptr), VHOST_ADDR_USED); \ > - if (to != NULL) \ > - ret = __put_user(x, to); \ > - else \ > - ret = -EFAULT; \ > - } \ > - ret; \ > + sizeof(*ptr), \ > + VHOST_ADDR_USED); \ > + if (to != NULL) \ > + ret = __put_user(x, to);\ > + else\ > + ret = -EFAULT; \ > + } \ > + ret;\ > }) > > static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) > @@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct > vhost_virtqueue *vq) > >used->idx); > } > > -#define vhost_get_user(vq, x, ptr, type) \ > -({ \ > - int ret; \ > - if (!vq->iotlb) { \ > - ret = __get_user(x, ptr); \ > - } else { \ > - __typeof__(ptr) from = \ > - (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ > -sizeof(*ptr), \ > -type); \ > - if (from != NULL) \ > - ret = __get_user(x, from); \ > - else \ > - ret = -EFAULT; \ > - } \ > - ret; \ > +#define vhost_get_user(vq, x, ptr, type) \ > +({ \ > + int ret;\ > + if (!vq->iotlb) { \ > + ret = __get_user(x, ptr); \ > + } else {\ > + __typeof__(ptr) from = \ > + (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ > + sizeof(*ptr), type);\ > + if (from != NULL) \ > + ret = __get_user(x, from); \ > + else\ > + ret = -EFAULT; \ > + } \ > + ret;\ > }) > > #define vhost_get_avail(vq, x, ptr) \ > -- > 2.44.0
Re: [PATCH v2 3/4] vhost: Improve vhost_get_avail_head()
On Mon, Apr 29, 2024 at 08:13:59PM +1000, Gavin Shan wrote: > Improve vhost_get_avail_head() so that the head or errno is returned. > With it, the relevant sanity checks are squeezed to vhost_get_avail_head() > and vhost_get_vq_desc() is further simplified. > > No functional change intended. > > Signed-off-by: Gavin Shan I don't see what does this moving code around achieve. > --- > drivers/vhost/vhost.c | 50 ++- > 1 file changed, 26 insertions(+), 24 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index b278c0333a66..4ddb9ec2fe46 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct > vhost_virtqueue *vq) > return 1; > } > > -static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > -__virtio16 *head, int idx) > +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq) > { > - return vhost_get_avail(vq, *head, > ->avail->ring[idx & (vq->num - 1)]); > + __virtio16 head; > + int r; > + > + r = vhost_get_avail(vq, head, > + >avail->ring[vq->last_avail_idx & (vq->num - > 1)]); > + if (unlikely(r)) { > + vq_err(vq, "Failed to read head: index %u address %p\n", > +vq->last_avail_idx, > +>avail->ring[vq->last_avail_idx & (vq->num - 1)]); > + return r; > + } > + > + r = vhost16_to_cpu(vq, head); > + if (unlikely(r >= vq->num)) { > + vq_err(vq, "Invalid head %d (%u)\n", r, vq->num); > + return -EINVAL; > + } > + > + return r; > } > > static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, > @@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > struct vhost_log *log, unsigned int *log_num) > { > struct vring_desc desc; > - unsigned int i, head, found = 0; > - __virtio16 ring_head; > - int ret, access; > + unsigned int i, found = 0; > + int head, ret, access; > > if (vq->avail_idx == vq->last_avail_idx) { > ret = vhost_get_avail_idx(vq); > @@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > return vq->num; > } > > - /* Grab the next descriptor number they're advertising, and increment > - * the index we've seen. */ > - if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) > { > - vq_err(vq, "Failed to read head: idx %d address %p\n", > -vq->last_avail_idx, > ->avail->ring[vq->last_avail_idx % vq->num]); > - return -EFAULT; > - } > - > - head = vhost16_to_cpu(vq, ring_head); > - > - /* If their number is silly, that's an error. */ > - if (unlikely(head >= vq->num)) { > - vq_err(vq, "Guest says index %u > %u is available", > -head, vq->num); > - return -EINVAL; > - } > + /* Grab the next descriptor number they're advertising */ > + head = vhost_get_avail_head(vq); > + if (unlikely(head < 0)) > + return head; > > /* When we start there are none of either input nor output. */ > *out_num = *in_num = 0; > -- > 2.44.0
Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote: > The local variable @last_avail_idx is equivalent to vq->last_avail_idx. > So the code can be simplified a bit by dropping the local variable > @last_avail_idx. > > No functional change intended. > > Signed-off-by: Gavin Shan > --- > drivers/vhost/vhost.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 7aa623117aab..b278c0333a66 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > { > struct vring_desc desc; > unsigned int i, head, found = 0; > - u16 last_avail_idx = vq->last_avail_idx; > __virtio16 ring_head; > int ret, access; > > @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > > /* Grab the next descriptor number they're advertising, and increment >* the index we've seen. */ > - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) { > + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) > { > vq_err(vq, "Failed to read head: idx %d address %p\n", > -last_avail_idx, > ->avail->ring[last_avail_idx % vq->num]); > +vq->last_avail_idx, > +>avail->ring[vq->last_avail_idx % vq->num]); > return -EFAULT; > } I don't see the big advantage and the line is long now. > > -- > 2.44.0
Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr
On Sun, 28 Apr 2024 20:28:37 -0400 Steven Rostedt wrote: > > Looking for any suggestion or solution, appreciate. > > Yeah, I do not think eventfs should be involved in this. It needs to be > protected at a higher level (in the synthetic/dynamic event code). > > I'm just coming back from Japan, and I'll need to take a deeper look at > this after I recover from my jetlag. OK, so I guess the eventfs nodes need an optional release callback. Here's the right way to do that. I added a "release" function to the passed in entry array that allows for calling a release function when the eventfs_inode is freed. Then in code for creating events, I call event_file_get() on the file being assigned and have the freeing of the "enable" file have the release function that will call event_file_put() on that file structure. Does this fix it for you? -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..dc97c19f9e0a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, struct eventfs_entry { const char *name; eventfs_callbackcallback; + eventfs_release release; }; struct eventfs_inode; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 52f75c36bbca..d14c84281f2b 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t *mode, void **data, return 0; } +/* The file is incremented on creation and freeing the enable file decrements it */ +static void event_release(const char *name, void *data) +{ + struct trace_event_file *file = data; + + event_file_put(file); +} + static int event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { @@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { .name = "enable", .callback = event_callback, + .release= event_release, }, { .name = "filter", @@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) return ret; } + /* Gets decremented on freeing of the "enable" file */ + event_file_get(file); + return 0; }
Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()
On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote: > From: "Michael S. Tsirkin" > > All the callers of vhost_get_avail_idx() are concerned to the memory *with* the memory barrier > barrier, imposed by smp_rmb() to ensure the order of the available > ring entry read and avail_idx read. > > Improve vhost_get_avail_idx() so that smp_rmb() is executed when > the avail_idx is advanced. accessed, not advanced. guest advances it. > With it, the callers needn't to worry > about the memory barrier. > > No functional change intended. I'd add: As a side benefit, we also validate the index on all paths now, which will hopefully help catch future errors earlier. Note: current code is inconsistent in how it handles errors: some places treat it as an empty ring, others - non empty. This patch does not attempt to change the existing behaviour. > Signed-off-by: Michael S. Tsirkin > [gshan: repainted vhost_get_avail_idx()] ?repainted? > Reviewed-by: Gavin Shan > Acked-by: Will Deacon > --- > drivers/vhost/vhost.c | 106 +- > 1 file changed, 42 insertions(+), 64 deletions(-) > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 8995730ce0bf..7aa623117aab 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > mutex_unlock(>vqs[i]->mutex); > } > > -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, > - __virtio16 *idx) > +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > { > - return vhost_get_avail(vq, *idx, >avail->idx); > + __virtio16 idx; > + int r; > + > + r = vhost_get_avail(vq, idx, >avail->idx); > + if (unlikely(r < 0)) { > + vq_err(vq, "Failed to access available index at %p (%d)\n", > +>avail->idx, r); > + return r; > + } > + > + /* Check it isn't doing very strange thing with available indexes */ > + vq->avail_idx = vhost16_to_cpu(vq, idx); > + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { > + vq_err(vq, "Invalid available index change from %u to %u", > +vq->last_avail_idx, vq->avail_idx); > + return -EINVAL; > + } > + > + /* We're done if there is nothing new */ > + if (vq->avail_idx == vq->last_avail_idx) > + return 0; > + > + /* > + * We updated vq->avail_idx so we need a memory barrier between > + * the index read above and the caller reading avail ring entries. > + */ > + smp_rmb(); > + return 1; > } > > static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, > @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, > { > struct vring_desc desc; > unsigned int i, head, found = 0; > - u16 last_avail_idx; > - __virtio16 avail_idx; > + u16 last_avail_idx = vq->last_avail_idx; > __virtio16 ring_head; > int ret, access; > > - /* Check it isn't doing very strange things with descriptor numbers. */ > - last_avail_idx = vq->last_avail_idx; > - > if (vq->avail_idx == vq->last_avail_idx) { > - if (unlikely(vhost_get_avail_idx(vq, _idx))) { > - vq_err(vq, "Failed to access avail idx at %p\n", > - >avail->idx); > - return -EFAULT; > - } > - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > - > - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { > - vq_err(vq, "Guest moved avail index from %u to %u", > - last_avail_idx, vq->avail_idx); > - return -EFAULT; > - } > + ret = vhost_get_avail_idx(vq); > + if (unlikely(ret < 0)) > + return ret; > > - /* If there's nothing new since last we looked, return > - * invalid. > - */ > - if (vq->avail_idx == last_avail_idx) > + if (!ret) > return vq->num; > - > - /* Only get avail ring entries after they have been > - * exposed by guest. > - */ > - smp_rmb(); > } > > /* Grab the next descriptor number they're advertising, and increment > @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); > /* return true if we're sure that avaiable ring is empty */ > bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) > { > - __virtio16 avail_idx; > int r; > > if (vq->avail_idx != vq->last_avail_idx) > return false; > > - r = vhost_get_avail_idx(vq, _idx); > - if (unlikely(r)) > - return false; > - > - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); > - if
Re: [syzbot] [bpf?] [trace?] WARNING in group_send_sig_info
On 4/27/24 9:34 AM, syzbot wrote: Hello, syzbot found the following issue on: HEAD commit:443574b03387 riscv, bpf: Fix kfunc parameters incompatibil.. git tree: bpf console output: https://syzkaller.appspot.com/x/log.txt?x=11ca8fe718 kernel config: https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440 dashboard link: https://syzkaller.appspot.com/bug?extid=1902c6d326478ce2dfb0 compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/3f355021a085/disk-443574b0.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/44cf4de7472a/vmlinux-443574b0.xz kernel image: https://storage.googleapis.com/syzbot-assets/a99a36c7ad65/bzImage-443574b0.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+1902c6d326478ce2d...@syzkaller.appspotmail.com [ cut here ] raw_local_irq_restore() called with IRQs enabled WARNING: CPU: 1 PID: 7785 at kernel/locking/irqflag-debug.c:10 warn_bogus_irq_restore+0x29/0x40 kernel/locking/irqflag-debug.c:10 Modules linked in: CPU: 1 PID: 7785 Comm: syz-executor.3 Not tainted 6.8.0-syzkaller-05236-g443574b03387 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 RIP: 0010:warn_bogus_irq_restore+0x29/0x40 kernel/locking/irqflag-debug.c:10 Code: 90 f3 0f 1e fa 90 80 3d de 59 01 04 00 74 06 90 c3 cc cc cc cc c6 05 cf 59 01 04 01 90 48 c7 c7 20 ba aa 8b e8 f8 d5 e7 f5 90 <0f> 0b 90 90 90 c3 cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 0f 1f RSP: 0018:c9000399fbb8 EFLAGS: 00010246 RAX: 4aede97b00455d00 RBX: 192000733f7c RCX: 88802a129e00 RDX: RSI: RDI: RBP: c9000399fc50 R08: 8157cc12 R09: 1110172a51a2 R10: dc00 R11: ed10172a51a3 R12: dc00 R13: 192000733f78 R14: c9000399fbe0 R15: 0246 FS: 7ae76480() GS:8880b950() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7ffc27e190f8 CR3: 6cb5 CR4: 003506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline] _raw_spin_unlock_irqrestore+0x120/0x140 kernel/locking/spinlock.c:194 spin_unlock_irqrestore include/linux/spinlock.h:406 [inline] unlock_task_sighand include/linux/sched/signal.h:754 [inline] do_send_sig_info kernel/signal.c:1302 [inline] group_send_sig_info+0x2e0/0x310 kernel/signal.c:1453 bpf_send_signal_common+0x2dd/0x430 kernel/trace/bpf_trace.c:881 bpf_send_signal kernel/trace/bpf_trace.c:886 [inline] bpf_send_signal+0x19/0x30 kernel/trace/bpf_trace.c:884 bpf_prog_8cc4ff36b5985b6a+0x1d/0x1f bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline] __bpf_prog_run include/linux/filter.h:650 [inline] bpf_prog_run include/linux/filter.h:664 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline] bpf_trace_run2+0x375/0x420 kernel/trace/bpf_trace.c:2420 trace_sys_exit include/trace/events/syscalls.h:44 [inline] syscall_exit_work+0x153/0x170 kernel/entry/common.c:163 syscall_exit_to_user_mode_prepare kernel/entry/common.c:194 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:199 [inline] syscall_exit_to_user_mode+0x273/0x360 kernel/entry/common.c:212 do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89 entry_SYSCALL_64_after_hwframe+0x6d/0x75 The following are related functions. struct sighand_struct *__lock_task_sighand(struct task_struct *tsk, unsigned long *flags) { struct sighand_struct *sighand; rcu_read_lock(); for (;;) { sighand = rcu_dereference(tsk->sighand); if (unlikely(sighand == NULL)) break; /* * This sighand can be already freed and even reused, but * we rely on SLAB_TYPESAFE_BY_RCU and sighand_ctor() which * initializes ->siglock: this slab can't go away, it has * the same object type, ->siglock can't be reinitialized. * * We need to ensure that tsk->sighand is still the same * after we take the lock, we can race with de_thread() or * __exit_signal(). In the latter case the next iteration * must see ->sighand == NULL. */ spin_lock_irqsave(>siglock, *flags); if (likely(sighand == rcu_access_pointer(tsk->sighand))) break; spin_unlock_irqrestore(>siglock, *flags); } rcu_read_unlock();
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Mon, 29 Apr 2024 11:43:05 -0500, Jarkko Sakkinen wrote: On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote: Hi Jarkko On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen wrote: > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote: >> On 4/16/24 07:15, Jarkko Sakkinen wrote: >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: >> > Yes, exactly. I'd take one week break and cycle the kselftest part >> > internally a bit as I said my previous response. I'm sure that there >> > is experise inside Intel how to implement it properly. I.e. take some >> > time to find the right person, and wait as long as that person has a >> > bit of bandwidth to go through the test and suggest modifications. >> >> Folks, I worry that this series is getting bogged down in the selftests. >> Yes, selftests are important. But getting _some_ tests in the kernel >> is substantially more important than getting perfect tests. >> >> I don't think Haitao needs to "cycle" this back inside Intel. > > The problem with the tests was that they are hard to run anything else > than Ubuntu (and perhaps Debian). It is hopefully now taken care of. > Selftests do not have to be perfect but at minimum they need to be > runnable. > > I need ret-test the latest series because it is possible that I did not > have right flags (I was travelling few days thus have not done it yet). > > BR, Jarkko > Let me know if you want me to send v13 before testing or you can just use the sgx_cg_upstream_v12_plus branch in my repo. Also thanks for the "Reviewed-by" tags for other patches. But I've not got "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go through those alsoe when you get chance? So, I compiled v12 branch. Was the only difference in selftests? I can just copy them to the device. BR, Jarkko The only other functional change is using BUG_ON() when workqueue allocation fails in sgx_cgroup_init(). It should not affect testing results. Thanks Haitao
Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test
On Mon, Apr 29, 2024 at 12:39 AM Jiri Olsa wrote: > > On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote: > > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > > > Adding test that adds return uprobe inside 32 bit task > > > and verify the return uprobe and attached bpf programs > > > get properly executed. > > > > > > Signed-off-by: Jiri Olsa > > > --- > > > tools/testing/selftests/bpf/.gitignore| 1 + > > > tools/testing/selftests/bpf/Makefile | 6 ++- > > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++ > > > .../bpf/progs/uprobe_syscall_compat.c | 13 ++ > > > 4 files changed, 59 insertions(+), 1 deletion(-) > > > create mode 100644 > > > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > > > > > diff --git a/tools/testing/selftests/bpf/.gitignore > > > b/tools/testing/selftests/bpf/.gitignore > > > index f1aebabfb017..69d71223c0dd 100644 > > > --- a/tools/testing/selftests/bpf/.gitignore > > > +++ b/tools/testing/selftests/bpf/.gitignore > > > @@ -45,6 +45,7 @@ test_cpp > > > /veristat > > > /sign-file > > > /uprobe_multi > > > +/uprobe_compat > > > *.ko > > > *.tmp > > > xskxceiver > > > diff --git a/tools/testing/selftests/bpf/Makefile > > > b/tools/testing/selftests/bpf/Makefile > > > index edc73f8f5aef..d170b63eca62 100644 > > > --- a/tools/testing/selftests/bpf/Makefile > > > +++ b/tools/testing/selftests/bpf/Makefile > > > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr > > > test_skb_cgroup_id_user \ > > > xskxceiver xdp_redirect_multi xdp_synproxy veristat > > > xdp_hw_metadata \ > > > xdp_features bpf_test_no_cfi.ko > > > > > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > > uprobe_compat > > > > you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no? > > ah right > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > index 9233210a4c33..3770254d893b 100644 > > > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > @@ -11,6 +11,7 @@ > > > #include > > > #include "uprobe_syscall.skel.h" > > > #include "uprobe_syscall_call.skel.h" > > > +#include "uprobe_syscall_compat.skel.h" > > > > > > __naked unsigned long uretprobe_regs_trigger(void) > > > { > > > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void) > > > "read_trace_pipe_iter"); > > > ASSERT_EQ(found, 0, "found"); > > > } > > > + > > > +static void trace_pipe_compat_cb(const char *str, void *data) > > > +{ > > > + if (strstr(str, "uretprobe compat") != NULL) > > > + (*(int *)data)++; > > > +} > > > + > > > +static void test_uretprobe_compat(void) > > > +{ > > > + struct uprobe_syscall_compat *skel = NULL; > > > + int err, found = 0; > > > + > > > + skel = uprobe_syscall_compat__open_and_load(); > > > + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load")) > > > + goto cleanup; > > > + > > > + err = uprobe_syscall_compat__attach(skel); > > > + if (!ASSERT_OK(err, "uprobe_syscall_compat__attach")) > > > + goto cleanup; > > > + > > > + system("./uprobe_compat"); > > > + > > > + ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , > > > 1000), > > > +"read_trace_pipe_iter"); > > > > why so complicated? can't you just set global variable that it was called > > hm, we execute separate uprobe_compat (32bit) process that triggers the bpf > program, so we can't use global variable.. using the trace_pipe was the only > thing that was easy to do you need child process to trigger uprobe, but you could have installed BPF program from parent process (you'd need to make child wait for parent to be ready, with normal pipe() like we do in other places). I think generally the less work forked child process does, the better. All those ASSERT() failures won't produce any output in child process, unless you run tests in verbose mode, because we haven't implemented some form of sending all the logs back to the parent process and so they are completely lost. But that's a separate topic. Either way, consider using pipe() to coordinate waiting from child on parent being ready, but otherwise do all the BPF-related heavy lifting from parent (you can attach BPF programs to specific PID using bpf_program__attach_uprobe() easily, it's not declarative, but simple enough). > > jirka > > > > > > + ASSERT_EQ(found, 1, "found"); > > > + > > > +cleanup: > > > + uprobe_syscall_compat__destroy(skel); > > > +} > > > #else > > > static void test_uretprobe_regs_equal(void) > > > { > > > @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void) > > > { > >
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote: > Hi Jarkko > > On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen > wrote: > > > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote: > >> On 4/16/24 07:15, Jarkko Sakkinen wrote: > >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: > >> > Yes, exactly. I'd take one week break and cycle the kselftest part > >> > internally a bit as I said my previous response. I'm sure that there > >> > is experise inside Intel how to implement it properly. I.e. take some > >> > time to find the right person, and wait as long as that person has a > >> > bit of bandwidth to go through the test and suggest modifications. > >> > >> Folks, I worry that this series is getting bogged down in the selftests. > >> Yes, selftests are important. But getting _some_ tests in the kernel > >> is substantially more important than getting perfect tests. > >> > >> I don't think Haitao needs to "cycle" this back inside Intel. > > > > The problem with the tests was that they are hard to run anything else > > than Ubuntu (and perhaps Debian). It is hopefully now taken care of. > > Selftests do not have to be perfect but at minimum they need to be > > runnable. > > > > I need ret-test the latest series because it is possible that I did not > > have right flags (I was travelling few days thus have not done it yet). > > > > BR, Jarkko > > > > Let me know if you want me to send v13 before testing or you can just use > the sgx_cg_upstream_v12_plus branch in my repo. > > Also thanks for the "Reviewed-by" tags for other patches. But I've not got > "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go > through those alsoe when you get chance? So, I compiled v12 branch. Was the only difference in selftests? I can just copy them to the device. BR, Jarkko
Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test
On Mon, Apr 29, 2024 at 12:33 AM Jiri Olsa wrote: > > On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote: > > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > > > Adding test to verify that when called from outside of the > > > trampoline provided by kernel, the uretprobe syscall will cause > > > calling process to receive SIGILL signal and the attached bpf > > > program is no executed. > > > > > > Signed-off-by: Jiri Olsa > > > --- > > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++ > > > .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++ > > > 2 files changed, 107 insertions(+) > > > create mode 100644 > > > tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > > > > > See nits below, but overall LGTM > > > > Acked-by: Andrii Nakryiko > > > > [...] > > > > > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void) > > > { > > > test__skip(); > > > } > > > + > > > +static void test_uretprobe_syscall_call(void) > > > +{ > > > + test__skip(); > > > +} > > > #endif > > > > > > void test_uprobe_syscall(void) > > > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void) > > > if (test__start_subtest("uretprobe_regs_change")) > > > test_uretprobe_regs_change(); > > > } > > > + > > > +void serial_test_uprobe_syscall_call(void) > > > > does it need to be serial? non-serial are still run sequentially > > within a process (there is no multi-threading), it's more about some > > global effects on system. > > plz see below > > > > > > +{ > > > + test_uretprobe_syscall_call(); > > > +} > > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > new file mode 100644 > > > index ..5ea03bb47198 > > > --- /dev/null > > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > @@ -0,0 +1,15 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +#include "vmlinux.h" > > > +#include > > > +#include > > > + > > > +struct pt_regs regs; > > > + > > > +char _license[] SEC("license") = "GPL"; > > > + > > > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call") > > > +int uretprobe(struct pt_regs *regs) > > > +{ > > > + bpf_printk("uretprobe called"); > > > > debugging leftover? we probably don't want to pollute trace_pipe from test > > the reason for this is to make sure the bpf program was not executed, > > the test makes sure the child gets killed with SIGILL and also that > the bpf program was not executed by checking the trace_pipe and > making sure nothing was received > > the trace_pipe reading is also why it's serial you could have attached BPF program from parent process and use a global variable (and thus eliminate all the trace_pipe system-wide dependency), but ok, it's fine by me the way this is done > > jirka > > > > > > + return 0; > > > +} > > > -- > > > 2.44.0 > > >
Re: [PATCH v7 00/16] mm: jit/text allocator
On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Hi, > > The patches are also available in git: > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7 > > v7 changes: > * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid > #ifdefs in a function body > * add Acks, thanks everybody Thanks, I've pushed this to modules-next for further exposure / testing. Given the status of testing so far with prior revisions, in that only a few issues were found and that those were fixed, and the status of reviews, this just might be ripe for v6.10. Luis
Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing
Hi Jarkko On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen wrote: On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote: On 4/16/24 07:15, Jarkko Sakkinen wrote: > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote: > Yes, exactly. I'd take one week break and cycle the kselftest part > internally a bit as I said my previous response. I'm sure that there > is experise inside Intel how to implement it properly. I.e. take some > time to find the right person, and wait as long as that person has a > bit of bandwidth to go through the test and suggest modifications. Folks, I worry that this series is getting bogged down in the selftests. Yes, selftests are important. But getting _some_ tests in the kernel is substantially more important than getting perfect tests. I don't think Haitao needs to "cycle" this back inside Intel. The problem with the tests was that they are hard to run anything else than Ubuntu (and perhaps Debian). It is hopefully now taken care of. Selftests do not have to be perfect but at minimum they need to be runnable. I need ret-test the latest series because it is possible that I did not have right flags (I was travelling few days thus have not done it yet). BR, Jarkko Let me know if you want me to send v13 before testing or you can just use the sgx_cg_upstream_v12_plus branch in my repo. Also thanks for the "Reviewed-by" tags for other patches. But I've not got "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go through those alsoe when you get chance? Thanks Haitao
Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
On Mon, 29 Apr 2024 05:49:13 -0500, Huang, Kai wrote: +/* + * Get the per-cgroup or global LRU list that tracks the given reclaimable page. + */ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_page) { +#ifdef CONFIG_CGROUP_MISC + /* +* epc_page->sgx_cg here is never NULL during a reclaimable epc_page's +* life between sgx_alloc_epc_page() and sgx_free_epc_page(): +* +* In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from + * sgx_get_current_cg() which is the misc cgroup of the current task, or +* the root by default even if the misc cgroup is disabled by kernel +* command line. +* +* epc_page->sgx_cg is only unset by sgx_free_epc_page(). +* +* This function is never used before sgx_alloc_epc_page() or after +* sgx_free_epc_page(). +*/ + return _page->sgx_cg->lru; +#else return _global_lru; +#endif } /* @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page *epc_pag */ static inline bool sgx_can_reclaim(void) { - return !list_empty(_global_lru.reclaimable); + return !sgx_cgroup_lru_empty(misc_cg_root()) || + !list_empty(_global_lru.reclaimable); } Shouldn't this be: if (IS_ENABLED(CONFIG_CGROUP_MISC)) return !sgx_cgroup_lru_empty(misc_cg_root()); else return !list_empty(_global_lru.reclaimable); ? In this way, it is consistent with the sgx_reclaim_pages_global() below. I changed to this way because sgx_cgroup_lru_empty() is now defined in both KConfig cases. And it seems better to minimize use of the KConfig variables based on earlier feedback (some are yours). Don't really have strong preference here. So let me know one way of the other. static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark) static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) { - sgx_reclaim_pages(_global_lru, charge_mm); + if (IS_ENABLED(CONFIG_CGROUP_MISC)) + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm); + else + sgx_reclaim_pages(_global_lru, charge_mm); } /* @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) */ void sgx_reclaim_direct(void) { + struct sgx_cgroup *sgx_cg = sgx_get_current_cg(); + + /* Make sure there are some free pages at cgroup level */ + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) { + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm); + sgx_put_cg(sgx_cg); + } Empty line. Sure + /* Make sure there are some free pages at global level */ if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) Looking at the code, to me sgx_should_reclaim() is a little bit vague because from the name we don't know whether it interally checks the current cgroup or the global. It's better to rename to sgx_should_reclaim_global(). Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global(). And I think you can do the renaming in the previous patch, because in the changelog of your previous patch, it seems you have called out the two functions are for global reclaim. ok Thanks Haitao
Re: [PATCH 1/2] dt-bindings: remoteproc: qcom,smd-edge: Mark qcom,ipc as deprecated
On Thu, 25 Apr 2024 21:14:30 +0200, Luca Weiss wrote: > Deprecate the qcom,ipc way of accessing the mailbox in favor of the > 'mboxes' property. > > Update the example to use mboxes. > > Signed-off-by: Luca Weiss > --- > Documentation/devicetree/bindings/remoteproc/qcom,smd-edge.yaml | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > Reviewed-by: Rob Herring (Arm)
Re: [PATCH 2/2] dt-bindings: soc: qcom,smp2p: Mark qcom,ipc as deprecated
On Thu, 25 Apr 2024 21:14:31 +0200, Luca Weiss wrote: > Deprecate the qcom,ipc way of accessing the mailbox in favor of the > 'mboxes' property. > > Update the example to use mboxes. > > Signed-off-by: Luca Weiss > --- > Documentation/devicetree/bindings/soc/qcom/qcom,smp2p.yaml | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > Reviewed-by: Rob Herring (Arm)
Re: [PATCH v9 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
On Thu, 25 Apr 2024 13:09:32 -0700 Andrii Nakryiko wrote: > On Mon, Apr 15, 2024 at 6:22 AM Masami Hiramatsu (Google) > wrote: > > > > From: Masami Hiramatsu (Google) > > > > Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is > > converted from ftrace_regs by ftrace_partial_regs(), thus some registers > > may always returns 0. But it should be enough for function entry (access > > arguments) and exit (access return value). > > > > Signed-off-by: Masami Hiramatsu (Google) > > Acked-by: Florent Revest > > --- > > Changes from previous series: NOTHING, Update against the new series. > > --- > > kernel/trace/bpf_trace.c | 22 +- > > 1 file changed, 9 insertions(+), 13 deletions(-) > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index e51a6ef87167..57b1174030c9 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void) > > fs_initcall(bpf_event_init); > > #endif /* CONFIG_MODULES */ > > > > -#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) > > +#ifdef CONFIG_FPROBE > > struct bpf_kprobe_multi_link { > > struct bpf_link link; > > struct fprobe fp; > > @@ -2600,6 +2600,8 @@ struct user_syms { > > char *buf; > > }; > > > > +static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs); > > this is a waste if CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, right? > Can we guard it? Good catch! Yes, we can guard it. > > > > + > > static int copy_user_syms(struct user_syms *us, unsigned long __user > > *usyms, u32 cnt) > > { > > unsigned long __user usymbol; > > @@ -2792,13 +2794,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct > > bpf_run_ctx *ctx) > > > > static int > > kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link, > > - unsigned long entry_ip, struct pt_regs *regs) > > + unsigned long entry_ip, struct ftrace_regs > > *fregs) > > { > > struct bpf_kprobe_multi_run_ctx run_ctx = { > > .link = link, > > .entry_ip = entry_ip, > > }; > > struct bpf_run_ctx *old_run_ctx; > > + struct pt_regs *regs; > > int err; > > > > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { > > @@ -2809,6 +2812,7 @@ kprobe_multi_link_prog_run(struct > > bpf_kprobe_multi_link *link, > > > > migrate_disable(); > > rcu_read_lock(); > > + regs = ftrace_partial_regs(fregs, > > this_cpu_ptr(_kprobe_multi_pt_regs)); > > and then pass NULL if defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)? Indeed. Thank you! > > > > old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx); > > err = bpf_prog_run(link->link.prog, regs); > > bpf_reset_run_ctx(old_run_ctx); > > @@ -2826,13 +2830,9 @@ kprobe_multi_link_handler(struct fprobe *fp, > > unsigned long fentry_ip, > > void *data) > > { > > struct bpf_kprobe_multi_link *link; > > - struct pt_regs *regs = ftrace_get_regs(fregs); > > - > > - if (!regs) > > - return 0; > > > > link = container_of(fp, struct bpf_kprobe_multi_link, fp); > > - kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs); > > + kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs); > > return 0; > > } > > > > @@ -2842,13 +2842,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, > > unsigned long fentry_ip, > >void *data) > > { > > struct bpf_kprobe_multi_link *link; > > - struct pt_regs *regs = ftrace_get_regs(fregs); > > - > > - if (!regs) > > - return; > > > > link = container_of(fp, struct bpf_kprobe_multi_link, fp); > > - kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs); > > + kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs); > > } > > > > static int symbols_cmp_r(const void *a, const void *b, const void *priv) > > @@ -3107,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr > > *attr, struct bpf_prog *pr > > kvfree(cookies); > > return err; > > } > > -#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */ > > +#else /* !CONFIG_FPROBE */ > > int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct > > bpf_prog *prog) > > { > > return -EOPNOTSUPP; > > > > -- Masami Hiramatsu (Google)
Re: [PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded
On Thu, 25 Apr 2024 13:15:08 -0700 Andrii Nakryiko wrote: > On Mon, Apr 15, 2024 at 6:25 AM Masami Hiramatsu (Google) > wrote: > > > > From: Masami Hiramatsu (Google) > > > > Skip recording calltime and rettime if the fgraph_ops does not need it. > > This is a kind of performance optimization for fprobe. Since the fprobe > > user does not use these entries, recording timestamp in fgraph is just > > a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag, > > and all fgraph_ops sets this flag, skip recording calltime and rettime. > > > > Suggested-by: Jiri Olsa > > Signed-off-by: Masami Hiramatsu (Google) > > --- > > Changes in v9: > > - Newly added. > > --- > > include/linux/ftrace.h |2 ++ > > kernel/trace/fgraph.c | 46 > > +++--- > > kernel/trace/fprobe.c |1 + > > 3 files changed, 42 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > > index d845a80a3d56..06fc7cbef897 100644 > > --- a/include/linux/ftrace.h > > +++ b/include/linux/ftrace.h > > @@ -1156,6 +1156,8 @@ struct fgraph_ops { > > struct ftrace_ops ops; /* for the hash lists */ > > void*private; > > int idx; > > + /* If skip_timestamp is true, this does not record timestamps. */ > > + boolskip_timestamp; > > }; > > > > void *fgraph_reserve_data(int idx, int size_bytes); > > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c > > index 7556fbbae323..a5722537bb79 100644 > > --- a/kernel/trace/fgraph.c > > +++ b/kernel/trace/fgraph.c > > @@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph); > > int ftrace_graph_active; > > > > static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE]; > > +static bool fgraph_skip_timestamp; > > > > /* LRU index table for fgraph_array */ > > static int fgraph_lru_table[FGRAPH_ARRAY_SIZE]; > > @@ -475,7 +476,7 @@ void ftrace_graph_stop(void) > > static int > > ftrace_push_return_trace(unsigned long ret, unsigned long func, > > unsigned long frame_pointer, unsigned long *retp, > > -int fgraph_idx) > > +int fgraph_idx, bool skip_ts) > > { > > struct ftrace_ret_stack *ret_stack; > > unsigned long long calltime; > > @@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned > > long func, > > ret_stack = get_ret_stack(current, current->curr_ret_stack, ); > > if (ret_stack && ret_stack->func == func && > > get_fgraph_type(current, index + FGRAPH_RET_INDEX) == > > FGRAPH_TYPE_BITMAP && > > - !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, > > fgraph_idx)) > > + !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, > > fgraph_idx)) { > > + /* If previous one skips calltime, update it. */ > > + if (!skip_ts && !ret_stack->calltime) > > + ret_stack->calltime = trace_clock_local(); > > return index + FGRAPH_RET_INDEX; > > + } > > > > val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | > > FGRAPH_RET_INDEX; > > > > @@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned > > long func, > > return -EBUSY; > > } > > > > - calltime = trace_clock_local(); > > + if (skip_ts) > > would it be ok to add likely() here to keep the least-overhead code path > linear? It's not "likely", but hmm, yes as you said. We can keep the least overhead. OK, let me add likely. Thank you, > > > + calltime = 0LL; > > + else > > + calltime = trace_clock_local(); > > > > index = READ_ONCE(current->curr_ret_stack); > > ret_stack = RET_STACK(current, index); > > @@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, > > unsigned long func, > > trace.func = func; > > trace.depth = ++current->curr_ret_depth; > > > > - index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0); > > + index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0, > > +fgraph_skip_timestamp); > > if (index < 0) > > goto out; > > > > @@ -654,7 +663,8 @@ int function_graph_enter_ops(unsigned long ret, > > unsigned long func, > > return -ENODEV; > > > > /* Use start for the distance to ret_stack (skipping over reserve) > > */ > > - index = ftrace_push_return_trace(ret, func, frame_pointer, retp, > > gops->idx); > > + index = ftrace_push_return_trace(ret, func, frame_pointer, retp, > > gops->idx, > > +gops->skip_timestamp); > > if (index < 0) > > return index; > > type = get_fgraph_type(current, index); > > @@ -732,6
Re: 回复:WARNING in current_check_refer_path
On Mon, Apr 29, 2024 at 05:16:57PM +0800, Ubisectech Sirius wrote: > > Hello, > > > Thanks for the report. Could you please provide a reproducer? > > > Regards, > > Mickaël > > Hi. > The Poc file has seed to you as attachment. Indeed, but could you please trim down the file. There are 650 lines, most of them are irrelevant. > > > On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote: > >> Hello. > >> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > >> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to > >> the email were a PoC file of the issue. > >> > >> Stack dump: > >> > > > loop3: detected capacity change from 0 to 1024 > > > [ cut here ] > > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > > > security/landlock/fs.c:598 [inline] > > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > > > security/landlock/fs.c:578 [inline] > > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 > > > current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > > > Modules linked in: > > > CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > > > 04/01/2014 > > > RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline] > > > RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline] > > > RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > > > Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 > > > 0f 0b 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 > > > db e9 86 f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba > > > RSP: 0018:c90001fb7ba0 EFLAGS: 00010212 > > > RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000 > > > RDX: 0004 RSI: 84125d64 RDI: 0003 > > > RBP: 8880123c5608 R08: 0003 R09: c000 > > > R10: f000 R11: R12: 88805d32fc00 > > > R13: 8880123c5608 R14: R15: 0001 > > > FS: 7fd70c4d8640() GS:88802c60() > > > knlGS: > > > CS: 0010 DS: ES: CR0: 80050033 > > > CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0 > > > DR0: DR1: DR2: > > > DR3: DR6: fffe0ff0 DR7: 0400 > > > PKRU: 5554 > > > Call Trace: > > > > > > security_path_rename+0x124/0x230 security/security.c:1828 > > > do_renameat2+0x9f6/0xd30 fs/namei.c:4983 > > > __do_sys_rename fs/namei.c:5042 [inline] > > > __se_sys_rename fs/namei.c:5040 [inline] > > > __x64_sys_rename+0x81/0xa0 fs/namei.c:5040 > > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > > do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 > > > entry_SYSCALL_64_after_hwframe+0x6f/0x77 > > > RIP: 0033:0x7fd70b6900ed > > > Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 > > > f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 > > > ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > > > RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052 > > > RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed > >> RDX: RSI: 2140 RDI: 2100 > > > RBP: 7fd70b6f14a6 R08: R09: > > > R10: R11: 0246 R12: > > > R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000 > > > > > > > > > Thank you for taking the time to read this email and we look forward to > > > working with you further. > > > > > > > > > > > > > > > > > > > > > > > > > >
Re: [v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
… > > it jumps to the label 'out' instead of 'fail' by mistake.In the result, … > > Looks good to me. * Do you care for a typo in this change description? * Would you like to read any improved (patch) version descriptions (or changelogs)? Regards, Markus
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
Hi Andrii, On Thu, 25 Apr 2024 13:31:53 -0700 Andrii Nakryiko wrote: > Hey Masami, > > I can't really review most of that code as I'm completely unfamiliar > with all those inner workings of fprobe/ftrace/function_graph. I left > a few comments where there were somewhat more obvious BPF-related > pieces. > > But I also did run our BPF benchmarks on probes/for-next as a baseline > and then with your series applied on top. Just to see if there are any > regressions. I think it will be a useful data point for you. Thanks for testing! > > You should be already familiar with the bench tool we have in BPF > selftests (I used it on some other patches for your tree). What patches we need? > > BASELINE > > kprobe : 24.634 ± 0.205M/s > kprobe-multi : 28.898 ± 0.531M/s > kretprobe : 10.478 ± 0.015M/s > kretprobe-multi: 11.012 ± 0.063M/s > > THIS PATCH SET ON TOP > = > kprobe : 25.144 ± 0.027M/s (+2%) > kprobe-multi : 28.909 ± 0.074M/s > kretprobe :9.482 ± 0.008M/s (-9.5%) > kretprobe-multi: 13.688 ± 0.027M/s (+24%) This looks good. Kretprobe should also use kretprobe-multi (fprobe) eventually because it should be a single callback version of kretprobe-multi. > > These numbers are pretty stable and look to be more or less representative. > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > about the same, though. > > Then (I suppose they are "legacy") kretprobes got quite noticeably > slower, almost by 10%. Not sure why, but looks real after re-running > benchmarks a bunch of times and getting stable results. Hmm, kretprobe on x86 should use ftrace + rethook even with my series. So nothing should be changed. Maybe cache access pattern has been changed? I'll check it with tracefs (to remove the effect from bpf related changes) > > On the other hand, multi-kretprobes got significantly faster (+24%!). > Again, I don't know if it is expected or not, but it's a nice > improvement. Thanks! > > If you have any idea why kretprobes would get so much slower, it would > be nice to look into that and see if you can mitigate the regression > somehow. Thanks! OK, let me check it. Thank you! > > > > 51 files changed, 2325 insertions(+), 882 deletions(-) > > create mode 100644 > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc > > > > -- > > Masami Hiramatsu (Google) > > -- Masami Hiramatsu (Google)
Re: [PATCH v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()
Hi LuMing, On Sat, 27 Apr 2024 08:23:47 +0100 lumingyindet...@126.com wrote: > From: LuMingYin > > If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt', > it jumps to the label 'out' instead of 'fail' by mistake.In the result, > the buffer 'tmp' is not freed in this case and leaks its memory. > > Thus jump to the label 'fail' in that error case. > Looks good to me. Acked-by: Masami Hiramatsu (Google) Thank you! > Fixes: 032330abd08b ("tracing/probes: Cleanup probe argument parser") > Signed-off-by: LuMingYin > --- > kernel/trace/trace_probe.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c > index c09fa6fc636e..81c319b92038 100644 > --- a/kernel/trace/trace_probe.c > +++ b/kernel/trace/trace_probe.c > @@ -1466,7 +1466,7 @@ static int traceprobe_parse_probe_arg_body(const char > *argv, ssize_t *size, > parg->fmt = kmalloc(len, GFP_KERNEL); > if (!parg->fmt) { > ret = -ENOMEM; > - goto out; > + goto fail; > } > snprintf(parg->fmt, len, "%s[%d]", parg->type->fmttype, >parg->count); > -- > 2.25.1 > > -- Masami Hiramatsu (Google)
Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > Two enclave threads may try to add and remove the same enclave page > simultaneously (e.g., if the SGX runtime supports both lazy allocation > and `MADV_DONTNEED` semantics). Consider this race: > > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right >after removing the page table entry and right before re-acquiring the >enclave lock to EREMOVE and xa_erase(>page_array) the page. > 2. T2 tries to access the page, and #PF[not_present] is raised. The >condition to EAUG in sgx_vma_fault() is not satisfied because the >page is still present in encl->page_array, thus the SGX driver >assumes that the fault happened because the page was swapped out. The >driver continues on a code path that installs a page table entry >*without* performing EAUG. > 3. The enclave page metadata is in inconsistent state: the PTE is >installed but there was no EAUG. Thus, T2 in userspace infinitely >receives SIGSEGV on this page (and EACCEPT always fails). > > Fix this by making sure that T1 (the page-removing thread) always wins > this data race. In particular, the page-being-removed is marked as such, > and T2 retries until the page is fully removed. > > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal") > Cc: sta...@vger.kernel.org > Signed-off-by: Dmitrii Kuvaiskii > --- > arch/x86/kernel/cpu/sgx/encl.c | 3 ++- > arch/x86/kernel/cpu/sgx/encl.h | 3 +++ > arch/x86/kernel/cpu/sgx/ioctl.c | 1 + > 3 files changed, 6 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 41f14b1a3025..7ccd8b2fce5f 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct > sgx_encl *encl, > > /* Entry successfully located. */ > if (entry->epc_page) { > - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) > + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED | > +SGX_ENCL_PAGE_BEING_REMOVED)) > return ERR_PTR(-EBUSY); > > return entry; > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h > index f94ff14c9486..fff5f2293ae7 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.h > +++ b/arch/x86/kernel/cpu/sgx/encl.h > @@ -25,6 +25,9 @@ > /* 'desc' bit marking that the page is being reclaimed. */ > #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3) > > +/* 'desc' bit marking that the page is being removed. */ > +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2) > + > struct sgx_encl_page { > unsigned long desc; > unsigned long vm_max_prot_bits:8; > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c > index b65ab214bdf5..c542d4dd3e64 100644 > --- a/arch/x86/kernel/cpu/sgx/ioctl.c > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, >* Do not keep encl->lock because of dependency on >* mmap_lock acquired in sgx_zap_enclave_ptes(). >*/ > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; > mutex_unlock(>lock); > > sgx_zap_enclave_ptes(encl, addr); It is somewhat trivial to NAK this as the commit message does not do any effort describing the new flag. By default at least I have strong opposition against any new flags related to reclaiming even if it needs a bit of extra synchronization work in the user space. One way to describe concurrency scenarios would be to take example from https://www.kernel.org/doc/Documentation/memory-barriers.txt I.e. see the examples with CPU 1 and CPU 2. BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 4:22 PM EEST, Jarkko Sakkinen wrote: > On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote: > > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > > > fault handler so that no signal is sent to userspace, and (2) by > > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > > > EREMOVE is performed. > > > > What is the collateral damage caused by ENCLS[EREMOVE]? > > Have you measured cost of eremove on an empty page? > > I tried to lookup for a thread from lore because I have a faint memory > that it was concluded that its cost irrelevant. Please correct if I'm > wrong. Also pseudocode for EREMOVE supports this as it just returns without actually doing anything. BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote: > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > > fault handler so that no signal is sent to userspace, and (2) by > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > > EREMOVE is performed. > > What is the collateral damage caused by ENCLS[EREMOVE]? Have you measured cost of eremove on an empty page? I tried to lookup for a thread from lore because I have a faint memory that it was concluded that its cost irrelevant. Please correct if I'm wrong. BR, Jarkko
Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of > enclave pages and may support MADV_DONTNEED semantics [1]. The former > implies #PF-based page allocation, and the latter implies the usage of > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl. > > A trivial program like below (run under Gramine and with EDMM enabled) > stresses these two flows in the SGX driver and hangs: > > /* repeatedly touch different enclave pages at random and mix with > * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */ > static void* thread_func(void* arg) { > size_t num_pages = 0xA000 / page_size; > for (int i = 0; i < 5000; i++) { > size_t page = get_random_ulong() % num_pages; > char data = READ_ONCE(((char*)arg)[page * page_size]); > > page = get_random_ulong() % num_pages; > madvise(arg + page * page_size, page_size, MADV_DONTNEED); > } > } > > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0); > pthread_t threads[16]; > for (int i = 0; i < 16; i++) > pthread_create([i], NULL, thread_func, addr); I'm not convinced that kernel is the problem here but it could be also how Gramine is implemented. So maybe you could make a better case of that. The example looks a bit artificial to me. > > This program uncovers two data races in the SGX driver. The remaining > patches describe and fix these races. > > I performed several stress tests to verify that there are no other data > races (at least with the test program above): > > - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This > stresses the first data race. A Gramine SGX test suite running in the > background for additional stressing. Result: 1,000 runs without hangs > (result without the first bug fix: hangs every time). > - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This > stresses the second data race. A Gramine SGX test suite running in the > background for additional stressing. Result: 1,000 runs without hangs > (result with the first bug fix but without the second bug fix: hangs > approx. once in 50 runs). > - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This > additionally stresses the enclave page swapping flows. Two Gramine SGX > test suites running in the background for additional stressing of > swapping (I observe 100% CPU utilization from ksgxd which confirms that > swapping happens). Result: 1,000 runs without hangs. > > (Sorry for the previous copy of this email, accidentally sent to > sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.) > > Dmitrii Kuvaiskii (2): > x86/sgx: Resolve EAUG race where losing thread returns SIGBUS > x86/sgx: Resolve EREMOVE page vs EAUG page data race > > arch/x86/kernel/cpu/sgx/encl.c | 10 +++--- > arch/x86/kernel/cpu/sgx/encl.h | 3 +++ > arch/x86/kernel/cpu/sgx/ioctl.c | 1 + > 3 files changed, 11 insertions(+), 3 deletions(-) BR, Jarkko
Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote: > Two enclave threads may try to access the same non-present enclave page > simultaneously (e.g., if the SGX runtime supports lazy allocation). The > threads will end up in sgx_encl_eaug_page(), racing to acquire the > enclave lock. The winning thread will perform EAUG, set up the page > table entry, and insert the page into encl->page_array. The losing > thread will then get -EBUSY on xa_insert(>page_array) and proceed > to error handling path. And that path removes page. Not sure I got gist of this tbh. > This error handling path contains two bugs: (1) SIGBUS is sent to > userspace even though the enclave page is correctly installed by another > thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE > even though the enclave page was never intended to be removed. The first > bug is less severe because it impacts only the user space; the second > bug is more severe because it also impacts the OS state by ripping the > page (added by the winning thread) from the enclave. > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux > fault handler so that no signal is sent to userspace, and (2) by > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no > EREMOVE is performed. What is the collateral damage caused by ENCLS[EREMOVE]? > > Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized > enclave") > Cc: sta...@vger.kernel.org > Reported-by: Marcelina Kościelnicka > Suggested-by: Reinette Chatre > Signed-off-by: Dmitrii Kuvaiskii > --- > arch/x86/kernel/cpu/sgx/encl.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > index 279148e72459..41f14b1a3025 100644 > --- a/arch/x86/kernel/cpu/sgx/encl.c > +++ b/arch/x86/kernel/cpu/sgx/encl.c > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, >* If ret == -EBUSY then page was created in another flow while >* running without encl->lock >*/ > - if (ret) > + if (ret) { > + if (ret == -EBUSY) > + vmret = VM_FAULT_NOPAGE; > goto err_out_shrink; > + } > > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); > pginfo.addr = encl_page->desc & PAGE_MASK; > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct > vm_area_struct *vma, > err_out_shrink: > sgx_encl_shrink(encl, va_page); > err_out_epc: > - sgx_encl_free_epc_page(epc_page); > + sgx_free_epc_page(epc_page); This ignores check for the page being reclaimer tracked, i.e. it does changes that have been ignored in the commit message. > err_out_unlock: > mutex_unlock(>lock); > kfree(encl_page); BR, Jarkko
[PATCH v7 16/16] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of
From: "Mike Rapoport (IBM)" BPF just-in-time compiler depended on CONFIG_MODULES because it used module_alloc() to allocate memory for the generated code. Since code allocations are now implemented with execmem, drop dependency of CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM. Suggested-by: Björn Töpel Signed-off-by: Mike Rapoport (IBM) --- kernel/bpf/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig index bc25f5098a25..f999e4e0b344 100644 --- a/kernel/bpf/Kconfig +++ b/kernel/bpf/Kconfig @@ -43,7 +43,7 @@ config BPF_JIT bool "Enable BPF Just In Time compiler" depends on BPF depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT - depends on MODULES + select EXECMEM help BPF programs are normally handled by a BPF interpreter. This option allows the kernel to generate native code when a program is loaded -- 2.43.0
[PATCH v7 15/16] kprobes: remove dependency on CONFIG_MODULES
From: "Mike Rapoport (IBM)" kprobes depended on CONFIG_MODULES because it has to allocate memory for code. Since code allocations are now implemented with execmem, kprobes can be enabled in non-modular kernels. Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the dependency of CONFIG_KPROBES on CONFIG_MODULES. Signed-off-by: Mike Rapoport (IBM) Acked-by: Masami Hiramatsu (Google) --- arch/Kconfig| 2 +- include/linux/module.h | 9 ++ kernel/kprobes.c| 55 +++-- kernel/trace/trace_kprobe.c | 20 +- 4 files changed, 63 insertions(+), 23 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 4fd0daa54e6c..caa459964f09 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -52,9 +52,9 @@ config GENERIC_ENTRY config KPROBES bool "Kprobes" - depends on MODULES depends on HAVE_KPROBES select KALLSYMS + select EXECMEM select TASKS_RCU if PREEMPTION help Kprobes allows you to trap at almost any kernel address and diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..ffa1c603163c 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod) return mod->state != MODULE_STATE_GOING; } +static inline bool module_is_coming(struct module *mod) +{ +return mod->state == MODULE_STATE_COMING; +} + struct module *__module_text_address(unsigned long addr); struct module *__module_address(unsigned long addr); bool is_module_address(unsigned long addr); @@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct module *mod, void *ptr) return ptr; } +static inline bool module_is_coming(struct module *mod) +{ + return false; +} #endif /* CONFIG_MODULES */ #ifdef CONFIG_SYSFS diff --git a/kernel/kprobes.c b/kernel/kprobes.c index ddd7cdc16edf..ca2c6cbd42d2 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p, } /* Get module refcount and reject __init functions for loaded modules. */ - if (*probed_mod) { + if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) { /* * We must hold a refcount of the probed module while updating * its code to prohibit unexpected unloading. @@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p, * kprobes in there. */ if (within_module_init((unsigned long)p->addr, *probed_mod) && - (*probed_mod)->state != MODULE_STATE_COMING) { + !module_is_coming(*probed_mod)) { module_put(*probed_mod); *probed_mod = NULL; ret = -ENOENT; } } + out: preempt_enable(); jump_label_unlock(); @@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, unsigned long end) return 0; } -/* Remove all symbols in given area from kprobe blacklist */ -static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) -{ - struct kprobe_blacklist_entry *ent, *n; - - list_for_each_entry_safe(ent, n, _blacklist, list) { - if (ent->start_addr < start || ent->start_addr >= end) - continue; - list_del(>list); - kfree(ent); - } -} - -static void kprobe_remove_ksym_blacklist(unsigned long entry) -{ - kprobe_remove_area_blacklist(entry, entry + 1); -} - int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value, char *type, char *sym) { @@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned long *start, return ret ? : arch_populate_kprobe_blacklist(); } +#ifdef CONFIG_MODULES +/* Remove all symbols in given area from kprobe blacklist */ +static void kprobe_remove_area_blacklist(unsigned long start, unsigned long end) +{ + struct kprobe_blacklist_entry *ent, *n; + + list_for_each_entry_safe(ent, n, _blacklist, list) { + if (ent->start_addr < start || ent->start_addr >= end) + continue; + list_del(>list); + kfree(ent); + } +} + +static void kprobe_remove_ksym_blacklist(unsigned long entry) +{ + kprobe_remove_area_blacklist(entry, entry + 1); +} + static void add_module_kprobe_blacklist(struct module *mod) { unsigned long start, end; @@ -2672,6 +2674,17 @@ static struct notifier_block kprobe_module_nb = { .priority = 0 }; +static int kprobe_register_module_notifier(void) +{ + return register_module_notifier(_module_nb); +} +#else +static int
[PATCH v7 14/16] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
From: "Mike Rapoport (IBM)" There are places where CONFIG_MODULES guards the code that depends on memory allocation being done with module_alloc(). Replace CONFIG_MODULES with CONFIG_EXECMEM in such places. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/kasan.h | 2 +- arch/powerpc/kernel/head_8xx.S | 4 ++-- arch/powerpc/kernel/head_book3s_32.S | 6 +++--- arch/powerpc/lib/code-patching.c | 2 +- arch/powerpc/mm/book3s32/mmu.c | 2 +- 6 files changed, 9 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1c4be3373686..2e586733a464 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -285,7 +285,7 @@ config PPC select IOMMU_HELPER if PPC64 select IRQ_DOMAIN select IRQ_FORCED_THREADING - select KASAN_VMALLOCif KASAN && MODULES + select KASAN_VMALLOCif KASAN && EXECMEM select LOCK_MM_AND_FIND_VMA select MMU_GATHER_PAGE_SIZE select MMU_GATHER_RCU_TABLE_FREE diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h index 365d2720097c..b5bbb94c51f6 100644 --- a/arch/powerpc/include/asm/kasan.h +++ b/arch/powerpc/include/asm/kasan.h @@ -19,7 +19,7 @@ #define KASAN_SHADOW_SCALE_SHIFT 3 -#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32) +#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32) #define KASAN_KERN_START ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M) #else #define KASAN_KERN_START PAGE_OFFSET diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S index 647b0b445e89..edc479a7c2bc 100644 --- a/arch/powerpc/kernel/head_8xx.S +++ b/arch/powerpc/kernel/head_8xx.S @@ -199,12 +199,12 @@ instruction_counter: mfspr r10, SPRN_SRR0 /* Get effective address of fault */ INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11) mtspr SPRN_MD_EPN, r10 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM mfcrr11 compare_to_kernel_boundary r10, r10 #endif mfspr r10, SPRN_M_TWB /* Get level 1 table */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM blt+3f rlwinm r10, r10, 0, 20, 31 orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha diff --git a/arch/powerpc/kernel/head_book3s_32.S b/arch/powerpc/kernel/head_book3s_32.S index c1d89764dd22..57196883a00e 100644 --- a/arch/powerpc/kernel/head_book3s_32.S +++ b/arch/powerpc/kernel/head_book3s_32.S @@ -419,14 +419,14 @@ InstructionTLBMiss: */ /* Get PTE (linux-style) and check access */ mfspr r3,SPRN_IMISS -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM lis r1, TASK_SIZE@h /* check if kernel address */ cmplw 0,r1,r3 #endif mfspr r2, SPRN_SDR1 li r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC rlwinm r2, r2, 28, 0xf000 -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM li r0, 3 bgt-112f lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha /* if kernel address, use */ @@ -442,7 +442,7 @@ InstructionTLBMiss: andc. r1,r1,r2/* check access & ~permission */ bne-InstructionAddressInvalid /* return if access not permitted */ /* Convert linux-style PTE to low word of PPC-style PTE */ -#ifdef CONFIG_MODULES +#ifdef CONFIG_EXECMEM rlwimi r2, r0, 0, 31, 31 /* userspace ? -> PP lsb */ #endif ori r1, r1, 0xe06 /* clear out reserved bits */ diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c index c6ab46156cda..7af791446ddf 100644 --- a/arch/powerpc/lib/code-patching.c +++ b/arch/powerpc/lib/code-patching.c @@ -225,7 +225,7 @@ void __init poking_init(void) static unsigned long get_patch_pfn(void *addr) { - if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr)) + if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr)) return vmalloc_to_pfn(addr); else return __pa_symbol(addr) >> PAGE_SHIFT; diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 100f999871bc..625fe7d08e06 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, unsigned long top) static bool is_module_segment(unsigned long addr) { - if (!IS_ENABLED(CONFIG_MODULES)) + if (!IS_ENABLED(CONFIG_EXECMEM)) return false; if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M)) return false; -- 2.43.0
[PATCH v7 13/16] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
From: "Mike Rapoport (IBM)" Dynamic ftrace must allocate memory for code and this was impossible without CONFIG_MODULES. With execmem separated from the modules code, execmem_text_alloc() is available regardless of CONFIG_MODULES. Remove dependency of dynamic ftrace on CONFIG_MODULES and make CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. Signed-off-by: Mike Rapoport (IBM) --- arch/x86/Kconfig | 1 + arch/x86/kernel/ftrace.c | 10 -- 2 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 4474bf32d0a4..f2917ccf4fb4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -34,6 +34,7 @@ config X86_64 select SWIOTLB select ARCH_HAS_ELFCORE_COMPAT select ZONE_DMA32 + select EXECMEM if DYNAMIC_FTRACE config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c index c8ddb7abda7c..8da0e66ca22d 100644 --- a/arch/x86/kernel/ftrace.c +++ b/arch/x86/kernel/ftrace.c @@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command) /* Currently only x86_64 supports dynamic trampolines */ #ifdef CONFIG_X86_64 -#ifdef CONFIG_MODULES -/* Module allocation simplifies allocating memory for code */ static inline void *alloc_tramp(unsigned long size) { return execmem_alloc(EXECMEM_FTRACE, size); @@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp) { execmem_free(tramp); } -#else -/* Trampolines can only be created if modules are supported */ -static inline void *alloc_tramp(unsigned long size) -{ - return NULL; -} -static inline void tramp_free(void *tramp) { } -#endif /* Defined as markers to the end of the ftrace default trampolines */ extern void ftrace_regs_caller_end(void); -- 2.43.0
[PATCH v7 12/16] arch: make execmem setup available regardless of CONFIG_MODULES
From: "Mike Rapoport (IBM)" execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Philippe Mathieu-Daudé --- arch/arm/kernel/module.c | 43 -- arch/arm/mm/init.c | 45 +++ arch/arm64/kernel/module.c | 140 - arch/arm64/mm/init.c | 140 + arch/loongarch/kernel/module.c | 19 - arch/loongarch/mm/init.c | 21 + arch/mips/kernel/module.c | 22 -- arch/mips/mm/init.c| 23 ++ arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c| 20 - arch/parisc/mm/init.c | 23 +- arch/powerpc/kernel/module.c | 63 --- arch/powerpc/mm/mem.c | 64 +++ arch/riscv/kernel/module.c | 34 arch/riscv/mm/init.c | 35 + arch/s390/kernel/module.c | 27 --- arch/s390/mm/init.c| 30 +++ arch/sparc/kernel/module.c | 19 - arch/sparc/mm/Makefile | 2 + arch/sparc/mm/execmem.c| 21 + arch/x86/kernel/module.c | 27 --- arch/x86/mm/init.c | 29 +++ 23 files changed, 453 insertions(+), 435 deletions(-) create mode 100644 arch/sparc/mm/execmem.c diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index a98fdf6ff26c..677f218f7e84 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -12,57 +12,14 @@ #include #include #include -#include #include #include -#include -#include #include #include #include #include -#ifdef CONFIG_XIP_KERNEL -/* - * The XIP kernel text is mapped in the module area for modules and - * some other stuff to work without any indirect relocations. - * MODULES_VADDR is redefined here and not in asm/memory.h to avoid - * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. - */ -#undef MODULES_VADDR -#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) -#endif - -#ifdef CONFIG_MMU -static struct execmem_info execmem_info __ro_after_init; - -struct execmem_info __init *execmem_arch_setup(void) -{ - unsigned long fallback_start = 0, fallback_end = 0; - - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { - fallback_start = VMALLOC_START; - fallback_end = VMALLOC_END; - } - - execmem_info = (struct execmem_info){ - .ranges = { - [EXECMEM_DEFAULT] = { - .start = MODULES_VADDR, - .end= MODULES_END, - .pgprot = PAGE_KERNEL_EXEC, - .alignment = 1, - .fallback_start = fallback_start, - .fallback_end = fallback_end, - }, - }, - }; - - return _info; -} -#endif - bool module_init_section(const char *name) { return strstarts(name, ".init") || diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index e8c6f4be0ce1..5345d218899a 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -486,3 +487,47 @@ void free_initrd_mem(unsigned long start, unsigned long end) free_reserved_area((void *)start, (void *)end, -1, "initrd"); } #endif + +#ifdef CONFIG_EXECMEM + +#ifdef CONFIG_XIP_KERNEL +/* + * The XIP kernel text is mapped in the module area for modules and + * some other stuff to work without any indirect relocations. + * MODULES_VADDR is redefined here and not in asm/memory.h to avoid + * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off. + */ +#undef MODULES_VADDR +#define MODULES_VADDR (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK) +#endif + +#ifdef CONFIG_MMU +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) +{ + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, +
[PATCH v7 11/16] powerpc: extend execmem_params for kprobes allocations
From: "Mike Rapoport (IBM)" powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of EXECMEM_KRPOBES to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) --- arch/powerpc/kernel/kprobes.c | 20 arch/powerpc/kernel/module.c | 7 +++ 2 files changed, 7 insertions(+), 20 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9fcd01bb2ce6..14c5ddec3056 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long addr, unsigned long offse return (kprobe_opcode_t *)(addr + offset); } -void *alloc_insn_page(void) -{ - void *page; - - page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); - if (!page) - return NULL; - - if (strict_module_rwx_enabled()) { - int err = set_memory_rox((unsigned long)page, 1); - - if (err) - goto error; - } - return page; -error: - execmem_free(page); - return NULL; -} - int arch_prepare_kprobe(struct kprobe *p) { int ret = 0; diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c index ac80559015a3..2a23cf7e141b 100644 --- a/arch/powerpc/kernel/module.c +++ b/arch/powerpc/kernel/module.c @@ -94,6 +94,7 @@ static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) { + pgprot_t kprobes_prot = strict_module_rwx_enabled() ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC; pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : PAGE_KERNEL_EXEC; unsigned long fallback_start = 0, fallback_end = 0; unsigned long start, end; @@ -132,6 +133,12 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = kprobes_prot, + .alignment = 1, + }, [EXECMEM_MODULE_DATA] = { .start = VMALLOC_START, .end= VMALLOC_END, -- 2.43.0
[PATCH v7 10/16] arm64: extend execmem_info for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on arm64 can be placed anywhere in vmalloc address space and currently this is implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64. Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon --- arch/arm64/kernel/module.c | 12 arch/arm64/kernel/probes/kprobes.c | 7 --- arch/arm64/net/bpf_jit_comp.c | 11 --- 3 files changed, 12 insertions(+), 18 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index b7a7a23f9f8f..a52240ea084b 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -146,6 +146,18 @@ struct execmem_info __init *execmem_arch_setup(void) .fallback_start = fallback_start, .fallback_end = fallback_end, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_ROX, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, }, }; diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c index 327855a11df2..4268678d0e86 100644 --- a/arch/arm64/kernel/probes/kprobes.c +++ b/arch/arm64/kernel/probes/kprobes.c @@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS, - NUMA_NO_NODE, __builtin_return_address(0)); -} - /* arm kprobe: install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c index 122021f9bdfc..456f5af239fc 100644 --- a/arch/arm64/net/bpf_jit_comp.c +++ b/arch/arm64/net/bpf_jit_comp.c @@ -1793,17 +1793,6 @@ u64 bpf_jit_alloc_exec_limit(void) return VMALLOC_END - VMALLOC_START; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - /* Memory is intended to be executable, reset the pointer tag. */ - return kasan_reset_tag(vmalloc(size)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */ bool bpf_jit_supports_subprog_tailcalls(void) { -- 2.43.0
[PATCH v7 09/16] riscv: extend execmem_params for generated code allocations
From: "Mike Rapoport (IBM)" The memory allocations for kprobes and BPF on RISC-V are not placed in the modules area and these custom allocations are implemented with overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for 32 bit and slightly reorder execmem_params initialization to support both 32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in riscv::execmem_params and drop overrides of alloc_insn_page() and bpf_jit_alloc_exec(). Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Alexandre Ghiti --- arch/riscv/include/asm/pgtable.h | 3 +++ arch/riscv/kernel/module.c | 14 +- arch/riscv/kernel/probes/kprobes.c | 10 -- arch/riscv/net/bpf_jit_core.c | 13 - 4 files changed, 16 insertions(+), 24 deletions(-) diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 9f8ea0e33eb1..5f21814e438e 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -55,6 +55,9 @@ #define MODULES_LOWEST_VADDR (KERNEL_LINK_ADDR - SZ_2G) #define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G) #define MODULES_END(PFN_ALIGN((unsigned long)&_start)) +#else +#define MODULES_VADDR VMALLOC_START +#define MODULES_ENDVMALLOC_END #endif /* diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c index 182904127ba0..0e6415f00fca 100644 --- a/arch/riscv/kernel/module.c +++ b/arch/riscv/kernel/module.c @@ -906,7 +906,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +#ifdef CONFIG_MMU static struct execmem_info execmem_info __ro_after_init; struct execmem_info __init *execmem_arch_setup(void) @@ -919,6 +919,18 @@ struct execmem_info __init *execmem_arch_setup(void) .pgprot = PAGE_KERNEL, .alignment = 1, }, + [EXECMEM_KPROBES] = { + .start = VMALLOC_START, + .end= VMALLOC_END, + .pgprot = PAGE_KERNEL_READ_EXEC, + .alignment = 1, + }, + [EXECMEM_BPF] = { + .start = BPF_JIT_REGION_START, + .end= BPF_JIT_REGION_END, + .pgprot = PAGE_KERNEL, + .alignment = PAGE_SIZE, + }, }, }; diff --git a/arch/riscv/kernel/probes/kprobes.c b/arch/riscv/kernel/probes/kprobes.c index 2f08c14a933d..e64f2f3064eb 100644 --- a/arch/riscv/kernel/probes/kprobes.c +++ b/arch/riscv/kernel/probes/kprobes.c @@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p) return 0; } -#ifdef CONFIG_MMU -void *alloc_insn_page(void) -{ - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END, -GFP_KERNEL, PAGE_KERNEL_READ_EXEC, -VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, -__builtin_return_address(0)); -} -#endif - /* install breakpoint in text */ void __kprobes arch_arm_kprobe(struct kprobe *p) { diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c index 6b3acac30c06..e238fdbd5dbc 100644 --- a/arch/riscv/net/bpf_jit_core.c +++ b/arch/riscv/net/bpf_jit_core.c @@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void) return BPF_JIT_REGION_SIZE; } -void *bpf_jit_alloc_exec(unsigned long size) -{ - return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START, - BPF_JIT_REGION_END, GFP_KERNEL, - PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); -} - -void bpf_jit_free_exec(void *addr) -{ - return vfree(addr); -} - void *bpf_arch_text_copy(void *dst, void *src, size_t len) { int ret; -- 2.43.0
[PATCH v7 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) Acked-by: Will Deacon Acked-by: Song Liu --- arch/Kconfig | 8 arch/arm/kernel/module.c | 41 arch/arm64/Kconfig | 1 + arch/arm64/kernel/module.c | 55 ++ arch/powerpc/kernel/module.c | 60 +++-- arch/s390/kernel/module.c| 54 +++--- arch/x86/kernel/module.c | 70 +++-- include/linux/execmem.h | 30 ++- include/linux/moduleloader.h | 12 -- kernel/module/main.c | 26 +++-- mm/execmem.c | 75 ++-- 11 files changed, 247 insertions(+), 185 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 65afb1de48b3..4fd0daa54e6c 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -960,6 +960,14 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC For architectures like powerpc/32 which have constraints on module allocation and need to allocate module data outside of module area. +config ARCH_WANTS_EXECMEM_LATE + bool + help + For architectures that do not allocate executable memory early on + boot, but rather require its initialization late when there is + enough entropy for module space randomization, for instance + arm64. + config HAVE_IRQ_EXIT_ON_IRQ_STACK bool help diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c index e74d84f58b77..a98fdf6ff26c 100644 --- a/arch/arm/kernel/module.c +++ b/arch/arm/kernel/module.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -34,23 +35,31 @@ #endif #ifdef CONFIG_MMU -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - gfp_t gfp_mask = GFP_KERNEL; - void *p; - - /* Silence the initial allocation */ - if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) - gfp_mask |= __GFP_NOWARN; - - p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); - if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p) - return p; - return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + unsigned long fallback_start = 0, fallback_end = 0; + + if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) { + fallback_start = VMALLOC_START; + fallback_end = VMALLOC_END; + } + + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + .fallback_start = fallback_start, + .fallback_end = fallback_end, + }, + }, + }; + + return _info; } #endif diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7b11c98b3e84..74b34a78b7ac 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -105,6 +105,7 @@ config ARM64 select ARCH_WANT_FRAME_POINTERS select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) select ARCH_WANT_LD_ORPHAN_WARN + select ARCH_WANTS_EXECMEM_LATE if EXECMEM select ARCH_WANTS_NO_INSTR select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES select ARCH_HAS_UBSAN diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index e92da4da1b2a..b7a7a23f9f8f 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -108,41 +109,47 @@ static int __init module_init_limits(void) return 0; } -subsys_initcall(module_init_limits); -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { -
[PATCH v7 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem
From: "Mike Rapoport (IBM)" Several architectures override module_alloc() only to define address range for code allocations different than VMALLOC address space. Provide a generic implementation in execmem that uses the parameters for address space ranges, required alignment and page protections provided by architectures. The architectures must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. This way the execmem initialization won't be called from every architecture, but rather from a central place, namely a core_initcall() in execmem. The execmem provides execmem_alloc() API that wraps __vmalloc_node_range() with the parameters defined by the architectures. If an architecture does not implement execmem_arch_setup(), execmem_alloc() will fall back to module_alloc(). Signed-off-by: Mike Rapoport (IBM) Acked-by: Song Liu --- arch/loongarch/kernel/module.c | 19 -- arch/mips/kernel/module.c | 20 -- arch/nios2/kernel/module.c | 21 --- arch/parisc/kernel/module.c| 24 arch/riscv/kernel/module.c | 24 arch/sparc/kernel/module.c | 20 -- include/linux/execmem.h| 47 mm/execmem.c | 67 -- mm/mm_init.c | 2 + 9 files changed, 210 insertions(+), 34 deletions(-) diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c index c7d0338d12c1..ca6dd7ea1610 100644 --- a/arch/loongarch/kernel/module.c +++ b/arch/loongarch/kernel/module.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab, return 0; } -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } static void module_init_ftrace_plt(const Elf_Ehdr *hdr, diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 9a6c96014904..59225a3cf918 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -20,6 +20,7 @@ #include #include #include +#include #include struct mips_hi16 { @@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); #ifdef MODULES_VADDR -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL, + .alignment = 1, + }, + }, + }; + + return _info; } #endif diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 9c97b7513853..0d1ee86631fc 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -18,15 +18,26 @@ #include #include #include +#include #include -void *module_alloc(unsigned long size) +static struct execmem_info execmem_info __ro_after_init; + +struct execmem_info __init *execmem_arch_setup(void) { - return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, - GFP_KERNEL, PAGE_KERNEL_EXEC, - VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, - __builtin_return_address(0)); + execmem_info = (struct execmem_info){ + .ranges = { + [EXECMEM_DEFAULT] = { + .start = MODULES_VADDR, + .end= MODULES_END, + .pgprot = PAGE_KERNEL_EXEC, + .alignment = 1, + }, + }, + }; + + return _info; } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, diff --git
[PATCH v7 06/16] mm: introduce execmem_alloc() and execmem_free()
From: "Mike Rapoport (IBM)" module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) Acked-by: Masami Hiramatsu (Google) Acked-by: Song Liu --- arch/powerpc/kernel/kprobes.c| 6 ++-- arch/s390/kernel/ftrace.c| 4 +-- arch/s390/kernel/kprobes.c | 4 +-- arch/s390/kernel/module.c| 5 +-- arch/sparc/net/bpf_jit_comp_32.c | 8 ++--- arch/x86/kernel/ftrace.c | 6 ++-- arch/x86/kernel/kprobes/core.c | 4 +-- include/linux/execmem.h | 57 include/linux/moduleloader.h | 3 -- kernel/bpf/core.c| 6 ++-- kernel/kprobes.c | 8 ++--- kernel/module/Kconfig| 1 + kernel/module/main.c | 25 +- mm/Kconfig | 3 ++ mm/Makefile | 1 + mm/execmem.c | 32 ++ 16 files changed, 128 insertions(+), 45 deletions(-) create mode 100644 include/linux/execmem.h create mode 100644 mm/execmem.c diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index bbca90a5e2ec..9fcd01bb2ce6 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -19,8 +19,8 @@ #include #include #include -#include #include +#include #include #include #include @@ -130,7 +130,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; @@ -142,7 +142,7 @@ void *alloc_insn_page(void) } return page; error: - module_memfree(page); + execmem_free(page); return NULL; } diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c index c46381ea04ec..798249ef5646 100644 --- a/arch/s390/kernel/ftrace.c +++ b/arch/s390/kernel/ftrace.c @@ -7,13 +7,13 @@ * Author(s): Martin Schwidefsky */ -#include #include #include #include #include #include #include +#include #include #include #include @@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void) { const char *start, *end; - ftrace_plt = module_alloc(PAGE_SIZE); + ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE); if (!ftrace_plt) panic("cannot allocate ftrace plt\n"); diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c index f0cf20d4b3c5..3c1b1be744de 100644 --- a/arch/s390/kernel/kprobes.c +++ b/arch/s390/kernel/kprobes.c @@ -9,7 +9,6 @@ #define pr_fmt(fmt) "kprobes: " fmt -#include #include #include #include @@ -21,6 +20,7 @@ #include #include #include +#include #include #include #include @@ -38,7 +38,7 @@ void *alloc_insn_page(void) { void *page; - page = module_alloc(PAGE_SIZE); + page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE); if (!page) return NULL; set_memory_rox((unsigned long)page, 1); diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c index 42215f9404af..ac97a905e8cd 100644 --- a/arch/s390/kernel/module.c +++ b/arch/s390/kernel/module.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -76,7 +77,7 @@ void *module_alloc(unsigned long size) #ifdef CONFIG_FUNCTION_TRACER void module_arch_cleanup(struct module *mod) { - module_memfree(mod->arch.trampolines_start); + execmem_free(mod->arch.trampolines_start); } #endif @@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct module *me, size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size); numpages = DIV_ROUND_UP(size, PAGE_SIZE); - start = module_alloc(numpages * PAGE_SIZE); + start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE); if (!start)
[PATCH v7 05/16] module: make module_memory_{alloc,free} more self-contained
From: "Mike Rapoport (IBM)" Move the logic related to the memory allocation and freeing into module_memory_alloc() and module_memory_free(). Signed-off-by: Mike Rapoport (IBM) --- kernel/module/main.c | 64 +++- 1 file changed, 39 insertions(+), 25 deletions(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index e1e8a7a9d6c1..5b82b069e0d3 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type) mod_mem_type_is_core_data(type); } -static void *module_memory_alloc(unsigned int size, enum mod_mem_type type) +static int module_memory_alloc(struct module *mod, enum mod_mem_type type) { + unsigned int size = PAGE_ALIGN(mod->mem[type].size); + void *ptr; + + mod->mem[type].size = size; + if (mod_mem_use_vmalloc(type)) - return vzalloc(size); - return module_alloc(size); + ptr = vmalloc(size); + else + ptr = module_alloc(size); + + if (!ptr) + return -ENOMEM; + + /* +* The pointer to these blocks of memory are stored on the module +* structure and we keep that around so long as the module is +* around. We only free that memory when we unload the module. +* Just mark them as not being a leak then. The .init* ELF +* sections *do* get freed after boot so we *could* treat them +* slightly differently with kmemleak_ignore() and only grey +* them out as they work as typical memory allocations which +* *do* eventually get freed, but let's just keep things simple +* and avoid *any* false positives. +*/ + kmemleak_not_leak(ptr); + + memset(ptr, 0, size); + mod->mem[type].base = ptr; + + return 0; } -static void module_memory_free(void *ptr, enum mod_mem_type type) +static void module_memory_free(struct module *mod, enum mod_mem_type type) { + void *ptr = mod->mem[type].base; + if (mod_mem_use_vmalloc(type)) vfree(ptr); else @@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod) /* Free lock-classes; relies on the preceding sync_rcu(). */ lockdep_free_key_range(mod_mem->base, mod_mem->size); if (mod_mem->size) - module_memory_free(mod_mem->base, type); + module_memory_free(mod, type); } /* MOD_DATA hosts mod, so free it at last */ lockdep_free_key_range(mod->mem[MOD_DATA].base, mod->mem[MOD_DATA].size); - module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA); + module_memory_free(mod, MOD_DATA); } /* Free a module, remove from lists, etc. */ @@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, struct load_info *info) static int move_module(struct module *mod, struct load_info *info) { int i; - void *ptr; enum mod_mem_type t = 0; int ret = -ENOMEM; @@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct load_info *info) mod->mem[type].base = NULL; continue; } - mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size); - ptr = module_memory_alloc(mod->mem[type].size, type); - /* - * The pointer to these blocks of memory are stored on the module - * structure and we keep that around so long as the module is - * around. We only free that memory when we unload the module. - * Just mark them as not being a leak then. The .init* ELF - * sections *do* get freed after boot so we *could* treat them - * slightly differently with kmemleak_ignore() and only grey - * them out as they work as typical memory allocations which - * *do* eventually get freed, but let's just keep things simple - * and avoid *any* false positives. -*/ - kmemleak_not_leak(ptr); - if (!ptr) { + + ret = module_memory_alloc(mod, type); + if (ret) { t = type; goto out_enomem; } - memset(ptr, 0, mod->mem[type].size); - mod->mem[type].base = ptr; } /* Transfer each section which specifies SHF_ALLOC */ @@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct load_info *info) return 0; out_enomem: for (t--; t >= 0; t--) - module_memory_free(mod->mem[t].base, t); + module_memory_free(mod, t); return ret; } -- 2.43.0
[PATCH v7 04/16] sparc: simplify module_alloc()
From: "Mike Rapoport (IBM)" Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for 32-bit and reduce module_alloc() to __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...) as with the new defines the allocations becomes identical for both 32 and 64 bits. While on it, drop unused include of Suggested-by: Sam Ravnborg Signed-off-by: Mike Rapoport (IBM) Reviewed-by: Sam Ravnborg --- arch/sparc/include/asm/pgtable_32.h | 2 ++ arch/sparc/kernel/module.c | 25 + 2 files changed, 3 insertions(+), 24 deletions(-) diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h index 9e85d57ac3f2..62bcafe38b1f 100644 --- a/arch/sparc/include/asm/pgtable_32.h +++ b/arch/sparc/include/asm/pgtable_32.h @@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma, #define VMALLOC_START _AC(0xfe60,UL) #define VMALLOC_END _AC(0xffc0,UL) +#define MODULES_VADDR VMALLOC_START +#define MODULES_END VMALLOC_END /* We provide our own get_unmapped_area to cope with VA holes for userland */ #define HAVE_ARCH_UNMAPPED_AREA diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c index 66c45a2764bc..d37adb2a0b54 100644 --- a/arch/sparc/kernel/module.c +++ b/arch/sparc/kernel/module.c @@ -21,35 +21,12 @@ #include "entry.h" -#ifdef CONFIG_SPARC64 - -#include - -static void *module_map(unsigned long size) +void *module_alloc(unsigned long size) { - if (PAGE_ALIGN(size) > MODULES_LEN) - return NULL; return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } -#else -static void *module_map(unsigned long size) -{ - return vmalloc(size); -} -#endif /* CONFIG_SPARC64 */ - -void *module_alloc(unsigned long size) -{ - void *ret; - - ret = module_map(size); - if (ret) - memset(ret, 0, size); - - return ret; -} /* Make generic code ignore STT_REGISTER dummy undefined symbols. */ int module_frob_arch_sections(Elf_Ehdr *hdr, -- 2.43.0
[PATCH v7 03/16] nios2: define virtual address space for modules
From: "Mike Rapoport (IBM)" nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26 cannot reach all of vmalloc address space. Define module space as 32MiB below the kernel base and switch nios2 to use vmalloc for module allocations. Suggested-by: Thomas Gleixner Acked-by: Dinh Nguyen Acked-by: Song Liu Signed-off-by: Mike Rapoport (IBM) --- arch/nios2/include/asm/pgtable.h | 5 - arch/nios2/kernel/module.c | 19 --- 2 files changed, 8 insertions(+), 16 deletions(-) diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h index d052dfcbe8d3..eab87c6beacb 100644 --- a/arch/nios2/include/asm/pgtable.h +++ b/arch/nios2/include/asm/pgtable.h @@ -25,7 +25,10 @@ #include #define VMALLOC_START CONFIG_NIOS2_KERNEL_MMU_REGION_BASE -#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) +#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1) + +#define MODULES_VADDR (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M) +#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1) struct mm_struct; diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c index 76e0a42d6e36..9c97b7513853 100644 --- a/arch/nios2/kernel/module.c +++ b/arch/nios2/kernel/module.c @@ -21,23 +21,12 @@ #include -/* - * Modules should NOT be allocated with kmalloc for (obvious) reasons. - * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach - * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns - * addresses in 0xc000) - */ void *module_alloc(unsigned long size) { - if (size == 0) - return NULL; - return kmalloc(size, GFP_KERNEL); -} - -/* Free memory returned from module_alloc */ -void module_memfree(void *module_region) -{ - kfree(module_region); + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, + GFP_KERNEL, PAGE_KERNEL_EXEC, + VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, + __builtin_return_address(0)); } int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab, -- 2.43.0
[PATCH v7 02/16] mips: module: rename MODULE_START to MODULES_VADDR
From: "Mike Rapoport (IBM)" and MODULE_END to MODULES_END to match other architectures that define custom address space for modules. Signed-off-by: Mike Rapoport (IBM) --- arch/mips/include/asm/pgtable-64.h | 4 ++-- arch/mips/kernel/module.c | 4 ++-- arch/mips/mm/fault.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/mips/include/asm/pgtable-64.h b/arch/mips/include/asm/pgtable-64.h index 20ca48c1b606..c0109aff223b 100644 --- a/arch/mips/include/asm/pgtable-64.h +++ b/arch/mips/include/asm/pgtable-64.h @@ -147,8 +147,8 @@ #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \ VMALLOC_START != CKSSEG /* Load modules into 32bit-compatible segment. */ -#define MODULE_START CKSSEG -#define MODULE_END (FIXADDR_START-2*PAGE_SIZE) +#define MODULES_VADDR CKSSEG +#define MODULES_END(FIXADDR_START-2*PAGE_SIZE) #endif #define pte_ERROR(e) \ diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c index 7b2fbaa9cac5..9a6c96014904 100644 --- a/arch/mips/kernel/module.c +++ b/arch/mips/kernel/module.c @@ -31,10 +31,10 @@ struct mips_hi16 { static LIST_HEAD(dbe_list); static DEFINE_SPINLOCK(dbe_lock); -#ifdef MODULE_START +#ifdef MODULES_VADDR void *module_alloc(unsigned long size) { - return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END, + return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, __builtin_return_address(0)); } diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index aaa9a242ebba..37fedeaca2e9 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write, if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END)) goto VMALLOC_FAULT_TARGET; -#ifdef MODULE_START - if (unlikely(address >= MODULE_START && address < MODULE_END)) +#ifdef MODULES_VADDR + if (unlikely(address >= MODULES_VADDR && address < MODULES_END)) goto VMALLOC_FAULT_TARGET; #endif -- 2.43.0
[PATCH v7 01/16] arm64: module: remove unneeded call to kasan_alloc_module_shadow()
From: "Mike Rapoport (IBM)" Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS modes") KASAN_VMALLOC is always enabled when KASAN is on. This means that allocations in module_alloc() will be tracked by KASAN protection for vmalloc() and that kasan_alloc_module_shadow() will be always an empty inline and there is no point in calling it. Drop meaningless call to kasan_alloc_module_shadow() from module_alloc(). Signed-off-by: Mike Rapoport (IBM) --- arch/arm64/kernel/module.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c index 47e0be610bb6..e92da4da1b2a 100644 --- a/arch/arm64/kernel/module.c +++ b/arch/arm64/kernel/module.c @@ -141,11 +141,6 @@ void *module_alloc(unsigned long size) __func__); } - if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) { - vfree(p); - return NULL; - } - /* Memory is intended to be executable, reset the pointer tag. */ return kasan_reset_tag(p); } -- 2.43.0
[PATCH v7 00/16] mm: jit/text allocator
From: "Mike Rapoport (IBM)" Hi, The patches are also available in git: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7 v7 changes: * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid #ifdefs in a function body * add Acks, thanks everybody v6: https://lore.kernel.org/all/20240426082854.7355-1-r...@kernel.org * restore patch "arm64: extend execmem_info for generated code allocations" that disappeared in v5 rebase * update execmem initialization so that by default it will be initialized early while late initialization will be an opt-in v5: https://lore.kernel.org/all/20240422094436.3625171-1-r...@kernel.org * rebase on v6.9-rc4 to avoid a conflict in kprobes * add copyrights to mm/execmem.c (Luis) * fix spelling (Ingo) * define MODULES_VADDDR for sparc (Sam) * consistently initialize struct execmem_info (Peter) * reduce #ifdefs in function bodies in kprobes (Masami) v4: https://lore.kernel.org/all/20240411160051.2093261-1-r...@kernel.org * rebase on v6.9-rc2 * rename execmem_params to execmem_info and execmem_arch_params() to execmem_arch_setup() * use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song) * avoid extra copy of execmem parameters (Rick) * run execmem_init() as core_initcall() except for the architectures that may allocated text really early (currently only x86) (Will) * add acks for some of arm64 and riscv changes, thanks Will and Alexandre * new commits: - drop call to kasan_alloc_module_shadow() on arm64 because it's not needed anymore - rename MODULE_START to MODULES_VADDR on MIPS - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe: https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/ v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org * add type parameter to execmem allocation APIs * remove BPF dependency on modules v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org * Separate "module" and "others" allocations with execmem_text_alloc() and jit_text_alloc() * Drop ROX entailment on x86 * Add ack for nios2 changes, thanks Dinh Nguyen v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org = Cover letter from v1 (sligtly updated) = module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystmes that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. A centralized infrastructure for code allocation allows allocations of executable memory as ROX, and future optimizations such as caching large pages for better iTLB performance and providing sub-page allocations for users that only need small jit code snippets. Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu proposed execmem_alloc [2], but both these approaches were targeting BPF allocations and lacked the ground work to abstract executable allocations and split them from the modules core. Thomas Gleixner suggested to express module allocation restrictions and requirements as struct mod_alloc_type_params [3] that would define ranges, protections and other parameters for different types of allocations used by modules and following that suggestion Song separated allocations of different types in modules (commit ac3b43283923 ("module: replace module_layout with module_memory")) and posted "Type aware module allocator" set [4]. I liked the idea of parametrising code allocation requirements as a structure, but I believe the original proposal and Song's module allocator was too module centric, so I came up with these patches. This set splits code allocation from modules by introducing execmem_alloc() and and execmem_free(), APIs, replaces call sites of module_alloc() and module_memfree() with the new APIs and implements core text and related allocations in a central place. Instead of architecture specific overrides for module_alloc(), the architectures that require non-default behaviour for text allocation must fill execmem_info structure and implement execmem_arch_setup() that returns a pointer to that structure. If an architecture does not implement execmem_arch_setup(), the defaults compatible with the current modules::module_alloc() are used. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem APIs take a type argument, that will be used to identify the calling subsystem and to allow architectures to define parameters for ranges suitable for that subsystem. The new infrastructure allows decoupling of BPF, kprobes and
Re: [PATCH 5.15,5.10,5.4,4.19 0/2] Fix warning when tracing with large filenames
On Wed, Apr 24, 2024 at 07:20:07PM -0300, Thadeu Lima de Souza Cascardo wrote: > The warning described on patch "tracing: Increase PERF_MAX_TRACE_SIZE to > handle Sentinel1 and docker together" can be triggered with a perf probe on > do_execve with a large path. As PATH_MAX is larger than PERF_MAX_TRACE_SIZE > (2048 before the patch), the warning will trigger. > > The fix was included in 5.16, so backporting to 5.15 and earlier LTS > kernels. Also included is a patch that better describes the attempted > allocation size. All now queued up, thanks. greg k-h
[PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race
Two enclave threads may try to add and remove the same enclave page simultaneously (e.g., if the SGX runtime supports both lazy allocation and `MADV_DONTNEED` semantics). Consider this race: 1. T1 performs page removal in sgx_encl_remove_pages() and stops right after removing the page table entry and right before re-acquiring the enclave lock to EREMOVE and xa_erase(>page_array) the page. 2. T2 tries to access the page, and #PF[not_present] is raised. The condition to EAUG in sgx_vma_fault() is not satisfied because the page is still present in encl->page_array, thus the SGX driver assumes that the fault happened because the page was swapped out. The driver continues on a code path that installs a page table entry *without* performing EAUG. 3. The enclave page metadata is in inconsistent state: the PTE is installed but there was no EAUG. Thus, T2 in userspace infinitely receives SIGSEGV on this page (and EACCEPT always fails). Fix this by making sure that T1 (the page-removing thread) always wins this data race. In particular, the page-being-removed is marked as such, and T2 retries until the page is fully removed. Fixes: 9849bb27152c ("x86/sgx: Support complete page removal") Cc: sta...@vger.kernel.org Signed-off-by: Dmitrii Kuvaiskii --- arch/x86/kernel/cpu/sgx/encl.c | 3 ++- arch/x86/kernel/cpu/sgx/encl.h | 3 +++ arch/x86/kernel/cpu/sgx/ioctl.c | 1 + 3 files changed, 6 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 41f14b1a3025..7ccd8b2fce5f 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl, /* Entry successfully located. */ if (entry->epc_page) { - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED) + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED | + SGX_ENCL_PAGE_BEING_REMOVED)) return ERR_PTR(-EBUSY); return entry; diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index f94ff14c9486..fff5f2293ae7 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -25,6 +25,9 @@ /* 'desc' bit marking that the page is being reclaimed. */ #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3) +/* 'desc' bit marking that the page is being removed. */ +#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2) + struct sgx_encl_page { unsigned long desc; unsigned long vm_max_prot_bits:8; diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index b65ab214bdf5..c542d4dd3e64 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl, * Do not keep encl->lock because of dependency on * mmap_lock acquired in sgx_zap_enclave_ptes(). */ + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED; mutex_unlock(>lock); sgx_zap_enclave_ptes(encl, addr); -- 2.34.1
[PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
Two enclave threads may try to access the same non-present enclave page simultaneously (e.g., if the SGX runtime supports lazy allocation). The threads will end up in sgx_encl_eaug_page(), racing to acquire the enclave lock. The winning thread will perform EAUG, set up the page table entry, and insert the page into encl->page_array. The losing thread will then get -EBUSY on xa_insert(>page_array) and proceed to error handling path. This error handling path contains two bugs: (1) SIGBUS is sent to userspace even though the enclave page is correctly installed by another thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE even though the enclave page was never intended to be removed. The first bug is less severe because it impacts only the user space; the second bug is more severe because it also impacts the OS state by ripping the page (added by the winning thread) from the enclave. Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux fault handler so that no signal is sent to userspace, and (2) by replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no EREMOVE is performed. Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave") Cc: sta...@vger.kernel.org Reported-by: Marcelina Kościelnicka Suggested-by: Reinette Chatre Signed-off-by: Dmitrii Kuvaiskii --- arch/x86/kernel/cpu/sgx/encl.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 279148e72459..41f14b1a3025 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, * If ret == -EBUSY then page was created in another flow while * running without encl->lock */ - if (ret) + if (ret) { + if (ret == -EBUSY) + vmret = VM_FAULT_NOPAGE; goto err_out_shrink; + } pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); pginfo.addr = encl_page->desc & PAGE_MASK; @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma, err_out_shrink: sgx_encl_shrink(encl, va_page); err_out_epc: - sgx_encl_free_epc_page(epc_page); + sgx_free_epc_page(epc_page); err_out_unlock: mutex_unlock(>lock); kfree(encl_page); -- 2.34.1
[PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of enclave pages and may support MADV_DONTNEED semantics [1]. The former implies #PF-based page allocation, and the latter implies the usage of SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl. A trivial program like below (run under Gramine and with EDMM enabled) stresses these two flows in the SGX driver and hangs: /* repeatedly touch different enclave pages at random and mix with * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */ static void* thread_func(void* arg) { size_t num_pages = 0xA000 / page_size; for (int i = 0; i < 5000; i++) { size_t page = get_random_ulong() % num_pages; char data = READ_ONCE(((char*)arg)[page * page_size]); page = get_random_ulong() % num_pages; madvise(arg + page * page_size, page_size, MADV_DONTNEED); } } addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0); pthread_t threads[16]; for (int i = 0; i < 16; i++) pthread_create([i], NULL, thread_func, addr); This program uncovers two data races in the SGX driver. The remaining patches describe and fix these races. I performed several stress tests to verify that there are no other data races (at least with the test program above): - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This stresses the first data race. A Gramine SGX test suite running in the background for additional stressing. Result: 1,000 runs without hangs (result without the first bug fix: hangs every time). - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This stresses the second data race. A Gramine SGX test suite running in the background for additional stressing. Result: 1,000 runs without hangs (result with the first bug fix but without the second bug fix: hangs approx. once in 50 runs). - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This additionally stresses the enclave page swapping flows. Two Gramine SGX test suites running in the background for additional stressing of swapping (I observe 100% CPU utilization from ksgxd which confirms that swapping happens). Result: 1,000 runs without hangs. (Sorry for the previous copy of this email, accidentally sent to sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.) Dmitrii Kuvaiskii (2): x86/sgx: Resolve EAUG race where losing thread returns SIGBUS x86/sgx: Resolve EREMOVE page vs EAUG page data race arch/x86/kernel/cpu/sgx/encl.c | 10 +++--- arch/x86/kernel/cpu/sgx/encl.h | 3 +++ arch/x86/kernel/cpu/sgx/ioctl.c | 1 + 3 files changed, 11 insertions(+), 3 deletions(-) -- 2.34.1
Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation
> +/* > + * Get the per-cgroup or global LRU list that tracks the given reclaimable > page. > + */ > static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page > *epc_page) > { > +#ifdef CONFIG_CGROUP_MISC > + /* > + * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's > + * life between sgx_alloc_epc_page() and sgx_free_epc_page(): > + * > + * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from > + * sgx_get_current_cg() which is the misc cgroup of the current task, or > + * the root by default even if the misc cgroup is disabled by kernel > + * command line. > + * > + * epc_page->sgx_cg is only unset by sgx_free_epc_page(). > + * > + * This function is never used before sgx_alloc_epc_page() or after > + * sgx_free_epc_page(). > + */ > + return _page->sgx_cg->lru; > +#else > return _global_lru; > +#endif > } > > /* > @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct > sgx_epc_page *epc_pag > */ > static inline bool sgx_can_reclaim(void) > { > - return !list_empty(_global_lru.reclaimable); > + return !sgx_cgroup_lru_empty(misc_cg_root()) || > +!list_empty(_global_lru.reclaimable); > } Shouldn't this be: if (IS_ENABLED(CONFIG_CGROUP_MISC)) return !sgx_cgroup_lru_empty(misc_cg_root()); else return !list_empty(_global_lru.reclaimable); ? In this way, it is consistent with the sgx_reclaim_pages_global() below. > > static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0); > @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark) > > static void sgx_reclaim_pages_global(struct mm_struct *charge_mm) > { > - sgx_reclaim_pages(_global_lru, charge_mm); > + if (IS_ENABLED(CONFIG_CGROUP_MISC)) > + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm); > + else > + sgx_reclaim_pages(_global_lru, charge_mm); > } > > /* > @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct > *charge_mm) > */ > void sgx_reclaim_direct(void) > { > + struct sgx_cgroup *sgx_cg = sgx_get_current_cg(); > + > + /* Make sure there are some free pages at cgroup level */ > + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) { > + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm); > + sgx_put_cg(sgx_cg); > + } Empty line. > + /* Make sure there are some free pages at global level */ > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) Looking at the code, to me sgx_should_reclaim() is a little bit vague because from the name we don't know whether it interally checks the current cgroup or the global. It's better to rename to sgx_should_reclaim_global(). Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global(). And I think you can do the renaming in the previous patch, because in the changelog of your previous patch, it seems you have called out the two functions are for global reclaim. > sgx_reclaim_pages_global(current->mm); > } > @@ -616,6 +649,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, > enum sgx_reclaim reclaim) > break; > } > > + /* > + * At this point, the usage within this cgroup is under its > + * limit but there is no physical page left for allocation. > + * Perform a global reclaim to get some pages released from any > + * cgroup with reclaimable pages. > + */ > sgx_reclaim_pages_global(current->mm); > cond_resched(); > }
Re: [PATCH 0/4] vhost: Cleanup
On 4/29/24 17:02, Michael S. Tsirkin wrote: On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote: This is suggested by Michael S. Tsirkin according to [1] and the goal is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it, the caller of the function needn't to worry about memory barriers. Since we're here, other cleanups are also applied. Gavin I suggested another approach. 1. Start with the patch I sent (vhost: order avail ring reads after index updates) just do a diff against latest. simplify error handling a bit. 2. Do any other cleanups on top. My apologies, Michael. I didn't see your patch until now [1] [1] https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/ v2 was sent with your changes integrated and other cleanup are applied on top of it. Please take a look when you getting a chance. v2: https://lore.kernel.org/virtualization/20240429101400.617007-1-gs...@redhat.com/T/#t Thanks, Gavin
[PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()
Reformat the macros to use tab as the terminator for each line so that it looks clean. No functional change intended. Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 60 +-- 1 file changed, 30 insertions(+), 30 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 4ddb9ec2fe46..c1ed5e750521 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, return __vhost_get_user_slow(vq, addr, size, type); } -#define vhost_put_user(vq, x, ptr) \ -({ \ - int ret; \ - if (!vq->iotlb) { \ - ret = __put_user(x, ptr); \ - } else { \ - __typeof__(ptr) to = \ +#define vhost_put_user(vq, x, ptr) \ +({ \ + int ret;\ + if (!vq->iotlb) { \ + ret = __put_user(x, ptr); \ + } else {\ + __typeof__(ptr) to =\ (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ - sizeof(*ptr), VHOST_ADDR_USED); \ - if (to != NULL) \ - ret = __put_user(x, to); \ - else \ - ret = -EFAULT; \ - } \ - ret; \ + sizeof(*ptr), \ + VHOST_ADDR_USED); \ + if (to != NULL) \ + ret = __put_user(x, to);\ + else\ + ret = -EFAULT; \ + } \ + ret;\ }) static inline int vhost_put_avail_event(struct vhost_virtqueue *vq) @@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq) >used->idx); } -#define vhost_get_user(vq, x, ptr, type) \ -({ \ - int ret; \ - if (!vq->iotlb) { \ - ret = __get_user(x, ptr); \ - } else { \ - __typeof__(ptr) from = \ - (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ - sizeof(*ptr), \ - type); \ - if (from != NULL) \ - ret = __get_user(x, from); \ - else \ - ret = -EFAULT; \ - } \ - ret; \ +#define vhost_get_user(vq, x, ptr, type) \ +({ \ + int ret;\ + if (!vq->iotlb) { \ + ret = __get_user(x, ptr); \ + } else {\ + __typeof__(ptr) from = \ + (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ + sizeof(*ptr), type);\ + if (from != NULL) \ + ret = __get_user(x, from); \ + else\ + ret = -EFAULT; \ + } \ + ret;\ }) #define vhost_get_avail(vq, x, ptr) \ -- 2.44.0
[PATCH v2 3/4] vhost: Improve vhost_get_avail_head()
Improve vhost_get_avail_head() so that the head or errno is returned. With it, the relevant sanity checks are squeezed to vhost_get_avail_head() and vhost_get_vq_desc() is further simplified. No functional change intended. Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 50 ++- 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index b278c0333a66..4ddb9ec2fe46 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) return 1; } -static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, - __virtio16 *head, int idx) +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *head, - >avail->ring[idx & (vq->num - 1)]); + __virtio16 head; + int r; + + r = vhost_get_avail(vq, head, + >avail->ring[vq->last_avail_idx & (vq->num - 1)]); + if (unlikely(r)) { + vq_err(vq, "Failed to read head: index %u address %p\n", + vq->last_avail_idx, + >avail->ring[vq->last_avail_idx & (vq->num - 1)]); + return r; + } + + r = vhost16_to_cpu(vq, head); + if (unlikely(r >= vq->num)) { + vq_err(vq, "Invalid head %d (%u)\n", r, vq->num); + return -EINVAL; + } + + return r; } static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq, @@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, struct vhost_log *log, unsigned int *log_num) { struct vring_desc desc; - unsigned int i, head, found = 0; - __virtio16 ring_head; - int ret, access; + unsigned int i, found = 0; + int head, ret, access; if (vq->avail_idx == vq->last_avail_idx) { ret = vhost_get_avail_idx(vq); @@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, return vq->num; } - /* Grab the next descriptor number they're advertising, and increment -* the index we've seen. */ - if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) { - vq_err(vq, "Failed to read head: idx %d address %p\n", - vq->last_avail_idx, - >avail->ring[vq->last_avail_idx % vq->num]); - return -EFAULT; - } - - head = vhost16_to_cpu(vq, ring_head); - - /* If their number is silly, that's an error. */ - if (unlikely(head >= vq->num)) { - vq_err(vq, "Guest says index %u > %u is available", - head, vq->num); - return -EINVAL; - } + /* Grab the next descriptor number they're advertising */ + head = vhost_get_avail_head(vq); + if (unlikely(head < 0)) + return head; /* When we start there are none of either input nor output. */ *out_num = *in_num = 0; -- 2.44.0
[PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
The local variable @last_avail_idx is equivalent to vq->last_avail_idx. So the code can be simplified a bit by dropping the local variable @last_avail_idx. No functional change intended. Signed-off-by: Gavin Shan --- drivers/vhost/vhost.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7aa623117aab..b278c0333a66 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, /* Grab the next descriptor number they're advertising, and increment * the index we've seen. */ - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) { + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) { vq_err(vq, "Failed to read head: idx %d address %p\n", - last_avail_idx, - >avail->ring[last_avail_idx % vq->num]); + vq->last_avail_idx, + >avail->ring[vq->last_avail_idx % vq->num]); return -EFAULT; } -- 2.44.0
[PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()
From: "Michael S. Tsirkin" All the callers of vhost_get_avail_idx() are concerned to the memory barrier, imposed by smp_rmb() to ensure the order of the available ring entry read and avail_idx read. Improve vhost_get_avail_idx() so that smp_rmb() is executed when the avail_idx is advanced. With it, the callers needn't to worry about the memory barrier. No functional change intended. Signed-off-by: Michael S. Tsirkin [gshan: repainted vhost_get_avail_idx()] Reviewed-by: Gavin Shan Acked-by: Will Deacon --- drivers/vhost/vhost.c | 106 +- 1 file changed, 42 insertions(+), 64 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 8995730ce0bf..7aa623117aab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) mutex_unlock(>vqs[i]->mutex); } -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq, - __virtio16 *idx) +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { - return vhost_get_avail(vq, *idx, >avail->idx); + __virtio16 idx; + int r; + + r = vhost_get_avail(vq, idx, >avail->idx); + if (unlikely(r < 0)) { + vq_err(vq, "Failed to access available index at %p (%d)\n", + >avail->idx, r); + return r; + } + + /* Check it isn't doing very strange thing with available indexes */ + vq->avail_idx = vhost16_to_cpu(vq, idx); + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { + vq_err(vq, "Invalid available index change from %u to %u", + vq->last_avail_idx, vq->avail_idx); + return -EINVAL; + } + + /* We're done if there is nothing new */ + if (vq->avail_idx == vq->last_avail_idx) + return 0; + + /* +* We updated vq->avail_idx so we need a memory barrier between +* the index read above and the caller reading avail ring entries. +*/ + smp_rmb(); + return 1; } static inline int vhost_get_avail_head(struct vhost_virtqueue *vq, @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, { struct vring_desc desc; unsigned int i, head, found = 0; - u16 last_avail_idx; - __virtio16 avail_idx; + u16 last_avail_idx = vq->last_avail_idx; __virtio16 ring_head; int ret, access; - /* Check it isn't doing very strange things with descriptor numbers. */ - last_avail_idx = vq->last_avail_idx; - if (vq->avail_idx == vq->last_avail_idx) { - if (unlikely(vhost_get_avail_idx(vq, _idx))) { - vq_err(vq, "Failed to access avail idx at %p\n", - >avail->idx); - return -EFAULT; - } - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) { - vq_err(vq, "Guest moved avail index from %u to %u", - last_avail_idx, vq->avail_idx); - return -EFAULT; - } + ret = vhost_get_avail_idx(vq); + if (unlikely(ret < 0)) + return ret; - /* If there's nothing new since last we looked, return -* invalid. -*/ - if (vq->avail_idx == last_avail_idx) + if (!ret) return vq->num; - - /* Only get avail ring entries after they have been -* exposed by guest. -*/ - smp_rmb(); } /* Grab the next descriptor number they're advertising, and increment @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n); /* return true if we're sure that avaiable ring is empty */ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq) { - __virtio16 avail_idx; int r; if (vq->avail_idx != vq->last_avail_idx) return false; - r = vhost_get_avail_idx(vq, _idx); - if (unlikely(r)) - return false; - - vq->avail_idx = vhost16_to_cpu(vq, avail_idx); - if (vq->avail_idx != vq->last_avail_idx) { - /* Since we have updated avail_idx, the following -* call to vhost_get_vq_desc() will read available -* ring entries. Make sure that read happens after -* the avail_idx read. -*/ - smp_rmb(); - return false; - } - - return true; + /* Treat error as non-empty here */ + r = vhost_get_avail_idx(vq); + return r == 0; } EXPORT_SYMBOL_GPL(vhost_vq_avail_empty); /* OK, now we need to know
[PATCH v2 0/4] vhost: Cleanup
This is suggested by Michael S. Tsirkin according to [1] and the goal is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it, the caller of the function needn't to worry about memory barriers. Since we're here, other cleanups are also applied. [1] https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/ PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if needed. Besides, the sanity checks on the retrieved available queue index are also squeezed to vhost_get_avail_idx() PATCH[2] drops the local variable @last_avail_idx since it's equivalent to vq->last_avail_idx PATCH[3] improves vhost_get_avail_head(), similar to what we're doing for vhost_get_avail_idx(), so that the relevant sanity checks on the head are squeezed to vhost_get_avail_head() PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space as the terminator for each line Gavin Shan (3): vhost: Drop variable last_avail_idx in vhost_get_vq_desc() vhost: Improve vhost_get_avail_head() vhost: Reformat vhost_{get, put}_user() Michael S. Tsirkin (1): vhost: Improve vhost_get_avail_idx() with smp_rmb() drivers/vhost/vhost.c | 215 +++--- 1 file changed, 97 insertions(+), 118 deletions(-) Changelog = v2: * Improve vhost_get_avail_idx() as Michael suggested in [1] as above (Michael) * Correct @head's type from 'unsigned int' to 'int' (l...@intel.com) -- 2.44.0
回复:WARNING in current_check_refer_path
> Hello, > Thanks for the report. Could you please provide a reproducer? > Regards, > Mickaël Hi. The Poc file has seed to you as attachment. > On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote: >> Hello. >> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. >> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to >> the email were a PoC file of the issue. >> >> Stack dump: >> > > loop3: detected capacity change from 0 to 1024 > > [ cut here ] > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > > security/landlock/fs.c:598 [inline] > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > > security/landlock/fs.c:578 [inline] > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 > > current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > > Modules linked in: > > CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > > 04/01/2014 > > RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline] > > RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline] > > RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > > Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 0f > > 0b 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 db > > e9 86 f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba > > RSP: 0018:c90001fb7ba0 EFLAGS: 00010212 > > RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000 > > RDX: 0004 RSI: 84125d64 RDI: 0003 > > RBP: 8880123c5608 R08: 0003 R09: c000 > > R10: f000 R11: R12: 88805d32fc00 > > R13: 8880123c5608 R14: R15: 0001 > > FS: 7fd70c4d8640() GS:88802c60() knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > > > security_path_rename+0x124/0x230 security/security.c:1828 > > do_renameat2+0x9f6/0xd30 fs/namei.c:4983 > > __do_sys_rename fs/namei.c:5042 [inline] > > __se_sys_rename fs/namei.c:5040 [inline] > > __x64_sys_rename+0x81/0xa0 fs/namei.c:5040 > > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > > do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 > > entry_SYSCALL_64_after_hwframe+0x6f/0x77 > > RIP: 0033:0x7fd70b6900ed > > Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 > > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > > ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > > RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052 > > RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed >> RDX: RSI: 2140 RDI: 2100 > > RBP: 7fd70b6f14a6 R08: R09: > > R10: R11: 0246 R12: > > R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000 > > > > > > Thank you for taking the time to read this email and we look forward to > > working with you further. > > > > > > > > > > > > > > > poc.c Description: Binary data
[PATCH net-next v8] net/ipv4: add tracepoint for icmp_send
From: Peilin He Introduce a tracepoint for icmp_send, which can help users to get more detail information conveniently when icmp abnormal events happen. 1. Giving an usecase example: = When an application experiences packet loss due to an unreachable UDP destination port, the kernel will send an exception message through the icmp_send function. By adding a trace point for icmp_send, developers or system administrators can obtain detailed information about the UDP packet loss, including the type, code, source address, destination address, source port, and destination port. This facilitates the trouble-shooting of UDP packet loss issues especially for those network-service applications. 2. Operation Instructions: == Switch to the tracing directory. cd /sys/kernel/tracing Filter for destination port unreachable. echo "type==3 && code==3" > events/icmp/icmp_send/filter Enable trace event. echo 1 > events/icmp/icmp_send/enable 3. Result View: udp_client_erro-11370 [002] ...s.12 124.728002: icmp_send: icmp_send: type=3, code=3. From 127.0.0.1:41895 to 127.0.0.1: ulen=23 skbaddr=589b167a Change log = v7->v8: Some fixes according to https://lore.kernel.org/all/CANn89iKNtKmN5im8K4dSZGqAV8=e3bZb z5ahxbcnbjfxk5v...@mail.gmail.com/ 1.Combine all variable definitions together without mixing code and variables. v6->v7: Some fixes according to https://lore.kernel.org/all/20240425081210.720a4...@kernel.org/ 1. Fix patch format issues. v5->v6: Some fixes according to https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/ 1.Resubmit patches based on the latest net-next code. v4->v5: Some fixes according to https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=CE2pc cqs7sfm0y9ak-e...@mail.gmail.com/ 1.Adjust the position of trace_icmp_send() to before icmp_push_reply(). v3->v4: Some fixes according to https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJz BzkJFVgCTdXBx=y...@mail.gmail.com/ 1.Add legality check for UDP header in SKB. 2.Target this patch for net-next. v2->v3: Some fixes according to https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/ 1. Change the tracking directory to/sys/kernel/tracking. 2. Adjust the layout of the TP-STRUCT_entry parameter structure. v1->v2: Some fixes according to https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sZtRnKRu_tnUwqHu fqtjvjsv-nz1x...@mail.gmail.com/ 1. adjust the trace_icmp_send() to more protocols than UDP. 2. move the calling of trace_icmp_send after sanity checks in __icmp_send(). Signed-off-by: Peilin He Signed-off-by: xu xin Reviewed-by: Yunkai Zhang Cc: Yang Yang Cc: Liu Chun Cc: Xuexin Jiang --- include/trace/events/icmp.h | 67 + net/ipv4/icmp.c | 4 +++ 2 files changed, 71 insertions(+) create mode 100644 include/trace/events/icmp.h diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h new file mode 100644 index ..31559796949a --- /dev/null +++ b/include/trace/events/icmp.h @@ -0,0 +1,67 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM icmp + +#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_ICMP_H + +#include +#include + +TRACE_EVENT(icmp_send, + + TP_PROTO(const struct sk_buff *skb, int type, int code), + + TP_ARGS(skb, type, code), + + TP_STRUCT__entry( + __field(const void *, skbaddr) + __field(int, type) + __field(int, code) + __array(__u8, saddr, 4) + __array(__u8, daddr, 4) + __field(__u16, sport) + __field(__u16, dport) + __field(unsigned short, ulen) + ), + + TP_fast_assign( + struct iphdr *iph = ip_hdr(skb); + struct udphdr *uh = udp_hdr(skb); + int proto_4 = iph->protocol; + __be32 *p32; + + __entry->skbaddr = skb; + __entry->type = type; + __entry->code = code; + + if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head || + (u8 *)uh + sizeof(struct udphdr) + > skb_tail_pointer(skb)) { + __entry->sport = 0; + __entry->dport = 0; + __entry->ulen = 0; + } else { + __entry->sport = ntohs(uh->source); + __entry->dport = ntohs(uh->dest); + __entry->ulen = ntohs(uh->len); + } + + p32 = (__be32 *) __entry->saddr; +
Re: WARNING in current_check_refer_path
Hello, Thanks for the report. Could you please provide a reproducer? Regards, Mickaël On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote: > Hello. > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to > the email were a PoC file of the issue. > > Stack dump: > > loop3: detected capacity change from 0 to 1024 > [ cut here ] > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > security/landlock/fs.c:598 [inline] > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access > security/landlock/fs.c:578 [inline] > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 > current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > Modules linked in: > CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > 04/01/2014 > RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline] > RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline] > RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758 > Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 0f 0b > 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 db e9 86 > f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba > RSP: 0018:c90001fb7ba0 EFLAGS: 00010212 > RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000 > RDX: 0004 RSI: 84125d64 RDI: 0003 > RBP: 8880123c5608 R08: 0003 R09: c000 > R10: f000 R11: R12: 88805d32fc00 > R13: 8880123c5608 R14: R15: 0001 > FS: 7fd70c4d8640() GS:88802c60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > PKRU: 5554 > Call Trace: > > security_path_rename+0x124/0x230 security/security.c:1828 > do_renameat2+0x9f6/0xd30 fs/namei.c:4983 > __do_sys_rename fs/namei.c:5042 [inline] > __se_sys_rename fs/namei.c:5040 [inline] > __x64_sys_rename+0x81/0xa0 fs/namei.c:5040 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x6f/0x77 > RIP: 0033:0x7fd70b6900ed > Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 > 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052 > RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed > RDX: RSI: 2140 RDI: 2100 > RBP: 7fd70b6f14a6 R08: R09: > R10: R11: 0246 R12: > R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000 > > > Thank you for taking the time to read this email and we look forward to > working with you further. > > > > > > > > >
[PATCH V1] soc: qcom: smp2p: Introduce tracepoint support
Introduce tracepoint support for smp2p providing useful logging for communication between clients. Signed-off-by: Sudeepgoud Patil Signed-off-by: Deepak Kumar Singh --- drivers/soc/qcom/Makefile | 1 + drivers/soc/qcom/smp2p.c | 10 drivers/soc/qcom/trace-smp2p.h | 99 ++ 3 files changed, 110 insertions(+) create mode 100644 drivers/soc/qcom/trace-smp2p.h diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile index ca0bece0dfff..30c1bf645501 100644 --- a/drivers/soc/qcom/Makefile +++ b/drivers/soc/qcom/Makefile @@ -23,6 +23,7 @@ qcom_rpmh-y += rpmh.o obj-$(CONFIG_QCOM_SMD_RPM) += rpm-proc.o smd-rpm.o obj-$(CONFIG_QCOM_SMEM) += smem.o obj-$(CONFIG_QCOM_SMEM_STATE) += smem_state.o +CFLAGS_smp2p.o := -I$(src) obj-$(CONFIG_QCOM_SMP2P) += smp2p.o obj-$(CONFIG_QCOM_SMSM)+= smsm.o obj-$(CONFIG_QCOM_SOCINFO) += socinfo.o diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c index a21241cbeec7..dde8513641ae 100644 --- a/drivers/soc/qcom/smp2p.c +++ b/drivers/soc/qcom/smp2p.c @@ -20,6 +20,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include "trace-smp2p.h" + /* * The Shared Memory Point to Point (SMP2P) protocol facilitates communication * of a single 32-bit value between two processors. Each value has a single @@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p *smp2p) struct smp2p_smem_item *out = smp2p->out; u32 val; + trace_smp2p_ssr_ack(smp2p->remote_pid); smp2p->ssr_ack = !smp2p->ssr_ack; val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT); @@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p *smp2p) smp2p->ssr_ack_enabled = true; smp2p->negotiation_done = true; + trace_smp2p_negotiate(smp2p->remote_pid, smp2p->ssr_ack_enabled); } } @@ -251,6 +256,8 @@ static void qcom_smp2p_notify_in(struct qcom_smp2p *smp2p) status = val ^ entry->last_value; entry->last_value = val; + trace_smp2p_notify_in(smp2p->remote_pid, entry->name, status, val); + /* No changes of this entry? */ if (!status) continue; @@ -406,6 +413,9 @@ static int smp2p_update_bits(void *data, u32 mask, u32 value) writel(val, entry->value); spin_unlock_irqrestore(>lock, flags); + trace_smp2p_update_bits(entry->smp2p->remote_pid, + entry->name, orig, val); + if (val != orig) qcom_smp2p_kick(entry->smp2p); diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h new file mode 100644 index ..c61afab23f0c --- /dev/null +++ b/drivers/soc/qcom/trace-smp2p.h @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved. + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM qcom_smp2p + +#if !defined(__QCOM_SMP2P_TRACE_H__) || defined(TRACE_HEADER_MULTI_READ) +#define __QCOM_SMP2P_TRACE_H__ + +#include + +TRACE_EVENT(smp2p_ssr_ack, + TP_PROTO(unsigned int remote_pid), + TP_ARGS(remote_pid), + TP_STRUCT__entry( + __field(u32, remote_pid) + ), + TP_fast_assign( + __entry->remote_pid = remote_pid; + ), + TP_printk("%d: SSR detected, doing SSR Handshake", + __entry->remote_pid + ) +); + +TRACE_EVENT(smp2p_negotiate, + TP_PROTO(unsigned int remote_pid, bool ssr_ack_enabled), + TP_ARGS(remote_pid, ssr_ack_enabled), + TP_STRUCT__entry( + __field(u32, remote_pid) + __field(bool, ssr_ack_enabled) + ), + TP_fast_assign( + __entry->remote_pid = remote_pid; + __entry->ssr_ack_enabled = ssr_ack_enabled; + ), + TP_printk("%d: state=open ssr_ack=%d", + __entry->remote_pid, + __entry->ssr_ack_enabled + ) +); + +TRACE_EVENT(smp2p_notify_in, + TP_PROTO(unsigned int remote_pid, const char *name, unsigned long status, u32 val), + TP_ARGS(remote_pid, name, status, val), + TP_STRUCT__entry( + __field(u32, remote_pid) + __string(name, name) + __field(unsigned long, status) + __field(u32, val) + ), + TP_fast_assign( + __entry->remote_pid = remote_pid; + __assign_str(name, name); + __entry->status = status; + __entry->val = val; + ), + TP_printk("%d: %s: status:0x%0lx val:0x%0x", + __entry->remote_pid, + __get_str(name), + __entry->status, + __entry->val + ) +); + +TRACE_EVENT(smp2p_update_bits, + TP_PROTO(unsigned int remote_pid, const char *name, u32 orig, u32 val), +
Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test
On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote: > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > Adding test that adds return uprobe inside 32 bit task > > and verify the return uprobe and attached bpf programs > > get properly executed. > > > > Signed-off-by: Jiri Olsa > > --- > > tools/testing/selftests/bpf/.gitignore| 1 + > > tools/testing/selftests/bpf/Makefile | 6 ++- > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++ > > .../bpf/progs/uprobe_syscall_compat.c | 13 ++ > > 4 files changed, 59 insertions(+), 1 deletion(-) > > create mode 100644 > > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > > > diff --git a/tools/testing/selftests/bpf/.gitignore > > b/tools/testing/selftests/bpf/.gitignore > > index f1aebabfb017..69d71223c0dd 100644 > > --- a/tools/testing/selftests/bpf/.gitignore > > +++ b/tools/testing/selftests/bpf/.gitignore > > @@ -45,6 +45,7 @@ test_cpp > > /veristat > > /sign-file > > /uprobe_multi > > +/uprobe_compat > > *.ko > > *.tmp > > xskxceiver > > diff --git a/tools/testing/selftests/bpf/Makefile > > b/tools/testing/selftests/bpf/Makefile > > index edc73f8f5aef..d170b63eca62 100644 > > --- a/tools/testing/selftests/bpf/Makefile > > +++ b/tools/testing/selftests/bpf/Makefile > > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr > > test_skb_cgroup_id_user \ > > xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata > > \ > > xdp_features bpf_test_no_cfi.ko > > > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > uprobe_compat > > you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no? ah right > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > index 9233210a4c33..3770254d893b 100644 > > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > @@ -11,6 +11,7 @@ > > #include > > #include "uprobe_syscall.skel.h" > > #include "uprobe_syscall_call.skel.h" > > +#include "uprobe_syscall_compat.skel.h" > > > > __naked unsigned long uretprobe_regs_trigger(void) > > { > > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void) > > "read_trace_pipe_iter"); > > ASSERT_EQ(found, 0, "found"); > > } > > + > > +static void trace_pipe_compat_cb(const char *str, void *data) > > +{ > > + if (strstr(str, "uretprobe compat") != NULL) > > + (*(int *)data)++; > > +} > > + > > +static void test_uretprobe_compat(void) > > +{ > > + struct uprobe_syscall_compat *skel = NULL; > > + int err, found = 0; > > + > > + skel = uprobe_syscall_compat__open_and_load(); > > + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load")) > > + goto cleanup; > > + > > + err = uprobe_syscall_compat__attach(skel); > > + if (!ASSERT_OK(err, "uprobe_syscall_compat__attach")) > > + goto cleanup; > > + > > + system("./uprobe_compat"); > > + > > + ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 1000), > > +"read_trace_pipe_iter"); > > why so complicated? can't you just set global variable that it was called hm, we execute separate uprobe_compat (32bit) process that triggers the bpf program, so we can't use global variable.. using the trace_pipe was the only thing that was easy to do jirka > > > + ASSERT_EQ(found, 1, "found"); > > + > > +cleanup: > > + uprobe_syscall_compat__destroy(skel); > > +} > > #else > > static void test_uretprobe_regs_equal(void) > > { > > @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void) > > { > > test__skip(); > > } > > + > > +static void test_uretprobe_compat(void) > > +{ > > + test__skip(); > > +} > > #endif > > > > void test_uprobe_syscall(void) > > @@ -320,3 +355,8 @@ void serial_test_uprobe_syscall_call(void) > > { > > test_uretprobe_syscall_call(); > > } > > + > > +void serial_test_uprobe_syscall_compat(void) > > and then no need for serial_test? > > > +{ > > + test_uretprobe_compat(); > > +} > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > new file mode 100644 > > index ..f8adde7f08e2 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > @@ -0,0 +1,13 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include > > +#include > > +#include > > + > > +char _license[] SEC("license") = "GPL"; > > + > > +SEC("uretprobe.multi/./uprobe_compat:main") > > +int uretprobe_compat(struct pt_regs *ctx) > > +{ > > + bpf_printk("uretprobe compat\n"); > > + return 0; > > +} > > -- > > 2.44.0 > >
Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test
On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote: > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > Adding test to verify that when called from outside of the > > trampoline provided by kernel, the uretprobe syscall will cause > > calling process to receive SIGILL signal and the attached bpf > > program is no executed. > > > > Signed-off-by: Jiri Olsa > > --- > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++ > > .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++ > > 2 files changed, 107 insertions(+) > > create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > > See nits below, but overall LGTM > > Acked-by: Andrii Nakryiko > > [...] > > > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void) > > { > > test__skip(); > > } > > + > > +static void test_uretprobe_syscall_call(void) > > +{ > > + test__skip(); > > +} > > #endif > > > > void test_uprobe_syscall(void) > > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void) > > if (test__start_subtest("uretprobe_regs_change")) > > test_uretprobe_regs_change(); > > } > > + > > +void serial_test_uprobe_syscall_call(void) > > does it need to be serial? non-serial are still run sequentially > within a process (there is no multi-threading), it's more about some > global effects on system. plz see below > > > +{ > > + test_uretprobe_syscall_call(); > > +} > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > new file mode 100644 > > index ..5ea03bb47198 > > --- /dev/null > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > @@ -0,0 +1,15 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > +#include "vmlinux.h" > > +#include > > +#include > > + > > +struct pt_regs regs; > > + > > +char _license[] SEC("license") = "GPL"; > > + > > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call") > > +int uretprobe(struct pt_regs *regs) > > +{ > > + bpf_printk("uretprobe called"); > > debugging leftover? we probably don't want to pollute trace_pipe from test the reason for this is to make sure the bpf program was not executed, the test makes sure the child gets killed with SIGILL and also that the bpf program was not executed by checking the trace_pipe and making sure nothing was received the trace_pipe reading is also why it's serial jirka > > > + return 0; > > +} > > -- > > 2.44.0 > >
[PATCH] livepatch.h: Add comment to klp transition state
From: Wardenjohn livepatch.h use KLP_UNDEFINED\KLP_UNPATCHED\KLP_PATCHED for klp transition state. When livepatch is ready but idle, using KLP_UNDEFINED seems very confusing. In order not to introduce potential risks to kernel, just update comment to these state. --- include/linux/livepatch.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h index 9b9b38e89563..b6a214f2f8e3 100644 --- a/include/linux/livepatch.h +++ b/include/linux/livepatch.h @@ -18,9 +18,9 @@ #if IS_ENABLED(CONFIG_LIVEPATCH) /* task patch states */ -#define KLP_UNDEFINED -1 -#define KLP_UNPATCHED 0 -#define KLP_PATCHED 1 +#define KLP_UNDEFINED -1 /* idle, no transition in progress */ +#define KLP_UNPATCHED 0 /* transitioning to unpatched state */ +#define KLP_PATCHED 1 /* transitioning to patched state */ /** * struct klp_func - function structure for live patching -- 2.37.3
Re: [PATCH 0/4] vhost: Cleanup
On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote: > This is suggested by Michael S. Tsirkin according to [1] and the goal > is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it, > the caller of the function needn't to worry about memory barriers. Since > we're here, other cleanups are also applied. Gavin I suggested another approach. 1. Start with the patch I sent (vhost: order avail ring reads after index updates) just do a diff against latest. simplify error handling a bit. 2. Do any other cleanups on top. > [1] > https://lore.kernel.org/virtualization/20240327075940-mutt-send-email-...@kernel.org/ > > PATCH[1] drops the local variable @last_avail_idx since it's equivalent > to vq->last_avail_idx > PATCH[2] improves vhost_get_avail_idx() so that smp_rmb() is applied if > needed. Besides, the sanity checks on the retrieved available > queue index are also squeezed to vhost_get_avail_idx() > PATCH[3] improves vhost_get_avail_head(), similar to what we're doing > for vhost_get_avail_idx(), so that the relevant sanity checks > on the head are squeezed to vhost_get_avail_head() > PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space > as the terminator for each line > > Gavin Shan (4): > vhost: Drop variable last_avail_idx in vhost_get_vq_desc() > vhost: Improve vhost_get_avail_idx() with smp_rmb() > vhost: Improve vhost_get_avail_head() > vhost: Reformat vhost_{get, put}_user() > > drivers/vhost/vhost.c | 199 +++--- > 1 file changed, 88 insertions(+), 111 deletions(-) > > -- > 2.44.0