Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Dan Williams
Verma, Vishal L wrote:
> > > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, 
> > > struct device_attribute *attr,
> > >   if (!victim)
> > >   return -ENXIO;
> > >  
> > > - rc = down_write_killable(_region_rwsem);
> > > - if (rc)
> > > - return rc;
> > > - rc = down_write_killable(_dev_rwsem);
> > > - if (rc) {
> > > - up_write(_region_rwsem);
> > > - return rc;
> > > - }
> > > + device_lock(dev);
> > > + device_lock(victim);
> > >   dev_dax = to_dev_dax(victim);
> > > + rc = down_write_killable(_dev_rwsem);
> > 
> > This begs the question, why down_write_killable(), but not
> > device_lock_interruptible()?
> 
> Do you mean change the device_lock()s to device_lock_interruptible() in
> addition to the taking the rwsem (i.e. not instead of the rwsem..)?

I mean convert the rwsem to drop _killable.

> I guess I just restored what was there previously - but the
> interruptible variant makes sense, I can make that change.

So the original code did device_lock(), then the rework added killable
rwsem (deleted device_lock()), and now the fixes add device_lock() back.
So now that there is a mix of killable/interruptible lock usage all the
locks should agree.

Since there really is no risk of these operations being long running
there is no driving need to make them killable/interruptible, so go with
the simple option.



Re: [EXTERNAL] Re: [PATCH v2 1/2] remoteproc: k3-r5: Wait for core0 power-up before powering up core1

2024-04-29 Thread Beleswar Prasad Padhi

Hello,

On 26/04/24 22:39, Mathieu Poirier wrote:

Good day, On Wed, Apr 24, 2024 at 06: 35: 03PM +0530, Beleswar Padhi wrote: >
From: Apurva Nandan  > > PSC controller has a limitation that
it can only power-up the second core > when the first core is in ON
ZjQcmQRYFpfptBannerStart
This message was sent from outside of Texas Instruments.
Do not click links or open attachments unless you recognize the source of this
email and know the content is safe. If you wish to report this message to IT
Security, please forward the message as an attachment to phish...@list.ti.com
ZjQcmQRYFpfptBannerEnd

Good day,

On Wed, Apr 24, 2024 at 06:35:03PM +0530, Beleswar Padhi wrote:
> From: Apurva Nandan 
> 
> PSC controller has a limitation that it can only power-up the second core

> when the first core is in ON state. Power-state for core0 should be equal
> to or higher than core1, else the kernel is seen hanging during rproc
> loading.
> 
> Make the powering up of cores sequential, by waiting for the current core

> to power-up before proceeding to the next core, with a timeout of 2sec.
> Add a wait queue event in k3_r5_cluster_rproc_init call, that will wait
> for the current core to be released from reset before proceeding with the
> next core.
> 
> Fixes: 6dedbd1d5443 ("remoteproc: k3-r5: Add a remoteproc driver for R5F subsystem")
> 
> Signed-off-by: Apurva Nandan 


You need to add your own SoB as well.

> ---
>  drivers/remoteproc/ti_k3_r5_remoteproc.c | 28 
>  1 file changed, 28 insertions(+)
> 
> diff --git a/drivers/remoteproc/ti_k3_r5_remoteproc.c b/drivers/remoteproc/ti_k3_r5_remoteproc.c

> index ad3415a3851b..5a9bd5d4a2ea 100644
> --- a/drivers/remoteproc/ti_k3_r5_remoteproc.c
> +++ b/drivers/remoteproc/ti_k3_r5_remoteproc.c
> @@ -103,12 +103,14 @@ struct k3_r5_soc_data {
>   * @dev: cached device pointer
>   * @mode: Mode to configure the Cluster - Split or LockStep
>   * @cores: list of R5 cores within the cluster
> + * @core_transition: wait queue to sync core state changes
>   * @soc_data: SoC-specific feature data for a R5FSS
>   */
>  struct k3_r5_cluster {
>struct device *dev;
>enum cluster_mode mode;
>struct list_head cores;
> +  wait_queue_head_t core_transition;
>const struct k3_r5_soc_data *soc_data;
>  };
>  
> @@ -128,6 +130,7 @@ struct k3_r5_cluster {

>   * @atcm_enable: flag to control ATCM enablement
>   * @btcm_enable: flag to control BTCM enablement
>   * @loczrama: flag to dictate which TCM is at device address 0x0
> + * @released_from_reset: flag to signal when core is out of reset
>   */
>  struct k3_r5_core {
>struct list_head elem;
> @@ -144,6 +147,7 @@ struct k3_r5_core {
>u32 atcm_enable;
>u32 btcm_enable;
>u32 loczrama;
> +  bool released_from_reset;
>  };
>  
>  /**

> @@ -460,6 +464,8 @@ static int k3_r5_rproc_prepare(struct rproc *rproc)
>ret);
>return ret;
>}
> +  core->released_from_reset = true;
> +  wake_up_interruptible(>core_transition);
>  
>  	/*

> * Newer IP revisions like on J7200 SoCs support h/w auto-initialization
> @@ -1140,6 +1146,7 @@ static int k3_r5_rproc_configure_mode(struct 
k3_r5_rproc *kproc)
>return ret;
>}
>  
> +	core->released_from_reset = c_state;


I understand why this is needed but it line could be very cryptic for people
trying to understand this driver.  Please add a comment to describe what is
happening here.

Thanks for the review. I will send v3 addressing these comments shortly!


>ret = ti_sci_proc_get_status(core->tsp, _vec, , ,
> );
>if (ret < 0) {
> @@ -1280,6 +1287,26 @@ static int k3_r5_cluster_rproc_init(struct 
platform_device *pdev)
>cluster->mode == CLUSTER_MODE_SINGLECPU ||
>cluster->mode == CLUSTER_MODE_SINGLECORE)
>break;
> +
> +  /*
> +   * R5 cores require to be powered on sequentially, core0
> +   * should be in higher power state than core1 in a cluster
> +   * So, wait for current core to power up before proceeding
> +   * to next core and put timeout of 2sec for each core.
> +   *
> +   * This waiting mechanism is necessary because
> +   * rproc_auto_boot_callback() for core1 can be called before
> +   * core0 due to thread execution order.
> +   */
> +  ret = wait_event_interruptible_timeout(cluster->core_transition,
> + core->released_from_reset,
> + msecs_to_jiffies(2000));
> +  if (ret <= 0) {
> +  dev_err(dev,
> +  "Timed out waiting for %s core to power up!\n",
> +  rproc->name);
> +  return ret;
> +  }
>}
>  
>  	return 0;

> @@ -1709,6 +1736,7 @@ static int k3_r5_probe(struct platform_device *pdev)
>cluster->dev = dev;
>

Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Verma, Vishal L
On Mon, 2024-04-29 at 18:25 -0700, Dan Williams wrote:
> Vishal Verma wrote:
> > Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> > rwsem")
> > was a bit overzealous in eliminating device_lock() usage, and ended up
> > removing a couple of lock acquisitions which were needed, and as a
> > result, fix some of the conditional locking missteps that the above
> > commit introduced in unregister_dax_dev() and unregister_dax_mapping().
> 
> I think it makes sense to tell the story a bit about why the
> delete_store() conversion was problematic, because the
> unregister_dev_dax() changes were just a knock-on effect to fixing the
> delete_store() flow.
> 
> Something like:
> 
> ---
> commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> aimed to undo device_lock() abuses for protecting changes to dax-driver
> internal data-structures like the dax_region resource tree to
> device-dax-instance range structures. However, the device_lock() was 
> legitamately
> enforcing that devices to be deleted were not current actively attached
> to any driver nor assigned any capacity from the region.
> ---
> 
> ...you can fill in a couple notes about the knock-on fixups after that
> was restored.

Sounds good, updated!

> 
> >  
> > @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, 
> > struct device_attribute *attr,
> >     if (!victim)
> >     return -ENXIO;
> >  
> > -   rc = down_write_killable(_region_rwsem);
> > -   if (rc)
> > -   return rc;
> > -   rc = down_write_killable(_dev_rwsem);
> > -   if (rc) {
> > -   up_write(_region_rwsem);
> > -   return rc;
> > -   }
> > +   device_lock(dev);
> > +   device_lock(victim);
> >     dev_dax = to_dev_dax(victim);
> > +   rc = down_write_killable(_dev_rwsem);
> 
> This begs the question, why down_write_killable(), but not
> device_lock_interruptible()?

Do you mean change the device_lock()s to device_lock_interruptible() in
addition to the taking the rwsem (i.e. not instead of the rwsem..)?
I guess I just restored what was there previously - but the
interruptible variant makes sense, I can make that change.

> 
> I do not expect any of this is long running so likely down_write() is
> sufficient here, especially since the heaviest locks to acquire are
> already held by the time rwsem is considered.
> 
> Other than that this looks good to me:
> 
> You can include my Reviewed-by on the next posting.

Thanks for the review Dan!



[PATCH v2 1/2] LoongArch: KVM: Add steal time support in kvm side

2024-04-29 Thread Bibo Mao
Steal time feature is added here in kvm side, VM can search supported
features provided by KVM hypervisor, feature KVM_FEATURE_STEAL_TIME
is added here. Like x86, steal time structure is saved in guest memory,
one hypercall function KVM_HCALL_FUNC_NOTIFY is added to notify KVM to
enable the feature.

One cpu attr ioctl command KVM_LOONGARCH_VCPU_PVTIME_CTRL is added to
save and restore base address of steal time structure when VM is migrated.

Since it needs hypercall instruction emulation handling, and it is
dependent on this patchset:
https://lore.kernel.org/all/20240428100518.1642324-1-maob...@loongson.cn/

Signed-off-by: Bibo Mao 
---
 arch/loongarch/include/asm/kvm_host.h  |   7 ++
 arch/loongarch/include/asm/kvm_para.h  |  10 +++
 arch/loongarch/include/asm/loongarch.h |   1 +
 arch/loongarch/include/uapi/asm/kvm.h  |   4 +
 arch/loongarch/kvm/exit.c  |  29 +-
 arch/loongarch/kvm/vcpu.c  | 120 +
 6 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/arch/loongarch/include/asm/kvm_host.h 
b/arch/loongarch/include/asm/kvm_host.h
index 1921750d4b4c..30bda553c54d 100644
--- a/arch/loongarch/include/asm/kvm_host.h
+++ b/arch/loongarch/include/asm/kvm_host.h
@@ -30,6 +30,7 @@
 #define KVM_PRIVATE_MEM_SLOTS  0
 
 #define KVM_HALT_POLL_NS_DEFAULT   50
+#define KVM_REQ_RECORD_STEAL   KVM_ARCH_REQ(1)
 
 #define KVM_GUESTDBG_VALID_MASK(KVM_GUESTDBG_ENABLE | \
KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP)
@@ -198,6 +199,12 @@ struct kvm_vcpu_arch {
struct kvm_mp_state mp_state;
/* cpucfg */
u32 cpucfg[KVM_MAX_CPUCFG_REGS];
+   /* paravirt steal time */
+   struct {
+   u64 guest_addr;
+   u64 last_steal;
+   struct gfn_to_hva_cache cache;
+   } st;
 };
 
 static inline unsigned long readl_sw_gcsr(struct loongarch_csrs *csr, int reg)
diff --git a/arch/loongarch/include/asm/kvm_para.h 
b/arch/loongarch/include/asm/kvm_para.h
index 56775554402a..5fb89e20432d 100644
--- a/arch/loongarch/include/asm/kvm_para.h
+++ b/arch/loongarch/include/asm/kvm_para.h
@@ -12,6 +12,7 @@
 #define KVM_HCALL_CODE_SWDBG   1
 #define KVM_HCALL_PV_SERVICE   HYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_PV_SERVICE)
 #define  KVM_HCALL_FUNC_PV_IPI 1
+#define  KVM_HCALL_FUNC_NOTIFY 2
 #define KVM_HCALL_SWDBGHYPERCALL_CODE(HYPERVISOR_KVM, 
KVM_HCALL_CODE_SWDBG)
 
 /*
@@ -21,6 +22,15 @@
 #define KVM_HCALL_INVALID_CODE -1UL
 #define KVM_HCALL_INVALID_PARAMETER-2UL
 
+#define KVM_STEAL_PHYS_VALID   BIT_ULL(0)
+#define KVM_STEAL_PHYS_MASKGENMASK_ULL(63, 6)
+struct kvm_steal_time {
+   __u64 steal;
+   __u32 version;
+   __u32 flags;
+   __u32 pad[12];
+};
+
 /*
  * Hypercall interface for KVM hypervisor
  *
diff --git a/arch/loongarch/include/asm/loongarch.h 
b/arch/loongarch/include/asm/loongarch.h
index 0ad36704cb4b..ab6a5e93c280 100644
--- a/arch/loongarch/include/asm/loongarch.h
+++ b/arch/loongarch/include/asm/loongarch.h
@@ -168,6 +168,7 @@
 #define  KVM_SIGNATURE "KVM\0"
 #define CPUCFG_KVM_FEATURE (CPUCFG_KVM_BASE + 4)
 #define  KVM_FEATURE_PV_IPIBIT(1)
+#define  KVM_FEATURE_STEAL_TIMEBIT(2)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/loongarch/include/uapi/asm/kvm.h 
b/arch/loongarch/include/uapi/asm/kvm.h
index 8f78b23672ac..286b5ce93a57 100644
--- a/arch/loongarch/include/uapi/asm/kvm.h
+++ b/arch/loongarch/include/uapi/asm/kvm.h
@@ -80,7 +80,11 @@ struct kvm_fpu {
 #define LOONGARCH_REG_64(TYPE, REG)(TYPE | KVM_REG_SIZE_U64 | (REG << 
LOONGARCH_REG_SHIFT))
 #define KVM_IOC_CSRID(REG) LOONGARCH_REG_64(KVM_REG_LOONGARCH_CSR, 
REG)
 #define KVM_IOC_CPUCFG(REG)
LOONGARCH_REG_64(KVM_REG_LOONGARCH_CPUCFG, REG)
+
+/* Device Control API on vcpu fd */
 #define KVM_LOONGARCH_VCPU_CPUCFG  0
+#define KVM_LOONGARCH_VCPU_PVTIME_CTRL 1
+#define  KVM_LOONGARCH_VCPU_PVTIME_GPA 0
 
 struct kvm_debug_exit_arch {
 };
diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c
index 680aeaa4aeeb..af0f1c46e4eb 100644
--- a/arch/loongarch/kvm/exit.c
+++ b/arch/loongarch/kvm/exit.c
@@ -209,7 +209,7 @@ int kvm_emu_idle(struct kvm_vcpu *vcpu)
 static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, larch_inst inst)
 {
int rd, rj;
-   unsigned int index;
+   unsigned int index, ret;
 
rd = inst.reg2_format.rd;
rj = inst.reg2_format.rj;
@@ -232,7 +232,10 @@ static int kvm_emu_cpucfg(struct kvm_vcpu *vcpu, 
larch_inst inst)
vcpu->arch.gprs[rd] = *(unsigned int *)KVM_SIGNATURE;
break;
case CPUCFG_KVM_FEATURE:
-   vcpu->arch.gprs[rd] = KVM_FEATURE_PV_IPI;
+   ret = KVM_FEATURE_PV_IPI;
+   if (sched_info_on())
+   ret |= KVM_FEATURE_STEAL_TIME;
+  

[PATCH v2 2/2] LoongArch: Add steal time support in guest side

2024-04-29 Thread Bibo Mao
Percpu struct kvm_steal_time is added here, its size is 64 bytes and
also defined as 64 bytes, so that the whole structure is in one physical
page.

When vcpu is onlined, function pv_enable_steal_time() is called. This
function will pass guest physical address of struct kvm_steal_time and
tells hypervisor to enable steal time. When vcpu is offline, physical
address is set as 0 and tells hypervisor to disable steal time.

Here is output of vmstat on guest when there is workload on both host
and guest. It includes steal time stat information.

procs ---memory-- -io -system-- --cpu-
 r  b   swpd   free  inact active   bibo   in   cs us sy id wa st
15  1  0 7583616 184112  72208200  162   52 31  6 43  0 20
17  0  0 7583616 184704  721920 0 6318 6885  5 60  8  5 22
16  0  0 7583616 185392  721440 0 1766 1081  0 49  0  1 50
16  0  0 7583616 184816  723040 0 6300 6166  4 62 12  2 20
18  0  0 7583632 184480  722400 0 2814 1754  2 58  4  1 35

Signed-off-by: Bibo Mao 
---
 arch/loongarch/Kconfig|  11 +++
 arch/loongarch/include/asm/paravirt.h |   5 +
 arch/loongarch/kernel/paravirt.c  | 131 ++
 arch/loongarch/kernel/time.c  |   2 +
 4 files changed, 149 insertions(+)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 0a1540a8853e..f3a03c33a052 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -592,6 +592,17 @@ config PARAVIRT
  over full virtualization.  However, when run without a hypervisor
  the kernel is theoretically slower and slightly larger.
 
+config PARAVIRT_TIME_ACCOUNTING
+   bool "Paravirtual steal time accounting"
+   select PARAVIRT
+   help
+ Select this option to enable fine granularity task steal time
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
 config ARCH_SUPPORTS_KEXEC
def_bool y
 
diff --git a/arch/loongarch/include/asm/paravirt.h 
b/arch/loongarch/include/asm/paravirt.h
index 58f7b7b89f2c..fe27fb5e82b8 100644
--- a/arch/loongarch/include/asm/paravirt.h
+++ b/arch/loongarch/include/asm/paravirt.h
@@ -17,11 +17,16 @@ static inline u64 paravirt_steal_clock(int cpu)
 }
 
 int pv_ipi_init(void);
+int __init pv_time_init(void);
 #else
 static inline int pv_ipi_init(void)
 {
return 0;
 }
 
+static inline int pv_time_init(void)
+{
+   return 0;
+}
 #endif // CONFIG_PARAVIRT
 #endif
diff --git a/arch/loongarch/kernel/paravirt.c b/arch/loongarch/kernel/paravirt.c
index 9044ed62045c..3f83afc7e2d2 100644
--- a/arch/loongarch/kernel/paravirt.c
+++ b/arch/loongarch/kernel/paravirt.c
@@ -5,10 +5,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct static_key paravirt_steal_enabled;
 struct static_key paravirt_steal_rq_enabled;
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock;
 
 static u64 native_steal_clock(int cpu)
 {
@@ -17,6 +20,57 @@ static u64 native_steal_clock(int cpu)
 
 DEFINE_STATIC_CALL(pv_steal_clock, native_steal_clock);
 
+static bool steal_acc = true;
+static int __init parse_no_stealacc(char *arg)
+{
+   steal_acc = false;
+   return 0;
+}
+early_param("no-steal-acc", parse_no_stealacc);
+
+static u64 para_steal_clock(int cpu)
+{
+   u64 steal;
+   struct kvm_steal_time *src;
+   int version;
+
+   src = _cpu(steal_time, cpu);
+   do {
+
+   version = src->version;
+   /* Make sure that the version is read before the steal */
+   virt_rmb();
+   steal = src->steal;
+   /* Make sure that the steal is read before the next version */
+   virt_rmb();
+
+   } while ((version & 1) || (version != src->version));
+   return steal;
+}
+
+static int pv_enable_steal_time(void)
+{
+   int cpu = smp_processor_id();
+   struct kvm_steal_time *st;
+   unsigned long addr;
+
+   if (!has_steal_clock)
+   return -EPERM;
+
+   st = _cpu(steal_time, cpu);
+   addr = per_cpu_ptr_to_phys(st);
+
+   /* The whole structure kvm_steal_time should be one page */
+   if (PFN_DOWN(addr) != PFN_DOWN(addr + sizeof(*st))) {
+   pr_warn("Illegal PV steal time addr %lx\n", addr);
+   return -EFAULT;
+   }
+
+   addr |= KVM_STEAL_PHYS_VALID;
+   kvm_hypercall2(KVM_HCALL_FUNC_NOTIFY, KVM_FEATURE_STEAL_TIME, addr);
+   return 0;
+}
+
 #ifdef CONFIG_SMP
 static void pv_send_ipi_single(int cpu, unsigned int action)
 {
@@ -110,6 +164,32 @@ static void pv_init_ipi(void)
if (r < 0)
panic("SWI0 IRQ request failed\n");
 }
+
+static void pv_disable_steal_time(void)
+{
+   if (has_steal_clock)
+   

[PATCH v2 0/2] LoongArch: Add steal time support

2024-04-29 Thread Bibo Mao
Para-virt feature steal time is added in both kvm and guest kernel side.
It is silimar with other architectures, steal time structure comes from
guest memory, also pseduo register is used to save/restore base address
of steal time structure, so that vm migration is supported also.

---
v2:
  1. Add PARAVIRT_TIME_ACCOUNTING kconfig option in file
arch/loongarch/Kconfig
  2. Function name change such as replace pv_register_steal_time with
pv_enable_steal_time etc

---
Bibo Mao (2):
  LoongArch: KVM: Add steal time support in kvm side
  LoongArch: Add steal time support in guest side

 arch/loongarch/Kconfig |  11 +++
 arch/loongarch/include/asm/kvm_host.h  |   7 ++
 arch/loongarch/include/asm/kvm_para.h  |  10 ++
 arch/loongarch/include/asm/loongarch.h |   1 +
 arch/loongarch/include/asm/paravirt.h  |   5 +
 arch/loongarch/include/uapi/asm/kvm.h  |   4 +
 arch/loongarch/kernel/paravirt.c   | 131 +
 arch/loongarch/kernel/time.c   |   2 +
 arch/loongarch/kvm/exit.c  |  29 +-
 arch/loongarch/kvm/vcpu.c  | 120 ++
 10 files changed, 318 insertions(+), 2 deletions(-)


base-commit: 2c8159388952f530bd260e097293ccc0209240be
-- 
2.39.3




Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Haitao Huang

On Mon, 29 Apr 2024 17:18:05 -0500, Huang, Kai  wrote:




 /*
@@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list  
*sgx_lru_list(struct sgx_epc_page *epc_pag

  */
 static inline bool sgx_can_reclaim(void)
 {
-return !list_empty(_global_lru.reclaimable);
+return !sgx_cgroup_lru_empty(misc_cg_root()) ||
+   !list_empty(_global_lru.reclaimable);
 }


Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
return !sgx_cgroup_lru_empty(misc_cg_root());
else
return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global()  
below.


 I changed to this way because sgx_cgroup_lru_empty() is now defined in  
both KConfig cases.
And it seems better to minimize use of the KConfig variables based on  
earlier feedback (some are yours).
Don't really have strong preference here. So let me know one way of the  
other.




But IMHO your code could be confusing, e.g., it can be interpreted as:

   The EPC pages can be managed by both the cgroup LRUs and the
   sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC
   is on.

Which is not true.

So we should make the code clearly reflect the true behaviour.



Ok, I'll change back.
Thanks
Haitao



Re: [PATCH v2 4/4] dax/bus.c: Use the right locking mode (read vs write) in size_show

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> In size_show(), the dax_dev_rwsem only needs a read lock, but was
> acquiring a write lock. Change it to down_read_interruptible() so it
> doesn't unnecessarily hold a write lock.
> 
> Cc: Dan Williams 
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



Re: [PATCH v2 3/4] dax/bus.c: Don't use down_write_killable for non-user processes

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> Change an instance of down_write_killable() to a simple down_write() where
> there is no user process that might want to interrupt the operation.
> 
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



Re: [PATCH v2 2/4] dax/bus.c: fix locking for unregister_dax_dev / unregister_dax_mapping paths

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> Commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> was a bit overzealous in eliminating device_lock() usage, and ended up
> removing a couple of lock acquisitions which were needed, and as a
> result, fix some of the conditional locking missteps that the above
> commit introduced in unregister_dax_dev() and unregister_dax_mapping().

I think it makes sense to tell the story a bit about why the
delete_store() conversion was problematic, because the
unregister_dev_dax() changes were just a knock-on effect to fixing the
delete_store() flow.

Something like:

---
commit c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
rwsem")
aimed to undo device_lock() abuses for protecting changes to dax-driver
internal data-structures like the dax_region resource tree to
device-dax-instance range structures. However, the device_lock() was 
legitamately
enforcing that devices to be deleted were not current actively attached
to any driver nor assigned any capacity from the region.
---

...you can fill in a couple notes about the knock-on fixups after that
was restored.

> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 44 ++--
>  1 file changed, 10 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 7924dd542a13..4e04b228b080 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -465,26 +465,17 @@ static void free_dev_dax_ranges(struct dev_dax *dev_dax)
>   trim_dev_dax_range(dev_dax);
>  }
>  
> -static void __unregister_dev_dax(void *dev)
> +static void unregister_dev_dax(void *dev)
>  {
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>  
>   dev_dbg(dev, "%s\n", __func__);
>  
> + down_write(_region_rwsem);
>   kill_dev_dax(dev_dax);
>   device_del(dev);
>   free_dev_dax_ranges(dev_dax);
>   put_device(dev);
> -}
> -
> -static void unregister_dev_dax(void *dev)
> -{
> - if (rwsem_is_locked(_region_rwsem))
> - return __unregister_dev_dax(dev);
> -
> - if (WARN_ON_ONCE(down_write_killable(_region_rwsem) != 0))
> - return;
> - __unregister_dev_dax(dev);
>   up_write(_region_rwsem);
>  }
>  
> @@ -560,15 +551,12 @@ static ssize_t delete_store(struct device *dev, struct 
> device_attribute *attr,
>   if (!victim)
>   return -ENXIO;
>  
> - rc = down_write_killable(_region_rwsem);
> - if (rc)
> - return rc;
> - rc = down_write_killable(_dev_rwsem);
> - if (rc) {
> - up_write(_region_rwsem);
> - return rc;
> - }
> + device_lock(dev);
> + device_lock(victim);
>   dev_dax = to_dev_dax(victim);
> + rc = down_write_killable(_dev_rwsem);

This begs the question, why down_write_killable(), but not
device_lock_interruptible()?

I do not expect any of this is long running so likely down_write() is
sufficient here, especially since the heaviest locks to acquire are
already held by the time rwsem is considered.

Other than that this looks good to me:

You can include my Reviewed-by on the next posting.



Re: [PATCH v2 1/4] dax/bus.c: replace WARN_ON_ONCE() with lockdep asserts

2024-04-29 Thread Dan Williams
Vishal Verma wrote:
> In [1], Dan points out that all of the WARN_ON_ONCE() usage in the
> referenced patch should be replaced with lockdep_assert_held, or
> lockdep_held_assert_write(). Replace these as appropriate.
> 
> Link: 
> https://lore.kernel.org/r/65f0b5ef41817_aa2229...@dwillia2-mobl3.amr.corp.intel.com.notmuch
>  [1]
> Fixes: c05ae9d85b47 ("dax/bus.c: replace driver-core lock usage by a local 
> rwsem")
> Cc: Andrew Morton 
> Reported-by: Dan Williams 
> Signed-off-by: Vishal Verma 
> ---
>  drivers/dax/bus.c | 16 
>  1 file changed, 8 insertions(+), 8 deletions(-)

Looks good,

Reviewed-by: Dan Williams 



BUG: unable to handle kernel paging request in do_split

2024-04-29 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the 
email were a PoC file of the issue.

Stack dump:
BUG: unable to handle page fault for address: ed110c2fd97f
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 7ffd0067 P4D 7ffd0067 PUD 0
Oops:  [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 PID: 24082 Comm: syz-executor.3 Not tainted 6.7.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:do_split+0xfef/0x1e10 fs/ext4/namei.c:2047
Code: d2 0f 85 38 0b 00 00 8b 45 00 89 84 24 84 00 00 00 41 8d 45 ff 48 8d 1c 
c3 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <0f> b6 14 02 48 89 d8 83 
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ef
RSP: 0018:c90001e9f858 EFLAGS: 00010a02
RAX: dc00 RBX: 617ecbf8 RCX: c9001048f000
RDX: 11110c2fd97f RSI: 823364ab RDI: 0005
RBP: 8880617ecc00 R08: 0005 R09: 
R10:  R11:  R12: dc00
R13:  R14:  R15: 88801ee8d2b0
FS:  7f191402a640() GS:88802c60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ed110c2fd97f CR3: 5500a000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 
 make_indexed_dir+0x1158/0x1540 fs/ext4/namei.c:2342
 ext4_add_entry+0xcd0/0xe80 fs/ext4/namei.c:2454
 ext4_add_nondir+0x90/0x2b0 fs/ext4/namei.c:2795
 ext4_symlink+0x539/0x9e0 fs/ext4/namei.c:3436
 vfs_symlink fs/namei.c:4464 [inline]
 vfs_symlink+0x3f6/0x640 fs/namei.c:4448
 do_symlinkat+0x245/0x2f0 fs/namei.c:4490
 __do_sys_symlink fs/namei.c:4511 [inline]
 __se_sys_symlink fs/namei.c:4509 [inline]
 __x64_sys_symlink+0x79/0xa0 fs/namei.c:4509
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7f191329002d
Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:7f191402a028 EFLAGS: 0246 ORIG_RAX: 0058
RAX: ffda RBX: 7f19133cbf80 RCX: 7f191329002d
RDX:  RSI: 2e40 RDI: 20001640
RBP: 7f19132f14d0 R08:  R09: 
R10:  R11: 0246 R12: 
R13: 000b R14: 7f19133cbf80 R15: 7f191400a000
 
Modules linked in:
CR2: ed110c2fd97f
---[ end trace  ]---
RIP: 0010:do_split+0xfef/0x1e10 fs/ext4/namei.c:2047
Code: d2 0f 85 38 0b 00 00 8b 45 00 89 84 24 84 00 00 00 41 8d 45 ff 48 8d 1c 
c3 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <0f> b6 14 02 48 89 d8 83 
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ef
RSP: 0018:c90001e9f858 EFLAGS: 00010a02
RAX: dc00 RBX: 617ecbf8 RCX: c9001048f000
RDX: 11110c2fd97f RSI: 823364ab RDI: 0005
RBP: 8880617ecc00 R08: 0005 R09: 
R10:  R11:  R12: dc00
R13:  R14:  R15: 88801ee8d2b0
FS:  7f191402a640() GS:88802c60() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ed110c2fd97f CR3: 5500a000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554

Code disassembly (best guess):
   0:   d2 0f   rorb   %cl,(%rdi)
   2:   85 38   test   %edi,(%rax)
   4:   0b 00   or (%rax),%eax
   6:   00 8b 45 00 89 84   add%cl,-0x7b76ffbb(%rbx)
   c:   24 84   and$0x84,%al
   e:   00 00   add%al,(%rax)
  10:   00 41 8dadd%al,-0x73(%rcx)
  13:   45 ff 48 8d rex.RB decl -0x73(%r8)
  17:   1c c3   sbb$0xc3,%al
  19:   48 b8 00 00 00 00 00movabs $0xdc00,%rax
  20:   fc ff df
  23:   48 89 damov%rbx,%rdx
  26:   48 c1 ea 03 shr$0x3,%rdx
* 2a:   0f b6 14 02 movzbl (%rdx,%rax,1),%edx <-- trapping 
instruction
  2e:   48 89 d8mov%rbx,%rax
  31:   83 e0 07and$0x7,%eax
  34:   83 c0 03add$0x3,%eax
  37:   38 d0   cmp%dl,%al
  39:   7c 08   jl 0x43
  3b:   84 d2   test   %dl,%dl
  3d:   0f  .byte 0xf
  3e:   85 ef   test   %ebp,%edi


Thank you 

[PATCH v4 3/4] remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes

2024-04-29 Thread Olivia Wen
The SCP on different chips will require different DRAM sizes and IPI
shared buffer sizes based on varying requirements.

Signed-off-by: Olivia Wen 
Reviewed-by: AngeloGioacchino Del Regno 

---
 drivers/remoteproc/mtk_common.h  | 11 --
 drivers/remoteproc/mtk_scp.c | 84 +++-
 drivers/remoteproc/mtk_scp_ipi.c |  7 +++-
 3 files changed, 79 insertions(+), 23 deletions(-)

diff --git a/drivers/remoteproc/mtk_common.h b/drivers/remoteproc/mtk_common.h
index 6d7736a..fd5c539 100644
--- a/drivers/remoteproc/mtk_common.h
+++ b/drivers/remoteproc/mtk_common.h
@@ -78,7 +78,6 @@
 #define MT8195_L2TCM_OFFSET0x850d0
 
 #define SCP_FW_VER_LEN 32
-#define SCP_SHARE_BUFFER_SIZE  288
 
 struct scp_run {
u32 signaled;
@@ -97,6 +96,11 @@ struct scp_ipi_desc {
 
 struct mtk_scp;
 
+struct mtk_scp_sizes_data {
+   size_t max_dram_size;
+   size_t ipi_share_buffer_size;
+};
+
 struct mtk_scp_of_data {
int (*scp_clk_get)(struct mtk_scp *scp);
int (*scp_before_load)(struct mtk_scp *scp);
@@ -110,6 +114,7 @@ struct mtk_scp_of_data {
u32 host_to_scp_int_bit;
 
size_t ipi_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 };
 
 struct mtk_scp_of_cluster {
@@ -141,10 +146,10 @@ struct mtk_scp {
struct scp_ipi_desc ipi_desc[SCP_IPI_MAX];
bool ipi_id_ack[SCP_IPI_MAX];
wait_queue_head_t ack_wq;
+   u8 *share_buf;
 
void *cpu_addr;
dma_addr_t dma_addr;
-   size_t dram_size;
 
struct rproc_subdev *rpmsg_subdev;
 
@@ -162,7 +167,7 @@ struct mtk_scp {
 struct mtk_share_obj {
u32 id;
u32 len;
-   u8 share_buf[SCP_SHARE_BUFFER_SIZE];
+   u8 *share_buf;
 };
 
 void scp_memcpy_aligned(void __iomem *dst, const void *src, unsigned int len);
diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6295148..e281d28 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -20,7 +20,6 @@
 #include "mtk_common.h"
 #include "remoteproc_internal.h"
 
-#define MAX_CODE_SIZE 0x50
 #define SECTION_NAME_IPI_BUFFER ".ipi_buffer"
 
 /**
@@ -94,14 +93,15 @@ static void scp_ipi_handler(struct mtk_scp *scp)
 {
struct mtk_share_obj __iomem *rcv_obj = scp->recv_buf;
struct scp_ipi_desc *ipi_desc = scp->ipi_desc;
-   u8 tmp_data[SCP_SHARE_BUFFER_SIZE];
scp_ipi_handler_t handler;
u32 id = readl(_obj->id);
u32 len = readl(_obj->len);
+   const struct mtk_scp_sizes_data *scp_sizes;
 
-   if (len > SCP_SHARE_BUFFER_SIZE) {
-   dev_err(scp->dev, "ipi message too long (len %d, max %d)", len,
-   SCP_SHARE_BUFFER_SIZE);
+   scp_sizes = scp->data->scp_sizes;
+   if (len > scp_sizes->ipi_share_buffer_size) {
+   dev_err(scp->dev, "ipi message too long (len %d, max %zd)", len,
+   scp_sizes->ipi_share_buffer_size);
return;
}
if (id >= SCP_IPI_MAX) {
@@ -117,8 +117,9 @@ static void scp_ipi_handler(struct mtk_scp *scp)
return;
}
 
-   memcpy_fromio(tmp_data, _obj->share_buf, len);
-   handler(tmp_data, len, ipi_desc[id].priv);
+   memset(scp->share_buf, 0, scp_sizes->ipi_share_buffer_size);
+   memcpy_fromio(scp->share_buf, _obj->share_buf, len);
+   handler(scp->share_buf, len, ipi_desc[id].priv);
scp_ipi_unlock(scp, id);
 
scp->ipi_id_ack[id] = true;
@@ -133,6 +134,8 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
 {
int ret;
size_t buf_sz, offset;
+   size_t share_buf_offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 
/* read the ipi buf addr from FW itself first */
ret = scp_elf_read_ipi_buf_addr(scp, fw, );
@@ -152,12 +155,15 @@ static int scp_ipi_init(struct mtk_scp *scp, const struct 
firmware *fw)
return -EOVERFLOW;
}
 
+   scp_sizes = scp->data->scp_sizes;
scp->recv_buf = (struct mtk_share_obj __iomem *)
(scp->sram_base + offset);
+   share_buf_offset = sizeof(scp->recv_buf->id)
+   + sizeof(scp->recv_buf->len) + scp_sizes->ipi_share_buffer_size;
scp->send_buf = (struct mtk_share_obj __iomem *)
-   (scp->sram_base + offset + sizeof(*scp->recv_buf));
-   memset_io(scp->recv_buf, 0, sizeof(*scp->recv_buf));
-   memset_io(scp->send_buf, 0, sizeof(*scp->send_buf));
+   (scp->sram_base + offset + share_buf_offset);
+   memset_io(scp->recv_buf, 0, share_buf_offset);
+   memset_io(scp->send_buf, 0, share_buf_offset);
 
return 0;
 }
@@ -741,14 +747,16 @@ static int scp_start(struct rproc *rproc)
 static void *mt8183_scp_da_to_va(struct mtk_scp *scp, u64 da, size_t len)
 {
int offset;
+   const struct mtk_scp_sizes_data *scp_sizes;
 
+   scp_sizes = 

[PATCH v4 2/4] remoteproc: mediatek: Support MT8188 SCP core 1

2024-04-29 Thread Olivia Wen
MT8188 SCP has two RISC-V cores which is similar to MT8195 but without
L1TCM. We've added MT8188-specific functions to configure L1TCM in
multicore setups.

Signed-off-by: Olivia Wen 
Reviewed-by: AngeloGioacchino Del Regno 

---
 drivers/remoteproc/mtk_scp.c | 146 ++-
 1 file changed, 143 insertions(+), 3 deletions(-)

diff --git a/drivers/remoteproc/mtk_scp.c b/drivers/remoteproc/mtk_scp.c
index 6751829..6295148 100644
--- a/drivers/remoteproc/mtk_scp.c
+++ b/drivers/remoteproc/mtk_scp.c
@@ -471,6 +471,86 @@ static int mt8186_scp_before_load(struct mtk_scp *scp)
return 0;
 }
 
+static int mt8188_scp_l2tcm_on(struct mtk_scp *scp)
+{
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   mutex_lock(_cluster->cluster_lock);
+
+   if (scp_cluster->l2tcm_refcnt == 0) {
+   /* clear SPM interrupt, SCP2SPM_IPC_CLR */
+   writel(0xff, scp->cluster->reg_base + MT8192_SCP2SPM_IPC_CLR);
+
+   /* Power on L2TCM */
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_0, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_1, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_2, 0);
+   scp_sram_power_on(scp->cluster->reg_base + 
MT8192_L1TCM_SRAM_PDN, 0);
+   }
+
+   scp_cluster->l2tcm_refcnt += 1;
+
+   mutex_unlock(_cluster->cluster_lock);
+
+   return 0;
+}
+
+static int mt8188_scp_before_load(struct mtk_scp *scp)
+{
+   writel(1, scp->cluster->reg_base + MT8192_CORE0_SW_RSTN_SET);
+
+   mt8188_scp_l2tcm_on(scp);
+
+   scp_sram_power_on(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0);
+
+   /* enable MPU for all memory regions */
+   writel(0xff, scp->cluster->reg_base + MT8192_CORE0_MEM_ATT_PREDEF);
+
+   return 0;
+}
+
+static int mt8188_scp_c1_before_load(struct mtk_scp *scp)
+{
+   u32 sec_ctrl;
+   struct mtk_scp *scp_c0;
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   scp->data->scp_reset_assert(scp);
+
+   mt8188_scp_l2tcm_on(scp);
+
+   scp_sram_power_on(scp->cluster->reg_base + MT8195_CPU1_SRAM_PD, 0);
+
+   /* enable MPU for all memory regions */
+   writel(0xff, scp->cluster->reg_base + MT8195_CORE1_MEM_ATT_PREDEF);
+
+   /*
+* The L2TCM_OFFSET_RANGE and L2TCM_OFFSET shift the destination address
+* on SRAM when SCP core 1 accesses SRAM.
+*
+* This configuration solves booting the SCP core 0 and core 1 from
+* different SRAM address because core 0 and core 1 both boot from
+* the head of SRAM by default. this must be configured before boot SCP 
core 1.
+*
+* The value of L2TCM_OFFSET_RANGE is from the viewpoint of SCP core 1.
+* When SCP core 1 issues address within the range (L2TCM_OFFSET_RANGE),
+* the address will be added with a fixed offset (L2TCM_OFFSET) on the 
bus.
+* The shift action is tranparent to software.
+*/
+   writel(0, scp->cluster->reg_base + MT8195_L2TCM_OFFSET_RANGE_0_LOW);
+   writel(scp->sram_size, scp->cluster->reg_base + 
MT8195_L2TCM_OFFSET_RANGE_0_HIGH);
+
+   scp_c0 = list_first_entry(_cluster->mtk_scp_list, struct mtk_scp, 
elem);
+   writel(scp->sram_phys - scp_c0->sram_phys, scp->cluster->reg_base + 
MT8195_L2TCM_OFFSET);
+
+   /* enable SRAM offset when fetching instruction and data */
+   sec_ctrl = readl(scp->cluster->reg_base + MT8195_SEC_CTRL);
+   sec_ctrl |= MT8195_CORE_OFFSET_ENABLE_I | MT8195_CORE_OFFSET_ENABLE_D;
+   writel(sec_ctrl, scp->cluster->reg_base + MT8195_SEC_CTRL);
+
+   return 0;
+}
+
 static int mt8192_scp_before_load(struct mtk_scp *scp)
 {
/* clear SPM interrupt, SCP2SPM_IPC_CLR */
@@ -717,6 +797,47 @@ static void mt8183_scp_stop(struct mtk_scp *scp)
writel(0, scp->cluster->reg_base + MT8183_WDT_CFG);
 }
 
+static void mt8188_scp_l2tcm_off(struct mtk_scp *scp)
+{
+   struct mtk_scp_of_cluster *scp_cluster = scp->cluster;
+
+   mutex_lock(_cluster->cluster_lock);
+
+   if (scp_cluster->l2tcm_refcnt > 0)
+   scp_cluster->l2tcm_refcnt -= 1;
+
+   if (scp_cluster->l2tcm_refcnt == 0) {
+   /* Power off L2TCM */
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_0, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_1, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L2TCM_SRAM_PD_2, 0);
+   scp_sram_power_off(scp->cluster->reg_base + 
MT8192_L1TCM_SRAM_PDN, 0);
+   }
+
+   mutex_unlock(_cluster->cluster_lock);
+}
+
+static void mt8188_scp_stop(struct mtk_scp *scp)
+{
+   mt8188_scp_l2tcm_off(scp);
+
+   scp_sram_power_off(scp->cluster->reg_base + MT8192_CPU0_SRAM_PD, 0);
+
+   /* Disable SCP watchdog */
+   writel(0, scp->cluster->reg_base + 

[PATCH v4 0/4] Support MT8188 SCP core 1

2024-04-29 Thread Olivia Wen
Change in v4:
Updating the description of PATCH v4 4/4.

Olivia Wen (4):
  dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP
  remoteproc: mediatek: Support MT8188 SCP core 1
  remoteproc: mediatek: Support setting DRAM and IPI shared buffer sizes
  remoteproc: mediatek: Add IMGSYS IPI command

 .../devicetree/bindings/remoteproc/mtk,scp.yaml|   2 +
 drivers/remoteproc/mtk_common.h|  11 +-
 drivers/remoteproc/mtk_scp.c   | 230 +++--
 drivers/remoteproc/mtk_scp_ipi.c   |   7 +-
 include/linux/remoteproc/mtk_scp.h |   1 +
 5 files changed, 225 insertions(+), 26 deletions(-)

-- 
2.6.4




[PATCH v4 4/4] remoteproc: mediatek: Add IMGSYS IPI command

2024-04-29 Thread Olivia Wen
Add an IPI command definition for communication with IMGSYS through
SCP mailbox.

Signed-off-by: Olivia Wen 
---
 include/linux/remoteproc/mtk_scp.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/remoteproc/mtk_scp.h 
b/include/linux/remoteproc/mtk_scp.h
index 7c2b7cc9..344ff41 100644
--- a/include/linux/remoteproc/mtk_scp.h
+++ b/include/linux/remoteproc/mtk_scp.h
@@ -43,6 +43,7 @@ enum scp_ipi_id {
SCP_IPI_CROS_HOST_CMD,
SCP_IPI_VDEC_LAT,
SCP_IPI_VDEC_CORE,
+   SCP_IPI_IMGSYS_CMD,
SCP_IPI_NS_SERVICE = 0xFF,
SCP_IPI_MAX = 0x100,
 };
-- 
2.6.4




[PATCH v4 1/4] dt-bindings: remoteproc: mediatek: Support MT8188 dual-core SCP

2024-04-29 Thread Olivia Wen
Under different applications, the MT8188 SCP can be used as single-core
or dual-core.

Signed-off-by: Olivia Wen 
Acked-by: Krzysztof Kozlowski 
Reviewed-by: AngeloGioacchino Del Regno 


---
 Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml 
b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
index 507f98f..c5dc3c2 100644
--- a/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/mtk,scp.yaml
@@ -19,6 +19,7 @@ properties:
   - mediatek,mt8183-scp
   - mediatek,mt8186-scp
   - mediatek,mt8188-scp
+  - mediatek,mt8188-scp-dual
   - mediatek,mt8192-scp
   - mediatek,mt8195-scp
   - mediatek,mt8195-scp-dual
@@ -194,6 +195,7 @@ allOf:
   properties:
 compatible:
   enum:
+- mediatek,mt8188-scp-dual
 - mediatek,mt8195-scp-dual
 then:
   properties:
-- 
2.6.4




Re: [PATCH v2 0/4] vhost: Cleanup

2024-04-29 Thread Gavin Shan

On 4/30/24 04:50, Michael S. Tsirkin wrote:

On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote:

This is suggested by Michael S. Tsirkin according to [1] and the goal
is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
the caller of the function needn't to worry about memory barriers. Since
we're here, other cleanups are also applied.

[1] 
https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/



Patch 1 makes some sense, gave some comments. Rest I think we should
just drop.



Sure, v3 has been sent with PATCH[v2 2/3/4] dropped. Please take a look
when you getting a chance.

v3: 
https://lore.kernel.org/virtualization/20240429232748.642356-1-gs...@redhat.com/T/#u

Thanks,
Gavin




[PATCH v3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Gavin Shan
From: "Michael S. Tsirkin" 

All the callers of vhost_get_avail_idx() are concerned with the
memory barrier, imposed by smp_rmb() to ensure the order of the
available ring entry read and avail_idx read.

Improve vhost_get_avail_idx() so that smp_rmb() is executed when
the avail_idx is accessed. With it, the callers needn't to worry
about the memory barrier. As a side benefit, we also validate the
index on all paths now, which will hopefully help to catch future
errors earlier.

Note that current code is inconsistent in how the errors are handled.
They are treated as an empty ring in some places, but as non-empty
ring in other places. This patch doesn't attempt to change the existing
behaviour.

No functional change intended.

Signed-off-by: Michael S. Tsirkin 
Reviewed-by: Gavin Shan 
Acked-by: Will Deacon 
---
v3: Improved commit log and comments as Michael suggested
---
 drivers/vhost/vhost.c | 105 +-
 1 file changed, 42 insertions(+), 63 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8995730ce0bf..60d9592eff7b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   int r;
+
+   r = vhost_get_avail(vq, idx, >avail->idx);
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access available index at %p (%d)\n",
+  >avail->idx, r);
+   return r;
+   }
+
+   /* Check it isn't doing very strange thing with available indexes */
+   vq->avail_idx = vhost16_to_cpu(vq, idx);
+   if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Invalid available index change from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EINVAL;
+   }
+
+   /* We're done if there is nothing new */
+   if (vq->avail_idx == vq->last_avail_idx)
+   return 0;
+
+   /*
+* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq->last_avail_idx;
-
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved avail index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
 
-   /* If there's nothing new since last we looked, return
-* invalid.
-*/
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising, and increment
@@ -2790,35 +2795,21 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 /* return true if we're sure that avaiable ring is empty */
 bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (vq->avail_idx != vq->last_avail_idx)
return false;
 
-   r = vhost_get_avail_idx(vq, _idx);
-   if (unlikely(r))
-   return false;
-
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-   if (vq->avail_idx != vq->last_avail_idx) {
-   /* Since we have updated avail_idx, the following
-* call to vhost_get_vq_desc() will read available
-* ring 

Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Gavin Shan

On 4/30/24 04:44, Michael S. Tsirkin wrote:

On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote:

From: "Michael S. Tsirkin" 

All the callers of vhost_get_avail_idx() are concerned to the memory


*with* the memory barrier



Thanks, will be corrected in v3.


barrier, imposed by smp_rmb() to ensure the order of the available
ring entry read and avail_idx read.

Improve vhost_get_avail_idx() so that smp_rmb() is executed when
the avail_idx is advanced.


accessed, not advanced. guest advances it.



smp_rmb() is executed only when vp->last_avail_idx != vp->avail_idx.
I used 'advanced' to indicate the condition. 'accessed' is also
correct since the 'advanced' case included to 'accessed' case.


With it, the callers needn't to worry
about the memory barrier.

No functional change intended.


I'd add:

As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.



Ok, I will integrate this to v3's commit log.





Signed-off-by: Michael S. Tsirkin 
[gshan: repainted vhost_get_avail_idx()]


?repainted?



It's just a indicator to say the changes aren't simply copied from
[1]. Some follow-up changes are also applied. So it needs to be
reviewed. I will drop this in v3.


Reviewed-by: Gavin Shan 
Acked-by: Will Deacon 
---
  drivers/vhost/vhost.c | 106 +-
  1 file changed, 42 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8995730ce0bf..7aa623117aab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
  }
  
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,

- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
  {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   int r;
+
+   r = vhost_get_avail(vq, idx, >avail->idx);
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access available index at %p (%d)\n",
+  >avail->idx, r);
+   return r;
+   }
+
+   /* Check it isn't doing very strange thing with available indexes */
+   vq->avail_idx = vhost16_to_cpu(vq, idx);
+   if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Invalid available index change from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EINVAL;
+   }
+
+   /* We're done if there is nothing new */
+   if (vq->avail_idx == vq->last_avail_idx)
+   return 0;
+
+   /*
+* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
  }
  
  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,

@@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
  {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
  
-	/* Check it isn't doing very strange things with descriptor numbers. */

-   last_avail_idx = vq->last_avail_idx;
-
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved avail index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
  
-		/* If there's nothing new since last we looked, return

-* invalid.
-*/
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
  
  	/* Grab the next descriptor number they're advertising, and increment

@@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
  /* return true if we're sure that 

Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()

2024-04-29 Thread Gavin Shan

On 4/30/24 04:45, Michael S. Tsirkin wrote:

On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote:

The local variable @last_avail_idx is equivalent to vq->last_avail_idx.
So the code can be simplified a bit by dropping the local variable
@last_avail_idx.

No functional change intended.

Signed-off-by: Gavin Shan 
---
  drivers/vhost/vhost.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7aa623117aab..b278c0333a66 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
  {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
  
@@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
  
  	/* Grab the next descriptor number they're advertising, and increment

 * the index we've seen. */
-   if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) {
+   if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
{
vq_err(vq, "Failed to read head: idx %d address %p\n",
-  last_avail_idx,
-  >avail->ring[last_avail_idx % vq->num]);
+  vq->last_avail_idx,
+  >avail->ring[vq->last_avail_idx % vq->num]);
return -EFAULT;
}


I don't see the big advantage and the line is long now.



The point is to avoid the local variable @last_avail_idx since it's equivalent
to vq->last_avail_idx, as stated in the commit log. Besides, it paves the way
for PATCH[v2 3/4] where the whole logic fetching the head and sanity check is
moved to vhost_get_avail_head(), so that vhost_get_vq_desc() is simplified

I will drop PATCH[2, 3, 4] as you suggested.

Thanks,
Gavin




Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

2024-04-29 Thread Andrii Nakryiko
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko  wrote:
>
> At the lowest level, rethook-based kretprobes on x86-64 architecture go
> through arch_rethoook_trampoline() function, manually written in
> assembly, which calls into a simple arch_rethook_trampoline_callback()
> function, written in C, and only doing a few straightforward field
> assignments, before calling further into rethook_trampoline_handler(),
> which handles kretprobe callbacks generically.
>
> Looking at simplicity of arch_rethook_trampoline_callback(), it seems
> not really worthwhile to spend an extra function call just to do 4 or
> 5 assignments. As such, this patch proposes to "inline"
> arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
> manually implementing it in an assembly code.
>
> This has two motivations. First, we do get a bit of runtime speed up by
> avoiding function calls. Using BPF selftests's bench tool, we see
> 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
> triggering code path:
>
> BEFORE (latest probes/for-next)
> ===
> kretprobe  :   10.455 ± 0.024M/s
> kretprobe-multi:   11.150 ± 0.012M/s
>
> AFTER (probes/for-next + this patch)
> 
> kretprobe  :   10.540 ± 0.009M/s (+0.8%)
> kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)
>
> Second, and no less importantly for some specialized use cases, this
> avoids unnecessarily "polluting" LBR records with an extra function call
> (recorded as a jump by CPU). This is the case for the retsnoop ([0])
> tool, which relies havily on capturing LBR records to provide users with
> lots of insight into kernel internals.
>
> This RFC patch is only inlining this function for x86-64, but it's
> possible to do that for 32-bit x86 arch as well and then remove
> arch_rethook_trampoline_callback() implementation altogether. Please let
> me know if this change is acceptable and whether I should complete it
> with 32-bit "inlining" as well. Thanks!
>
>   [0] 
> https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr
>
> Signed-off-by: Andrii Nakryiko 
> ---
>  arch/x86/kernel/asm-offsets_64.c |  4 
>  arch/x86/kernel/rethook.c| 37 +++-
>  2 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/asm-offsets_64.c 
> b/arch/x86/kernel/asm-offsets_64.c
> index bb65371ea9df..5c444abc540c 100644
> --- a/arch/x86/kernel/asm-offsets_64.c
> +++ b/arch/x86/kernel/asm-offsets_64.c
> @@ -42,6 +42,10 @@ int main(void)
> ENTRY(r14);
> ENTRY(r15);
> ENTRY(flags);
> +   ENTRY(ip);
> +   ENTRY(cs);
> +   ENTRY(ss);
> +   ENTRY(orig_ax);
> BLANK();
>  #undef ENTRY
>
> diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
> index 8a1c0111ae79..3e1c01beebd1 100644
> --- a/arch/x86/kernel/rethook.c
> +++ b/arch/x86/kernel/rethook.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "kprobes/common.h"
>
> @@ -34,10 +35,36 @@ asm(
> "   pushq %rsp\n"
> "   pushfq\n"
> SAVE_REGS_STRING
> -   "   movq %rsp, %rdi\n"
> -   "   call arch_rethook_trampoline_callback\n"
> +   "   movq %rsp, %rdi\n" /* $rdi points to regs */
> +   /* fixup registers */
> +   /* regs->cs = __KERNEL_CS; */
> +   "   movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) 
> "(%rdi)\n"
> +   /* regs->ip = (unsigned long)_rethook_trampoline; */
> +   "   movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) 
> "(%rdi)\n"
> +   /* regs->orig_ax = ~0UL; */
> +   "   movq $0x, " __stringify(pt_regs_orig_ax) 
> "(%rdi)\n"
> +   /* regs->sp += 2*sizeof(long); */
> +   "   addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
> +   /* 2nd arg is frame_pointer = (long *)(regs + 1); */
> +   "   lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"

BTW, all this __stringify() ugliness can be avoided if we move this
assembly into its own .S file, like lots of other assembly functions
in arch/x86/kernel subdir. That has another benefit of generating
better line information in DWARF for those assembly instructions. It's
lots more work, so before I do this, I'd like to get confirmation that
this change is acceptable in principle.

> +   /*
> +* The return address at 'frame_pointer' is recovered by the
> +* arch_rethook_fixup_return() which called from this
> +* rethook_trampoline_handler().
> +*/
> +   "   call rethook_trampoline_handler\n"
> +   /*
> +* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF.
> +*
> +* We don't save/restore %rax below, because we ignore
> +* rethook_trampoline_handler result.
> +*
> +* *(unsigned long *)>ss = regs->flags;
> +*/
> +   "   mov " __stringify(pt_regs_flags) "(%rsp), %rax\n"
> 

Re: [PATCH bpf-next] bpf: add support to read cpu_entry in bpf program

2024-04-29 Thread Yonghong Song



On 4/27/24 8:18 AM, Florian Lehner wrote:

Add new field "cpu_entry" to bpf_perf_event_data which could be read by
bpf programs attached to perf events. The value contains the CPU value
recorded by specifying sample_type with PERF_SAMPLE_CPU when calling
perf_event_open().


You can use bpf_cast_to_kern_ctx kfunc which can cast 'struct 
bpf_perf_event_data'
ctx to 'struct bpf_perf_event_data_kern'.

struct bpf_perf_event_data_kern {
bpf_user_pt_regs_t *regs;
struct perf_sample_data *data;
struct perf_event *event;
};

You can access bpf_perf_event_data_kern->data and then to access 'cpu_entry' 
field.



Signed-off-by: Florian Lehner 
---
  include/uapi/linux/bpf_perf_event.h   |  4 
  kernel/trace/bpf_trace.c  | 13 +
  tools/include/uapi/linux/bpf_perf_event.h |  4 
  3 files changed, 21 insertions(+)

diff --git a/include/uapi/linux/bpf_perf_event.h 
b/include/uapi/linux/bpf_perf_event.h
index eb1b9d21250c..4856b4396ece 100644
--- a/include/uapi/linux/bpf_perf_event.h
+++ b/include/uapi/linux/bpf_perf_event.h
@@ -14,6 +14,10 @@ struct bpf_perf_event_data {
bpf_user_pt_regs_t regs;
__u64 sample_period;
__u64 addr;
+   struct {
+   u32 cpu;
+   u32 reserved;
+   }   cpu_entry;
  };
  
  #endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index afb232b1d7c2..2b303221af5c 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -2176,6 +2176,11 @@ static bool pe_prog_is_valid_access(int off, int size, 
enum bpf_access_type type
if (!bpf_ctx_narrow_access_ok(off, size, size_u64))
return false;
break;
+   case bpf_ctx_range(struct bpf_perf_event_data, cpu_entry):
+   bpf_ctx_record_field_size(info, size_u64);
+   if (!bpf_ctx_narrow_access_ok(off, size, size_u64))
+   return false;
+   break;
default:
if (size != sizeof(long))
return false;
@@ -2208,6 +2213,14 @@ static u32 pe_prog_convert_ctx_access(enum 
bpf_access_type type,
  bpf_target_off(struct perf_sample_data, 
addr, 8,
 target_size));
break;
+   case offsetof(struct bpf_perf_event_data, cpu_entry):
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct 
bpf_perf_event_data_kern,
+  data), si->dst_reg, 
si->src_reg,
+ offsetof(struct bpf_perf_event_data_kern, 
data));
+   *insn++ = BPF_LDX_MEM(BPF_DW, si->dst_reg, si->dst_reg,
+ bpf_target_off(struct perf_sample_data, 
cpu_entry, 8,
+target_size));
+   break;
default:
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct 
bpf_perf_event_data_kern,
   regs), si->dst_reg, 
si->src_reg,
diff --git a/tools/include/uapi/linux/bpf_perf_event.h 
b/tools/include/uapi/linux/bpf_perf_event.h
index eb1b9d21250c..4856b4396ece 100644
--- a/tools/include/uapi/linux/bpf_perf_event.h
+++ b/tools/include/uapi/linux/bpf_perf_event.h
@@ -14,6 +14,10 @@ struct bpf_perf_event_data {
bpf_user_pt_regs_t regs;
__u64 sample_period;
__u64 addr;
+   struct {
+   u32 cpu;
+   u32 reserved;
+   }   cpu_entry;
  };
  
  #endif /* _UAPI__LINUX_BPF_PERF_EVENT_H__ */




Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Huang, Kai




 /*
@@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list 
*sgx_lru_list(struct sgx_epc_page *epc_pag

  */
 static inline bool sgx_can_reclaim(void)
 {
-    return !list_empty(_global_lru.reclaimable);
+    return !sgx_cgroup_lru_empty(misc_cg_root()) ||
+   !list_empty(_global_lru.reclaimable);
 }


Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
    return !sgx_cgroup_lru_empty(misc_cg_root());
else
    return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global() below.



I changed to this way because sgx_cgroup_lru_empty() is now defined in 
both KConfig cases.
And it seems better to minimize use of the KConfig variables based on 
earlier feedback (some are yours).
Don't really have strong preference here. So let me know one way of the 
other.




But IMHO your code could be confusing, e.g., it can be interpreted as:

  The EPC pages can be managed by both the cgroup LRUs and the
  sgx_global_lru simultaneously at runtime when CONFIG_CGROUP_MISC
  is on.

Which is not true.

So we should make the code clearly reflect the true behaviour.



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Andrii Nakryiko
On Sun, Apr 28, 2024 at 4:25 PM Steven Rostedt  wrote:
>
> On Thu, 25 Apr 2024 13:31:53 -0700
> Andrii Nakryiko  wrote:
>
> I'm just coming back from Japan (work and then a vacation), and
> catching up on my email during the 6 hour layover in Detroit.
>
> > Hey Masami,
> >
> > I can't really review most of that code as I'm completely unfamiliar
> > with all those inner workings of fprobe/ftrace/function_graph. I left
> > a few comments where there were somewhat more obvious BPF-related
> > pieces.
> >
> > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > and then with your series applied on top. Just to see if there are any
> > regressions. I think it will be a useful data point for you.
> >
> > You should be already familiar with the bench tool we have in BPF
> > selftests (I used it on some other patches for your tree).
>
> I should get familiar with your tools too.
>

It's a nifty and self-contained tool to do some micro-benchmarking, I
replied to Masami with a few details on how to build and use it.

> >
> > BASELINE
> > 
> > kprobe :   24.634 ± 0.205M/s
> > kprobe-multi   :   28.898 ± 0.531M/s
> > kretprobe  :   10.478 ± 0.015M/s
> > kretprobe-multi:   11.012 ± 0.063M/s
> >
> > THIS PATCH SET ON TOP
> > =
> > kprobe :   25.144 ± 0.027M/s (+2%)
> > kprobe-multi   :   28.909 ± 0.074M/s
> > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> >
> > These numbers are pretty stable and look to be more or less representative.
>
> Thanks for running this.
>
> >
> > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > about the same, though.
> >
> > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > slower, almost by 10%. Not sure why, but looks real after re-running
> > benchmarks a bunch of times and getting stable results.
> >
> > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > Again, I don't know if it is expected or not, but it's a nice
> > improvement.
> >
> > If you have any idea why kretprobes would get so much slower, it would
> > be nice to look into that and see if you can mitigate the regression
> > somehow. Thanks!
>
> My guess is that this patch set helps generic use cases for tracing the
> return of functions, but will likely add more overhead for single use
> cases. That is, kretprobe is made to be specific for a single function,
> but kretprobe-multi is more generic. Hence the generic version will
> improve at the sacrifice of the specific function. I did expect as much.
>
> That said, I think there's probably a lot of low hanging fruit that can
> be done to this series to help improve the kretprobe performance. I'm
> not sure we can get back to the baseline, but I'm hoping we can at
> least make it much better than that 10% slowdown.

That would certainly be appreciated, thanks!

But I'm also considering trying to switch to multi-kprobe/kretprobe
automatically on libbpf side, whenever possible, so that users can get
the best performance. There might still be situations where this can't
be done, so singular kprobe/kretprobe can't be completely deprecated,
but multi variants seems to be universally faster, so I'm going to
make them a default (I need to handle some backwards compat aspect,
but that's libbpf-specific stuff you shouldn't be concerned with).

>
> I'll be reviewing this patch set this week as I recover from jetlag.
>
> -- Steve



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  wrote:
>
> Hi Andrii,
>
> On Thu, 25 Apr 2024 13:31:53 -0700
> Andrii Nakryiko  wrote:
>
> > Hey Masami,
> >
> > I can't really review most of that code as I'm completely unfamiliar
> > with all those inner workings of fprobe/ftrace/function_graph. I left
> > a few comments where there were somewhat more obvious BPF-related
> > pieces.
> >
> > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > and then with your series applied on top. Just to see if there are any
> > regressions. I think it will be a useful data point for you.
>
> Thanks for testing!
>
> >
> > You should be already familiar with the bench tool we have in BPF
> > selftests (I used it on some other patches for your tree).
>
> What patches we need?
>

You mean for this `bench` tool? They are part of BPF selftests (under
tools/testing/selftests/bpf), you can build them by running:

$ make RELEASE=1 -j$(nproc) bench

After that you'll get a self-container `bench` binary, which has all
the self-contained benchmarks.

You might also find a small script (benchs/run_bench_trigger.sh inside
BPF selftests directory) helpful, it collects final summary of the
benchmark run and optionally accepts a specific set of benchmarks. So
you can use it like this:

$ benchs/run_bench_trigger.sh kprobe kprobe-multi
kprobe :   18.731 ± 0.639M/s
kprobe-multi   :   23.938 ± 0.612M/s

By default it will run a wider set of benchmarks (no uprobes, but a
bunch of extra fentry/fexit tests and stuff like this).

> >
> > BASELINE
> > 
> > kprobe :   24.634 ± 0.205M/s
> > kprobe-multi   :   28.898 ± 0.531M/s
> > kretprobe  :   10.478 ± 0.015M/s
> > kretprobe-multi:   11.012 ± 0.063M/s
> >
> > THIS PATCH SET ON TOP
> > =
> > kprobe :   25.144 ± 0.027M/s (+2%)
> > kprobe-multi   :   28.909 ± 0.074M/s
> > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
>
> This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> eventually because it should be a single callback version of
> kretprobe-multi.
>
> >
> > These numbers are pretty stable and look to be more or less representative.
> >
> > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > about the same, though.
> >
> > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > slower, almost by 10%. Not sure why, but looks real after re-running
> > benchmarks a bunch of times and getting stable results.
>
> Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> So nothing should be changed. Maybe cache access pattern has been
> changed?
> I'll check it with tracefs (to remove the effect from bpf related changes)
>
> >
> > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > Again, I don't know if it is expected or not, but it's a nice
> > improvement.
>
> Thanks!
>
> >
> > If you have any idea why kretprobes would get so much slower, it would
> > be nice to look into that and see if you can mitigate the regression
> > somehow. Thanks!
>
> OK, let me check it.
>
> Thank you!
>
> >
> >
> > >  51 files changed, 2325 insertions(+), 882 deletions(-)
> > >  create mode 100644 
> > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> > >
> > > --
> > > Masami Hiramatsu (Google) 
> > >
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCH v2 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin
On Mon, Apr 29, 2024 at 08:13:56PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.
> 
> [1] 
> https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/


Patch 1 makes some sense, gave some comments. Rest I think we should
just drop.

> PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[2] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (3):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
> Michael S. Tsirkin (1):
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
> 
>  drivers/vhost/vhost.c | 215 +++---
>  1 file changed, 97 insertions(+), 118 deletions(-)
> 
> Changelog
> =
> v2:
>   * Improve vhost_get_avail_idx() as Michael suggested in [1]
> as above (Michael)
>   * Correct @head's type from 'unsigned int' to 'int'
> (l...@intel.com)
> 
> -- 
> 2.44.0




Re: [PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()

2024-04-29 Thread Michael S. Tsirkin
On Mon, Apr 29, 2024 at 08:14:00PM +1000, Gavin Shan wrote:
> Reformat the macros to use tab as the terminator for each line so
> that it looks clean.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

Just messes up history for no real gain.

> ---
>  drivers/vhost/vhost.c | 60 +--
>  1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 4ddb9ec2fe46..c1ed5e750521 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct 
> vhost_virtqueue *vq,
>   return __vhost_get_user_slow(vq, addr, size, type);
>  }
>  
> -#define vhost_put_user(vq, x, ptr)   \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __put_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) to = \
> +#define vhost_put_user(vq, x, ptr)   \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __put_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) to =\
>   (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -   sizeof(*ptr), VHOST_ADDR_USED); \
> - if (to != NULL) \
> - ret = __put_user(x, to); \
> - else \
> - ret = -EFAULT;  \
> - } \
> - ret; \
> + sizeof(*ptr),   \
> + VHOST_ADDR_USED);   \
> + if (to != NULL) \
> + ret = __put_user(x, to);\
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> @@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct 
> vhost_virtqueue *vq)
> >used->idx);
>  }
>  
> -#define vhost_get_user(vq, x, ptr, type) \
> -({ \
> - int ret; \
> - if (!vq->iotlb) { \
> - ret = __get_user(x, ptr); \
> - } else { \
> - __typeof__(ptr) from = \
> - (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> -sizeof(*ptr), \
> -type); \
> - if (from != NULL) \
> - ret = __get_user(x, from); \
> - else \
> - ret = -EFAULT; \
> - } \
> - ret; \
> +#define vhost_get_user(vq, x, ptr, type) \
> +({   \
> + int ret;\
> + if (!vq->iotlb) {   \
> + ret = __get_user(x, ptr);   \
> + } else {\
> + __typeof__(ptr) from =  \
> + (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
> + sizeof(*ptr), type);\
> + if (from != NULL)   \
> + ret = __get_user(x, from);  \
> + else\
> + ret = -EFAULT;  \
> + }   \
> + ret;\
>  })
>  
>  #define vhost_get_avail(vq, x, ptr) \
> -- 
> 2.44.0




Re: [PATCH v2 3/4] vhost: Improve vhost_get_avail_head()

2024-04-29 Thread Michael S. Tsirkin
On Mon, Apr 29, 2024 at 08:13:59PM +1000, Gavin Shan wrote:
> Improve vhost_get_avail_head() so that the head or errno is returned.
> With it, the relevant sanity checks are squeezed to vhost_get_avail_head()
> and vhost_get_vq_desc() is further simplified.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 

I don't see what does this moving code around achieve.

> ---
>  drivers/vhost/vhost.c | 50 ++-
>  1 file changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index b278c0333a66..4ddb9ec2fe46 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct 
> vhost_virtqueue *vq)
>   return 1;
>  }
>  
> -static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> -__virtio16 *head, int idx)
> +static inline int vhost_get_avail_head(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *head,
> ->avail->ring[idx & (vq->num - 1)]);
> + __virtio16 head;
> + int r;
> +
> + r = vhost_get_avail(vq, head,
> + >avail->ring[vq->last_avail_idx & (vq->num - 
> 1)]);
> + if (unlikely(r)) {
> + vq_err(vq, "Failed to read head: index %u address %p\n",
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx & (vq->num - 1)]);
> + return r;
> + }
> +
> + r = vhost16_to_cpu(vq, head);
> + if (unlikely(r >= vq->num)) {
> + vq_err(vq, "Invalid head %d (%u)\n", r, vq->num);
> + return -EINVAL;
> + }
> +
> + return r;
>  }
>  
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> @@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> struct vhost_log *log, unsigned int *log_num)
>  {
>   struct vring_desc desc;
> - unsigned int i, head, found = 0;
> - __virtio16 ring_head;
> - int ret, access;
> + unsigned int i, found = 0;
> + int head, ret, access;
>  
>   if (vq->avail_idx == vq->last_avail_idx) {
>   ret = vhost_get_avail_idx(vq);
> @@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>   return vq->num;
>   }
>  
> - /* Grab the next descriptor number they're advertising, and increment
> -  * the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
> - vq_err(vq, "Failed to read head: idx %d address %p\n",
> -vq->last_avail_idx,
> ->avail->ring[vq->last_avail_idx % vq->num]);
> - return -EFAULT;
> - }
> -
> - head = vhost16_to_cpu(vq, ring_head);
> -
> - /* If their number is silly, that's an error. */
> - if (unlikely(head >= vq->num)) {
> - vq_err(vq, "Guest says index %u > %u is available",
> -head, vq->num);
> - return -EINVAL;
> - }
> + /* Grab the next descriptor number they're advertising */
> + head = vhost_get_avail_head(vq);
> + if (unlikely(head < 0))
> + return head;
>  
>   /* When we start there are none of either input nor output. */
>   *out_num = *in_num = 0;
> -- 
> 2.44.0




Re: [PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()

2024-04-29 Thread Michael S. Tsirkin
On Mon, Apr 29, 2024 at 08:13:58PM +1000, Gavin Shan wrote:
> The local variable @last_avail_idx is equivalent to vq->last_avail_idx.
> So the code can be simplified a bit by dropping the local variable
> @last_avail_idx.
> 
> No functional change intended.
> 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 7aa623117aab..b278c0333a66 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> @@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  
>   /* Grab the next descriptor number they're advertising, and increment
>* the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) {
> + if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
> {
>   vq_err(vq, "Failed to read head: idx %d address %p\n",
> -last_avail_idx,
> ->avail->ring[last_avail_idx % vq->num]);
> +vq->last_avail_idx,
> +>avail->ring[vq->last_avail_idx % vq->num]);
>   return -EFAULT;
>   }

I don't see the big advantage and the line is long now.

>  
> -- 
> 2.44.0




Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-04-29 Thread Steven Rostedt
On Sun, 28 Apr 2024 20:28:37 -0400
Steven Rostedt  wrote:

> > Looking for any suggestion or solution, appreciate.  
> 
> Yeah, I do not think eventfs should be involved in this. It needs to be
> protected at a higher level (in the synthetic/dynamic event code).
> 
> I'm just coming back from Japan, and I'll need to take a deeper look at
> this after I recover from my jetlag.

OK, so I guess the eventfs nodes need an optional release callback. Here's
the right way to do that. I added a "release" function to the passed in
entry array that allows for calling a release function when the
eventfs_inode is freed. Then in code for creating events, I call
event_file_get() on the file being assigned and have the freeing of the
"enable" file have the release function that will call event_file_put() on
that file structure.

Does this fix it for you?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..dc97c19f9e0a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 52f75c36bbca..d14c84281f2b 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
*mode, void **data,
return 0;
 }
 
+/* The file is incremented on creation and freeing the enable file decrements 
it */
+static void event_release(const char *name, void *data)
+{
+   struct trace_event_file *file = data;
+
+   event_file_put(file);
+}
+
 static int
 event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 {
@@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
{
.name   = "enable",
.callback   = event_callback,
+   .release= event_release,
},
{
.name   = "filter",
@@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
return ret;
}
 
+   /* Gets decremented on freeing of the "enable" file */
+   event_file_get(file);
+
return 0;
 }
 



Re: [PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Michael S. Tsirkin
On Mon, Apr 29, 2024 at 08:13:57PM +1000, Gavin Shan wrote:
> From: "Michael S. Tsirkin" 
> 
> All the callers of vhost_get_avail_idx() are concerned to the memory

*with* the memory barrier

> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
> 
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced.

accessed, not advanced. guest advances it.

> With it, the callers needn't to worry
> about the memory barrier.
> 
> No functional change intended.

I'd add:

As a side benefit, we also validate the index on all paths now, which
will hopefully help catch future errors earlier.

Note: current code is inconsistent in how it handles errors:
some places treat it as an empty ring, others - non empty.
This patch does not attempt to change the existing behaviour.



> Signed-off-by: Michael S. Tsirkin 
> [gshan: repainted vhost_get_avail_idx()]

?repainted?

> Reviewed-by: Gavin Shan 
> Acked-by: Will Deacon 
> ---
>  drivers/vhost/vhost.c | 106 +-
>  1 file changed, 42 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 8995730ce0bf..7aa623117aab 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 idx;
> + int r;
> +
> + r = vhost_get_avail(vq, idx, >avail->idx);
> + if (unlikely(r < 0)) {
> + vq_err(vq, "Failed to access available index at %p (%d)\n",
> +>avail->idx, r);
> + return r;
> + }
> +
> + /* Check it isn't doing very strange thing with available indexes */
> + vq->avail_idx = vhost16_to_cpu(vq, idx);
> + if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
> + vq_err(vq, "Invalid available index change from %u to %u",
> +vq->last_avail_idx, vq->avail_idx);
> + return -EINVAL;
> + }
> +
> + /* We're done if there is nothing new */
> + if (vq->avail_idx == vq->last_avail_idx)
> + return 0;
> +
> + /*
> +  * We updated vq->avail_idx so we need a memory barrier between
> +  * the index read above and the caller reading avail ring entries.
> +  */
> + smp_rmb();
> + return 1;
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> @@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  {
>   struct vring_desc desc;
>   unsigned int i, head, found = 0;
> - u16 last_avail_idx;
> - __virtio16 avail_idx;
> + u16 last_avail_idx = vq->last_avail_idx;
>   __virtio16 ring_head;
>   int ret, access;
>  
> - /* Check it isn't doing very strange things with descriptor numbers. */
> - last_avail_idx = vq->last_avail_idx;
> -
>   if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, _idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - >avail->idx);
> - return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> -
> - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> - vq_err(vq, "Guest moved avail index from %u to %u",
> - last_avail_idx, vq->avail_idx);
> - return -EFAULT;
> - }
> + ret = vhost_get_avail_idx(vq);
> + if (unlikely(ret < 0))
> + return ret;
>  
> - /* If there's nothing new since last we looked, return
> -  * invalid.
> -  */
> - if (vq->avail_idx == last_avail_idx)
> + if (!ret)
>   return vq->num;
> -
> - /* Only get avail ring entries after they have been
> -  * exposed by guest.
> -  */
> - smp_rmb();
>   }
>  
>   /* Grab the next descriptor number they're advertising, and increment
> @@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
>  /* return true if we're sure that avaiable ring is empty */
>  bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
>  {
> - __virtio16 avail_idx;
>   int r;
>  
>   if (vq->avail_idx != vq->last_avail_idx)
>   return false;
>  
> - r = vhost_get_avail_idx(vq, _idx);
> - if (unlikely(r))
> - return false;
> -
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> - if 

Re: [syzbot] [bpf?] [trace?] WARNING in group_send_sig_info

2024-04-29 Thread Yonghong Song



On 4/27/24 9:34 AM, syzbot wrote:

Hello,

syzbot found the following issue on:

HEAD commit:443574b03387 riscv, bpf: Fix kfunc parameters incompatibil..
git tree:   bpf
console output: https://syzkaller.appspot.com/x/log.txt?x=11ca8fe718
kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
dashboard link: https://syzkaller.appspot.com/bug?extid=1902c6d326478ce2dfb0
compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: 
https://storage.googleapis.com/syzbot-assets/3f355021a085/disk-443574b0.raw.xz
vmlinux: 
https://storage.googleapis.com/syzbot-assets/44cf4de7472a/vmlinux-443574b0.xz
kernel image: 
https://storage.googleapis.com/syzbot-assets/a99a36c7ad65/bzImage-443574b0.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+1902c6d326478ce2d...@syzkaller.appspotmail.com

[ cut here ]
raw_local_irq_restore() called with IRQs enabled
WARNING: CPU: 1 PID: 7785 at kernel/locking/irqflag-debug.c:10 
warn_bogus_irq_restore+0x29/0x40 kernel/locking/irqflag-debug.c:10
Modules linked in:
CPU: 1 PID: 7785 Comm: syz-executor.3 Not tainted 
6.8.0-syzkaller-05236-g443574b03387 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
03/27/2024
RIP: 0010:warn_bogus_irq_restore+0x29/0x40 kernel/locking/irqflag-debug.c:10
Code: 90 f3 0f 1e fa 90 80 3d de 59 01 04 00 74 06 90 c3 cc cc cc cc c6 05 cf 59 01 
04 01 90 48 c7 c7 20 ba aa 8b e8 f8 d5 e7 f5 90 <0f> 0b 90 90 90 c3 cc cc cc cc 
66 2e 0f 1f 84 00 00 00 00 00 0f 1f
RSP: 0018:c9000399fbb8 EFLAGS: 00010246

RAX: 4aede97b00455d00 RBX: 192000733f7c RCX: 88802a129e00
RDX:  RSI:  RDI: 
RBP: c9000399fc50 R08: 8157cc12 R09: 1110172a51a2
R10: dc00 R11: ed10172a51a3 R12: dc00
R13: 192000733f78 R14: c9000399fbe0 R15: 0246
FS:  7ae76480() GS:8880b950() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7ffc27e190f8 CR3: 6cb5 CR4: 003506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
  
  __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline]
  _raw_spin_unlock_irqrestore+0x120/0x140 kernel/locking/spinlock.c:194
  spin_unlock_irqrestore include/linux/spinlock.h:406 [inline]
  unlock_task_sighand include/linux/sched/signal.h:754 [inline]
  do_send_sig_info kernel/signal.c:1302 [inline]
  group_send_sig_info+0x2e0/0x310 kernel/signal.c:1453
  bpf_send_signal_common+0x2dd/0x430 kernel/trace/bpf_trace.c:881
  bpf_send_signal kernel/trace/bpf_trace.c:886 [inline]
  bpf_send_signal+0x19/0x30 kernel/trace/bpf_trace.c:884
  bpf_prog_8cc4ff36b5985b6a+0x1d/0x1f
  bpf_dispatcher_nop_func include/linux/bpf.h:1234 [inline]
  __bpf_prog_run include/linux/filter.h:650 [inline]
  bpf_prog_run include/linux/filter.h:664 [inline]
  __bpf_trace_run kernel/trace/bpf_trace.c:2381 [inline]
  bpf_trace_run2+0x375/0x420 kernel/trace/bpf_trace.c:2420
  trace_sys_exit include/trace/events/syscalls.h:44 [inline]
  syscall_exit_work+0x153/0x170 kernel/entry/common.c:163
  syscall_exit_to_user_mode_prepare kernel/entry/common.c:194 [inline]
  __syscall_exit_to_user_mode_work kernel/entry/common.c:199 [inline]
  syscall_exit_to_user_mode+0x273/0x360 kernel/entry/common.c:212
  do_syscall_64+0x10a/0x240 arch/x86/entry/common.c:89
  entry_SYSCALL_64_after_hwframe+0x6d/0x75


The following are related functions.

struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
   unsigned long *flags)
{
struct sighand_struct *sighand;

rcu_read_lock();
for (;;) {
sighand = rcu_dereference(tsk->sighand);
if (unlikely(sighand == NULL))
break;

/*
 * This sighand can be already freed and even reused, but
 * we rely on SLAB_TYPESAFE_BY_RCU and sighand_ctor() which
 * initializes ->siglock: this slab can't go away, it has
 * the same object type, ->siglock can't be reinitialized.
 *
 * We need to ensure that tsk->sighand is still the same
 * after we take the lock, we can race with de_thread() or
 * __exit_signal(). In the latter case the next iteration
 * must see ->sighand == NULL.
 */
spin_lock_irqsave(>siglock, *flags);
if (likely(sighand == rcu_access_pointer(tsk->sighand)))
break;
spin_unlock_irqrestore(>siglock, *flags);
}
rcu_read_unlock();


Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-29 Thread Haitao Huang
On Mon, 29 Apr 2024 11:43:05 -0500, Jarkko Sakkinen   
wrote:



On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote:

Hi Jarkko

On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen 
wrote:

> On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
>> On 4/16/24 07:15, Jarkko Sakkinen wrote:
>> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
>> > Yes, exactly. I'd take one week break and cycle the kselftest part
>> > internally a bit as I said my previous response. I'm sure that  
there
>> > is experise inside Intel how to implement it properly. I.e. take  
some
>> > time to find the right person, and wait as long as that person has  
a

>> > bit of bandwidth to go through the test and suggest modifications.
>>
>> Folks, I worry that this series is getting bogged down in the  
selftests.
>>  Yes, selftests are important.  But getting _some_ tests in the  
kernel

>> is substantially more important than getting perfect tests.
>>
>> I don't think Haitao needs to "cycle" this back inside Intel.
>
> The problem with the tests was that they are hard to run anything else
> than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
> Selftests do not have to be perfect but at minimum they need to be
> runnable.
>
> I need ret-test the latest series because it is possible that I did  
not
> have right flags (I was travelling few days thus have not done it  
yet).

>
> BR, Jarkko
>

Let me know if you want me to send v13 before testing or you can just  
use

the sgx_cg_upstream_v12_plus branch in my repo.

Also thanks for the "Reviewed-by" tags for other patches. But I've not  
got
"Reviewed-by" from you for patches #8-12 (not sure I missed). Could you  
go

through those alsoe when you get chance?


So, I compiled v12 branch. Was the only difference in selftests?

I can just copy them to the device.

BR, Jarkko

The only other functional change is using BUG_ON() when workqueue  
allocation fails in sgx_cgroup_init(). It should not affect testing  
results.


Thanks
Haitao




Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 12:39 AM Jiri Olsa  wrote:
>
> On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote:
> > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> > >
> > > Adding test that adds return uprobe inside 32 bit task
> > > and verify the return uprobe and attached bpf programs
> > > get properly executed.
> > >
> > > Signed-off-by: Jiri Olsa 
> > > ---
> > >  tools/testing/selftests/bpf/.gitignore|  1 +
> > >  tools/testing/selftests/bpf/Makefile  |  6 ++-
> > >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++
> > >  .../bpf/progs/uprobe_syscall_compat.c | 13 ++
> > >  4 files changed, 59 insertions(+), 1 deletion(-)
> > >  create mode 100644 
> > > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/.gitignore 
> > > b/tools/testing/selftests/bpf/.gitignore
> > > index f1aebabfb017..69d71223c0dd 100644
> > > --- a/tools/testing/selftests/bpf/.gitignore
> > > +++ b/tools/testing/selftests/bpf/.gitignore
> > > @@ -45,6 +45,7 @@ test_cpp
> > >  /veristat
> > >  /sign-file
> > >  /uprobe_multi
> > > +/uprobe_compat
> > >  *.ko
> > >  *.tmp
> > >  xskxceiver
> > > diff --git a/tools/testing/selftests/bpf/Makefile 
> > > b/tools/testing/selftests/bpf/Makefile
> > > index edc73f8f5aef..d170b63eca62 100644
> > > --- a/tools/testing/selftests/bpf/Makefile
> > > +++ b/tools/testing/selftests/bpf/Makefile
> > > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> > > test_skb_cgroup_id_user \
> > > xskxceiver xdp_redirect_multi xdp_synproxy veristat 
> > > xdp_hw_metadata \
> > > xdp_features bpf_test_no_cfi.ko
> > >
> > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> > > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> > > uprobe_compat
> >
> > you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no?
>
> ah right
>
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> > > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > index 9233210a4c33..3770254d893b 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > @@ -11,6 +11,7 @@
> > >  #include 
> > >  #include "uprobe_syscall.skel.h"
> > >  #include "uprobe_syscall_call.skel.h"
> > > +#include "uprobe_syscall_compat.skel.h"
> > >
> > >  __naked unsigned long uretprobe_regs_trigger(void)
> > >  {
> > > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void)
> > >  "read_trace_pipe_iter");
> > > ASSERT_EQ(found, 0, "found");
> > >  }
> > > +
> > > +static void trace_pipe_compat_cb(const char *str, void *data)
> > > +{
> > > +   if (strstr(str, "uretprobe compat") != NULL)
> > > +   (*(int *)data)++;
> > > +}
> > > +
> > > +static void test_uretprobe_compat(void)
> > > +{
> > > +   struct uprobe_syscall_compat *skel = NULL;
> > > +   int err, found = 0;
> > > +
> > > +   skel = uprobe_syscall_compat__open_and_load();
> > > +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load"))
> > > +   goto cleanup;
> > > +
> > > +   err = uprobe_syscall_compat__attach(skel);
> > > +   if (!ASSERT_OK(err, "uprobe_syscall_compat__attach"))
> > > +   goto cleanup;
> > > +
> > > +   system("./uprobe_compat");
> > > +
> > > +   ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 
> > > 1000),
> > > +"read_trace_pipe_iter");
> >
> > why so complicated? can't you just set global variable that it was called
>
> hm, we execute separate uprobe_compat (32bit) process that triggers the bpf
> program, so we can't use global variable.. using the trace_pipe was the only
> thing that was easy to do

you need child process to trigger uprobe, but you could have installed
BPF program from parent process (you'd need to make child wait for
parent to be ready, with normal pipe() like we do in other places).

I think generally the less work forked child process does, the better.
All those ASSERT() failures won't produce any output in child process,
unless you run tests in verbose mode, because we haven't implemented
some form of sending all the logs back to the parent process and so
they are completely lost. But that's a separate topic.

Either way, consider using pipe() to coordinate waiting from child on
parent being ready, but otherwise do all the BPF-related heavy lifting
from parent (you can attach BPF programs to specific PID using
bpf_program__attach_uprobe() easily, it's not declarative, but simple
enough).

>
> jirka
>
> >
> > > +   ASSERT_EQ(found, 1, "found");
> > > +
> > > +cleanup:
> > > +   uprobe_syscall_compat__destroy(skel);
> > > +}
> > >  #else
> > >  static void test_uretprobe_regs_equal(void)
> > >  {
> > > @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void)
> > >  {
> > 

Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 7:18 PM EEST, Haitao Huang wrote:
> Hi Jarkko
>
> On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen   
> wrote:
>
> > On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:
> >> On 4/16/24 07:15, Jarkko Sakkinen wrote:
> >> > On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> >> > Yes, exactly. I'd take one week break and cycle the kselftest part
> >> > internally a bit as I said my previous response. I'm sure that there
> >> > is experise inside Intel how to implement it properly. I.e. take some
> >> > time to find the right person, and wait as long as that person has a
> >> > bit of bandwidth to go through the test and suggest modifications.
> >>
> >> Folks, I worry that this series is getting bogged down in the selftests.
> >>  Yes, selftests are important.  But getting _some_ tests in the kernel
> >> is substantially more important than getting perfect tests.
> >>
> >> I don't think Haitao needs to "cycle" this back inside Intel.
> >
> > The problem with the tests was that they are hard to run anything else
> > than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
> > Selftests do not have to be perfect but at minimum they need to be
> > runnable.
> >
> > I need ret-test the latest series because it is possible that I did not
> > have right flags (I was travelling few days thus have not done it yet).
> >
> > BR, Jarkko
> >
>
> Let me know if you want me to send v13 before testing or you can just use  
> the sgx_cg_upstream_v12_plus branch in my repo.
>
> Also thanks for the "Reviewed-by" tags for other patches. But I've not got  
> "Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go  
> through those alsoe when you get chance?

So, I compiled v12 branch. Was the only difference in selftests?

I can just copy them to the device.

BR, Jarkko



Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 12:33 AM Jiri Olsa  wrote:
>
> On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote:
> > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> > >
> > > Adding test to verify that when called from outside of the
> > > trampoline provided by kernel, the uretprobe syscall will cause
> > > calling process to receive SIGILL signal and the attached bpf
> > > program is no executed.
> > >
> > > Signed-off-by: Jiri Olsa 
> > > ---
> > >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++
> > >  .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++
> > >  2 files changed, 107 insertions(+)
> > >  create mode 100644 
> > > tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > >
> >
> > See nits below, but overall LGTM
> >
> > Acked-by: Andrii Nakryiko 
> >
> > [...]
> >
> > > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void)
> > >  {
> > > test__skip();
> > >  }
> > > +
> > > +static void test_uretprobe_syscall_call(void)
> > > +{
> > > +   test__skip();
> > > +}
> > >  #endif
> > >
> > >  void test_uprobe_syscall(void)
> > > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void)
> > > if (test__start_subtest("uretprobe_regs_change"))
> > > test_uretprobe_regs_change();
> > >  }
> > > +
> > > +void serial_test_uprobe_syscall_call(void)
> >
> > does it need to be serial? non-serial are still run sequentially
> > within a process (there is no multi-threading), it's more about some
> > global effects on system.
>
> plz see below
>
> >
> > > +{
> > > +   test_uretprobe_syscall_call();
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c 
> > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > > new file mode 100644
> > > index ..5ea03bb47198
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > > @@ -0,0 +1,15 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "vmlinux.h"
> > > +#include 
> > > +#include 
> > > +
> > > +struct pt_regs regs;
> > > +
> > > +char _license[] SEC("license") = "GPL";
> > > +
> > > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call")
> > > +int uretprobe(struct pt_regs *regs)
> > > +{
> > > +   bpf_printk("uretprobe called");
> >
> > debugging leftover? we probably don't want to pollute trace_pipe from test
>
> the reason for this is to make sure the bpf program was not executed,
>
> the test makes sure the child gets killed with SIGILL and also that
> the bpf program was not executed by checking the trace_pipe and
> making sure nothing was received
>
> the trace_pipe reading is also why it's serial

you could have attached BPF program from parent process and use a
global variable (and thus eliminate all the trace_pipe system-wide
dependency), but ok, it's fine by me the way this is done

>
> jirka
>
> >
> > > +   return 0;
> > > +}
> > > --
> > > 2.44.0
> > >



Re: [PATCH v7 00/16] mm: jit/text allocator

2024-04-29 Thread Luis Chamberlain
On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Hi,
> 
> The patches are also available in git:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> 
> v7 changes:
> * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
>   #ifdefs in a function body
> * add Acks, thanks everybody

Thanks, I've pushed this to modules-next for further exposure / testing.
Given the status of testing so far with prior revisions, in that only a
few issues were found and that those were fixed, and the status of
reviews, this just might be ripe for v6.10.

  Luis



Re: [PATCH v12 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-04-29 Thread Haitao Huang

Hi Jarkko

On Sun, 28 Apr 2024 17:03:17 -0500, Jarkko Sakkinen   
wrote:



On Fri Apr 26, 2024 at 5:28 PM EEST, Dave Hansen wrote:

On 4/16/24 07:15, Jarkko Sakkinen wrote:
> On Tue Apr 16, 2024 at 8:42 AM EEST, Huang, Kai wrote:
> Yes, exactly. I'd take one week break and cycle the kselftest part
> internally a bit as I said my previous response. I'm sure that there
> is experise inside Intel how to implement it properly. I.e. take some
> time to find the right person, and wait as long as that person has a
> bit of bandwidth to go through the test and suggest modifications.

Folks, I worry that this series is getting bogged down in the selftests.
 Yes, selftests are important.  But getting _some_ tests in the kernel
is substantially more important than getting perfect tests.

I don't think Haitao needs to "cycle" this back inside Intel.


The problem with the tests was that they are hard to run anything else
than Ubuntu (and perhaps Debian). It is hopefully now taken care of.
Selftests do not have to be perfect but at minimum they need to be
runnable.

I need ret-test the latest series because it is possible that I did not
have right flags (I was travelling few days thus have not done it yet).

BR, Jarkko



Let me know if you want me to send v13 before testing or you can just use  
the sgx_cg_upstream_v12_plus branch in my repo.


Also thanks for the "Reviewed-by" tags for other patches. But I've not got  
"Reviewed-by" from you for patches #8-12 (not sure I missed). Could you go  
through those alsoe when you get chance?


Thanks
Haitao





Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Haitao Huang

On Mon, 29 Apr 2024 05:49:13 -0500, Huang, Kai  wrote:




+/*
+ * Get the per-cgroup or global LRU list that tracks the given  
reclaimable page.

+ */
 static inline struct sgx_epc_lru_list *sgx_lru_list(struct  
sgx_epc_page *epc_page)

 {
+#ifdef CONFIG_CGROUP_MISC
+   /*
+* epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
+* life between sgx_alloc_epc_page() and sgx_free_epc_page():
+*
+* In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
+	 * sgx_get_current_cg() which is the misc cgroup of the current task,  
or

+* the root by default even if the misc cgroup is disabled by kernel
+* command line.
+*
+* epc_page->sgx_cg is only unset by sgx_free_epc_page().
+*
+* This function is never used before sgx_alloc_epc_page() or after
+* sgx_free_epc_page().
+*/
+   return _page->sgx_cg->lru;
+#else
return _global_lru;
+#endif
 }

 /*
@@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list  
*sgx_lru_list(struct sgx_epc_page *epc_pag

  */
 static inline bool sgx_can_reclaim(void)
 {
-   return !list_empty(_global_lru.reclaimable);
+   return !sgx_cgroup_lru_empty(misc_cg_root()) ||
+  !list_empty(_global_lru.reclaimable);
 }


Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
return !sgx_cgroup_lru_empty(misc_cg_root());
else
return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global() below.



I changed to this way because sgx_cgroup_lru_empty() is now defined in  
both KConfig cases.
And it seems better to minimize use of the KConfig variables based on  
earlier feedback (some are yours).
Don't really have strong preference here. So let me know one way of the  
other.




 static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
@@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long  
watermark)


 static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
 {
-   sgx_reclaim_pages(_global_lru, charge_mm);
+   if (IS_ENABLED(CONFIG_CGROUP_MISC))
+   sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
+   else
+   sgx_reclaim_pages(_global_lru, charge_mm);
 }

 /*
@@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct  
mm_struct *charge_mm)

  */
 void sgx_reclaim_direct(void)
 {
+   struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
+
+   /* Make sure there are some free pages at cgroup level */
+   if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
+   sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
+   sgx_put_cg(sgx_cg);
+   }


Empty line.



Sure


+   /* Make sure there are some free pages at global level */
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))


Looking at the code, to me sgx_should_reclaim() is a little bit vague
because from the name we don't know whether it interally checks the
current cgroup or the global.  
It's better to rename to sgx_should_reclaim_global().


Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global().

And I think you can do the renaming in the previous patch, because in the
changelog of your previous patch, it seems you have called out the two
functions are for global reclaim.



ok

Thanks
Haitao



Re: [PATCH 1/2] dt-bindings: remoteproc: qcom,smd-edge: Mark qcom,ipc as deprecated

2024-04-29 Thread Rob Herring


On Thu, 25 Apr 2024 21:14:30 +0200, Luca Weiss wrote:
> Deprecate the qcom,ipc way of accessing the mailbox in favor of the
> 'mboxes' property.
> 
> Update the example to use mboxes.
> 
> Signed-off-by: Luca Weiss 
> ---
>  Documentation/devicetree/bindings/remoteproc/qcom,smd-edge.yaml | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 

Reviewed-by: Rob Herring (Arm) 




Re: [PATCH 2/2] dt-bindings: soc: qcom,smp2p: Mark qcom,ipc as deprecated

2024-04-29 Thread Rob Herring


On Thu, 25 Apr 2024 21:14:31 +0200, Luca Weiss wrote:
> Deprecate the qcom,ipc way of accessing the mailbox in favor of the
> 'mboxes' property.
> 
> Update the example to use mboxes.
> 
> Signed-off-by: Luca Weiss 
> ---
>  Documentation/devicetree/bindings/soc/qcom/qcom,smp2p.yaml | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 

Reviewed-by: Rob Herring (Arm) 




Re: [PATCH v9 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled

2024-04-29 Thread Google
On Thu, 25 Apr 2024 13:09:32 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 15, 2024 at 6:22 AM Masami Hiramatsu (Google)
>  wrote:
> >
> > From: Masami Hiramatsu (Google) 
> >
> > Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is
> > converted from ftrace_regs by ftrace_partial_regs(), thus some registers
> > may always returns 0. But it should be enough for function entry (access
> > arguments) and exit (access return value).
> >
> > Signed-off-by: Masami Hiramatsu (Google) 
> > Acked-by: Florent Revest 
> > ---
> >  Changes from previous series: NOTHING, Update against the new series.
> > ---
> >  kernel/trace/bpf_trace.c |   22 +-
> >  1 file changed, 9 insertions(+), 13 deletions(-)
> >
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index e51a6ef87167..57b1174030c9 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void)
> >  fs_initcall(bpf_event_init);
> >  #endif /* CONFIG_MODULES */
> >
> > -#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
> > +#ifdef CONFIG_FPROBE
> >  struct bpf_kprobe_multi_link {
> > struct bpf_link link;
> > struct fprobe fp;
> > @@ -2600,6 +2600,8 @@ struct user_syms {
> > char *buf;
> >  };
> >
> > +static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs);
> 
> this is a waste if CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, right?
> Can we guard it?

Good catch! Yes, we can guard it.

> 
> 
> > +
> >  static int copy_user_syms(struct user_syms *us, unsigned long __user 
> > *usyms, u32 cnt)
> >  {
> > unsigned long __user usymbol;
> > @@ -2792,13 +2794,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct 
> > bpf_run_ctx *ctx)
> >
> >  static int
> >  kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
> > -  unsigned long entry_ip, struct pt_regs *regs)
> > +  unsigned long entry_ip, struct ftrace_regs 
> > *fregs)
> >  {
> > struct bpf_kprobe_multi_run_ctx run_ctx = {
> > .link = link,
> > .entry_ip = entry_ip,
> > };
> > struct bpf_run_ctx *old_run_ctx;
> > +   struct pt_regs *regs;
> > int err;
> >
> > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> > @@ -2809,6 +2812,7 @@ kprobe_multi_link_prog_run(struct 
> > bpf_kprobe_multi_link *link,
> >
> > migrate_disable();
> > rcu_read_lock();
> > +   regs = ftrace_partial_regs(fregs, 
> > this_cpu_ptr(_kprobe_multi_pt_regs));
> 
> and then pass NULL if defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)?

Indeed.

Thank you!

> 
> 
> > old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx);
> > err = bpf_prog_run(link->link.prog, regs);
> > bpf_reset_run_ctx(old_run_ctx);
> > @@ -2826,13 +2830,9 @@ kprobe_multi_link_handler(struct fprobe *fp, 
> > unsigned long fentry_ip,
> >   void *data)
> >  {
> > struct bpf_kprobe_multi_link *link;
> > -   struct pt_regs *regs = ftrace_get_regs(fregs);
> > -
> > -   if (!regs)
> > -   return 0;
> >
> > link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> > -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> > +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
> > return 0;
> >  }
> >
> > @@ -2842,13 +2842,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, 
> > unsigned long fentry_ip,
> >void *data)
> >  {
> > struct bpf_kprobe_multi_link *link;
> > -   struct pt_regs *regs = ftrace_get_regs(fregs);
> > -
> > -   if (!regs)
> > -   return;
> >
> > link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> > -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> > +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
> >  }
> >
> >  static int symbols_cmp_r(const void *a, const void *b, const void *priv)
> > @@ -3107,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> > *attr, struct bpf_prog *pr
> > kvfree(cookies);
> > return err;
> >  }
> > -#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
> > +#else /* !CONFIG_FPROBE */
> >  int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct 
> > bpf_prog *prog)
> >  {
> > return -EOPNOTSUPP;
> >
> >


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded

2024-04-29 Thread Google
On Thu, 25 Apr 2024 13:15:08 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 15, 2024 at 6:25 AM Masami Hiramatsu (Google)
>  wrote:
> >
> > From: Masami Hiramatsu (Google) 
> >
> > Skip recording calltime and rettime if the fgraph_ops does not need it.
> > This is a kind of performance optimization for fprobe. Since the fprobe
> > user does not use these entries, recording timestamp in fgraph is just
> > a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag,
> > and all fgraph_ops sets this flag, skip recording calltime and rettime.
> >
> > Suggested-by: Jiri Olsa 
> > Signed-off-by: Masami Hiramatsu (Google) 
> > ---
> >  Changes in v9:
> >   - Newly added.
> > ---
> >  include/linux/ftrace.h |2 ++
> >  kernel/trace/fgraph.c  |   46 
> > +++---
> >  kernel/trace/fprobe.c  |1 +
> >  3 files changed, 42 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> > index d845a80a3d56..06fc7cbef897 100644
> > --- a/include/linux/ftrace.h
> > +++ b/include/linux/ftrace.h
> > @@ -1156,6 +1156,8 @@ struct fgraph_ops {
> > struct ftrace_ops   ops; /* for the hash lists */
> > void*private;
> > int idx;
> > +   /* If skip_timestamp is true, this does not record timestamps. */
> > +   boolskip_timestamp;
> >  };
> >
> >  void *fgraph_reserve_data(int idx, int size_bytes);
> > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> > index 7556fbbae323..a5722537bb79 100644
> > --- a/kernel/trace/fgraph.c
> > +++ b/kernel/trace/fgraph.c
> > @@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
> >  int ftrace_graph_active;
> >
> >  static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
> > +static bool fgraph_skip_timestamp;
> >
> >  /* LRU index table for fgraph_array */
> >  static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
> > @@ -475,7 +476,7 @@ void ftrace_graph_stop(void)
> >  static int
> >  ftrace_push_return_trace(unsigned long ret, unsigned long func,
> >  unsigned long frame_pointer, unsigned long *retp,
> > -int fgraph_idx)
> > +int fgraph_idx, bool skip_ts)
> >  {
> > struct ftrace_ret_stack *ret_stack;
> > unsigned long long calltime;
> > @@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > ret_stack = get_ret_stack(current, current->curr_ret_stack, );
> > if (ret_stack && ret_stack->func == func &&
> > get_fgraph_type(current, index + FGRAPH_RET_INDEX) == 
> > FGRAPH_TYPE_BITMAP &&
> > -   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> > fgraph_idx))
> > +   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> > fgraph_idx)) {
> > +   /* If previous one skips calltime, update it. */
> > +   if (!skip_ts && !ret_stack->calltime)
> > +   ret_stack->calltime = trace_clock_local();
> > return index + FGRAPH_RET_INDEX;
> > +   }
> >
> > val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | 
> > FGRAPH_RET_INDEX;
> >
> > @@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > return -EBUSY;
> > }
> >
> > -   calltime = trace_clock_local();
> > +   if (skip_ts)
> 
> would it be ok to add likely() here to keep the least-overhead code path 
> linear?

It's not "likely", but hmm, yes as you said. We can keep the least overhead.
OK, let me add likely. 

Thank you,

> 
> > +   calltime = 0LL;
> > +   else
> > +   calltime = trace_clock_local();
> >
> > index = READ_ONCE(current->curr_ret_stack);
> > ret_stack = RET_STACK(current, index);
> > @@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, 
> > unsigned long func,
> > trace.func = func;
> > trace.depth = ++current->curr_ret_depth;
> >
> > -   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0);
> > +   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0,
> > +fgraph_skip_timestamp);
> > if (index < 0)
> > goto out;
> >
> > @@ -654,7 +663,8 @@ int function_graph_enter_ops(unsigned long ret, 
> > unsigned long func,
> > return -ENODEV;
> >
> > /* Use start for the distance to ret_stack (skipping over reserve) 
> > */
> > -   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
> > gops->idx);
> > +   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
> > gops->idx,
> > +gops->skip_timestamp);
> > if (index < 0)
> > return index;
> > type = get_fgraph_type(current, index);
> > @@ -732,6 

Re: 回复:WARNING in current_check_refer_path

2024-04-29 Thread Mickaël Salaün
On Mon, Apr 29, 2024 at 05:16:57PM +0800, Ubisectech Sirius wrote:
> > Hello,
> 
> > Thanks for the report.  Could you please provide a reproducer?
> 
> > Regards,
> > Mickaël
> 
> Hi.
>   The Poc file has seed to you as attachment.

Indeed, but could you please trim down the file. There are 650 lines,
most of them are irrelevant.

> 
> > On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote:
> >> Hello.
> >> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> >> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> >> the email were a PoC file of the issue.
> >> 
> >> Stack dump:
> >> 
> > > loop3: detected capacity change from 0 to 1024
> > > [ cut here ]
> > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> > > security/landlock/fs.c:598 [inline]
> > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> > > security/landlock/fs.c:578 [inline]
> > > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 
> > > current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> > > Modules linked in:
> > > CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2
> > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> > > 04/01/2014
> > > RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline]
> > > RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline]
> > > RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> > > Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 
> > > 0f 0b 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 
> > > db e9 86 f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba
> > > RSP: 0018:c90001fb7ba0 EFLAGS: 00010212
> > > RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000
> > > RDX: 0004 RSI: 84125d64 RDI: 0003
> > > RBP: 8880123c5608 R08: 0003 R09: c000
> > > R10: f000 R11:  R12: 88805d32fc00
> > > R13: 8880123c5608 R14:  R15: 0001
> > > FS:  7fd70c4d8640() GS:88802c60() 
> > > knlGS:
> > > CS:  0010 DS:  ES:  CR0: 80050033
> > > CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0
> > > DR0:  DR1:  DR2: 
> > > DR3:  DR6: fffe0ff0 DR7: 0400
> > > PKRU: 5554
> > > Call Trace:
> > >  
> > >  security_path_rename+0x124/0x230 security/security.c:1828
> > >  do_renameat2+0x9f6/0xd30 fs/namei.c:4983
> > >  __do_sys_rename fs/namei.c:5042 [inline]
> > >  __se_sys_rename fs/namei.c:5040 [inline]
> > >  __x64_sys_rename+0x81/0xa0 fs/namei.c:5040
> > >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> > >  do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
> > >  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> > > RIP: 0033:0x7fd70b6900ed
> > > Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 
> > > f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 
> > > ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052
> > > RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed
> >>  RDX:  RSI: 2140 RDI: 2100
> > > RBP: 7fd70b6f14a6 R08:  R09: 
> > > R10:  R11: 0246 R12: 
> > > R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000
> > >  
> > > 
> > > Thank you for taking the time to read this email and we look forward to 
> > > working with you further.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > 
> > 
> 
> 





Re: [v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()

2024-04-29 Thread Markus Elfring
…
> > it jumps to the label 'out' instead of 'fail' by mistake.In the result,
…
>
> Looks good to me.

* Do you care for a typo in this change description?

* Would you like to read any improved (patch) version descriptions (or 
changelogs)?

Regards,
Markus



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Google
Hi Andrii,

On Thu, 25 Apr 2024 13:31:53 -0700
Andrii Nakryiko  wrote:

> Hey Masami,
> 
> I can't really review most of that code as I'm completely unfamiliar
> with all those inner workings of fprobe/ftrace/function_graph. I left
> a few comments where there were somewhat more obvious BPF-related
> pieces.
> 
> But I also did run our BPF benchmarks on probes/for-next as a baseline
> and then with your series applied on top. Just to see if there are any
> regressions. I think it will be a useful data point for you.

Thanks for testing!

> 
> You should be already familiar with the bench tool we have in BPF
> selftests (I used it on some other patches for your tree).

What patches we need?

> 
> BASELINE
> 
> kprobe :   24.634 ± 0.205M/s
> kprobe-multi   :   28.898 ± 0.531M/s
> kretprobe  :   10.478 ± 0.015M/s
> kretprobe-multi:   11.012 ± 0.063M/s
> 
> THIS PATCH SET ON TOP
> =
> kprobe :   25.144 ± 0.027M/s (+2%)
> kprobe-multi   :   28.909 ± 0.074M/s
> kretprobe  :9.482 ± 0.008M/s (-9.5%)
> kretprobe-multi:   13.688 ± 0.027M/s (+24%)

This looks good. Kretprobe should also use kretprobe-multi (fprobe)
eventually because it should be a single callback version of
kretprobe-multi.

> 
> These numbers are pretty stable and look to be more or less representative.
> 
> As you can see, kprobes got a bit faster, kprobe-multi seems to be
> about the same, though.
> 
> Then (I suppose they are "legacy") kretprobes got quite noticeably
> slower, almost by 10%. Not sure why, but looks real after re-running
> benchmarks a bunch of times and getting stable results.

Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
So nothing should be changed. Maybe cache access pattern has been
changed?
I'll check it with tracefs (to remove the effect from bpf related changes)

> 
> On the other hand, multi-kretprobes got significantly faster (+24%!).
> Again, I don't know if it is expected or not, but it's a nice
> improvement.

Thanks!

> 
> If you have any idea why kretprobes would get so much slower, it would
> be nice to look into that and see if you can mitigate the regression
> somehow. Thanks!

OK, let me check it.

Thank you!

> 
> 
> >  51 files changed, 2325 insertions(+), 882 deletions(-)
> >  create mode 100644 
> > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> >
> > --
> > Masami Hiramatsu (Google) 
> >


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v3] tracing/probes: Fix memory leak in traceprobe_parse_probe_arg_body()

2024-04-29 Thread Google
Hi LuMing,

On Sat, 27 Apr 2024 08:23:47 +0100
lumingyindet...@126.com wrote:

> From: LuMingYin 
> 
> If traceprobe_parse_probe_arg_body() failed to allocate 'parg->fmt',
> it jumps to the label 'out' instead of 'fail' by mistake.In the result,
> the buffer 'tmp' is not freed in this case and leaks its memory.
> 
> Thus jump to the label 'fail' in that error case.
> 

Looks good to me.

Acked-by: Masami Hiramatsu (Google) 


Thank you!

> Fixes: 032330abd08b ("tracing/probes: Cleanup probe argument parser")
> Signed-off-by: LuMingYin 
> ---
>  kernel/trace/trace_probe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
> index c09fa6fc636e..81c319b92038 100644
> --- a/kernel/trace/trace_probe.c
> +++ b/kernel/trace/trace_probe.c
> @@ -1466,7 +1466,7 @@ static int traceprobe_parse_probe_arg_body(const char 
> *argv, ssize_t *size,
>   parg->fmt = kmalloc(len, GFP_KERNEL);
>   if (!parg->fmt) {
>   ret = -ENOMEM;
> - goto out;
> + goto fail;
>   }
>   snprintf(parg->fmt, len, "%s[%d]", parg->type->fmttype,
>parg->count);
> -- 
> 2.25.1
> 
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and `MADV_DONTNEED` semantics). Consider this race:
>
> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
>after removing the page table entry and right before re-acquiring the
>enclave lock to EREMOVE and xa_erase(>page_array) the page.
> 2. T2 tries to access the page, and #PF[not_present] is raised. The
>condition to EAUG in sgx_vma_fault() is not satisfied because the
>page is still present in encl->page_array, thus the SGX driver
>assumes that the fault happened because the page was swapped out. The
>driver continues on a code path that installs a page table entry
>*without* performing EAUG.
> 3. The enclave page metadata is in inconsistent state: the PTE is
>installed but there was no EAUG. Thus, T2 in userspace infinitely
>receives SIGSEGV on this page (and EACCEPT always fails).
>
> Fix this by making sure that T1 (the page-removing thread) always wins
> this data race. In particular, the page-being-removed is marked as such,
> and T2 retries until the page is fully removed.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Dmitrii Kuvaiskii 
> ---
>  arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
>  arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
> sgx_encl *encl,
>  
>   /* Entry successfully located. */
>   if (entry->epc_page) {
> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> +SGX_ENCL_PAGE_BEING_REMOVED))
>   return ERR_PTR(-EBUSY);
>  
>   return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
>  /* 'desc' bit marking that the page is being reclaimed. */
>  #define SGX_ENCL_PAGE_BEING_RECLAIMEDBIT(3)
>  
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED  BIT(2)
> +
>  struct sgx_encl_page {
>   unsigned long desc;
>   unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>* Do not keep encl->lock because of dependency on
>* mmap_lock acquired in sgx_zap_enclave_ptes().
>*/
> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>   mutex_unlock(>lock);
>  
>   sgx_zap_enclave_ptes(encl, addr);

It is somewhat trivial to NAK this as the commit message does
not do any effort describing the new flag. By default at least
I have strong opposition against any new flags related to
reclaiming even if it needs a bit of extra synchronization
work in the user space.

One way to describe concurrency scenarios would be to take
example from https://www.kernel.org/doc/Documentation/memory-barriers.txt

I.e. see the examples with CPU 1 and CPU 2.

BR, Jarkko



Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 4:22 PM EEST, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > > fault handler so that no signal is sent to userspace, and (2) by
> > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > > EREMOVE is performed.
> >
> > What is the collateral damage caused by ENCLS[EREMOVE]?
>
> Have you measured cost of eremove on an empty page?
>
> I tried to lookup for a thread from lore because I have a faint memory
> that it was concluded that its cost irrelevant. Please correct if I'm
> wrong.

Also pseudocode for EREMOVE supports this as it just returns without
actually doing anything.

BR, Jarkko



Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > fault handler so that no signal is sent to userspace, and (2) by
> > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > EREMOVE is performed.
>
> What is the collateral damage caused by ENCLS[EREMOVE]?

Have you measured cost of eremove on an empty page?

I tried to lookup for a thread from lore because I have a faint memory
that it was concluded that its cost irrelevant. Please correct if I'm
wrong.

BR, Jarkko



Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> enclave pages and may support MADV_DONTNEED semantics [1]. The former
> implies #PF-based page allocation, and the latter implies the usage of
> SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
>
> A trivial program like below (run under Gramine and with EDMM enabled)
> stresses these two flows in the SGX driver and hangs:
>
> /* repeatedly touch different enclave pages at random and mix with
>  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> static void* thread_func(void* arg) {
> size_t num_pages = 0xA000 / page_size;
> for (int i = 0; i < 5000; i++) {
> size_t page = get_random_ulong() % num_pages;
> char data = READ_ONCE(((char*)arg)[page * page_size]);
>
> page = get_random_ulong() % num_pages;
> madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> }
> }
>
> addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> pthread_t threads[16];
> for (int i = 0; i < 16; i++)
> pthread_create([i], NULL, thread_func, addr);

I'm not convinced that kernel is the problem here but it could be also
how Gramine is implemented.

So maybe you could make a better case of that. The example looks a bit
artificial to me.

>
> This program uncovers two data races in the SGX driver. The remaining
> patches describe and fix these races.
>
> I performed several stress tests to verify that there are no other data
> races (at least with the test program above):
>
> - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
>   stresses the first data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result without the first bug fix: hangs every time).
> - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
>   stresses the second data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result with the first bug fix but without the second bug fix: hangs
>   approx. once in 50 runs).
> - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
>   additionally stresses the enclave page swapping flows. Two Gramine SGX
>   test suites running in the background for additional stressing of
>   swapping (I observe 100% CPU utilization from ksgxd which confirms that
>   swapping happens). Result: 1,000 runs without hangs.
>
> (Sorry for the previous copy of this email, accidentally sent to
> sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)
>
> Dmitrii Kuvaiskii (2):
>   x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
>   x86/sgx: Resolve EREMOVE page vs EAUG page data race
>
>  arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
>  arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
>  3 files changed, 11 insertions(+), 3 deletions(-)

BR, Jarkko



Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Jarkko Sakkinen
On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to access the same non-present enclave page
> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> threads will end up in sgx_encl_eaug_page(), racing to acquire the
> enclave lock. The winning thread will perform EAUG, set up the page
> table entry, and insert the page into encl->page_array. The losing
> thread will then get -EBUSY on xa_insert(>page_array) and proceed
> to error handling path.

And that path removes page. Not sure I got gist of this tbh.

> This error handling path contains two bugs: (1) SIGBUS is sent to
> userspace even though the enclave page is correctly installed by another
> thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
> even though the enclave page was never intended to be removed. The first
> bug is less severe because it impacts only the user space; the second
> bug is more severe because it also impacts the OS state by ripping the
> page (added by the winning thread) from the enclave.
>
> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> fault handler so that no signal is sent to userspace, and (2) by
> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> EREMOVE is performed.

What is the collateral damage caused by ENCLS[EREMOVE]?

>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
> enclave")
> Cc: sta...@vger.kernel.org
> Reported-by: Marcelina Kościelnicka 
> Suggested-by: Reinette Chatre 
> Signed-off-by: Dmitrii Kuvaiskii 
> ---
>  arch/x86/kernel/cpu/sgx/encl.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>* If ret == -EBUSY then page was created in another flow while
>* running without encl->lock
>*/
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
>   goto err_out_shrink;
> + }
>  
>   pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>   pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct 
> vm_area_struct *vma,
>  err_out_shrink:
>   sgx_encl_shrink(encl, va_page);
>  err_out_epc:
> - sgx_encl_free_epc_page(epc_page);
> + sgx_free_epc_page(epc_page);

This ignores check for the page being reclaimer tracked, i.e. it does
changes that have been ignored in the commit message.

>  err_out_unlock:
>   mutex_unlock(>lock);
>   kfree(encl_page);


BR, Jarkko



[PATCH v7 16/16] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

BPF just-in-time compiler depended on CONFIG_MODULES because it used
module_alloc() to allocate memory for the generated code.

Since code allocations are now implemented with execmem, drop dependency of
CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.

Suggested-by: Björn Töpel 
Signed-off-by: Mike Rapoport (IBM) 
---
 kernel/bpf/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index bc25f5098a25..f999e4e0b344 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -43,7 +43,7 @@ config BPF_JIT
bool "Enable BPF Just In Time compiler"
depends on BPF
depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
-   depends on MODULES
+   select EXECMEM
help
  BPF programs are normally handled by a BPF interpreter. This option
  allows the kernel to generate native code when a program is loaded
-- 
2.43.0




[PATCH v7 15/16] kprobes: remove dependency on CONFIG_MODULES

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

kprobes depended on CONFIG_MODULES because it has to allocate memory for
code.

Since code allocations are now implemented with execmem, kprobes can be
enabled in non-modular kernels.

Add #ifdef CONFIG_MODULE guards for the code dealing with kprobes inside
modules, make CONFIG_KPROBES select CONFIG_EXECMEM and drop the
dependency of CONFIG_KPROBES on CONFIG_MODULES.

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Masami Hiramatsu (Google) 
---
 arch/Kconfig|  2 +-
 include/linux/module.h  |  9 ++
 kernel/kprobes.c| 55 +++--
 kernel/trace/trace_kprobe.c | 20 +-
 4 files changed, 63 insertions(+), 23 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4fd0daa54e6c..caa459964f09 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -52,9 +52,9 @@ config GENERIC_ENTRY
 
 config KPROBES
bool "Kprobes"
-   depends on MODULES
depends on HAVE_KPROBES
select KALLSYMS
+   select EXECMEM
select TASKS_RCU if PREEMPTION
help
  Kprobes allows you to trap at almost any kernel address and
diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..ffa1c603163c 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -605,6 +605,11 @@ static inline bool module_is_live(struct module *mod)
return mod->state != MODULE_STATE_GOING;
 }
 
+static inline bool module_is_coming(struct module *mod)
+{
+return mod->state == MODULE_STATE_COMING;
+}
+
 struct module *__module_text_address(unsigned long addr);
 struct module *__module_address(unsigned long addr);
 bool is_module_address(unsigned long addr);
@@ -857,6 +862,10 @@ void *dereference_module_function_descriptor(struct module 
*mod, void *ptr)
return ptr;
 }
 
+static inline bool module_is_coming(struct module *mod)
+{
+   return false;
+}
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index ddd7cdc16edf..ca2c6cbd42d2 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -1588,7 +1588,7 @@ static int check_kprobe_address_safe(struct kprobe *p,
}
 
/* Get module refcount and reject __init functions for loaded modules. 
*/
-   if (*probed_mod) {
+   if (IS_ENABLED(CONFIG_MODULES) && *probed_mod) {
/*
 * We must hold a refcount of the probed module while updating
 * its code to prohibit unexpected unloading.
@@ -1603,12 +1603,13 @@ static int check_kprobe_address_safe(struct kprobe *p,
 * kprobes in there.
 */
if (within_module_init((unsigned long)p->addr, *probed_mod) &&
-   (*probed_mod)->state != MODULE_STATE_COMING) {
+   !module_is_coming(*probed_mod)) {
module_put(*probed_mod);
*probed_mod = NULL;
ret = -ENOENT;
}
}
+
 out:
preempt_enable();
jump_label_unlock();
@@ -2488,24 +2489,6 @@ int kprobe_add_area_blacklist(unsigned long start, 
unsigned long end)
return 0;
 }
 
-/* Remove all symbols in given area from kprobe blacklist */
-static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
-{
-   struct kprobe_blacklist_entry *ent, *n;
-
-   list_for_each_entry_safe(ent, n, _blacklist, list) {
-   if (ent->start_addr < start || ent->start_addr >= end)
-   continue;
-   list_del(>list);
-   kfree(ent);
-   }
-}
-
-static void kprobe_remove_ksym_blacklist(unsigned long entry)
-{
-   kprobe_remove_area_blacklist(entry, entry + 1);
-}
-
 int __weak arch_kprobe_get_kallsym(unsigned int *symnum, unsigned long *value,
   char *type, char *sym)
 {
@@ -2570,6 +2553,25 @@ static int __init populate_kprobe_blacklist(unsigned 
long *start,
return ret ? : arch_populate_kprobe_blacklist();
 }
 
+#ifdef CONFIG_MODULES
+/* Remove all symbols in given area from kprobe blacklist */
+static void kprobe_remove_area_blacklist(unsigned long start, unsigned long 
end)
+{
+   struct kprobe_blacklist_entry *ent, *n;
+
+   list_for_each_entry_safe(ent, n, _blacklist, list) {
+   if (ent->start_addr < start || ent->start_addr >= end)
+   continue;
+   list_del(>list);
+   kfree(ent);
+   }
+}
+
+static void kprobe_remove_ksym_blacklist(unsigned long entry)
+{
+   kprobe_remove_area_blacklist(entry, entry + 1);
+}
+
 static void add_module_kprobe_blacklist(struct module *mod)
 {
unsigned long start, end;
@@ -2672,6 +2674,17 @@ static struct notifier_block kprobe_module_nb = {
.priority = 0
 };
 
+static int kprobe_register_module_notifier(void)
+{
+   return register_module_notifier(_module_nb);
+}
+#else
+static int 

[PATCH v7 14/16] powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

There are places where CONFIG_MODULES guards the code that depends on
memory allocation being done with module_alloc().

Replace CONFIG_MODULES with CONFIG_EXECMEM in such places.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/Kconfig | 2 +-
 arch/powerpc/include/asm/kasan.h | 2 +-
 arch/powerpc/kernel/head_8xx.S   | 4 ++--
 arch/powerpc/kernel/head_book3s_32.S | 6 +++---
 arch/powerpc/lib/code-patching.c | 2 +-
 arch/powerpc/mm/book3s32/mmu.c   | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1c4be3373686..2e586733a464 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -285,7 +285,7 @@ config PPC
select IOMMU_HELPER if PPC64
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
-   select KASAN_VMALLOCif KASAN && MODULES
+   select KASAN_VMALLOCif KASAN && EXECMEM
select LOCK_MM_AND_FIND_VMA
select MMU_GATHER_PAGE_SIZE
select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/powerpc/include/asm/kasan.h b/arch/powerpc/include/asm/kasan.h
index 365d2720097c..b5bbb94c51f6 100644
--- a/arch/powerpc/include/asm/kasan.h
+++ b/arch/powerpc/include/asm/kasan.h
@@ -19,7 +19,7 @@
 
 #define KASAN_SHADOW_SCALE_SHIFT   3
 
-#if defined(CONFIG_MODULES) && defined(CONFIG_PPC32)
+#if defined(CONFIG_EXECMEM) && defined(CONFIG_PPC32)
 #define KASAN_KERN_START   ALIGN_DOWN(PAGE_OFFSET - SZ_256M, SZ_256M)
 #else
 #define KASAN_KERN_START   PAGE_OFFSET
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 647b0b445e89..edc479a7c2bc 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -199,12 +199,12 @@ instruction_counter:
mfspr   r10, SPRN_SRR0  /* Get effective address of fault */
INVALIDATE_ADJACENT_PAGES_CPU15(r10, r11)
mtspr   SPRN_MD_EPN, r10
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
mfcrr11
compare_to_kernel_boundary r10, r10
 #endif
mfspr   r10, SPRN_M_TWB /* Get level 1 table */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
blt+3f
rlwinm  r10, r10, 0, 20, 31
orisr10, r10, (swapper_pg_dir - PAGE_OFFSET)@ha
diff --git a/arch/powerpc/kernel/head_book3s_32.S 
b/arch/powerpc/kernel/head_book3s_32.S
index c1d89764dd22..57196883a00e 100644
--- a/arch/powerpc/kernel/head_book3s_32.S
+++ b/arch/powerpc/kernel/head_book3s_32.S
@@ -419,14 +419,14 @@ InstructionTLBMiss:
  */
/* Get PTE (linux-style) and check access */
mfspr   r3,SPRN_IMISS
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
lis r1, TASK_SIZE@h /* check if kernel address */
cmplw   0,r1,r3
 #endif
mfspr   r2, SPRN_SDR1
li  r1,_PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_EXEC
rlwinm  r2, r2, 28, 0xf000
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
li  r0, 3
bgt-112f
lis r2, (swapper_pg_dir - PAGE_OFFSET)@ha   /* if kernel address, 
use */
@@ -442,7 +442,7 @@ InstructionTLBMiss:
andc.   r1,r1,r2/* check access & ~permission */
bne-InstructionAddressInvalid /* return if access not permitted */
/* Convert linux-style PTE to low word of PPC-style PTE */
-#ifdef CONFIG_MODULES
+#ifdef CONFIG_EXECMEM
rlwimi  r2, r0, 0, 31, 31   /* userspace ? -> PP lsb */
 #endif
ori r1, r1, 0xe06   /* clear out reserved bits */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index c6ab46156cda..7af791446ddf 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -225,7 +225,7 @@ void __init poking_init(void)
 
 static unsigned long get_patch_pfn(void *addr)
 {
-   if (IS_ENABLED(CONFIG_MODULES) && is_vmalloc_or_module_addr(addr))
+   if (IS_ENABLED(CONFIG_EXECMEM) && is_vmalloc_or_module_addr(addr))
return vmalloc_to_pfn(addr);
else
return __pa_symbol(addr) >> PAGE_SHIFT;
diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index 100f999871bc..625fe7d08e06 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -184,7 +184,7 @@ unsigned long __init mmu_mapin_ram(unsigned long base, 
unsigned long top)
 
 static bool is_module_segment(unsigned long addr)
 {
-   if (!IS_ENABLED(CONFIG_MODULES))
+   if (!IS_ENABLED(CONFIG_EXECMEM))
return false;
if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M))
return false;
-- 
2.43.0




[PATCH v7 13/16] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Dynamic ftrace must allocate memory for code and this was impossible
without CONFIG_MODULES.

With execmem separated from the modules code, execmem_text_alloc() is
available regardless of CONFIG_MODULES.

Remove dependency of dynamic ftrace on CONFIG_MODULES and make
CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/x86/Kconfig |  1 +
 arch/x86/kernel/ftrace.c | 10 --
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4474bf32d0a4..f2917ccf4fb4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+   select EXECMEM if DYNAMIC_FTRACE
 
 config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index c8ddb7abda7c..8da0e66ca22d 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -261,8 +261,6 @@ void arch_ftrace_update_code(int command)
 /* Currently only x86_64 supports dynamic trampolines */
 #ifdef CONFIG_X86_64
 
-#ifdef CONFIG_MODULES
-/* Module allocation simplifies allocating memory for code */
 static inline void *alloc_tramp(unsigned long size)
 {
return execmem_alloc(EXECMEM_FTRACE, size);
@@ -271,14 +269,6 @@ static inline void tramp_free(void *tramp)
 {
execmem_free(tramp);
 }
-#else
-/* Trampolines can only be created if modules are supported */
-static inline void *alloc_tramp(unsigned long size)
-{
-   return NULL;
-}
-static inline void tramp_free(void *tramp) { }
-#endif
 
 /* Defined as markers to the end of the ftrace default trampolines */
 extern void ftrace_regs_caller_end(void);
-- 
2.43.0




[PATCH v7 12/16] arch: make execmem setup available regardless of CONFIG_MODULES

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

execmem does not depend on modules, on the contrary modules use
execmem.

To make execmem available when CONFIG_MODULES=n, for instance for
kprobes, split execmem_params initialization out from
arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y

Signed-off-by: Mike Rapoport (IBM) 
Reviewed-by: Philippe Mathieu-Daudé 
---
 arch/arm/kernel/module.c   |  43 --
 arch/arm/mm/init.c |  45 +++
 arch/arm64/kernel/module.c | 140 -
 arch/arm64/mm/init.c   | 140 +
 arch/loongarch/kernel/module.c |  19 -
 arch/loongarch/mm/init.c   |  21 +
 arch/mips/kernel/module.c  |  22 --
 arch/mips/mm/init.c|  23 ++
 arch/nios2/kernel/module.c |  20 -
 arch/nios2/mm/init.c   |  21 +
 arch/parisc/kernel/module.c|  20 -
 arch/parisc/mm/init.c  |  23 +-
 arch/powerpc/kernel/module.c   |  63 ---
 arch/powerpc/mm/mem.c  |  64 +++
 arch/riscv/kernel/module.c |  34 
 arch/riscv/mm/init.c   |  35 +
 arch/s390/kernel/module.c  |  27 ---
 arch/s390/mm/init.c|  30 +++
 arch/sparc/kernel/module.c |  19 -
 arch/sparc/mm/Makefile |   2 +
 arch/sparc/mm/execmem.c|  21 +
 arch/x86/kernel/module.c   |  27 ---
 arch/x86/mm/init.c |  29 +++
 23 files changed, 453 insertions(+), 435 deletions(-)
 create mode 100644 arch/sparc/mm/execmem.c

diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index a98fdf6ff26c..677f218f7e84 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -12,57 +12,14 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
-#include 
-#include 
 
 #include 
 #include 
 #include 
 #include 
 
-#ifdef CONFIG_XIP_KERNEL
-/*
- * The XIP kernel text is mapped in the module area for modules and
- * some other stuff to work without any indirect relocations.
- * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
- * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
- */
-#undef MODULES_VADDR
-#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
-#endif
-
-#ifdef CONFIG_MMU
-static struct execmem_info execmem_info __ro_after_init;
-
-struct execmem_info __init *execmem_arch_setup(void)
-{
-   unsigned long fallback_start = 0, fallback_end = 0;
-
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
-   fallback_start = VMALLOC_START;
-   fallback_end = VMALLOC_END;
-   }
-
-   execmem_info = (struct execmem_info){
-   .ranges = {
-   [EXECMEM_DEFAULT] = {
-   .start  = MODULES_VADDR,
-   .end= MODULES_END,
-   .pgprot = PAGE_KERNEL_EXEC,
-   .alignment = 1,
-   .fallback_start = fallback_start,
-   .fallback_end   = fallback_end,
-   },
-   },
-   };
-
-   return _info;
-}
-#endif
-
 bool module_init_section(const char *name)
 {
return strstarts(name, ".init") ||
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index e8c6f4be0ce1..5345d218899a 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -486,3 +487,47 @@ void free_initrd_mem(unsigned long start, unsigned long 
end)
free_reserved_area((void *)start, (void *)end, -1, "initrd");
 }
 #endif
+
+#ifdef CONFIG_EXECMEM
+
+#ifdef CONFIG_XIP_KERNEL
+/*
+ * The XIP kernel text is mapped in the module area for modules and
+ * some other stuff to work without any indirect relocations.
+ * MODULES_VADDR is redefined here and not in asm/memory.h to avoid
+ * recompiling the whole kernel when CONFIG_XIP_KERNEL is turned on/off.
+ */
+#undef MODULES_VADDR
+#define MODULES_VADDR  (((unsigned long)_exiprom + ~PMD_MASK) & PMD_MASK)
+#endif
+
+#ifdef CONFIG_MMU
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
+{
+   unsigned long fallback_start = 0, fallback_end = 0;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   fallback_start = VMALLOC_START;
+   fallback_end = VMALLOC_END;
+   }
+
+   execmem_info = (struct execmem_info){
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start  = MODULES_VADDR,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL_EXEC,
+   .alignment = 1,
+   .fallback_start = fallback_start,
+   .fallback_end   = fallback_end,
+  

[PATCH v7 11/16] powerpc: extend execmem_params for kprobes allocations

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

powerpc overrides kprobes::alloc_insn_page() to remove writable
permissions when STRICT_MODULE_RWX is on.

Add definition of EXECMEM_KRPOBES to execmem_params to allow using the
generic kprobes::alloc_insn_page() with the desired permissions.

As powerpc uses breakpoint instructions to inject kprobes, it does not
need to constrain kprobe allocations to the modules area and can use the
entire vmalloc address space.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/powerpc/kernel/kprobes.c | 20 
 arch/powerpc/kernel/module.c  |  7 +++
 2 files changed, 7 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 9fcd01bb2ce6..14c5ddec3056 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -126,26 +126,6 @@ kprobe_opcode_t *arch_adjust_kprobe_addr(unsigned long 
addr, unsigned long offse
return (kprobe_opcode_t *)(addr + offset);
 }
 
-void *alloc_insn_page(void)
-{
-   void *page;
-
-   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
-   if (!page)
-   return NULL;
-
-   if (strict_module_rwx_enabled()) {
-   int err = set_memory_rox((unsigned long)page, 1);
-
-   if (err)
-   goto error;
-   }
-   return page;
-error:
-   execmem_free(page);
-   return NULL;
-}
-
 int arch_prepare_kprobe(struct kprobe *p)
 {
int ret = 0;
diff --git a/arch/powerpc/kernel/module.c b/arch/powerpc/kernel/module.c
index ac80559015a3..2a23cf7e141b 100644
--- a/arch/powerpc/kernel/module.c
+++ b/arch/powerpc/kernel/module.c
@@ -94,6 +94,7 @@ static struct execmem_info execmem_info __ro_after_init;
 
 struct execmem_info __init *execmem_arch_setup(void)
 {
+   pgprot_t kprobes_prot = strict_module_rwx_enabled() ? PAGE_KERNEL_ROX : 
PAGE_KERNEL_EXEC;
pgprot_t prot = strict_module_rwx_enabled() ? PAGE_KERNEL : 
PAGE_KERNEL_EXEC;
unsigned long fallback_start = 0, fallback_end = 0;
unsigned long start, end;
@@ -132,6 +133,12 @@ struct execmem_info __init *execmem_arch_setup(void)
.fallback_start = fallback_start,
.fallback_end   = fallback_end,
},
+   [EXECMEM_KPROBES] = {
+   .start  = VMALLOC_START,
+   .end= VMALLOC_END,
+   .pgprot = kprobes_prot,
+   .alignment = 1,
+   },
[EXECMEM_MODULE_DATA] = {
.start  = VMALLOC_START,
.end= VMALLOC_END,
-- 
2.43.0




[PATCH v7 10/16] arm64: extend execmem_info for generated code allocations

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes and BPF on arm64 can be placed
anywhere in vmalloc address space and currently this is implemented with
overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64.

Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_info and
drop overrides of alloc_insn_page() and bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Will Deacon 
---
 arch/arm64/kernel/module.c | 12 
 arch/arm64/kernel/probes/kprobes.c |  7 ---
 arch/arm64/net/bpf_jit_comp.c  | 11 ---
 3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index b7a7a23f9f8f..a52240ea084b 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -146,6 +146,18 @@ struct execmem_info __init *execmem_arch_setup(void)
.fallback_start = fallback_start,
.fallback_end   = fallback_end,
},
+   [EXECMEM_KPROBES] = {
+   .start  = VMALLOC_START,
+   .end= VMALLOC_END,
+   .pgprot = PAGE_KERNEL_ROX,
+   .alignment = 1,
+   },
+   [EXECMEM_BPF] = {
+   .start  = VMALLOC_START,
+   .end= VMALLOC_END,
+   .pgprot = PAGE_KERNEL,
+   .alignment = 1,
+   },
},
};
 
diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index 327855a11df2..4268678d0e86 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-void *alloc_insn_page(void)
-{
-   return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
-   NUMA_NO_NODE, __builtin_return_address(0));
-}
-
 /* arm kprobe: install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index 122021f9bdfc..456f5af239fc 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -1793,17 +1793,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return VMALLOC_END - VMALLOC_START;
 }
 
-void *bpf_jit_alloc_exec(unsigned long size)
-{
-   /* Memory is intended to be executable, reset the pointer tag. */
-   return kasan_reset_tag(vmalloc(size));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
-   return vfree(addr);
-}
-
 /* Indicate the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool bpf_jit_supports_subprog_tailcalls(void)
 {
-- 
2.43.0




[PATCH v7 09/16] riscv: extend execmem_params for generated code allocations

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

The memory allocations for kprobes and BPF on RISC-V are not placed in
the modules area and these custom allocations are implemented with
overrides of alloc_insn_page() and  bpf_jit_alloc_exec().

Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END for
32 bit and slightly reorder execmem_params initialization to support both
32 and 64 bit variants, define EXECMEM_KPROBES and EXECMEM_BPF ranges in
riscv::execmem_params and drop overrides of alloc_insn_page() and
bpf_jit_alloc_exec().

Signed-off-by: Mike Rapoport (IBM) 
Reviewed-by: Alexandre Ghiti 
---
 arch/riscv/include/asm/pgtable.h   |  3 +++
 arch/riscv/kernel/module.c | 14 +-
 arch/riscv/kernel/probes/kprobes.c | 10 --
 arch/riscv/net/bpf_jit_core.c  | 13 -
 4 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 9f8ea0e33eb1..5f21814e438e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -55,6 +55,9 @@
 #define MODULES_LOWEST_VADDR   (KERNEL_LINK_ADDR - SZ_2G)
 #define MODULES_VADDR  (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
 #define MODULES_END(PFN_ALIGN((unsigned long)&_start))
+#else
+#define MODULES_VADDR  VMALLOC_START
+#define MODULES_ENDVMALLOC_END
 #endif
 
 /*
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 182904127ba0..0e6415f00fca 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -906,7 +906,7 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+#ifdef CONFIG_MMU
 static struct execmem_info execmem_info __ro_after_init;
 
 struct execmem_info __init *execmem_arch_setup(void)
@@ -919,6 +919,18 @@ struct execmem_info __init *execmem_arch_setup(void)
.pgprot = PAGE_KERNEL,
.alignment = 1,
},
+   [EXECMEM_KPROBES] = {
+   .start  = VMALLOC_START,
+   .end= VMALLOC_END,
+   .pgprot = PAGE_KERNEL_READ_EXEC,
+   .alignment = 1,
+   },
+   [EXECMEM_BPF] = {
+   .start  = BPF_JIT_REGION_START,
+   .end= BPF_JIT_REGION_END,
+   .pgprot = PAGE_KERNEL,
+   .alignment = PAGE_SIZE,
+   },
},
};
 
diff --git a/arch/riscv/kernel/probes/kprobes.c 
b/arch/riscv/kernel/probes/kprobes.c
index 2f08c14a933d..e64f2f3064eb 100644
--- a/arch/riscv/kernel/probes/kprobes.c
+++ b/arch/riscv/kernel/probes/kprobes.c
@@ -104,16 +104,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
return 0;
 }
 
-#ifdef CONFIG_MMU
-void *alloc_insn_page(void)
-{
-   return  __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
-GFP_KERNEL, PAGE_KERNEL_READ_EXEC,
-VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-__builtin_return_address(0));
-}
-#endif
-
 /* install breakpoint in text */
 void __kprobes arch_arm_kprobe(struct kprobe *p)
 {
diff --git a/arch/riscv/net/bpf_jit_core.c b/arch/riscv/net/bpf_jit_core.c
index 6b3acac30c06..e238fdbd5dbc 100644
--- a/arch/riscv/net/bpf_jit_core.c
+++ b/arch/riscv/net/bpf_jit_core.c
@@ -219,19 +219,6 @@ u64 bpf_jit_alloc_exec_limit(void)
return BPF_JIT_REGION_SIZE;
 }
 
-void *bpf_jit_alloc_exec(unsigned long size)
-{
-   return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
-   BPF_JIT_REGION_END, GFP_KERNEL,
-   PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-}
-
-void bpf_jit_free_exec(void *addr)
-{
-   return vfree(addr);
-}
-
 void *bpf_arch_text_copy(void *dst, void *src, size_t len)
 {
int ret;
-- 
2.43.0




[PATCH v7 08/16] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Extend execmem parameters to accommodate more complex overrides of
module_alloc() by architectures.

This includes specification of a fallback range required by arm, arm64
and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for
allocation of KASAN shadow required by s390 and x86 and support for
late initialization of execmem required by arm64.

The core implementation of execmem_alloc() takes care of suppressing
warnings when the initial allocation fails but there is a fallback range
defined.

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Will Deacon 
Acked-by: Song Liu 
---
 arch/Kconfig |  8 
 arch/arm/kernel/module.c | 41 
 arch/arm64/Kconfig   |  1 +
 arch/arm64/kernel/module.c   | 55 ++
 arch/powerpc/kernel/module.c | 60 +++--
 arch/s390/kernel/module.c| 54 +++---
 arch/x86/kernel/module.c | 70 +++--
 include/linux/execmem.h  | 30 ++-
 include/linux/moduleloader.h | 12 --
 kernel/module/main.c | 26 +++--
 mm/execmem.c | 75 ++--
 11 files changed, 247 insertions(+), 185 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 65afb1de48b3..4fd0daa54e6c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -960,6 +960,14 @@ config ARCH_WANTS_MODULES_DATA_IN_VMALLOC
  For architectures like powerpc/32 which have constraints on module
  allocation and need to allocate module data outside of module area.
 
+config ARCH_WANTS_EXECMEM_LATE
+   bool
+   help
+ For architectures that do not allocate executable memory early on
+ boot, but rather require its initialization late when there is
+ enough entropy for module space randomization, for instance
+ arm64.
+
 config HAVE_IRQ_EXIT_ON_IRQ_STACK
bool
help
diff --git a/arch/arm/kernel/module.c b/arch/arm/kernel/module.c
index e74d84f58b77..a98fdf6ff26c 100644
--- a/arch/arm/kernel/module.c
+++ b/arch/arm/kernel/module.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -34,23 +35,31 @@
 #endif
 
 #ifdef CONFIG_MMU
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   gfp_t gfp_mask = GFP_KERNEL;
-   void *p;
-
-   /* Silence the initial allocation */
-   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS))
-   gfp_mask |= __GFP_NOWARN;
-
-   p = __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   gfp_mask, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
-   if (!IS_ENABLED(CONFIG_ARM_MODULE_PLTS) || p)
-   return p;
-   return __vmalloc_node_range(size, 1,  VMALLOC_START, VMALLOC_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   unsigned long fallback_start = 0, fallback_end = 0;
+
+   if (IS_ENABLED(CONFIG_ARM_MODULE_PLTS)) {
+   fallback_start = VMALLOC_START;
+   fallback_end = VMALLOC_END;
+   }
+
+   execmem_info = (struct execmem_info){
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start  = MODULES_VADDR,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL_EXEC,
+   .alignment = 1,
+   .fallback_start = fallback_start,
+   .fallback_end   = fallback_end,
+   },
+   },
+   };
+
+   return _info;
 }
 #endif
 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7b11c98b3e84..74b34a78b7ac 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -105,6 +105,7 @@ config ARM64
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES 
&& !ARM64_VA_BITS_36)
select ARCH_WANT_LD_ORPHAN_WARN
+   select ARCH_WANTS_EXECMEM_LATE if EXECMEM
select ARCH_WANTS_NO_INSTR
select ARCH_WANTS_THP_SWAP if ARM64_4K_PAGES
select ARCH_HAS_UBSAN
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index e92da4da1b2a..b7a7a23f9f8f 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -108,41 +109,47 @@ static int __init module_init_limits(void)
 
return 0;
 }
-subsys_initcall(module_init_limits);
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   

[PATCH v7 07/16] mm/execmem, arch: convert simple overrides of module_alloc to execmem

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Several architectures override module_alloc() only to define address
range for code allocations different than VMALLOC address space.

Provide a generic implementation in execmem that uses the parameters for
address space ranges, required alignment and page protections provided
by architectures.

The architectures must fill execmem_info structure and implement
execmem_arch_setup() that returns a pointer to that structure. This way the
execmem initialization won't be called from every architecture, but rather
from a central place, namely a core_initcall() in execmem.

The execmem provides execmem_alloc() API that wraps __vmalloc_node_range()
with the parameters defined by the architectures.  If an architecture does
not implement execmem_arch_setup(), execmem_alloc() will fall back to
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Song Liu 
---
 arch/loongarch/kernel/module.c | 19 --
 arch/mips/kernel/module.c  | 20 --
 arch/nios2/kernel/module.c | 21 ---
 arch/parisc/kernel/module.c| 24 
 arch/riscv/kernel/module.c | 24 
 arch/sparc/kernel/module.c | 20 --
 include/linux/execmem.h| 47 
 mm/execmem.c   | 67 --
 mm/mm_init.c   |  2 +
 9 files changed, 210 insertions(+), 34 deletions(-)

diff --git a/arch/loongarch/kernel/module.c b/arch/loongarch/kernel/module.c
index c7d0338d12c1..ca6dd7ea1610 100644
--- a/arch/loongarch/kernel/module.c
+++ b/arch/loongarch/kernel/module.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -490,10 +491,22 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char 
*strtab,
return 0;
 }
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE, 
__builtin_return_address(0));
+   execmem_info = (struct execmem_info){
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start  = MODULES_VADDR,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL,
+   .alignment = 1,
+   },
+   },
+   };
+
+   return _info;
 }
 
 static void module_init_ftrace_plt(const Elf_Ehdr *hdr,
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 9a6c96014904..59225a3cf918 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 struct mips_hi16 {
@@ -32,11 +33,22 @@ static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
 #ifdef MODULES_VADDR
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   execmem_info = (struct execmem_info){
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start  = MODULES_VADDR,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL,
+   .alignment = 1,
+   },
+   },
+   };
+
+   return _info;
 }
 #endif
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 9c97b7513853..0d1ee86631fc 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -18,15 +18,26 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
-void *module_alloc(unsigned long size)
+static struct execmem_info execmem_info __ro_after_init;
+
+struct execmem_info __init *execmem_arch_setup(void)
 {
-   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
-   GFP_KERNEL, PAGE_KERNEL_EXEC,
-   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
-   __builtin_return_address(0));
+   execmem_info = (struct execmem_info){
+   .ranges = {
+   [EXECMEM_DEFAULT] = {
+   .start  = MODULES_VADDR,
+   .end= MODULES_END,
+   .pgprot = PAGE_KERNEL_EXEC,
+   .alignment = 1,
+   },
+   },
+   };
+
+   return _info;
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
diff --git 

[PATCH v7 06/16] mm: introduce execmem_alloc() and execmem_free()

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystems
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

Start splitting code allocation from modules by introducing execmem_alloc()
and execmem_free() APIs.

Initially, execmem_alloc() is a wrapper for module_alloc() and
execmem_free() is a replacement of module_memfree() to allow updating all
call sites to use the new APIs.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem_alloc() takes
a type argument, that will be used to identify the calling subsystem and to
allow architectures define parameters for ranges suitable for that
subsystem.

No functional changes.

Signed-off-by: Mike Rapoport (IBM) 
Acked-by: Masami Hiramatsu (Google) 
Acked-by: Song Liu 
---
 arch/powerpc/kernel/kprobes.c|  6 ++--
 arch/s390/kernel/ftrace.c|  4 +--
 arch/s390/kernel/kprobes.c   |  4 +--
 arch/s390/kernel/module.c|  5 +--
 arch/sparc/net/bpf_jit_comp_32.c |  8 ++---
 arch/x86/kernel/ftrace.c |  6 ++--
 arch/x86/kernel/kprobes/core.c   |  4 +--
 include/linux/execmem.h  | 57 
 include/linux/moduleloader.h |  3 --
 kernel/bpf/core.c|  6 ++--
 kernel/kprobes.c |  8 ++---
 kernel/module/Kconfig|  1 +
 kernel/module/main.c | 25 +-
 mm/Kconfig   |  3 ++
 mm/Makefile  |  1 +
 mm/execmem.c | 32 ++
 16 files changed, 128 insertions(+), 45 deletions(-)
 create mode 100644 include/linux/execmem.h
 create mode 100644 mm/execmem.c

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index bbca90a5e2ec..9fcd01bb2ce6 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -19,8 +19,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -130,7 +130,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
 
@@ -142,7 +142,7 @@ void *alloc_insn_page(void)
}
return page;
 error:
-   module_memfree(page);
+   execmem_free(page);
return NULL;
 }
 
diff --git a/arch/s390/kernel/ftrace.c b/arch/s390/kernel/ftrace.c
index c46381ea04ec..798249ef5646 100644
--- a/arch/s390/kernel/ftrace.c
+++ b/arch/s390/kernel/ftrace.c
@@ -7,13 +7,13 @@
  *   Author(s): Martin Schwidefsky 
  */
 
-#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -220,7 +220,7 @@ static int __init ftrace_plt_init(void)
 {
const char *start, *end;
 
-   ftrace_plt = module_alloc(PAGE_SIZE);
+   ftrace_plt = execmem_alloc(EXECMEM_FTRACE, PAGE_SIZE);
if (!ftrace_plt)
panic("cannot allocate ftrace plt\n");
 
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index f0cf20d4b3c5..3c1b1be744de 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -9,7 +9,6 @@
 
 #define pr_fmt(fmt) "kprobes: " fmt
 
-#include 
 #include 
 #include 
 #include 
@@ -21,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -38,7 +38,7 @@ void *alloc_insn_page(void)
 {
void *page;
 
-   page = module_alloc(PAGE_SIZE);
+   page = execmem_alloc(EXECMEM_KPROBES, PAGE_SIZE);
if (!page)
return NULL;
set_memory_rox((unsigned long)page, 1);
diff --git a/arch/s390/kernel/module.c b/arch/s390/kernel/module.c
index 42215f9404af..ac97a905e8cd 100644
--- a/arch/s390/kernel/module.c
+++ b/arch/s390/kernel/module.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -76,7 +77,7 @@ void *module_alloc(unsigned long size)
 #ifdef CONFIG_FUNCTION_TRACER
 void module_arch_cleanup(struct module *mod)
 {
-   module_memfree(mod->arch.trampolines_start);
+   execmem_free(mod->arch.trampolines_start);
 }
 #endif
 
@@ -510,7 +511,7 @@ static int module_alloc_ftrace_hotpatch_trampolines(struct 
module *me,
 
size = FTRACE_HOTPATCH_TRAMPOLINES_SIZE(s->sh_size);
numpages = DIV_ROUND_UP(size, PAGE_SIZE);
-   start = module_alloc(numpages * PAGE_SIZE);
+   start = execmem_alloc(EXECMEM_FTRACE, numpages * PAGE_SIZE);
if (!start)

[PATCH v7 05/16] module: make module_memory_{alloc,free} more self-contained

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Move the logic related to the memory allocation and freeing into
module_memory_alloc() and module_memory_free().

Signed-off-by: Mike Rapoport (IBM) 
---
 kernel/module/main.c | 64 +++-
 1 file changed, 39 insertions(+), 25 deletions(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index e1e8a7a9d6c1..5b82b069e0d3 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1203,15 +1203,44 @@ static bool mod_mem_use_vmalloc(enum mod_mem_type type)
mod_mem_type_is_core_data(type);
 }
 
-static void *module_memory_alloc(unsigned int size, enum mod_mem_type type)
+static int module_memory_alloc(struct module *mod, enum mod_mem_type type)
 {
+   unsigned int size = PAGE_ALIGN(mod->mem[type].size);
+   void *ptr;
+
+   mod->mem[type].size = size;
+
if (mod_mem_use_vmalloc(type))
-   return vzalloc(size);
-   return module_alloc(size);
+   ptr = vmalloc(size);
+   else
+   ptr = module_alloc(size);
+
+   if (!ptr)
+   return -ENOMEM;
+
+   /*
+* The pointer to these blocks of memory are stored on the module
+* structure and we keep that around so long as the module is
+* around. We only free that memory when we unload the module.
+* Just mark them as not being a leak then. The .init* ELF
+* sections *do* get freed after boot so we *could* treat them
+* slightly differently with kmemleak_ignore() and only grey
+* them out as they work as typical memory allocations which
+* *do* eventually get freed, but let's just keep things simple
+* and avoid *any* false positives.
+*/
+   kmemleak_not_leak(ptr);
+
+   memset(ptr, 0, size);
+   mod->mem[type].base = ptr;
+
+   return 0;
 }
 
-static void module_memory_free(void *ptr, enum mod_mem_type type)
+static void module_memory_free(struct module *mod, enum mod_mem_type type)
 {
+   void *ptr = mod->mem[type].base;
+
if (mod_mem_use_vmalloc(type))
vfree(ptr);
else
@@ -1229,12 +1258,12 @@ static void free_mod_mem(struct module *mod)
/* Free lock-classes; relies on the preceding sync_rcu(). */
lockdep_free_key_range(mod_mem->base, mod_mem->size);
if (mod_mem->size)
-   module_memory_free(mod_mem->base, type);
+   module_memory_free(mod, type);
}
 
/* MOD_DATA hosts mod, so free it at last */
lockdep_free_key_range(mod->mem[MOD_DATA].base, 
mod->mem[MOD_DATA].size);
-   module_memory_free(mod->mem[MOD_DATA].base, MOD_DATA);
+   module_memory_free(mod, MOD_DATA);
 }
 
 /* Free a module, remove from lists, etc. */
@@ -2225,7 +2254,6 @@ static int find_module_sections(struct module *mod, 
struct load_info *info)
 static int move_module(struct module *mod, struct load_info *info)
 {
int i;
-   void *ptr;
enum mod_mem_type t = 0;
int ret = -ENOMEM;
 
@@ -2234,26 +2262,12 @@ static int move_module(struct module *mod, struct 
load_info *info)
mod->mem[type].base = NULL;
continue;
}
-   mod->mem[type].size = PAGE_ALIGN(mod->mem[type].size);
-   ptr = module_memory_alloc(mod->mem[type].size, type);
-   /*
- * The pointer to these blocks of memory are stored on the 
module
- * structure and we keep that around so long as the module is
- * around. We only free that memory when we unload the module.
- * Just mark them as not being a leak then. The .init* ELF
- * sections *do* get freed after boot so we *could* treat them
- * slightly differently with kmemleak_ignore() and only grey
- * them out as they work as typical memory allocations which
- * *do* eventually get freed, but let's just keep things simple
- * and avoid *any* false positives.
-*/
-   kmemleak_not_leak(ptr);
-   if (!ptr) {
+
+   ret = module_memory_alloc(mod, type);
+   if (ret) {
t = type;
goto out_enomem;
}
-   memset(ptr, 0, mod->mem[type].size);
-   mod->mem[type].base = ptr;
}
 
/* Transfer each section which specifies SHF_ALLOC */
@@ -2296,7 +2310,7 @@ static int move_module(struct module *mod, struct 
load_info *info)
return 0;
 out_enomem:
for (t--; t >= 0; t--)
-   module_memory_free(mod->mem[t].base, t);
+   module_memory_free(mod, t);
return ret;
 }
 
-- 
2.43.0




[PATCH v7 04/16] sparc: simplify module_alloc()

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Define MODULES_VADDR and MODULES_END as VMALLOC_START and VMALLOC_END
for 32-bit and reduce module_alloc() to

__vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, ...)

as with the new defines the allocations becomes identical for both 32
and 64 bits.

While on it, drop unused include of 

Suggested-by: Sam Ravnborg 
Signed-off-by: Mike Rapoport (IBM) 
Reviewed-by: Sam Ravnborg 
---
 arch/sparc/include/asm/pgtable_32.h |  2 ++
 arch/sparc/kernel/module.c  | 25 +
 2 files changed, 3 insertions(+), 24 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_32.h 
b/arch/sparc/include/asm/pgtable_32.h
index 9e85d57ac3f2..62bcafe38b1f 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -432,6 +432,8 @@ static inline int io_remap_pfn_range(struct vm_area_struct 
*vma,
 
 #define VMALLOC_START   _AC(0xfe60,UL)
 #define VMALLOC_END _AC(0xffc0,UL)
+#define MODULES_VADDR   VMALLOC_START
+#define MODULES_END VMALLOC_END
 
 /* We provide our own get_unmapped_area to cope with VA holes for userland */
 #define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/sparc/kernel/module.c b/arch/sparc/kernel/module.c
index 66c45a2764bc..d37adb2a0b54 100644
--- a/arch/sparc/kernel/module.c
+++ b/arch/sparc/kernel/module.c
@@ -21,35 +21,12 @@
 
 #include "entry.h"
 
-#ifdef CONFIG_SPARC64
-
-#include 
-
-static void *module_map(unsigned long size)
+void *module_alloc(unsigned long size)
 {
-   if (PAGE_ALIGN(size) > MODULES_LEN)
-   return NULL;
return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
 }
-#else
-static void *module_map(unsigned long size)
-{
-   return vmalloc(size);
-}
-#endif /* CONFIG_SPARC64 */
-
-void *module_alloc(unsigned long size)
-{
-   void *ret;
-
-   ret = module_map(size);
-   if (ret)
-   memset(ret, 0, size);
-
-   return ret;
-}
 
 /* Make generic code ignore STT_REGISTER dummy undefined symbols.  */
 int module_frob_arch_sections(Elf_Ehdr *hdr,
-- 
2.43.0




[PATCH v7 03/16] nios2: define virtual address space for modules

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

nios2 uses kmalloc() to implement module_alloc() because CALL26/PCREL26
cannot reach all of vmalloc address space.

Define module space as 32MiB below the kernel base and switch nios2 to
use vmalloc for module allocations.

Suggested-by: Thomas Gleixner 
Acked-by: Dinh Nguyen 
Acked-by: Song Liu 
Signed-off-by: Mike Rapoport (IBM) 
---
 arch/nios2/include/asm/pgtable.h |  5 -
 arch/nios2/kernel/module.c   | 19 ---
 2 files changed, 8 insertions(+), 16 deletions(-)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index d052dfcbe8d3..eab87c6beacb 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -25,7 +25,10 @@
 #include 
 
 #define VMALLOC_START  CONFIG_NIOS2_KERNEL_MMU_REGION_BASE
-#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
+#define VMALLOC_END(CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M - 1)
+
+#define MODULES_VADDR  (CONFIG_NIOS2_KERNEL_REGION_BASE - SZ_32M)
+#define MODULES_END(CONFIG_NIOS2_KERNEL_REGION_BASE - 1)
 
 struct mm_struct;
 
diff --git a/arch/nios2/kernel/module.c b/arch/nios2/kernel/module.c
index 76e0a42d6e36..9c97b7513853 100644
--- a/arch/nios2/kernel/module.c
+++ b/arch/nios2/kernel/module.c
@@ -21,23 +21,12 @@
 
 #include 
 
-/*
- * Modules should NOT be allocated with kmalloc for (obvious) reasons.
- * But we do it for now to avoid relocation issues. CALL26/PCREL26 cannot reach
- * from 0x8000 (vmalloc area) to 0xc (kernel) (kmalloc returns
- * addresses in 0xc000)
- */
 void *module_alloc(unsigned long size)
 {
-   if (size == 0)
-   return NULL;
-   return kmalloc(size, GFP_KERNEL);
-}
-
-/* Free memory returned from module_alloc */
-void module_memfree(void *module_region)
-{
-   kfree(module_region);
+   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
+   GFP_KERNEL, PAGE_KERNEL_EXEC,
+   VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
+   __builtin_return_address(0));
 }
 
 int apply_relocate_add(Elf32_Shdr *sechdrs, const char *strtab,
-- 
2.43.0




[PATCH v7 02/16] mips: module: rename MODULE_START to MODULES_VADDR

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

and MODULE_END to MODULES_END to match other architectures that define
custom address space for modules.

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/mips/include/asm/pgtable-64.h | 4 ++--
 arch/mips/kernel/module.c  | 4 ++--
 arch/mips/mm/fault.c   | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/mips/include/asm/pgtable-64.h 
b/arch/mips/include/asm/pgtable-64.h
index 20ca48c1b606..c0109aff223b 100644
--- a/arch/mips/include/asm/pgtable-64.h
+++ b/arch/mips/include/asm/pgtable-64.h
@@ -147,8 +147,8 @@
 #if defined(CONFIG_MODULES) && defined(KBUILD_64BIT_SYM32) && \
VMALLOC_START != CKSSEG
 /* Load modules into 32bit-compatible segment. */
-#define MODULE_START   CKSSEG
-#define MODULE_END (FIXADDR_START-2*PAGE_SIZE)
+#define MODULES_VADDR  CKSSEG
+#define MODULES_END(FIXADDR_START-2*PAGE_SIZE)
 #endif
 
 #define pte_ERROR(e) \
diff --git a/arch/mips/kernel/module.c b/arch/mips/kernel/module.c
index 7b2fbaa9cac5..9a6c96014904 100644
--- a/arch/mips/kernel/module.c
+++ b/arch/mips/kernel/module.c
@@ -31,10 +31,10 @@ struct mips_hi16 {
 static LIST_HEAD(dbe_list);
 static DEFINE_SPINLOCK(dbe_lock);
 
-#ifdef MODULE_START
+#ifdef MODULES_VADDR
 void *module_alloc(unsigned long size)
 {
-   return __vmalloc_node_range(size, 1, MODULE_START, MODULE_END,
+   return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END,
GFP_KERNEL, PAGE_KERNEL, 0, NUMA_NO_NODE,
__builtin_return_address(0));
 }
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index aaa9a242ebba..37fedeaca2e9 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -83,8 +83,8 @@ static void __do_page_fault(struct pt_regs *regs, unsigned 
long write,
 
if (unlikely(address >= VMALLOC_START && address <= VMALLOC_END))
goto VMALLOC_FAULT_TARGET;
-#ifdef MODULE_START
-   if (unlikely(address >= MODULE_START && address < MODULE_END))
+#ifdef MODULES_VADDR
+   if (unlikely(address >= MODULES_VADDR && address < MODULES_END))
goto VMALLOC_FAULT_TARGET;
 #endif
 
-- 
2.43.0




[PATCH v7 01/16] arm64: module: remove unneeded call to kasan_alloc_module_shadow()

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Since commit f6f37d9320a1 ("arm64: select KASAN_VMALLOC for SW/HW_TAGS
modes") KASAN_VMALLOC is always enabled when KASAN is on. This means
that allocations in module_alloc() will be tracked by KASAN protection
for vmalloc() and that kasan_alloc_module_shadow() will be always an
empty inline and there is no point in calling it.

Drop meaningless call to kasan_alloc_module_shadow() from
module_alloc().

Signed-off-by: Mike Rapoport (IBM) 
---
 arch/arm64/kernel/module.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 47e0be610bb6..e92da4da1b2a 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -141,11 +141,6 @@ void *module_alloc(unsigned long size)
__func__);
}
 
-   if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
-   vfree(p);
-   return NULL;
-   }
-
/* Memory is intended to be executable, reset the pointer tag. */
return kasan_reset_tag(p);
 }
-- 
2.43.0




[PATCH v7 00/16] mm: jit/text allocator

2024-04-29 Thread Mike Rapoport
From: "Mike Rapoport (IBM)" 

Hi,

The patches are also available in git:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7

v7 changes:
* define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
  #ifdefs in a function body
* add Acks, thanks everybody

v6: https://lore.kernel.org/all/20240426082854.7355-1-r...@kernel.org
* restore patch "arm64: extend execmem_info for generated code
  allocations" that disappeared in v5 rebase
* update execmem initialization so that by default it will be
  initialized early while late initialization will be an opt-in

v5: https://lore.kernel.org/all/20240422094436.3625171-1-r...@kernel.org
* rebase on v6.9-rc4 to avoid a conflict in kprobes
* add copyrights to mm/execmem.c (Luis)
* fix spelling (Ingo)
* define MODULES_VADDDR for sparc (Sam)
* consistently initialize struct execmem_info (Peter)
* reduce #ifdefs in function bodies in kprobes (Masami) 

v4: https://lore.kernel.org/all/20240411160051.2093261-1-r...@kernel.org
* rebase on v6.9-rc2
* rename execmem_params to execmem_info and execmem_arch_params() to
  execmem_arch_setup()
* use single execmem_alloc() API instead of execmem_{text,data}_alloc() (Song)
* avoid extra copy of execmem parameters (Rick)
* run execmem_init() as core_initcall() except for the architectures that
  may allocated text really early (currently only x86) (Will)
* add acks for some of arm64 and riscv changes, thanks Will and Alexandre
* new commits:
  - drop call to kasan_alloc_module_shadow() on arm64 because it's not
needed anymore
  - rename MODULE_START to MODULES_VADDR on MIPS
  - use CONFIG_EXECMEM instead of CONFIG_MODULES on powerpc as per Christophe:
https://lore.kernel.org/all/79062fa3-3402-47b3-8920-9231ad05e...@csgroup.eu/

v3: https://lore.kernel.org/all/20230918072955.2507221-1-r...@kernel.org
* add type parameter to execmem allocation APIs
* remove BPF dependency on modules

v2: https://lore.kernel.org/all/20230616085038.4121892-1-r...@kernel.org
* Separate "module" and "others" allocations with execmem_text_alloc()
and jit_text_alloc()
* Drop ROX entailment on x86
* Add ack for nios2 changes, thanks Dinh Nguyen

v1: https://lore.kernel.org/all/20230601101257.530867-1-r...@kernel.org

= Cover letter from v1 (sligtly updated) =

module_alloc() is used everywhere as a mean to allocate memory for code.

Beside being semantically wrong, this unnecessarily ties all subsystmes
that need to allocate code, such as ftrace, kprobes and BPF to modules and
puts the burden of code allocation to the modules code.

Several architectures override module_alloc() because of various
constraints where the executable memory can be located and this causes
additional obstacles for improvements of code allocation.

A centralized infrastructure for code allocation allows allocations of
executable memory as ROX, and future optimizations such as caching large
pages for better iTLB performance and providing sub-page allocations for
users that only need small jit code snippets.

Rick Edgecombe proposed perm_alloc extension to vmalloc [1] and Song Liu
proposed execmem_alloc [2], but both these approaches were targeting BPF
allocations and lacked the ground work to abstract executable allocations
and split them from the modules core.

Thomas Gleixner suggested to express module allocation restrictions and
requirements as struct mod_alloc_type_params [3] that would define ranges,
protections and other parameters for different types of allocations used by
modules and following that suggestion Song separated allocations of
different types in modules (commit ac3b43283923 ("module: replace
module_layout with module_memory")) and posted "Type aware module
allocator" set [4].

I liked the idea of parametrising code allocation requirements as a
structure, but I believe the original proposal and Song's module allocator
was too module centric, so I came up with these patches.

This set splits code allocation from modules by introducing execmem_alloc()
and and execmem_free(), APIs, replaces call sites of module_alloc() and
module_memfree() with the new APIs and implements core text and related
allocations in a central place.

Instead of architecture specific overrides for module_alloc(), the
architectures that require non-default behaviour for text allocation must
fill execmem_info structure and implement execmem_arch_setup() that returns
a pointer to that structure. If an architecture does not implement
execmem_arch_setup(), the defaults compatible with the current
modules::module_alloc() are used.

Since architectures define different restrictions on placement,
permissions, alignment and other parameters for memory that can be used by
different subsystems that allocate executable memory, execmem APIs
take a type argument, that will be used to identify the calling subsystem
and to allow architectures to define parameters for ranges suitable for that
subsystem.

The new infrastructure allows decoupling of BPF, kprobes and 

Re: [PATCH 5.15,5.10,5.4,4.19 0/2] Fix warning when tracing with large filenames

2024-04-29 Thread Greg KH
On Wed, Apr 24, 2024 at 07:20:07PM -0300, Thadeu Lima de Souza Cascardo wrote:
> The warning described on patch "tracing: Increase PERF_MAX_TRACE_SIZE to
> handle Sentinel1 and docker together" can be triggered with a perf probe on
> do_execve with a large path. As PATH_MAX is larger than PERF_MAX_TRACE_SIZE
> (2048 before the patch), the warning will trigger.
> 
> The fix was included in 5.16, so backporting to 5.15 and earlier LTS
> kernels. Also included is a patch that better describes the attempted
> allocation size.

All now queued up, thanks.

greg k-h



[PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

2024-04-29 Thread Dmitrii Kuvaiskii
Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and `MADV_DONTNEED` semantics). Consider this race:

1. T1 performs page removal in sgx_encl_remove_pages() and stops right
   after removing the page table entry and right before re-acquiring the
   enclave lock to EREMOVE and xa_erase(>page_array) the page.
2. T2 tries to access the page, and #PF[not_present] is raised. The
   condition to EAUG in sgx_vma_fault() is not satisfied because the
   page is still present in encl->page_array, thus the SGX driver
   assumes that the fault happened because the page was swapped out. The
   driver continues on a code path that installs a page table entry
   *without* performing EAUG.
3. The enclave page metadata is in inconsistent state: the PTE is
   installed but there was no EAUG. Thus, T2 in userspace infinitely
   receives SIGSEGV on this page (and EACCEPT always fails).

Fix this by making sure that T1 (the page-removing thread) always wins
this data race. In particular, the page-being-removed is marked as such,
and T2 retries until the page is fully removed.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: sta...@vger.kernel.org
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c  | 3 ++-
 arch/x86/kernel/cpu/sgx/encl.h  | 3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct 
sgx_encl *encl,
 
/* Entry successfully located. */
if (entry->epc_page) {
-   if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+   if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+  SGX_ENCL_PAGE_BEING_REMOVED))
return ERR_PTR(-EBUSY);
 
return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -25,6 +25,9 @@
 /* 'desc' bit marking that the page is being reclaimed. */
 #define SGX_ENCL_PAGE_BEING_RECLAIMED  BIT(3)
 
+/* 'desc' bit marking that the page is being removed. */
+#define SGX_ENCL_PAGE_BEING_REMOVEDBIT(2)
+
 struct sgx_encl_page {
unsigned long desc;
unsigned long vm_max_prot_bits:8;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..c542d4dd3e64 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
 * Do not keep encl->lock because of dependency on
 * mmap_lock acquired in sgx_zap_enclave_ptes().
 */
+   entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
mutex_unlock(>lock);
 
sgx_zap_enclave_ptes(encl, addr);
-- 
2.34.1




[PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

2024-04-29 Thread Dmitrii Kuvaiskii
Two enclave threads may try to access the same non-present enclave page
simultaneously (e.g., if the SGX runtime supports lazy allocation). The
threads will end up in sgx_encl_eaug_page(), racing to acquire the
enclave lock. The winning thread will perform EAUG, set up the page
table entry, and insert the page into encl->page_array. The losing
thread will then get -EBUSY on xa_insert(>page_array) and proceed
to error handling path.

This error handling path contains two bugs: (1) SIGBUS is sent to
userspace even though the enclave page is correctly installed by another
thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
even though the enclave page was never intended to be removed. The first
bug is less severe because it impacts only the user space; the second
bug is more severe because it also impacts the OS state by ripping the
page (added by the winning thread) from the enclave.

Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
fault handler so that no signal is sent to userspace, and (2) by
replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
EREMOVE is performed.

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized 
enclave")
Cc: sta...@vger.kernel.org
Reported-by: Marcelina Kościelnicka 
Suggested-by: Reinette Chatre 
Signed-off-by: Dmitrii Kuvaiskii 
---
 arch/x86/kernel/cpu/sgx/encl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 * If ret == -EBUSY then page was created in another flow while
 * running without encl->lock
 */
-   if (ret)
+   if (ret) {
+   if (ret == -EBUSY)
+   vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+   }
 
pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct 
*vma,
 err_out_shrink:
sgx_encl_shrink(encl, va_page);
 err_out_epc:
-   sgx_encl_free_epc_page(epc_page);
+   sgx_free_epc_page(epc_page);
 err_out_unlock:
mutex_unlock(>lock);
kfree(encl_page);
-- 
2.34.1




[PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

2024-04-29 Thread Dmitrii Kuvaiskii
SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

A trivial program like below (run under Gramine and with EDMM enabled)
stresses these two flows in the SGX driver and hangs:

/* repeatedly touch different enclave pages at random and mix with
 * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create([i], NULL, thread_func, addr);

This program uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
  stresses the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
  stresses the second data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result with the first bug fix but without the second bug fix: hangs
  approx. once in 50 runs).
- On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
  additionally stresses the enclave page swapping flows. Two Gramine SGX
  test suites running in the background for additional stressing of
  swapping (I observe 100% CPU utilization from ksgxd which confirms that
  swapping happens). Result: 1,000 runs without hangs.

(Sorry for the previous copy of this email, accidentally sent to
sta...@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

-- 
2.34.1




Re: [PATCH v12 12/14] x86/sgx: Turn on per-cgroup EPC reclamation

2024-04-29 Thread Huang, Kai

> +/*
> + * Get the per-cgroup or global LRU list that tracks the given reclaimable 
> page.
> + */
>  static inline struct sgx_epc_lru_list *sgx_lru_list(struct sgx_epc_page 
> *epc_page)
>  {
> +#ifdef CONFIG_CGROUP_MISC
> + /*
> +  * epc_page->sgx_cg here is never NULL during a reclaimable epc_page's
> +  * life between sgx_alloc_epc_page() and sgx_free_epc_page():
> +  *
> +  * In sgx_alloc_epc_page(), epc_page->sgx_cg is set to the return from
> +  * sgx_get_current_cg() which is the misc cgroup of the current task, or
> +  * the root by default even if the misc cgroup is disabled by kernel
> +  * command line.
> +  *
> +  * epc_page->sgx_cg is only unset by sgx_free_epc_page().
> +  *
> +  * This function is never used before sgx_alloc_epc_page() or after
> +  * sgx_free_epc_page().
> +  */
> + return _page->sgx_cg->lru;
> +#else
>   return _global_lru;
> +#endif
>  }
>  
>  /*
> @@ -42,7 +63,8 @@ static inline struct sgx_epc_lru_list *sgx_lru_list(struct 
> sgx_epc_page *epc_pag
>   */
>  static inline bool sgx_can_reclaim(void)
>  {
> - return !list_empty(_global_lru.reclaimable);
> + return !sgx_cgroup_lru_empty(misc_cg_root()) ||
> +!list_empty(_global_lru.reclaimable);
>  }

Shouldn't this be:

if (IS_ENABLED(CONFIG_CGROUP_MISC))
return !sgx_cgroup_lru_empty(misc_cg_root());
else
return !list_empty(_global_lru.reclaimable);
?

In this way, it is consistent with the sgx_reclaim_pages_global() below.

>  
>  static atomic_long_t sgx_nr_free_pages = ATOMIC_LONG_INIT(0);
> @@ -404,7 +426,10 @@ static bool sgx_should_reclaim(unsigned long watermark)
>  
>  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)
>  {
> - sgx_reclaim_pages(_global_lru, charge_mm);
> + if (IS_ENABLED(CONFIG_CGROUP_MISC))
> + sgx_cgroup_reclaim_pages(misc_cg_root(), charge_mm);
> + else
> + sgx_reclaim_pages(_global_lru, charge_mm);
>  }
>  
>  /*
> @@ -414,6 +439,14 @@ static void sgx_reclaim_pages_global(struct mm_struct 
> *charge_mm)
>   */
>  void sgx_reclaim_direct(void)
>  {
> + struct sgx_cgroup *sgx_cg = sgx_get_current_cg();
> +
> + /* Make sure there are some free pages at cgroup level */
> + if (sgx_cg && sgx_cgroup_should_reclaim(sgx_cg)) {
> + sgx_cgroup_reclaim_pages(misc_from_sgx(sgx_cg), current->mm);
> + sgx_put_cg(sgx_cg);
> + }

Empty line.

> + /* Make sure there are some free pages at global level */
>   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))

Looking at the code, to me sgx_should_reclaim() is a little bit vague
because from the name we don't know whether it interally checks the
current cgroup or the global.  

It's better to rename to sgx_should_reclaim_global().

Ditto for sgx_can_reclaim() -> sgx_can_reclaim_global().

And I think you can do the renaming in the previous patch, because in the
changelog of your previous patch, it seems you have called out the two
functions are for global reclaim.

>   sgx_reclaim_pages_global(current->mm);
>  }
> @@ -616,6 +649,12 @@ struct sgx_epc_page *sgx_alloc_epc_page(void *owner, 
> enum sgx_reclaim reclaim)
>   break;
>   }
>  
> + /*
> +  * At this point, the usage within this cgroup is under its
> +  * limit but there is no physical page left for allocation.
> +  * Perform a global reclaim to get some pages released from any
> +  * cgroup with reclaimable pages.
> +  */
>   sgx_reclaim_pages_global(current->mm);
>   cond_resched();
>   }



Re: [PATCH 0/4] vhost: Cleanup

2024-04-29 Thread Gavin Shan

On 4/29/24 17:02, Michael S. Tsirkin wrote:

On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote:

This is suggested by Michael S. Tsirkin according to [1] and the goal
is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
the caller of the function needn't to worry about memory barriers. Since
we're here, other cleanups are also applied.



Gavin I suggested another approach.
1. Start with the patch I sent (vhost: order avail ring reads after
index updates) just do a diff against latest.
simplify error handling a bit.
2. Do any other cleanups on top.



My apologies, Michael. I didn't see your patch until now [1]

  [1] 
https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/

v2 was sent with your changes integrated and other cleanup are applied on
top of it. Please take a look when you getting a chance.

  v2: 
https://lore.kernel.org/virtualization/20240429101400.617007-1-gs...@redhat.com/T/#t

Thanks,
Gavin




[PATCH v2 4/4] vhost: Reformat vhost_{get, put}_user()

2024-04-29 Thread Gavin Shan
Reformat the macros to use tab as the terminator for each line so
that it looks clean.

No functional change intended.

Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 60 +--
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4ddb9ec2fe46..c1ed5e750521 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1207,21 +1207,22 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
return __vhost_get_user_slow(vq, addr, size, type);
 }
 
-#define vhost_put_user(vq, x, ptr) \
-({ \
-   int ret; \
-   if (!vq->iotlb) { \
-   ret = __put_user(x, ptr); \
-   } else { \
-   __typeof__(ptr) to = \
+#define vhost_put_user(vq, x, ptr) \
+({ \
+   int ret;\
+   if (!vq->iotlb) {   \
+   ret = __put_user(x, ptr);   \
+   } else {\
+   __typeof__(ptr) to =\
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
- sizeof(*ptr), VHOST_ADDR_USED); \
-   if (to != NULL) \
-   ret = __put_user(x, to); \
-   else \
-   ret = -EFAULT;  \
-   } \
-   ret; \
+   sizeof(*ptr),   \
+   VHOST_ADDR_USED);   \
+   if (to != NULL) \
+   ret = __put_user(x, to);\
+   else\
+   ret = -EFAULT;  \
+   }   \
+   ret;\
 })
 
 static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
@@ -1252,22 +1253,21 @@ static inline int vhost_put_used_idx(struct 
vhost_virtqueue *vq)
  >used->idx);
 }
 
-#define vhost_get_user(vq, x, ptr, type)   \
-({ \
-   int ret; \
-   if (!vq->iotlb) { \
-   ret = __get_user(x, ptr); \
-   } else { \
-   __typeof__(ptr) from = \
-   (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
-  sizeof(*ptr), \
-  type); \
-   if (from != NULL) \
-   ret = __get_user(x, from); \
-   else \
-   ret = -EFAULT; \
-   } \
-   ret; \
+#define vhost_get_user(vq, x, ptr, type)   \
+({ \
+   int ret;\
+   if (!vq->iotlb) {   \
+   ret = __get_user(x, ptr);   \
+   } else {\
+   __typeof__(ptr) from =  \
+   (__typeof__(ptr)) __vhost_get_user(vq, ptr, \
+   sizeof(*ptr), type);\
+   if (from != NULL)   \
+   ret = __get_user(x, from);  \
+   else\
+   ret = -EFAULT;  \
+   }   \
+   ret;\
 })
 
 #define vhost_get_avail(vq, x, ptr) \
-- 
2.44.0




[PATCH v2 3/4] vhost: Improve vhost_get_avail_head()

2024-04-29 Thread Gavin Shan
Improve vhost_get_avail_head() so that the head or errno is returned.
With it, the relevant sanity checks are squeezed to vhost_get_avail_head()
and vhost_get_vq_desc() is further simplified.

No functional change intended.

Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 50 ++-
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b278c0333a66..4ddb9ec2fe46 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1322,11 +1322,27 @@ static inline int vhost_get_avail_idx(struct 
vhost_virtqueue *vq)
return 1;
 }
 
-static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
-  __virtio16 *head, int idx)
+static inline int vhost_get_avail_head(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *head,
-  >avail->ring[idx & (vq->num - 1)]);
+   __virtio16 head;
+   int r;
+
+   r = vhost_get_avail(vq, head,
+   >avail->ring[vq->last_avail_idx & (vq->num - 
1)]);
+   if (unlikely(r)) {
+   vq_err(vq, "Failed to read head: index %u address %p\n",
+  vq->last_avail_idx,
+  >avail->ring[vq->last_avail_idx & (vq->num - 1)]);
+   return r;
+   }
+
+   r = vhost16_to_cpu(vq, head);
+   if (unlikely(r >= vq->num)) {
+   vq_err(vq, "Invalid head %d (%u)\n", r, vq->num);
+   return -EINVAL;
+   }
+
+   return r;
 }
 
 static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
@@ -2523,9 +2539,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
  struct vhost_log *log, unsigned int *log_num)
 {
struct vring_desc desc;
-   unsigned int i, head, found = 0;
-   __virtio16 ring_head;
-   int ret, access;
+   unsigned int i, found = 0;
+   int head, ret, access;
 
if (vq->avail_idx == vq->last_avail_idx) {
ret = vhost_get_avail_idx(vq);
@@ -2536,23 +2551,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
return vq->num;
}
 
-   /* Grab the next descriptor number they're advertising, and increment
-* the index we've seen. */
-   if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
{
-   vq_err(vq, "Failed to read head: idx %d address %p\n",
-  vq->last_avail_idx,
-  >avail->ring[vq->last_avail_idx % vq->num]);
-   return -EFAULT;
-   }
-
-   head = vhost16_to_cpu(vq, ring_head);
-
-   /* If their number is silly, that's an error. */
-   if (unlikely(head >= vq->num)) {
-   vq_err(vq, "Guest says index %u > %u is available",
-  head, vq->num);
-   return -EINVAL;
-   }
+   /* Grab the next descriptor number they're advertising */
+   head = vhost_get_avail_head(vq);
+   if (unlikely(head < 0))
+   return head;
 
/* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
-- 
2.44.0




[PATCH v2 2/4] vhost: Drop variable last_avail_idx in vhost_get_vq_desc()

2024-04-29 Thread Gavin Shan
The local variable @last_avail_idx is equivalent to vq->last_avail_idx.
So the code can be simplified a bit by dropping the local variable
@last_avail_idx.

No functional change intended.

Signed-off-by: Gavin Shan 
---
 drivers/vhost/vhost.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7aa623117aab..b278c0333a66 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2524,7 +2524,6 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
@@ -2539,10 +2538,10 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 
/* Grab the next descriptor number they're advertising, and increment
 * the index we've seen. */
-   if (unlikely(vhost_get_avail_head(vq, _head, last_avail_idx))) {
+   if (unlikely(vhost_get_avail_head(vq, _head, vq->last_avail_idx))) 
{
vq_err(vq, "Failed to read head: idx %d address %p\n",
-  last_avail_idx,
-  >avail->ring[last_avail_idx % vq->num]);
+  vq->last_avail_idx,
+  >avail->ring[vq->last_avail_idx % vq->num]);
return -EFAULT;
}
 
-- 
2.44.0




[PATCH v2 1/4] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-04-29 Thread Gavin Shan
From: "Michael S. Tsirkin" 

All the callers of vhost_get_avail_idx() are concerned to the memory
barrier, imposed by smp_rmb() to ensure the order of the available
ring entry read and avail_idx read.

Improve vhost_get_avail_idx() so that smp_rmb() is executed when
the avail_idx is advanced. With it, the callers needn't to worry
about the memory barrier.

No functional change intended.

Signed-off-by: Michael S. Tsirkin 
[gshan: repainted vhost_get_avail_idx()]
Reviewed-by: Gavin Shan 
Acked-by: Will Deacon 
---
 drivers/vhost/vhost.c | 106 +-
 1 file changed, 42 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8995730ce0bf..7aa623117aab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1290,10 +1290,36 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
mutex_unlock(>vqs[i]->mutex);
 }
 
-static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
- __virtio16 *idx)
+static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
 {
-   return vhost_get_avail(vq, *idx, >avail->idx);
+   __virtio16 idx;
+   int r;
+
+   r = vhost_get_avail(vq, idx, >avail->idx);
+   if (unlikely(r < 0)) {
+   vq_err(vq, "Failed to access available index at %p (%d)\n",
+  >avail->idx, r);
+   return r;
+   }
+
+   /* Check it isn't doing very strange thing with available indexes */
+   vq->avail_idx = vhost16_to_cpu(vq, idx);
+   if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
+   vq_err(vq, "Invalid available index change from %u to %u",
+  vq->last_avail_idx, vq->avail_idx);
+   return -EINVAL;
+   }
+
+   /* We're done if there is nothing new */
+   if (vq->avail_idx == vq->last_avail_idx)
+   return 0;
+
+   /*
+* We updated vq->avail_idx so we need a memory barrier between
+* the index read above and the caller reading avail ring entries.
+*/
+   smp_rmb();
+   return 1;
 }
 
 static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
@@ -2498,38 +2524,17 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 {
struct vring_desc desc;
unsigned int i, head, found = 0;
-   u16 last_avail_idx;
-   __virtio16 avail_idx;
+   u16 last_avail_idx = vq->last_avail_idx;
__virtio16 ring_head;
int ret, access;
 
-   /* Check it isn't doing very strange things with descriptor numbers. */
-   last_avail_idx = vq->last_avail_idx;
-
if (vq->avail_idx == vq->last_avail_idx) {
-   if (unlikely(vhost_get_avail_idx(vq, _idx))) {
-   vq_err(vq, "Failed to access avail idx at %p\n",
-   >avail->idx);
-   return -EFAULT;
-   }
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
-   if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
-   vq_err(vq, "Guest moved avail index from %u to %u",
-   last_avail_idx, vq->avail_idx);
-   return -EFAULT;
-   }
+   ret = vhost_get_avail_idx(vq);
+   if (unlikely(ret < 0))
+   return ret;
 
-   /* If there's nothing new since last we looked, return
-* invalid.
-*/
-   if (vq->avail_idx == last_avail_idx)
+   if (!ret)
return vq->num;
-
-   /* Only get avail ring entries after they have been
-* exposed by guest.
-*/
-   smp_rmb();
}
 
/* Grab the next descriptor number they're advertising, and increment
@@ -2790,35 +2795,20 @@ EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
 /* return true if we're sure that avaiable ring is empty */
 bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
 {
-   __virtio16 avail_idx;
int r;
 
if (vq->avail_idx != vq->last_avail_idx)
return false;
 
-   r = vhost_get_avail_idx(vq, _idx);
-   if (unlikely(r))
-   return false;
-
-   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-   if (vq->avail_idx != vq->last_avail_idx) {
-   /* Since we have updated avail_idx, the following
-* call to vhost_get_vq_desc() will read available
-* ring entries. Make sure that read happens after
-* the avail_idx read.
-*/
-   smp_rmb();
-   return false;
-   }
-
-   return true;
+   /* Treat error as non-empty here */
+   r = vhost_get_avail_idx(vq);
+   return r == 0;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
 
 /* OK, now we need to know 

[PATCH v2 0/4] vhost: Cleanup

2024-04-29 Thread Gavin Shan
This is suggested by Michael S. Tsirkin according to [1] and the goal
is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
the caller of the function needn't to worry about memory barriers. Since
we're here, other cleanups are also applied.

[1] 
https://lore.kernel.org/virtualization/20240327155750-mutt-send-email-...@kernel.org/

PATCH[1] improves vhost_get_avail_idx() so that smp_rmb() is applied if
 needed. Besides, the sanity checks on the retrieved available
 queue index are also squeezed to vhost_get_avail_idx()
PATCH[2] drops the local variable @last_avail_idx since it's equivalent
 to vq->last_avail_idx
PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
 for vhost_get_avail_idx(), so that the relevant sanity checks
 on the head are squeezed to vhost_get_avail_head()
PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
 as the terminator for each line

Gavin Shan (3):
  vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
  vhost: Improve vhost_get_avail_head()
  vhost: Reformat vhost_{get, put}_user()

Michael S. Tsirkin (1):
  vhost: Improve vhost_get_avail_idx() with smp_rmb()

 drivers/vhost/vhost.c | 215 +++---
 1 file changed, 97 insertions(+), 118 deletions(-)

Changelog
=
v2:
  * Improve vhost_get_avail_idx() as Michael suggested in [1]
as above (Michael)
  * Correct @head's type from 'unsigned int' to 'int'
(l...@intel.com)

-- 
2.44.0




回复:WARNING in current_check_refer_path

2024-04-29 Thread Ubisectech Sirius
> Hello,

> Thanks for the report.  Could you please provide a reproducer?

> Regards,
> Mickaël

Hi.
  The Poc file has seed to you as attachment.

> On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote:
>> Hello.
>> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
>> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
>> the email were a PoC file of the issue.
>> 
>> Stack dump:
>> 
> > loop3: detected capacity change from 0 to 1024
> > [ cut here ]
> > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> > security/landlock/fs.c:598 [inline]
> > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> > security/landlock/fs.c:578 [inline]
> > WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 
> > current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> > Modules linked in:
> > CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> > 04/01/2014
> > RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline]
> > RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline]
> > RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> > Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 0f 
> > 0b 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 db 
> > e9 86 f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba
> > RSP: 0018:c90001fb7ba0 EFLAGS: 00010212
> > RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000
> > RDX: 0004 RSI: 84125d64 RDI: 0003
> > RBP: 8880123c5608 R08: 0003 R09: c000
> > R10: f000 R11:  R12: 88805d32fc00
> > R13: 8880123c5608 R14:  R15: 0001
> > FS:  7fd70c4d8640() GS:88802c60() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > PKRU: 5554
> > Call Trace:
> >  
> >  security_path_rename+0x124/0x230 security/security.c:1828
> >  do_renameat2+0x9f6/0xd30 fs/namei.c:4983
> >  __do_sys_rename fs/namei.c:5042 [inline]
> >  __se_sys_rename fs/namei.c:5040 [inline]
> >  __x64_sys_rename+0x81/0xa0 fs/namei.c:5040
> >  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
> >  do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
> >  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> > RIP: 0033:0x7fd70b6900ed
> > Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 
> > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff 
> > ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052
> > RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed
>>  RDX:  RSI: 2140 RDI: 2100
> > RBP: 7fd70b6f14a6 R08:  R09: 
> > R10:  R11: 0246 R12: 
> > R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000
> >  
> > 
> > Thank you for taking the time to read this email and we look forward to 
> > working with you further.
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
> 




poc.c
Description: Binary data


[PATCH net-next v8] net/ipv4: add tracepoint for icmp_send  

2024-04-29 Thread xu.xin16
From: Peilin He 

Introduce a tracepoint for icmp_send, which can help users to get more
detail information conveniently when icmp abnormal events happen.

1. Giving an usecase example:
=
When an application experiences packet loss due to an unreachable UDP
destination port, the kernel will send an exception message through the
icmp_send function. By adding a trace point for icmp_send, developers or
system administrators can obtain detailed information about the UDP
packet loss, including the type, code, source address, destination address,
source port, and destination port. This facilitates the trouble-shooting
of UDP packet loss issues especially for those network-service
applications.

2. Operation Instructions:
==
Switch to the tracing directory.
cd /sys/kernel/tracing
Filter for destination port unreachable.
echo "type==3 && code==3" > events/icmp/icmp_send/filter
Enable trace event.
echo 1 > events/icmp/icmp_send/enable

3. Result View:

 udp_client_erro-11370   [002] ...s.12   124.728002:
 icmp_send: icmp_send: type=3, code=3.
 From 127.0.0.1:41895 to 127.0.0.1: ulen=23
 skbaddr=589b167a

Change log
=
v7->v8:
Some fixes according to
https://lore.kernel.org/all/CANn89iKNtKmN5im8K4dSZGqAV8=e3bZb
z5ahxbcnbjfxk5v...@mail.gmail.com/
1.Combine all variable definitions together without mixing 
code and variables.

v6->v7:
Some fixes according to
https://lore.kernel.org/all/20240425081210.720a4...@kernel.org/
1. Fix patch format issues.

v5->v6:
Some fixes according to
https://lore.kernel.org/all/20240413161319.ga853...@kernel.org/
1.Resubmit patches based on the latest net-next code.

v4->v5:
Some fixes according to
https://lore.kernel.org/all/CAL+tcoDeXXh+zcRk4PHnUk8ELnx=CE2pc
cqs7sfm0y9ak-e...@mail.gmail.com/
1.Adjust the position of trace_icmp_send() to before icmp_push_reply().

v3->v4:
Some fixes according to
https://lore.kernel.org/all/CANn89i+EFEr7VHXNdOi59Ba_R1nFKSBJz
BzkJFVgCTdXBx=y...@mail.gmail.com/
1.Add legality check for UDP header in SKB.
2.Target this patch for net-next.

v2->v3:
Some fixes according to
https://lore.kernel.org/all/20240319102549.7f7f6...@gandalf.local.home/
1. Change the tracking directory to/sys/kernel/tracking.
2. Adjust the layout of the TP-STRUCT_entry parameter structure.

v1->v2:
Some fixes according to
https://lore.kernel.org/all/CANn89iL-y9e_VFpdw=sZtRnKRu_tnUwqHu
fqtjvjsv-nz1x...@mail.gmail.com/
1. adjust the trace_icmp_send() to more protocols than UDP.
2. move the calling of trace_icmp_send after sanity checks
in __icmp_send().

Signed-off-by: Peilin He 
Signed-off-by: xu xin 
Reviewed-by: Yunkai Zhang 
Cc: Yang Yang 
Cc: Liu Chun 
Cc: Xuexin Jiang 
---
 include/trace/events/icmp.h | 67 +
 net/ipv4/icmp.c |  4 +++
 2 files changed, 71 insertions(+)
 create mode 100644 include/trace/events/icmp.h

diff --git a/include/trace/events/icmp.h b/include/trace/events/icmp.h
new file mode 100644
index ..31559796949a
--- /dev/null
+++ b/include/trace/events/icmp.h
@@ -0,0 +1,67 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM icmp
+
+#if !defined(_TRACE_ICMP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_ICMP_H
+
+#include 
+#include 
+
+TRACE_EVENT(icmp_send,
+
+   TP_PROTO(const struct sk_buff *skb, int type, int code),
+
+   TP_ARGS(skb, type, code),
+
+   TP_STRUCT__entry(
+   __field(const void *, skbaddr)
+   __field(int, type)
+   __field(int, code)
+   __array(__u8, saddr, 4)
+   __array(__u8, daddr, 4)
+   __field(__u16, sport)
+   __field(__u16, dport)
+   __field(unsigned short, ulen)
+   ),
+
+   TP_fast_assign(
+   struct iphdr *iph = ip_hdr(skb);
+   struct udphdr *uh = udp_hdr(skb);
+   int proto_4 = iph->protocol;
+   __be32 *p32;
+
+   __entry->skbaddr = skb;
+   __entry->type = type;
+   __entry->code = code;
+
+   if (proto_4 != IPPROTO_UDP || (u8 *)uh < skb->head ||
+   (u8 *)uh + sizeof(struct udphdr)
+   > skb_tail_pointer(skb)) {
+   __entry->sport = 0;
+   __entry->dport = 0;
+   __entry->ulen = 0;
+   } else {
+   __entry->sport = ntohs(uh->source);
+   __entry->dport = ntohs(uh->dest);
+   __entry->ulen = ntohs(uh->len);
+   }
+
+   p32 = (__be32 *) __entry->saddr;
+ 

Re: WARNING in current_check_refer_path

2024-04-29 Thread Mickaël Salaün
Hello,

Thanks for the report.  Could you please provide a reproducer?

Regards,
 Mickaël


On Sun, Apr 28, 2024 at 10:47:02AM +0800, Ubisectech Sirius wrote:
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> the email were a PoC file of the issue.
> 
> Stack dump:
> 
> loop3: detected capacity change from 0 to 1024
> [ cut here ]
> WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> security/landlock/fs.c:598 [inline]
> WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 get_mode_access 
> security/landlock/fs.c:578 [inline]
> WARNING: CPU: 0 PID: 30368 at security/landlock/fs.c:598 
> current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> Modules linked in:
> CPU: 0 PID: 30368 Comm: syz-executor.3 Not tainted 6.7.0 #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014
> RIP: 0010:get_mode_access security/landlock/fs.c:598 [inline]
> RIP: 0010:get_mode_access security/landlock/fs.c:578 [inline]
> RIP: 0010:current_check_refer_path+0x955/0xa60 security/landlock/fs.c:758
> Code: e9 76 fb ff ff 41 bc fe ff ff ff e9 6b fb ff ff e8 00 99 77 fd 90 0f 0b 
> 90 41 bc f3 ff ff ff e9 57 fb ff ff e8 ec 98 77 fd 90 <0f> 0b 90 31 db e9 86 
> f9 ff ff bb 00 08 00 00 e9 7c f9 ff ff 41 ba
> RSP: 0018:c90001fb7ba0 EFLAGS: 00010212
> RAX: 0bc5 RBX: 88805feeb7b0 RCX: c90006e15000
> RDX: 0004 RSI: 84125d64 RDI: 0003
> RBP: 8880123c5608 R08: 0003 R09: c000
> R10: f000 R11:  R12: 88805d32fc00
> R13: 8880123c5608 R14:  R15: 0001
> FS:  7fd70c4d8640() GS:88802c60() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 001b2c136000 CR3: 5b2a CR4: 00750ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  security_path_rename+0x124/0x230 security/security.c:1828
>  do_renameat2+0x9f6/0xd30 fs/namei.c:4983
>  __do_sys_rename fs/namei.c:5042 [inline]
>  __se_sys_rename fs/namei.c:5040 [inline]
>  __x64_sys_rename+0x81/0xa0 fs/namei.c:5040
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> RIP: 0033:0x7fd70b6900ed
> Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fd70c4d8028 EFLAGS: 0246 ORIG_RAX: 0052
> RAX: ffda RBX: 7fd70b7cbf80 RCX: 7fd70b6900ed
> RDX:  RSI: 2140 RDI: 2100
> RBP: 7fd70b6f14a6 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 000b R14: 7fd70b7cbf80 R15: 7fd70c4b8000
>  
> 
> Thank you for taking the time to read this email and we look forward to 
> working with you further.
> 
> 
> 
> 
> 
> 
> 
> 
> 





[PATCH V1] soc: qcom: smp2p: Introduce tracepoint support

2024-04-29 Thread Sudeepgoud Patil
Introduce tracepoint support for smp2p providing useful logging
for communication between clients.

Signed-off-by: Sudeepgoud Patil 
Signed-off-by: Deepak Kumar Singh 
---
 drivers/soc/qcom/Makefile  |  1 +
 drivers/soc/qcom/smp2p.c   | 10 
 drivers/soc/qcom/trace-smp2p.h | 99 ++
 3 files changed, 110 insertions(+)
 create mode 100644 drivers/soc/qcom/trace-smp2p.h

diff --git a/drivers/soc/qcom/Makefile b/drivers/soc/qcom/Makefile
index ca0bece0dfff..30c1bf645501 100644
--- a/drivers/soc/qcom/Makefile
+++ b/drivers/soc/qcom/Makefile
@@ -23,6 +23,7 @@ qcom_rpmh-y   += rpmh.o
 obj-$(CONFIG_QCOM_SMD_RPM) += rpm-proc.o smd-rpm.o
 obj-$(CONFIG_QCOM_SMEM) += smem.o
 obj-$(CONFIG_QCOM_SMEM_STATE) += smem_state.o
+CFLAGS_smp2p.o := -I$(src)
 obj-$(CONFIG_QCOM_SMP2P)   += smp2p.o
 obj-$(CONFIG_QCOM_SMSM)+= smsm.o
 obj-$(CONFIG_QCOM_SOCINFO) += socinfo.o
diff --git a/drivers/soc/qcom/smp2p.c b/drivers/soc/qcom/smp2p.c
index a21241cbeec7..dde8513641ae 100644
--- a/drivers/soc/qcom/smp2p.c
+++ b/drivers/soc/qcom/smp2p.c
@@ -20,6 +20,9 @@
 #include 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include "trace-smp2p.h"
+
 /*
  * The Shared Memory Point to Point (SMP2P) protocol facilitates communication
  * of a single 32-bit value between two processors.  Each value has a single
@@ -191,6 +194,7 @@ static void qcom_smp2p_do_ssr_ack(struct qcom_smp2p *smp2p)
struct smp2p_smem_item *out = smp2p->out;
u32 val;
 
+   trace_smp2p_ssr_ack(smp2p->remote_pid);
smp2p->ssr_ack = !smp2p->ssr_ack;
 
val = out->flags & ~BIT(SMP2P_FLAGS_RESTART_ACK_BIT);
@@ -213,6 +217,7 @@ static void qcom_smp2p_negotiate(struct qcom_smp2p *smp2p)
smp2p->ssr_ack_enabled = true;
 
smp2p->negotiation_done = true;
+   trace_smp2p_negotiate(smp2p->remote_pid, 
smp2p->ssr_ack_enabled);
}
 }
 
@@ -251,6 +256,8 @@ static void qcom_smp2p_notify_in(struct qcom_smp2p *smp2p)
status = val ^ entry->last_value;
entry->last_value = val;
 
+   trace_smp2p_notify_in(smp2p->remote_pid, entry->name, status, 
val);
+
/* No changes of this entry? */
if (!status)
continue;
@@ -406,6 +413,9 @@ static int smp2p_update_bits(void *data, u32 mask, u32 
value)
writel(val, entry->value);
spin_unlock_irqrestore(>lock, flags);
 
+   trace_smp2p_update_bits(entry->smp2p->remote_pid,
+   entry->name, orig, val);
+
if (val != orig)
qcom_smp2p_kick(entry->smp2p);
 
diff --git a/drivers/soc/qcom/trace-smp2p.h b/drivers/soc/qcom/trace-smp2p.h
new file mode 100644
index ..c61afab23f0c
--- /dev/null
+++ b/drivers/soc/qcom/trace-smp2p.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (c) 2024 Qualcomm Innovation Center, Inc. All rights reserved.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM qcom_smp2p
+
+#if !defined(__QCOM_SMP2P_TRACE_H__) || defined(TRACE_HEADER_MULTI_READ)
+#define __QCOM_SMP2P_TRACE_H__
+
+#include 
+
+TRACE_EVENT(smp2p_ssr_ack,
+   TP_PROTO(unsigned int remote_pid),
+   TP_ARGS(remote_pid),
+   TP_STRUCT__entry(
+   __field(u32, remote_pid)
+   ),
+   TP_fast_assign(
+   __entry->remote_pid = remote_pid;
+   ),
+   TP_printk("%d: SSR detected, doing SSR Handshake",
+   __entry->remote_pid
+   )
+);
+
+TRACE_EVENT(smp2p_negotiate,
+   TP_PROTO(unsigned int remote_pid, bool ssr_ack_enabled),
+   TP_ARGS(remote_pid, ssr_ack_enabled),
+   TP_STRUCT__entry(
+   __field(u32, remote_pid)
+   __field(bool, ssr_ack_enabled)
+   ),
+   TP_fast_assign(
+   __entry->remote_pid = remote_pid;
+   __entry->ssr_ack_enabled = ssr_ack_enabled;
+   ),
+   TP_printk("%d: state=open ssr_ack=%d",
+   __entry->remote_pid,
+   __entry->ssr_ack_enabled
+   )
+);
+
+TRACE_EVENT(smp2p_notify_in,
+   TP_PROTO(unsigned int remote_pid, const char *name, unsigned long 
status, u32 val),
+   TP_ARGS(remote_pid, name, status, val),
+   TP_STRUCT__entry(
+   __field(u32, remote_pid)
+   __string(name, name)
+   __field(unsigned long, status)
+   __field(u32, val)
+   ),
+   TP_fast_assign(
+   __entry->remote_pid = remote_pid;
+   __assign_str(name, name);
+   __entry->status = status;
+   __entry->val = val;
+   ),
+   TP_printk("%d: %s: status:0x%0lx val:0x%0x",
+   __entry->remote_pid,
+   __get_str(name),
+   __entry->status,
+   __entry->val
+   )
+);
+
+TRACE_EVENT(smp2p_update_bits,
+   TP_PROTO(unsigned int remote_pid, const char *name, u32 orig, u32 val),
+   

Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-04-29 Thread Jiri Olsa
On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote:
> On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> >
> > Adding test that adds return uprobe inside 32 bit task
> > and verify the return uprobe and attached bpf programs
> > get properly executed.
> >
> > Signed-off-by: Jiri Olsa 
> > ---
> >  tools/testing/selftests/bpf/.gitignore|  1 +
> >  tools/testing/selftests/bpf/Makefile  |  6 ++-
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++
> >  .../bpf/progs/uprobe_syscall_compat.c | 13 ++
> >  4 files changed, 59 insertions(+), 1 deletion(-)
> >  create mode 100644 
> > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> >
> > diff --git a/tools/testing/selftests/bpf/.gitignore 
> > b/tools/testing/selftests/bpf/.gitignore
> > index f1aebabfb017..69d71223c0dd 100644
> > --- a/tools/testing/selftests/bpf/.gitignore
> > +++ b/tools/testing/selftests/bpf/.gitignore
> > @@ -45,6 +45,7 @@ test_cpp
> >  /veristat
> >  /sign-file
> >  /uprobe_multi
> > +/uprobe_compat
> >  *.ko
> >  *.tmp
> >  xskxceiver
> > diff --git a/tools/testing/selftests/bpf/Makefile 
> > b/tools/testing/selftests/bpf/Makefile
> > index edc73f8f5aef..d170b63eca62 100644
> > --- a/tools/testing/selftests/bpf/Makefile
> > +++ b/tools/testing/selftests/bpf/Makefile
> > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> > test_skb_cgroup_id_user \
> > xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata 
> > \
> > xdp_features bpf_test_no_cfi.ko
> >
> > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> > uprobe_compat
> 
> you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no?

ah right

> > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > index 9233210a4c33..3770254d893b 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include "uprobe_syscall.skel.h"
> >  #include "uprobe_syscall_call.skel.h"
> > +#include "uprobe_syscall_compat.skel.h"
> >
> >  __naked unsigned long uretprobe_regs_trigger(void)
> >  {
> > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void)
> >  "read_trace_pipe_iter");
> > ASSERT_EQ(found, 0, "found");
> >  }
> > +
> > +static void trace_pipe_compat_cb(const char *str, void *data)
> > +{
> > +   if (strstr(str, "uretprobe compat") != NULL)
> > +   (*(int *)data)++;
> > +}
> > +
> > +static void test_uretprobe_compat(void)
> > +{
> > +   struct uprobe_syscall_compat *skel = NULL;
> > +   int err, found = 0;
> > +
> > +   skel = uprobe_syscall_compat__open_and_load();
> > +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load"))
> > +   goto cleanup;
> > +
> > +   err = uprobe_syscall_compat__attach(skel);
> > +   if (!ASSERT_OK(err, "uprobe_syscall_compat__attach"))
> > +   goto cleanup;
> > +
> > +   system("./uprobe_compat");
> > +
> > +   ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 1000),
> > +"read_trace_pipe_iter");
> 
> why so complicated? can't you just set global variable that it was called

hm, we execute separate uprobe_compat (32bit) process that triggers the bpf
program, so we can't use global variable.. using the trace_pipe was the only
thing that was easy to do

jirka

> 
> > +   ASSERT_EQ(found, 1, "found");
> > +
> > +cleanup:
> > +   uprobe_syscall_compat__destroy(skel);
> > +}
> >  #else
> >  static void test_uretprobe_regs_equal(void)
> >  {
> > @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void)
> >  {
> > test__skip();
> >  }
> > +
> > +static void test_uretprobe_compat(void)
> > +{
> > +   test__skip();
> > +}
> >  #endif
> >
> >  void test_uprobe_syscall(void)
> > @@ -320,3 +355,8 @@ void serial_test_uprobe_syscall_call(void)
> >  {
> > test_uretprobe_syscall_call();
> >  }
> > +
> > +void serial_test_uprobe_syscall_compat(void)
> 
> and then no need for serial_test?
> 
> > +{
> > +   test_uretprobe_compat();
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c 
> > b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> > new file mode 100644
> > index ..f8adde7f08e2
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> > @@ -0,0 +1,13 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include 
> > +#include 
> > +#include 
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +SEC("uretprobe.multi/./uprobe_compat:main")
> > +int uretprobe_compat(struct pt_regs *ctx)
> > +{
> > +   bpf_printk("uretprobe compat\n");
> > +   return 0;
> > +}
> > --
> > 2.44.0
> >



Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-04-29 Thread Jiri Olsa
On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote:
> On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> >
> > Adding test to verify that when called from outside of the
> > trampoline provided by kernel, the uretprobe syscall will cause
> > calling process to receive SIGILL signal and the attached bpf
> > program is no executed.
> >
> > Signed-off-by: Jiri Olsa 
> > ---
> >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++
> >  .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++
> >  2 files changed, 107 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> >
> 
> See nits below, but overall LGTM
> 
> Acked-by: Andrii Nakryiko 
> 
> [...]
> 
> > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void)
> >  {
> > test__skip();
> >  }
> > +
> > +static void test_uretprobe_syscall_call(void)
> > +{
> > +   test__skip();
> > +}
> >  #endif
> >
> >  void test_uprobe_syscall(void)
> > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void)
> > if (test__start_subtest("uretprobe_regs_change"))
> > test_uretprobe_regs_change();
> >  }
> > +
> > +void serial_test_uprobe_syscall_call(void)
> 
> does it need to be serial? non-serial are still run sequentially
> within a process (there is no multi-threading), it's more about some
> global effects on system.

plz see below

> 
> > +{
> > +   test_uretprobe_syscall_call();
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c 
> > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > new file mode 100644
> > index ..5ea03bb47198
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > @@ -0,0 +1,15 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include "vmlinux.h"
> > +#include 
> > +#include 
> > +
> > +struct pt_regs regs;
> > +
> > +char _license[] SEC("license") = "GPL";
> > +
> > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call")
> > +int uretprobe(struct pt_regs *regs)
> > +{
> > +   bpf_printk("uretprobe called");
> 
> debugging leftover? we probably don't want to pollute trace_pipe from test

the reason for this is to make sure the bpf program was not executed,

the test makes sure the child gets killed with SIGILL and also that
the bpf program was not executed by checking the trace_pipe and
making sure nothing was received

the trace_pipe reading is also why it's serial

jirka

> 
> > +   return 0;
> > +}
> > --
> > 2.44.0
> >



[PATCH] livepatch.h: Add comment to klp transition state

2024-04-29 Thread zhangwarden
From: Wardenjohn 

livepatch.h use KLP_UNDEFINED\KLP_UNPATCHED\KLP_PATCHED for klp transition 
state.
When livepatch is ready but idle, using KLP_UNDEFINED seems very confusing.
In order not to introduce potential risks to kernel, just update comment
to these state.
---
 include/linux/livepatch.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/livepatch.h b/include/linux/livepatch.h
index 9b9b38e89563..b6a214f2f8e3 100644
--- a/include/linux/livepatch.h
+++ b/include/linux/livepatch.h
@@ -18,9 +18,9 @@
 #if IS_ENABLED(CONFIG_LIVEPATCH)
 
 /* task patch states */
-#define KLP_UNDEFINED  -1
-#define KLP_UNPATCHED   0
-#define KLP_PATCHED 1
+#define KLP_UNDEFINED  -1 /* idle, no transition in progress */
+#define KLP_UNPATCHED   0 /* transitioning to unpatched state */
+#define KLP_PATCHED 1 /* transitioning to patched state */
 
 /**
  * struct klp_func - function structure for live patching
-- 
2.37.3




Re: [PATCH 0/4] vhost: Cleanup

2024-04-29 Thread Michael S. Tsirkin
On Tue, Apr 23, 2024 at 01:24:03PM +1000, Gavin Shan wrote:
> This is suggested by Michael S. Tsirkin according to [1] and the goal
> is to apply smp_rmb() inside vhost_get_avail_idx() if needed. With it,
> the caller of the function needn't to worry about memory barriers. Since
> we're here, other cleanups are also applied.


Gavin I suggested another approach.
1. Start with the patch I sent (vhost: order avail ring reads after
   index updates) just do a diff against latest.
   simplify error handling a bit.
2. Do any other cleanups on top.

> [1] 
> https://lore.kernel.org/virtualization/20240327075940-mutt-send-email-...@kernel.org/
> 
> PATCH[1] drops the local variable @last_avail_idx since it's equivalent
>  to vq->last_avail_idx
> PATCH[2] improves vhost_get_avail_idx() so that smp_rmb() is applied if
>  needed. Besides, the sanity checks on the retrieved available
>  queue index are also squeezed to vhost_get_avail_idx()
> PATCH[3] improves vhost_get_avail_head(), similar to what we're doing
>  for vhost_get_avail_idx(), so that the relevant sanity checks
>  on the head are squeezed to vhost_get_avail_head()
> PATCH[4] Reformat vhost_{get, put}_user() by using tab instead of space
>  as the terminator for each line
> 
> Gavin Shan (4):
>   vhost: Drop variable last_avail_idx in vhost_get_vq_desc()
>   vhost: Improve vhost_get_avail_idx() with smp_rmb()
>   vhost: Improve vhost_get_avail_head()
>   vhost: Reformat vhost_{get, put}_user()
> 
>  drivers/vhost/vhost.c | 199 +++---
>  1 file changed, 88 insertions(+), 111 deletions(-)
> 
> -- 
> 2.44.0