Re: [PATCH v13 14/14] selftests/sgx: Add scripts for EPC cgroup testing

2024-05-02 Thread Jarkko Sakkinen
On Tue Apr 30, 2024 at 10:51 PM EEST, Haitao Huang wrote:
> With different cgroups, the script starts one or multiple concurrent SGX
> selftests (test_sgx), each to run the unclobbered_vdso_oversubscribed
> test case, which loads an enclave of EPC size equal to the EPC capacity
> available on the platform. The script checks results against the
> expectation set for each cgroup and reports success or failure.
>
> The script creates 3 different cgroups at the beginning with following
> expectations:
>
> 1) small - intentionally small enough to fail the test loading an
> enclave of size equal to the capacity.
> 2) large - large enough to run up to 4 concurrent tests but fail some if
> more than 4 concurrent tests are run. The script starts 4 expecting at
> least one test to pass, and then starts 5 expecting at least one test
> to fail.
> 3) larger - limit is the same as the capacity, large enough to run lots of
> concurrent tests. The script starts 8 of them and expects all pass.
> Then it reruns the same test with one process randomly killed and
> usage checked to be zero after all processes exit.
>
> The script also includes a test with low mem_cg limit and large sgx_epc
> limit to verify that the RAM used for per-cgroup reclamation is charged
> to a proper mem_cg. For this test, it turns off swapping before start,
> and turns swapping back on afterwards.
>
> Add README to document how to run the tests.
>
> Signed-off-by: Haitao Huang 

Here's the transcript:

make: Entering directory '/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx'
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c main.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c load.c -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sigstruct.c -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c call.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-c sign_key.S -o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o
gcc -Wall -Werror -g 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include -fPIC 
-o /home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_sgx 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/main.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/load.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sigstruct.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/call.o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/sign_key.o -z noexecstack 
-lcrypto
gcc -Wall -Werror -static-pie -nostdlib -ffreestanding -fPIE 
-fno-stack-protector -mrdrnd 
-I/home/jarkko/linux-tpmdd/tools/testing/selftests/../../../tools/include 
test_encl.c test_encl_bootstrap.S -o 
/home/jarkko/linux-tpmdd/tools/testing/selftests/sgx/test_encl.elf 
-Wl,-T,test_encl.lds,--build-id=none
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: 
warning: /tmp/ccToNCLw.o: missing .note.GNU-stack section implies executable 
stack
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: NOTE: 
This behaviour is deprecated and will be removed in a future version of the 
linker
TAP version 13
1..2
# timeout set to 120
# selftests: sgx: test_sgx
# TAP version 13
# 1..16
# # Starting 16 tests from 1 test cases.
# #  RUN   enclave.unclobbered_vdso ...
# #OK  enclave.unclobbered_vdso
# ok 1 enclave.unclobbered_vdso
# #  RUN   enclave.unclobbered_vdso_oversubscribed ...
# #OK  enclave.unclobbered_vdso_oversubscribed
# ok 2 enclave.unclobbered_vdso_oversubscribed
# #  RUN   enclave.unclobbered_vdso_oversubscribed_remove ...
# # main.c:402:unclobbered_vdso_oversubscribed_remove:Creating an enclave with 
98566144 bytes heap may take a while ...
# # main.c:457:unclobbered_vdso_oversubscribed_remove:Changing type of 98566144 
bytes to trimmed may take a while ...
# # main.c:473:unclobbered_vdso_oversubscribed_remove:Entering enclave to run 
EACCEPT for each page of 98566144 bytes may take a while ...
# # main.c:494:unclobbered_vdso_oversubscribed_remove:Removing 98566144 bytes 
from enclave may take a while ...
# #OK  enclave.unclobbered_vdso_oversubscribed_remove
# ok 3 enclave.unclobbered_vdso_oversubscribed_remove
# #  RUN   enclave.clobbered_vdso ...
# #OK  enclave.clobbered_vdso
# ok 4 enclave.clobbered_vdso
# #  RUN   enclave.clobbered_vdso_and_user_function ...
# #OK  enclave.clobbered_vdso_and_user_function
# ok 5 

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-02 Thread 吳澤南
On Thu, 2024-05-02 at 10:09 -0400, Steven Rostedt wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Thu, 2 May 2024 06:49:18 +
> Tze-nan Wu (吳澤南)  wrote:
> 
> > Good news, this patch works, the test has passed, no more Kasan
> report
> > in my environment.
> 
> Great to hear!
> 
> > 
> > my environment:
> >  arm64 + kasan + swtag based kasan + kernel-6.6.18
> > 
> > Really appreciate, and learn a lot from the patch.
> 
> Can I add:
> 
> Tested-by: Tze-nan Wu (吳澤南) 
> 
> ?
> 
> -- Steve

Certainly, glad to be list as a tester.

-- Tze-nan


Re: [PATCH v7 00/16] mm: jit/text allocator

2024-05-02 Thread Liviu Dudau
On Thu, May 02, 2024 at 04:07:05PM -0700, Luis Chamberlain wrote:
> On Thu, May 02, 2024 at 11:50:36PM +0100, Liviu Dudau wrote:
> > On Mon, Apr 29, 2024 at 09:29:20AM -0700, Luis Chamberlain wrote:
> > > On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> > > > From: "Mike Rapoport (IBM)" 
> > > > 
> > > > Hi,
> > > > 
> > > > The patches are also available in git:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> > > > 
> > > > v7 changes:
> > > > * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
> > > >   #ifdefs in a function body
> > > > * add Acks, thanks everybody
> > > 
> > > Thanks, I've pushed this to modules-next for further exposure / testing.
> > > Given the status of testing so far with prior revisions, in that only a
> > > few issues were found and that those were fixed, and the status of
> > > reviews, this just might be ripe for v6.10.
> > 
> > Looks like there is still some work needed. I've picked up next-20240501
> > and on arch/mips with CONFIG_MODULE_COMPRESS_XZ=y and 
> > CONFIG_MODULE_DECOMPRESS=y
> > I fail to load any module:
> > 
> > # modprobe rfkill
> > [11746.539090] Invalid ELF header magic: != ELF
> > [11746.587149] execmem: unable to allocate memory
> > modprobe: can't load module rfkill (kernel/net/rfkill/rfkill.ko.xz): Out of 
> > memory
> > 
> > The (hopefully) relevant parts of my .config:
> 
> Thanks for the report! Any chance we can get you to try a bisection? I
> think it should take 2-3 test boots. To help reduce scope you try 
> modules-next:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-next
> 
> Then can you check by resetting your tree to commmit 3fbe6c2f820a76 (mm:
> introduce execmem_alloc() and execmem_free()"). I suspect that should
> boot, so your bad commit would be the tip 3c2c250cb3a5fbb ("bpf: remove
> CONFIG_BPF_JIT dependency on CONFIG_MODULES of").
> 
> That gives us only a few commits to bisect:
> 
> git log --oneline 3fbe6c2f820a76bc36d5546bda85832f57c8fce2..
> 3c2c250cb3a5 (HEAD -> modules-next, korg/modules-next) bpf: remove 
> CONFIG_BPF_JIT dependency on CONFIG_MODULES of
> 11e8e65cce5c kprobes: remove dependency on CONFIG_MODULES
> e10cbc38697b powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where 
> appropriate
> 4da3d38f24c5 x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
> 13ae3d74ee70 arch: make execmem setup available regardless of CONFIG_MODULES
> 460bbbc70a47 powerpc: extend execmem_params for kprobes allocations
> e1a14069b5b4 arm64: extend execmem_info for generated code allocations
> 971e181c6585 riscv: extend execmem_params for generated code allocations
> 0fa276f26721 mm/execmem, arch: convert remaining overrides of module_alloc to 
> execmem
> 022cef244287 mm/execmem, arch: convert simple overrides of module_alloc to 
> execmem
> 
> With 2-3 boots we should be to tell which is the bad commit.

Looks like 0fa276f26721 is the first bad commit.

$ git bisect log
# bad: [3c2c250cb3a5fbbccc4a4ff4c9354c54af91f02c] bpf: remove CONFIG_BPF_JIT 
dependency on CONFIG_MODULES of
# good: [3fbe6c2f820a76bc36d5546bda85832f57c8fce2] mm: introduce 
execmem_alloc() and execmem_free()
git bisect start '3c2c250cb3a5' '3fbe6c2f820a76'
# bad: [460bbbc70a47e929b1936ca68979f3b79f168fc6] powerpc: extend 
execmem_params for kprobes allocations
git bisect bad 460bbbc70a47e929b1936ca68979f3b79f168fc6
# bad: [0fa276f26721e0ffc2ae9c7cf67dcc005b43c67e] mm/execmem, arch: convert 
remaining overrides of module_alloc to execmem
git bisect bad 0fa276f26721e0ffc2ae9c7cf67dcc005b43c67e
# good: [022cef2442870db738a366d3b7a636040c081859] mm/execmem, arch: convert 
simple overrides of module_alloc to execmem
git bisect good 022cef2442870db738a366d3b7a636040c081859
# first bad commit: [0fa276f26721e0ffc2ae9c7cf67dcc005b43c67e] mm/execmem, 
arch: convert remaining overrides of module_alloc to execmem

Maybe MIPS also needs a ARCH_WANTS_EXECMEM_LATE?

Best regards,
Liviu

> 
>   Luis
> 

-- 
Everyone who uses computers frequently has had, from time to time,
a mad desire to attack the precocious abacus with an axe.
  -- John D. Clark, Ignition!



Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-02 Thread Steven Rostedt
On Thu, 2 May 2024 16:13:59 -0700
"Paul E. McKenney"  wrote:

> Very good, and thank you!
> 
> I will drop it from RCU as soon as it shows up in either -next or in
> mainline.

Sounds good.

I'm currently working on updates to get into -rc7 and plan to add my next
work on top of that (I know, I know, it's probably the latest release I had
for next, but things are still being worked on).

-- Steve



Re: [PATCH v13 11/14] x86/sgx: Abstract check for global reclaimable pages

2024-05-02 Thread Huang, Kai




On 1/05/2024 7:51 am, Haitao Huang wrote:

From: Kristen Carlson Accardi 

For the global reclaimer to determine if any page available for
reclamation at the global level, it currently only checks for emptiness
of the global LRU. That will be inadequate when pages are tracked in
multiple LRUs, one per cgroup. For this purpose, create a new helper,
sgx_can_reclaim_global(), to abstract this check. Currently it only
checks the global LRU, later will check emptiness of LRUs of all cgroups
when per-cgroup tracking is turned on.

Replace all the checks for emptiness of the global LRU,
list_empty(_global_lru.reclaimable), with calls to
sgx_can_reclaim_global().

Rename sgx_should_reclaim() to sgx_should_reclaim_global() as it is used
to check if a global reclamation should be performed.

Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 
Signed-off-by: Kristen Carlson Accardi 
Co-developed-by: Haitao Huang 
Signed-off-by: Haitao Huang 
Tested-by: Jarkko Sakkinen 
---


Free free to add:

Reviewed-by: Kai Huang 

One thing below:

[...]


-static bool sgx_should_reclaim(unsigned long watermark)
+static bool sgx_should_reclaim_global(unsigned long watermark)
  {
return atomic_long_read(_nr_free_pages) < watermark &&
-  !list_empty(_global_lru.reclaimable);
+   sgx_can_reclaim_global();
  }
  
  static void sgx_reclaim_pages_global(struct mm_struct *charge_mm)

@@ -405,7 +413,7 @@ static void sgx_reclaim_pages_global(struct mm_struct 
*charge_mm)
   */
  void sgx_reclaim_direct(void)
  {
-   if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
+   if (sgx_should_reclaim_global(SGX_NR_LOW_PAGES))
sgx_reclaim_pages_global(current->mm);
  }
  


Hmm.. Sorry for not pointing out in the previous version.

Perhaps it makes more sense to do the rename in the patch ...

  x86/sgx: Add basic EPC reclamation flow for cgroup

... where we have actually introduced the concept of per-cgroup reclaim, 
and we literally have below change in that patch:


 void sgx_reclaim_direct(void)
 {
if (sgx_should_reclaim(SGX_NR_LOW_PAGES))
-   sgx_reclaim_pages();
+   sgx_reclaim_pages_global();
 }

So in that patch, the sgx_should_reclaim() literally just means we 
should do gloabl reclaim, but not the per-cgruop reclaim.  Thus, perhaps 
we just do the renaming here together with the new 
sgx_reclaim_pages_global().


If there's a new version needed, please move the renaming to that patch?



Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-02 Thread Paul E. McKenney
On Thu, May 02, 2024 at 05:31:00PM -0400, Steven Rostedt wrote:
> On Wed, 1 May 2024 20:31:06 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Thu, May 02, 2024 at 11:05:01AM +0900, Masami Hiramatsu wrote:
> > > On Wed, 1 May 2024 16:12:37 -0700
> > > "Paul E. McKenney"  wrote:
> > >   
> > > > Note that the immediate pressure for this patch should be relieved by 
> > > > the
> > > > NAPI patch series [1], but this sort of problem could easily arise 
> > > > again.
> > > > 
> > > > When running heavy test workloads with KASAN enabled, RCU Tasks grace
> > > > periods can extend for many tens of seconds, significantly slowing
> > > > trace registration.  Therefore, make the registration-side RCU Tasks
> > > > grace period be asynchronous via call_rcu_tasks().  
> > > 
> > > Good catch! AFAICS, there is no reason to wait for synchronization
> > > when adding a new direct trampoline.
> > > This looks good to me.
> > > 
> > > Reviewed-by: Masami Hiramatsu (Google)   
> > 
> > Thank you very much!  I will apply this on my next rebase.
> 
> I can take it.
> 
> It's not a bug fix but just an performance improvement, so it can go into
> the next merge window.

Very good, and thank you!

I will drop it from RCU as soon as it shows up in either -next or in
mainline.

Thanx, Paul

> -- Steve
> 
> 
> 
> > 
> > > Thank you,
> > >   
> > > > [1]
> > > > https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> > > > 
> > > > Reported-by: Jakub Kicinski 
> > > > Reported-by: Alexei Starovoitov 
> > > > Reported-by: Chris Mason 
> > > > Signed-off-by: Paul E. McKenney 
> > > > Cc: Steven Rostedt 
> > > > Cc: Masami Hiramatsu 
> > > > Cc: Mark Rutland 
> > > > Cc: Mathieu Desnoyers 
> > > > Cc: 
> > > > 
> > > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > > > index 6c96b30f3d63b..32ea92934268c 100644
> > > > --- a/kernel/trace/ftrace.c
> > > > +++ b/kernel/trace/ftrace.c
> > > > @@ -5365,6 +5365,13 @@ static void
> > > > remove_direct_functions_hash(struct ftrace_hash *hash, unsigned long }
> > > >  }
> > > >  
> > > > +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> > > > +{
> > > > +   struct ftrace_hash *fhp = container_of(rhp, struct
> > > > ftrace_hash, rcu); +
> > > > +   free_ftrace_hash(fhp);
> > > > +}
> > > > +
> > > >  /**
> > > >   * register_ftrace_direct - Call a custom trampoline directly
> > > >   * for multiple functions registered in @ops
> > > > @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops
> > > > *ops, unsigned long addr) out_unlock:
> > > > mutex_unlock(_mutex);
> > > >  
> > > > -   if (free_hash && free_hash != EMPTY_HASH) {
> > > > -   synchronize_rcu_tasks();
> > > > -   free_ftrace_hash(free_hash);
> > > > -   }
> > > > +   if (free_hash && free_hash != EMPTY_HASH)
> > > > +   call_rcu_tasks(_hash->rcu,
> > > > register_ftrace_direct_cb); 
> > > > if (new_hash)
> > > > free_ftrace_hash(new_hash);  
> > > 
> > > 
> > > -- 
> > > Masami Hiramatsu (Google)   
> 



[PATCH 1/2] drivers: remoteproc: xlnx: add attach detach support

2024-05-02 Thread Tanmay Shah
It is possible that remote processor is already running before
linux boot or remoteproc platform driver probe. Implement required
remoteproc framework ops to provide resource table address and
connect or disconnect with remote processor in such case.

Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 164 +++-
 1 file changed, 160 insertions(+), 4 deletions(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 84243d1dff9f..af7aff5e9098 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -25,6 +25,10 @@
 /* RX mailbox client buffer max length */
 #define MBOX_CLIENT_BUF_MAX(IPI_BUF_LEN_MAX + \
 sizeof(struct zynqmp_ipi_message))
+
+#define RSC_TBL_XLNX_MAGIC ((uint32_t)'x' << 24 | (uint32_t)'a' << 16 | \
+(uint32_t)'m' << 8 | (uint32_t)'p')
+
 /*
  * settings for RPU cluster mode which
  * reflects possible values of xlnx,cluster-mode dt-property
@@ -73,6 +77,15 @@ struct mbox_info {
struct mbox_chan *rx_chan;
 };
 
+/* Xilinx Platform specific data structure */
+struct rsc_tbl_data {
+   const int version;
+   const u32 magic_num;
+   const u32 comp_magic_num;
+   const u32 rsc_tbl_size;
+   const uintptr_t rsc_tbl;
+} __packed;
+
 /*
  * Hardcoded TCM bank values. This will stay in driver to maintain backward
  * compatibility with device-tree that does not have TCM information.
@@ -95,20 +108,24 @@ static const struct mem_bank_data 
zynqmp_tcm_banks_lockstep[] = {
 /**
  * struct zynqmp_r5_core
  *
+ * @rsc_tbl_va: resource table virtual address
  * @dev: device of RPU instance
  * @np: device node of RPU instance
  * @tcm_bank_count: number TCM banks accessible to this RPU
  * @tcm_banks: array of each TCM bank data
  * @rproc: rproc handle
+ * @rsc_tbl_size: resource table size retrieved from remote
  * @pm_domain_id: RPU CPU power domain id
  * @ipi: pointer to mailbox information
  */
 struct zynqmp_r5_core {
+   struct resource_table *rsc_tbl_va;
struct device *dev;
struct device_node *np;
int tcm_bank_count;
struct mem_bank_data **tcm_banks;
struct rproc *rproc;
+   u32 rsc_tbl_size;
u32 pm_domain_id;
struct mbox_info *ipi;
 };
@@ -621,10 +638,19 @@ static int zynqmp_r5_rproc_prepare(struct rproc *rproc)
 {
int ret;
 
-   ret = add_tcm_banks(rproc);
-   if (ret) {
-   dev_err(>dev, "failed to get TCM banks, err %d\n", ret);
-   return ret;
+   /**
+* For attach/detach use case, Firmware is already loaded so
+* TCM isn't really needed at all. Also, for security TCM can be
+* locked in such case and linux may not have access at all.
+* So avoid adding TCM banks. TCM power-domains requested during attach
+* callback.
+*/
+   if (rproc->state != RPROC_DETACHED) {
+   ret = add_tcm_banks(rproc);
+   if (ret) {
+   dev_err(>dev, "failed to get TCM banks, err 
%d\n", ret);
+   return ret;
+   }
}
 
ret = add_mem_regions_carveout(rproc);
@@ -662,6 +688,123 @@ static int zynqmp_r5_rproc_unprepare(struct rproc *rproc)
return 0;
 }
 
+static struct resource_table *zynqmp_r5_get_loaded_rsc_table(struct rproc 
*rproc,
+size_t *size)
+{
+   struct zynqmp_r5_core *r5_core;
+
+   r5_core = rproc->priv;
+
+   *size = r5_core->rsc_tbl_size;
+
+   return r5_core->rsc_tbl_va;
+}
+
+static int zynqmp_r5_get_rsc_table_va(struct zynqmp_r5_core *r5_core)
+{
+   struct device *dev = r5_core->dev;
+   struct rsc_tbl_data *rsc_data_va;
+   struct resource_table *rsc_addr;
+   struct resource res_mem;
+   struct device_node *np;
+   int ret;
+
+   /**
+* It is expected from remote processor firmware to provide resource
+* table address via struct rsc_tbl_data data structure.
+* Start address of first entry under "memory-region" property list
+* contains that data structure which holds resource table address, size
+* and some magic number to validate correct resource table entry.
+*/
+   np = of_parse_phandle(r5_core->np, "memory-region", 0);
+   if (!np) {
+   dev_err(dev, "failed to get memory region dev node\n");
+   return -EINVAL;
+   }
+
+   ret = of_address_to_resource(np, 0, _mem);
+   if (ret) {
+   dev_err(dev, "failed to get memory-region resource addr\n");
+   return -EINVAL;
+   }
+
+   rsc_data_va = devm_ioremap_wc(dev, res_mem.start,
+ sizeof(struct rsc_tbl_data));
+   if (!rsc_data_va) {
+   dev_err(dev, "failed to map resource table data 

[PATCH 2/2] drivers: remoteproc: xlnx: add sram support

2024-05-02 Thread Tanmay Shah
AMD-Xilinx zynqmp platform contains on-chip sram memory (OCM).
R5 cores can access OCM and access is faster than DDR memory but slower
than TCM memories available. Sram region can have optional multiple
power-domains.

Signed-off-by: Tanmay Shah 
---
 drivers/remoteproc/xlnx_r5_remoteproc.c | 221 +++-
 1 file changed, 220 insertions(+), 1 deletion(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index af7aff5e9098..47c08b013152 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -56,6 +56,21 @@ struct mem_bank_data {
char *bank_name;
 };
 
+/**
+ * struct zynqmp_sram_bank - sram bank description
+ *
+ * @sram_res: sram address region information
+ * @power_domains: Array of pm domain id
+ * @num_pd: total pm domain id count
+ * @da: device address of sram
+ */
+struct zynqmp_sram_bank {
+   struct resource sram_res;
+   int *power_domains;
+   int num_pd;
+   u32 da;
+};
+
 /**
  * struct mbox_info
  *
@@ -109,6 +124,8 @@ static const struct mem_bank_data 
zynqmp_tcm_banks_lockstep[] = {
  * struct zynqmp_r5_core
  *
  * @rsc_tbl_va: resource table virtual address
+ * @sram: Array of sram memories assigned to this core
+ * @num_sram: number of sram for this core
  * @dev: device of RPU instance
  * @np: device node of RPU instance
  * @tcm_bank_count: number TCM banks accessible to this RPU
@@ -120,6 +137,8 @@ static const struct mem_bank_data 
zynqmp_tcm_banks_lockstep[] = {
  */
 struct zynqmp_r5_core {
struct resource_table *rsc_tbl_va;
+   struct zynqmp_sram_bank **sram;
+   int num_sram;
struct device *dev;
struct device_node *np;
int tcm_bank_count;
@@ -483,6 +502,69 @@ static int add_mem_regions_carveout(struct rproc *rproc)
return 0;
 }
 
+static int add_sram_carveouts(struct rproc *rproc)
+{
+   struct zynqmp_r5_core *r5_core = rproc->priv;
+   struct rproc_mem_entry *rproc_mem;
+   struct zynqmp_sram_bank *sram;
+   dma_addr_t dma_addr;
+   int da, i, j, ret;
+   size_t len;
+
+   for (i = 0; i < r5_core->num_sram; i++) {
+   sram = r5_core->sram[i];
+
+   dma_addr = (dma_addr_t)sram->sram_res.start;
+   len = resource_size(>sram_res);
+   da = sram->da;
+
+   for (j = 0; j < sram->num_pd; j++) {
+   ret = zynqmp_pm_request_node(sram->power_domains[j],
+
ZYNQMP_PM_CAPABILITY_ACCESS, 0,
+
ZYNQMP_PM_REQUEST_ACK_BLOCKING);
+   if (ret < 0) {
+   dev_err(r5_core->dev,
+   "failed to request on SRAM pd 0x%x",
+   sram->power_domains[j]);
+   goto fail_sram;
+   } else {
+   pr_err("sram pd 0x%x request success\n",
+  sram->power_domains[j]);
+   }
+   }
+
+   /* Register associated reserved memory regions */
+   rproc_mem = rproc_mem_entry_init(>dev, NULL,
+(dma_addr_t)dma_addr,
+len, da,
+zynqmp_r5_mem_region_map,
+zynqmp_r5_mem_region_unmap,
+sram->sram_res.name);
+
+   rproc_add_carveout(rproc, rproc_mem);
+   rproc_coredump_add_segment(rproc, da, len);
+
+   dev_err(>dev, "sram carveout %s addr=%llx, da=0x%x, 
size=0x%lx",
+   sram->sram_res.name, dma_addr, da, len);
+   }
+
+   return 0;
+
+fail_sram:
+   /* Release current sram pd. */
+   while (--j >= 0)
+   zynqmp_pm_release_node(sram->power_domains[j]);
+
+   /* Release previously requested sram pd. */
+   while (--i >= 0) {
+   sram = r5_core->sram[i];
+   for (j = 0; j < sram->num_pd; j++)
+   zynqmp_pm_release_node(sram->power_domains[j]);
+   }
+
+   return ret;
+}
+
 /*
  * tcm_mem_unmap()
  * @rproc: single R5 core's corresponding rproc instance
@@ -659,6 +741,12 @@ static int zynqmp_r5_rproc_prepare(struct rproc *rproc)
return ret;
}
 
+   ret = add_sram_carveouts(rproc);
+   if (ret) {
+   dev_err(>dev, "failed to get sram carveout %d\n", ret);
+   return ret;
+   }
+
return 0;
 }
 
@@ -673,8 +761,9 @@ static int zynqmp_r5_rproc_prepare(struct rproc *rproc)
 static int zynqmp_r5_rproc_unprepare(struct rproc *rproc)
 {
struct zynqmp_r5_core *r5_core;
+   struct zynqmp_sram_bank *sram;
u32 

[PATCH 0/2] remoteproc: xlnx: Add attach detach ops and sram support

2024-05-02 Thread Tanmay Shah
Attach detach ops are needed to connect to remote processor that is
running before remoteproc driver is probed. Implement remoteproc
framework ops that enables such use case on AMD-Xilinx platforms.

Remote processor can also use On Chip sram Memory (OCM) for various
purpose. For example, for fast code execution or data access compare
to DDR memory. Such sram region is made available to remoteproc nodes
via "sram" property. Add support in driver to parse and use OCM memory
via sram property.

Tanmay Shah (2):
  drivers: remoteproc: xlnx: add attach detach support
  drivers: remoteproc: xlnx: add sram support

 drivers/remoteproc/xlnx_r5_remoteproc.c | 385 +++-
 1 file changed, 380 insertions(+), 5 deletions(-)


base-commit: 0496190c4d42965acb31b9da1b6dac3509791062
-- 
2.25.1




Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-05-02 Thread Steven Rostedt
On Thu, 2 May 2024 15:58:53 -0700
Beau Belgrave  wrote:

> It's not an issue on the matching/logic. However, you do get an extra
> byte alloc (which doesn't bother me in this edge case).

Figured as much, but since there was no mention of it, I decided to bring
it up.

I'll take this as-is then.

-- Steve



Re: [PATCH v7 00/16] mm: jit/text allocator

2024-05-02 Thread Luis Chamberlain
On Thu, May 02, 2024 at 11:50:36PM +0100, Liviu Dudau wrote:
> On Mon, Apr 29, 2024 at 09:29:20AM -0700, Luis Chamberlain wrote:
> > On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> > > From: "Mike Rapoport (IBM)" 
> > > 
> > > Hi,
> > > 
> > > The patches are also available in git:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> > > 
> > > v7 changes:
> > > * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
> > >   #ifdefs in a function body
> > > * add Acks, thanks everybody
> > 
> > Thanks, I've pushed this to modules-next for further exposure / testing.
> > Given the status of testing so far with prior revisions, in that only a
> > few issues were found and that those were fixed, and the status of
> > reviews, this just might be ripe for v6.10.
> 
> Looks like there is still some work needed. I've picked up next-20240501
> and on arch/mips with CONFIG_MODULE_COMPRESS_XZ=y and 
> CONFIG_MODULE_DECOMPRESS=y
> I fail to load any module:
> 
> # modprobe rfkill
> [11746.539090] Invalid ELF header magic: != ELF
> [11746.587149] execmem: unable to allocate memory
> modprobe: can't load module rfkill (kernel/net/rfkill/rfkill.ko.xz): Out of 
> memory
> 
> The (hopefully) relevant parts of my .config:

Thanks for the report! Any chance we can get you to try a bisection? I
think it should take 2-3 test boots. To help reduce scope you try modules-next:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=modules-next

Then can you check by resetting your tree to commmit 3fbe6c2f820a76 (mm:
introduce execmem_alloc() and execmem_free()"). I suspect that should
boot, so your bad commit would be the tip 3c2c250cb3a5fbb ("bpf: remove
CONFIG_BPF_JIT dependency on CONFIG_MODULES of").

That gives us only a few commits to bisect:

git log --oneline 3fbe6c2f820a76bc36d5546bda85832f57c8fce2..
3c2c250cb3a5 (HEAD -> modules-next, korg/modules-next) bpf: remove 
CONFIG_BPF_JIT dependency on CONFIG_MODULES of
11e8e65cce5c kprobes: remove dependency on CONFIG_MODULES
e10cbc38697b powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where 
appropriate
4da3d38f24c5 x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
13ae3d74ee70 arch: make execmem setup available regardless of CONFIG_MODULES
460bbbc70a47 powerpc: extend execmem_params for kprobes allocations
e1a14069b5b4 arm64: extend execmem_info for generated code allocations
971e181c6585 riscv: extend execmem_params for generated code allocations
0fa276f26721 mm/execmem, arch: convert remaining overrides of module_alloc to 
execmem
022cef244287 mm/execmem, arch: convert simple overrides of module_alloc to 
execmem

With 2-3 boots we should be to tell which is the bad commit.

  Luis



Re: [PATCH v7 00/16] mm: jit/text allocator

2024-05-02 Thread Liviu Dudau
On Mon, Apr 29, 2024 at 09:29:20AM -0700, Luis Chamberlain wrote:
> On Mon, Apr 29, 2024 at 03:16:04PM +0300, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> > 
> > Hi,
> > 
> > The patches are also available in git:
> > https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=execmem/v7
> > 
> > v7 changes:
> > * define MODULE_{VADDR,END} for riscv32 to fix the build and avoid
> >   #ifdefs in a function body
> > * add Acks, thanks everybody
> 
> Thanks, I've pushed this to modules-next for further exposure / testing.
> Given the status of testing so far with prior revisions, in that only a
> few issues were found and that those were fixed, and the status of
> reviews, this just might be ripe for v6.10.

Looks like there is still some work needed. I've picked up next-20240501
and on arch/mips with CONFIG_MODULE_COMPRESS_XZ=y and CONFIG_MODULE_DECOMPRESS=y
I fail to load any module:

# modprobe rfkill
[11746.539090] Invalid ELF header magic: != ELF
[11746.587149] execmem: unable to allocate memory
modprobe: can't load module rfkill (kernel/net/rfkill/rfkill.ko.xz): Out of 
memory

The (hopefully) relevant parts of my .config:

CONFIG_HAVE_KERNEL_XZ=y
CONFIG_MIPS=y
CONFIG_RALINK=y
CONFIG_SOC_MT7621=y
CONFIG_EXECMEM=y
CONFIG_MODULES_USE_ELF_REL=y
CONFIG_MODULES=y
# CONFIG_MODULE_DEBUG is not set
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODULE_UNLOAD_TAINT_TRACKING is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
# CONFIG_MODULE_COMPRESS_NONE is not set
# CONFIG_MODULE_COMPRESS_GZIP is not set
CONFIG_MODULE_COMPRESS_XZ=y
# CONFIG_MODULE_COMPRESS_ZSTD is not set
CONFIG_MODULE_DECOMPRESS=y
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODULES_TREE_LOOKUP=y


Best regards,
Liviu


> 
>   Luis
> 

-- 
Everyone who uses computers frequently has had, from time to time,
a mad desire to attack the precocious abacus with an axe.
  -- John D. Clark, Ignition!



Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-05-02 Thread Beau Belgrave
On Thu, May 02, 2024 at 05:16:34PM -0400, Steven Rostedt wrote:
> On Tue, 23 Apr 2024 16:23:37 +
> Beau Belgrave  wrote:
> 
> > When the ABI was updated to prevent same name w/different args, it
> > missed an important corner case when fields don't end with a space.
> > Typically, space is used for fields to help separate them, like
> > "u8 field1; u8 field2". If no spaces are used, like
> > "u8 field1;u8 field2", then the parsing works for the first time.
> > However, the match check fails on a subsequent register, leading to
> > confusion.
> > 
> > This is because the match check uses argv_split() and assumes that all
> > fields will be split upon the space. When spaces are used, we get back
> > { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> > This causes a mismatch, and the user program gets back -EADDRINUSE.
> > 
> > Add a method to detect this case before calling argv_split(). If found
> > force a space after the field separator character ';'. This ensures all
> > cases work properly for matching.
> > 
> > With this fix, the following are all treated as matching:
> > u8 field1;u8 field2
> > u8 field1; u8 field2
> > u8 field1;\tu8 field2
> > u8 field1;\nu8 field2
> 
> I'm curious, what happens if you have: "u8 field1; u8 field2;" ?
> 

You'll get an extra whitespace during the copy, assuming it was really:
"u8 field1;u8 field2"

If it had spaces, this code wouldn't run.

> Do you care? As you will then create "u8 field1; u8 field2; "
> 
> but I'm guessing the extra whitespace at the end doesn't affect anything.
> 

Right, you get an extra byte allocated, but the argv_split() with ignore
it. The compare will work correctly (I've verified this just now to
double check).

IE these all match:
"Test u8 a; u8 b; "
"Test u8 a; u8 b;"
"Test u8 a; u8 b"

> 
> > 
> > Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> > args event")
> > Signed-off-by: Beau Belgrave 
> > ---
> >  kernel/trace/trace_events_user.c | 76 +++-
> >  1 file changed, 75 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/trace/trace_events_user.c 
> > b/kernel/trace/trace_events_user.c
> > index 70d428c394b6..82b191f33a28 100644
> > --- a/kernel/trace/trace_events_user.c
> > +++ b/kernel/trace/trace_events_user.c
> > @@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event 
> > *user)
> > return 0;
> >  }
> >  
> > +/*
> > + * Counts how many ';' without a trailing space are in the args.
> > + */
> > +static int count_semis_no_space(char *args)
> > +{
> > +   int count = 0;
> > +
> > +   while ((args = strchr(args, ';'))) {
> > +   args++;
> > +
> > +   if (!isspace(*args))
> > +   count++;
> 
> This will count that "..;" 
> 
> This is most likely not an issue, but since I didn't see this case
> anywhere, I figured I bring it up just to confirm that it's not an issue.
> 

It's not an issue on the matching/logic. However, you do get an extra
byte alloc (which doesn't bother me in this edge case).

Thanks,
-Beau

> -- Steve
> 
> 
> > +   }
> > +
> > +   return count;
> > +}
> > +
> > +/*
> > + * Copies the arguments while ensuring all ';' have a trailing space.
> > + */
> > +static char *insert_space_after_semis(char *args, int count)
> > +{
> > +   char *fixed, *pos;
> > +   int len;
> > +
> > +   len = strlen(args) + count;
> > +   fixed = kmalloc(len + 1, GFP_KERNEL);
> > +
> > +   if (!fixed)
> > +   return NULL;
> > +
> > +   pos = fixed;
> > +
> > +   /* Insert a space after ';' if there is no trailing space. */
> > +   while (*args) {
> > +   *pos = *args++;
> > +
> > +   if (*pos++ == ';' && !isspace(*args))
> > +   *pos++ = ' ';
> > +   }
> > +
> > +   *pos = '\0';
> > +
> > +   return fixed;
> > +}
> > +



Re: [PATCHv4 7/7] man2: Add uretprobe syscall page

2024-05-02 Thread Alejandro Colomar
Hi Jiri,

On Thu, May 02, 2024 at 10:13:12PM +0200, Jiri Olsa wrote:
> > You could add a HISTORY section.
> 
> ok, IIUC for this syscall it should contain just kernel version where
> it got merged, right?

Yep.

> 
> > 
> > Have a lovely day!
> 
> thanks for review,
> jirka

Thanks for the page.

Have a lovely night!
Alex

-- 

A client is hiring kernel driver, mm, and/or crypto developers;
contact me if interested.


signature.asc
Description: PGP signature


Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-02 Thread Steven Rostedt
On Wed, 1 May 2024 20:31:06 -0700
"Paul E. McKenney"  wrote:

> On Thu, May 02, 2024 at 11:05:01AM +0900, Masami Hiramatsu wrote:
> > On Wed, 1 May 2024 16:12:37 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > Note that the immediate pressure for this patch should be relieved by the
> > > NAPI patch series [1], but this sort of problem could easily arise again.
> > > 
> > > When running heavy test workloads with KASAN enabled, RCU Tasks grace
> > > periods can extend for many tens of seconds, significantly slowing
> > > trace registration.  Therefore, make the registration-side RCU Tasks
> > > grace period be asynchronous via call_rcu_tasks().  
> > 
> > Good catch! AFAICS, there is no reason to wait for synchronization
> > when adding a new direct trampoline.
> > This looks good to me.
> > 
> > Reviewed-by: Masami Hiramatsu (Google)   
> 
> Thank you very much!  I will apply this on my next rebase.

I can take it.

It's not a bug fix but just an performance improvement, so it can go into
the next merge window.

-- Steve



> 
> > Thank you,
> >   
> > > [1]
> > > https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> > > 
> > > Reported-by: Jakub Kicinski 
> > > Reported-by: Alexei Starovoitov 
> > > Reported-by: Chris Mason 
> > > Signed-off-by: Paul E. McKenney 
> > > Cc: Steven Rostedt 
> > > Cc: Masami Hiramatsu 
> > > Cc: Mark Rutland 
> > > Cc: Mathieu Desnoyers 
> > > Cc: 
> > > 
> > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > > index 6c96b30f3d63b..32ea92934268c 100644
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -5365,6 +5365,13 @@ static void
> > > remove_direct_functions_hash(struct ftrace_hash *hash, unsigned long }
> > >  }
> > >  
> > > +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> > > +{
> > > + struct ftrace_hash *fhp = container_of(rhp, struct
> > > ftrace_hash, rcu); +
> > > + free_ftrace_hash(fhp);
> > > +}
> > > +
> > >  /**
> > >   * register_ftrace_direct - Call a custom trampoline directly
> > >   * for multiple functions registered in @ops
> > > @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops
> > > *ops, unsigned long addr) out_unlock:
> > >   mutex_unlock(_mutex);
> > >  
> > > - if (free_hash && free_hash != EMPTY_HASH) {
> > > - synchronize_rcu_tasks();
> > > - free_ftrace_hash(free_hash);
> > > - }
> > > + if (free_hash && free_hash != EMPTY_HASH)
> > > + call_rcu_tasks(_hash->rcu,
> > > register_ftrace_direct_cb); 
> > >   if (new_hash)
> > >   free_ftrace_hash(new_hash);  
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google)   




Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-05-02 Thread Steven Rostedt
On Tue, 23 Apr 2024 16:23:37 +
Beau Belgrave  wrote:

> When the ABI was updated to prevent same name w/different args, it
> missed an important corner case when fields don't end with a space.
> Typically, space is used for fields to help separate them, like
> "u8 field1; u8 field2". If no spaces are used, like
> "u8 field1;u8 field2", then the parsing works for the first time.
> However, the match check fails on a subsequent register, leading to
> confusion.
> 
> This is because the match check uses argv_split() and assumes that all
> fields will be split upon the space. When spaces are used, we get back
> { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> This causes a mismatch, and the user program gets back -EADDRINUSE.
> 
> Add a method to detect this case before calling argv_split(). If found
> force a space after the field separator character ';'. This ensures all
> cases work properly for matching.
> 
> With this fix, the following are all treated as matching:
> u8 field1;u8 field2
> u8 field1; u8 field2
> u8 field1;\tu8 field2
> u8 field1;\nu8 field2

I'm curious, what happens if you have: "u8 field1; u8 field2;" ?

Do you care? As you will then create "u8 field1; u8 field2; "

but I'm guessing the extra whitespace at the end doesn't affect anything.


> 
> Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> args event")
> Signed-off-by: Beau Belgrave 
> ---
>  kernel/trace/trace_events_user.c | 76 +++-
>  1 file changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index 70d428c394b6..82b191f33a28 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event 
> *user)
>   return 0;
>  }
>  
> +/*
> + * Counts how many ';' without a trailing space are in the args.
> + */
> +static int count_semis_no_space(char *args)
> +{
> + int count = 0;
> +
> + while ((args = strchr(args, ';'))) {
> + args++;
> +
> + if (!isspace(*args))
> + count++;

This will count that "..;" 

This is most likely not an issue, but since I didn't see this case
anywhere, I figured I bring it up just to confirm that it's not an issue.

-- Steve


> + }
> +
> + return count;
> +}
> +
> +/*
> + * Copies the arguments while ensuring all ';' have a trailing space.
> + */
> +static char *insert_space_after_semis(char *args, int count)
> +{
> + char *fixed, *pos;
> + int len;
> +
> + len = strlen(args) + count;
> + fixed = kmalloc(len + 1, GFP_KERNEL);
> +
> + if (!fixed)
> + return NULL;
> +
> + pos = fixed;
> +
> + /* Insert a space after ';' if there is no trailing space. */
> + while (*args) {
> + *pos = *args++;
> +
> + if (*pos++ == ';' && !isspace(*args))
> + *pos++ = ' ';
> + }
> +
> + *pos = '\0';
> +
> + return fixed;
> +}
> +



Re: [PATCH v3] ftrace: Fix possible use-after-free issue in ftrace_location()

2024-05-02 Thread Steven Rostedt
On Wed, 17 Apr 2024 11:28:30 +0800
Zheng Yejian  wrote:

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index da1710499698..e05d3e3dc06a 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>  }
>  
>  /**
> - * ftrace_location_range - return the first address of a traced location
> + * ftrace_location_range_rcu - return the first address of a traced location

kerneldoc comments are for external functions. You need to move this down
to ftrace_location_range() as here you are commenting a local static function.

But I have to ask, why did you create this static function anyway? There's
only one user of it (the ftrace_location_range()). Why didn't you just
simply add the rcu locking there?

unsigned long ftrace_location_range(unsigned long start, unsigned long end)
{
struct dyn_ftrace *rec;
unsigned long ip = 0;

rcu_read_lock();
rec = lookup_rec(start, end);
if (rec)
ip = rec->ip;
rcu_read_unlock();

return ip;
}

-- Steve


>   *   if it touches the given ip range
>   * @start: start of range to search.
>   * @end: end of range to search (inclusive). @end points to the last byte
> @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>   * that is either a NOP or call to the function tracer. It checks the ftrace
>   * internal tables to determine if the address belongs or not.
>   */
> -unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +static unsigned long ftrace_location_range_rcu(unsigned long start, unsigned 
> long end)
>  {
>   struct dyn_ftrace *rec;
>  
> @@ -1603,6 +1603,16 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   return 0;
>  }
>  
> +unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +{
> + unsigned long loc;
> +
> + rcu_read_lock();
> + loc = ftrace_location_range_rcu(start, end);
> + rcu_read_unlock();
> + return loc;
> +}



Re: [PATCHv4 7/7] man2: Add uretprobe syscall page

2024-05-02 Thread Jiri Olsa
On Thu, May 02, 2024 at 03:43:27PM +0200, Alejandro Colomar wrote:
> Hi Jiri,
> 
> On Thu, May 02, 2024 at 02:23:13PM +0200, Jiri Olsa wrote:
> > Adding man page for new uretprobe syscall.
> > 
> > Signed-off-by: Jiri Olsa 
> > ---
> >  man2/uretprobe.2 | 45 +
> >  1 file changed, 45 insertions(+)
> >  create mode 100644 man2/uretprobe.2
> > 
> > diff --git a/man2/uretprobe.2 b/man2/uretprobe.2
> > new file mode 100644
> > index ..08fe6a670430
> > --- /dev/null
> > +++ b/man2/uretprobe.2
> > @@ -0,0 +1,45 @@
> > +.\" Copyright (C) 2024, Jiri Olsa 
> > +.\"
> > +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> > +.\"
> > +.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
> > +.SH NAME
> > +uretprobe \- execute pending return uprobes
> > +.SH SYNOPSIS
> > +.nf
> > +.B int uretprobe(void)
> > +.fi
> > +.SH DESCRIPTION
> > +Kernel is using
> > +.BR uretprobe()
> > +syscall to trigger uprobe return probe consumers instead of using
> > +standard breakpoint instruction.
> > +
> 
> Please use .P instead of a blank.  See man-pages(7):
> 
>Formatting conventions (general)
>  Paragraphs should be separated by suitable markers (usually either
>  .P or .IP).  Do not separate paragraphs using blank lines, as this
>  results in poor rendering in some output formats  (such  as  Post‐
>  Script and PDF).

ok, will do

> 
> > +The uretprobe syscall is not supposed to be called directly by user, it's 
> > allowed
> 
> s/by user/by the user/

ok

> 
> > +to be invoked only through user space trampoline provided by kernel.
> 
> s/user space/user-space/

ok

> 
> Missing a few 'the' too, here and in the rest of the page.

ok, will check

> 
> > +When called from outside of this trampoline, the calling process will 
> > receive
> > +.BR SIGILL .
> > +
> > +.SH RETURN VALUE
> > +.BR uretprobe()
> 
> You're missing a space here:
> 
> .BR uretprobe ()

ok

> 
> > +return value is specific for given architecture.
> > +
> > +.SH VERSIONS
> > +This syscall is not specified in POSIX,
> > +and details of its behavior vary across systems.
> > +.SH STANDARDS
> > +None.
> 
> You could add a HISTORY section.

ok, IIUC for this syscall it should contain just kernel version where
it got merged, right?

> 
> Have a lovely day!

thanks for review,
jirka



[PATCH v3 6/6] eventfs: Have "events" directory get permissions from its parent

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6e08405892ae..a878cea70f4c 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -226,12 +227,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -817,6 +829,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -826,10 +839,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0





[PATCH v3 5/6] eventfs: Do not treat events directory different than other directories

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 9dacf65c0b6e..6e08405892ae 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -206,21 +206,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0





[PATCH v3 2/6] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 ++
 fs/tracefs/inode.c   | 65 +++-
 fs/tracefs/internal.h|  7 -
 3 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cc8b838bbe62..15a2a9c3c62b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -308,6 +308,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..52aa14bd2994 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_SPINLOCK(tracefs_inode_lock);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
+   unsigned long flags;
 
ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
if (!ti)
return NULL;
 
+   spin_lock_irqsave(_inode_lock, flags);
+   list_add_rcu(>list, _inodes);
+   spin_unlock_irqrestore(_inode_lock, flags);
+
return >vfs_inode;
 }
 
+static void tracefs_free_inode_rcu(struct rcu_head *rcu)
+{
+   struct tracefs_inode *ti;
+
+   ti = container_of(rcu, struct tracefs_inode, rcu);
+   kmem_cache_free(tracefs_inode_cachep, 

[PATCH v3 4/6] eventfs: Do not differentiate the toplevel events directory

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 15a2a9c3c62b..9dacf65c0b6e 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -239,14 +238,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -259,10 +254,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -291,7 +286,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -359,7 +354,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -465,7 +460,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -845,14 +840,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -861,13 +848,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 29f0c75b..f704d8348357 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET= BIT(3),
-   TRACEFS_UID_PERM_SET= BIT(4),
-   TRACEFS_INSTANCE_INODE  = BIT(5),
+   TRACEFS_GID_PERM_SET= BIT(2),
+   

[PATCH v3 1/6] eventfs: Free all of the eventfs_inode after RCU

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The freeing of eventfs_inode via a kfree_rcu() callback. But the content
of the eventfs_inode was being freed after the last kref. This is
dangerous, as changes are being made that can access the content of an
eventfs_inode from an RCU loop.

Instead of using kfree_rcu() use call_rcu() that calls a function to do
all the freeing of the eventfs_inode after a RCU grace period has expired.

Cc: sta...@vger.kernel.org
Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f5510e26f0f6..cc8b838bbe62 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -73,6 +73,21 @@ enum {
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
 
+static void free_ei_rcu(struct rcu_head *rcu)
+{
+   struct eventfs_inode *ei = container_of(rcu, struct eventfs_inode, rcu);
+   struct eventfs_root_inode *rei;
+
+   kfree(ei->entry_attrs);
+   kfree_const(ei->name);
+   if (ei->is_events) {
+   rei = get_root_inode(ei);
+   kfree(rei);
+   } else {
+   kfree(ei);
+   }
+}
+
 /*
  * eventfs_inode reference count management.
  *
@@ -85,7 +100,6 @@ static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
const struct eventfs_entry *entry;
-   struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
@@ -95,14 +109,7 @@ static void release_ei(struct kref *ref)
entry->release(entry->name, ei->data);
}
 
-   kfree(ei->entry_attrs);
-   kfree_const(ei->name);
-   if (ei->is_events) {
-   rei = get_root_inode(ei);
-   kfree_rcu(rei, ei.rcu);
-   } else {
-   kfree_rcu(ei, rcu);
-   }
+   call_rcu(>rcu, free_ei_rcu);
 }
 
 static inline void put_ei(struct eventfs_inode *ei)
-- 
2.43.0





[PATCH v3 3/6] tracefs: Still use mount point as default permissions for instances

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 52aa14bd2994..417c840e6403 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0





[PATCH v3 0/6] tracefs/eventfs: Fix inconsistent permissions

2024-05-02 Thread Steven Rostedt
The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/20240502151547.973653...@goodmis.org/

- The eventfs_inode freeing was incorrect. The kref_put() would call
  release_ei() that freed the contents of the eventfs_inode then
  call kfree_rcu() on the eventfs_inode itself. The contents of the
  eventfs_inode needs to be freed after the RCU synchronization as
  well. The patches here add even more cases where that's a requirement.

- Add a iput callback for the tracefs_inode to clear the TRACEFS_EVENT_INODE
  flag. This will prevent the clearing of flags in remount to go into
  the eventfs_remount() function. A RCU grace cycle happens between
  the clearing of this flag and where the eventfs_inode is freed, so
  it is OK if the iteration is happening at the same time, as it is
  done under rcu_read_lock().

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/

- Testing showed that taking a mutex when freeing the tracefs_inode
  caused a lockdep splat as it can happen in the RCU softirq context.
  Convert the mutex to a spinlock for adding and removing the node
  from the link list, and free the node via call_rcu() so that the
  iteration of the list only needs to be protected by rcu_read_lock().


Steven Rostedt (Google) (6):
  eventfs: Free all of the eventfs_inode after RCU
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 127 ---
 fs/tracefs/inode.c   |  92 --
 fs/tracefs/internal.h|  14 --
 3 files changed, 175 insertions(+), 58 deletions(-)



Re: [PATCHv4 bpf-next 0/7] uprobe: uretprobe speed up

2024-05-02 Thread Jiri Olsa
On Thu, May 02, 2024 at 09:43:02AM -0700, Andrii Nakryiko wrote:
> On Thu, May 2, 2024 at 5:23 AM Jiri Olsa  wrote:
> >
> > hi,
> > as part of the effort on speeding up the uprobes [0] coming with
> > return uprobe optimization by using syscall instead of the trap
> > on the uretprobe trampoline.
> >
> > The speed up depends on instruction type that uprobe is installed
> > and depends on specific HW type, please check patch 1 for details.
> >
> > Patches 1-6 are based on bpf-next/master, but path 1 and 2 are
> > apply-able on linux-trace.git tree probes/for-next branch.
> > Patch 7 is based on man-pages master.
> >
> > v4 changes:
> >   - added acks [Oleg,Andrii,Masami]
> >   - reworded the man page and adding more info to NOTE section [Masami]
> >   - rewrote bpf tests not to use trace_pipe [Andrii]
> >   - cc-ed linux-man list
> >
> > Also available at:
> >   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
> >   uretprobe_syscall
> >
> 
> It looks great to me, thanks! Unfortunately BPF CI build is broken,
> probably due to some of the Makefile additions, please investigate and
> fix (or we'll need to fix something on BPF CI side), but it looks like
> you'll need another revision, unfortunately.
> 
> pw-bot: cr
> 
>   [0] 
> https://github.com/kernel-patches/bpf/actions/runs/8923849088/job/24509002194

yes, I think it's missing the 32-bit libc for uprobe_compat binary,
probably it needs to be added to github.com:libbpf/ci.git 
setup-build-env/action.yml ?
hm but I'm not sure how to test it, need to check

> 
> 
> 
> But while we are at it.
> 
> Masami, Oleg,
> 
> What should be the logistics of landing this? Can/should we route this
> through the bpf-next tree, given there are lots of BPF-based
> selftests? Or you want to take this through
> linux-trace/probes/for-next? In the latter case, it's probably better
> to apply only the first two patches to probes/for-next and the rest
> should still go through the bpf-next tree (otherwise we are running

I think this was the plan, previously mentioned in here:
https://lore.kernel.org/bpf/20240423000943.478ccf1e735a63c6c1b4c...@kernel.org/

> into conflicts in BPF selftests). Previously we were handling such
> cross-tree dependencies by creating a named branch or tag, and merging
> it into bpf-next (so that all SHAs are preserved). It's a bunch of
> extra work for everyone involved, so the simplest way would be to just
> land through bpf-next, of course. But let me know your preferences.
> 
> Thanks!
> 
> > thanks,
> > jirka
> >
> >
> > Notes to check list items in Documentation/process/adding-syscalls.rst:
> >
> > - System Call Alternatives
> >   New syscall seems like the best way in here, becase we need
> 
> typo (thanks, Gmail): because

ok

> 
> >   just to quickly enter kernel with no extra arguments processing,
> >   which we'd need to do if we decided to use another syscall.
> >
> > - Designing the API: Planning for Extension
> >   The uretprobe syscall is very specific and most likely won't be
> >   extended in the future.
> >
> >   At the moment it does not take any arguments and even if it does
> >   in future, it's allowed to be called only from trampoline prepared
> >   by kernel, so there'll be no broken user.
> >
> > - Designing the API: Other Considerations
> >   N/A because uretprobe syscall does not return reference to kernel
> >   object.
> >
> > - Proposing the API
> >   Wiring up of the uretprobe system call si in separate change,
> 
> typo: is

ok, thanks

jirka



Re: [PATCHv4 bpf-next 0/7] uprobe: uretprobe speed up

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:23 AM Jiri Olsa  wrote:
>
> hi,
> as part of the effort on speeding up the uprobes [0] coming with
> return uprobe optimization by using syscall instead of the trap
> on the uretprobe trampoline.
>
> The speed up depends on instruction type that uprobe is installed
> and depends on specific HW type, please check patch 1 for details.
>
> Patches 1-6 are based on bpf-next/master, but path 1 and 2 are
> apply-able on linux-trace.git tree probes/for-next branch.
> Patch 7 is based on man-pages master.
>
> v4 changes:
>   - added acks [Oleg,Andrii,Masami]
>   - reworded the man page and adding more info to NOTE section [Masami]
>   - rewrote bpf tests not to use trace_pipe [Andrii]
>   - cc-ed linux-man list
>
> Also available at:
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   uretprobe_syscall
>

It looks great to me, thanks! Unfortunately BPF CI build is broken,
probably due to some of the Makefile additions, please investigate and
fix (or we'll need to fix something on BPF CI side), but it looks like
you'll need another revision, unfortunately.

pw-bot: cr

  [0] 
https://github.com/kernel-patches/bpf/actions/runs/8923849088/job/24509002194



But while we are at it.

Masami, Oleg,

What should be the logistics of landing this? Can/should we route this
through the bpf-next tree, given there are lots of BPF-based
selftests? Or you want to take this through
linux-trace/probes/for-next? In the latter case, it's probably better
to apply only the first two patches to probes/for-next and the rest
should still go through the bpf-next tree (otherwise we are running
into conflicts in BPF selftests). Previously we were handling such
cross-tree dependencies by creating a named branch or tag, and merging
it into bpf-next (so that all SHAs are preserved). It's a bunch of
extra work for everyone involved, so the simplest way would be to just
land through bpf-next, of course. But let me know your preferences.

Thanks!

> thanks,
> jirka
>
>
> Notes to check list items in Documentation/process/adding-syscalls.rst:
>
> - System Call Alternatives
>   New syscall seems like the best way in here, becase we need

typo (thanks, Gmail): because

>   just to quickly enter kernel with no extra arguments processing,
>   which we'd need to do if we decided to use another syscall.
>
> - Designing the API: Planning for Extension
>   The uretprobe syscall is very specific and most likely won't be
>   extended in the future.
>
>   At the moment it does not take any arguments and even if it does
>   in future, it's allowed to be called only from trampoline prepared
>   by kernel, so there'll be no broken user.
>
> - Designing the API: Other Considerations
>   N/A because uretprobe syscall does not return reference to kernel
>   object.
>
> - Proposing the API
>   Wiring up of the uretprobe system call si in separate change,

typo: is

>   selftests and man page changes are part of the patchset.
>
> - Generic System Call Implementation
>   There's no CONFIG option for the new functionality because it
>   keeps the same behaviour from the user POV.
>
> - x86 System Call Implementation
>   It's 64-bit syscall only.
>
> - Compatibility System Calls (Generic)
>   N/A uretprobe syscall has no arguments and is not supported
>   for compat processes.
>
> - Compatibility System Calls (x86)
>   N/A uretprobe syscall is not supported for compat processes.
>
> - System Calls Returning Elsewhere
>   N/A.
>
> - Other Details
>   N/A.
>
> - Testing
>   Adding new bpf selftests and ran ltp on top of this change.
>
> - Man Page
>   Attached.
>
> - Do not call System Calls in the Kernel
>   N/A.
>
>
> [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
> ---
> Jiri Olsa (6):
>   uprobe: Wire up uretprobe system call
>   uprobe: Add uretprobe syscall to speed up return probe
>   selftests/bpf: Add uretprobe syscall test for regs integrity
>   selftests/bpf: Add uretprobe syscall test for regs changes
>   selftests/bpf: Add uretprobe syscall call from user space test
>   selftests/bpf: Add uretprobe compat test
>
>  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>  arch/x86/kernel/uprobes.c   | 115 
> 
>  include/linux/syscalls.h|   2 +
>  include/linux/uprobes.h |   3 +
>  include/uapi/asm-generic/unistd.h   |   5 +-
>  kernel/events/uprobes.c |  24 --
>  kernel/sys_ni.c |   2 +
>  tools/include/linux/compiler.h  |   4 +
>  tools/testing/selftests/bpf/.gitignore  |   1 +
>  tools/testing/selftests/bpf/Makefile|   7 +-
>  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 
> -
>  

Re: [PATCHv4 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa  wrote:
>
> Adding test that adds return uprobe inside 32-bit task
> and verify the return uprobe and attached bpf programs
> get properly executed.
>
> Reviewed-by: Masami Hiramatsu (Google) 
> Signed-off-by: Jiri Olsa 
> ---
>  tools/testing/selftests/bpf/.gitignore|  1 +
>  tools/testing/selftests/bpf/Makefile  |  7 ++-
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++
>  3 files changed, 67 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/.gitignore 
> b/tools/testing/selftests/bpf/.gitignore
> index f1aebabfb017..69d71223c0dd 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -45,6 +45,7 @@ test_cpp
>  /veristat
>  /sign-file
>  /uprobe_multi
> +/uprobe_compat
>  *.ko
>  *.tmp
>  xskxceiver
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index 82247aeef857..a94352162290 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -133,7 +133,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> test_skb_cgroup_id_user \
> xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
> xdp_features bpf_test_no_cfi.ko
>
> -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> uprobe_compat
>
>  ifneq ($(V),1)
>  submake_extras := feature_display=0
> @@ -631,6 +631,7 @@ TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read 
> $(OUTPUT)/bpf_testmod.ko  \
>$(OUTPUT)/xdp_synproxy   \
>$(OUTPUT)/sign-file  \
>$(OUTPUT)/uprobe_multi   \
> +  $(OUTPUT)/uprobe_compat  \
>ima_setup.sh \
>verify_sig_setup.sh  \
>$(wildcard progs/btf_dump_test_case_*.c) \
> @@ -752,6 +753,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c
> $(call msg,BINARY,,$@)
> $(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@
>
> +$(OUTPUT)/uprobe_compat:
> +   $(call msg,BINARY,,$@)
> +   $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - 
> -o $@
> +
>  EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)  \
> prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
> feature bpftool \
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index c6fdb8c59ea3..bfea9a0368a4 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -5,6 +5,7 @@
>  #ifdef __x86_64__
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -297,6 +298,58 @@ static void test_uretprobe_syscall_call(void)
> close(go[1]);
> close(go[0]);
>  }
> +
> +static void test_uretprobe_compat(void)
> +{
> +   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
> +   .retprobe = true,
> +   );
> +   struct uprobe_syscall_executed *skel;
> +   int err, go[2], pid, c, status;
> +
> +   if (pipe(go))
> +   return;

ASSERT_OK() missing, like in the previous patch

Thanks for switching to pipe() + global variable instead of using trace_pipe.

Acked-by: Andrii Nakryiko 

> +
> +   skel = uprobe_syscall_executed__open_and_load();
> +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
> +   goto cleanup;
> +

[...]



Re: [PATCHv4 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa  wrote:
>
> Adding test to verify that when called from outside of the
> trampoline provided by kernel, the uretprobe syscall will cause
> calling process to receive SIGILL signal and the attached bpf
> program is not executed.
>
> Reviewed-by: Masami Hiramatsu (Google) 
> Signed-off-by: Jiri Olsa 
> ---
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++
>  .../bpf/progs/uprobe_syscall_executed.c   | 17 
>  2 files changed, 112 insertions(+)
>  create mode 100644 
> tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 1a50cd35205d..c6fdb8c59ea3 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -7,7 +7,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include "uprobe_syscall.skel.h"
> +#include "uprobe_syscall_executed.skel.h"
>
>  __naked unsigned long uretprobe_regs_trigger(void)
>  {
> @@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void)
> }
>  }
>
> +#ifndef __NR_uretprobe
> +#define __NR_uretprobe 462
> +#endif
> +
> +__naked unsigned long uretprobe_syscall_call_1(void)
> +{
> +   /*
> +* Pretend we are uretprobe trampoline to trigger the return
> +* probe invocation in order to verify we get SIGILL.
> +*/
> +   asm volatile (
> +   "pushq %rax\n"
> +   "pushq %rcx\n"
> +   "pushq %r11\n"
> +   "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> +   "syscall\n"
> +   "popq %r11\n"
> +   "popq %rcx\n"
> +   "retq\n"
> +   );
> +}
> +
> +__naked unsigned long uretprobe_syscall_call(void)
> +{
> +   asm volatile (
> +   "call uretprobe_syscall_call_1\n"
> +   "retq\n"
> +   );
> +}
> +
> +static void test_uretprobe_syscall_call(void)
> +{
> +   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
> +   .retprobe = true,
> +   );
> +   struct uprobe_syscall_executed *skel;
> +   int pid, status, err, go[2], c;
> +
> +   if (pipe(go))
> +   return;

very unlikely to fail, but still, ASSERT_OK() would be in order here

But regardless:

Acked-by: Andrii Nakryiko 

[...]



Re: (subset) [PATCH v2] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types

2024-05-02 Thread Lee Jones
On Fri, 12 Apr 2024 16:22:53 +0200, Luca Weiss wrote:
> Add the PBS (Programmable Boot Sequencer) to the list of devices.
> 
> 

Applied, thanks!

[1/1] dt-bindings: mfd: qcom,spmi-pmic: Add pbs to SPMI device types
  commit: a1f3b5edaf18b1c71a537032c4a6537bde2ad5e9

--
Lee Jones [李琼斯]




Re: [PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt
On Thu, 02 May 2024 11:15:48 -0400
Steven Rostedt  wrote:

> +/*
> + * On a remount of tracefs, if UID or GID options are set, then
> + * the mount point inode permissions should be used.
> + * Reset the saved permission flags appropriately.
> + */
> +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
> update_gid)
> +{
> + struct eventfs_inode *ei = ti->private;

And I just realized there's a race here too :-p

I need to set ti->private = NULL before freeing the ei, and probably free
the ei via RCU.

-- Steve


> +
> + if (!ei)
> + return;
> +
> + if (update_uid)
> + ei->attr.mode &= ~EVENTFS_SAVE_UID;
> +
> + if (update_gid)
> + ei->attr.mode &= ~EVENTFS_SAVE_GID;
> +
> + if (!ei->entry_attrs)
> + return;
> +
> + for (int i = 0; i < ei->nr_entries; i++) {
> + if (update_uid)
> + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
> + if (update_gid)
> + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
> + }
> +}
> +



[PATCH v2 5/5] eventfs: Have "events" directory get permissions from its parent

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ff4ee3dad56a..3ea32159d556 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -219,12 +220,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -810,6 +822,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -819,10 +832,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0





[PATCH v2 4/5] eventfs: Do not treat events directory different than other directories

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2971609946e5..ff4ee3dad56a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -199,21 +199,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0





[PATCH v2 0/5] tracefs/eventfs: Fix inconsistent permissions

2024-05-02 Thread Steven Rostedt


The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/

- Testing showed that taking a mutex when freeing the tracefs_inode
  caused a lockdep splat as it can happen in the RCU softirq context.
  Convert the mutex to a spinlock for adding and removing the node
  from the link list, and free the node via call_rcu() so that the
  iteration of the list only needs to be protected by rcu_read_lock().


Steven Rostedt (Google) (5):
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 102 ---
 fs/tracefs/inode.c   |  77 +--
 fs/tracefs/internal.h|  14 ---
 3 files changed, 144 insertions(+), 49 deletions(-)



[PATCH v2 2/5] tracefs: Still use mount point as default permissions for instances

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5f2417df1c43..e1bf65efdbb3 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0





[PATCH v2 3/5] eventfs: Do not differentiate the toplevel events directory

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a754c6725a52..2971609946e5 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -232,14 +231,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -252,10 +247,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -284,7 +279,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -352,7 +347,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -458,7 +453,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -838,14 +833,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -854,13 +841,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 29f0c75b..f704d8348357 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET= BIT(3),
-   TRACEFS_UID_PERM_SET= BIT(4),
-   TRACEFS_INSTANCE_INODE  = BIT(5),
+   TRACEFS_GID_PERM_SET= BIT(2),
+   

[PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 +++
 fs/tracefs/inode.c   | 50 +++-
 fs/tracefs/internal.h|  7 +-
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f5510e26f0f6..a754c6725a52 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -301,6 +301,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..5f2417df1c43 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_SPINLOCK(tracefs_inode_lock);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
+   unsigned long flags;
 
ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
if (!ti)
return NULL;
 
+   spin_lock_irqsave(_inode_lock, flags);
+   list_add_rcu(>list, _inodes);
+   spin_unlock_irqrestore(_inode_lock, flags);
+
return >vfs_inode;
 }
 
+static void tracefs_free_inode_rcu(struct rcu_head *rcu)
+{
+   struct tracefs_inode *ti;
+
+   ti = container_of(rcu, struct tracefs_inode, rcu);
+   

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-02 Thread Steven Rostedt
On Thu, 2 May 2024 06:49:18 +
Tze-nan Wu (吳澤南)  wrote:

> Good news, this patch works, the test has passed, no more Kasan report
> in my environment.

Great to hear!

> 
> my environment:
>  arm64 + kasan + swtag based kasan + kernel-6.6.18
> 
> Really appreciate, and learn a lot from the patch.

Can I add:

Tested-by: Tze-nan Wu (吳澤南) 

?

-- Steve



Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread Steven Rostedt
On Thu, 2 May 2024 14:38:32 +0100
Vincent Donnefort  wrote:

> > > + while (s < nr_subbufs && p < nr_pages) {
> > > + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
> > > + int off = 0;
> > > +
> > > + for (; off < (1 << (subbuf_order)); off++, page++) {
> > > + if (p >= nr_pages)
> > > + break;
> > > +
> > > + pages[p++] = page;
> > > + }
> > > + s++;
> > > + }
> > > +
> > > + err = vm_insert_pages(vma, vma->vm_start, pages, _pages);  
> > 
> > Nit: I did not immediately understand if we could end here with p < nr_pages
> > (IOW, pages[] not completely filled).
> > 
> > One source of confusion is the "s < nr_subbufs" check in the while loop: why
> > is "p < nr_pages" insufficient?  
> 
> Hum, indeed, the "s < nr_subbufs" check is superfluous, nr_pages, is already
> capped by the number of subbufs, there's no way we can overflow subbuf_ids[].

We can keep it as is, or perhaps change it to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;

page = virt_to_page(cpu_buffer->subbuf_ids[s]);
for (; off < (1 << (subbuf_order)); off++, page++) {
if (p >= nr_pages)
break;

pages[p++] = page;
}
s++;
}

I don't like having an unchecked dependency between s and p.

-- Steve



Re: [PATCHv4 7/7] man2: Add uretprobe syscall page

2024-05-02 Thread Alejandro Colomar
Hi Jiri,

On Thu, May 02, 2024 at 02:23:13PM +0200, Jiri Olsa wrote:
> Adding man page for new uretprobe syscall.
> 
> Signed-off-by: Jiri Olsa 
> ---
>  man2/uretprobe.2 | 45 +
>  1 file changed, 45 insertions(+)
>  create mode 100644 man2/uretprobe.2
> 
> diff --git a/man2/uretprobe.2 b/man2/uretprobe.2
> new file mode 100644
> index ..08fe6a670430
> --- /dev/null
> +++ b/man2/uretprobe.2
> @@ -0,0 +1,45 @@
> +.\" Copyright (C) 2024, Jiri Olsa 
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +uretprobe \- execute pending return uprobes
> +.SH SYNOPSIS
> +.nf
> +.B int uretprobe(void)
> +.fi
> +.SH DESCRIPTION
> +Kernel is using
> +.BR uretprobe()
> +syscall to trigger uprobe return probe consumers instead of using
> +standard breakpoint instruction.
> +

Please use .P instead of a blank.  See man-pages(7):

   Formatting conventions (general)
 Paragraphs should be separated by suitable markers (usually either
 .P or .IP).  Do not separate paragraphs using blank lines, as this
 results in poor rendering in some output formats  (such  as  Post‐
 Script and PDF).

> +The uretprobe syscall is not supposed to be called directly by user, it's 
> allowed

s/by user/by the user/

> +to be invoked only through user space trampoline provided by kernel.

s/user space/user-space/

Missing a few 'the' too, here and in the rest of the page.

> +When called from outside of this trampoline, the calling process will receive
> +.BR SIGILL .
> +
> +.SH RETURN VALUE
> +.BR uretprobe()

You're missing a space here:

.BR uretprobe ()

> +return value is specific for given architecture.
> +
> +.SH VERSIONS
> +This syscall is not specified in POSIX,
> +and details of its behavior vary across systems.
> +.SH STANDARDS
> +None.

You could add a HISTORY section.

Have a lovely day!
Alex

> +.SH NOTES
> +.BR uretprobe()
> +syscall is initially introduced on x86-64 architecture, because doing syscall
> +is faster than doing breakpoint trap on it. It might be extended to other
> +architectures.
> +
> +.BR uretprobe()
> +syscall exists only to allow the invocation of return uprobe consumers.
> +It should
> +.B never
> +be called directly.
> +Details of the arguments (if any) passed to
> +.BR uretprobe ()
> +and the return value are specific for given architecture.
> -- 
> 2.44.0
> 
> 

-- 

A client is hiring kernel driver, mm, and/or crypto developers;
contact me if interested.


signature.asc
Description: PGP signature


Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread Vincent Donnefort
On Thu, May 02, 2024 at 03:30:32PM +0200, David Hildenbrand wrote:
> On 30.04.24 13:13, Vincent Donnefort wrote:
> > In preparation for allowing the user-space to map a ring-buffer, add
> > a set of mapping functions:
> > 
> >ring_buffer_{map,unmap}()
> > 
> > And controls on the ring-buffer:
> > 
> >ring_buffer_map_get_reader()  /* swap reader and head */
> > 
> > Mapping the ring-buffer also involves:
> > 
> >A unique ID for each subbuf of the ring-buffer, currently they are
> >only identified through their in-kernel VA.
> > 
> >A meta-page, where are stored ring-buffer statistics and a
> >description for the current reader
> > 
> > The linear mapping exposes the meta-page, and each subbuf of the
> > ring-buffer, ordered following their unique ID, assigned during the
> > first mapping.
> > 
> > Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
> > size will remain unmodified and the splice enabling functions will in
> > reality simply memcpy the data instead of swapping subbufs.
> > 
> > CC: 
> > Signed-off-by: Vincent Donnefort 
> > 
> > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> > index dc5ae4e96aee..96d2140b471e 100644
> > --- a/include/linux/ring_buffer.h
> > +++ b/include/linux/ring_buffer.h
> 
> [...]
> 
> > +/*
> > + *   +--+  pgoff == 0
> > + *   |   meta page  |
> > + *   +--+  pgoff == 1
> > + *   | subbuffer 0  |
> > + *   |  |
> > + *   +--+  pgoff == (1 + (1 << subbuf_order))
> > + *   | subbuffer 1  |
> > + *   |  |
> > + * ...
> > + */
> > +#ifdef CONFIG_MMU
> > +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> > +   struct vm_area_struct *vma)
> > +{
> > +   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
> > +   unsigned int subbuf_pages, subbuf_order;
> > +   struct page **pages;
> > +   int p = 0, s = 0;
> > +   int err;
> > +
> > +   /* Refuse MP_PRIVATE or writable mappings */
> > +   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
> > +   !(vma->vm_flags & VM_MAYSHARE))
> > +   return -EPERM;
> > +
> > +   /*
> > +* Make sure the mapping cannot become writable later. Also tell the VM
> > +* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
> > +* prevent migration, GUP and dump (VM_IO).
> > +*/
> > +   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);
> > +
> > +   lockdep_assert_held(_buffer->mapping_lock);
> > +
> > +   subbuf_order = cpu_buffer->buffer->subbuf_order;
> > +   subbuf_pages = 1 << subbuf_order;
> > +
> > +   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
> > +   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
> > +
> > +   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> > +   if (!vma_pages || vma_pages > nr_pages)
> > +   return -EINVAL;
> > +
> > +   nr_pages = vma_pages;
> > +
> > +   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> > +   if (!pages)
> > +   return -ENOMEM;
> > +
> > +   if (!pgoff) {
> > +   pages[p++] = virt_to_page(cpu_buffer->meta_page);
> > +
> > +   /*
> > +* TODO: Align sub-buffers on their size, once
> > +* vm_insert_pages() supports the zero-page.
> > +*/
> > +   } else {
> > +   /* Skip the meta-page */
> > +   pgoff--;
> > +
> > +   if (pgoff % subbuf_pages) {
> > +   err = -EINVAL;
> > +   goto out;
> > +   }
> > +
> > +   s += pgoff / subbuf_pages;
> > +   }
> > +
> > +   while (s < nr_subbufs && p < nr_pages) {
> > +   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
> > +   int off = 0;
> > +
> > +   for (; off < (1 << (subbuf_order)); off++, page++) {
> > +   if (p >= nr_pages)
> > +   break;
> > +
> > +   pages[p++] = page;
> > +   }
> > +   s++;
> > +   }
> > +
> > +   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);
> 
> Nit: I did not immediately understand if we could end here with p < nr_pages
> (IOW, pages[] not completely filled).
> 
> One source of confusion is the "s < nr_subbufs" check in the while loop: why
> is "p < nr_pages" insufficient?

Hum, indeed, the "s < nr_subbufs" check is superfluous, nr_pages, is already
capped by the number of subbufs, there's no way we can overflow subbuf_ids[].

> 
> 
> For the MM bits:
> 
> Acked-by: David Hildenbrand 

Thanks a lot for having a look at the series, very much appreciated!

> 
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 



Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread David Hildenbrand

On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

   ring_buffer_{map,unmap}()

And controls on the ring-buffer:

   ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

   A unique ID for each subbuf of the ring-buffer, currently they are
   only identified through their in-kernel VA.

   A meta-page, where are stored ring-buffer statistics and a
   description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

CC: 
Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dc5ae4e96aee..96d2140b471e 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h


[...]


+/*
+ *   +--+  pgoff == 0
+ *   |   meta page  |
+ *   +--+  pgoff == 1
+ *   | subbuffer 0  |
+ *   |  |
+ *   +--+  pgoff == (1 + (1 << subbuf_order))
+ *   | subbuffer 1  |
+ *   |  |
+ * ...
+ */
+#ifdef CONFIG_MMU
+static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
+   struct vm_area_struct *vma)
+{
+   unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
+   unsigned int subbuf_pages, subbuf_order;
+   struct page **pages;
+   int p = 0, s = 0;
+   int err;
+
+   /* Refuse MP_PRIVATE or writable mappings */
+   if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
+   !(vma->vm_flags & VM_MAYSHARE))
+   return -EPERM;
+
+   /*
+* Make sure the mapping cannot become writable later. Also tell the VM
+* to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
+* prevent migration, GUP and dump (VM_IO).
+*/
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);
+
+   lockdep_assert_held(_buffer->mapping_lock);
+
+   subbuf_order = cpu_buffer->buffer->subbuf_order;
+   subbuf_pages = 1 << subbuf_order;
+
+   nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
+   nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
+
+   vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+   if (!vma_pages || vma_pages > nr_pages)
+   return -EINVAL;
+
+   nr_pages = vma_pages;
+
+   pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return -ENOMEM;
+
+   if (!pgoff) {
+   pages[p++] = virt_to_page(cpu_buffer->meta_page);
+
+   /*
+* TODO: Align sub-buffers on their size, once
+* vm_insert_pages() supports the zero-page.
+*/
+   } else {
+   /* Skip the meta-page */
+   pgoff--;
+
+   if (pgoff % subbuf_pages) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   s += pgoff / subbuf_pages;
+   }
+
+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   int off = 0;
+
+   for (; off < (1 << (subbuf_order)); off++, page++) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+   }
+   s++;
+   }
+
+   err = vm_insert_pages(vma, vma->vm_start, pages, _pages);


Nit: I did not immediately understand if we could end here with p < 
nr_pages (IOW, pages[] not completely filled).


One source of confusion is the "s < nr_subbufs" check in the while loop: 
why is "p < nr_pages" insufficient?



For the MM bits:

Acked-by: David Hildenbrand 



--
Cheers,

David / dhildenb




Re: [PATCH v22 1/5] ring-buffer: Allocate sub-buffers with __GFP_COMP

2024-05-02 Thread David Hildenbrand

On 30.04.24 13:13, Vincent Donnefort wrote:

In preparation for the ring-buffer memory mapping, allocate compound
pages for the ring-buffer sub-buffers to enable us to map them to
user-space with vm_insert_pages().

Signed-off-by: Vincent Donnefort 


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-02 Thread Steven Rostedt
On Wed, 1 May 2024 23:56:26 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks Masami,

Although Tze-nan pointed out a issue with this patch.

I just published v2, can you review that one too?

Thanks,

-- Steve



[PATCH v2] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-02 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
while :
do
  echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
done

  Script 'B':
echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
struct trace_event_file *file = inode->i_private;
int ret;

ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240430142327.7bff8...@gandalf.local.home/

- Do not call release function on error creating the eventfs_inode

 fs/tracefs/event_inode.c| 23 +--
 include/linux/tracefs.h |  3 +++
 kernel/trace/trace_events.c | 12 
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..f5510e26f0f6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
@@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei)
}
 }
 
+/*
+ * Called when creation of an ei fails, do not call release() functions.
+ */
+static inline void cleanup_ei(struct eventfs_inode *ei)
+{
+   if (ei) {
+   /* Set nr_entries to 0 to prevent release() function being 
called */
+   ei->nr_entries = 0;
+   free_ei(ei);
+   }
+}
+
 static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
 {
if (ei)
@@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
 
/* Was the parent freed? */
if (list_empty(>list)) {
-   free_ei(ei);
+   cleanup_ei(ei);
ei = NULL;
}
return ei;
@@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
return ei;
 
  fail:
-   free_ei(ei);
+   cleanup_ei(ei);
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
 }
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callback   

[PATCHv4 7/7] man2: Add uretprobe syscall page

2024-05-02 Thread Jiri Olsa
Adding man page for new uretprobe syscall.

Signed-off-by: Jiri Olsa 
---
 man2/uretprobe.2 | 45 +
 1 file changed, 45 insertions(+)
 create mode 100644 man2/uretprobe.2

diff --git a/man2/uretprobe.2 b/man2/uretprobe.2
new file mode 100644
index ..08fe6a670430
--- /dev/null
+++ b/man2/uretprobe.2
@@ -0,0 +1,45 @@
+.\" Copyright (C) 2024, Jiri Olsa 
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH uretprobe 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+uretprobe \- execute pending return uprobes
+.SH SYNOPSIS
+.nf
+.B int uretprobe(void)
+.fi
+.SH DESCRIPTION
+Kernel is using
+.BR uretprobe()
+syscall to trigger uprobe return probe consumers instead of using
+standard breakpoint instruction.
+
+The uretprobe syscall is not supposed to be called directly by user, it's 
allowed
+to be invoked only through user space trampoline provided by kernel.
+When called from outside of this trampoline, the calling process will receive
+.BR SIGILL .
+
+.SH RETURN VALUE
+.BR uretprobe()
+return value is specific for given architecture.
+
+.SH VERSIONS
+This syscall is not specified in POSIX,
+and details of its behavior vary across systems.
+.SH STANDARDS
+None.
+.SH NOTES
+.BR uretprobe()
+syscall is initially introduced on x86-64 architecture, because doing syscall
+is faster than doing breakpoint trap on it. It might be extended to other
+architectures.
+
+.BR uretprobe()
+syscall exists only to allow the invocation of return uprobe consumers.
+It should
+.B never
+be called directly.
+Details of the arguments (if any) passed to
+.BR uretprobe ()
+and the return value are specific for given architecture.
-- 
2.44.0




[PATCHv4 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-05-02 Thread Jiri Olsa
Adding test that adds return uprobe inside 32-bit task
and verify the return uprobe and attached bpf programs
get properly executed.

Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 tools/testing/selftests/bpf/.gitignore|  1 +
 tools/testing/selftests/bpf/Makefile  |  7 ++-
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++
 3 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index f1aebabfb017..69d71223c0dd 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -45,6 +45,7 @@ test_cpp
 /veristat
 /sign-file
 /uprobe_multi
+/uprobe_compat
 *.ko
 *.tmp
 xskxceiver
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 82247aeef857..a94352162290 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -133,7 +133,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
test_skb_cgroup_id_user \
xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
xdp_features bpf_test_no_cfi.ko
 
-TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
+TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
uprobe_compat
 
 ifneq ($(V),1)
 submake_extras := feature_display=0
@@ -631,6 +631,7 @@ TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read 
$(OUTPUT)/bpf_testmod.ko  \
   $(OUTPUT)/xdp_synproxy   \
   $(OUTPUT)/sign-file  \
   $(OUTPUT)/uprobe_multi   \
+  $(OUTPUT)/uprobe_compat  \
   ima_setup.sh \
   verify_sig_setup.sh  \
   $(wildcard progs/btf_dump_test_case_*.c) \
@@ -752,6 +753,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c
$(call msg,BINARY,,$@)
$(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@
 
+$(OUTPUT)/uprobe_compat:
+   $(call msg,BINARY,,$@)
+   $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - -o 
$@
+
 EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)  \
prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
feature bpftool \
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index c6fdb8c59ea3..bfea9a0368a4 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -5,6 +5,7 @@
 #ifdef __x86_64__
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -297,6 +298,58 @@ static void test_uretprobe_syscall_call(void)
close(go[1]);
close(go[0]);
 }
+
+static void test_uretprobe_compat(void)
+{
+   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+   .retprobe = true,
+   );
+   struct uprobe_syscall_executed *skel;
+   int err, go[2], pid, c, status;
+
+   if (pipe(go))
+   return;
+
+   skel = uprobe_syscall_executed__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+   goto cleanup;
+
+   pid = fork();
+   if (pid < 0)
+   goto cleanup;
+
+   /* child */
+   if (pid == 0) {
+   close(go[1]);
+
+   /* wait for parent's kick */
+   err = read(go[0], , 1);
+   if (err != 1)
+   exit(-1);
+   execl("./uprobe_compat", "./uprobe_compat", NULL);
+   exit(-1);
+   }
+
+   skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, 
pid,
+   "./uprobe_compat", 
"main", );
+   if (!ASSERT_OK_PTR(skel->links.test, 
"bpf_program__attach_uprobe_multi"))
+   goto cleanup;
+
+   /* kick the child */
+   write(go[1], , 1);
+   err = waitpid(pid, , 0);
+   ASSERT_EQ(err, pid, "waitpid");
+
+   /* verify the child exited normally and the bpf program got executed */
+   ASSERT_EQ(WIFEXITED(status), 1, "WIFEXITED");
+   ASSERT_EQ(WEXITSTATUS(status), 0, "WEXITSTATUS");
+   ASSERT_EQ(skel->bss->executed, 1, "executed");
+
+cleanup:
+   uprobe_syscall_executed__destroy(skel);
+   close(go[0]);
+   close(go[1]);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -312,6 +365,11 @@ static void test_uretprobe_syscall_call(void)
 {
test__skip();
 }
+
+static void test_uretprobe_compat(void)
+{
+   test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -322,4 +380,6 @@ void test_uprobe_syscall(void)

[PATCHv4 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-05-02 Thread Jiri Olsa
Adding test to verify that when called from outside of the
trampoline provided by kernel, the uretprobe syscall will cause
calling process to receive SIGILL signal and the attached bpf
program is not executed.

Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++
 .../bpf/progs/uprobe_syscall_executed.c   | 17 
 2 files changed, 112 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c

diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
index 1a50cd35205d..c6fdb8c59ea3 100644
--- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -7,7 +7,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "uprobe_syscall.skel.h"
+#include "uprobe_syscall_executed.skel.h"
 
 __naked unsigned long uretprobe_regs_trigger(void)
 {
@@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void)
}
 }
 
+#ifndef __NR_uretprobe
+#define __NR_uretprobe 462
+#endif
+
+__naked unsigned long uretprobe_syscall_call_1(void)
+{
+   /*
+* Pretend we are uretprobe trampoline to trigger the return
+* probe invocation in order to verify we get SIGILL.
+*/
+   asm volatile (
+   "pushq %rax\n"
+   "pushq %rcx\n"
+   "pushq %r11\n"
+   "movq $" __stringify(__NR_uretprobe) ", %rax\n"
+   "syscall\n"
+   "popq %r11\n"
+   "popq %rcx\n"
+   "retq\n"
+   );
+}
+
+__naked unsigned long uretprobe_syscall_call(void)
+{
+   asm volatile (
+   "call uretprobe_syscall_call_1\n"
+   "retq\n"
+   );
+}
+
+static void test_uretprobe_syscall_call(void)
+{
+   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
+   .retprobe = true,
+   );
+   struct uprobe_syscall_executed *skel;
+   int pid, status, err, go[2], c;
+
+   if (pipe(go))
+   return;
+
+   skel = uprobe_syscall_executed__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
+   goto cleanup;
+
+   pid = fork();
+   if (!ASSERT_GE(pid, 0, "fork"))
+   goto cleanup;
+
+   /* child */
+   if (pid == 0) {
+   close(go[1]);
+
+   /* wait for parent's kick */
+   err = read(go[0], , 1);
+   if (err != 1)
+   exit(-1);
+
+   uretprobe_syscall_call();
+   _exit(0);
+   }
+
+   skel->links.test = bpf_program__attach_uprobe_multi(skel->progs.test, 
pid,
+   "/proc/self/exe",
+   
"uretprobe_syscall_call", );
+   if (!ASSERT_OK_PTR(skel->links.test, 
"bpf_program__attach_uprobe_multi"))
+   goto cleanup;
+
+   /* kick the child */
+   write(go[1], , 1);
+   err = waitpid(pid, , 0);
+   ASSERT_EQ(err, pid, "waitpid");
+
+   /* verify the child got killed with SIGILL */
+   ASSERT_EQ(WIFSIGNALED(status), 1, "WIFSIGNALED");
+   ASSERT_EQ(WTERMSIG(status), SIGILL, "WTERMSIG");
+
+   /* verify the uretprobe program wasn't called */
+   ASSERT_EQ(skel->bss->executed, 0, "executed");
+
+cleanup:
+   uprobe_syscall_executed__destroy(skel);
+   close(go[1]);
+   close(go[0]);
+}
 #else
 static void test_uretprobe_regs_equal(void)
 {
@@ -219,6 +307,11 @@ static void test_uretprobe_regs_change(void)
 {
test__skip();
 }
+
+static void test_uretprobe_syscall_call(void)
+{
+   test__skip();
+}
 #endif
 
 void test_uprobe_syscall(void)
@@ -227,4 +320,6 @@ void test_uprobe_syscall(void)
test_uretprobe_regs_equal();
if (test__start_subtest("uretprobe_regs_change"))
test_uretprobe_regs_change();
+   if (test__start_subtest("uretprobe_syscall_call"))
+   test_uretprobe_syscall_call();
 }
diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c 
b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
new file mode 100644
index ..0d7f1a7db2e2
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
@@ -0,0 +1,17 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include 
+#include 
+
+struct pt_regs regs;
+
+char _license[] SEC("license") = "GPL";
+
+int executed = 0;
+
+SEC("uretprobe.multi")
+int test(struct pt_regs *regs)
+{
+   executed = 1;
+   return 0;
+}
-- 
2.44.0




[PATCHv4 bpf-next 4/7] selftests/bpf: Add uretprobe syscall test for regs changes

2024-05-02 Thread Jiri Olsa
Adding test that creates uprobe consumer on uretprobe which changes some
of the registers. Making sure the changed registers are propagated to the
user space when the ureptobe syscall trampoline is used on x86_64.

To be able to do this, adding support to bpf_testmod to create uprobe via
new attribute file:
  /sys/kernel/bpf_testmod_uprobe

This file is expecting file offset and creates related uprobe on current
process exe file and removes existing uprobe if offset is 0. The can be
only single uprobe at any time.

The uprobe has specific consumer that changes registers used in ureprobe
syscall trampoline and which are later checked in the test.

Acked-by: Andrii Nakryiko 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 +-
 .../selftests/bpf/prog_tests/uprobe_syscall.c |  67 ++
 2 files changed, 189 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index eb2b78552ca2..27a12d125b9f 100644
--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "bpf_testmod.h"
 #include "bpf_testmod_kfunc.h"
 
@@ -343,6 +344,119 @@ static struct bin_attribute bin_attr_bpf_testmod_file 
__ro_after_init = {
.write = bpf_testmod_test_write,
 };
 
+/* bpf_testmod_uprobe sysfs attribute is so far enabled for x86_64 only,
+ * please see test_uretprobe_regs_change test
+ */
+#ifdef __x86_64__
+
+static int
+uprobe_ret_handler(struct uprobe_consumer *self, unsigned long func,
+  struct pt_regs *regs)
+
+{
+   regs->ax  = 0x12345678deadbeef;
+   regs->cx  = 0x87654321feebdaed;
+   regs->r11 = (u64) -1;
+   return true;
+}
+
+struct testmod_uprobe {
+   struct path path;
+   loff_t offset;
+   struct uprobe_consumer consumer;
+};
+
+static DEFINE_MUTEX(testmod_uprobe_mutex);
+
+static struct testmod_uprobe uprobe = {
+   .consumer.ret_handler = uprobe_ret_handler,
+};
+
+static int testmod_register_uprobe(loff_t offset)
+{
+   int err = -EBUSY;
+
+   if (uprobe.offset)
+   return -EBUSY;
+
+   mutex_lock(_uprobe_mutex);
+
+   if (uprobe.offset)
+   goto out;
+
+   err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, );
+   if (err)
+   goto out;
+
+   err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry),
+offset, 0, );
+   if (err)
+   path_put();
+   else
+   uprobe.offset = offset;
+
+out:
+   mutex_unlock(_uprobe_mutex);
+   return err;
+}
+
+static void testmod_unregister_uprobe(void)
+{
+   mutex_lock(_uprobe_mutex);
+
+   if (uprobe.offset) {
+   uprobe_unregister(d_real_inode(uprobe.path.dentry),
+ uprobe.offset, );
+   uprobe.offset = 0;
+   }
+
+   mutex_unlock(_uprobe_mutex);
+}
+
+static ssize_t
+bpf_testmod_uprobe_write(struct file *file, struct kobject *kobj,
+struct bin_attribute *bin_attr,
+char *buf, loff_t off, size_t len)
+{
+   unsigned long offset;
+   int err;
+
+   if (kstrtoul(buf, 0, ))
+   return -EINVAL;
+
+   if (offset)
+   err = testmod_register_uprobe(offset);
+   else
+   testmod_unregister_uprobe();
+
+   return err ?: strlen(buf);
+}
+
+static struct bin_attribute bin_attr_bpf_testmod_uprobe_file __ro_after_init = 
{
+   .attr = { .name = "bpf_testmod_uprobe", .mode = 0666, },
+   .write = bpf_testmod_uprobe_write,
+};
+
+static int register_bpf_testmod_uprobe(void)
+{
+   return sysfs_create_bin_file(kernel_kobj, 
_attr_bpf_testmod_uprobe_file);
+}
+
+static void unregister_bpf_testmod_uprobe(void)
+{
+   testmod_unregister_uprobe();
+   sysfs_remove_bin_file(kernel_kobj, _attr_bpf_testmod_uprobe_file);
+}
+
+#else
+static int register_bpf_testmod_uprobe(void)
+{
+   return 0;
+}
+
+static void unregister_bpf_testmod_uprobe(void) { }
+#endif
+
 BTF_KFUNCS_START(bpf_testmod_common_kfunc_ids)
 BTF_ID_FLAGS(func, bpf_iter_testmod_seq_new, KF_ITER_NEW)
 BTF_ID_FLAGS(func, bpf_iter_testmod_seq_next, KF_ITER_NEXT | KF_RET_NULL)
@@ -655,7 +769,13 @@ static int bpf_testmod_init(void)
return ret;
if (bpf_fentry_test1(0) < 0)
return -EINVAL;
-   return sysfs_create_bin_file(kernel_kobj, _attr_bpf_testmod_file);
+   ret = sysfs_create_bin_file(kernel_kobj, _attr_bpf_testmod_file);
+   if (ret < 0)
+   return ret;
+   ret = register_bpf_testmod_uprobe();
+   if (ret < 0)
+   return ret;
+   return 0;
 }
 
 static void bpf_testmod_exit(void)
@@ -669,6 +789,7 @@ static void bpf_testmod_exit(void)
  

Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-02 Thread Tobias Huschle
On Wed, May 01, 2024 at 11:31:02AM -0400, Michael S. Tsirkin wrote:
> On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
<...>
> > 
> > I'm still wondering why exactly it is imperative for t2 to preempt t1.
> > Is there some unexpressed serialization / spin-waiting ?
> 
> 
> I am not sure but I think the point is that t2 is a kworker. It is
> much cheaper to run it right now when we are already in the kernel
> than return to userspace, let it run for a bit then interrupt it
> and then run t2.
> Right, Tobias?
> 

That would be correct, the optimal scenario would be that t1, the vhost
does its thing, wakes up t2, the kworker, makes sure t2 executes immediately,
then gets control again and continues watching its loops without ever 
leaving kernel space.



[PATCHv4 bpf-next 3/7] selftests/bpf: Add uretprobe syscall test for regs integrity

2024-05-02 Thread Jiri Olsa
Add uretprobe syscall test that compares register values before
and after the uretprobe is hit. It also compares the register
values seen from attached bpf program.

Acked-by: Andrii Nakryiko 
Reviewed-by: Masami Hiramatsu (Google) 
Signed-off-by: Jiri Olsa 
---
 tools/include/linux/compiler.h|   4 +
 .../selftests/bpf/prog_tests/uprobe_syscall.c | 163 ++
 .../selftests/bpf/progs/uprobe_syscall.c  |  15 ++
 3 files changed, 182 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c

diff --git a/tools/include/linux/compiler.h b/tools/include/linux/compiler.h
index 8a63a9913495..6f7f22ac9da5 100644
--- a/tools/include/linux/compiler.h
+++ b/tools/include/linux/compiler.h
@@ -62,6 +62,10 @@
 #define __nocf_check __attribute__((nocf_check))
 #endif
 
+#ifndef __naked
+#define __naked __attribute__((__naked__))
+#endif
+
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #ifndef __same_type
 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
new file mode 100644
index ..311ac19d8992
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
@@ -0,0 +1,163 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+
+#ifdef __x86_64__
+
+#include 
+#include 
+#include 
+#include "uprobe_syscall.skel.h"
+
+__naked unsigned long uretprobe_regs_trigger(void)
+{
+   asm volatile (
+   "movq $0xdeadbeef, %rax\n"
+   "ret\n"
+   );
+}
+
+__naked void uretprobe_regs(struct pt_regs *before, struct pt_regs *after)
+{
+   asm volatile (
+   "movq %r15,   0(%rdi)\n"
+   "movq %r14,   8(%rdi)\n"
+   "movq %r13,  16(%rdi)\n"
+   "movq %r12,  24(%rdi)\n"
+   "movq %rbp,  32(%rdi)\n"
+   "movq %rbx,  40(%rdi)\n"
+   "movq %r11,  48(%rdi)\n"
+   "movq %r10,  56(%rdi)\n"
+   "movq  %r9,  64(%rdi)\n"
+   "movq  %r8,  72(%rdi)\n"
+   "movq %rax,  80(%rdi)\n"
+   "movq %rcx,  88(%rdi)\n"
+   "movq %rdx,  96(%rdi)\n"
+   "movq %rsi, 104(%rdi)\n"
+   "movq %rdi, 112(%rdi)\n"
+   "movq   $0, 120(%rdi)\n" /* orig_rax */
+   "movq   $0, 128(%rdi)\n" /* rip  */
+   "movq   $0, 136(%rdi)\n" /* cs   */
+   "pushf\n"
+   "pop %rax\n"
+   "movq %rax, 144(%rdi)\n" /* eflags   */
+   "movq %rsp, 152(%rdi)\n" /* rsp  */
+   "movq   $0, 160(%rdi)\n" /* ss   */
+
+   /* save 2nd argument */
+   "pushq %rsi\n"
+   "call uretprobe_regs_trigger\n"
+
+   /* save  return value and load 2nd argument pointer to rax */
+   "pushq %rax\n"
+   "movq 8(%rsp), %rax\n"
+
+   "movq %r15,   0(%rax)\n"
+   "movq %r14,   8(%rax)\n"
+   "movq %r13,  16(%rax)\n"
+   "movq %r12,  24(%rax)\n"
+   "movq %rbp,  32(%rax)\n"
+   "movq %rbx,  40(%rax)\n"
+   "movq %r11,  48(%rax)\n"
+   "movq %r10,  56(%rax)\n"
+   "movq  %r9,  64(%rax)\n"
+   "movq  %r8,  72(%rax)\n"
+   "movq %rcx,  88(%rax)\n"
+   "movq %rdx,  96(%rax)\n"
+   "movq %rsi, 104(%rax)\n"
+   "movq %rdi, 112(%rax)\n"
+   "movq   $0, 120(%rax)\n" /* orig_rax */
+   "movq   $0, 128(%rax)\n" /* rip  */
+   "movq   $0, 136(%rax)\n" /* cs   */
+
+   /* restore return value and 2nd argument */
+   "pop %rax\n"
+   "pop %rsi\n"
+
+   "movq %rax,  80(%rsi)\n"
+
+   "pushf\n"
+   "pop %rax\n"
+
+   "movq %rax, 144(%rsi)\n" /* eflags   */
+   "movq %rsp, 152(%rsi)\n" /* rsp  */
+   "movq   $0, 160(%rsi)\n" /* ss   */
+   "ret\n"
+);
+}
+
+static void test_uretprobe_regs_equal(void)
+{
+   struct uprobe_syscall *skel = NULL;
+   struct pt_regs before = {}, after = {};
+   unsigned long *pb = (unsigned long *) 
+   unsigned long *pa = (unsigned long *) 
+   unsigned long *pp;
+   unsigned int i, cnt;
+   int err;
+
+   skel = uprobe_syscall__open_and_load();
+   if (!ASSERT_OK_PTR(skel, "uprobe_syscall__open_and_load"))
+   goto cleanup;
+
+   err = uprobe_syscall__attach(skel);
+   if (!ASSERT_OK(err, "uprobe_syscall__attach"))
+   goto cleanup;
+
+   uretprobe_regs(, );
+
+   pp = (unsigned long *) >bss->regs;
+   cnt = sizeof(before)/sizeof(*pb);
+
+   for (i = 0; i < cnt; i++) {

[PATCHv4 bpf-next 2/7] uprobe: Add uretprobe syscall to speed up return probe

2024-05-02 Thread Jiri Olsa
Adding uretprobe syscall instead of trap to speed up return probe.

At the moment the uretprobe setup/path is:

  - install entry uprobe

  - when the uprobe is hit, it overwrites probed function's return address
on stack with address of the trampoline that contains breakpoint
instruction

  - the breakpoint trap code handles the uretprobe consumers execution and
jumps back to original return address

This patch replaces the above trampoline's breakpoint instruction with new
ureprobe syscall call. This syscall does exactly the same job as the trap
with some more extra work:

  - syscall trampoline must save original value for rax/r11/rcx registers
on stack - rax is set to syscall number and r11/rcx are changed and
used by syscall instruction

  - the syscall code reads the original values of those registers and
restore those values in task's pt_regs area

  - only caller from trampoline exposed in '[uprobes]' is allowed,
the process will receive SIGILL signal otherwise

Even with some extra work, using the uretprobes syscall shows speed
improvement (compared to using standard breakpoint):

  On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)

  current:
uretprobe-nop  :1.498 ± 0.000M/s
uretprobe-push :1.448 ± 0.001M/s
uretprobe-ret  :0.816 ± 0.001M/s

  with the fix:
uretprobe-nop  :1.969 ± 0.002M/s  < 31% speed up
uretprobe-push :1.910 ± 0.000M/s  < 31% speed up
uretprobe-ret  :0.934 ± 0.000M/s  < 14% speed up

  On Amd (AMD Ryzen 7 5700U)

  current:
uretprobe-nop  :0.778 ± 0.001M/s
uretprobe-push :0.744 ± 0.001M/s
uretprobe-ret  :0.540 ± 0.001M/s

  with the fix:
uretprobe-nop  :0.860 ± 0.001M/s  < 10% speed up
uretprobe-push :0.818 ± 0.001M/s  < 10% speed up
uretprobe-ret  :0.578 ± 0.000M/s  <  7% speed up

The performance test spawns a thread that runs loop which triggers
uprobe with attached bpf program that increments the counter that
gets printed in results above.

The uprobe (and uretprobe) kind is determined by which instruction
is being patched with breakpoint instruction. That's also important
for uretprobes, because uprobe is installed for each uretprobe.

The performance test is part of bpf selftests:
  tools/testing/selftests/bpf/run_bench_uprobes.sh

Note at the moment uretprobe syscall is supported only for native
64-bit process, compat process still uses standard breakpoint.

Suggested-by: Andrii Nakryiko 
Reviewed-by: Oleg Nesterov 
Reviewed-by: Masami Hiramatsu (Google) 
Acked-by: Andrii Nakryiko 
Signed-off-by: Oleg Nesterov 
Signed-off-by: Jiri Olsa 
---
 arch/x86/kernel/uprobes.c | 115 ++
 include/linux/uprobes.h   |   3 +
 kernel/events/uprobes.c   |  24 +---
 3 files changed, 135 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 6c07f6daaa22..81e6ee95784d 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -308,6 +309,120 @@ static int uprobe_init_insn(struct arch_uprobe *auprobe, 
struct insn *insn, bool
 }
 
 #ifdef CONFIG_X86_64
+
+asm (
+   ".pushsection .rodata\n"
+   ".global uretprobe_syscall_entry\n"
+   "uretprobe_syscall_entry:\n"
+   "pushq %rax\n"
+   "pushq %rcx\n"
+   "pushq %r11\n"
+   "movq $" __stringify(__NR_uretprobe) ", %rax\n"
+   "syscall\n"
+   ".global uretprobe_syscall_check\n"
+   "uretprobe_syscall_check:\n"
+   "popq %r11\n"
+   "popq %rcx\n"
+
+   /* The uretprobe syscall replaces stored %rax value with final
+* return address, so we don't restore %rax in here and just
+* call ret.
+*/
+   "retq\n"
+   ".global uretprobe_syscall_end\n"
+   "uretprobe_syscall_end:\n"
+   ".popsection\n"
+);
+
+extern u8 uretprobe_syscall_entry[];
+extern u8 uretprobe_syscall_check[];
+extern u8 uretprobe_syscall_end[];
+
+void *arch_uprobe_trampoline(unsigned long *psize)
+{
+   static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
+   struct pt_regs *regs = task_pt_regs(current);
+
+   /*
+* At the moment the uretprobe syscall trampoline is supported
+* only for native 64-bit process, the compat process still uses
+* standard breakpoint.
+*/
+   if (user_64bit_mode(regs)) {
+   *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
+   return uretprobe_syscall_entry;
+   }
+
+   *psize = UPROBE_SWBP_INSN_SIZE;
+   return 
+}
+
+static unsigned long trampoline_check_ip(void)
+{
+   unsigned long tramp = uprobe_get_trampoline_vaddr();
+
+   return tramp + (uretprobe_syscall_check - uretprobe_syscall_entry);
+}
+
+SYSCALL_DEFINE0(uretprobe)
+{
+   struct pt_regs *regs = task_pt_regs(current);
+   unsigned long err, ip, sp, r11_cx_ax[3];
+
+   if (regs->ip 

[PATCHv4 bpf-next 1/7] uprobe: Wire up uretprobe system call

2024-05-02 Thread Jiri Olsa
Wiring up uretprobe system call, which comes in following changes.
We need to do the wiring before, because the uretprobe implementation
needs the syscall number.

Note at the moment uretprobe syscall is supported only for native
64-bit process.

Reviewed-by: Oleg Nesterov 
Reviewed-by: Masami Hiramatsu (Google) 
Acked-by: Andrii Nakryiko 
Signed-off-by: Jiri Olsa 
---
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h   | 2 ++
 include/uapi/asm-generic/unistd.h  | 5 -
 kernel/sys_ni.c| 2 ++
 4 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..af0a33ab06ee 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -383,6 +383,7 @@
 459common  lsm_get_self_attr   sys_lsm_get_self_attr
 460common  lsm_set_self_attr   sys_lsm_set_self_attr
 461common  lsm_list_modulessys_lsm_list_modules
+46264  uretprobe   sys_uretprobe
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e619ac10cd23..5318e0e76799 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, u32 *size, 
u32 flags);
 /* x86 */
 asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
+asmlinkage long sys_uretprobe(void);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
unsigned long off, unsigned long len,
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 75f00965ab15..8a747cd1d735 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
 #define __NR_lsm_list_modules 461
 __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
 
+#define __NR_uretprobe 462
+__SYSCALL(__NR_uretprobe, sys_uretprobe)
+
 #undef __NR_syscalls
-#define __NR_syscalls 462
+#define __NR_syscalls 463
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index faad00cce269..be6195e0d078 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
 
 /* restartable sequence */
 COND_SYSCALL(rseq);
+
+COND_SYSCALL(uretprobe);
-- 
2.44.0




[PATCHv4 bpf-next 0/7] uprobe: uretprobe speed up

2024-05-02 Thread Jiri Olsa
hi,
as part of the effort on speeding up the uprobes [0] coming with
return uprobe optimization by using syscall instead of the trap
on the uretprobe trampoline.

The speed up depends on instruction type that uprobe is installed
and depends on specific HW type, please check patch 1 for details.

Patches 1-6 are based on bpf-next/master, but path 1 and 2 are
apply-able on linux-trace.git tree probes/for-next branch.
Patch 7 is based on man-pages master.

v4 changes:
  - added acks [Oleg,Andrii,Masami]
  - reworded the man page and adding more info to NOTE section [Masami]
  - rewrote bpf tests not to use trace_pipe [Andrii]
  - cc-ed linux-man list

Also available at:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  uretprobe_syscall

thanks,
jirka


Notes to check list items in Documentation/process/adding-syscalls.rst:

- System Call Alternatives
  New syscall seems like the best way in here, becase we need
  just to quickly enter kernel with no extra arguments processing,
  which we'd need to do if we decided to use another syscall.

- Designing the API: Planning for Extension
  The uretprobe syscall is very specific and most likely won't be
  extended in the future.

  At the moment it does not take any arguments and even if it does
  in future, it's allowed to be called only from trampoline prepared
  by kernel, so there'll be no broken user.

- Designing the API: Other Considerations
  N/A because uretprobe syscall does not return reference to kernel
  object.

- Proposing the API
  Wiring up of the uretprobe system call si in separate change,
  selftests and man page changes are part of the patchset.

- Generic System Call Implementation
  There's no CONFIG option for the new functionality because it
  keeps the same behaviour from the user POV.

- x86 System Call Implementation
  It's 64-bit syscall only.

- Compatibility System Calls (Generic)
  N/A uretprobe syscall has no arguments and is not supported
  for compat processes.

- Compatibility System Calls (x86)
  N/A uretprobe syscall is not supported for compat processes.

- System Calls Returning Elsewhere
  N/A.

- Other Details
  N/A.

- Testing
  Adding new bpf selftests and ran ltp on top of this change.

- Man Page
  Attached.

- Do not call System Calls in the Kernel
  N/A.


[0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
---
Jiri Olsa (6):
  uprobe: Wire up uretprobe system call
  uprobe: Add uretprobe syscall to speed up return probe
  selftests/bpf: Add uretprobe syscall test for regs integrity
  selftests/bpf: Add uretprobe syscall test for regs changes
  selftests/bpf: Add uretprobe syscall call from user space test
  selftests/bpf: Add uretprobe compat test

 arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
 arch/x86/kernel/uprobes.c   | 115 

 include/linux/syscalls.h|   2 +
 include/linux/uprobes.h |   3 +
 include/uapi/asm-generic/unistd.h   |   5 +-
 kernel/events/uprobes.c |  24 --
 kernel/sys_ni.c |   2 +
 tools/include/linux/compiler.h  |   4 +
 tools/testing/selftests/bpf/.gitignore  |   1 +
 tools/testing/selftests/bpf/Makefile|   7 +-
 tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 
-
 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c | 382 
+++
 tools/testing/selftests/bpf/progs/uprobe_syscall.c  |  15 
 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c |  17 +
 14 files changed, 691 insertions(+), 10 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall.c
 create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c

Jiri Olsa (1):
  man2: Add uretprobe syscall page

 man2/uretprobe.2 | 45 +
 1 file changed, 45 insertions(+)
 create mode 100644 man2/uretprobe.2



Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-02 Thread Tobias Huschle
On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
<...>
> > 
> > Let's now assume, that ocassionally, task 2 runs a little bit longer than
> > task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> > bit, but, it can easily remain below the value of task 1. Task 2 would 
> > still get picked on wake up.
> > With EEVDF, in its current form, task 2 will now get a negative lag, which
> > in turn, will cause it not being picked on the next wake up.
> 
> Right, so I've been working on changes where tasks will be able to
> 'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
> runqueue will allow them to burn off negative lag. Once they get picked
> again they are guaranteed to have zero (or more) lag. If by that time
> they've not been woken up again, they get dequeued with 0-lag.
> 
> (placement with 0-lag will ensure eligibility doesn't inhibit the pick,
> but is not sufficient to ensure a pick)
> 
> However, this alone will not be sufficient to get the behaviour you
> want. Notably, even at 0-lag the virtual deadline will still be after
> the virtual deadline of the already running task -- assuming they have
> equal request sizes.
> 
> That is, IIUC, you want your task 2 (kworker) to always preempt task 1
> (vhost), right? So even if tsak 2 were to have 0-lag, placing it would
> be something like:
> 
> t1  |-<
> t2|-<
> V-|-

Exactly, the kworker should be picked. I experimented with that a bit as
well and forced all tasks to have 0-lag on wake-up but got the results
you are mentioning here. Only if I would give the kworker (in general all
woken up tasks) a lag >0, with 1 being already sufficient, the kworker 
would be picked consistently.

> 
> So t1 has started at | with a virtual deadline at <. Then a short
> while later -- V will have advanced a little -- it wakes t2 with 0-lag,
> but as you can observe, its virtual deadline will be later than t1's and
> as such it will never get picked, even though they're both eligible.
> 
> > So, it seems we have a change in the level of how far the both variants 
> > look 
> > into the past. CFS being willing to take more history into account, whereas
> > EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> > and place_entity not taking the original vruntime into account).
> >
> > All of this can be seen as correct by design, a task consumes more time
> > than the others, so it has to give way to others. The big difference
> > is now, that CFS allowed a task to collect some bonus by constantly using 
> > less CPU time than others and trading that time against ocassionally taking
> > more CPU time. EEVDF could do the same thing, by allowing the accumulation
> > of positive lag, which can then be traded against the one time the task
> > would get negative lag. This might clash with other EEVDF assumptions 
> > though.
> 
> Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
> considers both virtual runtime (for eligibility, which ties to fairness)
> but primarily virtual deadline (for timeliness).
> 
> If you want to make EEVDF force pick a task by modifying vruntime you
> have to place it with lag > request (slice) such that the virtual
> deadline of the newly placed task is before the already running task,
> yielding both eligibility and earliest deadline.
> 
> Consistently placing tasks with such large (positive) lag will affect
> fairness though, they're basically always runnable, so barring external
> throttling, they'll starve you.

I was concerned about that as well. Tampering with the lag value will help
in this particular scenario but might cause problems with others.

> 
> > The patch below fixes the degredation, but is not at all aligned with what 
> > EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> > correct.
> > 
> > So, what does this now mean for the vhost regression we were discussing?
> > 
> > 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> > 2. vhost in its current form relies on the way how CFS works by assuming 
> >that the kworker always gets scheduled.
> 
> How does it assume this? Also, this is a performance issue, not a
> correctness issue, right?

vhost runs a while(true) loop to go over its queues. After each iteration it 
runs cond_resched to give other tasks a chance to run. So, it will never be
pre-empted by the kworker, since the kworker has no handle to do so since vhost
is running in kernel space. This means, that the wake up path is the only
chance for the kworker to get selected.

So that assumption is of a very implicit nature.

In fact, vhost will run forever until migration hits due do an issue that I 
assume in the cgroup context. See here:
https://lore.kernel.org/all/20240228161023.14310-1-husc...@linux.ibm.com/
Fixing this issue still lets 

Re: [PATCH 00/20] sh: Fix missing prototypes

2024-05-02 Thread John Paul Adrian Glaubitz
On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
>   Hi all,
> 
> This patch series fixes several "no previous prototype for "
> warnings when building a kernel for SuperH.
> 
> Known issues:
>   - The various warnings about cache functions are not yet fixed, but
> I didn't want to hold off the rest of this series,
>   - sdk7786_defconfig needs "[PATCH/RFC] locking/spinlocks: Make __raw_*
> lock ops static" [1],
>   - Probably there are more warnings to fix, I didn't build all
> defconfigs.
> 
> This has been boot-tested on landisk and on qemu/rts7751r2d.
> 
> Thanks for your comments!
> 
> [1] 
> https://lore.kernel.org/linux-sh/c395b02613572131568bc1fd1bc456d20d1a5426.1709325647.git.geert+rene...@glider.be
> 
> Geert Uytterhoeven (20):
>   sh: pgtable: Fix missing prototypes
>   sh: fpu: Add missing forward declarations
>   sh: syscall: Add missing forward declaration for sys_cacheflush()
>   sh: tlb: Add missing forward declaration for handle_tlbmiss()
>   sh: return_address: Add missing #include 
>   sh: traps: Add missing #include 
>   sh: hw_breakpoint: Add missing forward declaration for
> arch_bp_generic_fields()
>   sh: boot: Add proper forward declarations
>   sh: ftrace: Fix missing prototypes
>   sh: nommu: Add missing #include 
>   sh: math-emu: Add missing #include 
>   sh: dma: Remove unused dmac_search_free_channel()
>   sh: sh2a: Add missing #include 
>   sh: sh7786: Remove unused sh7786_usb_use_exclock()
>   sh: smp: Fix missing prototypes
>   sh: kprobes: Merge arch_copy_kprobe() into arch_prepare_kprobe()
>   sh: kprobes: Make trampoline_probe_handler() static
>   sh: kprobes: Remove unneeded kprobe_opcode_t casts
>   sh: dwarf: Make dwarf_lookup_fde() static
>   [RFC] sh: dma: Remove unused functionality
> 
>  arch/sh/boot/compressed/cache.c |   3 +
>  arch/sh/boot/compressed/cache.h |  10 ++
>  arch/sh/boot/compressed/misc.c  |   8 +-
>  arch/sh/boot/compressed/misc.h  |   9 ++
>  arch/sh/drivers/dma/dma-api.c   | 143 
>  arch/sh/include/asm/dma.h   |   7 --
>  arch/sh/include/asm/fpu.h   |   3 +
>  arch/sh/include/asm/ftrace.h|  10 ++
>  arch/sh/include/asm/hw_breakpoint.h |   2 +
>  arch/sh/include/asm/syscalls.h  |   1 +
>  arch/sh/include/asm/tlb.h   |   4 +
>  arch/sh/kernel/cpu/sh2a/opcode_helper.c |   2 +
>  arch/sh/kernel/cpu/sh4a/setup-sh7786.c  |  14 ---
>  arch/sh/kernel/dwarf.c  |   2 +-
>  arch/sh/kernel/kprobes.c|  13 +--
>  arch/sh/kernel/return_address.c |   2 +
>  arch/sh/kernel/smp.c|   4 +-
>  arch/sh/kernel/traps.c  |  10 +-
>  arch/sh/kernel/traps_32.c   |   1 +
>  arch/sh/math-emu/math.c |   2 +
>  arch/sh/mm/nommu.c  |   2 +
>  arch/sh/mm/pgtable.c|   4 +-
>  arch/sh/mm/tlbex_32.c   |   1 +
>  23 files changed, 68 insertions(+), 189 deletions(-)
>  create mode 100644 arch/sh/boot/compressed/cache.h
>  create mode 100644 arch/sh/boot/compressed/misc.h

Applied to my sh-linux tree in the for-next branch.

Thanks,
Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

2024-05-02 Thread Peter Zijlstra
On Wed, May 01, 2024 at 11:31:02AM -0400, Michael S. Tsirkin wrote:
> On Wed, May 01, 2024 at 12:51:51PM +0200, Peter Zijlstra wrote:

> > I'm still wondering why exactly it is imperative for t2 to preempt t1.
> > Is there some unexpressed serialization / spin-waiting ?
> 
> 
> I am not sure but I think the point is that t2 is a kworker. It is
> much cheaper to run it right now when we are already in the kernel
> than return to userspace, let it run for a bit then interrupt it
> and then run t2.
> Right, Tobias?

So that is fundamentally a consequence of using a kworker.

So I tried to have a quick peek at vhost to figure out why you're using
kworkers... but no luck :/

Also, when I look at drivers/vhost/ it seems to implement it's own
worker and not use normal workqueues or even kthread_worker. Did we
really need yet another copy of all that?

Anyway, I tried to have a quick look at the code, but I can't seem to
get a handle on what and why it's doing things.



Re: [PATCH v15 3/4] dts: zynqmp: add properties for TCM in remoteproc

2024-05-02 Thread Michal Simek




On 4/12/24 20:37, Tanmay Shah wrote:

Add properties as per new bindings in zynqmp remoteproc node
to represent TCM address and size.

This patch also adds alternative remoteproc node to represent
remoteproc cluster in split mode. By default lockstep mode is
enabled and users should disable it before using split mode
dts. Both device-tree nodes can't be used simultaneously one
of them must be disabled. For zcu102-1.0 and zcu102-1.1 board
remoteproc split mode dts node is enabled and lockstep mode
dts is disabled.

Signed-off-by: Tanmay Shah 
---
  .../boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts  |  8 +++
  arch/arm64/boot/dts/xilinx/zynqmp.dtsi| 67 +--
  2 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts 
b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
index c8f71a1aec89..495ca94b45db 100644
--- a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
+++ b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
@@ -14,6 +14,14 @@ / {
compatible = "xlnx,zynqmp-zcu102-rev1.0", "xlnx,zynqmp-zcu102", 
"xlnx,zynqmp";
  };
  
+_split {

+   status = "okay";
+};
+
+_lockstep {
+   status = "disabled";
+};
+
   {
#address-cells = <1>;
#size-cells = <1>;
diff --git a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi 
b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
index 25d20d803230..ef31b0fc73d1 100644
--- a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
+++ b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
@@ -260,19 +260,76 @@ fpga_full: fpga-full {
ranges;
};
  
-	remoteproc {

+   rproc_lockstep: remoteproc@ffe0 {
compatible = "xlnx,zynqmp-r5fss";
xlnx,cluster-mode = <1>;
+   xlnx,tcm-mode = <1>;
  
-		r5f-0 {

+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
+<0x0 0x2 0x0 0xffe2 0x0 0x1>,
+<0x0 0x1 0x0 0xffe1 0x0 0x1>,
+<0x0 0x3 0x0 0xffe3 0x0 0x1>;
+
+   r5f@0 {
+   compatible = "xlnx,zynqmp-r5f";
+   reg = <0x0 0x0 0x0 0x1>,
+ <0x0 0x2 0x0 0x1>,
+ <0x0 0x1 0x0 0x1>,
+ <0x0 0x3 0x0 0x1>;
+   reg-names = "atcm0", "btcm0", "atcm1", "btcm1";
+   power-domains = <_firmware PD_RPU_0>,
+   <_firmware PD_R5_0_ATCM>,
+   <_firmware PD_R5_0_BTCM>,
+   <_firmware PD_R5_1_ATCM>,
+   <_firmware PD_R5_1_BTCM>;
+   memory-region = <_0_fw_image>;
+   };
+
+   r5f@1 {
+   compatible = "xlnx,zynqmp-r5f";
+   reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
+   reg-names = "atcm0", "btcm0";
+   power-domains = <_firmware PD_RPU_1>,
+   <_firmware PD_R5_1_ATCM>,
+   <_firmware PD_R5_1_BTCM>;
+   memory-region = <_1_fw_image>;
+   };
+   };
+
+   rproc_split: remoteproc-split@ffe0 {
+   status = "disabled";
+   compatible = "xlnx,zynqmp-r5fss";
+   xlnx,cluster-mode = <0>;
+   xlnx,tcm-mode = <0>;
+
+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
+<0x0 0x2 0x0 0xffe2 0x0 0x1>,
+<0x1 0x0 0x0 0xffe9 0x0 0x1>,
+<0x1 0x2 0x0 0xffeb 0x0 0x1>;
+
+   r5f@0 {
compatible = "xlnx,zynqmp-r5f";
-   power-domains = <_firmware PD_RPU_0>;
+   reg = <0x0 0x0 0x0 0x1>, <0x0 0x2 0x0 0x1>;
+   reg-names = "atcm0", "btcm0";
+   power-domains = <_firmware PD_RPU_0>,
+   <_firmware PD_R5_0_ATCM>,
+   <_firmware PD_R5_0_BTCM>;
memory-region = <_0_fw_image>;
};
  
-		r5f-1 {

+   r5f@1 {
compatible = "xlnx,zynqmp-r5f";
-   power-domains = <_firmware PD_RPU_1>;
+   reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
+   reg-names = "atcm0", "btcm0";
+   power-domains = <_firmware PD_RPU_1>,
+   <_firmware PD_R5_1_ATCM>,
+   <_firmware PD_R5_1_BTCM>;
 

Re: [PATCH 20/20] [RFC] sh: dma: Remove unused functionality

2024-05-02 Thread John Paul Adrian Glaubitz
Hi Geert,

On Thu, 2024-05-02 at 09:03 +0200, Geert Uytterhoeven wrote:
> On Wed, May 1, 2024 at 3:58 PM John Paul Adrian Glaubitz
>  wrote:
> > On Wed, 2024-05-01 at 11:12 +0200, John Paul Adrian Glaubitz wrote:
> > > On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> > > > dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> > > > request_dma_bycap() are unused.  Remove them, and all related code.
> > > > 
> > > > Signed-off-by: Geert Uytterhoeven 
> 
> > > I assume we could re-add these again in case we need them, but it would 
> > > be good
> > > if Yoshinori could comment on whether we should keep these functions or 
> > > not.
> > 
> > I was wondering: Could there be any userland tools using these DMA 
> > functions?
> 
> They cannot be called from userspace, as there is no API for that.
> They can only be called from inside the kernel, or from a kernel module
> (possibly out-of-tree).

OK, thanks for the confirmation. Then I think it's safe to remove them.

I will apply both your series tonight and the rest of the patches except
for the one that moves the paging_init() around as it turns out the current
positioning is intentional.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913



Re: [PATCH 20/20] [RFC] sh: dma: Remove unused functionality

2024-05-02 Thread Geert Uytterhoeven
Hi Adrian,

On Wed, May 1, 2024 at 3:58 PM John Paul Adrian Glaubitz
 wrote:
> On Wed, 2024-05-01 at 11:12 +0200, John Paul Adrian Glaubitz wrote:
> > On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> > > dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> > > request_dma_bycap() are unused.  Remove them, and all related code.
> > >
> > > Signed-off-by: Geert Uytterhoeven 

> > I assume we could re-add these again in case we need them, but it would be 
> > good
> > if Yoshinori could comment on whether we should keep these functions or not.
>
> I was wondering: Could there be any userland tools using these DMA functions?

They cannot be called from userspace, as there is no API for that.
They can only be called from inside the kernel, or from a kernel module
(possibly out-of-tree).

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds



Re: [PATCH 12/20] sh: dma: Remove unused dmac_search_free_channel()

2024-05-02 Thread Geert Uytterhoeven
Hi Adrian,

On Wed, May 1, 2024 at 11:09 AM John Paul Adrian Glaubitz
 wrote:
> On Fri, 2024-03-01 at 22:02 +0100, Geert Uytterhoeven wrote:
> > arch/sh/drivers/dma/dma-api.c:164:5: warning: no previous prototype for 
> > 'dmac_search_free_channel' [-Wmissing-prototypes]
> >
> > dmac_search_free_channel() never had a user in upstream, remove it.
> >
> > Signed-off-by: Geert Uytterhoeven 
> > ---
> > dma_extend(), get_dma_info_by_name(), register_chan_caps(), and
> > request_dma_bycap() are also unused, but don't trigger warnings
> > ---
>
> I assume the other functions didn't trigger a warning because their symbols
> were exported. Correct me if I'm wrong.

No, because they have a forward declaration in arch/sh/include/asm/dma.h.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds



Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-02 Thread 吳澤南
On Wed, 2024-05-01 at 23:50 -0400, Steven Rostedt wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On Thu, 2 May 2024 03:10:24 +
> Tze-nan Wu (吳澤南)  wrote:
> 
> > >   
> > Sorry for my late reply, I'm testing the patch on my machine now. 
> > Test will be done in four hours.
> > 
> > There's something I'm worrying about in the patch,
> > what I'm worrying about is commented in the code below.
> > 
> > /kernel/trace/trace_events.c:
> >   static int
> >   event_create_dir(struct eventfs_inode *parent, 
> >   struct trace_event_file *file) 
> >   {
> > ...
> > ...
> > ...
> > nr_entries = ARRAY_SIZE(event_entries);
> > 
> > name = trace_event_name(call);
> > 
> > +event_file_get(file);// Line A
> > ^
> > // Should we move the "event_file_get" to here, instead  
> > // of calling it at line C?
> > // Due to Line B could eventually invoke "event_file_put".
> > //   eventfs_create_dir -> free_ei ->put_ei -> kref_put 
> > //  -> release_ei -> event_release -> event_file_put
> > // Not sure if this is a potential risk? If Line B do
> call   
> > // event_file_put,"event_file_put" will be called prior to
> > // "event_file_get", could corrupt the reference of the
> file.
> 
> No, but you do bring up a good point. The release should not be
> called on
> error, but it looks like it possibly can be.
> 
> > 
> > ei = eventfs_create_dir(name, e_events,// Line B 
> >  event_entries, nr_entries, file);
> > if (IS_ERR(ei)) {
> > pr_warn("Could not create tracefs '%s'
> directory\n", 
> > name);
> > return -1;
> > }
> > file->ei = ei;
> > 
> > ret = event_define_fields(call);
> > if (ret < 0) {
> > pr_warn("Could not initialize trace point
> events/%s\n",
> > name);
> > return ret;
> >^  
> >// Maybe we chould have similar concern if we return here.
> >// Due to the event_inode had been created, but we did not
> call 
> >// event_file_get. 
> >// Could it lead to some issues in the future while freeing 
> >// event_indoe?
> > }
> > 
> > 
> > -event_file_get(file);   //Line C
> > return 0;
> >   }
> 
> This prevents the release() function from being called on failure of
> creating the ei.
> 
> Can you try this patch instead?
> 
> -- Steve
> 
Good news, this patch works, the test has passed, no more Kasan report
in my environment.

my environment:
 arm64 + kasan + swtag based kasan + kernel-6.6.18

Really appreciate, and learn a lot from the patch.

--Tze-nan
> diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
> index 894c6ca1e500..f5510e26f0f6 100644
> --- a/fs/tracefs/event_inode.c
> +++ b/fs/tracefs/event_inode.c
> @@ -84,10 +84,17 @@ enum {
>  static void release_ei(struct kref *ref)
>  {
>  struct eventfs_inode *ei = container_of(ref, struct eventfs_inode,
> kref);
> +const struct eventfs_entry *entry;
>  struct eventfs_root_inode *rei;
>  
>  WARN_ON_ONCE(!ei->is_freed);
>  
> +for (int i = 0; i < ei->nr_entries; i++) {
> +entry = >entries[i];
> +if (entry->release)
> +entry->release(entry->name, ei->data);
> +}
> +
>  kfree(ei->entry_attrs);
>  kfree_const(ei->name);
>  if (ei->is_events) {
> @@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode
> *ei)
>  }
>  }
>  
> +/*
> + * Called when creation of an ei fails, do not call release()
> functions.
> + */
> +static inline void cleanup_ei(struct eventfs_inode *ei)
> +{
> +if (ei) {
> +/* Set nr_entries to 0 to prevent release() function being called */
> +ei->nr_entries = 0;
> +free_ei(ei);
> +}
> +}
> +
>  static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
>  {
>  if (ei)
> @@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const
> char *name, struct eventfs_inode
>  
>  /* Was the parent freed? */
>  if (list_empty(>list)) {
> -free_ei(ei);
> +cleanup_ei(ei);
>  ei = NULL;
>  }
>  return ei;
> @@ -835,7 +854,7 @@ struct eventfs_inode
> *eventfs_create_events_dir(const char *name, struct dentry
>  return ei;
>  
>   fail:
> -free_ei(ei);
> +cleanup_ei(ei);
>  tracefs_failed_creating(dentry);
>  return ERR_PTR(-ENOMEM);
>  }
> diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
> index 7a5fe17b6bf9..d03f74658716 100644
> --- a/include/linux/tracefs.h
> +++ b/include/linux/tracefs.h
> @@ -62,6 +62,8 @@ struct eventfs_file;
>  typedef int (*eventfs_callback)(const char *name, umode_t *mode,
> void **data,
>  const struct file_operations **fops);
>  
> +typedef void (*eventfs_release)(const char *name, void *data);
> +
>  /**
>   * struct eventfs_entry - dynamically created eventfs file call back
> handler
>   * @name:Then name