from:"Steven Rostedt"

[PATCH v2 0/4] tracefs/eventfs: Fix failed second run of test_ownership

2024-05-22 Thread Steven Rostedt



The test_ownership test of the kselftests was failing again.
That's because the original fix was incorrect and a fix to
a race condition showed how the original fix was broken.

Instead of using tracefs_inodes to find the eventfs_inode that
needs to be reset on remount, use the "events" directory descriptor
to descend into its files and directories to catch all changes.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240522124504.28982...@gandalf.local.home

 - Added other fixes underneath and rebased it on:
   https://lore.kernel.org/lkml/20240522164320.469785...@goodmis.org/

 - The real fix is to not use the tracefs_inodes to find the eventfs_inodes
   that need to be cleared on remount. Instead, the events descriptor
   needs to be used to descend its directories and files to update
   their attributes

 - The d_iput callback logic was misplaced. It should be done in the
   drop_inode callback.

  git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git
eventfs/urgent

Head SHA1: 41b7db11bcac4638fa489c58d35e7d2146b665ab


Steven Rostedt (Google) (4):
  eventfs: Keep the directories from having the same inode number as files
  tracefs: Update inode permissions on remount
  eventfs: Update all the eventfs_inodes from the events descriptor
  tracefs: Clear EVENT_INODE flag in tracefs_drop_inode()


 fs/tracefs/event_inode.c | 57 +---
 fs/tracefs/inode.c   | 48 
 2 files changed, 73 insertions(+), 32 deletions(-)

Re: [PATCH] tracefs: Remove unneeded buggy tracefs iput callback

2024-05-22 Thread Steven Rostedt

On Wed, 22 May 2024 12:45:04 -0400
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The iput callback was added because the remount could call into the
> eventfs code and touch the ei->entry_attrs array, which could have been
> freed when an eventfs directory is freed (via a synthetic event). But the
> entry_attrs was freed incorrectly and since been fixed to be freed after
> the last reference of the ei is done.
> 
> The iput clears the TRACEFS_EVENT_INODE flag of the tracefs_inode
> preventing it from calling the eventfs_remount() function. But the iput
> can be called after the last reference to the inode is done but the
> eventfs_inode still exists, causing the eventfs_remount() not to be called
> on an tracefs_inode when it should be.

Testing this more, I found that the iput is still needed, as the deletion
of the eventfs inodes can happen before the inode is released.

Will produce a v2 that handles this properly.

-- Steve

[PATCH] tracefs: Remove unneeded buggy tracefs iput callback

2024-05-22 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The iput callback was added because the remount could call into the
eventfs code and touch the ei->entry_attrs array, which could have been
freed when an eventfs directory is freed (via a synthetic event). But the
entry_attrs was freed incorrectly and since been fixed to be freed after
the last reference of the ei is done.

The iput clears the TRACEFS_EVENT_INODE flag of the tracefs_inode
preventing it from calling the eventfs_remount() function. But the iput
can be called after the last reference to the inode is done but the
eventfs_inode still exists, causing the eventfs_remount() not to be called
on an tracefs_inode when it should be.

Link: 
https://lore.kernel.org/all/cak7lnarxgaww3kh9jgrnh4vk6fr8ldknkf3wq8nhmwjrvwj...@mail.gmail.com/

Cc: sta...@vger.kernel.org
Reported-by: Masahiro Yamada 
Fixes: ee4e0379475e4 ("eventfs: Free all of the eventfs_inode after RCU")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 9252e0d78ea2..62ca9c23b93c 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -455,22 +455,7 @@ static int tracefs_d_revalidate(struct dentry *dentry, 
unsigned int flags)
return !(ei && ei->is_freed);
 }
 
-static void tracefs_d_iput(struct dentry *dentry, struct inode *inode)
-{
-   struct tracefs_inode *ti = get_tracefs(inode);
-
-   /*
-* This inode is being freed and cannot be used for
-* eventfs. Clear the flag so that it doesn't call into
-* eventfs during the remount flag updates. The eventfs_inode
-* gets freed after an RCU cycle, so the content will still
-* be safe if the iteration is going on now.
-*/
-   ti->flags &= ~TRACEFS_EVENT_INODE;
-}
-
 static const struct dentry_operations tracefs_dentry_operations = {
-   .d_iput = tracefs_d_iput,
.d_revalidate = tracefs_d_revalidate,
.d_release = tracefs_d_release,
 };
-- 
2.43.0

Re: [PATCH] tools/latency-collector: fix -Wformat-security compile warns

2024-05-21 Thread Steven Rostedt

On Tue, 21 May 2024 09:11:08 -0600
Shuah Khan  wrote:

> Any thoughts on this patch?

Sorry, this one fell through the cracks. Daniel Bristot has been
maintaining his tools and I thought this was one of his changes.

I'll take a look at it.

-- Steve

Re: UBSAN: shift-out-of-bounds in validate_sb_layout

2024-05-20 Thread Steven Rostedt

On Mon, 20 May 2024 15:02:26 +0800
"Ubisectech Sirius"  wrote:

> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7. Attached to 
> the email were a PoC file of the issue.
> 
> Stack dump:
> UBSAN: shift-out-of-bounds in fs/bcachefs/super-io.c:310:18
> shift exponent 127 is too large for 32-bit type 'int'
> CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 6.7.0 #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106
>  ubsan_epilogue lib/ubsan.c:217 [inline]
>  __ubsan_handle_shift_out_of_bounds+0x24b/0x430 lib/ubsan.c:387
>  validate_sb_layout.cold+0x1a/0x51 fs/bcachefs/super-io.c:310
>  bch2_read_super+0x980/0x1000 fs/bcachefs/super-io.c:786
>  bch2_fs_open+0x471/0x3890 fs/bcachefs/super.c:1922
>  bch2_mount+0x538/0x13c0 fs/bcachefs/fs.c:1863
>  legacy_get_tree+0x109/0x220 fs/fs_context.c:662
>  vfs_get_tree+0x93/0x380 fs/super.c:1771
>  do_new_mount fs/namespace.c:3337 [inline]
>  path_mount+0x679/0x1e40 fs/namespace.c:3664
>  do_mount fs/namespace.c:3677 [inline]
>  __do_sys_mount fs/namespace.c:3886 [inline]
>  __se_sys_mount fs/namespace.c:3863 [inline]
>  __x64_sys_mount+0x287/0x310 fs/namespace.c:3863
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> RIP: 0033:0x7f41e1091b3e
> Code: 48 c7 c0 ff ff ff ff eb aa e8 be 0d 00 00 66 2e 0f 1f 84 00 00 00 00 00 
> 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7f41e1d22e38 EFLAGS: 0202 ORIG_RAX: 00a5
> RAX: ffda RBX: 5d82 RCX: 7f41e1091b3e
> RDX: 20005d80 RSI: 2100 RDI: 7f41e1d22e90
> RBP: 7f41e1d22ed0 R08: 7f41e1d22ed0 R09: 0080
> R10: 0080 R11: 0202 R12: 20005d80
> R13: 2100 R14: 7f41e1d22e90 R15: 20005e00
>  
> 
> Thank you for taking the time to read this email and we look forward to 
> working with you further.

I'm not sure why this is getting Cc'd to linux-trace-kernel. That's for
anything to do with the tracing code (trace events, tracepoints, kprobes,
uprobes, function tracer etc).

What part of tracing is this for?

-- Steve

Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-05-20 Thread Steven Rostedt

On Fri, 17 May 2024 15:40:08 +0200
Petr Pavlu  wrote:

> The reader code in rb_get_reader_page() swaps a new reader page into the
> ring buffer by doing cmpxchg on old->list.prev->next to point it to the
> new page. Following that, if the operation is successful,
> old->list.next->prev gets updated too. This means the underlying
> doubly-linked list is temporarily inconsistent, page->prev->next or
> page->next->prev might not be equal back to page for some page in the
> ring buffer.
> 
> The resize operation in ring_buffer_resize() can be invoked in parallel.
> It calls rb_check_pages() which can detect the described inconsistency
> and stop further tracing:
> 
> [  190.271762] [ cut here ]
> [  190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 
> rb_check_pages.isra.0+0x6a/0xa0
> [  190.271789] Modules linked in: [...]
> [  190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 
> skx_edac(E):1
> [  190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G 
>E  6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
> [  190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
> [  190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
> [  190.272023] Code: [...]
> [  190.272028] RSP: 0018:9c37463abb70 EFLAGS: 00010206
> [  190.272034] RAX: 8eba04b6cb80 RBX: 0007 RCX: 
> 8eba01f13d80
> [  190.272038] RDX: 8eba01f130c0 RSI: 8eba04b6cd00 RDI: 
> 8eba0004c700
> [  190.272042] RBP: 8eba0004c700 R08: 00010002 R09: 
> 
> [  190.272045] R10: 7f52 R11: 8eba7f60 R12: 
> 8eba0004c720
> [  190.272049] R13: 8eba00223a00 R14: 0008 R15: 
> 8eba067a8000
> [  190.272053] FS:  7f1bd64752c0() GS:8eba7f68() 
> knlGS:
> [  190.272057] CS:  0010 DS:  ES:  CR0: 80050033
> [  190.272061] CR2: 7f1bd6662590 CR3: 00010291e001 CR4: 
> 00370ef0
> [  190.272070] DR0:  DR1:  DR2: 
> 
> [  190.272073] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  190.272077] Call Trace:
> [  190.272098]  
> [  190.272189]  ring_buffer_resize+0x2ab/0x460
> [  190.272199]  __tracing_resize_ring_buffer.part.0+0x23/0xa0
> [  190.272206]  tracing_resize_ring_buffer+0x65/0x90
> [  190.272216]  tracing_entries_write+0x74/0xc0
> [  190.272225]  vfs_write+0xf5/0x420
> [  190.272248]  ksys_write+0x67/0xe0
> [  190.272256]  do_syscall_64+0x82/0x170
> [  190.272363]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  190.272373] RIP: 0033:0x7f1bd657d263
> [  190.272381] Code: [...]
> [  190.272385] RSP: 002b:7ffe72b643f8 EFLAGS: 0246 ORIG_RAX: 
> 0001
> [  190.272391] RAX: ffda RBX: 0002 RCX: 
> 7f1bd657d263
> [  190.272395] RDX: 0002 RSI: 555a6eb538e0 RDI: 
> 0001
> [  190.272398] RBP: 555a6eb538e0 R08: 000a R09: 
> 
> [  190.272401] R10: 555a6eb55190 R11: 0246 R12: 
> 7f1bd6662500
> [  190.272404] R13: 0002 R14: 7f1bd6667c00 R15: 
> 0002
> [  190.272412]  
> [  190.272414] ---[ end trace  ]---
> 
> Note that ring_buffer_resize() calls rb_check_pages() only if the parent
> trace_buffer has recording disabled. Recent commit d78ab792705c
> ("tracing: Stop current tracer when resizing buffer") causes that it is
> now always the case which makes it more likely to experience this issue.
> 
> The window to hit this race is nonetheless very small. To help
> reproducing it, one can add a delay loop in rb_get_reader_page():
> 
>  ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
>  if (!ret)
>   goto spin;
>  for (unsigned i = 0; i < 1U << 26; i++)  /* inserted delay loop */
>   __asm__ __volatile__ ("" : : : "memory");
>  rb_list_head(reader->list.next)->prev = _buffer->reader_page->list;
> 
> .. and then run the following commands on the target system:
> 
>  echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>  while true; do
>   echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>   echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>  done &
>  while true; do
>   for i in /sys/kernel/tracing/per_cpu/*; do
>   timeout 0.1 cat $i/trace_pipe; sleep 0.2
>   done
>  done
> 
> To fix the problem, make sure ring_buffer_resize() doesn't invoke
> rb_check_pages() concurrently with a reader operating on the same
> ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.

Definitely a bug. Thanks for catching it. But...

> 
> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read")
> Signed-off-by: Petr Pavlu 
> ---
>  kernel/trace/ring_buffer.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git

Re: [PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-05-17 Thread Steven Rostedt

On Fri, 17 May 2024 10:36:37 -0700
Guenter Roeck  wrote:

> Building csky:allmodconfig (and others) ... failed
> --
> Error log:
> In file included from include/trace/trace_events.h:419,
>  from include/trace/define_trace.h:102,
>  from drivers/cxl/core/trace.h:737,
>  from drivers/cxl/core/trace.c:8:
> drivers/cxl/core/./trace.h:383:1: error: macro "__assign_str" passed 2 
> arguments, but takes just 1
> 
> This is with the patch applied on top of v6.9-8410-gff2632d7d08e.
> So far that seems to be the only build failure.
> Introduced with commit 6aec00139d3a8 ("cxl/core: Add region info to
> cxl_general_media and cxl_dram events"). Guess we'll see more of those
> towards the end of the commit window.

Looks like I made this patch just before this commit was pulled into
Linus's tree.

Which is why I'll apply and rerun the above again probably on Tuesday of
next week against Linus's latest.

This patch made it through both an allyesconfig and an allmodconfig, but on
the commit I had applied it to, which was:

  1b294a1f3561 ("Merge tag 'net-next-6.10' of 
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next")

I'll be compiling those two builds after I update it then.

-- Steve

Re: [PATCH] tracing/treewide: Remove second parameter of __assign_str()

2024-05-17 Thread Steven Rostedt

On Fri, 17 May 2024 10:08:51 +0300
Jani Nikula  wrote:

> On Thu, 16 May 2024, Steven Rostedt  wrote:
> > There's over 700 users of __assign_str() and because coccinelle does not
> > handle the TRACE_EVENT() macro I ended up using the following sed script:
> >
> >   git grep -l __assign_str | while read a ; do
> >   sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > 
> > /tmp/test-file;
> >   mv /tmp/test-file $a;
> >   done  
> 
> Try 'sed -i' ;)

I've always been nervous about trusting -i ;-)

> 
> >  .../drm/i915/display/intel_display_trace.h|  56 -  
> 
> On i915,
> 
> Acked-by: Jani Nikula 
> 

Thanks,

-- Steve

[PATCH] ring-buffer: Add cast to unsigned long addr passed to virt_to_page()

2024-05-14 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The sub-buffer pages are held in an unsigned long array, and when it is
passed to virt_to_page() a cast is needed.

Link: https://lore.kernel.org/all/20240515124808.06279...@canb.auug.org.au/

Fixes: 117c39200d9d ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: Stephen Rothwell 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a02c7a52a0f5..7345a8b625fb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6283,7 +6283,7 @@ static int __rb_map_vma(struct ring_buffer_per_cpu 
*cpu_buffer,
}
 
while (p < nr_pages) {
-   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   struct page *page = virt_to_page((void 
*)cpu_buffer->subbuf_ids[s]);
int off = 0;
 
if (WARN_ON_ONCE(s >= nr_subbufs)) {
-- 
2.43.0

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-10 Thread Steven Rostedt

On Fri, 10 May 2024 12:03:12 +0100
Vincent Donnefort  wrote:

> > I'm not particularly happy about us calling vm_insert_pages with NULL
> > pointers stored in pages.
> > 
> > Should we instead do
> > 
> > if (WARN_ON_ONCE(s >= nr_subbufs)) {
> > err = -EINVAL;
> > goto out;
> > }
> > 
> > ?  
> 
> I could also nr_pages = p in the event of s >= nr_subbufs... but that
> really that shouldn't happen so let's return an error.

I'm good with this. It should never happen anyway.

-- Steve

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-07 Thread Steven Rostedt

On Tue, 30 Apr 2024 12:13:51 +0100
Vincent Donnefort  wrote:

> +#ifdef CONFIG_MMU
> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> + struct vm_area_struct *vma)
> +{
> + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff;
> + unsigned int subbuf_pages, subbuf_order;
> + struct page **pages;
> + int p = 0, s = 0;
> + int err;
> +
> + /* Refuse MP_PRIVATE or writable mappings */
> + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC ||
> + !(vma->vm_flags & VM_MAYSHARE))
> + return -EPERM;
> +
> + /*
> +  * Make sure the mapping cannot become writable later. Also tell the VM
> +  * to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally,
> +  * prevent migration, GUP and dump (VM_IO).
> +  */
> + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE);

Do we really need the VM_IO?

When testing this in gdb, I would get:

(gdb) p tmap->map->subbuf_size
Cannot access memory at address 0x77fc2008

It appears that you can't ptrace IO memory. When I removed that flag,
gdb has no problem reading that memory.

I think we should drop that flag.

Can you send a v23 with that removed, Shuah's update, and also the
change below:

> +
> + lockdep_assert_held(_buffer->mapping_lock);
> +
> + subbuf_order = cpu_buffer->buffer->subbuf_order;
> + subbuf_pages = 1 << subbuf_order;
> +
> + nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */
> + nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */
> +
> + vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> + if (!vma_pages || vma_pages > nr_pages)
> + return -EINVAL;
> +
> + nr_pages = vma_pages;
> +
> + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
> + if (!pages)
> + return -ENOMEM;
> +
> + if (!pgoff) {
> + pages[p++] = virt_to_page(cpu_buffer->meta_page);
> +
> + /*
> +  * TODO: Align sub-buffers on their size, once
> +  * vm_insert_pages() supports the zero-page.
> +  */
> + } else {
> + /* Skip the meta-page */
> + pgoff--;
> +
> + if (pgoff % subbuf_pages) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + s += pgoff / subbuf_pages;
> + }
> +
> + while (s < nr_subbufs && p < nr_pages) {
> + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
> + int off = 0;
> +
> + for (; off < (1 << (subbuf_order)); off++, page++) {
> + if (p >= nr_pages)
> + break;
> +
> + pages[p++] = page;
> + }
> + s++;
> + }

The above can be made to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;

page = virt_to_page(cpu_buffer->subbuf_ids[s]);
for (; off < (1 << (subbuf_order)); off++, page++) {
if (p >= nr_pages)
break;

pages[p++] = page;
}
s++;
}

Thanks.

-- Steve

> +
> + err = vm_insert_pages(vma, vma->vm_start, pages, _pages);
> +
> +out:
> + kfree(pages);
> +
> + return err;
> +}
> +#else
> +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer,
> + struct vm_area_struct *vma)
> +{
> + return -EOPNOTSUPP;
> +}
> +#endif

Re: [PATCH net-next v9] net/ipv4: add tracepoint for icmp_send

2024-05-07 Thread Steven Rostedt

On Tue, 7 May 2024 15:41:03 +0800 (CST)
 wrote:

> From: Peilin He 
> 
> Introduce a tracepoint for icmp_send, which can help users to get more
> detail information conveniently when icmp abnormal events happen.
> 
> 1. Giving an usecase example:
> =
> When an application experiences packet loss due to an unreachable UDP
> destination port, the kernel will send an exception message through the
> icmp_send function. By adding a trace point for icmp_send, developers or
> system administrators can obtain detailed information about the UDP
> packet loss, including the type, code, source address, destination address,
> source port, and destination port. This facilitates the trouble-shooting
> of UDP packet loss issues especially for those network-service
> applications.
> 
> 2. Operation Instructions:
> ==
> Switch to the tracing directory.
> cd /sys/kernel/tracing
> Filter for destination port unreachable.
> echo "type==3 && code==3" > events/icmp/icmp_send/filter
> Enable trace event.
> echo 1 > events/icmp/icmp_send/enable
> 
> 3. Result View:
> 
>  udp_client_erro-11370   [002] ...s.12   124.728002:
>  icmp_send: icmp_send: type=3, code=3.
>  From 127.0.0.1:41895 to 127.0.0.1: ulen=23
>  skbaddr=589b167a
> 
> Signed-off-by: Peilin He 
> Signed-off-by: xu xin 
> Reviewed-by: Yunkai Zhang 
> Cc: Yang Yang 
> Cc: Liu Chun 
> Cc: Xuexin Jiang 
> ---

From just a tracing point of view:

Reviewed-by: Steven Rostedt (Google) 

-- Steve

Re: [PATCH v8 13/17] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES

2024-05-06 Thread Steven Rostedt

On Sun,  5 May 2024 17:25:56 +0300
Mike Rapoport  wrote:

> From: "Mike Rapoport (IBM)" 
> 
> Dynamic ftrace must allocate memory for code and this was impossible
> without CONFIG_MODULES.
> 
> With execmem separated from the modules code, execmem_text_alloc() is
> available regardless of CONFIG_MODULES.
> 
> Remove dependency of dynamic ftrace on CONFIG_MODULES and make
> CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig.
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/x86/Kconfig |  1 +
>  arch/x86/kernel/ftrace.c | 10 --
>  2 files changed, 1 insertion(+), 10 deletions(-)

Reviewed-by: Steven Rostedt (Google) 

-- Steve

Re: [PATCH v8 06/17] mm: introduce execmem_alloc() and execmem_free()

2024-05-06 Thread Steven Rostedt

On Sun,  5 May 2024 17:25:49 +0300
Mike Rapoport  wrote:

> diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
> index 70139d9d2e01..c8ddb7abda7c 100644
> --- a/arch/x86/kernel/ftrace.c
> +++ b/arch/x86/kernel/ftrace.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command)
>  #ifdef CONFIG_X86_64
>  
>  #ifdef CONFIG_MODULES
> -#include 
>  /* Module allocation simplifies allocating memory for code */
>  static inline void *alloc_tramp(unsigned long size)
>  {
> - return module_alloc(size);
> + return execmem_alloc(EXECMEM_FTRACE, size);
>  }
>  static inline void tramp_free(void *tramp)
>  {
> - module_memfree(tramp);
> + execmem_free(tramp);
>  }
>  #else
>  /* Trampolines can only be created if modules are supported */
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c

Acked-by: Steven Rostedt (Google) 

-- Steve

Re: ftrace_direct_func_count ?

2024-05-06 Thread Steven Rostedt

On Sat, 4 May 2024 13:35:26 +
"Dr. David Alan Gilbert"  wrote:

> Hi,
>   I've just posted a patch 'ftrace: Remove unused list 'ftrace_direct_funcs''
> that clears out some old code, but while at it I noticed the global
> 'ftrace_direct_func_count'.
> 
> As far as I can tell, it's never assigned (or initialised) but it is tested:
> 
> kernel/trace/fgraph.c:
> #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>   /*
>* Skip graph tracing if the return location is served by direct trampoline,
>* since call sequence and return addresses are unpredictable anyway.
>* Ex: BPF trampoline may call original function and may skip frame
>* depending on type of BPF programs attached.
>*/
>   if (ftrace_direct_func_count &&
>   ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE))
> return -EBUSY;
> #endif
> 
> So I wasn't sure whether it was just safe to nuke that section
> or whether it really needed fixing?

Yes, after commit 8788ca164eb4bad ("ftrace: Remove the legacy
_ftrace_direct API") that variable is no longer used.

-- Steve

Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-02 Thread Steven Rostedt

On Thu, 2 May 2024 16:13:59 -0700
"Paul E. McKenney"  wrote:

> Very good, and thank you!
> 
> I will drop it from RCU as soon as it shows up in either -next or in
> mainline.

Sounds good.

I'm currently working on updates to get into -rc7 and plan to add my next
work on top of that (I know, I know, it's probably the latest release I had
for next, but things are still being worked on).

-- Steve

Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-05-02 Thread Steven Rostedt

On Thu, 2 May 2024 15:58:53 -0700
Beau Belgrave  wrote:

> It's not an issue on the matching/logic. However, you do get an extra
> byte alloc (which doesn't bother me in this edge case).

Figured as much, but since there was no mention of it, I decided to bring
it up.

I'll take this as-is then.

-- Steve

Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()

2024-05-02 Thread Steven Rostedt

On Wed, 1 May 2024 20:31:06 -0700
"Paul E. McKenney"  wrote:

> On Thu, May 02, 2024 at 11:05:01AM +0900, Masami Hiramatsu wrote:
> > On Wed, 1 May 2024 16:12:37 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > Note that the immediate pressure for this patch should be relieved by the
> > > NAPI patch series [1], but this sort of problem could easily arise again.
> > > 
> > > When running heavy test workloads with KASAN enabled, RCU Tasks grace
> > > periods can extend for many tens of seconds, significantly slowing
> > > trace registration.  Therefore, make the registration-side RCU Tasks
> > > grace period be asynchronous via call_rcu_tasks().  
> > 
> > Good catch! AFAICS, there is no reason to wait for synchronization
> > when adding a new direct trampoline.
> > This looks good to me.
> > 
> > Reviewed-by: Masami Hiramatsu (Google)   
> 
> Thank you very much!  I will apply this on my next rebase.

I can take it.

It's not a bug fix but just an performance improvement, so it can go into
the next merge window.

-- Steve



> 
> > Thank you,
> >   
> > > [1]
> > > https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/
> > > 
> > > Reported-by: Jakub Kicinski 
> > > Reported-by: Alexei Starovoitov 
> > > Reported-by: Chris Mason 
> > > Signed-off-by: Paul E. McKenney 
> > > Cc: Steven Rostedt 
> > > Cc: Masami Hiramatsu 
> > > Cc: Mark Rutland 
> > > Cc: Mathieu Desnoyers 
> > > Cc: 
> > > 
> > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > > index 6c96b30f3d63b..32ea92934268c 100644
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -5365,6 +5365,13 @@ static void
> > > remove_direct_functions_hash(struct ftrace_hash *hash, unsigned long }
> > >  }
> > >  
> > > +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> > > +{
> > > + struct ftrace_hash *fhp = container_of(rhp, struct
> > > ftrace_hash, rcu); +
> > > + free_ftrace_hash(fhp);
> > > +}
> > > +
> > >  /**
> > >   * register_ftrace_direct - Call a custom trampoline directly
> > >   * for multiple functions registered in @ops
> > > @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops
> > > *ops, unsigned long addr) out_unlock:
> > >   mutex_unlock(_mutex);
> > >  
> > > - if (free_hash && free_hash != EMPTY_HASH) {
> > > - synchronize_rcu_tasks();
> > > - free_ftrace_hash(free_hash);
> > > - }
> > > + if (free_hash && free_hash != EMPTY_HASH)
> > > + call_rcu_tasks(_hash->rcu,
> > > register_ftrace_direct_cb); 
> > >   if (new_hash)
> > >   free_ftrace_hash(new_hash);  
> > 
> > 
> > -- 
> > Masami Hiramatsu (Google)

Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching

2024-05-02 Thread Steven Rostedt

On Tue, 23 Apr 2024 16:23:37 +
Beau Belgrave  wrote:

> When the ABI was updated to prevent same name w/different args, it
> missed an important corner case when fields don't end with a space.
> Typically, space is used for fields to help separate them, like
> "u8 field1; u8 field2". If no spaces are used, like
> "u8 field1;u8 field2", then the parsing works for the first time.
> However, the match check fails on a subsequent register, leading to
> confusion.
> 
> This is because the match check uses argv_split() and assumes that all
> fields will be split upon the space. When spaces are used, we get back
> { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }.
> This causes a mismatch, and the user program gets back -EADDRINUSE.
> 
> Add a method to detect this case before calling argv_split(). If found
> force a space after the field separator character ';'. This ensures all
> cases work properly for matching.
> 
> With this fix, the following are all treated as matching:
> u8 field1;u8 field2
> u8 field1; u8 field2
> u8 field1;\tu8 field2
> u8 field1;\nu8 field2

I'm curious, what happens if you have: "u8 field1; u8 field2;" ?

Do you care? As you will then create "u8 field1; u8 field2; "

but I'm guessing the extra whitespace at the end doesn't affect anything.


> 
> Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different 
> args event")
> Signed-off-by: Beau Belgrave 
> ---
>  kernel/trace/trace_events_user.c | 76 +++-
>  1 file changed, 75 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_events_user.c 
> b/kernel/trace/trace_events_user.c
> index 70d428c394b6..82b191f33a28 100644
> --- a/kernel/trace/trace_events_user.c
> +++ b/kernel/trace/trace_events_user.c
> @@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event 
> *user)
>   return 0;
>  }
>  
> +/*
> + * Counts how many ';' without a trailing space are in the args.
> + */
> +static int count_semis_no_space(char *args)
> +{
> + int count = 0;
> +
> + while ((args = strchr(args, ';'))) {
> + args++;
> +
> + if (!isspace(*args))
> + count++;

This will count that "..;" 

This is most likely not an issue, but since I didn't see this case
anywhere, I figured I bring it up just to confirm that it's not an issue.

-- Steve


> + }
> +
> + return count;
> +}
> +
> +/*
> + * Copies the arguments while ensuring all ';' have a trailing space.
> + */
> +static char *insert_space_after_semis(char *args, int count)
> +{
> + char *fixed, *pos;
> + int len;
> +
> + len = strlen(args) + count;
> + fixed = kmalloc(len + 1, GFP_KERNEL);
> +
> + if (!fixed)
> + return NULL;
> +
> + pos = fixed;
> +
> + /* Insert a space after ';' if there is no trailing space. */
> + while (*args) {
> + *pos = *args++;
> +
> + if (*pos++ == ';' && !isspace(*args))
> + *pos++ = ' ';
> + }
> +
> + *pos = '\0';
> +
> + return fixed;
> +}
> +

Re: [PATCH v3] ftrace: Fix possible use-after-free issue in ftrace_location()

2024-05-02 Thread Steven Rostedt

On Wed, 17 Apr 2024 11:28:30 +0800
Zheng Yejian  wrote:

> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index da1710499698..e05d3e3dc06a 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>  }
>  
>  /**
> - * ftrace_location_range - return the first address of a traced location
> + * ftrace_location_range_rcu - return the first address of a traced location

kerneldoc comments are for external functions. You need to move this down
to ftrace_location_range() as here you are commenting a local static function.

But I have to ask, why did you create this static function anyway? There's
only one user of it (the ftrace_location_range()). Why didn't you just
simply add the rcu locking there?

unsigned long ftrace_location_range(unsigned long start, unsigned long end)
{
struct dyn_ftrace *rec;
unsigned long ip = 0;

rcu_read_lock();
rec = lookup_rec(start, end);
if (rec)
ip = rec->ip;
rcu_read_unlock();

return ip;
}

-- Steve


>   *   if it touches the given ip range
>   * @start: start of range to search.
>   * @end: end of range to search (inclusive). @end points to the last byte
> @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>   * that is either a NOP or call to the function tracer. It checks the ftrace
>   * internal tables to determine if the address belongs or not.
>   */
> -unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +static unsigned long ftrace_location_range_rcu(unsigned long start, unsigned 
> long end)
>  {
>   struct dyn_ftrace *rec;
>  
> @@ -1603,6 +1603,16 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   return 0;
>  }
>  
> +unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +{
> + unsigned long loc;
> +
> + rcu_read_lock();
> + loc = ftrace_location_range_rcu(start, end);
> + rcu_read_unlock();
> + return loc;
> +}

[PATCH v3 6/6] eventfs: Have "events" directory get permissions from its parent

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 6e08405892ae..a878cea70f4c 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -226,12 +227,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -817,6 +829,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -826,10 +839,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0

[PATCH v3 5/6] eventfs: Do not treat events directory different than other directories

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 9dacf65c0b6e..6e08405892ae 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -206,21 +206,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0

[PATCH v3 2/6] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 ++
 fs/tracefs/inode.c   | 65 +++-
 fs/tracefs/internal.h|  7 -
 3 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cc8b838bbe62..15a2a9c3c62b 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -308,6 +308,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..52aa14bd2994 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_SPINLOCK(tracefs_inode_lock);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
+   unsigned long flags;
 
ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
if (!ti)
return NULL;
 
+   spin_lock_irqsave(_inode_lock, flags);
+   list_add_rcu(>list, _inodes);
+   spin_unlock_irqrestore(_inode_lock, flags);
+
return >vfs_inode;
 }
 
+static void tracefs_free_inode_rcu(struct rcu_head *rcu)
+{
+   struct tracefs_inode *ti;
+
+

[PATCH v3 4/6] eventfs: Do not differentiate the toplevel events directory

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 15a2a9c3c62b..9dacf65c0b6e 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -239,14 +238,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -259,10 +254,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -291,7 +286,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -359,7 +354,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -465,7 +460,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -845,14 +840,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -861,13 +848,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 29f0c75b..f704d8348357 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET=

[PATCH v3 1/6] eventfs: Free all of the eventfs_inode after RCU

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The freeing of eventfs_inode via a kfree_rcu() callback. But the content
of the eventfs_inode was being freed after the last kref. This is
dangerous, as changes are being made that can access the content of an
eventfs_inode from an RCU loop.

Instead of using kfree_rcu() use call_rcu() that calls a function to do
all the freeing of the eventfs_inode after a RCU grace period has expired.

Cc: sta...@vger.kernel.org
Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f5510e26f0f6..cc8b838bbe62 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -73,6 +73,21 @@ enum {
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
 
+static void free_ei_rcu(struct rcu_head *rcu)
+{
+   struct eventfs_inode *ei = container_of(rcu, struct eventfs_inode, rcu);
+   struct eventfs_root_inode *rei;
+
+   kfree(ei->entry_attrs);
+   kfree_const(ei->name);
+   if (ei->is_events) {
+   rei = get_root_inode(ei);
+   kfree(rei);
+   } else {
+   kfree(ei);
+   }
+}
+
 /*
  * eventfs_inode reference count management.
  *
@@ -85,7 +100,6 @@ static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
const struct eventfs_entry *entry;
-   struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
@@ -95,14 +109,7 @@ static void release_ei(struct kref *ref)
entry->release(entry->name, ei->data);
}
 
-   kfree(ei->entry_attrs);
-   kfree_const(ei->name);
-   if (ei->is_events) {
-   rei = get_root_inode(ei);
-   kfree_rcu(rei, ei.rcu);
-   } else {
-   kfree_rcu(ei, rcu);
-   }
+   call_rcu(>rcu, free_ei_rcu);
 }
 
 static inline void put_ei(struct eventfs_inode *ei)
-- 
2.43.0

[PATCH v3 3/6] tracefs: Still use mount point as default permissions for instances

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 52aa14bd2994..417c840e6403 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0

[PATCH v3 0/6] tracefs/eventfs: Fix inconsistent permissions

2024-05-02 Thread Steven Rostedt

The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Changes since v2: 
https://lore.kernel.org/linux-trace-kernel/20240502151547.973653...@goodmis.org/

- The eventfs_inode freeing was incorrect. The kref_put() would call
  release_ei() that freed the contents of the eventfs_inode then
  call kfree_rcu() on the eventfs_inode itself. The contents of the
  eventfs_inode needs to be freed after the RCU synchronization as
  well. The patches here add even more cases where that's a requirement.

- Add a iput callback for the tracefs_inode to clear the TRACEFS_EVENT_INODE
  flag. This will prevent the clearing of flags in remount to go into
  the eventfs_remount() function. A RCU grace cycle happens between
  the clearing of this flag and where the eventfs_inode is freed, so
  it is OK if the iteration is happening at the same time, as it is
  done under rcu_read_lock().

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/

- Testing showed that taking a mutex when freeing the tracefs_inode
  caused a lockdep splat as it can happen in the RCU softirq context.
  Convert the mutex to a spinlock for adding and removing the node
  from the link list, and free the node via call_rcu() so that the
  iteration of the list only needs to be protected by rcu_read_lock().


Steven Rostedt (Google) (6):
  eventfs: Free all of the eventfs_inode after RCU
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 127 ---
 fs/tracefs/inode.c   |  92 --
 fs/tracefs/internal.h|  14 --
 3 files changed, 175 insertions(+), 58 deletions(-)

Re: [PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt

On Thu, 02 May 2024 11:15:48 -0400
Steven Rostedt  wrote:

> +/*
> + * On a remount of tracefs, if UID or GID options are set, then
> + * the mount point inode permissions should be used.
> + * Reset the saved permission flags appropriately.
> + */
> +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
> update_gid)
> +{
> + struct eventfs_inode *ei = ti->private;

And I just realized there's a race here too :-p

I need to set ti->private = NULL before freeing the ei, and probably free
the ei via RCU.

-- Steve


> +
> + if (!ei)
> + return;
> +
> + if (update_uid)
> + ei->attr.mode &= ~EVENTFS_SAVE_UID;
> +
> + if (update_gid)
> + ei->attr.mode &= ~EVENTFS_SAVE_GID;
> +
> + if (!ei->entry_attrs)
> + return;
> +
> + for (int i = 0; i < ei->nr_entries; i++) {
> + if (update_uid)
> + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
> + if (update_gid)
> + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
> + }
> +}
> +

[PATCH v2 5/5] eventfs: Have "events" directory get permissions from its parent

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ff4ee3dad56a..3ea32159d556 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -219,12 +220,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -810,6 +822,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -819,10 +832,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0

[PATCH v2 4/5] eventfs: Do not treat events directory different than other directories

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2971609946e5..ff4ee3dad56a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -199,21 +199,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0

[PATCH v2 0/5] tracefs/eventfs: Fix inconsistent permissions

2024-05-02 Thread Steven Rostedt



The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/

- Testing showed that taking a mutex when freeing the tracefs_inode
  caused a lockdep splat as it can happen in the RCU softirq context.
  Convert the mutex to a spinlock for adding and removing the node
  from the link list, and free the node via call_rcu() so that the
  iteration of the list only needs to be protected by rcu_read_lock().


Steven Rostedt (Google) (5):
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 102 ---
 fs/tracefs/inode.c   |  77 +--
 fs/tracefs/internal.h|  14 ---
 3 files changed, 144 insertions(+), 49 deletions(-)

[PATCH v2 2/5] tracefs: Still use mount point as default permissions for instances

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5f2417df1c43..e1bf65efdbb3 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0

[PATCH v2 3/5] eventfs: Do not differentiate the toplevel events directory

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a754c6725a52..2971609946e5 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -232,14 +231,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -252,10 +247,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -284,7 +279,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -352,7 +347,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -458,7 +453,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -838,14 +833,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -854,13 +841,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 29f0c75b..f704d8348357 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET=

[PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 +++
 fs/tracefs/inode.c   | 50 +++-
 fs/tracefs/internal.h|  7 +-
 3 files changed, 84 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index f5510e26f0f6..a754c6725a52 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -301,6 +301,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..5f2417df1c43 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_SPINLOCK(tracefs_inode_lock);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
+   unsigned long flags;
 
ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL);
if (!ti)
return NULL;
 
+   spin_lock_irqsave(_inode_lock, flags);
+   list_add_rcu(>list, _inodes);
+   spin_unlock_irqrestore(_inode_lock, flags);
+
return >vfs_inode;
 }
 
+static void tracefs_free_inode_rcu(struct rcu_head *rcu)
+{
+   struct tracefs

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-02 Thread Steven Rostedt

On Thu, 2 May 2024 06:49:18 +
Tze-nan Wu (吳澤南)  wrote:

> Good news, this patch works, the test has passed, no more Kasan report
> in my environment.

Great to hear!

> 
> my environment:
>  arm64 + kasan + swtag based kasan + kernel-6.6.18
> 
> Really appreciate, and learn a lot from the patch.

Can I add:

Tested-by: Tze-nan Wu (吳澤南) 

?

-- Steve

Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-05-02 Thread Steven Rostedt

On Thu, 2 May 2024 14:38:32 +0100
Vincent Donnefort  wrote:

> > > + while (s < nr_subbufs && p < nr_pages) {
> > > + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
> > > + int off = 0;
> > > +
> > > + for (; off < (1 << (subbuf_order)); off++, page++) {
> > > + if (p >= nr_pages)
> > > + break;
> > > +
> > > + pages[p++] = page;
> > > + }
> > > + s++;
> > > + }
> > > +
> > > + err = vm_insert_pages(vma, vma->vm_start, pages, _pages);  
> > 
> > Nit: I did not immediately understand if we could end here with p < nr_pages
> > (IOW, pages[] not completely filled).
> > 
> > One source of confusion is the "s < nr_subbufs" check in the while loop: why
> > is "p < nr_pages" insufficient?  
> 
> Hum, indeed, the "s < nr_subbufs" check is superfluous, nr_pages, is already
> capped by the number of subbufs, there's no way we can overflow subbuf_ids[].

We can keep it as is, or perhaps change it to:

while (p < nr_pages) {
struct page *page;
int off = 0;

if (WARN_ON_ONCE(s >= nr_subbufs))
break;

page = virt_to_page(cpu_buffer->subbuf_ids[s]);
for (; off < (1 << (subbuf_order)); off++, page++) {
if (p >= nr_pages)
break;

pages[p++] = page;
}
s++;
}

I don't like having an unchecked dependency between s and p.

-- Steve

Re: [PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-02 Thread Steven Rostedt

On Wed, 1 May 2024 23:56:26 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Reviewed-by: Masami Hiramatsu (Google) 

Thanks Masami,

Although Tze-nan pointed out a issue with this patch.

I just published v2, can you review that one too?

Thanks,

-- Steve

[PATCH v2] eventfs/tracing: Add callback for release of an eventfs_inode

2024-05-02 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
while :
do
  echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
done

  Script 'B':
echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
struct trace_event_file *file = inode->i_private;
int ret;

ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20240430142327.7bff8...@gandalf.local.home/

- Do not call release function on error creating the eventfs_inode

 fs/tracefs/event_inode.c| 23 +--
 include/linux/tracefs.h |  3 +++
 kernel/trace/trace_events.c | 12 
 3 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..f5510e26f0f6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
@@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei)
}
 }
 
+/*
+ * Called when creation of an ei fails, do not call release() functions.
+ */
+static inline void cleanup_ei(struct eventfs_inode *ei)
+{
+   if (ei) {
+   /* Set nr_entries to 0 to prevent release() function being 
called */
+   ei->nr_entries = 0;
+   free_ei(ei);
+   }
+}
+
 static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
 {
if (ei)
@@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
 
/* Was the parent freed? */
if (list_empty(>list)) {
-   free_ei(ei);
+   cleanup_ei(ei);
ei = NULL;
}
return ei;
@@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
return ei;
 
  fail:
-   free_ei(ei);
+   cleanup_ei(ei);
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
 }
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eve

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-05-01 Thread Steven Rostedt

On Thu, 2 May 2024 03:10:24 +
Tze-nan Wu (吳澤南)  wrote:

> >   
> Sorry for my late reply, I'm testing the patch on my machine now. 
> Test will be done in four hours.
> 
> There's something I'm worrying about in the patch,
> what I'm worrying about is commented in the code below.
> 
> /kernel/trace/trace_events.c:
>   static int
>   event_create_dir(struct eventfs_inode *parent, 
>   struct trace_event_file *file) 
>   {
> ...
> ...
> ...
> nr_entries = ARRAY_SIZE(event_entries);
> 
> name = trace_event_name(call);
> 
> +event_file_get(file);// Line A
> ^
> // Should we move the "event_file_get" to here, instead  
> // of calling it at line C?
> // Due to Line B could eventually invoke "event_file_put".
> //   eventfs_create_dir -> free_ei ->put_ei -> kref_put 
> //  -> release_ei -> event_release -> event_file_put
> // Not sure if this is a potential risk? If Line B do call   
> // event_file_put,"event_file_put" will be called prior to
> // "event_file_get", could corrupt the reference of the file.

No, but you do bring up a good point. The release should not be called on
error, but it looks like it possibly can be.

> 
> ei = eventfs_create_dir(name, e_events,// Line B 
>  event_entries, nr_entries, file);
> if (IS_ERR(ei)) {
> pr_warn("Could not create tracefs '%s' directory\n", 
> name);
> return -1;
> }
> file->ei = ei;
> 
> ret = event_define_fields(call);
> if (ret < 0) {
> pr_warn("Could not initialize trace point events/%s\n",
> name);
> return ret;
>^  
>// Maybe we chould have similar concern if we return here.
>// Due to the event_inode had been created, but we did not call 
>// event_file_get. 
>// Could it lead to some issues in the future while freeing 
>// event_indoe?
> }
> 
> 
> -event_file_get(file);   //Line C
> return 0;
>   }

This prevents the release() function from being called on failure of
creating the ei.

Can you try this patch instead?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..f5510e26f0f6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
@@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei)
}
 }
 
+/*
+ * Called when creation of an ei fails, do not call release() functions.
+ */
+static inline void cleanup_ei(struct eventfs_inode *ei)
+{
+   if (ei) {
+   /* Set nr_entries to 0 to prevent release() function being 
called */
+   ei->nr_entries = 0;
+   free_ei(ei);
+   }
+}
+
 static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
 {
if (ei)
@@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
 
/* Was the parent freed? */
if (list_empty(>list)) {
-   free_ei(ei);
+   cleanup_ei(ei);
ei = NULL;
}
return ei;
@@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
return ei;
 
  fail:
-   free_ei(ei);
+   cleanup_ei(ei);
tracefs_failed_creating(dentry);
return ERR_PTR(-ENOMEM);
 }
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c

[PATCH 5/5] eventfs: Have "events" directory get permissions from its parent

2024-05-01 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The events directory gets its permissions from the root inode. But this
can cause an inconsistency if the instances directory changes its
permissions, as the permissions of the created directories under it should
inherit the permissions of the instances directory when directories under
it are created.

Currently the behavior is:

 # cd /sys/kernel/tracing
 # chgrp 1002 instances
 # mkdir instances/foo
 # ls -l instances/foo
[..]
 -r--r-  1 root lkp  0 May  1 18:55 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 18:55 current_tracer
 -rw-r-  1 root lkp  0 May  1 18:55 error_log
 drwxr-xr-x  1 root root 0 May  1 18:55 events
 --w---  1 root lkp  0 May  1 18:55 free_buffer
 drwxr-x---  2 root lkp  0 May  1 18:55 options
 drwxr-x--- 10 root lkp  0 May  1 18:55 per_cpu
 -rw-r-  1 root lkp  0 May  1 18:55 set_event

All the files and directories under "foo" has the "lkp" group except the
"events" directory. That's because its getting its default value from the
mount point instead of its parent.

Have the "events" directory make its default value based on its parent's
permissions. That now gives:

 # ls -l instances/foo
[..]
 -rw-r-  1 root lkp 0 May  1 21:16 buffer_subbuf_size_kb
 -r--r-  1 root lkp 0 May  1 21:16 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:16 current_tracer
 -rw-r-  1 root lkp 0 May  1 21:16 error_log
 drwxr-xr-x  1 root lkp 0 May  1 21:16 events
 --w---  1 root lkp 0 May  1 21:16 free_buffer
 drwxr-x---  2 root lkp 0 May  1 21:16 options
 drwxr-x--- 10 root lkp 0 May  1 21:16 per_cpu
 -rw-r-  1 root lkp 0 May  1 21:16 set_event

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 624a0e4a8e29..f77800398a54 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex);
 
 struct eventfs_root_inode {
struct eventfs_inodeei;
+   struct inode*parent_inode;
struct dentry   *events_dir;
 };
 
@@ -207,12 +208,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 
 static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
-   struct inode *root;
+   struct eventfs_root_inode *rei;
+   struct inode *parent;
+
+   rei = get_root_inode(ei);
+
+   /* Use the parent inode permissions unless root set its permissions */
+   parent = rei->parent_inode;
 
-   /* Get the tracefs root inode. */
-   root = d_inode(sb->s_root);
-   ei->attr.uid = root->i_uid;
-   ei->attr.gid = root->i_gid;
+   if (rei->ei.attr.mode & EVENTFS_SAVE_UID)
+   ei->attr.uid = rei->ei.attr.uid;
+   else
+   ei->attr.uid = parent->i_uid;
+
+   if (rei->ei.attr.mode & EVENTFS_SAVE_GID)
+   ei->attr.gid = rei->ei.attr.gid;
+   else
+   ei->attr.gid = parent->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
@@ -798,6 +810,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
// Note: we have a ref to the dentry from tracefs_start_creating()
rei = get_root_inode(ei);
rei->events_dir = dentry;
+   rei->parent_inode = d_inode(dentry->d_sb->s_root);
 
ei->entries = entries;
ei->nr_entries = size;
@@ -807,10 +820,15 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
 
+   /*
+* When the "events" directory is created, it takes on the
+* permissions of its parent. But can be reset on remount.
+*/
+   ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID;
+
INIT_LIST_HEAD(>children);
INIT_LIST_HEAD(>list);
 
-- 
2.43.0

[PATCH 4/5] eventfs: Do not treat events directory different than other directories

2024-05-01 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Treat the events directory the same as other directories when it comes to
permissions. The events directory was considered different because it's
dentry is persistent, whereas the other directory dentries are created
when accessed. But the way tracefs now does its ownership by using the
root dentry's permissions as the default permissions, the events directory
can get out of sync when a remount is performed setting the group and user
permissions.

Remove the special case for the events directory on setting the
attributes. This allows the updates caused by remount to work properly as
well as simplifies the code.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2c33262107c6..624a0e4a8e29 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -187,21 +187,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
 * determined by the parent directory.
 */
if (dentry->d_inode->i_mode & S_IFDIR) {
-   /*
-* The events directory dentry is never freed, unless its
-* part of an instance that is deleted. It's attr is the
-* default for its child files and directories.
-* Do not update it. It's not used for its own mode or 
ownership.
-*/
-   if (ei->is_events) {
-   /* But it still needs to know if it was modified */
-   if (iattr->ia_valid & ATTR_UID)
-   ei->attr.mode |= EVENTFS_SAVE_UID;
-   if (iattr->ia_valid & ATTR_GID)
-   ei->attr.mode |= EVENTFS_SAVE_GID;
-   } else {
-   update_attr(>attr, iattr);
-   }
+   update_attr(>attr, iattr);
 
} else {
name = dentry->d_name.name;
-- 
2.43.0

[PATCH 3/5] eventfs: Do not differentiate the toplevel events directory

2024-05-01 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The toplevel events directory is really no different than the events
directory of instances. Having the two be different caused
inconsistencies and made it harder to fix the permissions bugs.

Make all events directories act the same.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 -
 fs/tracefs/internal.h|  7 +++
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index db71fed7057d..2c33262107c6 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -68,7 +68,6 @@ enum {
EVENTFS_SAVE_MODE   = BIT(16),
EVENTFS_SAVE_UID= BIT(17),
EVENTFS_SAVE_GID= BIT(18),
-   EVENTFS_TOPLEVEL= BIT(19),
 };
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
@@ -220,14 +219,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
+static void update_events_attr(struct eventfs_inode *ei, struct super_block 
*sb)
 {
struct inode *root;
 
-   /* Only update if the "events" was on the top level */
-   if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
-   return;
-
/* Get the tracefs root inode. */
root = d_inode(sb->s_root);
ei->attr.uid = root->i_uid;
@@ -240,10 +235,10 @@ static void set_top_events_ownership(struct inode *inode)
struct eventfs_inode *ei = ti->private;
 
/* The top events directory doesn't get automatically updated */
-   if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
+   if (!ei || !ei->is_events)
return;
 
-   update_top_events_attr(ei, inode->i_sb);
+   update_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -272,7 +267,7 @@ static int eventfs_permission(struct mnt_idmap *idmap,
return generic_permission(idmap, inode, mask);
 }
 
-static const struct inode_operations eventfs_root_dir_inode_operations = {
+static const struct inode_operations eventfs_dir_inode_operations = {
.lookup = eventfs_root_lookup,
.setattr= eventfs_set_attr,
.getattr= eventfs_get_attr,
@@ -340,7 +335,7 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
// Walk upwards until you find the events inode
} while (!ei->is_events);
 
-   update_top_events_attr(ei, dentry->d_sb);
+   update_events_attr(ei, dentry->d_sb);
 
return ei;
 }
@@ -446,7 +441,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
update_inode_attr(dentry, inode, >attr,
  S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO);
 
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
/* All directories will have the same inode number */
@@ -826,14 +821,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
uid = d_inode(dentry->d_parent)->i_uid;
gid = d_inode(dentry->d_parent)->i_gid;
 
-   /*
-* If the events directory is of the top instance, then parent
-* is NULL. Set the attr.mode to reflect this and its permissions will
-* default to the tracefs root dentry.
-*/
-   if (!parent)
-   ei->attr.mode = EVENTFS_TOPLEVEL;
-
/* This is used as the default ownership of the files and directories */
ei->attr.uid = uid;
ei->attr.gid = gid;
@@ -842,13 +829,13 @@ struct eventfs_inode *eventfs_create_events_dir(const 
char *name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags |= TRACEFS_EVENT_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_uid = uid;
inode->i_gid = gid;
-   inode->i_op = _root_dir_inode_operations;
+   inode->i_op = _dir_inode_operations;
inode->i_fop = _file_operations;
 
dentry->d_fsdata = get_ei(ei);
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 661ac13e2984..d83c2a25f288 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -4,10 +4,9 @@
 
 enum {
TRACEFS_EVENT_INODE = BIT(1),
-   TRACEFS_EVENT_TOP_INODE = BIT(2),
-   TRACEFS_GID_PERM_SET=

[PATCH 2/5] tracefs: Still use mount point as default permissions for instances

2024-05-01 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If the instances directory's permissions were never change, then have it
and its children use the mount point permissions as the default.

Currently, the permissions of instance directories are determined by the
instance directory's permissions itself. But if the tracefs file system is
remounted and changes the permissions, the instance directory and its
children should use the new permission.

But because both the instance directory and its children use the instance
directory's inode for permissions, it misses the update.

To demonstrate this:

  # cd /sys/kernel/tracing/
  # mkdir instances/foo
  # ls -ld instances/foo
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 18:57 current_tracer

  # mount -o remount,gid=1002 .
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 18:57 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:07 instances/foo/
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 18:57 current_tracer

Notice that changing the group id to that of "lkp" did not affect the
instances directory nor its children. It should have been:

  # ls -ld current_tracer
 -rw-r- 1 root root 0 May  1 19:19 current_tracer
  # ls -ld instances/foo/
 drwxr-x--- 5 root root 0 May  1 19:25 instances/foo/
  # ls -ld instances
 drwxr-x--- 3 root root 0 May  1 19:19 instances

  # mount -o remount,gid=1002 .
  # ls -ld current_tracer
 -rw-r- 1 root lkp 0 May  1 19:19 current_tracer
  # ls -ld instances
 drwxr-x--- 3 root lkp 0 May  1 19:19 instances
  # ls -ld instances/foo/
 drwxr-x--- 5 root lkp 0 May  1 19:25 instances/foo/

Where all files were updated by the remount gid update.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/inode.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 2a1dc2b442d1..1130c0fe2426 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -170,16 +170,39 @@ static void set_tracefs_inode_owner(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct inode *root_inode = ti->private;
+   kuid_t uid;
+   kgid_t gid;
+
+   uid = root_inode->i_uid;
+   gid = root_inode->i_gid;
+
+   /*
+* If the root is not the mount point, then check the root's
+* permissions. If it was never set, then default to the
+* mount point.
+*/
+   if (root_inode != d_inode(root_inode->i_sb->s_root)) {
+   struct tracefs_inode *rti;
+
+   rti = get_tracefs(root_inode);
+   root_inode = d_inode(root_inode->i_sb->s_root);
+
+   if (!(rti->flags & TRACEFS_UID_PERM_SET))
+   uid = root_inode->i_uid;
+
+   if (!(rti->flags & TRACEFS_GID_PERM_SET))
+   gid = root_inode->i_gid;
+   }
 
/*
 * If this inode has never been referenced, then update
 * the permissions to the superblock.
 */
if (!(ti->flags & TRACEFS_UID_PERM_SET))
-   inode->i_uid = root_inode->i_uid;
+   inode->i_uid = uid;
 
if (!(ti->flags & TRACEFS_GID_PERM_SET))
-   inode->i_gid = root_inode->i_gid;
+   inode->i_gid = gid;
 }
 
 static int tracefs_permission(struct mnt_idmap *idmap,
-- 
2.43.0

[PATCH 0/5] tracefs/eventfs: Fix inconsistent permissions

2024-05-01 Thread Steven Rostedt



The tracefs and eventfs permissions are created dynamically based
on what the mount point inode has or the instances directory inode has.
But the way it worked had some inconsistencies that could lead to
security issues as the file system is not behaving like admins would
expect.

The files and directories could ignore the remount option that changes
the gid or uid ownerships, leaving files susceptable to access that
is not expected. This happens if a file had its value changed previously
and then a remount changed all the files permissions. The one that
was changed previously would not be affected.

This change set resolves these inconsistencies.

This also fixes the test_ownership.tc test as it would pass on the
first time it is run, but fail on the second time, because of the
inconsistant state of the permissions. Now you can run that test
multiple times and it will always pass.

Steven Rostedt (Google) (5):
  tracefs: Reset permissions on remount if permissions are options
  tracefs: Still use mount point as default permissions for instances
  eventfs: Do not differentiate the toplevel events directory
  eventfs: Do not treat events directory different than other directories
  eventfs: Have "events" directory get permissions from its parent


 fs/tracefs/event_inode.c | 102 ---
 fs/tracefs/inode.c   |  67 +--
 fs/tracefs/internal.h|   9 +++--
 3 files changed, 130 insertions(+), 48 deletions(-)

[PATCH 1/5] tracefs: Reset permissions on remount if permissions are options

2024-05-01 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

There's an inconsistency with the way permissions are handled in tracefs.
Because the permissions are generated when accessed, they default to the
root inode's permission if they were never set by the user. If the user
sets the permissions, then a flag is set and the permissions are saved via
the inode (for tracefs files) or an internal attribute field (for
eventfs).

But if a remount happens that specify the permissions, all the files that
were not changed by the user gets updated, but the ones that were are not.
If the user were to remount the file system with a given permission, then
all files and directories within that file system should be updated.

This can cause security issues if a file's permission was updated but the
admin forgot about it. They could incorrectly think that remounting with
permissions set would update all files, but miss some.

For example:

 # cd /sys/kernel/tracing
 # chgrp 1002 current_tracer
 # ls -l
[..]
 -rw-r-  1 root root 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root root 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root root 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp  0 May  1 21:25 current_tracer
 -rw-r-  1 root root 0 May  1 21:25 dynamic_events
 -r--r-  1 root root 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root root 0 May  1 21:25 enabled_functions

Where current_tracer now has group "lkp".

 # mount -o remount,gid=1001 .
 # ls -l
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_size_kb
 -rw-r-  1 root tracing 0 May  1 21:25 buffer_subbuf_size_kb
 -r--r-  1 root tracing 0 May  1 21:25 buffer_total_size_kb
 -rw-r-  1 root lkp 0 May  1 21:25 current_tracer
 -rw-r-  1 root tracing 0 May  1 21:25 dynamic_events
 -r--r-  1 root tracing 0 May  1 21:25 dyn_ftrace_total_info
 -r--r-  1 root tracing 0 May  1 21:25 enabled_functions

Everything changed but the "current_tracer".

Add a new link list that keeps track of all the tracefs_inodes which has
the permission flags that tell if the file/dir should use the root inode's
permission or not. Then on remount, clear all the flags so that the
default behavior of using the root inode's permission is done for all
files and directories.

Cc: sta...@vger.kernel.org
Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default 
ownership")
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c | 29 +
 fs/tracefs/inode.c   | 40 +++-
 fs/tracefs/internal.h|  2 ++
 3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index dc97c19f9e0a..db71fed7057d 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -289,6 +289,35 @@ static const struct file_operations 
eventfs_file_operations = {
.llseek = generic_file_llseek,
 };
 
+/*
+ * On a remount of tracefs, if UID or GID options are set, then
+ * the mount point inode permissions should be used.
+ * Reset the saved permission flags appropriately.
+ */
+void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool 
update_gid)
+{
+   struct eventfs_inode *ei = ti->private;
+
+   if (!ei)
+   return;
+
+   if (update_uid)
+   ei->attr.mode &= ~EVENTFS_SAVE_UID;
+
+   if (update_gid)
+   ei->attr.mode &= ~EVENTFS_SAVE_GID;
+
+   if (!ei->entry_attrs)
+   return;
+
+   for (int i = 0; i < ei->nr_entries; i++) {
+   if (update_uid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID;
+   if (update_gid)
+   ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID;
+   }
+}
+
 /* Return the evenfs_inode of the "events" directory */
 static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 5545e6bf7d26..2a1dc2b442d1 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -30,6 +30,13 @@ static struct vfsmount *tracefs_mount;
 static int tracefs_mount_count;
 static bool tracefs_registered;
 
+/*
+ * Keep track of all tracefs_inodes in order to update their
+ * flags if necessary on a remount.
+ */
+static DEFINE_MUTEX(tracefs_inode_mutex);
+static LIST_HEAD(tracefs_inodes);
+
 static struct inode *tracefs_alloc_inode(struct super_block *sb)
 {
struct tracefs_inode *ti;
@@ -38,12 +45,22 @@ static struct inode *tracefs_alloc_inode(struct super_block 
*sb)
if (!ti)
return NULL;
 
+   mutex_lock(_inode_mutex);
+   list_add(>list, _inodes);
+   mutex_unlock(_inode_mutex);
+
return >vfs_inode;
 }
 
 static void tracefs_free_inode(struct inode *inode)
 {
-   kmem_cache_free(tracefs_inode_cachep, get_tracefs(inode));
+   struct tr

[PATCH] eventfs/tracing: Add callback for release of an eventfs_inode

2024-04-30 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Synthetic events create and destroy tracefs files when they are created
and removed. The tracing subsystem has its own file descriptor
representing the state of the events attached to the tracefs files.
There's a race between the eventfs files and this file descriptor of the
tracing system where the following can cause an issue:

With two scripts 'A' and 'B' doing:

  Script 'A':
echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
while :
do
  echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
done

  Script 'B':
echo > /sys/kernel/tracing/synthetic_events

Script 'A' creates a synthetic event "hello" and then just writes zero
into its enable file.

Script 'B' removes all synthetic events (including the newly created
"hello" event).

What happens is that the opening of the "enable" file has:

 {
struct trace_event_file *file = inode->i_private;
int ret;

ret = tracing_check_open_get_tr(file->tr);
 [..]

But deleting the events frees the "file" descriptor, and a "use after
free" happens with the dereference at "file->tr".

The file descriptor does have a reference counter, but there needs to be a
way to decrement it from the eventfs when the eventfs_inode is removed
that represents this file descriptor.

Add an optional "release" callback to the eventfs_entry array structure,
that gets called when the eventfs file is about to be removed. This allows
for the creating on the eventfs file to increment the tracing file
descriptor ref counter. When the eventfs file is deleted, it can call the
release function that will call the put function for the tracing file
descriptor.

This will protect the tracing file from being freed while a eventfs file
that references it is being opened.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/

Cc: sta...@vger.kernel.org
Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode")
Reported-by: Tze-nan wu 
Signed-off-by: Steven Rostedt (Google) 
---
 fs/tracefs/event_inode.c|  7 +++
 include/linux/tracefs.h |  3 +++
 kernel/trace/trace_events.c | 12 
 3 files changed, 22 insertions(+)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..dc97c19f9e0a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 52f75c36bbca..6ef29eba90ce 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
*mode, void **data,
return 0;
 }
 
+/* The file is incremented on creation and freeing the enable file decrements 
it */
+static void event_release(const char *name, void *data)
+{
+   struct trace_event_file *file = data;
+
+   event_file_put(file);
+}
+
 static int
 event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 {
@@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
{
.name   = "enable",
.callback   = event_callback,
+   .release= event_release,
},
{

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-04-29 Thread Steven Rostedt

On Sun, 28 Apr 2024 20:28:37 -0400
Steven Rostedt  wrote:

> > Looking for any suggestion or solution, appreciate.  
> 
> Yeah, I do not think eventfs should be involved in this. It needs to be
> protected at a higher level (in the synthetic/dynamic event code).
> 
> I'm just coming back from Japan, and I'll need to take a deeper look at
> this after I recover from my jetlag.

OK, so I guess the eventfs nodes need an optional release callback. Here's
the right way to do that. I added a "release" function to the passed in
entry array that allows for calling a release function when the
eventfs_inode is freed. Then in code for creating events, I call
event_file_get() on the file being assigned and have the freeing of the
"enable" file have the release function that will call event_file_put() on
that file structure.

Does this fix it for you?

-- Steve

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 894c6ca1e500..dc97c19f9e0a 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -84,10 +84,17 @@ enum {
 static void release_ei(struct kref *ref)
 {
struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   const struct eventfs_entry *entry;
struct eventfs_root_inode *rei;
 
WARN_ON_ONCE(!ei->is_freed);
 
+   for (int i = 0; i < ei->nr_entries; i++) {
+   entry = >entries[i];
+   if (entry->release)
+   entry->release(entry->name, ei->data);
+   }
+
kfree(ei->entry_attrs);
kfree_const(ei->name);
if (ei->is_events) {
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 7a5fe17b6bf9..d03f74658716 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -62,6 +62,8 @@ struct eventfs_file;
 typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);
 
+typedef void (*eventfs_release)(const char *name, void *data);
+
 /**
  * struct eventfs_entry - dynamically created eventfs file call back handler
  * @name:  Then name of the dynamic file in an eventfs directory
@@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t 
*mode, void **data,
 struct eventfs_entry {
const char  *name;
eventfs_callbackcallback;
+   eventfs_release release;
 };
 
 struct eventfs_inode;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 52f75c36bbca..d14c84281f2b 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t 
*mode, void **data,
return 0;
 }
 
+/* The file is incremented on creation and freeing the enable file decrements 
it */
+static void event_release(const char *name, void *data)
+{
+   struct trace_event_file *file = data;
+
+   event_file_put(file);
+}
+
 static int
 event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
 {
@@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
{
.name   = "enable",
.callback   = event_callback,
+   .release= event_release,
},
{
.name   = "filter",
@@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct 
trace_event_file *file)
return ret;
}
 
+   /* Gets decremented on freeing of the "enable" file */
+   event_file_get(file);
+
return 0;
 }

Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr

2024-04-28 Thread Steven Rostedt

On Fri, 26 Apr 2024 15:34:08 +0800
Tze-nan wu  wrote:

> "tracing_event_file" is at the risk of use-after-free due to the race of
> two functions "tracing_open_file_tr" and "synth_event_release".
> Specifically, it could be freed by synth_event_release before
> tracing_open_file_tr has the opportunity to access its members.
> 
> It's easy to reproduced by first executing script 'A' and then script 'B'
> in my environment.
> 
>   Script 'A':
> echo "hello int aaa" > /sys/kernel/tracing/synthetic_events
> while :
> do
>   echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable
> done
> 
>   Script 'B':
> echo > /sys/kernel/tracing/synthetic/events

I think you meant:

  echo > /sys/kernel/tracing/synthetic_events

> 
>   My environment:
> arm64 + generic and swtag based KASAN both enabled + kernel-6.6.18
> 
> KASAN report
> ==
> BUG: KASAN: slab-use-after-free in tracing_open_file_tr
> Read of size 8 at addr 9*** by task sh/3485
> Pointer tag: [9*], memory tag: [fe]
> 
> CPU: * PID: 3485 Comm: sh Tainted: 
> Call trace:
>  __hwasan_tag_mismatch
>  tracing_open_file_tr
>  do_dentry_open
>  vfs_open
>  path_openat
> 
> Freed by task 3490:
>  slab_free_freelist_hook
>  kmem_cache_free
>  event_file_put
>  remove_event_file_dir
>  __trace_remove_event_call
>  trace_remove_event_call
>  synth_event_release
>  dyn_events_release_all
>  synth_events_open
> 
> page last allocated via order 0, migratetype Unmovable:
>  __slab_alloc
>  kmem_cache_alloc
>  trace_create_new_event
>  trace_add_event_call
>  register_synth_event
>  __create_synth_event
>  create_or_delete_synth_event
>  trace_parse_run_command
>  synth_events_write
>  vfs_write

Thanks for reporting this.

> 
> Based on the assumption that eventfs_inode will persist throughout the
> execution of the tracing_open_file_tr function,
> we can fix this issue by incrementing the reference count of
> trace_event_file once it is assigned to eventfs_inode->data.
> The reference count will then be decremented during the release phase of
> eventfs_inode.
> 
> This ensures that trace_event_file is not prematurely freed while the
> tracing_open_file_tr function is being called.
> 
> Fixes: bb32500fb9b7 ("tracing: Have trace_event_file have ref counters")
> Co-developed-by: Cheng-Jui Wang 
> Signed-off-by: Cheng-Jui Wang 
> Signed-off-by: Tze-nan wu 
> ---
> BTW, I've also attempted to reproduce the same issue in another
> environment (described below).
> It's also reproducible but in a lower rate.
> 
> another environment:
>   x86 + ubuntu + generic kasan enabled + kernel-6.9-rc4
> 
> After applying the patch, KASAN no longer reports any issues when
> following the same reproduction steps in my arm64 environment. 
> However, I am concerned about potential side effects that the patch might 
> introduce.
> Additionally, the newly added function "is_entry_event_callback" may not
> be reliable in my opinion,
> as the callback function used in the comparison could change in future. 
> Nonetheless, this is the best solution I can come up with now.
> 
> Looking for any suggestion or solution, appreciate.

Yeah, I do not think eventfs should be involved in this. It needs to be
protected at a higher level (in the synthetic/dynamic event code).

I'm just coming back from Japan, and I'll need to take a deeper look at
this after I recover from my jetlag.

Thanks,

-- Steve

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-28 Thread Steven Rostedt

On Thu, 25 Apr 2024 13:31:53 -0700
Andrii Nakryiko  wrote:

I'm just coming back from Japan (work and then a vacation), and
catching up on my email during the 6 hour layover in Detroit.

> Hey Masami,
> 
> I can't really review most of that code as I'm completely unfamiliar
> with all those inner workings of fprobe/ftrace/function_graph. I left
> a few comments where there were somewhat more obvious BPF-related
> pieces.
> 
> But I also did run our BPF benchmarks on probes/for-next as a baseline
> and then with your series applied on top. Just to see if there are any
> regressions. I think it will be a useful data point for you.
> 
> You should be already familiar with the bench tool we have in BPF
> selftests (I used it on some other patches for your tree).

I should get familiar with your tools too.

> 
> BASELINE
> 
> kprobe :   24.634 ± 0.205M/s
> kprobe-multi   :   28.898 ± 0.531M/s
> kretprobe  :   10.478 ± 0.015M/s
> kretprobe-multi:   11.012 ± 0.063M/s
> 
> THIS PATCH SET ON TOP
> =
> kprobe :   25.144 ± 0.027M/s (+2%)
> kprobe-multi   :   28.909 ± 0.074M/s
> kretprobe  :9.482 ± 0.008M/s (-9.5%)
> kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> 
> These numbers are pretty stable and look to be more or less representative.

Thanks for running this.

> 
> As you can see, kprobes got a bit faster, kprobe-multi seems to be
> about the same, though.
> 
> Then (I suppose they are "legacy") kretprobes got quite noticeably
> slower, almost by 10%. Not sure why, but looks real after re-running
> benchmarks a bunch of times and getting stable results.
> 
> On the other hand, multi-kretprobes got significantly faster (+24%!).
> Again, I don't know if it is expected or not, but it's a nice
> improvement.
> 
> If you have any idea why kretprobes would get so much slower, it would
> be nice to look into that and see if you can mitigate the regression
> somehow. Thanks!

My guess is that this patch set helps generic use cases for tracing the
return of functions, but will likely add more overhead for single use
cases. That is, kretprobe is made to be specific for a single function,
but kretprobe-multi is more generic. Hence the generic version will
improve at the sacrifice of the specific function. I did expect as much.

That said, I think there's probably a lot of low hanging fruit that can
be done to this series to help improve the kretprobe performance. I'm
not sure we can get back to the baseline, but I'm hoping we can at
least make it much better than that 10% slowdown.

I'll be reviewing this patch set this week as I recover from jetlag.

-- Steve

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-28 Thread Steven Rostedt

On Tue, 23 Apr 2024 12:04:15 -0400
"Liam R. Howlett"  wrote:

> > Nit: For all labels, please add a space before them. Otherwise, diffs will
> > show "unlock" as the function and not "ring_buffer_map", making it harder
> > to find where the change is.
> >   
> 
> Isn't the inclusion of a space before labels outdate since 'git diff'
> superseds 'diff' and fixed this issue?

I still use patch and quilt ;-) and that still suffers from that on
reject files (used for backporting and such).

-- Steve

Re: [PATCH] ftrace: Replace ftrace_disabled variable with ftrace_is_dead function

2024-04-27 Thread Steven Rostedt

On Sat, 20 Apr 2024 11:50:29 +0800
"Bang Li"  wrote:

> Thank you for your explanation, let me understand this. How about
> replacing ftrace_disabled with unlike(ftrace_disabled)?

Why? They are slow paths. No need to optimize them.

-- Steve

Re: [PATCH v9 07/36] function_graph: Allow multiple users to attach to function graph

2024-04-19 Thread Steven Rostedt

On Mon, 15 Apr 2024 21:50:20 +0900
"Masami Hiramatsu (Google)"  wrote:

> @@ -27,23 +28,157 @@
>  
>  #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack)
>  #define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long))
> +
> +/*
> + * On entry to a function (via function_graph_enter()), a new 
> ftrace_ret_stack
> + * is allocated on the task's ret_stack with indexes entry, then each
> + * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that returns
> + * non-zero, the index into the fgraph_array[] for that fgraph_ops is 
> recorded
> + * on the indexes entry as a bit flag.
> + * As the associated ftrace_ret_stack saved for those fgraph_ops needs to
> + * be found, the index to it is also added to the ret_stack along with the
> + * index of the fgraph_array[] to each fgraph_ops that needs their retfunc
> + * called.
> + *
> + * The top of the ret_stack (when not empty) will always have a reference
> + * to the last ftrace_ret_stack saved. All references to the
> + * ftrace_ret_stack has the format of:
> + *
> + * bits:  0 -  9 offset in words from the previous ftrace_ret_stack
> + *   (bitmap type should have FGRAPH_RET_INDEX always)
> + * bits: 10 - 11 Type of storage
> + * 0 - reserved
> + * 1 - bitmap of fgraph_array index
> + *
> + * For bitmap of fgraph_array index
> + *  bits: 12 - 27The bitmap of fgraph_ops fgraph_array index

I really hate the terminology I came up with here, and would love to
get better terminology for describing what is going on. I looked it
over but I'm constantly getting confused. And I wrote this code!

Perhaps we should use:

 @frame : The data that represents a single function call. When a
  function is traced, all the data used for all the callbacks
  attached to it, is in a single frame. This would replace the
  FGRAPH_RET_SIZE as FGRAPH_FRAME_SIZE.

 @offset : This is the word size position on the stack. It would
   replace INDEX, as I think "index" is being used for more
   than one thing. Perhaps it should be "offset" when dealing
   with where it is on the shadow stack, and "pos" when dealing
   with which callback ops is being referenced.


> + *
> + * That is, at the end of function_graph_enter, if the first and forth
> + * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc 
> called
> + * on the return of the function being traced, this is what will be on the
> + * task's shadow ret_stack: (the stack grows upward)
> + *
> + * || <- task->curr_ret_stack
> + * ++
> + * | bitmap_type(bitmap:(BIT(3)|BIT(0)),|
> + * | offset:FGRAPH_RET_INDEX)   | <- the offset is from here
> + * ++
> + * | struct ftrace_ret_stack|
> + * |   (stores the saved ret pointer)   | <- the offset points here
> + * ++
> + * | (X) | (N)  | ( N words away from
> + * ||   previous ret_stack)
> + *
> + * If a backtrace is required, and the real return pointer needs to be
> + * fetched, then it looks at the task's curr_ret_stack index, if it
> + * is greater than zero (reserved, or right before poped), it would mask
> + * the value by FGRAPH_RET_INDEX_MASK to get the offset index of the
> + * ftrace_ret_stack structure stored on the shadow stack.
> + */
> +
> +#define FGRAPH_RET_INDEX_SIZE10

Replace SIZE with BITS.

> +#define FGRAPH_RET_INDEX_MASKGENMASK(FGRAPH_RET_INDEX_SIZE - 1, 0)

  #define FGRAPH_FRAME_SIZE_BITS10
  #define FGRAPH_FRAME_SIZE_MASKGENMASK(FGRAPH_FRAME_SIZE_BITS - 1, 0)


> +
> +#define FGRAPH_TYPE_SIZE 2
> +#define FGRAPH_TYPE_MASK GENMASK(FGRAPH_TYPE_SIZE - 1, 0)

  #define FGRAPH_TYPE_BITS  2
  #define FGRAPH_TYPE_MASK  GENMASK(FGRAPH_TYPE_BITS - 1, 0)


> +#define FGRAPH_TYPE_SHIFTFGRAPH_RET_INDEX_SIZE
> +
> +enum {
> + FGRAPH_TYPE_RESERVED= 0,
> + FGRAPH_TYPE_BITMAP  = 1,
> +};
> +
> +#define FGRAPH_INDEX_SIZE16

replace "INDEX" with "OPS" as it will be the indexes of ops in the
array.

  #define FGRAPH_OPS_BITS   16
  #define FGRAPH_OPS_MASK   GENMASK(FGRAPH_OPS_BITS - 1, 0)

> +#define FGRAPH_INDEX_MASKGENMASK(FGRAPH_INDEX_SIZE - 1, 0)
> +#define FGRAPH_INDEX_SHIFT   (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE)
> +
> +/* Currently the max stack index can't be more than register callers */
> +#define FGRAPH_MAX_INDEX (FGRAPH_INDEX_SIZE + FGRAPH_RET_INDEX)

FGRAPH_MAX_INDEX isn't even used. Let's delete it.

> +
> +#define FGRAPH_ARRAY_SIZEFGRAPH_INDEX_SIZE

  #define FGRAPH_ARRAY_SIZE FGRAPH_INDEX_BITS

> +
>  #define SHADOW_STACK_SIZE (PAGE_SIZE)
>  #define SHADOW_STACK_INDEX (SHADOW_STACK_SIZE / sizeof(long))
>  /* Leave

Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 16:00:20 +0800
Jason Xing  wrote:

> If other experts see this thread, please help me. I would appreciate
> it. I have strong interests and feel strong responsibility to
> implement something like this patch series. It can be very useful!!

I'm not a networking expert, but as I'm Cc'd and this is about tracing,
I'll jump in to see if I can help. Honestly, reading the thread, it
appears that you and Eric are talking past each other.

I believe Eric is concerned about losing the value of the enum. Enums
are types, and if you typecast them to another type, they lose the
previous type, and all the safety that goes with it.

Now, I do not really understand the problem trying to be solved here. I
understand how TCP works but I never looked into the implementation of
MPTCP.

You added this:

+static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
+{
+   return reason += RST_REASON_START;
+}

And used it for places like this:

@@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct 
sock *sk,
return dst;

dst_release(dst);
-   if (!req->syncookie)
-   tcp_request_sock_ops.send_reset(sk, skb, 
SK_RST_REASON_NOT_SPECIFIED);
+   if (!req->syncookie) {
+   struct mptcp_ext *mpext = mptcp_get_ext(skb);
+   enum sk_rst_reason reason;
+
+   reason = convert_mptcp_reason(mpext->reset_reason);
+   tcp_request_sock_ops.send_reset(sk, skb, reason);
+   }
return NULL;
 }

As I don't know this code or how MPTCP works, I do not understand the
above. It use to pass to send_reset() SK_RST_REASON_NOT_SPECIFIED. But
now it takes a "reset_reason" calls the "convert_mptcp_reason()" to get
back a enum value.

If you are mapping the reset_reason to enum sk_rst_reason, why not do
it via a real conversion instead of this fragile arithmetic between the two
values?

static inline enum sk_rst_reason convert_mptcp_reason(u32 reason)
{
switch(reason) {
case 0: return SK_RST_REASON_MPTCP_RST_EUNSPEC;
case 1: return SK_RST_REASON_MPTCP_RST_EMPTCP;
case 2: return SK_RST_REASON_MPTCP_RST_ERESOURCE;
[..]
default: return SK_RST_REASON_MAX; // or some other error value
]
}

I'm not sure if this is any better, but it's not doing any casting and
it's easier to understand. It's a simple mapping between the reason and
the enum and there's no inherit dependency between the values. Could
possibly create enums for the reason numbers and replace the hard coded
values with them.

That way that helper function is at least doing a real conversion of
one type to another.

But like I said from the beginning. I don't understand the details here
and have not spent the time to dig deeper. I just read the thread and I
agree with Eric that the arithmetic conversion of reason to an enum
looks fragile at best and buggy at worst.

-- Steve

Re: [PATCH] ftrace: Replace ftrace_disabled variable with ftrace_is_dead function

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 22:38:44 +0800
"Bang Li"  wrote:

> Use the existing function ftrace_is_dead to replace the variable to make
> the code clearer.
> 
> Signed-off-by: Bang Li 
> ---
>  kernel/trace/ftrace.c | 46 +--
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index 50ca4d4f8840..4a08c79db677 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -2693,7 +2693,7 @@ void __weak ftrace_replace_code(int mod_flags)
>   int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL;
>   int failed;
>  
> - if (unlikely(ftrace_disabled))
> + if (unlikely(ftrace_is_dead()))
>   return;
>  

NACK!

ftrace_is_dead() is only there to make the static variable
"ftrace_disabled" available for code outside of ftrace.c. In ftrace.c,
it is perfectly fine to use ftrace_disabled.

-- Steve

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-19 Thread Steven Rostedt

On Fri, 19 Apr 2024 14:36:18 +0900
Masami Hiramatsu (Google)  wrote:

> Hi Steve,
> 
> Can you review this series? Especially, [07/36] and [12/36] has been changed
> a lot from your original patch.

I haven't forgotten (just been a bit hectic).

Worse comes to worse, I'll review it tomorrow.

-- Steve

> 
> Thank you,
> 
> On Mon, 15 Apr 2024 21:48:59 +0900
> "Masami Hiramatsu (Google)"  wrote:
> 
> > Hi,
> > 
> > Here is the 9th version of the series to re-implement the fprobe on
> > function-graph tracer. The previous version is;
> > 
> > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
> > 
> > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> > and fixed some bugs + performance optimization patch[36/36].
> >  - [12/36] Fix to clear fgraph_array entry in registration failure, also
> >return -ENOSPC when fgraph_array is full.
> >  - [28/36] Add new store_fprobe_entry_data() for fprobe.
> >  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
> >  - [36/36] Add new flag to skip timestamp recording.
> > 
> > Overview
> > 
> > This series does major 2 changes, enable multiple function-graphs on
> > the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> > fprobe on this function-graph.
> > 
> > The former changes had been sent from Steven Rostedt 4 years ago (*),
> > which allows users to set different setting function-graph tracer (and
> > other tracers based on function-graph) in each trace-instances at the
> > same time.
> > 
> > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
> > 
> > The purpose of latter change are;
> > 
> >  1) Remove dependency of the rethook from fprobe so that we can reduce
> >the return hook code and shadow stack.
> > 
> >  2) Make 'ftrace_regs' the common trace interface for the function
> >boundary.
> > 
> > 1) Currently we have 2(or 3) different function return hook codes,
> >  the function-graph tracer and rethook (and legacy kretprobe).
> >  But since this  is redundant and needs double maintenance cost,
> >  I would like to unify those. From the user's viewpoint, function-
> >  graph tracer is very useful to grasp the execution path. For this
> >  purpose, it is hard to use the rethook in the function-graph
> >  tracer, but the opposite is possible. (Strictly speaking, kretprobe
> >  can not use it because it requires 'pt_regs' for historical reasons.)
> > 
> > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
> >  wrong for the function entry and exit. Moreover, depending on the
> >  architecture, there is no way to accurately reproduce 'pt_regs'
> >  outside of interrupt or exception handlers. This means fprobe should
> >  not use 'pt_regs' because it does not use such exceptions.
> >  (Conversely, kprobe should use 'pt_regs' because it is an abstract
> >   interface of the software breakpoint exception.)
> > 
> > This series changes fprobe to use function-graph tracer for tracing
> > function entry and exit, instead of mixture of ftrace and rethook.
> > Unlike the rethook which is a per-task list of system-wide allocated
> > nodes, the function graph's ret_stack is a per-task shadow stack.
> > Thus it does not need to set 'nr_maxactive' (which is the number of
> > pre-allocated nodes).
> > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> > their register interface, this changes it to convert 'ftrace_regs' to
> > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> > so users must access only registers for function parameters or
> > return value. 
> > 
> > Design
> > --
> > Instead of using ftrace's function entry hook directly, the new fprobe
> > is built on top of the function-graph's entry and return callbacks
> > with 'ftrace_regs'.
> > 
> > Since the fprobe requires access to 'ftrace_regs', the architecture
> > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> > entry callback with 'ftrace_regs', and also
> > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> > return_to_handler.
> > 
> > All fprobes share a single function-graph ops (means shares a common
> > ftrace filter) similar to the kprobe-on-ftrace. This needs another
> > layer to find corresponding fprobe in the

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-18 Thread Steven Rostedt

On Thu, 18 Apr 2024 09:55:55 +0300
Mike Rapoport  wrote:

Hi Mike,

Thanks for doing this review!

> > +/**
> > + * struct trace_buffer_meta - Ring-buffer Meta-page description
> > + * @meta_page_size:Size of this meta-page.
> > + * @meta_struct_len:   Size of this structure.
> > + * @subbuf_size:   Size of each sub-buffer.
> > + * @nr_subbufs:Number of subbfs in the ring-buffer, including 
> > the reader.
> > + * @reader.lost_events:Number of events lost at the time of the reader 
> > swap.
> > + * @reader.id: subbuf ID of the current reader. ID range [0 : 
> > @nr_subbufs - 1]
> > + * @reader.read:   Number of bytes read on the reader subbuf.
> > + * @flags: Placeholder for now, 0 until new features are supported.
> > + * @entries:   Number of entries in the ring-buffer.
> > + * @overrun:   Number of entries lost in the ring-buffer.
> > + * @read:  Number of entries that have been read.
> > + * @Reserved1: Reserved for future use.
> > + * @Reserved2: Reserved for future use.
> > + */
> > +struct trace_buffer_meta {
> > +   __u32   meta_page_size;
> > +   __u32   meta_struct_len;
> > +
> > +   __u32   subbuf_size;
> > +   __u32   nr_subbufs;
> > +
> > +   struct {
> > +   __u64   lost_events;
> > +   __u32   id;
> > +   __u32   read;
> > +   } reader;
> > +
> > +   __u64   flags;
> > +
> > +   __u64   entries;
> > +   __u64   overrun;
> > +   __u64   read;
> > +
> > +   __u64   Reserved1;
> > +   __u64   Reserved2;  
> 
> Why do you need reserved fields? This structure always resides in the
> beginning of a page and the rest of the page is essentially "reserved".

So this code is also going to be used in arm's pkvm hypervisor code,
where it will be using these fields, but since we are looking at
keeping the same interface between the two, we don't want these used by
this interface.

We probably should add a comment about that.

> 
> > +};
> > +
> > +#endif /* _TRACE_MMAP_H_ */
> > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> > index cc9ebe593571..793ecc454039 100644
> > --- a/kernel/trace/ring_buffer.c
> > +++ b/kernel/trace/ring_buffer.c  
> 
> ... 
> 
> > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer,
> > +  unsigned long *subbuf_ids)
> > +{
> > +   struct trace_buffer_meta *meta = cpu_buffer->meta_page;
> > +   unsigned int nr_subbufs = cpu_buffer->nr_pages + 1;
> > +   struct buffer_page *first_subbuf, *subbuf;
> > +   int id = 0;
> > +
> > +   subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page;
> > +   cpu_buffer->reader_page->id = id++;
> > +
> > +   first_subbuf = subbuf = rb_set_head_page(cpu_buffer);
> > +   do {
> > +   if (WARN_ON(id >= nr_subbufs))
> > +   break;
> > +
> > +   subbuf_ids[id] = (unsigned long)subbuf->page;
> > +   subbuf->id = id;
> > +
> > +   rb_inc_page();
> > +   id++;
> > +   } while (subbuf != first_subbuf);
> > +
> > +   /* install subbuf ID to kern VA translation */
> > +   cpu_buffer->subbuf_ids = subbuf_ids;
> > +
> > +   /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */
> > +   meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order;  
> 
> Isn't this a single page?

One thing we are doing is to make sure that the subbuffers are aligned
by their size. If a subbuffer is 3 pages, it should be aligned on 3
page boundaries. This was something that Linus suggested.

> 
> > +   meta->meta_struct_len = sizeof(*meta);
> > +   meta->nr_subbufs = nr_subbufs;
> > +   meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
> > +
> > +   rb_update_meta_page(cpu_buffer);
> > +}  
> 
> ...
> 
> > +#define subbuf_page(off, start) \
> > +   virt_to_page((void *)((start) + ((off) << PAGE_SHIFT)))
> > +
> > +#define foreach_subbuf_page(sub_order, start, page)\  
> 
> Nit: usually iterators in kernel use for_each

Ah, good catch. Yeah, that should be changed. But then ...

> 
> > +   page = subbuf_page(0, (start)); \
> > +   for (int __off = 0; __off < (1 << (sub_order)); \
> > +__off++, page = subbuf_page(__off, (start)))  
> 
> The pages are allocated with alloc_pages_node(.. subbuf_order) are
> physically contiguous and struct pages for them are also contiguous, so
> inside a subbuf_order allocation you can just do page++.
> 

I'm wondering if we should just nuke the macro. It was there because of
the previous implementation did things twice. But now it's just done
once here:

+   while (s < nr_subbufs && p < nr_pages) {
+   struct page *page;
+
+   foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], 
page) {
+   if (p >= nr_pages)
+   break;
+
+   pages[p++] = page;
+

Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads

2024-04-15 Thread Steven Rostedt

On Mon, 15 Apr 2024 18:40:23 +0900
"Masami Hiramatsu (Google)"  wrote:

> Check the number of probe target symbols in the target module when
> the module is loaded. If the probe is not on the unique name symbols
> in the module, it will be rejected at that point.
> 
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,

This says what it does, but doesn't explain why it is doing it.
What's the purpose of this patch?

-- Steve

[PATCH] ASoC: tracing: Export SND_SOC_DAPM_DIR_OUT to its value

2024-04-15 Thread Steven Rostedt



The string SND_SOC_DAPM_DIR_OUT is printed in the snd_soc_dapm_path trace
event instead of its value:

   (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-")

User space cannot parse this, as it has no idea what SND_SOC_DAPM_DIR_OUT
is. Use TRACE_DEFINE_ENUM() to convert it to its value:

   (((REC->path_dir) == 1) ? "->" : "<-")

So that user space tools, such as perf and trace-cmd, can parse it
correctly.

Reported-by: Luca Ceresoli 
Fixes: 6e588a0d839b5 ("ASoC: dapm: Consolidate path trace events")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/events/asoc.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h
index 4eed9028bb11..517015ef36a8 100644
--- a/include/trace/events/asoc.h
+++ b/include/trace/events/asoc.h
@@ -12,6 +12,8 @@
 #define DAPM_DIRECT "(direct)"
 #define DAPM_ARROW(dir) (((dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-")
 
+TRACE_DEFINE_ENUM(SND_SOC_DAPM_DIR_OUT);
+
 struct snd_soc_jack;
 struct snd_soc_card;
 struct snd_soc_dapm_widget;
-- 
2.43.0

Re: TP_printk() bug with %c, and more?

2024-04-15 Thread Steven Rostedt

On Tue, 16 Apr 2024 04:08:46 +0200
Luca Ceresoli  wrote:

> Thanks for the insight. I'm definitely trying to fix this based on your
> hint as soon as I get my hand on a board.

I have a patch I forgot to send out. Let me do that now.

-- Steve

Re: TP_printk() bug with %c, and more?

2024-04-15 Thread Steven Rostedt

On Mon, 18 Mar 2024 16:43:07 +0100
Luca Ceresoli  wrote:

> However the arrows are still reversed.

This requires a kernel change. The problem is that the print fmt has:

print fmt: "%c%s %s %s %s %s", (int) REC->path_node && (int) REC->path_connect 
? '*' : ' ', __get_str(wname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? 
"->" : "<-"), __get_str(pname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? 
"->" : "<-"), __get_str(pnname)

User space (trace-cmd and perf) have no idea what SND_SOC_DAPM_DIR_OUT
is. The kernel needs to convert that, otherwise the parsing will fail,
or it will default it to zero.

-- Steve

Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

2024-04-13 Thread Steven Rostedt

On Sat, 13 Apr 2024 12:53:38 +0200
Peter Zijlstra  wrote:

> On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote:
> 
> > > Anyway, since we typically run stuff from NMI context, accessing user
> > > data is 'interesting'. As such I would really like to make this work
> > > depend on the call-graph rework that pushes all the user access bits
> > > into return-to-user.  
> > 
> > Cool, I assume that's the SFRAME work? Are there pointers to work I
> > could look at and think about what a rebase looks like? Or do you have
> > someone in mind I should work with for this?  
> 
> I've been offline for a little while and still need to catch up with
> things myself.
> 
> Josh was working on that when I dropped off IIRC, I'm not entirely sure
> where things are at currently (and there is no way I can ever hope to
> process the backlog).
> 
> Anybody know where we are with that?

It's still very much on my RADAR, but with layoffs and such, my
priorities have unfortunately changed. I'm hoping to start helping out
in the near future though (in a month or two).

Josh was working on it, but I think he got pulled off onto other
priorities too :-p

-- Steve

Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Steven Rostedt

On Thu, 11 Apr 2024 08:15:05 -0700
Kees Cook  wrote:

> This looks good to me. If tracing wants to take it:
> 
> Acked-by: Kees Cook 
> 
> If not, I can take it in my tree if I get a tracing Ack. :)

You can take it.

Acked-by: Steven Rostedt (Google) 

-- Steve

[PATCH v2 11/11] tracing: Update function tracing output for previous boot buffer

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

For a persistent ring buffer that is saved across boots, if function
tracing was performed in the previous boot, it only saves the address of
the functions and uses "%pS" to print their names. But the current boot,
those functions may be in different locations. The persistent meta-data
saves the text delta between the two boots and can be used to find the
address of the saved function of where it is located in the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace_output.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c
index d8b302d01083..b9d2c64c0648 100644
--- a/kernel/trace/trace_output.c
+++ b/kernel/trace/trace_output.c
@@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator 
*iter, int flags,
 }
 
 static void print_fn_trace(struct trace_seq *s, unsigned long ip,
-  unsigned long parent_ip, int flags)
+  unsigned long parent_ip, long delta, int flags)
 {
+   ip += delta;
+   parent_ip += delta;
+
seq_print_ip_sym(s, ip, flags);
 
if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) {
@@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct 
trace_iterator *iter, int flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_putc(s, '\n');
 
return trace_handle_return(s);
@@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int 
flags,
 
trace_assign_type(field, iter->ent);
 
-   print_fn_trace(s, field->ip, field->parent_ip, flags);
+   print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, 
flags);
trace_seq_printf(s, " (repeats: %u, last_ts:", field->count);
trace_print_time(s, iter,
 iter->ts - FUNC_REPEATS_GET_DELTA_TS(field));
-- 
2.43.0

[PATCH v2 10/11] tracing: Handle old buffer mappings for event strings and functions

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Use the saved text_delta and data_delta of a persistent memory mapped ring
buffer that was saved from a previous boot, and use the delta in the trace
event print output so that strings and functions show up normally.

That is, for an event like trace_kmalloc() that prints the callsite via
"%pS", if it used the address saved in the ring buffer it will not match
the function that was saved in the previous boot if the kernel remaps
itself between boots.

For RCU events that point to saved static strings where only the address
of the string is saved in the ring buffer, it too will be adjusted to
point to where the string is on the current boot.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 42 +++---
 1 file changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 33ef5311fa39..bac5db67cf15 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -3671,8 +3671,11 @@ static void test_can_verify(void)
 void trace_check_vprintf(struct trace_iterator *iter, const char *fmt,
 va_list ap)
 {
+   long text_delta = iter->tr->text_delta;
+   long data_delta = iter->tr->data_delta;
const char *p = fmt;
const char *str;
+   bool good;
int i, j;
 
if (WARN_ON_ONCE(!fmt))
@@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 
j = 0;
 
-   /* We only care about %s and variants */
+   /*
+* We only care about %s and variants
+* as well as %p[sS] if delta is non-zero
+*/
for (i = 0; p[i]; i++) {
if (i + 1 >= iter->fmt_size) {
/*
@@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
}
if (p[i+j] == 's')
break;
+
+   if (text_delta && p[i+1] == 'p' &&
+   ((p[i+2] == 's' || p[i+2] == 'S')))
+   break;
+
star = false;
}
j = 0;
@@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
iter->fmt[i] = '\0';
trace_seq_vprintf(>seq, iter->fmt, ap);
 
+   /* Add delta to %pS pointers */
+   if (p[i+1] == 'p') {
+   unsigned long addr;
+   char fmt[4];
+
+   fmt[0] = '%';
+   fmt[1] = 'p';
+   fmt[2] = p[i+2]; /* Either %ps or %pS */
+   fmt[3] = '\0';
+
+   addr = va_arg(ap, unsigned long);
+   addr += text_delta;
+   trace_seq_printf(>seq, fmt, (void *)addr);
+
+   p += i + 3;
+   continue;
+   }
+
/*
 * If iter->seq is full, the above call no longer guarantees
 * that ap is in sync with fmt processing, and further calls
@@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
/* The ap now points to the string data of the %s */
str = va_arg(ap, const char *);
 
+   good = trace_safe_str(iter, str, star, len);
+
+   /* Could be from the last boot */
+   if (data_delta && !good) {
+   str += data_delta;
+   good = trace_safe_str(iter, str, star, len);
+   }
+
/*
 * If you hit this warning, it is likely that the
 * trace event in question used %s on a string that
@@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, 
const char *fmt,
 * instead. See samples/trace_events/trace-events-sample.h
 * for reference.
 */
-   if (WARN_ONCE(!trace_safe_str(iter, str, star, len),
- "fmt: '%s' current_buffer: '%s'",
+   if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'",
  fmt, seq_buf_str(>seq.seq))) {
int ret;
 
-- 
2.43.0

[PATCH v2 09/11] tracing/ring-buffer: Add last_boot_info file to boot instance

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

If an instance is mapped to memory on boot up, create a new file called
"last_boot_info" that will hold information that can be used to properly
parse the raw data in the ring buffer.

It will export the delta of the addresses for text and data from what it
was from the last boot. It does not expose actually addresses (unless you
knew what the actual address was from the last boot).

The output will look like:

 # cat last_boot_info
 text delta:-268435456
 data delta:-268435456

The text and data are kept separate in case they are ever made different.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  3 +++
 kernel/trace/ring_buffer.c  | 23 ++
 kernel/trace/trace.c| 47 -
 kernel/trace/trace.h|  2 ++
 4 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index a50b0223b1d3..55de3798a9b9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long 
size, unsigned flag
   unsigned long range_size,
   struct lock_class_key *key);
 
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c68695ae2749..e189f500ac32 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2390,6 +2390,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned 
long size, unsigned flag
return alloc_buffer(size, flags, order, start, start + range_size, key);
 }
 
+/**
+ * ring_buffer_last_boot_delta - return the delta offset from last boot
+ * @buffer: The buffer to return the delta from
+ * @text: Return text delta
+ * @data: Return data delta
+ *
+ * Returns: The true if the delta is non zero
+ */
+bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text,
+long *data)
+{
+   if (!buffer)
+   return false;
+
+   if (!buffer->last_text_delta)
+   return false;
+
+   *text = buffer->last_text_delta;
+   *data = buffer->last_data_delta;
+
+   return true;
+}
+
 /**
  * ring_buffer_free - free a ring buffer.
  * @buffer: the buffer to free.
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b2b16be6e4ec..33ef5311fa39 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array 
*tr,
return ret;
 }
 
+static void update_last_data(struct trace_array *tr)
+{
+   if (!tr->text_delta && !tr->data_delta)
+   return;
+
+   /* Clear old data */
+   tracing_reset_online_cpus(>array_buffer);
+
+   /* Using current data now */
+   tr->text_delta = 0;
+   tr->data_delta = 0;
+}
 
 /**
  * tracing_update_buffers - used by tracing facility to expand ring buffers
@@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr)
int ret = 0;
 
mutex_lock(_types_lock);
+
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded)
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char 
*buf)
 
mutex_lock(_types_lock);
 
+   update_last_data(tr);
+
if (!tr->ring_buffer_expanded) {
ret = __tracing_resize_ring_buffer(tr, trace_buf_size,
RING_BUFFER_ALL_CPUS);
@@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static ssize_t
+tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
+{
+   struct trace_array *tr = filp->private_data;
+   struct seq_buf seq;
+   char buf[64];
+
+   seq_buf_init(, buf, 64);
+
+   seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta);
+   seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta);
+
+   return simple_read_from_buffer(ubuf, cnt, ppos, buf, 
seq_buf_used());
+}
+
 static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
 {
struct trace_array *tr = inode->i_private;
@@ -7499,6 +7531,13 @@ static const struct file_operations 
trace_time_stamp_mode_fops = {
.release= tracing_single_release_tr,
 };
 
+static const s

[PATCH v2 08/11] ring-buffer: Save text and data locations in mapped meta data

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When a ring buffer is mapped to a specific address, save the address of a
text function and some data. This will be used to determine the delta
between the last boot and the current boot for pointers to functions as
well as to data.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 6f19e7612472..c68695ae2749 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,8 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   text_addr;
+   unsigned long   data_addr;
unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
@@ -539,6 +541,9 @@ struct trace_buffer {
unsigned long   range_addr_start;
unsigned long   range_addr_end;
 
+   longlast_text_delta;
+   longlast_data_delta;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1827,10 +1832,15 @@ static void rb_meta_validate_events(struct 
ring_buffer_per_cpu *cpu_buffer)
}
 }
 
+/* Used to calculate data delta */
+static char rb_data_ptr[] __initdata = "";
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
unsigned long delta;
+   unsigned long this_text = (unsigned long)rb_range_meta_init;
+   unsigned long this_data = (unsigned long)rb_data_ptr;
void *subbuf;
int cpu;
int i;
@@ -1847,6 +1857,10 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
meta->first_buffer += delta;
meta->head_buffer += delta;
meta->commit_buffer += delta;
+   buffer->last_text_delta = this_text - meta->text_addr;
+   buffer->last_data_delta = this_data - meta->data_addr;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
continue;
}
 
@@ -1863,6 +1877,8 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
subbuf = rb_subbufs_from_meta(meta);
 
meta->first_buffer = (unsigned long)subbuf;
+   meta->text_addr = this_text;
+   meta->data_addr = this_data;
 
/*
 * The buffers[] array holds the order of the sub-buffers
-- 
2.43.0

[PATCH v2 07/11] ring-buffer: Validate boot range memory events

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Make sure all the events in each of the sub-buffers that were mapped in a
memory region are valid. This moves the code that walks the buffers for
time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
ifdef block and is used to validate the content. Only the ring buffer
event meta data is checked and not the data load.

This also has a second purpose. The buffer_page structure that points to
the data sub-buffers has accounting that keeps track of the number of
events that are on the sub-buffer. This updates that counter as well. That
counter is used in reading the buffer and knowing if the ring buffer is
empty or not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 190 -
 1 file changed, 166 insertions(+), 24 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d05b60fd5695..6f19e7612472 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1663,10 +1663,170 @@ static bool rb_meta_valid(struct ring_buffer_meta 
*meta, int cpu,
subbuf = (void *)subbuf + subbuf_size;
}
 
-   pr_info("Ring buffer meta is from previous boot!\n");
return true;
 }
 
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+#ifdef CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
+static DEFINE_PER_CPU(atomic_t, checking);
+static atomic_t ts_dump;
+
+#define buffer_warn_return(fmt, ...)   \
+   do {\
+   /* If another report is happening, ignore this one */   \
+   if (atomic_inc_return(_dump) != 1) { \
+   atomic_dec(_dump);   \
+   goto out;   \
+   }   \
+   atomic_inc(_buffer->record_disabled);   \
+   pr_warn(fmt, ##__VA_ARGS__);\
+   dump_buffer_page(bpage, info, tail);\
+   atomic_dec(_dump);   \
+   /* There's some cases in boot up that this can happen */ \
+   if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING))   \
+   /* Do not re-enable checking */ \
+   return; \
+   } while (0)
+#else
+#define buffer_warn_return(fmt, ...) do { } while (0)
+#endif
+
+static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int 
cpu,
+  unsigned long long *timestamp, bool warn)
+{
+   struct ring_buffer_event *event;
+   u64 ts, delta;
+   int events = 0;
+   int e;
+
+   ts = dpage->time_stamp;
+
+   for (e = 0; e < tail; e += rb_event_length(event)) {
+
+   event = (struct ring_buffer_event *)(dpage->data + e);
+
+   switch (event->type_len) {
+
+   case RINGBUF_TYPE_TIME_EXTEND:
+   delta = rb_event_time_stamp(event);
+   ts += delta;
+   break;
+
+   case RINGBUF_TYPE_TIME_STAMP:
+   delta = rb_event_time_stamp(event);
+   if (warn && delta < ts) {
+   buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT 
BACKWARDS: last ts: %lld absolute ts: %lld\n",
+  cpu, ts, delta);
+   }
+   ts = delta;
+   break;
+
+   case RINGBUF_TYPE_PADDING:
+   if (event->time_delta == 1)
+   break;
+   fallthrough;
+   case RINGBUF_TYPE_DATA:
+   events++;
+   ts += event->time_delta;
+   break;
+
+   default:
+   return -1;
+   }
+   }
+   *timestamp = ts;
+   return events;
+}
+
+static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu)
+{
+   unsigned long long ts;
+   int tail;
+
+   tail = local_read(>commit);
+   return rb_read_data_buffer(dpage, tail, cpu, , false);
+}
+
+/* If the meta data has been validated, now validate the events */
+static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   struct buffer_page *head_page;
+   unsigned long entry_bytes = 0;
+   unsigned long entries = 0;
+   int ret;
+   int i;
+
+   if (!meta || !meta->head_buffer)
+   return;
+
+   /* Do the reader page

[PATCH v2 06/11] ring-buffer: Add test if range of boot buffer is valid

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a test against the ring buffer memory range to see if it has valid
data. The ring_buffer_meta structure is given a new field called
"first_buffer" which holds the address of the first sub-buffer. This is
used to both determine if the other fields are valid as well as finding
the offset between the old addresses of the sub-buffer from the previous
boot to the new addresses of the current boot.

Since the values for nr_subbufs and subbuf_size is to be the same, check
if the values in the meta page match the values calculated.

Take the range of the first_buffer and the total size of all the buffers
and make sure the saved head_buffer and commit_buffer fall in the range.

Iterate through all the sub-buffers to make sure that the values in the
sub-buffer "commit" field (the field that holds the amount of data on the
sub-buffer) is within the end of the sub-buffer. Also check the index
array to make sure that all the indexes are within nr_subbufs.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 143 ++---
 1 file changed, 135 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ee3a5c6966e2..d05b60fd5695 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -45,6 +45,7 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   first_buffer;
unsigned long   head_buffer;
unsigned long   commit_buffer;
__u32   subbuf_size;
@@ -1606,21 +1607,103 @@ static void *rb_range_buffer(struct 
ring_buffer_per_cpu *cpu_buffer, int idx)
return (void *)ptr;
 }
 
+/*
+ * See if the existing memory contains valid ring buffer data.
+ * As the previous kernel must be the same as this kernel, all
+ * the calculations (size of buffers and number of buffers)
+ * must be the same.
+ */
+static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
+ struct trace_buffer *buffer, int nr_pages)
+{
+   int subbuf_size = PAGE_SIZE;
+   struct buffer_data_page *subbuf;
+   unsigned long buffers_start;
+   unsigned long buffers_end;
+   int i;
+
+   /* The subbuffer's size and number of subbuffers must match */
+   if (meta->subbuf_size != subbuf_size ||
+   meta->nr_subbufs != nr_pages + 1) {
+   pr_info("Ring buffer boot meta [%d] mismatch of 
subbuf_size/nr_pages\n", cpu);
+   return false;
+   }
+
+   buffers_start = meta->first_buffer;
+   buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs);
+
+   /* Is the head and commit buffers within the range of buffers? */
+   if (meta->head_buffer < buffers_start ||
+   meta->head_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] head buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   if (meta->commit_buffer < buffers_start ||
+   meta->commit_buffer >= buffers_end) {
+   pr_info("Ring buffer boot meta [%d] commit buffer out of 
range\n", cpu);
+   return false;
+   }
+
+   subbuf = rb_subbufs_from_meta(meta);
+
+   /* Is the meta buffers and the subbufs themselves have correct data? */
+   for (i = 0; i < meta->nr_subbufs; i++) {
+   if (meta->buffers[i] < 0 ||
+   meta->buffers[i] >= meta->nr_subbufs) {
+   pr_info("Ring buffer boot meta [%d] array out of 
range\n", cpu);
+   return false;
+   }
+
+   if ((unsigned)local_read(>commit) > subbuf_size) {
+   pr_info("Ring buffer boot meta [%d] buffer invalid 
commit\n", cpu);
+   return false;
+   }
+
+   subbuf = (void *)subbuf + subbuf_size;
+   }
+
+   pr_info("Ring buffer meta is from previous boot!\n");
+   return true;
+}
+
 static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 {
struct ring_buffer_meta *meta;
+   unsigned long delta;
void *subbuf;
int cpu;
int i;
 
for (cpu = 0; cpu < nr_cpu_ids; cpu++) {
+   void *next_meta;
+
meta = rb_range_meta(buffer, nr_pages, cpu);
 
+   if (rb_meta_valid(meta, cpu, buffer, nr_pages)) {
+   /* Make the mappings match the current address */
+   subbuf = rb_subbufs_from_meta(meta);
+   delta = (unsigned long)subbuf - meta->first_buffer;
+   meta->first_buffer += delta;
+   meta->head_buffer += delta;
+   meta->commit_buffer += delta;

[PATCH v2 05/11] ring-buffer: Add output of ring buffer meta page

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a buffer_meta per-cpu file for the trace instance that is mapped to
boot memory. This shows the current meta-data and can be used by user
space tools to record off the current mappings to help reconstruct the
ring buffer after a reboot.

It does not expose any virtual addresses, just indexes into the sub-buffer
pages.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 77 ++
 kernel/trace/trace.c   | 30 ++-
 kernel/trace/trace.h   |  2 +
 3 files changed, 107 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index db8762faef77..ee3a5c6966e2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -32,6 +32,8 @@
 #include 
 #include 
 
+#include "trace.h"
+
 /*
  * The "absolute" timestamp in the buffer is only 59 bits.
  * If a clock has the 5 MSBs set, it needs to be saved and
@@ -1635,6 +1637,81 @@ static void rb_range_meta_init(struct trace_buffer 
*buffer, int nr_pages)
}
 }
 
+static void *rbm_start(struct seq_file *m, loff_t *pos)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val;
+
+   if (!meta)
+   return NULL;
+
+   if (*pos > meta->nr_subbufs)
+   return NULL;
+
+   val = *pos;
+   val++;
+
+   return (void *)val;
+}
+
+static void *rbm_next(struct seq_file *m, void *v, loff_t *pos)
+{
+   (*pos)++;
+
+   return rbm_start(m, pos);
+}
+
+static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf);
+
+static int rbm_show(struct seq_file *m, void *v)
+{
+   struct ring_buffer_per_cpu *cpu_buffer = m->private;
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   unsigned long val = (unsigned long)v;
+
+   if (val == 1) {
+   seq_printf(m, "head_buffer:   %d\n",
+  rb_meta_subbuf_idx(meta, (void *)meta->head_buffer));
+   seq_printf(m, "commit_buffer: %d\n",
+  rb_meta_subbuf_idx(meta, (void 
*)meta->commit_buffer));
+   seq_printf(m, "subbuf_size:   %d\n", meta->subbuf_size);
+   seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs);
+   return 0;
+   }
+
+   val -= 2;
+   seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]);
+
+   return 0;
+}
+
+static void rbm_stop(struct seq_file *m, void *p)
+{
+}
+
+static const struct seq_operations rb_meta_seq_ops = {
+   .start  = rbm_start,
+   .next   = rbm_next,
+   .show   = rbm_show,
+   .stop   = rbm_stop,
+};
+
+int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, 
int cpu)
+{
+   struct seq_file *m;
+   int ret;
+
+   ret = seq_open(file, _meta_seq_ops);
+   if (ret)
+   return ret;
+
+   m = file->private_data;
+   m->private = buffer->buffers[cpu];
+
+   return 0;
+}
+
 static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer,
long nr_pages, struct list_head *pages)
 {
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 31067de977fc..b2b16be6e4ec 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct 
file *file)
return 0;
 }
 
-static int show_traces_release(struct inode *inode, struct file *file)
+static int tracing_seq_release(struct inode *inode, struct file *file)
 {
struct trace_array *tr = inode->i_private;
 
@@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = {
.open   = show_traces_open,
.read   = seq_read,
.llseek = seq_lseek,
-   .release= show_traces_release,
+   .release= tracing_seq_release,
 };
 
 static ssize_t
@@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char 
__user *ubuf,
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
+static int tracing_buffer_meta_open(struct inode *inode, struct file *filp)
+{
+   struct trace_array *tr = inode->i_private;
+   int cpu = tracing_get_cpu(inode);
+   int ret;
+
+   ret = tracing_check_open_get_tr(tr);
+   if (ret)
+   return ret;
+
+   ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu);
+   if (ret < 0)
+   __trace_array_put(tr);
+   return ret;
+}
+
 static ssize_t
 tracing_free_buffer_write(struct file *filp, const char __user *ubuf,
  size_t cnt, loff_t *ppos)
@@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops 
= {

[PATCH v2 04/11] tracing: Implement creating an instance based on a given memory region

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Allow for creating a new instance by passing in an address and size to map
the ring buffer for the instance to.

This will allow features like a pstore memory mapped region to be used for
an tracing instance ring buffer that can be retrieved from one boot to the
next.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 50 +++-
 kernel/trace/trace.h |  3 +++
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 83194bf7b1df..31067de977fc 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file 
*file)
 static bool
 trace_ok_for_array(struct tracer *t, struct trace_array *tr)
 {
+#ifdef CONFIG_TRACER_SNAPSHOT
+   /* arrays with mapped buffer range do not have snapshots */
+   if (tr->range_addr_start && t->use_max_tr)
+   return false;
+#endif
return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances;
 }
 
@@ -8673,11 +8678,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, 
long cpu)
tr, cpu, _entries_fops);
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
-   tr, cpu, _fops);
+   if (!tr->range_addr_start) {
+   trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu,
+ tr, cpu, _fops);
 
-   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
-   tr, cpu, _raw_fops);
+   trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu,
+ tr, cpu, _raw_fops);
+   }
 #endif
 }
 
@@ -9214,7 +9221,18 @@ allocate_trace_buffer(struct trace_array *tr, struct 
array_buffer *buf, int size
 
buf->tr = tr;
 
-   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   if (tr->range_addr_start && tr->range_addr_size) {
+   buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0,
+ tr->range_addr_start,
+ tr->range_addr_size);
+   /*
+* This is basically the same as a mapped buffer,
+* with the same restrictions.
+*/
+   tr->mapped++;
+   } else {
+   buf->buffer = ring_buffer_alloc(size, rb_flags);
+   }
if (!buf->buffer)
return -ENOMEM;
 
@@ -9251,6 +9269,10 @@ static int allocate_trace_buffers(struct trace_array 
*tr, int size)
return ret;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
+   /* Fix mapped buffer trace arrays do not have snapshot buffers */
+   if (tr->range_addr_start)
+   return 0;
+
ret = allocate_trace_buffer(tr, >max_buffer,
allocate_snapshot ? size : 1);
if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) {
@@ -9351,7 +9373,9 @@ static int trace_array_create_dir(struct trace_array *tr)
 }
 
 static struct trace_array *
-trace_array_create_systems(const char *name, const char *systems)
+trace_array_create_systems(const char *name, const char *systems,
+  unsigned long range_addr_start,
+  unsigned long range_addr_size)
 {
struct trace_array *tr;
int ret;
@@ -9377,6 +9401,10 @@ trace_array_create_systems(const char *name, const char 
*systems)
goto out_free_tr;
}
 
+   /* Only for boot up memory mapped ring buffers */
+   tr->range_addr_start = range_addr_start;
+   tr->range_addr_size = range_addr_size;
+
tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS;
 
cpumask_copy(tr->tracing_cpumask, cpu_all_mask);
@@ -9434,7 +9462,7 @@ trace_array_create_systems(const char *name, const char 
*systems)
 
 static struct trace_array *trace_array_create(const char *name)
 {
-   return trace_array_create_systems(name, NULL);
+   return trace_array_create_systems(name, NULL, 0, 0);
 }
 
 static int instance_mkdir(const char *name)
@@ -9488,7 +9516,7 @@ struct trace_array *trace_array_get_by_name(const char 
*name, const char *system
goto out_unlock;
}
 
-   tr = trace_array_create_systems(name, systems);
+   tr = trace_array_create_systems(name, systems, 0, 0);
 
if (IS_ERR(tr))
tr = NULL;
@@ -9681,8 +9709,10 @@ init_tracer_tracefs(struct trace_array *tr, struct 
dentry *d_tracer)
MEM_FAIL(1, "Could not allocate function filter files");
 
 #ifdef CONFIG_TRACER_SNAPSHOT
-   trace_cre

[PATCH v2 03/11] ring-buffer: Add ring_buffer_meta data

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Populate the ring_buffer_meta array. It holds the pointer to the
head_buffer (next to read), the commit_buffer (next to write) the size of
the sub-buffers, number of sub-buffers and an array that keeps track of
the order of the sub-buffers.

This information will be stored in the persistent memory to help on reboot
to reconstruct the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 208 -
 1 file changed, 183 insertions(+), 25 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a6d22a77f128..db8762faef77 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -43,6 +43,11 @@
 static void update_pages_handler(struct work_struct *work);
 
 struct ring_buffer_meta {
+   unsigned long   head_buffer;
+   unsigned long   commit_buffer;
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+   int buffers[];
 };
 
 /*
@@ -498,6 +503,7 @@ struct ring_buffer_per_cpu {
struct mutexmapping_lock;
unsigned long   *subbuf_ids;/* ID to subbuf VA */
struct trace_buffer_meta*meta_page;
+   struct ring_buffer_meta *ring_meta;
 
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
@@ -1258,6 +1264,11 @@ static void rb_head_page_activate(struct 
ring_buffer_per_cpu *cpu_buffer)
 * Set the previous list pointer to have the HEAD flag.
 */
rb_set_list_to_head(head->list.prev);
+
+   if (cpu_buffer->ring_meta) {
+   struct ring_buffer_meta *meta = cpu_buffer->ring_meta;
+   meta->head_buffer = (unsigned long)head->page;
+   }
 }
 
 static void rb_list_head_clear(struct list_head *list)
@@ -1505,50 +1516,125 @@ rb_range_align_subbuf(unsigned long addr, int 
subbuf_size, int nr_subbufs)
 }
 
 /*
- * Return a specific sub-buffer for a given @cpu defined by @idx.
+ * Return the ring_buffer_meta for a given @cpu.
  */
-static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu)
 {
-   unsigned long ptr;
int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   unsigned long ptr = buffer->range_addr_start;
+   struct ring_buffer_meta *meta;
int nr_subbufs;
 
-   /* Include the reader page */
-   nr_subbufs = nr_pages + 1;
+   if (!ptr)
+   return NULL;
+
+   /* When nr_pages passed in is zero, the first meta has already been 
initialized */
+   if (!nr_pages) {
+   meta = (struct ring_buffer_meta *)ptr;
+   nr_subbufs = meta->nr_subbufs;
+   } else {
+   meta = NULL;
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+   }
 
/*
 * The first chunk may not be subbuffer aligned, where as
 * the rest of the chunks are.
 */
-   ptr = buffer->range_addr_start;
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
if (cpu) {
-   unsigned long p;
-
-   ptr += subbuf_size * nr_subbufs;
-
-   /* Save the beginning of this CPU chunk */
-   p = ptr;
-
ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs);
+   ptr += subbuf_size * nr_subbufs;
 
if (cpu > 1) {
unsigned long size;
+   unsigned long p;
 
+   /* Save the beginning of this CPU chunk */
+   p = ptr;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
ptr += subbuf_size * nr_subbufs;
 
/* Now all chunks after this are the same size */
size = ptr - p;
ptr += size * (cpu - 2);
-
-   ptr = rb_range_align_subbuf(ptr, subbuf_size, 
nr_subbufs);
}
}
-   if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end)
+   return (void *)ptr;
+}
+
+static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta)
+{
+   int subbuf_size = meta->subbuf_size;
+   unsigned long ptr;
+
+   ptr = (unsigned long)meta;
+   ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs);
+
+   return (void *)ptr;
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx)
+{
+   struct ring_buffer_meta *meta;
+   unsigned long ptr;
+   int subbuf_size;
+
+   meta = rb_range_meta(cpu_buffer->buffer, 0, cpu_buffer->cpu);
+   if (!m

[PATCH v2 00/11] tracing: Persistent traces across a reboot or crash

2024-04-10 Thread Steven Rostedt

This is a way to map a ring buffer instance across reboots.
The requirement is that you have a memory region that is not erased.
I tested this on a Debian VM running on qemu on a Debian server,
and even tested it on a baremetal box running Fedora. I was
surprised that it worked on the baremetal box, but it does so
surprisingly consistently.

I changed things up since the POC version (which this is no longer
a proof of concept, but now a serious patch set).

https://lore.kernel.org/linux-trace-kernel/20240306015910.766510...@goodmis.org/

I got rid of the variables that needed to be set:

trace_buffer_start and trace_buffer_size

The allocation of the mapping require a new interface (not in this
patch set yet, as it will be added for pstore or the wildcard
memmap that is under RFC). That is, either a new patch will be
added to connect pstore, or if the new wildcard memmap option is
accepted, it can be created with:

memmap=12M*4096:trace trace_instance=boot_mapped@trace

But that is to come later. If you want to test this code, I have
the command line version available at:

git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git
branch: trace/ring-buffer-prealloc

The name of the instance no longer defined by the kernel but by
the calling code.

The memory reserved is used by the ring buffer of this instance.
It acts like a memory mapped instance so it has some limitations. It does not
allow snapshots nor does it allow tracers which use a snapshot buffer (like
irqsoff and wakeup tracers).

On boot up, when setting up the ring buffer, it looks at the current
content and does a vigorous test to see if the content is valid.
It even walks the events in all the sub-buffers to make sure the
ring buffer meta data is correct. If it determines that the content
is valid, it will reconstruct the ring buffer to use the content
it has found.

If the buffer is valid, on the next boot, the boot_mapped instance
will contain the data from the previous boot. You can cat the
trace or trace_pipe file, or even run trace-cmd extract on it to
make a trace.dat file that holds the date. This is much better than
dealing with a ftrace_dump_on_opps (I wish I had this a decade ago!)

There are still some limitations of this buffer. One is that it assumes
that the kernel you are booting back into is the same one that crashed.
At least the trace_events (like sched_switch and friends) all have the
same ids. This would be true with the same kernel as the ids are determined
at link time.

Module events could possible be a problem as the ids may not match.

This version of the patch series saves a text function and a data
string address in the persistent memory, and this is used to calculate
the delta between text and data addresses of the new boot up. Now
function tracing and "%pS" still work across boots. Even the RCU
trace events that point to static strings work as well!

The delta is exported by a new file in the instance called "last_boot_info"
that has something like this:

# cat last_boot_info
text delta:-268435456
data delta:-268435456

This can be used by trace-cmd that reads the trace_pipe_raw data and
now can figure out how to map the print_formats and kallsyms to the raw
data in the buffers.

This is no longer a POC as it seems to be working as expected.

This is based on top of Vincent Donnefort's ring buffer user space mapping code:

https://lore.kernel.org/linux-trace-kernel/20240406173649.3210836-1-vdonnef...@google.com/

Enjoy...

Steven Rostedt (Google) (11):
ring-buffer: Allow mapped field to be set without mapping
ring-buffer: Add ring_buffer_alloc_range()
ring-buffer: Add ring_buffer_meta data
tracing: Implement creating an instance based on a given memory region
ring-buffer: Add output of ring buffer meta page
ring-buffer: Add test if range of boot buffer is valid
ring-buffer: Validate boot range memory events
ring-buffer: Save text and data locations in mapped meta data
tracing/ring-buffer: Add last_boot_info file to boot instance
tracing: Handle old buffer mappings for event strings and functions
tracing: Update function tracing output for previous boot buffer

[PATCH v2 01/11] ring-buffer: Allow mapped field to be set without mapping

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In preparation for having the ring buffer mapped to a dedicated location,
which will have the same restrictions as user space memory mapped buffers,
allow it to use the "mapped" field of the ring_buffer_per_cpu structure
without having the user space meta page mapping.

When this starts using the mapped field, it will need to handle adding a
user space mapping (and removing it) from a ring buffer that is using a
dedicated memory range.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 793ecc454039..44b1d5f1a99a 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5223,6 +5223,9 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 {
struct trace_buffer_meta *meta = cpu_buffer->meta_page;
 
+   if (!meta)
+   return;
+
meta->reader.read = cpu_buffer->reader_page->read;
meta->reader.id = cpu_buffer->reader_page->id;
meta->reader.lost_events = cpu_buffer->lost_events;
@@ -6167,7 +6170,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu)
 
mutex_lock(_buffer->mapping_lock);
 
-   if (!cpu_buffer->mapped) {
+   if (!cpu_buffer->mapped || !cpu_buffer->meta_page) {
mutex_unlock(_buffer->mapping_lock);
return ERR_PTR(-ENODEV);
}
@@ -6345,12 +6348,13 @@ int ring_buffer_map(struct trace_buffer *buffer, int 
cpu,
 */
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
+
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
err = __rb_map_vma(cpu_buffer, vma);
if (!err) {
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
-   cpu_buffer->mapped = 1;
+   cpu_buffer->mapped++;
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
} else {
kfree(cpu_buffer->subbuf_ids);
@@ -6388,7 +6392,7 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int 
cpu)
mutex_lock(>mutex);
raw_spin_lock_irqsave(_buffer->reader_lock, flags);
 
-   cpu_buffer->mapped = 0;
+   cpu_buffer->mapped--;
 
raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
 
-- 
2.43.0

[PATCH v2 02/11] ring-buffer: Add ring_buffer_alloc_range()

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In preparation to allowing the trace ring buffer to be allocated in a
range of memory that is persistent across reboots, add
ring_buffer_alloc_range(). It takes a contiguous range of memory and will
split it up evening for the per CPU ring buffers.

If there's not enough memory to handle all CPUs with the minimum size, it
will fail to allocate the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  17 +++
 kernel/trace/ring_buffer.c  | 222 ++--
 2 files changed, 204 insertions(+), 35 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 96d2140b471e..a50b0223b1d3 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer,
 struct trace_buffer *
 __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key 
*key);
 
+struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned 
flags,
+  int order, unsigned long start,
+  unsigned long range_size,
+  struct lock_class_key *key);
+
 /*
  * Because the ring buffer is generic, if other users of the ring buffer get
  * traced by ftrace, it can produce lockdep warnings. We need to keep each
@@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, 
struct lock_class_key *k
__ring_buffer_alloc((size), (flags), &__key);   \
 })
 
+/*
+ * Because the ring buffer is generic, if other users of the ring buffer get
+ * traced by ftrace, it can produce lockdep warnings. We need to keep each
+ * ring buffer's lock class separate.
+ */
+#define ring_buffer_alloc_range(size, flags, order, start, range_size) \
+({ \
+   static struct lock_class_key __key; \
+   __ring_buffer_alloc_range((size), (flags), (order), (start),\
+ (range_size), &__key);\
+})
+
 typedef bool (*ring_buffer_cond_fn)(void *data);
 int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full,
 ring_buffer_cond_fn cond, void *data);
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 44b1d5f1a99a..a6d22a77f128 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -42,6 +42,9 @@
 
 static void update_pages_handler(struct work_struct *work);
 
+struct ring_buffer_meta {
+};
+
 /*
  * The ring buffer header is special. We must manually up keep it.
  */
@@ -340,7 +343,8 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
-   u32  id;/* ID for external mapping */
+   u32  id:30; /* ID for external mapping */
+   u32  range:1;   /* Mapped via a range */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -371,7 +375,9 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_pages((unsigned long)bpage->page, bpage->order);
+   /* Range pages are not to be freed */
+   if (!bpage->range)
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -521,6 +527,9 @@ struct trace_buffer {
struct rb_irq_work  irq_work;
booltime_stamp_abs;
 
+   unsigned long   range_addr_start;
+   unsigned long   range_addr_end;
+
unsigned intsubbuf_size;
unsigned intsubbuf_order;
unsigned intmax_data_size;
@@ -1483,9 +1492,67 @@ static void rb_check_pages(struct ring_buffer_per_cpu 
*cpu_buffer)
}
 }
 
+/*
+ * Take an address, add the meta data size as well as the array of
+ * array subbuffer indexes, then align it to a subbuffer size.
+ */
+static unsigned long
+rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs)
+{
+   addr += sizeof(struct ring_buffer_meta) +
+   sizeof(int) * nr_subbufs;
+   return ALIGN(addr, subbuf_size);
+}
+
+/*
+ * Return a specific sub-buffer for a given @cpu defined by @idx.
+ */
+static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int 
nr_pages, int idx)
+{
+   unsigned long ptr;
+   int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+   int nr_subbufs;
+
+   /* Include the reader page */
+   nr_subbufs = nr_pages + 1;
+
+   /*
+* The firs

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

On Wed, 10 Apr 2024 13:07:20 -0400
Chuck Lever  wrote:

> Well I guess I could test it for you. I'll take it for nfsd v6.9-rc.

Thanks!

-- Steve

Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space

2024-04-10 Thread Steven Rostedt



Hi Andrew, et.al.

Linus said it's a hard requirement that this code gets an Acked-by (or
Reviewed-by) from the memory sub-maintainers before he will accept it.
He was upset that we faulted in pages one at a time instead of mapping it
in one go:

  
https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w...@mail.gmail.com/

Could you take a look at patches 1-3 to make sure they look sane from a
memory management point of view?

I really want this applied in the next merge window.

Thanks!

-- Steve


On Sat,  6 Apr 2024 18:36:44 +0100
Vincent Donnefort  wrote:

> The tracing ring-buffers can be stored on disk or sent to network
> without any copy via splice. However the later doesn't allow real time
> processing of the traces. A solution is to give userspace direct access
> to the ring-buffer pages via a mapping. An application can now become a
> consumer of the ring-buffer, in a similar fashion to what trace_pipe
> offers.
> 
> Support for this new feature can already be found in libtracefs from
> version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.
> 
> Vincent
> 
> v19 -> v20:
>   * Fix typos in documentation.
>   * Remove useless mmap open and fault callbacks.
>   * add mm.h include for vm_insert_pages
> 
> v18 -> v19:
>   * Use VM_PFNMAP and vm_insert_pages
>   * Allocate ring-buffer subbufs with __GFP_COMP
>   * Pad the meta-page with the zero-page to align on the subbuf_order
>   * Extend the ring-buffer test with mmap() dedicated suite
> 
> v17 -> v18:
>   * Fix lockdep_assert_held
>   * Fix spin_lock_init typo
>   * Fix CONFIG_TRACER_MAX_TRACE typo
> 
> v16 -> v17:
>   * Documentation and comments improvements.
>   * Create get/put_snapshot_map() for clearer code.
>   * Replace kzalloc with kcalloc.
>   * Fix -ENOMEM handling in rb_alloc_meta_page().
>   * Move flush(cpu_buffer->reader_page) behind the reader lock.
>   * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
> mutex. (removes READ_ONCE/WRITE_ONCE accesses).
> 
> v15 -> v16:
>   * Add comment for the dcache flush.
>   * Remove now unnecessary WRITE_ONCE for the meta-page.
> 
> v14 -> v15:
>   * Add meta-page and reader-page flush. Intends to fix the mapping
> for VIVT and aliasing-VIPT data caches.
>   * -EPERM on VM_EXEC.
>   * Fix build warning !CONFIG_TRACER_MAX_TRACE.
> 
> v13 -> v14:
>   * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
>   * on unmap, sync meta-page teardown with the reader_lock instead of
> the synchronize_rcu.
>   * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
> (intends to fix a lockdep issue)
>   * Add kerneldoc for flags and Reserved fields.
>   * Add kselftest for snapshot/map mutual exclusion.
> 
> v12 -> v13:
>   * Swap subbufs_{touched,lost} for Reserved fields.
>   * Add a flag field in the meta-page.
>   * Fix CONFIG_TRACER_MAX_TRACE.
>   * Rebase on top of trace/urgent.
>   * Add a comment for try_unregister_trigger()
> 
> v11 -> v12:
>   * Fix code sample mmap bug.
>   * Add logging in sample code.
>   * Reset tracer in selftest.
>   * Add a refcount for the snapshot users.
>   * Prevent mapping when there are snapshot users and vice versa.
>   * Refine the meta-page.
>   * Fix types in the meta-page.
>   * Collect Reviewed-by.
> 
> v10 -> v11:
>   * Add Documentation and code sample.
>   * Add a selftest.
>   * Move all the update to the meta-page into a single
> rb_update_meta_page().
>   * rb_update_meta_page() is now called from
> ring_buffer_map_get_reader() to fix NOBLOCK callers.
>   * kerneldoc for struct trace_meta_page.
>   * Add a patch to zero all the ring-buffer allocations.
> 
> v9 -> v10:
>   * Refactor rb_update_meta_page()
>   * In-loop declaration for foreach_subbuf_page()
>   * Check for cpu_buffer->mapped overflow
> 
> v8 -> v9:
>   * Fix the unlock path in ring_buffer_map()
>   * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
>   * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)
> 
> v7 -> v8:
>   * Drop the subbufs renaming into bpages
>   * Use subbuf as a name when relevant
> 
> v6 -> v7:
>   * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
>   * Support for subbufs
>   * Rename subbufs into bpages
> 
> v5 -> v6:
>   * Rebase on next-20230802.
>   * (unsigned long) -> (void *) cast for virt_to_page().
>   * Add a wait for the GET_READER_PAGE ioctl.
>   * Move writer fields update (overrun/pages_lost/entries/pages_touched)
> in the irq_work.
>   * Rearrange id in struct buffer_page.
>   * Rearrange the meta-page.
>   * ring_buffer_meta_page -> trace_buffer_meta_page.
>   * Add meta_struct_len into the meta-page.
> 
> v4 -> v5:
>   * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)
> 
> v3 -> v4:
>   * Add to the meta-page:
>- pages_lost / pages_read (allow to compute how full is the
>ring-buffer)
>- read (allow to compute how many entries can be read)
>- A reader_page

Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions

2024-04-10 Thread Steven Rostedt

On Sat,  6 Apr 2024 18:36:46 +0100
Vincent Donnefort  wrote:

> +int ring_buffer_map(struct trace_buffer *buffer, int cpu,
> + struct vm_area_struct *vma)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long flags, *subbuf_ids;
> + int err = 0;
> +
> + if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + return -EINVAL;
> +
> + cpu_buffer = buffer->buffers[cpu];
> +
> + mutex_lock(_buffer->mapping_lock);
> +
> + if (cpu_buffer->mapped) {
> + err = __rb_map_vma(cpu_buffer, vma);
> + if (!err)
> + err = __rb_inc_dec_mapped(cpu_buffer, true);
> + mutex_unlock(_buffer->mapping_lock);
> + return err;
> + }
> +
> + /* prevent another thread from changing buffer/sub-buffer sizes */
> + mutex_lock(>mutex);
> +
> + err = rb_alloc_meta_page(cpu_buffer);
> + if (err)
> + goto unlock;
> +
> + /* subbuf_ids include the reader while nr_pages does not */
> + subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), 
> GFP_KERNEL);
> + if (!subbuf_ids) {
> + rb_free_meta_page(cpu_buffer);
> + err = -ENOMEM;
> + goto unlock;
> + }
> +
> + atomic_inc(_buffer->resize_disabled);
> +
> + /*
> +  * Lock all readers to block any subbuf swap until the subbuf IDs are
> +  * assigned.
> +  */
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> + rb_setup_ids_meta_page(cpu_buffer, subbuf_ids);
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> +
> + err = __rb_map_vma(cpu_buffer, vma);
> + if (!err) {
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> + cpu_buffer->mapped = 1;
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> + } else {
> + kfree(cpu_buffer->subbuf_ids);
> + cpu_buffer->subbuf_ids = NULL;
> + rb_free_meta_page(cpu_buffer);
> + }
> +unlock:

Nit: For all labels, please add a space before them. Otherwise, diffs will
show "unlock" as the function and not "ring_buffer_map", making it harder
to find where the change is.

Same for the labels below.

-- Steve


> + mutex_unlock(>mutex);
> + mutex_unlock(_buffer->mapping_lock);
> +
> + return err;
> +}
> +
> +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long flags;
> + int err = 0;
> +
> + if (!cpumask_test_cpu(cpu, buffer->cpumask))
> + return -EINVAL;
> +
> + cpu_buffer = buffer->buffers[cpu];
> +
> + mutex_lock(_buffer->mapping_lock);
> +
> + if (!cpu_buffer->mapped) {
> + err = -ENODEV;
> + goto out;
> + } else if (cpu_buffer->mapped > 1) {
> + __rb_inc_dec_mapped(cpu_buffer, false);
> + goto out;
> + }
> +
> + mutex_lock(>mutex);
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> +
> + cpu_buffer->mapped = 0;
> +
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> +
> + kfree(cpu_buffer->subbuf_ids);
> + cpu_buffer->subbuf_ids = NULL;
> + rb_free_meta_page(cpu_buffer);
> + atomic_dec(_buffer->resize_disabled);
> +
> + mutex_unlock(>mutex);
> +out:
> + mutex_unlock(_buffer->mapping_lock);
> +
> + return err;
> +}
> +
> +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu)
> +{
> + struct ring_buffer_per_cpu *cpu_buffer;
> + unsigned long reader_size;
> + unsigned long flags;
> +
> + cpu_buffer = rb_get_mapped_buffer(buffer, cpu);
> + if (IS_ERR(cpu_buffer))
> + return (int)PTR_ERR(cpu_buffer);
> +
> + raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> +consume:
> + if (rb_per_cpu_empty(cpu_buffer))
> + goto out;
> +
> + reader_size = rb_page_size(cpu_buffer->reader_page);
> +
> + /*
> +  * There are data to be read on the current reader page, we can
> +  * return to the caller. But before that, we assume the latter will read
> +  * everything. Let's update the kernel reader accordingly.
> +  */
> + if (cpu_buffer->reader_page->read < reader_size) {
> + while (cpu_buffer->reader_page->read < reader_size)
> + rb_advance_reader(cpu_buffer);
> + goto out;
> + }
> +
> + if (WARN_ON(!rb_get_reader_page(cpu_buffer)))
> + goto out;
> +
> + goto consume;
> +out:
> + /* Some archs do not have data cache coherency between kernel and 
> user-space */
> + flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page));
> +
> + rb_update_meta_page(cpu_buffer);
> +
> + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags);
> + rb_put_mapped_buffer(cpu_buffer);
> +
> + return 0;
> +}
> +
>  /*
>   * We only allocate new buffers, never free them if the CPU goes down.
>   *

Re: [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP

2024-04-10 Thread Steven Rostedt



Hi Vincent,

Thanks for sending this. Nit: Subject should start with a capital:

  ring-buffer: Allocate sub-buffers with __GFP_COMP

-- Steve


On Sat,  6 Apr 2024 18:36:45 +0100
Vincent Donnefort  wrote:

> In preparation for the ring-buffer memory mapping, allocate compound
> pages for the ring-buffer sub-buffers to enable us to map them to
> user-space with vm_insert_pages().
> 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 25476ead681b..cc9ebe593571 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct 
> ring_buffer_per_cpu *cpu_buffer,
>   list_add(>list, pages);
>  
>   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
> - mflags | __GFP_ZERO,
> + mflags | __GFP_COMP | __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page)
>   goto free_pages;
> @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, 
> long nr_pages, int cpu)
>  
>   cpu_buffer->reader_page = bpage;
>  
> - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
> + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | 
> __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page)
>   goto fail_free_reader;
> @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer 
> *buffer, int cpu)
>   goto out;
>  
>   page = alloc_pages_node(cpu_to_node(cpu),
> - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
> + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | 
> __GFP_ZERO,
>   cpu_buffer->buffer->subbuf_order);
>   if (!page) {
>   kfree(bpage);

Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

On Wed, 10 Apr 2024 12:38:53 -0400
Chuck Lever  wrote:

> On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote:
> > From: "Steven Rostedt (Google)" 
> > 
> > The rpcgss_context trace event acceptor field is a dynamically sized
> > string that records the "data" parameter. But this parameter is also
> > dependent on the "len" field to determine the size of the data.
> > 
> > It needs to use __string_len() helper macro where the length can be passed
> > in. It also incorrectly uses strncpy() to save it instead of
> > __assign_str(). As these macros can change, it is not wise to open code
> > them in trace events.
> > 
> > As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
> > __assign_str() can be used for both __string() and __string_len() fields.
> > Before that commit, __assign_str_len() is required to be used. This needs
> > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
> > Rework __assign_str() and __string() to not duplicate getting the string")
> > is the commit that makes __string_str_len() obsolete).
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
> > Signed-off-by: Steven Rostedt (Google)   
> 
> Acked-by: Chuck Lever 
> 

Thanks, but feel free to take it if you want. Unless you rather have me
take it through my tree?

-- Steve

[PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field

2024-04-10 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The rpcgss_context trace event acceptor field is a dynamically sized
string that records the "data" parameter. But this parameter is also
dependent on the "len" field to determine the size of the data.

It needs to use __string_len() helper macro where the length can be passed
in. It also incorrectly uses strncpy() to save it instead of
__assign_str(). As these macros can change, it is not wise to open code
them in trace events.

As of commit c759e609030c ("tracing: Remove __assign_str_len()"),
__assign_str() can be used for both __string() and __string_len() fields.
Before that commit, __assign_str_len() is required to be used. This needs
to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing:
Rework __assign_str() and __string() to not duplicate getting the string")
is the commit that makes __string_str_len() obsolete).

Cc: sta...@vger.kernel.org
Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko")
Signed-off-by: Steven Rostedt (Google) 
---
 include/trace/events/rpcgss.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h
index ba2d96a1bc2f..f50fcafc69de 100644
--- a/include/trace/events/rpcgss.h
+++ b/include/trace/events/rpcgss.h
@@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context,
__field(unsigned int, timeout)
__field(u32, window_size)
__field(int, len)
-   __string(acceptor, data)
+   __string_len(acceptor, data, len)
),
 
TP_fast_assign(
@@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context,
__entry->timeout = timeout;
__entry->window_size = window_size;
__entry->len = len;
-   strncpy(__get_str(acceptor), data, len);
+   __assign_str(acceptor, data);
),
 
TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s",
-- 
2.43.0

Re: [PATCH] ftrace: Fix use-after-free issue in ftrace_location()

2024-04-10 Thread Steven Rostedt

On Mon, 1 Apr 2024 20:55:43 +0800
Zheng Yejian  wrote:

> KASAN reports a bug:
> 
>   BUG: KASAN: use-after-free in ftrace_location+0x90/0x120
>   Read of size 8 at addr 888141d40010 by task insmod/424
>   CPU: 8 PID: 424 Comm: insmod Tainted: GW  6.9.0-rc2+ #213
>   [...]
>   Call Trace:
>
>dump_stack_lvl+0x68/0xa0
>print_report+0xcf/0x610
>kasan_report+0xb5/0xe0
>ftrace_location+0x90/0x120
>register_kprobe+0x14b/0xa40
>kprobe_init+0x2d/0xff0 [kprobe_example]
>do_one_initcall+0x8f/0x2d0
>do_init_module+0x13a/0x3c0
>load_module+0x3082/0x33d0
>init_module_from_file+0xd2/0x130
>__x64_sys_finit_module+0x306/0x440
>do_syscall_64+0x68/0x140
>entry_SYSCALL_64_after_hwframe+0x71/0x79
> 
> The root cause is that when lookup_rec() is lookuping ftrace record of
> an address in some module, and at the same time in ftrace_release_mod(),
> the memory that saving ftrace records has been freed as that module is
> being deleted.
> 
>   register_kprobes() {
> check_kprobe_address_safe() {
>   arch_check_ftrace_location() {
> ftrace_location() {
>   lookup_rec()  // access memory that has been freed by
> // ftrace_release_mod() !!!
> 
> It seems that the ftrace_lock is required when lookuping records in
> ftrace_location(), so is ftrace_location_range().
> 
> Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization")
> Signed-off-by: Zheng Yejian 
> ---
>  kernel/trace/ftrace.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index da1710499698..838d175709c1 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>  }
>  
>  /**
> - * ftrace_location_range - return the first address of a traced location
> + * ftrace_location_range_locked - return the first address of a traced 
> location
>   *   if it touches the given ip range
>   * @start: start of range to search.
>   * @end: end of range to search (inclusive). @end points to the last byte
> @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long 
> start, unsigned long end)
>   * that is either a NOP or call to the function tracer. It checks the ftrace
>   * internal tables to determine if the address belongs or not.
>   */
> -unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +static unsigned long ftrace_location_range_locked(unsigned long start, 
> unsigned long end)
>  {
>   struct dyn_ftrace *rec;
>  
> @@ -1603,6 +1603,17 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   return 0;
>  }
>  
> +unsigned long ftrace_location_range(unsigned long start, unsigned long end)
> +{
> + unsigned long loc;
> +
> + mutex_lock(_lock);
> + loc = ftrace_location_range_locked(start, end);
> + mutex_unlock(_lock);

I'm not so sure we can take a mutex in all places that call this function.

What about using RCU?

rcu_read_lock();
loc = ftrace_location_range_rcu(start, end);
rcu_read_unlock();

Then in ftrace_release_mod() we can have:

 out_unlock:
mutex_unlock();

/* Need to synchronize with ftrace_location() */
if (tmp_pages)
synchronize_rcu();

-- Steve


> +
> + return loc;
> +}
> +
>  /**
>   * ftrace_location - return the ftrace location
>   * @ip: the instruction pointer to check
> @@ -1614,25 +1625,22 @@ unsigned long ftrace_location_range(unsigned long 
> start, unsigned long end)
>   */
>  unsigned long ftrace_location(unsigned long ip)
>  {
> - struct dyn_ftrace *rec;
> + unsigned long loc;
>   unsigned long offset;
>   unsigned long size;
>  
> - rec = lookup_rec(ip, ip);
> - if (!rec) {
> + loc = ftrace_location_range(ip, ip);
> + if (!loc) {
>   if (!kallsyms_lookup_size_offset(ip, , ))
>   goto out;
>  
>   /* map sym+0 to __fentry__ */
>   if (!offset)
> - rec = lookup_rec(ip, ip + size - 1);
> + loc = ftrace_location_range(ip, ip + size - 1);
>   }
>  
> - if (rec)
> - return rec->ip;
> -
>  out:
> - return 0;
> + return loc;
>  }
>  
>  /**

Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

On Wed, 10 Apr 2024 08:44:00 +0900
Masami Hiramatsu (Google)  wrote:

> Looks good to me.
> 
> Acked-by: Masami Hiramatsu (Google) 

Thanks.

> 
> BTW, isn't this a real bugfix, because the page_touched can be
> bigger than nr_pages without this fix?

Yes, I simply forgot to add the Cc stable.

-- Steve

[PATCH] ring-buffer: Only update pages_touched when a new page is touched

2024-04-09 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The "buffer_percent" logic that is used by the ring buffer splice code to
only wake up the tasks when there's no data after the buffer is filled to
the percentage of the "buffer_percent" file is dependent on three
variables that determine the amount of data that is in the ring buffer:

 1) pages_read - incremented whenever a new sub-buffer is consumed
 2) pages_lost - incremented every time a writer overwrites a sub-buffer
 3) pages_touched - incremented when a write goes to a new sub-buffer

The percentage is the calculation of:

  (pages_touched - (pages_lost + pages_read)) / nr_pages

Basically, the amount of data is the total number of sub-bufs that have been
touched, minus the number of sub-bufs lost and sub-bufs consumed. This is
divided by the total count to give the buffer percentage. When the
percentage is greater than the value in the "buffer_percent" file, it
wakes up splice readers waiting for that amount.

It was observed that over time, the amount read from the splice was
constantly decreasing the longer the trace was running. That is, if one
asked for 60%, it would read over 60% when it first starts tracing, but
then it would be woken up at under 60% and would slowly decrease the
amount of data read after being woken up, where the amount becomes much
less than the buffer percent.

This was due to an accounting of the pages_touched incrementation. This
value is incremented whenever a writer transfers to a new sub-buffer. But
the place where it was incremented was incorrect. If a writer overflowed
the current sub-buffer it would go to the next one. If it gets preempted
by an interrupt at that time, and the interrupt performs a trace, it too
will end up going to the next sub-buffer. But only one should increment
the counter. Unfortunately, that was not the case.

Change the cmpxchg() that does the real switch of the tail-page into a
try_cmpxchg(), and on success, perform the increment of pages_touched. This
will only increment the counter once for when the writer moves to a new
sub-buffer, and not when there's a race and is incremented for when a
writer and its preempting writer both move to the same new sub-buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 25476ead681b..6511dc3a00da 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
old_write = local_add_return(RB_WRITE_INTCNT, _page->write);
old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries);
 
-   local_inc(_buffer->pages_touched);
/*
 * Just make sure we have seen our old_write and synchronize
 * with any interrupts that come in.
@@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct 
ring_buffer_per_cpu *cpu_buffer,
 */
local_set(_page->page->commit, 0);
 
-   /* Again, either we update tail_page or an interrupt does */
-   (void)cmpxchg(_buffer->tail_page, tail_page, next_page);
+   /* Either we update tail_page or an interrupt does */
+   if (try_cmpxchg(_buffer->tail_page, _page, next_page))
+   local_inc(_buffer->pages_touched);
}
 }
 
-- 
2.43.0

Re: [PATCH net-next v3 6/6] rstreason: make it work in trace world

2024-04-09 Thread Steven Rostedt

On Tue,  9 Apr 2024 18:09:34 +0800
Jason Xing  wrote:

>  /*
>   * tcp event with arguments sk and skb
> @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb,
>   TP_ARGS(sk, skb)
>  );
>  
> +#undef FN1
> +#define FN1(reason)  TRACE_DEFINE_ENUM(SK_RST_REASON_##reason);
> +#undef FN2
> +#define FN2(reason)  TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason);
> +DEFINE_RST_REASON(FN1, FN1)

Interesting. I've never seen the passing of the internal macros to the main
macro before. I see that you are using it for handling both the
SK_RST_REASON and the SK_DROP_REASON.

> +
> +#undef FN1
> +#undef FNe1
> +#define FN1(reason)  { SK_RST_REASON_##reason, #reason },
> +#define FNe1(reason) { SK_RST_REASON_##reason, #reason }
> +
> +#undef FN2
> +#undef FNe2
> +#define FN2(reason)  { SKB_DROP_REASON_##reason, #reason },
> +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason }

Anyway, from a tracing point of view, as it looks like it would work
(I haven't tested it).

Reviewed-by: Steven Rostedt (Google) 

-- Steve

Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 16:58:17 +0200
Michal Koutný  wrote:

> @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[])
>   return;
>   }
>  
> - tpid = pid & (PID_MAX_DEFAULT - 1);
> + tpid = pid % PID_MAP_SIZE;

Does that compile to the same? This is a fast path.

-- Steve


>   map = savedcmd->map_pid_to_cmdline[tpid];
>   if (map != NO_CMDLINE_MAP) {
>   tpid = savedcmd->map_cmdline_to_pid[map];

Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Steven Rostedt

On Mon,  8 Apr 2024 11:01:54 +0200
Marco Elver  wrote:

> Add "new_exec" tracepoint, which is run right after the point of no
> return but before the current task assumes its new exec identity.
> 
> Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> runs before flushing the old exec, i.e. while the task still has the
> original state (such as original MM), but when the new exec either
> succeeds or crashes (but never returns to the original exec).
> 
> Being able to trace this event can be helpful in a number of use cases:
> 
>   * allowing tracing eBPF programs access to the original MM on exec,
> before current->mm is replaced;
>   * counting exec in the original task (via perf event);
>   * profiling flush time ("new_exec" to "sched_process_exec").
> 
> Example of tracing output ("new_exec" and "sched_process_exec"):

How common is this? And can't you just do the same with adding a kprobe?

-- Steve

Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 11:53:14 -0700
"Paul E. McKenney"  wrote:

> @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct 
> ftrace_hash *hash, unsigned long
>   }
>  }
>  
> +static void register_ftrace_direct_cb(struct rcu_head *rhp)
> +{
> + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu);
> +
> + free_ftrace_hash(fhp);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * for multiple functions registered in @ops
> @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, 
> unsigned long addr)
>   out_unlock:
>   mutex_unlock(_mutex);
>  
> - if (free_hash && free_hash != EMPTY_HASH) {
> - synchronize_rcu_tasks();
> - free_ftrace_hash(free_hash);
> - }
> + if (free_hash && free_hash != EMPTY_HASH)
> + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb);
>  
>   if (new_hash)
>   free_ftrace_hash(new_hash);

I'm getting ready to go on PTO, but a quick glance doesn't look like this
would cause any harm.

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Wed, 3 Apr 2024 15:39:44 +0100
Vincent Donnefort  wrote:

> > Do you plan on sending out a v20 series?  
> 
> Of course, let me spin that this week! Got also few typos to fix in the doc 
> and
> I believe an include missing for riscv.

No rush, I'll be on PTO until next Tuesday, and will not get to it before then.

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-04-03 Thread Steven Rostedt

On Fri, 29 Mar 2024 14:40:55 -0400
Steven Rostedt  wrote:

> > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> > +{
> > +   return VM_FAULT_SIGBUS;
> > +}  
> 
> If this is all it does, I don't believe it's needed.
> 
> > +
> > +#ifdef CONFIG_TRACER_MAX_TRACE
> > +static int get_snapshot_map(struct trace_array *tr)
> > +{
> > +   int err = 0;
> > +
> > +   /*
> > +* Called with mmap_lock held. lockdep would be unhappy if we would now
> > +* take trace_types_lock. Instead use the specific
> > +* snapshot_trigger_lock.
> > +*/
> > +   spin_lock(>snapshot_trigger_lock);
> > +
> > +   if (tr->snapshot || tr->mapped == UINT_MAX)
> > +   err = -EBUSY;
> > +   else
> > +   tr->mapped++;
> > +
> > +   spin_unlock(>snapshot_trigger_lock);
> > +
> > +   /* Wait for update_max_tr() to observe iter->tr->mapped */
> > +   if (tr->mapped == 1)
> > +   synchronize_rcu();
> > +
> > +   return err;
> > +
> > +}
> > +static void put_snapshot_map(struct trace_array *tr)
> > +{
> > +   spin_lock(>snapshot_trigger_lock);
> > +   if (!WARN_ON(!tr->mapped))
> > +   tr->mapped--;
> > +   spin_unlock(>snapshot_trigger_lock);
> > +}
> > +#else
> > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; }
> > +static inline void put_snapshot_map(struct trace_array *tr) { }
> > +#endif
> > +
> > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
> > +{
> > +   struct ftrace_buffer_info *info = vma->vm_file->private_data;
> > +   struct trace_iterator *iter = >iter;
> > +
> > +   WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file));
> > +   put_snapshot_map(iter->tr);
> > +}
> > +
> > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { }  
> 
> Same for the open.
> 
> 
> > +
> > +static const struct vm_operations_struct tracing_buffers_vmops = {
> > +   .open   = tracing_buffers_mmap_open,
> > +   .close  = tracing_buffers_mmap_close,
> > +   .fault  = tracing_buffers_mmap_fault,
> > +};  
> 
> I replaced this with:
> 
> static const struct vm_operations_struct tracing_buffers_vmops = {
>   .close  = tracing_buffers_mmap_close,
> };
> 
> And it appears to work just fine. The mm code handles the NULL cases for
> .open and .fault.
> 
> Is there any reason to do something different than the mm defaults?

Hi Vincent,

Do you plan on sending out a v20 series?

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 22:21:00 -0700
Andrii Nakryiko  wrote:

> > I just checked our fleet-wide production data for the last 24 hours.
> > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > everything called from it), rcu_is_watching (both calls, see below)
> > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > prefer to be able to avoid that in production use cases.
> >  
> 
> I just ran synthetic microbenchmark testing multi-kretprobe
> throughput. We get (in millions of BPF kretprobe-multi program
> invocations per second):
>   - 5.568M/s as baseline;
>   - 5.679M/s with changes in this patch (+2% throughput improvement);
>   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> 
> It's definitely noticeable.

Ah, thanks for verifying (I should have read the thread before replying to
the previous email).

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Steven Rostedt

On Tue, 2 Apr 2024 21:00:19 -0700
Andrii Nakryiko  wrote:

> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?

That is totally up to Masami. It may have even less overhead as I'm not
sure how many times that gets called, and there may be more work to do than
with function tracing.

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Steven Rostedt

On Wed, 3 Apr 2024 09:40:48 +0900
Masami Hiramatsu (Google)  wrote:

> OK, for me, this last sentence is preferred for the help message. That 
> explains
> what this is for. 
> 
> All callbacks that attach to the function tracing have some sort
> of protection against recursion. This option is only to verify that
>   　ftrace (and other users of ftrace_test_recursion_trylock()) are not
> called outside of RCU, as if they are, it can cause a race.
> But it also has a noticeable overhead when enabled.
> 
> BTW, how much overhead does this introduce? and the race case a kernel crash?
> or just messed up the ftrace buffer?

There's a hypothetical race where it can cause a use after free.

That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
which requires RCU to be watching. There's a theoretical case where that
task calls the trampoline and misses the synchronization. Note, these
locations are with preemption disabled, as rcu is always watching when
preemption is enabled. Thus it would be extremely fast where as the
synchronize_rcu_tasks() is rather slow.

We also have synchronize_rcu_tasks_rude() which would actually keep the
trace from happening, as it would schedule on each CPU forcing all CPUs to
have RCU watching.

I have never heard of this race being hit. I guess it could happen on a VM
where a vCPU gets preempted at the right moment for a long time and the
other CPUs synchronize.

But like lockdep, where deadlocks can crash the kernel, we don't enable it
for production.

The overhead is another function call within the function tracer. I had
numbers before, but I guess I could run tests again and get new numbers.

Thanks,

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt

On Mon, 1 Apr 2024 19:29:46 -0700
Andrii Nakryiko  wrote:

> On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
> >
> > On Mon, 1 Apr 2024 12:09:18 -0400
> > Steven Rostedt  wrote:
> >  
> > > On Mon, 1 Apr 2024 20:25:52 +0900
> > > Masami Hiramatsu (Google)  wrote:
> > >  
> > > > > Masami,
> > > > >
> > > > > Are you OK with just keeping it set to N.  
> > > >
> > > > OK, if it is only for the debugging, I'm OK to set N this.
> > > >  
> > > > >
> > > > > We could have other options like PROVE_LOCKING enable it.  
> > > >
> > > > Agreed (but it should say this is a debug option)  
> > >
> > > It does say "Validate" which to me is a debug option. What would you
> > > suggest?  
> >
> > I think the help message should have "This is for debugging ftrace."
> >  
> 
> Sent v2 with adjusted wording, thanks!

You may want to wait till Masami and I agree ;-)

Masami,

But it isn't really for "debugging", it's for validating. That is, it
doesn't give us any information to debug ftrace. It only validates if it is
executed properly. In other words, I never want to be asked "How can I use
this option to debug ftrace?"

For example, we also have:

  "Verify ring buffer time stamp deltas"

that makes sure the time stamps of the ring buffer are not buggy.

And there's:

  "Verify compile time sorting of ftrace functions"

Which is also used to make sure things are working properly.

Neither of the above says they are for "debugging", even though they are
more useful for debugging than this option.

I'm not sure saying this is "debugging ftrace" is accurate. It may help
debug ftrace if it is called outside of an RCU location, which has a
1 in 100,000,000,000 chance of causing an actual bug, as the race window is
extremely small. 

Now if it is also called outside of instrumentation, that will likely trigger
other warnings even without this code, and this will not be needed to debug
that.

ftrace has all sorts of "verifiers" that is used to make sure things are
working properly. And yes, you can consider it as "debugging". But I would
not consider this an option to enable if ftrace was broken, and you are
looking into why it is broken.

To me, this option is only to verify that ftrace (and other users of
ftrace_test_recursion_trylock()) are not called outside of RCU, as if they
are, it can cause a race. But it also has a noticeable overhead when enabled.

-- Steve

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Steven Rostedt

On Mon, 1 Apr 2024 20:25:52 +0900
Masami Hiramatsu (Google)  wrote:

> > Masami,
> > 
> > Are you OK with just keeping it set to N.  
> 
> OK, if it is only for the debugging, I'm OK to set N this.
> 
> > 
> > We could have other options like PROVE_LOCKING enable it.  
> 
> Agreed (but it should say this is a debug option)

It does say "Validate" which to me is a debug option. What would you
suggest?

-- Steve

Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer

2024-03-29 Thread Steven Rostedt

On Tue, 26 Mar 2024 10:08:28 +
Vincent Donnefort  wrote:

> Currently, user-space extracts data from the ring-buffer via splice,
> which is handy for storage or network sharing. However, due to splice
> limitations, it is imposible to do real-time analysis without a copy.
> 
> A solution for that problem is to let the user-space map the ring-buffer
> directly.
> 
> The mapping is exposed via the per-CPU file trace_pipe_raw. The first
> element of the mapping is the meta-page. It is followed by each
> subbuffer constituting the ring-buffer, ordered by their unique page ID:
> 
>   * Meta-page -- include/uapi/linux/trace_mmap.h for a description
>   * Subbuf ID 0
>   * Subbuf ID 1
>  ...
> 
> It is therefore easy to translate a subbuf ID into an offset in the
> mapping:
> 
>   reader_id = meta->reader->id;
>   reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
> 
> When new data is available, the mapper must call a newly introduced ioctl:
> TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
> point to the next reader containing unread data.
> 

Thanks for the update Vincent!

> Mapping will prevent snapshot and buffer size modifications.
> 
> CC: 
> Signed-off-by: Vincent Donnefort 
> 
> diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
> index ffcd8dfcaa4f..d25b9d504a7c 100644
> --- a/include/uapi/linux/trace_mmap.h
> +++ b/include/uapi/linux/trace_mmap.h
> @@ -43,4 +43,6 @@ struct trace_buffer_meta {
>   __u64   Reserved2;
>  };
>  
> +#define TRACE_MMAP_IOCTL_GET_READER  _IO('T', 0x1)
> +
>  #endif /* _TRACE_MMAP_H_ */
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index 233d1af39fff..0f37aa9860fd 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct 
> trace_array *tr,
>   return;
>   }
>  
> + if (tr->mapped) {
> + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
> + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
> + return;
> + }
> +
>   local_irq_save(flags);
>   update_max_tr(tr, current, smp_processor_id(), cond_data);
>   local_irq_restore(flags);
> @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct 
> trace_array *tr)
>   lockdep_assert_held(_types_lock);
>  
>   spin_lock(>snapshot_trigger_lock);
> - if (tr->snapshot == UINT_MAX) {
> + if (tr->snapshot == UINT_MAX || tr->mapped) {
>   spin_unlock(>snapshot_trigger_lock);
>   return -EBUSY;
>   }
> @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr)
>  {
>   if (tr->current_trace == _trace)
>   return;
> - 
> +
>   tr->current_trace->enabled--;
>  
>   if (tr->current_trace->reset)
> @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t 
> *ppos,
>   return ret;
>  }
>  
> -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters 
> */
>  static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
> unsigned long arg)
>  {
>   struct ftrace_buffer_info *info = file->private_data;
>   struct trace_iterator *iter = >iter;
> + int err;
> +
> + if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
> + if (!(file->f_flags & O_NONBLOCK)) {
> + err = ring_buffer_wait(iter->array_buffer->buffer,
> +iter->cpu_file,
> +iter->tr->buffer_percent,
> +NULL, NULL);
> + if (err)
> + return err;
> + }
>  
> - if (cmd)
> - return -ENOIOCTLCMD;
> + return ring_buffer_map_get_reader(iter->array_buffer->buffer,
> +   iter->cpu_file);
> + } else if (cmd) {
> + return -ENOTTY;
> + }
>  
> + /*
> +  * An ioctl call with cmd 0 to the ring buffer file will wake up all
> +  * waiters
> +  */
>   mutex_lock(_types_lock);
>  
>   /* Make sure the waiters see the new wait_index */
> @@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, 
> unsigned int cmd, unsigned
>   return 0;
>  }
>  
> +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
> +{
> + return VM_FAULT_SIGBUS;
> +}

If this is all it does, I don't believe it's needed.

> +
> +#ifdef CONFIG_TRACER_MAX_TRACE
> +static int get_snapshot_map(struct trace_array *tr)
> +{
> + int err = 0;
> +
> + /*
> +  * Called with mmap_lock held. lockdep would be unhappy if we would now
> +  * take trace_types_lock. Instead use the specific
> +  * snapshot_trigger_lock.
> +  */
> + spin_lock(>snapshot_trigger_lock);
> +
> + if (tr->snapshot || tr->mapped == UINT_MAX)
> +

Re: [PATCH] trace/sched: add tgid for sched_wakeup_template

2024-03-27 Thread Steven Rostedt

On Wed, 27 Mar 2024 16:50:57 +0800
Tio Zhang  wrote:

> By doing this, we are able to filter tasks by tgid while we are
> tracing wakeup events by ebpf or other methods.
> 
> For example, when we care about tracing a user space process (which has
> uncertain number of LWPs, i.e, pids) to monitor its wakeup latency,
> without tgid available in sched_wakeup tracepoints, we would struggle
> finding out all pids to trace, or we could use kprobe to achieve tgid
> tracing, which is less accurate and much less efficient than using
> tracepoint.

This is a very common trace event, and I really do not want to add more
data than necessary to it, as it increases the size of the event which
means less events can be recorded on a fixed size trace ring buffer.

Note, you are not modifying the "tracepoint", but you are actually
modifying a "trace event".

 "tracepoint" is the hook in the kernel code:

   trace_sched_wakeup()

 "trace event" is defined by TRACE_EVENT() macro (and friends) that defines
 what is exposed in the tracefs file system.

I thought ebpf could hook directly to the tracepoint which is:

trace_sched_wakeup(p);

I believe you can have direct access to the 'p' before it is processed from 
ebpf.

There's also "trace probes" (I think we are lacking documentation on this,
as well as event probes :-p):

 $ gdb vmlinux
(gdb) p &((struct task_struct *)0)->tgid
$1 = (pid_t *) 0x56c
(gdb) p &((struct task_struct *)0)->pid
$2 = (pid_t *) 0x568

 # echo 't:wakeup sched_waking pid=+0x568($arg1):u32 tgid=+0x56c($arg1):u32' > 
/sys/kernel/tracing/dynamic_events

 # trace-cmd start -e wakeup
 # trace-cmd show
   trace-cmd-7307[003] d..6. 599486.485762: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=845 tgid=845
bash-845 [001] d.s4. 599486.486136: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
bash-845 [001] d..4. 599486.486336: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=5516 tgid=5516
   kworker/u18:2-5516[001] d..4. 599486.486445: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=818 tgid=818
  -0   [001] d.s4. 599486.491206: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s5. 599486.493218: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [001] d.s4. 599486.497200: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=17 tgid=17
  -0   [003] d.s4. 599486.829209: wakeup: 
(__probestub_sched_waking+0x4/0x10) pid=70 tgid=70

The above attaches to the tracepoint and $arg1 is the 'struct task_struct *p'.

-- Steve


> 
> Signed-off-by: Tio Zhang 
> Signed-off-by: Dylane Chen 
> ---
>  include/trace/events/sched.h | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index dbb01b4b7451..ea7e525649e5 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __field(pid_t,  pid )
>   __field(int,prio)
>   __field(int,target_cpu  )
> + __field(pid_t,  tgid)
>   ),
>  
>   TP_fast_assign(
> @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
>   __entry->pid= p->pid;
>   __entry->prio   = p->prio; /* XXX SCHED_DEADLINE */
>   __entry->target_cpu = task_cpu(p);
> + __entry->tgid   = p->tgid;
>   ),
>  
> - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d",
> + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d",
> __entry->comm, __entry->pid, __entry->prio,
> -   __entry->target_cpu)
> +   __entry->target_cpu, __entry->tgid)
>  );
>  
>  /*

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 09:16:33 -0700
Andrii Nakryiko  wrote:

> > It's no different than lockdep. Test boxes should have it enabled, but
> > there's no reason to have this enabled in a production system.
> >  
> 
> I tend to agree with Steven here (which is why I sent this patch as it
> is), but I'm happy to do it as an opt-out, if Masami insists. Please
> do let me know if I need to send v2 or this one is actually the one
> we'll end up using. Thanks!

Masami,

Are you OK with just keeping it set to N.

We could have other options like PROVE_LOCKING enable it.

-- Steve

Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes

2024-03-26 Thread Steven Rostedt

On Tue, 26 Mar 2024 15:53:38 +0100
Arnd Bergmann  wrote:

> -const char *
> +int
>  ftrace_mod_address_lookup(unsigned long addr, unsigned long *size,
>  unsigned long *off, char **modname, char *sym)
>  {
>   struct ftrace_mod_map *mod_map;
> - const char *ret = NULL;
> + int ret;

This needs to be ret = 0;

>  
>   /* mod_map is freed via call_rcu() */
>   preempt_disable();

As here we have:

list_for_each_entry_rcu(mod_map, _mod_maps, list) {
ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym);
if (ret) {
if (modname)
*modname = mod_map->mod->name;
break;
}
}
preempt_enable();

return ret;
}

Where it is possible for the loop never to be executed.

-- Steve

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Steven Rostedt

On Mon, 25 Mar 2024 11:38:48 +0900
Masami Hiramatsu (Google)  wrote:

> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
> 
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> > 
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.  
> 
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.
> 

Really, it's for debugging. I would argue that it should *not* be default y.
Peter added this to find all the locations that could be called where RCU
is not watching. But the issue I have is that this is that it *does cause
overhead* with function tracing.

I believe we found pretty much all locations that were an issue, and we
should now just make it an option for developers.

It's no different than lockdep. Test boxes should have it enabled, but
there's no reason to have this enabled in a production system.

-- Steve


> > 
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> >  #endif
> >  
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)  \
> > ({  \
> > bool __ret = !rcu_is_watching();\  
> 
> BTW, maybe we can add "unlikely" in the next "if" line?
> 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> >   This file can be reset, but the limit can not change in
> >   size at runtime.
> >  
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > +   bool "Validate RCU is on during ftrace recursion check"
> > +   depends on FUNCTION_TRACER
> > +   depends on ARCH_WANTS_NO_INSTR  
> 
>   default y
> 
> > +   help
> > + All callbacks that attach to the function tracing have some sort
> > + of protection against recursion. This option performs additional
> > + checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > + This will add more overhead to all ftrace-based invocations.  
> 
>   ... invocations, but keep it safe.
> 
> > +
> > + If unsure, say N  
> 
>   If unsure, say Y
> 
> Thank you,
> 
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> > bool "Record functions that recurse in the ring buffer"
> > depends on FTRACE_RECORD_RECURSION
> > -- 
> > 2.43.0
> >   
> 
>

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-21 Thread Steven Rostedt

On Tue, 12 Mar 2024 13:42:28 +
Mark Rutland  wrote:

> There are ways around that, but they're complicated and/or expensive, e.g.
> 
> * Use a sequence of multiple patches, starting with replacing the JALR with an
>   exception-generating instruction with a fixup handler, which is sort-of what
>   x86 does with UD2. This may require multiple passes with
>   synchronize_rcu_tasks() to make sure all threads have seen the latest
>   instructions, and that cannot be done under stop_machine(), so if you need
>   stop_machine() for CMODx reasons, you may need to use that several times 
> with
>   intervening calls to synchronize_rcu_tasks().

Just for clarification. x86 doesn't use UD2 for updating the call sites. It
uses the breakpoint (0xcc) operation. This is because x86 instructions are
not a fixed size and can cross cache boundaries, making updates to text
sections dangerous as another CPU may get half the old instruction and half
the new one!

Thus, when we have:

0f 1f 44 00 00  nop

and want to convert it to:

e8 e7 57 07 00  call ftrace_caller

We have to first add a breakpoint:

cc 17 44 00 00

Send an IPI to all CPUs so that they see the breakpoint (IPI is equivalent
to a memory barrier).

We have a ftrace breakpoint handler that will look at the function that the
breakpoint happened on. If it was a nop, it will just skip over the rest of
the instruction, and return from the break point handler just after the
"cc 17 44 00 00". If it's supposed to be a function, it will push the
return to after the instruction onto the stack, and return from the break
point handler to ftrace_caller. (Although things changed a little since
this update is now handled by text_poke_bp_batch()).

Then it changes the rest of the instruction:

cc e7 57 07 00

Sends out another IPI to all CPUs and removes the break point with the new
instruction op.

e8 e7 57 07 00

and now all the callers of this function will call the ftrace_caller.

-- Steve

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 34441 matches

Mail list logo