[PATCH v2 0/4] tracefs/eventfs: Fix failed second run of test_ownership
The test_ownership test of the kselftests was failing again. That's because the original fix was incorrect and a fix to a race condition showed how the original fix was broken. Instead of using tracefs_inodes to find the eventfs_inode that needs to be reset on remount, use the "events" directory descriptor to descend into its files and directories to catch all changes. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240522124504.28982...@gandalf.local.home - Added other fixes underneath and rebased it on: https://lore.kernel.org/lkml/20240522164320.469785...@goodmis.org/ - The real fix is to not use the tracefs_inodes to find the eventfs_inodes that need to be cleared on remount. Instead, the events descriptor needs to be used to descend its directories and files to update their attributes - The d_iput callback logic was misplaced. It should be done in the drop_inode callback. git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git eventfs/urgent Head SHA1: 41b7db11bcac4638fa489c58d35e7d2146b665ab Steven Rostedt (Google) (4): eventfs: Keep the directories from having the same inode number as files tracefs: Update inode permissions on remount eventfs: Update all the eventfs_inodes from the events descriptor tracefs: Clear EVENT_INODE flag in tracefs_drop_inode() fs/tracefs/event_inode.c | 57 +--- fs/tracefs/inode.c | 48 2 files changed, 73 insertions(+), 32 deletions(-)
Re: [PATCH] tracefs: Remove unneeded buggy tracefs iput callback
On Wed, 22 May 2024 12:45:04 -0400 Steven Rostedt wrote: > From: "Steven Rostedt (Google)" > > The iput callback was added because the remount could call into the > eventfs code and touch the ei->entry_attrs array, which could have been > freed when an eventfs directory is freed (via a synthetic event). But the > entry_attrs was freed incorrectly and since been fixed to be freed after > the last reference of the ei is done. > > The iput clears the TRACEFS_EVENT_INODE flag of the tracefs_inode > preventing it from calling the eventfs_remount() function. But the iput > can be called after the last reference to the inode is done but the > eventfs_inode still exists, causing the eventfs_remount() not to be called > on an tracefs_inode when it should be. Testing this more, I found that the iput is still needed, as the deletion of the eventfs inodes can happen before the inode is released. Will produce a v2 that handles this properly. -- Steve
[PATCH] tracefs: Remove unneeded buggy tracefs iput callback
From: "Steven Rostedt (Google)" The iput callback was added because the remount could call into the eventfs code and touch the ei->entry_attrs array, which could have been freed when an eventfs directory is freed (via a synthetic event). But the entry_attrs was freed incorrectly and since been fixed to be freed after the last reference of the ei is done. The iput clears the TRACEFS_EVENT_INODE flag of the tracefs_inode preventing it from calling the eventfs_remount() function. But the iput can be called after the last reference to the inode is done but the eventfs_inode still exists, causing the eventfs_remount() not to be called on an tracefs_inode when it should be. Link: https://lore.kernel.org/all/cak7lnarxgaww3kh9jgrnh4vk6fr8ldknkf3wq8nhmwjrvwj...@mail.gmail.com/ Cc: sta...@vger.kernel.org Reported-by: Masahiro Yamada Fixes: ee4e0379475e4 ("eventfs: Free all of the eventfs_inode after RCU") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 15 --- 1 file changed, 15 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 9252e0d78ea2..62ca9c23b93c 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -455,22 +455,7 @@ static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags) return !(ei && ei->is_freed); } -static void tracefs_d_iput(struct dentry *dentry, struct inode *inode) -{ - struct tracefs_inode *ti = get_tracefs(inode); - - /* -* This inode is being freed and cannot be used for -* eventfs. Clear the flag so that it doesn't call into -* eventfs during the remount flag updates. The eventfs_inode -* gets freed after an RCU cycle, so the content will still -* be safe if the iteration is going on now. -*/ - ti->flags &= ~TRACEFS_EVENT_INODE; -} - static const struct dentry_operations tracefs_dentry_operations = { - .d_iput = tracefs_d_iput, .d_revalidate = tracefs_d_revalidate, .d_release = tracefs_d_release, }; -- 2.43.0
Re: [PATCH] tools/latency-collector: fix -Wformat-security compile warns
On Tue, 21 May 2024 09:11:08 -0600 Shuah Khan wrote: > Any thoughts on this patch? Sorry, this one fell through the cracks. Daniel Bristot has been maintaining his tools and I thought this was one of his changes. I'll take a look at it. -- Steve
Re: UBSAN: shift-out-of-bounds in validate_sb_layout
On Mon, 20 May 2024 15:02:26 +0800 "Ubisectech Sirius" wrote: > Hello. > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > Recently, our team has discovered a issue in Linux kernel 6.7. Attached to > the email were a PoC file of the issue. > > Stack dump: > UBSAN: shift-out-of-bounds in fs/bcachefs/super-io.c:310:18 > shift exponent 127 is too large for 32-bit type 'int' > CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 6.7.0 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > 04/01/2014 > Call Trace: > > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106 > ubsan_epilogue lib/ubsan.c:217 [inline] > __ubsan_handle_shift_out_of_bounds+0x24b/0x430 lib/ubsan.c:387 > validate_sb_layout.cold+0x1a/0x51 fs/bcachefs/super-io.c:310 > bch2_read_super+0x980/0x1000 fs/bcachefs/super-io.c:786 > bch2_fs_open+0x471/0x3890 fs/bcachefs/super.c:1922 > bch2_mount+0x538/0x13c0 fs/bcachefs/fs.c:1863 > legacy_get_tree+0x109/0x220 fs/fs_context.c:662 > vfs_get_tree+0x93/0x380 fs/super.c:1771 > do_new_mount fs/namespace.c:3337 [inline] > path_mount+0x679/0x1e40 fs/namespace.c:3664 > do_mount fs/namespace.c:3677 [inline] > __do_sys_mount fs/namespace.c:3886 [inline] > __se_sys_mount fs/namespace.c:3863 [inline] > __x64_sys_mount+0x287/0x310 fs/namespace.c:3863 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x6f/0x77 > RIP: 0033:0x7f41e1091b3e > Code: 48 c7 c0 ff ff ff ff eb aa e8 be 0d 00 00 66 2e 0f 1f 84 00 00 00 00 00 > 0f 1f 40 00 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 > 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7f41e1d22e38 EFLAGS: 0202 ORIG_RAX: 00a5 > RAX: ffda RBX: 5d82 RCX: 7f41e1091b3e > RDX: 20005d80 RSI: 2100 RDI: 7f41e1d22e90 > RBP: 7f41e1d22ed0 R08: 7f41e1d22ed0 R09: 0080 > R10: 0080 R11: 0202 R12: 20005d80 > R13: 2100 R14: 7f41e1d22e90 R15: 20005e00 > > > Thank you for taking the time to read this email and we look forward to > working with you further. I'm not sure why this is getting Cc'd to linux-trace-kernel. That's for anything to do with the tracing code (trace events, tracepoints, kprobes, uprobes, function tracer etc). What part of tracing is this for? -- Steve
Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks
On Fri, 17 May 2024 15:40:08 +0200 Petr Pavlu wrote: > The reader code in rb_get_reader_page() swaps a new reader page into the > ring buffer by doing cmpxchg on old->list.prev->next to point it to the > new page. Following that, if the operation is successful, > old->list.next->prev gets updated too. This means the underlying > doubly-linked list is temporarily inconsistent, page->prev->next or > page->next->prev might not be equal back to page for some page in the > ring buffer. > > The resize operation in ring_buffer_resize() can be invoked in parallel. > It calls rb_check_pages() which can detect the described inconsistency > and stop further tracing: > > [ 190.271762] [ cut here ] > [ 190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 > rb_check_pages.isra.0+0x6a/0xa0 > [ 190.271789] Modules linked in: [...] > [ 190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 > skx_edac(E):1 > [ 190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G >E 6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f > [ 190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014 > [ 190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0 > [ 190.272023] Code: [...] > [ 190.272028] RSP: 0018:9c37463abb70 EFLAGS: 00010206 > [ 190.272034] RAX: 8eba04b6cb80 RBX: 0007 RCX: > 8eba01f13d80 > [ 190.272038] RDX: 8eba01f130c0 RSI: 8eba04b6cd00 RDI: > 8eba0004c700 > [ 190.272042] RBP: 8eba0004c700 R08: 00010002 R09: > > [ 190.272045] R10: 7f52 R11: 8eba7f60 R12: > 8eba0004c720 > [ 190.272049] R13: 8eba00223a00 R14: 0008 R15: > 8eba067a8000 > [ 190.272053] FS: 7f1bd64752c0() GS:8eba7f68() > knlGS: > [ 190.272057] CS: 0010 DS: ES: CR0: 80050033 > [ 190.272061] CR2: 7f1bd6662590 CR3: 00010291e001 CR4: > 00370ef0 > [ 190.272070] DR0: DR1: DR2: > > [ 190.272073] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 190.272077] Call Trace: > [ 190.272098] > [ 190.272189] ring_buffer_resize+0x2ab/0x460 > [ 190.272199] __tracing_resize_ring_buffer.part.0+0x23/0xa0 > [ 190.272206] tracing_resize_ring_buffer+0x65/0x90 > [ 190.272216] tracing_entries_write+0x74/0xc0 > [ 190.272225] vfs_write+0xf5/0x420 > [ 190.272248] ksys_write+0x67/0xe0 > [ 190.272256] do_syscall_64+0x82/0x170 > [ 190.272363] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 190.272373] RIP: 0033:0x7f1bd657d263 > [ 190.272381] Code: [...] > [ 190.272385] RSP: 002b:7ffe72b643f8 EFLAGS: 0246 ORIG_RAX: > 0001 > [ 190.272391] RAX: ffda RBX: 0002 RCX: > 7f1bd657d263 > [ 190.272395] RDX: 0002 RSI: 555a6eb538e0 RDI: > 0001 > [ 190.272398] RBP: 555a6eb538e0 R08: 000a R09: > > [ 190.272401] R10: 555a6eb55190 R11: 0246 R12: > 7f1bd6662500 > [ 190.272404] R13: 0002 R14: 7f1bd6667c00 R15: > 0002 > [ 190.272412] > [ 190.272414] ---[ end trace ]--- > > Note that ring_buffer_resize() calls rb_check_pages() only if the parent > trace_buffer has recording disabled. Recent commit d78ab792705c > ("tracing: Stop current tracer when resizing buffer") causes that it is > now always the case which makes it more likely to experience this issue. > > The window to hit this race is nonetheless very small. To help > reproducing it, one can add a delay loop in rb_get_reader_page(): > > ret = rb_head_page_replace(reader, cpu_buffer->reader_page); > if (!ret) > goto spin; > for (unsigned i = 0; i < 1U << 26; i++) /* inserted delay loop */ > __asm__ __volatile__ ("" : : : "memory"); > rb_list_head(reader->list.next)->prev = _buffer->reader_page->list; > > .. and then run the following commands on the target system: > > echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable > while true; do > echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 > echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1 > done & > while true; do > for i in /sys/kernel/tracing/per_cpu/*; do > timeout 0.1 cat $i/trace_pipe; sleep 0.2 > done > done > > To fix the problem, make sure ring_buffer_resize() doesn't invoke > rb_check_pages() concurrently with a reader operating on the same > ring_buffer_per_cpu by taking its cpu_buffer->reader_lock. Definitely a bug. Thanks for catching it. But... > > Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read") > Signed-off-by: Petr Pavlu > --- > kernel/trace/ring_buffer.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git
Re: [PATCH] tracing/treewide: Remove second parameter of __assign_str()
On Fri, 17 May 2024 10:36:37 -0700 Guenter Roeck wrote: > Building csky:allmodconfig (and others) ... failed > -- > Error log: > In file included from include/trace/trace_events.h:419, > from include/trace/define_trace.h:102, > from drivers/cxl/core/trace.h:737, > from drivers/cxl/core/trace.c:8: > drivers/cxl/core/./trace.h:383:1: error: macro "__assign_str" passed 2 > arguments, but takes just 1 > > This is with the patch applied on top of v6.9-8410-gff2632d7d08e. > So far that seems to be the only build failure. > Introduced with commit 6aec00139d3a8 ("cxl/core: Add region info to > cxl_general_media and cxl_dram events"). Guess we'll see more of those > towards the end of the commit window. Looks like I made this patch just before this commit was pulled into Linus's tree. Which is why I'll apply and rerun the above again probably on Tuesday of next week against Linus's latest. This patch made it through both an allyesconfig and an allmodconfig, but on the commit I had applied it to, which was: 1b294a1f3561 ("Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") I'll be compiling those two builds after I update it then. -- Steve
Re: [PATCH] tracing/treewide: Remove second parameter of __assign_str()
On Fri, 17 May 2024 10:08:51 +0300 Jani Nikula wrote: > On Thu, 16 May 2024, Steven Rostedt wrote: > > There's over 700 users of __assign_str() and because coccinelle does not > > handle the TRACE_EVENT() macro I ended up using the following sed script: > > > > git grep -l __assign_str | while read a ; do > > sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > > > /tmp/test-file; > > mv /tmp/test-file $a; > > done > > Try 'sed -i' ;) I've always been nervous about trusting -i ;-) > > > .../drm/i915/display/intel_display_trace.h| 56 - > > On i915, > > Acked-by: Jani Nikula > Thanks, -- Steve
[PATCH] ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
From: "Steven Rostedt (Google)" The sub-buffer pages are held in an unsigned long array, and when it is passed to virt_to_page() a cast is needed. Link: https://lore.kernel.org/all/20240515124808.06279...@canb.auug.org.au/ Fixes: 117c39200d9d ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: Stephen Rothwell Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a02c7a52a0f5..7345a8b625fb 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -6283,7 +6283,7 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, } while (p < nr_pages) { - struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); + struct page *page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); int off = 0; if (WARN_ON_ONCE(s >= nr_subbufs)) { -- 2.43.0
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Fri, 10 May 2024 12:03:12 +0100 Vincent Donnefort wrote: > > I'm not particularly happy about us calling vm_insert_pages with NULL > > pointers stored in pages. > > > > Should we instead do > > > > if (WARN_ON_ONCE(s >= nr_subbufs)) { > > err = -EINVAL; > > goto out; > > } > > > > ? > > I could also nr_pages = p in the event of s >= nr_subbufs... but that > really that shouldn't happen so let's return an error. I'm good with this. It should never happen anyway. -- Steve
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Tue, 30 Apr 2024 12:13:51 +0100 Vincent Donnefort wrote: > +#ifdef CONFIG_MMU > +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, > + struct vm_area_struct *vma) > +{ > + unsigned long nr_subbufs, nr_pages, vma_pages, pgoff = vma->vm_pgoff; > + unsigned int subbuf_pages, subbuf_order; > + struct page **pages; > + int p = 0, s = 0; > + int err; > + > + /* Refuse MP_PRIVATE or writable mappings */ > + if (vma->vm_flags & VM_WRITE || vma->vm_flags & VM_EXEC || > + !(vma->vm_flags & VM_MAYSHARE)) > + return -EPERM; > + > + /* > + * Make sure the mapping cannot become writable later. Also tell the VM > + * to not touch these pages (VM_DONTCOPY | VM_DONTEXPAND). Finally, > + * prevent migration, GUP and dump (VM_IO). > + */ > + vm_flags_mod(vma, VM_DONTCOPY | VM_DONTEXPAND | VM_IO, VM_MAYWRITE); Do we really need the VM_IO? When testing this in gdb, I would get: (gdb) p tmap->map->subbuf_size Cannot access memory at address 0x77fc2008 It appears that you can't ptrace IO memory. When I removed that flag, gdb has no problem reading that memory. I think we should drop that flag. Can you send a v23 with that removed, Shuah's update, and also the change below: > + > + lockdep_assert_held(_buffer->mapping_lock); > + > + subbuf_order = cpu_buffer->buffer->subbuf_order; > + subbuf_pages = 1 << subbuf_order; > + > + nr_subbufs = cpu_buffer->nr_pages + 1; /* + reader-subbuf */ > + nr_pages = ((nr_subbufs) << subbuf_order) - pgoff + 1; /* + meta-page */ > + > + vma_pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; > + if (!vma_pages || vma_pages > nr_pages) > + return -EINVAL; > + > + nr_pages = vma_pages; > + > + pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL); > + if (!pages) > + return -ENOMEM; > + > + if (!pgoff) { > + pages[p++] = virt_to_page(cpu_buffer->meta_page); > + > + /* > + * TODO: Align sub-buffers on their size, once > + * vm_insert_pages() supports the zero-page. > + */ > + } else { > + /* Skip the meta-page */ > + pgoff--; > + > + if (pgoff % subbuf_pages) { > + err = -EINVAL; > + goto out; > + } > + > + s += pgoff / subbuf_pages; > + } > + > + while (s < nr_subbufs && p < nr_pages) { > + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); > + int off = 0; > + > + for (; off < (1 << (subbuf_order)); off++, page++) { > + if (p >= nr_pages) > + break; > + > + pages[p++] = page; > + } > + s++; > + } The above can be made to: while (p < nr_pages) { struct page *page; int off = 0; if (WARN_ON_ONCE(s >= nr_subbufs)) break; page = virt_to_page(cpu_buffer->subbuf_ids[s]); for (; off < (1 << (subbuf_order)); off++, page++) { if (p >= nr_pages) break; pages[p++] = page; } s++; } Thanks. -- Steve > + > + err = vm_insert_pages(vma, vma->vm_start, pages, _pages); > + > +out: > + kfree(pages); > + > + return err; > +} > +#else > +static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, > + struct vm_area_struct *vma) > +{ > + return -EOPNOTSUPP; > +} > +#endif
Re: [PATCH net-next v9] net/ipv4: add tracepoint for icmp_send
On Tue, 7 May 2024 15:41:03 +0800 (CST) wrote: > From: Peilin He > > Introduce a tracepoint for icmp_send, which can help users to get more > detail information conveniently when icmp abnormal events happen. > > 1. Giving an usecase example: > = > When an application experiences packet loss due to an unreachable UDP > destination port, the kernel will send an exception message through the > icmp_send function. By adding a trace point for icmp_send, developers or > system administrators can obtain detailed information about the UDP > packet loss, including the type, code, source address, destination address, > source port, and destination port. This facilitates the trouble-shooting > of UDP packet loss issues especially for those network-service > applications. > > 2. Operation Instructions: > == > Switch to the tracing directory. > cd /sys/kernel/tracing > Filter for destination port unreachable. > echo "type==3 && code==3" > events/icmp/icmp_send/filter > Enable trace event. > echo 1 > events/icmp/icmp_send/enable > > 3. Result View: > > udp_client_erro-11370 [002] ...s.12 124.728002: > icmp_send: icmp_send: type=3, code=3. > From 127.0.0.1:41895 to 127.0.0.1: ulen=23 > skbaddr=589b167a > > Signed-off-by: Peilin He > Signed-off-by: xu xin > Reviewed-by: Yunkai Zhang > Cc: Yang Yang > Cc: Liu Chun > Cc: Xuexin Jiang > --- From just a tracing point of view: Reviewed-by: Steven Rostedt (Google) -- Steve
Re: [PATCH v8 13/17] x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
On Sun, 5 May 2024 17:25:56 +0300 Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Dynamic ftrace must allocate memory for code and this was impossible > without CONFIG_MODULES. > > With execmem separated from the modules code, execmem_text_alloc() is > available regardless of CONFIG_MODULES. > > Remove dependency of dynamic ftrace on CONFIG_MODULES and make > CONFIG_DYNAMIC_FTRACE select CONFIG_EXECMEM in Kconfig. > > Signed-off-by: Mike Rapoport (IBM) > --- > arch/x86/Kconfig | 1 + > arch/x86/kernel/ftrace.c | 10 -- > 2 files changed, 1 insertion(+), 10 deletions(-) Reviewed-by: Steven Rostedt (Google) -- Steve
Re: [PATCH v8 06/17] mm: introduce execmem_alloc() and execmem_free()
On Sun, 5 May 2024 17:25:49 +0300 Mike Rapoport wrote: > diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c > index 70139d9d2e01..c8ddb7abda7c 100644 > --- a/arch/x86/kernel/ftrace.c > +++ b/arch/x86/kernel/ftrace.c > @@ -25,6 +25,7 @@ > #include > #include > #include > +#include > > #include > > @@ -261,15 +262,14 @@ void arch_ftrace_update_code(int command) > #ifdef CONFIG_X86_64 > > #ifdef CONFIG_MODULES > -#include > /* Module allocation simplifies allocating memory for code */ > static inline void *alloc_tramp(unsigned long size) > { > - return module_alloc(size); > + return execmem_alloc(EXECMEM_FTRACE, size); > } > static inline void tramp_free(void *tramp) > { > - module_memfree(tramp); > + execmem_free(tramp); > } > #else > /* Trampolines can only be created if modules are supported */ > diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c Acked-by: Steven Rostedt (Google) -- Steve
Re: ftrace_direct_func_count ?
On Sat, 4 May 2024 13:35:26 + "Dr. David Alan Gilbert" wrote: > Hi, > I've just posted a patch 'ftrace: Remove unused list 'ftrace_direct_funcs'' > that clears out some old code, but while at it I noticed the global > 'ftrace_direct_func_count'. > > As far as I can tell, it's never assigned (or initialised) but it is tested: > > kernel/trace/fgraph.c: > #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS > /* >* Skip graph tracing if the return location is served by direct trampoline, >* since call sequence and return addresses are unpredictable anyway. >* Ex: BPF trampoline may call original function and may skip frame >* depending on type of BPF programs attached. >*/ > if (ftrace_direct_func_count && > ftrace_find_rec_direct(ret - MCOUNT_INSN_SIZE)) > return -EBUSY; > #endif > > So I wasn't sure whether it was just safe to nuke that section > or whether it really needed fixing? Yes, after commit 8788ca164eb4bad ("ftrace: Remove the legacy _ftrace_direct API") that variable is no longer used. -- Steve
Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()
On Thu, 2 May 2024 16:13:59 -0700 "Paul E. McKenney" wrote: > Very good, and thank you! > > I will drop it from RCU as soon as it shows up in either -next or in > mainline. Sounds good. I'm currently working on updates to get into -rc7 and plan to add my next work on top of that (I know, I know, it's probably the latest release I had for next, but things are still being worked on). -- Steve
Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching
On Thu, 2 May 2024 15:58:53 -0700 Beau Belgrave wrote: > It's not an issue on the matching/logic. However, you do get an extra > byte alloc (which doesn't bother me in this edge case). Figured as much, but since there was no mention of it, I decided to bring it up. I'll take this as-is then. -- Steve
Re: [PATCH resend ftrace] Asynchronous grace period for register_ftrace_direct()
On Wed, 1 May 2024 20:31:06 -0700 "Paul E. McKenney" wrote: > On Thu, May 02, 2024 at 11:05:01AM +0900, Masami Hiramatsu wrote: > > On Wed, 1 May 2024 16:12:37 -0700 > > "Paul E. McKenney" wrote: > > > > > Note that the immediate pressure for this patch should be relieved by the > > > NAPI patch series [1], but this sort of problem could easily arise again. > > > > > > When running heavy test workloads with KASAN enabled, RCU Tasks grace > > > periods can extend for many tens of seconds, significantly slowing > > > trace registration. Therefore, make the registration-side RCU Tasks > > > grace period be asynchronous via call_rcu_tasks(). > > > > Good catch! AFAICS, there is no reason to wait for synchronization > > when adding a new direct trampoline. > > This looks good to me. > > > > Reviewed-by: Masami Hiramatsu (Google) > > Thank you very much! I will apply this on my next rebase. I can take it. It's not a bug fix but just an performance improvement, so it can go into the next merge window. -- Steve > > > Thank you, > > > > > [1] > > > https://lore.kernel.org/all/cover.1710877680.git@cloudflare.com/ > > > > > > Reported-by: Jakub Kicinski > > > Reported-by: Alexei Starovoitov > > > Reported-by: Chris Mason > > > Signed-off-by: Paul E. McKenney > > > Cc: Steven Rostedt > > > Cc: Masami Hiramatsu > > > Cc: Mark Rutland > > > Cc: Mathieu Desnoyers > > > Cc: > > > > > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > > > index 6c96b30f3d63b..32ea92934268c 100644 > > > --- a/kernel/trace/ftrace.c > > > +++ b/kernel/trace/ftrace.c > > > @@ -5365,6 +5365,13 @@ static void > > > remove_direct_functions_hash(struct ftrace_hash *hash, unsigned long } > > > } > > > > > > +static void register_ftrace_direct_cb(struct rcu_head *rhp) > > > +{ > > > + struct ftrace_hash *fhp = container_of(rhp, struct > > > ftrace_hash, rcu); + > > > + free_ftrace_hash(fhp); > > > +} > > > + > > > /** > > > * register_ftrace_direct - Call a custom trampoline directly > > > * for multiple functions registered in @ops > > > @@ -5463,10 +5470,8 @@ int register_ftrace_direct(struct ftrace_ops > > > *ops, unsigned long addr) out_unlock: > > > mutex_unlock(_mutex); > > > > > > - if (free_hash && free_hash != EMPTY_HASH) { > > > - synchronize_rcu_tasks(); > > > - free_ftrace_hash(free_hash); > > > - } > > > + if (free_hash && free_hash != EMPTY_HASH) > > > + call_rcu_tasks(_hash->rcu, > > > register_ftrace_direct_cb); > > > if (new_hash) > > > free_ftrace_hash(new_hash); > > > > > > -- > > Masami Hiramatsu (Google)
Re: [PATCH v2 1/2] tracing/user_events: Fix non-spaced field matching
On Tue, 23 Apr 2024 16:23:37 + Beau Belgrave wrote: > When the ABI was updated to prevent same name w/different args, it > missed an important corner case when fields don't end with a space. > Typically, space is used for fields to help separate them, like > "u8 field1; u8 field2". If no spaces are used, like > "u8 field1;u8 field2", then the parsing works for the first time. > However, the match check fails on a subsequent register, leading to > confusion. > > This is because the match check uses argv_split() and assumes that all > fields will be split upon the space. When spaces are used, we get back > { "u8", "field1;" }, without spaces we get back { "u8", "field1;u8" }. > This causes a mismatch, and the user program gets back -EADDRINUSE. > > Add a method to detect this case before calling argv_split(). If found > force a space after the field separator character ';'. This ensures all > cases work properly for matching. > > With this fix, the following are all treated as matching: > u8 field1;u8 field2 > u8 field1; u8 field2 > u8 field1;\tu8 field2 > u8 field1;\nu8 field2 I'm curious, what happens if you have: "u8 field1; u8 field2;" ? Do you care? As you will then create "u8 field1; u8 field2; " but I'm guessing the extra whitespace at the end doesn't affect anything. > > Fixes: ba470eebc2f6 ("tracing/user_events: Prevent same name but different > args event") > Signed-off-by: Beau Belgrave > --- > kernel/trace/trace_events_user.c | 76 +++- > 1 file changed, 75 insertions(+), 1 deletion(-) > > diff --git a/kernel/trace/trace_events_user.c > b/kernel/trace/trace_events_user.c > index 70d428c394b6..82b191f33a28 100644 > --- a/kernel/trace/trace_events_user.c > +++ b/kernel/trace/trace_events_user.c > @@ -1989,6 +1989,80 @@ static int user_event_set_tp_name(struct user_event > *user) > return 0; > } > > +/* > + * Counts how many ';' without a trailing space are in the args. > + */ > +static int count_semis_no_space(char *args) > +{ > + int count = 0; > + > + while ((args = strchr(args, ';'))) { > + args++; > + > + if (!isspace(*args)) > + count++; This will count that "..;" This is most likely not an issue, but since I didn't see this case anywhere, I figured I bring it up just to confirm that it's not an issue. -- Steve > + } > + > + return count; > +} > + > +/* > + * Copies the arguments while ensuring all ';' have a trailing space. > + */ > +static char *insert_space_after_semis(char *args, int count) > +{ > + char *fixed, *pos; > + int len; > + > + len = strlen(args) + count; > + fixed = kmalloc(len + 1, GFP_KERNEL); > + > + if (!fixed) > + return NULL; > + > + pos = fixed; > + > + /* Insert a space after ';' if there is no trailing space. */ > + while (*args) { > + *pos = *args++; > + > + if (*pos++ == ';' && !isspace(*args)) > + *pos++ = ' '; > + } > + > + *pos = '\0'; > + > + return fixed; > +} > +
Re: [PATCH v3] ftrace: Fix possible use-after-free issue in ftrace_location()
On Wed, 17 Apr 2024 11:28:30 +0800 Zheng Yejian wrote: > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index da1710499698..e05d3e3dc06a 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long > start, unsigned long end) > } > > /** > - * ftrace_location_range - return the first address of a traced location > + * ftrace_location_range_rcu - return the first address of a traced location kerneldoc comments are for external functions. You need to move this down to ftrace_location_range() as here you are commenting a local static function. But I have to ask, why did you create this static function anyway? There's only one user of it (the ftrace_location_range()). Why didn't you just simply add the rcu locking there? unsigned long ftrace_location_range(unsigned long start, unsigned long end) { struct dyn_ftrace *rec; unsigned long ip = 0; rcu_read_lock(); rec = lookup_rec(start, end); if (rec) ip = rec->ip; rcu_read_unlock(); return ip; } -- Steve > * if it touches the given ip range > * @start: start of range to search. > * @end: end of range to search (inclusive). @end points to the last byte > @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long > start, unsigned long end) > * that is either a NOP or call to the function tracer. It checks the ftrace > * internal tables to determine if the address belongs or not. > */ > -unsigned long ftrace_location_range(unsigned long start, unsigned long end) > +static unsigned long ftrace_location_range_rcu(unsigned long start, unsigned > long end) > { > struct dyn_ftrace *rec; > > @@ -1603,6 +1603,16 @@ unsigned long ftrace_location_range(unsigned long > start, unsigned long end) > return 0; > } > > +unsigned long ftrace_location_range(unsigned long start, unsigned long end) > +{ > + unsigned long loc; > + > + rcu_read_lock(); > + loc = ftrace_location_range_rcu(start, end); > + rcu_read_unlock(); > + return loc; > +}
[PATCH v3 6/6] eventfs: Have "events" directory get permissions from its parent
From: "Steven Rostedt (Google)" The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 18:55 current_tracer -rw-r- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w--- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:16 current_tracer -rw-r- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w--- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r- 1 root lkp 0 May 1 21:16 set_event Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 6e08405892ae..a878cea70f4c 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex); struct eventfs_root_inode { struct eventfs_inodeei; + struct inode*parent_inode; struct dentry *events_dir; }; @@ -226,12 +227,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { - struct inode *root; + struct eventfs_root_inode *rei; + struct inode *parent; + + rei = get_root_inode(ei); + + /* Use the parent inode permissions unless root set its permissions */ + parent = rei->parent_inode; - /* Get the tracefs root inode. */ - root = d_inode(sb->s_root); - ei->attr.uid = root->i_uid; - ei->attr.gid = root->i_gid; + if (rei->ei.attr.mode & EVENTFS_SAVE_UID) + ei->attr.uid = rei->ei.attr.uid; + else + ei->attr.uid = parent->i_uid; + + if (rei->ei.attr.mode & EVENTFS_SAVE_GID) + ei->attr.gid = rei->ei.attr.gid; + else + ei->attr.gid = parent->i_gid; } static void set_top_events_ownership(struct inode *inode) @@ -817,6 +829,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry // Note: we have a ref to the dentry from tracefs_start_creating() rei = get_root_inode(ei); rei->events_dir = dentry; + rei->parent_inode = d_inode(dentry->d_sb->s_root); ei->entries = entries; ei->nr_entries = size; @@ -826,10 +839,15 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; + /* +* When the "events" directory is created, it takes on the +* permissions of its parent. But can be reset on remount. +*/ + ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID; + INIT_LIST_HEAD(>children); INIT_LIST_HEAD(>list); -- 2.43.0
[PATCH v3 5/6] eventfs: Do not treat events directory different than other directories
From: "Steven Rostedt (Google)" Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 16 +--- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 9dacf65c0b6e..6e08405892ae 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -206,21 +206,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * determined by the parent directory. */ if (dentry->d_inode->i_mode & S_IFDIR) { - /* -* The events directory dentry is never freed, unless its -* part of an instance that is deleted. It's attr is the -* default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership. -*/ - if (ei->is_events) { - /* But it still needs to know if it was modified */ - if (iattr->ia_valid & ATTR_UID) - ei->attr.mode |= EVENTFS_SAVE_UID; - if (iattr->ia_valid & ATTR_GID) - ei->attr.mode |= EVENTFS_SAVE_GID; - } else { - update_attr(>attr, iattr); - } + update_attr(>attr, iattr); } else { name = dentry->d_name.name; -- 2.43.0
[PATCH v3 2/6] tracefs: Reset permissions on remount if permissions are options
From: "Steven Rostedt (Google)" There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root root 0 May 1 21:25 dynamic_events -r--r- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root tracing 0 May 1 21:25 dynamic_events -r--r- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 ++ fs/tracefs/inode.c | 65 +++- fs/tracefs/internal.h| 7 - 3 files changed, 99 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index cc8b838bbe62..15a2a9c3c62b 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -308,6 +308,35 @@ static const struct file_operations eventfs_file_operations = { .llseek = generic_file_llseek, }; +/* + * On a remount of tracefs, if UID or GID options are set, then + * the mount point inode permissions should be used. + * Reset the saved permission flags appropriately. + */ +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) +{ + struct eventfs_inode *ei = ti->private; + + if (!ei) + return; + + if (update_uid) + ei->attr.mode &= ~EVENTFS_SAVE_UID; + + if (update_gid) + ei->attr.mode &= ~EVENTFS_SAVE_GID; + + if (!ei->entry_attrs) + return; + + for (int i = 0; i < ei->nr_entries; i++) { + if (update_uid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; + if (update_gid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; + } +} + /* Return the evenfs_inode of the "events" directory */ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) { diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 5545e6bf7d26..52aa14bd2994 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount; static int tracefs_mount_count; static bool tracefs_registered; +/* + * Keep track of all tracefs_inodes in order to update their + * flags if necessary on a remount. + */ +static DEFINE_SPINLOCK(tracefs_inode_lock); +static LIST_HEAD(tracefs_inodes); + static struct inode *tracefs_alloc_inode(struct super_block *sb) { struct tracefs_inode *ti; + unsigned long flags; ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL); if (!ti) return NULL; + spin_lock_irqsave(_inode_lock, flags); + list_add_rcu(>list, _inodes); + spin_unlock_irqrestore(_inode_lock, flags); + return >vfs_inode; } +static void tracefs_free_inode_rcu(struct rcu_head *rcu) +{ + struct tracefs_inode *ti; + +
[PATCH v3 4/6] eventfs: Do not differentiate the toplevel events directory
From: "Steven Rostedt (Google)" The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 - fs/tracefs/internal.h| 7 +++ 2 files changed, 11 insertions(+), 25 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 15a2a9c3c62b..9dacf65c0b6e 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -68,7 +68,6 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), - EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -239,14 +238,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } -static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb) +static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { struct inode *root; - /* Only update if the "events" was on the top level */ - if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) - return; - /* Get the tracefs root inode. */ root = d_inode(sb->s_root); ei->attr.uid = root->i_uid; @@ -259,10 +254,10 @@ static void set_top_events_ownership(struct inode *inode) struct eventfs_inode *ei = ti->private; /* The top events directory doesn't get automatically updated */ - if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + if (!ei || !ei->is_events) return; - update_top_events_attr(ei, inode->i_sb); + update_events_attr(ei, inode->i_sb); if (!(ei->attr.mode & EVENTFS_SAVE_UID)) inode->i_uid = ei->attr.uid; @@ -291,7 +286,7 @@ static int eventfs_permission(struct mnt_idmap *idmap, return generic_permission(idmap, inode, mask); } -static const struct inode_operations eventfs_root_dir_inode_operations = { +static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, @@ -359,7 +354,7 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) // Walk upwards until you find the events inode } while (!ei->is_events); - update_top_events_attr(ei, dentry->d_sb); + update_events_attr(ei, dentry->d_sb); return ei; } @@ -465,7 +460,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, update_inode_attr(dentry, inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; /* All directories will have the same inode number */ @@ -845,14 +840,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* -* If the events directory is of the top instance, then parent -* is NULL. Set the attr.mode to reflect this and its permissions will -* default to the tracefs root dentry. -*/ - if (!parent) - ei->attr.mode = EVENTFS_TOPLEVEL; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; @@ -861,13 +848,13 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry INIT_LIST_HEAD(>list); ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; + ti->flags |= TRACEFS_EVENT_INODE; ti->private = ei; inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO; inode->i_uid = uid; inode->i_gid = gid; - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; dentry->d_fsdata = get_ei(ei); diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 29f0c75b..f704d8348357 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -4,10 +4,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), - TRACEFS_EVENT_TOP_INODE = BIT(2), - TRACEFS_GID_PERM_SET=
[PATCH v3 1/6] eventfs: Free all of the eventfs_inode after RCU
From: "Steven Rostedt (Google)" The freeing of eventfs_inode via a kfree_rcu() callback. But the content of the eventfs_inode was being freed after the last kref. This is dangerous, as changes are being made that can access the content of an eventfs_inode from an RCU loop. Instead of using kfree_rcu() use call_rcu() that calls a function to do all the freeing of the eventfs_inode after a RCU grace period has expired. Cc: sta...@vger.kernel.org Fixes: 43aa6f97c2d03 ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index f5510e26f0f6..cc8b838bbe62 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -73,6 +73,21 @@ enum { #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) +static void free_ei_rcu(struct rcu_head *rcu) +{ + struct eventfs_inode *ei = container_of(rcu, struct eventfs_inode, rcu); + struct eventfs_root_inode *rei; + + kfree(ei->entry_attrs); + kfree_const(ei->name); + if (ei->is_events) { + rei = get_root_inode(ei); + kfree(rei); + } else { + kfree(ei); + } +} + /* * eventfs_inode reference count management. * @@ -85,7 +100,6 @@ static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); const struct eventfs_entry *entry; - struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); @@ -95,14 +109,7 @@ static void release_ei(struct kref *ref) entry->release(entry->name, ei->data); } - kfree(ei->entry_attrs); - kfree_const(ei->name); - if (ei->is_events) { - rei = get_root_inode(ei); - kfree_rcu(rei, ei.rcu); - } else { - kfree_rcu(ei, rcu); - } + call_rcu(>rcu, free_ei_rcu); } static inline void put_ei(struct eventfs_inode *ei) -- 2.43.0
[PATCH v3 3/6] tracefs: Still use mount point as default permissions for instances
From: "Steven Rostedt (Google)" If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 27 +-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 52aa14bd2994..417c840e6403 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode) { struct tracefs_inode *ti = get_tracefs(inode); struct inode *root_inode = ti->private; + kuid_t uid; + kgid_t gid; + + uid = root_inode->i_uid; + gid = root_inode->i_gid; + + /* +* If the root is not the mount point, then check the root's +* permissions. If it was never set, then default to the +* mount point. +*/ + if (root_inode != d_inode(root_inode->i_sb->s_root)) { + struct tracefs_inode *rti; + + rti = get_tracefs(root_inode); + root_inode = d_inode(root_inode->i_sb->s_root); + + if (!(rti->flags & TRACEFS_UID_PERM_SET)) + uid = root_inode->i_uid; + + if (!(rti->flags & TRACEFS_GID_PERM_SET)) + gid = root_inode->i_gid; + } /* * If this inode has never been referenced, then update * the permissions to the superblock. */ if (!(ti->flags & TRACEFS_UID_PERM_SET)) - inode->i_uid = root_inode->i_uid; + inode->i_uid = uid; if (!(ti->flags & TRACEFS_GID_PERM_SET)) - inode->i_gid = root_inode->i_gid; + inode->i_gid = gid; } static int tracefs_permission(struct mnt_idmap *idmap, -- 2.43.0
[PATCH v3 0/6] tracefs/eventfs: Fix inconsistent permissions
The tracefs and eventfs permissions are created dynamically based on what the mount point inode has or the instances directory inode has. But the way it worked had some inconsistencies that could lead to security issues as the file system is not behaving like admins would expect. The files and directories could ignore the remount option that changes the gid or uid ownerships, leaving files susceptable to access that is not expected. This happens if a file had its value changed previously and then a remount changed all the files permissions. The one that was changed previously would not be affected. This change set resolves these inconsistencies. This also fixes the test_ownership.tc test as it would pass on the first time it is run, but fail on the second time, because of the inconsistant state of the permissions. Now you can run that test multiple times and it will always pass. Changes since v2: https://lore.kernel.org/linux-trace-kernel/20240502151547.973653...@goodmis.org/ - The eventfs_inode freeing was incorrect. The kref_put() would call release_ei() that freed the contents of the eventfs_inode then call kfree_rcu() on the eventfs_inode itself. The contents of the eventfs_inode needs to be freed after the RCU synchronization as well. The patches here add even more cases where that's a requirement. - Add a iput callback for the tracefs_inode to clear the TRACEFS_EVENT_INODE flag. This will prevent the clearing of flags in remount to go into the eventfs_remount() function. A RCU grace cycle happens between the clearing of this flag and where the eventfs_inode is freed, so it is OK if the iteration is happening at the same time, as it is done under rcu_read_lock(). Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/ - Testing showed that taking a mutex when freeing the tracefs_inode caused a lockdep splat as it can happen in the RCU softirq context. Convert the mutex to a spinlock for adding and removing the node from the link list, and free the node via call_rcu() so that the iteration of the list only needs to be protected by rcu_read_lock(). Steven Rostedt (Google) (6): eventfs: Free all of the eventfs_inode after RCU tracefs: Reset permissions on remount if permissions are options tracefs: Still use mount point as default permissions for instances eventfs: Do not differentiate the toplevel events directory eventfs: Do not treat events directory different than other directories eventfs: Have "events" directory get permissions from its parent fs/tracefs/event_inode.c | 127 --- fs/tracefs/inode.c | 92 -- fs/tracefs/internal.h| 14 -- 3 files changed, 175 insertions(+), 58 deletions(-)
Re: [PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options
On Thu, 02 May 2024 11:15:48 -0400 Steven Rostedt wrote: > +/* > + * On a remount of tracefs, if UID or GID options are set, then > + * the mount point inode permissions should be used. > + * Reset the saved permission flags appropriately. > + */ > +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool > update_gid) > +{ > + struct eventfs_inode *ei = ti->private; And I just realized there's a race here too :-p I need to set ti->private = NULL before freeing the ei, and probably free the ei via RCU. -- Steve > + > + if (!ei) > + return; > + > + if (update_uid) > + ei->attr.mode &= ~EVENTFS_SAVE_UID; > + > + if (update_gid) > + ei->attr.mode &= ~EVENTFS_SAVE_GID; > + > + if (!ei->entry_attrs) > + return; > + > + for (int i = 0; i < ei->nr_entries; i++) { > + if (update_uid) > + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; > + if (update_gid) > + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; > + } > +} > +
[PATCH v2 5/5] eventfs: Have "events" directory get permissions from its parent
From: "Steven Rostedt (Google)" The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 18:55 current_tracer -rw-r- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w--- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:16 current_tracer -rw-r- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w--- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r- 1 root lkp 0 May 1 21:16 set_event Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index ff4ee3dad56a..3ea32159d556 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex); struct eventfs_root_inode { struct eventfs_inodeei; + struct inode*parent_inode; struct dentry *events_dir; }; @@ -219,12 +220,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { - struct inode *root; + struct eventfs_root_inode *rei; + struct inode *parent; + + rei = get_root_inode(ei); + + /* Use the parent inode permissions unless root set its permissions */ + parent = rei->parent_inode; - /* Get the tracefs root inode. */ - root = d_inode(sb->s_root); - ei->attr.uid = root->i_uid; - ei->attr.gid = root->i_gid; + if (rei->ei.attr.mode & EVENTFS_SAVE_UID) + ei->attr.uid = rei->ei.attr.uid; + else + ei->attr.uid = parent->i_uid; + + if (rei->ei.attr.mode & EVENTFS_SAVE_GID) + ei->attr.gid = rei->ei.attr.gid; + else + ei->attr.gid = parent->i_gid; } static void set_top_events_ownership(struct inode *inode) @@ -810,6 +822,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry // Note: we have a ref to the dentry from tracefs_start_creating() rei = get_root_inode(ei); rei->events_dir = dentry; + rei->parent_inode = d_inode(dentry->d_sb->s_root); ei->entries = entries; ei->nr_entries = size; @@ -819,10 +832,15 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; + /* +* When the "events" directory is created, it takes on the +* permissions of its parent. But can be reset on remount. +*/ + ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID; + INIT_LIST_HEAD(>children); INIT_LIST_HEAD(>list); -- 2.43.0
[PATCH v2 4/5] eventfs: Do not treat events directory different than other directories
From: "Steven Rostedt (Google)" Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 16 +--- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 2971609946e5..ff4ee3dad56a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -199,21 +199,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * determined by the parent directory. */ if (dentry->d_inode->i_mode & S_IFDIR) { - /* -* The events directory dentry is never freed, unless its -* part of an instance that is deleted. It's attr is the -* default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership. -*/ - if (ei->is_events) { - /* But it still needs to know if it was modified */ - if (iattr->ia_valid & ATTR_UID) - ei->attr.mode |= EVENTFS_SAVE_UID; - if (iattr->ia_valid & ATTR_GID) - ei->attr.mode |= EVENTFS_SAVE_GID; - } else { - update_attr(>attr, iattr); - } + update_attr(>attr, iattr); } else { name = dentry->d_name.name; -- 2.43.0
[PATCH v2 0/5] tracefs/eventfs: Fix inconsistent permissions
The tracefs and eventfs permissions are created dynamically based on what the mount point inode has or the instances directory inode has. But the way it worked had some inconsistencies that could lead to security issues as the file system is not behaving like admins would expect. The files and directories could ignore the remount option that changes the gid or uid ownerships, leaving files susceptable to access that is not expected. This happens if a file had its value changed previously and then a remount changed all the files permissions. The one that was changed previously would not be affected. This change set resolves these inconsistencies. This also fixes the test_ownership.tc test as it would pass on the first time it is run, but fail on the second time, because of the inconsistant state of the permissions. Now you can run that test multiple times and it will always pass. Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240502030024.062275...@goodmis.org/ - Testing showed that taking a mutex when freeing the tracefs_inode caused a lockdep splat as it can happen in the RCU softirq context. Convert the mutex to a spinlock for adding and removing the node from the link list, and free the node via call_rcu() so that the iteration of the list only needs to be protected by rcu_read_lock(). Steven Rostedt (Google) (5): tracefs: Reset permissions on remount if permissions are options tracefs: Still use mount point as default permissions for instances eventfs: Do not differentiate the toplevel events directory eventfs: Do not treat events directory different than other directories eventfs: Have "events" directory get permissions from its parent fs/tracefs/event_inode.c | 102 --- fs/tracefs/inode.c | 77 +-- fs/tracefs/internal.h| 14 --- 3 files changed, 144 insertions(+), 49 deletions(-)
[PATCH v2 2/5] tracefs: Still use mount point as default permissions for instances
From: "Steven Rostedt (Google)" If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 27 +-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 5f2417df1c43..e1bf65efdbb3 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -180,16 +180,39 @@ static void set_tracefs_inode_owner(struct inode *inode) { struct tracefs_inode *ti = get_tracefs(inode); struct inode *root_inode = ti->private; + kuid_t uid; + kgid_t gid; + + uid = root_inode->i_uid; + gid = root_inode->i_gid; + + /* +* If the root is not the mount point, then check the root's +* permissions. If it was never set, then default to the +* mount point. +*/ + if (root_inode != d_inode(root_inode->i_sb->s_root)) { + struct tracefs_inode *rti; + + rti = get_tracefs(root_inode); + root_inode = d_inode(root_inode->i_sb->s_root); + + if (!(rti->flags & TRACEFS_UID_PERM_SET)) + uid = root_inode->i_uid; + + if (!(rti->flags & TRACEFS_GID_PERM_SET)) + gid = root_inode->i_gid; + } /* * If this inode has never been referenced, then update * the permissions to the superblock. */ if (!(ti->flags & TRACEFS_UID_PERM_SET)) - inode->i_uid = root_inode->i_uid; + inode->i_uid = uid; if (!(ti->flags & TRACEFS_GID_PERM_SET)) - inode->i_gid = root_inode->i_gid; + inode->i_gid = gid; } static int tracefs_permission(struct mnt_idmap *idmap, -- 2.43.0
[PATCH v2 3/5] eventfs: Do not differentiate the toplevel events directory
From: "Steven Rostedt (Google)" The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 - fs/tracefs/internal.h| 7 +++ 2 files changed, 11 insertions(+), 25 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index a754c6725a52..2971609946e5 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -68,7 +68,6 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), - EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -232,14 +231,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } -static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb) +static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { struct inode *root; - /* Only update if the "events" was on the top level */ - if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) - return; - /* Get the tracefs root inode. */ root = d_inode(sb->s_root); ei->attr.uid = root->i_uid; @@ -252,10 +247,10 @@ static void set_top_events_ownership(struct inode *inode) struct eventfs_inode *ei = ti->private; /* The top events directory doesn't get automatically updated */ - if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + if (!ei || !ei->is_events) return; - update_top_events_attr(ei, inode->i_sb); + update_events_attr(ei, inode->i_sb); if (!(ei->attr.mode & EVENTFS_SAVE_UID)) inode->i_uid = ei->attr.uid; @@ -284,7 +279,7 @@ static int eventfs_permission(struct mnt_idmap *idmap, return generic_permission(idmap, inode, mask); } -static const struct inode_operations eventfs_root_dir_inode_operations = { +static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, @@ -352,7 +347,7 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) // Walk upwards until you find the events inode } while (!ei->is_events); - update_top_events_attr(ei, dentry->d_sb); + update_events_attr(ei, dentry->d_sb); return ei; } @@ -458,7 +453,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, update_inode_attr(dentry, inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; /* All directories will have the same inode number */ @@ -838,14 +833,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* -* If the events directory is of the top instance, then parent -* is NULL. Set the attr.mode to reflect this and its permissions will -* default to the tracefs root dentry. -*/ - if (!parent) - ei->attr.mode = EVENTFS_TOPLEVEL; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; @@ -854,13 +841,13 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry INIT_LIST_HEAD(>list); ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; + ti->flags |= TRACEFS_EVENT_INODE; ti->private = ei; inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO; inode->i_uid = uid; inode->i_gid = gid; - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; dentry->d_fsdata = get_ei(ei); diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 29f0c75b..f704d8348357 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -4,10 +4,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), - TRACEFS_EVENT_TOP_INODE = BIT(2), - TRACEFS_GID_PERM_SET=
[PATCH v2 1/5] tracefs: Reset permissions on remount if permissions are options
From: "Steven Rostedt (Google)" There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root root 0 May 1 21:25 dynamic_events -r--r- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root tracing 0 May 1 21:25 dynamic_events -r--r- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 +++ fs/tracefs/inode.c | 50 +++- fs/tracefs/internal.h| 7 +- 3 files changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index f5510e26f0f6..a754c6725a52 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -301,6 +301,35 @@ static const struct file_operations eventfs_file_operations = { .llseek = generic_file_llseek, }; +/* + * On a remount of tracefs, if UID or GID options are set, then + * the mount point inode permissions should be used. + * Reset the saved permission flags appropriately. + */ +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) +{ + struct eventfs_inode *ei = ti->private; + + if (!ei) + return; + + if (update_uid) + ei->attr.mode &= ~EVENTFS_SAVE_UID; + + if (update_gid) + ei->attr.mode &= ~EVENTFS_SAVE_GID; + + if (!ei->entry_attrs) + return; + + for (int i = 0; i < ei->nr_entries; i++) { + if (update_uid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; + if (update_gid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; + } +} + /* Return the evenfs_inode of the "events" directory */ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) { diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 5545e6bf7d26..5f2417df1c43 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -30,20 +30,47 @@ static struct vfsmount *tracefs_mount; static int tracefs_mount_count; static bool tracefs_registered; +/* + * Keep track of all tracefs_inodes in order to update their + * flags if necessary on a remount. + */ +static DEFINE_SPINLOCK(tracefs_inode_lock); +static LIST_HEAD(tracefs_inodes); + static struct inode *tracefs_alloc_inode(struct super_block *sb) { struct tracefs_inode *ti; + unsigned long flags; ti = kmem_cache_alloc(tracefs_inode_cachep, GFP_KERNEL); if (!ti) return NULL; + spin_lock_irqsave(_inode_lock, flags); + list_add_rcu(>list, _inodes); + spin_unlock_irqrestore(_inode_lock, flags); + return >vfs_inode; } +static void tracefs_free_inode_rcu(struct rcu_head *rcu) +{ + struct tracefs
Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr
On Thu, 2 May 2024 06:49:18 + Tze-nan Wu (吳澤南) wrote: > Good news, this patch works, the test has passed, no more Kasan report > in my environment. Great to hear! > > my environment: > arm64 + kasan + swtag based kasan + kernel-6.6.18 > > Really appreciate, and learn a lot from the patch. Can I add: Tested-by: Tze-nan Wu (吳澤南) ? -- Steve
Re: [PATCH v22 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 2 May 2024 14:38:32 +0100 Vincent Donnefort wrote: > > > + while (s < nr_subbufs && p < nr_pages) { > > > + struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); > > > + int off = 0; > > > + > > > + for (; off < (1 << (subbuf_order)); off++, page++) { > > > + if (p >= nr_pages) > > > + break; > > > + > > > + pages[p++] = page; > > > + } > > > + s++; > > > + } > > > + > > > + err = vm_insert_pages(vma, vma->vm_start, pages, _pages); > > > > Nit: I did not immediately understand if we could end here with p < nr_pages > > (IOW, pages[] not completely filled). > > > > One source of confusion is the "s < nr_subbufs" check in the while loop: why > > is "p < nr_pages" insufficient? > > Hum, indeed, the "s < nr_subbufs" check is superfluous, nr_pages, is already > capped by the number of subbufs, there's no way we can overflow subbuf_ids[]. We can keep it as is, or perhaps change it to: while (p < nr_pages) { struct page *page; int off = 0; if (WARN_ON_ONCE(s >= nr_subbufs)) break; page = virt_to_page(cpu_buffer->subbuf_ids[s]); for (; off < (1 << (subbuf_order)); off++, page++) { if (p >= nr_pages) break; pages[p++] = page; } s++; } I don't like having an unchecked dependency between s and p. -- Steve
Re: [PATCH] eventfs/tracing: Add callback for release of an eventfs_inode
On Wed, 1 May 2024 23:56:26 +0900 Masami Hiramatsu (Google) wrote: > Looks good to me. > > Reviewed-by: Masami Hiramatsu (Google) Thanks Masami, Although Tze-nan pointed out a issue with this patch. I just published v2, can you review that one too? Thanks, -- Steve
[PATCH v2] eventfs/tracing: Add callback for release of an eventfs_inode
From: "Steven Rostedt (Google)" Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/ Cc: sta...@vger.kernel.org Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu Signed-off-by: Steven Rostedt (Google) --- Changes since v1: https://lore.kernel.org/linux-trace-kernel/20240430142327.7bff8...@gandalf.local.home/ - Do not call release function on error creating the eventfs_inode fs/tracefs/event_inode.c| 23 +-- include/linux/tracefs.h | 3 +++ kernel/trace/trace_events.c | 12 3 files changed, 36 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..f5510e26f0f6 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { @@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei) } } +/* + * Called when creation of an ei fails, do not call release() functions. + */ +static inline void cleanup_ei(struct eventfs_inode *ei) +{ + if (ei) { + /* Set nr_entries to 0 to prevent release() function being called */ + ei->nr_entries = 0; + free_ei(ei); + } +} + static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei) { if (ei) @@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode /* Was the parent freed? */ if (list_empty(>list)) { - free_ei(ei); + cleanup_ei(ei); ei = NULL; } return ei; @@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry return ei; fail: - free_ei(ei); + cleanup_ei(ei); tracefs_failed_creating(dentry); return ERR_PTR(-ENOMEM); } diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eve
Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr
On Thu, 2 May 2024 03:10:24 + Tze-nan Wu (吳澤南) wrote: > > > Sorry for my late reply, I'm testing the patch on my machine now. > Test will be done in four hours. > > There's something I'm worrying about in the patch, > what I'm worrying about is commented in the code below. > > /kernel/trace/trace_events.c: > static int > event_create_dir(struct eventfs_inode *parent, > struct trace_event_file *file) > { > ... > ... > ... > nr_entries = ARRAY_SIZE(event_entries); > > name = trace_event_name(call); > > +event_file_get(file);// Line A > ^ > // Should we move the "event_file_get" to here, instead > // of calling it at line C? > // Due to Line B could eventually invoke "event_file_put". > // eventfs_create_dir -> free_ei ->put_ei -> kref_put > // -> release_ei -> event_release -> event_file_put > // Not sure if this is a potential risk? If Line B do call > // event_file_put,"event_file_put" will be called prior to > // "event_file_get", could corrupt the reference of the file. No, but you do bring up a good point. The release should not be called on error, but it looks like it possibly can be. > > ei = eventfs_create_dir(name, e_events,// Line B > event_entries, nr_entries, file); > if (IS_ERR(ei)) { > pr_warn("Could not create tracefs '%s' directory\n", > name); > return -1; > } > file->ei = ei; > > ret = event_define_fields(call); > if (ret < 0) { > pr_warn("Could not initialize trace point events/%s\n", > name); > return ret; >^ >// Maybe we chould have similar concern if we return here. >// Due to the event_inode had been created, but we did not call >// event_file_get. >// Could it lead to some issues in the future while freeing >// event_indoe? > } > > > -event_file_get(file); //Line C > return 0; > } This prevents the release() function from being called on failure of creating the ei. Can you try this patch instead? -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..f5510e26f0f6 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { @@ -112,6 +119,18 @@ static inline void free_ei(struct eventfs_inode *ei) } } +/* + * Called when creation of an ei fails, do not call release() functions. + */ +static inline void cleanup_ei(struct eventfs_inode *ei) +{ + if (ei) { + /* Set nr_entries to 0 to prevent release() function being called */ + ei->nr_entries = 0; + free_ei(ei); + } +} + static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei) { if (ei) @@ -734,7 +753,7 @@ struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode /* Was the parent freed? */ if (list_empty(>list)) { - free_ei(ei); + cleanup_ei(ei); ei = NULL; } return ei; @@ -835,7 +854,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry return ei; fail: - free_ei(ei); + cleanup_ei(ei); tracefs_failed_creating(dentry); return ERR_PTR(-ENOMEM); } diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, struct eventfs_entry { const char *name; eventfs_callbackcallback; + eventfs_release release; }; struct eventfs_inode; diff --git a/kernel/trace/trace_events.c
[PATCH 5/5] eventfs: Have "events" directory get permissions from its parent
From: "Steven Rostedt (Google)" The events directory gets its permissions from the root inode. But this can cause an inconsistency if the instances directory changes its permissions, as the permissions of the created directories under it should inherit the permissions of the instances directory when directories under it are created. Currently the behavior is: # cd /sys/kernel/tracing # chgrp 1002 instances # mkdir instances/foo # ls -l instances/foo [..] -r--r- 1 root lkp 0 May 1 18:55 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 18:55 current_tracer -rw-r- 1 root lkp 0 May 1 18:55 error_log drwxr-xr-x 1 root root 0 May 1 18:55 events --w--- 1 root lkp 0 May 1 18:55 free_buffer drwxr-x--- 2 root lkp 0 May 1 18:55 options drwxr-x--- 10 root lkp 0 May 1 18:55 per_cpu -rw-r- 1 root lkp 0 May 1 18:55 set_event All the files and directories under "foo" has the "lkp" group except the "events" directory. That's because its getting its default value from the mount point instead of its parent. Have the "events" directory make its default value based on its parent's permissions. That now gives: # ls -l instances/foo [..] -rw-r- 1 root lkp 0 May 1 21:16 buffer_subbuf_size_kb -r--r- 1 root lkp 0 May 1 21:16 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:16 current_tracer -rw-r- 1 root lkp 0 May 1 21:16 error_log drwxr-xr-x 1 root lkp 0 May 1 21:16 events --w--- 1 root lkp 0 May 1 21:16 free_buffer drwxr-x--- 2 root lkp 0 May 1 21:16 options drwxr-x--- 10 root lkp 0 May 1 21:16 per_cpu -rw-r- 1 root lkp 0 May 1 21:16 set_event Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 30 -- 1 file changed, 24 insertions(+), 6 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 624a0e4a8e29..f77800398a54 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -37,6 +37,7 @@ static DEFINE_MUTEX(eventfs_mutex); struct eventfs_root_inode { struct eventfs_inodeei; + struct inode*parent_inode; struct dentry *events_dir; }; @@ -207,12 +208,23 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { - struct inode *root; + struct eventfs_root_inode *rei; + struct inode *parent; + + rei = get_root_inode(ei); + + /* Use the parent inode permissions unless root set its permissions */ + parent = rei->parent_inode; - /* Get the tracefs root inode. */ - root = d_inode(sb->s_root); - ei->attr.uid = root->i_uid; - ei->attr.gid = root->i_gid; + if (rei->ei.attr.mode & EVENTFS_SAVE_UID) + ei->attr.uid = rei->ei.attr.uid; + else + ei->attr.uid = parent->i_uid; + + if (rei->ei.attr.mode & EVENTFS_SAVE_GID) + ei->attr.gid = rei->ei.attr.gid; + else + ei->attr.gid = parent->i_gid; } static void set_top_events_ownership(struct inode *inode) @@ -798,6 +810,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry // Note: we have a ref to the dentry from tracefs_start_creating() rei = get_root_inode(ei); rei->events_dir = dentry; + rei->parent_inode = d_inode(dentry->d_sb->s_root); ei->entries = entries; ei->nr_entries = size; @@ -807,10 +820,15 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; + /* +* When the "events" directory is created, it takes on the +* permissions of its parent. But can be reset on remount. +*/ + ei->attr.mode |= EVENTFS_SAVE_UID | EVENTFS_SAVE_GID; + INIT_LIST_HEAD(>children); INIT_LIST_HEAD(>list); -- 2.43.0
[PATCH 4/5] eventfs: Do not treat events directory different than other directories
From: "Steven Rostedt (Google)" Treat the events directory the same as other directories when it comes to permissions. The events directory was considered different because it's dentry is persistent, whereas the other directory dentries are created when accessed. But the way tracefs now does its ownership by using the root dentry's permissions as the default permissions, the events directory can get out of sync when a remount is performed setting the group and user permissions. Remove the special case for the events directory on setting the attributes. This allows the updates caused by remount to work properly as well as simplifies the code. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 16 +--- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 2c33262107c6..624a0e4a8e29 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -187,21 +187,7 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, * determined by the parent directory. */ if (dentry->d_inode->i_mode & S_IFDIR) { - /* -* The events directory dentry is never freed, unless its -* part of an instance that is deleted. It's attr is the -* default for its child files and directories. -* Do not update it. It's not used for its own mode or ownership. -*/ - if (ei->is_events) { - /* But it still needs to know if it was modified */ - if (iattr->ia_valid & ATTR_UID) - ei->attr.mode |= EVENTFS_SAVE_UID; - if (iattr->ia_valid & ATTR_GID) - ei->attr.mode |= EVENTFS_SAVE_GID; - } else { - update_attr(>attr, iattr); - } + update_attr(>attr, iattr); } else { name = dentry->d_name.name; -- 2.43.0
[PATCH 3/5] eventfs: Do not differentiate the toplevel events directory
From: "Steven Rostedt (Google)" The toplevel events directory is really no different than the events directory of instances. Having the two be different caused inconsistencies and made it harder to fix the permissions bugs. Make all events directories act the same. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 - fs/tracefs/internal.h| 7 +++ 2 files changed, 11 insertions(+), 25 deletions(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index db71fed7057d..2c33262107c6 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -68,7 +68,6 @@ enum { EVENTFS_SAVE_MODE = BIT(16), EVENTFS_SAVE_UID= BIT(17), EVENTFS_SAVE_GID= BIT(18), - EVENTFS_TOPLEVEL= BIT(19), }; #define EVENTFS_MODE_MASK (EVENTFS_SAVE_MODE - 1) @@ -220,14 +219,10 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry, return ret; } -static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb) +static void update_events_attr(struct eventfs_inode *ei, struct super_block *sb) { struct inode *root; - /* Only update if the "events" was on the top level */ - if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL)) - return; - /* Get the tracefs root inode. */ root = d_inode(sb->s_root); ei->attr.uid = root->i_uid; @@ -240,10 +235,10 @@ static void set_top_events_ownership(struct inode *inode) struct eventfs_inode *ei = ti->private; /* The top events directory doesn't get automatically updated */ - if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL)) + if (!ei || !ei->is_events) return; - update_top_events_attr(ei, inode->i_sb); + update_events_attr(ei, inode->i_sb); if (!(ei->attr.mode & EVENTFS_SAVE_UID)) inode->i_uid = ei->attr.uid; @@ -272,7 +267,7 @@ static int eventfs_permission(struct mnt_idmap *idmap, return generic_permission(idmap, inode, mask); } -static const struct inode_operations eventfs_root_dir_inode_operations = { +static const struct inode_operations eventfs_dir_inode_operations = { .lookup = eventfs_root_lookup, .setattr= eventfs_set_attr, .getattr= eventfs_get_attr, @@ -340,7 +335,7 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) // Walk upwards until you find the events inode } while (!ei->is_events); - update_top_events_attr(ei, dentry->d_sb); + update_events_attr(ei, dentry->d_sb); return ei; } @@ -446,7 +441,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry, update_inode_attr(dentry, inode, >attr, S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO); - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; /* All directories will have the same inode number */ @@ -826,14 +821,6 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry uid = d_inode(dentry->d_parent)->i_uid; gid = d_inode(dentry->d_parent)->i_gid; - /* -* If the events directory is of the top instance, then parent -* is NULL. Set the attr.mode to reflect this and its permissions will -* default to the tracefs root dentry. -*/ - if (!parent) - ei->attr.mode = EVENTFS_TOPLEVEL; - /* This is used as the default ownership of the files and directories */ ei->attr.uid = uid; ei->attr.gid = gid; @@ -842,13 +829,13 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry INIT_LIST_HEAD(>list); ti = get_tracefs(inode); - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE; + ti->flags |= TRACEFS_EVENT_INODE; ti->private = ei; inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO; inode->i_uid = uid; inode->i_gid = gid; - inode->i_op = _root_dir_inode_operations; + inode->i_op = _dir_inode_operations; inode->i_fop = _file_operations; dentry->d_fsdata = get_ei(ei); diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h index 661ac13e2984..d83c2a25f288 100644 --- a/fs/tracefs/internal.h +++ b/fs/tracefs/internal.h @@ -4,10 +4,9 @@ enum { TRACEFS_EVENT_INODE = BIT(1), - TRACEFS_EVENT_TOP_INODE = BIT(2), - TRACEFS_GID_PERM_SET=
[PATCH 2/5] tracefs: Still use mount point as default permissions for instances
From: "Steven Rostedt (Google)" If the instances directory's permissions were never change, then have it and its children use the mount point permissions as the default. Currently, the permissions of instance directories are determined by the instance directory's permissions itself. But if the tracefs file system is remounted and changes the permissions, the instance directory and its children should use the new permission. But because both the instance directory and its children use the instance directory's inode for permissions, it misses the update. To demonstrate this: # cd /sys/kernel/tracing/ # mkdir instances/foo # ls -ld instances/foo drwxr-x--- 5 root root 0 May 1 19:07 instances/foo # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld current_tracer -rw-r- 1 root root 0 May 1 18:57 current_tracer # mount -o remount,gid=1002 . # ls -ld instances drwxr-x--- 3 root root 0 May 1 18:57 instances # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:07 instances/foo/ # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 18:57 current_tracer Notice that changing the group id to that of "lkp" did not affect the instances directory nor its children. It should have been: # ls -ld current_tracer -rw-r- 1 root root 0 May 1 19:19 current_tracer # ls -ld instances/foo/ drwxr-x--- 5 root root 0 May 1 19:25 instances/foo/ # ls -ld instances drwxr-x--- 3 root root 0 May 1 19:19 instances # mount -o remount,gid=1002 . # ls -ld current_tracer -rw-r- 1 root lkp 0 May 1 19:19 current_tracer # ls -ld instances drwxr-x--- 3 root lkp 0 May 1 19:19 instances # ls -ld instances/foo/ drwxr-x--- 5 root lkp 0 May 1 19:25 instances/foo/ Where all files were updated by the remount gid update. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/inode.c | 27 +-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 2a1dc2b442d1..1130c0fe2426 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -170,16 +170,39 @@ static void set_tracefs_inode_owner(struct inode *inode) { struct tracefs_inode *ti = get_tracefs(inode); struct inode *root_inode = ti->private; + kuid_t uid; + kgid_t gid; + + uid = root_inode->i_uid; + gid = root_inode->i_gid; + + /* +* If the root is not the mount point, then check the root's +* permissions. If it was never set, then default to the +* mount point. +*/ + if (root_inode != d_inode(root_inode->i_sb->s_root)) { + struct tracefs_inode *rti; + + rti = get_tracefs(root_inode); + root_inode = d_inode(root_inode->i_sb->s_root); + + if (!(rti->flags & TRACEFS_UID_PERM_SET)) + uid = root_inode->i_uid; + + if (!(rti->flags & TRACEFS_GID_PERM_SET)) + gid = root_inode->i_gid; + } /* * If this inode has never been referenced, then update * the permissions to the superblock. */ if (!(ti->flags & TRACEFS_UID_PERM_SET)) - inode->i_uid = root_inode->i_uid; + inode->i_uid = uid; if (!(ti->flags & TRACEFS_GID_PERM_SET)) - inode->i_gid = root_inode->i_gid; + inode->i_gid = gid; } static int tracefs_permission(struct mnt_idmap *idmap, -- 2.43.0
[PATCH 0/5] tracefs/eventfs: Fix inconsistent permissions
The tracefs and eventfs permissions are created dynamically based on what the mount point inode has or the instances directory inode has. But the way it worked had some inconsistencies that could lead to security issues as the file system is not behaving like admins would expect. The files and directories could ignore the remount option that changes the gid or uid ownerships, leaving files susceptable to access that is not expected. This happens if a file had its value changed previously and then a remount changed all the files permissions. The one that was changed previously would not be affected. This change set resolves these inconsistencies. This also fixes the test_ownership.tc test as it would pass on the first time it is run, but fail on the second time, because of the inconsistant state of the permissions. Now you can run that test multiple times and it will always pass. Steven Rostedt (Google) (5): tracefs: Reset permissions on remount if permissions are options tracefs: Still use mount point as default permissions for instances eventfs: Do not differentiate the toplevel events directory eventfs: Do not treat events directory different than other directories eventfs: Have "events" directory get permissions from its parent fs/tracefs/event_inode.c | 102 --- fs/tracefs/inode.c | 67 +-- fs/tracefs/internal.h| 9 +++-- 3 files changed, 130 insertions(+), 48 deletions(-)
[PATCH 1/5] tracefs: Reset permissions on remount if permissions are options
From: "Steven Rostedt (Google)" There's an inconsistency with the way permissions are handled in tracefs. Because the permissions are generated when accessed, they default to the root inode's permission if they were never set by the user. If the user sets the permissions, then a flag is set and the permissions are saved via the inode (for tracefs files) or an internal attribute field (for eventfs). But if a remount happens that specify the permissions, all the files that were not changed by the user gets updated, but the ones that were are not. If the user were to remount the file system with a given permission, then all files and directories within that file system should be updated. This can cause security issues if a file's permission was updated but the admin forgot about it. They could incorrectly think that remounting with permissions set would update all files, but miss some. For example: # cd /sys/kernel/tracing # chgrp 1002 current_tracer # ls -l [..] -rw-r- 1 root root 0 May 1 21:25 buffer_size_kb -rw-r- 1 root root 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root root 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root root 0 May 1 21:25 dynamic_events -r--r- 1 root root 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root root 0 May 1 21:25 enabled_functions Where current_tracer now has group "lkp". # mount -o remount,gid=1001 . # ls -l -rw-r- 1 root tracing 0 May 1 21:25 buffer_size_kb -rw-r- 1 root tracing 0 May 1 21:25 buffer_subbuf_size_kb -r--r- 1 root tracing 0 May 1 21:25 buffer_total_size_kb -rw-r- 1 root lkp 0 May 1 21:25 current_tracer -rw-r- 1 root tracing 0 May 1 21:25 dynamic_events -r--r- 1 root tracing 0 May 1 21:25 dyn_ftrace_total_info -r--r- 1 root tracing 0 May 1 21:25 enabled_functions Everything changed but the "current_tracer". Add a new link list that keeps track of all the tracefs_inodes which has the permission flags that tell if the file/dir should use the root inode's permission or not. Then on remount, clear all the flags so that the default behavior of using the root inode's permission is done for all files and directories. Cc: sta...@vger.kernel.org Fixes: 8186fff7ab649 ("tracefs/eventfs: Use root and instance inodes as default ownership") Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c | 29 + fs/tracefs/inode.c | 40 +++- fs/tracefs/internal.h| 2 ++ 3 files changed, 70 insertions(+), 1 deletion(-) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index dc97c19f9e0a..db71fed7057d 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -289,6 +289,35 @@ static const struct file_operations eventfs_file_operations = { .llseek = generic_file_llseek, }; +/* + * On a remount of tracefs, if UID or GID options are set, then + * the mount point inode permissions should be used. + * Reset the saved permission flags appropriately. + */ +void eventfs_remount(struct tracefs_inode *ti, bool update_uid, bool update_gid) +{ + struct eventfs_inode *ei = ti->private; + + if (!ei) + return; + + if (update_uid) + ei->attr.mode &= ~EVENTFS_SAVE_UID; + + if (update_gid) + ei->attr.mode &= ~EVENTFS_SAVE_GID; + + if (!ei->entry_attrs) + return; + + for (int i = 0; i < ei->nr_entries; i++) { + if (update_uid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_UID; + if (update_gid) + ei->entry_attrs[i].mode &= ~EVENTFS_SAVE_GID; + } +} + /* Return the evenfs_inode of the "events" directory */ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry) { diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 5545e6bf7d26..2a1dc2b442d1 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -30,6 +30,13 @@ static struct vfsmount *tracefs_mount; static int tracefs_mount_count; static bool tracefs_registered; +/* + * Keep track of all tracefs_inodes in order to update their + * flags if necessary on a remount. + */ +static DEFINE_MUTEX(tracefs_inode_mutex); +static LIST_HEAD(tracefs_inodes); + static struct inode *tracefs_alloc_inode(struct super_block *sb) { struct tracefs_inode *ti; @@ -38,12 +45,22 @@ static struct inode *tracefs_alloc_inode(struct super_block *sb) if (!ti) return NULL; + mutex_lock(_inode_mutex); + list_add(>list, _inodes); + mutex_unlock(_inode_mutex); + return >vfs_inode; } static void tracefs_free_inode(struct inode *inode) { - kmem_cache_free(tracefs_inode_cachep, get_tracefs(inode)); + struct tr
[PATCH] eventfs/tracing: Add callback for release of an eventfs_inode
From: "Steven Rostedt (Google)" Synthetic events create and destroy tracefs files when they are created and removed. The tracing subsystem has its own file descriptor representing the state of the events attached to the tracefs files. There's a race between the eventfs files and this file descriptor of the tracing system where the following can cause an issue: With two scripts 'A' and 'B' doing: Script 'A': echo "hello int aaa" > /sys/kernel/tracing/synthetic_events while : do echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable done Script 'B': echo > /sys/kernel/tracing/synthetic_events Script 'A' creates a synthetic event "hello" and then just writes zero into its enable file. Script 'B' removes all synthetic events (including the newly created "hello" event). What happens is that the opening of the "enable" file has: { struct trace_event_file *file = inode->i_private; int ret; ret = tracing_check_open_get_tr(file->tr); [..] But deleting the events frees the "file" descriptor, and a "use after free" happens with the dereference at "file->tr". The file descriptor does have a reference counter, but there needs to be a way to decrement it from the eventfs when the eventfs_inode is removed that represents this file descriptor. Add an optional "release" callback to the eventfs_entry array structure, that gets called when the eventfs file is about to be removed. This allows for the creating on the eventfs file to increment the tracing file descriptor ref counter. When the eventfs file is deleted, it can call the release function that will call the put function for the tracing file descriptor. This will protect the tracing file from being freed while a eventfs file that references it is being opened. Link: https://lore.kernel.org/linux-trace-kernel/20240426073410.17154-1-tze-nan...@mediatek.com/ Cc: sta...@vger.kernel.org Fixes: 5790b1fb3d672 ("eventfs: Remove eventfs_file and just use eventfs_inode") Reported-by: Tze-nan wu Signed-off-by: Steven Rostedt (Google) --- fs/tracefs/event_inode.c| 7 +++ include/linux/tracefs.h | 3 +++ kernel/trace/trace_events.c | 12 3 files changed, 22 insertions(+) diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..dc97c19f9e0a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, struct eventfs_entry { const char *name; eventfs_callbackcallback; + eventfs_release release; }; struct eventfs_inode; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 52f75c36bbca..6ef29eba90ce 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t *mode, void **data, return 0; } +/* The file is incremented on creation and freeing the enable file decrements it */ +static void event_release(const char *name, void *data) +{ + struct trace_event_file *file = data; + + event_file_put(file); +} + static int event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { @@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { .name = "enable", .callback = event_callback, + .release= event_release, }, {
Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr
On Sun, 28 Apr 2024 20:28:37 -0400 Steven Rostedt wrote: > > Looking for any suggestion or solution, appreciate. > > Yeah, I do not think eventfs should be involved in this. It needs to be > protected at a higher level (in the synthetic/dynamic event code). > > I'm just coming back from Japan, and I'll need to take a deeper look at > this after I recover from my jetlag. OK, so I guess the eventfs nodes need an optional release callback. Here's the right way to do that. I added a "release" function to the passed in entry array that allows for calling a release function when the eventfs_inode is freed. Then in code for creating events, I call event_file_get() on the file being assigned and have the freeing of the "enable" file have the release function that will call event_file_put() on that file structure. Does this fix it for you? -- Steve diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c index 894c6ca1e500..dc97c19f9e0a 100644 --- a/fs/tracefs/event_inode.c +++ b/fs/tracefs/event_inode.c @@ -84,10 +84,17 @@ enum { static void release_ei(struct kref *ref) { struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref); + const struct eventfs_entry *entry; struct eventfs_root_inode *rei; WARN_ON_ONCE(!ei->is_freed); + for (int i = 0; i < ei->nr_entries; i++) { + entry = >entries[i]; + if (entry->release) + entry->release(entry->name, ei->data); + } + kfree(ei->entry_attrs); kfree_const(ei->name); if (ei->is_events) { diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h index 7a5fe17b6bf9..d03f74658716 100644 --- a/include/linux/tracefs.h +++ b/include/linux/tracefs.h @@ -62,6 +62,8 @@ struct eventfs_file; typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); +typedef void (*eventfs_release)(const char *name, void *data); + /** * struct eventfs_entry - dynamically created eventfs file call back handler * @name: Then name of the dynamic file in an eventfs directory @@ -72,6 +74,7 @@ typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, struct eventfs_entry { const char *name; eventfs_callbackcallback; + eventfs_release release; }; struct eventfs_inode; diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 52f75c36bbca..d14c84281f2b 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -2552,6 +2552,14 @@ static int event_callback(const char *name, umode_t *mode, void **data, return 0; } +/* The file is incremented on creation and freeing the enable file decrements it */ +static void event_release(const char *name, void *data) +{ + struct trace_event_file *file = data; + + event_file_put(file); +} + static int event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { @@ -2566,6 +2574,7 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) { .name = "enable", .callback = event_callback, + .release= event_release, }, { .name = "filter", @@ -2634,6 +2643,9 @@ event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file) return ret; } + /* Gets decremented on freeing of the "enable" file */ + event_file_get(file); + return 0; }
Re: [PATCH] tracing: Fix uaf issue in tracing_open_file_tr
On Fri, 26 Apr 2024 15:34:08 +0800 Tze-nan wu wrote: > "tracing_event_file" is at the risk of use-after-free due to the race of > two functions "tracing_open_file_tr" and "synth_event_release". > Specifically, it could be freed by synth_event_release before > tracing_open_file_tr has the opportunity to access its members. > > It's easy to reproduced by first executing script 'A' and then script 'B' > in my environment. > > Script 'A': > echo "hello int aaa" > /sys/kernel/tracing/synthetic_events > while : > do > echo 0 > /sys/kernel/tracing/events/synthetic/hello/enable > done > > Script 'B': > echo > /sys/kernel/tracing/synthetic/events I think you meant: echo > /sys/kernel/tracing/synthetic_events > > My environment: > arm64 + generic and swtag based KASAN both enabled + kernel-6.6.18 > > KASAN report > == > BUG: KASAN: slab-use-after-free in tracing_open_file_tr > Read of size 8 at addr 9*** by task sh/3485 > Pointer tag: [9*], memory tag: [fe] > > CPU: * PID: 3485 Comm: sh Tainted: > Call trace: > __hwasan_tag_mismatch > tracing_open_file_tr > do_dentry_open > vfs_open > path_openat > > Freed by task 3490: > slab_free_freelist_hook > kmem_cache_free > event_file_put > remove_event_file_dir > __trace_remove_event_call > trace_remove_event_call > synth_event_release > dyn_events_release_all > synth_events_open > > page last allocated via order 0, migratetype Unmovable: > __slab_alloc > kmem_cache_alloc > trace_create_new_event > trace_add_event_call > register_synth_event > __create_synth_event > create_or_delete_synth_event > trace_parse_run_command > synth_events_write > vfs_write Thanks for reporting this. > > Based on the assumption that eventfs_inode will persist throughout the > execution of the tracing_open_file_tr function, > we can fix this issue by incrementing the reference count of > trace_event_file once it is assigned to eventfs_inode->data. > The reference count will then be decremented during the release phase of > eventfs_inode. > > This ensures that trace_event_file is not prematurely freed while the > tracing_open_file_tr function is being called. > > Fixes: bb32500fb9b7 ("tracing: Have trace_event_file have ref counters") > Co-developed-by: Cheng-Jui Wang > Signed-off-by: Cheng-Jui Wang > Signed-off-by: Tze-nan wu > --- > BTW, I've also attempted to reproduce the same issue in another > environment (described below). > It's also reproducible but in a lower rate. > > another environment: > x86 + ubuntu + generic kasan enabled + kernel-6.9-rc4 > > After applying the patch, KASAN no longer reports any issues when > following the same reproduction steps in my arm64 environment. > However, I am concerned about potential side effects that the patch might > introduce. > Additionally, the newly added function "is_entry_event_callback" may not > be reliable in my opinion, > as the callback function used in the comparison could change in future. > Nonetheless, this is the best solution I can come up with now. > > Looking for any suggestion or solution, appreciate. Yeah, I do not think eventfs should be involved in this. It needs to be protected at a higher level (in the synthetic/dynamic event code). I'm just coming back from Japan, and I'll need to take a deeper look at this after I recover from my jetlag. Thanks, -- Steve
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Thu, 25 Apr 2024 13:31:53 -0700 Andrii Nakryiko wrote: I'm just coming back from Japan (work and then a vacation), and catching up on my email during the 6 hour layover in Detroit. > Hey Masami, > > I can't really review most of that code as I'm completely unfamiliar > with all those inner workings of fprobe/ftrace/function_graph. I left > a few comments where there were somewhat more obvious BPF-related > pieces. > > But I also did run our BPF benchmarks on probes/for-next as a baseline > and then with your series applied on top. Just to see if there are any > regressions. I think it will be a useful data point for you. > > You should be already familiar with the bench tool we have in BPF > selftests (I used it on some other patches for your tree). I should get familiar with your tools too. > > BASELINE > > kprobe : 24.634 ± 0.205M/s > kprobe-multi : 28.898 ± 0.531M/s > kretprobe : 10.478 ± 0.015M/s > kretprobe-multi: 11.012 ± 0.063M/s > > THIS PATCH SET ON TOP > = > kprobe : 25.144 ± 0.027M/s (+2%) > kprobe-multi : 28.909 ± 0.074M/s > kretprobe :9.482 ± 0.008M/s (-9.5%) > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > These numbers are pretty stable and look to be more or less representative. Thanks for running this. > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > about the same, though. > > Then (I suppose they are "legacy") kretprobes got quite noticeably > slower, almost by 10%. Not sure why, but looks real after re-running > benchmarks a bunch of times and getting stable results. > > On the other hand, multi-kretprobes got significantly faster (+24%!). > Again, I don't know if it is expected or not, but it's a nice > improvement. > > If you have any idea why kretprobes would get so much slower, it would > be nice to look into that and see if you can mitigate the regression > somehow. Thanks! My guess is that this patch set helps generic use cases for tracing the return of functions, but will likely add more overhead for single use cases. That is, kretprobe is made to be specific for a single function, but kretprobe-multi is more generic. Hence the generic version will improve at the sacrifice of the specific function. I did expect as much. That said, I think there's probably a lot of low hanging fruit that can be done to this series to help improve the kretprobe performance. I'm not sure we can get back to the baseline, but I'm hoping we can at least make it much better than that 10% slowdown. I'll be reviewing this patch set this week as I recover from jetlag. -- Steve
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Tue, 23 Apr 2024 12:04:15 -0400 "Liam R. Howlett" wrote: > > Nit: For all labels, please add a space before them. Otherwise, diffs will > > show "unlock" as the function and not "ring_buffer_map", making it harder > > to find where the change is. > > > > Isn't the inclusion of a space before labels outdate since 'git diff' > superseds 'diff' and fixed this issue? I still use patch and quilt ;-) and that still suffers from that on reject files (used for backporting and such). -- Steve
Re: [PATCH] ftrace: Replace ftrace_disabled variable with ftrace_is_dead function
On Sat, 20 Apr 2024 11:50:29 +0800 "Bang Li" wrote: > Thank you for your explanation, let me understand this. How about > replacing ftrace_disabled with unlike(ftrace_disabled)? Why? They are slow paths. No need to optimize them. -- Steve
Re: [PATCH v9 07/36] function_graph: Allow multiple users to attach to function graph
On Mon, 15 Apr 2024 21:50:20 +0900 "Masami Hiramatsu (Google)" wrote: > @@ -27,23 +28,157 @@ > > #define FGRAPH_RET_SIZE sizeof(struct ftrace_ret_stack) > #define FGRAPH_RET_INDEX DIV_ROUND_UP(FGRAPH_RET_SIZE, sizeof(long)) > + > +/* > + * On entry to a function (via function_graph_enter()), a new > ftrace_ret_stack > + * is allocated on the task's ret_stack with indexes entry, then each > + * fgraph_ops on the fgraph_array[]'s entryfunc is called and if that returns > + * non-zero, the index into the fgraph_array[] for that fgraph_ops is > recorded > + * on the indexes entry as a bit flag. > + * As the associated ftrace_ret_stack saved for those fgraph_ops needs to > + * be found, the index to it is also added to the ret_stack along with the > + * index of the fgraph_array[] to each fgraph_ops that needs their retfunc > + * called. > + * > + * The top of the ret_stack (when not empty) will always have a reference > + * to the last ftrace_ret_stack saved. All references to the > + * ftrace_ret_stack has the format of: > + * > + * bits: 0 - 9 offset in words from the previous ftrace_ret_stack > + * (bitmap type should have FGRAPH_RET_INDEX always) > + * bits: 10 - 11 Type of storage > + * 0 - reserved > + * 1 - bitmap of fgraph_array index > + * > + * For bitmap of fgraph_array index > + * bits: 12 - 27The bitmap of fgraph_ops fgraph_array index I really hate the terminology I came up with here, and would love to get better terminology for describing what is going on. I looked it over but I'm constantly getting confused. And I wrote this code! Perhaps we should use: @frame : The data that represents a single function call. When a function is traced, all the data used for all the callbacks attached to it, is in a single frame. This would replace the FGRAPH_RET_SIZE as FGRAPH_FRAME_SIZE. @offset : This is the word size position on the stack. It would replace INDEX, as I think "index" is being used for more than one thing. Perhaps it should be "offset" when dealing with where it is on the shadow stack, and "pos" when dealing with which callback ops is being referenced. > + * > + * That is, at the end of function_graph_enter, if the first and forth > + * fgraph_ops on the fgraph_array[] (index 0 and 3) needs their retfunc > called > + * on the return of the function being traced, this is what will be on the > + * task's shadow ret_stack: (the stack grows upward) > + * > + * || <- task->curr_ret_stack > + * ++ > + * | bitmap_type(bitmap:(BIT(3)|BIT(0)),| > + * | offset:FGRAPH_RET_INDEX) | <- the offset is from here > + * ++ > + * | struct ftrace_ret_stack| > + * | (stores the saved ret pointer) | <- the offset points here > + * ++ > + * | (X) | (N) | ( N words away from > + * || previous ret_stack) > + * > + * If a backtrace is required, and the real return pointer needs to be > + * fetched, then it looks at the task's curr_ret_stack index, if it > + * is greater than zero (reserved, or right before poped), it would mask > + * the value by FGRAPH_RET_INDEX_MASK to get the offset index of the > + * ftrace_ret_stack structure stored on the shadow stack. > + */ > + > +#define FGRAPH_RET_INDEX_SIZE10 Replace SIZE with BITS. > +#define FGRAPH_RET_INDEX_MASKGENMASK(FGRAPH_RET_INDEX_SIZE - 1, 0) #define FGRAPH_FRAME_SIZE_BITS10 #define FGRAPH_FRAME_SIZE_MASKGENMASK(FGRAPH_FRAME_SIZE_BITS - 1, 0) > + > +#define FGRAPH_TYPE_SIZE 2 > +#define FGRAPH_TYPE_MASK GENMASK(FGRAPH_TYPE_SIZE - 1, 0) #define FGRAPH_TYPE_BITS 2 #define FGRAPH_TYPE_MASK GENMASK(FGRAPH_TYPE_BITS - 1, 0) > +#define FGRAPH_TYPE_SHIFTFGRAPH_RET_INDEX_SIZE > + > +enum { > + FGRAPH_TYPE_RESERVED= 0, > + FGRAPH_TYPE_BITMAP = 1, > +}; > + > +#define FGRAPH_INDEX_SIZE16 replace "INDEX" with "OPS" as it will be the indexes of ops in the array. #define FGRAPH_OPS_BITS 16 #define FGRAPH_OPS_MASK GENMASK(FGRAPH_OPS_BITS - 1, 0) > +#define FGRAPH_INDEX_MASKGENMASK(FGRAPH_INDEX_SIZE - 1, 0) > +#define FGRAPH_INDEX_SHIFT (FGRAPH_TYPE_SHIFT + FGRAPH_TYPE_SIZE) > + > +/* Currently the max stack index can't be more than register callers */ > +#define FGRAPH_MAX_INDEX (FGRAPH_INDEX_SIZE + FGRAPH_RET_INDEX) FGRAPH_MAX_INDEX isn't even used. Let's delete it. > + > +#define FGRAPH_ARRAY_SIZEFGRAPH_INDEX_SIZE #define FGRAPH_ARRAY_SIZE FGRAPH_INDEX_BITS > + > #define SHADOW_STACK_SIZE (PAGE_SIZE) > #define SHADOW_STACK_INDEX (SHADOW_STACK_SIZE / sizeof(long)) > /* Leave
Re: [PATCH net-next v6 0/7] Implement reset reason mechanism to detect
On Fri, 19 Apr 2024 16:00:20 +0800 Jason Xing wrote: > If other experts see this thread, please help me. I would appreciate > it. I have strong interests and feel strong responsibility to > implement something like this patch series. It can be very useful!! I'm not a networking expert, but as I'm Cc'd and this is about tracing, I'll jump in to see if I can help. Honestly, reading the thread, it appears that you and Eric are talking past each other. I believe Eric is concerned about losing the value of the enum. Enums are types, and if you typecast them to another type, they lose the previous type, and all the safety that goes with it. Now, I do not really understand the problem trying to be solved here. I understand how TCP works but I never looked into the implementation of MPTCP. You added this: +static inline enum sk_rst_reason convert_mptcp_reason(u32 reason) +{ + return reason += RST_REASON_START; +} And used it for places like this: @@ -309,8 +309,13 @@ static struct dst_entry *subflow_v4_route_req(const struct sock *sk, return dst; dst_release(dst); - if (!req->syncookie) - tcp_request_sock_ops.send_reset(sk, skb, SK_RST_REASON_NOT_SPECIFIED); + if (!req->syncookie) { + struct mptcp_ext *mpext = mptcp_get_ext(skb); + enum sk_rst_reason reason; + + reason = convert_mptcp_reason(mpext->reset_reason); + tcp_request_sock_ops.send_reset(sk, skb, reason); + } return NULL; } As I don't know this code or how MPTCP works, I do not understand the above. It use to pass to send_reset() SK_RST_REASON_NOT_SPECIFIED. But now it takes a "reset_reason" calls the "convert_mptcp_reason()" to get back a enum value. If you are mapping the reset_reason to enum sk_rst_reason, why not do it via a real conversion instead of this fragile arithmetic between the two values? static inline enum sk_rst_reason convert_mptcp_reason(u32 reason) { switch(reason) { case 0: return SK_RST_REASON_MPTCP_RST_EUNSPEC; case 1: return SK_RST_REASON_MPTCP_RST_EMPTCP; case 2: return SK_RST_REASON_MPTCP_RST_ERESOURCE; [..] default: return SK_RST_REASON_MAX; // or some other error value ] } I'm not sure if this is any better, but it's not doing any casting and it's easier to understand. It's a simple mapping between the reason and the enum and there's no inherit dependency between the values. Could possibly create enums for the reason numbers and replace the hard coded values with them. That way that helper function is at least doing a real conversion of one type to another. But like I said from the beginning. I don't understand the details here and have not spent the time to dig deeper. I just read the thread and I agree with Eric that the arithmetic conversion of reason to an enum looks fragile at best and buggy at worst. -- Steve
Re: [PATCH] ftrace: Replace ftrace_disabled variable with ftrace_is_dead function
On Fri, 19 Apr 2024 22:38:44 +0800 "Bang Li" wrote: > Use the existing function ftrace_is_dead to replace the variable to make > the code clearer. > > Signed-off-by: Bang Li > --- > kernel/trace/ftrace.c | 46 +-- > 1 file changed, 23 insertions(+), 23 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index 50ca4d4f8840..4a08c79db677 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -2693,7 +2693,7 @@ void __weak ftrace_replace_code(int mod_flags) > int schedulable = mod_flags & FTRACE_MODIFY_MAY_SLEEP_FL; > int failed; > > - if (unlikely(ftrace_disabled)) > + if (unlikely(ftrace_is_dead())) > return; > NACK! ftrace_is_dead() is only there to make the static variable "ftrace_disabled" available for code outside of ftrace.c. In ftrace.c, it is perfectly fine to use ftrace_disabled. -- Steve
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Fri, 19 Apr 2024 14:36:18 +0900 Masami Hiramatsu (Google) wrote: > Hi Steve, > > Can you review this series? Especially, [07/36] and [12/36] has been changed > a lot from your original patch. I haven't forgotten (just been a bit hectic). Worse comes to worse, I'll review it tomorrow. -- Steve > > Thank you, > > On Mon, 15 Apr 2024 21:48:59 +0900 > "Masami Hiramatsu (Google)" wrote: > > > Hi, > > > > Here is the 9th version of the series to re-implement the fprobe on > > function-graph tracer. The previous version is; > > > > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/ > > > > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next) > > and fixed some bugs + performance optimization patch[36/36]. > > - [12/36] Fix to clear fgraph_array entry in registration failure, also > >return -ENOSPC when fgraph_array is full. > > - [28/36] Add new store_fprobe_entry_data() for fprobe. > > - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation. > > - [36/36] Add new flag to skip timestamp recording. > > > > Overview > > > > This series does major 2 changes, enable multiple function-graphs on > > the ftrace (e.g. allow function-graph on sub instances) and rewrite the > > fprobe on this function-graph. > > > > The former changes had been sent from Steven Rostedt 4 years ago (*), > > which allows users to set different setting function-graph tracer (and > > other tracers based on function-graph) in each trace-instances at the > > same time. > > > > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/ > > > > The purpose of latter change are; > > > > 1) Remove dependency of the rethook from fprobe so that we can reduce > >the return hook code and shadow stack. > > > > 2) Make 'ftrace_regs' the common trace interface for the function > >boundary. > > > > 1) Currently we have 2(or 3) different function return hook codes, > > the function-graph tracer and rethook (and legacy kretprobe). > > But since this is redundant and needs double maintenance cost, > > I would like to unify those. From the user's viewpoint, function- > > graph tracer is very useful to grasp the execution path. For this > > purpose, it is hard to use the rethook in the function-graph > > tracer, but the opposite is possible. (Strictly speaking, kretprobe > > can not use it because it requires 'pt_regs' for historical reasons.) > > > > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is > > wrong for the function entry and exit. Moreover, depending on the > > architecture, there is no way to accurately reproduce 'pt_regs' > > outside of interrupt or exception handlers. This means fprobe should > > not use 'pt_regs' because it does not use such exceptions. > > (Conversely, kprobe should use 'pt_regs' because it is an abstract > > interface of the software breakpoint exception.) > > > > This series changes fprobe to use function-graph tracer for tracing > > function entry and exit, instead of mixture of ftrace and rethook. > > Unlike the rethook which is a per-task list of system-wide allocated > > nodes, the function graph's ret_stack is a per-task shadow stack. > > Thus it does not need to set 'nr_maxactive' (which is the number of > > pre-allocated nodes). > > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'. > > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as > > their register interface, this changes it to convert 'ftrace_regs' to > > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs', > > so users must access only registers for function parameters or > > return value. > > > > Design > > -- > > Instead of using ftrace's function entry hook directly, the new fprobe > > is built on top of the function-graph's entry and return callbacks > > with 'ftrace_regs'. > > > > Since the fprobe requires access to 'ftrace_regs', the architecture > > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and > > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph > > entry callback with 'ftrace_regs', and also > > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to > > return_to_handler. > > > > All fprobes share a single function-graph ops (means shares a common > > ftrace filter) similar to the kprobe-on-ftrace. This needs another > > layer to find corresponding fprobe in the
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Thu, 18 Apr 2024 09:55:55 +0300 Mike Rapoport wrote: Hi Mike, Thanks for doing this review! > > +/** > > + * struct trace_buffer_meta - Ring-buffer Meta-page description > > + * @meta_page_size:Size of this meta-page. > > + * @meta_struct_len: Size of this structure. > > + * @subbuf_size: Size of each sub-buffer. > > + * @nr_subbufs:Number of subbfs in the ring-buffer, including > > the reader. > > + * @reader.lost_events:Number of events lost at the time of the reader > > swap. > > + * @reader.id: subbuf ID of the current reader. ID range [0 : > > @nr_subbufs - 1] > > + * @reader.read: Number of bytes read on the reader subbuf. > > + * @flags: Placeholder for now, 0 until new features are supported. > > + * @entries: Number of entries in the ring-buffer. > > + * @overrun: Number of entries lost in the ring-buffer. > > + * @read: Number of entries that have been read. > > + * @Reserved1: Reserved for future use. > > + * @Reserved2: Reserved for future use. > > + */ > > +struct trace_buffer_meta { > > + __u32 meta_page_size; > > + __u32 meta_struct_len; > > + > > + __u32 subbuf_size; > > + __u32 nr_subbufs; > > + > > + struct { > > + __u64 lost_events; > > + __u32 id; > > + __u32 read; > > + } reader; > > + > > + __u64 flags; > > + > > + __u64 entries; > > + __u64 overrun; > > + __u64 read; > > + > > + __u64 Reserved1; > > + __u64 Reserved2; > > Why do you need reserved fields? This structure always resides in the > beginning of a page and the rest of the page is essentially "reserved". So this code is also going to be used in arm's pkvm hypervisor code, where it will be using these fields, but since we are looking at keeping the same interface between the two, we don't want these used by this interface. We probably should add a comment about that. > > > +}; > > + > > +#endif /* _TRACE_MMAP_H_ */ > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > > index cc9ebe593571..793ecc454039 100644 > > --- a/kernel/trace/ring_buffer.c > > +++ b/kernel/trace/ring_buffer.c > > ... > > > +static void rb_setup_ids_meta_page(struct ring_buffer_per_cpu *cpu_buffer, > > + unsigned long *subbuf_ids) > > +{ > > + struct trace_buffer_meta *meta = cpu_buffer->meta_page; > > + unsigned int nr_subbufs = cpu_buffer->nr_pages + 1; > > + struct buffer_page *first_subbuf, *subbuf; > > + int id = 0; > > + > > + subbuf_ids[id] = (unsigned long)cpu_buffer->reader_page->page; > > + cpu_buffer->reader_page->id = id++; > > + > > + first_subbuf = subbuf = rb_set_head_page(cpu_buffer); > > + do { > > + if (WARN_ON(id >= nr_subbufs)) > > + break; > > + > > + subbuf_ids[id] = (unsigned long)subbuf->page; > > + subbuf->id = id; > > + > > + rb_inc_page(); > > + id++; > > + } while (subbuf != first_subbuf); > > + > > + /* install subbuf ID to kern VA translation */ > > + cpu_buffer->subbuf_ids = subbuf_ids; > > + > > + /* __rb_map_vma() pads the meta-page to align it with the sub-buffers */ > > + meta->meta_page_size = PAGE_SIZE << cpu_buffer->buffer->subbuf_order; > > Isn't this a single page? One thing we are doing is to make sure that the subbuffers are aligned by their size. If a subbuffer is 3 pages, it should be aligned on 3 page boundaries. This was something that Linus suggested. > > > + meta->meta_struct_len = sizeof(*meta); > > + meta->nr_subbufs = nr_subbufs; > > + meta->subbuf_size = cpu_buffer->buffer->subbuf_size + BUF_PAGE_HDR_SIZE; > > + > > + rb_update_meta_page(cpu_buffer); > > +} > > ... > > > +#define subbuf_page(off, start) \ > > + virt_to_page((void *)((start) + ((off) << PAGE_SHIFT))) > > + > > +#define foreach_subbuf_page(sub_order, start, page)\ > > Nit: usually iterators in kernel use for_each Ah, good catch. Yeah, that should be changed. But then ... > > > + page = subbuf_page(0, (start)); \ > > + for (int __off = 0; __off < (1 << (sub_order)); \ > > +__off++, page = subbuf_page(__off, (start))) > > The pages are allocated with alloc_pages_node(.. subbuf_order) are > physically contiguous and struct pages for them are also contiguous, so > inside a subbuf_order allocation you can just do page++. > I'm wondering if we should just nuke the macro. It was there because of the previous implementation did things twice. But now it's just done once here: + while (s < nr_subbufs && p < nr_pages) { + struct page *page; + + foreach_subbuf_page(subbuf_order, cpu_buffer->subbuf_ids[s], page) { + if (p >= nr_pages) + break; + + pages[p++] = page; +
Re: [PATCH for-next v2] tracing/kprobes: Add symbol counting check when module loads
On Mon, 15 Apr 2024 18:40:23 +0900 "Masami Hiramatsu (Google)" wrote: > Check the number of probe target symbols in the target module when > the module is loaded. If the probe is not on the unique name symbols > in the module, it will be rejected at that point. > > Note that the symbol which has a unique name in the target module, > it will be accepted even if there are same-name symbols in the > kernel or other modules, This says what it does, but doesn't explain why it is doing it. What's the purpose of this patch? -- Steve
[PATCH] ASoC: tracing: Export SND_SOC_DAPM_DIR_OUT to its value
The string SND_SOC_DAPM_DIR_OUT is printed in the snd_soc_dapm_path trace event instead of its value: (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-") User space cannot parse this, as it has no idea what SND_SOC_DAPM_DIR_OUT is. Use TRACE_DEFINE_ENUM() to convert it to its value: (((REC->path_dir) == 1) ? "->" : "<-") So that user space tools, such as perf and trace-cmd, can parse it correctly. Reported-by: Luca Ceresoli Fixes: 6e588a0d839b5 ("ASoC: dapm: Consolidate path trace events") Signed-off-by: Steven Rostedt (Google) --- include/trace/events/asoc.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/trace/events/asoc.h b/include/trace/events/asoc.h index 4eed9028bb11..517015ef36a8 100644 --- a/include/trace/events/asoc.h +++ b/include/trace/events/asoc.h @@ -12,6 +12,8 @@ #define DAPM_DIRECT "(direct)" #define DAPM_ARROW(dir) (((dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-") +TRACE_DEFINE_ENUM(SND_SOC_DAPM_DIR_OUT); + struct snd_soc_jack; struct snd_soc_card; struct snd_soc_dapm_widget; -- 2.43.0
Re: TP_printk() bug with %c, and more?
On Tue, 16 Apr 2024 04:08:46 +0200 Luca Ceresoli wrote: > Thanks for the insight. I'm definitely trying to fix this based on your > hint as soon as I get my hand on a board. I have a patch I forgot to send out. Let me do that now. -- Steve
Re: TP_printk() bug with %c, and more?
On Mon, 18 Mar 2024 16:43:07 +0100 Luca Ceresoli wrote: > However the arrows are still reversed. This requires a kernel change. The problem is that the print fmt has: print fmt: "%c%s %s %s %s %s", (int) REC->path_node && (int) REC->path_connect ? '*' : ' ', __get_str(wname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-"), __get_str(pname), (((REC->path_dir) == SND_SOC_DAPM_DIR_OUT) ? "->" : "<-"), __get_str(pnname) User space (trace-cmd and perf) have no idea what SND_SOC_DAPM_DIR_OUT is. The kernel needs to convert that, otherwise the parsing will fail, or it will default it to zero. -- Steve
Re: [RFC PATCH 0/4] perf: Correlating user process data to samples
On Sat, 13 Apr 2024 12:53:38 +0200 Peter Zijlstra wrote: > On Fri, Apr 12, 2024 at 09:37:24AM -0700, Beau Belgrave wrote: > > > > Anyway, since we typically run stuff from NMI context, accessing user > > > data is 'interesting'. As such I would really like to make this work > > > depend on the call-graph rework that pushes all the user access bits > > > into return-to-user. > > > > Cool, I assume that's the SFRAME work? Are there pointers to work I > > could look at and think about what a rebase looks like? Or do you have > > someone in mind I should work with for this? > > I've been offline for a little while and still need to catch up with > things myself. > > Josh was working on that when I dropped off IIRC, I'm not entirely sure > where things are at currently (and there is no way I can ever hope to > process the backlog). > > Anybody know where we are with that? It's still very much on my RADAR, but with layoffs and such, my priorities have unfortunately changed. I'm hoping to start helping out in the near future though (in a month or two). Josh was working on it, but I think he got pulled off onto other priorities too :-p -- Steve
Re: [PATCH v2] tracing: Add sched_prepare_exec tracepoint
On Thu, 11 Apr 2024 08:15:05 -0700 Kees Cook wrote: > This looks good to me. If tracing wants to take it: > > Acked-by: Kees Cook > > If not, I can take it in my tree if I get a tracing Ack. :) You can take it. Acked-by: Steven Rostedt (Google) -- Steve
[PATCH v2 11/11] tracing: Update function tracing output for previous boot buffer
From: "Steven Rostedt (Google)" For a persistent ring buffer that is saved across boots, if function tracing was performed in the previous boot, it only saves the address of the functions and uses "%pS" to print their names. But the current boot, those functions may be in different locations. The persistent meta-data saves the text delta between the two boots and can be used to find the address of the saved function of where it is located in the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace_output.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_output.c b/kernel/trace/trace_output.c index d8b302d01083..b9d2c64c0648 100644 --- a/kernel/trace/trace_output.c +++ b/kernel/trace/trace_output.c @@ -990,8 +990,11 @@ enum print_line_t trace_nop_print(struct trace_iterator *iter, int flags, } static void print_fn_trace(struct trace_seq *s, unsigned long ip, - unsigned long parent_ip, int flags) + unsigned long parent_ip, long delta, int flags) { + ip += delta; + parent_ip += delta; + seq_print_ip_sym(s, ip, flags); if ((flags & TRACE_ITER_PRINT_PARENT) && parent_ip) { @@ -1009,7 +1012,7 @@ static enum print_line_t trace_fn_trace(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_putc(s, '\n'); return trace_handle_return(s); @@ -1674,7 +1677,7 @@ trace_func_repeats_print(struct trace_iterator *iter, int flags, trace_assign_type(field, iter->ent); - print_fn_trace(s, field->ip, field->parent_ip, flags); + print_fn_trace(s, field->ip, field->parent_ip, iter->tr->text_delta, flags); trace_seq_printf(s, " (repeats: %u, last_ts:", field->count); trace_print_time(s, iter, iter->ts - FUNC_REPEATS_GET_DELTA_TS(field)); -- 2.43.0
[PATCH v2 10/11] tracing: Handle old buffer mappings for event strings and functions
From: "Steven Rostedt (Google)" Use the saved text_delta and data_delta of a persistent memory mapped ring buffer that was saved from a previous boot, and use the delta in the trace event print output so that strings and functions show up normally. That is, for an event like trace_kmalloc() that prints the callsite via "%pS", if it used the address saved in the ring buffer it will not match the function that was saved in the previous boot if the kernel remaps itself between boots. For RCU events that point to saved static strings where only the address of the string is saved in the ring buffer, it too will be adjusted to point to where the string is on the current boot. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 42 +++--- 1 file changed, 39 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 33ef5311fa39..bac5db67cf15 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -3671,8 +3671,11 @@ static void test_can_verify(void) void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, va_list ap) { + long text_delta = iter->tr->text_delta; + long data_delta = iter->tr->data_delta; const char *p = fmt; const char *str; + bool good; int i, j; if (WARN_ON_ONCE(!fmt)) @@ -3691,7 +3694,10 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, j = 0; - /* We only care about %s and variants */ + /* +* We only care about %s and variants +* as well as %p[sS] if delta is non-zero +*/ for (i = 0; p[i]; i++) { if (i + 1 >= iter->fmt_size) { /* @@ -3720,6 +3726,11 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, } if (p[i+j] == 's') break; + + if (text_delta && p[i+1] == 'p' && + ((p[i+2] == 's' || p[i+2] == 'S'))) + break; + star = false; } j = 0; @@ -3733,6 +3744,24 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, iter->fmt[i] = '\0'; trace_seq_vprintf(>seq, iter->fmt, ap); + /* Add delta to %pS pointers */ + if (p[i+1] == 'p') { + unsigned long addr; + char fmt[4]; + + fmt[0] = '%'; + fmt[1] = 'p'; + fmt[2] = p[i+2]; /* Either %ps or %pS */ + fmt[3] = '\0'; + + addr = va_arg(ap, unsigned long); + addr += text_delta; + trace_seq_printf(>seq, fmt, (void *)addr); + + p += i + 3; + continue; + } + /* * If iter->seq is full, the above call no longer guarantees * that ap is in sync with fmt processing, and further calls @@ -3751,6 +3780,14 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, /* The ap now points to the string data of the %s */ str = va_arg(ap, const char *); + good = trace_safe_str(iter, str, star, len); + + /* Could be from the last boot */ + if (data_delta && !good) { + str += data_delta; + good = trace_safe_str(iter, str, star, len); + } + /* * If you hit this warning, it is likely that the * trace event in question used %s on a string that @@ -3760,8 +3797,7 @@ void trace_check_vprintf(struct trace_iterator *iter, const char *fmt, * instead. See samples/trace_events/trace-events-sample.h * for reference. */ - if (WARN_ONCE(!trace_safe_str(iter, str, star, len), - "fmt: '%s' current_buffer: '%s'", + if (WARN_ONCE(!good, "fmt: '%s' current_buffer: '%s'", fmt, seq_buf_str(>seq.seq))) { int ret; -- 2.43.0
[PATCH v2 09/11] tracing/ring-buffer: Add last_boot_info file to boot instance
From: "Steven Rostedt (Google)" If an instance is mapped to memory on boot up, create a new file called "last_boot_info" that will hold information that can be used to properly parse the raw data in the ring buffer. It will export the delta of the addresses for text and data from what it was from the last boot. It does not expose actually addresses (unless you knew what the actual address was from the last boot). The output will look like: # cat last_boot_info text delta:-268435456 data delta:-268435456 The text and data are kept separate in case they are ever made different. Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 3 +++ kernel/trace/ring_buffer.c | 23 ++ kernel/trace/trace.c| 47 - kernel/trace/trace.h| 2 ++ 4 files changed, 74 insertions(+), 1 deletion(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index a50b0223b1d3..55de3798a9b9 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -94,6 +94,9 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flag unsigned long range_size, struct lock_class_key *key); +bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text, +long *data); + /* * Because the ring buffer is generic, if other users of the ring buffer get * traced by ftrace, it can produce lockdep warnings. We need to keep each diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index c68695ae2749..e189f500ac32 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -2390,6 +2390,29 @@ struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flag return alloc_buffer(size, flags, order, start, start + range_size, key); } +/** + * ring_buffer_last_boot_delta - return the delta offset from last boot + * @buffer: The buffer to return the delta from + * @text: Return text delta + * @data: Return data delta + * + * Returns: The true if the delta is non zero + */ +bool ring_buffer_last_boot_delta(struct trace_buffer *buffer, long *text, +long *data) +{ + if (!buffer) + return false; + + if (!buffer->last_text_delta) + return false; + + *text = buffer->last_text_delta; + *data = buffer->last_data_delta; + + return true; +} + /** * ring_buffer_free - free a ring buffer. * @buffer: the buffer to free. diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index b2b16be6e4ec..33ef5311fa39 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -6041,6 +6041,18 @@ ssize_t tracing_resize_ring_buffer(struct trace_array *tr, return ret; } +static void update_last_data(struct trace_array *tr) +{ + if (!tr->text_delta && !tr->data_delta) + return; + + /* Clear old data */ + tracing_reset_online_cpus(>array_buffer); + + /* Using current data now */ + tr->text_delta = 0; + tr->data_delta = 0; +} /** * tracing_update_buffers - used by tracing facility to expand ring buffers @@ -6058,6 +6070,9 @@ int tracing_update_buffers(struct trace_array *tr) int ret = 0; mutex_lock(_types_lock); + + update_last_data(tr); + if (!tr->ring_buffer_expanded) ret = __tracing_resize_ring_buffer(tr, trace_buf_size, RING_BUFFER_ALL_CPUS); @@ -6113,6 +6128,8 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) mutex_lock(_types_lock); + update_last_data(tr); + if (!tr->ring_buffer_expanded) { ret = __tracing_resize_ring_buffer(tr, trace_buf_size, RING_BUFFER_ALL_CPUS); @@ -6860,6 +6877,21 @@ tracing_total_entries_read(struct file *filp, char __user *ubuf, return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); } +static ssize_t +tracing_last_boot_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t *ppos) +{ + struct trace_array *tr = filp->private_data; + struct seq_buf seq; + char buf[64]; + + seq_buf_init(, buf, 64); + + seq_buf_printf(, "text delta:\t%ld\n", tr->text_delta); + seq_buf_printf(, "data delta:\t%ld\n", tr->data_delta); + + return simple_read_from_buffer(ubuf, cnt, ppos, buf, seq_buf_used()); +} + static int tracing_buffer_meta_open(struct inode *inode, struct file *filp) { struct trace_array *tr = inode->i_private; @@ -7499,6 +7531,13 @@ static const struct file_operations trace_time_stamp_mode_fops = { .release= tracing_single_release_tr, }; +static const s
[PATCH v2 08/11] ring-buffer: Save text and data locations in mapped meta data
From: "Steven Rostedt (Google)" When a ring buffer is mapped to a specific address, save the address of a text function and some data. This will be used to determine the delta between the last boot and the current boot for pointers to functions as well as to data. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 16 1 file changed, 16 insertions(+) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 6f19e7612472..c68695ae2749 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -45,6 +45,8 @@ static void update_pages_handler(struct work_struct *work); struct ring_buffer_meta { + unsigned long text_addr; + unsigned long data_addr; unsigned long first_buffer; unsigned long head_buffer; unsigned long commit_buffer; @@ -539,6 +541,9 @@ struct trace_buffer { unsigned long range_addr_start; unsigned long range_addr_end; + longlast_text_delta; + longlast_data_delta; + unsigned intsubbuf_size; unsigned intsubbuf_order; unsigned intmax_data_size; @@ -1827,10 +1832,15 @@ static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer) } } +/* Used to calculate data delta */ +static char rb_data_ptr[] __initdata = ""; + static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) { struct ring_buffer_meta *meta; unsigned long delta; + unsigned long this_text = (unsigned long)rb_range_meta_init; + unsigned long this_data = (unsigned long)rb_data_ptr; void *subbuf; int cpu; int i; @@ -1847,6 +1857,10 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) meta->first_buffer += delta; meta->head_buffer += delta; meta->commit_buffer += delta; + buffer->last_text_delta = this_text - meta->text_addr; + buffer->last_data_delta = this_data - meta->data_addr; + meta->text_addr = this_text; + meta->data_addr = this_data; continue; } @@ -1863,6 +1877,8 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) subbuf = rb_subbufs_from_meta(meta); meta->first_buffer = (unsigned long)subbuf; + meta->text_addr = this_text; + meta->data_addr = this_data; /* * The buffers[] array holds the order of the sub-buffers -- 2.43.0
[PATCH v2 07/11] ring-buffer: Validate boot range memory events
From: "Steven Rostedt (Google)" Make sure all the events in each of the sub-buffers that were mapped in a memory region are valid. This moves the code that walks the buffers for time-stamp validation out of the CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS ifdef block and is used to validate the content. Only the ring buffer event meta data is checked and not the data load. This also has a second purpose. The buffer_page structure that points to the data sub-buffers has accounting that keeps track of the number of events that are on the sub-buffer. This updates that counter as well. That counter is used in reading the buffer and knowing if the ring buffer is empty or not. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 190 - 1 file changed, 166 insertions(+), 24 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index d05b60fd5695..6f19e7612472 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1663,10 +1663,170 @@ static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu, subbuf = (void *)subbuf + subbuf_size; } - pr_info("Ring buffer meta is from previous boot!\n"); return true; } +static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf); + +#ifdef CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS +static DEFINE_PER_CPU(atomic_t, checking); +static atomic_t ts_dump; + +#define buffer_warn_return(fmt, ...) \ + do {\ + /* If another report is happening, ignore this one */ \ + if (atomic_inc_return(_dump) != 1) { \ + atomic_dec(_dump); \ + goto out; \ + } \ + atomic_inc(_buffer->record_disabled); \ + pr_warn(fmt, ##__VA_ARGS__);\ + dump_buffer_page(bpage, info, tail);\ + atomic_dec(_dump); \ + /* There's some cases in boot up that this can happen */ \ + if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING)) \ + /* Do not re-enable checking */ \ + return; \ + } while (0) +#else +#define buffer_warn_return(fmt, ...) do { } while (0) +#endif + +static int rb_read_data_buffer(struct buffer_data_page *dpage, int tail, int cpu, + unsigned long long *timestamp, bool warn) +{ + struct ring_buffer_event *event; + u64 ts, delta; + int events = 0; + int e; + + ts = dpage->time_stamp; + + for (e = 0; e < tail; e += rb_event_length(event)) { + + event = (struct ring_buffer_event *)(dpage->data + e); + + switch (event->type_len) { + + case RINGBUF_TYPE_TIME_EXTEND: + delta = rb_event_time_stamp(event); + ts += delta; + break; + + case RINGBUF_TYPE_TIME_STAMP: + delta = rb_event_time_stamp(event); + if (warn && delta < ts) { + buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT BACKWARDS: last ts: %lld absolute ts: %lld\n", + cpu, ts, delta); + } + ts = delta; + break; + + case RINGBUF_TYPE_PADDING: + if (event->time_delta == 1) + break; + fallthrough; + case RINGBUF_TYPE_DATA: + events++; + ts += event->time_delta; + break; + + default: + return -1; + } + } + *timestamp = ts; + return events; +} + +static int rb_validate_buffer(struct buffer_data_page *dpage, int cpu) +{ + unsigned long long ts; + int tail; + + tail = local_read(>commit); + return rb_read_data_buffer(dpage, tail, cpu, , false); +} + +/* If the meta data has been validated, now validate the events */ +static void rb_meta_validate_events(struct ring_buffer_per_cpu *cpu_buffer) +{ + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + struct buffer_page *head_page; + unsigned long entry_bytes = 0; + unsigned long entries = 0; + int ret; + int i; + + if (!meta || !meta->head_buffer) + return; + + /* Do the reader page
[PATCH v2 06/11] ring-buffer: Add test if range of boot buffer is valid
From: "Steven Rostedt (Google)" Add a test against the ring buffer memory range to see if it has valid data. The ring_buffer_meta structure is given a new field called "first_buffer" which holds the address of the first sub-buffer. This is used to both determine if the other fields are valid as well as finding the offset between the old addresses of the sub-buffer from the previous boot to the new addresses of the current boot. Since the values for nr_subbufs and subbuf_size is to be the same, check if the values in the meta page match the values calculated. Take the range of the first_buffer and the total size of all the buffers and make sure the saved head_buffer and commit_buffer fall in the range. Iterate through all the sub-buffers to make sure that the values in the sub-buffer "commit" field (the field that holds the amount of data on the sub-buffer) is within the end of the sub-buffer. Also check the index array to make sure that all the indexes are within nr_subbufs. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 143 ++--- 1 file changed, 135 insertions(+), 8 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index ee3a5c6966e2..d05b60fd5695 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -45,6 +45,7 @@ static void update_pages_handler(struct work_struct *work); struct ring_buffer_meta { + unsigned long first_buffer; unsigned long head_buffer; unsigned long commit_buffer; __u32 subbuf_size; @@ -1606,21 +1607,103 @@ static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx) return (void *)ptr; } +/* + * See if the existing memory contains valid ring buffer data. + * As the previous kernel must be the same as this kernel, all + * the calculations (size of buffers and number of buffers) + * must be the same. + */ +static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu, + struct trace_buffer *buffer, int nr_pages) +{ + int subbuf_size = PAGE_SIZE; + struct buffer_data_page *subbuf; + unsigned long buffers_start; + unsigned long buffers_end; + int i; + + /* The subbuffer's size and number of subbuffers must match */ + if (meta->subbuf_size != subbuf_size || + meta->nr_subbufs != nr_pages + 1) { + pr_info("Ring buffer boot meta [%d] mismatch of subbuf_size/nr_pages\n", cpu); + return false; + } + + buffers_start = meta->first_buffer; + buffers_end = meta->first_buffer + (subbuf_size * meta->nr_subbufs); + + /* Is the head and commit buffers within the range of buffers? */ + if (meta->head_buffer < buffers_start || + meta->head_buffer >= buffers_end) { + pr_info("Ring buffer boot meta [%d] head buffer out of range\n", cpu); + return false; + } + + if (meta->commit_buffer < buffers_start || + meta->commit_buffer >= buffers_end) { + pr_info("Ring buffer boot meta [%d] commit buffer out of range\n", cpu); + return false; + } + + subbuf = rb_subbufs_from_meta(meta); + + /* Is the meta buffers and the subbufs themselves have correct data? */ + for (i = 0; i < meta->nr_subbufs; i++) { + if (meta->buffers[i] < 0 || + meta->buffers[i] >= meta->nr_subbufs) { + pr_info("Ring buffer boot meta [%d] array out of range\n", cpu); + return false; + } + + if ((unsigned)local_read(>commit) > subbuf_size) { + pr_info("Ring buffer boot meta [%d] buffer invalid commit\n", cpu); + return false; + } + + subbuf = (void *)subbuf + subbuf_size; + } + + pr_info("Ring buffer meta is from previous boot!\n"); + return true; +} + static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) { struct ring_buffer_meta *meta; + unsigned long delta; void *subbuf; int cpu; int i; for (cpu = 0; cpu < nr_cpu_ids; cpu++) { + void *next_meta; + meta = rb_range_meta(buffer, nr_pages, cpu); + if (rb_meta_valid(meta, cpu, buffer, nr_pages)) { + /* Make the mappings match the current address */ + subbuf = rb_subbufs_from_meta(meta); + delta = (unsigned long)subbuf - meta->first_buffer; + meta->first_buffer += delta; + meta->head_buffer += delta; + meta->commit_buffer += delta;
[PATCH v2 05/11] ring-buffer: Add output of ring buffer meta page
From: "Steven Rostedt (Google)" Add a buffer_meta per-cpu file for the trace instance that is mapped to boot memory. This shows the current meta-data and can be used by user space tools to record off the current mappings to help reconstruct the ring buffer after a reboot. It does not expose any virtual addresses, just indexes into the sub-buffer pages. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 77 ++ kernel/trace/trace.c | 30 ++- kernel/trace/trace.h | 2 + 3 files changed, 107 insertions(+), 2 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index db8762faef77..ee3a5c6966e2 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -32,6 +32,8 @@ #include #include +#include "trace.h" + /* * The "absolute" timestamp in the buffer is only 59 bits. * If a clock has the 5 MSBs set, it needs to be saved and @@ -1635,6 +1637,81 @@ static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages) } } +static void *rbm_start(struct seq_file *m, loff_t *pos) +{ + struct ring_buffer_per_cpu *cpu_buffer = m->private; + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + unsigned long val; + + if (!meta) + return NULL; + + if (*pos > meta->nr_subbufs) + return NULL; + + val = *pos; + val++; + + return (void *)val; +} + +static void *rbm_next(struct seq_file *m, void *v, loff_t *pos) +{ + (*pos)++; + + return rbm_start(m, pos); +} + +static int rb_meta_subbuf_idx(struct ring_buffer_meta *meta, void *subbuf); + +static int rbm_show(struct seq_file *m, void *v) +{ + struct ring_buffer_per_cpu *cpu_buffer = m->private; + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + unsigned long val = (unsigned long)v; + + if (val == 1) { + seq_printf(m, "head_buffer: %d\n", + rb_meta_subbuf_idx(meta, (void *)meta->head_buffer)); + seq_printf(m, "commit_buffer: %d\n", + rb_meta_subbuf_idx(meta, (void *)meta->commit_buffer)); + seq_printf(m, "subbuf_size: %d\n", meta->subbuf_size); + seq_printf(m, "nr_subbufs:%d\n", meta->nr_subbufs); + return 0; + } + + val -= 2; + seq_printf(m, "buffer[%ld]:%d\n", val, meta->buffers[val]); + + return 0; +} + +static void rbm_stop(struct seq_file *m, void *p) +{ +} + +static const struct seq_operations rb_meta_seq_ops = { + .start = rbm_start, + .next = rbm_next, + .show = rbm_show, + .stop = rbm_stop, +}; + +int ring_buffer_meta_seq_init(struct file *file, struct trace_buffer *buffer, int cpu) +{ + struct seq_file *m; + int ret; + + ret = seq_open(file, _meta_seq_ops); + if (ret) + return ret; + + m = file->private_data; + m->private = buffer->buffers[cpu]; + + return 0; +} + static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, long nr_pages, struct list_head *pages) { diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 31067de977fc..b2b16be6e4ec 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5018,7 +5018,7 @@ static int show_traces_open(struct inode *inode, struct file *file) return 0; } -static int show_traces_release(struct inode *inode, struct file *file) +static int tracing_seq_release(struct inode *inode, struct file *file) { struct trace_array *tr = inode->i_private; @@ -5059,7 +5059,7 @@ static const struct file_operations show_traces_fops = { .open = show_traces_open, .read = seq_read, .llseek = seq_lseek, - .release= show_traces_release, + .release= tracing_seq_release, }; static ssize_t @@ -6860,6 +6860,22 @@ tracing_total_entries_read(struct file *filp, char __user *ubuf, return simple_read_from_buffer(ubuf, cnt, ppos, buf, r); } +static int tracing_buffer_meta_open(struct inode *inode, struct file *filp) +{ + struct trace_array *tr = inode->i_private; + int cpu = tracing_get_cpu(inode); + int ret; + + ret = tracing_check_open_get_tr(tr); + if (ret) + return ret; + + ret = ring_buffer_meta_seq_init(filp, tr->array_buffer.buffer, cpu); + if (ret < 0) + __trace_array_put(tr); + return ret; +} + static ssize_t tracing_free_buffer_write(struct file *filp, const char __user *ubuf, size_t cnt, loff_t *ppos) @@ -7436,6 +7452,13 @@ static const struct file_operations tracing_entries_fops = {
[PATCH v2 04/11] tracing: Implement creating an instance based on a given memory region
From: "Steven Rostedt (Google)" Allow for creating a new instance by passing in an address and size to map the ring buffer for the instance to. This will allow features like a pstore memory mapped region to be used for an tracing instance ring buffer that can be retrieved from one boot to the next. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 50 +++- kernel/trace/trace.h | 3 +++ 2 files changed, 43 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 83194bf7b1df..31067de977fc 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4921,6 +4921,11 @@ static int tracing_open(struct inode *inode, struct file *file) static bool trace_ok_for_array(struct tracer *t, struct trace_array *tr) { +#ifdef CONFIG_TRACER_SNAPSHOT + /* arrays with mapped buffer range do not have snapshots */ + if (tr->range_addr_start && t->use_max_tr) + return false; +#endif return (tr->flags & TRACE_ARRAY_FL_GLOBAL) || t->allow_instances; } @@ -8673,11 +8678,13 @@ tracing_init_tracefs_percpu(struct trace_array *tr, long cpu) tr, cpu, _entries_fops); #ifdef CONFIG_TRACER_SNAPSHOT - trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu, - tr, cpu, _fops); + if (!tr->range_addr_start) { + trace_create_cpu_file("snapshot", TRACE_MODE_WRITE, d_cpu, + tr, cpu, _fops); - trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu, - tr, cpu, _raw_fops); + trace_create_cpu_file("snapshot_raw", TRACE_MODE_READ, d_cpu, + tr, cpu, _raw_fops); + } #endif } @@ -9214,7 +9221,18 @@ allocate_trace_buffer(struct trace_array *tr, struct array_buffer *buf, int size buf->tr = tr; - buf->buffer = ring_buffer_alloc(size, rb_flags); + if (tr->range_addr_start && tr->range_addr_size) { + buf->buffer = ring_buffer_alloc_range(size, rb_flags, 0, + tr->range_addr_start, + tr->range_addr_size); + /* +* This is basically the same as a mapped buffer, +* with the same restrictions. +*/ + tr->mapped++; + } else { + buf->buffer = ring_buffer_alloc(size, rb_flags); + } if (!buf->buffer) return -ENOMEM; @@ -9251,6 +9269,10 @@ static int allocate_trace_buffers(struct trace_array *tr, int size) return ret; #ifdef CONFIG_TRACER_MAX_TRACE + /* Fix mapped buffer trace arrays do not have snapshot buffers */ + if (tr->range_addr_start) + return 0; + ret = allocate_trace_buffer(tr, >max_buffer, allocate_snapshot ? size : 1); if (MEM_FAIL(ret, "Failed to allocate trace buffer\n")) { @@ -9351,7 +9373,9 @@ static int trace_array_create_dir(struct trace_array *tr) } static struct trace_array * -trace_array_create_systems(const char *name, const char *systems) +trace_array_create_systems(const char *name, const char *systems, + unsigned long range_addr_start, + unsigned long range_addr_size) { struct trace_array *tr; int ret; @@ -9377,6 +9401,10 @@ trace_array_create_systems(const char *name, const char *systems) goto out_free_tr; } + /* Only for boot up memory mapped ring buffers */ + tr->range_addr_start = range_addr_start; + tr->range_addr_size = range_addr_size; + tr->trace_flags = global_trace.trace_flags & ~ZEROED_TRACE_FLAGS; cpumask_copy(tr->tracing_cpumask, cpu_all_mask); @@ -9434,7 +9462,7 @@ trace_array_create_systems(const char *name, const char *systems) static struct trace_array *trace_array_create(const char *name) { - return trace_array_create_systems(name, NULL); + return trace_array_create_systems(name, NULL, 0, 0); } static int instance_mkdir(const char *name) @@ -9488,7 +9516,7 @@ struct trace_array *trace_array_get_by_name(const char *name, const char *system goto out_unlock; } - tr = trace_array_create_systems(name, systems); + tr = trace_array_create_systems(name, systems, 0, 0); if (IS_ERR(tr)) tr = NULL; @@ -9681,8 +9709,10 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer) MEM_FAIL(1, "Could not allocate function filter files"); #ifdef CONFIG_TRACER_SNAPSHOT - trace_cre
[PATCH v2 03/11] ring-buffer: Add ring_buffer_meta data
From: "Steven Rostedt (Google)" Populate the ring_buffer_meta array. It holds the pointer to the head_buffer (next to read), the commit_buffer (next to write) the size of the sub-buffers, number of sub-buffers and an array that keeps track of the order of the sub-buffers. This information will be stored in the persistent memory to help on reboot to reconstruct the ring buffer. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 208 - 1 file changed, 183 insertions(+), 25 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a6d22a77f128..db8762faef77 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -43,6 +43,11 @@ static void update_pages_handler(struct work_struct *work); struct ring_buffer_meta { + unsigned long head_buffer; + unsigned long commit_buffer; + __u32 subbuf_size; + __u32 nr_subbufs; + int buffers[]; }; /* @@ -498,6 +503,7 @@ struct ring_buffer_per_cpu { struct mutexmapping_lock; unsigned long *subbuf_ids;/* ID to subbuf VA */ struct trace_buffer_meta*meta_page; + struct ring_buffer_meta *ring_meta; /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; @@ -1258,6 +1264,11 @@ static void rb_head_page_activate(struct ring_buffer_per_cpu *cpu_buffer) * Set the previous list pointer to have the HEAD flag. */ rb_set_list_to_head(head->list.prev); + + if (cpu_buffer->ring_meta) { + struct ring_buffer_meta *meta = cpu_buffer->ring_meta; + meta->head_buffer = (unsigned long)head->page; + } } static void rb_list_head_clear(struct list_head *list) @@ -1505,50 +1516,125 @@ rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs) } /* - * Return a specific sub-buffer for a given @cpu defined by @idx. + * Return the ring_buffer_meta for a given @cpu. */ -static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int nr_pages, int idx) +static void *rb_range_meta(struct trace_buffer *buffer, int nr_pages, int cpu) { - unsigned long ptr; int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + unsigned long ptr = buffer->range_addr_start; + struct ring_buffer_meta *meta; int nr_subbufs; - /* Include the reader page */ - nr_subbufs = nr_pages + 1; + if (!ptr) + return NULL; + + /* When nr_pages passed in is zero, the first meta has already been initialized */ + if (!nr_pages) { + meta = (struct ring_buffer_meta *)ptr; + nr_subbufs = meta->nr_subbufs; + } else { + meta = NULL; + /* Include the reader page */ + nr_subbufs = nr_pages + 1; + } /* * The first chunk may not be subbuffer aligned, where as * the rest of the chunks are. */ - ptr = buffer->range_addr_start; - ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); if (cpu) { - unsigned long p; - - ptr += subbuf_size * nr_subbufs; - - /* Save the beginning of this CPU chunk */ - p = ptr; - ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); + ptr += subbuf_size * nr_subbufs; if (cpu > 1) { unsigned long size; + unsigned long p; + /* Save the beginning of this CPU chunk */ + p = ptr; + ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); ptr += subbuf_size * nr_subbufs; /* Now all chunks after this are the same size */ size = ptr - p; ptr += size * (cpu - 2); - - ptr = rb_range_align_subbuf(ptr, subbuf_size, nr_subbufs); } } - if (ptr + subbuf_size * nr_subbufs > buffer->range_addr_end) + return (void *)ptr; +} + +static void *rb_subbufs_from_meta(struct ring_buffer_meta *meta) +{ + int subbuf_size = meta->subbuf_size; + unsigned long ptr; + + ptr = (unsigned long)meta; + ptr = rb_range_align_subbuf(ptr, subbuf_size, meta->nr_subbufs); + + return (void *)ptr; +} + +/* + * Return a specific sub-buffer for a given @cpu defined by @idx. + */ +static void *rb_range_buffer(struct ring_buffer_per_cpu *cpu_buffer, int idx) +{ + struct ring_buffer_meta *meta; + unsigned long ptr; + int subbuf_size; + + meta = rb_range_meta(cpu_buffer->buffer, 0, cpu_buffer->cpu); + if (!m
[PATCH v2 00/11] tracing: Persistent traces across a reboot or crash
This is a way to map a ring buffer instance across reboots. The requirement is that you have a memory region that is not erased. I tested this on a Debian VM running on qemu on a Debian server, and even tested it on a baremetal box running Fedora. I was surprised that it worked on the baremetal box, but it does so surprisingly consistently. I changed things up since the POC version (which this is no longer a proof of concept, but now a serious patch set). https://lore.kernel.org/linux-trace-kernel/20240306015910.766510...@goodmis.org/ I got rid of the variables that needed to be set: trace_buffer_start and trace_buffer_size The allocation of the mapping require a new interface (not in this patch set yet, as it will be added for pstore or the wildcard memmap that is under RFC). That is, either a new patch will be added to connect pstore, or if the new wildcard memmap option is accepted, it can be created with: memmap=12M*4096:trace trace_instance=boot_mapped@trace But that is to come later. If you want to test this code, I have the command line version available at: git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace.git branch: trace/ring-buffer-prealloc The name of the instance no longer defined by the kernel but by the calling code. The memory reserved is used by the ring buffer of this instance. It acts like a memory mapped instance so it has some limitations. It does not allow snapshots nor does it allow tracers which use a snapshot buffer (like irqsoff and wakeup tracers). On boot up, when setting up the ring buffer, it looks at the current content and does a vigorous test to see if the content is valid. It even walks the events in all the sub-buffers to make sure the ring buffer meta data is correct. If it determines that the content is valid, it will reconstruct the ring buffer to use the content it has found. If the buffer is valid, on the next boot, the boot_mapped instance will contain the data from the previous boot. You can cat the trace or trace_pipe file, or even run trace-cmd extract on it to make a trace.dat file that holds the date. This is much better than dealing with a ftrace_dump_on_opps (I wish I had this a decade ago!) There are still some limitations of this buffer. One is that it assumes that the kernel you are booting back into is the same one that crashed. At least the trace_events (like sched_switch and friends) all have the same ids. This would be true with the same kernel as the ids are determined at link time. Module events could possible be a problem as the ids may not match. This version of the patch series saves a text function and a data string address in the persistent memory, and this is used to calculate the delta between text and data addresses of the new boot up. Now function tracing and "%pS" still work across boots. Even the RCU trace events that point to static strings work as well! The delta is exported by a new file in the instance called "last_boot_info" that has something like this: # cat last_boot_info text delta:-268435456 data delta:-268435456 This can be used by trace-cmd that reads the trace_pipe_raw data and now can figure out how to map the print_formats and kallsyms to the raw data in the buffers. This is no longer a POC as it seems to be working as expected. This is based on top of Vincent Donnefort's ring buffer user space mapping code: https://lore.kernel.org/linux-trace-kernel/20240406173649.3210836-1-vdonnef...@google.com/ Enjoy... Steven Rostedt (Google) (11): ring-buffer: Allow mapped field to be set without mapping ring-buffer: Add ring_buffer_alloc_range() ring-buffer: Add ring_buffer_meta data tracing: Implement creating an instance based on a given memory region ring-buffer: Add output of ring buffer meta page ring-buffer: Add test if range of boot buffer is valid ring-buffer: Validate boot range memory events ring-buffer: Save text and data locations in mapped meta data tracing/ring-buffer: Add last_boot_info file to boot instance tracing: Handle old buffer mappings for event strings and functions tracing: Update function tracing output for previous boot buffer include/linux/ring_buffer.h | 20 ++ kernel/trace/ring_buffer.c | 837 kernel/trace/trace.c| 167 - kernel/trace/trace.h| 7 + kernel/trace/trace_output.c | 9 +- 5 files changed, 953 insertions(+), 87 deletions(-)
[PATCH v2 01/11] ring-buffer: Allow mapped field to be set without mapping
From: "Steven Rostedt (Google)" In preparation for having the ring buffer mapped to a dedicated location, which will have the same restrictions as user space memory mapped buffers, allow it to use the "mapped" field of the ring_buffer_per_cpu structure without having the user space meta page mapping. When this starts using the mapped field, it will need to handle adding a user space mapping (and removing it) from a ring buffer that is using a dedicated memory range. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 793ecc454039..44b1d5f1a99a 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -5223,6 +5223,9 @@ static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer) { struct trace_buffer_meta *meta = cpu_buffer->meta_page; + if (!meta) + return; + meta->reader.read = cpu_buffer->reader_page->read; meta->reader.id = cpu_buffer->reader_page->id; meta->reader.lost_events = cpu_buffer->lost_events; @@ -6167,7 +6170,7 @@ rb_get_mapped_buffer(struct trace_buffer *buffer, int cpu) mutex_lock(_buffer->mapping_lock); - if (!cpu_buffer->mapped) { + if (!cpu_buffer->mapped || !cpu_buffer->meta_page) { mutex_unlock(_buffer->mapping_lock); return ERR_PTR(-ENODEV); } @@ -6345,12 +6348,13 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu, */ raw_spin_lock_irqsave(_buffer->reader_lock, flags); rb_setup_ids_meta_page(cpu_buffer, subbuf_ids); + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); err = __rb_map_vma(cpu_buffer, vma); if (!err) { raw_spin_lock_irqsave(_buffer->reader_lock, flags); - cpu_buffer->mapped = 1; + cpu_buffer->mapped++; raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); } else { kfree(cpu_buffer->subbuf_ids); @@ -6388,7 +6392,7 @@ int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) mutex_lock(>mutex); raw_spin_lock_irqsave(_buffer->reader_lock, flags); - cpu_buffer->mapped = 0; + cpu_buffer->mapped--; raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); -- 2.43.0
[PATCH v2 02/11] ring-buffer: Add ring_buffer_alloc_range()
From: "Steven Rostedt (Google)" In preparation to allowing the trace ring buffer to be allocated in a range of memory that is persistent across reboots, add ring_buffer_alloc_range(). It takes a contiguous range of memory and will split it up evening for the per CPU ring buffers. If there's not enough memory to handle all CPUs with the minimum size, it will fail to allocate the ring buffer. Signed-off-by: Steven Rostedt (Google) --- include/linux/ring_buffer.h | 17 +++ kernel/trace/ring_buffer.c | 222 ++-- 2 files changed, 204 insertions(+), 35 deletions(-) diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index 96d2140b471e..a50b0223b1d3 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -89,6 +89,11 @@ void ring_buffer_discard_commit(struct trace_buffer *buffer, struct trace_buffer * __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *key); +struct trace_buffer *__ring_buffer_alloc_range(unsigned long size, unsigned flags, + int order, unsigned long start, + unsigned long range_size, + struct lock_class_key *key); + /* * Because the ring buffer is generic, if other users of the ring buffer get * traced by ftrace, it can produce lockdep warnings. We need to keep each @@ -100,6 +105,18 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k __ring_buffer_alloc((size), (flags), &__key); \ }) +/* + * Because the ring buffer is generic, if other users of the ring buffer get + * traced by ftrace, it can produce lockdep warnings. We need to keep each + * ring buffer's lock class separate. + */ +#define ring_buffer_alloc_range(size, flags, order, start, range_size) \ +({ \ + static struct lock_class_key __key; \ + __ring_buffer_alloc_range((size), (flags), (order), (start),\ + (range_size), &__key);\ +}) + typedef bool (*ring_buffer_cond_fn)(void *data); int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, ring_buffer_cond_fn cond, void *data); diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 44b1d5f1a99a..a6d22a77f128 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -42,6 +42,9 @@ static void update_pages_handler(struct work_struct *work); +struct ring_buffer_meta { +}; + /* * The ring buffer header is special. We must manually up keep it. */ @@ -340,7 +343,8 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ - u32 id;/* ID for external mapping */ + u32 id:30; /* ID for external mapping */ + u32 range:1; /* Mapped via a range */ struct buffer_data_page *page; /* Actual data page */ }; @@ -371,7 +375,9 @@ static __always_inline unsigned int rb_page_commit(struct buffer_page *bpage) static void free_buffer_page(struct buffer_page *bpage) { - free_pages((unsigned long)bpage->page, bpage->order); + /* Range pages are not to be freed */ + if (!bpage->range) + free_pages((unsigned long)bpage->page, bpage->order); kfree(bpage); } @@ -521,6 +527,9 @@ struct trace_buffer { struct rb_irq_work irq_work; booltime_stamp_abs; + unsigned long range_addr_start; + unsigned long range_addr_end; + unsigned intsubbuf_size; unsigned intsubbuf_order; unsigned intmax_data_size; @@ -1483,9 +1492,67 @@ static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer) } } +/* + * Take an address, add the meta data size as well as the array of + * array subbuffer indexes, then align it to a subbuffer size. + */ +static unsigned long +rb_range_align_subbuf(unsigned long addr, int subbuf_size, int nr_subbufs) +{ + addr += sizeof(struct ring_buffer_meta) + + sizeof(int) * nr_subbufs; + return ALIGN(addr, subbuf_size); +} + +/* + * Return a specific sub-buffer for a given @cpu defined by @idx. + */ +static void *rb_range_buffer(struct trace_buffer *buffer, int cpu, int nr_pages, int idx) +{ + unsigned long ptr; + int subbuf_size = buffer->subbuf_size + BUF_PAGE_HDR_SIZE; + int nr_subbufs; + + /* Include the reader page */ + nr_subbufs = nr_pages + 1; + + /* +* The firs
Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field
On Wed, 10 Apr 2024 13:07:20 -0400 Chuck Lever wrote: > Well I guess I could test it for you. I'll take it for nfsd v6.9-rc. Thanks! -- Steve
Re: [PATCH v20 0/5] Introducing trace buffer mapping by user-space
Hi Andrew, et.al. Linus said it's a hard requirement that this code gets an Acked-by (or Reviewed-by) from the memory sub-maintainers before he will accept it. He was upset that we faulted in pages one at a time instead of mapping it in one go: https://lore.kernel.org/all/CAHk-=wh5wWeib7+kVHpBVtUn7kx7GGadWqb5mW5FYTdewEfL=w...@mail.gmail.com/ Could you take a look at patches 1-3 to make sure they look sane from a memory management point of view? I really want this applied in the next merge window. Thanks! -- Steve On Sat, 6 Apr 2024 18:36:44 +0100 Vincent Donnefort wrote: > The tracing ring-buffers can be stored on disk or sent to network > without any copy via splice. However the later doesn't allow real time > processing of the traces. A solution is to give userspace direct access > to the ring-buffer pages via a mapping. An application can now become a > consumer of the ring-buffer, in a similar fashion to what trace_pipe > offers. > > Support for this new feature can already be found in libtracefs from > version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. > > Vincent > > v19 -> v20: > * Fix typos in documentation. > * Remove useless mmap open and fault callbacks. > * add mm.h include for vm_insert_pages > > v18 -> v19: > * Use VM_PFNMAP and vm_insert_pages > * Allocate ring-buffer subbufs with __GFP_COMP > * Pad the meta-page with the zero-page to align on the subbuf_order > * Extend the ring-buffer test with mmap() dedicated suite > > v17 -> v18: > * Fix lockdep_assert_held > * Fix spin_lock_init typo > * Fix CONFIG_TRACER_MAX_TRACE typo > > v16 -> v17: > * Documentation and comments improvements. > * Create get/put_snapshot_map() for clearer code. > * Replace kzalloc with kcalloc. > * Fix -ENOMEM handling in rb_alloc_meta_page(). > * Move flush(cpu_buffer->reader_page) behind the reader lock. > * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer > mutex. (removes READ_ONCE/WRITE_ONCE accesses). > > v15 -> v16: > * Add comment for the dcache flush. > * Remove now unnecessary WRITE_ONCE for the meta-page. > > v14 -> v15: > * Add meta-page and reader-page flush. Intends to fix the mapping > for VIVT and aliasing-VIPT data caches. > * -EPERM on VM_EXEC. > * Fix build warning !CONFIG_TRACER_MAX_TRACE. > > v13 -> v14: > * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) > * on unmap, sync meta-page teardown with the reader_lock instead of > the synchronize_rcu. > * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. > (intends to fix a lockdep issue) > * Add kerneldoc for flags and Reserved fields. > * Add kselftest for snapshot/map mutual exclusion. > > v12 -> v13: > * Swap subbufs_{touched,lost} for Reserved fields. > * Add a flag field in the meta-page. > * Fix CONFIG_TRACER_MAX_TRACE. > * Rebase on top of trace/urgent. > * Add a comment for try_unregister_trigger() > > v11 -> v12: > * Fix code sample mmap bug. > * Add logging in sample code. > * Reset tracer in selftest. > * Add a refcount for the snapshot users. > * Prevent mapping when there are snapshot users and vice versa. > * Refine the meta-page. > * Fix types in the meta-page. > * Collect Reviewed-by. > > v10 -> v11: > * Add Documentation and code sample. > * Add a selftest. > * Move all the update to the meta-page into a single > rb_update_meta_page(). > * rb_update_meta_page() is now called from > ring_buffer_map_get_reader() to fix NOBLOCK callers. > * kerneldoc for struct trace_meta_page. > * Add a patch to zero all the ring-buffer allocations. > > v9 -> v10: > * Refactor rb_update_meta_page() > * In-loop declaration for foreach_subbuf_page() > * Check for cpu_buffer->mapped overflow > > v8 -> v9: > * Fix the unlock path in ring_buffer_map() > * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer > * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) > > v7 -> v8: > * Drop the subbufs renaming into bpages > * Use subbuf as a name when relevant > > v6 -> v7: > * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ > * Support for subbufs > * Rename subbufs into bpages > > v5 -> v6: > * Rebase on next-20230802. > * (unsigned long) -> (void *) cast for virt_to_page(). > * Add a wait for the GET_READER_PAGE ioctl. > * Move writer fields update (overrun/pages_lost/entries/pages_touched) > in the irq_work. > * Rearrange id in struct buffer_page. > * Rearrange the meta-page. > * ring_buffer_meta_page -> trace_buffer_meta_page. > * Add meta_struct_len into the meta-page. > > v4 -> v5: > * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) > > v3 -> v4: > * Add to the meta-page: >- pages_lost / pages_read (allow to compute how full is the >ring-buffer) >- read (allow to compute how many entries can be read) >- A reader_page
Re: [PATCH v20 2/5] ring-buffer: Introducing ring-buffer mapping functions
On Sat, 6 Apr 2024 18:36:46 +0100 Vincent Donnefort wrote: > +int ring_buffer_map(struct trace_buffer *buffer, int cpu, > + struct vm_area_struct *vma) > +{ > + struct ring_buffer_per_cpu *cpu_buffer; > + unsigned long flags, *subbuf_ids; > + int err = 0; > + > + if (!cpumask_test_cpu(cpu, buffer->cpumask)) > + return -EINVAL; > + > + cpu_buffer = buffer->buffers[cpu]; > + > + mutex_lock(_buffer->mapping_lock); > + > + if (cpu_buffer->mapped) { > + err = __rb_map_vma(cpu_buffer, vma); > + if (!err) > + err = __rb_inc_dec_mapped(cpu_buffer, true); > + mutex_unlock(_buffer->mapping_lock); > + return err; > + } > + > + /* prevent another thread from changing buffer/sub-buffer sizes */ > + mutex_lock(>mutex); > + > + err = rb_alloc_meta_page(cpu_buffer); > + if (err) > + goto unlock; > + > + /* subbuf_ids include the reader while nr_pages does not */ > + subbuf_ids = kcalloc(cpu_buffer->nr_pages + 1, sizeof(*subbuf_ids), > GFP_KERNEL); > + if (!subbuf_ids) { > + rb_free_meta_page(cpu_buffer); > + err = -ENOMEM; > + goto unlock; > + } > + > + atomic_inc(_buffer->resize_disabled); > + > + /* > + * Lock all readers to block any subbuf swap until the subbuf IDs are > + * assigned. > + */ > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > + rb_setup_ids_meta_page(cpu_buffer, subbuf_ids); > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > + > + err = __rb_map_vma(cpu_buffer, vma); > + if (!err) { > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > + cpu_buffer->mapped = 1; > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > + } else { > + kfree(cpu_buffer->subbuf_ids); > + cpu_buffer->subbuf_ids = NULL; > + rb_free_meta_page(cpu_buffer); > + } > +unlock: Nit: For all labels, please add a space before them. Otherwise, diffs will show "unlock" as the function and not "ring_buffer_map", making it harder to find where the change is. Same for the labels below. -- Steve > + mutex_unlock(>mutex); > + mutex_unlock(_buffer->mapping_lock); > + > + return err; > +} > + > +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu) > +{ > + struct ring_buffer_per_cpu *cpu_buffer; > + unsigned long flags; > + int err = 0; > + > + if (!cpumask_test_cpu(cpu, buffer->cpumask)) > + return -EINVAL; > + > + cpu_buffer = buffer->buffers[cpu]; > + > + mutex_lock(_buffer->mapping_lock); > + > + if (!cpu_buffer->mapped) { > + err = -ENODEV; > + goto out; > + } else if (cpu_buffer->mapped > 1) { > + __rb_inc_dec_mapped(cpu_buffer, false); > + goto out; > + } > + > + mutex_lock(>mutex); > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > + > + cpu_buffer->mapped = 0; > + > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > + > + kfree(cpu_buffer->subbuf_ids); > + cpu_buffer->subbuf_ids = NULL; > + rb_free_meta_page(cpu_buffer); > + atomic_dec(_buffer->resize_disabled); > + > + mutex_unlock(>mutex); > +out: > + mutex_unlock(_buffer->mapping_lock); > + > + return err; > +} > + > +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu) > +{ > + struct ring_buffer_per_cpu *cpu_buffer; > + unsigned long reader_size; > + unsigned long flags; > + > + cpu_buffer = rb_get_mapped_buffer(buffer, cpu); > + if (IS_ERR(cpu_buffer)) > + return (int)PTR_ERR(cpu_buffer); > + > + raw_spin_lock_irqsave(_buffer->reader_lock, flags); > +consume: > + if (rb_per_cpu_empty(cpu_buffer)) > + goto out; > + > + reader_size = rb_page_size(cpu_buffer->reader_page); > + > + /* > + * There are data to be read on the current reader page, we can > + * return to the caller. But before that, we assume the latter will read > + * everything. Let's update the kernel reader accordingly. > + */ > + if (cpu_buffer->reader_page->read < reader_size) { > + while (cpu_buffer->reader_page->read < reader_size) > + rb_advance_reader(cpu_buffer); > + goto out; > + } > + > + if (WARN_ON(!rb_get_reader_page(cpu_buffer))) > + goto out; > + > + goto consume; > +out: > + /* Some archs do not have data cache coherency between kernel and > user-space */ > + flush_dcache_folio(virt_to_folio(cpu_buffer->reader_page->page)); > + > + rb_update_meta_page(cpu_buffer); > + > + raw_spin_unlock_irqrestore(_buffer->reader_lock, flags); > + rb_put_mapped_buffer(cpu_buffer); > + > + return 0; > +} > + > /* > * We only allocate new buffers, never free them if the CPU goes down. > *
Re: [PATCH v20 1/5] ring-buffer: allocate sub-buffers with __GFP_COMP
Hi Vincent, Thanks for sending this. Nit: Subject should start with a capital: ring-buffer: Allocate sub-buffers with __GFP_COMP -- Steve On Sat, 6 Apr 2024 18:36:45 +0100 Vincent Donnefort wrote: > In preparation for the ring-buffer memory mapping, allocate compound > pages for the ring-buffer sub-buffers to enable us to map them to > user-space with vm_insert_pages(). > > Signed-off-by: Vincent Donnefort > > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index 25476ead681b..cc9ebe593571 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -1524,7 +1524,7 @@ static int __rb_allocate_pages(struct > ring_buffer_per_cpu *cpu_buffer, > list_add(>list, pages); > > page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), > - mflags | __GFP_ZERO, > + mflags | __GFP_COMP | __GFP_ZERO, > cpu_buffer->buffer->subbuf_order); > if (!page) > goto free_pages; > @@ -1609,7 +1609,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, > long nr_pages, int cpu) > > cpu_buffer->reader_page = bpage; > > - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, > + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_COMP | > __GFP_ZERO, > cpu_buffer->buffer->subbuf_order); > if (!page) > goto fail_free_reader; > @@ -5579,7 +5579,7 @@ ring_buffer_alloc_read_page(struct trace_buffer > *buffer, int cpu) > goto out; > > page = alloc_pages_node(cpu_to_node(cpu), > - GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, > + GFP_KERNEL | __GFP_NORETRY | __GFP_COMP | > __GFP_ZERO, > cpu_buffer->buffer->subbuf_order); > if (!page) { > kfree(bpage);
Re: [PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field
On Wed, 10 Apr 2024 12:38:53 -0400 Chuck Lever wrote: > On Wed, Apr 10, 2024 at 12:38:13PM -0400, Steven Rostedt wrote: > > From: "Steven Rostedt (Google)" > > > > The rpcgss_context trace event acceptor field is a dynamically sized > > string that records the "data" parameter. But this parameter is also > > dependent on the "len" field to determine the size of the data. > > > > It needs to use __string_len() helper macro where the length can be passed > > in. It also incorrectly uses strncpy() to save it instead of > > __assign_str(). As these macros can change, it is not wise to open code > > them in trace events. > > > > As of commit c759e609030c ("tracing: Remove __assign_str_len()"), > > __assign_str() can be used for both __string() and __string_len() fields. > > Before that commit, __assign_str_len() is required to be used. This needs > > to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing: > > Rework __assign_str() and __string() to not duplicate getting the string") > > is the commit that makes __string_str_len() obsolete). > > > > Cc: sta...@vger.kernel.org > > Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") > > Signed-off-by: Steven Rostedt (Google) > > Acked-by: Chuck Lever > Thanks, but feel free to take it if you want. Unless you rather have me take it through my tree? -- Steve
[PATCH] SUNRPC: Fix rpcgss_context trace event acceptor field
From: "Steven Rostedt (Google)" The rpcgss_context trace event acceptor field is a dynamically sized string that records the "data" parameter. But this parameter is also dependent on the "len" field to determine the size of the data. It needs to use __string_len() helper macro where the length can be passed in. It also incorrectly uses strncpy() to save it instead of __assign_str(). As these macros can change, it is not wise to open code them in trace events. As of commit c759e609030c ("tracing: Remove __assign_str_len()"), __assign_str() can be used for both __string() and __string_len() fields. Before that commit, __assign_str_len() is required to be used. This needs to be noted for backporting. (In actuality, commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string") is the commit that makes __string_str_len() obsolete). Cc: sta...@vger.kernel.org Fixes: 0c77668ddb4e7 ("SUNRPC: Introduce trace points in rpc_auth_gss.ko") Signed-off-by: Steven Rostedt (Google) --- include/trace/events/rpcgss.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/trace/events/rpcgss.h b/include/trace/events/rpcgss.h index ba2d96a1bc2f..f50fcafc69de 100644 --- a/include/trace/events/rpcgss.h +++ b/include/trace/events/rpcgss.h @@ -609,7 +609,7 @@ TRACE_EVENT(rpcgss_context, __field(unsigned int, timeout) __field(u32, window_size) __field(int, len) - __string(acceptor, data) + __string_len(acceptor, data, len) ), TP_fast_assign( @@ -618,7 +618,7 @@ TRACE_EVENT(rpcgss_context, __entry->timeout = timeout; __entry->window_size = window_size; __entry->len = len; - strncpy(__get_str(acceptor), data, len); + __assign_str(acceptor, data); ), TP_printk("win_size=%u expiry=%lu now=%lu timeout=%u acceptor=%.*s", -- 2.43.0
Re: [PATCH] ftrace: Fix use-after-free issue in ftrace_location()
On Mon, 1 Apr 2024 20:55:43 +0800 Zheng Yejian wrote: > KASAN reports a bug: > > BUG: KASAN: use-after-free in ftrace_location+0x90/0x120 > Read of size 8 at addr 888141d40010 by task insmod/424 > CPU: 8 PID: 424 Comm: insmod Tainted: GW 6.9.0-rc2+ #213 > [...] > Call Trace: > >dump_stack_lvl+0x68/0xa0 >print_report+0xcf/0x610 >kasan_report+0xb5/0xe0 >ftrace_location+0x90/0x120 >register_kprobe+0x14b/0xa40 >kprobe_init+0x2d/0xff0 [kprobe_example] >do_one_initcall+0x8f/0x2d0 >do_init_module+0x13a/0x3c0 >load_module+0x3082/0x33d0 >init_module_from_file+0xd2/0x130 >__x64_sys_finit_module+0x306/0x440 >do_syscall_64+0x68/0x140 >entry_SYSCALL_64_after_hwframe+0x71/0x79 > > The root cause is that when lookup_rec() is lookuping ftrace record of > an address in some module, and at the same time in ftrace_release_mod(), > the memory that saving ftrace records has been freed as that module is > being deleted. > > register_kprobes() { > check_kprobe_address_safe() { > arch_check_ftrace_location() { > ftrace_location() { > lookup_rec() // access memory that has been freed by > // ftrace_release_mod() !!! > > It seems that the ftrace_lock is required when lookuping records in > ftrace_location(), so is ftrace_location_range(). > > Fixes: ae6aa16fdc16 ("kprobes: introduce ftrace based optimization") > Signed-off-by: Zheng Yejian > --- > kernel/trace/ftrace.c | 28 ++-- > 1 file changed, 18 insertions(+), 10 deletions(-) > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c > index da1710499698..838d175709c1 100644 > --- a/kernel/trace/ftrace.c > +++ b/kernel/trace/ftrace.c > @@ -1581,7 +1581,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long > start, unsigned long end) > } > > /** > - * ftrace_location_range - return the first address of a traced location > + * ftrace_location_range_locked - return the first address of a traced > location > * if it touches the given ip range > * @start: start of range to search. > * @end: end of range to search (inclusive). @end points to the last byte > @@ -1592,7 +1592,7 @@ static struct dyn_ftrace *lookup_rec(unsigned long > start, unsigned long end) > * that is either a NOP or call to the function tracer. It checks the ftrace > * internal tables to determine if the address belongs or not. > */ > -unsigned long ftrace_location_range(unsigned long start, unsigned long end) > +static unsigned long ftrace_location_range_locked(unsigned long start, > unsigned long end) > { > struct dyn_ftrace *rec; > > @@ -1603,6 +1603,17 @@ unsigned long ftrace_location_range(unsigned long > start, unsigned long end) > return 0; > } > > +unsigned long ftrace_location_range(unsigned long start, unsigned long end) > +{ > + unsigned long loc; > + > + mutex_lock(_lock); > + loc = ftrace_location_range_locked(start, end); > + mutex_unlock(_lock); I'm not so sure we can take a mutex in all places that call this function. What about using RCU? rcu_read_lock(); loc = ftrace_location_range_rcu(start, end); rcu_read_unlock(); Then in ftrace_release_mod() we can have: out_unlock: mutex_unlock(); /* Need to synchronize with ftrace_location() */ if (tmp_pages) synchronize_rcu(); -- Steve > + > + return loc; > +} > + > /** > * ftrace_location - return the ftrace location > * @ip: the instruction pointer to check > @@ -1614,25 +1625,22 @@ unsigned long ftrace_location_range(unsigned long > start, unsigned long end) > */ > unsigned long ftrace_location(unsigned long ip) > { > - struct dyn_ftrace *rec; > + unsigned long loc; > unsigned long offset; > unsigned long size; > > - rec = lookup_rec(ip, ip); > - if (!rec) { > + loc = ftrace_location_range(ip, ip); > + if (!loc) { > if (!kallsyms_lookup_size_offset(ip, , )) > goto out; > > /* map sym+0 to __fentry__ */ > if (!offset) > - rec = lookup_rec(ip, ip + size - 1); > + loc = ftrace_location_range(ip, ip + size - 1); > } > > - if (rec) > - return rec->ip; > - > out: > - return 0; > + return loc; > } > > /**
Re: [PATCH] ring-buffer: Only update pages_touched when a new page is touched
On Wed, 10 Apr 2024 08:44:00 +0900 Masami Hiramatsu (Google) wrote: > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) Thanks. > > BTW, isn't this a real bugfix, because the page_touched can be > bigger than nr_pages without this fix? Yes, I simply forgot to add the Cc stable. -- Steve
[PATCH] ring-buffer: Only update pages_touched when a new page is touched
From: "Steven Rostedt (Google)" The "buffer_percent" logic that is used by the ring buffer splice code to only wake up the tasks when there's no data after the buffer is filled to the percentage of the "buffer_percent" file is dependent on three variables that determine the amount of data that is in the ring buffer: 1) pages_read - incremented whenever a new sub-buffer is consumed 2) pages_lost - incremented every time a writer overwrites a sub-buffer 3) pages_touched - incremented when a write goes to a new sub-buffer The percentage is the calculation of: (pages_touched - (pages_lost + pages_read)) / nr_pages Basically, the amount of data is the total number of sub-bufs that have been touched, minus the number of sub-bufs lost and sub-bufs consumed. This is divided by the total count to give the buffer percentage. When the percentage is greater than the value in the "buffer_percent" file, it wakes up splice readers waiting for that amount. It was observed that over time, the amount read from the splice was constantly decreasing the longer the trace was running. That is, if one asked for 60%, it would read over 60% when it first starts tracing, but then it would be woken up at under 60% and would slowly decrease the amount of data read after being woken up, where the amount becomes much less than the buffer percent. This was due to an accounting of the pages_touched incrementation. This value is incremented whenever a writer transfers to a new sub-buffer. But the place where it was incremented was incorrect. If a writer overflowed the current sub-buffer it would go to the next one. If it gets preempted by an interrupt at that time, and the interrupt performs a trace, it too will end up going to the next sub-buffer. But only one should increment the counter. Unfortunately, that was not the case. Change the cmpxchg() that does the real switch of the tail-page into a try_cmpxchg(), and on success, perform the increment of pages_touched. This will only increment the counter once for when the writer moves to a new sub-buffer, and not when there's a race and is incremented for when a writer and its preempting writer both move to the same new sub-buffer. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index 25476ead681b..6511dc3a00da 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1393,7 +1393,6 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, old_write = local_add_return(RB_WRITE_INTCNT, _page->write); old_entries = local_add_return(RB_WRITE_INTCNT, _page->entries); - local_inc(_buffer->pages_touched); /* * Just make sure we have seen our old_write and synchronize * with any interrupts that come in. @@ -1430,8 +1429,9 @@ static void rb_tail_page_update(struct ring_buffer_per_cpu *cpu_buffer, */ local_set(_page->page->commit, 0); - /* Again, either we update tail_page or an interrupt does */ - (void)cmpxchg(_buffer->tail_page, tail_page, next_page); + /* Either we update tail_page or an interrupt does */ + if (try_cmpxchg(_buffer->tail_page, _page, next_page)) + local_inc(_buffer->pages_touched); } } -- 2.43.0
Re: [PATCH net-next v3 6/6] rstreason: make it work in trace world
On Tue, 9 Apr 2024 18:09:34 +0800 Jason Xing wrote: > /* > * tcp event with arguments sk and skb > @@ -74,20 +75,38 @@ DEFINE_EVENT(tcp_event_sk_skb, tcp_retransmit_skb, > TP_ARGS(sk, skb) > ); > > +#undef FN1 > +#define FN1(reason) TRACE_DEFINE_ENUM(SK_RST_REASON_##reason); > +#undef FN2 > +#define FN2(reason) TRACE_DEFINE_ENUM(SKB_DROP_REASON_##reason); > +DEFINE_RST_REASON(FN1, FN1) Interesting. I've never seen the passing of the internal macros to the main macro before. I see that you are using it for handling both the SK_RST_REASON and the SK_DROP_REASON. > + > +#undef FN1 > +#undef FNe1 > +#define FN1(reason) { SK_RST_REASON_##reason, #reason }, > +#define FNe1(reason) { SK_RST_REASON_##reason, #reason } > + > +#undef FN2 > +#undef FNe2 > +#define FN2(reason) { SKB_DROP_REASON_##reason, #reason }, > +#define FNe2(reason) { SKB_DROP_REASON_##reason, #reason } Anyway, from a tracing point of view, as it looks like it would work (I haven't tested it). Reviewed-by: Steven Rostedt (Google) -- Steve
Re: [PATCH 1/3] tracing: Remove dependency of saved_cmdlines_buffer on PID_MAX_DEFAULT
On Mon, 8 Apr 2024 16:58:17 +0200 Michal Koutný wrote: > @@ -294,7 +295,7 @@ static void __trace_find_cmdline(int pid, char comm[]) > return; > } > > - tpid = pid & (PID_MAX_DEFAULT - 1); > + tpid = pid % PID_MAP_SIZE; Does that compile to the same? This is a fast path. -- Steve > map = savedcmd->map_pid_to_cmdline[tpid]; > if (map != NO_CMDLINE_MAP) { > tpid = savedcmd->map_cmdline_to_pid[map];
Re: [PATCH] tracing: Add new_exec tracepoint
On Mon, 8 Apr 2024 11:01:54 +0200 Marco Elver wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): How common is this? And can't you just do the same with adding a kprobe? -- Steve
Re: [PATCH RFC ftrace] Asynchronous grace period for register_ftrace_direct()
On Wed, 3 Apr 2024 11:53:14 -0700 "Paul E. McKenney" wrote: > @@ -5366,6 +5366,13 @@ static void remove_direct_functions_hash(struct > ftrace_hash *hash, unsigned long > } > } > > +static void register_ftrace_direct_cb(struct rcu_head *rhp) > +{ > + struct ftrace_hash *fhp = container_of(rhp, struct ftrace_hash, rcu); > + > + free_ftrace_hash(fhp); > +} > + > /** > * register_ftrace_direct - Call a custom trampoline directly > * for multiple functions registered in @ops > @@ -5464,10 +5471,8 @@ int register_ftrace_direct(struct ftrace_ops *ops, > unsigned long addr) > out_unlock: > mutex_unlock(_mutex); > > - if (free_hash && free_hash != EMPTY_HASH) { > - synchronize_rcu_tasks(); > - free_ftrace_hash(free_hash); > - } > + if (free_hash && free_hash != EMPTY_HASH) > + call_rcu_tasks(_hash->rcu, register_ftrace_direct_cb); > > if (new_hash) > free_ftrace_hash(new_hash); I'm getting ready to go on PTO, but a quick glance doesn't look like this would cause any harm. -- Steve
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Wed, 3 Apr 2024 15:39:44 +0100 Vincent Donnefort wrote: > > Do you plan on sending out a v20 series? > > Of course, let me spin that this week! Got also few typos to fix in the doc > and > I believe an include missing for riscv. No rush, I'll be on PTO until next Tuesday, and will not get to it before then. -- Steve
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Fri, 29 Mar 2024 14:40:55 -0400 Steven Rostedt wrote: > > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) > > +{ > > + return VM_FAULT_SIGBUS; > > +} > > If this is all it does, I don't believe it's needed. > > > + > > +#ifdef CONFIG_TRACER_MAX_TRACE > > +static int get_snapshot_map(struct trace_array *tr) > > +{ > > + int err = 0; > > + > > + /* > > +* Called with mmap_lock held. lockdep would be unhappy if we would now > > +* take trace_types_lock. Instead use the specific > > +* snapshot_trigger_lock. > > +*/ > > + spin_lock(>snapshot_trigger_lock); > > + > > + if (tr->snapshot || tr->mapped == UINT_MAX) > > + err = -EBUSY; > > + else > > + tr->mapped++; > > + > > + spin_unlock(>snapshot_trigger_lock); > > + > > + /* Wait for update_max_tr() to observe iter->tr->mapped */ > > + if (tr->mapped == 1) > > + synchronize_rcu(); > > + > > + return err; > > + > > +} > > +static void put_snapshot_map(struct trace_array *tr) > > +{ > > + spin_lock(>snapshot_trigger_lock); > > + if (!WARN_ON(!tr->mapped)) > > + tr->mapped--; > > + spin_unlock(>snapshot_trigger_lock); > > +} > > +#else > > +static inline int get_snapshot_map(struct trace_array *tr) { return 0; } > > +static inline void put_snapshot_map(struct trace_array *tr) { } > > +#endif > > + > > +static void tracing_buffers_mmap_close(struct vm_area_struct *vma) > > +{ > > + struct ftrace_buffer_info *info = vma->vm_file->private_data; > > + struct trace_iterator *iter = >iter; > > + > > + WARN_ON(ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file)); > > + put_snapshot_map(iter->tr); > > +} > > + > > +static void tracing_buffers_mmap_open(struct vm_area_struct *vma) { } > > Same for the open. > > > > + > > +static const struct vm_operations_struct tracing_buffers_vmops = { > > + .open = tracing_buffers_mmap_open, > > + .close = tracing_buffers_mmap_close, > > + .fault = tracing_buffers_mmap_fault, > > +}; > > I replaced this with: > > static const struct vm_operations_struct tracing_buffers_vmops = { > .close = tracing_buffers_mmap_close, > }; > > And it appears to work just fine. The mm code handles the NULL cases for > .open and .fault. > > Is there any reason to do something different than the mm defaults? Hi Vincent, Do you plan on sending out a v20 series? -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 2 Apr 2024 22:21:00 -0700 Andrii Nakryiko wrote: > > I just checked our fleet-wide production data for the last 24 hours. > > Within the kprobe/kretprobe code path (ftrace_trampoline and > > everything called from it), rcu_is_watching (both calls, see below) > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > > prefer to be able to avoid that in production use cases. > > > > I just ran synthetic microbenchmark testing multi-kretprobe > throughput. We get (in millions of BPF kretprobe-multi program > invocations per second): > - 5.568M/s as baseline; > - 5.679M/s with changes in this patch (+2% throughput improvement); > - 5.808M/s with disabling rcu_is_watching in rethook_try_get() > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). > > It's definitely noticeable. Ah, thanks for verifying (I should have read the thread before replying to the previous email). -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 2 Apr 2024 21:00:19 -0700 Andrii Nakryiko wrote: > I just noticed another rcu_is_watching() call, in rethook_try_get(), > which seems to be a similar and complementary validation check to the > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > in this patch. It feels like both of them should be controlled by the > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > guard around rcu_is_watching() check in rethook_try_get() as well? That is totally up to Masami. It may have even less overhead as I'm not sure how many times that gets called, and there may be more work to do than with function tracing. -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Wed, 3 Apr 2024 09:40:48 +0900 Masami Hiramatsu (Google) wrote: > OK, for me, this last sentence is preferred for the help message. That > explains > what this is for. > > All callbacks that attach to the function tracing have some sort > of protection against recursion. This option is only to verify that > ftrace (and other users of ftrace_test_recursion_trylock()) are not > called outside of RCU, as if they are, it can cause a race. > But it also has a noticeable overhead when enabled. > > BTW, how much overhead does this introduce? and the race case a kernel crash? > or just messed up the ftrace buffer? There's a hypothetical race where it can cause a use after free. That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(), which requires RCU to be watching. There's a theoretical case where that task calls the trampoline and misses the synchronization. Note, these locations are with preemption disabled, as rcu is always watching when preemption is enabled. Thus it would be extremely fast where as the synchronize_rcu_tasks() is rather slow. We also have synchronize_rcu_tasks_rude() which would actually keep the trace from happening, as it would schedule on each CPU forcing all CPUs to have RCU watching. I have never heard of this race being hit. I guess it could happen on a VM where a vCPU gets preempted at the right moment for a long time and the other CPUs synchronize. But like lockdep, where deadlocks can crash the kernel, we don't enable it for production. The overhead is another function call within the function tracer. I had numbers before, but I guess I could run tests again and get new numbers. Thanks, -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 19:29:46 -0700 Andrii Nakryiko wrote: > On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > > > On Mon, 1 Apr 2024 12:09:18 -0400 > > Steven Rostedt wrote: > > > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > > Masami Hiramatsu (Google) wrote: > > > > > > > > Masami, > > > > > > > > > > Are you OK with just keeping it set to N. > > > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > > > Agreed (but it should say this is a debug option) > > > > > > It does say "Validate" which to me is a debug option. What would you > > > suggest? > > > > I think the help message should have "This is for debugging ftrace." > > > > Sent v2 with adjusted wording, thanks! You may want to wait till Masami and I agree ;-) Masami, But it isn't really for "debugging", it's for validating. That is, it doesn't give us any information to debug ftrace. It only validates if it is executed properly. In other words, I never want to be asked "How can I use this option to debug ftrace?" For example, we also have: "Verify ring buffer time stamp deltas" that makes sure the time stamps of the ring buffer are not buggy. And there's: "Verify compile time sorting of ftrace functions" Which is also used to make sure things are working properly. Neither of the above says they are for "debugging", even though they are more useful for debugging than this option. I'm not sure saying this is "debugging ftrace" is accurate. It may help debug ftrace if it is called outside of an RCU location, which has a 1 in 100,000,000,000 chance of causing an actual bug, as the race window is extremely small. Now if it is also called outside of instrumentation, that will likely trigger other warnings even without this code, and this will not be needed to debug that. ftrace has all sorts of "verifiers" that is used to make sure things are working properly. And yes, you can consider it as "debugging". But I would not consider this an option to enable if ftrace was broken, and you are looking into why it is broken. To me, this option is only to verify that ftrace (and other users of ftrace_test_recursion_trylock()) are not called outside of RCU, as if they are, it can cause a race. But it also has a noticeable overhead when enabled. -- Steve -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 1 Apr 2024 20:25:52 +0900 Masami Hiramatsu (Google) wrote: > > Masami, > > > > Are you OK with just keeping it set to N. > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > We could have other options like PROVE_LOCKING enable it. > > Agreed (but it should say this is a debug option) It does say "Validate" which to me is a debug option. What would you suggest? -- Steve
Re: [PATCH v19 RESEND 3/5] tracing: Allow user-space mapping of the ring-buffer
On Tue, 26 Mar 2024 10:08:28 + Vincent Donnefort wrote: > Currently, user-space extracts data from the ring-buffer via splice, > which is handy for storage or network sharing. However, due to splice > limitations, it is imposible to do real-time analysis without a copy. > > A solution for that problem is to let the user-space map the ring-buffer > directly. > > The mapping is exposed via the per-CPU file trace_pipe_raw. The first > element of the mapping is the meta-page. It is followed by each > subbuffer constituting the ring-buffer, ordered by their unique page ID: > > * Meta-page -- include/uapi/linux/trace_mmap.h for a description > * Subbuf ID 0 > * Subbuf ID 1 > ... > > It is therefore easy to translate a subbuf ID into an offset in the > mapping: > > reader_id = meta->reader->id; > reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; > > When new data is available, the mapper must call a newly introduced ioctl: > TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to > point to the next reader containing unread data. > Thanks for the update Vincent! > Mapping will prevent snapshot and buffer size modifications. > > CC: > Signed-off-by: Vincent Donnefort > > diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h > index ffcd8dfcaa4f..d25b9d504a7c 100644 > --- a/include/uapi/linux/trace_mmap.h > +++ b/include/uapi/linux/trace_mmap.h > @@ -43,4 +43,6 @@ struct trace_buffer_meta { > __u64 Reserved2; > }; > > +#define TRACE_MMAP_IOCTL_GET_READER _IO('T', 0x1) > + > #endif /* _TRACE_MMAP_H_ */ > diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c > index 233d1af39fff..0f37aa9860fd 100644 > --- a/kernel/trace/trace.c > +++ b/kernel/trace/trace.c > @@ -1191,6 +1191,12 @@ static void tracing_snapshot_instance_cond(struct > trace_array *tr, > return; > } > > + if (tr->mapped) { > + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); > + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); > + return; > + } > + > local_irq_save(flags); > update_max_tr(tr, current, smp_processor_id(), cond_data); > local_irq_restore(flags); > @@ -1323,7 +1329,7 @@ static int tracing_arm_snapshot_locked(struct > trace_array *tr) > lockdep_assert_held(_types_lock); > > spin_lock(>snapshot_trigger_lock); > - if (tr->snapshot == UINT_MAX) { > + if (tr->snapshot == UINT_MAX || tr->mapped) { > spin_unlock(>snapshot_trigger_lock); > return -EBUSY; > } > @@ -6068,7 +6074,7 @@ static void tracing_set_nop(struct trace_array *tr) > { > if (tr->current_trace == _trace) > return; > - > + > tr->current_trace->enabled--; > > if (tr->current_trace->reset) > @@ -8194,15 +8200,32 @@ tracing_buffers_splice_read(struct file *file, loff_t > *ppos, > return ret; > } > > -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters > */ > static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, > unsigned long arg) > { > struct ftrace_buffer_info *info = file->private_data; > struct trace_iterator *iter = >iter; > + int err; > + > + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { > + if (!(file->f_flags & O_NONBLOCK)) { > + err = ring_buffer_wait(iter->array_buffer->buffer, > +iter->cpu_file, > +iter->tr->buffer_percent, > +NULL, NULL); > + if (err) > + return err; > + } > > - if (cmd) > - return -ENOIOCTLCMD; > + return ring_buffer_map_get_reader(iter->array_buffer->buffer, > + iter->cpu_file); > + } else if (cmd) { > + return -ENOTTY; > + } > > + /* > + * An ioctl call with cmd 0 to the ring buffer file will wake up all > + * waiters > + */ > mutex_lock(_types_lock); > > /* Make sure the waiters see the new wait_index */ > @@ -8214,6 +8237,94 @@ static long tracing_buffers_ioctl(struct file *file, > unsigned int cmd, unsigned > return 0; > } > > +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) > +{ > + return VM_FAULT_SIGBUS; > +} If this is all it does, I don't believe it's needed. > + > +#ifdef CONFIG_TRACER_MAX_TRACE > +static int get_snapshot_map(struct trace_array *tr) > +{ > + int err = 0; > + > + /* > + * Called with mmap_lock held. lockdep would be unhappy if we would now > + * take trace_types_lock. Instead use the specific > + * snapshot_trigger_lock. > + */ > + spin_lock(>snapshot_trigger_lock); > + > + if (tr->snapshot || tr->mapped == UINT_MAX) > +
Re: [PATCH] trace/sched: add tgid for sched_wakeup_template
On Wed, 27 Mar 2024 16:50:57 +0800 Tio Zhang wrote: > By doing this, we are able to filter tasks by tgid while we are > tracing wakeup events by ebpf or other methods. > > For example, when we care about tracing a user space process (which has > uncertain number of LWPs, i.e, pids) to monitor its wakeup latency, > without tgid available in sched_wakeup tracepoints, we would struggle > finding out all pids to trace, or we could use kprobe to achieve tgid > tracing, which is less accurate and much less efficient than using > tracepoint. This is a very common trace event, and I really do not want to add more data than necessary to it, as it increases the size of the event which means less events can be recorded on a fixed size trace ring buffer. Note, you are not modifying the "tracepoint", but you are actually modifying a "trace event". "tracepoint" is the hook in the kernel code: trace_sched_wakeup() "trace event" is defined by TRACE_EVENT() macro (and friends) that defines what is exposed in the tracefs file system. I thought ebpf could hook directly to the tracepoint which is: trace_sched_wakeup(p); I believe you can have direct access to the 'p' before it is processed from ebpf. There's also "trace probes" (I think we are lacking documentation on this, as well as event probes :-p): $ gdb vmlinux (gdb) p &((struct task_struct *)0)->tgid $1 = (pid_t *) 0x56c (gdb) p &((struct task_struct *)0)->pid $2 = (pid_t *) 0x568 # echo 't:wakeup sched_waking pid=+0x568($arg1):u32 tgid=+0x56c($arg1):u32' > /sys/kernel/tracing/dynamic_events # trace-cmd start -e wakeup # trace-cmd show trace-cmd-7307[003] d..6. 599486.485762: wakeup: (__probestub_sched_waking+0x4/0x10) pid=845 tgid=845 bash-845 [001] d.s4. 599486.486136: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 bash-845 [001] d..4. 599486.486336: wakeup: (__probestub_sched_waking+0x4/0x10) pid=5516 tgid=5516 kworker/u18:2-5516[001] d..4. 599486.486445: wakeup: (__probestub_sched_waking+0x4/0x10) pid=818 tgid=818 -0 [001] d.s4. 599486.491206: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [001] d.s5. 599486.493218: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [001] d.s4. 599486.497200: wakeup: (__probestub_sched_waking+0x4/0x10) pid=17 tgid=17 -0 [003] d.s4. 599486.829209: wakeup: (__probestub_sched_waking+0x4/0x10) pid=70 tgid=70 The above attaches to the tracepoint and $arg1 is the 'struct task_struct *p'. -- Steve > > Signed-off-by: Tio Zhang > Signed-off-by: Dylane Chen > --- > include/trace/events/sched.h | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index dbb01b4b7451..ea7e525649e5 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -149,6 +149,7 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > __field(pid_t, pid ) > __field(int,prio) > __field(int,target_cpu ) > + __field(pid_t, tgid) > ), > > TP_fast_assign( > @@ -156,11 +157,12 @@ DECLARE_EVENT_CLASS(sched_wakeup_template, > __entry->pid= p->pid; > __entry->prio = p->prio; /* XXX SCHED_DEADLINE */ > __entry->target_cpu = task_cpu(p); > + __entry->tgid = p->tgid; > ), > > - TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d", > + TP_printk("comm=%s pid=%d prio=%d target_cpu=%03d tgid=%d", > __entry->comm, __entry->pid, __entry->prio, > - __entry->target_cpu) > + __entry->target_cpu, __entry->tgid) > ); > > /*
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, 26 Mar 2024 09:16:33 -0700 Andrii Nakryiko wrote: > > It's no different than lockdep. Test boxes should have it enabled, but > > there's no reason to have this enabled in a production system. > > > > I tend to agree with Steven here (which is why I sent this patch as it > is), but I'm happy to do it as an opt-out, if Masami insists. Please > do let me know if I need to send v2 or this one is actually the one > we'll end up using. Thanks! Masami, Are you OK with just keeping it set to N. We could have other options like PROVE_LOCKING enable it. -- Steve
Re: [PATCH 11/12] [v4] kallsyms: rework symbol lookup return codes
On Tue, 26 Mar 2024 15:53:38 +0100 Arnd Bergmann wrote: > -const char * > +int > ftrace_mod_address_lookup(unsigned long addr, unsigned long *size, > unsigned long *off, char **modname, char *sym) > { > struct ftrace_mod_map *mod_map; > - const char *ret = NULL; > + int ret; This needs to be ret = 0; > > /* mod_map is freed via call_rcu() */ > preempt_disable(); As here we have: list_for_each_entry_rcu(mod_map, _mod_maps, list) { ret = ftrace_func_address_lookup(mod_map, addr, size, off, sym); if (ret) { if (modname) *modname = mod_map->mod->name; break; } } preempt_enable(); return ret; } Where it is possible for the loop never to be executed. -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, 25 Mar 2024 11:38:48 +0900 Masami Hiramatsu (Google) wrote: > On Fri, 22 Mar 2024 09:03:23 -0700 > Andrii Nakryiko wrote: > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > control whether ftrace low-level code performs additional > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > violations. > > > > This check is expected to never be true in practice and would be best > > controlled with extra config to let users decide if they are willing to > > pay the price. > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > is OK, but tht should be set to Y by default. If you have already verified > that your system never make it true and you want to optimize your ftrace > path, you can manually set it to N at your own risk. > Really, it's for debugging. I would argue that it should *not* be default y. Peter added this to find all the locations that could be called where RCU is not watching. But the issue I have is that this is that it *does cause overhead* with function tracing. I believe we found pretty much all locations that were an issue, and we should now just make it an option for developers. It's no different than lockdep. Test boxes should have it enabled, but there's no reason to have this enabled in a production system. -- Steve > > > > Cc: Steven Rostedt > > Cc: Masami Hiramatsu > > Cc: Paul E. McKenney > > Signed-off-by: Andrii Nakryiko > > --- > > include/linux/trace_recursion.h | 2 +- > > kernel/trace/Kconfig| 13 + > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/trace_recursion.h > > b/include/linux/trace_recursion.h > > index d48cd92d2364..24ea8ac049b4 100644 > > --- a/include/linux/trace_recursion.h > > +++ b/include/linux/trace_recursion.h > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > unsigned long parent_ip); > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > #endif > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > # define trace_warn_on_no_rcu(ip) \ > > ({ \ > > bool __ret = !rcu_is_watching();\ > > BTW, maybe we can add "unlikely" in the next "if" line? > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > index 61c541c36596..19bce4e217d6 100644 > > --- a/kernel/trace/Kconfig > > +++ b/kernel/trace/Kconfig > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > This file can be reset, but the limit can not change in > > size at runtime. > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > + bool "Validate RCU is on during ftrace recursion check" > > + depends on FUNCTION_TRACER > > + depends on ARCH_WANTS_NO_INSTR > > default y > > > + help > > + All callbacks that attach to the function tracing have some sort > > + of protection against recursion. This option performs additional > > + checks to make sure RCU is on when ftrace callbacks recurse. > > + > > + This will add more overhead to all ftrace-based invocations. > > ... invocations, but keep it safe. > > > + > > + If unsure, say N > > If unsure, say Y > > Thank you, > > > + > > config RING_BUFFER_RECORD_RECURSION > > bool "Record functions that recurse in the ring buffer" > > depends on FTRACE_RECORD_RECURSION > > -- > > 2.43.0 > > > >
Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
On Tue, 12 Mar 2024 13:42:28 + Mark Rutland wrote: > There are ways around that, but they're complicated and/or expensive, e.g. > > * Use a sequence of multiple patches, starting with replacing the JALR with an > exception-generating instruction with a fixup handler, which is sort-of what > x86 does with UD2. This may require multiple passes with > synchronize_rcu_tasks() to make sure all threads have seen the latest > instructions, and that cannot be done under stop_machine(), so if you need > stop_machine() for CMODx reasons, you may need to use that several times > with > intervening calls to synchronize_rcu_tasks(). Just for clarification. x86 doesn't use UD2 for updating the call sites. It uses the breakpoint (0xcc) operation. This is because x86 instructions are not a fixed size and can cross cache boundaries, making updates to text sections dangerous as another CPU may get half the old instruction and half the new one! Thus, when we have: 0f 1f 44 00 00 nop and want to convert it to: e8 e7 57 07 00 call ftrace_caller We have to first add a breakpoint: cc 17 44 00 00 Send an IPI to all CPUs so that they see the breakpoint (IPI is equivalent to a memory barrier). We have a ftrace breakpoint handler that will look at the function that the breakpoint happened on. If it was a nop, it will just skip over the rest of the instruction, and return from the break point handler just after the "cc 17 44 00 00". If it's supposed to be a function, it will push the return to after the instruction onto the stack, and return from the break point handler to ftrace_caller. (Although things changed a little since this update is now handled by text_poke_bp_batch()). Then it changes the rest of the instruction: cc e7 57 07 00 Sends out another IPI to all CPUs and removes the break point with the new instruction op. e8 e7 57 07 00 and now all the callers of this function will call the ftrace_caller. -- Steve