On 11/27/23 18:41, Steven Rostedt wrote: > On Mon, 27 Nov 2023 16:12:47 +0100 > Petr Pavlu <petr.pa...@suse.com> wrote: > >> The following warning appears when using buffered events: >> [ 203.556451] WARNING: CPU: 53 PID: 10220 at >> kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420 > > Hmm, I don't have a waring on line 3912, do you have extra code (debugging) > in your version?
The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following code and RB_WARN_ON(): 3895 void ring_buffer_discard_commit(struct trace_buffer *buffer, 3896 struct ring_buffer_event *event) 3897 { [...] 3908 * This must only be called if the event has not been 3909 * committed yet. Thus we can assume that preemption 3910 * is still disabled. 3911 */ 3912 RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing)); 3913 3914 rb_decrement_entry(cpu_buffer, event); 3915 if (rb_try_to_discard(cpu_buffer, event)) 3916 goto out; https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912 >> [...] >> [ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G >> E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a >> [ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS >> GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 >> [ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420 >> [ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 >> 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> >> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff >> [ 203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202 >> [ 203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: >> ffff8ac10662c000 >> [ 203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: >> ffff8ac0c004d400 >> [ 203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: >> 0000000000000000 >> [ 203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: >> 0000000000000000 >> [ 203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: >> ffff8ac10662c008 >> [ 203.781846] FS: 00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) >> knlGS:0000000000000000 >> [ 203.781851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: >> 00000000001506f0 >> [ 203.781862] Call Trace: >> [ 203.781870] <TASK> >> [ 203.851949] trace_event_buffer_commit+0x1ea/0x250 >> [ 203.851967] trace_event_raw_event_sys_enter+0x83/0xe0 >> [ 203.851983] syscall_trace_enter.isra.0+0x182/0x1a0 >> [ 203.851990] do_syscall_64+0x3a/0xe0 >> [ 203.852075] entry_SYSCALL_64_after_hwframe+0x6e/0x76 >> [ 203.852090] RIP: 0033:0x7f4cd870fa77 >> [ 203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 >> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> >> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48 >> [ 203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: >> 0000000000000089 >> [ 203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: >> 00007f4cd870fa77 >> [ 203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: >> 0000558ea1d7b6f0 >> [ 203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: >> 00007fff997180e0 >> [ 203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: >> 00007fff99717f40 >> [ 204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: >> 00007fff997180b0 >> [ 204.049256] </TASK> >> >> For instance, it can be triggered by running these two commands in >> parallel: >> $ while true; do >> echo hist:key=id.syscall:val=hitcount > \ >> /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger; >> done >> $ stress-ng --sysinfo $(nproc) >> >> The warning indicates that the current ring_buffer_per_cpu is not in the >> committing state. It happens because the active ring_buffer_event >> doesn't actually come from the ring_buffer_per_cpu but is allocated from >> trace_buffered_event. >> >> The bug is in function trace_buffered_event_disable() where the >> following normally happens: >> * The code invokes disable_trace_buffered_event() via >> smp_call_function_many() and follows it by synchronize_rcu(). This >> increments the per-CPU variable trace_buffered_event_cnt on each >> target CPU and grants trace_buffered_event_disable() the exclusive >> access to the per-CPU variable trace_buffered_event. >> * Maintenance is performed on trace_buffered_event, all per-CPU event >> buffers get freed. >> * The code invokes enable_trace_buffered_event() via >> smp_call_function_many(). This decrements trace_buffered_event_cnt and >> releases the access to trace_buffered_event. >> >> A problem is that smp_call_function_many() runs a given function on all >> target CPUs except on the current one. The following can then occur: >> * Task X executing trace_buffered_event_disable() runs on CPU A. >> * The control reaches synchronize_rcu() and the task gets rescheduled on >> another CPU B. >> * The RCU synchronization finishes. At this point, >> trace_buffered_event_disable() has the exclusive access to all >> trace_buffered_event variables except trace_buffered_event[A] because >> trace_buffered_event_cnt[A] is never incremented and if the buffer is >> currently unused, remains set to 0. >> * A different task Y is scheduled on CPU A and hits a trace event. The >> code in trace_event_buffer_lock_reserve() sees that >> trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer >> provided by trace_buffered_event[A]. >> * Task X continues its execution in trace_buffered_event_disable(). The >> code incorrectly frees the event buffer pointed by >> trace_buffered_event[A] and resets the variable to NULL. >> * Task Y writes event data to the now freed buffer and later detects the >> created inconsistency. >> >> The issue is observable since commit dea499781a11 ("tracing: Fix warning >> in trace_buffered_event_disable()") which moved the call of >> trace_buffered_event_disable() in __ftrace_event_enable_disable() >> earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..). >> The underlying problem in trace_buffered_event_disable() is however >> present since the original implementation in commit 0fc1b09ff1ff >> ("tracing: Use temp buffer when filtering events"). >> >> The bug is simply fixable by replacing smp_call_function_many() by >> on_each_cpu_mask(), but the code has other issues as well: > > Or by simply calling the update for the local CPU as well as the many: > > preempt_disable(); > /* For each CPU, set the buffer as used. */ > disable_trace_buffered_event(NULL); > smp_call_function_many(tracing_buffer_mask, > disable_trace_buffered_event, NULL, 1); > preempt_enable(); > >> * Function trace_event_buffer_lock_reserve() reads trace_buffered_event >> and trace_buffered_event_cnt in reverse order than they are written in >> trace_buffered_event_disable() and without any memory barrier. It >> could happen that trace_event_buffer_lock_reserve() still finds >> a valid pointer in trace_buffered_event which is being freed by >> trace_buffered_event_disable() but later already sees the decremented >> value of trace_buffered_event_cnt back to 0 and incorrectly decides to >> use the provided buffer. > > Not an issue (see below) > >> * Function trace_buffered_event_enable() initializes memory of each >> allocated trace_buffered_event to zero but no memory barrier is >> present to ensure this operation is completed before publishing its >> pointer for use in trace_event_buffer_lock_reserve(). > > Yes, a wmb() may be helpful, but as the reader will see either the > allocated page or NULL, and can handle both cases. I decided not to add it > because it really doesn't matter if we do. My concern is about the earlier call "memset(event, 0, sizeof(*event));" in trace_buffered_event_enable() which writes the obtained event page prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;". I think another CPU running trace_event_buffer_lock_reserve() could see already the valid event pointer, write to the event page and only then the memory could see the memset() write coming from trace_buffered_event_enable(). The event data would then become corrupted. > >> * Calling function trace_buffered_event_enable() normally requires >> pairing it with trace_buffered_event_disable(). However, the function >> has no error return code and in case of a failure decrements back >> trace_buffered_event_ref. This results in underflow of the counter >> when trace_buffered_event_disable() gets called later. > > Not an issue. > > Even on failure the ref count will be greater than zero, where > trace_buffered_event_disable() handles it properly, and the freeing can > handle no pages being allocated. > > That is: > > free_page((unsigned long)per_cpu(trace_buffered_event, cpu)); > > would be the same as: free_page((unsigned long)NULL); > > Which is perfectly fine to do. The situation that I have on my mind is the following: * The counter trace_buffered_event_ref is at 0. * The soft mode gets enabled for some event and trace_buffered_event_enable() is called. The function increments trace_buffered_event_ref to 1 and starts allocating event pages. * The allocation fails for some page and trace_buffered_event_disable() is called for cleanup. * Function trace_buffered_event_disable() decrements trace_buffered_event_ref back to 0, recognizes that it was the last use of buffered events and frees all allocated pages. * The control goes back to trace_buffered_event_enable() which returns. The caller of trace_buffered_event_enable() has no information that the function actually failed. * Some time later, the soft mode is disabled for the same event. Function trace_buffered_event_disable() is called. It warns on "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns. The end result is only an ugly warning but I think it is still good to fix. >> >> Instead of using the per-CPU variable trace_buffered_event_cnt for the >> exclusive access during the disable operation, introduce a new variable >> trace_buffered_event_enabled to reflect the current state and >> appropriately adjust the code. The variable indicates whether buffered >> events are currently enabled and trace_buffered_event together with >> trace_buffered_event_cnt are ok to use. > > The current logic is perfectly fine. The only bug here (which you found) > was the missing update to the counter of the current CPU. > >> >> The updated synchronization mechanism fixes the mentioned problems and >> avoids also sending IPIs across the system. >> >> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events") >> Fixes: dea499781a11 ("tracing: Fix warning in >> trace_buffered_event_disable()") >> Signed-off-by: Petr Pavlu <petr.pa...@suse.com> >> --- > > Let me explain why it works. > > In trace_event_buffer_lock_reserve(): > > preempt_disable_notrace(); > if ((entry = __this_cpu_read(trace_buffered_event))) { > int max_len = PAGE_SIZE - struct_size(entry, array, 1); > > val = this_cpu_inc_return(trace_buffered_event_cnt); > > if (val == 1 && likely(len <= max_len)) { > > If val == 1 and len <= max_len, then this is going to use the "buffered > event". > > Notice that preemption is disabled. > > trace_event_setup(entry, type, trace_ctx); > entry->array[0] = len; > /* Return with preemption disabled */ > return entry; > > And we return with preemption disabled! > > Everything after this does not use the "buffered event" and is not part of > this code. > > } > this_cpu_dec(trace_buffered_event_cnt); > } > /* __trace_buffer_lock_reserve() disables preemption */ > preempt_enable_notrace(); > > > Now, on commit (when we are done with the "buffered event"): > > __buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event > *event) > { > __this_cpu_write(trace_taskinfo_save, true); > > /* If this is the temp buffer, we need to commit fully */ > if (this_cpu_read(trace_buffered_event) == event) { > /* Length is in event->array[0] */ > ring_buffer_write(buffer, event->array[0], &event->array[1]); > /* Release the temp buffer */ > this_cpu_dec(trace_buffered_event_cnt); > /* ring_buffer_unlock_commit() enables preemption */ > preempt_enable_notrace(); > > Preemption is finally enabled here. That is, this could not preempt from > the time we found the "buffered event" to the time we released it. > > } else > ring_buffer_unlock_commit(buffer); > } > > > Now lets look at the trace_buffered_event_disable(): > > void trace_buffered_event_disable(void) > { > int cpu; > > WARN_ON_ONCE(!mutex_is_locked(&event_mutex)); > > if (WARN_ON_ONCE(!trace_buffered_event_ref)) > return; > > if (--trace_buffered_event_ref) > return; > > preempt_disable(); > /* For each CPU, set the buffer as used. */ > smp_call_function_many(tracing_buffer_mask, > disable_trace_buffered_event, NULL, 1); > preempt_enable(); > > The above will up the counter of all the buffers (after the bug you found > is fixed ;-) > > That means, we just need to wait till all the currently running users of > the buffer are done with it. > > /* Wait for all current users to finish */ > synchronize_rcu(); > > The synchronize_rcu() also synchronizes preempt disabled sections. That is, > after the synchronize_rcu() completes, all users of the "buffered event" > are done with it, and because we upped the count, there will be no more > users. > > That is *there is no race here*! > > At this point, there are no users of the "buffered event" and we can do > whatever we want without locks or memory barriers. This is RCU 101. > > for_each_tracing_cpu(cpu) { > free_page((unsigned long)per_cpu(trace_buffered_event, cpu)); > per_cpu(trace_buffered_event, cpu) = NULL; > } > /* > * Make sure trace_buffered_event is NULL before clearing > * trace_buffered_event_cnt. > */ > smp_wmb(); > > preempt_disable(); > /* Do the work on each cpu */ > smp_call_function_many(tracing_buffer_mask, > enable_trace_buffered_event, NULL, 1); > preempt_enable(); > } > > > > Now, the reason you found the crash was from the first issue you mentioned. > That is, we didn't disable the current CPU and if we migrated, things would > get really messed up. The only fix here is to make sure all CPUs have their > trace_buffered_event_cnt get incremented. The potential race that I see is the following: * Function trace_buffered_event_disable() is called on CPU 0. It increments trace_buffered_event_cnt on each CPU and waits via synchronize_rcu() for each user of trace_buffered_event to complete. * After synchronize_rcu() is already finished, the situation is that all counters trace_buffered_event_cnt are at 1 and all pointers trace_buffered_event are still valid. * At this point, on a different CPU 1, the execution reaches trace_event_buffer_lock_reserve(). The function calls preempt_disable_notrace() and enters an RCU read-side critical section but that is not important because all RCU stuff is already performed and completed in trace_buffered_event_disable(). The function proceeds and reads a still valid pointer from trace_buffered_event[CPU1] into the local variable "entry". However, it doesn't yet read trace_buffered_event_cnt[CPU1] which happens later. * Function trace_buffered_event_disable() continues. It frees trace_buffered_event[CPU1] and decrements trace_buffered_event_cnt[CPU1] back to 0. * Function trace_event_buffer_lock_reserve() continues. It reads and increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it believe that it can use the "entry" that it already obtained but the pointer is now invalid and any access results in a use-after-free. Thanks, Petr