The following warning appears when using buffered events: [ 203.556451] WARNING: CPU: 53 PID: 10220 at kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420 [...] [ 203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G E 6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a [ 203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS GVPRCRB1.86B.0016.D04.1705030402 05/03/2017 [ 203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420 [ 203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff [ 203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202 [ 203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: ffff8ac10662c000 [ 203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: ffff8ac0c004d400 [ 203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 0000000000000000 [ 203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: ffff8ac10662c008 [ 203.781846] FS: 00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) knlGS:0000000000000000 [ 203.781851] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 00000000001506f0 [ 203.781862] Call Trace: [ 203.781870] <TASK> [ 203.851949] trace_event_buffer_commit+0x1ea/0x250 [ 203.851967] trace_event_raw_event_sys_enter+0x83/0xe0 [ 203.851983] syscall_trace_enter.isra.0+0x182/0x1a0 [ 203.851990] do_syscall_64+0x3a/0xe0 [ 203.852075] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 203.852090] RIP: 0033:0x7f4cd870fa77 [ 203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48 [ 203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000089 [ 203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 00007f4cd870fa77 [ 203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 0000558ea1d7b6f0 [ 203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 00007fff997180e0 [ 203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 00007fff99717f40 [ 204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 00007fff997180b0 [ 204.049256] </TASK>
For instance, it can be triggered by running these two commands in parallel: $ while true; do echo hist:key=id.syscall:val=hitcount > \ /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger; done $ stress-ng --sysinfo $(nproc) The warning indicates that the current ring_buffer_per_cpu is not in the committing state. It happens because the active ring_buffer_event doesn't actually come from the ring_buffer_per_cpu but is allocated from trace_buffered_event. The bug is in function trace_buffered_event_disable() where the following normally happens: * The code invokes disable_trace_buffered_event() via smp_call_function_many() and follows it by synchronize_rcu(). This increments the per-CPU variable trace_buffered_event_cnt on each target CPU and grants trace_buffered_event_disable() the exclusive access to the per-CPU variable trace_buffered_event. * Maintenance is performed on trace_buffered_event, all per-CPU event buffers get freed. * The code invokes enable_trace_buffered_event() via smp_call_function_many(). This decrements trace_buffered_event_cnt and releases the access to trace_buffered_event. A problem is that smp_call_function_many() runs a given function on all target CPUs except on the current one. The following can then occur: * Task X executing trace_buffered_event_disable() runs on CPU A. * The control reaches synchronize_rcu() and the task gets rescheduled on another CPU B. * The RCU synchronization finishes. At this point, trace_buffered_event_disable() has the exclusive access to all trace_buffered_event variables except trace_buffered_event[A] because trace_buffered_event_cnt[A] is never incremented and if the buffer is currently unused, remains set to 0. * A different task Y is scheduled on CPU A and hits a trace event. The code in trace_event_buffer_lock_reserve() sees that trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer provided by trace_buffered_event[A]. * Task X continues its execution in trace_buffered_event_disable(). The code incorrectly frees the event buffer pointed by trace_buffered_event[A] and resets the variable to NULL. * Task Y writes event data to the now freed buffer and later detects the created inconsistency. The issue is observable since commit dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()") which moved the call of trace_buffered_event_disable() in __ftrace_event_enable_disable() earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..). The underlying problem in trace_buffered_event_disable() is however present since the original implementation in commit 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events"). The bug is simply fixable by replacing smp_call_function_many() by on_each_cpu_mask(), but the code has other issues as well: * Function trace_event_buffer_lock_reserve() reads trace_buffered_event and trace_buffered_event_cnt in reverse order than they are written in trace_buffered_event_disable() and without any memory barrier. It could happen that trace_event_buffer_lock_reserve() still finds a valid pointer in trace_buffered_event which is being freed by trace_buffered_event_disable() but later already sees the decremented value of trace_buffered_event_cnt back to 0 and incorrectly decides to use the provided buffer. * Function trace_buffered_event_enable() initializes memory of each allocated trace_buffered_event to zero but no memory barrier is present to ensure this operation is completed before publishing its pointer for use in trace_event_buffer_lock_reserve(). * Calling function trace_buffered_event_enable() normally requires pairing it with trace_buffered_event_disable(). However, the function has no error return code and in case of a failure decrements back trace_buffered_event_ref. This results in underflow of the counter when trace_buffered_event_disable() gets called later. These problems can be addressed individually but it looks better to rework the synchronization mechanism and simplify it in the first place. Instead of using the per-CPU variable trace_buffered_event_cnt for the exclusive access during the disable operation, introduce a new variable trace_buffered_event_enabled to reflect the current state and appropriately adjust the code. The variable indicates whether buffered events are currently enabled and trace_buffered_event together with trace_buffered_event_cnt are ok to use. The updated synchronization mechanism fixes the mentioned problems and avoids also sending IPIs across the system. Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events") Fixes: dea499781a11 ("tracing: Fix warning in trace_buffered_event_disable()") Signed-off-by: Petr Pavlu <petr.pa...@suse.com> --- kernel/trace/trace.c | 66 ++++++++++++++++++-------------------------- 1 file changed, 27 insertions(+), 39 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 9aebf904ff97..123acb9105dd 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2741,6 +2741,7 @@ trace_buffer_lock_reserve(struct trace_buffer *buffer, DEFINE_PER_CPU(struct ring_buffer_event *, trace_buffered_event); DEFINE_PER_CPU(int, trace_buffered_event_cnt); static int trace_buffered_event_ref; +static bool trace_buffered_event_enabled; /** * trace_buffered_event_enable - enable buffering events @@ -2764,7 +2765,9 @@ void trace_buffered_event_enable(void) WARN_ON_ONCE(!mutex_is_locked(&event_mutex)); - if (trace_buffered_event_ref++) + trace_buffered_event_ref++; + + if (trace_buffered_event_enabled) return; for_each_tracing_cpu(cpu) { @@ -2777,30 +2780,21 @@ void trace_buffered_event_enable(void) memset(event, 0, sizeof(*event)); per_cpu(trace_buffered_event, cpu) = event; - - preempt_disable(); - if (cpu == smp_processor_id() && - __this_cpu_read(trace_buffered_event) != - per_cpu(trace_buffered_event, cpu)) - WARN_ON_ONCE(1); - preempt_enable(); } + /* + * Ensure all initialization changes are visible and publish + * availability of trace_buffered_event. + */ + smp_wmb(); + WRITE_ONCE(trace_buffered_event_enabled, true); return; - failed: - trace_buffered_event_disable(); -} -static void enable_trace_buffered_event(void *data) -{ - /* Probably not needed, but do it anyway */ - smp_rmb(); - this_cpu_dec(trace_buffered_event_cnt); -} - -static void disable_trace_buffered_event(void *data) -{ - this_cpu_inc(trace_buffered_event_cnt); + failed: + for_each_tracing_cpu(cpu) { + free_page((unsigned long)per_cpu(trace_buffered_event, cpu)); + per_cpu(trace_buffered_event, cpu) = NULL; + } } /** @@ -2820,33 +2814,19 @@ void trace_buffered_event_disable(void) if (WARN_ON_ONCE(!trace_buffered_event_ref)) return; - if (--trace_buffered_event_ref) + if (--trace_buffered_event_ref || !trace_buffered_event_enabled) return; - preempt_disable(); - /* For each CPU, set the buffer as used. */ - smp_call_function_many(tracing_buffer_mask, - disable_trace_buffered_event, NULL, 1); - preempt_enable(); + WRITE_ONCE(trace_buffered_event_enabled, false); /* Wait for all current users to finish */ synchronize_rcu(); for_each_tracing_cpu(cpu) { + WARN_ON_ONCE(per_cpu(trace_buffered_event_cnt, cpu) != 0); free_page((unsigned long)per_cpu(trace_buffered_event, cpu)); per_cpu(trace_buffered_event, cpu) = NULL; } - /* - * Make sure trace_buffered_event is NULL before clearing - * trace_buffered_event_cnt. - */ - smp_wmb(); - - preempt_disable(); - /* Do the work on each cpu */ - smp_call_function_many(tracing_buffer_mask, - enable_trace_buffered_event, NULL, 1); - preempt_enable(); } static struct trace_buffer *temp_buffer; @@ -2883,9 +2863,16 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, * is still quicker than no copy on match, but having * to discard out of the ring buffer on a failed match. */ - if ((entry = __this_cpu_read(trace_buffered_event))) { + if (READ_ONCE(trace_buffered_event_enabled)) { int max_len = PAGE_SIZE - struct_size(entry, array, 1); + /* + * Ensure per-CPU variables trace_buffered_event(_cnt) + * are read after trace_buffered_event_enabled, pairs + * with smp_wmb() in trace_buffered_event_enable(). + */ + smp_rmb(); + val = this_cpu_inc_return(trace_buffered_event_cnt); /* @@ -2906,6 +2893,7 @@ trace_event_buffer_lock_reserve(struct trace_buffer **current_rb, * handle the failure in that case. */ if (val == 1 && likely(len <= max_len)) { + entry = __this_cpu_read(trace_buffered_event); trace_event_setup(entry, type, trace_ctx); entry->array[0] = len; /* Return with preemption disabled */ -- 2.35.3