Re: [PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

Petr Pavlu Tue, 28 Nov 2023 07:05:26 -0800

On 11/27/23 18:41, Steven Rostedt wrote:
> On Mon, 27 Nov 2023 16:12:47 +0100
> Petr Pavlu <petr.pa...@suse.com> wrote:
> 
>> The following warning appears when using buffered events:
>> [  203.556451] WARNING: CPU: 53 PID: 10220 at 
>> kernel/trace/ring_buffer.c:3912 ring_buffer_discard_commit+0x2eb/0x420
> 
> Hmm, I don't have a waring on line 3912, do you have extra code (debugging)
> in your version?


The series is based on 2cc14f52aeb7 (tag: v6.7-rc3). It is the following
code and RB_WARN_ON():
  3895  void ring_buffer_discard_commit(struct trace_buffer *buffer,
  3896                                  struct ring_buffer_event *event)
  3897  {
[...]
  3908           * This must only be called if the event has not been
  3909           * committed yet. Thus we can assume that preemption
  3910           * is still disabled.
  3911           */
  3912          RB_WARN_ON(buffer, !local_read(&cpu_buffer->committing));
  3913  
  3914          rb_decrement_entry(cpu_buffer, event);
  3915          if (rb_try_to_discard(cpu_buffer, event))
  3916                  goto out;

https://github.com/torvalds/linux/blob/2cc14f52aeb78ce3f29677c2de1f06c0e91471ab/kernel/trace/ring_buffer.c#L3912

>> [...]
>> [  203.670690] CPU: 53 PID: 10220 Comm: stress-ng-sysin Tainted: G           
>>  E      6.7.0-rc2-default #4 56e6d0fcf5581e6e51eaaecbdaec2a2338c80f3a
>> [  203.670704] Hardware name: Intel Corp. GROVEPORT/GROVEPORT, BIOS 
>> GVPRCRB1.86B.0016.D04.1705030402 05/03/2017
>> [  203.670709] RIP: 0010:ring_buffer_discard_commit+0x2eb/0x420
>> [  203.735721] Code: 4c 8b 4a 50 48 8b 42 48 49 39 c1 0f 84 b3 00 00 00 49 
>> 83 e8 01 75 b1 48 8b 42 10 f0 ff 40 08 0f 0b e9 fc fe ff ff f0 ff 47 08 <0f> 
>> 0b e9 77 fd ff ff 48 8b 42 10 f0 ff 40 08 0f 0b e9 f5 fe ff ff
>> [  203.735734] RSP: 0018:ffffb4ae4f7b7d80 EFLAGS: 00010202
>> [  203.735745] RAX: 0000000000000000 RBX: ffffb4ae4f7b7de0 RCX: 
>> ffff8ac10662c000
>> [  203.735754] RDX: ffff8ac0c750be00 RSI: ffff8ac10662c000 RDI: 
>> ffff8ac0c004d400
>> [  203.781832] RBP: ffff8ac0c039cea0 R08: 0000000000000000 R09: 
>> 0000000000000000
>> [  203.781839] R10: 0000000000000000 R11: 0000000000000000 R12: 
>> 0000000000000000
>> [  203.781842] R13: ffff8ac10662c000 R14: ffff8ac0c004d400 R15: 
>> ffff8ac10662c008
>> [  203.781846] FS:  00007f4cd8a67740(0000) GS:ffff8ad798880000(0000) 
>> knlGS:0000000000000000
>> [  203.781851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  203.781855] CR2: 0000559766a74028 CR3: 00000001804c4000 CR4: 
>> 00000000001506f0
>> [  203.781862] Call Trace:
>> [  203.781870]  <TASK>
>> [  203.851949]  trace_event_buffer_commit+0x1ea/0x250
>> [  203.851967]  trace_event_raw_event_sys_enter+0x83/0xe0
>> [  203.851983]  syscall_trace_enter.isra.0+0x182/0x1a0
>> [  203.851990]  do_syscall_64+0x3a/0xe0
>> [  203.852075]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
>> [  203.852090] RIP: 0033:0x7f4cd870fa77
>> [  203.982920] Code: 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 
>> 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 90 b8 89 00 00 00 0f 05 <48> 
>> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e9 43 0e 00 f7 d8 64 89 01 48
>> [  203.982932] RSP: 002b:00007fff99717dd8 EFLAGS: 00000246 ORIG_RAX: 
>> 0000000000000089
>> [  203.982942] RAX: ffffffffffffffda RBX: 0000558ea1d7b6f0 RCX: 
>> 00007f4cd870fa77
>> [  203.982948] RDX: 0000000000000000 RSI: 00007fff99717de0 RDI: 
>> 0000558ea1d7b6f0
>> [  203.982957] RBP: 00007fff99717de0 R08: 00007fff997180e0 R09: 
>> 00007fff997180e0
>> [  203.982962] R10: 00007fff997180e0 R11: 0000000000000246 R12: 
>> 00007fff99717f40
>> [  204.049239] R13: 00007fff99718590 R14: 0000558e9f2127a8 R15: 
>> 00007fff997180b0
>> [  204.049256]  </TASK>
>>
>> For instance, it can be triggered by running these two commands in
>> parallel:
>> $ while true; do
>>     echo hist:key=id.syscall:val=hitcount > \
>>       /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger;
>>   done
>> $ stress-ng --sysinfo $(nproc)
>>
>> The warning indicates that the current ring_buffer_per_cpu is not in the
>> committing state. It happens because the active ring_buffer_event
>> doesn't actually come from the ring_buffer_per_cpu but is allocated from
>> trace_buffered_event.
>>
>> The bug is in function trace_buffered_event_disable() where the
>> following normally happens:
>> * The code invokes disable_trace_buffered_event() via
>>   smp_call_function_many() and follows it by synchronize_rcu(). This
>>   increments the per-CPU variable trace_buffered_event_cnt on each
>>   target CPU and grants trace_buffered_event_disable() the exclusive
>>   access to the per-CPU variable trace_buffered_event.
>> * Maintenance is performed on trace_buffered_event, all per-CPU event
>>   buffers get freed.
>> * The code invokes enable_trace_buffered_event() via
>>   smp_call_function_many(). This decrements trace_buffered_event_cnt and
>>   releases the access to trace_buffered_event.
>>
>> A problem is that smp_call_function_many() runs a given function on all
>> target CPUs except on the current one. The following can then occur:
>> * Task X executing trace_buffered_event_disable() runs on CPU A.
>> * The control reaches synchronize_rcu() and the task gets rescheduled on
>>   another CPU B.
>> * The RCU synchronization finishes. At this point,
>>   trace_buffered_event_disable() has the exclusive access to all
>>   trace_buffered_event variables except trace_buffered_event[A] because
>>   trace_buffered_event_cnt[A] is never incremented and if the buffer is
>>   currently unused, remains set to 0.
>> * A different task Y is scheduled on CPU A and hits a trace event. The
>>   code in trace_event_buffer_lock_reserve() sees that
>>   trace_buffered_event_cnt[A] is set to 0 and decides the use the buffer
>>   provided by trace_buffered_event[A].
>> * Task X continues its execution in trace_buffered_event_disable(). The
>>   code incorrectly frees the event buffer pointed by
>>   trace_buffered_event[A] and resets the variable to NULL.
>> * Task Y writes event data to the now freed buffer and later detects the
>>   created inconsistency.
>>
>> The issue is observable since commit dea499781a11 ("tracing: Fix warning
>> in trace_buffered_event_disable()") which moved the call of
>> trace_buffered_event_disable() in __ftrace_event_enable_disable()
>> earlier, prior to invoking call->class->reg(.. TRACE_REG_UNREGISTER ..).
>> The underlying problem in trace_buffered_event_disable() is however
>> present since the original implementation in commit 0fc1b09ff1ff
>> ("tracing: Use temp buffer when filtering events").
>>
>> The bug is simply fixable by replacing smp_call_function_many() by
>> on_each_cpu_mask(), but the code has other issues as well:
> 
> Or by simply calling the update for the local CPU as well as the many:
> 
>       preempt_disable();
>       /* For each CPU, set the buffer as used. */
>       disable_trace_buffered_event(NULL);
>       smp_call_function_many(tracing_buffer_mask,
>                              disable_trace_buffered_event, NULL, 1);
>       preempt_enable();
> 
>> * Function trace_event_buffer_lock_reserve() reads trace_buffered_event
>>   and trace_buffered_event_cnt in reverse order than they are written in
>>   trace_buffered_event_disable() and without any memory barrier. It
>>   could happen that trace_event_buffer_lock_reserve() still finds
>>   a valid pointer in trace_buffered_event which is being freed by
>>   trace_buffered_event_disable() but later already sees the decremented
>>   value of trace_buffered_event_cnt back to 0 and incorrectly decides to
>>   use the provided buffer.
> 
> Not an issue (see below)
> 
>> * Function trace_buffered_event_enable() initializes memory of each
>>   allocated trace_buffered_event to zero but no memory barrier is
>>   present to ensure this operation is completed before publishing its
>>   pointer for use in trace_event_buffer_lock_reserve().
> 
> Yes, a wmb() may be helpful, but as the reader will see either the
> allocated page or NULL, and can handle both cases. I decided not to add it
> because it really doesn't matter if we do.

My concern is about the earlier call "memset(event, 0, sizeof(*event));"
in trace_buffered_event_enable() which writes the obtained event page
prior to publishing it via "per_cpu(trace_buffered_event, cpu) = event;".
I think another CPU running trace_event_buffer_lock_reserve() could see
already the valid event pointer, write to the event page and only then
the memory could see the memset() write coming from
trace_buffered_event_enable(). The event data would then become
corrupted.

> 
>> * Calling function trace_buffered_event_enable() normally requires
>>   pairing it with trace_buffered_event_disable(). However, the function
>>   has no error return code and in case of a failure decrements back
>>   trace_buffered_event_ref. This results in underflow of the counter
>>   when trace_buffered_event_disable() gets called later.
> 
> Not an issue.
> 
> Even on failure the ref count will be greater than zero, where 
> trace_buffered_event_disable() handles it properly, and the freeing can
> handle no pages being allocated.
> 
> That is:
> 
>       free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
> 
> would be the same as: free_page((unsigned long)NULL);
> 
> Which is perfectly fine to do.

The situation that I have on my mind is the following:
* The counter trace_buffered_event_ref is at 0.
* The soft mode gets enabled for some event and
  trace_buffered_event_enable() is called. The function increments
  trace_buffered_event_ref to 1 and starts allocating event pages.
* The allocation fails for some page and trace_buffered_event_disable()
  is called for cleanup.
* Function trace_buffered_event_disable() decrements
  trace_buffered_event_ref back to 0, recognizes that it was the last
  use of buffered events and frees all allocated pages.
* The control goes back to trace_buffered_event_enable() which returns.
  The caller of trace_buffered_event_enable() has no information that
  the function actually failed.
* Some time later, the soft mode is disabled for the same event.
  Function trace_buffered_event_disable() is called. It warns on
  "WARN_ON_ONCE(!trace_buffered_event_ref)" and returns.

The end result is only an ugly warning but I think it is still good to
fix.

>>
>> Instead of using the per-CPU variable trace_buffered_event_cnt for the
>> exclusive access during the disable operation, introduce a new variable
>> trace_buffered_event_enabled to reflect the current state and
>> appropriately adjust the code. The variable indicates whether buffered
>> events are currently enabled and trace_buffered_event together with
>> trace_buffered_event_cnt are ok to use.
> 
> The current logic is perfectly fine. The only bug here (which you found)
> was the missing update to the counter of the current CPU.
> 
>>
>> The updated synchronization mechanism fixes the mentioned problems and
>> avoids also sending IPIs across the system.
>>
>> Fixes: 0fc1b09ff1ff ("tracing: Use temp buffer when filtering events")
>> Fixes: dea499781a11 ("tracing: Fix warning in 
>> trace_buffered_event_disable()")
>> Signed-off-by: Petr Pavlu <petr.pa...@suse.com>
>> ---
> 
> Let me explain why it works.
> 
> In trace_event_buffer_lock_reserve():
> 
>               preempt_disable_notrace();
>               if ((entry = __this_cpu_read(trace_buffered_event))) {
>                       int max_len = PAGE_SIZE - struct_size(entry, array, 1);
> 
>                       val = this_cpu_inc_return(trace_buffered_event_cnt);
> 
>                       if (val == 1 && likely(len <= max_len)) {
> 
> If val == 1 and len <= max_len, then this is going to use the "buffered 
> event".
> 
> Notice that preemption is disabled.
> 
>                               trace_event_setup(entry, type, trace_ctx);
>                               entry->array[0] = len;
>                               /* Return with preemption disabled */
>                               return entry;
> 
> And we return with preemption disabled!
> 
> Everything after this does not use the "buffered event" and is not part of
> this code.
> 
>                       }
>                       this_cpu_dec(trace_buffered_event_cnt);
>               }
>               /* __trace_buffer_lock_reserve() disables preemption */
>               preempt_enable_notrace();
> 
> 
> Now, on commit (when we are done with the "buffered event"):
> 
> __buffer_unlock_commit(struct trace_buffer *buffer, struct ring_buffer_event 
> *event)
> {
>       __this_cpu_write(trace_taskinfo_save, true);
> 
>       /* If this is the temp buffer, we need to commit fully */
>       if (this_cpu_read(trace_buffered_event) == event) {
>               /* Length is in event->array[0] */
>               ring_buffer_write(buffer, event->array[0], &event->array[1]);
>               /* Release the temp buffer */
>               this_cpu_dec(trace_buffered_event_cnt);
>               /* ring_buffer_unlock_commit() enables preemption */
>               preempt_enable_notrace();
> 
> Preemption is finally enabled here. That is, this could not preempt from
> the time we found the "buffered event" to the time we released it.
> 
>       } else
>               ring_buffer_unlock_commit(buffer);
> }
> 
> 
> Now lets look at the trace_buffered_event_disable():
> 
> void trace_buffered_event_disable(void)
> {
>       int cpu;
> 
>       WARN_ON_ONCE(!mutex_is_locked(&event_mutex));
> 
>       if (WARN_ON_ONCE(!trace_buffered_event_ref))
>               return;
> 
>       if (--trace_buffered_event_ref)
>               return;
> 
>       preempt_disable();
>       /* For each CPU, set the buffer as used. */
>       smp_call_function_many(tracing_buffer_mask,
>                              disable_trace_buffered_event, NULL, 1);
>       preempt_enable();
> 
> The above will up the counter of all the buffers (after the bug you found
> is fixed ;-)
> 
> That means, we just need to wait till all the currently running users of
> the buffer are done with it.
> 
>       /* Wait for all current users to finish */
>       synchronize_rcu();
> 
> The synchronize_rcu() also synchronizes preempt disabled sections. That is,
> after the synchronize_rcu() completes, all users of the "buffered event"
> are done with it, and because we upped the count, there will be no more
> users.
> 
> That is *there is no race here*!
> 
> At this point, there are no users of the "buffered event" and we can do
> whatever we want without locks or memory barriers. This is RCU 101.
> 
>       for_each_tracing_cpu(cpu) {
>               free_page((unsigned long)per_cpu(trace_buffered_event, cpu));
>               per_cpu(trace_buffered_event, cpu) = NULL;
>       }
>       /*
>        * Make sure trace_buffered_event is NULL before clearing
>        * trace_buffered_event_cnt.
>        */
>       smp_wmb();
> 
>       preempt_disable();
>       /* Do the work on each cpu */
>       smp_call_function_many(tracing_buffer_mask,
>                              enable_trace_buffered_event, NULL, 1);
>       preempt_enable();
> }
> 
> 
> 
> Now, the reason you found the crash was from the first issue you mentioned.
> That is, we didn't disable the current CPU and if we migrated, things would
> get really messed up. The only fix here is to make sure all CPUs have their
> trace_buffered_event_cnt get incremented.

The potential race that I see is the following:
* Function trace_buffered_event_disable() is called on CPU 0. It
  increments trace_buffered_event_cnt on each CPU and waits via
  synchronize_rcu() for each user of trace_buffered_event to complete.
* After synchronize_rcu() is already finished, the situation is that all
  counters trace_buffered_event_cnt are at 1 and all pointers
  trace_buffered_event are still valid.
* At this point, on a different CPU 1, the execution reaches
  trace_event_buffer_lock_reserve(). The function calls
  preempt_disable_notrace() and enters an RCU read-side critical section
  but that is not important because all RCU stuff is already performed
  and completed in trace_buffered_event_disable(). The function proceeds
  and reads a still valid pointer from trace_buffered_event[CPU1] into
  the local variable "entry". However, it doesn't yet read
  trace_buffered_event_cnt[CPU1] which happens later.
* Function trace_buffered_event_disable() continues. It frees
  trace_buffered_event[CPU1] and decrements
  trace_buffered_event_cnt[CPU1] back to 0.
* Function trace_event_buffer_lock_reserve() continues. It reads and
  increments trace_buffered_event_cnt[CPU1] from 0 to 1. This makes it
  believe that it can use the "entry" that it already obtained but the
  pointer is now invalid and any access results in a use-after-free.

Thanks,
Petr

Re: [PATCH 1/2] tracing: Simplify and fix "buffered event" synchronization

Reply via email to