Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-06-10 Thread Steven Rostedt
On Fri, 7 Jun 2024 10:29:03 +0200
Petr Pavlu  wrote:

> Another option could be to try traversing the whole list in smaller
> parts and give up the reader_lock in between them. This would need some
> care to make sure that the operation completes, e.g. the code would need
> to bail out if it detects a change on cpu_buffer->pages_read.

I think I like this approach the most. Perhaps even have a counter that
gets incremented everything a new reader page is taken. And if it
detects that, it restarts the check?

To prevent a DOS, we restart 3 times at most, and then just say "the
list is OK" and exit.

So basically, we release the lock within the loop per each sub-buffer,
and then check if the reader touch it when reacquiring the lock.

-- Steve



Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-06-07 Thread Petr Pavlu
On 5/28/24 01:43, Steven Rostedt wrote:
> On Mon, 27 May 2024 11:36:55 +0200
> Petr Pavlu  wrote:
> 
  static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
  {
 @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
 unsigned long size,
 */
synchronize_rcu();
for_each_buffer_cpu(buffer, cpu) {
 +  unsigned long flags;
 +
cpu_buffer = buffer->buffers[cpu];
 +  raw_spin_lock_irqsave(_buffer->reader_lock, flags);
rb_check_pages(cpu_buffer);
 +  raw_spin_unlock_irqrestore(_buffer->reader_lock,
 + flags);  
>>>
>>> Putting my RT hat on, I really don't like the above fix. The
>>> rb_check_pages() iterates all subbuffers which makes the time interrupts
>>> are disabled non-deterministic.  
>>
>> I see, this applies also to the same rb_check_pages() validation invoked
>> from ring_buffer_read_finish().
>>
>>>
>>> Instead, I would rather have something where we disable readers while we do
>>> the check, and re-enable them.
>>>
>>> raw_spin_lock_irqsave(_buffer->reader_lock, flags);
>>> cpu_buffer->read_disabled++;
>>> raw_spin_unlock_irqrestore(_buffer->reader_lock, 
>>> flags);
>>>
>>> // Also, don't put flags on a new line. We are allow to go 100 characters 
>>> now.  
>>
>> Noted.
>>
>>>
>>>
>>> rb_check_pages(cpu_buffer);
>>> raw_spin_lock_irqsave(_buffer->reader_lock, flags);
>>> cpu_buffer->read_disabled--;
>>> raw_spin_unlock_irqrestore(_buffer->reader_lock, 
>>> flags);
>>>
>>> Or something like that. Yes, that also requires creating a new
>>> "read_disabled" field in the ring_buffer_per_cpu code.  
>>
>> I think this would work but I'm personally not immediately sold on this
>> approach. If I understand the idea correctly, readers should then check
>> whether cpu_buffer->read_disabled is set and bail out with some error if
>> that is the case. The rb_check_pages() function is only a self-check
>> code and as such, I feel it doesn't justify disrupting readers with
>> a new error condition and adding more complex locking.
> 
> Honestly, this code was never made for more than one reader per
> cpu_buffer. I'm perfectly fine if all check_pages() causes other
> readers to the same per_cpu buffer to get -EBUSY.
> 
> Do you really see this being a problem? What use case is there for
> hitting the check_pages() and reading the same cpu buffer at the same
> time?

My main issue is with added complexity to check the new read_disabled
flag. The rb_check_pages() part is simple and you showed what to do. The
readers part is what I struggle with.

I think the read_disabled flag needs to be either checked once in
rb_get_reader_page() where the actual problem with making the list
temporarily inconsistent exists. Or alternatively, it can be checked by
direct or indirect users of rb_get_reader_page() just after they take
the reader_lock.

Looking at the final rb_get_reader_page() function, it currently always
returns a valid reader page unless the buffer doesn't contain any
additional entry or a serious problem is detected by RB_WARN_ON()
checks. This is simple to handle for callers, either they get a reader
page and can continue, or they stop.

Returning -EBUSY means that callers have a new case that they need to
decide what to do about. They need to propagate the error up the chain
or attempt to handle it. This involves ring-buffer functions
rb_advance_reader(), rb_buffer_peek(), ring_buffer_peek(),
ring_buffer_consume(), ring_buffer_read_page()
ring_buffer_map_get_reader() and their callers from other source files.

It is possible to handle this new case in these functions but I'm not
sure if adding this logic is justified. I feel I must have misunderstood
something how it should work?

I was further thinking about alternatives that would possibly provide
a less thorough check but have their complexity limited only to
rb_check_pages(). The already mentioned idea is to have the function to
look only at surrounding nodes where some change in the list occurred.

Another option could be to try traversing the whole list in smaller
parts and give up the reader_lock in between them. This would need some
care to make sure that the operation completes, e.g. the code would need
to bail out if it detects a change on cpu_buffer->pages_read.

Thanks,
Petr




Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-05-27 Thread Steven Rostedt
On Mon, 27 May 2024 11:36:55 +0200
Petr Pavlu  wrote:

> >>  static void rb_check_pages(struct ring_buffer_per_cpu *cpu_buffer)
> >>  {
> >> @@ -2200,8 +2205,13 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
> >> unsigned long size,
> >> */
> >>synchronize_rcu();
> >>for_each_buffer_cpu(buffer, cpu) {
> >> +  unsigned long flags;
> >> +
> >>cpu_buffer = buffer->buffers[cpu];
> >> +  raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> >>rb_check_pages(cpu_buffer);
> >> +  raw_spin_unlock_irqrestore(_buffer->reader_lock,
> >> + flags);  
> > 
> > Putting my RT hat on, I really don't like the above fix. The
> > rb_check_pages() iterates all subbuffers which makes the time interrupts
> > are disabled non-deterministic.  
> 
> I see, this applies also to the same rb_check_pages() validation invoked
> from ring_buffer_read_finish().
> 
> > 
> > Instead, I would rather have something where we disable readers while we do
> > the check, and re-enable them.
> > 
> > raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > cpu_buffer->read_disabled++;
> > raw_spin_unlock_irqrestore(_buffer->reader_lock, 
> > flags);
> > 
> > // Also, don't put flags on a new line. We are allow to go 100 characters 
> > now.  
> 
> Noted.
> 
> > 
> > 
> > rb_check_pages(cpu_buffer);
> > raw_spin_lock_irqsave(_buffer->reader_lock, flags);
> > cpu_buffer->read_disabled--;
> > raw_spin_unlock_irqrestore(_buffer->reader_lock, 
> > flags);
> > 
> > Or something like that. Yes, that also requires creating a new
> > "read_disabled" field in the ring_buffer_per_cpu code.  
> 
> I think this would work but I'm personally not immediately sold on this
> approach. If I understand the idea correctly, readers should then check
> whether cpu_buffer->read_disabled is set and bail out with some error if
> that is the case. The rb_check_pages() function is only a self-check
> code and as such, I feel it doesn't justify disrupting readers with
> a new error condition and adding more complex locking.

Honestly, this code was never made for more than one reader per
cpu_buffer. I'm perfectly fine if all check_pages() causes other
readers to the same per_cpu buffer to get -EBUSY.

Do you really see this being a problem? What use case is there for
hitting the check_pages() and reading the same cpu buffer at the same
time?

-- Steve



Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-05-27 Thread Petr Pavlu
On 5/20/24 15:50, Steven Rostedt wrote:
> On Fri, 17 May 2024 15:40:08 +0200
> Petr Pavlu  wrote:
> 
>> The reader code in rb_get_reader_page() swaps a new reader page into the
>> ring buffer by doing cmpxchg on old->list.prev->next to point it to the
>> new page. Following that, if the operation is successful,
>> old->list.next->prev gets updated too. This means the underlying
>> doubly-linked list is temporarily inconsistent, page->prev->next or
>> page->next->prev might not be equal back to page for some page in the
>> ring buffer.
>>
>> The resize operation in ring_buffer_resize() can be invoked in parallel.
>> It calls rb_check_pages() which can detect the described inconsistency
>> and stop further tracing:
>>
>> [  190.271762] [ cut here ]
>> [  190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 
>> rb_check_pages.isra.0+0x6a/0xa0
>> [  190.271789] Modules linked in: [...]
>> [  190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 
>> skx_edac(E):1
>> [  190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G
>> E  6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
>> [  190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
>> rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
>> [  190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
>> [  190.272023] Code: [...]
>> [  190.272028] RSP: 0018:9c37463abb70 EFLAGS: 00010206
>> [  190.272034] RAX: 8eba04b6cb80 RBX: 0007 RCX: 
>> 8eba01f13d80
>> [  190.272038] RDX: 8eba01f130c0 RSI: 8eba04b6cd00 RDI: 
>> 8eba0004c700
>> [  190.272042] RBP: 8eba0004c700 R08: 00010002 R09: 
>> 
>> [  190.272045] R10: 7f52 R11: 8eba7f60 R12: 
>> 8eba0004c720
>> [  190.272049] R13: 8eba00223a00 R14: 0008 R15: 
>> 8eba067a8000
>> [  190.272053] FS:  7f1bd64752c0() GS:8eba7f68() 
>> knlGS:
>> [  190.272057] CS:  0010 DS:  ES:  CR0: 80050033
>> [  190.272061] CR2: 7f1bd6662590 CR3: 00010291e001 CR4: 
>> 00370ef0
>> [  190.272070] DR0:  DR1:  DR2: 
>> 
>> [  190.272073] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  190.272077] Call Trace:
>> [  190.272098]  
>> [  190.272189]  ring_buffer_resize+0x2ab/0x460
>> [  190.272199]  __tracing_resize_ring_buffer.part.0+0x23/0xa0
>> [  190.272206]  tracing_resize_ring_buffer+0x65/0x90
>> [  190.272216]  tracing_entries_write+0x74/0xc0
>> [  190.272225]  vfs_write+0xf5/0x420
>> [  190.272248]  ksys_write+0x67/0xe0
>> [  190.272256]  do_syscall_64+0x82/0x170
>> [  190.272363]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [  190.272373] RIP: 0033:0x7f1bd657d263
>> [  190.272381] Code: [...]
>> [  190.272385] RSP: 002b:7ffe72b643f8 EFLAGS: 0246 ORIG_RAX: 
>> 0001
>> [  190.272391] RAX: ffda RBX: 0002 RCX: 
>> 7f1bd657d263
>> [  190.272395] RDX: 0002 RSI: 555a6eb538e0 RDI: 
>> 0001
>> [  190.272398] RBP: 555a6eb538e0 R08: 000a R09: 
>> 
>> [  190.272401] R10: 555a6eb55190 R11: 0246 R12: 
>> 7f1bd6662500
>> [  190.272404] R13: 0002 R14: 7f1bd6667c00 R15: 
>> 0002
>> [  190.272412]  
>> [  190.272414] ---[ end trace  ]---
>>
>> Note that ring_buffer_resize() calls rb_check_pages() only if the parent
>> trace_buffer has recording disabled. Recent commit d78ab792705c
>> ("tracing: Stop current tracer when resizing buffer") causes that it is
>> now always the case which makes it more likely to experience this issue.
>>
>> The window to hit this race is nonetheless very small. To help
>> reproducing it, one can add a delay loop in rb_get_reader_page():
>>
>>  ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
>>  if (!ret)
>>  goto spin;
>>  for (unsigned i = 0; i < 1U << 26; i++)  /* inserted delay loop */
>>  __asm__ __volatile__ ("" : : : "memory");
>>  rb_list_head(reader->list.next)->prev = _buffer->reader_page->list;
>>
>> .. and then run the following commands on the target system:
>>
>>  echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>>  while true; do
>>  echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>>  echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>>  done &
>>  while true; do
>>  for i in /sys/kernel/tracing/per_cpu/*; do
>>  timeout 0.1 cat $i/trace_pipe; sleep 0.2
>>  done
>>  done
>>
>> To fix the problem, make sure ring_buffer_resize() doesn't invoke
>> rb_check_pages() concurrently with a reader operating on the same
>> ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.
> 
> Definitely a bug. Thanks for catching it. But...
> 
>>
>> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read")
>> 

Re: [PATCH 2/2] ring-buffer: Fix a race between readers and resize checks

2024-05-20 Thread Steven Rostedt
On Fri, 17 May 2024 15:40:08 +0200
Petr Pavlu  wrote:

> The reader code in rb_get_reader_page() swaps a new reader page into the
> ring buffer by doing cmpxchg on old->list.prev->next to point it to the
> new page. Following that, if the operation is successful,
> old->list.next->prev gets updated too. This means the underlying
> doubly-linked list is temporarily inconsistent, page->prev->next or
> page->next->prev might not be equal back to page for some page in the
> ring buffer.
> 
> The resize operation in ring_buffer_resize() can be invoked in parallel.
> It calls rb_check_pages() which can detect the described inconsistency
> and stop further tracing:
> 
> [  190.271762] [ cut here ]
> [  190.271771] WARNING: CPU: 1 PID: 6186 at kernel/trace/ring_buffer.c:1467 
> rb_check_pages.isra.0+0x6a/0xa0
> [  190.271789] Modules linked in: [...]
> [  190.271991] Unloaded tainted modules: intel_uncore_frequency(E):1 
> skx_edac(E):1
> [  190.272002] CPU: 1 PID: 6186 Comm: cmd.sh Kdump: loaded Tainted: G 
>E  6.9.0-rc6-default #5 158d3e1e6d0b091c34c3b96bfd99a1c58306d79f
> [  190.272011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
> [  190.272015] RIP: 0010:rb_check_pages.isra.0+0x6a/0xa0
> [  190.272023] Code: [...]
> [  190.272028] RSP: 0018:9c37463abb70 EFLAGS: 00010206
> [  190.272034] RAX: 8eba04b6cb80 RBX: 0007 RCX: 
> 8eba01f13d80
> [  190.272038] RDX: 8eba01f130c0 RSI: 8eba04b6cd00 RDI: 
> 8eba0004c700
> [  190.272042] RBP: 8eba0004c700 R08: 00010002 R09: 
> 
> [  190.272045] R10: 7f52 R11: 8eba7f60 R12: 
> 8eba0004c720
> [  190.272049] R13: 8eba00223a00 R14: 0008 R15: 
> 8eba067a8000
> [  190.272053] FS:  7f1bd64752c0() GS:8eba7f68() 
> knlGS:
> [  190.272057] CS:  0010 DS:  ES:  CR0: 80050033
> [  190.272061] CR2: 7f1bd6662590 CR3: 00010291e001 CR4: 
> 00370ef0
> [  190.272070] DR0:  DR1:  DR2: 
> 
> [  190.272073] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  190.272077] Call Trace:
> [  190.272098]  
> [  190.272189]  ring_buffer_resize+0x2ab/0x460
> [  190.272199]  __tracing_resize_ring_buffer.part.0+0x23/0xa0
> [  190.272206]  tracing_resize_ring_buffer+0x65/0x90
> [  190.272216]  tracing_entries_write+0x74/0xc0
> [  190.272225]  vfs_write+0xf5/0x420
> [  190.272248]  ksys_write+0x67/0xe0
> [  190.272256]  do_syscall_64+0x82/0x170
> [  190.272363]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  190.272373] RIP: 0033:0x7f1bd657d263
> [  190.272381] Code: [...]
> [  190.272385] RSP: 002b:7ffe72b643f8 EFLAGS: 0246 ORIG_RAX: 
> 0001
> [  190.272391] RAX: ffda RBX: 0002 RCX: 
> 7f1bd657d263
> [  190.272395] RDX: 0002 RSI: 555a6eb538e0 RDI: 
> 0001
> [  190.272398] RBP: 555a6eb538e0 R08: 000a R09: 
> 
> [  190.272401] R10: 555a6eb55190 R11: 0246 R12: 
> 7f1bd6662500
> [  190.272404] R13: 0002 R14: 7f1bd6667c00 R15: 
> 0002
> [  190.272412]  
> [  190.272414] ---[ end trace  ]---
> 
> Note that ring_buffer_resize() calls rb_check_pages() only if the parent
> trace_buffer has recording disabled. Recent commit d78ab792705c
> ("tracing: Stop current tracer when resizing buffer") causes that it is
> now always the case which makes it more likely to experience this issue.
> 
> The window to hit this race is nonetheless very small. To help
> reproducing it, one can add a delay loop in rb_get_reader_page():
> 
>  ret = rb_head_page_replace(reader, cpu_buffer->reader_page);
>  if (!ret)
>   goto spin;
>  for (unsigned i = 0; i < 1U << 26; i++)  /* inserted delay loop */
>   __asm__ __volatile__ ("" : : : "memory");
>  rb_list_head(reader->list.next)->prev = _buffer->reader_page->list;
> 
> .. and then run the following commands on the target system:
> 
>  echo 1 > /sys/kernel/tracing/events/sched/sched_switch/enable
>  while true; do
>   echo 16 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>   echo 8 > /sys/kernel/tracing/buffer_size_kb; sleep 0.1
>  done &
>  while true; do
>   for i in /sys/kernel/tracing/per_cpu/*; do
>   timeout 0.1 cat $i/trace_pipe; sleep 0.2
>   done
>  done
> 
> To fix the problem, make sure ring_buffer_resize() doesn't invoke
> rb_check_pages() concurrently with a reader operating on the same
> ring_buffer_per_cpu by taking its cpu_buffer->reader_lock.

Definitely a bug. Thanks for catching it. But...

> 
> Fixes: 659f451ff213 ("ring-buffer: Add integrity check at end of iter read")
> Signed-off-by: Petr Pavlu 
> ---
>  kernel/trace/ring_buffer.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git