On 2025-11-13 10:05:24 [-0500], Steven Rostedt wrote: > This means that the chunks are not being freed and we can't be doing > synchronize_rcu() in every exit.
You don't have to, you can do call_rcu(). > > Additionally it would guarantee that the buffer is not released in > > trace_pid_list_free(). I don't see how the seqcount ensures that the > > buffer is not gone. I mean you could have a reader and the retry would > > force you to do another loop but before that happens you dereference the > > upper_chunk pointer which could be reused. > > This protection has nothing to do with trace_pid_list_free(). In fact, > you'll notice that function doesn't even have any locking. That's because > the pid_list itself is removed from view and RCU synchronization happens > before that function is called. > > The protection in trace_pid_list_is_set() is only to synchronize with the > adding and removing of the bits in the updates in exit and fork as well as > with the user manually writing into the set_*_pid files. So if the kfree() is not an issue, it is just the use of the block from the freelist which must not point to a wrong item? And therefore the seqcount? > > So I *think* the RCU approach should be doable and cover this. > > Where would you put the synchronize_rcu()? In do_exit()? simply call_rcu() and let it move to the freelist. > Also understanding what this is used for helps in understanding the scope > of protection needed. > > The pid_list is created when you add anything into one of the pid files in > tracefs. Let's use /sys/kernel/tracing/set_ftrace_pid: > > # cd /sys/kernel/tracing > # echo $$ > set_ftrace_pid > # echo 1 > options/function-fork > # cat set_ftrace_pid > 2716 > 2936 > # cat set_ftrace_pid > 2716 > 2945 > > What the above did was to create a pid_list for the function tracer. I > added the bash process pid using $$ (2716). Then when I cat the file, it > showed the pid for the bash process as well as the pid for the cat process, > as the cat process is a child of the bash process. The function-fork option > means to add any child process to the set_ftrace_pid if the parent is > already in the list. It also means to remove the pid if a process in the > list exits. This adding/ add-on-fork, removing and remove-on-exit is the only write side? > When I enable function tracing, it will only trace the bash process and any > of its children: > > # echo 0 > tracing_on > # echo function > current_tracer > # cat set_ftrace_pid ; echo 0 > tracing_on > 2716 > 2989 > # cat trace > [..] > bash-2716 [003] ..... 36854.662833: rcu_read_lock_held > <-mtree_range_walk > bash-2716 [003] ..... 36854.662834: > rcu_lockdep_current_cpu_online <-rcu_read_lock_held > bash-2716 [003] ..... 36854.662834: rcu_read_lock_held > <-vma_start_read > ##### CPU 6 buffer started #### > cat-2989 [006] d..2. 36854.662834: ret_from_fork > <-ret_from_fork_asm > bash-2716 [003] ..... 36854.662835: > rcu_lockdep_current_cpu_online <-rcu_read_lock_held > cat-2989 [006] d..2. 36854.662836: schedule_tail > <-ret_from_fork > bash-2716 [003] ..... 36854.662836: __rcu_read_unlock > <-lock_vma_under_rcu > cat-2989 [006] d..2. 36854.662836: finish_task_switch.isra.0 > <-schedule_tail > bash-2716 [003] ..... 36854.662836: handle_mm_fault > <-do_user_addr_fault > [..] > > It would be way too expensive to check the pid_list at *every* function > call. But luckily we don't have to. Instead, we set a per-cpu flag in the > instance trace_array on sched_switch if the next pid is in the pid_list and > clear it if it is not. (See ftrace_filter_pid_sched_switch_probe()). > > This means, the bit being checked in the pid_list is always for a task that > is about to run. > > The bit being cleared, is always for that task that is exiting (except for > the case of manual updates). > > What we are protecting against is when one chunk is freed, but then > allocated again for a different set of PIDs. Where the reader has the chunk, > it was freed and re-allocated and the bit that is about to be checked > doesn't represent the bit it is checking for. This I assumed. And the kfree() at the end can not happen while there is still a reader? … > And if the "lower" bit matches the set_bit from CPU2, we have a false > positive. Although, this race is highly unlikely, we should still protect > against it (it could happen on a VM vCPU that was preempted in > trace_pid_list_is_set()). > > -- Steve Sebastian
