On Thu, 11 Sep 2025 15:33:15 +0200
Vladimir Riabchun <[email protected]> wrote:
> A soft lockup was observed when loading amdgpu module,
I'd like to see more about that soft lockup.
> this is the same issue that was fixed in
> commit d0b24b4e91fc ("ftrace: Prevent RCU stall on PREEMPT_VOLUNTARY
> kernels") and commit 42ea22e754ba ("ftrace: Add cond_resched() to
> ftrace_graph_set_hash()").
The above cond_resched() is in the loop of all records that actually look
at all records! And that can be pretty big. On my server, it shows on boot:
[ 1.934175] ftrace: allocating 45706 entries in 180 pages
[ 1.934177] ftrace: allocated 180 pages with 4 groups
That means the loop will go through 45,706 entries. That's quite a lot and
a cond_resched() makes perfect sense.
>
> Fix it the same way by adding cond_resched() in ftrace_module_enable.
>
> Signed-off-by: Vladimir Riabchun <[email protected]>
> ---
> kernel/trace/ftrace.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index a69067367c29..23c4d37c7bcd 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -7526,6 +7526,9 @@ void ftrace_module_enable(struct module *mod)
>
> do_for_each_ftrace_rec(pg, rec) {
> int cnt;
> +
> + cond_resched();
> +
> /*
> * do_for_each_ftrace_rec() is a double loop.
> * module text shares the pg. If a record is
This loop is different. Let me show a bit more context:
do_for_each_ftrace_rec(pg, rec) {
int cnt;
/*
* do_for_each_ftrace_rec() is a double loop.
* module text shares the pg. If a record is
* not part of this module, then skip this pg,
* which the "break" will do.
*/
if (!within_module(rec->ip, mod))
break;
See this "if (!within_module(rec->ip, mod))" break?
Look at the dmesg output again, and you'll see "groups" mentioned.
[ 1.934177] ftrace: allocated 180 pages with 4 groups
That "4 groups" means there are 4 "page groups". That's the "pg" in the
do_for_each_ftrace_recr() function.
This means in my scenario, it loops 4 times. And then it will loop through
each module.
How big is the amdgpu driver? How many functions does it have?
# grep amdgpu /sys/kernel/tracing/available_filter_functions | wc -l
And I'm guessing that this is only an issue when ftrace is enabled:
if (ftrace_start_up && cnt) {
int failed = __ftrace_replace_code(rec, 1);
if (failed) {
ftrace_bug(failed, rec);
goto out_loop;
}
}
As that could slow things down.
If this is all the case, then the cond_resched() should be with the
ftrace_start_up code and not in the open like you have it.
if (ftrace_start_up && cnt) {
int failed = __ftrace_replace_code(rec, 1);
if (failed) {
ftrace_bug(failed, rec);
goto out_loop;
}
+ cond_resched();
}
-- Steve