On 2025/10/14 19:18, Li,Rongqing wrote:
Currently, when 'hung_task_panic' is enabled, the kernel panics
immediately upon detecting the first hung task. However, some hung
tasks are transient and the system can recover, while others are
persistent and may accumulate progressively.
My understanding is that this patch wanted to do:
+ report even temporary stalls
+ panic only when the stall was much longer and likely persistent
Which might make some sense. But the code does something else.
Cool. Sounds good to me!
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct
*t, unsigned long timeout)
*/
sysctl_hung_task_detect_count++;
+ total_hung_task = sysctl_hung_task_detect_count -
+prev_detect_count;
trace_sched_process_hang(t);
- if (sysctl_hung_task_panic) {
+ if (sysctl_hung_task_panic &&
+ (total_hung_task >= sysctl_hung_task_panic)) {
console_verbose();
hung_task_show_lock = true;
hung_task_call_panic = true;
I would expect that this patch added another counter, similar to
sysctl_hung_task_detect_count. It would be incremented only once per
check when a hung task was detected. And it would be cleared (reset)
when no hung task was found.
Much cleaner. We could add an internal counter for that, yeah. No need to
expose it to userspace ;)
Petr's suggestion seems to align better with the goal of panicking on
persistent hangs, IMHO. Panic after N consecutive checks with hung tasks.
@RongQing does that work for you?
In my opinion, a single task hang is not a critical issue, fatal hangs—such as
those caused by I/O hangs, network card failures, or hangs while holding
locks—will inevitably lead to multiple tasks being hung. In such scenarios,
users cannot even log in to the machine, making it extremely difficult to
investigate the root cause. Therefore, I believe the current approach is sound.
What's your opinion?
Thanks! I'm fine with either approach. Let's hear what the other folks
think ;)