CCed Stephen to discuss linux-next related question at the bottom of the mail.
On Thu 2019-05-16 17:19:12, Tetsuo Handa wrote: > On 2019/05/15 19:55, Petr Mladek wrote: > >> + if (!stamp) { > >> + stamp = jiffies; > >> + if (!stamp) > >> + stamp++; > >> + t->killed_time = stamp; > >> + return; > >> + } > > > > I might be too dumb but the above code looks pretty tricky to me. > > It would deserve a comment. Or better, I would remove > > trick to handle overflow. If it happens, we would just > > lose one check period. > > We can use > > static inline unsigned long jiffies_nonzero(void) > { > const unsigned long stamp = jiffies; > > return stamp ? stamp : -1; > } > > or even shortcut "jiffies | 1" because difference by one jiffie > is an measurement error for multiple HZ of timeout. I would just ignore the overflow. We would just start measuring the timeout in the next check_hung_task() call. It is not a big deal and removes few lines of a tricky code. > >> + if (time_is_after_jiffies(stamp + timeout * HZ)) > >> + return; > >> + trace_sched_process_hang(t); > >> + if (sysctl_hung_task_panic) { > >> + console_verbose(); > >> + hung_task_call_panic = true; > > > > IMHO, the delayed task exit is much less fatal than sleeping > > in an uninterruptible state. > > > > Anyway, the check is much less reliable. In case of hung_task, > > it is enough when the task gets scheduled. In the new check, > > the task has to do some amount of work until the signal > > gets handled and do_exit() is called. > > > > The panic should either get enabled separately or we should > > never panic in this case. > > OK, we should not share existing sysctl settings. > > But in the context of syzbot's testing where there are only 2 CPUs > in the target VM (which means that only small number of threads and > not so much memory) and threads get SIGKILL after 5 seconds from fork(), > being unable to reach do_exit() within 10 seconds is likely a sign of > something went wrong. For example, 6 out of 7 trials of a reproducer for > https://syzkaller.appspot.com/bug?id=835a0b9e75b14b55112661cbc61ca8b8f0edf767 > resulted in "no output from test machine" rather than "task hung". > This patch is revealing that such killed threads are failing to reach > do_exit() because they are trapped at unkillable retry loop due to a > race bug. > > Therefore, I would like to try this patch in linux-next.git for feasibility > testing whether this patch helps finding more bugs and reproducers for such > bugs, by bringing "unable to terminate threads" reports out of "no output from > test machine" reports. We can add sysctl settings before sending to linux.git. In this case, the watchdog should get enabled on with CONFIG_DEBUG_AID_FOR_SYZBOT Also we should ask/inform Stephen about this. I am not sure if he is willing to resolve eventual conflicts for these syzboot-specific patches that are not upstream candidates. A solution might be to create sysbot-specific for-next branch that Stephen might simply ignore when there are conflicts. And you would be responsible for updating it. Best Regards, Petr