Re: [PATCH] kernel/hung_task.c: Monitor killed tasks.

Petr Mladek Thu, 16 May 2019 04:59:01 -0700

CCed Stephen to discuss linux-next related question at the bottom
of the mail.


On Thu 2019-05-16 17:19:12, Tetsuo Handa wrote:
> On 2019/05/15 19:55, Petr Mladek wrote:
> >> +  if (!stamp) {
> >> +          stamp = jiffies;
> >> +          if (!stamp)
> >> +                  stamp++;
> >> +          t->killed_time = stamp;
> >> +          return;
> >> +  }
> > 
> > I might be too dumb but the above code looks pretty tricky to me.
> > It would deserve a comment. Or better, I would remove
> > trick to handle overflow. If it happens, we would just
> > lose one check period.
> 
> We can use
> 
>   static inline unsigned long jiffies_nonzero(void)
>   {
>       const unsigned long stamp = jiffies;
> 
>       return stamp ? stamp : -1;
>   }
> 
> or even shortcut "jiffies | 1" because difference by one jiffie
> is an measurement error for multiple HZ of timeout.

I would just ignore the overflow. We would just start measuring
the timeout in the next check_hung_task() call. It is not
a big deal and removes few lines of a tricky code.

> >> +  if (time_is_after_jiffies(stamp + timeout * HZ))
> >> +          return;
> >> +  trace_sched_process_hang(t);
> >> +  if (sysctl_hung_task_panic) {
> >> +          console_verbose();
> >> +          hung_task_call_panic = true;
> > 
> > IMHO, the delayed task exit is much less fatal than sleeping
> > in an uninterruptible state.
> > 
> > Anyway, the check is much less reliable. In case of hung_task,
> > it is enough when the task gets scheduled. In the new check,
> > the task has to do some amount of work until the signal
> > gets handled and do_exit() is called.
> > 
> > The panic should either get enabled separately or we should
> > never panic in this case.
> 
> OK, we should not share existing sysctl settings.
> 
> But in the context of syzbot's testing where there are only 2 CPUs
> in the target VM (which means that only small number of threads and
> not so much memory) and threads get SIGKILL after 5 seconds from fork(),
> being unable to reach do_exit() within 10 seconds is likely a sign of
> something went wrong. For example, 6 out of 7 trials of a reproducer for
> https://syzkaller.appspot.com/bug?id=835a0b9e75b14b55112661cbc61ca8b8f0edf767
> resulted in "no output from test machine" rather than "task hung".
> This patch is revealing that such killed threads are failing to reach
> do_exit() because they are trapped at unkillable retry loop due to a
> race bug.
> 
> Therefore, I would like to try this patch in linux-next.git for feasibility
> testing whether this patch helps finding more bugs and reproducers for such
> bugs, by bringing "unable to terminate threads" reports out of "no output from
> test machine" reports. We can add sysctl settings before sending to linux.git.

In this case, the watchdog should get enabled on with
CONFIG_DEBUG_AID_FOR_SYZBOT

Also we should ask/inform Stephen about this. I am not sure
if he is willing to resolve eventual conflicts for these
syzboot-specific patches that are not upstream candidates.

A solution might be to create sysbot-specific for-next branch
that Stephen might simply ignore when there are conflicts.
And you would be responsible for updating it.

Best Regards,
Petr

Re: [PATCH] kernel/hung_task.c: Monitor killed tasks.

Reply via email to