On Sat, Jun 9, 2018 at 9:00 AM, Tetsuo Handa <penguin-ker...@i-love.sakura.ne.jp> wrote: > On 2018/06/09 6:58, Andrew Morton wrote: >> On Fri, 8 Jun 2018 15:30:43 +0200 Dmitry Vyukov <dvyu...@google.com> wrote: >> >>> Currently task hung checking period is equal to timeout, >>> as the result hung is detected anywhere between timeout and 2*timeout. >>> This is fine for most interactive environments, but this hurts automated >>> testing setups (syzbot). In an automated setup we need to strictly order >>> CPU lockup < RCU stall < workqueue lockup < task hung < silent loss, >>> so that RCU stall is not detected as task hung and task hung is not >>> detected as silent machine loss. The large variance in task hung >>> detection timeout requires setting silent machine loss timeout to >>> a very large value (e.g. if task hung is 3 mins, then silent loss >>> need to be set to ~7 mins). The additional 3 minutes significantly >>> reduce testing efficiency because usually we crash kernel within >>> a minute, and this can add hours to bug localization process as it >>> needs to do dozens of tests. >>> >>> Allow setting checking period separately from timeout. >>> This allows to set timeout to, say, 3 minutes, but period to 10 secs. >>> >>> The period is controlled via a new hung_task_period_secs sysctl, >>> similar to the existing hung_task_timeout_secs sysctl. >>> The default value of 0 results in the current behavior. >> >> I'm rather struggling to understand the difference between "period" and >> "timeout". We would benefit from a clear description of what these two >> things do. An appropriate place for this description is >> Documentation/sysctl/kernel.txt, which this patch forgot to update. > > My understanding is that "period" is "how frequently we should check" > and "timeout" is "how long a thread remained uninterruptible". Maybe > hung_task_check_interval_secs would be better than hung_task_period_secs.
Hi Tetsuo, Andrew, I've just mailed v2: Changes since v1: - add entry to Documentation/sysctl/kernel.txt - rename hung_task_period_secs sysctl to hung_task_check_interval_sec Hopefully now it's more clear what's the difference and what it is doing. > timeout = 60 and period = 1 would allow hung task to be reported as soon > as it remained uninterruptible for 60 seconds. That makes me easier to > narrow down relevant kernel messages and syzbot program. > > Well, showing exact slept time, along with all threads which slept more > than some threshold (e.g. timeout / 2), might be helpful. You mean if we report any task, then scan all tasks second time and additionally report tasks that are blocked for (timeout/2 : timeout)? Should we do this when hung_task_show_lock? Or only when sysctl_hung_task_panic? Or when?