[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-12-27 Thread Po-Hsu Lin
** Changed in: linux (Ubuntu)
   Status: Incomplete => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regression Potential]
  This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
  4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
  community.

  Xenial received this patch in 4.4.0-150.176, as per LP #1828420
  Disco will receive this patch in the next version, as per LP #1830922
  Eoan already has the patch, being based on 5.2.

  While this does effect a core part of the kernel, the scheduler, the
  patch has been extensively tested, and it has been proven in
  production environments, so the overall risk is low.

To manage 

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-08-07 Thread Matthew Ruffell
I installed linux-hwe 4.15.0-57 #63~16.04.1-Ubuntu on xenial, placed a
heavy load on the system.

Everything seemed stable and no lockups occurred. Due to the successful
tests in the hadoop environment under bionic and the nature of this
patch being from upstream stable, I am happy to mark this as verified
for xenial.

** Tags removed: verification-needed-xenial
** Tags added: verification-done-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regression Potential]
  This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
  4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
  community.

  Xenial received this patch in 4.4.0-150.176, as per LP 

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-08-07 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
xenial' to 'verification-done-xenial'. If the problem still exists,
change the tag 'verification-needed-xenial' to 'verification-failed-
xenial'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-xenial

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regr

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-07-30 Thread Matthew Ruffell
I installed 4.15.0-56 from -proposed and ran some CPU stress tests and
everything was stable and no problems detected.

The kernel was also installed onto a few machines from the production hadoop
cluster which experienced extremely high cpu load, and the kernel is stable, 
with
no CPU lockups in sight.

Since this is also an upstream -stable patch, and we have good test results from
the hadoop cluster which was suffering the issue, I am happy to mark this as
verified.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regression Potential]
  This patch was accepted into upstream stable versions 4.4.179, 4.9.

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-07-25 Thread Ubuntu Kernel Bot
This bug is awaiting verification that the kernel in -proposed solves
the problem. Please test the kernel and update this bug with the
results. If the problem is solved, change the tag 'verification-needed-
bionic' to 'verification-done-bionic'. If the problem still exists,
change the tag 'verification-needed-bionic' to 'verification-failed-
bionic'.

If verification is not done by 5 working days from today, this fix will
be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how
to enable and use -proposed. Thank you!


** Tags added: verification-needed-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regr

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-07-22 Thread Khaled El Mously
** Changed in: linux (Ubuntu Bionic)
   Status: In Progress => Fix Committed

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplete
Status in linux source package in Bionic:
  Fix Committed

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1836971

  [Impact]

  On machines with extremely high CPU usage, parent task groups which
  have a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.

  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never
  left, and tasks are never rescheduled.

  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().

  The kernel will produce this call trace:

  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu

  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  

  This has been hit in production in a particularly highly utilised
  hadoop cluster which is powering an analytics platform. About 30% of
  the cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.

  [Fix]

  This was fixed in 5.1 upstream with the below commit:

  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.

  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in
  upstream stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.

  This patch requires minor backporting for 4.15, so please cherry pick 
  d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.

  [Testcase]

  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.

  I built a test kernel, which is available here:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test

  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.

  For patched kernels, if the machine hits the condition, it will print
  a warning to the kernel log with the new period and quota which has
  been used:

  Example from the same hadoop cluster with a machine running the test
  kernel:

  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)

  [Regression Potential]
  This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
  4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
  community.

  Xenial received this patch in 4.4.0-150.176, as per LP #1828420
  Disco will receive this patch in the next version, as per LP #1830922
  Eoan already has the patch, being based on 5.2.

  While this does effect a core part of the kernel, the scheduler, the
  patch has been extensively tested, and it has been proven in
  production environments, so the overall risk is low.

To

[Kernel-packages] [Bug 1836971] Re: sched: Prevent CPU lockups when task groups take longer than the period

2019-07-17 Thread Matthew Ruffell
** Description changed:

  BugLink: https://bugs.launchpad.net/bugs/1836971
  
  [Impact]
  
  On machines with extremely high CPU usage, parent task groups which have
  a large number of children can make the for loop in
  sched_cfs_period_timer() run until the watchdog fires when the
  cfs_period_us setting is too short.
  
  In this particular case, it is unlikely that the call to
  hrtimer_forward_now() will return 0, meaning the for loop is never left,
  and tasks are never rescheduled.
  
  The large number of children makes do_sched_cfs_period_timer() take
  longer than the period, which impacts calls to hrtimer_forward_now().
  
  The kernel will produce this call trace:
  
  CPU: 51 PID: 0 Comm: swapper/51 Tainted: P OELK 4.15.0-50-generic
  #54-Ubuntu
  
  Call Trace:
  
  ? sched_clock+0x9/0x10
  walk_tg_tree_from+0x61/0xd0
  ? task_rq_unlock+0x30/0x30
  unthrottle_cfs_rq+0xcb/0x1a0
  distribute_cfs_runtime+0xd7/0x100
  sched_cfs_period_timer+0xd9/0x1a0
  ? sched_cfs_slack_timer+0xe0/0xe0
  __hrtimer_run_queues+0xdf/0x230
  hrtimer_interrupt+0xa0/0x1d0
  smp_apic_timer_interrupt+0x6f/0x130
  apic_timer_interrupt+0x84/0x90
  
  
  This has been hit in production in a particularly highly utilised hadoop
  cluster which is powering an analytics platform. About 30% of the
  cluster experiences this issue every week, and the machines need a
  manual reboot to get back online.
  
  [Fix]
  
  This was fixed in 5.1 upstream with the below commit:
  
  commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8
  Author: Phil Auld 
  Date: Tue Mar 19 09:00:05 2019 -0400
  subject: sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
  
  This commit adds a check to see if the loop has run too many times, and if it
  has, scales up the period and quota, so the timer can complete before the
  next period expires, which enables the task to be rescheduled normally.
  
  Note, 2e8e19226398db8265a8e675fcc0118b9e80c9e8 was included in upstream
  stable versions 4.4.179, 4.9.171, 4.14.114, 4.19.37, 5.0.10.
  
- Please cherry pick 2e8e19226398db8265a8e675fcc0118b9e80c9e8 to all bionic
- kernels.
+ This patch requires minor backporting for 4.15, so please cherry pick 
+ d069fe4844f8d799d771659a745fe91870c93fda from upstream stable 4.14.y, where 
the backport has been done by the original author, to all bionic kernels.
  
  [Testcase]
  
  Kind of hard to reproduce, so this was tested on a production hadoop cluster
  with extremely high CPU load.
  
  I built a test kernel, which is available here:
  
  https://launchpad.net/~mruffell/+archive/ubuntu/sf232784-test
  
  For unpatched kernels, expect the machine to lockup and print the call
  trace in the impact section.
  
  For patched kernels, if the machine hits the condition, it will print a
  warning to the kernel log with the new period and quota which has been
  used:
  
  Example from the same hadoop cluster with a machine running the test
  kernel:
  
  % uname -a
  4.15.0-50-generic #54+hf232784v20190626b1-Ubuntu
  % sudo grep cfs /var/log/kern.log.*
  cfs_period_timer[cpu40]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
22430, cfs_quota_us = 1148437)
  cfs_period_timer[cpu48]: period too short, scaling up (new cfs_period_us 
25759, cfs_quota_us = 1318908)
  cfs_period_timer[cpu68]: period too short, scaling up (new cfs_period_us 
29583, cfs_quota_us = 1514684)
  cfs_period_timer[cpu49]: period too short, scaling up (new cfs_period_us 
33974, cfs_quota_us = 1739519)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
39017, cfs_quota_us = 1997729)
  cfs_period_timer[cpu10]: period too short, scaling up (new cfs_period_us 
44809, cfs_quota_us = 2294267)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
51460, cfs_quota_us = 2634823)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
59099, cfs_quota_us = 3025929)
  cfs_period_timer[cpu3]: period too short, scaling up (new cfs_period_us 
67872, cfs_quota_us = 3475091)
  
  [Regression Potential]
  This patch was accepted into upstream stable versions 4.4.179, 4.9.171,
  4.14.114, 4.19.37, 5.0.10, and is thus treated as stable and trusted by the
  community.
  
  Xenial received this patch in 4.4.0-150.176, as per LP #1828420
  Disco will receive this patch in the next version, as per LP #1830922
  Eoan already has the patch, being based on 5.2.
  
  While this does effect a core part of the kernel, the scheduler, the
  patch has been extensively tested, and it has been proven in production
  environments, so the overall risk is low.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1836971

Title:
  sched: Prevent CPU lockups when task groups take longer than the
  period

Status in linux package in Ubuntu:
  Incomplet