[tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-12 Thread tip-bot for Phil Auld
Commit-ID:  a46d14eca7b75fffe35603aa8b81df654353d80f
Gitweb: https://git.kernel.org/tip/a46d14eca7b75fffe35603aa8b81df654353d80f
Author: Phil Auld 
AuthorDate: Thu, 1 Aug 2019 09:37:49 -0400
Committer:  Thomas Gleixner 
CommitDate: Mon, 12 Aug 2019 14:45:34 +0200

sched/fair: Use rq_lock/unlock in online_fair_sched_group

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

[ tglx: Use rq_*lock_irq() ]

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pa...@redhat.com
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19c58599e967..1054d2cf6aaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10281,18 +10281,18 @@ err:
 void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
+   struct rq_flags rf;
struct rq *rq;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(&rq->lock);
+   rq_lock_irq(rq, &rf);
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(&rq->lock);
+   rq_unlock_irq(rq, &rf);
}
 }
 


[tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-08 Thread tip-bot for Phil Auld
Commit-ID:  6b8fd01b21f5f2701b407a7118f236ba4c41226d
Gitweb: https://git.kernel.org/tip/6b8fd01b21f5f2701b407a7118f236ba4c41226d
Author: Phil Auld 
AuthorDate: Thu, 1 Aug 2019 09:37:49 -0400
Committer:  Peter Zijlstra 
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched/fair: Use rq_lock/unlock in online_fair_sched_group

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pa...@redhat.com
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19c58599e967..d9407517dae9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10281,18 +10281,18 @@ err:
 void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
+   struct rq_flags rf;
struct rq *rq;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(&rq->lock);
+   rq_lock(rq, &rf);
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(&rq->lock);
+   rq_unlock(rq, &rf);
}
 }
 


[tip:sched/urgent] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

2019-04-16 Thread tip-bot for Phil Auld
Commit-ID:  2e8e19226398db8265a8e675fcc0118b9e80c9e8
Gitweb: https://git.kernel.org/tip/2e8e19226398db8265a8e675fcc0118b9e80c9e8
Author: Phil Auld 
AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 16 Apr 2019 16:50:05 +0200

sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0.  The large number of children can make
do_sched_cfs_period_timer() take longer than the period.

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
 RIP: 0010:tg_nop+0x0/0x10
  
  walk_tg_tree_from+0x29/0xb0
  unthrottle_cfs_rq+0xe0/0x1a0
  distribute_cfs_runtime+0xd3/0xf0
  sched_cfs_period_timer+0xcb/0x160
  ? sched_cfs_slack_timer+0xd0/0xd0
  __hrtimer_run_queues+0xfb/0x270
  hrtimer_interrupt+0x122/0x270
  smp_apic_timer_interrupt+0x6a/0x140
  apic_timer_interrupt+0xf/0x20
  

To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the 
timer
can complete before then next period expires.  This preserves the relative 
runtime
quota while preventing the hard lockup.

A warning is issued reporting this state and the new values.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Anton Blanchard 
Cc: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pa...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct 
hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct 
hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+   int count = 0;
 
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart 
sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
 
+   if (++count > 3) {
+   u64 new, old = ktime_to_ns(cfs_b->period);
+
+   new = (old * 147) / 128; /* ~115% */
+   new = min(new, max_cfs_quota_period);
+
+   cfs_b->period = ns_to_ktime(new);
+
+   /* since max is 1s, this is limited to 1e9^2, which 
fits in u64 */
+   cfs_b->quota *= new;
+   cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+   pr_warn_ratelimited(
+   "cfs_period_timer[cpu%d]: period too short, scaling up (new 
cfs_period_us %lld, cfs_quota_us = %lld)\n",
+   smp_processor_id(),
+   div_u64(new, NSEC_PER_USEC),
+   div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+   /* reset count so we don't come right back in here */
+   count = 0;
+   }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)


[tip:sched/core] sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup

2019-04-03 Thread tip-bot for Phil Auld
Commit-ID:  06ec5d30e8d57b820d44df6340dcb25010d6d0fa
Gitweb: https://git.kernel.org/tip/06ec5d30e8d57b820d44df6340dcb25010d6d0fa
Author: Phil Auld 
AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
Committer:  Ingo Molnar 
CommitDate: Wed, 3 Apr 2019 09:50:23 +0200

sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup

With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0.  The large number of children can make
do_sched_cfs_period_timer() take longer than the period.

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
 RIP: 0010:tg_nop+0x0/0x10
  
  walk_tg_tree_from+0x29/0xb0
  unthrottle_cfs_rq+0xe0/0x1a0
  distribute_cfs_runtime+0xd3/0xf0
  sched_cfs_period_timer+0xcb/0x160
  ? sched_cfs_slack_timer+0xd0/0xd0
  __hrtimer_run_queues+0xfb/0x270
  hrtimer_interrupt+0x122/0x270
  smp_apic_timer_interrupt+0x6a/0x140
  apic_timer_interrupt+0xf/0x20
  

To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the 
timer
can complete before then next period expires.  This preserves the relative 
runtime
quota while preventing the hard lockup.

A warning is issued reporting this state and the new values.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Anton Blanchard 
Cc: Ben Segall 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: 
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pa...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..d4cce633eac8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct 
hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct 
hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+   int count = 0;
 
raw_spin_lock_irqsave(&cfs_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart 
sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
 
+   if (++count > 3) {
+   u64 new, old = ktime_to_ns(cfs_b->period);
+
+   new = (old * 147) / 128; /* ~115% */
+   new = min(new, max_cfs_quota_period);
+
+   cfs_b->period = ns_to_ktime(new);
+
+   /* since max is 1s, this is limited to 1e9^2, which 
fits in u64 */
+   cfs_b->quota *= new;
+   cfs_b->quota /= old;
+
+   pr_warn_ratelimited(
+   "cfs_period_timer[cpu%d]: period too short, scaling up (new 
cfs_period_us %lld, cfs_quota_us = %lld)\n",
+   smp_processor_id(),
+   new/NSEC_PER_USEC,
+   cfs_b->quota/NSEC_PER_USEC);
+
+   /* reset count so we don't come right back in here */
+   count = 0;
+   }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)


[tip:sched/urgent] sched/fair: Fix throttle_list starvation with low CFS quota

2018-10-11 Thread tip-bot for Phil Auld
Commit-ID:  baa9be4ffb55876923dc9716abc0a448e510ba30
Gitweb: https://git.kernel.org/tip/baa9be4ffb55876923dc9716abc0a448e510ba30
Author: Phil Auld 
AuthorDate: Mon, 8 Oct 2018 10:36:40 -0400
Committer:  Ingo Molnar 
CommitDate: Thu, 11 Oct 2018 13:10:18 +0200

sched/fair: Fix throttle_list starvation with low CFS quota

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0x90b54f6ade40 -s 
cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
1 90b56cb2d200  -976050
2 90b56cb2cc00  -484925
3 90b56cb2bc00  -658814
4 90b56cb2ba00  -275365
5 90b166a45600  -135138
6 90b56cb2da00  -282505
7 90b56cb2e000  -148065
8 90b56cb2fa00  -872591
9 90b56cb2c000  -84687
   10 90b56cb2f000  -87237
   11 90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0x90b54f6ade40 -s 
cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
1 90b56cb2d200  -994147
2 90b56cb2cc00  -306051
3 90b56cb2bc00  -961321
4 90b56cb2ba00  -24490
5 90b166a45600  -135138
6 90b56cb2da00  -282505
7 90b56cb2e000  -148065
8 90b56cb2fa00  -872591
9 90b56cb2c000  -84687
   10 90b56cb2f000  -87237
   11 90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task 8eb765994500 sched_info
  PID: 7800   TASK: 8eb765994500  CPU: 16  COMMAND: "cputest"
sched_info = {
  pcount = 8,
  run_delay = 697094208,
  last_arrival = 240260125039,
  last_queued = 240260327513
},
  crash> task 8eb765994500 sched_info
  PID: 7800   TASK: 8eb765994500  CPU: 16  COMMAND: "cputest"
sched_info = {
  pcount = 8,
  run_delay = 697094208,
  last_arrival = 240260125039,
  last_queued = 240260327513
},

Signed-off-by: Phil Auld 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite 
distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.ga4...@pauld.bos.csb
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 22 +++---
 kernel/sched/sched.h |  2 ++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7fc4a371bdd2..f88e00705b55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
/*
 * Add to the _head_ of the list, so that an already-started
-* distribute_cfs_runtime will not see us
+* distribute_cfs_runtime will not see us. If disribute_cfs_runtime is
+* not running add to the tail so that later runqueues don't get 
starved.
 */
-   list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+   if (cfs_b->distribute_running)
+   list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+   else
+   list_add_tail_rcu(&cfs_rq->throttled_list, 
&cfs_b->throttled_cfs_rq);
 
/*
 * If we're the first throttled task, make sure the bandwidth
@@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct 
cfs_bandwidth *cfs_b, int overrun)
 * in us over-using our runtime if it is all used during this loop, but
 * only by limited amounts in that extreme case.
 */
-   while (throttled && cfs_b->runtime > 0) {
+   while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
runtime = cfs_b->runtime;
+   cfs_b->distribute_running = 1;
raw_spin_unlock(&cfs_b->lock);
/* we can't nest cfs_b->lock while distributing bandwidth */
  

[tip:sched/urgent] sched/fair: Fix throttle_list starvation with low CFS quota

2018-10-11 Thread tip-bot for Phil Auld
Commit-ID:  8b48300108248e950cde0bdc5708039fc3836623
Gitweb: https://git.kernel.org/tip/8b48300108248e950cde0bdc5708039fc3836623
Author: Phil Auld 
AuthorDate: Mon, 8 Oct 2018 10:36:40 -0400
Committer:  Ingo Molnar 
CommitDate: Thu, 11 Oct 2018 11:18:32 +0200

sched/fair: Fix throttle_list starvation with low CFS quota

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0x90b54f6ade40 -s 
cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
1 90b56cb2d200  -976050
2 90b56cb2cc00  -484925
3 90b56cb2bc00  -658814
4 90b56cb2ba00  -275365
5 90b166a45600  -135138
6 90b56cb2da00  -282505
7 90b56cb2e000  -148065
8 90b56cb2fa00  -872591
9 90b56cb2c000  -84687
   10 90b56cb2f000  -87237
   11 90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0x90b54f6ade40 -s 
cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
1 90b56cb2d200  -994147
2 90b56cb2cc00  -306051
3 90b56cb2bc00  -961321
4 90b56cb2ba00  -24490
5 90b166a45600  -135138
6 90b56cb2da00  -282505
7 90b56cb2e000  -148065
8 90b56cb2fa00  -872591
9 90b56cb2c000  -84687
   10 90b56cb2f000  -87237
   11 90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task 8eb765994500 sched_info
  PID: 7800   TASK: 8eb765994500  CPU: 16  COMMAND: "cputest"
sched_info = {
  pcount = 8,
  run_delay = 697094208,
  last_arrival = 240260125039,
  last_queued = 240260327513
},
  crash> task 8eb765994500 sched_info
  PID: 7800   TASK: 8eb765994500  CPU: 16  COMMAND: "cputest"
sched_info = {
  pcount = 8,
  run_delay = 697094208,
  last_arrival = 240260125039,
  last_queued = 240260327513
},

Signed-off-by: Phil Auld 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: sta...@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite 
distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.ga4...@pauld.bos.csb
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c  | 22 +++---
 kernel/sched/sched.h |  2 ++
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7fc4a371bdd2..f88e00705b55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4476,9 +4476,13 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 
/*
 * Add to the _head_ of the list, so that an already-started
-* distribute_cfs_runtime will not see us
+* distribute_cfs_runtime will not see us. If disribute_cfs_runtime is
+* not running add to the tail so that later runqueues don't get 
starved.
 */
-   list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+   if (cfs_b->distribute_running)
+   list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
+   else
+   list_add_tail_rcu(&cfs_rq->throttled_list, 
&cfs_b->throttled_cfs_rq);
 
/*
 * If we're the first throttled task, make sure the bandwidth
@@ -4622,14 +4626,16 @@ static int do_sched_cfs_period_timer(struct 
cfs_bandwidth *cfs_b, int overrun)
 * in us over-using our runtime if it is all used during this loop, but
 * only by limited amounts in that extreme case.
 */
-   while (throttled && cfs_b->runtime > 0) {
+   while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
runtime = cfs_b->runtime;
+   cfs_b->distribute_running = 1;
raw_spin_unlock(&cfs_b->lock);
/* we can't nest cfs_b->lock while distributing bandwidth */
runtime = d