from:"Joonwoo Park"

Re: [PATCH] timers: Fix timer inaccuracy

2016-11-09 Thread Joonwoo Park




On 11/09/2016 01:56 AM, Thomas Gleixner wrote:

On Wed, 9 Nov 2016, Joonwoo Park wrote:


When a new timer list enqueued into the time wheel, array index
for the given expiry time is:

  expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
  idx = LVL_OFFS(lvl) + (expires & LVL_MASK);

The granularity of the expiry time level is being added to the index
in order to fire the timer after its expiry time for the case when
the timer cannot fire at the exact time because of each level's
granularity.  However current index calculation also increases index
of timer list even if the timer can fire at exact time.  Consequently
timers which can fire at exact time including all in the first level
of bucket fire with one jiffy delay at present.

Fix such inaccuracy by adding granularity of expiry time level only
when a given timer cannot fire at exact time.


That's simply wrong. We guarantee that the timer sleeps for at least a
jiffy. So in case of the first wheel we _must_ increment by one simply
because the next jiffie might be immanent and not doing so would expire the
timer early.

The wheel is not meant to be accurate at all and I really don't want an
extra conditional in that path just to make it accurate for some expiry
values. That's a completely pointless exercise.


I understand it's not meant to provide much of accuracy and also don't 
really care about accuracy of sporadic timer aim and fire.
What I'm worried about is case that relies on periodic timer with 
relatively short interval.
If you see bus scaling driver function devfreq_monitor() in 
drivers/devfreq/devfreq.c, it polls hardware every configured interval 
by using deferrable delayed work.  I'm not quite familiar with bus 
scaling so cc'ing linux-pm.


My guess is that ever since we have timer refactoring, above driver is 
polling every configured jiffy + 1 most of time.
When CONFIG_HZ_100=y and the interval is 1, polling will be happening 
every 20ms rather than configured 10ms which is 100% later than ideal.


If that kind of drivers want to run periodic polling at similar level of 
accuracy like pre v4.8, each drivers have to switch to hrtimer but there 
are problems apart from the fact there is no nicely written deferred 
processing mechanism like workqueue with hrtimer -

1) there is no deferrable hrtimer.
2) hrtimer has more overhead more than low res timer, especially hrtimer 
will fire interrupt for individual timer lists which will cause power 
impact.


It also makes sense to me that queued timer especially with long delay 
is tolerable to inaccuracy especially when most of them got canceled 
prior to its expiry time.
But by drivers which use timer as polling mechanism which never cancel 
it, IMHO this behaviour change could be a regression.


Thanks,
Joonwoo



Thanks,

tglx

Re: [PATCH] timers: Fix timer inaccuracy

2016-11-09 Thread Joonwoo Park




On 11/09/2016 01:56 AM, Thomas Gleixner wrote:

On Wed, 9 Nov 2016, Joonwoo Park wrote:


When a new timer list enqueued into the time wheel, array index
for the given expiry time is:

  expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
  idx = LVL_OFFS(lvl) + (expires & LVL_MASK);

The granularity of the expiry time level is being added to the index
in order to fire the timer after its expiry time for the case when
the timer cannot fire at the exact time because of each level's
granularity.  However current index calculation also increases index
of timer list even if the timer can fire at exact time.  Consequently
timers which can fire at exact time including all in the first level
of bucket fire with one jiffy delay at present.

Fix such inaccuracy by adding granularity of expiry time level only
when a given timer cannot fire at exact time.


That's simply wrong. We guarantee that the timer sleeps for at least a
jiffy. So in case of the first wheel we _must_ increment by one simply
because the next jiffie might be immanent and not doing so would expire the
timer early.

The wheel is not meant to be accurate at all and I really don't want an
extra conditional in that path just to make it accurate for some expiry
values. That's a completely pointless exercise.


I understand it's not meant to provide much of accuracy and also don't 
really care about accuracy of sporadic timer aim and fire.
What I'm worried about is case that relies on periodic timer with 
relatively short interval.
If you see bus scaling driver function devfreq_monitor() in 
drivers/devfreq/devfreq.c, it polls hardware every configured interval 
by using deferrable delayed work.  I'm not quite familiar with bus 
scaling so cc'ing linux-pm.


My guess is that ever since we have timer refactoring, above driver is 
polling every configured jiffy + 1 most of time.
When CONFIG_HZ_100=y and the interval is 1, polling will be happening 
every 20ms rather than configured 10ms which is 100% later than ideal.


If that kind of drivers want to run periodic polling at similar level of 
accuracy like pre v4.8, each drivers have to switch to hrtimer but there 
are problems apart from the fact there is no nicely written deferred 
processing mechanism like workqueue with hrtimer -

1) there is no deferrable hrtimer.
2) hrtimer has more overhead more than low res timer, especially hrtimer 
will fire interrupt for individual timer lists which will cause power 
impact.


It also makes sense to me that queued timer especially with long delay 
is tolerable to inaccuracy especially when most of them got canceled 
prior to its expiry time.
But by drivers which use timer as polling mechanism which never cancel 
it, IMHO this behaviour change could be a regression.


Thanks,
Joonwoo



Thanks,

tglx

[PATCH] timers: Fix timer inaccuracy

2016-11-09 Thread Joonwoo Park

When a new timer list enqueued into the time wheel, array index
for the given expiry time is:

  expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
  idx = LVL_OFFS(lvl) + (expires & LVL_MASK);

The granularity of the expiry time level is being added to the index
in order to fire the timer after its expiry time for the case when
the timer cannot fire at the exact time because of each level's
granularity.  However current index calculation also increases index
of timer list even if the timer can fire at exact time.  Consequently
timers which can fire at exact time including all in the first level
of bucket fire with one jiffy delay at present.

Fix such inaccuracy by adding granularity of expiry time level only
when a given timer cannot fire at exact time.

With CONFIG_HZ_100=y

Before:
  225.768008: timer_start: timer=a00042c0 function=timer_func [timer] 
expires=4294959868 [timeout=2] flags=0x0e80
  
  225.797961: timer_expire_entry: timer=a00042c0 function=timer_func 
[timer] now=4294959869

After:
   54.424805: timer_start: timer=a00042c0 function=timer_func [timer] 
expires=4294942730 [timeout=2] flags=0x1040
   
   54.444764: timer_expire_entry: timer=a00042c0 function=timer_func 
[timer] now=4294942730

Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: John Stultz <john.stu...@linaro.org>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Frederic Weisbecker <fweis...@gmail.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
 kernel/time/timer.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c611c47..f6ad4e9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -467,17 +467,29 @@ static inline void timer_set_idx(struct timer_list 
*timer, unsigned int idx)
  */
 static inline unsigned calc_index(unsigned expires, unsigned lvl)
 {
-   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+   if (expires & ~(UINT_MAX << LVL_SHIFT(lvl)))
+   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+   else
+   expires = expires >> LVL_SHIFT(lvl);
return LVL_OFFS(lvl) + (expires & LVL_MASK);
 }
 
+static inline unsigned calc_index_min_granularity(unsigned expires)
+{
+   return LVL_OFFS(0) + ((expires >> LVL_SHIFT(0)) & LVL_MASK);
+}
+
 static int calc_wheel_index(unsigned long expires, unsigned long clk)
 {
unsigned long delta = expires - clk;
unsigned int idx;
 
if (delta < LVL_START(1)) {
-   idx = calc_index(expires, 0);
+   /*
+* calc_index(expires, 0) should still work but we can
+* optimize as LVL_SHIFT(0) is always 0.
+*/
+   idx = calc_index_min_granularity(expires);
} else if (delta < LVL_START(2)) {
idx = calc_index(expires, 1);
} else if (delta < LVL_START(3)) {
-- 
2.9.3

[PATCH] timers: Fix timer inaccuracy

2016-11-09 Thread Joonwoo Park

When a new timer list enqueued into the time wheel, array index
for the given expiry time is:

  expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
  idx = LVL_OFFS(lvl) + (expires & LVL_MASK);

The granularity of the expiry time level is being added to the index
in order to fire the timer after its expiry time for the case when
the timer cannot fire at the exact time because of each level's
granularity.  However current index calculation also increases index
of timer list even if the timer can fire at exact time.  Consequently
timers which can fire at exact time including all in the first level
of bucket fire with one jiffy delay at present.

Fix such inaccuracy by adding granularity of expiry time level only
when a given timer cannot fire at exact time.

With CONFIG_HZ_100=y

Before:
  225.768008: timer_start: timer=a00042c0 function=timer_func [timer] 
expires=4294959868 [timeout=2] flags=0x0e80
  
  225.797961: timer_expire_entry: timer=a00042c0 function=timer_func 
[timer] now=4294959869

After:
   54.424805: timer_start: timer=a00042c0 function=timer_func [timer] 
expires=4294942730 [timeout=2] flags=0x1040
   
   54.444764: timer_expire_entry: timer=a00042c0 function=timer_func 
[timer] now=4294942730

Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
Cc: Thomas Gleixner 
Cc: John Stultz 
Cc: Eric Dumazet 
Cc: Frederic Weisbecker 
Cc: Linus Torvalds 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
 kernel/time/timer.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c611c47..f6ad4e9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -467,17 +467,29 @@ static inline void timer_set_idx(struct timer_list 
*timer, unsigned int idx)
  */
 static inline unsigned calc_index(unsigned expires, unsigned lvl)
 {
-   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+   if (expires & ~(UINT_MAX << LVL_SHIFT(lvl)))
+   expires = (expires + LVL_GRAN(lvl)) >> LVL_SHIFT(lvl);
+   else
+   expires = expires >> LVL_SHIFT(lvl);
return LVL_OFFS(lvl) + (expires & LVL_MASK);
 }
 
+static inline unsigned calc_index_min_granularity(unsigned expires)
+{
+   return LVL_OFFS(0) + ((expires >> LVL_SHIFT(0)) & LVL_MASK);
+}
+
 static int calc_wheel_index(unsigned long expires, unsigned long clk)
 {
unsigned long delta = expires - clk;
unsigned int idx;
 
if (delta < LVL_START(1)) {
-   idx = calc_index(expires, 0);
+   /*
+* calc_index(expires, 0) should still work but we can
+* optimize as LVL_SHIFT(0) is always 0.
+*/
+   idx = calc_index_min_granularity(expires);
} else if (delta < LVL_START(2)) {
idx = calc_index(expires, 1);
} else if (delta < LVL_START(3)) {
-- 
2.9.3

Re: [GIT PULL] Re: [PATCH 08/15] perf tools: Introduce timestamp_in_usec()

2016-10-28 Thread Joonwoo Park




On 10/28/2016 10:39 AM, Ingo Molnar wrote:


* Arnaldo Carvalho de Melo <a...@kernel.org> wrote:


Em Fri, Oct 28, 2016 at 11:30:41AM -0200, Arnaldo Carvalho de Melo escreveu:

Em Fri, Oct 28, 2016 at 10:53:38AM -0200, Arnaldo Carvalho de Melo escreveu:

Em Thu, Oct 27, 2016 at 04:14:55PM -0700, Joonwoo Park escreveu:

Hmm.. I didn't ACK this patch because of bug I commented at
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1256724.html



s/work_list->max_lat/work_list->max_lat_at/



Sorry about that, I took the "thanks for taking care of this" as an ack,
now that I re-read that message I saw your points later on in that
e-mail.


No worries.  I think I could be more specific with the first part of 
comment by mentioning about the bug.





Since Ingo hasn't pulled this, I'll try fixing it, will check that other
naming issue,



So, here is how it ended up, it fixes the problem you pointed out and
renames the function to follow the scnprintf() convention, as used
elsewhere in tools/perf (tools/perf/util/annotate.h has several
examples).


Ingo, I've just signed a perf-core-for-mingo-20161028 with the only
change being the patch below, re-run my tests, I think this doesn't
introduce any bugs and addresses Joonwoo's concerns, please consider
pulling.


Pulled, thanks a lot Arnaldo!


Thanks!

Joonwoo



Ingo

Re: [GIT PULL] Re: [PATCH 08/15] perf tools: Introduce timestamp_in_usec()

2016-10-28 Thread Joonwoo Park




On 10/28/2016 10:39 AM, Ingo Molnar wrote:


* Arnaldo Carvalho de Melo  wrote:


Em Fri, Oct 28, 2016 at 11:30:41AM -0200, Arnaldo Carvalho de Melo escreveu:

Em Fri, Oct 28, 2016 at 10:53:38AM -0200, Arnaldo Carvalho de Melo escreveu:

Em Thu, Oct 27, 2016 at 04:14:55PM -0700, Joonwoo Park escreveu:

Hmm.. I didn't ACK this patch because of bug I commented at
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1256724.html



s/work_list->max_lat/work_list->max_lat_at/



Sorry about that, I took the "thanks for taking care of this" as an ack,
now that I re-read that message I saw your points later on in that
e-mail.


No worries.  I think I could be more specific with the first part of 
comment by mentioning about the bug.





Since Ingo hasn't pulled this, I'll try fixing it, will check that other
naming issue,



So, here is how it ended up, it fixes the problem you pointed out and
renames the function to follow the scnprintf() convention, as used
elsewhere in tools/perf (tools/perf/util/annotate.h has several
examples).


Ingo, I've just signed a perf-core-for-mingo-20161028 with the only
change being the patch below, re-run my tests, I think this doesn't
introduce any bugs and addresses Joonwoo's concerns, please consider
pulling.


Pulled, thanks a lot Arnaldo!


Thanks!

Joonwoo



Ingo

Re: [PATCH 08/15] perf tools: Introduce timestamp_in_usec()

2016-10-27 Thread Joonwoo Park




On 10/27/2016 01:40 PM, Arnaldo Carvalho de Melo wrote:

From: Namhyung Kim <namhy...@kernel.org>

Joonwoo reported that there's a mismatch between timestamps in script
and sched commands.  This was because of difference in printing the
timestamp.  Factor out the code and share it so that they can be in
sync.  Also I found that sched map has similar problem, fix it too.

Reported-and-Acked-by: Joonwoo Park <joonw...@codeaurora.org>


Hmm.. I didn't ACK this patch because of bug I commented at 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1256724.html


s/work_list->max_lat/work_list->max_lat_at/

Do we have fix for this?

Thanks,
Joonwoo


Signed-off-by: Namhyung Kim <namhy...@kernel.org>
Cc: David Ahern <dsah...@gmail.com>
Cc: Jiri Olsa <jo...@kernel.org>
Cc: Peter Zijlstra <a.p.zijls...@chello.nl>
Link: http://lkml.kernel.org/r/20161024020246.14928-3-namhy...@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/builtin-sched.c  | 9 ++---
 tools/perf/builtin-script.c | 9 ++---
 tools/perf/util/util.c  | 9 +
 tools/perf/util/util.h  | 3 +++
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 1f33d15314a5..f0ab715b4923 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1191,6 +1191,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   char buf[32];

if (!work_list->nb_atoms)
return;
@@ -1213,11 +1214,11 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13s s\n",
  (double)work_list->total_runtime / NSEC_PER_MSEC,
 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 (double)work_list->max_lat / NSEC_PER_MSEC,
-(double)work_list->max_lat_at / NSEC_PER_SEC);
+timestamp_in_usec(buf, sizeof(buf), work_list->max_lat));
 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
@@ -1402,6 +1403,7 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
int cpus_nr;
bool new_cpu = false;
const char *color = PERF_COLOR_NORMAL;
+   char buf[32];

BUG_ON(this_cpu >= MAX_CPUS || this_cpu < 0);

@@ -1492,7 +1494,8 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
if (sched->map.cpus && !cpu_map__has(sched->map.cpus, this_cpu))
goto out;

-   color_fprintf(stdout, color, "  %12.6f secs ", (double)timestamp / 
NSEC_PER_SEC);
+   color_fprintf(stdout, color, "  %12s secs ",
+ timestamp_in_usec(buf, sizeof(buf), timestamp));
if (new_shortname || (verbose && sched_in->tid)) {
const char *pid_color = color;

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 412fb6e65ac0..dae4d1013c33 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -441,7 +441,6 @@ static void print_sample_start(struct perf_sample *sample,
 {
struct perf_event_attr *attr = >attr;
unsigned long secs;
-   unsigned long usecs;
unsigned long long nsecs;

if (PRINT_FIELD(COMM)) {
@@ -468,14 +467,18 @@ static void print_sample_start(struct perf_sample *sample,
}

if (PRINT_FIELD(TIME)) {
+   char buf[32];
+   size_t sz = sizeof(buf);
+
nsecs = sample->time;
secs = nsecs / NSEC_PER_SEC;
nsecs -= secs * NSEC_PER_SEC;
-   usecs = nsecs / NSEC_PER_USEC;
+
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
-   printf("%5lu.%06lu: ", secs, usecs);
+   printf("%12s: ", timestamp_in_usec(buf, sz,
+  sample->time));
}
 }

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 85c56800f17a..aa3e778989ce 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -433,6 +433,15 @@ int parse_nsec_time(const char *str, u64 *ptime)
return 0;
 }

+char *timestamp_in_usec(char *buf, size_t sz, u64 timestamp)
+{
+   u64  sec = timestamp / NSEC_PER_SEC;
+   u64 usec = (timestamp % NSEC_PER_SEC) / NSEC_PER_USEC;
+
+   scnprintf(buf, sz, "%"PRIu64".%06"PRIu64, sec,

Re: [PATCH 08/15] perf tools: Introduce timestamp_in_usec()

2016-10-27 Thread Joonwoo Park




On 10/27/2016 01:40 PM, Arnaldo Carvalho de Melo wrote:

From: Namhyung Kim 

Joonwoo reported that there's a mismatch between timestamps in script
and sched commands.  This was because of difference in printing the
timestamp.  Factor out the code and share it so that they can be in
sync.  Also I found that sched map has similar problem, fix it too.

Reported-and-Acked-by: Joonwoo Park 


Hmm.. I didn't ACK this patch because of bug I commented at 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1256724.html


s/work_list->max_lat/work_list->max_lat_at/

Do we have fix for this?

Thanks,
Joonwoo


Signed-off-by: Namhyung Kim 
Cc: David Ahern 
Cc: Jiri Olsa 
Cc: Peter Zijlstra 
Link: http://lkml.kernel.org/r/20161024020246.14928-3-namhy...@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/builtin-sched.c  | 9 ++---
 tools/perf/builtin-script.c | 9 ++---
 tools/perf/util/util.c  | 9 +
 tools/perf/util/util.h  | 3 +++
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 1f33d15314a5..f0ab715b4923 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1191,6 +1191,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   char buf[32];

if (!work_list->nb_atoms)
return;
@@ -1213,11 +1214,11 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13s s\n",
  (double)work_list->total_runtime / NSEC_PER_MSEC,
 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 (double)work_list->max_lat / NSEC_PER_MSEC,
-(double)work_list->max_lat_at / NSEC_PER_SEC);
+timestamp_in_usec(buf, sizeof(buf), work_list->max_lat));
 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
@@ -1402,6 +1403,7 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
int cpus_nr;
bool new_cpu = false;
const char *color = PERF_COLOR_NORMAL;
+   char buf[32];

BUG_ON(this_cpu >= MAX_CPUS || this_cpu < 0);

@@ -1492,7 +1494,8 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
if (sched->map.cpus && !cpu_map__has(sched->map.cpus, this_cpu))
goto out;

-   color_fprintf(stdout, color, "  %12.6f secs ", (double)timestamp / 
NSEC_PER_SEC);
+   color_fprintf(stdout, color, "  %12s secs ",
+ timestamp_in_usec(buf, sizeof(buf), timestamp));
if (new_shortname || (verbose && sched_in->tid)) {
const char *pid_color = color;

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 412fb6e65ac0..dae4d1013c33 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -441,7 +441,6 @@ static void print_sample_start(struct perf_sample *sample,
 {
struct perf_event_attr *attr = >attr;
unsigned long secs;
-   unsigned long usecs;
unsigned long long nsecs;

if (PRINT_FIELD(COMM)) {
@@ -468,14 +467,18 @@ static void print_sample_start(struct perf_sample *sample,
}

if (PRINT_FIELD(TIME)) {
+   char buf[32];
+   size_t sz = sizeof(buf);
+
nsecs = sample->time;
secs = nsecs / NSEC_PER_SEC;
nsecs -= secs * NSEC_PER_SEC;
-   usecs = nsecs / NSEC_PER_USEC;
+
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
-   printf("%5lu.%06lu: ", secs, usecs);
+   printf("%12s: ", timestamp_in_usec(buf, sz,
+  sample->time));
}
 }

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 85c56800f17a..aa3e778989ce 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -433,6 +433,15 @@ int parse_nsec_time(const char *str, u64 *ptime)
return 0;
 }

+char *timestamp_in_usec(char *buf, size_t sz, u64 timestamp)
+{
+   u64  sec = timestamp / NSEC_PER_SEC;
+   u64 usec = (timestamp % NSEC_PER_SEC) / NSEC_PER_USEC;
+
+   scnprintf(buf, sz, "%"PRIu64".%06"PRIu64, sec, usec);
+   return buf;
+}
+
 unsigned long parse_tag_value(const char *str, struct parse_tag *tags)
 {
struct parse_tag *i = tags;
diff --git a/tools/perf/util/util.h b/tools

Re: [PATCH 3/3] perf tools: Introduce timestamp_in_usec()

2016-10-24 Thread Joonwoo Park




On 10/23/2016 07:02 PM, Namhyung Kim wrote:

Joonwoo reported that there's a mismatch between timestamps in script
and sched commands.  This was because of difference in printing the
timestamp.  Factor out the code and share it so that they can be in
sync.  Also I found that sched map has similar problem, fix it too.

Reported-by: Joonwoo Park <joonw...@codeaurora.org>


Sorry I was busy with something else so didn't have chance to follow up 
on my initial proposal and thanks for take caring of this.



Signed-off-by: Namhyung Kim <namhy...@kernel.org>
---
 tools/perf/builtin-sched.c  | 9 ++---
 tools/perf/builtin-script.c | 9 ++---
 tools/perf/util/util.c  | 9 +
 tools/perf/util/util.h  | 3 +++
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 97d6cbf486bb..c88d64ae997e 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1191,6 +1191,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   char buf[32];

if (!work_list->nb_atoms)
return;
@@ -1213,11 +1214,11 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13s s\n",
  (double)work_list->total_runtime / NSEC_PER_MSEC,
 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 (double)work_list->max_lat / NSEC_PER_MSEC,
-(double)work_list->max_lat_at / NSEC_PER_SEC);
+timestamp_in_usec(buf, sizeof(buf), work_list->max_lat));


This should be :
s/work_list->max_lat/work_list->max_lat_at/


 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
@@ -1402,6 +1403,7 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
int cpus_nr;
bool new_cpu = false;
const char *color = PERF_COLOR_NORMAL;
+   char buf[32];

BUG_ON(this_cpu >= MAX_CPUS || this_cpu < 0);

@@ -1492,7 +1494,8 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
if (sched->map.cpus && !cpu_map__has(sched->map.cpus, this_cpu))
goto out;

-   color_fprintf(stdout, color, "  %12.6f secs ", (double)timestamp / 
NSEC_PER_SEC);
+   color_fprintf(stdout, color, "  %12s secs ",
+ timestamp_in_usec(buf, sizeof(buf), timestamp));
if (new_shortname || (verbose && sched_in->tid)) {
const char *pid_color = color;

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7228d141a789..c848c74bdc90 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -437,7 +437,6 @@ static void print_sample_start(struct perf_sample *sample,
 {
struct perf_event_attr *attr = >attr;
unsigned long secs;
-   unsigned long usecs;
unsigned long long nsecs;

if (PRINT_FIELD(COMM)) {
@@ -464,14 +463,18 @@ static void print_sample_start(struct perf_sample *sample,
}

if (PRINT_FIELD(TIME)) {
+   char buf[32];
+   size_t sz = sizeof(buf);
+
nsecs = sample->time;
secs = nsecs / NSEC_PER_SEC;
nsecs -= secs * NSEC_PER_SEC;
-   usecs = nsecs / NSEC_PER_USEC;
+
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
-   printf("%5lu.%06lu: ", secs, usecs);
+   printf("%12s: ", timestamp_in_usec(buf, sz,
+  sample->time));
}
 }

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 85c56800f17a..aa3e778989ce 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -433,6 +433,15 @@ int parse_nsec_time(const char *str, u64 *ptime)
return 0;
 }

+char *timestamp_in_usec(char *buf, size_t sz, u64 timestamp)


I agree with Jirka.  timestamp_usec__scnprintf looks better.

Thanks,
Joonwoo


+{
+   u64  sec = timestamp / NSEC_PER_SEC;
+   u64 usec = (timestamp % NSEC_PER_SEC) / NSEC_PER_USEC;
+
+   scnprintf(buf, sz, "%"PRIu64".%06"PRIu64, sec, usec);
+   return buf;
+}
+
 unsigned long parse_tag_value(const char *str, struct parse_tag *tags)
 {
struct parse_tag *i = tags;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 71b6992f1d98..ece974f1c538 100644
--- a/tools/perf/util/util.h
+++ b/tools

Re: [PATCH 3/3] perf tools: Introduce timestamp_in_usec()

2016-10-24 Thread Joonwoo Park




On 10/23/2016 07:02 PM, Namhyung Kim wrote:

Joonwoo reported that there's a mismatch between timestamps in script
and sched commands.  This was because of difference in printing the
timestamp.  Factor out the code and share it so that they can be in
sync.  Also I found that sched map has similar problem, fix it too.

Reported-by: Joonwoo Park 


Sorry I was busy with something else so didn't have chance to follow up 
on my initial proposal and thanks for take caring of this.



Signed-off-by: Namhyung Kim 
---
 tools/perf/builtin-sched.c  | 9 ++---
 tools/perf/builtin-script.c | 9 ++---
 tools/perf/util/util.c  | 9 +
 tools/perf/util/util.h  | 3 +++
 4 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 97d6cbf486bb..c88d64ae997e 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1191,6 +1191,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   char buf[32];

if (!work_list->nb_atoms)
return;
@@ -1213,11 +1214,11 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13s s\n",
  (double)work_list->total_runtime / NSEC_PER_MSEC,
 work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
 (double)work_list->max_lat / NSEC_PER_MSEC,
-(double)work_list->max_lat_at / NSEC_PER_SEC);
+timestamp_in_usec(buf, sizeof(buf), work_list->max_lat));


This should be :
s/work_list->max_lat/work_list->max_lat_at/


 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
@@ -1402,6 +1403,7 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
int cpus_nr;
bool new_cpu = false;
const char *color = PERF_COLOR_NORMAL;
+   char buf[32];

BUG_ON(this_cpu >= MAX_CPUS || this_cpu < 0);

@@ -1492,7 +1494,8 @@ static int map_switch_event(struct perf_sched *sched, 
struct perf_evsel *evsel,
if (sched->map.cpus && !cpu_map__has(sched->map.cpus, this_cpu))
goto out;

-   color_fprintf(stdout, color, "  %12.6f secs ", (double)timestamp / 
NSEC_PER_SEC);
+   color_fprintf(stdout, color, "  %12s secs ",
+ timestamp_in_usec(buf, sizeof(buf), timestamp));
if (new_shortname || (verbose && sched_in->tid)) {
const char *pid_color = color;

diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 7228d141a789..c848c74bdc90 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -437,7 +437,6 @@ static void print_sample_start(struct perf_sample *sample,
 {
struct perf_event_attr *attr = >attr;
unsigned long secs;
-   unsigned long usecs;
unsigned long long nsecs;

if (PRINT_FIELD(COMM)) {
@@ -464,14 +463,18 @@ static void print_sample_start(struct perf_sample *sample,
}

if (PRINT_FIELD(TIME)) {
+   char buf[32];
+   size_t sz = sizeof(buf);
+
nsecs = sample->time;
secs = nsecs / NSEC_PER_SEC;
nsecs -= secs * NSEC_PER_SEC;
-   usecs = nsecs / NSEC_PER_USEC;
+
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
-   printf("%5lu.%06lu: ", secs, usecs);
+   printf("%12s: ", timestamp_in_usec(buf, sz,
+  sample->time));
}
 }

diff --git a/tools/perf/util/util.c b/tools/perf/util/util.c
index 85c56800f17a..aa3e778989ce 100644
--- a/tools/perf/util/util.c
+++ b/tools/perf/util/util.c
@@ -433,6 +433,15 @@ int parse_nsec_time(const char *str, u64 *ptime)
return 0;
 }

+char *timestamp_in_usec(char *buf, size_t sz, u64 timestamp)


I agree with Jirka.  timestamp_usec__scnprintf looks better.

Thanks,
Joonwoo


+{
+   u64  sec = timestamp / NSEC_PER_SEC;
+   u64 usec = (timestamp % NSEC_PER_SEC) / NSEC_PER_USEC;
+
+   scnprintf(buf, sz, "%"PRIu64".%06"PRIu64, sec, usec);
+   return buf;
+}
+
 unsigned long parse_tag_value(const char *str, struct parse_tag *tags)
 {
struct parse_tag *i = tags;
diff --git a/tools/perf/util/util.h b/tools/perf/util/util.h
index 71b6992f1d98..ece974f1c538 100644
--- a/tools/perf/util/util.h
+++ b/tools/perf/util/util.h
@@ -362,4 +362,7 @@ extern int sched_getcpu(void);
 #endif

 int is_printable_array(char *p, unsigned int len);
+
+char *timestamp_in_usec(char *buf, size_t sz, u64 timestamp);
+
 #endif /* GIT_COMPAT_UTIL_H */

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Joonwoo Park

On Wed, Oct 19, 2016 at 04:33:03PM +0100, Dietmar Eggemann wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
> > On 19 October 2016 at 11:46, Dietmar Eggemann <dietmar.eggem...@arm.com> 
> > wrote:
> >> On 18/10/16 12:56, Vincent Guittot wrote:
> >>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> >>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> >>>>> On 18 October 2016 at 11:07, Peter Zijlstra <pet...@infradead.org> 
> >>>>> wrote:
> 
> [...]
> 
> >> But this test only makes sure that we don't see any ghost contribution
> >> (from non-existing cpus) any more.
> >>
> >> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> >> (with the highest tg having a task enqueued) a little bit more, with and
> >> without your v5 'sched: reflect sched_entity move into task_group's load'.
> > 
> > Can you elaborate ?
> 
> I try :-)
> 
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This is the exact thing I was also worried about and that's the reason I
tried to fix this in a different way.
However I didn't find any behaviour difference once any task attached to
child cfs_rq which is the point we really care about.

I found this bug while making patch at https://lkml.org/lkml/2016/10/18/841
which will fail with wrong task_group load_avg.
I tested Vincent's patch and above together, confirmed it's still good.

Though I know Ingo already sent out pull request.  Anyway.

Tested-by: Joonwoo Park <joonw...@codeaurora.org>

Thanks,
Joonwoo

> 
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
> 
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
> 
> # cat /proc/sched_debug
> 
>  cfs_rq[1]:/tg_1
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 0
>  cfs_rq[1]:/tg_1/tg_11
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120
> 
> With your patch applied all the .tg_load_avg are 0.

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-19 Thread Joonwoo Park

On Wed, Oct 19, 2016 at 04:33:03PM +0100, Dietmar Eggemann wrote:
> On 19/10/16 12:25, Vincent Guittot wrote:
> > On 19 October 2016 at 11:46, Dietmar Eggemann  
> > wrote:
> >> On 18/10/16 12:56, Vincent Guittot wrote:
> >>> Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :
> >>>> On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:
> >>>>> On 18 October 2016 at 11:07, Peter Zijlstra  
> >>>>> wrote:
> 
> [...]
> 
> >> But this test only makes sure that we don't see any ghost contribution
> >> (from non-existing cpus) any more.
> >>
> >> We should study the tg->se[i]->avg.load_avg for the hierarchy of tg's
> >> (with the highest tg having a task enqueued) a little bit more, with and
> >> without your v5 'sched: reflect sched_entity move into task_group's load'.
> > 
> > Can you elaborate ?
> 
> I try :-)
> 
> I thought I will see some different behaviour because of the fact that
> the tg se's are initialized differently [1024 versus 0].

This is the exact thing I was also worried about and that's the reason I
tried to fix this in a different way.
However I didn't find any behaviour difference once any task attached to
child cfs_rq which is the point we really care about.

I found this bug while making patch at https://lkml.org/lkml/2016/10/18/841
which will fail with wrong task_group load_avg.
I tested Vincent's patch and above together, confirmed it's still good.

Though I know Ingo already sent out pull request.  Anyway.

Tested-by: Joonwoo Park 

Thanks,
Joonwoo

> 
> But I can't spot any difference. The test case is running a sysbench
> thread affine to cpu1 in tg_root/tg_1/tg_11/tg_111 on tip/sched/core on
> an ARM64 Juno (6 logical cpus).
> The moment the sysbench task is put into tg_111
> tg_111->se[1]->avg.load_avg gets updated to 0 any way because of the
> huge time difference between creating this tg and attaching a task to
> it. So the tg->se[2]->avg.load_avg signals for tg_111, tg_11 and tg_1
> look exactly the same w/o and w/ your patch.
> 
> But your patch helps in this (very synthetic) test case as well. W/o
> your patch I see remaining tg->load_avg for tg_1 and tg_11 after the
> test case has finished because the tg's were exclusively used on cpu1.
> 
> # cat /proc/sched_debug
> 
>  cfs_rq[1]:/tg_1
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120 (5 (unused cpus) * 1024 * 1)
>  cfs_rq[1]:/tg_1/tg_11/tg_111
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 0
>  cfs_rq[1]:/tg_1/tg_11
>.tg_load_avg_contrib   : 0
>.tg_load_avg   : 5120
> 
> With your patch applied all the .tg_load_avg are 0.

[PATCH 2/2] sched/fair: avoid unnecessary hrtick rearm when it's possible

2016-10-18 Thread Joonwoo Park

At present, scheduler rearms hrtick always when the nr_running is low
enough to matter.  We can also skip rearming of hrtick when newly
enqueued or dequeued task is on a different cgroup than current task's
because as long as enqueue/dequeue didn't change nr_running of current
task's cfs_rq, current task's slice won't change.

Find the top most ancestor cfs_rq whose cfs_rq has changed since
enqueue/dequeue and rearm hrtick only when current task is under the
cfs_rq.

A modified hackbench which creates sender and receiver groups on a
separate parent and child cgroup showed 13% of hrtick rearm reduction
with this optimization whereas there was no measurable throughput
degradation when both sender and receiver groups are in the same cgroup.

Cc: Ingo Molnar <mi...@redhat.com> 
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
 kernel/sched/fair.c | 67 +++--
 1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f465448..2eb091b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4495,19 +4495,72 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
}
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Find whether slice of 'p' has changed since 'pnew' enqueued or dequeued.
+ */
+static bool task_needs_new_slice(struct task_struct *p,
+struct task_struct *pnew)
+{
+   struct cfs_rq *cfs_rq;
+   struct sched_entity *se;
+
+   SCHED_WARN_ON(task_cpu(p) != task_cpu(pnew));
+
+   se = >se;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   /*
+* We can stop walking up hierarchy at the ancestor level
+* which has more than 1 nr_running because enqueue/dequeue
+* of new task won't affect cfs_rq's task_group load_avg from
+* that level through the root cfs_rq.
+*/
+   if (cfs_rq->nr_running > 1)
+   break;
+   }
+
+   /*
+* The new task enqueue/dequeue ended up adding or removing of se in
+* the root cfs_rq.  All the ses now have new slice.
+*/
+   if (!se)
+   return true;
+
+   se = >se;
+   for_each_sched_entity(se) {
+   /*
+* All the ses under 'cfs_rq' now have new slice.  Find if
+* 'cfs_rq' is ancestor of 'p'.
+*/
+   if (cfs_rq == cfs_rq_of(se))
+   return true;
+   }
+
+   return false;
+}
+#else /* !CONFIG_FAIR_GROUP_SCHED */
+static inline bool
+task_needs_new_slice(struct task_struct *p, struct task_struct *pnew)
+{
+   return true;
+}
+#endif
+
 /*
  * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * current task is from our class, nr_running is low enough to matter and
+ * current task's slice can be changed by enqueue or deqeueue of 'p'.
  */
-static void hrtick_update(struct rq *rq)
+static void hrtick_update(struct rq *rq, struct task_struct *p)
 {
struct task_struct *curr = rq->curr;
 
if (!hrtick_enabled(rq) || curr->sched_class != _sched_class)
return;
 
-   if (cfs_rq_of(>se)->nr_running <= sched_nr_latency)
+   if (task_cfs_rq(curr)->nr_running <= sched_nr_latency &&
+   task_needs_new_slice(curr, p))
hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
@@ -4516,7 +4569,7 @@ hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 }
 
-static inline void hrtick_update(struct rq *rq)
+static inline void hrtick_update(struct rq *rq, struct task_struct *p)
 {
 }
 #endif
@@ -4573,7 +4626,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
if (!se)
add_nr_running(rq, 1);
 
-   hrtick_update(rq);
+   hrtick_update(rq, p);
 }
 
 static void set_next_buddy(struct sched_entity *se);
@@ -4632,7 +4685,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);
 
-   hrtick_update(rq);
+   hrtick_update(rq, p);
 }
 
 #ifdef CONFIG_SMP
-- 
2.9.3

[PATCH 2/2] sched/fair: avoid unnecessary hrtick rearm when it's possible

2016-10-18 Thread Joonwoo Park

At present, scheduler rearms hrtick always when the nr_running is low
enough to matter.  We can also skip rearming of hrtick when newly
enqueued or dequeued task is on a different cgroup than current task's
because as long as enqueue/dequeue didn't change nr_running of current
task's cfs_rq, current task's slice won't change.

Find the top most ancestor cfs_rq whose cfs_rq has changed since
enqueue/dequeue and rearm hrtick only when current task is under the
cfs_rq.

A modified hackbench which creates sender and receiver groups on a
separate parent and child cgroup showed 13% of hrtick rearm reduction
with this optimization whereas there was no measurable throughput
degradation when both sender and receiver groups are in the same cgroup.

Cc: Ingo Molnar  
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
 kernel/sched/fair.c | 67 +++--
 1 file changed, 60 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f465448..2eb091b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4495,19 +4495,72 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
}
 }
 
+#ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Find whether slice of 'p' has changed since 'pnew' enqueued or dequeued.
+ */
+static bool task_needs_new_slice(struct task_struct *p,
+struct task_struct *pnew)
+{
+   struct cfs_rq *cfs_rq;
+   struct sched_entity *se;
+
+   SCHED_WARN_ON(task_cpu(p) != task_cpu(pnew));
+
+   se = >se;
+   for_each_sched_entity(se) {
+   cfs_rq = cfs_rq_of(se);
+   /*
+* We can stop walking up hierarchy at the ancestor level
+* which has more than 1 nr_running because enqueue/dequeue
+* of new task won't affect cfs_rq's task_group load_avg from
+* that level through the root cfs_rq.
+*/
+   if (cfs_rq->nr_running > 1)
+   break;
+   }
+
+   /*
+* The new task enqueue/dequeue ended up adding or removing of se in
+* the root cfs_rq.  All the ses now have new slice.
+*/
+   if (!se)
+   return true;
+
+   se = >se;
+   for_each_sched_entity(se) {
+   /*
+* All the ses under 'cfs_rq' now have new slice.  Find if
+* 'cfs_rq' is ancestor of 'p'.
+*/
+   if (cfs_rq == cfs_rq_of(se))
+   return true;
+   }
+
+   return false;
+}
+#else /* !CONFIG_FAIR_GROUP_SCHED */
+static inline bool
+task_needs_new_slice(struct task_struct *p, struct task_struct *pnew)
+{
+   return true;
+}
+#endif
+
 /*
  * called from enqueue/dequeue and updates the hrtick when the
- * current task is from our class and nr_running is low enough
- * to matter.
+ * current task is from our class, nr_running is low enough to matter and
+ * current task's slice can be changed by enqueue or deqeueue of 'p'.
  */
-static void hrtick_update(struct rq *rq)
+static void hrtick_update(struct rq *rq, struct task_struct *p)
 {
struct task_struct *curr = rq->curr;
 
if (!hrtick_enabled(rq) || curr->sched_class != _sched_class)
return;
 
-   if (cfs_rq_of(>se)->nr_running <= sched_nr_latency)
+   if (task_cfs_rq(curr)->nr_running <= sched_nr_latency &&
+   task_needs_new_slice(curr, p))
hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
@@ -4516,7 +4569,7 @@ hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
 }
 
-static inline void hrtick_update(struct rq *rq)
+static inline void hrtick_update(struct rq *rq, struct task_struct *p)
 {
 }
 #endif
@@ -4573,7 +4626,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
if (!se)
add_nr_running(rq, 1);
 
-   hrtick_update(rq);
+   hrtick_update(rq, p);
 }
 
 static void set_next_buddy(struct sched_entity *se);
@@ -4632,7 +4685,7 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);
 
-   hrtick_update(rq);
+   hrtick_update(rq, p);
 }
 
 #ifdef CONFIG_SMP
-- 
2.9.3

[PATCH 1/2] sched/fair: rearm hrtick timer when it's senseful

2016-10-18 Thread Joonwoo Park

When a new cfs task enqueued, fair scheduler rearms current cfs task's
hrtick expiration time with decreased slice.  But the slice stops
decreasing eventually when it reached sched_min_granularity.
Consequently cfs scheduler also stops rearming of hrtick timer because
hrtick expiration time for the current cfs task doesn't change.  This
is a legitimate optimization but there is a subtle error in the 'if'
condition so at present cfs scheduler stops rearming of hrtick timer
earlier than ideal that will cause a subtle unfairness.

 When sched_latency = 6ms, sched_min_granularity = 0.75ms,
 sched_nr_latency = 8 :

   nr_run  sliceperiod
   1   6.00 ms  6 ms
   2   3.00 ms  6 ms
   3   4.50 ms  6 ms
   4   1.50 ms  6 ms
   5   1.20 ms  6 ms
   6   1.00 ms  6 ms
   7   0.85 ms  6 ms
   8   0.75 ms  6 ms
   9+  0.75 ms  6.75ms

The first time when sched_slice becomes equal to sched_min_granularity
is when cfs_rq's nr_running becomes equal to sched_nr_latency.  Fix
the condition to rearm the hrtick nr_running up to sched_nr_latency.

Cc: Ingo Molnar <mi...@redhat.com> 
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71c08a8..f465448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4507,7 +4507,7 @@ static void hrtick_update(struct rq *rq)
if (!hrtick_enabled(rq) || curr->sched_class != _sched_class)
return;
 
-   if (cfs_rq_of(>se)->nr_running < sched_nr_latency)
+   if (cfs_rq_of(>se)->nr_running <= sched_nr_latency)
hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-- 
2.9.3

[PATCH 1/2] sched/fair: rearm hrtick timer when it's senseful

2016-10-18 Thread Joonwoo Park

When a new cfs task enqueued, fair scheduler rearms current cfs task's
hrtick expiration time with decreased slice.  But the slice stops
decreasing eventually when it reached sched_min_granularity.
Consequently cfs scheduler also stops rearming of hrtick timer because
hrtick expiration time for the current cfs task doesn't change.  This
is a legitimate optimization but there is a subtle error in the 'if'
condition so at present cfs scheduler stops rearming of hrtick timer
earlier than ideal that will cause a subtle unfairness.

 When sched_latency = 6ms, sched_min_granularity = 0.75ms,
 sched_nr_latency = 8 :

   nr_run  sliceperiod
   1   6.00 ms  6 ms
   2   3.00 ms  6 ms
   3   4.50 ms  6 ms
   4   1.50 ms  6 ms
   5   1.20 ms  6 ms
   6   1.00 ms  6 ms
   7   0.85 ms  6 ms
   8   0.75 ms  6 ms
   9+  0.75 ms  6.75ms

The first time when sched_slice becomes equal to sched_min_granularity
is when cfs_rq's nr_running becomes equal to sched_nr_latency.  Fix
the condition to rearm the hrtick nr_running up to sched_nr_latency.

Cc: Ingo Molnar  
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 71c08a8..f465448 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4507,7 +4507,7 @@ static void hrtick_update(struct rq *rq)
if (!hrtick_enabled(rq) || curr->sched_class != _sched_class)
return;
 
-   if (cfs_rq_of(>se)->nr_running < sched_nr_latency)
+   if (cfs_rq_of(>se)->nr_running <= sched_nr_latency)
hrtick_start_fair(rq, curr);
 }
 #else /* !CONFIG_SCHED_HRTICK */
-- 
2.9.3

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Joonwoo Park




On 10/18/2016 04:56 AM, Vincent Guittot wrote:

Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

On 18 October 2016 at 11:07, Peter Zijlstra  wrote:

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.


The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
 for_each_possible_cpu(i) {
 init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group


Yes, I think that makes sense, not sure how horrible that is with the


That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group



current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.


The patch below fixes the issue on my platform:

Dietmar, Omer can you confirm that this fix the problem of your platform too ?


I just noticed this thread after posting 
https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but 
unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I 
post should address it.


Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which 
is in child cgroup.


dd (2030, #threads: 1)
---
se.exec_start:275700.024137
se.vruntime  : 10589.114654
se.sum_exec_runtime  :  1576.837993
se.nr_migrations :0
nr_switches  :  159
nr_voluntary_switches:0
nr_involuntary_switches  :  159
se.load.weight   :  1048576
se.avg.load_sum  : 48840575
se.avg.util_sum  : 19741820
se.avg.load_avg  : 1022
se.avg.util_avg  :  413
se.avg.last_update_time  : 275700024137
policy   :0
prio :  120
clock-delta  :   34
dd (2033, #threads: 1)
---
se.exec_start:275710.037178
se.vruntime  :  2383.802868
se.sum_exec_runtime  :  1576.547591
se.nr_migrations :0
nr_switches  :  162
nr_voluntary_switches:0
nr_involuntary_switches  :  162
se.load.weight   :  1048576
se.avg.load_sum  : 48316646
se.avg.util_sum  : 21235249
se.avg.load_avg  : 1011
se.avg.util_avg  :  444
se.avg.last_update_time  : 275710037178
policy   :0
prio :  120
clock-delta  :   36

Thanks,
Joonwoo



---
 kernel/sched/fair.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   /*
+* Tasks are intialized with full load to be seen as heavy task until
+* they get a chance to stabilize to their real load level.
+* group entity are

Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes

2016-10-18 Thread Joonwoo Park




On 10/18/2016 04:56 AM, Vincent Guittot wrote:

Le Tuesday 18 Oct 2016 à 12:34:12 (+0200), Peter Zijlstra a écrit :

On Tue, Oct 18, 2016 at 11:45:48AM +0200, Vincent Guittot wrote:

On 18 October 2016 at 11:07, Peter Zijlstra  wrote:

So aside from funny BIOSes, this should also show up when creating
cgroups when you have offlined a few CPUs, which is far more common I'd
think.


The problem is also that the load of the tg->se[cpu] that represents
the tg->cfs_rq[cpu] is initialized to 1024 in:
alloc_fair_sched_group
 for_each_possible_cpu(i) {
 init_entity_runnable_average(se);
sa->load_avg = scale_load_down(se->load.weight);

Initializing  sa->load_avg to 1024 for a newly created task makes
sense as we don't know yet what will be its real load but i'm not sure
that we have to do the same for se that represents a task group. This
load should be initialized to 0 and it will increase when task will be
moved/attached into task group


Yes, I think that makes sense, not sure how horrible that is with the


That should not be that bad because this initial value is only useful for
the few dozens of ms that follow the creation of the task group



current state of things, but after your propagate patch, that
reinstates the interactivity hack that should work for sure.


The patch below fixes the issue on my platform:

Dietmar, Omer can you confirm that this fix the problem of your platform too ?


I just noticed this thread after posting 
https://lkml.org/lkml/2016/10/18/719...
Noticed this bug while a ago and had the patch above at least a week but 
unfortunately didn't have time to post...
I think Omer had same problem I was trying to fix and I believe patch I 
post should address it.


Vincent, your version fixes my test case as well.
This is sched_stat from the same test case I had in my changelog.
Note dd-2030 which is in root cgroup had same runtime as dd-2033 which 
is in child cgroup.


dd (2030, #threads: 1)
---
se.exec_start:275700.024137
se.vruntime  : 10589.114654
se.sum_exec_runtime  :  1576.837993
se.nr_migrations :0
nr_switches  :  159
nr_voluntary_switches:0
nr_involuntary_switches  :  159
se.load.weight   :  1048576
se.avg.load_sum  : 48840575
se.avg.util_sum  : 19741820
se.avg.load_avg  : 1022
se.avg.util_avg  :  413
se.avg.last_update_time  : 275700024137
policy   :0
prio :  120
clock-delta  :   34
dd (2033, #threads: 1)
---
se.exec_start:275710.037178
se.vruntime  :  2383.802868
se.sum_exec_runtime  :  1576.547591
se.nr_migrations :0
nr_switches  :  162
nr_voluntary_switches:0
nr_involuntary_switches  :  162
se.load.weight   :  1048576
se.avg.load_sum  : 48316646
se.avg.util_sum  : 21235249
se.avg.load_avg  : 1011
se.avg.util_avg  :  444
se.avg.last_update_time  : 275710037178
policy   :0
prio :  120
clock-delta  :   36

Thanks,
Joonwoo



---
 kernel/sched/fair.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b03fb5..89776ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -690,7 +690,14 @@ void init_entity_runnable_average(struct sched_entity *se)
 * will definitely be update (after enqueue).
 */
sa->period_contrib = 1023;
-   sa->load_avg = scale_load_down(se->load.weight);
+   /*
+* Tasks are intialized with full load to be seen as heavy task until
+* they get a chance to stabilize to their real load level.
+* group entity are intialized with null

Re: [PATCH] sched/fair: fix fairness problems among the tasks in different cgroups

2016-10-18 Thread Joonwoo Park




On 10/18/2016 02:37 PM, Peter Zijlstra wrote:


Have you read this thread:

 lkml.kernel.org/r/20161018115651.ga20...@linaro.org


Yeah... I noticed the thread... I'm replying to the thread too.

Thanks,
Joonwoo

Re: [PATCH] sched/fair: fix fairness problems among the tasks in different cgroups

2016-10-18 Thread Joonwoo Park




On 10/18/2016 02:37 PM, Peter Zijlstra wrote:


Have you read this thread:

 lkml.kernel.org/r/20161018115651.ga20...@linaro.org


Yeah... I noticed the thread... I'm replying to the thread too.

Thanks,
Joonwoo

[PATCH] sched/fair: fix fairness problems among the tasks in different cgroups

2016-10-18 Thread Joonwoo Park

When a new cgroup is created, scheduler attaches the child cgroup
to its parent and also increases the parent's task_group load_avg to
account increased load with following path :

  sched_create_group()
alloc_fair_sched_group()
  sched_online_group()
online_fair_sched_group()
  for_each_possible_cpu()
post_init_entity_util_avg()
  update_tg_load_avg()

However the parent's load_avg is shared by all cpus hence it's being
increased number of cpu times.  For example when there are 8 cpus
available (in fact 1 available cpu after hotplugging out too),
making empty cgroups /grp1 and /grp1/grp11 leads each task_group's
load_avg to be 8092 and 1024 whereas desired both cgroup's task_group
load_avg is 1024 which happens when booting with 1 cpu at present.

Such an incorrect load_avg accounting causes quite steep unfairness
to the tasks when they are in different cgroups.
With a scenario when online cpus = 1, possible cpus = 4 and 2 cpu
bound tasks are running but each runs on the parent and the child
cgroup :

  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # echo 0 > /sys/devices/system/cpu/cpu2/online
  # echo 0 > /sys/devices/system/cpu/cpu3/online
  # cat /sys/devices/system/cpu/online
  0
  # mkdir /sys/fs/cgroup/grp1
  # dd if=/dev/zero of=/dev/null &
  # echo $! > /sys/fs/cgroup/tasks
  # dd if=/dev/zero of=/dev/null &
  # echo $! > /sys/fs/cgroup/grp1/tasks

After 3 seconds, the task in the root cgroup got 4 times of execution
time than the task in the child cgroup because weight of possible cpu
is 4 so scheduler thinks the root cgroup has 4 times more load than
child cgroup.

  dd (2029, #threads: 1)
  se.exec_start:562900.460656
  se.sum_exec_runtime  :  2573.175002
  dd (2032, #threads: 1)
  se.exec_start:562900.037152
  se.sum_exec_runtime  :   655.439360

Whereas booting the same system with maxcpus=1 makes both tasks run
evenly.

  dd (1952, #threads: 1)
  se.exec_start: 75660.457449
  se.sum_exec_runtime  :  1754.045078
  dd (1955, #threads: 1)
  se.exec_start: 75680.029689
  se.sum_exec_runtime  :  1768.195390

Fix such fairness problems by updating parent's task group load_avg
only once when a new child cgroup is being created.

Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 9 ++---
 kernel/sched/sched.h | 3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..2cf46aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2577,7 +2577,7 @@ void wake_up_new_task(struct task_struct *p)
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
rq = __task_rq_lock(p, );
-   post_init_entity_util_avg(>se);
+   post_init_entity_util_avg(>se, true);
 
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 502e95a..71c08a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -730,7 +730,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, 
struct sched_entity *s
  * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
  * if util_avg > util_avg_cap.
  */
-void post_init_entity_util_avg(struct sched_entity *se)
+void post_init_entity_util_avg(struct sched_entity *se, bool update_tg_load)
 {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_avg *sa = >avg;
@@ -770,7 +770,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 
update_cfs_rq_load_avg(now, cfs_rq, false);
attach_entity_load_avg(cfs_rq, se);
-   update_tg_load_avg(cfs_rq, false);
+   if (update_tg_load)
+   update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */
@@ -8872,15 +8873,17 @@ void online_fair_sched_group(struct task_group *tg)
struct sched_entity *se;
struct rq *rq;
int i;
+   bool update_tg_load = true;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
 
raw_spin_lock_irq(>lock);
-   post_init_entity_util_avg(se);
+   post_init_entity_util_avg(se, update_tg_load);
sync_throttle(tg, i);
raw_spin_unlock_irq(>lock);
+   update_tg_load = false;
}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..6ab89af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sc

[PATCH] sched/fair: fix fairness problems among the tasks in different cgroups

2016-10-18 Thread Joonwoo Park

When a new cgroup is created, scheduler attaches the child cgroup
to its parent and also increases the parent's task_group load_avg to
account increased load with following path :

  sched_create_group()
alloc_fair_sched_group()
  sched_online_group()
online_fair_sched_group()
  for_each_possible_cpu()
post_init_entity_util_avg()
  update_tg_load_avg()

However the parent's load_avg is shared by all cpus hence it's being
increased number of cpu times.  For example when there are 8 cpus
available (in fact 1 available cpu after hotplugging out too),
making empty cgroups /grp1 and /grp1/grp11 leads each task_group's
load_avg to be 8092 and 1024 whereas desired both cgroup's task_group
load_avg is 1024 which happens when booting with 1 cpu at present.

Such an incorrect load_avg accounting causes quite steep unfairness
to the tasks when they are in different cgroups.
With a scenario when online cpus = 1, possible cpus = 4 and 2 cpu
bound tasks are running but each runs on the parent and the child
cgroup :

  # echo 0 > /sys/devices/system/cpu/cpu1/online
  # echo 0 > /sys/devices/system/cpu/cpu2/online
  # echo 0 > /sys/devices/system/cpu/cpu3/online
  # cat /sys/devices/system/cpu/online
  0
  # mkdir /sys/fs/cgroup/grp1
  # dd if=/dev/zero of=/dev/null &
  # echo $! > /sys/fs/cgroup/tasks
  # dd if=/dev/zero of=/dev/null &
  # echo $! > /sys/fs/cgroup/grp1/tasks

After 3 seconds, the task in the root cgroup got 4 times of execution
time than the task in the child cgroup because weight of possible cpu
is 4 so scheduler thinks the root cgroup has 4 times more load than
child cgroup.

  dd (2029, #threads: 1)
  se.exec_start:562900.460656
  se.sum_exec_runtime  :  2573.175002
  dd (2032, #threads: 1)
  se.exec_start:562900.037152
  se.sum_exec_runtime  :   655.439360

Whereas booting the same system with maxcpus=1 makes both tasks run
evenly.

  dd (1952, #threads: 1)
  se.exec_start: 75660.457449
  se.sum_exec_runtime  :  1754.045078
  dd (1955, #threads: 1)
  se.exec_start: 75680.029689
  se.sum_exec_runtime  :  1768.195390

Fix such fairness problems by updating parent's task group load_avg
only once when a new child cgroup is being created.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 9 ++---
 kernel/sched/sched.h | 3 ++-
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1..2cf46aa 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2577,7 +2577,7 @@ void wake_up_new_task(struct task_struct *p)
__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
rq = __task_rq_lock(p, );
-   post_init_entity_util_avg(>se);
+   post_init_entity_util_avg(>se, true);
 
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 502e95a..71c08a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -730,7 +730,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, 
struct sched_entity *s
  * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
  * if util_avg > util_avg_cap.
  */
-void post_init_entity_util_avg(struct sched_entity *se)
+void post_init_entity_util_avg(struct sched_entity *se, bool update_tg_load)
 {
struct cfs_rq *cfs_rq = cfs_rq_of(se);
struct sched_avg *sa = >avg;
@@ -770,7 +770,8 @@ void post_init_entity_util_avg(struct sched_entity *se)
 
update_cfs_rq_load_avg(now, cfs_rq, false);
attach_entity_load_avg(cfs_rq, se);
-   update_tg_load_avg(cfs_rq, false);
+   if (update_tg_load)
+   update_tg_load_avg(cfs_rq, false);
 }
 
 #else /* !CONFIG_SMP */
@@ -8872,15 +8873,17 @@ void online_fair_sched_group(struct task_group *tg)
struct sched_entity *se;
struct rq *rq;
int i;
+   bool update_tg_load = true;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
 
raw_spin_lock_irq(>lock);
-   post_init_entity_util_avg(se);
+   post_init_entity_util_avg(se, update_tg_load);
sync_throttle(tg, i);
raw_spin_unlock_irq(>lock);
+   update_tg_load = false;
}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 055f935..6ab89af 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1356,7 +1356,8 @@ extern void init_dl_task_timer(struct sched_dl_entity 
*dl_se);

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-10-04 Thread Joonwoo Park




On 10/04/2016 05:04 AM, Namhyung Kim wrote:

On Mon, Oct 03, 2016 at 03:04:48PM -0700, Joonwoo Park wrote:



On 09/30/2016 10:15 PM, Namhyung Kim wrote:

Hi Joonwoo,

On Wed, Sep 28, 2016 at 07:25:26PM -0700, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;


Isn't it usec rathen than msec? :)


It's to contain three decimal digits which are msecs when 'max_lat_at' is
expressed in sec.
For example when max_lat_at = 1858303049520 which is 1858.303049518 sec,
max_lat_at_msec is meant to be 303.


???  But didn't you want to print it as 303049?

Looking at the code, the 'max_lat' is a latency which was printed in
msec (%9.3f ms) but the 'max_lat_at' is a timestamp which was printed
in sec (%13.6f s).  This is confusing but they use effectively same
unit (usec) by using different number of digit after the period.



I must admit variable's name is bit misleading.  Maybe just secs, msecs are
better?
Also just noticed u64 isn't needed for msecs.  Will size down.


IIUC you wanted usec for timestamp rather than msec, aren't you?


Ugh.  I was a dummy.  My bad.  Will fix this.









if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",


Maybe you'd better to be

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-10-04 Thread Joonwoo Park




On 10/04/2016 05:04 AM, Namhyung Kim wrote:

On Mon, Oct 03, 2016 at 03:04:48PM -0700, Joonwoo Park wrote:



On 09/30/2016 10:15 PM, Namhyung Kim wrote:

Hi Joonwoo,

On Wed, Sep 28, 2016 at 07:25:26PM -0700, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Steven Rostedt 
Cc: Namhyung Kim 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;


Isn't it usec rathen than msec? :)


It's to contain three decimal digits which are msecs when 'max_lat_at' is
expressed in sec.
For example when max_lat_at = 1858303049520 which is 1858.303049518 sec,
max_lat_at_msec is meant to be 303.


???  But didn't you want to print it as 303049?

Looking at the code, the 'max_lat' is a latency which was printed in
msec (%9.3f ms) but the 'max_lat_at' is a timestamp which was printed
in sec (%13.6f s).  This is confusing but they use effectively same
unit (usec) by using different number of digit after the period.



I must admit variable's name is bit misleading.  Maybe just secs, msecs are
better?
Also just noticed u64 isn't needed for msecs.  Will size down.


IIUC you wanted usec for timestamp rather than msec, aren't you?


Ugh.  I was a dummy.  My bad.  Will fix this.









if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",


Maybe you'd better to be in sync with the script code:

if (PRINT_FIELD(TIME)) {
nsecs = sample->time;
secs = nsecs / NSECS_PER_SEC;
nsecs -= secs *

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-10-03 Thread Joonwoo Park




On 09/30/2016 10:15 PM, Namhyung Kim wrote:

Hi Joonwoo,

On Wed, Sep 28, 2016 at 07:25:26PM -0700, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;


Isn't it usec rathen than msec? :)


It's to contain three decimal digits which are msecs when 'max_lat_at' 
is expressed in sec.
For example when max_lat_at = 1858303049520 which is 1858.303049518 
sec, max_lat_at_msec is meant to be 303.


I must admit variable's name is bit misleading.  Maybe just secs, msecs 
are better?

Also just noticed u64 isn't needed for msecs.  Will size down.





if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",


Maybe you'd better to be in sync with the script code:

if (PRINT_FIELD(TIME)) {
nsecs = sample->time;
secs = nsecs / NSECS_PER_SEC;
nsecs -= secs * NSECS_PER_SEC;
usecs = nsecs / NSECS_PER_USEC;
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
printf("%5lu.%06lu: ", secs, usecs);
}


Apart from variable name, I'm not quite sure what to sync because sched 
doesn't print in nsecs.

M

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-10-03 Thread Joonwoo Park




On 09/30/2016 10:15 PM, Namhyung Kim wrote:

Hi Joonwoo,

On Wed, Sep 28, 2016 at 07:25:26PM -0700, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Steven Rostedt 
Cc: Namhyung Kim 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;


Isn't it usec rathen than msec? :)


It's to contain three decimal digits which are msecs when 'max_lat_at' 
is expressed in sec.
For example when max_lat_at = 1858303049520 which is 1858.303049518 
sec, max_lat_at_msec is meant to be 303.


I must admit variable's name is bit misleading.  Maybe just secs, msecs 
are better?

Also just noticed u64 isn't needed for msecs.  Will size down.





if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",


Maybe you'd better to be in sync with the script code:

if (PRINT_FIELD(TIME)) {
nsecs = sample->time;
secs = nsecs / NSECS_PER_SEC;
nsecs -= secs * NSECS_PER_SEC;
usecs = nsecs / NSECS_PER_USEC;
if (nanosecs)
printf("%5lu.%09llu: ", secs, nsecs);
else
printf("%5lu.%06lu: ", secs, usecs);
}


Apart from variable name, I'm not quite sure what to sync because sched 
doesn't print in nsecs.

Maybe you just wanted for variable names in sync rather than logic?

Thanks!
Joonwoo



Thanks,
Namhyung



  (double)work_list->total_runtime / 1e6,
 work_list->nb_atoms,

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-09-28 Thread Joonwoo Park




On 09/28/2016 07:25 PM, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;

if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency


s/discrepency/discrepancy/
I will fix this typo after gathering review comments.

Thanks,
Joonwoo


+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",
  (double)work_list->total_runtime / 1e6,
 work_list->nb_atoms, (double)avg / 1e6,
 (double)work_list->max_lat / 1e6,
-(double)work_list->max_lat_at / 1e9);
+max_lat_at_sec, max_lat_at_msec);
 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)

Re: [PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-09-28 Thread Joonwoo Park




On 09/28/2016 07:25 PM, Joonwoo Park wrote:

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Steven Rostedt 
Cc: Namhyung Kim 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {

 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL

 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;

if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_

avg = work_list->total_lat / work_list->nb_atoms;

-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency


s/discrepency/discrepancy/
I will fix this typo after gathering review comments.

Thanks,
Joonwoo


+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: 
%6lu.%06lu s\n",
  (double)work_list->total_runtime / 1e6,
 work_list->nb_atoms, (double)avg / 1e6,
 (double)work_list->max_lat / 1e6,
-(double)work_list->max_lat_at / 1e9);
+max_lat_at_sec, max_lat_at_msec);
 }

 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)

[PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-09-28 Thread Joonwoo Park

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {
 
 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL
 
 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;
 
if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
 
avg = work_list->total_lat / work_list->nb_atoms;
 
-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max 
at: %13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max 
at: %6lu.%06lu s\n",
  (double)work_list->total_runtime / 1e6,
 work_list->nb_atoms, (double)avg / 1e6,
 (double)work_list->max_lat / 1e6,
-(double)work_list->max_lat_at / 1e9);
+max_lat_at_sec, max_lat_at_msec);
 }
 
 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH] perf sched: kill time stamp discrepancy between script and latency

2016-09-28 Thread Joonwoo Park

Perf sched latency is handy to find out the maximum sched latency and
the time stamp of the event.  After running sched latency, if a found
latency looks suspicious it's quite reasonable to run perf script
subsequently and search with the time stamp given by perf sched latency
to continue further debugging.  However, at present, it's possible the
time stamp given by perf sched latency cannot be found in the trace
output by perf script because perf sched latency converts the event
time from ns to ms as double float and prints it with printf which
does banker's rounding as opposed to perf script doesn't.

  For example:

   0x750ff0 [0x80]: event: 9
   
   2 1858303049520 0x750ff0 [0x80]: PERF_RECORD_SAMPLE(IP, 0x1): 15281/15281: 
0x8162a63a period: 1 addr: 0
... thread: hackbench:15281
   

$ perf sched -i perf.data latency | grep hackbench
  hackbench:(401)   +   3539.283 ms |23347 | avg:7.286 ms | 
max:  829.998 ms | max at:   1858.303050 s

$ perf script -i perf.data | grep "1858\.303050"

$ perf script -i perf.data | grep "1858\.303049"
  hackbench 15281 [002]  1858.303049:   sched:sched_switch: 
prev_comm=hackbench prev_pid=15281 prev_prio=120 prev_state=D ==> 
next_comm=hackbench next_pid=15603 next_prio=120

Fix perf latency to print out time stamp without rounding to avoid such
discrepancy.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Steven Rostedt 
Cc: Namhyung Kim 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---

I was tempted to get rid of all u64 to double casting in the function
output_lat_thread but didn't because there is no data loss as of
today.  Double float gives at least 15 significant decimal digits
precision while the function requires only 14 significant digits precision.

$ python -c "print(len(str(int(0x / 1e6"
14

 tools/lib/traceevent/event-parse.h |  1 +
 tools/perf/builtin-sched.c | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/lib/traceevent/event-parse.h 
b/tools/lib/traceevent/event-parse.h
index 9ffde37..f42703c 100644
--- a/tools/lib/traceevent/event-parse.h
+++ b/tools/lib/traceevent/event-parse.h
@@ -174,6 +174,7 @@ struct pevent_plugin_option {
 
 #define NSECS_PER_SEC  10ULL
 #define NSECS_PER_USEC 1000ULL
+#define MSECS_PER_SEC  1000ULL
 
 enum format_flags {
FIELD_IS_ARRAY  = 1,
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index afa0576..e5cf51a 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -1190,6 +1190,7 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
int i;
int ret;
u64 avg;
+   u64 max_lat_at_sec, max_lat_at_msec;
 
if (!work_list->nb_atoms)
return;
@@ -1212,11 +1213,18 @@ static void output_lat_thread(struct perf_sched *sched, 
struct work_atoms *work_
 
avg = work_list->total_lat / work_list->nb_atoms;
 
-   printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max 
at: %13.6f s\n",
+   /*
+* Avoid round up with printf to prevent event time discrepency
+* between sched script and latency.
+*/
+   max_lat_at_sec = work_list->max_lat_at / NSECS_PER_SEC;
+   max_lat_at_msec = (work_list->max_lat_at -
+  max_lat_at_sec * NSECS_PER_SEC) / MSECS_PER_SEC;
+   printf("+%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max 
at: %6lu.%06lu s\n",
  (double)work_list->total_runtime / 1e6,
 work_list->nb_atoms, (double)avg / 1e6,
 (double)work_list->max_lat / 1e6,
-(double)work_list->max_lat_at / 1e9);
+max_lat_at_sec, max_lat_at_msec);
 }
 
 static int pid_cmp(struct work_atoms *l, struct work_atoms *r)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH] regulator: core: don't return error with inadequate reason

2016-09-19 Thread Joonwoo Park

drms_uA_update() always returns failure when it cannot find regulator's
input voltage.  But if hardware supports load configuration with
ops->set_load() and the input regulator isn't specified with valid reason
such as the input regulator is battery, not finding input voltage is
normal so such case should not return with an error.

Avoid such inadequate error return by checking input/output voltages
only when drms_uA_update() is about to configure load with enum based
ops->set_mode().

Cc: Liam Girdwood <lgirdw...@gmail.com>
Cc: Mark Brown <broo...@kernel.org>
Cc: Bjorn Andersson <bjorn.anders...@linaro.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
I think we can get rid of ops->get_optimum_mode since *most* of regulator 
drivers' get_optimum_mode ops are dead code.
The only driver at least I found which has valid vaild_modes_mask with proper 
ops->get_optimum_mode is wm8350-regulator.c used by mach-mx31ads.c
But I believe the issue what I'm fixing here is slightly off from the matter of 
removing ops->get_optimum_mode.

 drivers/regulator/core.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index 1eef2ec..50cb7b9 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -680,24 +680,6 @@ static int drms_uA_update(struct regulator_dev *rdev)
!rdev->desc->ops->set_load)
return -EINVAL;
 
-   /* get output voltage */
-   output_uV = _regulator_get_voltage(rdev);
-   if (output_uV <= 0) {
-   rdev_err(rdev, "invalid output voltage found\n");
-   return -EINVAL;
-   }
-
-   /* get input voltage */
-   input_uV = 0;
-   if (rdev->supply)
-   input_uV = regulator_get_voltage(rdev->supply);
-   if (input_uV <= 0)
-   input_uV = rdev->constraints->input_uV;
-   if (input_uV <= 0) {
-   rdev_err(rdev, "invalid input voltage found\n");
-   return -EINVAL;
-   }
-
/* calc total requested load */
list_for_each_entry(sibling, >consumer_list, list)
current_uA += sibling->uA_load;
@@ -710,6 +692,24 @@ static int drms_uA_update(struct regulator_dev *rdev)
if (err < 0)
rdev_err(rdev, "failed to set load %d\n", current_uA);
} else {
+   /* get output voltage */
+   output_uV = _regulator_get_voltage(rdev);
+   if (output_uV <= 0) {
+   rdev_err(rdev, "invalid output voltage found\n");
+   return -EINVAL;
+   }
+
+   /* get input voltage */
+   input_uV = 0;
+   if (rdev->supply)
+   input_uV = regulator_get_voltage(rdev->supply);
+   if (input_uV <= 0)
+   input_uV = rdev->constraints->input_uV;
+   if (input_uV <= 0) {
+   rdev_err(rdev, "invalid input voltage found\n");
+   return -EINVAL;
+   }
+
/* now get the optimum mode for our new total regulator load */
mode = rdev->desc->ops->get_optimum_mode(rdev, input_uV,
 output_uV, current_uA);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH] regulator: core: don't return error with inadequate reason

2016-09-19 Thread Joonwoo Park

drms_uA_update() always returns failure when it cannot find regulator's
input voltage.  But if hardware supports load configuration with
ops->set_load() and the input regulator isn't specified with valid reason
such as the input regulator is battery, not finding input voltage is
normal so such case should not return with an error.

Avoid such inadequate error return by checking input/output voltages
only when drms_uA_update() is about to configure load with enum based
ops->set_mode().

Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Bjorn Andersson 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
I think we can get rid of ops->get_optimum_mode since *most* of regulator 
drivers' get_optimum_mode ops are dead code.
The only driver at least I found which has valid vaild_modes_mask with proper 
ops->get_optimum_mode is wm8350-regulator.c used by mach-mx31ads.c
But I believe the issue what I'm fixing here is slightly off from the matter of 
removing ops->get_optimum_mode.

 drivers/regulator/core.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/regulator/core.c b/drivers/regulator/core.c
index 1eef2ec..50cb7b9 100644
--- a/drivers/regulator/core.c
+++ b/drivers/regulator/core.c
@@ -680,24 +680,6 @@ static int drms_uA_update(struct regulator_dev *rdev)
!rdev->desc->ops->set_load)
return -EINVAL;
 
-   /* get output voltage */
-   output_uV = _regulator_get_voltage(rdev);
-   if (output_uV <= 0) {
-   rdev_err(rdev, "invalid output voltage found\n");
-   return -EINVAL;
-   }
-
-   /* get input voltage */
-   input_uV = 0;
-   if (rdev->supply)
-   input_uV = regulator_get_voltage(rdev->supply);
-   if (input_uV <= 0)
-   input_uV = rdev->constraints->input_uV;
-   if (input_uV <= 0) {
-   rdev_err(rdev, "invalid input voltage found\n");
-   return -EINVAL;
-   }
-
/* calc total requested load */
list_for_each_entry(sibling, >consumer_list, list)
current_uA += sibling->uA_load;
@@ -710,6 +692,24 @@ static int drms_uA_update(struct regulator_dev *rdev)
if (err < 0)
rdev_err(rdev, "failed to set load %d\n", current_uA);
} else {
+   /* get output voltage */
+   output_uV = _regulator_get_voltage(rdev);
+   if (output_uV <= 0) {
+   rdev_err(rdev, "invalid output voltage found\n");
+   return -EINVAL;
+   }
+
+   /* get input voltage */
+   input_uV = 0;
+   if (rdev->supply)
+   input_uV = regulator_get_voltage(rdev->supply);
+   if (input_uV <= 0)
+   input_uV = rdev->constraints->input_uV;
+   if (input_uV <= 0) {
+   rdev_err(rdev, "invalid input voltage found\n");
+   return -EINVAL;
+   }
+
/* now get the optimum mode for our new total regulator load */
mode = rdev->desc->ops->get_optimum_mode(rdev, input_uV,
 output_uV, current_uA);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Mon, Sep 19, 2016 at 11:04:49AM -0700, Joonwoo Park wrote:
> On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > > From: Srivatsa Vaddagiri <va...@codeaurora.org>
> > > 
> > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > 
> > Right, but I always found the overhead of the thing too high to be
> > really useful.
> > 
> > How come you're using this?
> 
> This patch was in our internal tree for decades so I unfortunately cannot
> find actual usecase or history.
> But I guess it was about excessive latency when there are number of CPU
> bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.
> 
> See how I recreated :
> 
> * run 4 cpu hogs on the same cgroup [1] :
>  dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
>  dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
>  dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
>  dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
>   preempt every 1.5ms slice by hrtick.
> 
> * run 4 CPU hogs on 4 different cgroups [2] :
>  dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
>  dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
>  dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
>  dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
>   preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
> preemption latency.
> 
> [1] : 
>  dd if=/dev/zero of=/dev/zero &
Ugh..  of=/dev/null instead.

>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
> 
> [2] :
>  mount -t cgroup -o cpu cpu /sys/fs/cgroup
>  mkdir /sys/fs/cgroup/grp1
>  mkdir /sys/fs/cgroup/grp2
>  mkdir /sys/fs/cgroup/grp3
>  mkdir /sys/fs/cgroup/grp4
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp1/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp2/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp3/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp4/tasks 
> 
> I could confirm this patch makes the latter behaves as same as the former in 
> terms of preemption latency.
> 
> > 
> > 
> > >  joonwoop: Do we also need to update or remove if-statement inside
> > >  hrtick_update()?
> > 
> > >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > > cfs_rq
> > >  has large number of nr_running where slice is longer than sched_latency.
> > 
> > Right, you want that to match with whatever sched_slice() does.
> 
> Cool.  Thank you!
> 
> Thanks,
> Joonwoo
> 
> > 
> > > +++ b/kernel/sched/fair.c
> > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > > task_struct *p)
> > >  
> > >   WARN_ON(task_rq(p) != rq);
> > >  
> > > - if (cfs_rq->nr_running > 1) {
> > > + if (rq->cfs.h_nr_running > 1) {
> > >   u64 slice = sched_slice(cfs_rq, se);
> > >   u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > >   s64 delta = slice - ran;
> > 
> > Yeah, that looks right. I don't think I've ever tried hrtick with
> > cgroups enabled...

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Mon, Sep 19, 2016 at 11:04:49AM -0700, Joonwoo Park wrote:
> On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > > From: Srivatsa Vaddagiri 
> > > 
> > > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > 
> > Right, but I always found the overhead of the thing too high to be
> > really useful.
> > 
> > How come you're using this?
> 
> This patch was in our internal tree for decades so I unfortunately cannot
> find actual usecase or history.
> But I guess it was about excessive latency when there are number of CPU
> bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.
> 
> See how I recreated :
> 
> * run 4 cpu hogs on the same cgroup [1] :
>  dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
>  dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
>  dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
>  dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
>   preempt every 1.5ms slice by hrtick.
> 
> * run 4 CPU hogs on 4 different cgroups [2] :
>  dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
>  dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
>  dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
>  dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
> prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
>   preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
> preemption latency.
> 
> [1] : 
>  dd if=/dev/zero of=/dev/zero &
Ugh..  of=/dev/null instead.

>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
>  dd if=/dev/zero of=/dev/zero &
> 
> [2] :
>  mount -t cgroup -o cpu cpu /sys/fs/cgroup
>  mkdir /sys/fs/cgroup/grp1
>  mkdir /sys/fs/cgroup/grp2
>  mkdir /sys/fs/cgroup/grp3
>  mkdir /sys/fs/cgroup/grp4
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp1/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp2/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp3/tasks 
>  dd if=/dev/zero of=/dev/zero &
>  echo $! > /sys/fs/cgroup/grp4/tasks 
> 
> I could confirm this patch makes the latter behaves as same as the former in 
> terms of preemption latency.
> 
> > 
> > 
> > >  joonwoop: Do we also need to update or remove if-statement inside
> > >  hrtick_update()?
> > 
> > >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > > cfs_rq
> > >  has large number of nr_running where slice is longer than sched_latency.
> > 
> > Right, you want that to match with whatever sched_slice() does.
> 
> Cool.  Thank you!
> 
> Thanks,
> Joonwoo
> 
> > 
> > > +++ b/kernel/sched/fair.c
> > > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > > task_struct *p)
> > >  
> > >   WARN_ON(task_rq(p) != rq);
> > >  
> > > - if (cfs_rq->nr_running > 1) {
> > > + if (rq->cfs.h_nr_running > 1) {
> > >   u64 slice = sched_slice(cfs_rq, se);
> > >   u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > >   s64 delta = slice - ran;
> > 
> > Yeah, that looks right. I don't think I've ever tried hrtick with
> > cgroups enabled...

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote:
> 2016-09-17 9:28 GMT+08:00 Joonwoo Park <joonw...@codeaurora.org>:
> > From: Srivatsa Vaddagiri <va...@codeaurora.org>
> >
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > (just when they would have exceeded their ideal_runtime). It makes use
> > of a per-cpu hrtimer resource and hence alarming that hrtimer should
> > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
> > rather than being based on number of tasks in a particular cfs_rq (as
> > implemented currently). As a result, with current code, its possible for
> > a running task (which is the sole task in its cfs_rq) to be preempted
> 
> not be preempted much, right?

I don't think so
By saying 'to be preempted much after its ideal_runtime has elapsed' I
wanted to describe the current suboptimal behaviour.

Thanks,
Joonwoo

> 
> > much after its ideal_runtime has elapsed, resulting in increased latency
> > for tasks in other cfs_rq on same cpu.
> >
> > Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
> > tasks a CPU has across its various cfs_rqs.
> >
> > Cc: Ingo Molnar <mi...@redhat.com>
> > Cc: Peter Zijlstra <pet...@infradead.org>
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Srivatsa Vaddagiri <va...@codeaurora.org>
> > Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
> > ---
> >
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> >
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4088eed..c55c566 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >
> > WARN_ON(task_rq(p) != rq);
> >
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> > --
> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> > hosted by The Linux Foundation
> >

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Sun, Sep 18, 2016 at 07:28:32AM +0800, Wanpeng Li wrote:
> 2016-09-17 9:28 GMT+08:00 Joonwoo Park :
> > From: Srivatsa Vaddagiri 
> >
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> > (just when they would have exceeded their ideal_runtime). It makes use
> > of a per-cpu hrtimer resource and hence alarming that hrtimer should
> > be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
> > rather than being based on number of tasks in a particular cfs_rq (as
> > implemented currently). As a result, with current code, its possible for
> > a running task (which is the sole task in its cfs_rq) to be preempted
> 
> not be preempted much, right?

I don't think so
By saying 'to be preempted much after its ideal_runtime has elapsed' I
wanted to describe the current suboptimal behaviour.

Thanks,
Joonwoo

> 
> > much after its ideal_runtime has elapsed, resulting in increased latency
> > for tasks in other cfs_rq on same cpu.
> >
> > Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
> > tasks a CPU has across its various cfs_rqs.
> >
> > Cc: Ingo Molnar 
> > Cc: Peter Zijlstra 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Srivatsa Vaddagiri 
> > Signed-off-by: Joonwoo Park 
> > ---
> >
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> >
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4088eed..c55c566 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >
> > WARN_ON(task_rq(p) != rq);
> >
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> > --
> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> > hosted by The Linux Foundation
> >

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > From: Srivatsa Vaddagiri <va...@codeaurora.org>
> > 
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> 
> Right, but I always found the overhead of the thing too high to be
> really useful.
> 
> How come you're using this?

This patch was in our internal tree for decades so I unfortunately cannot
find actual usecase or history.
But I guess it was about excessive latency when there are number of CPU
bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.

See how I recreated :

* run 4 cpu hogs on the same cgroup [1] :
 dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
 dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
 dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
 dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
  preempt every 1.5ms slice by hrtick.

* run 4 CPU hogs on 4 different cgroups [2] :
 dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
 dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
 dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
 dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
  preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
preemption latency.

[1] : 
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &

[2] :
 mount -t cgroup -o cpu cpu /sys/fs/cgroup
 mkdir /sys/fs/cgroup/grp1
 mkdir /sys/fs/cgroup/grp2
 mkdir /sys/fs/cgroup/grp3
 mkdir /sys/fs/cgroup/grp4
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp1/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp2/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp3/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp4/tasks 

I could confirm this patch makes the latter behaves as same as the former in 
terms of preemption latency.

> 
> 
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> 
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> 
> Right, you want that to match with whatever sched_slice() does.

Cool.  Thank you!

Thanks,
Joonwoo

> 
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >  
> > WARN_ON(task_rq(p) != rq);
> >  
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> 
> Yeah, that looks right. I don't think I've ever tried hrtick with
> cgroups enabled...

Re: [PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-19 Thread Joonwoo Park

On Mon, Sep 19, 2016 at 10:21:58AM +0200, Peter Zijlstra wrote:
> On Fri, Sep 16, 2016 at 06:28:51PM -0700, Joonwoo Park wrote:
> > From: Srivatsa Vaddagiri 
> > 
> > SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
> 
> Right, but I always found the overhead of the thing too high to be
> really useful.
> 
> How come you're using this?

This patch was in our internal tree for decades so I unfortunately cannot
find actual usecase or history.
But I guess it was about excessive latency when there are number of CPU
bound tasks running on a CPU but on different cfs_rqs and CONFIG_HZ = 100.

See how I recreated :

* run 4 cpu hogs on the same cgroup [1] :
 dd-960   [000] d..3   110.651060: sched_switch: prev_comm=dd prev_pid=960 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=959 next_prio=120
 dd-959   [000] d..3   110.652566: sched_switch: prev_comm=dd prev_pid=959 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=961 next_prio=120
 dd-961   [000] d..3   110.654072: sched_switch: prev_comm=dd prev_pid=961 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=962 next_prio=120
 dd-962   [000] d..3   110.655578: sched_switch: prev_comm=dd prev_pid=962 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=960 next_prio=120
  preempt every 1.5ms slice by hrtick.

* run 4 CPU hogs on 4 different cgroups [2] :
 dd-964   [000] d..324.169873: sched_switch: prev_comm=dd prev_pid=964 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=966 next_prio=120
 dd-966   [000] d..324.179873: sched_switch: prev_comm=dd prev_pid=966 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=965 next_prio=120
 dd-965   [000] d..324.189873: sched_switch: prev_comm=dd prev_pid=965 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=967 next_prio=120
 dd-967   [000] d..324.199873: sched_switch: prev_comm=dd prev_pid=967 
prev_prio=120 prev_state=R+ ==> next_comm=dd next_pid=964 next_prio=120
  preempt every 10ms by scheduler tick so that all tasks suffers from 40ms 
preemption latency.

[1] : 
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &
 dd if=/dev/zero of=/dev/zero &

[2] :
 mount -t cgroup -o cpu cpu /sys/fs/cgroup
 mkdir /sys/fs/cgroup/grp1
 mkdir /sys/fs/cgroup/grp2
 mkdir /sys/fs/cgroup/grp3
 mkdir /sys/fs/cgroup/grp4
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp1/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp2/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp3/tasks 
 dd if=/dev/zero of=/dev/zero &
 echo $! > /sys/fs/cgroup/grp4/tasks 

I could confirm this patch makes the latter behaves as same as the former in 
terms of preemption latency.

> 
> 
> >  joonwoop: Do we also need to update or remove if-statement inside
> >  hrtick_update()?
> 
> >  I guess not because hrtick_update() doesn't want to start hrtick when 
> > cfs_rq
> >  has large number of nr_running where slice is longer than sched_latency.
> 
> Right, you want that to match with whatever sched_slice() does.

Cool.  Thank you!

Thanks,
Joonwoo

> 
> > +++ b/kernel/sched/fair.c
> > @@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
> > task_struct *p)
> >  
> > WARN_ON(task_rq(p) != rq);
> >  
> > -   if (cfs_rq->nr_running > 1) {
> > +   if (rq->cfs.h_nr_running > 1) {
> > u64 slice = sched_slice(cfs_rq, se);
> > u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
> > s64 delta = slice - ran;
> 
> Yeah, that looks right. I don't think I've ever tried hrtick with
> cgroups enabled...

[PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-16 Thread Joonwoo Park

From: Srivatsa Vaddagiri <va...@codeaurora.org>

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Srivatsa Vaddagiri <va...@codeaurora.org>
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---

 joonwoop: Do we also need to update or remove if-statement inside
 hrtick_update()?
 I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq
 has large number of nr_running where slice is longer than sched_latency.

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4088eed..c55c566 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH] sched: Fix SCHED_HRTICK bug leading to late preemption of tasks

2016-09-16 Thread Joonwoo Park

From: Srivatsa Vaddagiri 

SCHED_HRTICK feature is useful to preempt SCHED_FAIR tasks on-the-dot
(just when they would have exceeded their ideal_runtime). It makes use
of a per-cpu hrtimer resource and hence alarming that hrtimer should
be based on total SCHED_FAIR tasks a cpu has across its various cfs_rqs,
rather than being based on number of tasks in a particular cfs_rq (as
implemented currently). As a result, with current code, its possible for
a running task (which is the sole task in its cfs_rq) to be preempted
much after its ideal_runtime has elapsed, resulting in increased latency
for tasks in other cfs_rq on same cpu.

Fix this by alarming sched hrtimer based on total number of SCHED_FAIR
tasks a CPU has across its various cfs_rqs.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Srivatsa Vaddagiri 
Signed-off-by: Joonwoo Park 
---

 joonwoop: Do we also need to update or remove if-statement inside
 hrtick_update()?
 I guess not because hrtick_update() doesn't want to start hrtick when cfs_rq
 has large number of nr_running where slice is longer than sched_latency.

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4088eed..c55c566 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4458,7 +4458,7 @@ static void hrtick_start_fair(struct rq *rq, struct 
task_struct *p)
 
WARN_ON(task_rq(p) != rq);
 
-   if (cfs_rq->nr_running > 1) {
+   if (rq->cfs.h_nr_running > 1) {
u64 slice = sched_slice(cfs_rq, se);
u64 ran = se->sum_exec_runtime - se->prev_sum_exec_runtime;
s64 delta = slice - ran;
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH v2] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-11 Thread Joonwoo Park

A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations.  For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.

However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.

For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

    ===
   cpu_online_mask   top_cpuset.effective_cpus
    ===
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
  [0,2] [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
  [2]   [0]
    ===

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 00d0
   [ cut here ]
   Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
   task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.

Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
Cc: Li Zefan <lize...@huawei.com>
Cc: Tejun Heo <t...@kernel.org>
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: <sta...@vger.kernel.org> # 3.17+
---
 v2: fixed changelog and comment.

 kernel/cpuset.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 73e93e5..27c6d78 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -325,8 +325,7 @@ static struct file_system_type cpuset_fs_type = {
 /*
  * Return in pmask the portion of a cpusets's cpus_allowed that
  * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.  The top
- * cpuset always has some cpus online.
+ * until we find one that does have some online cpus.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
@@ -335,8 +334,20 @@ static struct file_system_type cpuset_fs_type = {
  */
 static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask))
+   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
cs = parent_cs(cs);
+   if (unlikely(!cs)) {
+   /*
+* The top cpuset doesn't have any online cpu as a
+* consequence of a race between cpuset_hotplug_work
+* and cpu hotplug notifier.  But we know the top
+* cpuset's effective_cpus is on its way to to be
+* identical to cpu_online_mask.
+*/
+   cpumask_copy(pmask, cpu_online_mask);
+   return;
+   }
+   }
cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH v2] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-11 Thread Joonwoo Park

A discrepancy between cpu_online_mask and cpuset's effective_cpus
mask is inevitable during hotplug since cpuset defers updating of
effective_cpus mask using a workqueue, during which time nothing
prevents the system from more hotplug operations.  For that reason
guarantee_online_cpus() walks up the cpuset hierarchy until it finds
an intersection under the assumption that top cpuset's effective_cpus
mask intersects with cpu_online_mask even with such a race occurring.

However a sequence of CPU hotplugs can open a time window, during which
none of the effective CPUs in the top cpuset intersect with
cpu_online_mask.

For example when there are 4 possible CPUs 0-3 and only CPU0 is online:

    ===
   cpu_online_mask   top_cpuset.effective_cpus
    ===
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
  [0,2] [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
  [2]   [0]
    ===

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 00d0
   [ cut here ]
   Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
   task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to
cpu_online_mask eventually.  Hence fall back to cpu_online_mask when
there is no intersection between top cpuset's effective_cpus and
cpu_online_mask.

Signed-off-by: Joonwoo Park 
Cc: Li Zefan 
Cc: Tejun Heo 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc:  # 3.17+
---
 v2: fixed changelog and comment.

 kernel/cpuset.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 73e93e5..27c6d78 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -325,8 +325,7 @@ static struct file_system_type cpuset_fs_type = {
 /*
  * Return in pmask the portion of a cpusets's cpus_allowed that
  * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.  The top
- * cpuset always has some cpus online.
+ * until we find one that does have some online cpus.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
@@ -335,8 +334,20 @@ static struct file_system_type cpuset_fs_type = {
  */
 static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask))
+   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
cs = parent_cs(cs);
+   if (unlikely(!cs)) {
+   /*
+* The top cpuset doesn't have any online cpu as a
+* consequence of a race between cpuset_hotplug_work
+* and cpu hotplug notifier.  But we know the top
+* cpuset's effective_cpus is on its way to to be
+* identical to cpu_online_mask.
+*/
+   cpumask_copy(pmask, cpu_online_mask);
+   return;
+   }
+   }
cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

Re: [PATCH] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-11 Thread Joonwoo Park

On Mon, Sep 12, 2016 at 10:48:31AM +0800, Zefan Li wrote:
> Cc: Tejun
> 
> On 2016/9/9 8:41, Joonwoo Park wrote:
> > Discrepancy between cpu_online_mask and cpuset's effective CPU masks on
> > cpuset hierarchy is inevitable since cpuset defers updating of
> > effective CPU masks with workqueue while nothing prevents system from
> > doing CPU hotplug.  For that reason guarantee_online_cpus() walks up
> > the cpuset hierarchy until it finds intersection under the assumption
> > that top cpuset's effective CPU mask intersects with cpu_online_mask
> > even under such race.
> > 
> > However a sequence of CPU hotplugs can open a time window which is none
> > of effective CPUs in the top cpuset intersects with cpu_online_mask.
> > 
> > For example when there are 4 possible CPUs 0-3 where only CPU0 is online:
> > 
> >     ===
> >cpu_online_mask   top_cpuset.effective_cpus
> >     ===
> >echo 1 > cpu2/online.
> >CPU hotplug notifier woke up hotplug work but not yet scheduled.
> >   [0,2] [0]
> > 
> >echo 0 > cpu0/online.
> >The workqueue is still runnable.
> >   [2]   [0]
> >     ===
> > 
> >   Now there is no intersection between cpu_online_mask and
> >   top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
> >   this moment can cause following:
> > 
> >Unable to handle kernel NULL pointer dereference at virtual address 
> > 00d0
> >[ cut here ]
> >Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
> >Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
> >Modules linked in:
> >CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
> >task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
> >PC is at guarantee_online_cpus+0x2c/0x58
> >LR is at cpuset_cpus_allowed+0x4c/0x6c
> >
> >Process taskset (pid: 1420, stack limit = 0xffc06e124020)
> >Call trace:
> >[] guarantee_online_cpus+0x2c/0x58
> >[] cpuset_cpus_allowed+0x4c/0x6c
> >[] sched_setaffinity+0xc0/0x1ac
> >[] SyS_sched_setaffinity+0x98/0xac
> >[] el0_svc_naked+0x24/0x28
> > 
> > The top cpuset's effective_cpus are guaranteed to be identical to online
> > CPUs eventually.  Hence fall back to online CPU mask when there is no
> > intersection between top cpuset's effective_cpus and online CPU mask.
> > 
> > Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
> > Cc: Li Zefan <lize...@huawei.com>
> > Cc: cgro...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> 
> Thanks for fixing this!
> 
> Acked-by: Zefan Li <lize...@huawei.com>
> Cc: <sta...@vger.kernel.org> # 3.17+
> 

Thanks for reviewing.

Shortly I will send v2 which has few grammar error fixes in the
changelog.
No code change has made.

Joonwoo

Re: [PATCH] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-11 Thread Joonwoo Park

On Mon, Sep 12, 2016 at 10:48:31AM +0800, Zefan Li wrote:
> Cc: Tejun
> 
> On 2016/9/9 8:41, Joonwoo Park wrote:
> > Discrepancy between cpu_online_mask and cpuset's effective CPU masks on
> > cpuset hierarchy is inevitable since cpuset defers updating of
> > effective CPU masks with workqueue while nothing prevents system from
> > doing CPU hotplug.  For that reason guarantee_online_cpus() walks up
> > the cpuset hierarchy until it finds intersection under the assumption
> > that top cpuset's effective CPU mask intersects with cpu_online_mask
> > even under such race.
> > 
> > However a sequence of CPU hotplugs can open a time window which is none
> > of effective CPUs in the top cpuset intersects with cpu_online_mask.
> > 
> > For example when there are 4 possible CPUs 0-3 where only CPU0 is online:
> > 
> >     ===
> >cpu_online_mask   top_cpuset.effective_cpus
> >     ===
> >echo 1 > cpu2/online.
> >CPU hotplug notifier woke up hotplug work but not yet scheduled.
> >   [0,2] [0]
> > 
> >echo 0 > cpu0/online.
> >The workqueue is still runnable.
> >   [2]   [0]
> >     ===
> > 
> >   Now there is no intersection between cpu_online_mask and
> >   top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
> >   this moment can cause following:
> > 
> >Unable to handle kernel NULL pointer dereference at virtual address 
> > 00d0
> >[ cut here ]
> >Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
> >Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
> >Modules linked in:
> >CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
> >task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
> >PC is at guarantee_online_cpus+0x2c/0x58
> >LR is at cpuset_cpus_allowed+0x4c/0x6c
> >
> >Process taskset (pid: 1420, stack limit = 0xffc06e124020)
> >Call trace:
> >[] guarantee_online_cpus+0x2c/0x58
> >[] cpuset_cpus_allowed+0x4c/0x6c
> >[] sched_setaffinity+0xc0/0x1ac
> >[] SyS_sched_setaffinity+0x98/0xac
> >[] el0_svc_naked+0x24/0x28
> > 
> > The top cpuset's effective_cpus are guaranteed to be identical to online
> > CPUs eventually.  Hence fall back to online CPU mask when there is no
> > intersection between top cpuset's effective_cpus and online CPU mask.
> > 
> > Signed-off-by: Joonwoo Park 
> > Cc: Li Zefan 
> > Cc: cgro...@vger.kernel.org
> > Cc: linux-kernel@vger.kernel.org
> 
> Thanks for fixing this!
> 
> Acked-by: Zefan Li 
> Cc:  # 3.17+
> 

Thanks for reviewing.

Shortly I will send v2 which has few grammar error fixes in the
changelog.
No code change has made.

Joonwoo

[PATCH] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-08 Thread Joonwoo Park

Discrepancy between cpu_online_mask and cpuset's effective CPU masks on
cpuset hierarchy is inevitable since cpuset defers updating of
effective CPU masks with workqueue while nothing prevents system from
doing CPU hotplug.  For that reason guarantee_online_cpus() walks up
the cpuset hierarchy until it finds intersection under the assumption
that top cpuset's effective CPU mask intersects with cpu_online_mask
even under such race.

However a sequence of CPU hotplugs can open a time window which is none
of effective CPUs in the top cpuset intersects with cpu_online_mask.

For example when there are 4 possible CPUs 0-3 where only CPU0 is online:

    ===
   cpu_online_mask   top_cpuset.effective_cpus
    ===
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
  [0,2] [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
  [2]   [0]
    ===

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 00d0
   [ cut here ]
   Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
   task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to online
CPUs eventually.  Hence fall back to online CPU mask when there is no
intersection between top cpuset's effective_cpus and online CPU mask.

Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
Cc: Li Zefan <lize...@huawei.com>
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/cpuset.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c7fd277..b5d2b73 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -325,8 +325,7 @@ static struct file_system_type cpuset_fs_type = {
 /*
  * Return in pmask the portion of a cpusets's cpus_allowed that
  * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.  The top
- * cpuset always has some cpus online.
+ * until we find one that does have some online cpus.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
@@ -335,8 +334,20 @@ static struct file_system_type cpuset_fs_type = {
  */
 static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask))
+   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
cs = parent_cs(cs);
+   if (unlikely(!cs)) {
+   /*
+* The top cpuset doesn't have any online cpu in
+* consequence of race between cpuset_hotplug_work
+* and cpu hotplug notifier.  But we know the top
+* cpuset's effective_cpus is on its way to be same
+* with online cpus mask.
+*/
+   cpumask_copy(pmask, cpu_online_mask);
+   return;
+   }
+   }
cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[PATCH] cpuset: handle race between CPU hotplug and cpuset_hotplug_work

2016-09-08 Thread Joonwoo Park

Discrepancy between cpu_online_mask and cpuset's effective CPU masks on
cpuset hierarchy is inevitable since cpuset defers updating of
effective CPU masks with workqueue while nothing prevents system from
doing CPU hotplug.  For that reason guarantee_online_cpus() walks up
the cpuset hierarchy until it finds intersection under the assumption
that top cpuset's effective CPU mask intersects with cpu_online_mask
even under such race.

However a sequence of CPU hotplugs can open a time window which is none
of effective CPUs in the top cpuset intersects with cpu_online_mask.

For example when there are 4 possible CPUs 0-3 where only CPU0 is online:

    ===
   cpu_online_mask   top_cpuset.effective_cpus
    ===
   echo 1 > cpu2/online.
   CPU hotplug notifier woke up hotplug work but not yet scheduled.
  [0,2] [0]

   echo 0 > cpu0/online.
   The workqueue is still runnable.
  [2]   [0]
    ===

  Now there is no intersection between cpu_online_mask and
  top_cpuset.effective_cpus.  Thus invoking sys_sched_setaffinity() at
  this moment can cause following:

   Unable to handle kernel NULL pointer dereference at virtual address 00d0
   [ cut here ]
   Kernel BUG at ffc0001389b0 [verbose debug info unavailable]
   Internal error: Oops - BUG: 9605 [#1] PREEMPT SMP
   Modules linked in:
   CPU: 2 PID: 1420 Comm: taskset Tainted: GW   4.4.8+ #98
   task: ffc06a5c4880 ti: ffc06e124000 task.ti: ffc06e124000
   PC is at guarantee_online_cpus+0x2c/0x58
   LR is at cpuset_cpus_allowed+0x4c/0x6c
   
   Process taskset (pid: 1420, stack limit = 0xffc06e124020)
   Call trace:
   [] guarantee_online_cpus+0x2c/0x58
   [] cpuset_cpus_allowed+0x4c/0x6c
   [] sched_setaffinity+0xc0/0x1ac
   [] SyS_sched_setaffinity+0x98/0xac
   [] el0_svc_naked+0x24/0x28

The top cpuset's effective_cpus are guaranteed to be identical to online
CPUs eventually.  Hence fall back to online CPU mask when there is no
intersection between top cpuset's effective_cpus and online CPU mask.

Signed-off-by: Joonwoo Park 
Cc: Li Zefan 
Cc: cgro...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 kernel/cpuset.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c7fd277..b5d2b73 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -325,8 +325,7 @@ static struct file_system_type cpuset_fs_type = {
 /*
  * Return in pmask the portion of a cpusets's cpus_allowed that
  * are online.  If none are online, walk up the cpuset hierarchy
- * until we find one that does have some online cpus.  The top
- * cpuset always has some cpus online.
+ * until we find one that does have some online cpus.
  *
  * One way or another, we guarantee to return some non-empty subset
  * of cpu_online_mask.
@@ -335,8 +334,20 @@ static struct file_system_type cpuset_fs_type = {
  */
 static void guarantee_online_cpus(struct cpuset *cs, struct cpumask *pmask)
 {
-   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask))
+   while (!cpumask_intersects(cs->effective_cpus, cpu_online_mask)) {
cs = parent_cs(cs);
+   if (unlikely(!cs)) {
+   /*
+* The top cpuset doesn't have any online cpu in
+* consequence of race between cpuset_hotplug_work
+* and cpu hotplug notifier.  But we know the top
+* cpuset's effective_cpus is on its way to be same
+* with online cpus mask.
+*/
+   cpumask_copy(pmask, cpu_online_mask);
+   return;
+   }
+   }
cpumask_and(pmask, cs->effective_cpus, cpu_online_mask);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

[tip:sched/core] sched/core: Fix incorrect wait time and wait count statistics

2015-11-23 Thread tip-bot for Joonwoo Park

Commit-ID:  3ea94de15ce9f3a217f6d0a7e9e0f48388902bb7
Gitweb: http://git.kernel.org/tip/3ea94de15ce9f3a217f6d0a7e9e0f48388902bb7
Author: Joonwoo Park 
AuthorDate: Thu, 12 Nov 2015 19:38:54 -0800
Committer:  Ingo Molnar 
CommitDate: Mon, 23 Nov 2015 09:48:17 +0100

sched/core: Fix incorrect wait time and wait count statistics

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: ohau...@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.ga4...@codeaurora.org
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 15 ++--
 kernel/sched/fair.c | 67 +
 2 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..1b7cb5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1071,8 +1071,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1080,8 +1080,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1274,6 +1274,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1310,9 +1319,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 95b944e..f7017ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -738,12 +738,56 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
+
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   se->statistics.wait_start = wait_start;
 }
 
+static void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+   struct task_struct *p;
+   u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+
+   if (entity_is_task(se)) {
+   p = task_of(se);
+   if (task_on_rq_migrating(p)) {
+   /*
+* Preserve migrating task's wait time so wait_start
+* time stamp can be adjusted to accumulate wait time
+* prior to migration.
+*/
+   se->statistics.wait_start = delta;
+   return;
+   }
+   trace_sch

[tip:sched/core] sched/core: Fix incorrect wait time and wait count statistics

2015-11-23 Thread tip-bot for Joonwoo Park

Commit-ID:  3ea94de15ce9f3a217f6d0a7e9e0f48388902bb7
Gitweb: http://git.kernel.org/tip/3ea94de15ce9f3a217f6d0a7e9e0f48388902bb7
Author: Joonwoo Park <joonw...@codeaurora.org>
AuthorDate: Thu, 12 Nov 2015 19:38:54 -0800
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Mon, 23 Nov 2015 09:48:17 +0100

sched/core: Fix incorrect wait time and wait count statistics

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Mike Galbraith <efa...@gmx.de>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: ohau...@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.ga4...@codeaurora.org
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/core.c | 15 ++--
 kernel/sched/fair.c | 67 +
 2 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..1b7cb5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1071,8 +1071,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1080,8 +1080,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1274,6 +1274,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1310,9 +1319,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 95b944e..f7017ad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -738,12 +738,56 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
+
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   se->statistics.wait_start = wait_start;
 }
 
+static void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+   struct task_struct *p;
+   u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+
+   if (entity_is_task(se)) {
+   p = task_of(se);
+   if (task_on_rq_migrating(p)) {
+   /*
+* Preserve migrating task's wait time so wait_start
+* time stamp can be adjust

Re: [PATCH v5] sched: fix incorrect wait time and wait count statistics

2015-11-12 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar 
To: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.
Changes in v4: 
 * Made __migrate_swap_task() to set p->on_rq = TASK_ON_RQ_MIGRATING.
 * Added WARN_ON_ONCE() inside CONFIG_SCHED_DEBUG.
 * Added comments.
 * Cleanup with ifdefy.
Changes in v5:
 * Cleanup update_stats_wait_end().

 kernel/sched/core.c | 15 ++--
 kernel/sched/fair.c | 67 +
 2 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..1ddbabc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1308,9 +1317,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..fc54ecb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -737,12 +737,56 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
+
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   se->statistics.wait_start = wait_start;
 }
 
+static void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+   struct task_struct *p;
+   u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+
+   if (entity_is_task(se)) {
+   p = task_of(se);
+   if (task_on_rq_migrating(p)) {
+   /*
+* Preserve migrating task's wait time so wait_start
+* time stamp can be adjusted to accumulate wait time
+* prior to migration.
+*/
+   se->statistics.wait_start = delta;
+   return;
+

Re: [PATCH v5] sched: fix incorrect wait time and wait count statistics

2015-11-12 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar <mi...@kernel.org>
To: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.
Changes in v4: 
 * Made __migrate_swap_task() to set p->on_rq = TASK_ON_RQ_MIGRATING.
 * Added WARN_ON_ONCE() inside CONFIG_SCHED_DEBUG.
 * Added comments.
 * Cleanup with ifdefy.
Changes in v5:
 * Cleanup update_stats_wait_end().

 kernel/sched/core.c | 15 ++--
 kernel/sched/fair.c | 67 +
 2 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..1ddbabc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1308,9 +1317,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..fc54ecb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -737,12 +737,56 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
+
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   se->statistics.wait_start = wait_start;
 }
 
+static void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+   struct task_struct *p;
+   u64 delta = rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start;
+
+   if (entity_is_task(se)) {
+   p = task_of(se);
+   if (task_on_rq_migrating(p)) {
+   /*
+* Preserve migrating task's wait time so wait_start
+* time stamp can be adjusted to accumulate wait time
+* prior to migration.
+*/
+

Re: [PATCH v4] sched: fix incorrect wait time and wait count statistics

2015-11-06 Thread Joonwoo Park

On Fri, Nov 06, 2015 at 02:57:49PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2015 at 09:46:53PM -0700, Joonwoo Park wrote:
> > @@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned 
> > int new_cpu)
> > WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
> > !p->on_rq);
> >  
> > +   /*
> > +* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
> > +* because schedstat_wait_{start,end} rebase migrating task's wait_start
> > +* time relying on p->on_rq.
> > +*/
> > +   WARN_ON_ONCE(p->state == TASK_RUNNING &&
> > +p->sched_class == _sched_class &&
> > +(p->on_rq && !task_on_rq_migrating(p)));
> > +
> 
> Why do we have to test p->on_rq? Would not ->state == RUNNING imply
> that?
> 

sched_fork() sets p->state = RUNNING before changing task cpu.
Please let me know if you got better idea.

> > +++ b/kernel/sched/fair.c
> > @@ -737,41 +737,69 @@ static void update_curr_fair(struct rq *rq)
> > update_curr(cfs_rq_of(>curr->se));
> >  }
> >  
> > +#ifdef CONFIG_SCHEDSTATS
> >  static inline void
> >  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +   u64 wait_start = rq_clock(rq_of(cfs_rq));
> >  
> > +   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
> > +   likely(wait_start > se->statistics.wait_start))
> > +   wait_start -= se->statistics.wait_start;
> > +
> > +   schedstat_set(se->statistics.wait_start, wait_start);
> >  }
> >  
> >  static void
> >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> 
> Since this is now all under CONFIG_SCHEDSTAT, would it not make sense
> to do something like:
> 
>   u64 now = rq_clock(rq_of(cfs_rq));
> 
> to avoid the endless calling of that function?
> 
> Also, for that very same reason; would it not make sense to drop the
> schedstat_set() usage below, that would greatly enhance readability.
> 

Agreed.

> > +   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
> > +   /*
> > +* Preserve migrating task's wait time so wait_start time stamp
> > +* can be adjusted to accumulate wait time prior to migration.
> > +*/
> > +   schedstat_set(se->statistics.wait_start,
> > + rq_clock(rq_of(cfs_rq)) -
> > + se->statistics.wait_start);
> > +   return;
> > +   }
> > +
> > schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
> > schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
> > schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
> > +
> > if (entity_is_task(se)) {
> > trace_sched_stat_wait(task_of(se),
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
> > }
> 
> Is there no means of collapsing the two 'entity_is_task()' branches?
> 

Agreed.  Will spin v5 with these clean up.

Thanks,
Joonwoo

> > schedstat_set(se->statistics.wait_start, 0);
> >  }
> > +#else
> > +static inline void
> > +update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +}
> > +
> > +static inline void
> > +update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +}
> > +#endif

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v4] sched: fix incorrect wait time and wait count statistics

2015-11-06 Thread Joonwoo Park

On Fri, Nov 06, 2015 at 02:57:49PM +0100, Peter Zijlstra wrote:
> On Tue, Oct 27, 2015 at 09:46:53PM -0700, Joonwoo Park wrote:
> > @@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned 
> > int new_cpu)
> > WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
> > !p->on_rq);
> >  
> > +   /*
> > +* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
> > +* because schedstat_wait_{start,end} rebase migrating task's wait_start
> > +* time relying on p->on_rq.
> > +*/
> > +   WARN_ON_ONCE(p->state == TASK_RUNNING &&
> > +p->sched_class == _sched_class &&
> > +(p->on_rq && !task_on_rq_migrating(p)));
> > +
> 
> Why do we have to test p->on_rq? Would not ->state == RUNNING imply
> that?
> 

sched_fork() sets p->state = RUNNING before changing task cpu.
Please let me know if you got better idea.

> > +++ b/kernel/sched/fair.c
> > @@ -737,41 +737,69 @@ static void update_curr_fair(struct rq *rq)
> > update_curr(cfs_rq_of(>curr->se));
> >  }
> >  
> > +#ifdef CONFIG_SCHEDSTATS
> >  static inline void
> >  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +   u64 wait_start = rq_clock(rq_of(cfs_rq));
> >  
> > +   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
> > +   likely(wait_start > se->statistics.wait_start))
> > +   wait_start -= se->statistics.wait_start;
> > +
> > +   schedstat_set(se->statistics.wait_start, wait_start);
> >  }
> >  
> >  static void
> >  update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> 
> Since this is now all under CONFIG_SCHEDSTAT, would it not make sense
> to do something like:
> 
>   u64 now = rq_clock(rq_of(cfs_rq));
> 
> to avoid the endless calling of that function?
> 
> Also, for that very same reason; would it not make sense to drop the
> schedstat_set() usage below, that would greatly enhance readability.
> 

Agreed.

> > +   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
> > +   /*
> > +* Preserve migrating task's wait time so wait_start time stamp
> > +* can be adjusted to accumulate wait time prior to migration.
> > +*/
> > +   schedstat_set(se->statistics.wait_start,
> > + rq_clock(rq_of(cfs_rq)) -
> > + se->statistics.wait_start);
> > +   return;
> > +   }
> > +
> > schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
> > schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
> > schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
> > +
> > if (entity_is_task(se)) {
> > trace_sched_stat_wait(task_of(se),
> > rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
> > }
> 
> Is there no means of collapsing the two 'entity_is_task()' branches?
> 

Agreed.  Will spin v5 with these clean up.

Thanks,
Joonwoo

> > schedstat_set(se->statistics.wait_start, 0);
> >  }
> > +#else
> > +static inline void
> > +update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +}
> > +
> > +static inline void
> > +update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +{
> > +}
> > +#endif

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether fair task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar 
To: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.
Changes in v4: 
 * Made __migrate_swap_task() to set p->on_rq = TASK_ON_RQ_MIGRATING.
 * Added WARN_ON_ONCE() inside CONFIG_SCHED_DEBUG.
 * Added comments.
 * Cleanup with ifdefy.

 kernel/sched/core.c | 15 +++--
 kernel/sched/fair.c | 62 ++---
 2 files changed, 58 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..1ddbabc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1308,9 +1317,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..ce7e869 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -737,41 +737,69 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
-}
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
 
-/*
- * Task is being enqueued - update stats:
- */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
-{
-   /*
-* Are we enqueueing a waiting task? (for current tasks
-* a dequeue/enqueue event is a NOP)
-*/
-   if (se != cfs_rq->curr)
-   update_stats_wait_start(cfs_rq, se);
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   schedstat_set(se->statistics.wait_start, wait_start);
 }
 
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
+   /*
+* Preserve migrating task's wait time so wait_start time stamp
+* can be adjusted to accumula

Re: [PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

On Tue, Oct 27, 2015 at 01:57:28PM +0100, Peter Zijlstra wrote:
> 
> (Excessive quoting for Olav)
> 
> On Mon, Oct 26, 2015 at 06:44:48PM -0700, Joonwoo Park wrote:
> > On 10/25/2015 03:26 AM, Peter Zijlstra wrote:
> 
> > > Also note that on both sites we also set TASK_ON_RQ_MIGRATING -- albeit
> > > late. Can't you simply set that earlier (and back to QUEUED later) and
> > > test for task_on_rq_migrating() instead of blowing up the fastpath like
> > > you did?
> > > 
> > 
> > Yes it's doable.  I also find it's much simpler.
> 
> > From 98d615d46211a90482a0f9b7204265c54bba8520 Mon Sep 17 00:00:00 2001
> > From: Joonwoo Park 
> > Date: Mon, 26 Oct 2015 16:37:47 -0700
> > Subject: [PATCH v2] sched: fix incorrect wait time and wait count statistics
> > 
> > At present scheduler resets task's wait start timestamp when the task
> > migrates to another rq.  This misleads scheduler itself into reporting
> > less wait time than actual by omitting time spent for waiting prior to
> > migration and also more wait count than actual by counting migration as
> > wait end event which can be seen by trace or /proc//sched with
> > CONFIG_SCHEDSTATS=y.
> > 
> > Carry forward migrating task's wait time prior to migration and
> > don't count migration as a wait end event to fix such statistics error.
> > 
> > In order to determine whether task is migrating mark task->on_rq with
> > TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.
> > 
> > To: Ingo Molnar 
> > To: Peter Zijlstra 
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Joonwoo Park 
> > ---
> 
> So now that you rely on TASK_ON_RQ_MIGRATING; I think you missed one
> place that can migrate sched_fair tasks and doesn't set it.
> 
> Olav recently did a patch adding TASK_ON_RQ_MIGRATING to _every_
> migration path, but that is (still) somewhat overkill. With your changes
> we need it for sched_fair though.
> 
> So I think you need to change __migrate_swap_task(), which is used by
> the NUMA scheduling to swap two running tasks.

Oh yes... __migrate_swap_task() can migrate fair class task so I should mark as 
TASK_ON_RQ_MIGRATING.
I will fix this in subsequent patch.

> 
> Also, it might be prudent to extend the CONFIG_SCHED_DEBUG ifdef in
> set_task_cpu() to test for this new requirement:
> 
>   WARN_ON_ONCE(p->state == TASK_RUNNING &&
>p->class == _sched_class &&
>p->on_rq != TASK_ON_RQ_MIGRATING);
> 

I conceived to argue that if we had Olav's change (with revised marking order) 
dummy like myself don't need to worry
about stale on_rq state and probably didn't make mistake like I did with 
__migrate_swap_task(). 
But I don't think I can argue for that reason anymore as my patch will set 
TASK_ON_RQ_MIGRATING for all migration paths for
the fair class tasks at least and moreover above macro you suggested would 
fortify the new requirement.

I will add that macro. 

I recall Olav had some other reason for his patch though.

> >  kernel/sched/core.c |  4 ++--
> >  kernel/sched/fair.c | 17 ++---
> >  2 files changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcd214e..d9e4ad5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, 
> > struct task_struct *p, int new
> >  {
> > lockdep_assert_held(>lock);
> >  
> > -   dequeue_task(rq, p, 0);
> > p->on_rq = TASK_ON_RQ_MIGRATING;
> > +   dequeue_task(rq, p, 0);
> > set_task_cpu(p, new_cpu);
> > raw_spin_unlock(>lock);
> >  
> > @@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, 
> > struct task_struct *p, int new
> >  
> > raw_spin_lock(>lock);
> > BUG_ON(task_cpu(p) != new_cpu);
> > -   p->on_rq = TASK_ON_RQ_QUEUED;
> > enqueue_task(rq, p, 0);
> > +   p->on_rq = TASK_ON_RQ_QUEUED;
> > check_preempt_curr(rq, p, 0);
> >  
> > return rq;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9a5e60f..7609576 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
> >  static inline void
> >  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
> > +   schedstat_set(se->statistics.wait_start,

[PATCH v3] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar 
To: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c | 27 ++-
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..d9e4ad5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..4c174a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   schedstat_set(se->statistics.wait_start,
+   entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+   rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+   rq_clock(rq_of(cfs_rq)));
 }
 
 /*
@@ -756,22 +760,35 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
update_stats_wait_start(cfs_rq, se);
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
+   schedstat_set(se->statistics.wait_start,
+ rq_clock(rq_of(cfs_rq)) -
+ se->statistics.wait_start);
+   return;
+   }
+
schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
-#ifdef CONFIG_SCHEDSTATS
+
if (entity_is_task(se)) {
trace_sched_stat_wait(task_of(se),
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
}
-#endif
schedstat_set(se->statistics.wait_start, 0);
 }
+#else
+static inline void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+#endif
 
 static inline void
 update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5656,8 +5673,8 @@ static void detach_task(struct task_struct *p, struct 
lb_env *env)
 {
lockdep_assert_held(>src_rq->lock);
 
-   deactivate_task(env->src_rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(env->src_rq, p, 0);
set_task_cpu(p, env->dst_cpu);
 }
 
@@ -5790,8 +5807,8 @@ static void attach_task(struct rq *rq, struct task_struct 
*p)
lockdep_assert_held(>lock);
 
BUG_ON(task_rq(p) != rq);
-   p->on_rq = TASK_ON_RQ_QUEUED;
activate_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe fro

[PATCH v3] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar <mi...@kernel.org>
To: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c | 27 ++-
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..d9e4ad5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..4c174a1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   schedstat_set(se->statistics.wait_start,
+   entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+   rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+   rq_clock(rq_of(cfs_rq)));
 }
 
 /*
@@ -756,22 +760,35 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, 
struct sched_entity *se)
update_stats_wait_start(cfs_rq, se);
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
+   schedstat_set(se->statistics.wait_start,
+ rq_clock(rq_of(cfs_rq)) -
+ se->statistics.wait_start);
+   return;
+   }
+
schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
schedstat_set(se->statistics.wait_sum, se->statistics.wait_sum +
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
-#ifdef CONFIG_SCHEDSTATS
+
if (entity_is_task(se)) {
trace_sched_stat_wait(task_of(se),
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start);
}
-#endif
schedstat_set(se->statistics.wait_start, 0);
 }
+#else
+static inline void
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+{
+}
+#endif
 
 static inline void
 update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5656,8 +5673,8 @@ static void detach_task(struct task_struct *p, struct 
lb_env *env)
 {
lockdep_assert_held(>src_rq->lock);
 
-   deactivate_task(env->src_rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   deactivate_task(env->src_rq, p, 0);
set_task_cpu(p, env->dst_cpu);
 }
 
@@ -5790,8 +5807,8 @@ static void attach_task(struct rq *rq, struct task_struct 
*p)
lockdep_assert_held(>lock);
 
BUG_ON(task_rq(p) != rq);
-   p->on_rq = TASK_ON_RQ_QUEUED;
activate_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of

[PATCH v4] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether fair task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar <mi...@kernel.org>
To: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().
Changes in v3: 
 * Fixed "WARNING: CPU: 0 PID: 3 at kernel/sched/fair.c:260 
update_stats_wait_end+0x23/0x30()" caught by Intel kernel test robot.
Changes in v4: 
 * Made __migrate_swap_task() to set p->on_rq = TASK_ON_RQ_MIGRATING.
 * Added WARN_ON_ONCE() inside CONFIG_SCHED_DEBUG.
 * Added comments.
 * Cleanup with ifdefy.

 kernel/sched/core.c | 15 +++--
 kernel/sched/fair.c | 62 ++---
 2 files changed, 58 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..1ddbabc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_MIGRATING;
+   dequeue_task(rq, p, 0);
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
-   p->on_rq = TASK_ON_RQ_QUEUED;
enqueue_task(rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
 
return rq;
@@ -1272,6 +1272,15 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
WARN_ON_ONCE(p->state != TASK_RUNNING && p->state != TASK_WAKING &&
!p->on_rq);
 
+   /*
+* Migrating fair class task must have p->on_rq = TASK_ON_RQ_MIGRATING,
+* because schedstat_wait_{start,end} rebase migrating task's wait_start
+* time relying on p->on_rq.
+*/
+   WARN_ON_ONCE(p->state == TASK_RUNNING &&
+p->sched_class == _sched_class &&
+(p->on_rq && !task_on_rq_migrating(p)));
+
 #ifdef CONFIG_LOCKDEP
/*
 * The caller should hold either p->pi_lock or rq->lock, when changing
@@ -1308,9 +1317,11 @@ static void __migrate_swap_task(struct task_struct *p, 
int cpu)
src_rq = task_rq(p);
dst_rq = cpu_rq(cpu);
 
+   p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
+   p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
} else {
/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..ce7e869 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -737,41 +737,69 @@ static void update_curr_fair(struct rq *rq)
update_curr(cfs_rq_of(>curr->se));
 }
 
+#ifdef CONFIG_SCHEDSTATS
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
-}
+   u64 wait_start = rq_clock(rq_of(cfs_rq));
 
-/*
- * Task is being enqueued - update stats:
- */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
-{
-   /*
-* Are we enqueueing a waiting task? (for current tasks
-* a dequeue/enqueue event is a NOP)
-*/
-   if (se != cfs_rq->curr)
-   update_stats_wait_start(cfs_rq, se);
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se)) &&
+   likely(wait_start > se->statistics.wait_start))
+   wait_start -= se->statistics.wait_start;
+
+   schedstat_set(se->statistics.wait_start, wait_start);
 }
 
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   if (entity_is_task(se) && task_on_rq_migrating(task_of(se))) {
+   /*
+* Preserve migrating task's wait time

Re: [PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-27 Thread Joonwoo Park

On Tue, Oct 27, 2015 at 01:57:28PM +0100, Peter Zijlstra wrote:
> 
> (Excessive quoting for Olav)
> 
> On Mon, Oct 26, 2015 at 06:44:48PM -0700, Joonwoo Park wrote:
> > On 10/25/2015 03:26 AM, Peter Zijlstra wrote:
> 
> > > Also note that on both sites we also set TASK_ON_RQ_MIGRATING -- albeit
> > > late. Can't you simply set that earlier (and back to QUEUED later) and
> > > test for task_on_rq_migrating() instead of blowing up the fastpath like
> > > you did?
> > > 
> > 
> > Yes it's doable.  I also find it's much simpler.
> 
> > From 98d615d46211a90482a0f9b7204265c54bba8520 Mon Sep 17 00:00:00 2001
> > From: Joonwoo Park <joonw...@codeaurora.org>
> > Date: Mon, 26 Oct 2015 16:37:47 -0700
> > Subject: [PATCH v2] sched: fix incorrect wait time and wait count statistics
> > 
> > At present scheduler resets task's wait start timestamp when the task
> > migrates to another rq.  This misleads scheduler itself into reporting
> > less wait time than actual by omitting time spent for waiting prior to
> > migration and also more wait count than actual by counting migration as
> > wait end event which can be seen by trace or /proc//sched with
> > CONFIG_SCHEDSTATS=y.
> > 
> > Carry forward migrating task's wait time prior to migration and
> > don't count migration as a wait end event to fix such statistics error.
> > 
> > In order to determine whether task is migrating mark task->on_rq with
> > TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.
> > 
> > To: Ingo Molnar <mi...@kernel.org>
> > To: Peter Zijlstra <pet...@infradead.org>
> > Cc: linux-kernel@vger.kernel.org
> > Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
> > ---
> 
> So now that you rely on TASK_ON_RQ_MIGRATING; I think you missed one
> place that can migrate sched_fair tasks and doesn't set it.
> 
> Olav recently did a patch adding TASK_ON_RQ_MIGRATING to _every_
> migration path, but that is (still) somewhat overkill. With your changes
> we need it for sched_fair though.
> 
> So I think you need to change __migrate_swap_task(), which is used by
> the NUMA scheduling to swap two running tasks.

Oh yes... __migrate_swap_task() can migrate fair class task so I should mark as 
TASK_ON_RQ_MIGRATING.
I will fix this in subsequent patch.

> 
> Also, it might be prudent to extend the CONFIG_SCHED_DEBUG ifdef in
> set_task_cpu() to test for this new requirement:
> 
>   WARN_ON_ONCE(p->state == TASK_RUNNING &&
>p->class == _sched_class &&
>p->on_rq != TASK_ON_RQ_MIGRATING);
> 

I conceived to argue that if we had Olav's change (with revised marking order) 
dummy like myself don't need to worry
about stale on_rq state and probably didn't make mistake like I did with 
__migrate_swap_task(). 
But I don't think I can argue for that reason anymore as my patch will set 
TASK_ON_RQ_MIGRATING for all migration paths for
the fair class tasks at least and moreover above macro you suggested would 
fortify the new requirement.

I will add that macro. 

I recall Olav had some other reason for his patch though.

> >  kernel/sched/core.c |  4 ++--
> >  kernel/sched/fair.c | 17 ++---
> >  2 files changed, 16 insertions(+), 5 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index bcd214e..d9e4ad5 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, 
> > struct task_struct *p, int new
> >  {
> > lockdep_assert_held(>lock);
> >  
> > -   dequeue_task(rq, p, 0);
> > p->on_rq = TASK_ON_RQ_MIGRATING;
> > +   dequeue_task(rq, p, 0);
> > set_task_cpu(p, new_cpu);
> > raw_spin_unlock(>lock);
> >  
> > @@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, 
> > struct task_struct *p, int new
> >  
> > raw_spin_lock(>lock);
> > BUG_ON(task_cpu(p) != new_cpu);
> > -   p->on_rq = TASK_ON_RQ_QUEUED;
> > enqueue_task(rq, p, 0);
> > +   p->on_rq = TASK_ON_RQ_QUEUED;
> > check_preempt_curr(rq, p, 0);
> >  
> > return rq;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9a5e60f..7609576 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
> >  static inline void
> >  update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > -   schedstat_se

Re: [PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-26 Thread Joonwoo Park

On 10/25/2015 03:26 AM, Peter Zijlstra wrote:
> On Sat, Oct 24, 2015 at 10:23:14PM -0700, Joonwoo Park wrote:
>> @@ -1069,7 +1069,7 @@ static struct rq *move_queued_task(struct rq *rq, 
>> struct task_struct *p, int new
>>  {
>>  lockdep_assert_held(>lock);
>>  
>> -dequeue_task(rq, p, 0);
>> +dequeue_task(rq, p, DEQUEUE_MIGRATING);
>>  p->on_rq = TASK_ON_RQ_MIGRATING;
>>  set_task_cpu(p, new_cpu);
>>  raw_spin_unlock(>lock);
> 
>> @@ -5656,7 +5671,7 @@ static void detach_task(struct task_struct *p, struct 
>> lb_env *env)
>>  {
>>  lockdep_assert_held(>src_rq->lock);
>>  
>> -deactivate_task(env->src_rq, p, 0);
>> +deactivate_task(env->src_rq, p, DEQUEUE_MIGRATING);
>>  p->on_rq = TASK_ON_RQ_MIGRATING;
>>  set_task_cpu(p, env->dst_cpu);
>>  }
> 
> Also note that on both sites we also set TASK_ON_RQ_MIGRATING -- albeit
> late. Can't you simply set that earlier (and back to QUEUED later) and
> test for task_on_rq_migrating() instead of blowing up the fastpath like
> you did?
> 

Yes it's doable.  I also find it's much simpler.
Please find patch v2.  I verified v2 does same job as v1 by comparing 
sched_stat_wait time with sched_switch - sched_wakeup timestamp.

Thanks,
Joonwoo
>From 98d615d46211a90482a0f9b7204265c54bba8520 Mon Sep 17 00:00:00 2001
From: Joonwoo Park 
Date: Mon, 26 Oct 2015 16:37:47 -0700
Subject: [PATCH v2] sched: fix incorrect wait time and wait count statistics

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar 
To: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c | 17 ++---
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..d9e4ad5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
 {
 	lockdep_assert_held(>lock);
 
-	dequeue_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_MIGRATING;
+	dequeue_task(rq, p, 0);
 	set_task_cpu(p, new_cpu);
 	raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
 
 	raw_spin_lock(>lock);
 	BUG_ON(task_cpu(p) != new_cpu);
-	p->on_rq = TASK_ON_RQ_QUEUED;
 	enqueue_task(rq, p, 0);
+	p->on_rq = TASK_ON_RQ_QUEUED;
 	check_preempt_curr(rq, p, 0);
 
 	return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..7609576 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+	schedstat_set(se->statistics.wait_start,
+		task_on_rq_migrating(task_of(se)) &&
+		likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+		rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+		rq_clock(rq_of(cfs_rq)));
 }
 
 /*
@@ -759,6 +763,13 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_on_rq_migrating(task_of(se))) {
+		schedstat_set(se->statistics.wait_start,
+			  rq_clock(rq_of(cfs_rq)) -
+			  se->statistics.wait_start);
+		return;
+	}
+
 	schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
 			rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
 	schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
@@ -5656,8 +5667,8 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_held(>src_rq->lock);
 
-	deactivate_task(env->src_rq, p, 0);
 	p->on_rq = TASK_ON_RQ_MIGRATING;
+	deactivate_task(env->src_rq, p, 0);
 	set_task_cpu(p, env->dst_cpu);
 }
 
@@ -5790,8 +5801,8 @@ static void attach_task(struct rq *rq, str

Re: [PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-26 Thread Joonwoo Park

On 10/25/2015 03:26 AM, Peter Zijlstra wrote:
> On Sat, Oct 24, 2015 at 10:23:14PM -0700, Joonwoo Park wrote:
>> @@ -1069,7 +1069,7 @@ static struct rq *move_queued_task(struct rq *rq, 
>> struct task_struct *p, int new
>>  {
>>  lockdep_assert_held(>lock);
>>  
>> -dequeue_task(rq, p, 0);
>> +dequeue_task(rq, p, DEQUEUE_MIGRATING);
>>  p->on_rq = TASK_ON_RQ_MIGRATING;
>>  set_task_cpu(p, new_cpu);
>>  raw_spin_unlock(>lock);
> 
>> @@ -5656,7 +5671,7 @@ static void detach_task(struct task_struct *p, struct 
>> lb_env *env)
>>  {
>>  lockdep_assert_held(>src_rq->lock);
>>  
>> -deactivate_task(env->src_rq, p, 0);
>> +deactivate_task(env->src_rq, p, DEQUEUE_MIGRATING);
>>  p->on_rq = TASK_ON_RQ_MIGRATING;
>>  set_task_cpu(p, env->dst_cpu);
>>  }
> 
> Also note that on both sites we also set TASK_ON_RQ_MIGRATING -- albeit
> late. Can't you simply set that earlier (and back to QUEUED later) and
> test for task_on_rq_migrating() instead of blowing up the fastpath like
> you did?
> 

Yes it's doable.  I also find it's much simpler.
Please find patch v2.  I verified v2 does same job as v1 by comparing 
sched_stat_wait time with sched_switch - sched_wakeup timestamp.

Thanks,
Joonwoo
>From 98d615d46211a90482a0f9b7204265c54bba8520 Mon Sep 17 00:00:00 2001
From: Joonwoo Park <joonw...@codeaurora.org>
Date: Mon, 26 Oct 2015 16:37:47 -0700
Subject: [PATCH v2] sched: fix incorrect wait time and wait count statistics

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

To: Ingo Molnar <mi...@kernel.org>
To: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
Changes in v2: 
 * Set p->on_rq = TASK_ON_RQ_MIGRATING while doing migration dequeue/enqueue
   and check whether task's migrating with task_on_rq_migrating().

 kernel/sched/core.c |  4 ++--
 kernel/sched/fair.c | 17 ++---
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..d9e4ad5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,8 +1069,8 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
 {
 	lockdep_assert_held(>lock);
 
-	dequeue_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_MIGRATING;
+	dequeue_task(rq, p, 0);
 	set_task_cpu(p, new_cpu);
 	raw_spin_unlock(>lock);
 
@@ -1078,8 +1078,8 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
 
 	raw_spin_lock(>lock);
 	BUG_ON(task_cpu(p) != new_cpu);
-	p->on_rq = TASK_ON_RQ_QUEUED;
 	enqueue_task(rq, p, 0);
+	p->on_rq = TASK_ON_RQ_QUEUED;
 	check_preempt_curr(rq, p, 0);
 
 	return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..7609576 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -740,7 +740,11 @@ static void update_curr_fair(struct rq *rq)
 static inline void
 update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+	schedstat_set(se->statistics.wait_start,
+		task_on_rq_migrating(task_of(se)) &&
+		likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+		rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+		rq_clock(rq_of(cfs_rq)));
 }
 
 /*
@@ -759,6 +763,13 @@ static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static void
 update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	if (task_on_rq_migrating(task_of(se))) {
+		schedstat_set(se->statistics.wait_start,
+			  rq_clock(rq_of(cfs_rq)) -
+			  se->statistics.wait_start);
+		return;
+	}
+
 	schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
 			rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
 	schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
@@ -5656,8 +5667,8 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
 {
 	lockdep_assert_held(>src_rq->lock);
 
-	deactivate_task(env->src_rq, p, 0);
 	p->on_rq = TASK_ON_RQ_MIGRATING;
+	deactivate_task(env->src_r

[PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-24 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and don't
count migration as a wait end event to fix such statistics error.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park 
---
 kernel/sched/core.c  |  4 ++--
 kernel/sched/fair.c  | 41 -
 kernel/sched/sched.h |  2 ++
 3 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..4f20895 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,7 +1069,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
+   dequeue_task(rq, p, DEQUEUE_MIGRATING);
p->on_rq = TASK_ON_RQ_MIGRATING;
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
@@ -1079,7 +1079,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
p->on_rq = TASK_ON_RQ_QUEUED;
-   enqueue_task(rq, p, 0);
+   enqueue_task(rq, p, ENQUEUE_MIGRATING);
check_preempt_curr(rq, p, 0);
 
return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..53ec4d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -738,27 +738,41 @@ static void update_curr_fair(struct rq *rq)
 }
 
 static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se,
+   bool migrating)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   schedstat_set(se->statistics.wait_start,
+   migrating &&
+   likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+   rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+   rq_clock(rq_of(cfs_rq)));
 }
 
 /*
  * Task is being enqueued - update stats:
  */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
+static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se,
+bool migrating)
 {
/*
 * Are we enqueueing a waiting task? (for current tasks
 * a dequeue/enqueue event is a NOP)
 */
if (se != cfs_rq->curr)
-   update_stats_wait_start(cfs_rq, se);
+   update_stats_wait_start(cfs_rq, se, migrating);
 }
 
 static void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ bool migrating)
 {
+   if (migrating) {
+   schedstat_set(se->statistics.wait_start,
+ rq_clock(rq_of(cfs_rq)) -
+ se->statistics.wait_start);
+   return;
+   }
+
schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
@@ -774,14 +788,15 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 }
 
 static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se,
+bool migrating)
 {
/*
 * Mark the end of the wait period if dequeueing a
 * waiting task:
 */
if (se != cfs_rq->curr)
-   update_stats_wait_end(cfs_rq, se);
+   update_stats_wait_end(cfs_rq, se, migrating);
 }
 
 /*
@@ -2960,7 +2975,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
enqueue_sleeper(cfs_rq, se);
}
 
-   update_stats_enqueue(cfs_rq, se);
+   update_stats_enqueue(cfs_rq, se, !!(flags & ENQUEUE_MIGRATING));
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
@@ -3028,7 +3043,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
update_curr(cfs_rq);
dequeue_entity_load_avg(cfs_rq, se);
 
-   update_stats_dequeue(cfs_rq, se);
+   update_stats_dequeue(cfs_rq, se, !!(flags & DEQUEUE_MIGRATING));
if (flags & DEQUEUE_SLEEP) {
 #ifdef CONFIG_SCHEDSTATS
if (entity_is

[PATCH] sched: fix incorrect wait time and wait count statistics

2015-10-24 Thread Joonwoo Park

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and don't
count migration as a wait end event to fix such statistics error.

Cc: Ingo Molnar <mi...@kernel.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Joonwoo Park <joonw...@codeaurora.org>
---
 kernel/sched/core.c  |  4 ++--
 kernel/sched/fair.c  | 41 -
 kernel/sched/sched.h |  2 ++
 3 files changed, 32 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bcd214e..4f20895 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1069,7 +1069,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
 {
lockdep_assert_held(>lock);
 
-   dequeue_task(rq, p, 0);
+   dequeue_task(rq, p, DEQUEUE_MIGRATING);
p->on_rq = TASK_ON_RQ_MIGRATING;
set_task_cpu(p, new_cpu);
raw_spin_unlock(>lock);
@@ -1079,7 +1079,7 @@ static struct rq *move_queued_task(struct rq *rq, struct 
task_struct *p, int new
raw_spin_lock(>lock);
BUG_ON(task_cpu(p) != new_cpu);
p->on_rq = TASK_ON_RQ_QUEUED;
-   enqueue_task(rq, p, 0);
+   enqueue_task(rq, p, ENQUEUE_MIGRATING);
check_preempt_curr(rq, p, 0);
 
return rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9a5e60f..53ec4d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -738,27 +738,41 @@ static void update_curr_fair(struct rq *rq)
 }
 
 static inline void
-update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_start(struct cfs_rq *cfs_rq, struct sched_entity *se,
+   bool migrating)
 {
-   schedstat_set(se->statistics.wait_start, rq_clock(rq_of(cfs_rq)));
+   schedstat_set(se->statistics.wait_start,
+   migrating &&
+   likely(rq_clock(rq_of(cfs_rq)) > se->statistics.wait_start) ?
+   rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start :
+   rq_clock(rq_of(cfs_rq)));
 }
 
 /*
  * Task is being enqueued - update stats:
  */
-static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se)
+static void update_stats_enqueue(struct cfs_rq *cfs_rq, struct sched_entity 
*se,
+bool migrating)
 {
/*
 * Are we enqueueing a waiting task? (for current tasks
 * a dequeue/enqueue event is a NOP)
 */
if (se != cfs_rq->curr)
-   update_stats_wait_start(cfs_rq, se);
+   update_stats_wait_start(cfs_rq, se, migrating);
 }
 
 static void
-update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_wait_end(struct cfs_rq *cfs_rq, struct sched_entity *se,
+ bool migrating)
 {
+   if (migrating) {
+   schedstat_set(se->statistics.wait_start,
+ rq_clock(rq_of(cfs_rq)) -
+ se->statistics.wait_start);
+   return;
+   }
+
schedstat_set(se->statistics.wait_max, max(se->statistics.wait_max,
rq_clock(rq_of(cfs_rq)) - se->statistics.wait_start));
schedstat_set(se->statistics.wait_count, se->statistics.wait_count + 1);
@@ -774,14 +788,15 @@ update_stats_wait_end(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 }
 
 static inline void
-update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
+update_stats_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se,
+bool migrating)
 {
/*
 * Mark the end of the wait period if dequeueing a
 * waiting task:
 */
if (se != cfs_rq->curr)
-   update_stats_wait_end(cfs_rq, se);
+   update_stats_wait_end(cfs_rq, se, migrating);
 }
 
 /*
@@ -2960,7 +2975,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
enqueue_sleeper(cfs_rq, se);
}
 
-   update_stats_enqueue(cfs_rq, se);
+   update_stats_enqueue(cfs_rq, se, !!(flags & ENQUEUE_MIGRATING));
check_spread(cfs_rq, se);
if (se != cfs_rq->curr)
__enqueue_entity(cfs_rq, se);
@@ -3028,7 +3043,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
update_curr(cfs_rq);
dequeue_entity_load_avg(cfs_rq, se);
 
-   update_stats_dequeue(cfs_rq, se);
+   update_stats_dequeue(cfs_rq, se, !!(flags & DEQUEUE_MIGRATING));

[tip:timers/core] timer: Use timer->base for flag checks

2015-05-05 Thread tip-bot for Joonwoo Park

Commit-ID:  781978e6e156101209f62b9ebc8783b70ef248de
Gitweb: http://git.kernel.org/tip/781978e6e156101209f62b9ebc8783b70ef248de
Author: Joonwoo Park 
AuthorDate: Mon, 27 Apr 2015 19:21:49 -0700
Committer:  Thomas Gleixner 
CommitDate: Tue, 5 May 2015 10:40:43 +0200

timer: Use timer->base for flag checks

At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer->base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE set.

Signed-off-by: Joonwoo Park 
Cc: sb...@codeaurora.org
Cc: skan...@codeaurora.org
Cc: John Stultz 
Link: 
http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonw...@codeaurora.org
Signed-off-by: Thomas Gleixner 
---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 03f926c..d4af7c5 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -436,7 +436,7 @@ static void internal_add_timer(struct tvec_base *base, 
struct timer_list *timer)
 * require special care against races with idle_cpu(), lets deal
 * with that later.
 */
-   if (!tbase_get_deferrable(base) || tick_nohz_full_cpu(base->cpu))
+   if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(base->cpu))
wake_up_nohz_cpu(base->cpu);
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[tip:timers/core] timer: Use timer-base for flag checks

2015-05-05 Thread tip-bot for Joonwoo Park

Commit-ID:  781978e6e156101209f62b9ebc8783b70ef248de
Gitweb: http://git.kernel.org/tip/781978e6e156101209f62b9ebc8783b70ef248de
Author: Joonwoo Park joonw...@codeaurora.org
AuthorDate: Mon, 27 Apr 2015 19:21:49 -0700
Committer:  Thomas Gleixner t...@linutronix.de
CommitDate: Tue, 5 May 2015 10:40:43 +0200

timer: Use timer-base for flag checks

At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer-base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE set.

Signed-off-by: Joonwoo Park joonw...@codeaurora.org
Cc: sb...@codeaurora.org
Cc: skan...@codeaurora.org
Cc: John Stultz john.stu...@linaro.org
Link: 
http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonw...@codeaurora.org
Signed-off-by: Thomas Gleixner t...@linutronix.de
---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 03f926c..d4af7c5 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -436,7 +436,7 @@ static void internal_add_timer(struct tvec_base *base, 
struct timer_list *timer)
 * require special care against races with idle_cpu(), lets deal
 * with that later.
 */
-   if (!tbase_get_deferrable(base) || tick_nohz_full_cpu(base-cpu))
+   if (!tbase_get_deferrable(timer-base) || tick_nohz_full_cpu(base-cpu))
wake_up_nohz_cpu(base-cpu);
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-04-27 Thread Joonwoo Park

Thomas, I made some clean up.  Will much appreciate if you can give me some 
feedback on this.

Thanks,
Joonwoo

On 04/27/2015 07:39 PM, Joonwoo Park wrote:
> When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
> queue_delayed_work() it's probably intended to run the work item on any
> CPU that isn't idle. However, we queue the work to run at a later time
> by starting a deferrable timer that binds to whatever CPU the work is
> queued on which is same with queue_delayed_work_on(smp_processor_id())
> effectively.
> 
> As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
> In fact this is perfectly fine with UP kernel and also won't affect much a
> system without dyntick with SMP kernel too as every cpus run timers
> periodically.  But on SMP systems with dyntick current implementation leads
> deferrable timers not very scalable because the timer's base which has
> queued the deferrable timer won't wake up till next non-deferrable timer
> expires even though there are possible other non idle cpus are running
> which are able to run expired deferrable timers.
> 
> The deferrable work is a good example of the current implementation's
> victim like below.
> 
> INIT_DEFERRABLE_WORK(, fn);
> CPU 0 CPU 1
> queue_delayed_work(wq, , HZ);
> queue_delayed_work_on(WORK_CPU_UNBOUND);
> ...
>   __mod_timer() -> queues timer to the
>current cpu's timer
>base.
>   ...
> tick_nohz_idle_enter() -> cpu enters idle.
> A second later
> cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
>   now it's in active but won't
> cpu 0 won't wake up till next handle cpu unbound deferrable timer
> non-deferrable timer expires. as it's in cpu 0's timer base.
> 
> To make all cpu unbound deferrable timers are scalable, introduce a common
> timer base which is only for cpu unbound deferrable timers to make those
> are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
> This common timer fixes scalability issue of delayed work and all other cpu
> unbound deferrable timer using implementations.
> 
> CC: Thomas Gleixner 
> CC: John Stultz 
> CC: Tejun Heo 
> Signed-off-by: Joonwoo Park 
> ---
> Changes in v3:
>  * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
> bouncing.
> 
> Changes in v4:
>  * Kill CONFIG_SMP ifdefry.
>  * Allocate and initialize tvec_base_deferrable at compile time.
>  * Pin pinned deferrable timer. 
>  * s/deferral/deferrable/
> 
>  include/linux/timer.h |  14 ++-
>  kernel/time/timer.c   | 103 
> --
>  2 files changed, 97 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/timer.h b/include/linux/timer.h
> index 8c5a197..45847ca 100644
> --- a/include/linux/timer.h
> +++ b/include/linux/timer.h
> @@ -34,6 +34,9 @@ struct timer_list {
>  };
>  
>  extern struct tvec_base boot_tvec_bases;
> +#ifdef CONFIG_SMP
> +extern struct tvec_base tvec_base_deferrable;
> +#endif
>  
>  #ifdef CONFIG_LOCKDEP
>  /*
> @@ -70,12 +73,21 @@ extern struct tvec_base boot_tvec_bases;
>  
>  #define TIMER_FLAG_MASK  0x3LU
>  
> +#ifdef CONFIG_SMP
> +#define __TIMER_BASE(_flags) \
> + ((_flags) & TIMER_DEFERRABLE ? \
> +  (unsigned long)_base_deferrable + (_flags) : \
> +  (unsigned long)_tvec_bases + (_flags))
> +#else
> +#define __TIMER_BASE(_flags) ((unsigned long)_tvec_bases + (_flags))
> +#endif
> +
>  #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
>   .entry = { .prev = TIMER_ENTRY_STATIC },\
>   .function = (_function),\
>   .expires = (_expires),  \
>   .data = (_data),\
> - .base = (void *)((unsigned long)_tvec_bases + (_flags)), \
> + .base = (void *)(__TIMER_BASE(_flags)), \
>   .slack = -1,\
>   __TIMER_LOCKDEP_MAP_INITIALIZER(\
>   __FILE__ ":" __stringify(__LINE__)) \
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index e5d5733c..133e94a 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -49,6 +49,8 @@
>  #include 
>  #include 
>  
> +#include "tick-internal.h"
> +
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> @@ -103,6 +105,9 @@ struct tvec_base boot_tvec_bases;
>  EXPORT_SYMBOL(boot_tvec_bases);
>

[PATCH 2/2] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-04-27 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(, fn);
CPU 0 CPU 1
queue_delayed_work(wq, , HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() -> queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner 
CC: John Stultz 
CC: Tejun Heo 
Signed-off-by: Joonwoo Park 
---
Changes in v3:
 * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
bouncing.

Changes in v4:
 * Kill CONFIG_SMP ifdefry.
 * Allocate and initialize tvec_base_deferrable at compile time.
 * Pin pinned deferrable timer. 
 * s/deferral/deferrable/

 include/linux/timer.h |  14 ++-
 kernel/time/timer.c   | 103 --
 2 files changed, 97 insertions(+), 20 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197..45847ca 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -34,6 +34,9 @@ struct timer_list {
 };
 
 extern struct tvec_base boot_tvec_bases;
+#ifdef CONFIG_SMP
+extern struct tvec_base tvec_base_deferrable;
+#endif
 
 #ifdef CONFIG_LOCKDEP
 /*
@@ -70,12 +73,21 @@ extern struct tvec_base boot_tvec_bases;
 
 #define TIMER_FLAG_MASK0x3LU
 
+#ifdef CONFIG_SMP
+#define __TIMER_BASE(_flags) \
+   ((_flags) & TIMER_DEFERRABLE ? \
+(unsigned long)_base_deferrable + (_flags) : \
+(unsigned long)_tvec_bases + (_flags))
+#else
+#define __TIMER_BASE(_flags) ((unsigned long)_tvec_bases + (_flags))
+#endif
+
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\
.function = (_function),\
.expires = (_expires),  \
.data = (_data),\
-   .base = (void *)((unsigned long)_tvec_bases + (_flags)), \
+   .base = (void *)(__TIMER_BASE(_flags)), \
.slack = -1,\
__TIMER_LOCKDEP_MAP_INITIALIZER(\
__FILE__ ":" __stringify(__LINE__)) \
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e5d5733c..133e94a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -49,6 +49,8 @@
 #include 
 #include 
 
+#include "tick-internal.h"
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -103,6 +105,9 @@ struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
+#ifdef CONFIG_SMP
+struct tvec_base tvec_base_deferrable;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -662,10 +667,63 @@ static inline void debug_assert_init(struct timer_list 
*timer)
debug_timer_assert_init(timer);
 }
 
+#ifdef CONFIG_SMP
+static inline struct tvec_base *__get_timer_base(unsigned int flags)
+{
+   if (flags & TIMER_DEFERRABLE)
+   return _base_deferrable;
+   else
+   return raw_cpu_read(tvec_bases);
+}
+
+static inline bool is_deferrable_timer_base(struct tvec_

[PATCH 1/2] timer: avoid unnecessary waking up of nohz CPU

2015-04-27 Thread Joonwoo Park

At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer->base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE.

CC: Thomas Gleixner 
CC: John Stultz 
Signed-off-by: Joonwoo Park 
---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2ece3aa..e5d5733c 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -434,7 +434,7 @@ static void internal_add_timer(struct tvec_base *base, 
struct timer_list *timer)
 * require special care against races with idle_cpu(), lets deal
 * with that later.
 */
-   if (!tbase_get_deferrable(base) || tick_nohz_full_cpu(base->cpu))
+   if (!tbase_get_deferrable(timer->base) || tick_nohz_full_cpu(base->cpu))
wake_up_nohz_cpu(base->cpu);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-04-27 Thread Joonwoo Park

Thomas, I made some clean up.  Will much appreciate if you can give me some 
feedback on this.

Thanks,
Joonwoo

On 04/27/2015 07:39 PM, Joonwoo Park wrote:
 When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
 queue_delayed_work() it's probably intended to run the work item on any
 CPU that isn't idle. However, we queue the work to run at a later time
 by starting a deferrable timer that binds to whatever CPU the work is
 queued on which is same with queue_delayed_work_on(smp_processor_id())
 effectively.
 
 As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
 In fact this is perfectly fine with UP kernel and also won't affect much a
 system without dyntick with SMP kernel too as every cpus run timers
 periodically.  But on SMP systems with dyntick current implementation leads
 deferrable timers not very scalable because the timer's base which has
 queued the deferrable timer won't wake up till next non-deferrable timer
 expires even though there are possible other non idle cpus are running
 which are able to run expired deferrable timers.
 
 The deferrable work is a good example of the current implementation's
 victim like below.
 
 INIT_DEFERRABLE_WORK(dwork, fn);
 CPU 0 CPU 1
 queue_delayed_work(wq, dwork, HZ);
 queue_delayed_work_on(WORK_CPU_UNBOUND);
 ...
   __mod_timer() - queues timer to the
current cpu's timer
base.
   ...
 tick_nohz_idle_enter() - cpu enters idle.
 A second later
 cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
   now it's in active but won't
 cpu 0 won't wake up till next handle cpu unbound deferrable timer
 non-deferrable timer expires. as it's in cpu 0's timer base.
 
 To make all cpu unbound deferrable timers are scalable, introduce a common
 timer base which is only for cpu unbound deferrable timers to make those
 are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
 This common timer fixes scalability issue of delayed work and all other cpu
 unbound deferrable timer using implementations.
 
 CC: Thomas Gleixner t...@linutronix.de
 CC: John Stultz john.stu...@linaro.org
 CC: Tejun Heo t...@kernel.org
 Signed-off-by: Joonwoo Park joonw...@codeaurora.org
 ---
 Changes in v3:
  * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
 bouncing.
 
 Changes in v4:
  * Kill CONFIG_SMP ifdefry.
  * Allocate and initialize tvec_base_deferrable at compile time.
  * Pin pinned deferrable timer. 
  * s/deferral/deferrable/
 
  include/linux/timer.h |  14 ++-
  kernel/time/timer.c   | 103 
 --
  2 files changed, 97 insertions(+), 20 deletions(-)
 
 diff --git a/include/linux/timer.h b/include/linux/timer.h
 index 8c5a197..45847ca 100644
 --- a/include/linux/timer.h
 +++ b/include/linux/timer.h
 @@ -34,6 +34,9 @@ struct timer_list {
  };
  
  extern struct tvec_base boot_tvec_bases;
 +#ifdef CONFIG_SMP
 +extern struct tvec_base tvec_base_deferrable;
 +#endif
  
  #ifdef CONFIG_LOCKDEP
  /*
 @@ -70,12 +73,21 @@ extern struct tvec_base boot_tvec_bases;
  
  #define TIMER_FLAG_MASK  0x3LU
  
 +#ifdef CONFIG_SMP
 +#define __TIMER_BASE(_flags) \
 + ((_flags)  TIMER_DEFERRABLE ? \
 +  (unsigned long)tvec_base_deferrable + (_flags) : \
 +  (unsigned long)boot_tvec_bases + (_flags))
 +#else
 +#define __TIMER_BASE(_flags) ((unsigned long)boot_tvec_bases + (_flags))
 +#endif
 +
  #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
   .entry = { .prev = TIMER_ENTRY_STATIC },\
   .function = (_function),\
   .expires = (_expires),  \
   .data = (_data),\
 - .base = (void *)((unsigned long)boot_tvec_bases + (_flags)), \
 + .base = (void *)(__TIMER_BASE(_flags)), \
   .slack = -1,\
   __TIMER_LOCKDEP_MAP_INITIALIZER(\
   __FILE__ : __stringify(__LINE__)) \
 diff --git a/kernel/time/timer.c b/kernel/time/timer.c
 index e5d5733c..133e94a 100644
 --- a/kernel/time/timer.c
 +++ b/kernel/time/timer.c
 @@ -49,6 +49,8 @@
  #include asm/timex.h
  #include asm/io.h
  
 +#include tick-internal.h
 +
  #define CREATE_TRACE_POINTS
  #include trace/events/timer.h
  
 @@ -103,6 +105,9 @@ struct tvec_base boot_tvec_bases;
  EXPORT_SYMBOL(boot_tvec_bases);
  
  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
 +#ifdef CONFIG_SMP
 +struct tvec_base tvec_base_deferrable;
 +#endif
  
  /* Functions below help us manage 'deferrable' flag */
  static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 @@ -662,10 +667,63 @@ static inline void debug_assert_init(struct timer_list

[PATCH 1/2] timer: avoid unnecessary waking up of nohz CPU

2015-04-27 Thread Joonwoo Park

At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags.  Examine with 'timer-base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE.

CC: Thomas Gleixner t...@linutronix.de
CC: John Stultz john.stu...@linaro.org
Signed-off-by: Joonwoo Park joonw...@codeaurora.org
---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2ece3aa..e5d5733c 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -434,7 +434,7 @@ static void internal_add_timer(struct tvec_base *base, 
struct timer_list *timer)
 * require special care against races with idle_cpu(), lets deal
 * with that later.
 */
-   if (!tbase_get_deferrable(base) || tick_nohz_full_cpu(base-cpu))
+   if (!tbase_get_deferrable(timer-base) || tick_nohz_full_cpu(base-cpu))
wake_up_nohz_cpu(base-cpu);
 }
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
hosted by The Linux Foundation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-04-27 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(dwork, fn);
CPU 0 CPU 1
queue_delayed_work(wq, dwork, HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() - queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() - cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner t...@linutronix.de
CC: John Stultz john.stu...@linaro.org
CC: Tejun Heo t...@kernel.org
Signed-off-by: Joonwoo Park joonw...@codeaurora.org
---
Changes in v3:
 * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
bouncing.

Changes in v4:
 * Kill CONFIG_SMP ifdefry.
 * Allocate and initialize tvec_base_deferrable at compile time.
 * Pin pinned deferrable timer. 
 * s/deferral/deferrable/

 include/linux/timer.h |  14 ++-
 kernel/time/timer.c   | 103 --
 2 files changed, 97 insertions(+), 20 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index 8c5a197..45847ca 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -34,6 +34,9 @@ struct timer_list {
 };
 
 extern struct tvec_base boot_tvec_bases;
+#ifdef CONFIG_SMP
+extern struct tvec_base tvec_base_deferrable;
+#endif
 
 #ifdef CONFIG_LOCKDEP
 /*
@@ -70,12 +73,21 @@ extern struct tvec_base boot_tvec_bases;
 
 #define TIMER_FLAG_MASK0x3LU
 
+#ifdef CONFIG_SMP
+#define __TIMER_BASE(_flags) \
+   ((_flags)  TIMER_DEFERRABLE ? \
+(unsigned long)tvec_base_deferrable + (_flags) : \
+(unsigned long)boot_tvec_bases + (_flags))
+#else
+#define __TIMER_BASE(_flags) ((unsigned long)boot_tvec_bases + (_flags))
+#endif
+
 #define __TIMER_INITIALIZER(_function, _expires, _data, _flags) { \
.entry = { .prev = TIMER_ENTRY_STATIC },\
.function = (_function),\
.expires = (_expires),  \
.data = (_data),\
-   .base = (void *)((unsigned long)boot_tvec_bases + (_flags)), \
+   .base = (void *)(__TIMER_BASE(_flags)), \
.slack = -1,\
__TIMER_LOCKDEP_MAP_INITIALIZER(\
__FILE__ : __stringify(__LINE__)) \
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e5d5733c..133e94a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -49,6 +49,8 @@
 #include asm/timex.h
 #include asm/io.h
 
+#include tick-internal.h
+
 #define CREATE_TRACE_POINTS
 #include trace/events/timer.h
 
@@ -103,6 +105,9 @@ struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
+#ifdef CONFIG_SMP
+struct tvec_base tvec_base_deferrable;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -662,10 +667,63 @@ static inline void debug_assert_init(struct timer_list 
*timer)
debug_timer_assert_init(timer);
 }
 
+#ifdef CONFIG_SMP
+static inline struct tvec_base *__get_timer_base(unsigned int flags)
+{
+   if (flags  TIMER_DEFERRABLE)
+   return tvec_base_deferrable;
+   else

Re: [PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-03-30 Thread Joonwoo Park

Hi Thomas,

Please find patch v3 which makes only tick_do_timer_cpu to run deferral timer 
wheel to reduce cache bouncing and let me know what you think.

Thanks,
Joonwoo

>From 0c91f82a0b43b247f1ed310212ef3aada7ccc9f7 Mon Sep 17 00:00:00 2001
From: Joonwoo Park 
Date: Thu, 11 Sep 2014 15:34:25 -0700
Subject: [PATCH] timer: make deferrable cpu unbound timers really not bound to
 a cpu

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(, fn);
CPU 0 CPU 1
queue_delayed_work(wq, , HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() -> queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner 
CC: John Stultz 
CC: Tejun Heo 
Signed-off-by: Joonwoo Park 
---
Changes in v3:
 * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
bouncing.

 kernel/time/timer.c | 94 -
 1 file changed, 71 insertions(+), 23 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c5..59306fe 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -49,6 +49,8 @@
 #include 
 #include 
 
+#include "tick-internal.h"
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -93,6 +95,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = _tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +660,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = raw_cpu_read(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags & TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = raw_cpu_read(tvec_bases);
 
timer->entry.next = NULL;
timer->base = (void *)((unsigned long)base | flags);
@@ -777,26 +789,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base->running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-

Re: [PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2015-03-30 Thread Joonwoo Park

Hi Thomas,

Please find patch v3 which makes only tick_do_timer_cpu to run deferral timer 
wheel to reduce cache bouncing and let me know what you think.

Thanks,
Joonwoo

From 0c91f82a0b43b247f1ed310212ef3aada7ccc9f7 Mon Sep 17 00:00:00 2001
From: Joonwoo Park joonw...@codeaurora.org
Date: Thu, 11 Sep 2014 15:34:25 -0700
Subject: [PATCH] timer: make deferrable cpu unbound timers really not bound to
 a cpu

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(dwork, fn);
CPU 0 CPU 1
queue_delayed_work(wq, dwork, HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() - queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() - cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by tick_do_timer_cpu.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner t...@linutronix.de
CC: John Stultz john.stu...@linaro.org
CC: Tejun Heo t...@kernel.org
Signed-off-by: Joonwoo Park joonw...@codeaurora.org
---
Changes in v3:
 * Make only tick_do_timer_cpu to run deferral timer wheel to reduce cache 
bouncing.

 kernel/time/timer.c | 94 -
 1 file changed, 71 insertions(+), 23 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2d3f5c5..59306fe 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -49,6 +49,8 @@
 #include asm/timex.h
 #include asm/io.h
 
+#include tick-internal.h
+
 #define CREATE_TRACE_POINTS
 #include trace/events/timer.h
 
@@ -93,6 +95,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = boot_tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +660,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = raw_cpu_read(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags  TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = raw_cpu_read(tvec_bases);
 
timer-entry.next = NULL;
timer-base = (void *)((unsigned long)base | flags);
@@ -777,26 +789,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base-running_timer != timer)) {
-   /* See

Re: [PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-26 Thread Joonwoo Park

On Tue, Sep 23, 2014 at 08:33:34PM +0200, Thomas Gleixner wrote:
> On Mon, 15 Sep 2014, Joonwoo Park wrote:
> > +#ifdef CONFIG_SMP
> > +static struct tvec_base *tvec_base_deferral = _tvec_bases;
> > +#endif
> 
> In principle I like the idea of a deferrable wheel, but this
> implementation is going to go nowhere.
> 
> First of all making it SMP only is silly. The deferrable stuff is a
> pain in other places as well.
> 
> But whats way worse is:
> 
> > +static inline void __run_timers(struct tvec_base *base, bool try)
> >  {
> > struct timer_list *timer;
> >  
> > -   spin_lock_irq(>lock);
> > +   if (!try)
> > +   spin_lock_irq(>lock);
> > +   else if (!spin_trylock_irq(>lock))
> > +   return;
> 
> Yuck. All cpus fighting about a single spinlock roughly at the same
> time? You just created a proper thundering herd cacheline bouncing
> issue. 

Since __run_timers() for deferrable wheel do spin_lock_try() always, none of 
cpus would spin but just return if spinlock is already acquired.

I agree with cacheline bouncing issue of timer base (maybe you're worried about 
coherency of spinlock too?).
The other approach I thought about is the way that waking up cpus which has 
expired deferrable timer from active cpu rather than having global deferrable 
wheel.
I didn't go with this way because this sounds conflict with the idea of 
'deferrable' and consumes more power compare to the patch I proposed.  Cache 
prefetching isn't free of power consumption either though.
What do you think about this approach?

Or I think we can migrate expired deferrable timers from idle cpu to active 
cpus but I doubt if this good idea as migration seems expensive.

> 
> No way. We have already mechanisms in place to deal with such
> problems, you just have to use them.

The problem I'm trying to tackle down is a case that a driver needs a 
deferrable delayed_work to prevent from waking up cpus because of that timer 
while it's still in need of making sure scheduling the deferrable timer in time 
if any of cpus are active.

Would you mind shed some lights on me about the mechanisms you're referring to?
I thought about queuing cpu bound deferrable timer to all cpus and cancel all 
others when any of them got scheduled, but this overkill.

Thanks,
Joonwoo

> 
> Thanks,
> 
>   tglx

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-26 Thread Joonwoo Park

On Tue, Sep 23, 2014 at 08:33:34PM +0200, Thomas Gleixner wrote:
 On Mon, 15 Sep 2014, Joonwoo Park wrote:
  +#ifdef CONFIG_SMP
  +static struct tvec_base *tvec_base_deferral = boot_tvec_bases;
  +#endif
 
 In principle I like the idea of a deferrable wheel, but this
 implementation is going to go nowhere.
 
 First of all making it SMP only is silly. The deferrable stuff is a
 pain in other places as well.
 
 But whats way worse is:
 
  +static inline void __run_timers(struct tvec_base *base, bool try)
   {
  struct timer_list *timer;
   
  -   spin_lock_irq(base-lock);
  +   if (!try)
  +   spin_lock_irq(base-lock);
  +   else if (!spin_trylock_irq(base-lock))
  +   return;
 
 Yuck. All cpus fighting about a single spinlock roughly at the same
 time? You just created a proper thundering herd cacheline bouncing
 issue. 

Since __run_timers() for deferrable wheel do spin_lock_try() always, none of 
cpus would spin but just return if spinlock is already acquired.

I agree with cacheline bouncing issue of timer base (maybe you're worried about 
coherency of spinlock too?).
The other approach I thought about is the way that waking up cpus which has 
expired deferrable timer from active cpu rather than having global deferrable 
wheel.
I didn't go with this way because this sounds conflict with the idea of 
'deferrable' and consumes more power compare to the patch I proposed.  Cache 
prefetching isn't free of power consumption either though.
What do you think about this approach?

Or I think we can migrate expired deferrable timers from idle cpu to active 
cpus but I doubt if this good idea as migration seems expensive.

 
 No way. We have already mechanisms in place to deal with such
 problems, you just have to use them.

The problem I'm trying to tackle down is a case that a driver needs a 
deferrable delayed_work to prevent from waking up cpus because of that timer 
while it's still in need of making sure scheduling the deferrable timer in time 
if any of cpus are active.

Would you mind shed some lights on me about the mechanisms you're referring to?
I thought about queuing cpu bound deferrable timer to all cpus and cancel all 
others when any of them got scheduled, but this overkill.

Thanks,
Joonwoo

 
 Thanks,
 
   tglx

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-15 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(, fn);
CPU 0 CPU 1
queue_delayed_work(wq, , HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() -> queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by any of non idle cpus.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner 
CC: John Stultz 
CC: Tejun Heo 
Signed-off-by: Joonwoo Park 
---
 Changes in v2:
 * Use kzalloc_node()/kzalloc()

 kernel/time/timer.c | 106 +++-
 1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..5313cb0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = _tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags & TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = __raw_get_cpu_var(tvec_bases);
 
timer->entry.next = NULL;
timer->base = (void *)((unsigned long)base | flags);
@@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base->running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(>lock);
-   base = new_base;
-   spin_lock(>lock);
-   timer_set_base(timer, base);
+   if (base != new_base) {
+   /*
+* We are trying to schedule the timer on the local CPU.
+* However we can't change timer's base while it is
+* running, otherwise del_timer_sync() can't detect that
+* the timer's handler yet has not finished. This also
+

[PATCH v2 RESEND/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-15 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(dwork, fn);
CPU 0 CPU 1
queue_delayed_work(wq, dwork, HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() - queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() - cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by any of non idle cpus.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

CC: Thomas Gleixner t...@linutronix.de
CC: John Stultz john.stu...@linaro.org
CC: Tejun Heo t...@kernel.org
Signed-off-by: Joonwoo Park joonw...@codeaurora.org
---
 Changes in v2:
 * Use kzalloc_node()/kzalloc()

 kernel/time/timer.c | 106 +++-
 1 file changed, 80 insertions(+), 26 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..5313cb0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = boot_tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags  TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = __raw_get_cpu_var(tvec_bases);
 
timer-entry.next = NULL;
timer-base = (void *)((unsigned long)base | flags);
@@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base-running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(base-lock);
-   base = new_base;
-   spin_lock(base-lock);
-   timer_set_base(timer, base);
+   if (base != new_base) {
+   /*
+* We are trying to schedule the timer on the local CPU.
+* However we can't change timer's base while it is
+* running, otherwise del_timer_sync() can't detect

Re: [PATCH/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-11 Thread Joonwoo Park

On Thu, Sep 11, 2014 at 04:56:52PM -0700, Joonwoo Park wrote:
> When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
> queue_delayed_work() it's probably intended to run the work item on any
> CPU that isn't idle. However, we queue the work to run at a later time
> by starting a deferrable timer that binds to whatever CPU the work is
> queued on which is same with queue_delayed_work_on(smp_processor_id())
> effectively.
> 
> As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
> In fact this is perfectly fine with UP kernel and also won't affect much a
> system without dyntick with SMP kernel too as every cpus run timers
> periodically.  But on SMP systems with dyntick current implementation leads
> deferrable timers not very scalable because the timer's base which has
> queued the deferrable timer won't wake up till next non-deferrable timer
> expires even though there are possible other non idle cpus are running
> which are able to run expired deferrable timers.
> 
> The deferrable work is a good example of the current implementation's
> victim like below.
> 
> INIT_DEFERRABLE_WORK(, fn);
> CPU 0 CPU 1
> queue_delayed_work(wq, , HZ);
> queue_delayed_work_on(WORK_CPU_UNBOUND);
> ...
>   __mod_timer() -> queues timer to the
>current cpu's timer
>base.
>   ...
> tick_nohz_idle_enter() -> cpu enters idle.
> A second later
> cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
>   now it's in active but won't
> cpu 0 won't wake up till next handle cpu unbound deferrable timer
> non-deferrable timer expires. as it's in cpu 0's timer base.
> 
> To make all cpu unbound deferrable timers are scalable, introduce a common
> timer base which is only for cpu unbound deferrable timers to make those
> are indeed cpu unbound so that can be scheduled by any of non idle cpus.
> This common timer fixes scalability issue of delayed work and all other cpu
> unbound deferrable timer using implementations.
> 
> cc: Thomas Gleixner 
> CC: John Stultz 
> CC: Tejun Heo 
> Signed-off-by: Joonwoo Park 
> ---
>  kernel/time/timer.c | 108 
> +++-
>  1 file changed, 82 insertions(+), 26 deletions(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index aca5dfe..655076b 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -93,6 +93,9 @@ struct tvec_base {
>  struct tvec_base boot_tvec_bases;
>  EXPORT_SYMBOL(boot_tvec_bases);
>  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
> +#ifdef CONFIG_SMP
> +static struct tvec_base *tvec_base_deferral = _tvec_bases;
> +#endif
>  
>  /* Functions below help us manage 'deferrable' flag */
>  static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
> @@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
> *timer)
>  static void do_init_timer(struct timer_list *timer, unsigned int flags,
> const char *name, struct lock_class_key *key)
>  {
> - struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
> + struct tvec_base *base;
> +
> +#ifdef CONFIG_SMP
> + if (flags & TIMER_DEFERRABLE)
> + base = tvec_base_deferral;
> + else
> +#endif
> + base = __raw_get_cpu_var(tvec_bases);
>  
>   timer->entry.next = NULL;
>   timer->base = (void *)((unsigned long)base | flags);
> @@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires,
>  
>   debug_activate(timer, expires);
>  
> - cpu = get_nohz_timer_target(pinned);
> - new_base = per_cpu(tvec_bases, cpu);
> +#ifdef CONFIG_SMP
> + if (base != tvec_base_deferral) {
> +#endif
> + cpu = get_nohz_timer_target(pinned);
> + new_base = per_cpu(tvec_bases, cpu);
>  
> - if (base != new_base) {
> - /*
> -  * We are trying to schedule the timer on the local CPU.
> -  * However we can't change timer's base while it is running,
> -  * otherwise del_timer_sync() can't detect that the timer's
> -  * handler yet has not finished. This also guarantees that
> -  * the timer is serialized wrt itself.
> -  */
> - if (likely(base->running_timer != timer)) {
> - /* See the comment in lock_timer_base() */
> - timer_set_base(timer, NULL);
> - spin_unlock(>lock);
> - base = new_base;
> -

[PATCH/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-11 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(, fn);
CPU 0 CPU 1
queue_delayed_work(wq, , HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() -> queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() -> cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by any of non idle cpus.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

cc: Thomas Gleixner 
CC: John Stultz 
CC: Tejun Heo 
Signed-off-by: Joonwoo Park 
---
 kernel/time/timer.c | 108 +++-
 1 file changed, 82 insertions(+), 26 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..655076b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = _tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = _tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags & TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = __raw_get_cpu_var(tvec_bases);
 
timer->entry.next = NULL;
timer->base = (void *)((unsigned long)base | flags);
@@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base->running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(>lock);
-   base = new_base;
-   spin_lock(>lock);
-   timer_set_base(timer, base);
+   if (base != new_base) {
+   /*
+* We are trying to schedule the timer on the local CPU.
+* However we can't change timer's base while it is
+* running, otherwise del_timer_sync() can't detect that
+* the timer's handler yet has not finished. This also
+* guarantees that the timer

[PATCH/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-11 Thread Joonwoo Park

When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
queue_delayed_work() it's probably intended to run the work item on any
CPU that isn't idle. However, we queue the work to run at a later time
by starting a deferrable timer that binds to whatever CPU the work is
queued on which is same with queue_delayed_work_on(smp_processor_id())
effectively.

As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
In fact this is perfectly fine with UP kernel and also won't affect much a
system without dyntick with SMP kernel too as every cpus run timers
periodically.  But on SMP systems with dyntick current implementation leads
deferrable timers not very scalable because the timer's base which has
queued the deferrable timer won't wake up till next non-deferrable timer
expires even though there are possible other non idle cpus are running
which are able to run expired deferrable timers.

The deferrable work is a good example of the current implementation's
victim like below.

INIT_DEFERRABLE_WORK(dwork, fn);
CPU 0 CPU 1
queue_delayed_work(wq, dwork, HZ);
queue_delayed_work_on(WORK_CPU_UNBOUND);
...
__mod_timer() - queues timer to the
 current cpu's timer
 base.
...
tick_nohz_idle_enter() - cpu enters idle.
A second later
cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
  now it's in active but won't
cpu 0 won't wake up till next handle cpu unbound deferrable timer
non-deferrable timer expires. as it's in cpu 0's timer base.

To make all cpu unbound deferrable timers are scalable, introduce a common
timer base which is only for cpu unbound deferrable timers to make those
are indeed cpu unbound so that can be scheduled by any of non idle cpus.
This common timer fixes scalability issue of delayed work and all other cpu
unbound deferrable timer using implementations.

cc: Thomas Gleixner t...@linutronix.de
CC: John Stultz john.stu...@linaro.org
CC: Tejun Heo t...@kernel.org
Signed-off-by: Joonwoo Park joonw...@codeaurora.org
---
 kernel/time/timer.c | 108 +++-
 1 file changed, 82 insertions(+), 26 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index aca5dfe..655076b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -93,6 +93,9 @@ struct tvec_base {
 struct tvec_base boot_tvec_bases;
 EXPORT_SYMBOL(boot_tvec_bases);
 static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
+#ifdef CONFIG_SMP
+static struct tvec_base *tvec_base_deferral = boot_tvec_bases;
+#endif
 
 /* Functions below help us manage 'deferrable' flag */
 static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
@@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
*timer)
 static void do_init_timer(struct timer_list *timer, unsigned int flags,
  const char *name, struct lock_class_key *key)
 {
-   struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
+   struct tvec_base *base;
+
+#ifdef CONFIG_SMP
+   if (flags  TIMER_DEFERRABLE)
+   base = tvec_base_deferral;
+   else
+#endif
+   base = __raw_get_cpu_var(tvec_bases);
 
timer-entry.next = NULL;
timer-base = (void *)((unsigned long)base | flags);
@@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires,
 
debug_activate(timer, expires);
 
-   cpu = get_nohz_timer_target(pinned);
-   new_base = per_cpu(tvec_bases, cpu);
+#ifdef CONFIG_SMP
+   if (base != tvec_base_deferral) {
+#endif
+   cpu = get_nohz_timer_target(pinned);
+   new_base = per_cpu(tvec_bases, cpu);
 
-   if (base != new_base) {
-   /*
-* We are trying to schedule the timer on the local CPU.
-* However we can't change timer's base while it is running,
-* otherwise del_timer_sync() can't detect that the timer's
-* handler yet has not finished. This also guarantees that
-* the timer is serialized wrt itself.
-*/
-   if (likely(base-running_timer != timer)) {
-   /* See the comment in lock_timer_base() */
-   timer_set_base(timer, NULL);
-   spin_unlock(base-lock);
-   base = new_base;
-   spin_lock(base-lock);
-   timer_set_base(timer, base);
+   if (base != new_base) {
+   /*
+* We are trying to schedule the timer on the local CPU.
+* However we can't change timer's base while it is
+* running, otherwise del_timer_sync() can't detect that
+* the timer's handler yet has not finished

Re: [PATCH/RFC] timer: make deferrable cpu unbound timers really not bound to a cpu

2014-09-11 Thread Joonwoo Park

On Thu, Sep 11, 2014 at 04:56:52PM -0700, Joonwoo Park wrote:
 When a deferrable work (INIT_DEFERRABLE_WORK, etc.) is queued via
 queue_delayed_work() it's probably intended to run the work item on any
 CPU that isn't idle. However, we queue the work to run at a later time
 by starting a deferrable timer that binds to whatever CPU the work is
 queued on which is same with queue_delayed_work_on(smp_processor_id())
 effectively.
 
 As a result WORK_CPU_UNBOUND work items aren't really cpu unbound now.
 In fact this is perfectly fine with UP kernel and also won't affect much a
 system without dyntick with SMP kernel too as every cpus run timers
 periodically.  But on SMP systems with dyntick current implementation leads
 deferrable timers not very scalable because the timer's base which has
 queued the deferrable timer won't wake up till next non-deferrable timer
 expires even though there are possible other non idle cpus are running
 which are able to run expired deferrable timers.
 
 The deferrable work is a good example of the current implementation's
 victim like below.
 
 INIT_DEFERRABLE_WORK(dwork, fn);
 CPU 0 CPU 1
 queue_delayed_work(wq, dwork, HZ);
 queue_delayed_work_on(WORK_CPU_UNBOUND);
 ...
   __mod_timer() - queues timer to the
current cpu's timer
base.
   ...
 tick_nohz_idle_enter() - cpu enters idle.
 A second later
 cpu 0 is now in idle. cpu 1 exits idle or wasn't in idle so
   now it's in active but won't
 cpu 0 won't wake up till next handle cpu unbound deferrable timer
 non-deferrable timer expires. as it's in cpu 0's timer base.
 
 To make all cpu unbound deferrable timers are scalable, introduce a common
 timer base which is only for cpu unbound deferrable timers to make those
 are indeed cpu unbound so that can be scheduled by any of non idle cpus.
 This common timer fixes scalability issue of delayed work and all other cpu
 unbound deferrable timer using implementations.
 
 cc: Thomas Gleixner t...@linutronix.de
 CC: John Stultz john.stu...@linaro.org
 CC: Tejun Heo t...@kernel.org
 Signed-off-by: Joonwoo Park joonw...@codeaurora.org
 ---
  kernel/time/timer.c | 108 
 +++-
  1 file changed, 82 insertions(+), 26 deletions(-)
 
 diff --git a/kernel/time/timer.c b/kernel/time/timer.c
 index aca5dfe..655076b 100644
 --- a/kernel/time/timer.c
 +++ b/kernel/time/timer.c
 @@ -93,6 +93,9 @@ struct tvec_base {
  struct tvec_base boot_tvec_bases;
  EXPORT_SYMBOL(boot_tvec_bases);
  static DEFINE_PER_CPU(struct tvec_base *, tvec_bases) = boot_tvec_bases;
 +#ifdef CONFIG_SMP
 +static struct tvec_base *tvec_base_deferral = boot_tvec_bases;
 +#endif
  
  /* Functions below help us manage 'deferrable' flag */
  static inline unsigned int tbase_get_deferrable(struct tvec_base *base)
 @@ -655,7 +658,14 @@ static inline void debug_assert_init(struct timer_list 
 *timer)
  static void do_init_timer(struct timer_list *timer, unsigned int flags,
 const char *name, struct lock_class_key *key)
  {
 - struct tvec_base *base = __raw_get_cpu_var(tvec_bases);
 + struct tvec_base *base;
 +
 +#ifdef CONFIG_SMP
 + if (flags  TIMER_DEFERRABLE)
 + base = tvec_base_deferral;
 + else
 +#endif
 + base = __raw_get_cpu_var(tvec_bases);
  
   timer-entry.next = NULL;
   timer-base = (void *)((unsigned long)base | flags);
 @@ -777,26 +787,32 @@ __mod_timer(struct timer_list *timer, unsigned long 
 expires,
  
   debug_activate(timer, expires);
  
 - cpu = get_nohz_timer_target(pinned);
 - new_base = per_cpu(tvec_bases, cpu);
 +#ifdef CONFIG_SMP
 + if (base != tvec_base_deferral) {
 +#endif
 + cpu = get_nohz_timer_target(pinned);
 + new_base = per_cpu(tvec_bases, cpu);
  
 - if (base != new_base) {
 - /*
 -  * We are trying to schedule the timer on the local CPU.
 -  * However we can't change timer's base while it is running,
 -  * otherwise del_timer_sync() can't detect that the timer's
 -  * handler yet has not finished. This also guarantees that
 -  * the timer is serialized wrt itself.
 -  */
 - if (likely(base-running_timer != timer)) {
 - /* See the comment in lock_timer_base() */
 - timer_set_base(timer, NULL);
 - spin_unlock(base-lock);
 - base = new_base;
 - spin_lock(base-lock);
 - timer_set_base(timer, base);
 + if (base != new_base) {
 + /*
 +  * We are trying to schedule the timer on the local CPU.
 +  * However we can't change timer's base while it is
 +  * running, otherwise

[PATCH] [RESENDING] netconsole: register cmdline netconsole configs to configfs

2008-02-11 Thread Joonwoo Park

This patch intorduces cmdline netconsole configs to register to configfs
with dynamic netconsole. Satyam Sharma who designed shiny dynamic
reconfiguration for netconsole, mentioned about this issue already.
(http://lkml.org/lkml/2007/7/29/360)
But I think, without separately managing of two kind of netconsole target
objects, it's possible by using config_group instead of
config_item in the netconsole_target and default_groups feature of configfs.

Patch was tested with configuration creation/destruction by kernel and
module.
And it makes possible to enable/disable, modify and review netconsole
target configs from cmdline.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>
---
 drivers/net/netconsole.c |   91 --
 1 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 31e047d..63aabbb 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -93,7 +93,7 @@ static DEFINE_SPINLOCK(target_list_lock);
 struct netconsole_target {
struct list_headlist;
 #ifdef CONFIG_NETCONSOLE_DYNAMIC
-   struct config_item  item;
+   struct config_group group;
 #endif
int enabled;
struct netpoll  np;
@@ -103,16 +103,49 @@ struct netconsole_target {
 
 static struct configfs_subsystem netconsole_subsys;
 
-static int __init dynamic_netconsole_init(void)
+static void netconsole_target_put(struct netconsole_target *nt);
+static struct config_item_type netconsole_target_type;
+
+static int __init dynamic_netconsole_init(int defaults)
 {
+   int err;
+   unsigned long flags;
config_group_init(_subsys.su_group);
+
+   if (defaults > 0) {
+   struct list_head *pos;
+   struct config_group **groups;
+   int i = 0;
+
+   groups = kcalloc(defaults, sizeof(struct config_group *),
+   GFP_KERNEL);
+   if (!groups)
+   return -ENOMEM;
+
+   spin_lock_irqsave(_list_lock, flags);
+   list_for_each(pos, _list) {
+   struct netconsole_target *nt;
+   nt = list_entry(pos, struct netconsole_target, list);
+   groups[i] = >group;
+   i++;
+   }
+   spin_unlock_irqrestore(_list_lock, flags);
+   netconsole_subsys.su_group.default_groups = groups;
+   }
+
mutex_init(_subsys.su_mutex);
-   return configfs_register_subsystem(_subsys);
+
+   err = configfs_register_subsystem(_subsys);
+   if (err)
+   kfree(netconsole_subsys.su_group.default_groups);
+
+   return err;
 }
 
 static void __exit dynamic_netconsole_exit(void)
 {
configfs_unregister_subsystem(_subsys);
+   kfree(netconsole_subsys.su_group.default_groups);
 }
 
 /*
@@ -122,14 +155,23 @@ static void __exit dynamic_netconsole_exit(void)
  */
 static void netconsole_target_get(struct netconsole_target *nt)
 {
-   if (config_item_name(>item))
-   config_item_get(>item);
+   if (config_item_name(>group.cg_item))
+   config_item_get(>group.cg_item);
 }
 
 static void netconsole_target_put(struct netconsole_target *nt)
 {
-   if (config_item_name(>item))
-   config_item_put(>item);
+   if (config_item_name(>group.cg_item))
+   config_item_put(>group.cg_item);
+}
+
+static void dynamic_netconsole_init_type_name(struct netconsole_target *nt,
+   int index)
+{
+   char name[16];
+   snprintf(name, sizeof(name), "netcon%d", index);
+   config_item_init_type_name(>group.cg_item, name,
+   _target_type);
 }
 
 #else  /* !CONFIG_NETCONSOLE_DYNAMIC */
@@ -155,6 +197,11 @@ static void netconsole_target_put(struct netconsole_target 
*nt)
 {
 }
 
+static void dynamic_netconsole_init_type_name(struct netconsole_target *nt,
+   int index)
+{
+}
+
 #endif /* CONFIG_NETCONSOLE_DYNAMIC */
 
 /* Allocate new target (from boot/module param) and setup netpoll for it */
@@ -236,8 +283,8 @@ struct netconsole_target_attr {
 static struct netconsole_target *to_target(struct config_item *item)
 {
return item ?
-   container_of(item, struct netconsole_target, item) :
-   NULL;
+   container_of(to_config_group(item), struct netconsole_target,
+   group) : NULL;
 }
 
 /*
@@ -370,7 +417,7 @@ static ssize_t store_dev_name(struct netconsole_target *nt,
if (nt->enabled) {
printk(KERN_ERR "netconsole: target (%s) is enabled, "
"disable to update parameters\n",
-   config_item_name(>item));
+

[PATCH 2/2] [RESENDING] fs/ocfs2: get rid of unnecessary initialization

2008-02-11 Thread Joonwoo Park

default_groups was allocated with kcalloc, so initialize to NULL
is unnecessary.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>
---
 fs/ocfs2/cluster/nodemanager.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/cluster/nodemanager.c b/fs/ocfs2/cluster/nodemanager.c
index 709fba2..08609d7 100644
--- a/fs/ocfs2/cluster/nodemanager.c
+++ b/fs/ocfs2/cluster/nodemanager.c
@@ -839,7 +839,6 @@ static struct config_group 
*o2nm_cluster_group_make_group(struct config_group *g
cluster->cl_group.default_groups = defs;
cluster->cl_group.default_groups[0] = >ns_group;
cluster->cl_group.default_groups[1] = o2hb_group;
-   cluster->cl_group.default_groups[2] = NULL;
rwlock_init(>cl_nodes_lock);
cluster->cl_node_ip_tree = RB_ROOT;
cluster->cl_reconnect_delay_ms = O2NET_RECONNECT_DELAY_MS_DEFAULT;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] [RESENDING] fs/dlm: get rid of unnecessary initialization

2008-02-11 Thread Joonwoo Park

default_groups was allocated with kcalloc, so initialize to NULL
is unnecessary.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>
---
 fs/dlm/config.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index c3ad1df..2b96428 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -414,7 +414,6 @@ static struct config_group *make_cluster(struct 
config_group *g,
cl->group.default_groups = gps;
cl->group.default_groups[0] = >ss_group;
cl->group.default_groups[1] = >cs_group;
-   cl->group.default_groups[2] = NULL;
 
cl->cl_tcp_port = dlm_config.ci_tcp_port;
cl->cl_buffer_size = dlm_config.ci_buffer_size;
@@ -483,7 +482,6 @@ static struct config_group *make_space(struct config_group 
*g, const char *name)
 
sp->group.default_groups = gps;
sp->group.default_groups[0] = >ns_group;
-   sp->group.default_groups[1] = NULL;
 
INIT_LIST_HEAD(>members);
mutex_init(>members_lock);
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] [RESENDING] netconsole: register cmdline netconsole configs to configfs

2008-02-11 Thread Joonwoo Park

This patch intorduces cmdline netconsole configs to register to configfs
with dynamic netconsole. Satyam Sharma who designed shiny dynamic
reconfiguration for netconsole, mentioned about this issue already.
(http://lkml.org/lkml/2007/7/29/360)
But I think, without separately managing of two kind of netconsole target
objects, it's possible by using config_group instead of
config_item in the netconsole_target and default_groups feature of configfs.

Patch was tested with configuration creation/destruction by kernel and
module.
And it makes possible to enable/disable, modify and review netconsole
target configs from cmdline.

Signed-off-by: Joonwoo Park [EMAIL PROTECTED]
---
 drivers/net/netconsole.c |   91 --
 1 files changed, 72 insertions(+), 19 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 31e047d..63aabbb 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -93,7 +93,7 @@ static DEFINE_SPINLOCK(target_list_lock);
 struct netconsole_target {
struct list_headlist;
 #ifdef CONFIG_NETCONSOLE_DYNAMIC
-   struct config_item  item;
+   struct config_group group;
 #endif
int enabled;
struct netpoll  np;
@@ -103,16 +103,49 @@ struct netconsole_target {
 
 static struct configfs_subsystem netconsole_subsys;
 
-static int __init dynamic_netconsole_init(void)
+static void netconsole_target_put(struct netconsole_target *nt);
+static struct config_item_type netconsole_target_type;
+
+static int __init dynamic_netconsole_init(int defaults)
 {
+   int err;
+   unsigned long flags;
config_group_init(netconsole_subsys.su_group);
+
+   if (defaults  0) {
+   struct list_head *pos;
+   struct config_group **groups;
+   int i = 0;
+
+   groups = kcalloc(defaults, sizeof(struct config_group *),
+   GFP_KERNEL);
+   if (!groups)
+   return -ENOMEM;
+
+   spin_lock_irqsave(target_list_lock, flags);
+   list_for_each(pos, target_list) {
+   struct netconsole_target *nt;
+   nt = list_entry(pos, struct netconsole_target, list);
+   groups[i] = nt-group;
+   i++;
+   }
+   spin_unlock_irqrestore(target_list_lock, flags);
+   netconsole_subsys.su_group.default_groups = groups;
+   }
+
mutex_init(netconsole_subsys.su_mutex);
-   return configfs_register_subsystem(netconsole_subsys);
+
+   err = configfs_register_subsystem(netconsole_subsys);
+   if (err)
+   kfree(netconsole_subsys.su_group.default_groups);
+
+   return err;
 }
 
 static void __exit dynamic_netconsole_exit(void)
 {
configfs_unregister_subsystem(netconsole_subsys);
+   kfree(netconsole_subsys.su_group.default_groups);
 }
 
 /*
@@ -122,14 +155,23 @@ static void __exit dynamic_netconsole_exit(void)
  */
 static void netconsole_target_get(struct netconsole_target *nt)
 {
-   if (config_item_name(nt-item))
-   config_item_get(nt-item);
+   if (config_item_name(nt-group.cg_item))
+   config_item_get(nt-group.cg_item);
 }
 
 static void netconsole_target_put(struct netconsole_target *nt)
 {
-   if (config_item_name(nt-item))
-   config_item_put(nt-item);
+   if (config_item_name(nt-group.cg_item))
+   config_item_put(nt-group.cg_item);
+}
+
+static void dynamic_netconsole_init_type_name(struct netconsole_target *nt,
+   int index)
+{
+   char name[16];
+   snprintf(name, sizeof(name), netcon%d, index);
+   config_item_init_type_name(nt-group.cg_item, name,
+   netconsole_target_type);
 }
 
 #else  /* !CONFIG_NETCONSOLE_DYNAMIC */
@@ -155,6 +197,11 @@ static void netconsole_target_put(struct netconsole_target 
*nt)
 {
 }
 
+static void dynamic_netconsole_init_type_name(struct netconsole_target *nt,
+   int index)
+{
+}
+
 #endif /* CONFIG_NETCONSOLE_DYNAMIC */
 
 /* Allocate new target (from boot/module param) and setup netpoll for it */
@@ -236,8 +283,8 @@ struct netconsole_target_attr {
 static struct netconsole_target *to_target(struct config_item *item)
 {
return item ?
-   container_of(item, struct netconsole_target, item) :
-   NULL;
+   container_of(to_config_group(item), struct netconsole_target,
+   group) : NULL;
 }
 
 /*
@@ -370,7 +417,7 @@ static ssize_t store_dev_name(struct netconsole_target *nt,
if (nt-enabled) {
printk(KERN_ERR netconsole: target (%s) is enabled, 
disable to update parameters\n,
-   config_item_name(nt-item

[PATCH 1/2] [RESENDING] fs/dlm: get rid of unnecessary initialization

2008-02-11 Thread Joonwoo Park

default_groups was allocated with kcalloc, so initialize to NULL
is unnecessary.

Signed-off-by: Joonwoo Park [EMAIL PROTECTED]
---
 fs/dlm/config.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index c3ad1df..2b96428 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -414,7 +414,6 @@ static struct config_group *make_cluster(struct 
config_group *g,
cl-group.default_groups = gps;
cl-group.default_groups[0] = sps-ss_group;
cl-group.default_groups[1] = cms-cs_group;
-   cl-group.default_groups[2] = NULL;
 
cl-cl_tcp_port = dlm_config.ci_tcp_port;
cl-cl_buffer_size = dlm_config.ci_buffer_size;
@@ -483,7 +482,6 @@ static struct config_group *make_space(struct config_group 
*g, const char *name)
 
sp-group.default_groups = gps;
sp-group.default_groups[0] = nds-ns_group;
-   sp-group.default_groups[1] = NULL;
 
INIT_LIST_HEAD(sp-members);
mutex_init(sp-members_lock);
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] [RESENDING] fs/ocfs2: get rid of unnecessary initialization

2008-02-11 Thread Joonwoo Park

default_groups was allocated with kcalloc, so initialize to NULL
is unnecessary.

Signed-off-by: Joonwoo Park [EMAIL PROTECTED]
---
 fs/ocfs2/cluster/nodemanager.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/fs/ocfs2/cluster/nodemanager.c b/fs/ocfs2/cluster/nodemanager.c
index 709fba2..08609d7 100644
--- a/fs/ocfs2/cluster/nodemanager.c
+++ b/fs/ocfs2/cluster/nodemanager.c
@@ -839,7 +839,6 @@ static struct config_group 
*o2nm_cluster_group_make_group(struct config_group *g
cluster-cl_group.default_groups = defs;
cluster-cl_group.default_groups[0] = ns-ns_group;
cluster-cl_group.default_groups[1] = o2hb_group;
-   cluster-cl_group.default_groups[2] = NULL;
rwlock_init(cluster-cl_nodes_lock);
cluster-cl_node_ip_tree = RB_ROOT;
cluster-cl_reconnect_delay_ms = O2NET_RECONNECT_DELAY_MS_DEFAULT;
-- 
1.5.3.rc5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LINUX-KERNEL] C++ in linux kernel

2008-02-08 Thread Joonwoo Park

2008/2/9, Jan Engelhardt <[EMAIL PROTECTED]>:
>
> On Feb 9 2008 00:14, Joonwoo Park wrote:
> >2008/2/8, rohit h <[EMAIL PROTECTED]>:
> >> Hi,
> >>  I am a kernel newbie.
> >>  I tried to insmod a C++ module containing classes, inheritance.
> >>  I am getting 'unresolved symbol' error when I use the 'new' keyword.
> >>  What could the problem be?
> >>
> >>  What kind of runtime support is needed ( arm linux kernel)? Is a
> >> patch available for it?
> >>
> >Please take a look at click modular router which is using c++ as a
> >linux kernel module.
> >http://www.read.cs.ucla.edu/click/
> >The lib/glue.cc provides custom operator new.
>
> Uh, let's not make the world worse :)
> Just call malloc from C++, and carefully select what C++ features
> you are going to use. The VMware source for example does it right.
>

Yep, also I think that if kmalloc is possible it would better than the
operator new.

Rohit,
Compiling the kernel module with g++ is not a simple work, you may
need big patch for kernel itself.
The c++ has more internal keywords than c.
e.g) new, delete, and ::
The new and delete should be re-spelled with another names.
The '::' is used in inline assembly however in c++ it means global namespace.

In conclusion, I don't recommand c++ for linux kernel without very
special goal like the click project :)

Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LINUX-KERNEL] C++ in linux kernel

2008-02-08 Thread Joonwoo Park

2008/2/8, rohit h <[EMAIL PROTECTED]>:
> Hi,
>  I am a kernel newbie.
>  I tried to insmod a C++ module containing classes, inheritance.
>  I am getting 'unresolved symbol' error when I use the 'new' keyword.
>  What could the problem be?
>
>  What kind of runtime support is needed ( arm linux kernel)? Is a
> patch available for it?
>
>  Thanks,
> Rohit
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

Rohit,
Please take a look at click modular router which is using c++ as a
linux kernel module.
http://www.read.cs.ucla.edu/click/
The lib/glue.cc provides custom operator new.

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LINUX-KERNEL] C++ in linux kernel

2008-02-08 Thread Joonwoo Park

2008/2/9, Jan Engelhardt [EMAIL PROTECTED]:

 On Feb 9 2008 00:14, Joonwoo Park wrote:
 2008/2/8, rohit h [EMAIL PROTECTED]:
  Hi,
   I am a kernel newbie.
   I tried to insmod a C++ module containing classes, inheritance.
   I am getting 'unresolved symbol' error when I use the 'new' keyword.
   What could the problem be?
 
   What kind of runtime support is needed ( arm linux kernel)? Is a
  patch available for it?
 
 Please take a look at click modular router which is using c++ as a
 linux kernel module.
 http://www.read.cs.ucla.edu/click/
 The lib/glue.cc provides custom operator new.

 Uh, let's not make the world worse :)
 Just call malloc from C++, and carefully select what C++ features
 you are going to use. The VMware source for example does it right.


Yep, also I think that if kmalloc is possible it would better than the
operator new.

Rohit,
Compiling the kernel module with g++ is not a simple work, you may
need big patch for kernel itself.
The c++ has more internal keywords than c.
e.g) new, delete, and ::
The new and delete should be re-spelled with another names.
The '::' is used in inline assembly however in c++ it means global namespace.

In conclusion, I don't recommand c++ for linux kernel without very
special goal like the click project :)

Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LINUX-KERNEL] C++ in linux kernel

2008-02-08 Thread Joonwoo Park

2008/2/8, rohit h [EMAIL PROTECTED]:
 Hi,
  I am a kernel newbie.
  I tried to insmod a C++ module containing classes, inheritance.
  I am getting 'unresolved symbol' error when I use the 'new' keyword.
  What could the problem be?

  What kind of runtime support is needed ( arm linux kernel)? Is a
 patch available for it?

  Thanks,
 Rohit
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/


Rohit,
Please take a look at click modular router which is using c++ as a
linux kernel module.
http://www.read.cs.ucla.edu/click/
The lib/glue.cc provides custom operator new.

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-14 Thread Joonwoo Park


I'm so sorry for mangled patch.
resending patch with preformat html from thunderbird.


After disabling interrupts, it's possible irq & tasklet is pending or running
This patch eleminates races for down iwlwifi.

Since synchronize_irq can introduce iwl_irq_tasklet, tasklet_kill should be 
called after doing synchronize_irq.

To avoid races between iwl_synchronize_irq and iwl_irq_tasklet
STATUS_INT_ENABLED flag is needed.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>
---
drivers/net/wireless/iwlwifi/iwl3945-base.c |   33 ++-
1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c 
b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 6e20725..f98cd4f 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -4405,6 +4405,14 @@ static void iwl_print_rx_config_cmd(struct iwl_rxon_cmd 
*rxon)
}
#endif

+static void iwl_synchronize_interrupts(struct iwl_priv *priv)
+{
+   synchronize_irq(priv->pci_dev->irq);
+   /* synchornize_irq introduces irq_tasklet,
+* tasklet_kill should be called after doing synchronize_irq */
+   tasklet_kill(>irq_tasklet);
+}
+
static void iwl_enable_interrupts(struct iwl_priv *priv)
{
IWL_DEBUG_ISR("Enabling interrupts\n");
@@ -4413,7 +4421,7 @@ static void iwl_enable_interrupts(struct iwl_priv *priv)
iwl_flush32(priv, CSR_INT_MASK);
}

-static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+static inline void __iwl_disable_interrupts(struct iwl_priv *priv)
{
clear_bit(STATUS_INT_ENABLED, >status);

@@ -4427,6 +4435,13 @@ static inline void iwl_disable_interrupts(struct 
iwl_priv *priv)
iwl_flush32(priv, CSR_INT);
iwl_write32(priv, CSR_FH_INT_STATUS, 0x);
iwl_flush32(priv, CSR_FH_INT_STATUS);
+}
+
+static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+{
+   __iwl_disable_interrupts(priv);
+
+   iwl_synchronize_interrupts(priv);

IWL_DEBUG_ISR("Disabled interrupts\n");
}
@@ -4708,7 +4723,8 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_ERROR("Microcode HW error detected.  Restarting.\n");

/* Tell the device to stop sending interrupts */
-   iwl_disable_interrupts(priv);
+   __iwl_disable_interrupts(priv);
+   IWL_DEBUG_ISR("Disabled interrupts\n");

iwl_irq_handle_error(priv);

@@ -4814,8 +4830,11 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_WARNING("   with FH_INT = 0x%08x\n", inta_fh);
}

-   /* Re-enable all interrupts */
-   iwl_enable_interrupts(priv);
+   /* To avoid race when device goes down,
+* it should be discarded to enable interrupts */
+   if (test_bit(STATUS_INT_ENABLED, >status))
+   /* Re-enable all interrupts */
+   iwl_enable_interrupts(priv);

#ifdef CONFIG_IWLWIFI_DEBUG
if (iwl_debug_level & (IWL_DL_ISR)) {
@@ -4876,8 +4895,10 @@ unplugged:
return IRQ_HANDLED;

 none:
-   /* re-enable interrupts here since we don't have anything to service. */
-   iwl_enable_interrupts(priv);
+   if (test_bit(STATUS_INT_ENABLED, >status))
+   /* re-enable interrupts here since we don't have anything
+* to service. */
+   iwl_enable_interrupts(priv);
spin_unlock(>lock);
return IRQ_NONE;
}
---

Thanks,
Joonwoo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-14 Thread Joonwoo Park


Joonwoo Park wrote:

2008/1/11, Chatre, Reinette <[EMAIL PROTECTED]>:

On Thursday, January 10, 2008 5:25 PM, Joonwoo Park  wrote:



What modification are you considering?


Roughly, I'm considering make synchronize_irq() be moved into
iwl_disable_interrupts() and fix iwl_irq_tasket not to call
iwl_disable_interrupts with irq disabled.
For now iwl_irq_tasklet calls iwl_disable_interrupts() with local irq disabled.
like this:

static void iwl_irq_tasklet(struct iwl_priv *priv)
{
...
spin_lock_irqsave(>lock, flags);

...
/* Now service all interrupt bits discovered above. */
if (inta & CSR_INT_BIT_HW_ERR) {
IWL_ERROR("Microcode HW error detected.  Restarting.\n");

/* Tell the device to stop sending interrupts */
iwl_disable_interrupts(priv);
...
spin_unlock_irqrestore(>lock, flags);
return;
}




After disabling interrupts, it's possible irq & tasklet is pending or running
This patch eleminates races for down iwlwifi

Since synchronize_irq can introduce iwl_irq_tasklet, tasklet_kill
should be called after doing synchronize_irq
To avoid races between iwl_synchronize_irq and iwl_irq_tasklet 
STATUS_INT_ENABLED
flag is needed.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>
---
  drivers/net/wireless/iwlwifi/iwl3945-base.c |   33 ++-
  1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c 
b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 6e20725..f98cd4f 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -4405,6 +4405,14 @@ static void iwl_print_rx_config_cmd(struct iwl_rxon_cmd 
*rxon)
  }
  #endif

+static void iwl_synchronize_interrupts(struct iwl_priv *priv)
+{
+   synchronize_irq(priv->pci_dev->irq);
+   /* synchornize_irq introduces irq_tasklet,
+* tasklet_kill should be called after doing synchronize_irq */
+   tasklet_kill(>irq_tasklet);
+}
+
  static void iwl_enable_interrupts(struct iwl_priv *priv)
  {
IWL_DEBUG_ISR("Enabling interrupts\n");
@@ -4413,7 +4421,7 @@ static void iwl_enable_interrupts(struct iwl_priv *priv)
iwl_flush32(priv, CSR_INT_MASK);
  }

-static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+static inline void __iwl_disable_interrupts(struct iwl_priv *priv)
  {
clear_bit(STATUS_INT_ENABLED, >status);

@@ -4427,6 +4435,13 @@ static inline void iwl_disable_interrupts(struct 
iwl_priv *priv)
iwl_flush32(priv, CSR_INT);
iwl_write32(priv, CSR_FH_INT_STATUS, 0x);
iwl_flush32(priv, CSR_FH_INT_STATUS);
+}
+
+static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+{
+   __iwl_disable_interrupts(priv);
+
+   iwl_synchronize_interrupts(priv);

IWL_DEBUG_ISR("Disabled interrupts\n");
  }
@@ -4708,7 +4723,8 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_ERROR("Microcode HW error detected.  Restarting.\n");

/* Tell the device to stop sending interrupts */
-   iwl_disable_interrupts(priv);
+   __iwl_disable_interrupts(priv);
+   IWL_DEBUG_ISR("Disabled interrupts\n");

iwl_irq_handle_error(priv);

@@ -4814,8 +4830,11 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_WARNING("   with FH_INT = 0x%08x\n", inta_fh);
}

-   /* Re-enable all interrupts */
-   iwl_enable_interrupts(priv);
+   /* To avoid race when device goes down,
+* it should be discarded to enable interrupts */
+   if (test_bit(STATUS_INT_ENABLED, >status))
+   /* Re-enable all interrupts */
+   iwl_enable_interrupts(priv);

  #ifdef CONFIG_IWLWIFI_DEBUG
if (iwl_debug_level & (IWL_DL_ISR)) {
@@ -4876,8 +4895,10 @@ unplugged:
return IRQ_HANDLED;

   none:
-   /* re-enable interrupts here since we don't have anything to service. */
-   iwl_enable_interrupts(priv);
+   if (test_bit(STATUS_INT_ENABLED, >status))
+   /* re-enable interrupts here since we don't have anything
+* to service. */
+   iwl_enable_interrupts(priv);
spin_unlock(>lock);
return IRQ_NONE;
  }
---

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-14 Thread Joonwoo Park


Joonwoo Park wrote:

2008/1/11, Chatre, Reinette [EMAIL PROTECTED]:

On Thursday, January 10, 2008 5:25 PM, Joonwoo Park  wrote:



What modification are you considering?


Roughly, I'm considering make synchronize_irq() be moved into
iwl_disable_interrupts() and fix iwl_irq_tasket not to call
iwl_disable_interrupts with irq disabled.
For now iwl_irq_tasklet calls iwl_disable_interrupts() with local irq disabled.
like this:

static void iwl_irq_tasklet(struct iwl_priv *priv)
{
...
spin_lock_irqsave(priv-lock, flags);

...
/* Now service all interrupt bits discovered above. */
if (inta  CSR_INT_BIT_HW_ERR) {
IWL_ERROR(Microcode HW error detected.  Restarting.\n);

/* Tell the device to stop sending interrupts */
iwl_disable_interrupts(priv);
...
spin_unlock_irqrestore(priv-lock, flags);
return;
}




After disabling interrupts, it's possible irq  tasklet is pending or running
This patch eleminates races for down iwlwifi

Since synchronize_irq can introduce iwl_irq_tasklet, tasklet_kill
should be called after doing synchronize_irq
To avoid races between iwl_synchronize_irq and iwl_irq_tasklet 
STATUS_INT_ENABLED
flag is needed.

Signed-off-by: Joonwoo Park [EMAIL PROTECTED]
---
  drivers/net/wireless/iwlwifi/iwl3945-base.c |   33 ++-
  1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c 
b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 6e20725..f98cd4f 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -4405,6 +4405,14 @@ static void iwl_print_rx_config_cmd(struct iwl_rxon_cmd 
*rxon)
  }
  #endif

+static void iwl_synchronize_interrupts(struct iwl_priv *priv)
+{
+   synchronize_irq(priv-pci_dev-irq);
+   /* synchornize_irq introduces irq_tasklet,
+* tasklet_kill should be called after doing synchronize_irq */
+   tasklet_kill(priv-irq_tasklet);
+}
+
  static void iwl_enable_interrupts(struct iwl_priv *priv)
  {
IWL_DEBUG_ISR(Enabling interrupts\n);
@@ -4413,7 +4421,7 @@ static void iwl_enable_interrupts(struct iwl_priv *priv)
iwl_flush32(priv, CSR_INT_MASK);
  }

-static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+static inline void __iwl_disable_interrupts(struct iwl_priv *priv)
  {
clear_bit(STATUS_INT_ENABLED, priv-status);

@@ -4427,6 +4435,13 @@ static inline void iwl_disable_interrupts(struct 
iwl_priv *priv)
iwl_flush32(priv, CSR_INT);
iwl_write32(priv, CSR_FH_INT_STATUS, 0x);
iwl_flush32(priv, CSR_FH_INT_STATUS);
+}
+
+static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+{
+   __iwl_disable_interrupts(priv);
+
+   iwl_synchronize_interrupts(priv);

IWL_DEBUG_ISR(Disabled interrupts\n);
  }
@@ -4708,7 +4723,8 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_ERROR(Microcode HW error detected.  Restarting.\n);

/* Tell the device to stop sending interrupts */
-   iwl_disable_interrupts(priv);
+   __iwl_disable_interrupts(priv);
+   IWL_DEBUG_ISR(Disabled interrupts\n);

iwl_irq_handle_error(priv);

@@ -4814,8 +4830,11 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_WARNING(   with FH_INT = 0x%08x\n, inta_fh);
}

-   /* Re-enable all interrupts */
-   iwl_enable_interrupts(priv);
+   /* To avoid race when device goes down,
+* it should be discarded to enable interrupts */
+   if (test_bit(STATUS_INT_ENABLED, priv-status))
+   /* Re-enable all interrupts */
+   iwl_enable_interrupts(priv);

  #ifdef CONFIG_IWLWIFI_DEBUG
if (iwl_debug_level  (IWL_DL_ISR)) {
@@ -4876,8 +4895,10 @@ unplugged:
return IRQ_HANDLED;

   none:
-   /* re-enable interrupts here since we don't have anything to service. */
-   iwl_enable_interrupts(priv);
+   if (test_bit(STATUS_INT_ENABLED, priv-status))
+   /* re-enable interrupts here since we don't have anything
+* to service. */
+   iwl_enable_interrupts(priv);
spin_unlock(priv-lock);
return IRQ_NONE;
  }
---

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-14 Thread Joonwoo Park


I'm so sorry for mangled patch.
resending patch with preformat html from thunderbird.


After disabling interrupts, it's possible irq  tasklet is pending or running
This patch eleminates races for down iwlwifi.

Since synchronize_irq can introduce iwl_irq_tasklet, tasklet_kill should be 
called after doing synchronize_irq.

To avoid races between iwl_synchronize_irq and iwl_irq_tasklet
STATUS_INT_ENABLED flag is needed.

Signed-off-by: Joonwoo Park [EMAIL PROTECTED]
---
drivers/net/wireless/iwlwifi/iwl3945-base.c |   33 ++-
1 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl3945-base.c 
b/drivers/net/wireless/iwlwifi/iwl3945-base.c
index 6e20725..f98cd4f 100644
--- a/drivers/net/wireless/iwlwifi/iwl3945-base.c
+++ b/drivers/net/wireless/iwlwifi/iwl3945-base.c
@@ -4405,6 +4405,14 @@ static void iwl_print_rx_config_cmd(struct iwl_rxon_cmd 
*rxon)
}
#endif

+static void iwl_synchronize_interrupts(struct iwl_priv *priv)
+{
+   synchronize_irq(priv-pci_dev-irq);
+   /* synchornize_irq introduces irq_tasklet,
+* tasklet_kill should be called after doing synchronize_irq */
+   tasklet_kill(priv-irq_tasklet);
+}
+
static void iwl_enable_interrupts(struct iwl_priv *priv)
{
IWL_DEBUG_ISR(Enabling interrupts\n);
@@ -4413,7 +4421,7 @@ static void iwl_enable_interrupts(struct iwl_priv *priv)
iwl_flush32(priv, CSR_INT_MASK);
}

-static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+static inline void __iwl_disable_interrupts(struct iwl_priv *priv)
{
clear_bit(STATUS_INT_ENABLED, priv-status);

@@ -4427,6 +4435,13 @@ static inline void iwl_disable_interrupts(struct 
iwl_priv *priv)
iwl_flush32(priv, CSR_INT);
iwl_write32(priv, CSR_FH_INT_STATUS, 0x);
iwl_flush32(priv, CSR_FH_INT_STATUS);
+}
+
+static inline void iwl_disable_interrupts(struct iwl_priv *priv)
+{
+   __iwl_disable_interrupts(priv);
+
+   iwl_synchronize_interrupts(priv);

IWL_DEBUG_ISR(Disabled interrupts\n);
}
@@ -4708,7 +4723,8 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_ERROR(Microcode HW error detected.  Restarting.\n);

/* Tell the device to stop sending interrupts */
-   iwl_disable_interrupts(priv);
+   __iwl_disable_interrupts(priv);
+   IWL_DEBUG_ISR(Disabled interrupts\n);

iwl_irq_handle_error(priv);

@@ -4814,8 +4830,11 @@ static void iwl_irq_tasklet(struct iwl_priv *priv)
IWL_WARNING(   with FH_INT = 0x%08x\n, inta_fh);
}

-   /* Re-enable all interrupts */
-   iwl_enable_interrupts(priv);
+   /* To avoid race when device goes down,
+* it should be discarded to enable interrupts */
+   if (test_bit(STATUS_INT_ENABLED, priv-status))
+   /* Re-enable all interrupts */
+   iwl_enable_interrupts(priv);

#ifdef CONFIG_IWLWIFI_DEBUG
if (iwl_debug_level  (IWL_DL_ISR)) {
@@ -4876,8 +4895,10 @@ unplugged:
return IRQ_HANDLED;

 none:
-   /* re-enable interrupts here since we don't have anything to service. */
-   iwl_enable_interrupts(priv);
+   if (test_bit(STATUS_INT_ENABLED, priv-status))
+   /* re-enable interrupts here since we don't have anything
+* to service. */
+   iwl_enable_interrupts(priv);
spin_unlock(priv-lock);
return IRQ_NONE;
}
---

Thanks,
Joonwoo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 4/5] iwlwifi: iwl3945 eliminate sleepable task queue from context

2008-01-10 Thread Joonwoo Park

2008/1/11, Zhu Yi <[EMAIL PROTECTED]>:
>
> The version doesn't work on a .24-rc kernel. Can you compile it
> with .23?
>

Thank you for your help, I built it.

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 4/5] iwlwifi: iwl3945 eliminate sleepable task queue from context

2008-01-10 Thread Joonwoo Park

2008/1/11, Zhu Yi <[EMAIL PROTECTED]>:
> Hi Joonwoo,
>
> We already did something similiar in our code base. Could you please
> take a look at this patch?
>
> http://intellinuxwireless.org/repos/?p=iwlwifi.git;a=commitdiff;h=57aa02255e9d7be5e2494683fc2793bd1d0707e2
>
> Thanks,
> -yi

Ooops :-)
I should have checked code base, it's my fault.
it's seems very similar, but I could not made build unfortunately.
Can you introduce to me how to build it please?

my build error is here:
[EMAIL PROTECTED] ~/SRC/DRIVERS/iwlwifi $ KSRC=/home/jason/SRC/LINUX/linux-2.6 
make
make -C /home/jason/SRC/LINUX/linux-2.6 O=
M=/home/jason/SRC/DRIVERS/iwlwifi/compatible/
EXTRA_CFLAGS="-DCONFIG_IWL3945_DEBUG=y -DCONFIG_IWL4965_DEBUG=y
-DCONFIG_IWL3945_SPECTRUM_MEASUREMENT=y
-DCONFIG_IWL4965_SPECTRUM_MEASUREMENT=y -DCONFIG_IWL4965_SENSITIVITY=y
-DCONFIG_IWL3945_QOS=y -DCONFIG_IWL4965_QOS=y" modules
make[1]: Entering directory `/home/jason/SRC/LINUX/linux-2.6'
  CC [M]  /home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.o
In file included from
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:51:
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl-3945.h:443: error:
expected specifier-qualifier-list before 'ieee80211_key_alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_add_station':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:539: error:
implicit declaration of function 'MAC_ARG'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:539:
warning: too few arguments for format
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_commit_rxon':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1161:
warning: too few arguments for format
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_update_sta_key_info':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1400: error:
'struct iwl3945_hw_key' has no member named 'alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1401: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1402: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1402: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_build_tx_cmd_hwcrypto':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2594: error:
'struct iwl3945_hw_key' has no member named 'alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'


Thanks,
Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-10 Thread Joonwoo Park

Hello Reinette,

2008/1/11, Chatre, Reinette <[EMAIL PROTECTED]>:
>
> Could synchronize_irq() be moved into iwl_disable_interrupts() ? I am

At this time, iwl_disable_interrupts() can be called with irq
disabled, so for do that I think additional modification would be
needed.

> also wondering if we cannot call tasklet_kill() before
> iwl_disable_interrupts() ... thus preventing it from being scheduled
> when we are going down.

Thanks for your catch, it seems tasklet can re-enable interrupts.
I'll handle and make an another patch for them at this weekend :)

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 2/5] iwlwifi: iwl3945 synchronize interruptand tasklet for down iwlwifi

2008-01-10 Thread Joonwoo Park

Hello Reinette,

2008/1/11, Chatre, Reinette [EMAIL PROTECTED]:

 Could synchronize_irq() be moved into iwl_disable_interrupts() ? I am

At this time, iwl_disable_interrupts() can be called with irq
disabled, so for do that I think additional modification would be
needed.

 also wondering if we cannot call tasklet_kill() before
 iwl_disable_interrupts() ... thus preventing it from being scheduled
 when we are going down.

Thanks for your catch, it seems tasklet can re-enable interrupts.
I'll handle and make an another patch for them at this weekend :)

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 4/5] iwlwifi: iwl3945 eliminate sleepable task queue from context

2008-01-10 Thread Joonwoo Park

2008/1/11, Zhu Yi [EMAIL PROTECTED]:
 Hi Joonwoo,

 We already did something similiar in our code base. Could you please
 take a look at this patch?

 http://intellinuxwireless.org/repos/?p=iwlwifi.git;a=commitdiff;h=57aa02255e9d7be5e2494683fc2793bd1d0707e2

 Thanks,
 -yi

Ooops :-)
I should have checked code base, it's my fault.
it's seems very similar, but I could not made build unfortunately.
Can you introduce to me how to build it please?

my build error is here:
[EMAIL PROTECTED] ~/SRC/DRIVERS/iwlwifi $ KSRC=/home/jason/SRC/LINUX/linux-2.6 
make
make -C /home/jason/SRC/LINUX/linux-2.6 O=
M=/home/jason/SRC/DRIVERS/iwlwifi/compatible/
EXTRA_CFLAGS=-DCONFIG_IWL3945_DEBUG=y -DCONFIG_IWL4965_DEBUG=y
-DCONFIG_IWL3945_SPECTRUM_MEASUREMENT=y
-DCONFIG_IWL4965_SPECTRUM_MEASUREMENT=y -DCONFIG_IWL4965_SENSITIVITY=y
-DCONFIG_IWL3945_QOS=y -DCONFIG_IWL4965_QOS=y modules
make[1]: Entering directory `/home/jason/SRC/LINUX/linux-2.6'
  CC [M]  /home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.o
In file included from
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:51:
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl-3945.h:443: error:
expected specifier-qualifier-list before 'ieee80211_key_alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_add_station':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:539: error:
implicit declaration of function 'MAC_ARG'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:539:
warning: too few arguments for format
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_commit_rxon':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1161:
warning: too few arguments for format
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_update_sta_key_info':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1400: error:
'struct iwl3945_hw_key' has no member named 'alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1401: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1402: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:1402: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c: In function
'iwl3945_build_tx_cmd_hwcrypto':
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2594: error:
'struct iwl3945_hw_key' has no member named 'alg'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'key'
/home/jason/SRC/DRIVERS/iwlwifi/compatible/iwl3945-base.c:2597: error:
'struct iwl3945_hw_key' has no member named 'keylen'


Thanks,
Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [ipw3945-devel] [PATCH 4/5] iwlwifi: iwl3945 eliminate sleepable task queue from context

2008-01-10 Thread Joonwoo Park

2008/1/11, Zhu Yi [EMAIL PROTECTED]:

 The version doesn't work on a .24-rc kernel. Can you compile it
 with .23?


Thank you for your help, I built it.

Thanks,
Joonwoo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 226 matches

Mail list logo