from:"Steve Muckle"

[PATCH] selftests: x86: add version check in test_syscall_vdso

2019-02-28 Thread Steve Muckle

Since 4.17 registers r8-r11 are not clobbered/zeroed by a 64-bit kernel
handling a 32-bit syscall and this behavior is enforced by the
test_syscall_vdso testcase. See commit 8bb2610bc496
("x86/entry/64/compat: Preserve r8-r11 in int $0x80").

Permit the old behavior in the testcase for kernels prior to 4.17.

Signed-off-by: Steve Muckle 
---
 .../testing/selftests/x86/test_syscall_vdso.c | 30 +++
 1 file changed, 30 insertions(+)

diff --git a/tools/testing/selftests/x86/test_syscall_vdso.c 
b/tools/testing/selftests/x86/test_syscall_vdso.c
index c9c3281077bc..f7284dc4c32b 100644
--- a/tools/testing/selftests/x86/test_syscall_vdso.c
+++ b/tools/testing/selftests/x86/test_syscall_vdso.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #if !defined(__i386__)
@@ -71,6 +72,7 @@ struct regs64 {
 };
 struct regs64 regs64;
 int kernel_is_64bit;
+int clobber_ok;
 
 asm (
"   .pushsection .text\n"
@@ -130,6 +132,28 @@ void print_regs64(void)
printf("12:%016llx 13:%016llx 14:%016llx 15:%016llx\n", regs64.r12,  
regs64.r13,  regs64.r14,  regs64.r15);
 }
 
+static void get_kernel_version(int *version, int *patchlevel)
+{
+   int ret, sublevel;
+   struct utsname utsname;
+
+   ret = uname(&utsname);
+   if (ret) {
+   perror("uname");
+   exit(1);
+   }
+
+   ret = sscanf(utsname.release, "%d.%d.%d", version, patchlevel,
+&sublevel);
+   if (ret < 0) {
+   perror("sscanf");
+   exit(1);
+   } else if (ret != 3) {
+   printf("Malformed kernel version %s\n", &utsname.release);
+   exit(1);
+   }
+}
+
 int check_regs64(void)
 {
int err = 0;
@@ -166,6 +190,8 @@ int check_regs64(void)
 * Historically (and probably unintentionally), they
 * were clobbered or zeroed.
 */
+   if (clobber_ok && *r64 == 0 && num <= 11)
+   continue;
}
printf("[FAIL]\tR%d has changed:%016llx\n", num, *r64);
err++;
@@ -385,6 +411,7 @@ int main(int argc, char **argv, char **envp)
 {
int exitcode = 0;
int cs;
+   int version, patchlevel;
 
asm("\n"
"   movl%%cs, %%eax\n"
@@ -394,6 +421,9 @@ int main(int argc, char **argv, char **envp)
if (!kernel_is_64bit)
printf("[NOTE]\tNot a 64-bit kernel, won't test R8..R15 
leaks\n");
 
+   get_kernel_version(&version, &patchlevel);
+   clobber_ok = version < 4 || (version == 4 && patchlevel < 17);
+
/* This only works for non-static builds:
 * syscall_addr = dlsym(dlopen("linux-gate.so.1", RTLD_NOW), 
"__kernel_vsyscall");
 */
-- 
2.21.0.352.gf09ad66450-goog

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-09-28 Thread Steve Muckle


On 09/27/2018 05:43 PM, Wanpeng Li wrote:

On your CPU4:
scheduler_ipi()
   -> sched_ttwu_pending()
-> ttwu_do_activate()=> p->sched_remote_wakeup should be
false, so ENQUEUE_WAKEUP is set, ENQUEUE_MIGRATED is not
 -> ttwu_activate()
  -> activate_task()
   -> enqueue_task()
-> enqueue_task_fair()
 -> enqueue_entity()
  bool renorm = !(flags &
ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE)
so renorm is false in enqueue_entity(), why you mentioned that the
cfs_rq->min_vruntime is still added to the se->vruntime in
enqueue_task_fair()?


Maybe this is a misunderstanding on my side but didn't you asked me to
'... Could you point out when the fair rq's min_vruntime is added to the
task's vruntime in your *later* scenario? ...'


Yeah, if the calltrace above and my analysis is correct, then the fair
rq's min_vruntime will not be added to the task's vruntime in your
*later* scenario, which means that your patch is not necessary.


In the scenario I observed, the task is not waking - it is running and 
being deboosted from priority inheritance, transitioning from RT to CFS.


Dietmar and I both were able to reproduce the issue with the testcase I 
posted earlier in this thread.


thanks,
Steve

[tip:sched/core] sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

2018-09-10 Thread tip-bot for Steve Muckle

Commit-ID:  d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Gitweb: https://git.kernel.org/tip/d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe
Author: Steve Muckle 
AuthorDate: Fri, 31 Aug 2018 15:42:17 -0700
Committer:  Ingo Molnar 
CommitDate: Mon, 10 Sep 2018 10:13:47 +0200

sched/fair: Fix vruntime_normalized() for remote non-migration wakeup

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.

For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from John Dias .

Suggested-by: Peter Zijlstra 
Tested-by: Dietmar Eggemann 
Signed-off-by: Steve Muckle 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Redpath 
Cc: John Dias 
Cc: Linus Torvalds 
Cc: Miguel de Dios 
Cc: Morten Rasmussen 
Cc: Patrick Bellasi 
Cc: Paul Turner 
Cc: Quentin Perret 
Cc: Thomas Gleixner 
Cc: Todd Kjos 
Cc: kernel-t...@android.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on 
migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuc...@google.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cff8d55ee95..c6b7d6daab20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9644,7 +9644,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;

[PATCH] sched/fair: fix vruntime_normalized for remote non-migration wakeup

2018-08-31 Thread Steve Muckle

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an
error in the task's vruntime if it is switched from the fair class
during this time, for example if it is boosted to RT priority via
rt_mutex_setprio. The rq's min_vruntime will not be subtracted from the
task's vruntime but it will be added again when the task returns to the
fair class. The task's vruntime will have been erroneously doubled and
the effective priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from joaod...@google.com.

Suggested-by: Peter Zijlstra 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..b3b62cf37fb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;
 
return false;
-- 
2.19.0.rc1.350.ge57e33dbd1-goog

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-31 Thread Steve Muckle


On 08/29/2018 08:33 AM, Dietmar Eggemann wrote:

Yes, this solves the issue for the case I described. Using
'p->sched_remote_wakeup' (WF_MIGRATED) looks more elegant than using
'p->sched_class == &fair_sched_class'.


It's confirmed that this patch solves the original issue we saw (and my 
test case passes for me as well). I will send this version out shortly.

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle


On 08/24/2018 02:47 AM, Peter Zijlstra wrote:

On 08/17/2018 11:27 AM, Steve Muckle wrote:



When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.



Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
the vruntime_normalized check sees that the task's state is TASK_WAKING,
which results in skipping the subtraction of the rq's min_vruntime
from the task's vruntime
- later, when the prio is deboosted and the task is moved back
to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, even though it wasn't subtracted earlier.


I'm thinking that is an incomplete scenario; where do we get to
TASK_WAKING.


Yes there's a missing bit of context here at the beginning that the task 
to be boosted had already been put into TASK_WAKING.

Re: [PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-24 Thread Steve Muckle


On 08/23/2018 11:54 PM, Juri Lelli wrote:

I tried to catch this issue on my Arm64 Juno board using pi_test (and a
slightly adapted pip_test (usleep_val = 1500 and keep low as cfs)) from
rt-tests but wasn't able to do so.

# pi_stress --inversions=1 --duration=1 --groups=1 --sched id=low,policy=cfs

Starting PI Stress Test
Number of thread groups: 1
Duration of test run: 1 seconds
Number of inversions per group: 1
  Admin thread SCHED_FIFO priority 4
1 groups of 3 threads will be created
   High thread SCHED_FIFO priority 3
Med thread SCHED_FIFO priority 2
Low thread SCHED_OTHER nice 0

# ./pip_stress

In both cases, the cfs task entering  rt_mutex_setprio() is queued, so
dequeue_task_fair()->dequeue_entity(), which subtracts cfs_rq->min_vruntime
from se->vruntime, is called on it before it gets the rt prio.

Maybe it requires a very specific use of the pthread library to provoke this
issue by making sure that the cfs tasks really blocks/sleeps?


Maybe one could play with rt-app to recreate such specific use case?

https://github.com/scheduler-tools/rt-app/blob/master/doc/tutorial.txt#L459


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
  pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
  pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
   -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
   -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
   -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
  pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
  pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
  pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
   -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
   -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
  pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
  pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
  pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The timing is quite sensitive since the task being boosted has to be 
caught in the TASK_WAKING state. The program:


/*
  * Test case for vruntime management during rtmutex priority inheritance
  * promotion and demotion.
  *
  * build with -lpthread
  */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
 if (x) \
 fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
 int policy;
 int i = 0;
 cpu_set_t cpuset;

 CPU_ZERO(&cpuset);
 CPU_SET(0, &cpuset);
 ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

&cpuset));

 while(i++ < 100) {
 sem_wait(&b_s

[PATCH] sched/fair: vruntime should normalize when switching from fair

2018-08-17 Thread Steve Muckle

From: John Dias 

When rt_mutex_setprio changes a task's scheduling class to RT,
we're seeing cases where the task's vruntime is not updated
correctly upon return to the fair class.
Specifically, the following is being observed:
- task is deactivated while still in the fair class
- task is boosted to RT via rt_mutex_setprio, which changes
  the task to RT and calls check_class_changed.
- check_class_changed leads to detach_task_cfs_rq, at which point
  the vruntime_normalized check sees that the task's state is TASK_WAKING,
  which results in skipping the subtraction of the rq's min_vruntime
  from the task's vruntime
- later, when the prio is deboosted and the task is moved back
  to the fair class, the fair rq's min_vruntime is added to
  the task's vruntime, even though it wasn't subtracted earlier.
The immediate result is inflation of the task's vruntime, giving
it lower priority (starving it if there's enough available work).
The longer-term effect is inflation of all vruntimes because the
task's vruntime becomes the rq's min_vruntime when the higher
priority tasks go idle. That leads to a vicious cycle, where
the vruntime inflation repeatedly doubled.

The change here is to detect when vruntime_normalized is being
called when the task is waking but is waking in another class,
and to conclude that this is a case where vruntime has not
been normalized.

Signed-off-by: John Dias 
Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b39fb596f6c1..14011d7929d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9638,7 +9638,8 @@ static inline bool vruntime_normalized(struct task_struct 
*p)
 * - A task which has been woken up by try_to_wake_up() and
 *   waiting for actually being woken up by sched_ttwu_pending().
 */
-   if (!se->sum_exec_runtime || p->state == TASK_WAKING)
+   if (!se->sum_exec_runtime ||
+   (p->state == TASK_WAKING && p->sched_class == &fair_sched_class))
return true;
 
return false;
-- 
2.18.0.865.gffc8e1a3cd6-goog

Re: [RFC] vruntime updated incorrectly when rt_mutex boots prio?

2018-08-13 Thread Steve Muckle


On 08/07/2018 10:40 AM, 'Todd Kjos' via kernel-team wrote:

This issue was discovered on a 4.9-based android device, but the
relevant mainline code appears to be the same. The symptom is that
over time the some workloads become sluggish resulting in missed
frames or sluggishness. It appears to be the same issue described in
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html.

Here is the scenario: A task is deactivated while still in the fair
class. The task is then boosted to RT, so rt_mutex_setprio() is
called. This changes the task to RT and calls check_class_changed(),
which eventually calls detach_task_cfs_rq(), which is where
vruntime_normalized() sees that the task's state is TASK_WAKING, which
results in skipping the subtraction of the rq's min_vruntime from the
task's vruntime. Later, when the prio is deboosted and the task is
moved back to the fair class, the fair rq's min_vruntime is added to
the task's vruntime, resulting in vruntime inflation.


This was reproduced for me on tip of mainline by using the program at 
the end of this mail. It was run in a 2 CPU virtualbox VM. Relevant 
annotated bits of the trace:


low-prio thread vruntime is 752ms
 pi-vruntime-tes-598   [001] d...   520.572459: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=29953 [ns] vruntime=752888705 [ns]


low-prio thread waits on a_sem
 pi-vruntime-tes-598   [001] d...   520.572465: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=598 prev_prio=120 prev_state=D ==> 
next_comm=swapper/1 next_pid=0 next_prio=120


high prio thread finishes wakeup, then sleeps for 1ms
  -0 [000] dNh.   520.572483: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.572486: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.572498: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


high prio thread wakes up after 1ms sleep, posts a_sem which starts to 
wake low-prio thread, then tries to grab pi_mutex, which low-prio thread has
  -0 [000] d.h.   520.573876: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] dNh.   520.573879: sched_wakeup: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000
  -0 [000] d...   520.573887: sched_switch: 
prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=597 next_prio=19
 pi-vruntime-tes-597   [000] d...   520.573895: sched_waking: 
comm=pi-vruntime-tes pid=598 prio=120 target_cpu=001


low-prio thread pid 598 gets pi_mutex priority inheritance, this happens 
while low-prio thread is in waking state
 pi-vruntime-tes-597   [000] d...   520.573911: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=120 newprio=19


high-prio thread sleeps on pi_mutex
 pi-vruntime-tes-597   [000] d...   520.573919: sched_switch: 
prev_comm=pi-vruntime-tes prev_pid=597 prev_prio=19 prev_state=D ==> 
next_comm=swapper/0 next_pid=0 next_prio=120


low-prio thread finishes wakeup
  -0 [001] dNh.   520.573932: sched_wakeup: 
comm=pi-vruntime-tes pid=598 prio=19 target_cpu=001
  -0 [001] d...   520.573936: sched_switch: 
prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=S ==> 
next_comm=pi-vruntime-tes next_pid=598 next_prio=19


low-prio thread releases pi-mutex, loses pi boost, high-prio thread 
wakes for pi-mutex
 pi-vruntime-tes-598   [001] d...   520.573946: sched_pi_setprio: 
comm=pi-vruntime-tes pid=598 oldprio=19 newprio=120
 pi-vruntime-tes-598   [001] dN..   520.573954: sched_waking: 
comm=pi-vruntime-tes pid=597 prio=19 target_cpu=000


low-prio thread vruntime is 1505ms
 pi-vruntime-tes-598   [001] dN..   520.573966: sched_stat_runtime: 
comm=pi-vruntime-tes pid=598 runtime=20150 [ns] vruntime=1505797560 [ns]


The program:

/*
 * Test case for vruntime management during rtmutex priority inheritance
 * promotion and demotion.
 *
 * build with -lpthread
 */

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

#define ERROR_CHECK(x) \
if (x) \
fprintf(stderr, "Error at line %d", __LINE__);

pthread_mutex_t pi_mutex;
sem_t a_sem;
sem_t b_sem;

void *rt_thread_func(void *arg) {
int policy;
int i = 0;
cpu_set_t cpuset;

CPU_ZERO(&cpuset);
CPU_SET(0, &cpuset);
ERROR_CHECK(pthread_setaffinity_np(pthread_self(), 
sizeof(cpu_set_t),

   &cpuset));

while(i++ < 100) {
sem_wait(&b_sem);
usleep(1000);
sem_post(&a_sem);
pthread_mutex_lock(&pi_mutex);
pthread_mutex_unlock(&pi_mutex);
}
}

void *low_prio_thread_func(void *arg) {
int i = 0;
cpu_set_t cpuset;

Re: [PATCH 06/24] selftests: exec: return Kselftest Skip code for skipped tests

2018-05-07 Thread Steve Muckle


On 05/04/2018 06:13 PM, Shuah Khan (Samsung OSG) wrote:

When execveat test is skipped because of unmet dependencies and/or
unsupported configuration, it exits with error which is treated as
a fail by the Kselftest framework. This leads to false negative
result even when the test could not be run.

Change it to return kselftest skip code when a test gets skipped to
clearly report that the test could not be run.

Change it to use ksft_exit_skip() when kernel doesn't support execveat.

Signed-off-by: Shuah Khan (Samsung OSG) 
---
  tools/testing/selftests/exec/execveat.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 67cd4597db2b..47cbf54d0801 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -20,6 +20,8 @@
  #include 
  #include 
  
+#include "../kselftest.h"

+
  static char longpath[2 * PATH_MAX] = "";
  static char *envp[] = { "IN_TEST=yes", NULL, NULL };
  static char *argv[] = { "execveat", "99", NULL };
@@ -249,8 +251,8 @@ static int run_tests(void)
errno = 0;
execveat_(-1, NULL, NULL, NULL, 0);
if (errno == ENOSYS) {
-   printf("[FAIL] ENOSYS calling execveat - no kernel support?\n");
-   return 1;
+   ksft_exit_skip(
+   "ENOSYS calling execveat - no kernel support?\n");
}
  
  	/* Change file position to confirm it doesn't affect anything */



LGTM.

thanks,
Steve

Re: [PATCH RFC 2/5] sched/fair: Skip frequency update if CPU about to idle

2017-11-01 Thread Steve Muckle


On 10/30/2017 12:02 PM, Joel Fernandes wrote:

Also, this more looks like a policy decision. Will it be better to
put that directly into schedutil? Like this:

 if (cpu_idle())
 "Don't change the freq";

Will something like that work?


I thought about this and I think it wont work very well. In the
dequeue path we're still running the task being dequeued so the CPU is
not yet idle. What is needed here IMO is a notion that the CPU is
possibly about to idle and we can get predict that from the dequeue
path of the CFS class.

Also just looking at whether the CPU is currently idle or not in the
governor doesn't help to differentiate between say the dequeue path /
tick path. Both of which can occur when the CPU is not idle.

Any thoughts about this?


Also if it really is the case that this bit of policy is universally 
desirable, I'd think it is better to do this in the scheduler and avoid 
a needless trip through a fn pointer out to schedutil for performance 
reasons.

[PATCH v2] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle

When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
Differences from v1:
 - free allocation from getcwd()
 - update comment in check_execveat_pathmax()

 tools/testing/selftests/exec/execveat.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..67cd4597db2b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,20 +156,30 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
longname[len] = '\0';
strcat(longpath, longname);
+   free(cwd);
}
exe_cp(src, longpath);
 
@@ -190,7 +200,7 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
}
 
/*
-* Execute as a long pathname relative to ".".  If this is a script,
+* Execute as a long pathname relative to "/".  If this is a script,
 * the interpreter will launch but fail to open the script because its
 * name ("/dev/fd/5/xxx") is bigger than PATH_MAX.
 *
@@ -200,10 +210,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +228,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +364,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.11.0

Re: [PATCH] selftests/exec: include cwd in long path calculation

2017-10-12 Thread Steve Muckle


Thanks David for the review. Replies inline.

On 10/12/2017 05:22 AM, David Drysdale wrote:

Modulo the minor comment below:

Reviewed-by: David Drysdale 

On Thu, Oct 12, 2017 at 1:40 AM, Steve Muckle  wrote:

When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
  tools/testing/selftests/exec/execveat.c | 24 +---
  1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
  }

  #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
  {
 int fail = 0;
 int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)

 if (*longpath == '\0') {
 /* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);


cwd never gets freed, but that's presumably not a problem unless these
tests ever get run under some kind of leak checker.


Yeah I found some other instances of this in the test (specifically the 
calls to realpath() at the top of run_tests() which alloc their own 
buffer) and figured for a small unit test like this it had been decided 
it wasn't worth freeing stuff. I'll tidy this up anyway though.





+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
 memset(longname, 'x', XX_DIR_LEN - 1);
 longname[XX_DIR_LEN - 1] = '/';
 longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
 for (ii = 0; ii < count; ii++) {
 strcat(longpath, longname);
 mkdir(longpath, 0755);
 }
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
 if (len <= 0)
 len = 1;
 memset(longname, 'y', len);


There's a missing comment update around here-ish:
   * Execute as a long pathname relative to ".".
(should now be 'relative to "/"' I think).


Will fix.

thanks,
Steve

[PATCH] selftests/exec: include cwd in long path calculation

2017-10-11 Thread Steve Muckle

When creating a pathname close to PATH_MAX to test execveat, factor in
the current working directory path otherwise we end up with an absolute
path that is longer than PATH_MAX. While execveat() may succeed, subsequent
calls to the kernel from the runtime environment which are required to
successfully execute the test binary/script may fail because of this.

To keep the semantics of the test the same, rework the relative pathname
part of the test to be relative to the root directory so it isn't
decreased by the length of the current working directory path.

Signed-off-by: Steve Muckle 
---
 tools/testing/selftests/exec/execveat.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/exec/execveat.c 
b/tools/testing/selftests/exec/execveat.c
index 8d5d1d2ee7c1..5edc609c778b 100644
--- a/tools/testing/selftests/exec/execveat.c
+++ b/tools/testing/selftests/exec/execveat.c
@@ -147,7 +147,7 @@ static void exe_cp(const char *src, const char *dest)
 }
 
 #define XX_DIR_LEN 200
-static int check_execveat_pathmax(int dot_dfd, const char *src, int is_script)
+static int check_execveat_pathmax(int root_dfd, const char *src, int is_script)
 {
int fail = 0;
int ii, count, len;
@@ -156,15 +156,24 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 
if (*longpath == '\0') {
/* Create a filename close to PATH_MAX in length */
+   char *cwd = getcwd(NULL, 0);
+
+   if (!cwd) {
+   printf("Failed to getcwd(), errno=%d (%s)\n",
+  errno, strerror(errno));
+   return 2;
+   }
+   strcpy(longpath, cwd);
+   strcat(longpath, "/");
memset(longname, 'x', XX_DIR_LEN - 1);
longname[XX_DIR_LEN - 1] = '/';
longname[XX_DIR_LEN] = '\0';
-   count = (PATH_MAX - 3) / XX_DIR_LEN;
+   count = (PATH_MAX - 3 - strlen(cwd)) / XX_DIR_LEN;
for (ii = 0; ii < count; ii++) {
strcat(longpath, longname);
mkdir(longpath, 0755);
}
-   len = (PATH_MAX - 3) - (count * XX_DIR_LEN);
+   len = (PATH_MAX - 3 - strlen(cwd)) - (count * XX_DIR_LEN);
if (len <= 0)
len = 1;
memset(longname, 'y', len);
@@ -200,10 +209,10 @@ static int check_execveat_pathmax(int dot_dfd, const char 
*src, int is_script)
 * the exit status shall be 126."), so allow either.
 */
if (is_script)
-   fail += check_execveat_invoked_rc(dot_dfd, longpath, 0,
+   fail += check_execveat_invoked_rc(root_dfd, longpath + 1, 0,
  127, 126);
else
-   fail += check_execveat(dot_dfd, longpath, 0);
+   fail += check_execveat(root_dfd, longpath + 1, 0);
 
return fail;
 }
@@ -218,6 +227,7 @@ static int run_tests(void)
int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
   O_DIRECTORY|O_RDONLY);
int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+   int root_dfd = open_or_die("/", O_DIRECTORY|O_RDONLY);
int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
int dot_dfd_cloexec = open_or_die(".", O_DIRECTORY|O_RDONLY|O_CLOEXEC);
int fd = open_or_die("execveat", O_RDONLY);
@@ -353,8 +363,8 @@ static int run_tests(void)
/* Attempt to execute relative to non-directory => ENOTDIR */
fail += check_execveat_fail(fd, "execveat", 0, ENOTDIR);
 
-   fail += check_execveat_pathmax(dot_dfd, "execveat", 0);
-   fail += check_execveat_pathmax(dot_dfd, "script", 1);
+   fail += check_execveat_pathmax(root_dfd, "execveat", 0);
+   fail += check_execveat_pathmax(root_dfd, "script", 1);
return fail;
 }
 
-- 
2.15.0.rc0.271.g36b669edcc-goog

Android kernel configs (was Re: configs)

2017-09-20 Thread Steve Muckle


On 09/19/2017 10:08 AM, John Stultz wrote:

So what I was thinking of to improve the developer usability might be
the following:

1) Leave the upstream configs in place.  We can try to keep
maintaining them, but its not something the Android team is likely to
care greatly about, and hopefully can be ignored. Meanwhile for the
folks tinkering with upstream, there's something that helps them
figure things out. When the next android-x.y tree comes out, you can
simply remove it.

2) To the android-x.y branches, add a script to the kernel/configs/
dir to fetch the current config directly from the config repo. This
way its simple for folks to get the right thing w/o looking anything
up. This won't fly upstream, but should be fine for common.git

Thoughts?


LGTM. If anyone has concerns let me know, otherwise I'll proceed with #2.

thanks,
Steve

Re: [PATCH RFC RESEND v2 0/2] Prevent cpufreq update for only task on rq that sleeps

2017-09-07 Thread Steve Muckle


On 09/07/2017 09:14 AM, Joel Fernandes wrote:

I'm planning to rebase this series on Linus's master and post it
again, but just checking any thoughts about it?

Just to add more context, the reason for not updating the frequency:

- When a last dequeue of a sleeping task happens, it is sufficient to
update utilization without updating the frequency because if other
CPUs are busy then their updates will consider the utilization of the
idle CPU in the shared policy unless sufficient time has passed.

- If the last dequeue of a sleeping task happens while all other CPUs
in the cluster are idle, then the cluster will likely enter
cluster-idle soon.


To clarify - when you say "last dequeue of a sleeping task happens" 
above, you're referring to the dequeue of the last task running on the 
CPU, correct? I.e. the CPU is about to go idle?


It's been a while since I've looked at this area so would like to hold 
off for a rebased version to review in further detail. But I think the 
concept is valid.


thanks,
steve

Re: [PATCH V2 0/4] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-23 Thread Steve Muckle

Hi Viresh,

On Tue, Nov 15, 2016 at 01:53:19PM +0530, Viresh Kumar wrote:
> This work was started by Steve Muckle, where he used a simple kthread
> instead of kthread-worker and that wasn't sufficient as some guarantees
> weren't met.

I know this has already gone in, but can you expand on the unmet
guarantees mentioned here just for my own (and perhaps others')
understanding?

thanks,
Steve

Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle

On Fri, Nov 11, 2016 at 11:16:59PM +0100, Rafael J. Wysocki wrote:
> > +   struct sched_param param = { .sched_priority = 50 };
> 
> I'd define a symbol for the 50.  It's just one extra line of code ...

A minor point for sure, but in general what's the motivation for
defining symbols for things which are only used once? It makes it harder
to untangle and learn a piece of code IMO, having to jump to the
definitions of them to see what they are.

thanks,
Steve

Re: [PATCH 2/3] cpufreq: schedutil: move slow path from workqueue to SCHED_FIFO task

2016-11-13 Thread Steve Muckle

On Sun, Nov 13, 2016 at 03:37:18PM +0100, Rafael J. Wysocki wrote:
> > Hold on a sec. I thought during LPC someone (Peter?) made a point that when
> > RT thread run, we should bump the frequency to max? So, schedutil is going
> > to trigger schedutil to bump up the frequency to max, right?
> 
> No, it isn't, or at least that is unlikely.
> 
> sugov_update_commit() sets sg_policy->work_in_progress before queuing
> the IRQ work and it is not cleared until the frequency changes in
> sugov_work().
> 
> OTOH, sugov_should_update_freq() checks sg_policy->work_in_progress
> upfront and returns false when it is set, so the governor won't see
> its own worker threads run, unless I'm overlooking something highly
> non-obvious.

FWIW my intention with the original version of this patch (which I
neglected to communicate to Viresh) was that it would depend on changing
the frequency policy for RT. I had been using rt_avg. It sounds like
during LPC there were talks of using another metric.

It does appear things would work okay without that but it also seems
a bit fragile. There's the window between when the work_in_progress
gets cleared and the RT kthread yields. I have not thought through the
various scenarios there, what is possible and tested to see if it is
significant enough to impact power-sensitive platforms.

thanks,
Steve

Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-08 Thread Steve Muckle

On Wed, Sep 07, 2016 at 05:35:50PM -0700, Srinivas Pandruvada wrote:
> Did you see any performance regression on Android workloads?

I did a few AnTuTU runs and did not observe a regression.

thanks,
Steve

Re: [RFC/RFT][PATCH 0/4] cpufreq / sched: iowait boost in intel_pstate and schedutil

2016-09-07 Thread Steve Muckle

On Sat, Sep 03, 2016 at 02:56:48AM +0200, Rafael J. Wysocki wrote:
> Please let me know what you think and if you can run some benchmarks you
> care about and see if the changes make any difference (this way or another),
> please do that and let me know what you've found.

LGTM (I just reviewed the first and last patch, skipping the
intel_pstate ones).

I was unable to see a conclusive power regression in Android audio, video or
idle usecases on my hikey 96board.

thanks,
Steve

Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-09-01 Thread Steve Muckle

On Wed, Aug 31, 2016 at 06:00:02PM +0100, Juri Lelli wrote:
> > Another problem is that we have many semi related knobs; we have the
> > global RT runtime limit knob, but that doesn't affect cpufreq (maybe it
> > should)
> 
> Maybe we could create this sort of link when using the cgroup RT
> throttling interface as well? It should still then fit well once we
> replace the underlying mechanism with DL reservations. And, AFAIK, the
> interface is used by Android folks already.

I'm not sure how the upper bounds can be used to infer CPU frequency...
On my Nexus 6p (an Android device), the global RT runtime limit
seems to be set at 950ms/1sec, the root cgroup is set to 800ms/1sec, and
bg_non_interactive is set at 700ms/1sec.

Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle

On Wed, Aug 31, 2016 at 04:39:07PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 26, 2016 at 11:40:48AM -0700, Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> 
> Not entirely true. People have simply disabled cpufreq because of this.
>
> Yes, RR/FIFO are a pain, but they should still be deterministic, and
> variable cpufreq destroys that.

That is the way it's been with cpufreq and many systems (including all
mobile devices) rely on that to not destroy power. RT + variable cpufreq
is not deterministic.

Given we don't have good constraints on RT tasks I don't think we should
try to strengthen the semantics there. Folks should either move to DL if
they want determinism *and* not-sucky power, or continue disabling
cpufreq if they are able to do so.

> I realize that the fmax thing is annoying, but I'm not seeing how rt_avg
> is much better.

Rt_avg is much closer to the current behavior offered by the most
commonly used cpufreq governors since it tracks actual CPU utilization.
Power is not impacted by minimal RT activity and the frequency is raised
if RT activity is high.

Re: [PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-31 Thread Steve Muckle

On Wed, Aug 31, 2016 at 03:31:07AM +0200, Rafael J. Wysocki wrote:
> On Friday, August 26, 2016 11:40:48 AM Steve Muckle wrote:
> > A policy of going to fmax on any RT activity will be detrimental
> > for power on many platforms. Often RT accounts for only a small amount
> > of CPU activity so sending the CPU frequency to fmax is overkill. Worse
> > still, some platforms may not be able to even complete the CPU frequency
> > change before the RT activity has already completed.
> > 
> > Cpufreq governors have not treated RT activity this way in the past so
> > it is not part of the expected semantics of the RT scheduling class. The
> > DL class offers guarantees about task completion and could be used for
> > this purpose.
> > 
> > Modify the schedutil algorithm to instead use rt_avg as an estimate of
> > RT utilization of the CPU.
> > 
> > Based on previous work by Vincent Guittot .
> 
> If we do it for RT, why not to do a similar thing for DL?  As in the
> original patch from Peter, for example?

Agreed DL should have a similar change. I think that could be done in a
separate patch. I also would need to discuss it with the deadline sched
devs to fully understand the metric used there.

> 
> > Signed-off-by: Steve Muckle 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 26 +-
> >  1 file changed, 17 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index cb8a77b1ef1b..89094a466250 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
> > *sg_cpu, unsigned long util,
> >  
> >  static void sugov_get_util(unsigned long *util, unsigned long *max)
> >  {
> > -   struct rq *rq = this_rq();
> > -   unsigned long cfs_max;
> > +   int cpu = smp_processor_id();
> > +   struct rq *rq = cpu_rq(cpu);
> > +   unsigned long max_cap, rt;
> > +   s64 delta;
> >  
> > -   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> > +   max_cap = arch_scale_cpu_capacity(NULL, cpu);
> >  
> > -   *util = min(rq->cfs.avg.util_avg, cfs_max);
> > -   *max = cfs_max;
> > +   delta = rq_clock(rq) - rq->age_stamp;
> > +   if (unlikely(delta < 0))
> > +   delta = 0;
> > +   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
> > +   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
> 
> These computations are rather heavy, so I wonder if they are avoidable based
> on the flags, for example?

Yeah the div is bad. I don't know that we can avoid it based on the
flags because rt_avg will decay during CFS activity and you'd want to
take note of that.

One way to make this a little better is to ssume that the divisor,
sched_avg_period() + delta, fits into 32 bits so that div_u64 can be
used, which I believe is less bad. Doing that means placing a
restriction on how large sysctl_sched_time_avg (which determines
sched_avg_period()) can be, a max of 4.2 seconds I think. I don't know
that anyone uses a value that large anyway but there's currently no
limit on it.

Another option would be just adding another separate metric to track rt
activity that is more mathematically favorable to deal with.

Both these seemed potentially heavy handed so I figured I'd just start
with the obvious, if suboptimal, solution...

> Plus is SCHED_CAPACITY_SHIFT actually defined for all architectures?

Yes.

> One more ugly thing is about using rq_clock(rq) directly from here whereas we
> pass it around as the 'time' argument elsewhere.

Sure I'll clean this up.

> 
> > +
> > +   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
> > +   *max = max_cap;
> >  }

[PATCH 2/2] sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

2016-08-26 Thread Steve Muckle

A policy of going to fmax on any RT activity will be detrimental
for power on many platforms. Often RT accounts for only a small amount
of CPU activity so sending the CPU frequency to fmax is overkill. Worse
still, some platforms may not be able to even complete the CPU frequency
change before the RT activity has already completed.

Cpufreq governors have not treated RT activity this way in the past so
it is not part of the expected semantics of the RT scheduling class. The
DL class offers guarantees about task completion and could be used for
this purpose.

Modify the schedutil algorithm to instead use rt_avg as an estimate of
RT utilization of the CPU.

Based on previous work by Vincent Guittot .

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index cb8a77b1ef1b..89094a466250 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -146,13 +146,21 @@ static unsigned int get_next_freq(struct sugov_cpu 
*sg_cpu, unsigned long util,
 
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
-   struct rq *rq = this_rq();
-   unsigned long cfs_max;
+   int cpu = smp_processor_id();
+   struct rq *rq = cpu_rq(cpu);
+   unsigned long max_cap, rt;
+   s64 delta;
 
-   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
+   max_cap = arch_scale_cpu_capacity(NULL, cpu);
 
-   *util = min(rq->cfs.avg.util_avg, cfs_max);
-   *max = cfs_max;
+   delta = rq_clock(rq) - rq->age_stamp;
+   if (unlikely(delta < 0))
+   delta = 0;
+   rt = div64_u64(rq->rt_avg, sched_avg_period() + delta);
+   rt = (rt * max_cap) >> SCHED_CAPACITY_SHIFT;
+
+   *util = min(rq->cfs.avg.util_avg + rt, max_cap);
+   *max = max_cap;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -167,7 +175,7 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
if (!sugov_should_update_freq(sg_policy, time))
return;
 
-   if (flags & SCHED_CPUFREQ_RT_DL) {
+   if (flags & SCHED_CPUFREQ_DL) {
next_f = policy->cpuinfo.max_freq;
} else {
sugov_get_util(&util, &max);
@@ -186,7 +194,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
u64 last_freq_update_time = sg_policy->last_freq_update_time;
unsigned int j;
 
-   if (flags & SCHED_CPUFREQ_RT_DL)
+   if (flags & SCHED_CPUFREQ_DL)
return max_f;
 
for_each_cpu(j, policy->cpus) {
@@ -209,7 +217,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu 
*sg_cpu,
if (delta_ns > TICK_NSEC)
continue;
 
-   if (j_sg_cpu->flags & SCHED_CPUFREQ_RT_DL)
+   if (j_sg_cpu->flags & SCHED_CPUFREQ_DL)
return max_f;
 
j_util = j_sg_cpu->util;
@@ -467,7 +475,7 @@ static int sugov_start(struct cpufreq_policy *policy)
if (policy_is_shared(policy)) {
sg_cpu->util = 0;
sg_cpu->max = 0;
-   sg_cpu->flags = SCHED_CPUFREQ_RT;
+   sg_cpu->flags = SCHED_CPUFREQ_DL;
sg_cpu->last_update = 0;
sg_cpu->cached_raw_freq = 0;
cpufreq_add_update_util_hook(cpu, &sg_cpu->update_util,
-- 
2.7.3

[PATCH 0/2] utilization changes for schedutil

2016-08-26 Thread Steve Muckle

The first patch (which is a re-send) corrects an issue with utilization on SMT
platforms. Currently CFS utilization is being roughly doubled on these
platforms due to differences in the way PELT and overall max CPU capacity are
calculated.

The second patch moves away from the policy of going to fmax during RT task
activity, instead using rt_avg as an estimate of RT utilization.

Steve Muckle (2):
  sched: cpufreq: ignore SMT when determining max cpu capacity
  sched: cpufreq: use rt_avg as estimate of required RT CPU capacity

 kernel/sched/cpufreq_schedutil.c | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

-- 
2.7.3

[PATCH 1/2] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-26 Thread Steve Muckle

PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3

[PATCH] sched: cpufreq: ignore SMT when determining max cpu capacity

2016-08-19 Thread Steve Muckle

PELT does not consider SMT when scaling its utilization values via
arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
take SMT into consideration though and therefore may be smaller than
the utilization reported by PELT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024. This means that a 50% utilized CPU will show
up in schedutil as ~86% busy.

Fix this by using the same CPU scaling value in schedutil as that which
is used by PELT.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 60d985f4dc47..cb8a77b1ef1b 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -147,7 +147,9 @@ static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, 
unsigned long util,
 static void sugov_get_util(unsigned long *util, unsigned long *max)
 {
struct rq *rq = this_rq();
-   unsigned long cfs_max = rq->cpu_capacity_orig;
+   unsigned long cfs_max;
+
+   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
 
*util = min(rq->cfs.avg.util_avg, cfs_max);
*max = cfs_max;
-- 
2.7.3

Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle

On Fri, Aug 19, 2016 at 04:00:57PM +0100, Dietmar Eggemann wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 61d485421bed..95d34b337152 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg 
> > *sa,
> > sa->last_update_time = now;
> >  
> > scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
> 
> Wouldn't you have to subscribe to this rcu pointer rq->sd w/ something
> like 'rcu_dereference(cpu_rq(cpu)->sd)'?
> 
> IMHO, __update_load_avg() is called outside existing RCU read-side
> critical sections as well so there would be a pair of
> rcu_read_lock()/rcu_read_unlock() required in this case.

Thanks Dietmar for the review.

Yeah I didn't consider that this was protected with rcu. It looks like
I'm abandoning this approach anyway though and doing something limited
just to schedutil.

thanks,
Steve

Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-19 Thread Steve Muckle

On Fri, Aug 19, 2016 at 04:30:39PM +0100, Morten Rasmussen wrote:
> Hi Steve,
> 
> On Thu, Aug 18, 2016 at 06:55:41PM -0700, Steve Muckle wrote:
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> > 
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> 
> I can't convince myself whether this is the right thing to do. SMT is a
> bit 'special' and it depends on how you model SMT capacity.
> 
> I'm no SMT expert, but the way I understand the current SMT capacity
> model is that capacity_orig represents the capacity of the SMT-thread
> when all its thread-siblings are busy. The true capacity of an
> SMT-thread where all thread-siblings are idle is actually 1024, but we
> don't model this (it would be nightmare to track when the capacity
> should change). The capacity of a core with two or more SMT-threads is
> chosen to be 1024 + smt_gain, where smt_gain is supposed represent the
> additional throughput we gain for the additional SMT-threads. The reason
> why we don't have 1024 per thread is that we would prefer to have only
> one task per core if possible.
> 
> With util_avg scaling to 1024 a core (capacity = 2*589) would be nearly
> 'full' with just one always-running task. If we change util_avg to max
> out at 589, it would take two always-running tasks for the combined
> utilization to match the core capacity. So we may loose some bias
> towards spreading for SMT systems.
> 
> AFAICT, group_is_overloaded() and group_has_capacity() would both be
> affected by this patch.
> 
> Interestingly, Vincent recently proposed to set the SMT-thread capacity
> to 1024 which would affectively make all the current SMT code redundant.
> It would make things a lot simpler, but I'm not sure if we can get away
> with it. It would need discussion at least.
> 
> Opinions?

Thanks for having a look.

The reason I pushed this patch was to address an issue with the
schedutil governor - demand is effectively doubled on SMT systems due to
the above scheme. But this can just be fixed for schedutil by using a
max value there consistent with what __update_load_avg() is using. I'll send
another patch. It looks like there's a good reason for the current PELT
scaling w.r.t. SMT in the scheduler/load balancer.

thanks,
Steve

Re: [PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle

On Fri, Aug 19, 2016 at 10:30:36AM +0800, Wanpeng Li wrote:
> 2016-08-19 9:55 GMT+08:00 Steve Muckle :
> > PELT scales its util_sum and util_avg values via
> > arch_scale_cpu_capacity(). If that function is passed the CPU's sched
> > domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
> > is set. PELT does not pass in the sd however. The other caller of
> > arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
> > util_sum and util_avg scale beyond the CPU capacity on SMT.
> >
> > On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
> > util_avg scales up to 1024.
> >
> > Fix this by passing in the sd in __update_load_avg() as well.
> 
> I believe we notice this at least several months ago.
> https://lkml.org/lkml/2016/5/25/228

Glad to see I'm not alone in thinking this is an issue.

It causes an issue with schedutil, effectively doubling the apparent
demand on SMT. I don't know the load balance code well enough offhand to
say whether it's an issue there.

cheers,
Steve

[PATCH] sched: fix incorrect PELT values on SMT

2016-08-18 Thread Steve Muckle

PELT scales its util_sum and util_avg values via
arch_scale_cpu_capacity(). If that function is passed the CPU's sched
domain then it will reduce the scaling capacity if SD_SHARE_CPUCAPACITY
is set. PELT does not pass in the sd however. The other caller of
arch_scale_cpu_capacity, update_cpu_capacity(), does. This means
util_sum and util_avg scale beyond the CPU capacity on SMT.

On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
util_avg scales up to 1024.

Fix this by passing in the sd in __update_load_avg() as well.

Signed-off-by: Steve Muckle 
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d485421bed..95d34b337152 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2731,7 +2731,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
sa->last_update_time = now;
 
scale_freq = arch_scale_freq_capacity(NULL, cpu);
-   scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+   scale_cpu = arch_scale_cpu_capacity(cpu_rq(cpu)->sd, cpu);
 
/* delta_w is the amount already accumulated against our next period */
delta_w = sa->period_contrib;
-- 
2.7.3

Re: [PATCH v2 2/2] cpufreq / sched: Pass runqueue pointer to cpufreq_update_util()

2016-08-15 Thread Steve Muckle

LGTM

On Fri, Aug 12, 2016 at 02:06:44AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> All of the callers of cpufreq_update_util() pass rq_clock(rq) to it
> as the time argument and some of them check whether or not cpu_of(rq)
> is equal to smp_processor_id() before calling it, so rework it to
> take a runqueue pointer as the argument and move the rq_clock(rq)
> evaluation into it.
> 
> Additionally, provide a wrapper checking cpu_of(rq) against
> smp_processor_id() for the cpufreq_update_util() callers that
> need it.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> This is a new patch based on the [2/2] from v1.
> 
> ---
>  kernel/sched/deadline.c |3 +--
>  kernel/sched/fair.c |4 +---
>  kernel/sched/rt.c   |3 +--
>  kernel/sched/sched.h|   15 +++
>  4 files changed, 14 insertions(+), 11 deletions(-)
> 
> Index: linux-pm/kernel/sched/deadline.c
> ===
> --- linux-pm.orig/kernel/sched/deadline.c
> +++ linux-pm/kernel/sched/deadline.c
> @@ -733,8 +733,7 @@ static void update_curr_dl(struct rq *rq
>   }
>  
>   /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_DL);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_DL);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (&this_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,7 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
>  }
>  
> Index: linux-pm/kernel/sched/rt.c
> ===
> --- linux-pm.orig/kernel/sched/rt.c
> +++ linux-pm/kernel/sched/rt.c
> @@ -958,8 +958,7 @@ static void update_curr_rt(struct rq *rq
>   return;
>  
>   /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), SCHED_CPUFREQ_RT);
> + cpufreq_update_this_cpu(rq, SCHED_CPUFREQ_RT);
>  
>   schedstat_set(curr->se.statistics.exec_max,
> max(curr->se.statistics.exec_max, delta_exec));
> Index: linux-pm/kernel/sched/sched.h
> ===
> --- linux-pm.orig/kernel/sched/sched.h
> +++ linux-pm/kernel/sched/sched.h
> @@ -1763,7 +1763,7 @@ DECLARE_PER_CPU(struct update_util_data
>  
>  /**
>   * cpufreq_update_util - Take a note about CPU utilization changes.
> - * @time: Current time.
> + * @rq: Runqueue to carry out the update for.
>   * @flags: Update reason flags.
>   *
>   * This function is called by the scheduler on the CPU whose utilization is
> @@ -1783,16 +1783,23 @@ DECLARE_PER_CPU(struct update_util_data
>   * but that really is a band-aid.  Going forward it should be replaced with
>   * solutions targeted more specifically at RT and DL tasks.
>   */
> -static inline void cpufreq_update_util(u64 time, unsigned int flags)
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
>   data = rcu_dereference_sched(*this_cpu_ptr(&cpufreq_update_util_data));
>   if (data)
> - data->func(data, time, flags);
> + data->func(data, rq_clock(rq), flags);
> +}
> +
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int flags)
> +{
> + if (cpu_of(rq) == smp_processor_id())
> + cpufreq_update_util(rq, flags);
>  }
>  #else
> -static inline void cpufreq_update_util(u64 time, unsigned int flags) {}
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> +static inline void cpufreq_update_this_cpu(struct rq *rq, unsigned int 
> flags) {}
>  #endif /* CONFIG_CPU_FREQ */
>  
>  #ifdef arch_scale_freq_capacity
>

Re: [PATCH v2 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-15 Thread Steve Muckle

LGTM

On Fri, Aug 12, 2016 at 02:04:42AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> It is useful to know the reason why cpufreq_update_util() has just
> been called and that can be passed as flags to cpufreq_update_util()
> and to the ->func() callback in struct update_util_data.  However,
> doing that in addition to passing the util and max arguments they
> already take would be clumsy, so avoid it.
> 
> Instead, use the observation that the schedutil governor is part
> of the scheduler proper, so it can access scheduler data directly.
> This allows the util and max arguments of cpufreq_update_util()
> and the ->func() callback in struct update_util_data to be replaced
> with a flags one, but schedutil has to be modified to follow.
> 
> Thus make the schedutil governor obtain the CFS utilization
> information from the scheduler and use the "RT" and "DL" flags
> instead of the special utilization value of ULONG_MAX to track
> updates from the RT and DL sched classes.  Make it non-modular
> too to avoid having to export scheduler variables to modules at
> large.
> 
> Next, update all of the other users of cpufreq_update_util()
> and the ->func() callback in struct update_util_data accordingly.
> 
> Suggested-by: Peter Zijlstra 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v1 -> v2: Do not check cpu_of(rq) against smp_processor_id() in
>   cfs_rq_util_change().
> 
> ---
>  drivers/cpufreq/Kconfig|5 --
>  drivers/cpufreq/cpufreq_governor.c |2 -
>  drivers/cpufreq/intel_pstate.c |2 -
>  include/linux/sched.h  |   12 --
>  kernel/sched/cpufreq.c |2 -
>  kernel/sched/cpufreq_schedutil.c   |   67 
> -
>  kernel/sched/deadline.c|4 +-
>  kernel/sched/fair.c|   10 +
>  kernel/sched/rt.c  |4 +-
>  kernel/sched/sched.h   |   31 +
>  10 files changed, 66 insertions(+), 73 deletions(-)
> 
> Index: linux-pm/drivers/cpufreq/cpufreq_governor.c
> ===
> --- linux-pm.orig/drivers/cpufreq/cpufreq_governor.c
> +++ linux-pm/drivers/cpufreq/cpufreq_governor.c
> @@ -260,7 +260,7 @@ static void dbs_irq_work(struct irq_work
>  }
>  
>  static void dbs_update_util_handler(struct update_util_data *data, u64 time,
> - unsigned long util, unsigned long max)
> + unsigned int flags)
>  {
>   struct cpu_dbs_info *cdbs = container_of(data, struct cpu_dbs_info, 
> update_util);
>   struct policy_dbs_info *policy_dbs = cdbs->policy_dbs;
> Index: linux-pm/drivers/cpufreq/intel_pstate.c
> ===
> --- linux-pm.orig/drivers/cpufreq/intel_pstate.c
> +++ linux-pm/drivers/cpufreq/intel_pstate.c
> @@ -1329,7 +1329,7 @@ static inline void intel_pstate_adjust_b
>  }
>  
>  static void intel_pstate_update_util(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max)
> +  unsigned int flags)
>  {
>   struct cpudata *cpu = container_of(data, struct cpudata, update_util);
>   u64 delta_ns = time - cpu->sample.time;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3469,15 +3469,19 @@ static inline unsigned long rlimit_max(u
>   return task_rlimit_max(current, limit);
>  }
>  
> +#define SCHED_CPUFREQ_RT (1U << 0)
> +#define SCHED_CPUFREQ_DL (1U << 1)
> +
> +#define SCHED_CPUFREQ_RT_DL  (SCHED_CPUFREQ_RT | SCHED_CPUFREQ_DL)
> +
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {
> - void (*func)(struct update_util_data *data,
> -  u64 time, unsigned long util, unsigned long max);
> +   void (*func)(struct update_util_data *data, u64 time, unsigned int 
> flags);
>  };
>  
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
> - void (*func)(struct update_util_data *data, u64 time,
> -  unsigned long util, unsigned long max));
> +   void (*func)(struct update_util_data *data, u64 time,
> + unsigned int flags));
>  void cpufreq_remove_update_util_hook(int cpu);
>  #endif /* CONFIG_CPU_FREQ */
>  
> Index: linux-pm/kernel/sched/cpufreq.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq.c
> +++ linux-pm/kernel/sched/cpufreq.c
> @@ -33,7 +33,7 @@ DEFINE_PER_CPU(struct update_util_data *
>   */
>  void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
>   void (*func)(struct update_util_data *data, u64 time,
> -  unsigned long

Re: [PATCH 2/2] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-11 Thread Steve Muckle

On Wed, Aug 10, 2016 at 03:11:17AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2876,8 +2876,6 @@ static inline void update_tg_load_avg(st
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
>   if (&this_rq()->cfs == cfs_rq) {
> - struct rq *rq = rq_of(cfs_rq);
> -
>   /*
>* There are a few boundary cases this might miss but it should
>* get called often enough that that should (hopefully) not be
> @@ -2894,8 +2892,7 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - if (cpu_of(rq) == smp_processor_id())
> - cpufreq_update_util(rq_clock(rq), 0);
> + cpufreq_update_util(rq_of(cfs_rq), 0);
>   }
...
> +static inline void cpufreq_update_util(struct rq *rq, unsigned int flags)
>  {
>   struct update_util_data *data;
>  
> + if (cpu_of(rq) != smp_processor_id())
> + return;

This test is unecessary in the CFS (most common) path due to the check
on this_rq in cfs_rq_util_change().

I think instead of bringing the test into cpufreq_update_util it should
be left at the call sites for RT and DL, and removed from CFS as part of
the first patch.

thanks,
Steve

Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle

On Thu, Aug 11, 2016 at 11:03:47AM -0700, Steve Muckle wrote:
> On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> > Index: linux-pm/kernel/sched/fair.c
> > ===
> > --- linux-pm.orig/kernel/sched/fair.c
> > +++ linux-pm/kernel/sched/fair.c
> > @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
> >  
> >  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
> >  {
> > -   struct rq *rq = rq_of(cfs_rq);
> > -   int cpu = cpu_of(rq);
> > -
> > -   if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> > -   unsigned long max = rq->cpu_capacity_orig;
> > +   if (&this_rq()->cfs == cfs_rq) {
> > +   struct rq *rq = rq_of(cfs_rq);
> >  
> > /*
> >  * There are a few boundary cases this might miss but it should
> > @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
> >  *
> >  * See cpu_util().
> >  */
> > -   cpufreq_update_util(rq_clock(rq),
> > -   min(cfs_rq->avg.util_avg, max), max);
> > +   if (cpu_of(rq) == smp_processor_id())
> 
> Isn't this test against smp_processor_id() redundant since
> this_rq()->cfs == cfs_rq?

Sorry, I see this is modified in the next patch.

Re: [Update][PATCH 1/2] cpufreq / sched: Pass flags to cpufreq_update_util()

2016-08-11 Thread Steve Muckle

On Wed, Aug 10, 2016 at 03:49:07AM +0200, Rafael J. Wysocki wrote:
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -2875,11 +2875,8 @@ static inline void update_tg_load_avg(st
>  
>  static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
>  {
> - struct rq *rq = rq_of(cfs_rq);
> - int cpu = cpu_of(rq);
> -
> - if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> - unsigned long max = rq->cpu_capacity_orig;
> + if (&this_rq()->cfs == cfs_rq) {
> + struct rq *rq = rq_of(cfs_rq);
>  
>   /*
>* There are a few boundary cases this might miss but it should
> @@ -2897,8 +2894,8 @@ static inline void cfs_rq_util_change(st
>*
>* See cpu_util().
>*/
> - cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + if (cpu_of(rq) == smp_processor_id())

Isn't this test against smp_processor_id() redundant since
this_rq()->cfs == cfs_rq?

> + cpufreq_update_util(rq_clock(rq), 0);

All else looked good to me.

thanks,
Steve

Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-04 Thread Steve Muckle

On Thu, Aug 04, 2016 at 11:19:00PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, August 03, 2016 07:24:18 PM Steve Muckle wrote:
> > On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle  
> > > wrote:
> > > > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> > > >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  
> > > >> wrote:
> > > >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > > >> > ...
> > > >> >> For this purpose, define a new cpufreq_update_util() flag
> > > >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> > > >> >> cpufreq_update_util() in the in_iowait case.  That generally
> > > >> >> requires cpufreq_update_util() to be called directly from there,
> > > >> >> because update_load_avg() is not likely to be invoked in that
> > > >> >> case.
> > > >> >
> > > >> > I didn't follow why the cpufreq hook won't likely be called if
> > > >> > in_iowait is set? AFAICS update_load_avg() gets called in the second 
> > > >> > loop
> > > >> > and calls update_cfs_rq_load_avg (triggers the hook).
> > > >>
> > > >> In practice it turns out that in the majority of cases when in_iowait
> > > >> is set the second loop will not run.
> > > >
> > > > My understanding of enqueue_task_fair() is that the first loop walks up
> > > > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > > > the second loop updates the rest of the hierarchy that was already
> > > > enqueued.
> > > >
> > > > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > > > (meaning the whole hierarchy is traversed in the first loop and the
> > > > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > > > result in the cpufreq hook being called, via enqueue_entity() ->
> > > > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> > > 
> > > But then it's rather difficult to pass the IO flag to this one, isn't it?
> > > 
> > > Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> > > p->in_iowait is set.
> > > 
> > > If you can find a clever way to do it without adding an extra call
> > > site, that's fine by me, but in any case the extra
> > > cpufreq_update_util() invocation should not be too expensive.
> > 
> > I was under the impression that function pointer calls were more
> > expensive, and in the shared policy case there is a nontrivial amount of
> > code that is run in schedutil (including taking a spinlock) before we'd
> > see sugov_should_update_freq() return false and bail.
> 
> That's correct in principle, but we only do that if p->in_iowait is set,
> which is somewhat special anyway and doesn't happen every time for sure.
> 
> So while there is overhead theoretically, I'm not even sure if it is 
> measurable.

Ok my worry was if there were IO-heavy workloads that would
hammer this path, but I don't know of any specifically or how often this
path can be taken.

> 
> > Agreed that getting knowledge of p->in_iowait down to the existing hook
> > is not easy. I spent some time fiddling with that. It seemed doable but
> > somewhat gross due to the required flag passing and modifications
> > to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
> > pursuing I can keep working on it and post a draft.
> 
> Well, that's a Peter's call. :-)
> 
> > But I also wonder if the hooks are in the best location.  They are
> > currently deep in the PELT code. This may make sense from a theoretical
> > standpoint, calling them whenever a root cfs_rq utilization changes, but
> > it also makes the hooks difficult to correlate (for policy purposes such
> > as this iowait change) with higher level logical events like a task
> > wakeup. Or load balance where we probably want to call the hook just
> > once after a load balance is complete.
> 
> I generally agree.  We still need to ensure that the hools will be invoked
> frequently enough, though, even if HZ is 100.
> 
> > This is also an issue for the remote wakeup case where I currently have
> > another invocation of the hook in check_preempt_curr(), so I can know if
> > preemption was triggered and skip a remote schedutil update in that case
> > to avoid a duplicate IPI.
> > 
> > It seems to me worth evaluating if a higher level set of hook locations
> > could be used. One possibility is higher up in CFS:
> > - enqueue_task_fair, dequeue_task_fair
> > - scheduler_tick
> > - active_load_balance_cpu_stop, load_balance
> 
> Agreed, that's worth checking.
> 
> > Though this wouldn't solve my issue with check_preempt_curr. That would
> > probably require going further up the stack to try_to_wake_up() etc. Not
> > yet sure what the other hook locations would be at that level.
> 
> That's probably too far away from the root cfs_rq utilization changes IMO.

Is your concern that the rate of hook calls would be decreased?

thanks,
Steve

Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-03 Thread Steve Muckle

On Wed, Aug 03, 2016 at 12:38:20AM +0200, Rafael J. Wysocki wrote:
> On Wed, Aug 3, 2016 at 12:02 AM, Steve Muckle  wrote:
> > On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> >> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  
> >> wrote:
> >> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> >> > ...
> >> >> For this purpose, define a new cpufreq_update_util() flag
> >> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> >> requires cpufreq_update_util() to be called directly from there,
> >> >> because update_load_avg() is not likely to be invoked in that
> >> >> case.
> >> >
> >> > I didn't follow why the cpufreq hook won't likely be called if
> >> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> >> > and calls update_cfs_rq_load_avg (triggers the hook).
> >>
> >> In practice it turns out that in the majority of cases when in_iowait
> >> is set the second loop will not run.
> >
> > My understanding of enqueue_task_fair() is that the first loop walks up
> > the portion of the sched_entity hierarchy that needs to be enqueued, and
> > the second loop updates the rest of the hierarchy that was already
> > enqueued.
> >
> > Even if the se corresponding to the root cfs_rq needs to be enqueued
> > (meaning the whole hierarchy is traversed in the first loop and the
> > second loop does nothing), enqueue_entity() on the root cfs_rq should
> > result in the cpufreq hook being called, via enqueue_entity() ->
> > enqueue_entity_load_avg() -> update_cfs_rq_load_avg().
> 
> But then it's rather difficult to pass the IO flag to this one, isn't it?
> 
> Essentially, the problem is to pass "IO" to cpufreq_update_util() when
> p->in_iowait is set.
> 
> If you can find a clever way to do it without adding an extra call
> site, that's fine by me, but in any case the extra
> cpufreq_update_util() invocation should not be too expensive.

I was under the impression that function pointer calls were more
expensive, and in the shared policy case there is a nontrivial amount of
code that is run in schedutil (including taking a spinlock) before we'd
see sugov_should_update_freq() return false and bail.

Agreed that getting knowledge of p->in_iowait down to the existing hook
is not easy. I spent some time fiddling with that. It seemed doable but
somewhat gross due to the required flag passing and modifications
to enqueue_entity, update_load_avg, etc. If it is decided that it is worth
pursuing I can keep working on it and post a draft.

But I also wonder if the hooks are in the best location.  They are
currently deep in the PELT code. This may make sense from a theoretical
standpoint, calling them whenever a root cfs_rq utilization changes, but
it also makes the hooks difficult to correlate (for policy purposes such
as this iowait change) with higher level logical events like a task
wakeup. Or load balance where we probably want to call the hook just
once after a load balance is complete.

This is also an issue for the remote wakeup case where I currently have
another invocation of the hook in check_preempt_curr(), so I can know if
preemption was triggered and skip a remote schedutil update in that case
to avoid a duplicate IPI.

It seems to me worth evaluating if a higher level set of hook locations
could be used. One possibility is higher up in CFS:
- enqueue_task_fair, dequeue_task_fair
- scheduler_tick
- active_load_balance_cpu_stop, load_balance

Though this wouldn't solve my issue with check_preempt_curr. That would
probably require going further up the stack to try_to_wake_up() etc. Not
yet sure what the other hook locations would be at that level.

thanks,
Steve

Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-02 Thread Steve Muckle

On Tue, Aug 02, 2016 at 03:37:02AM +0200, Rafael J. Wysocki wrote:
> On Tue, Aug 2, 2016 at 3:22 AM, Steve Muckle  wrote:
> > On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
> > ...
> >> For this purpose, define a new cpufreq_update_util() flag
> >> UUF_IO and modify enqueue_task_fair() to pass that flag to
> >> cpufreq_update_util() in the in_iowait case.  That generally
> >> requires cpufreq_update_util() to be called directly from there,
> >> because update_load_avg() is not likely to be invoked in that
> >> case.
> >
> > I didn't follow why the cpufreq hook won't likely be called if
> > in_iowait is set? AFAICS update_load_avg() gets called in the second loop
> > and calls update_cfs_rq_load_avg (triggers the hook).
> 
> In practice it turns out that in the majority of cases when in_iowait
> is set the second loop will not run.

My understanding of enqueue_task_fair() is that the first loop walks up
the portion of the sched_entity hierarchy that needs to be enqueued, and
the second loop updates the rest of the hierarchy that was already
enqueued.

Even if the se corresponding to the root cfs_rq needs to be enqueued
(meaning the whole hierarchy is traversed in the first loop and the
second loop does nothing), enqueue_entity() on the root cfs_rq should
result in the cpufreq hook being called, via enqueue_entity() ->
enqueue_entity_load_avg() -> update_cfs_rq_load_avg().

I'll keep looking to see if I've misunderstood something in here.

thanks,
Steve

Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-02 Thread Steve Muckle

On Tue, Aug 02, 2016 at 11:38:17AM +0100, Juri Lelli wrote:
> > > Anyway one way that my patch differed was that I had used the flags
> > > field to keep the behavior the same for both RT and DL.
> 
> Do you mean "go to max" policy for both, until proper policies will be
> implemented in the future?

Yep.

> > That happens
> > > later on in this series for RT but the DL policy is modified as above.
> > > Can the DL policy be left as-is and discussed/modified in a separate
> > > series?
> 
> Not that we want to start discussing this point now, if we postpone the
> change for later, but I just wanted to point out a difference w.r.t.
> what the schedfreq thing was doing: it used to sum contributions from
> the different classes, instead of taking the max. We probably never
> really discussed on the list what is the right thing to do, though.

Yeah I figured that was worth deferring into its own patchset/thread.

cheers,
Steve

Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle

On Tue, Aug 02, 2016 at 01:44:41AM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 12:59:30 PM Steve Muckle wrote:
> > On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> > > On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > > > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > > > +#define UUF_RT   0x01
> > > > 
> > > > What does UUF stand for?
> > > 
> > > "Utilization upadte flag".
> > 
> > I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for 
> > the
> > prefixes, though I guess some may object to the length.
> 
> Well, OK.
> 
> I guess something like SCHED_CPUFREQ_RT etc would be sufficient?

Yeah I think that would work.

thanks,
Steve

Re: [RFC][PATCH 6/7] cpufreq: schedutil: Add iowait boosting

2016-08-01 Thread Steve Muckle

On Mon, Aug 01, 2016 at 01:37:59AM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Modify the schedutil cpufreq governor to boost the CPU frequency
> if the UUF_IO flag is passed to it via cpufreq_update_util().
> 
> If that happens, the frequency is set to the maximum during
> the first update after receiving the UUF_IO flag and then the
> boost is reduced by half during each following update.

Were these changes to schedutil part of the positive test results
mentioned in patch 5? Or are those just from intel pstate?

I was nervous about the effect of this on power and tested a couple low
power usecases. The platform is the Hikey 96board (8 core ARM A53,
single CPUfreq domain) running AOSP Android and schedutil backported to
kernel 4.4. These tests run mp3 and mpeg4 playback for a little while,
recording total energy consumption during the test along with frequency
residency.

As the results below show I did not measure an appreciable effect - if
anything things may be slightly better with the patches.

The hardcoding of a non-tunable boosting scheme makes me nervous but
perhaps it could be revisited if some platform or configuration shows
a noticeable regression?

TestcaseEnergy  /- CPU frequency residency -\
(J) 208000  432000  729000  96  120
mp3-before-126.822  47.27%  24.79%  16.23%  5.20%   6.52%
mp3-before-226.817  41.70%  28.75%  17.62%  5.17%   6.75%
mp3-before-326.65   42.48%  28.65%  17.25%  5.07%   6.55%
mp3-after-1 26.667  42.51%  27.38%  18.00%  5.40%   6.71%
mp3-after-2 26.777  48.37%  24.15%  15.68%  4.55%   7.25%
mp3-after-3 26.806  41.93%  27.71%  18.35%  4.78%   7.35%

mpeg4-before-1  26.024  18.41%  60.09%  13.16%  0.49%   7.85%
mpeg4-before-2  25.147  20.47%  64.80%  8.44%   1.37%   4.91%
mpeg4-before-3  25.007  19.18%  66.08%  10.01%  0.59%   4.22%
mpeg4-after-1   25.598  19.77%  61.33%  11.63%  0.79%   6.48%
mpeg4-after-2   25.18   22.31%  62.78%  8.83%   1.18%   4.90%
mpeg4-after-3   25.162  21.59%  64.88%  8.29%   0.49%   4.71%

thanks,
Steve

Re: [RFC][PATCH 5/7] cpufreq / sched: UUF_IO flag to indicate iowait condition

2016-08-01 Thread Steve Muckle

On Mon, Aug 01, 2016 at 01:37:23AM +0200, Rafael J. Wysocki wrote:
...
> For this purpose, define a new cpufreq_update_util() flag
> UUF_IO and modify enqueue_task_fair() to pass that flag to
> cpufreq_update_util() in the in_iowait case.  That generally
> requires cpufreq_update_util() to be called directly from there,
> because update_load_avg() is not likely to be invoked in that
> case.

I didn't follow why the cpufreq hook won't likely be called if
in_iowait is set? AFAICS update_load_avg() gets called in the second loop
and calls update_cfs_rq_load_avg (triggers the hook).

> 
> Signed-off-by: Rafael J. Wysocki 
> ---
>  include/linux/sched.h |1 +
>  kernel/sched/fair.c   |8 
>  2 files changed, 9 insertions(+)
> 
> Index: linux-pm/kernel/sched/fair.c
> ===
> --- linux-pm.orig/kernel/sched/fair.c
> +++ linux-pm/kernel/sched/fair.c
> @@ -4459,6 +4459,14 @@ enqueue_task_fair(struct rq *rq, struct
>   struct cfs_rq *cfs_rq;
>   struct sched_entity *se = &p->se;
>  
> + /*
> +  * If in_iowait is set, it is likely that the loops below will not
> +  * trigger any cpufreq utilization updates, so do it here explicitly
> +  * with the IO flag passed.
> +  */
> + if (p->in_iowait)
> + cpufreq_update_util(rq, UUF_IO);
> +
>   for_each_sched_entity(se) {
>   if (se->on_rq)
>   break;
> Index: linux-pm/include/linux/sched.h
> ===
> --- linux-pm.orig/include/linux/sched.h
> +++ linux-pm/include/linux/sched.h
> @@ -3376,6 +3376,7 @@ static inline unsigned long rlimit_max(u
>  }
>  
>  #define UUF_RT   0x01
> +#define UUF_IO   0x02
>  
>  #ifdef CONFIG_CPU_FREQ
>  struct update_util_data {

thanks,
Steve

Re: [RFC][PATCH 4/7] cpufreq / sched: Add flags argument to cpufreq_update_util()

2016-08-01 Thread Steve Muckle

On Mon, Aug 01, 2016 at 04:57:18PM +0200, Rafael J. Wysocki wrote:
> On Monday, August 01, 2016 09:33:12 AM Dominik Brodowski wrote:
> > On Mon, Aug 01, 2016 at 01:36:46AM +0200, Rafael J. Wysocki wrote:
> > > +#define UUF_RT   0x01
> > 
> > What does UUF stand for?
> 
> "Utilization upadte flag".

I had wondered the same - in my patchset I used CPUFREQ_SCHED_UPDATE_* for the
prefixes, though I guess some may object to the length.

thanks,
Steve

Re: [RFC][PATCH 3/7] cpufreq / sched: Check cpu_of(rq) in cpufreq_update_util()

2016-08-01 Thread Steve Muckle

On Mon, Aug 01, 2016 at 09:29:57AM +0200, Dominik Brodowski wrote:
> A small nitpick:
> 
> On Mon, Aug 01, 2016 at 01:36:01AM +0200, Rafael J. Wysocki wrote:
> > --- linux-pm.orig/kernel/sched/sched.h
> > +++ linux-pm/kernel/sched/sched.h
> > @@ -1760,7 +1760,7 @@ DECLARE_PER_CPU(struct update_util_data
> >  
> >  /**
> >   * cpufreq_update_util - Take a note about CPU utilization changes.
> > - * @time: Current time.
> > + * @rq: Runqueue to carry out the update for.
> >   *
> >   * This function is called by the scheduler on every invocation of
> >   * update_load_avg() on the CPU whose utilization is being updated.
> 
> This comment seems to need an update due to the smp_processor_id() check
> being moved into this function.

The callers of this have also changed - it is no longer called directly
by update_load_avg(), rather via cfs_rq_util_change() from several other
locations (I believe it was my patch that failed to update this
comment).

Could this be replaced with a more generic statement such as "called by
CFS in various paths?"

thanks,
Steve

Re: [RFC][PATCH 1/7] cpufreq / sched: Make schedutil access utilization data directly

2016-08-01 Thread Steve Muckle

On Mon, Aug 01, 2016 at 01:34:36AM +0200, Rafael J. Wysocki wrote:
...
> Index: linux-pm/kernel/sched/cpufreq_schedutil.c
> ===
> --- linux-pm.orig/kernel/sched/cpufreq_schedutil.c
> +++ linux-pm/kernel/sched/cpufreq_schedutil.c
> @@ -144,17 +144,47 @@ static unsigned int get_next_freq(struct
>   return cpufreq_driver_resolve_freq(policy, freq);
>  }
>  
> -static void sugov_update_single(struct update_util_data *hook, u64 time,
> - unsigned long util, unsigned long max)
> +static void sugov_get_util(unsigned long *util, unsigned long *max)
> +{
> + unsigned long dl_util, dl_max;
> + unsigned long cfs_util, cfs_max;
> + int cpu = smp_processor_id();
> + struct dl_bw *dl_bw = dl_bw_of(cpu);
> + struct rq *rq = this_rq();
> +
> + if (rt_prio(current->prio)) {
> + *util = ULONG_MAX;
> + return;
> + }
> +
> + dl_max = dl_bw_cpus(cpu) << 20;
> + dl_util = dl_bw->total_bw;
> +
> + cfs_max = rq->cpu_capacity_orig;
> + cfs_util = min(rq->cfs.avg.util_avg, cfs_max);
> +
> + if (cfs_util * dl_max > dl_util * cfs_max) {
> + *util = cfs_util;
> + *max  = cfs_max;
> + } else {
> + *util = dl_util;
> + *max  = dl_max;
> + }
> +}

Last Friday I had put together a similar patch based on Peter's. I need
the flags field for the remote wakeup support. My previous plan,
installing a late callback in check_preempt_curr that gets requested
from the earlier existing CFS callback, was not working out since those
two events don't always match up 1:1.

Anyway one way that my patch differed was that I had used the flags
field to keep the behavior the same for both RT and DL. That happens
later on in this series for RT but the DL policy is modified as above.
Can the DL policy be left as-is and discussed/modified in a separate
series?

thanks,
Steve

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-22 Thread Steve Muckle

On Fri, Jul 22, 2016 at 08:16:42AM -0700, Viresh Kumar wrote:
> > Long term as I was mentioning in the other thread I think it'd be good
> > if the current target() drivers were modified to supply resolve_freq(),
> > and that cpufreq_register_driver() were again changed to require it for
> > those drivers.
> 
> There is no need for us to force this, its really optional for such
> platforms. Worst case, schedutil wouldn't work at the best, so what?
> Its a platform driver's choice, isn't it ?

This would be in the context of then being able to remove the additional if
statement from the hot path.

To reply to the suggestion of using likely() here, I'm partial to
solving it without that because I'm guessing likely() has to be an
architecture-dependent thing? It seems cleaner to me if the existing
few target() drivers were augmented to provide the resolve_freq() calback
and its presence required. But it's certainly not a big deal and won't
affect any platforms I'm involved with, at least for now. Maybe I could
work on those target() drivers if things change.

Re: linux-next: build failure after merge of the pm tree

2016-07-21 Thread Steve Muckle

On Fri, Jul 22, 2016 at 11:56:20AM +1000, Stephen Rothwell wrote:
> Hi Rafael,
> 
> After merging the pm tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> ERROR: "cpufreq_driver_resolve_freq" [kernel/sched/cpufreq_schedutil.ko] 
> undefined!
> 
> Caused by commit
> 
>   5cbea46984d6 ("cpufreq: schedutil: map raw required frequency to driver 
> frequency")
> 
> I used the pm tree from next-20160721 for today.

Sorry - I have just sent a patch to address this.

thanks,
Steve

[PATCH] cpufreq: export cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

Export cpufreq_driver_resolve_freq() since governors may be compiled as
modules.

Reported-by: Stephen Rothwell 
Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/cpufreq.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b696baeb249d..fa0a92d6121e 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -514,6 +514,7 @@ unsigned int cpufreq_driver_resolve_freq(struct 
cpufreq_policy *policy,
   CPUFREQ_RELATION_L);
return policy->freq_table[policy->cached_resolved_idx].frequency;
 }
+EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
 
 /*
  *  SYSFS INTERFACE  *
-- 
2.7.3

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

On Thu, Jul 21, 2016 at 04:36:48PM -0700, Steve Muckle wrote:
> As another alternative, this could be caught in cpufreq driver
> initialization? I believe you suggested that originally, but I avoided
> it as I didn't want to have to implement resolve_freq() for every
> target() style driver. It sounds like there aren't many though.

Going back and checking I see I was thinking of your suggestion that
cpufreq_register_driver() check that only target() drivers offer a
resolve_freq() callback. I put a comment for this in cpufreq.h but not a
check - I could add a check in another patch if you like.

Long term as I was mentioning in the other thread I think it'd be good
if the current target() drivers were modified to supply resolve_freq(),
and that cpufreq_register_driver() were again changed to require it for
those drivers.

Re: [PATCH V2] cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()

2016-07-21 Thread Steve Muckle

On Fri, Jul 22, 2016 at 02:18:54AM +0200, Rafael J. Wysocki wrote:
> > My thinking was that one of these two would be preferable:
> >
> > - Forcing ->target() drivers to install a ->resolve_freq callback,
> >   enforcing this at cpufreq driver init time.
> 
> That would have been possible, but your series didn't do that.
> 
> >   My understanding is
> >   ->target() drivers are deprecated anyway
> 
> No, they aren't.

Ok. I didn't follow Documentation/cpu-freq/cpu-drivers.txt section 1.5
then - it suggests something about target() is deprecated, perhaps it's
out of date.

> There simply are cases in which frequency tables are not workable
> (like the ACPI CPPC one).

Sure that makes sense.

> >  and theren't aren't many of
> >   them, though I don't know offhand exactly how many or how hard it
> >   would be to do for each one.
> >
> > - Forcing callers (schedutil in this case) to check that either
> >   ->target() or ->resolve_freq() is implemented. It means
> >   catching and scrutinizing future callers of resolve_freq.
> 
> But that doesn't reduce the number of checks in cpufreq_driver_resolve_freq().
> 
> There still are three choices in there: return a frequency from the
> table (if present), or call ->resolve_freq (if implemented), or return
> target_freq (as the last resort).

Sorry, that should've been "check that either ->target_index() or
->resolve_freq() is implemented." 

Implementing resolve_freq for the target() drivers and requiring it at
driver init time is probably the better way to go though. Perhaps I can
work on this at some point.

Re: [PATCH V2] cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()

2016-07-21 Thread Steve Muckle

On Fri, Jul 22, 2016 at 01:53:13AM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 22, 2016 at 1:45 AM, Steve Muckle  wrote:
> > On Fri, Jul 22, 2016 at 01:32:00AM +0200, Rafael J. Wysocki wrote:
> >> On Fri, Jul 22, 2016 at 1:22 AM, Steve Muckle  
> >> wrote:
> >> > On Fri, Jul 22, 2016 at 01:22:22AM +0200, Rafael J. Wysocki wrote:
> >> >> OK, applied.
> >> >
> >> > FWIW I do have a concern on this patch, I think it adds unnecessary
> >> > overhead.
> >>
> >> It isn't unnecessary.  It prevents an otherwise possible kernel crash
> >> from happening.
> >
> > The logic may not be unecessary, but the overhead is. The crash could be
> > prevented in a way that doesn't require repeatedly checking a pointer
> > that doesn't change.
> 
> Well, you had the ->resolve_freq check in your patch, didn't you?
> 
> Viresh simply added a ->target_index check to it.
> 
> Now, you can argue that this is one check too many, but as long as
> drivers are allowed to implement ->target without implementing
> ->resolve_freq, the *number* of checks in this routine cannot be
> reduced.
> 
> There are three possible cases and two checks are required to
> determine which case really takes place.

My thinking was that one of these two would be preferable:

- Forcing ->target() drivers to install a ->resolve_freq callback,
  enforcing this at cpufreq driver init time. My understanding is
  ->target() drivers are deprecated anyway and theren't aren't many of
  them, though I don't know offhand exactly how many or how hard it
  would be to do for each one.

- Forcing callers (schedutil in this case) to check that either
  ->target() or ->resolve_freq() is implemented. It means
  catching and scrutinizing future callers of resolve_freq.

But even if one of these is better than it could always be done on top
of this patch I suppose. I'm also not familiar with the platforms that use
->target() style drivers. So strictly speaking for my purposes it won't
matter since the number of tests is the same for them.

Re: [PATCH V2] cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()

2016-07-21 Thread Steve Muckle

On Fri, Jul 22, 2016 at 01:32:00AM +0200, Rafael J. Wysocki wrote:
> On Fri, Jul 22, 2016 at 1:22 AM, Steve Muckle  wrote:
> > On Fri, Jul 22, 2016 at 01:22:22AM +0200, Rafael J. Wysocki wrote:
> >> OK, applied.
> >
> > FWIW I do have a concern on this patch, I think it adds unnecessary
> > overhead.
> 
> It isn't unnecessary.  It prevents an otherwise possible kernel crash
> from happening.

The logic may not be unecessary, but the overhead is. The crash could be
prevented in a way that doesn't require repeatedly checking a pointer
that doesn't change.

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

On Thu, Jul 21, 2016 at 04:36:48PM -0700, Steve Muckle wrote:
> On Thu, Jul 21, 2016 at 04:30:03PM -0700, Viresh Kumar wrote:
> > On 21-07-16, 16:21, Steve Muckle wrote:
> > > On Thu, Jul 21, 2016 at 01:30:41PM -0700, Viresh Kumar wrote:
> > > > Okay, but in that case shouldn't we do something like this:
> > > > 
> > > > unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > > > unsigned int target_freq)
> > > > {
> > > >target_freq = clamp_val(target_freq, policy->min, policy->max);
> > > >policy->cached_target_freq = target_freq;
> > > > 
> > > >if (cpufreq_driver->target_index) {
> > > > policy->cached_resolved_idx =
> > > > cpufreq_frequency_table_target(policy, 
> > > > target_freq,
> > > >
> > > > CPUFREQ_RELATION_L);
> > > > return 
> > > > policy->freq_table[policy->cached_resolved_idx].frequency;
> > > >}
> > > > 
> > > >if (cpufreq_driver->resolve_freq)
> > > >return cpufreq_driver->resolve_freq(policy, target_freq);
> > > > }
> > > 
> > > Thanks for the review.
> > > 
> > > My thinking (noted in the commit text) was that the caller of
> > > cpufreq_driver_resolve_freq() would verify that the driver supported the
> > > proper calls before using this API.
> > 
> > Okay, but the caller isn't doing that today. Right?
> 
> There is no caller yet.

Sorry, of course this is not true.

I'm still of the opinion that modifying the governor (I could fix up
schedutil now) or adding a check in driver init would be better than any
unnecessary logic in the fast path.

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

On Thu, Jul 21, 2016 at 04:30:03PM -0700, Viresh Kumar wrote:
> On 21-07-16, 16:21, Steve Muckle wrote:
> > On Thu, Jul 21, 2016 at 01:30:41PM -0700, Viresh Kumar wrote:
> > > Okay, but in that case shouldn't we do something like this:
> > > 
> > > unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > > unsigned int target_freq)
> > > {
> > >target_freq = clamp_val(target_freq, policy->min, policy->max);
> > >policy->cached_target_freq = target_freq;
> > > 
> > >if (cpufreq_driver->target_index) {
> > >   policy->cached_resolved_idx =
> > >   cpufreq_frequency_table_target(policy, 
> > > target_freq,
> > >  
> > > CPUFREQ_RELATION_L);
> > >   return 
> > > policy->freq_table[policy->cached_resolved_idx].frequency;
> > >}
> > > 
> > >if (cpufreq_driver->resolve_freq)
> > >return cpufreq_driver->resolve_freq(policy, target_freq);
> > > }
> > 
> > Thanks for the review.
> > 
> > My thinking (noted in the commit text) was that the caller of
> > cpufreq_driver_resolve_freq() would verify that the driver supported the
> > proper calls before using this API.
> 
> Okay, but the caller isn't doing that today. Right?

There is no caller yet.

> > This way it can be checked once,
> > presumably in a governor's init routine. Checking the pointer over and
> > over again in a fast path is wasteful.
> 
> But we just can not assume the callers to always check that the driver
> has a ->target() and no ->resolve_freq(), and in that case not to call
> this routine. We would be forced to add a WARN_ON() in that case here
> to make sure we aren't trying to access a NULL ->resolve_freq.

Why not? Can we not catch that in code review?

If somehow this slips past and someone tries to use a driver with
schedutil that doesn't provide either target_index or resolve_freq, it's
not like it'll be a rare crash. It'll die immediately and in a very
obvious way.

> Over that, it will be used for a very small number of drivers which
> still use the ->target() callback and anyway we are going to do a
> function call for them. We can add a likely() here if that helps, but
> some sort of checking is surely required IMO.
> 
> And, this is a core API, which can be used for other governor's
> tomorrow :)

As another alternative, this could be caught in cpufreq driver
initialization? I believe you suggested that originally, but I avoided
it as I didn't want to have to implement resolve_freq() for every
target() style driver. It sounds like there aren't many though.

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

On Thu, Jul 21, 2016 at 04:21:31PM -0700, Steve Muckle wrote:
> On Thu, Jul 21, 2016 at 01:30:41PM -0700, Viresh Kumar wrote:
> > Okay, but in that case shouldn't we do something like this:
> > 
> > unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > unsigned int target_freq)
> > {
> >target_freq = clamp_val(target_freq, policy->min, policy->max);
> >policy->cached_target_freq = target_freq;
> > 
> >if (cpufreq_driver->target_index) {
> > policy->cached_resolved_idx =
> > cpufreq_frequency_table_target(policy, 
> > target_freq,
> >
> > CPUFREQ_RELATION_L);
> > return 
> > policy->freq_table[policy->cached_resolved_idx].frequency;
> >}
> > 
> >if (cpufreq_driver->resolve_freq)
> >return cpufreq_driver->resolve_freq(policy, target_freq);
> > }
> 
> Thanks for the review.
> 
> My thinking (noted in the commit text) was that the caller of
> cpufreq_driver_resolve_freq() would verify that the driver supported the
> proper calls before using this API. This way it can be checked once,
> presumably in a governor's init routine. Checking the pointer over and
> over again in a fast path is wasteful.

I guess this isn't immediately possible as the governor can't see
cpufreq_driver. I was hoping to change that however to allow
cpufreq_driver_resolve_freq() to be inlined in schedutil to get rid of
another function call...

Re: [PATCH V2] cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()

2016-07-21 Thread Steve Muckle

On Fri, Jul 22, 2016 at 01:22:22AM +0200, Rafael J. Wysocki wrote:
> OK, applied.

FWIW I do have a concern on this patch, I think it adds unnecessary
overhead.

Re: [PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-21 Thread Steve Muckle

On Thu, Jul 21, 2016 at 01:30:41PM -0700, Viresh Kumar wrote:
> Okay, but in that case shouldn't we do something like this:
> 
> unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> unsigned int target_freq)
> {
>target_freq = clamp_val(target_freq, policy->min, policy->max);
>policy->cached_target_freq = target_freq;
> 
>if (cpufreq_driver->target_index) {
>   policy->cached_resolved_idx =
>   cpufreq_frequency_table_target(policy, 
> target_freq,
>  
> CPUFREQ_RELATION_L);
>   return 
> policy->freq_table[policy->cached_resolved_idx].frequency;
>}
> 
>if (cpufreq_driver->resolve_freq)
>return cpufreq_driver->resolve_freq(policy, target_freq);
> }

Thanks for the review.

My thinking (noted in the commit text) was that the caller of
cpufreq_driver_resolve_freq() would verify that the driver supported the
proper calls before using this API. This way it can be checked once,
presumably in a governor's init routine. Checking the pointer over and
over again in a fast path is wasteful.

Re: [PATCH v3 0/3] cpufreq: avoid redundant driver calls in schedutil

2016-07-14 Thread Steve Muckle

On Thu, Jul 14, 2016 at 06:02:31PM +0800, Pingbo Wen wrote:
> > Steve Muckle (3):
> >   cpufreq: add cpufreq_driver_resolve_freq()
> >   cpufreq: schedutil: map raw required frequency to driver frequency
> 
> Tested the first two patches on db410c, only waking up irq_work 53
> times, while previous was 257 times(79% decrease) in Android home idle
> for 5 minutes.

Thanks Pingbo. My experience measuring the number of calls into the acpi
cpufreq fast switch path was similar. An arbitrary system workload I
chose showed a ~75% reduction in calls.

thanks,
Steve

[PATCH v3 0/3] cpufreq: avoid redundant driver calls in schedutil

2016-07-13 Thread Steve Muckle

Invoking the cpufreq driver to set a frequency can be expensive. On platforms
with a cpufreq driver that does not support fast switching a thread must be
woken to complete the operation. IPIs will also occur if and when support to
process remote task wakeups is added in schedutil.

Currently schedutil calculates a raw frequency request from scheduler
utilization data. This raw frequency request does not correlate to supported
cpufreq driver frequencies aside from being capped by the CPU's maximum
frequency. Consequently, there may be many consecutive requests for different
raw frequency values which all translate to the same driver-supported
frequency. For example on a platform with 3 supported frequencies 200MHz,
400MHz, and 600MHz, raw requests for 257MHz, 389MHz, and 307MHz all map to a
driver-supported frequency of 400MHz in schedutil. Assuming these requests were
consecutive and there were no changes in policy limits (min/max), there is no
need to issue the second or third request.

In order to resolve a raw frequency request to a driver-supported one a new
cpufreq API is added, cpufreq_driver_resolve_freq(). This API relies on a new
cpufreq driver callback in the case of ->target() style drivers. Otherwise it
uses the existing frequency table operations.

Lookups are cached both in cpufreq_driver_resolve_freq() (for the benefit of the
driver) and in schedutil.

Changes since v2:
- incorporated feedback from Viresh to use resolve_freq driver callbacks
  only for ->target() style drivers, to use cpufreq's freq table operations,
  and move freq mapping caching into cpufreq policy
Changes since v1:
- incorporated feedback from Rafael to avoid referencing freq_table from
  schedutil by introducing a new cpufreq API

Steve Muckle (3):
  cpufreq: add cpufreq_driver_resolve_freq()
  cpufreq: schedutil: map raw required frequency to driver frequency
  cpufreq: acpi-cpufreq: use cached frequency mapping when possible

 drivers/cpufreq/acpi-cpufreq.c   |  5 -
 drivers/cpufreq/cpufreq.c| 25 +
 include/linux/cpufreq.h  | 16 
 kernel/sched/cpufreq_schedutil.c | 31 +++
 4 files changed, 68 insertions(+), 9 deletions(-)

-- 
2.7.3

[PATCH v3 1/3] cpufreq: add cpufreq_driver_resolve_freq()

2016-07-13 Thread Steve Muckle

Cpufreq governors may need to know what a particular target frequency
maps to in the driver without necessarily wanting to set the frequency.
Support this operation via a new cpufreq API,
cpufreq_driver_resolve_freq(). This API returns the lowest driver
frequency equal or greater than the target frequency
(CPUFREQ_RELATION_L), subject to any policy (min/max) or driver
limitations. The mapping is also cached in the policy so that a
subsequent fast_switch operation can avoid repeating the same lookup.

The API will call a new cpufreq driver callback, resolve_freq(), if it
has been registered by the driver. Otherwise the frequency is resolved
via cpufreq_frequency_table_target(). Rather than require ->target()
style drivers to provide a resolve_freq() callback it is left to the
caller to ensure that the driver implements this callback if necessary
to use cpufreq_driver_resolve_freq().

Suggested-by: Rafael J. Wysocki 
Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/cpufreq.c | 25 +
 include/linux/cpufreq.h   | 16 
 2 files changed, 41 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 118b4f30a406..b696baeb249d 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -492,6 +492,29 @@ void cpufreq_disable_fast_switch(struct cpufreq_policy 
*policy)
 }
 EXPORT_SYMBOL_GPL(cpufreq_disable_fast_switch);
 
+/**
+ * cpufreq_driver_resolve_freq - Map a target frequency to a driver-supported
+ * one.
+ * @target_freq: target frequency to resolve.
+ *
+ * The target to driver frequency mapping is cached in the policy.
+ *
+ * Return: Lowest driver-supported frequency greater than or equal to the
+ * given target_freq, subject to policy (min/max) and driver limitations.
+ */
+unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
+unsigned int target_freq)
+{
+   target_freq = clamp_val(target_freq, policy->min, policy->max);
+   policy->cached_target_freq = target_freq;
+   if (cpufreq_driver->resolve_freq)
+   return cpufreq_driver->resolve_freq(policy, target_freq);
+   policy->cached_resolved_idx =
+   cpufreq_frequency_table_target(policy, target_freq,
+  CPUFREQ_RELATION_L);
+   return policy->freq_table[policy->cached_resolved_idx].frequency;
+}
+
 /*
  *  SYSFS INTERFACE  *
  */
@@ -2199,6 +,8 @@ static int cpufreq_set_policy(struct cpufreq_policy 
*policy,
policy->min = new_policy->min;
policy->max = new_policy->max;
 
+   policy->cached_target_freq = UINT_MAX;
+
pr_debug("new min and max freqs are %u - %u kHz\n",
 policy->min, policy->max);
 
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index c6410b1b2490..631ba33bbe9f 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -120,6 +120,10 @@ struct cpufreq_policy {
boolfast_switch_possible;
boolfast_switch_enabled;
 
+/* Cached frequency lookup from cpufreq_driver_resolve_freq. */
+   unsigned int cached_target_freq;
+   int cached_resolved_idx;
+
/* Synchronization for frequency transitions */
booltransition_ongoing; /* Tracks transition status 
*/
spinlock_t  transition_lock;
@@ -270,6 +274,16 @@ struct cpufreq_driver {
unsigned int index);
unsigned int(*fast_switch)(struct cpufreq_policy *policy,
   unsigned int target_freq);
+
+   /*
+* Caches and returns the lowest driver-supported frequency greater than
+* or equal to the target frequency, subject to any driver limitations.
+* Does not set the frequency. Only to be implemented for drivers with
+* target().
+*/
+   unsigned int(*resolve_freq)(struct cpufreq_policy *policy,
+   unsigned int target_freq);
+
/*
 * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION
 * unset.
@@ -501,6 +515,8 @@ int cpufreq_driver_target(struct cpufreq_policy *policy,
 int __cpufreq_driver_target(struct cpufreq_policy *policy,
   unsigned int target_freq,
   unsigned int relation);
+unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
+unsigned int target_freq);
 int cpufreq_register_governor(struct cpufreq_governor *governor);
 void cpufreq_unregister_governor(struct cpufreq_governor *governor);
 
-- 
2.7.3

[PATCH v3 3/3] cpufreq: acpi-cpufreq: use cached frequency mapping when possible

2016-07-13 Thread Steve Muckle

A call to cpufreq_driver_resolve_freq will cache the mapping from
the desired target frequency to the frequency table index. If there
is a mapping for the desired target frequency then use it instead of
looking up the mapping again.

Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/acpi-cpufreq.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 11c9a078e0fd..297e9128fe9f 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -473,7 +473,10 @@ unsigned int acpi_cpufreq_fast_switch(struct 
cpufreq_policy *policy,
/*
 * Find the closest frequency above target_freq.
 */
-   index = cpufreq_table_find_index_dl(policy, target_freq);
+   if (policy->cached_target_freq == target_freq)
+   index = policy->cached_resolved_idx;
+   else
+   index = cpufreq_table_find_index_dl(policy, target_freq);
 
entry = &policy->freq_table[index];
next_freq = entry->frequency;
-- 
2.7.3

[PATCH v3 2/3] cpufreq: schedutil: map raw required frequency to driver frequency

2016-07-13 Thread Steve Muckle

The slow-path frequency transition path is relatively expensive as it
requires waking up a thread to do work. Should support be added for
remote CPU cpufreq updates that is also expensive since it requires an
IPI. These activities should be avoided if they are not necessary.

To that end, calculate the actual driver-supported frequency required by
the new utilization value in schedutil by using the recently added
cpufreq_driver_resolve_freq API. If it is the same as the previously
requested driver frequency then there is no need to continue with the
update assuming the cpu frequency limits have not changed. This will
have additional benefits should the semantics of the rate limit be
changed to apply solely to frequency transitions rather than to
frequency calculations in schedutil.

The last raw required frequency is cached. This allows the driver
frequency lookup to be skipped in the event that the new raw required
frequency matches the last one, assuming a frequency update has not been
forced due to limits changing (indicated by a next_freq value of
UINT_MAX, see sugov_should_update_freq).

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 31 +++
 1 file changed, 23 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 758efd7f3abe..a84641b222c1 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -47,6 +47,8 @@ struct sugov_cpu {
struct update_util_data update_util;
struct sugov_policy *sg_policy;
 
+   unsigned int cached_raw_freq;
+
/* The fields below are only needed when sharing a policy. */
unsigned long util;
unsigned long max;
@@ -106,7 +108,7 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
 
 /**
  * get_next_freq - Compute a new frequency for a given cpufreq policy.
- * @policy: cpufreq policy object to compute the new frequency for.
+ * @sg_cpu: schedutil cpu object to compute the new frequency for.
  * @util: Current CPU utilization.
  * @max: CPU capacity.
  *
@@ -121,14 +123,25 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
  * next_freq = C * curr_freq * util_raw / max
  *
  * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ *
+ * The lowest driver-supported frequency which is equal or greater than the raw
+ * next_freq (as calculated above) is returned, subject to policy min/max and
+ * cpufreq driver limitations.
  */
-static unsigned int get_next_freq(struct cpufreq_policy *policy,
- unsigned long util, unsigned long max)
+static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util,
+ unsigned long max)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+   struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
 
-   return (freq + (freq >> 2)) * util / max;
+   freq = (freq + (freq >> 2)) * util / max;
+
+   if (freq == sg_cpu->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
+   return sg_policy->next_freq;
+   sg_cpu->cached_raw_freq = freq;
+   return cpufreq_driver_resolve_freq(policy, freq);
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -143,13 +156,14 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
return;
 
next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
-   get_next_freq(policy, util, max);
+   get_next_freq(sg_cpu, util, max);
sugov_update_commit(sg_policy, time, next_f);
 }
 
-static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
   unsigned long util, unsigned long 
max)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int max_f = policy->cpuinfo.max_freq;
u64 last_freq_update_time = sg_policy->last_freq_update_time;
@@ -189,7 +203,7 @@ static unsigned int sugov_next_freq_shared(struct 
sugov_policy *sg_policy,
}
}
 
-   return get_next_freq(policy, util, max);
+   return get_next_freq(sg_cpu, util, max);
 }
 
 static void sugov_update_shared(struct update_util_data *hook, u64 time,
@@ -206,7 +220,7 @@ static void sugov_update_shared(struct update_util_data 
*hook, u64 time,
sg_cpu->last_update = time;
 
if (sugov_should_update_freq(sg_policy, time)) {
-   next_f = sugov_next_freq_shared(sg_policy, util, max);
+   next_f = sugov_next_freq_shared(sg_

Re: [PATCH V3 8/9] cpufreq: Keep policy->freq_table sorted in ascending order

2016-06-03 Thread Steve Muckle

On Fri, Jun 03, 2016 at 07:05:14PM +0530, Viresh Kumar wrote:
...
> @@ -468,20 +469,15 @@ unsigned int acpi_cpufreq_fast_switch(struct 
> cpufreq_policy *policy,
>   struct acpi_cpufreq_data *data = policy->driver_data;
>   struct acpi_processor_performance *perf;
>   struct cpufreq_frequency_table *entry;
> - unsigned int next_perf_state, next_freq, freq;
> + unsigned int next_perf_state, next_freq, index;
>  
>   /*
>* Find the closest frequency above target_freq.
> -  *
> -  * The table is sorted in the reverse order with respect to the
> -  * frequency and all of the entries are valid (see the initialization).
>*/
> - entry = policy->freq_table;
> - do {
> - entry++;
> - freq = entry->frequency;
> - } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
> - entry--;
> + index = cpufreq_frequency_table_target(policy, target_freq,
> +CPUFREQ_RELATION_L);

Can we call cpufreq_find_index_l directly here? Seems like we could
phase out cpufreq_frequency_table_target() for the most part and call
the helpers directly. It would avoid some code bloat, an unnecessary
switch statement and an error check for an invalid frequency table which
seems unnecessary for every frequency table lookup.

thanks,
Steve

Re: [PATCH V2 2/2] cpufreq: Optimize cpufreq_frequency_table_target()

2016-06-02 Thread Steve Muckle

On Thu, Jun 02, 2016 at 06:59:04AM +0530, Viresh Kumar wrote:
> On 01-06-16, 12:46, Steve Muckle wrote:
> > >   /*
> > >* Find the closest frequency above target_freq.
> > > -  *
> > > -  * The table is sorted in the reverse order with respect to the
> > > -  * frequency and all of the entries are valid (see the initialization).
> > >*/
> > > - entry = policy->freq_table;
> > > - do {
> > > - entry++;
> > > - freq = entry->frequency;
> > > - } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
> > > - entry--;
> > > + index = cpufreq_frequency_table_target(policy, target_freq,
> > > +CPUFREQ_RELATION_L);
> > 
> > This adds a function call to the fast path...
> 
> I understand that, but I am not sure how far should we go to avoid
> that. Open coding routines to save on that isn't a good idea surely.
> 
> I have at least kept this routine in cpufreq.h to avoid a call, but
> eventually we will have at least a call somewhere within it. :(

Shouldn't we be able to avoid extra function calls through the use of
macros/inlines? Otherwise this is making things slower for schedutil
than it is today. 

Actually cpufreq_driver_resolve_freq() shouldn't require any calls from
schedutil when a freq_table is available - the whole thing could be run
inline.

Re: [PATCH 3/5] sched: cpufreq: call cpufreq hook from remote CPUs

2016-06-01 Thread Steve Muckle

On Sat, May 21, 2016 at 12:46:06PM -0700, Steve Muckle wrote:
> Hi Peter, Ingo,

Hi Peter/Ingo would appreciate any thoughts you may have on the issue
below.

thanks,
Steve

> 
> On Thu, May 19, 2016 at 04:04:19PM -0700, Steve Muckle wrote:
> > On Thu, May 19, 2016 at 11:06:14PM +0200, Rafael J. Wysocki wrote:
> > > > In the case of a remote update the hook has to run (or not) after it is
> > > > known whether preemption will occur so we don't do needless work or
> > > > IPIs. If the policy CPUs aren't known in the scheduler then the early
> > > > hook will always need to be called along with an indication that it is
> > > > the early hook being called. If it turns out to be a remote update it
> > > > could then be deferred to the later hook, which would only be called
> > > > when a remote update has been deferred and preemption has not occurred.
> > > >
> > > > This means two hook inovcations for a remote non-preempting wakeup
> > > > though instead of one.  Perhaps this is a good middle ground on code
> > > > churn vs. optimization though.
> > > 
> > > I would think so.
> > 
> > Ok, I will pursue this approach.
> 
> I'd like to get your opinion here before proceeding further...
> 
> To catch you up and summarize, I'm trying to implement support for
> calling the scheduler cpufreq callback during remote wakeups.  Currently
> the scheduler cpufreq callback is only called when the target CPU is the
> current CPU. If a remote wakeup does not result in preemption, the CPU
> frequency may currently not be adjusted appropriately for up to a tick,
> when we are guaranteed to call the hook again.
> 
> Invoking schedutil promptly for the target CPU in this situation means
> sending an IPI if the current CPU is not in the target CPU's frequency
> domain. This is because often a cpufreq driver must run on a CPU within
> the frequency domain it is bound to.  But the catch is that we should
> not do this and incur the overhead of an IPI if preemption will occur,
> as in that case the scheduler (and schedutil) will run soon on the
> target CPU anyway, potentially as a result of the scheduler sending its
> own IPI.
> 
> I figured this unnecessary overhead would be unacceptable and so have
> been working on an approach to avoid it. Unfortunately the current hooks
> happen before the preemption decision is made. My current implementation
> sets a flag if schedutil sees a remote wakeup and then bails. There's a
> test to call the hook again at the end of check_preempt_curr() if the flag
> is set.  The flag is cleared by resched_curr() as that means preemption
> will happen on the target CPU. The flag currently lives at the end of
> the rq struct. I could move it into the update_util_data hook structure
> or elsewhere, but that would mean accessing another per-cpu item in
> hot scheduler paths.
> 
> Thoughts? Note the current implementation described above differs a bit
> from the last posting in this thread, per discussion with Rafael.
> 
> thanks,
> Steve

Re: [PATCH V2 2/2] cpufreq: Optimize cpufreq_frequency_table_target()

2016-06-01 Thread Steve Muckle

On Wed, Jun 01, 2016 at 04:09:55PM +0530, Viresh Kumar wrote:
> cpufreq core keeps another table of sorted frequencies now and that can
> be used to find a match quickly, instead of traversing the unsorted list
> in an inefficient way.
> 
> Create helper routines for separate relation types to optimize it
> further.
> 
> Ideally the new routine cpufreq_find_target_index() is required to be
> exported, but s3c24xx was abusing the earlier API and have to be
> supported for now. Added a FIXME for it.
> 
> Tested on Exynos board with both ondemand and schedutil governor and
> confirmed with help of various print messages that we are eventually
> switching to the desired frequency based on a target frequency.
> 
> Signed-off-by: Viresh Kumar 
> ---
>  drivers/cpufreq/acpi-cpufreq.c|  15 ++--
>  drivers/cpufreq/freq_table.c  | 180 
> ++
>  drivers/cpufreq/s3c24xx-cpufreq.c |  13 ++-
>  include/linux/cpufreq.h   |  20 -
>  4 files changed, 134 insertions(+), 94 deletions(-)
> 
> diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> index 32a15052f363..4f9f7504a17c 100644
> --- a/drivers/cpufreq/acpi-cpufreq.c
> +++ b/drivers/cpufreq/acpi-cpufreq.c
> @@ -468,20 +468,15 @@ unsigned int acpi_cpufreq_fast_switch(struct 
> cpufreq_policy *policy,
>   struct acpi_cpufreq_data *data = policy->driver_data;
>   struct acpi_processor_performance *perf;
>   struct cpufreq_frequency_table *entry;
> - unsigned int next_perf_state, next_freq, freq;
> + unsigned int next_perf_state, next_freq, index;
>  
>   /*
>* Find the closest frequency above target_freq.
> -  *
> -  * The table is sorted in the reverse order with respect to the
> -  * frequency and all of the entries are valid (see the initialization).
>*/
> - entry = policy->freq_table;
> - do {
> - entry++;
> - freq = entry->frequency;
> - } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
> - entry--;
> + index = cpufreq_frequency_table_target(policy, target_freq,
> +CPUFREQ_RELATION_L);

This adds a function call to the fast path...

Re: [PATCH v2 1/3] cpufreq: add resolve_freq driver callback

2016-05-31 Thread Steve Muckle

On Tue, May 31, 2016 at 11:00:11AM +0530, Viresh Kumar wrote:
> On 30-05-16, 08:31, Steve Muckle wrote:
> > My goal here was to have the system operate in this case in a manner
> > that is obviously not optimized (running at fmax), so the platform owner
> > realizes that the cpufreq driver doesn't fully support the schedutil
> > governor.
> > 
> > I was originally going to just return an error code but that also means
> > having to check for it which would be nice to avoid if possible on this
> > fast path.
> 
> Okay, I get what you are saying.
> 
> But all we are doing here is to make things fast by not sending IPIs,
> etc. That should *not* lead to a behavior where the frequency stays at
> MAX all the time even if the driver doesn't provide this callback or
> the freq-table.
> 
> If we just return the target_freq in this case instead of UINT_MAX,
> the platform may eventually have some unnecessary IPIs, wakeups, etc,
> but its frequency will still be switched properly.
> 
> Wouldn't that be a better choice ?

I'm still concerned that a platform owner may use this and accept
suboptimal perf/power because they aren't aware their cpufreq driver is
not fully compliant. But I agree it'd be better to have it work as well
as it can. I will make the change.

Maybe a warning message can be added when schedutil initializes if
resolve_freq is not supported.

thanks,
Steve

Re: [PATCH v2 1/3] cpufreq: add resolve_freq driver callback

2016-05-31 Thread Steve Muckle

On Tue, May 31, 2016 at 04:44:51PM +0530, Viresh Kumar wrote:
> On 25-05-16, 19:52, Steve Muckle wrote:
> > +unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > +unsigned int target_freq)
> > +{
> > +   struct cpufreq_frequency_table *freq_table;
> > +   int index, retval;
> > +
> > +   clamp_val(target_freq, policy->min, policy->max);
> 
> Rafael will kill me for this, as I have fallen into the same trap and
> copied your *incorrect* code :(
> 
> This is a useless statement unless you do:
> 
> target_freq = clamp_val(target_freq, policy->min, policy->max);

Well I wouldn't worry too much, I copied it from Rafael's code in
cpufreq_driver_fast_switch() which also has this problem and needs to be
fixed.

Perhaps clamp_val could be changed to force the compiler to check that
its return value is being used.

thanks,
Steve

Re: [PATCH v2 3/3] cpufreq: schedutil: map raw required frequency to driver frequency

2016-05-30 Thread Steve Muckle

On Fri, May 27, 2016 at 01:41:02PM +0800, Wanpeng Li wrote:
> 2016-05-26 10:53 GMT+08:00 Steve Muckle :
> > The slow-path frequency transition path is relatively expensive as it
> > requires waking up a thread to do work. Should support be added for
> > remote CPU cpufreq updates that is also expensive since it requires an
> > IPI. These activities should be avoided if they are not necessary.
> >
> > To that end, calculate the actual driver-supported frequency required by
> > the new utilization value in schedutil by using the recently added
> > cpufreq_driver_resolve_freq callback. If it is the same as the
> > previously requested driver frequency then there is no need to continue
> > with the update assuming the cpu frequency limits have not changed. This
> > will have additional benefits should the semantics of the rate limit be
> > changed to apply solely to frequency transitions rather than to
> > frequency calculations in schedutil.
> 
> sugov_should_update_freq() still be called before get_nex_freq() after
> the patch applied, so rate limit still apply to both frequency
> transitions and frequency calculations, right?

Yes this series does not change the semantics of the rate limit. It
only includes the changes required for resolving raw target frequencies
to driver-supported frequencies so redundant operations can be avoided.

FWIW the approach I took to change the rate_limit semantics [0] just
moves where last_freq_update_time is set. I was going to return to that
once these changes are complete.

[0]: https://lkml.org/lkml/2016/5/9/857

thanks,
Steve

Re: [PATCH v2 3/3] cpufreq: schedutil: map raw required frequency to driver frequency

2016-05-30 Thread Steve Muckle

On Thu, May 26, 2016 at 12:46:29PM +0530, Viresh Kumar wrote:
> On 25-05-16, 19:53, Steve Muckle wrote:
> > The slow-path frequency transition path is relatively expensive as it
> > requires waking up a thread to do work. Should support be added for
> > remote CPU cpufreq updates that is also expensive since it requires an
> > IPI. These activities should be avoided if they are not necessary.
> > 
> > To that end, calculate the actual driver-supported frequency required by
> > the new utilization value in schedutil by using the recently added
> > cpufreq_driver_resolve_freq callback. If it is the same as the
> > previously requested driver frequency then there is no need to continue
> > with the update assuming the cpu frequency limits have not changed. This
> > will have additional benefits should the semantics of the rate limit be
> > changed to apply solely to frequency transitions rather than to
> > frequency calculations in schedutil.
> 
> Maybe mention here that this patch isn't avoiding those IPIs yet, I
> got an impression earlier that they are avoided with it.

Perhaps the problem is that my cover letter should have more clearly
specified that the earlier referenced series was an unaccepted draft?
I'll be more careful to note that in the future.

The text above (the second sentence) seems okay to me in that it
mentions remote updates are not currently supported. Let me know if
there is specific text you would like included.

thanks,
Steve

Re: [PATCH v2 2/3] cpufreq: acpi-cpufreq: add resolve_freq callback

2016-05-30 Thread Steve Muckle

On Thu, May 26, 2016 at 12:13:41PM +0530, Viresh Kumar wrote:
> On 25-05-16, 19:53, Steve Muckle wrote:
> > Support the new resolve_freq cpufreq callback which resolves a target
> > frequency to a driver-supported frequency without actually setting it.
> 
> And here is the first abuser of this API as I was talking about in the
> earlier patch :)
> 
> But, I know why you are doing it and I think we can do it differently.
> 
> So, lets assume that the ->resolve_freq() callback will ONLY be
> provided by the drivers which also provide a ->target() callback.
> 
> i.e. not by acpi-cpufreq at least.
> 
> > The target frequency and resolved frequency table entry are cached so
> > that a subsequent fast_switch operation may avoid the frequency table
> > walk assuming the requested target frequency is the same.
> > 
> > Suggested-by: Rafael J. Wysocki 
> > Signed-off-by: Steve Muckle 
> > ---
> >  drivers/cpufreq/acpi-cpufreq.c | 56 
> > --
> >  1 file changed, 43 insertions(+), 13 deletions(-)
> > 
> > diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
> > index 7f38fb55f223..d87962eda1ed 100644
> > --- a/drivers/cpufreq/acpi-cpufreq.c
> > +++ b/drivers/cpufreq/acpi-cpufreq.c
> > @@ -66,6 +66,8 @@ enum {
> >  
> >  struct acpi_cpufreq_data {
> > struct cpufreq_frequency_table *freq_table;
> > +   unsigned int cached_lookup_freq;
> > +   struct cpufreq_frequency_table *cached_lookup_entry;
> 
> This could have been an integer value 'Index', which could have been
> used together with the freq-table to get the freq we wanted.
> 
> And, then we can move these two fields into the cpufreq_policy
> structure and make them part of the first patch itself.
> 
> > unsigned int resume;
> > unsigned int cpu_feature;
> > unsigned int acpi_perf_cpu;
> > @@ -458,26 +460,53 @@ static int acpi_cpufreq_target(struct cpufreq_policy 
> > *policy,
> > return result;
> >  }
> >  
> > -unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > - unsigned int target_freq)
> > +/*
> > + * Find the closest frequency above target_freq.
> > + *
> > + * The table is sorted in the reverse order with respect to the
> > + * frequency and all of the entries are valid (see the initialization).
> > + */
> > +static inline struct cpufreq_frequency_table
> > +*lookup_freq(struct cpufreq_frequency_table *table, unsigned int 
> > target_freq)
> >  {
> > -   struct acpi_cpufreq_data *data = policy->driver_data;
> > -   struct acpi_processor_performance *perf;
> > -   struct cpufreq_frequency_table *entry;
> > -   unsigned int next_perf_state, next_freq, freq;
> > +   struct cpufreq_frequency_table *entry = table;
> > +   unsigned int freq;
> >  
> > -   /*
> > -* Find the closest frequency above target_freq.
> > -*
> > -* The table is sorted in the reverse order with respect to the
> > -* frequency and all of the entries are valid (see the initialization).
> > -*/
> > -   entry = data->freq_table;
> > do {
> > entry++;
> > freq = entry->frequency;
> > } while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
> > entry--;
> > +
> > +   return entry;
> > +}
> > +
> > +unsigned int acpi_cpufreq_resolve_freq(struct cpufreq_policy *policy,
> > +  unsigned int target_freq)
> > +{
> > +   struct acpi_cpufreq_data *data = policy->driver_data;
> > +   struct cpufreq_frequency_table *entry;
> > +
> > +   data->cached_lookup_freq = target_freq;
> > +   entry = lookup_freq(data->freq_table, target_freq);
> > +   data->cached_lookup_entry = entry;
> > +
> > +   return entry->frequency;
> > +}
> > +
> 
> And then we could remove this callback from acpi-cpufreq.
> 
> > +unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > + unsigned int target_freq)
> > +{
> > +   struct acpi_cpufreq_data *data = policy->driver_data;
> > +   struct acpi_processor_performance *perf;
> > +   struct cpufreq_frequency_table *entry;
> > +   unsigned int next_perf_state, next_freq;
> > +
> > +   if (data->cached_lookup_entry &&
> > +   data->cached_lookup_freq == target_freq)
> > +   entry = data->cached_lookup_entry;
> > +   else
>

Re: [PATCH v2 1/3] cpufreq: add resolve_freq driver callback

2016-05-30 Thread Steve Muckle

On Thu, May 26, 2016 at 11:55:14AM +0530, Viresh Kumar wrote:
> On 25-05-16, 19:52, Steve Muckle wrote:
> > Cpufreq governors may need to know what a particular target frequency
> > maps to in the driver without necessarily wanting to set the frequency.
> > Support this operation via a new cpufreq API,
> > cpufreq_driver_resolve_freq().
> > 
> > The above API will call a new cpufreq driver callback, resolve_freq(),
> > if it has been registered by the driver. If that callback has not been
> > registered and a frequency table is available then the frequency table
> > is walked using cpufreq_frequency_table_target().
> > 
> > UINT_MAX is returned if no driver callback or frequency table is
> > available.
> 
> Why should we return UINT_MAX here? We should return target_freq, no ?

My goal here was to have the system operate in this case in a manner
that is obviously not optimized (running at fmax), so the platform owner
realizes that the cpufreq driver doesn't fully support the schedutil
governor.

I was originally going to just return an error code but that also means
having to check for it which would be nice to avoid if possible on this
fast path.

> 
> > Suggested-by: Rafael J. Wysocki 
> > Signed-off-by: Steve Muckle 
> > ---
> >  drivers/cpufreq/cpufreq.c | 25 +
> >  include/linux/cpufreq.h   | 11 +++
> >  2 files changed, 36 insertions(+)
> > 
> > diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> > index 77d77a4e3b74..3b44f4bdc071 100644
> > --- a/drivers/cpufreq/cpufreq.c
> > +++ b/drivers/cpufreq/cpufreq.c
> > @@ -1849,6 +1849,31 @@ unsigned int cpufreq_driver_fast_switch(struct 
> > cpufreq_policy *policy,
> >  }
> >  EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
> >  
> > +unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
> > +unsigned int target_freq)
> > +{
> > +   struct cpufreq_frequency_table *freq_table;
> > +   int index, retval;
> > +
> > +   clamp_val(target_freq, policy->min, policy->max);
> > +
> > +   if (cpufreq_driver->resolve_freq)
> > +   return cpufreq_driver->resolve_freq(policy, target_freq);
> > +
> > +   freq_table = cpufreq_frequency_get_table(policy->cpu);
> 
> I have sent a separate patch to provide a light weight alternative to
> this. If that gets accepted, we can switch over to using it.

Sure.

> 
> > +   if (!freq_table)
> > +   return UINT_MAX;
> > +
> > +   retval = cpufreq_frequency_table_target(policy, freq_table,
> > +   target_freq, CPUFREQ_RELATION_L,
> > +   &index);
> > +   if (retval)
> > +   return UINT_MAX;
> > +
> > +   return freq_table[index].frequency;
> > +}
> > +EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
> > +
> >  /* Must set freqs->new to intermediate frequency */
> >  static int __target_intermediate(struct cpufreq_policy *policy,
> >  struct cpufreq_freqs *freqs, int index)
> > diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> > index 4e81e08db752..675f17f98e75 100644
> > --- a/include/linux/cpufreq.h
> > +++ b/include/linux/cpufreq.h
> > @@ -271,6 +271,13 @@ struct cpufreq_driver {
> > int (*target_intermediate)(struct cpufreq_policy *policy,
> >unsigned int index);
> >  
> > +   /*
> > +* Return the driver-supported frequency that a particular target
> > +* frequency maps to (does not set the new frequency).
> > +*/
> > +   unsigned int(*resolve_freq)(struct cpufreq_policy *policy,
> > +   unsigned int target_freq);
> 
> We have 3 categories of cpufreq-drivers today:
> 1. setpolicy drivers: They don't use the cpufreq governors we are
>working on.
> 2. non-setpolicy drivers:
>   A. with ->target_index() callback, these will always provide a
>  freq-table.
>   B. with ->target() callback, ONLY these should be allowed to provide
>  the ->resolve_freq() callback and no one else.
> 
> And so I would suggest adding an additional check in
> cpufreq_register_driver() to catch incorrect usage of this callback.

I'll reply to this in the next email on patch 2...

thanks,
Steve

[PATCH v2 1/3] cpufreq: add resolve_freq driver callback

2016-05-25 Thread Steve Muckle

Cpufreq governors may need to know what a particular target frequency
maps to in the driver without necessarily wanting to set the frequency.
Support this operation via a new cpufreq API,
cpufreq_driver_resolve_freq().

The above API will call a new cpufreq driver callback, resolve_freq(),
if it has been registered by the driver. If that callback has not been
registered and a frequency table is available then the frequency table
is walked using cpufreq_frequency_table_target().

UINT_MAX is returned if no driver callback or frequency table is
available.

Suggested-by: Rafael J. Wysocki 
Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/cpufreq.c | 25 +
 include/linux/cpufreq.h   | 11 +++
 2 files changed, 36 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 77d77a4e3b74..3b44f4bdc071 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1849,6 +1849,31 @@ unsigned int cpufreq_driver_fast_switch(struct 
cpufreq_policy *policy,
 }
 EXPORT_SYMBOL_GPL(cpufreq_driver_fast_switch);
 
+unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
+unsigned int target_freq)
+{
+   struct cpufreq_frequency_table *freq_table;
+   int index, retval;
+
+   clamp_val(target_freq, policy->min, policy->max);
+
+   if (cpufreq_driver->resolve_freq)
+   return cpufreq_driver->resolve_freq(policy, target_freq);
+
+   freq_table = cpufreq_frequency_get_table(policy->cpu);
+   if (!freq_table)
+   return UINT_MAX;
+
+   retval = cpufreq_frequency_table_target(policy, freq_table,
+   target_freq, CPUFREQ_RELATION_L,
+   &index);
+   if (retval)
+   return UINT_MAX;
+
+   return freq_table[index].frequency;
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_resolve_freq);
+
 /* Must set freqs->new to intermediate frequency */
 static int __target_intermediate(struct cpufreq_policy *policy,
 struct cpufreq_freqs *freqs, int index)
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 4e81e08db752..675f17f98e75 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -271,6 +271,13 @@ struct cpufreq_driver {
int (*target_intermediate)(struct cpufreq_policy *policy,
   unsigned int index);
 
+   /*
+* Return the driver-supported frequency that a particular target
+* frequency maps to (does not set the new frequency).
+*/
+   unsigned int(*resolve_freq)(struct cpufreq_policy *policy,
+   unsigned int target_freq);
+
/* should be defined, if possible */
unsigned int(*get)(unsigned int cpu);
 
@@ -487,6 +494,10 @@ int cpufreq_driver_target(struct cpufreq_policy *policy,
 int __cpufreq_driver_target(struct cpufreq_policy *policy,
   unsigned int target_freq,
   unsigned int relation);
+
+unsigned int cpufreq_driver_resolve_freq(struct cpufreq_policy *policy,
+unsigned int target_freq);
+
 int cpufreq_register_governor(struct cpufreq_governor *governor);
 void cpufreq_unregister_governor(struct cpufreq_governor *governor);
 
-- 
2.4.10

[PATCH v2 0/3] cpufreq: avoid redundant driver calls in schedutil

2016-05-25 Thread Steve Muckle

In the series [0] I included a patch which attempted to avoid redundant driver
calls in the schedutil governor by mapping the raw required CPU frequencies to
driver frequencies. This vastly increases the likelihood of detecting a
redundant cpufreq driver call, i.e. one which will end up attempting to set the
CPU frequency to the frequency it is already running at. The redundant call can
be avoided. This is especially valuable in the case of drivers which do not
support fast path updates or if remote CPU cpufreq callbacks are implemented.

Unfortunately the implementation of this in [0] walked the frequency table
directly in schedutil. Rafael pointed out that not all drivers may have a
frequency table and that a driver callback might be implemented to return the
driver frequency associated with a particular target frequency. The driver
could then also cache this lookup and possibly use it on an ensuing
fast_switch. This series implements that approach.

Given that this change is beneficial on its own I've split it out into its own
series from the remote callback support.

[0] https://lkml.org/lkml/2016/5/9/853

Steve Muckle (3):
  cpufreq: add resolve_freq driver callback
  cpufreq: acpi-cpufreq: add resolve_freq callback
  cpufreq: schedutil: map raw required frequency to driver frequency

 drivers/cpufreq/acpi-cpufreq.c   | 56 ++--
 drivers/cpufreq/cpufreq.c| 25 ++
 include/linux/cpufreq.h  | 11 
 kernel/sched/cpufreq_schedutil.c | 30 +++--
 4 files changed, 101 insertions(+), 21 deletions(-)

-- 
2.4.10

[PATCH v2 2/3] cpufreq: acpi-cpufreq: add resolve_freq callback

2016-05-25 Thread Steve Muckle

Support the new resolve_freq cpufreq callback which resolves a target
frequency to a driver-supported frequency without actually setting it.

The target frequency and resolved frequency table entry are cached so
that a subsequent fast_switch operation may avoid the frequency table
walk assuming the requested target frequency is the same.

Suggested-by: Rafael J. Wysocki 
Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/acpi-cpufreq.c | 56 --
 1 file changed, 43 insertions(+), 13 deletions(-)

diff --git a/drivers/cpufreq/acpi-cpufreq.c b/drivers/cpufreq/acpi-cpufreq.c
index 7f38fb55f223..d87962eda1ed 100644
--- a/drivers/cpufreq/acpi-cpufreq.c
+++ b/drivers/cpufreq/acpi-cpufreq.c
@@ -66,6 +66,8 @@ enum {
 
 struct acpi_cpufreq_data {
struct cpufreq_frequency_table *freq_table;
+   unsigned int cached_lookup_freq;
+   struct cpufreq_frequency_table *cached_lookup_entry;
unsigned int resume;
unsigned int cpu_feature;
unsigned int acpi_perf_cpu;
@@ -458,26 +460,53 @@ static int acpi_cpufreq_target(struct cpufreq_policy 
*policy,
return result;
 }
 
-unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
- unsigned int target_freq)
+/*
+ * Find the closest frequency above target_freq.
+ *
+ * The table is sorted in the reverse order with respect to the
+ * frequency and all of the entries are valid (see the initialization).
+ */
+static inline struct cpufreq_frequency_table
+*lookup_freq(struct cpufreq_frequency_table *table, unsigned int target_freq)
 {
-   struct acpi_cpufreq_data *data = policy->driver_data;
-   struct acpi_processor_performance *perf;
-   struct cpufreq_frequency_table *entry;
-   unsigned int next_perf_state, next_freq, freq;
+   struct cpufreq_frequency_table *entry = table;
+   unsigned int freq;
 
-   /*
-* Find the closest frequency above target_freq.
-*
-* The table is sorted in the reverse order with respect to the
-* frequency and all of the entries are valid (see the initialization).
-*/
-   entry = data->freq_table;
do {
entry++;
freq = entry->frequency;
} while (freq >= target_freq && freq != CPUFREQ_TABLE_END);
entry--;
+
+   return entry;
+}
+
+unsigned int acpi_cpufreq_resolve_freq(struct cpufreq_policy *policy,
+  unsigned int target_freq)
+{
+   struct acpi_cpufreq_data *data = policy->driver_data;
+   struct cpufreq_frequency_table *entry;
+
+   data->cached_lookup_freq = target_freq;
+   entry = lookup_freq(data->freq_table, target_freq);
+   data->cached_lookup_entry = entry;
+
+   return entry->frequency;
+}
+
+unsigned int acpi_cpufreq_fast_switch(struct cpufreq_policy *policy,
+ unsigned int target_freq)
+{
+   struct acpi_cpufreq_data *data = policy->driver_data;
+   struct acpi_processor_performance *perf;
+   struct cpufreq_frequency_table *entry;
+   unsigned int next_perf_state, next_freq;
+
+   if (data->cached_lookup_entry &&
+   data->cached_lookup_freq == target_freq)
+   entry = data->cached_lookup_entry;
+   else
+   entry = lookup_freq(data->freq_table, target_freq);
next_freq = entry->frequency;
next_perf_state = entry->driver_data;
 
@@ -918,6 +947,7 @@ static struct cpufreq_driver acpi_cpufreq_driver = {
.verify = cpufreq_generic_frequency_table_verify,
.target_index   = acpi_cpufreq_target,
.fast_switch= acpi_cpufreq_fast_switch,
+   .resolve_freq   = acpi_cpufreq_resolve_freq,
.bios_limit = acpi_processor_get_bios_limit,
.init   = acpi_cpufreq_cpu_init,
.exit   = acpi_cpufreq_cpu_exit,
-- 
2.4.10

[PATCH v2 3/3] cpufreq: schedutil: map raw required frequency to driver frequency

2016-05-25 Thread Steve Muckle

The slow-path frequency transition path is relatively expensive as it
requires waking up a thread to do work. Should support be added for
remote CPU cpufreq updates that is also expensive since it requires an
IPI. These activities should be avoided if they are not necessary.

To that end, calculate the actual driver-supported frequency required by
the new utilization value in schedutil by using the recently added
cpufreq_driver_resolve_freq callback. If it is the same as the
previously requested driver frequency then there is no need to continue
with the update assuming the cpu frequency limits have not changed. This
will have additional benefits should the semantics of the rate limit be
changed to apply solely to frequency transitions rather than to
frequency calculations in schedutil.

The last raw required frequency is cached. This allows the driver
frequency lookup to be skipped in the event that the new raw required
frequency matches the last one, assuming a frequency update has not been
forced due to limits changing (indicated by a next_freq value of
UINT_MAX, see sugov_should_update_freq).

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 30 ++
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 154ae3a51e86..ef73ca4d8d78 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -45,6 +45,8 @@ struct sugov_cpu {
struct update_util_data update_util;
struct sugov_policy *sg_policy;
 
+   unsigned int cached_raw_freq;
+
/* The fields below are only needed when sharing a policy. */
unsigned long util;
unsigned long max;
@@ -104,7 +106,7 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
 
 /**
  * get_next_freq - Compute a new frequency for a given cpufreq policy.
- * @policy: cpufreq policy object to compute the new frequency for.
+ * @sg_cpu: schedutil cpu object to compute the new frequency for.
  * @util: Current CPU utilization.
  * @max: CPU capacity.
  *
@@ -119,14 +121,24 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
  * next_freq = C * curr_freq * util_raw / max
  *
  * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ *
+ * The lowest driver-supported frequency which is equal or greater than the raw
+ * next_freq (as calculated above) is returned.
  */
-static unsigned int get_next_freq(struct cpufreq_policy *policy,
- unsigned long util, unsigned long max)
+static unsigned int get_next_freq(struct sugov_cpu *sg_cpu, unsigned long util,
+ unsigned long max)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
+   struct cpufreq_policy *policy = sg_policy->policy;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
 
-   return (freq + (freq >> 2)) * util / max;
+   freq = (freq + (freq >> 2)) * util / max;
+
+   if (freq == sg_cpu->cached_raw_freq && sg_policy->next_freq != UINT_MAX)
+   return sg_policy->next_freq;
+   sg_cpu->cached_raw_freq = freq;
+   return cpufreq_driver_resolve_freq(policy, freq);
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
@@ -141,13 +153,14 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
return;
 
next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
-   get_next_freq(policy, util, max);
+   get_next_freq(sg_cpu, util, max);
sugov_update_commit(sg_policy, time, next_f);
 }
 
-static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
   unsigned long util, unsigned long 
max)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int max_f = policy->cpuinfo.max_freq;
u64 last_freq_update_time = sg_policy->last_freq_update_time;
@@ -187,7 +200,7 @@ static unsigned int sugov_next_freq_shared(struct 
sugov_policy *sg_policy,
}
}
 
-   return get_next_freq(policy, util, max);
+   return get_next_freq(sg_cpu, util, max);
 }
 
 static void sugov_update_shared(struct update_util_data *hook, u64 time,
@@ -204,7 +217,7 @@ static void sugov_update_shared(struct update_util_data 
*hook, u64 time,
sg_cpu->last_update = time;
 
if (sugov_should_update_freq(sg_policy, time)) {
-   next_f = sugov_next_freq_shared(sg_policy, util, max);
+   next_f = sugov_next_freq_shared(sg_cpu, util, max);
sugov_update_commit(sg_pol

Re: [RFC PATCH] Increase in idle power with schedutil

2016-05-22 Thread Steve Muckle

On Sun, May 22, 2016 at 12:39:12PM +0200, Peter Zijlstra wrote:
> On Fri, May 20, 2016 at 05:53:41PM +0530, Shilpasri G Bhat wrote:
> > 
> > Below are the comparisons by disabling watchdog.
> > Both schedutil and ondemand have a similar ramp-down trend. And in both the
> > cases I can see that frequency of the cpu is not reduced in deterministic
> > fashion. In a observation window of 30 seconds after running a workload I 
> > can
> > see that the frequency is not ramped down on some cpus in the system and are
> > idling at max frequency.
> 
> So does it actually matter what the frequency is when you idle? Isn't
> the whole thing clock gated anyway?
> 
> Because this seems to generate contradictory requirements, on the one
> hand we want to stay idle as long as possible while on the other hand
> you seem to want to clock down while idle, which requires not being
> idle.
> 
> If it matters; should not your idle state muck explicitly set/restore
> frequency?

AFAIK this is very platform dependent. Some will waste more power than
others when a CPU idles above fmin due to things like resource (bus
bandwidth, shared cache freq etc) voting.

It is also true that there is power spent going to fmin (and then
perhaps restoring the frequency when idle ends) which will be in part a
function of how slow the frequency change operation is on that platform.

I think Daniel Lezcano (added) was exploring the idea of having cpuidle
drivers take the expected idle duration and potentially communicate to
cpufreq to reduce the frequency depending on a platform-specific
cost/benefit analysis.

Re: [PATCH 3/5] sched: cpufreq: call cpufreq hook from remote CPUs

2016-05-21 Thread Steve Muckle

Hi Peter, Ingo,

On Thu, May 19, 2016 at 04:04:19PM -0700, Steve Muckle wrote:
> On Thu, May 19, 2016 at 11:06:14PM +0200, Rafael J. Wysocki wrote:
> > > In the case of a remote update the hook has to run (or not) after it is
> > > known whether preemption will occur so we don't do needless work or
> > > IPIs. If the policy CPUs aren't known in the scheduler then the early
> > > hook will always need to be called along with an indication that it is
> > > the early hook being called. If it turns out to be a remote update it
> > > could then be deferred to the later hook, which would only be called
> > > when a remote update has been deferred and preemption has not occurred.
> > >
> > > This means two hook inovcations for a remote non-preempting wakeup
> > > though instead of one.  Perhaps this is a good middle ground on code
> > > churn vs. optimization though.
> > 
> > I would think so.
> 
> Ok, I will pursue this approach.

I'd like to get your opinion here before proceeding further...

To catch you up and summarize, I'm trying to implement support for
calling the scheduler cpufreq callback during remote wakeups.  Currently
the scheduler cpufreq callback is only called when the target CPU is the
current CPU. If a remote wakeup does not result in preemption, the CPU
frequency may currently not be adjusted appropriately for up to a tick,
when we are guaranteed to call the hook again.

Invoking schedutil promptly for the target CPU in this situation means
sending an IPI if the current CPU is not in the target CPU's frequency
domain. This is because often a cpufreq driver must run on a CPU within
the frequency domain it is bound to.  But the catch is that we should
not do this and incur the overhead of an IPI if preemption will occur,
as in that case the scheduler (and schedutil) will run soon on the
target CPU anyway, potentially as a result of the scheduler sending its
own IPI.

I figured this unnecessary overhead would be unacceptable and so have
been working on an approach to avoid it. Unfortunately the current hooks
happen before the preemption decision is made. My current implementation
sets a flag if schedutil sees a remote wakeup and then bails. There's a
test to call the hook again at the end of check_preempt_curr() if the flag
is set.  The flag is cleared by resched_curr() as that means preemption
will happen on the target CPU. The flag currently lives at the end of
the rq struct. I could move it into the update_util_data hook structure
or elsewhere, but that would mean accessing another per-cpu item in
hot scheduler paths.

Thoughts? Note the current implementation described above differs a bit
from the last posting in this thread, per discussion with Rafael.

thanks,
Steve

Re: [PATCH 5/5] cpufreq: schedutil: do not update rate limit ts when freq is unchanged

2016-05-19 Thread Steve Muckle

On Fri, May 20, 2016 at 02:37:17AM +0200, Rafael J. Wysocki wrote:
> Also I think that it would be good to avoid walking the frequency
> table twice in case we end up wanting to update the frequency after
> all.  With the [4/5] we'd do it once in get_next_freq() and then once
> more in cpufreq_driver_fast_switch(), for example, and walking the
> frequency table may be more expensive that doing the switch in the
> first place.

If a driver API is added to return the platform frequency associated
with a target frequency, what do you think about requiring the
fast_switch API to take a target-supported frequency?

Re: [PATCH 5/5] cpufreq: schedutil: do not update rate limit ts when freq is unchanged

2016-05-19 Thread Steve Muckle

On Fri, May 20, 2016 at 02:24:19AM +0200, Rafael J. Wysocki wrote:
> On Fri, May 20, 2016 at 1:34 AM, Steve Muckle  wrote:
> > On Thu, May 19, 2016 at 11:15:52PM +0200, Rafael J. Wysocki wrote:
> >> But anyway this change again seems to be an optimization that might be
> >> done later to me.
> >>
> >> I guess there are many things that might be optimized in schedutil,
> >> but I'd prefer to address one item at a time, maybe going after the
> >> ones that appear most relevant first?
> >
> > Calling the last two patches in this series optimizations is a stretch
> > IMO. Issuing frequency change requests that result in the same
> > target-supported frequency is clearly unnecessary and ends up delaying
> > more urgent frequency changes, which I think is more like a bug.
> 
> The [4/5] is pulling stuff where it doesn't belong.  Simple as that.
> Frequency tables don't belong in schedutil, so don't pull them in
> there.
> 
> If you want to do that cleanly, add a call to the driver that will
> tell you what frequency would be selected by it if it were given a
> particular target.
> 
> I actually do agree with the direction of it and the [5/5], but I
> don't like cutting corners. :-)

Okay sure, makes sense and I'll work on a change to do that.

> 
> > These patches are also needed in conjunction with the first three to address
> > the remote wakeup delay.
> 
> Well, does this mean that without the [4-5/5] the rest of the series
> doesn't provide as much benefit as initially expected?

At least in my testing the last two patches were required to show the
improvement with the unit test I had developed. This is because all the
minor system activity would keep pushing the rate limit out so when the
remote cb came in it had no effect.

So the last two patches aren't a hard requirement to theoretically solve
the problem but in practice I don't think a benefit will be seen without
them.

> > Are there specific items you want to see addressed before these patches 
> > could go in?
> 
> Do you mean in addition to what I already said in my comments?

I was referring to the other possible optimizations/changes you
mentioned in schedutil that might serve as cause to defer these patches.

> 
> > I'm aware of the RT/DL support that needs improving, though
> > that should be orthogonal.
> >
> > Also if it helps, I can provide a test case and/or traces to show the
> > need for the last two patches.
> 
> Yes, that will help.

Okay I'll include a pointer to my unit test program and some trace data
in my next post.

thanks,
Steve

Re: [PATCH 5/5] cpufreq: schedutil: do not update rate limit ts when freq is unchanged

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 11:15:52PM +0200, Rafael J. Wysocki wrote:
> But anyway this change again seems to be an optimization that might be
> done later to me.
> 
> I guess there are many things that might be optimized in schedutil,
> but I'd prefer to address one item at a time, maybe going after the
> ones that appear most relevant first?

Calling the last two patches in this series optimizations is a stretch
IMO. Issuing frequency change requests that result in the same
target-supported frequency is clearly unnecessary and ends up delaying
more urgent frequency changes, which I think is more like a bug. These
patches are also needed in conjunction with the first three to address
the remote wakeup delay.

Are there specific items you want to see addressed before these patches
could go in? I'm aware of the RT/DL support that needs improving, though
that should be orthogonal.

Also if it helps, I can provide a test case and/or traces to show the
need for the last two patches.

thanks,
Steve

Re: [PATCH 3/5] sched: cpufreq: call cpufreq hook from remote CPUs

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 11:06:14PM +0200, Rafael J. Wysocki wrote:
> > In the case of a remote update the hook has to run (or not) after it is
> > known whether preemption will occur so we don't do needless work or
> > IPIs. If the policy CPUs aren't known in the scheduler then the early
> > hook will always need to be called along with an indication that it is
> > the early hook being called. If it turns out to be a remote update it
> > could then be deferred to the later hook, which would only be called
> > when a remote update has been deferred and preemption has not occurred.
> >
> > This means two hook inovcations for a remote non-preempting wakeup
> > though instead of one.  Perhaps this is a good middle ground on code
> > churn vs. optimization though.
> 
> I would think so.

Ok, I will pursue this approach.

thanks,
Steve

Re: [PATCH 2/5] cpufreq: schedutil: support scheduler cpufreq callbacks on remote CPUs

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 10:55:23PM +0200, Rafael J. Wysocki wrote:
> >> > +static inline bool sugov_queue_remote_callback(struct sugov_policy 
> >> > *sg_policy,
> >> > +int cpu)
> >> > +{
> >> > +   struct cpufreq_policy *policy = sg_policy->policy;
> >> > +
> >> > +   if (!cpumask_test_cpu(smp_processor_id(), policy->cpus)) {
> >>
> >> This check is overkill for policies that aren't shared (and we have a
> >> special case for them already).
> >
> > I don't see why it is overkill -
> 
> Because it requires more computation, memory accesses etc than simply
> comparing smp_processor_id() with cpu.

Do you have a preference on how to restructure this? Otherwise I'll
create a second version of sugov_update_commit, factoring out as much of
it as I can into two inline sub-functions. 

...
> 
> > but it seems like an odd inconsistency for the governor to trace unchanged
> > frequencies when fast switches are enabled but not otherwise. It'd be
> > useful I think for profiling and tuning if the tracing was consistent.
> 
> Well, fair enough.
> 
> > This behavioral change is admittedly not part of the purpose of the
> > patch and could be split out if needbe.
> 
> No need to split IMO, but it might be prudent to mention that change
> in behavior in the changelog.

Will do.

thanks,
Steve

Re: [PATCH 5/5] cpufreq: schedutil: do not update rate limit ts when freq is unchanged

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 01:44:36AM +0200, Rafael J. Wysocki wrote:
> On Mon, May 9, 2016 at 11:20 PM, Steve Muckle  wrote:
> > The rate limit timestamp (last_freq_update_time) is currently advanced
> > anytime schedutil re-evaluates the policy regardless of whether the CPU
> > frequency is changed or not. This means that utilization updates which
> > have no effect can cause much more significant utilization updates
> > (which require a large increase or decrease in CPU frequency) to be
> > delayed due to rate limiting.
> >
> > Instead only update the rate limiting timstamp when the requested
> > target-supported frequency changes. The rate limit will now apply to
> > the rate of CPU frequency changes rather than the rate of
> > re-evaluations of the policy frequency.
> >
> > Signed-off-by: Steve Muckle 
> 
> I'm sort of divided here to be honest.

It is true that this means we'll do more frequency re-evaluations, they
will occur until an actual frequency change is requested.

But the way it stands now, with a system's typical background activity
there are so many minor events that it is very common for throttling to
be in effect, causing major events to be ignored. 
> 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index e185075fcb5c..4d2907c8a142 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -117,12 +117,11 @@ static void sugov_update_commit(struct sugov_cpu 
> > *sg_cpu, int cpu, u64 time,
> > struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > struct cpufreq_policy *policy = sg_policy->policy;
> >
> > -   sg_policy->last_freq_update_time = time;
> > -
> > if (sg_policy->next_freq == next_freq) {
> > trace_cpu_frequency(policy->cur, cpu);
> 
> You should at least rate limit the trace_cpu_frequency() thing here if
> you don't want to advance the last update time I think, or you may
> easily end up with the trace buffer flooded by irrelevant stuff.

Going back to the reason this tracepoint exists, is it known why
powertop thinks the CPU is idle when this tracepoint is removed? Maybe
it's possible to get rid of this tracepoint altogether.

Thanks for reviewing the series.

thanks,
Steve

Re: [PATCH 4/5] cpufreq: schedutil: map raw required frequency to CPU-supported frequency

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 01:37:40AM +0200, Rafael J. Wysocki wrote:
> On Mon, May 9, 2016 at 11:20 PM, Steve Muckle  wrote:
> > The mechanisms for remote CPU updates and slow-path frequency
> > transitions are relatively expensive - the former is an IPI while the
> > latter requires waking up a thread to do work. These activities should
> > be avoided if they are not necessary. To that end, calculate the
> > actual target-supported frequency required by the new utilization
> > value in schedutil. If it is the same as the previously requested
> > frequency then there is no need to continue with the update.
> 
> Unless the max/min limits changed in the meantime, right?

Right, I'll amend the commit text. The functionality is correct AFAICS.

> >
> > Signed-off-by: Steve Muckle 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 14 +-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index 6cb2ecc204ec..e185075fcb5c 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -153,14 +153,26 @@ static void sugov_update_commit(struct sugov_cpu 
> > *sg_cpu, int cpu, u64 time,
> >   * next_freq = C * curr_freq * util_raw / max
> >   *
> >   * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
> > + *
> > + * The lowest target-supported frequency which is equal or greater than 
> > the raw
> > + * next_freq (as calculated above) is returned, or the CPU's max_freq if 
> > such
> > + * a target-supported frequency does not exist.
> >   */
> >  static unsigned int get_next_freq(struct cpufreq_policy *policy,
> >   unsigned long util, unsigned long max)
> >  {
> > +   struct cpufreq_frequency_table *entry;
> > unsigned int freq = arch_scale_freq_invariant() ?
> > policy->cpuinfo.max_freq : policy->cur;
> > +   unsigned int target_freq = UINT_MAX;
> > +
> > +   freq = (freq + (freq >> 2)) * util / max;
> > +
> > +   cpufreq_for_each_valid_entry(entry, policy->freq_table)
> > +   if (entry->frequency >= freq && entry->frequency < 
> > target_freq)
> > +   target_freq = entry->frequency;
> 
> Please don't assume that every driver will have a frequency table.
> That may not be the case in the future (and I'm not even sure about
> the existing CPPC driver for that matter).

For platforms without a frequency table I guess we can just continue
with the current behavior, passing in the raw calculated frequency. I'll
make this change.

At some point I imagine those platforms will want to somehow achieve
similar behavior to avoid very small transitions that do not result in
real benefit. Maybe some sort of threshold % in the schedutil down the
road.

thanks,
Steve

Re: [PATCH 3/5] sched: cpufreq: call cpufreq hook from remote CPUs

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 02:00:54PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 19, 2016 at 1:33 AM, Rafael J. Wysocki  wrote:
> > On Mon, May 9, 2016 at 11:20 PM, Steve Muckle  
> > wrote:
> >> Without calling the cpufreq hook for a remote wakeup it is possible
> >> for such a wakeup to go unnoticed by cpufreq on the target CPU for up
> >> to a full tick. This can occur if the target CPU is running a
> >> CPU-bound task and preemption does not occur. If preemption does occur
> >> then the scheduler is expected to run soon on the target CPU anyway so
> >> invoking the cpufreq hook on the remote wakeup is not required.
> >>
> >> In order to avoid unnecessary interprocessor communication in the
> >> governor for the preemption case, the existing hook (which happens
> >> before preemption is decided) is only called when the target CPU
> >> is within the current CPU's cpufreq policy. A new hook is added in
> >> check_preempt_curr() to handle events outside the current CPU's
> >> cpufreq policy where preemption does not happen.
> >>
> >> Some governors may opt to not receive remote CPU callbacks. This
> >> behavior is possible by providing NULL as the new policy_cpus
> >> parameter to cpufreq_add_update_util_hook(). Callbacks will only be
> >> issued in this case when the target CPU and the current CPU are the
> >> same. Otherwise policy_cpus is used to determine what is a local
> >> vs. remote callback.
> >
> > I don't really like the extra complexity added by this patch.
> >
> > It makes the code look fragile at some places 

Perhaps I can improve these, can you point them out?

> > and it only really matters for schedutil 

True. As we are trying to create an integrated scheduler+CPU frequency
control solution I think some of this is to be expected, and should be
worthwhile since this is hopefully the future and will eventually
replace the need for the other governors.

> > and for the fast switch case in there.

Once there is a high priority context for the slow path I expect it to
benefit from this as well.

> >
> > Overall, it looks like a premature optimization to me.

Are you referring to this new approach of avoiding duplicate IPIs, or
supporting updates on remote wakeups overall?

The duplicate IPI thing is admittedly not required for the problem I'm
looking to solve but I figured at least some people would be concerned
about it.  If folks can live with it for now then I can go back to the
simpler approach I had in my first posting.

> In particular, the dance with checking the policy CPUs from the
> scheduler is questionable (the scheduler might be interested in this
> information for other purposes too and hooking it up in an ad-hoc way
> just for cpufreq doesn't seem to be appropriate from that perspective)
> and also doesn't seem to be necessary.
> 
> You can check if the current CPU is a policy one from the governor and
> if that is the case, simply run the frequency update on it directly
> without any IPI (because if both the target CPU and the current CPU
> belong to the same policy, it doesn't matter which one of them will
> run the update).

The checking of policy CPUs from the scheduler was done so as to
minimize the number of calls to the hook, given their expense.

In the case of a remote update the hook has to run (or not) after it is
known whether preemption will occur so we don't do needless work or
IPIs. If the policy CPUs aren't known in the scheduler then the early
hook will always need to be called along with an indication that it is
the early hook being called. If it turns out to be a remote update it
could then be deferred to the later hook, which would only be called
when a remote update has been deferred and preemption has not occurred.

This means two hook inovcations for a remote non-preempting wakeup
though instead of one.  Perhaps this is a good middle ground on code
churn vs. optimization though.

thanks,
Steve

Re: [PATCH 2/5] cpufreq: schedutil: support scheduler cpufreq callbacks on remote CPUs

2016-05-19 Thread Steve Muckle

On Thu, May 19, 2016 at 01:24:41AM +0200, Rafael J. Wysocki wrote:
> On Mon, May 9, 2016 at 11:20 PM, Steve Muckle  wrote:
> > In preparation for the scheduler cpufreq callback happening on remote
> > CPUs, add support for this in schedutil.
> >
> > Schedutil currently requires the callback occur on the CPU being
> > updated in order to support fast frequency switches. Remove this
> > limitation by checking for the current CPU being outside the target
> > CPU's cpufreq policy and if this is the case, enqueuing an irq_work on
> > the target CPU. The irq_work for schedutil is modified to carry out a
> > fast frequency switch if that is enabled for the policy.
> >
> > If the callback occurs on a CPU within the target CPU's policy, the
> > transition is carried out on the local CPU.
> >
> > Signed-off-by: Steve Muckle 
> > ---
> >  kernel/sched/cpufreq_schedutil.c | 86 
> > ++--
> >  1 file changed, 65 insertions(+), 21 deletions(-)
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c 
> > b/kernel/sched/cpufreq_schedutil.c
> > index 154ae3a51e86..c81f9432f520 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -76,27 +76,61 @@ static bool sugov_should_update_freq(struct 
> > sugov_policy *sg_policy, u64 time)
> > return delta_ns >= sg_policy->freq_update_delay_ns;
> >  }
> >
> > -static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
> > +static void sugov_fast_switch(struct sugov_policy *sg_policy, int cpu,
> > + unsigned int next_freq)
> > +{
> > +   struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > +   next_freq = cpufreq_driver_fast_switch(policy, next_freq);
> > +   if (next_freq == CPUFREQ_ENTRY_INVALID)
> > +   return;
> > +
> > +   policy->cur = next_freq;
> > +   trace_cpu_frequency(next_freq, cpu);
> > +}
> > +
> > +#ifdef CONFIG_SMP
> 
> schedutil depends on CONFIG_SMP now, so that's not necessary at least
> for the time being.

Will remove.

> 
> > +static inline bool sugov_queue_remote_callback(struct sugov_policy 
> > *sg_policy,
> > +int cpu)
> > +{
> > +   struct cpufreq_policy *policy = sg_policy->policy;
> > +
> > +   if (!cpumask_test_cpu(smp_processor_id(), policy->cpus)) {
> 
> This check is overkill for policies that aren't shared (and we have a
> special case for them already).

I don't see why it is overkill - regardless of whether the policy is
shared, we need to determine whether or not we are running on one of the
CPUs (or in the case of a non-shared policy, the single CPU) within that
policy to know whether we can immediately change the frequency in this
context or a remote call is required.

> > +   sg_policy->work_in_progress = true;
> > +   irq_work_queue_on(&sg_policy->irq_work, cpu);
> > +   return true;
> > +   }
> > +
> > +   return false;
> > +}
> > +#else
> > +static inline bool sugov_queue_remote_callback(struct sugov_policy 
> > *sg_policy,
> > +int cpu)
> > +{
> > +   return false;
> > +}
> > +#endif
> > +
> > +static void sugov_update_commit(struct sugov_cpu *sg_cpu, int cpu, u64 
> > time,
> 
> It looks like you might pass hook here instead of the sg_cpu, cpu pair.

I can do that but it means having to do the comtainer_of operation
again. Strictly speaking this seems slightly less efficient than passing
the above values which are already available in the callers.

> 
> > unsigned int next_freq)
> >  {
> > +   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
> > struct cpufreq_policy *policy = sg_policy->policy;
> >
> > sg_policy->last_freq_update_time = time;
> >
> > +   if (sg_policy->next_freq == next_freq) {
> > +   trace_cpu_frequency(policy->cur, cpu);
> > +   return;
> > +   }
> 
> There was a reason why I put the above under the fast_switch_enabled
> branch and it was because this check/trace is not necessary otherwise.

I remember asking about this tracepoint earlier. You had said it was
required because powertop would not work without it (reporting the CPU
as idle in certain situations).

I'm not sure why that is only true for the fast switch enabled case but
it seems like an odd inconsistency for the governor to trace unchanged
frequencies when fast switches are enabled but not otherwise. It'd be
useful I think for profiling and tuning if the tracing was consistent.

This behavioral change is admittedly not part of the purpose of the
patch and could be split out if needbe.

thanks,
Steve

Re: [PATCH] sched/fair: Invoke cpufreq hooks for CONFIG_SMP unset

2016-05-09 Thread Steve Muckle

On Sat, May 07, 2016 at 08:30:51AM +0200, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
> > On Fri, May 06, 2016 at 02:58:43PM +0200, Rafael J. Wysocki wrote:
> > > From: Rafael J. Wysocki 
> > > 
> > > Commit 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization
> > > update callbacks) overlooked the fact that update_load_avg(), where
> > > CFS invokes cpufreq utilization update hooks, becomes an empty stub for
> > > CONFIG_SMP unset.  In consequence, if CONFIG_SMP is not set, cpufreq
> > > governors are never invoked from CFS and they do not have a chance to
> > > evaluate CPU performace levels and update them often enough.  Needless
> > > to say, things don't work as expected then.
> > > 
> > > Fix the problem by making the !CONFIG_SMP stub of update_load_avg()
> > > invoke cpufreq update hooks too.
> > > 
> > > Fixes: 34e2c555f3e1 (cpufreq: Add mechanism for registering utilization 
> > > update callbacks)
> > > Reported-by: Steve Muckle 
> > > Signed-off-by: Rafael J. Wysocki 
> > 
> > Rafael; feel free to push this through the pm tree.
> > 
> > Acked-by: Peter Zijlstra (Intel) 
> 
> I already have it in sched/urgent.

Looks like this just missed 4.6-rc7 and the release email for
that doesn't suggest an rc8, as it stands...

[PATCH 0/5] cpufreq: schedutil: improve latency of response

2016-05-09 Thread Steve Muckle

The cpufreq update hook is presently not called when the target CPU is not the
current CPU. This may cause a delay in adjusting the frequency of the target
CPU to accommodate new task load since the target CPU may not run the scheduler
hook until the next tick.

In order to test this series I wrote a small sample program that reproduces a
scenario susceptible to the problem above. I ran this on x86 since ARM
currently lacks both fast switch support and a high priority (RT/DL) slow path
switch context. With HZ set to 100 I was easily able to see an unnecessary
latency of 7ms in the target CPU frequency being re-evaluated and adjusted.
This latency is eliminated with this series.

This is the second version of a patch series initially posted here:
http://thread.gmane.org/gmane.linux.power-management.general/74977
The approach is quite different though. The dbs and intel_pstate governors are
not modified to take remote callbacks and the scheduler modifications are
rather different so as to avoid redundant IPIs.

Having the governor invoke the switch on a remote CPU means sending an IPI.
Unfortunately if the scheduler event also results in preemption, the scheduler
(and consequently the cpufreq hook) will run soon on the target CPU anyway. A
resched IPI may also be sent. In short it is desirable to not run the cpufreq
hook remotely if preemption will occur as a result of the scheduler event. To
achieve this, remote cpufreq callbacks are processed after preemption is
decided in check_preempt_curr().

Another optimization is to process callbacks from remote CPUs within a target
CPU's cpufreq policy on the remote CPU, since any CPU within a frequency domain
should be able to update the frequency there.

Legacy behavior of the cpufreq hook, where the callback only occurs on the
target CPU, is maintained via special usage of a new parameter to
cpufreq_add_update_util_hook(). That function now takes a fourth parameter,
cpumask_var_t *policy_cpus. If NULL then the hook is not called remotely.
Otherwise it is expected to point to the cpumask for the frequency domain so
the scheduler may differentiate between local and remote callbacks as above.
Clients of the hook other than schedutil are configured to only receive local
callbacks so their behavior does not change.

The above logic did not alone fix the original issue because currently,
schedutil does not assign the raw required frequency to a target-supported
frequency. This means that even small variations in raw required frequency will
result in an attempted frequency switch and a corresponding advance of the rate
limit timestamp, last_freq_update_time. If the actual target-supported
frequency is unchanged then this is not necessary; furthermore, schedutil
will now be rate limited causing any imminent and more substantial updates to
have to wait.

To address this the required target-supported frequency is now calculated in
schedutil. Also last_freq_update_time is not advanced if there is no change in
the requested target-supported frequency. I expect the mapping to
target-supported frequency to require discussion since acpu_cpufreq_fast_switch
also performs this mapping.

I attempted to look for general performance regression but results seemed to be
inconclusive (tried a couple passes each).

model name  : Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Performance counter stats for 'perf bench sched messaging -g 50 -l 5000' (30 
runs):
baseline w/perf gov:
 20.847382649 seconds time elapsed ( +-  0.14% )
 20.701305631 seconds time elapsed ( +-  0.17% )
patched w/perf gov:
 20.665433861 seconds time elapsed ( +-  0.19% )
 20.606157659 seconds time elapsed ( +-  0.19% )
baseline w/sched gov:
 20.893686693 seconds time elapsed ( +-  0.12% )
 20.639752039 seconds time elapsed ( +-  0.12% )
patched w/sched gov:
 20.847845042 seconds time elapsed ( +-  0.14% )
 20.644003401 seconds time elapsed ( +-  0.14% )

Steve Muckle (5):
  sched: cpufreq: add cpu to update_util_data
  cpufreq: schedutil: support scheduler cpufreq callbacks on remote CPUs
  sched: cpufreq: call cpufreq hook from remote CPUs
  cpufreq: schedutil: map raw required frequency to CPU-supported
frequency
  cpufreq: schedutil: do not update rate limit ts when freq is unchanged

 drivers/cpufreq/cpufreq_governor.c |   2 +-
 drivers/cpufreq/intel_pstate.c |   2 +-
 include/linux/sched.h  |   5 +-
 kernel/sched/core.c|   4 ++
 kernel/sched/cpufreq.c |  14 -
 kernel/sched/cpufreq_schedutil.c   | 105 +++-
 kernel/sched/fair.c|  40 +++---
 kernel/sched/sched.h   | 106 +
 8 files changed, 206 insertions(+), 72 deletions(-)

-- 
2.4.10

[PATCH 1/5] sched: cpufreq: add cpu to update_util_data

2016-05-09 Thread Steve Muckle

Upcoming support for scheduler cpufreq callbacks on remote wakeups
will require the client to know what the target CPU is that the
callback is being invoked for. Add this information into the callback
data structure.

Signed-off-by: Steve Muckle 
---
 include/linux/sched.h  | 1 +
 kernel/sched/cpufreq.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8344e1947eec..81aba7dc5966 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3238,6 +3238,7 @@ static inline unsigned long rlimit_max(unsigned int limit)
 struct update_util_data {
void (*func)(struct update_util_data *data,
 u64 time, unsigned long util, unsigned long max);
+   int cpu;
 };
 
 void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index 1141954e73b4..d88a78ea805d 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -42,6 +42,7 @@ void cpufreq_add_update_util_hook(int cpu, struct 
update_util_data *data,
return;
 
data->func = func;
+   data->cpu = cpu;
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
 }
 EXPORT_SYMBOL_GPL(cpufreq_add_update_util_hook);
-- 
2.4.10

[PATCH 5/5] cpufreq: schedutil: do not update rate limit ts when freq is unchanged

2016-05-09 Thread Steve Muckle

The rate limit timestamp (last_freq_update_time) is currently advanced
anytime schedutil re-evaluates the policy regardless of whether the CPU
frequency is changed or not. This means that utilization updates which
have no effect can cause much more significant utilization updates
(which require a large increase or decrease in CPU frequency) to be
delayed due to rate limiting.

Instead only update the rate limiting timstamp when the requested
target-supported frequency changes. The rate limit will now apply to
the rate of CPU frequency changes rather than the rate of
re-evaluations of the policy frequency.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e185075fcb5c..4d2907c8a142 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -117,12 +117,11 @@ static void sugov_update_commit(struct sugov_cpu *sg_cpu, 
int cpu, u64 time,
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
 
-   sg_policy->last_freq_update_time = time;
-
if (sg_policy->next_freq == next_freq) {
trace_cpu_frequency(policy->cur, cpu);
return;
}
+   sg_policy->last_freq_update_time = time;
sg_policy->next_freq = next_freq;
 
if (sugov_queue_remote_callback(sg_policy, cpu))
-- 
2.4.10

[PATCH 4/5] cpufreq: schedutil: map raw required frequency to CPU-supported frequency

2016-05-09 Thread Steve Muckle

The mechanisms for remote CPU updates and slow-path frequency
transitions are relatively expensive - the former is an IPI while the
latter requires waking up a thread to do work. These activities should
be avoided if they are not necessary. To that end, calculate the
actual target-supported frequency required by the new utilization
value in schedutil. If it is the same as the previously requested
frequency then there is no need to continue with the update.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 6cb2ecc204ec..e185075fcb5c 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -153,14 +153,26 @@ static void sugov_update_commit(struct sugov_cpu *sg_cpu, 
int cpu, u64 time,
  * next_freq = C * curr_freq * util_raw / max
  *
  * Take C = 1.25 for the frequency tipping point at (util / max) = 0.8.
+ *
+ * The lowest target-supported frequency which is equal or greater than the raw
+ * next_freq (as calculated above) is returned, or the CPU's max_freq if such
+ * a target-supported frequency does not exist.
  */
 static unsigned int get_next_freq(struct cpufreq_policy *policy,
  unsigned long util, unsigned long max)
 {
+   struct cpufreq_frequency_table *entry;
unsigned int freq = arch_scale_freq_invariant() ?
policy->cpuinfo.max_freq : policy->cur;
+   unsigned int target_freq = UINT_MAX;
+
+   freq = (freq + (freq >> 2)) * util / max;
+
+   cpufreq_for_each_valid_entry(entry, policy->freq_table)
+   if (entry->frequency >= freq && entry->frequency < target_freq)
+   target_freq = entry->frequency;
 
-   return (freq + (freq >> 2)) * util / max;
+   return target_freq != UINT_MAX ? target_freq : policy->cpuinfo.max_freq;
 }
 
 static void sugov_update_single(struct update_util_data *hook, u64 time,
-- 
2.4.10

[PATCH 3/5] sched: cpufreq: call cpufreq hook from remote CPUs

2016-05-09 Thread Steve Muckle

Without calling the cpufreq hook for a remote wakeup it is possible
for such a wakeup to go unnoticed by cpufreq on the target CPU for up
to a full tick. This can occur if the target CPU is running a
CPU-bound task and preemption does not occur. If preemption does occur
then the scheduler is expected to run soon on the target CPU anyway so
invoking the cpufreq hook on the remote wakeup is not required.

In order to avoid unnecessary interprocessor communication in the
governor for the preemption case, the existing hook (which happens
before preemption is decided) is only called when the target CPU
is within the current CPU's cpufreq policy. A new hook is added in
check_preempt_curr() to handle events outside the current CPU's
cpufreq policy where preemption does not happen.

Some governors may opt to not receive remote CPU callbacks. This
behavior is possible by providing NULL as the new policy_cpus
parameter to cpufreq_add_update_util_hook(). Callbacks will only be
issued in this case when the target CPU and the current CPU are the
same. Otherwise policy_cpus is used to determine what is a local
vs. remote callback.

Signed-off-by: Steve Muckle 
---
 drivers/cpufreq/cpufreq_governor.c |   2 +-
 drivers/cpufreq/intel_pstate.c |   2 +-
 include/linux/sched.h  |   4 +-
 kernel/sched/core.c|   4 ++
 kernel/sched/cpufreq.c |  13 -
 kernel/sched/cpufreq_schedutil.c   |   6 ++-
 kernel/sched/fair.c|  40 +++---
 kernel/sched/sched.h   | 106 +
 8 files changed, 127 insertions(+), 50 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c 
b/drivers/cpufreq/cpufreq_governor.c
index 20f0a4e114d1..e127a7a22177 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -311,7 +311,7 @@ static void gov_set_update_util(struct policy_dbs_info 
*policy_dbs,
struct cpu_dbs_info *cdbs = &per_cpu(cpu_dbs, cpu);
 
cpufreq_add_update_util_hook(cpu, &cdbs->update_util,
-dbs_update_util_handler);
+dbs_update_util_handler, NULL);
}
 }
 
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 6c7cff13f0ed..9cf262ef23af 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1266,7 +1266,7 @@ static void intel_pstate_set_update_util_hook(unsigned 
int cpu_num)
/* Prevent intel_pstate_update_util() from using stale data. */
cpu->sample.time = 0;
cpufreq_add_update_util_hook(cpu_num, &cpu->update_util,
-intel_pstate_update_util);
+intel_pstate_update_util, NULL);
 }
 
 static void intel_pstate_clear_update_util_hook(unsigned int cpu)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 81aba7dc5966..ce154518119a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3239,11 +3239,13 @@ struct update_util_data {
void (*func)(struct update_util_data *data,
 u64 time, unsigned long util, unsigned long max);
int cpu;
+   cpumask_var_t *policy_cpus;
 };
 
 void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
void (*func)(struct update_util_data *data, u64 time,
-unsigned long util, unsigned long max));
+unsigned long util, unsigned long max),
+ cpumask_var_t *policy_cpus);
 void cpufreq_remove_update_util_hook(int cpu);
 #endif /* CONFIG_CPU_FREQ */
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fcac37b..fce6c0b43231 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -450,6 +450,8 @@ void resched_curr(struct rq *rq)
 
lockdep_assert_held(&rq->lock);
 
+   cpufreq_set_skip_cb(rq);
+
if (test_tsk_need_resched(curr))
return;
 
@@ -922,6 +924,8 @@ void check_preempt_curr(struct rq *rq, struct task_struct 
*p, int flags)
 */
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq, true);
+
+   cpufreq_update_remote(rq);
 }
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index d88a78ea805d..2946d2096bf2 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -18,6 +18,7 @@ DEFINE_PER_CPU(struct update_util_data *, 
cpufreq_update_util_data);
  * @cpu: The CPU to set the pointer for.
  * @data: New pointer value.
  * @func: Callback function to set for the CPU.
+ * @policy_cpus: Pointer to cpumask for CPUs which share the same policy.
  *
  * Set and publish the update_util_data pointer for the given CPU.
  *
@@ -28,12 +29,21 @@ DEFINE_PER_CPU(struct update_util_data *,

[PATCH 2/5] cpufreq: schedutil: support scheduler cpufreq callbacks on remote CPUs

2016-05-09 Thread Steve Muckle

In preparation for the scheduler cpufreq callback happening on remote
CPUs, add support for this in schedutil.

Schedutil currently requires the callback occur on the CPU being
updated in order to support fast frequency switches. Remove this
limitation by checking for the current CPU being outside the target
CPU's cpufreq policy and if this is the case, enqueuing an irq_work on
the target CPU. The irq_work for schedutil is modified to carry out a
fast frequency switch if that is enabled for the policy.

If the callback occurs on a CPU within the target CPU's policy, the
transition is carried out on the local CPU.

Signed-off-by: Steve Muckle 
---
 kernel/sched/cpufreq_schedutil.c | 86 ++--
 1 file changed, 65 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 154ae3a51e86..c81f9432f520 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -76,27 +76,61 @@ static bool sugov_should_update_freq(struct sugov_policy 
*sg_policy, u64 time)
return delta_ns >= sg_policy->freq_update_delay_ns;
 }
 
-static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
+static void sugov_fast_switch(struct sugov_policy *sg_policy, int cpu,
+ unsigned int next_freq)
+{
+   struct cpufreq_policy *policy = sg_policy->policy;
+
+   next_freq = cpufreq_driver_fast_switch(policy, next_freq);
+   if (next_freq == CPUFREQ_ENTRY_INVALID)
+   return;
+
+   policy->cur = next_freq;
+   trace_cpu_frequency(next_freq, cpu);
+}
+
+#ifdef CONFIG_SMP
+static inline bool sugov_queue_remote_callback(struct sugov_policy *sg_policy,
+int cpu)
+{
+   struct cpufreq_policy *policy = sg_policy->policy;
+
+   if (!cpumask_test_cpu(smp_processor_id(), policy->cpus)) {
+   sg_policy->work_in_progress = true;
+   irq_work_queue_on(&sg_policy->irq_work, cpu);
+   return true;
+   }
+
+   return false;
+}
+#else
+static inline bool sugov_queue_remote_callback(struct sugov_policy *sg_policy,
+int cpu)
+{
+   return false;
+}
+#endif
+
+static void sugov_update_commit(struct sugov_cpu *sg_cpu, int cpu, u64 time,
unsigned int next_freq)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
 
sg_policy->last_freq_update_time = time;
 
+   if (sg_policy->next_freq == next_freq) {
+   trace_cpu_frequency(policy->cur, cpu);
+   return;
+   }
+   sg_policy->next_freq = next_freq;
+
+   if (sugov_queue_remote_callback(sg_policy, cpu))
+   return;
+
if (policy->fast_switch_enabled) {
-   if (sg_policy->next_freq == next_freq) {
-   trace_cpu_frequency(policy->cur, smp_processor_id());
-   return;
-   }
-   sg_policy->next_freq = next_freq;
-   next_freq = cpufreq_driver_fast_switch(policy, next_freq);
-   if (next_freq == CPUFREQ_ENTRY_INVALID)
-   return;
-
-   policy->cur = next_freq;
-   trace_cpu_frequency(next_freq, smp_processor_id());
-   } else if (sg_policy->next_freq != next_freq) {
-   sg_policy->next_freq = next_freq;
+   sugov_fast_switch(sg_policy, cpu, next_freq);
+   } else {
sg_policy->work_in_progress = true;
irq_work_queue(&sg_policy->irq_work);
}
@@ -142,12 +176,13 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
 
next_f = util == ULONG_MAX ? policy->cpuinfo.max_freq :
get_next_freq(policy, util, max);
-   sugov_update_commit(sg_policy, time, next_f);
+   sugov_update_commit(sg_cpu, hook->cpu, time, next_f);
 }
 
-static unsigned int sugov_next_freq_shared(struct sugov_policy *sg_policy,
+static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu,
   unsigned long util, unsigned long 
max)
 {
+   struct sugov_policy *sg_policy = sg_cpu->sg_policy;
struct cpufreq_policy *policy = sg_policy->policy;
unsigned int max_f = policy->cpuinfo.max_freq;
u64 last_freq_update_time = sg_policy->last_freq_update_time;
@@ -161,10 +196,10 @@ static unsigned int sugov_next_freq_shared(struct 
sugov_policy *sg_policy,
unsigned long j_util, j_max;
s64 delta_ns;
 
-   if (j == smp_processor_id())
+   j_sg_cpu = &per_cpu(sugov_cpu, j);
+   if (j_sg_cpu == sg_cpu)
continue;
 
-   j_s

Re: [RFC PATCH 1/4] cpufreq: governor: support scheduler cpufreq callbacks on remote CPUs

2016-05-06 Thread Steve Muckle

On Fri, Apr 29, 2016 at 01:21:24PM +0200, Rafael J. Wysocki wrote:
> On Friday, April 29, 2016 04:08:16 PM Viresh Kumar wrote:
...
> > Any clue, why we don't have a non-SMP version of irq_work_queue_on(), Which 
> > can
> > simply call irq_work_queue() ?
> 
> Because nobody else needs it?
> 
> But I agree that it would be nicer to add the stub to irq_work.h.

I had wondered the same myself and figured there had to be a good reason why it 
didn't exist.

But if not, and it's preferred to add it, I'll go ahead and so.

Re: [PATCH] sched/fair: Invoke cpufreq hooks for CONFIG_SMP unset

2016-05-06 Thread Steve Muckle

Looks good to me.

Also re-tested with intel_pstate on i7-3630QM !SMP, confirmed issue
is resolved. I didn't retest with ondemand because for some reason
that wasn't showing the problem before.

Re: cpufreq governors broken with !CONFIG_SMP?

2016-05-05 Thread Steve Muckle

On Fri, May 06, 2016 at 02:09:07AM +0200, Rafael J. Wysocki wrote:
> In turn, schedutil should probably depend on CONFIG_SMP.

In the long term I wonder if it's worth putting PELT under its own
separate feature or just removing #ifdef CONFIG_SMP.

Aside from task migration CPU frequency updates there's also task
creation and deletion which would apply on UP. The tunable
infrastructure being created for scheduler-guided frequency may be of
interest on UP also.

cpufreq governors broken with !CONFIG_SMP?

2016-05-05 Thread Steve Muckle

While working on a few patches for schedutil I noticed that the CFS
cpufreq hooks depend on PELT, which depends on CONFIG_SMP.

I compiled and ran a UP kernel with intel_pstate. Running a cpu-bound
task did not result in the frequency increasing beyond fmin. For some reason
ondemand is working for me with the same test, not sure why yet.

It appears dbs/intel-pstate/schedutil have a dependency on CONFIG_SMP
now. Or am I missing something?

1 2 3 >

1 - 100 of 222 matches

Mail list logo