[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2
Gitweb:
https://git.kernel.org/tip/9bcb959d05eeb564dfc9cac13a59843a4fb2edf2
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 09 Apr 2021 18:02:20 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc34e35..1ad929b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
 


[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 8d25d10a4f5a5d87c062838358ab5b3ed7eaa131
Gitweb:
https://git.kernel.org/tip/8d25d10a4f5a5d87c062838358ab5b3ed7eaa131
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Fri, 09 Apr 2021 13:52:10 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

Signed-off-by: Lingutla Chandrasekhar 
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0bd861..d10e33d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
 


[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls

2021-04-09 Thread tip-bot2 for Lingutla Chandrasekhar
The following commit has been merged into the sched/core branch of tip:

Commit-ID: 29b628b521119c0dfe151da302e11018cb32db4f
Gitweb:
https://git.kernel.org/tip/29b628b521119c0dfe151da302e11018cb32db4f
Author:Lingutla Chandrasekhar 
AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00
Committer: Peter Zijlstra 
CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00

sched/fair: Ignore percpu threads for imbalance pulls

During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.

In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).

It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.

Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:

  DIE [  ]
  MC  [][]
   0  1  2  3
   L  L  B  B

  arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)

Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.

[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider 
Signed-off-by: Lingutla Chandrasekhar 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: 
https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0bd861..d10e33d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct 
lb_env *env)
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
 
+   /* Disregard pcpu kthreads; they are where they need to be. */
+   if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
+   return 0;
+
if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
int cpu;
 


[PATCH] sched/fair: Ignore percpu threads for imbalance pulls

2021-02-17 Thread Lingutla Chandrasekhar
In load balancing, when balancing group is unable to pull task
due to ->cpus_ptr constraints from busy group, then it sets
LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
is set for its parent domain level. which makes the group
classified as imbalance to get help from another balancing cpu.

Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
CPUs 2-3 as Bigs with below scenario:
- CPU0 doing newly_idle balancing
- CPU1 running percpu kworker and RT task (small tasks)
- CPU2 running 2 big tasks
- CPU3 running 1 medium task

While CPU0 is doing newly_idle load balance at MC level, it fails to
pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
find other busiest_group, so CPU0 stops balacing at MC level without
clearing 'sgc->imbalance' and restart the load balacing at DIE level.

And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
type as imbalance, and Bigs that classified the level below imbalance type
would be ignored to pick as busiest, and the balancing would be aborted
without pulling any tasks (by the time, CPU1 might not have running tasks).

It is suboptimal decision to classify the group as imbalance due to
percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.

Signed-off-by: Lingutla Chandrasekhar 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..44a05ad8c96b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7560,7 +7560,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env 
*env)
 
schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
 
-   env->flags |= LBF_SOME_PINNED;
+   /* Ignore percpu threads for imbalance pulls. */
+   if (p->nr_cpus_allowed > 1)
+   env->flags |= LBF_SOME_PINNED;
 
/*
 * Remember if this task can be migrated to any other CPU in
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v3] arch_topology: Make cpu_capacity sysfs node as ready-only

2019-03-31 Thread Lingutla Chandrasekhar
If user updates any cpu's cpu_capacity, then the new value is going to
be applied to all its online sibling cpus. But this need not to be correct
always, as sibling cpus (in ARM, same micro architecture cpus) would have
different cpu_capacity with different performance characteristics.
So, updating the user supplied cpu_capacity to all cpu siblings
is not correct.

And another problem is, current code assumes that 'all cpus in a cluster
or with same package_id (core_siblings), would have same cpu_capacity'.
But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove
cpu topology sibling masks")', when a cpu hotplugged out, the cpu
information gets cleared in its sibling cpus. So, user supplied
cpu_capacity would be applied to only online sibling cpus at the time.
After that, if any cpu hotplugged in, it would have different cpu_capacity
than its siblings, which breaks the above assumption.

So, instead of mucking around the core sibling mask for user supplied
value, use device-tree to set cpu capacity. And make the cpu_capacity
node as read-only to know the asymmetry between cpus in the system.
While at it, remove cpu_scale_mutex usage, which used for sysfs write
protection.

Tested-by: Dietmar Eggemann 
Tested-by: Quentin Perret 
Reviewed-by: Quentin Perret 
Acked-by: Sudeep Holla 
Signed-off-by: Lingutla Chandrasekhar 

---

Changes from v2:
   - Corrected spelling mistakes in commit text.
Changes from v1:
   - Removed cpu_scale_mutex usage, suggested by Dietmar Eggemann.
Changes from v0:
   - Instead of iterating all possible cpus to update cpu capacity,
 removed write capability for the sysfs node.

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d982e4..1739d7e1952a 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -7,7 +7,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long 
cur_freq,
per_cpu(freq_scale, i) = scale;
 }
 
-static DEFINE_MUTEX(cpu_scale_mutex);
 DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
 void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
@@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 static void update_topology_flags_workfn(struct work_struct *work);
 static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
 
-static ssize_t cpu_capacity_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf,
- size_t count)
-{
-   struct cpu *cpu = container_of(dev, struct cpu, dev);
-   int this_cpu = cpu->dev.id;
-   int i;
-   unsigned long new_capacity;
-   ssize_t ret;
-
-   if (!count)
-   return 0;
-
-   ret = kstrtoul(buf, 0, _capacity);
-   if (ret)
-   return ret;
-   if (new_capacity > SCHED_CAPACITY_SCALE)
-   return -EINVAL;
-
-   mutex_lock(_scale_mutex);
-   for_each_cpu(i, _topology[this_cpu].core_sibling)
-   topology_set_cpu_scale(i, new_capacity);
-   mutex_unlock(_scale_mutex);
-
-   schedule_work(_topology_flags_work);
-
-   return count;
-}
-
-static DEVICE_ATTR_RW(cpu_capacity);
+static DEVICE_ATTR_RO(cpu_capacity);
 
 static int register_cpu_capacity_sysctl(void)
 {
@@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
-   mutex_lock(_scale_mutex);
for_each_possible_cpu(cpu) {
pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
 cpu, raw_capacity[cpu]);
@@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void)
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(NULL, cpu));
}
-   mutex_unlock(_scale_mutex);
 }
 
 bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only

2019-03-27 Thread Lingutla Chandrasekhar
If user updates any cpu's cpu_capacity, then the new value is going to
be applied to all its online sibling cpus. But this need not to be correct
always, as sibling cpus (in ARM, same micro architecture cpus) would have
different cpu_capacity with different performance characteristics.
So, updating the user supplied cpu_capacity to all cpu siblings
is not correct.

And another problem is, current code assumes that 'all cpus in a cluster
or with same package_id (core_siblings), would have same cpu_capacity'.
But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove
cpu topology sibling masks")', when a cpu hotplugged out, the cpu
information gets cleared in its sibling cpus. So, user supplied
cpu_capacity would be applied to only online sibling cpus at the time.
After that, if any cpu hotplugged in, it would have different cpu_capacity
than its siblings, which breaks the above assumption.

So, instead of mucking around the core sibling mask for user supplied
value, use device-tree to set cpu capacity. And make the cpu_capacity
node as read-only to know the asymmetry between cpus in the system.
While at it, remove cpu_scale_mutex usage, which used for sysfs write
protection.

Tested-by: Dietmar Eggemann 
Tested-by: Quentin Perret 
Reviewed-by: Quentin Perret 
Acked-by: Sudeep Holla 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d982e4..1739d7e1952a 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -7,7 +7,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long 
cur_freq,
per_cpu(freq_scale, i) = scale;
 }
 
-static DEFINE_MUTEX(cpu_scale_mutex);
 DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
 void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
@@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 static void update_topology_flags_workfn(struct work_struct *work);
 static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
 
-static ssize_t cpu_capacity_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf,
- size_t count)
-{
-   struct cpu *cpu = container_of(dev, struct cpu, dev);
-   int this_cpu = cpu->dev.id;
-   int i;
-   unsigned long new_capacity;
-   ssize_t ret;
-
-   if (!count)
-   return 0;
-
-   ret = kstrtoul(buf, 0, _capacity);
-   if (ret)
-   return ret;
-   if (new_capacity > SCHED_CAPACITY_SCALE)
-   return -EINVAL;
-
-   mutex_lock(_scale_mutex);
-   for_each_cpu(i, _topology[this_cpu].core_sibling)
-   topology_set_cpu_scale(i, new_capacity);
-   mutex_unlock(_scale_mutex);
-
-   schedule_work(_topology_flags_work);
-
-   return count;
-}
-
-static DEVICE_ATTR_RW(cpu_capacity);
+static DEVICE_ATTR_RO(cpu_capacity);
 
 static int register_cpu_capacity_sysctl(void)
 {
@@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
-   mutex_lock(_scale_mutex);
for_each_possible_cpu(cpu) {
pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
 cpu, raw_capacity[cpu]);
@@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void)
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(NULL, cpu));
}
-   mutex_unlock(_scale_mutex);
 }
 
 bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only

2019-03-27 Thread Lingutla Chandrasekhar
If user updates any cpu's cpu_capacity, then the new value is going to
be applied to all its online sibling cpus. But this need not to be correct
always, as sibling cpus (in ARM, same micro architecture cpus) would have
different cpu_capacity with different performance characteristics.
So updating the user supplied cpu_capacity to all cpu siblings
is not correct.

And another problem is, current code assumes that 'all cpus in a cluster
or with same package_id (core_siblings), would have same cpu_capacity'.
But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove
cpu topology sibling masks")', when a cpu hotplugged out, the cpu
information gets cleared in its sibling cpus. So user supplied
cpu_capacity would be applied to only online sibling cpus at the time.
After that, if any cpu hot plugged in, it would have different cpu_capacity
than its siblings, which breaks the above assumption.

So instead of mucking around the core sibling mask for user supplied
value, use device-tree to set cpu capacity. And make the cpu_capacity
node as read-only to know the assymetry between cpus in the system.
While at it, remove cpu_scale_mutex usage, which used for sysfs write
protection.

Tested-by: Dietmar Eggemann 
Tested-by: Quentin Perret 
Acked-by: Sudeep Holla 
Reviewed-by: Quentin Perret 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d982e4..1739d7e1952a 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -7,7 +7,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long 
cur_freq,
per_cpu(freq_scale, i) = scale;
 }
 
-static DEFINE_MUTEX(cpu_scale_mutex);
 DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
 void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
@@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 static void update_topology_flags_workfn(struct work_struct *work);
 static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
 
-static ssize_t cpu_capacity_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf,
- size_t count)
-{
-   struct cpu *cpu = container_of(dev, struct cpu, dev);
-   int this_cpu = cpu->dev.id;
-   int i;
-   unsigned long new_capacity;
-   ssize_t ret;
-
-   if (!count)
-   return 0;
-
-   ret = kstrtoul(buf, 0, _capacity);
-   if (ret)
-   return ret;
-   if (new_capacity > SCHED_CAPACITY_SCALE)
-   return -EINVAL;
-
-   mutex_lock(_scale_mutex);
-   for_each_cpu(i, _topology[this_cpu].core_sibling)
-   topology_set_cpu_scale(i, new_capacity);
-   mutex_unlock(_scale_mutex);
-
-   schedule_work(_topology_flags_work);
-
-   return count;
-}
-
-static DEVICE_ATTR_RW(cpu_capacity);
+static DEVICE_ATTR_RO(cpu_capacity);
 
 static int register_cpu_capacity_sysctl(void)
 {
@@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
-   mutex_lock(_scale_mutex);
for_each_possible_cpu(cpu) {
pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
 cpu, raw_capacity[cpu]);
@@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void)
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(NULL, cpu));
}
-   mutex_unlock(_scale_mutex);
 }
 
 bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only

2019-03-08 Thread Lingutla Chandrasekhar
If user updates any cpu's cpu_capacity, then the new value is going to
be applied to all its online sibling cpus. But this need not to be correct
always, as sibling cpus (in ARM, same micro architecture cpus) would have
different cpu_capacity with different performance characteristics.
So updating the user supplied cpu_capacity to all cpu siblings
is not correct.

And another problem is, current code assumes that 'all cpus in a cluster
or with same package_id (core_siblings), would have same cpu_capacity'.
But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove
cpu topology sibling masks")', when a cpu hotplugged out, the cpu
information gets cleared in its sibling cpus. So user supplied
cpu_capacity would be applied to only online sibling cpus at the time.
After that, if any cpu hot plugged in, it would have different cpu_capacity
than its siblings, which breaks the above assumption.

So instead of mucking around the core sibling mask for user supplied
value, use device-tree to set cpu capacity. And make the cpu_capacity
node as read-only to know the assymetry between cpus in the system.
While at it, remove cpu_scale_mutex usage, which used for sysfs write
protection.

Tested-by: Dietmar Eggemann 
Acked-by: Sudeep Holla 
Signed-off-by: Lingutla Chandrasekhar 
---

Changes from v1:
   - Removed cpu_scale_mutex usage, suggested by Dietmar Eggemann.
---
 drivers/base/arch_topology.c | 36 +---
 1 file changed, 1 insertion(+), 35 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d..1739d7e 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -7,7 +7,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long 
cur_freq,
per_cpu(freq_scale, i) = scale;
 }
 
-static DEFINE_MUTEX(cpu_scale_mutex);
 DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
 void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity)
@@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 static void update_topology_flags_workfn(struct work_struct *work);
 static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
 
-static ssize_t cpu_capacity_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf,
- size_t count)
-{
-   struct cpu *cpu = container_of(dev, struct cpu, dev);
-   int this_cpu = cpu->dev.id;
-   int i;
-   unsigned long new_capacity;
-   ssize_t ret;
-
-   if (!count)
-   return 0;
-
-   ret = kstrtoul(buf, 0, _capacity);
-   if (ret)
-   return ret;
-   if (new_capacity > SCHED_CAPACITY_SCALE)
-   return -EINVAL;
-
-   mutex_lock(_scale_mutex);
-   for_each_cpu(i, _topology[this_cpu].core_sibling)
-   topology_set_cpu_scale(i, new_capacity);
-   mutex_unlock(_scale_mutex);
-
-   schedule_work(_topology_flags_work);
-
-   return count;
-}
-
-static DEVICE_ATTR_RW(cpu_capacity);
+static DEVICE_ATTR_RO(cpu_capacity);
 
 static int register_cpu_capacity_sysctl(void)
 {
@@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
-   mutex_lock(_scale_mutex);
for_each_possible_cpu(cpu) {
pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n",
 cpu, raw_capacity[cpu]);
@@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void)
pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n",
cpu, topology_get_cpu_scale(NULL, cpu));
}
-   mutex_unlock(_scale_mutex);
 }
 
 bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v1] arch_topology: Make cpu_capacity sysfs node as ready-only

2019-03-06 Thread Lingutla Chandrasekhar
If user updates any cpu's cpu_capacity, then the new value is going to
be applied to all its online sibling cpus. But this need not to be correct
always, as sibling cpus (in ARM, same micro architecture cpus) would have
different cpu_capacity with different performance characteristics.
So updating the user supplied cpu_capacity to all cpu siblings
is not correct.

And another problem is, current code assumes that 'all cpus in a cluster
or with same package_id (core_siblings), would have same cpu_capacity'.
But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove
cpu topology sibling masks")', when a cpu hotplugged out, the cpu
information gets cleared in its sibling cpus. So user supplied
cpu_capacity would be applied to only online sibling cpus at the time.
After that, if any cpu hot plugged in, it would have different cpu_capacity
than its siblings, which breaks the above assumption.

So instead of mucking around the core sibling mask for user supplied
value, use device-tree to set cpu capacity. And make the cpu_capacity
node as read-only to know the assymetry between cpus in the system.

Signed-off-by: Lingutla Chandrasekhar 
---
 drivers/base/arch_topology.c | 33 +
 1 file changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index edfcf8d..d455897 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -7,7 +7,6 @@
  */
 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -51,37 +50,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 static void update_topology_flags_workfn(struct work_struct *work);
 static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn);
 
-static ssize_t cpu_capacity_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf,
- size_t count)
-{
-   struct cpu *cpu = container_of(dev, struct cpu, dev);
-   int this_cpu = cpu->dev.id;
-   int i;
-   unsigned long new_capacity;
-   ssize_t ret;
-
-   if (!count)
-   return 0;
-
-   ret = kstrtoul(buf, 0, _capacity);
-   if (ret)
-   return ret;
-   if (new_capacity > SCHED_CAPACITY_SCALE)
-   return -EINVAL;
-
-   mutex_lock(_scale_mutex);
-   for_each_cpu(i, _topology[this_cpu].core_sibling)
-   topology_set_cpu_scale(i, new_capacity);
-   mutex_unlock(_scale_mutex);
-
-   schedule_work(_topology_flags_work);
-
-   return count;
-}
-
-static DEVICE_ATTR_RW(cpu_capacity);
+static DEVICE_ATTR_RO(cpu_capacity);
 
 static int register_cpu_capacity_sysctl(void)
 {
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[tip:timers/urgent] timers: Forward timer base before migrating timers

2018-02-28 Thread tip-bot for Lingutla Chandrasekhar
Commit-ID:  c52232a49e203a65a6e1a670cd5262f59e9364a0
Gitweb: https://git.kernel.org/tip/c52232a49e203a65a6e1a670cd5262f59e9364a0
Author: Lingutla Chandrasekhar <clingu...@codeaurora.org>
AuthorDate: Thu, 18 Jan 2018 17:20:22 +0530
Committer:  Thomas Gleixner <t...@linutronix.de>
CommitDate: Wed, 28 Feb 2018 23:34:33 +0100

timers: Forward timer base before migrating timers

On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.

If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.

In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.

But there is a worse problem than that. The following sequence of events
illustrates it:

 - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

   The timer is queued at wheel level 2, with resulting expiry time = 60032
   (due to level granularity).

 - CPU1 enters idle @60007, with next timer expiry @60020.

 - CPU0 is hotplugged at @60009

 - CPU1 exits idle and runs the control thread which migrates the
   timers from CPU0

   timer1 is now queued in level 0 for immediate handling in the next
   softirq because the requested expiry time 59969 is before CPU1 base->clk
   60007

 - CPU1 runs code which forwards the base clock which succeeds because the
   next expiring timer. which was collected at idle entry time is still set
   to 60020.

   So it forwards beyond 60007 and therefore misses to expire the migrated
   timer1. That timer gets expired when the wheel wraps around again, which
   takes between 63 and 630ms depending on the HZ setting.

Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.

[ tglx: Massaged comment and changelog ]

Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay <neer...@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
Cc: Anna-Maria Gleixner <anna-ma...@linutronix.de>
Cc: linux-arm-...@vger.kernel.org
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingu...@codeaurora.org
---
 kernel/time/timer.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 48150ab42de9..4a4fd567fb26 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1894,6 +1894,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* The current CPUs base clock might be stale. Update it
+* before moving the timers over.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)


[tip:timers/urgent] timers: Forward timer base before migrating timers

2018-02-28 Thread tip-bot for Lingutla Chandrasekhar
Commit-ID:  c52232a49e203a65a6e1a670cd5262f59e9364a0
Gitweb: https://git.kernel.org/tip/c52232a49e203a65a6e1a670cd5262f59e9364a0
Author: Lingutla Chandrasekhar 
AuthorDate: Thu, 18 Jan 2018 17:20:22 +0530
Committer:  Thomas Gleixner 
CommitDate: Wed, 28 Feb 2018 23:34:33 +0100

timers: Forward timer base before migrating timers

On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.

If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.

In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.

But there is a worse problem than that. The following sequence of events
illustrates it:

 - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

   The timer is queued at wheel level 2, with resulting expiry time = 60032
   (due to level granularity).

 - CPU1 enters idle @60007, with next timer expiry @60020.

 - CPU0 is hotplugged at @60009

 - CPU1 exits idle and runs the control thread which migrates the
   timers from CPU0

   timer1 is now queued in level 0 for immediate handling in the next
   softirq because the requested expiry time 59969 is before CPU1 base->clk
   60007

 - CPU1 runs code which forwards the base clock which succeeds because the
   next expiring timer. which was collected at idle entry time is still set
   to 60020.

   So it forwards beyond 60007 and therefore misses to expire the migrated
   timer1. That timer gets expired when the wheel wraps around again, which
   takes between 63 and 630ms depending on the HZ setting.

Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.

[ tglx: Massaged comment and changelog ]

Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay 
Signed-off-by: Neeraj Upadhyay 
Signed-off-by: Lingutla Chandrasekhar 
Signed-off-by: Thomas Gleixner 
Cc: Anna-Maria Gleixner 
Cc: linux-arm-...@vger.kernel.org
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingu...@codeaurora.org
---
 kernel/time/timer.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 48150ab42de9..4a4fd567fb26 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1894,6 +1894,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* The current CPUs base clock might be stale. Update it
+* before moving the timers over.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)


[PATCH v2] timer: Forward timer base before migrating timers

2018-01-18 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Co-developed-by: Neeraj Upadhyay <neer...@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..f66c7ad55d7a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v2] timer: Forward timer base before migrating timers

2018-01-18 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Co-developed-by: Neeraj Upadhyay 
Signed-off-by: Neeraj Upadhyay 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..f66c7ad55d7a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v1] timer: Forward timer base before migrating timers

2018-01-17 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>
Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org>

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..f66c7ad55d7a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH v1] timer: Forward timer base before migrating timers

2018-01-17 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Signed-off-by: Lingutla Chandrasekhar 
Signed-off-by: Neeraj Upadhyay 

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..f66c7ad55d7a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /*
+* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH] kernel: time: forward timer base before migrating timers

2018-01-17 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Change-Id: Idbe737b346f00e6e7241b93181bbbd80871f1400
Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..ae94aa97b5a9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,11 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH] kernel: time: forward timer base before migrating timers

2018-01-17 Thread Lingutla Chandrasekhar
In case when timers are migrated to a CPU, after it exits
idle, but before timer base is forwarded, either from
run_timer_softirq()/mod_timer()/add_timer_on(), it's
possible that migrated timers are queued, based on older
clock value. This can cause delays in handling those timers.

For example, consider below sequence of events:

- CPU0 timer1 expires = 59969 and base->clk = 59131. So,
  timer is queued at level 2, with next expiry for this timer
  = 60032 (due to granularity addition).
- CPU1 enters idle @60007, with next timer expiry @60020.
- CPU1 exits idle.
- CPU0 is hotplugged at 60009, and timers are migrated to
  CPU1, with new base->clk = 60007. timer1 is queued,
  based on 60007 at level 0, for immediate handling (in
  next timer softirq handling).
- CPU1's base->clk is forwarded to 60009, so, in next sched
  timer interrupt, timer1 is not handled.

The issue happens as timer wheel collects expired timers
starting from the current clk's index onwards, but migrated
timers, if enqueued, based on older clk value can result
in their index being less than clk's current index.
This can only happen if new base->clk is ahead of
timer->expires, resulting in timer being queued at
new base->clk's current index.

Change-Id: Idbe737b346f00e6e7241b93181bbbd80871f1400
Signed-off-by: Neeraj Upadhyay 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 89a9e1b4264a..ae94aa97b5a9 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1886,6 +1886,11 @@ int timers_dead_cpu(unsigned int cpu)
raw_spin_lock_irq(_base->lock);
raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING);
 
+   /* Before migrating timers, update new base clk to avoid
+* queueing timers based on older clock value.
+*/
+   forward_timer_base(new_base);
+
BUG_ON(old_base->running_timer);
 
for (i = 0; i < WHEEL_SIZE; i++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[RFC] hung task: check specific tasks for long uninterruptible sleep state

2017-11-07 Thread Lingutla Chandrasekhar
khungtask by default monitors all tasks for long unterruptible sleep.
This change introduces a sysctl option, /proc/sys/kernel/
hung_task_selective_monitoring, to enable monitoring selected tasks.
If this sysctl option is enabled then only the tasks with
/proc/$PID/hang_detection_enabled set are to be monitored,
otherwise all tasks are monitored as default case.

Some tasks may intentionally moves to uninterruptable sleep state,
which shouldn't leads to khungtask panics, as those are recoverable
hungs. So to avoid false hung reports, add an option to select tasks
to be monitored and report/panic them only.

Signed-off-by: Imran Khan <kim...@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9d357b2ea6cb..810f9a5e209e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2733,6 +2733,52 @@ static int proc_tgid_io_accounting(struct seq_file *m, 
struct pid_namespace *ns,
 }
 #endif /* CONFIG_TASK_IO_ACCOUNTING */
 
+#ifdef CONFIG_DETECT_HUNG_TASK
+static ssize_t proc_hung_task_detection_enabled_read(struct file *file,
+   char __user *buf, size_t count, loff_t *ppos)
+{
+   struct task_struct *task = get_proc_task(file_inode(file));
+   char buffer[PROC_NUMBUF];
+   size_t len;
+   bool hang_detection_enabled;
+
+   if (!task)
+   return -ESRCH;
+   hang_detection_enabled = task->hang_detection_enabled;
+   put_task_struct(task);
+
+   len = snprintf(buffer, sizeof(buffer), "%d\n", hang_detection_enabled);
+
+   return simple_read_from_buffer(buf, sizeof(buffer), ppos, buffer, len);
+}
+
+static ssize_t proc_hung_task_detection_enabled_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppos)
+{
+   struct task_struct *task;
+   bool hang_detection_enabled;
+   int rv;
+
+   rv = kstrtobool_from_user(buf, count, _detection_enabled);
+   if (rv < 0)
+   return rv;
+
+   task = get_proc_task(file_inode(file));
+   if (!task)
+   return -ESRCH;
+   task->hang_detection_enabled = hang_detection_enabled;
+   put_task_struct(task);
+
+   return count;
+}
+
+static const struct file_operations 
proc_hung_task_detection_enabled_operations = {
+   .read   = proc_hung_task_detection_enabled_read,
+   .write  = proc_hung_task_detection_enabled_write,
+   .llseek = generic_file_llseek,
+};
+#endif
+
 #ifdef CONFIG_USER_NS
 static int proc_id_map_open(struct inode *inode, struct file *file,
const struct seq_operations *seq_ops)
@@ -2976,6 +3022,10 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
ONE("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+#ifdef CONFIG_DETECT_HUNG_TASK
+   REG("hang_detection_enabled", 0666,
+   proc_hung_task_detection_enabled_operations),
+#endif
 #ifdef CONFIG_USER_NS
REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations),
REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations),
@@ -3367,6 +3417,10 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
ONE("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+#ifdef CONFIG_DETECT_HUNG_TASK
+   REG("hang_detection_enabled", 0666,
+   proc_hung_task_detection_enabled_operations),
+#endif
 #ifdef CONFIG_USER_NS
REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations),
REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdf74f27acf1..1280df0a6708 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -767,6 +767,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_DETECT_HUNG_TASK
unsigned long   last_switch_count;
+   boolhang_detection_enabled;
 #endif
/* Filesystem information: */
struct fs_struct*fs;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d6a18a3839cc..09d9f1c65fd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -11,6 +11,7 @@ extern int sysctl_hung_task_check_count;
 extern unsigned int  sysctl_hung_task_panic;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern int sysctl_hung_task_warnings;
+extern int sysctl_hung_task_selective_monitoring;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 void __user *buffer,
 size_t *lenp, loff_t *ppos);
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 751593ed7c0b..321bcfa2e34a 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -20,12 +20,21 @@
 #include 
 
 #

[RFC] hung task: check specific tasks for long uninterruptible sleep state

2017-11-07 Thread Lingutla Chandrasekhar
khungtask by default monitors all tasks for long unterruptible sleep.
This change introduces a sysctl option, /proc/sys/kernel/
hung_task_selective_monitoring, to enable monitoring selected tasks.
If this sysctl option is enabled then only the tasks with
/proc/$PID/hang_detection_enabled set are to be monitored,
otherwise all tasks are monitored as default case.

Some tasks may intentionally moves to uninterruptable sleep state,
which shouldn't leads to khungtask panics, as those are recoverable
hungs. So to avoid false hung reports, add an option to select tasks
to be monitored and report/panic them only.

Signed-off-by: Imran Khan 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9d357b2ea6cb..810f9a5e209e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2733,6 +2733,52 @@ static int proc_tgid_io_accounting(struct seq_file *m, 
struct pid_namespace *ns,
 }
 #endif /* CONFIG_TASK_IO_ACCOUNTING */
 
+#ifdef CONFIG_DETECT_HUNG_TASK
+static ssize_t proc_hung_task_detection_enabled_read(struct file *file,
+   char __user *buf, size_t count, loff_t *ppos)
+{
+   struct task_struct *task = get_proc_task(file_inode(file));
+   char buffer[PROC_NUMBUF];
+   size_t len;
+   bool hang_detection_enabled;
+
+   if (!task)
+   return -ESRCH;
+   hang_detection_enabled = task->hang_detection_enabled;
+   put_task_struct(task);
+
+   len = snprintf(buffer, sizeof(buffer), "%d\n", hang_detection_enabled);
+
+   return simple_read_from_buffer(buf, sizeof(buffer), ppos, buffer, len);
+}
+
+static ssize_t proc_hung_task_detection_enabled_write(struct file *file,
+   const char __user *buf, size_t count, loff_t *ppos)
+{
+   struct task_struct *task;
+   bool hang_detection_enabled;
+   int rv;
+
+   rv = kstrtobool_from_user(buf, count, _detection_enabled);
+   if (rv < 0)
+   return rv;
+
+   task = get_proc_task(file_inode(file));
+   if (!task)
+   return -ESRCH;
+   task->hang_detection_enabled = hang_detection_enabled;
+   put_task_struct(task);
+
+   return count;
+}
+
+static const struct file_operations 
proc_hung_task_detection_enabled_operations = {
+   .read   = proc_hung_task_detection_enabled_read,
+   .write  = proc_hung_task_detection_enabled_write,
+   .llseek = generic_file_llseek,
+};
+#endif
+
 #ifdef CONFIG_USER_NS
 static int proc_id_map_open(struct inode *inode, struct file *file,
const struct seq_operations *seq_ops)
@@ -2976,6 +3022,10 @@ static const struct pid_entry tgid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
ONE("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+#ifdef CONFIG_DETECT_HUNG_TASK
+   REG("hang_detection_enabled", 0666,
+   proc_hung_task_detection_enabled_operations),
+#endif
 #ifdef CONFIG_USER_NS
REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations),
REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations),
@@ -3367,6 +3417,10 @@ static const struct pid_entry tid_base_stuff[] = {
 #ifdef CONFIG_HARDWALL
ONE("hardwall",   S_IRUGO, proc_pid_hardwall),
 #endif
+#ifdef CONFIG_DETECT_HUNG_TASK
+   REG("hang_detection_enabled", 0666,
+   proc_hung_task_detection_enabled_operations),
+#endif
 #ifdef CONFIG_USER_NS
REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations),
REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fdf74f27acf1..1280df0a6708 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -767,6 +767,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_DETECT_HUNG_TASK
unsigned long   last_switch_count;
+   boolhang_detection_enabled;
 #endif
/* Filesystem information: */
struct fs_struct*fs;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d6a18a3839cc..09d9f1c65fd2 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -11,6 +11,7 @@ extern int sysctl_hung_task_check_count;
 extern unsigned int  sysctl_hung_task_panic;
 extern unsigned long sysctl_hung_task_timeout_secs;
 extern int sysctl_hung_task_warnings;
+extern int sysctl_hung_task_selective_monitoring;
 extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
 void __user *buffer,
 size_t *lenp, loff_t *ppos);
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 751593ed7c0b..321bcfa2e34a 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -20,12 +20,21 @@
 #include 
 
 #include 
+#include 
 
 /*
  * The number of tasks checked:
  */
 int

[PATCH] coresight: of_get_coresight_platform_data needs both OF and CORESIGHT

2017-10-26 Thread Lingutla Chandrasekhar
The file that defines of_get_coresight_platform_data() is indeed
dependent on CONFIG_OF but the entire coresight directory depends
on CONFIG_CORESIGHT so both need to be enabled to make the symbol
resolve.

Signed-off-by: Jordan Crouse <jcro...@codeaurora.org>
Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org>

diff --git a/include/linux/coresight.h b/include/linux/coresight.h
index d950dad5056a..d298eea7e6a0 100644
--- a/include/linux/coresight.h
+++ b/include/linux/coresight.h
@@ -262,7 +262,7 @@ static inline int coresight_timeout(void __iomem *addr, u32 
offset,
 int position, int value) { return 1; }
 #endif
 
-#ifdef CONFIG_OF
+#if defined(CONFIG_OF) && defined(CONFIG_CORESIGHT)
 extern int of_coresight_get_cpu(const struct device_node *node);
 extern struct coresight_platform_data *
 of_get_coresight_platform_data(struct device *dev,
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.



[PATCH] coresight: of_get_coresight_platform_data needs both OF and CORESIGHT

2017-10-26 Thread Lingutla Chandrasekhar
The file that defines of_get_coresight_platform_data() is indeed
dependent on CONFIG_OF but the entire coresight directory depends
on CONFIG_CORESIGHT so both need to be enabled to make the symbol
resolve.

Signed-off-by: Jordan Crouse 
Signed-off-by: Lingutla Chandrasekhar 

diff --git a/include/linux/coresight.h b/include/linux/coresight.h
index d950dad5056a..d298eea7e6a0 100644
--- a/include/linux/coresight.h
+++ b/include/linux/coresight.h
@@ -262,7 +262,7 @@ static inline int coresight_timeout(void __iomem *addr, u32 
offset,
 int position, int value) { return 1; }
 #endif
 
-#ifdef CONFIG_OF
+#if defined(CONFIG_OF) && defined(CONFIG_CORESIGHT)
 extern int of_coresight_get_cpu(const struct device_node *node);
 extern struct coresight_platform_data *
 of_get_coresight_platform_data(struct device *dev,
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum,
 a Linux Foundation Collaborative Project.