[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls
The following commit has been merged into the sched/core branch of tip: Commit-ID: 9bcb959d05eeb564dfc9cac13a59843a4fb2edf2 Gitweb: https://git.kernel.org/tip/9bcb959d05eeb564dfc9cac13a59843a4fb2edf2 Author:Lingutla Chandrasekhar AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 09 Apr 2021 18:02:20 +02:00 sched/fair: Ignore percpu threads for imbalance pulls During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [][] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com --- kernel/sched/fair.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index bc34e35..1ad929b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; + /* Disregard pcpu kthreads; they are where they need to be. */ + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu;
[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls
The following commit has been merged into the sched/core branch of tip: Commit-ID: 8d25d10a4f5a5d87c062838358ab5b3ed7eaa131 Gitweb: https://git.kernel.org/tip/8d25d10a4f5a5d87c062838358ab5b3ed7eaa131 Author:Lingutla Chandrasekhar AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00 Committer: Peter Zijlstra CommitterDate: Fri, 09 Apr 2021 13:52:10 +02:00 sched/fair: Ignore percpu threads for imbalance pulls During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [][] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. Signed-off-by: Lingutla Chandrasekhar [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com --- kernel/sched/fair.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d0bd861..d10e33d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; + /* Disregard pcpu kthreads; they are where they need to be. */ + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu;
[tip: sched/core] sched/fair: Ignore percpu threads for imbalance pulls
The following commit has been merged into the sched/core branch of tip: Commit-ID: 29b628b521119c0dfe151da302e11018cb32db4f Gitweb: https://git.kernel.org/tip/29b628b521119c0dfe151da302e11018cb32db4f Author:Lingutla Chandrasekhar AuthorDate:Wed, 07 Apr 2021 23:06:26 +01:00 Committer: Peter Zijlstra CommitterDate: Thu, 08 Apr 2021 23:09:44 +02:00 sched/fair: Ignore percpu threads for imbalance pulls During load balance, LBF_SOME_PINNED will be set if any candidate task cannot be detached due to CPU affinity constraints. This can result in setting env->sd->parent->sgc->group_imbalance, which can lead to a group being classified as group_imbalanced (rather than any of the other, lower group_type) when balancing at a higher level. In workloads involving a single task per CPU, LBF_SOME_PINNED can often be set due to per-CPU kthreads being the only other runnable tasks on any given rq. This results in changing the group classification during load-balance at higher levels when in reality there is nothing that can be done for this affinity constraint: per-CPU kthreads, as the name implies, don't get to move around (modulo hotplug shenanigans). It's not as clear for userspace tasks - a task could be in an N-CPU cpuset with N-1 offline CPUs, making it an "accidental" per-CPU task rather than an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which we can leverage here to not set LBF_SOME_PINNED. Note that the aforementioned classification to group_imbalance (when nothing can be done) is especially problematic on big.LITTLE systems, which have a topology the likes of: DIE [ ] MC [][] 0 1 2 3 L L B B arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B) Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads can significantly delay the upgmigration of said misfit tasks. Systems relying on ASYM_PACKING are likely to face similar issues. [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed] [Reword changelog] Signed-off-by: Valentin Schneider Signed-off-by: Lingutla Chandrasekhar Signed-off-by: Peter Zijlstra (Intel) Reviewed-by: Dietmar Eggemann Reviewed-by: Vincent Guittot Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schnei...@arm.com --- kernel/sched/fair.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d0bd861..d10e33d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7598,6 +7598,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu)) return 0; + /* Disregard pcpu kthreads; they are where they need to be. */ + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p)) + return 0; + if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) { int cpu;
[PATCH] sched/fair: Ignore percpu threads for imbalance pulls
In load balancing, when balancing group is unable to pull task due to ->cpus_ptr constraints from busy group, then it sets LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance is set for its parent domain level. which makes the group classified as imbalance to get help from another balancing cpu. Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and CPUs 2-3 as Bigs with below scenario: - CPU0 doing newly_idle balancing - CPU1 running percpu kworker and RT task (small tasks) - CPU2 running 2 big tasks - CPU3 running 1 medium task While CPU0 is doing newly_idle load balance at MC level, it fails to pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared, it tries to redo the balancing by clearing CPU1 in env cpus, but it don't find other busiest_group, so CPU0 stops balacing at MC level without clearing 'sgc->imbalance' and restart the load balacing at DIE level. And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group type as imbalance, and Bigs that classified the level below imbalance type would be ignored to pick as busiest, and the balancing would be aborted without pulling any tasks (by the time, CPU1 might not have running tasks). It is suboptimal decision to classify the group as imbalance due to percpu threads. So don't use LBF_SOME_PINNED for per cpu threads. Signed-off-by: Lingutla Chandrasekhar diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 04a3ce20da67..44a05ad8c96b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7560,7 +7560,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env) schedstat_inc(p->se.statistics.nr_failed_migrations_affine); - env->flags |= LBF_SOME_PINNED; + /* Ignore percpu threads for imbalance pulls. */ + if (p->nr_cpus_allowed > 1) + env->flags |= LBF_SOME_PINNED; /* * Remember if this task can be migrated to any other CPU in -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v3] arch_topology: Make cpu_capacity sysfs node as ready-only
If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So, updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So, user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hotplugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So, instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the asymmetry between cpus in the system. While at it, remove cpu_scale_mutex usage, which used for sysfs write protection. Tested-by: Dietmar Eggemann Tested-by: Quentin Perret Reviewed-by: Quentin Perret Acked-by: Sudeep Holla Signed-off-by: Lingutla Chandrasekhar --- Changes from v2: - Corrected spelling mistakes in commit text. Changes from v1: - Removed cpu_scale_mutex usage, suggested by Dietmar Eggemann. Changes from v0: - Instead of iterating all possible cpus to update cpu capacity, removed write capability for the sysfs node. diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index edfcf8d982e4..1739d7e1952a 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -7,7 +7,6 @@ */ #include -#include #include #include #include @@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, per_cpu(freq_scale, i) = scale; } -static DEFINE_MUTEX(cpu_scale_mutex); DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) @@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev, static void update_topology_flags_workfn(struct work_struct *work); static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); -static ssize_t cpu_capacity_store(struct device *dev, - struct device_attribute *attr, - const char *buf, - size_t count) -{ - struct cpu *cpu = container_of(dev, struct cpu, dev); - int this_cpu = cpu->dev.id; - int i; - unsigned long new_capacity; - ssize_t ret; - - if (!count) - return 0; - - ret = kstrtoul(buf, 0, _capacity); - if (ret) - return ret; - if (new_capacity > SCHED_CAPACITY_SCALE) - return -EINVAL; - - mutex_lock(_scale_mutex); - for_each_cpu(i, _topology[this_cpu].core_sibling) - topology_set_cpu_scale(i, new_capacity); - mutex_unlock(_scale_mutex); - - schedule_work(_topology_flags_work); - - return count; -} - -static DEVICE_ATTR_RW(cpu_capacity); +static DEVICE_ATTR_RO(cpu_capacity); static int register_cpu_capacity_sysctl(void) { @@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); - mutex_lock(_scale_mutex); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); @@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void) pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(NULL, cpu)); } - mutex_unlock(_scale_mutex); } bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only
If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So, updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So, user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hotplugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So, instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the asymmetry between cpus in the system. While at it, remove cpu_scale_mutex usage, which used for sysfs write protection. Tested-by: Dietmar Eggemann Tested-by: Quentin Perret Reviewed-by: Quentin Perret Acked-by: Sudeep Holla Signed-off-by: Lingutla Chandrasekhar diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index edfcf8d982e4..1739d7e1952a 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -7,7 +7,6 @@ */ #include -#include #include #include #include @@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, per_cpu(freq_scale, i) = scale; } -static DEFINE_MUTEX(cpu_scale_mutex); DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) @@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev, static void update_topology_flags_workfn(struct work_struct *work); static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); -static ssize_t cpu_capacity_store(struct device *dev, - struct device_attribute *attr, - const char *buf, - size_t count) -{ - struct cpu *cpu = container_of(dev, struct cpu, dev); - int this_cpu = cpu->dev.id; - int i; - unsigned long new_capacity; - ssize_t ret; - - if (!count) - return 0; - - ret = kstrtoul(buf, 0, _capacity); - if (ret) - return ret; - if (new_capacity > SCHED_CAPACITY_SCALE) - return -EINVAL; - - mutex_lock(_scale_mutex); - for_each_cpu(i, _topology[this_cpu].core_sibling) - topology_set_cpu_scale(i, new_capacity); - mutex_unlock(_scale_mutex); - - schedule_work(_topology_flags_work); - - return count; -} - -static DEVICE_ATTR_RW(cpu_capacity); +static DEVICE_ATTR_RO(cpu_capacity); static int register_cpu_capacity_sysctl(void) { @@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); - mutex_lock(_scale_mutex); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); @@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void) pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(NULL, cpu)); } - mutex_unlock(_scale_mutex); } bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only
If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hot plugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the assymetry between cpus in the system. While at it, remove cpu_scale_mutex usage, which used for sysfs write protection. Tested-by: Dietmar Eggemann Tested-by: Quentin Perret Acked-by: Sudeep Holla Reviewed-by: Quentin Perret Signed-off-by: Lingutla Chandrasekhar diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index edfcf8d982e4..1739d7e1952a 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -7,7 +7,6 @@ */ #include -#include #include #include #include @@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, per_cpu(freq_scale, i) = scale; } -static DEFINE_MUTEX(cpu_scale_mutex); DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) @@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev, static void update_topology_flags_workfn(struct work_struct *work); static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); -static ssize_t cpu_capacity_store(struct device *dev, - struct device_attribute *attr, - const char *buf, - size_t count) -{ - struct cpu *cpu = container_of(dev, struct cpu, dev); - int this_cpu = cpu->dev.id; - int i; - unsigned long new_capacity; - ssize_t ret; - - if (!count) - return 0; - - ret = kstrtoul(buf, 0, _capacity); - if (ret) - return ret; - if (new_capacity > SCHED_CAPACITY_SCALE) - return -EINVAL; - - mutex_lock(_scale_mutex); - for_each_cpu(i, _topology[this_cpu].core_sibling) - topology_set_cpu_scale(i, new_capacity); - mutex_unlock(_scale_mutex); - - schedule_work(_topology_flags_work); - - return count; -} - -static DEVICE_ATTR_RW(cpu_capacity); +static DEVICE_ATTR_RO(cpu_capacity); static int register_cpu_capacity_sysctl(void) { @@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); - mutex_lock(_scale_mutex); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); @@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void) pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(NULL, cpu)); } - mutex_unlock(_scale_mutex); } bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v2] arch_topology: Make cpu_capacity sysfs node as ready-only
If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hot plugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the assymetry between cpus in the system. While at it, remove cpu_scale_mutex usage, which used for sysfs write protection. Tested-by: Dietmar Eggemann Acked-by: Sudeep Holla Signed-off-by: Lingutla Chandrasekhar --- Changes from v1: - Removed cpu_scale_mutex usage, suggested by Dietmar Eggemann. --- drivers/base/arch_topology.c | 36 +--- 1 file changed, 1 insertion(+), 35 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index edfcf8d..1739d7e 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -7,7 +7,6 @@ */ #include -#include #include #include #include @@ -31,7 +30,6 @@ void arch_set_freq_scale(struct cpumask *cpus, unsigned long cur_freq, per_cpu(freq_scale, i) = scale; } -static DEFINE_MUTEX(cpu_scale_mutex); DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE; void topology_set_cpu_scale(unsigned int cpu, unsigned long capacity) @@ -51,37 +49,7 @@ static ssize_t cpu_capacity_show(struct device *dev, static void update_topology_flags_workfn(struct work_struct *work); static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); -static ssize_t cpu_capacity_store(struct device *dev, - struct device_attribute *attr, - const char *buf, - size_t count) -{ - struct cpu *cpu = container_of(dev, struct cpu, dev); - int this_cpu = cpu->dev.id; - int i; - unsigned long new_capacity; - ssize_t ret; - - if (!count) - return 0; - - ret = kstrtoul(buf, 0, _capacity); - if (ret) - return ret; - if (new_capacity > SCHED_CAPACITY_SCALE) - return -EINVAL; - - mutex_lock(_scale_mutex); - for_each_cpu(i, _topology[this_cpu].core_sibling) - topology_set_cpu_scale(i, new_capacity); - mutex_unlock(_scale_mutex); - - schedule_work(_topology_flags_work); - - return count; -} - -static DEVICE_ATTR_RW(cpu_capacity); +static DEVICE_ATTR_RO(cpu_capacity); static int register_cpu_capacity_sysctl(void) { @@ -141,7 +109,6 @@ void topology_normalize_cpu_scale(void) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); - mutex_lock(_scale_mutex); for_each_possible_cpu(cpu) { pr_debug("cpu_capacity: cpu=%d raw_capacity=%u\n", cpu, raw_capacity[cpu]); @@ -151,7 +118,6 @@ void topology_normalize_cpu_scale(void) pr_debug("cpu_capacity: CPU%d cpu_capacity=%lu\n", cpu, topology_get_cpu_scale(NULL, cpu)); } - mutex_unlock(_scale_mutex); } bool __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v1] arch_topology: Make cpu_capacity sysfs node as ready-only
If user updates any cpu's cpu_capacity, then the new value is going to be applied to all its online sibling cpus. But this need not to be correct always, as sibling cpus (in ARM, same micro architecture cpus) would have different cpu_capacity with different performance characteristics. So updating the user supplied cpu_capacity to all cpu siblings is not correct. And another problem is, current code assumes that 'all cpus in a cluster or with same package_id (core_siblings), would have same cpu_capacity'. But with commit '5bdd2b3f0f8 ("arm64: topology: add support to remove cpu topology sibling masks")', when a cpu hotplugged out, the cpu information gets cleared in its sibling cpus. So user supplied cpu_capacity would be applied to only online sibling cpus at the time. After that, if any cpu hot plugged in, it would have different cpu_capacity than its siblings, which breaks the above assumption. So instead of mucking around the core sibling mask for user supplied value, use device-tree to set cpu capacity. And make the cpu_capacity node as read-only to know the assymetry between cpus in the system. Signed-off-by: Lingutla Chandrasekhar --- drivers/base/arch_topology.c | 33 + 1 file changed, 1 insertion(+), 32 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index edfcf8d..d455897 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -7,7 +7,6 @@ */ #include -#include #include #include #include @@ -51,37 +50,7 @@ static ssize_t cpu_capacity_show(struct device *dev, static void update_topology_flags_workfn(struct work_struct *work); static DECLARE_WORK(update_topology_flags_work, update_topology_flags_workfn); -static ssize_t cpu_capacity_store(struct device *dev, - struct device_attribute *attr, - const char *buf, - size_t count) -{ - struct cpu *cpu = container_of(dev, struct cpu, dev); - int this_cpu = cpu->dev.id; - int i; - unsigned long new_capacity; - ssize_t ret; - - if (!count) - return 0; - - ret = kstrtoul(buf, 0, _capacity); - if (ret) - return ret; - if (new_capacity > SCHED_CAPACITY_SCALE) - return -EINVAL; - - mutex_lock(_scale_mutex); - for_each_cpu(i, _topology[this_cpu].core_sibling) - topology_set_cpu_scale(i, new_capacity); - mutex_unlock(_scale_mutex); - - schedule_work(_topology_flags_work); - - return count; -} - -static DEVICE_ATTR_RW(cpu_capacity); +static DEVICE_ATTR_RO(cpu_capacity); static int register_cpu_capacity_sysctl(void) { -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[tip:timers/urgent] timers: Forward timer base before migrating timers
Commit-ID: c52232a49e203a65a6e1a670cd5262f59e9364a0 Gitweb: https://git.kernel.org/tip/c52232a49e203a65a6e1a670cd5262f59e9364a0 Author: Lingutla Chandrasekhar <clingu...@codeaurora.org> AuthorDate: Thu, 18 Jan 2018 17:20:22 +0530 Committer: Thomas Gleixner <t...@linutronix.de> CommitDate: Wed, 28 Feb 2018 23:34:33 +0100 timers: Forward timer base before migrating timers On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a live CPU. This happens from the control thread which initiated the unplug. If the CPU on which the control thread runs came out from a longer idle period then the base clock of that CPU might be stale because the control thread runs prior to any event which forwards the clock. In such a case the timers from the unplugged CPU are queued on the live CPU based on the stale clock which can cause large delays due to increased granularity of the outer timer wheels which are far away from base:;clock. But there is a worse problem than that. The following sequence of events illustrates it: - CPU0 timer1 is queued expires = 59969 and base->clk = 59131. The timer is queued at wheel level 2, with resulting expiry time = 60032 (due to level granularity). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU0 is hotplugged at @60009 - CPU1 exits idle and runs the control thread which migrates the timers from CPU0 timer1 is now queued in level 0 for immediate handling in the next softirq because the requested expiry time 59969 is before CPU1 base->clk 60007 - CPU1 runs code which forwards the base clock which succeeds because the next expiring timer. which was collected at idle entry time is still set to 60020. So it forwards beyond 60007 and therefore misses to expire the migrated timer1. That timer gets expired when the wheel wraps around again, which takes between 63 and 630ms depending on the HZ setting. Address both problems by invoking forward_timer_base() for the control CPUs timer base. All other places, which might run into a similar problem (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid that. [ tglx: Massaged comment and changelog ] Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Co-developed-by: Neeraj Upadhyay <neer...@codeaurora.org> Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> Signed-off-by: Thomas Gleixner <t...@linutronix.de> Cc: Anna-Maria Gleixner <anna-ma...@linutronix.de> Cc: linux-arm-...@vger.kernel.org Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingu...@codeaurora.org --- kernel/time/timer.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 48150ab42de9..4a4fd567fb26 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1894,6 +1894,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* The current CPUs base clock might be stale. Update it +* before moving the timers over. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++)
[tip:timers/urgent] timers: Forward timer base before migrating timers
Commit-ID: c52232a49e203a65a6e1a670cd5262f59e9364a0 Gitweb: https://git.kernel.org/tip/c52232a49e203a65a6e1a670cd5262f59e9364a0 Author: Lingutla Chandrasekhar AuthorDate: Thu, 18 Jan 2018 17:20:22 +0530 Committer: Thomas Gleixner CommitDate: Wed, 28 Feb 2018 23:34:33 +0100 timers: Forward timer base before migrating timers On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a live CPU. This happens from the control thread which initiated the unplug. If the CPU on which the control thread runs came out from a longer idle period then the base clock of that CPU might be stale because the control thread runs prior to any event which forwards the clock. In such a case the timers from the unplugged CPU are queued on the live CPU based on the stale clock which can cause large delays due to increased granularity of the outer timer wheels which are far away from base:;clock. But there is a worse problem than that. The following sequence of events illustrates it: - CPU0 timer1 is queued expires = 59969 and base->clk = 59131. The timer is queued at wheel level 2, with resulting expiry time = 60032 (due to level granularity). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU0 is hotplugged at @60009 - CPU1 exits idle and runs the control thread which migrates the timers from CPU0 timer1 is now queued in level 0 for immediate handling in the next softirq because the requested expiry time 59969 is before CPU1 base->clk 60007 - CPU1 runs code which forwards the base clock which succeeds because the next expiring timer. which was collected at idle entry time is still set to 60020. So it forwards beyond 60007 and therefore misses to expire the migrated timer1. That timer gets expired when the wheel wraps around again, which takes between 63 and 630ms depending on the HZ setting. Address both problems by invoking forward_timer_base() for the control CPUs timer base. All other places, which might run into a similar problem (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid that. [ tglx: Massaged comment and changelog ] Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible") Co-developed-by: Neeraj Upadhyay Signed-off-by: Neeraj Upadhyay Signed-off-by: Lingutla Chandrasekhar Signed-off-by: Thomas Gleixner Cc: Anna-Maria Gleixner Cc: linux-arm-...@vger.kernel.org Cc: sta...@vger.kernel.org Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingu...@codeaurora.org --- kernel/time/timer.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 48150ab42de9..4a4fd567fb26 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1894,6 +1894,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* The current CPUs base clock might be stale. Update it +* before moving the timers over. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++)
[PATCH v2] timer: Forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Co-developed-by: Neeraj Upadhyay <neer...@codeaurora.org> Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..f66c7ad55d7a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v2] timer: Forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Co-developed-by: Neeraj Upadhyay Signed-off-by: Neeraj Upadhyay Signed-off-by: Lingutla Chandrasekhar diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..f66c7ad55d7a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v1] timer: Forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org> diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..f66c7ad55d7a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH v1] timer: Forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Signed-off-by: Lingutla Chandrasekhar Signed-off-by: Neeraj Upadhyay diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..f66c7ad55d7a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,12 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* +* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH] kernel: time: forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Change-Id: Idbe737b346f00e6e7241b93181bbbd80871f1400 Signed-off-by: Neeraj Upadhyay <neer...@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..ae94aa97b5a9 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,11 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH] kernel: time: forward timer base before migrating timers
In case when timers are migrated to a CPU, after it exits idle, but before timer base is forwarded, either from run_timer_softirq()/mod_timer()/add_timer_on(), it's possible that migrated timers are queued, based on older clock value. This can cause delays in handling those timers. For example, consider below sequence of events: - CPU0 timer1 expires = 59969 and base->clk = 59131. So, timer is queued at level 2, with next expiry for this timer = 60032 (due to granularity addition). - CPU1 enters idle @60007, with next timer expiry @60020. - CPU1 exits idle. - CPU0 is hotplugged at 60009, and timers are migrated to CPU1, with new base->clk = 60007. timer1 is queued, based on 60007 at level 0, for immediate handling (in next timer softirq handling). - CPU1's base->clk is forwarded to 60009, so, in next sched timer interrupt, timer1 is not handled. The issue happens as timer wheel collects expired timers starting from the current clk's index onwards, but migrated timers, if enqueued, based on older clk value can result in their index being less than clk's current index. This can only happen if new base->clk is ahead of timer->expires, resulting in timer being queued at new base->clk's current index. Change-Id: Idbe737b346f00e6e7241b93181bbbd80871f1400 Signed-off-by: Neeraj Upadhyay Signed-off-by: Lingutla Chandrasekhar diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 89a9e1b4264a..ae94aa97b5a9 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1886,6 +1886,11 @@ int timers_dead_cpu(unsigned int cpu) raw_spin_lock_irq(_base->lock); raw_spin_lock_nested(_base->lock, SINGLE_DEPTH_NESTING); + /* Before migrating timers, update new base clk to avoid +* queueing timers based on older clock value. +*/ + forward_timer_base(new_base); + BUG_ON(old_base->running_timer); for (i = 0; i < WHEEL_SIZE; i++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[RFC] hung task: check specific tasks for long uninterruptible sleep state
khungtask by default monitors all tasks for long unterruptible sleep. This change introduces a sysctl option, /proc/sys/kernel/ hung_task_selective_monitoring, to enable monitoring selected tasks. If this sysctl option is enabled then only the tasks with /proc/$PID/hang_detection_enabled set are to be monitored, otherwise all tasks are monitored as default case. Some tasks may intentionally moves to uninterruptable sleep state, which shouldn't leads to khungtask panics, as those are recoverable hungs. So to avoid false hung reports, add an option to select tasks to be monitored and report/panic them only. Signed-off-by: Imran Khan <kim...@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> diff --git a/fs/proc/base.c b/fs/proc/base.c index 9d357b2ea6cb..810f9a5e209e 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2733,6 +2733,52 @@ static int proc_tgid_io_accounting(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_TASK_IO_ACCOUNTING */ +#ifdef CONFIG_DETECT_HUNG_TASK +static ssize_t proc_hung_task_detection_enabled_read(struct file *file, + char __user *buf, size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file_inode(file)); + char buffer[PROC_NUMBUF]; + size_t len; + bool hang_detection_enabled; + + if (!task) + return -ESRCH; + hang_detection_enabled = task->hang_detection_enabled; + put_task_struct(task); + + len = snprintf(buffer, sizeof(buffer), "%d\n", hang_detection_enabled); + + return simple_read_from_buffer(buf, sizeof(buffer), ppos, buffer, len); +} + +static ssize_t proc_hung_task_detection_enabled_write(struct file *file, + const char __user *buf, size_t count, loff_t *ppos) +{ + struct task_struct *task; + bool hang_detection_enabled; + int rv; + + rv = kstrtobool_from_user(buf, count, _detection_enabled); + if (rv < 0) + return rv; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + task->hang_detection_enabled = hang_detection_enabled; + put_task_struct(task); + + return count; +} + +static const struct file_operations proc_hung_task_detection_enabled_operations = { + .read = proc_hung_task_detection_enabled_read, + .write = proc_hung_task_detection_enabled_write, + .llseek = generic_file_llseek, +}; +#endif + #ifdef CONFIG_USER_NS static int proc_id_map_open(struct inode *inode, struct file *file, const struct seq_operations *seq_ops) @@ -2976,6 +3022,10 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_HARDWALL ONE("hardwall", S_IRUGO, proc_pid_hardwall), #endif +#ifdef CONFIG_DETECT_HUNG_TASK + REG("hang_detection_enabled", 0666, + proc_hung_task_detection_enabled_operations), +#endif #ifdef CONFIG_USER_NS REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations), REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations), @@ -3367,6 +3417,10 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_HARDWALL ONE("hardwall", S_IRUGO, proc_pid_hardwall), #endif +#ifdef CONFIG_DETECT_HUNG_TASK + REG("hang_detection_enabled", 0666, + proc_hung_task_detection_enabled_operations), +#endif #ifdef CONFIG_USER_NS REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations), REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations), diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..1280df0a6708 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -767,6 +767,7 @@ struct task_struct { #endif #ifdef CONFIG_DETECT_HUNG_TASK unsigned long last_switch_count; + boolhang_detection_enabled; #endif /* Filesystem information: */ struct fs_struct*fs; diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d6a18a3839cc..09d9f1c65fd2 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -11,6 +11,7 @@ extern int sysctl_hung_task_check_count; extern unsigned int sysctl_hung_task_panic; extern unsigned long sysctl_hung_task_timeout_secs; extern int sysctl_hung_task_warnings; +extern int sysctl_hung_task_selective_monitoring; extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 751593ed7c0b..321bcfa2e34a 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -20,12 +20,21 @@ #include #
[RFC] hung task: check specific tasks for long uninterruptible sleep state
khungtask by default monitors all tasks for long unterruptible sleep. This change introduces a sysctl option, /proc/sys/kernel/ hung_task_selective_monitoring, to enable monitoring selected tasks. If this sysctl option is enabled then only the tasks with /proc/$PID/hang_detection_enabled set are to be monitored, otherwise all tasks are monitored as default case. Some tasks may intentionally moves to uninterruptable sleep state, which shouldn't leads to khungtask panics, as those are recoverable hungs. So to avoid false hung reports, add an option to select tasks to be monitored and report/panic them only. Signed-off-by: Imran Khan Signed-off-by: Lingutla Chandrasekhar diff --git a/fs/proc/base.c b/fs/proc/base.c index 9d357b2ea6cb..810f9a5e209e 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2733,6 +2733,52 @@ static int proc_tgid_io_accounting(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_TASK_IO_ACCOUNTING */ +#ifdef CONFIG_DETECT_HUNG_TASK +static ssize_t proc_hung_task_detection_enabled_read(struct file *file, + char __user *buf, size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file_inode(file)); + char buffer[PROC_NUMBUF]; + size_t len; + bool hang_detection_enabled; + + if (!task) + return -ESRCH; + hang_detection_enabled = task->hang_detection_enabled; + put_task_struct(task); + + len = snprintf(buffer, sizeof(buffer), "%d\n", hang_detection_enabled); + + return simple_read_from_buffer(buf, sizeof(buffer), ppos, buffer, len); +} + +static ssize_t proc_hung_task_detection_enabled_write(struct file *file, + const char __user *buf, size_t count, loff_t *ppos) +{ + struct task_struct *task; + bool hang_detection_enabled; + int rv; + + rv = kstrtobool_from_user(buf, count, _detection_enabled); + if (rv < 0) + return rv; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + task->hang_detection_enabled = hang_detection_enabled; + put_task_struct(task); + + return count; +} + +static const struct file_operations proc_hung_task_detection_enabled_operations = { + .read = proc_hung_task_detection_enabled_read, + .write = proc_hung_task_detection_enabled_write, + .llseek = generic_file_llseek, +}; +#endif + #ifdef CONFIG_USER_NS static int proc_id_map_open(struct inode *inode, struct file *file, const struct seq_operations *seq_ops) @@ -2976,6 +3022,10 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_HARDWALL ONE("hardwall", S_IRUGO, proc_pid_hardwall), #endif +#ifdef CONFIG_DETECT_HUNG_TASK + REG("hang_detection_enabled", 0666, + proc_hung_task_detection_enabled_operations), +#endif #ifdef CONFIG_USER_NS REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations), REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations), @@ -3367,6 +3417,10 @@ static const struct pid_entry tid_base_stuff[] = { #ifdef CONFIG_HARDWALL ONE("hardwall", S_IRUGO, proc_pid_hardwall), #endif +#ifdef CONFIG_DETECT_HUNG_TASK + REG("hang_detection_enabled", 0666, + proc_hung_task_detection_enabled_operations), +#endif #ifdef CONFIG_USER_NS REG("uid_map",S_IRUGO|S_IWUSR, proc_uid_map_operations), REG("gid_map",S_IRUGO|S_IWUSR, proc_gid_map_operations), diff --git a/include/linux/sched.h b/include/linux/sched.h index fdf74f27acf1..1280df0a6708 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -767,6 +767,7 @@ struct task_struct { #endif #ifdef CONFIG_DETECT_HUNG_TASK unsigned long last_switch_count; + boolhang_detection_enabled; #endif /* Filesystem information: */ struct fs_struct*fs; diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index d6a18a3839cc..09d9f1c65fd2 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -11,6 +11,7 @@ extern int sysctl_hung_task_check_count; extern unsigned int sysctl_hung_task_panic; extern unsigned long sysctl_hung_task_timeout_secs; extern int sysctl_hung_task_warnings; +extern int sysctl_hung_task_selective_monitoring; extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); diff --git a/kernel/hung_task.c b/kernel/hung_task.c index 751593ed7c0b..321bcfa2e34a 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -20,12 +20,21 @@ #include #include +#include /* * The number of tasks checked: */ int
[PATCH] coresight: of_get_coresight_platform_data needs both OF and CORESIGHT
The file that defines of_get_coresight_platform_data() is indeed dependent on CONFIG_OF but the entire coresight directory depends on CONFIG_CORESIGHT so both need to be enabled to make the symbol resolve. Signed-off-by: Jordan Crouse <jcro...@codeaurora.org> Signed-off-by: Lingutla Chandrasekhar <clingu...@codeaurora.org> diff --git a/include/linux/coresight.h b/include/linux/coresight.h index d950dad5056a..d298eea7e6a0 100644 --- a/include/linux/coresight.h +++ b/include/linux/coresight.h @@ -262,7 +262,7 @@ static inline int coresight_timeout(void __iomem *addr, u32 offset, int position, int value) { return 1; } #endif -#ifdef CONFIG_OF +#if defined(CONFIG_OF) && defined(CONFIG_CORESIGHT) extern int of_coresight_get_cpu(const struct device_node *node); extern struct coresight_platform_data * of_get_coresight_platform_data(struct device *dev, -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
[PATCH] coresight: of_get_coresight_platform_data needs both OF and CORESIGHT
The file that defines of_get_coresight_platform_data() is indeed dependent on CONFIG_OF but the entire coresight directory depends on CONFIG_CORESIGHT so both need to be enabled to make the symbol resolve. Signed-off-by: Jordan Crouse Signed-off-by: Lingutla Chandrasekhar diff --git a/include/linux/coresight.h b/include/linux/coresight.h index d950dad5056a..d298eea7e6a0 100644 --- a/include/linux/coresight.h +++ b/include/linux/coresight.h @@ -262,7 +262,7 @@ static inline int coresight_timeout(void __iomem *addr, u32 offset, int position, int value) { return 1; } #endif -#ifdef CONFIG_OF +#if defined(CONFIG_OF) && defined(CONFIG_CORESIGHT) extern int of_coresight_get_cpu(const struct device_node *node); extern struct coresight_platform_data * of_get_coresight_platform_data(struct device *dev, -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.