from:"Pavan Kondeti"

Re: [PATCH v3 1/1] scripts: Add add-maintainer.py

2023-09-26 Thread Pavan Kondeti

On Tue, Sep 26, 2023 at 05:32:10PM +0530, Pavan Kondeti wrote:
> On Sat, Aug 26, 2023 at 01:07:42AM -0700, Guru Das Srinagesh wrote:
> > +def gather_maintainers_of_file(patch_file):
> > +all_entities_of_patch = dict()
> > +
> > +# Run get_maintainer.pl on patch file
> > +logging.info("GET: Patch: {}".format(os.path.basename(patch_file)))
> > +cmd = ['scripts/get_maintainer.pl']
> > +cmd.extend([patch_file])
> > +
> > +try:
> > +p = subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
> > +except:
> > +sys.exit(1)
> > +
> > +logging.debug("\n{}".format(p.stdout.decode()))
> > +
> > +entries = p.stdout.decode().splitlines()
> > +
> > +maintainers = []
> > +lists = []
> > +others = []
> > +
> > +for entry in entries:
> > +entity = entry.split('(')[0].strip()
> > +if any(role in entry for role in ["maintainer", "reviewer"]):
> > +maintainers.append(entity)
> > +elif "list" in entry:
> > +lists.append(entity)
> > +else:
> > +others.append(entity)
> > +
> > +all_entities_of_patch["maintainers"] = set(maintainers)
> > +all_entities_of_patch["lists"] = set(lists)
> > +all_entities_of_patch["others"] = set(others)
> > +
> > +return all_entities_of_patch
> > +
> 
> FYI, there are couple of issues found while playing around.
> 
> - Some entries in MAINTAINERS could be "supporter"
> - When names contain ("company"), the script fails to extract name.
> 
> Thanks,
> Pavan
> 
> diff --git a/scripts/add-maintainer.py b/scripts/add-maintainer.py
> index 5a5cc9482b06..6aa5e7941172 100755
> --- a/scripts/add-maintainer.py
> +++ b/scripts/add-maintainer.py
> @@ -29,8 +29,8 @@ def gather_maintainers_of_file(patch_file):
>  others = []
> 
>  for entry in entries:
> -entity = entry.split('(')[0].strip()
> -if any(role in entry for role in ["maintainer", "reviewer"]):
> +entity = entry.rsplit('(', 1)[0].strip()
> +if any(role in entry for role in ["maintainer", "reviewer", 
> "supporter"]):
>  maintainers.append(entity)
>  elif "list" in entry:
>  lists.append(entity)
> 
> 

The %s/split/rsplit trades one bug for another :-( , pls ignore the
diff, when entries have multiple braces "()" , the script does not work
as epxected.

Thanks,
Pavan

Re: [PATCH v3 1/1] scripts: Add add-maintainer.py

2023-09-26 Thread Pavan Kondeti

On Mon, Aug 28, 2023 at 10:21:32AM +0200, Krzysztof Kozlowski wrote:
> On 26/08/2023 10:07, Guru Das Srinagesh wrote:
> > This script runs get_maintainer.py on a given patch file (or multiple
> > patch files) and adds its output to the patch file in place with the
> > appropriate email headers "To: " or "Cc: " as the case may be. These new
> > headers are added after the "From: " line in the patch.
> > 
> > Currently, for a single patch, maintainers and reviewers are added as
> > "To: ", mailing lists and all other roles are added as "Cc: ".
> > 
> > For a series of patches, however, a set-union scheme is employed in
> > order to solve the all-too-common problem of ending up sending only
> > subsets of a patch series to some lists, which results in important
> > pieces of context such as the cover letter (or other patches in the
> > series) being dropped from those lists. This scheme is as follows:
> > 
> > - Create set-union of all maintainers and reviewers from all patches and
> >   use this to do the following per patch:
> >   - add only that specific patch's maintainers and reviewers as "To: "
> >   - add the other maintainers and reviewers from the other patches as "Cc: "
> > 
> > - Create set-union of all mailing lists corresponding to all patches and
> >   add this to all patches as "Cc: "
> > 
> > - Create set-union of all other roles corresponding to all patches and
> >   add this to all patches as "Cc: "
> > 
> > Please note that patch files that don't have any "Maintainer"s or
> > "Reviewers" explicitly listed in their `get_maintainer.pl` output will
> 
> So before you will ignoring the reviewers, right? One more reason to not
> get it right...
> 
> > not have any "To: " entries added to them; developers are expected to
> > manually make edits to the added entries in such cases to convert some
> > "Cc: " entries to "To: " as desired.
> > 
> > The script is quiet by default (only prints errors) and its verbosity
> > can be adjusted via an optional parameter.
> > 
> > Signed-off-by: Guru Das Srinagesh 
> > ---
> >  MAINTAINERS   |   5 ++
> >  scripts/add-maintainer.py | 164 ++
> >  2 files changed, 169 insertions(+)
> >  create mode 100755 scripts/add-maintainer.py
> > 
> 
> I do not see the benefits of this script. For me - it's unnecessarily
> more complicated instead of my simple bash function which makes
> everything one command less than here.
> One more thing to maintain.

Thanks for your bash script. I slightly modified it to keep maintainers
in To and rest in Cc. 

git send-email --dry-run --to="$(scripts/get_maintainer.pl --no-multiline 
--separator=, --no-r --no-l --no-git --no-roles --no-rolestats 
--no-git-fallback *.patch)" --cc="$(scripts/get_maintainer.pl --no-multiline 
--separator=, --no-m --no-git --no-roles --no-rolestats --no-git-fallback 
*.patch)" *.patch

> 
> I don't see the benefits of this for newcomers, either. They should use
> b4. b4 can do much, much more, so anyone creating his workflow should
> start from b4, not from this script.

The ROI from b4 is huge. Totally agree that one should definitely consider b4 
for
kernel patch submissions. I really liked b4 appending a unique "change-id"
across patch-versions. This is the single most feature I miss out from gerrit 
where
all revisions of a patch are glued with a common change-id. If everyone uses, a 
common
change-id for all versions of series, it is super easy to dig into archives.

Thanks,
Pavan

Thanks,
Pavan

Re: [PATCH v3 1/1] scripts: Add add-maintainer.py

2023-09-26 Thread Pavan Kondeti

On Sat, Aug 26, 2023 at 01:07:42AM -0700, Guru Das Srinagesh wrote:
> +def gather_maintainers_of_file(patch_file):
> +all_entities_of_patch = dict()
> +
> +# Run get_maintainer.pl on patch file
> +logging.info("GET: Patch: {}".format(os.path.basename(patch_file)))
> +cmd = ['scripts/get_maintainer.pl']
> +cmd.extend([patch_file])
> +
> +try:
> +p = subprocess.run(cmd, stdout=subprocess.PIPE, check=True)
> +except:
> +sys.exit(1)
> +
> +logging.debug("\n{}".format(p.stdout.decode()))
> +
> +entries = p.stdout.decode().splitlines()
> +
> +maintainers = []
> +lists = []
> +others = []
> +
> +for entry in entries:
> +entity = entry.split('(')[0].strip()
> +if any(role in entry for role in ["maintainer", "reviewer"]):
> +maintainers.append(entity)
> +elif "list" in entry:
> +lists.append(entity)
> +else:
> +others.append(entity)
> +
> +all_entities_of_patch["maintainers"] = set(maintainers)
> +all_entities_of_patch["lists"] = set(lists)
> +all_entities_of_patch["others"] = set(others)
> +
> +return all_entities_of_patch
> +

FYI, there are couple of issues found while playing around.

- Some entries in MAINTAINERS could be "supporter"
- When names contain ("company"), the script fails to extract name.

Thanks,
Pavan

diff --git a/scripts/add-maintainer.py b/scripts/add-maintainer.py
index 5a5cc9482b06..6aa5e7941172 100755
--- a/scripts/add-maintainer.py
+++ b/scripts/add-maintainer.py
@@ -29,8 +29,8 @@ def gather_maintainers_of_file(patch_file):
 others = []

 for entry in entries:
-entity = entry.split('(')[0].strip()
-if any(role in entry for role in ["maintainer", "reviewer"]):
+entity = entry.rsplit('(', 1)[0].strip()
+if any(role in entry for role in ["maintainer", "reviewer", 
"supporter"]):
 maintainers.append(entity)
 elif "list" in entry:
 lists.append(entity)


Thanks,
Pavan

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

On Tue, Apr 06, 2021 at 12:15:24PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> > Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> > background workqueue if we identify that it is cpu intensive. However, this
> > needs case by case analysis, tweaking etc. If there is no other alternative,
> > we might end up chasing the background workers and reduce their nice value.
> 
> There shouldn't be that many workqueues that consume a lot of cpu cycles.
> The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
> shouldn't be a long list and we want them identified anyway.
> 
Sure. I have not done a complete study on which workers in our system can
compete with important tasks in other cgroups. We will have to do that to
adjust the workqueue priority so that the impact can be minimized.

> > The problem at our hand (which you might be knowing already) is that, lets 
> > say
> > we have 2 cgroups in our setup and we want to prioritize UX over background.
> > IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> > Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> > can potentially have CPU share equal to the entire UX cgroup.
> > 
> > -kworker/0:0
> > -kworker/1:0
> > - UX
> > Task-A
> > Task-B
> > - background
> > Task-X
> > Task-Y
> > Hence we want to move all kernel threads to another cgroup so that this 
> > cgroup
> > will have CPU share equal to UX.
> > 
> > The patch presented here allows us to create the above setup. Any other
> > alternative approaches to achieve the same without impacting any future
> > designs/requirements?
> 
> Not quite the same but we already have
> /sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
> so maybe a global default priority knob would help here?
> 

yeah, not exactly what we are looking for. It gives us the abiility to restrict
the priority of all unbound workqueues at the expense of not being able to
prioritize one workqueue over another workqueue.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

Hi Tejun,

On Tue, Apr 06, 2021 at 09:36:00AM -0400, Tejun Heo wrote:
> Hello,
> 
> On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> > In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> > important work. Given that CPU shares of root cgroup can't be changed,
> > leaving the tasks inside root cgroup will give them higher share
> > compared to the other tasks inside important cgroups. This is mitigated
> > by moving all tasks inside root cgroup to a different cgroup after
> > Android is booted. However, there are many kernel tasks stuck in the
> > root cgroup after the boot.
> > 
> > We see all kworker threads are in the root cpu cgroup. This is because,
> > tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> > This restriction is in place to avoid kworkers getting moved to a cpuset
> > which conflicts with kworker affinity. Relax this restriction by explicitly
> > checking if the task is moving out of a cpuset cgroup. This allows kworkers
> > to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> > are mounted on different hierarchies.
> > 
> > We also see kthreadd_task and any kernel thread created after the Android 
> > boot
> > also stuck in the root cgroup. The current code prevents kthreadd_task 
> > moving
> > out root cgroup to avoid the possibility of creating new RT kernel threads
> > inside a cgroup with no RT runtime allocated. Apply this restriction when 
> > tasks
> > are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> > kernel threads to be moved out of root cpu cgroup if the kernel does not
> > enable RT group scheduling.
> 
> The fundamental reason why those kthreads are in the root cgroup is because
> they're doing work on behalf of the entire system and their resource usages
> can't be attributed to any specific cgroup. What we want to do is accounting
> actual usages to the originating cgroups so that cpu cycles spent by kswapd
> is charged to the originating cgroups, however well we can define them, and
> then throttle the origin if the consumption is going over budget for that
> cgroup's allocation. This is how we already handle shared IOs.

Thanks for your reply. I understand the reasoning on why we don't allow
kworkers to a custom cgroup.

> 
> The problem with the proposed patch is that it breaks the logical
> organization of resource hierarchy in a way which hinders proper future
> solutions.
> 
> If all you want is deprioritizing certain kworkers, please use workqueue
> attrs instead.
> 
Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
background workqueue if we identify that it is cpu intensive. However, this
needs case by case analysis, tweaking etc. If there is no other alternative,
we might end up chasing the background workers and reduce their nice value.

The problem at our hand (which you might be knowing already) is that, lets say
we have 2 cgroups in our setup and we want to prioritize UX over background.
IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
can potentially have CPU share equal to the entire UX cgroup.

-kworker/0:0
-kworker/1:0
- UX
Task-A
Task-B
- background
Task-X
Task-Y

Hence we want to move all kernel threads to another cgroup so that this cgroup
will have CPU share equal to UX.

The patch presented here allows us to create the above setup. Any other
alternative approaches to achieve the same without impacting any future
designs/requirements?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Pavan Kondeti

Hi Quentin,

On Tue, Apr 06, 2021 at 12:10:41PM +, Quentin Perret wrote:
> Hi Pavan,
> 
> On Tuesday 06 Apr 2021 at 16:29:13 (+0530), Pavankumar Kondeti wrote:
> > In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> > important work. Given that CPU shares of root cgroup can't be changed,
> > leaving the tasks inside root cgroup will give them higher share
> > compared to the other tasks inside important cgroups. This is mitigated
> > by moving all tasks inside root cgroup to a different cgroup after
> > Android is booted. However, there are many kernel tasks stuck in the
> > root cgroup after the boot.
> > 
> > We see all kworker threads are in the root cpu cgroup. This is because,
> > tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> > This restriction is in place to avoid kworkers getting moved to a cpuset
> > which conflicts with kworker affinity. Relax this restriction by explicitly
> > checking if the task is moving out of a cpuset cgroup. This allows kworkers
> > to be moved out root cpu cgroup.
> > 
> > We also see kthreadd_task and any kernel thread created after the Android 
> > boot
> > also stuck in the root cgroup. The current code prevents kthreadd_task 
> > moving
> > out root cgroup to avoid the possibility of creating new RT kernel threads
> > inside a cgroup with no RT runtime allocated. Apply this restriction when 
> > tasks
> > are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> > kernel threads to be moved out of root cpu cgroup if the kernel does not
> > enable RT group scheduling.
> 
> OK, so IIUC this only works with cgroup v1 -- the unified hierarchy in
> v2 forces you to keep cpu and cpuset in 'sync'. But that should be fine,
> so this looks like a nice improvement to me.
> 
Yes. I will mention this in commit description.

> >  
> >  struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
> > -bool *locked)
> > +bool *locked,
> > +struct cgroup *dst_cgrp)
> > __acquires(_threadgroup_rwsem)
> >  {
> > struct task_struct *tsk;
> > @@ -2784,15 +2785,28 @@ struct task_struct *cgroup_procs_write_start(char 
> > *buf, bool threadgroup,
> > tsk = tsk->group_leader;
> >  
> > /*
> > +* RT kthreads may be born in a cgroup with no rt_runtime allocated.
> > +* Just say no.
> > +*/
> > +#ifdef CONFIG_RT_GROUP_SCHED
> > +   if (tsk->no_cgroup_migration && (dst_cgrp->root->subsys_mask & (1U << 
> > cpu_cgrp_id))) {
> > +   tsk = ERR_PTR(-EINVAL);
> > +   goto out_unlock_threadgroup;
> > +   }
> > +#endif
> > +
> > +   /*
> >  * kthreads may acquire PF_NO_SETAFFINITY during initialization.
> >  * If userland migrates such a kthread to a non-root cgroup, it can
> > -* become trapped in a cpuset, or RT kthread may be born in a
> > -* cgroup with no rt_runtime allocated.  Just say no.
> > +* become trapped in a cpuset. Just say no.
> >  */
> > -   if (tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) {
> > +#ifdef CONFIG_CPUSETS
> > +   if ((tsk->no_cgroup_migration || (tsk->flags & PF_NO_SETAFFINITY)) &&
> > +   (dst_cgrp->root->subsys_mask & (1U << cpuset_cgrp_id))) 
> > {
> > tsk = ERR_PTR(-EINVAL);
> > goto out_unlock_threadgroup;
> > }
> > +#endif
> 
> Nit: maybe move this #ifdefery out to a header?
> 
Agreed.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2 1/7] sched/fair: Ignore percpu threads for imbalance pulls

2021-02-21 Thread Pavan Kondeti

On Fri, Feb 19, 2021 at 12:59:57PM +, Valentin Schneider wrote:
> From: Lingutla Chandrasekhar 
> 
> In load balancing, when balancing group is unable to pull task
> due to ->cpus_ptr constraints from busy group, then it sets
> LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
> is set for its parent domain level. which makes the group
> classified as imbalance to get help from another balancing cpu.
> 
> Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
> CPUs 2-3 as Bigs with below scenario:
> - CPU0 doing newly_idle balancing
> - CPU1 running percpu kworker and RT task (small tasks)
> - CPU2 running 2 big tasks
> - CPU3 running 1 medium task
> 
> While CPU0 is doing newly_idle load balance at MC level, it fails to
> pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
> and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
> it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
> find other busiest_group, so CPU0 stops balacing at MC level without
> clearing 'sgc->imbalance' and restart the load balacing at DIE level.
> 
> And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
> type as imbalance, and Bigs that classified the level below imbalance type
> would be ignored to pick as busiest, and the balancing would be aborted
> without pulling any tasks (by the time, CPU1 might not have running tasks).
> 
> It is suboptimal decision to classify the group as imbalance due to
> percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.
> 
> Signed-off-by: Lingutla Chandrasekhar 
> [Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/sched/fair.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..2d4dcf1a3372 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7539,6 +7539,10 @@ int can_migrate_task(struct task_struct *p, struct 
> lb_env *env)
>   if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>   return 0;
>  
> + /* Disregard pcpu kthreads; they are where they need to be. */
> + if ((p->flags & PF_KTHREAD) && kthread_is_per_cpu(p))
> + return 0;
> +
>   if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>   int cpu;
>  

Looks good to me. Thanks Valentin for the help.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] sched/fair: Ignore percpu threads for imbalance pulls

2021-02-17 Thread Pavan Kondeti

On Wed, Feb 17, 2021 at 02:50:23PM +, Valentin Schneider wrote:
> On 17/02/21 17:38, Lingutla Chandrasekhar wrote:
> > In load balancing, when balancing group is unable to pull task
> > due to ->cpus_ptr constraints from busy group, then it sets
> > LBF_SOME_PINNED to lb env flags, as a consequence, sgc->imbalance
> > is set for its parent domain level. which makes the group
> > classified as imbalance to get help from another balancing cpu.
> >
> > Consider a 4-CPU big.LITTLE system with CPUs 0-1 as LITTLEs and
> > CPUs 2-3 as Bigs with below scenario:
> > - CPU0 doing newly_idle balancing
> > - CPU1 running percpu kworker and RT task (small tasks)
> > - CPU2 running 2 big tasks
> > - CPU3 running 1 medium task
> >
> > While CPU0 is doing newly_idle load balance at MC level, it fails to
> > pull percpu kworker from CPU1 and sets LBF_SOME_PINNED to lb env flag
> > and set sgc->imbalance at DIE level domain. As LBF_ALL_PINNED not cleared,
> > it tries to redo the balancing by clearing CPU1 in env cpus, but it don't
> > find other busiest_group, so CPU0 stops balacing at MC level without
> > clearing 'sgc->imbalance' and restart the load balacing at DIE level.
> >
> > And CPU0 (balancing cpu) finds LITTLE's group as busiest_group with group
> > type as imbalance, and Bigs that classified the level below imbalance type
> > would be ignored to pick as busiest, and the balancing would be aborted
> > without pulling any tasks (by the time, CPU1 might not have running tasks).
> >
> > It is suboptimal decision to classify the group as imbalance due to
> > percpu threads. So don't use LBF_SOME_PINNED for per cpu threads.
> >
> 
> Sounds like you've stumbled on the same thing I'm trying to fix in
> 
>   http://lore.kernel.org/r/20210128183141.28097-8-valentin.schnei...@arm.com
> 
> (I'm currently working on a v2)
> 
> Now, I'd tend to agree that if we could prevent pcpu kworkers from
> interfering with load-balance altogether, that would indeed be much
> better than trying to deal with the group_imbalanced faff further down the
> line (which is what I've been doing).
> 
> > Signed-off-by: Lingutla Chandrasekhar 
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..44a05ad8c96b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7560,7 +7560,9 @@ int can_migrate_task(struct task_struct *p, struct 
> > lb_env *env)
> >  
> > schedstat_inc(p->se.statistics.nr_failed_migrations_affine);
> >  
> > -   env->flags |= LBF_SOME_PINNED;
> > +   /* Ignore percpu threads for imbalance pulls. */
> > +   if (p->nr_cpus_allowed > 1)
> > +   env->flags |= LBF_SOME_PINNED;
> >  
> > /*
> >  * Remember if this task can be migrated to any other CPU in
> 
> Unlike user tasks, pcpu kworkers have a stable affinity (with some hotplug
> quirks), so perhaps we could do this instead:
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8a8bd7b13634..84fca350b9ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7539,6 +7539,9 @@ int can_migrate_task(struct task_struct *p, struct 
> lb_env *env)
>   if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>   return 0;
>  
> + if (kthread_is_per_cpu(p))
> + return 0;
> +
>   if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
>   int cpu;
>  

Looks good to me. In our testing also, the false imbalance is manifested due
to pinned kworkers.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] PM / EM: Micro optimization in em_pd_energy

2020-11-23 Thread Pavan Kondeti

On Mon, Nov 23, 2020 at 10:28:39AM +, Quentin Perret wrote:
> Hi Pavan,
> 
> On Monday 23 Nov 2020 at 15:47:57 (+0530), Pavankumar Kondeti wrote:
> > When the sum of the utilization of CPUs in a power domain is zero,
> 
> s/power/performance
> 
> > return the energy as 0 without doing any computations.
> > 
> > Signed-off-by: Pavankumar Kondeti 
> > ---
> >  include/linux/energy_model.h | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
> > index b67a51c..8810f1f 100644
> > --- a/include/linux/energy_model.h
> > +++ b/include/linux/energy_model.h
> > @@ -103,6 +103,9 @@ static inline unsigned long em_cpu_energy(struct 
> > em_perf_domain *pd,
> > struct em_perf_state *ps;
> > int i, cpu;
> >  
> > +   if (!sum_util)
> > +   return 0;
> > +
> > /*
> >  * In order to predict the performance state, map the utilization of
> >  * the most utilized CPU of the performance domain to a requested
> 
> Makes sense to me, so with nit above:
> 
> Acked-by: Quentin Perret 
> 
Thanks Quentin. I have updated the patch as per your correction. 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH 1/1] sched/uclamp: release per-task uclamp control if user set to default value

2020-10-05 Thread Pavan Kondeti

On Fri, Oct 02, 2020 at 01:38:12PM +0800, Yun Hsiang wrote:
> On Wed, Sep 30, 2020 at 03:12:51PM +0200, Dietmar Eggemann wrote:
> Hi Dietmar,
> 
> > Hi Yun,
> > 
> > On 28/09/2020 10:26, Yun Hsiang wrote:
> > > If the user wants to release the util clamp and let cgroup to control it,
> > > we need a method to reset.
> > > 
> > > So if the user set the task uclamp to the default value (0 for UCLAMP_MIN
> > > and 1024 for UCLAMP_MAX), reset the user_defined flag to release control.
> > > 
> > > Signed-off-by: Yun Hsiang 
> > 
> > could you explain with a little bit more detail why you would need this
> > feature?
> > 
> > Currently we assume that once the per-task uclamp (user-defined) values
> > are set, you could only change the effective uclamp values of this task
> > by (1) moving it into another taskgroup or (2) changing the system
> > default uclamp values.
> > 
> 
> Assume a module that controls group (such as top-app in android) uclamp and
> task A in the group.
> Once task A set uclamp, it will not be affected by the group setting.
> If task A doesn't want to control itself anymore,
> it can not go back to the initial state to let the module(group) control.
> But the other tasks in the group will be affected by the group.
> 
> The policy might be
> 1) if the task wants to control it's uclamp, use task uclamp value
> (but under group uclamp constraint)
> 2) if the task doesn't want to control it's uclamp, use group uclamp value.
> 
> If the policy is proper, we need a reset method for per-task uclamp.

Right. It would be nice to have the capability to reset per-task uclamp
settings. In Android, I have seen tasks in top-app affining to Big cluster.
When these tasks move to background, the cpuset automatically override the
affinity of tasks. If the same use case is extended to use uclamp to set a
high value for some tasks in top-app, there should be a way to reset the
uclamp settings when they are moved to background. I don't know if we ever
see this implemented in Android.

> 
> > > ---
> > >  kernel/sched/core.c | 7 +--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 9a2fbf98fd6f..fa63d70d783a 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -1187,6 +1187,7 @@ static void __setscheduler_uclamp(struct 
> > > task_struct *p,
> > > const struct sched_attr *attr)
> > >  {
> > >   enum uclamp_id clamp_id;
> > > + bool user_defined;
> > >  
> > >   /*
> > >* On scheduling class change, reset to default clamps for tasks
> > > @@ -1210,14 +1211,16 @@ static void __setscheduler_uclamp(struct 
> > > task_struct *p,
> > >   if (likely(!(attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)))
> > >   return;
> > >  
> > > + user_defined = attr->sched_util_min == 0 ? false : true;
> > >   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MIN) {
> > >   uclamp_se_set(>uclamp_req[UCLAMP_MIN],
> > > -   attr->sched_util_min, true);
> > > +   attr->sched_util_min, user_defined);
> > >   }
> > >  
> > > + user_defined = attr->sched_util_max == 1024 ? false : true;

This does not work for all cases. Lets say a task is in a cgroup which
restricts uclamp.max. The task want to lift this restriction by setting
uclamp.max = 1024. With your approach, we don't honor this. Correct?

> > >   if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
> > >   uclamp_se_set(>uclamp_req[UCLAMP_MAX],
> > > -   attr->sched_util_max, true);
> > > +   attr->sched_util_max, user_defined);
> > >   }
> > >  }

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v7 12/23] sched: Trivial forced-newidle balancer

2020-09-02 Thread Pavan Kondeti

On Fri, Aug 28, 2020 at 03:51:13PM -0400, Julien Desfossez wrote:
>  /*
>   * The static-key + stop-machine variable are needed such that:
>   *
> @@ -4641,7 +4656,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>   struct task_struct *next, *max = NULL;
>   const struct sched_class *class;
>   const struct cpumask *smt_mask;
> - int i, j, cpu;
> + int i, j, cpu, occ = 0;
>   int smt_weight;
>   bool need_sync;
>  
> @@ -4750,6 +4765,9 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>   goto done;
>   }
>  
> + if (!is_idle_task(p))
> + occ++;
> +
>   rq_i->core_pick = p;
>  
>   /*
> @@ -4775,6 +4793,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, 
> struct rq_flags *rf)
>  
>   cpu_rq(j)->core_pick = NULL;
>   }
> + occ = 1;
>   goto again;
>   } else {
>   /*
> @@ -4820,6 +4839,8 @@ next_class:;
>   if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
>   rq_i->core_forceidle = true;
>  
> + rq_i->core_pick->core_occupation = occ;
> +
>   if (i == cpu)
>   continue;
>  
> @@ -4837,6 +4858,113 @@ next_class:;
>   return next;
>  }
>  
> +static bool try_steal_cookie(int this, int that)
> +{
> + struct rq *dst = cpu_rq(this), *src = cpu_rq(that);
> + struct task_struct *p;
> + unsigned long cookie;
> + bool success = false;
> +
> + local_irq_disable();
> + double_rq_lock(dst, src);
> +
> + cookie = dst->core->core_cookie;
> + if (!cookie)
> + goto unlock;
> +
> + if (dst->curr != dst->idle)
> + goto unlock;
> +
> + p = sched_core_find(src, cookie);
> + if (p == src->idle)
> + goto unlock;
> +
> + do {
> + if (p == src->core_pick || p == src->curr)
> + goto next;
> +
> + if (!cpumask_test_cpu(this, >cpus_mask))
> + goto next;
> +
> + if (p->core_occupation > dst->idle->core_occupation)
> + goto next;
> +
Can you please explain the rationale behind this check? If I understand
correctly, p->core_occupation is set in pick_next_task() to indicate
the number of matching cookie (except idle) tasks picked on this core.
It is not reset anywhere.

> + p->on_rq = TASK_ON_RQ_MIGRATING;
> + deactivate_task(src, p, 0);
> + set_task_cpu(p, this);
> + activate_task(dst, p, 0);
> + p->on_rq = TASK_ON_RQ_QUEUED;
> +
> + resched_curr(dst);
> +
> + success = true;
> + break;
> +
> +next:
> + p = sched_core_next(p, cookie);
> + } while (p);
> +
> +unlock:
> + double_rq_unlock(dst, src);
> + local_irq_enable();
> +
> + return success;
> +}

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: Looping more in detach_tasks() when RT and CFS tasks are present

2020-06-24 Thread Pavan Kondeti

Hi Vincent,

On Wed, Jun 24, 2020 at 02:39:25PM +0200, Vincent Guittot wrote:
> Hi Pavan,
> 
> On Wed, 24 Jun 2020 at 13:42, Pavan Kondeti  wrote:
> >
> > Hi Vincent/Peter,
> >
> > in load_balance(), we derive env->loop_max based on rq->nr_running.
> > When the busiest rq has both RT and CFS tasks, we do more loops in
> > detach_tasks(). Is there any reason for not using
> > rq->cfs.h_nr_running?
> 
> Using cfs.h_nr_running seems fine for loop_max
> 

Thanks for taking a look.

> >
> > Lei Wen attempted to fix this before.
> > https://lore.kernel.org/lkml/1376814322-7320-2-git-send-email-lei...@marvell.com/
> 
> The 1st part of the patch is wrong because even if h_nr_running == 1
> but nr_running > 1, we can pull the cfs thread without using active
> balance
> 

Right. When a RT and CFS tasks are packed, I have seen CFS task getting pulled
via load balancer without waking migration/X.

I was using the below patch along with some prints in detach_tasks() loop.
I will update Lei Wen's patch and resend it.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..f042016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9509,7 +9509,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 * correctly treated as an imbalance.
 */
env.flags |= LBF_ALL_PINNED;
-   env.loop_max  = min(sysctl_sched_nr_migrate, 
busiest->nr_running);
+   env.loop_max  = min(sysctl_sched_nr_migrate, 
busiest->cfs.h_nr_running);
 
 more_balance:
rq_lock_irqsave(busiest, );

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Looping more in detach_tasks() when RT and CFS tasks are present

2020-06-24 Thread Pavan Kondeti

Hi Vincent/Peter,

in load_balance(), we derive env->loop_max based on rq->nr_running.
When the busiest rq has both RT and CFS tasks, we do more loops in
detach_tasks(). Is there any reason for not using
rq->cfs.h_nr_running?

Lei Wen attempted to fix this before.
https://lore.kernel.org/lkml/1376814322-7320-2-git-send-email-lei...@marvell.com/

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [PATCH 2/2] sched: Offload wakee task activation if it the wakee is descheduling

2020-05-27 Thread Pavan Kondeti

On Sun, May 24, 2020 at 09:29:56PM +0100, Mel Gorman wrote:
> The patch "sched: Optimize ttwu() spinning on p->on_cpu" avoids spinning
> on p->on_rq when the task is descheduling but only if the wakee is on
> a CPU that does not share cache with the waker. This patch offloads the
> activation of the wakee to the CPU that is about to go idle if the task
> is the only one on the runqueue. This potentially allows the waker task
> to continue making progress when the wakeup is not strictly synchronous.
> 
> This is very obvious with netperf UDP_STREAM running on localhost. The
> waker is sending packets as quickly as possible without waiting for any
> reply. It frequently wakes the server for the processing of packets and
> when netserver is using local memory, it quickly completes the processing
> and goes back to idle. The waker often observes that netserver is on_rq
> and spins excessively leading to a drop in throughput.
> 
> This is a comparison of 5.7-rc6 against "sched: Optimize ttwu() spinning
> on p->on_cpu" and against this patch labeled vanilla, optttwu-v1r1 and
> localwakelist-v1r2 respectively.
> 
>   5.7.0-rc6  5.7.0-rc6
>   5.7.0-rc6
> vanilla   optttwu-v1r1 
> localwakelist-v1r2
> Hmean send-64 251.49 (   0.00%)  258.05 *   2.61%*  
> 305.59 *  21.51%*
> Hmean send-128497.86 (   0.00%)  519.89 *   4.43%*  
> 600.25 *  20.57%*
> Hmean send-256944.90 (   0.00%)  997.45 *   5.56%* 
> 1140.19 *  20.67%*
> Hmean send-1024  3779.03 (   0.00%) 3859.18 *   2.12%* 
> 4518.19 *  19.56%*
> Hmean send-2048  7030.81 (   0.00%) 7315.99 *   4.06%* 
> 8683.01 *  23.50%*
> Hmean send-3312 10847.44 (   0.00%)11149.43 *   2.78%*
> 12896.71 *  18.89%*
> Hmean send-4096 13436.19 (   0.00%)13614.09 (   1.32%)
> 15041.09 *  11.94%*
> Hmean send-8192 22624.49 (   0.00%)23265.32 *   2.83%*
> 24534.96 *   8.44%*
> Hmean send-1638434441.87 (   0.00%)36457.15 *   5.85%*
> 35986.21 *   4.48%*
> 
> Note that this benefit is not universal to all wakeups, it only applies
> to the case where the waker often spins on p->on_rq.
> 
Thanks for the detailed description. I think you meant p->on_cpu here not
p->on_rq. I know this patch is included, if possible, please make this edit
here and below.

>   /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index db3a57675ccf..06297d1142a0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1688,7 +1688,8 @@ static inline int task_on_rq_migrating(struct 
> task_struct *p)
>   */
>  #define WF_SYNC  0x01/* Waker goes to sleep 
> after wakeup */
>  #define WF_FORK  0x02/* Child wakeup after 
> fork */
> -#define WF_MIGRATED  0x4 /* Internal use, task got 
> migrated */
> +#define WF_MIGRATED  0x04/* Internal use, task got 
> migrated */
> +#define WF_ON_RQ 0x08/* Wakee is on_rq */
>  
should this be WF_ON_CPU? There is a different check for p->on_rq in
try_to_wake_up.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] kthread: Use TASK_IDLE state for newly created kernel threads

2020-05-21 Thread Pavan Kondeti

On Thu, May 21, 2020 at 07:56:39AM +0200, Greg Kroah-Hartman wrote:
> On Thu, May 21, 2020 at 07:05:44AM +0530, Pavan Kondeti wrote:
> > On Wed, May 20, 2020 at 08:18:58PM +0200, Greg Kroah-Hartman wrote:
> > > On Wed, May 20, 2020 at 05:25:09PM +0530, Pavankumar Kondeti wrote:
> > > > When kernel threads are created for later use, they will be in
> > > > TASK_UNINTERRUPTIBLE state until they are woken up. This results
> > > > in increased loadavg and false hung task reports. To fix this,
> > > > use TASK_IDLE state instead of TASK_UNINTERRUPTIBLE when
> > > > a kernel thread schedules out for the first time.
> > > > 
> > > > Signed-off-by: Pavankumar Kondeti 
> > > > ---
> > > >  kernel/kthread.c | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > > > index bfbfa48..b74ed8e 100644
> > > > --- a/kernel/kthread.c
> > > > +++ b/kernel/kthread.c
> > > > @@ -250,7 +250,7 @@ static int kthread(void *_create)
> > > > current->vfork_done = >exited;
> > > >  
> > > > /* OK, tell user we're spawned, wait for stop or wakeup */
> > > > -   __set_current_state(TASK_UNINTERRUPTIBLE);
> > > > +   __set_current_state(TASK_IDLE);
> > > > create->result = current;
> > > > /*
> > > >  * Thread is going to call schedule(), do not preempt it,
> > > > @@ -428,7 +428,7 @@ static void __kthread_bind(struct task_struct *p, 
> > > > unsigned int cpu, long state)
> > > >  
> > > >  void kthread_bind_mask(struct task_struct *p, const struct cpumask 
> > > > *mask)
> > > >  {
> > > > -   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
> > > > +   __kthread_bind_mask(p, mask, TASK_IDLE);
> > > >  }
> > > >  
> > > >  /**
> > > > @@ -442,7 +442,7 @@ void kthread_bind_mask(struct task_struct *p, const 
> > > > struct cpumask *mask)
> > > >   */
> > > >  void kthread_bind(struct task_struct *p, unsigned int cpu)
> > > >  {
> > > > -   __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
> > > > +   __kthread_bind(p, cpu, TASK_IDLE);
> > > >  }
> > > >  EXPORT_SYMBOL(kthread_bind);
> > > 
> > > It's as if people never read mailing lists:
> > >   
> > > https://lore.kernel.org/r/dm6pr11mb3531d3b164357b2dc476102ddf...@dm6pr11mb3531.namprd11.prod.outlook.com
> > > 
> > > Given that this is an identical resend of the previous patch, why are
> > > you doing so, and what has changed since that original rejection?
> > > 
> > I did not know that it is attempted before. Thanks for pointing to the
> > previous discussion. 
> > 
> > We have seen hung task reports from customers and it is due to a downstream
> > change which create bunch of kernel threads for later use.
> 
> Do you have a pointer to that specific change?
> 
Here it is
https://source.codeaurora.org/quic/la/kernel/msm-4.19/commit/drivers/staging/android/ion/ion_system_heap.c?h=LA.UM.8.12.r1-11300-sm8250.0=0734b477f1e77cb9f91f5e5c0d7742d3113f2cd3

> > From Peter's reply, I understood that one must wake up the kthread
> > after creation and put it in INTERRUPTIBLE sleep. I will pass on the
> > message.
> 
> Just go fix that code, it sounds like it's in your tree already :)
> 
Yes, we will fix it.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] kthread: Use TASK_IDLE state for newly created kernel threads

2020-05-20 Thread Pavan Kondeti

On Wed, May 20, 2020 at 08:18:58PM +0200, Greg Kroah-Hartman wrote:
> On Wed, May 20, 2020 at 05:25:09PM +0530, Pavankumar Kondeti wrote:
> > When kernel threads are created for later use, they will be in
> > TASK_UNINTERRUPTIBLE state until they are woken up. This results
> > in increased loadavg and false hung task reports. To fix this,
> > use TASK_IDLE state instead of TASK_UNINTERRUPTIBLE when
> > a kernel thread schedules out for the first time.
> > 
> > Signed-off-by: Pavankumar Kondeti 
> > ---
> >  kernel/kthread.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index bfbfa48..b74ed8e 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -250,7 +250,7 @@ static int kthread(void *_create)
> > current->vfork_done = >exited;
> >  
> > /* OK, tell user we're spawned, wait for stop or wakeup */
> > -   __set_current_state(TASK_UNINTERRUPTIBLE);
> > +   __set_current_state(TASK_IDLE);
> > create->result = current;
> > /*
> >  * Thread is going to call schedule(), do not preempt it,
> > @@ -428,7 +428,7 @@ static void __kthread_bind(struct task_struct *p, 
> > unsigned int cpu, long state)
> >  
> >  void kthread_bind_mask(struct task_struct *p, const struct cpumask *mask)
> >  {
> > -   __kthread_bind_mask(p, mask, TASK_UNINTERRUPTIBLE);
> > +   __kthread_bind_mask(p, mask, TASK_IDLE);
> >  }
> >  
> >  /**
> > @@ -442,7 +442,7 @@ void kthread_bind_mask(struct task_struct *p, const 
> > struct cpumask *mask)
> >   */
> >  void kthread_bind(struct task_struct *p, unsigned int cpu)
> >  {
> > -   __kthread_bind(p, cpu, TASK_UNINTERRUPTIBLE);
> > +   __kthread_bind(p, cpu, TASK_IDLE);
> >  }
> >  EXPORT_SYMBOL(kthread_bind);
> 
> It's as if people never read mailing lists:
>   
> https://lore.kernel.org/r/dm6pr11mb3531d3b164357b2dc476102ddf...@dm6pr11mb3531.namprd11.prod.outlook.com
> 
> Given that this is an identical resend of the previous patch, why are
> you doing so, and what has changed since that original rejection?
> 
I did not know that it is attempted before. Thanks for pointing to the
previous discussion. 

We have seen hung task reports from customers and it is due to a downstream
change which create bunch of kernel threads for later use. From Peter's
reply, I understood that one must wake up the kthread after creation and put
it in INTERRUPTIBLE sleep. I will pass on the message.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] sched/fair: enqueue_task_fair optimization

2020-05-12 Thread Pavan Kondeti

On Mon, May 11, 2020 at 09:23:01PM +0200, Vincent Guittot wrote:
> enqueue_task_fair() jumps to enqueue_throttle when cfs_rq_of(se) is
> throttled, which means that se can't be NULL and we can skip the test.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b73518aa25c..910bbbe50365 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5512,7 +5512,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
> int flags)
> list_add_leaf_cfs_rq(cfs_rq);
>   }
>  
> -enqueue_throttle:
>   if (!se) {
>   add_nr_running(rq, 1);
>   /*
> @@ -5534,6 +5533,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
> int flags)
>  
>   }
>  
> +enqueue_throttle:
>   if (cfs_bandwidth_used()) {
>   /*
>* When bandwidth control is enabled; the cfs_rq_throttled()

Right, se can't be NULL when we break upon throttling. Looks good to me.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-05-11 Thread Pavan Kondeti

On Mon, May 11, 2020 at 04:40:52PM +0100, Qais Yousef wrote:
> RT tasks by default run at the highest capacity/performance level. When
> uclamp is selected this default behavior is retained by enforcing the
> requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
> uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
> value.
> 
> This is also referred to as 'the default boost value of RT tasks'.
> 
> See commit 1a00d71c ("sched/uclamp: Set default clamps for RT tasks").
> 
> On battery powered devices, it is desired to control this default
> (currently hardcoded) behavior at runtime to reduce energy consumed by
> RT tasks.
> 
> For example, a mobile device manufacturer where big.LITTLE architecture
> is dominant, the performance of the little cores varies across SoCs, and
> on high end ones the big cores could be too power hungry.
> 
> Given the diversity of SoCs, the new knob allows manufactures to tune
> the best performance/power for RT tasks for the particular hardware they
> run on.
> 
> They could opt to further tune the value when the user selects
> a different power saving mode or when the device is actively charging.
> 
> The runtime aspect of it further helps in creating a single kernel image
> that can be run on multiple devices that require different tuning.
> 
> Keep in mind that a lot of RT tasks in the system are created by the
> kernel. On Android for instance I can see over 50 RT tasks, only
> a handful of which created by the Android framework.
> 
> To control the default behavior globally by system admins and device
> integrators, introduce the new sysctl_sched_uclamp_util_min_rt_default
> to change the default boost value of the RT tasks.
> 
> I anticipate this to be mostly in the form of modifying the init script
> of a particular device.
> 
> Whenever the new default changes, it'd be applied lazily on the next
> opportunity the scheduler needs to calculate the effective uclamp.min
> value for the task, assuming that it still uses the system default value
> and not a user applied one.
> 
> Tested on Juno-r2 in combination with the RT capacity awareness [1].
> By default an RT task will go to the highest capacity CPU and run at the
> maximum frequency, which is particularly energy inefficient on high end
> mobile devices because the biggest core[s] are 'huge' and power hungry.
> 
> With this patch the RT task can be controlled to run anywhere by
> default, and doesn't cause the frequency to be maximum all the time.
> Yet any task that really needs to be boosted can easily escape this
> default behavior by modifying its requested uclamp.min value
> (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.
> 
> [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")
> 

I have tested this patch on SDM845 running V5.7-rc4 and it works as expected.

Default: i.e /proc/sys/kernel/sched_util_clamp_min_rt_default = 1024.

RT task runs on BIG cluster every time at max frequency. Both effective
and requested uclamp.min are set to 1024

With /proc/sys/kernel/sched_util_clamp_min_rt_default = 128

RT task runs on Little cluster (max capacity is 404) and frequency scaling
happens as per the change in utilization. Both effective and requested
uclamp are set to 128.

Feel free to add

Tested-by: Pavankumar Kondeti 

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] sched/debug: Fix requested task uclamp values shown in procfs

2020-05-10 Thread Pavan Kondeti

On Sun, May 10, 2020 at 05:16:28PM +0100, Valentin Schneider wrote:
> 
> On 10/05/20 13:56, Pavankumar Kondeti wrote:
> > The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp
> > values to SCHED_DEBUG procfs") was to print requested and effective
> > task uclamp values. The requested values printed are read from p->uclamp,
> > which holds the last effective values. Fix this by printing the values
> > from p->uclamp_req.
> >
> > Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG 
> > procfs")
> > Signed-off-by: Pavankumar Kondeti 
> 
> Argh, Qais pointed this out to me ~ a week ago, and I left this in my todo
> stack. I goofed up, sorry!
> 
> As Pavan points out, p->uclamp[foo] is just a cache of uclamp_eff_value(p,
> foo) from the last time p was enqueued and runnable - what we are
> interested in is indeed comparing this with the *requested* value.
> 
> I wanted to send an example along with a patch, I guess that's the kick I
> needed!
> 
> 
> My setup is a busy loop, its per-task clamps are set to (256, 768) via
> sched_setattr(), and it's shoved in a cpu cgroup with uclamp settings of
> (50%, 50%)
> 
> On the current master (e99332e7b4cd ("gcc-10: mark more functions __init to
> avoid section mismatch warnings")), this gives me:
> 
>   $ uclamp-get $PID # via sched_getattr()
>   uclamp.min=256 uclamp.max=768
> 
>   $ cat /proc/$PID/sched | grep uclamp
>   uclamp.min   :  256
>   uclamp.max   :  512
>   effective uclamp.min :  256
>   effective uclamp.max :  512
> 
> With Pavan's patch, I get:
> 
>   $ uclamp-get $PID # via sched_getattr()
>   uclamp.min=256 uclamp.max=768
> 
>   $ cat /proc/$PID/sched | grep uclamp
>   uclamp.min   :  256
>   uclamp.max   :  768
>   effective uclamp.min :  256
>   effective uclamp.max :  512
> 
> 
> Minor print nit below, otherwise:
> Tested-and-reviewed-by: Valentin Schneider 
> 
> Peter/Ingo, any chance this can go to sched/urgent? I know it's a debug
> interface, but I'd rather have it land in a shape that makes sense. Again,
> apologies for the goof.
> 
> > ---
> >  kernel/sched/debug.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index a562df5..239970b 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -948,8 +948,8 @@ void proc_sched_show_task(struct task_struct *p, struct 
> > pid_namespace *ns,
> >   P(se.avg.util_est.enqueued);
> >  #endif
> >  #ifdef CONFIG_UCLAMP_TASK
> > -   __PS("uclamp.min", p->uclamp[UCLAMP_MIN].value);
> > -   __PS("uclamp.max", p->uclamp[UCLAMP_MAX].value);
> > +   __PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
> > +   __PS("uclamp.max", p->uclamp_req[UCLAMP_MAX].value);
> 
> While we're at it, I'd prepend this with "requested".
> 
> >   __PS("effective uclamp.min", uclamp_eff_value(p, UCLAMP_MIN));
> >   __PS("effective uclamp.max", uclamp_eff_value(p, UCLAMP_MAX));
> >  #endif

Thanks Valentin for taking a look. I have added "requested" prefix and sent
the patch.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH 13/14] sched: cpufreq: Use IS_ENABLED() for schedutil

2020-05-08 Thread Pavan Kondeti

On Fri, May 08, 2020 at 02:21:29PM +0100, Quentin Perret wrote:
> On Friday 08 May 2020 at 11:00:53 (+0530), Pavan Kondeti wrote:
> > > -#if defined(CONFIG_ENERGY_MODEL) && 
> > > defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> > > +#if defined(CONFIG_ENERGY_MODEL) && 
> > > IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> > >   /* Build perf. domains: */
> > >   for (i = 0; i < ndoms_new; i++) {
> > >   for (j = 0; j < n && !sched_energy_update; j++) {
> > 
> > Now that scheduler does not have any references to schedutil_gov and cpufreq
> > has want_eas flag, do we need this CONFIG_CPU_FREQ_GOV_SCHEDUTIL checks 
> > here?
> 
> Right, they're not absolutely required, but given that sugov is the only
> one to have 'want_eas' set I guess there is no need to compile that in
> without it, no?
> 
Right.

Since you removed all compile time dependencies on schedutil, I thought the
#ifdef check around schedutil can be removed too. 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

2020-05-08 Thread Pavan Kondeti

On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
> Hi Pavan,
> 
> Thanks for going through this patch-set.
> 
> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> > Hi Parth,
> > 
> > On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> >> Monitor tasks at:
> >> 1. wake_up_new_task() - forked tasks
> >>
> >> 2. set_task_cpu() - task migrations, Load balancer
> >>
> >> 3. __sched_setscheduler() - set/unset latency_nice value
> >> Increment the nr_lat_sensitive count on the CPU with task marked with
> >> latency_nice == -20.
> >> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> >> with >-20 latency_nice task.
> >>
> >> 4. finish_task_switch() - dying task
> >>
> > 
> > 
> >> Signed-off-by: Parth Shah 
> >> ---
> >>  kernel/sched/core.c  | 30 --
> >>  kernel/sched/sched.h |  5 +
> >>  2 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8b76f41d61..ad396c36eba6 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned 
> >> int new_cpu)
> >>trace_sched_migrate_task(p, new_cpu);
> >>  
> >>if (task_cpu(p) != new_cpu) {
> >> +  if (task_is_lat_sensitive(p)) {
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +  per_cpu(nr_lat_sensitive, new_cpu)++;
> >> +  }
> >> +
> > 
> > Since we can come here without rq locks, there is a possibility
> > of a race and incorrect updates can happen. Since the counters
> > are used to prevent C-states, we don't want that to happen.
> 
> I did tried using task_lock(p) wherever we do change refcount and when
> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
> type.
> 
> After lots of thinking to optimize it and thinking that we anyways hold rq
> lock, I thought of not using any lock here and see if scheduler community
> has well known solution for this :-)
> 
> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
> refcount should solve problem, right?
> 
> If you or anyone have solution for this kind of pattern, then that surely
> will be helpful.
> 
I am not sure if task_lock() can help here, because we are operating the
counter on per CPU basis here. May be cmpxchg based loop works here to make
sure that increment/decrement operation happens atomically here.

> > 
> >>if (p->sched_class->migrate_task_rq)
> >>p->sched_class->migrate_task_rq(p, new_cpu);
> >>p->se.nr_migrations++;

[...]

> >> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct 
> >> task_struct *p,
> >>p->normal_prio = normal_prio(p);
> >>set_load_weight(p, true);
> >>  
> >> -  if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> >> +  if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> >> +  if (p->state != TASK_DEAD &&
> >> +  attr->sched_latency_nice != p->latency_nice) {
> >> +  if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))++;
> >> +  else if (task_is_lat_sensitive(p))
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +  }
> >> +
> >>p->latency_nice = attr->sched_latency_nice;
> >> +  }
> >>  }
> > 
> > There is a potential race here due to which we can mess up the refcount.
> > 
> > - A latency sensitive task is marked TASK_DEAD
> > 
> > - sched_setattr() called on the task to clear the latency nice. Since
> > we check the task state here, we skip the decrement.
> > - The task is finally context switched out and we skip the decrement again
> > since it is not a latency senstivie task.
> 
> if task is already marked TASK_DEAD then we should have already decremented
> its refcount in finish_task_switch().
> am I missing something?

There is a window (context switch and dropping rq lock) between
marking a task DEAD (in do_task_dead()) and dropping the ref counter
(in finish_task_switch()) during which we can run into here and skip
the checking because task is marked as DEAD.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

2020-05-08 Thread Pavan Kondeti

On Fri, May 08, 2020 at 04:49:04PM +0530, Parth Shah wrote:
> Hi Pavan,
> 
> On 5/8/20 2:06 PM, Pavan Kondeti wrote:
> > On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> >> Restrict the call to deeper idle states when the given CPU has been set for
> >> the least latency requirements
> >>
> >> Signed-off-by: Parth Shah 
> >> ---
> >>  kernel/sched/idle.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> >> index b743bf38f08f..85d72a6e2521 100644
> >> --- a/kernel/sched/idle.c
> >> +++ b/kernel/sched/idle.c
> >> @@ -262,7 +262,8 @@ static void do_idle(void)
> >> * broadcast device expired for us, we don't want to go deep
> >> * idle as we know that the IPI is going to arrive right away.
> >> */
> >> -  if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> >> +  if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> >> +  per_cpu(nr_lat_sensitive, cpu)) {
> >>tick_nohz_idle_restart_tick();
> >>cpu_idle_poll();
> >>} else {
> >> -- 
> >> 2.17.2
> >>
> > 
> > Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
> > task becomes non-latency sensitive task), we may need to add this condition
> > in cpu_idle_poll() as well.
> > 
> 
> Right. but if the CPU is running idle_task then it will again come back to
> do_idle and read the refcount. Its a penalty in power-saving for 1
> do_idle() loop but it is difficult to put up checks for any change in
> latency_nice value.
> 

Yes, when the CPU exits to serve an interrupt, we note the change in latency
task count and enter C-state.

I asked this because all checks that are present here are also present in
cpu_idle_poll().

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks

2020-05-08 Thread Pavan Kondeti

On Fri, May 08, 2020 at 05:00:44PM +0530, Parth Shah wrote:
> 
> 
> On 5/8/20 2:10 PM, Pavan Kondeti wrote:
> > On Thu, May 07, 2020 at 07:07:20PM +0530, Parth Shah wrote:
> >> The "nr_lat_sensitive" per_cpu variable provides hints on the possible
> >> number of latency-sensitive tasks occupying the CPU. This hints further
> >> helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
> >> (next patches includes this).
> >>
> > 
> > Can you please explain the intended use case here? Once a latency sensitive
> > task is created, it prevents c-state on a CPU whether the task runs again
> > or not in the near future.
> > 
> > I assume, either these latency sensitive tasks won't be around for long time
> > or applications set/reset latency sensitive nice value dynamically.
> > 
> 
> Intended use-cases is to get rid of IDLE states exit_latency for
> wakeup-sleep-wakeup pattern workload. This types of tasks (like GPU
> workloads, few DB benchmarks) makes CPU go IDLE due to its low runtime on
> rq, resulting in higher wakeups due to IDLE states exit_latency.
> 
> And this kind of workloads may last for long time as well.
> 
> In current scenario, Sysadmins do disable all IDLE states or use PM_QoS to
> not have latency penalty on workload. This model was good when core counts
> were less. But now higher core count and Turbo frequencies have led to save
> power in-order to get higher performance and hence this patch-set tries to
> do PM_QoS like thing but at per-task granularity.
> 
Thanks for the details. Instead of disabling C-states for all CPUs, we disable
it for CPUS that host latency sensitive tasks. Since this is hooked with the
scheduler, the task migrations are accounted for. Got it.

Thanks,
Pavan

> If idea seems good to go, then this can potentially be extended to do IDLE
> gating upto certain level where latency_nice value hints on which IDLE
> states can't be chosen, just like PM_QoS have cpu_dma_latency constraints.
> 
> 
> Thanks,
> Parth
> 
> 
> > Thanks,
> > Pavan
> > 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks

2020-05-08 Thread Pavan Kondeti

On Thu, May 07, 2020 at 07:07:20PM +0530, Parth Shah wrote:
> The "nr_lat_sensitive" per_cpu variable provides hints on the possible
> number of latency-sensitive tasks occupying the CPU. This hints further
> helps in inhibiting the CPUIDLE governor from calling deeper IDLE states
> (next patches includes this).
> 

Can you please explain the intended use case here? Once a latency sensitive
task is created, it prevents c-state on a CPU whether the task runs again
or not in the near future.

I assume, either these latency sensitive tasks won't be around for long time
or applications set/reset latency sensitive nice value dynamically.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 3/4] sched/idle: Disable idle call on least latency requirements

2020-05-08 Thread Pavan Kondeti

On Thu, May 07, 2020 at 07:07:22PM +0530, Parth Shah wrote:
> Restrict the call to deeper idle states when the given CPU has been set for
> the least latency requirements
> 
> Signed-off-by: Parth Shah 
> ---
>  kernel/sched/idle.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index b743bf38f08f..85d72a6e2521 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -262,7 +262,8 @@ static void do_idle(void)
>* broadcast device expired for us, we don't want to go deep
>* idle as we know that the IPI is going to arrive right away.
>*/
> - if (cpu_idle_force_poll || tick_check_broadcast_expired()) {
> + if (cpu_idle_force_poll || tick_check_broadcast_expired() ||
> + per_cpu(nr_lat_sensitive, cpu)) {
>   tick_nohz_idle_restart_tick();
>   cpu_idle_poll();
>   } else {
> -- 
> 2.17.2
> 

Since nr_lat_sensitive updates can happen remotely (when a latency sensitive
task becomes non-latency sensitive task), we may need to add this condition
in cpu_idle_poll() as well.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

2020-05-08 Thread Pavan Kondeti

Hi Parth,

On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> Monitor tasks at:
> 1. wake_up_new_task() - forked tasks
> 
> 2. set_task_cpu() - task migrations, Load balancer
> 
> 3. __sched_setscheduler() - set/unset latency_nice value
> Increment the nr_lat_sensitive count on the CPU with task marked with
> latency_nice == -20.
> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> with >-20 latency_nice task.
> 
> 4. finish_task_switch() - dying task
> 


> Signed-off-by: Parth Shah 
> ---
>  kernel/sched/core.c  | 30 --
>  kernel/sched/sched.h |  5 +
>  2 files changed, 33 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2d8b76f41d61..ad396c36eba6 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int 
> new_cpu)
>   trace_sched_migrate_task(p, new_cpu);
>  
>   if (task_cpu(p) != new_cpu) {
> + if (task_is_lat_sensitive(p)) {
> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> + per_cpu(nr_lat_sensitive, new_cpu)++;
> + }
> +

Since we can come here without rq locks, there is a possibility
of a race and incorrect updates can happen. Since the counters
are used to prevent C-states, we don't want that to happen.

>   if (p->sched_class->migrate_task_rq)
>   p->sched_class->migrate_task_rq(p, new_cpu);
>   p->se.nr_migrations++;
> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p)
>  {
>   struct rq_flags rf;
>   struct rq *rq;
> + int target_cpu = 0;
>  
>   raw_spin_lock_irqsave(>pi_lock, rf.flags);
>   p->state = TASK_RUNNING;
> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p)
>* as we're not fully set-up yet.
>*/
>   p->recent_used_cpu = task_cpu(p);
> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
> + target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0);
> + __set_task_cpu(p, target_cpu);
> +

The target_cpu variable can be eliminated by using task_cpu(p) directly
in the below update.

>  #endif
>   rq = __task_rq_lock(p, );
> +
> +#ifdef CONFIG_SMP
> + if (task_is_lat_sensitive(p))
> + per_cpu(nr_lat_sensitive, target_cpu)++;
> +#endif
> +

Is the SMP check intentional? In some parts of this patch, updates to
nr_lat_sensitive are done without SMP checks. For example,
finish_task_switch() below.

>   update_rq_clock(rq);
>   post_init_entity_util_avg(p);
>  
> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
>  
> + if (task_is_lat_sensitive(prev))
> + per_cpu(nr_lat_sensitive, prev->cpu)--;
> +
>   /*
>* Remove function-return probe instances associated with this
>* task and put them back on the free list.
> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct 
> *p,
>   p->normal_prio = normal_prio(p);
>   set_load_weight(p, true);
>  
> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> + if (p->state != TASK_DEAD &&
> + attr->sched_latency_nice != p->latency_nice) {
> + if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> + per_cpu(nr_lat_sensitive, task_cpu(p))++;
> + else if (task_is_lat_sensitive(p))
> + per_cpu(nr_lat_sensitive, task_cpu(p))--;
> + }
> +
>   p->latency_nice = attr->sched_latency_nice;
> + }
>  }

There is a potential race here due to which we can mess up the refcount.

- A latency sensitive task is marked TASK_DEAD

- sched_setattr() called on the task to clear the latency nice. Since
we check the task state here, we skip the decrement.
- The task is finally context switched out and we skip the decrement again
since it is not a latency senstivie task.

>  
>  /* Actually do priority change: must hold pi & rq lock. */
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5c41020c530e..56f885e37451 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct 
> *p)
>   return dl_policy(p->policy);
>  }
>  
> +static inline int task_is_lat_sensitive(struct task_struct *p)
> +{
> + return p->latency_nice == MIN_LATENCY_NICE;
> +}
> +
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
>  /*
> -- 
> 2.17.2
> 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a

Re: [PATCH 04/14] sched: cpufreq: Move sched_cpufreq_governor_change()

2020-05-07 Thread Pavan Kondeti

On Thu, May 07, 2020 at 07:10:02PM +0100, Quentin Perret wrote:
> CPUFreq calls into sched_cpufreq_governor_change() when switching
> governors, which triggers a sched domain rebuild when entering or
> exiting schedutil.
> 
> Move the function to sched/cpufreq.c to prepare the ground for the
> modularization of schedutil.
> 
> Signed-off-by: Quentin Perret 
> ---
>  kernel/sched/cpufreq.c   | 33 
>  kernel/sched/cpufreq_schedutil.c | 33 
>  2 files changed, 33 insertions(+), 33 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index 7c2fe50fd76d..82f2dda61a55 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -75,3 +75,36 @@ bool cpufreq_this_cpu_can_update(struct cpufreq_policy 
> *policy)
>   (policy->dvfs_possible_from_any_cpu &&
>
> rcu_dereference_sched(*this_cpu_ptr(_update_util_data)));
>  }
> +
> +#ifdef CONFIG_ENERGY_MODEL
> +extern bool sched_energy_update;
> +extern struct mutex sched_energy_mutex;
> +
> +static void rebuild_sd_workfn(struct work_struct *work)
> +{
> + mutex_lock(_energy_mutex);
> + sched_energy_update = true;
> + rebuild_sched_domains();
> + sched_energy_update = false;
> + mutex_unlock(_energy_mutex);
> +}
> +static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
> +
> +/*
> + * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
> + * on governor changes to make sure the scheduler knows about it.
> + */

In the previous patch, you removed reference to schedutil and replaced it with
" an EAS-compatible CPUfreq governor (schedutil)". May be you could do the
same here.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH 13/14] sched: cpufreq: Use IS_ENABLED() for schedutil

2020-05-07 Thread Pavan Kondeti

Hi Quentin

On Thu, May 07, 2020 at 07:10:11PM +0100, Quentin Perret wrote:
> The IS_ENABLED() macro evaluates to true when an option is set to =y or
> =m. As such, it is a good fit for tristate options.
> 
> In preparation for modularizing schedutil, change all the related ifdefs
> to use IS_ENABLED().
> 
> Signed-off-by: Quentin Perret 
> ---
>  include/linux/cpufreq.h  | 2 +-
>  include/linux/sched/sysctl.h | 2 +-
>  kernel/sched/sched.h | 4 ++--
>  kernel/sched/topology.c  | 4 ++--
>  kernel/sysctl.c  | 2 +-
>  5 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 267cc3b624da..c1176b8a0f61 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -983,7 +983,7 @@ static inline bool policy_has_boost_freq(struct 
> cpufreq_policy *policy)
>  }
>  #endif
>  
> -#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> +#if defined(CONFIG_ENERGY_MODEL) && IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>  void sched_cpufreq_governor_change(struct cpufreq_policy *policy,
>   struct cpufreq_governor *old_gov);
>  #else
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index d4f6215ee03f..704d971f204f 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -94,7 +94,7 @@ extern int sysctl_schedstats(struct ctl_table *table, int 
> write,
>void __user *buffer, size_t *lenp,
>loff_t *ppos);
>  
> -#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> +#if defined(CONFIG_ENERGY_MODEL) && IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>  extern unsigned int sysctl_sched_energy_aware;
>  extern int sched_energy_aware_handler(struct ctl_table *table, int write,
>void __user *buffer, size_t *lenp,
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 60592cde80e8..087508723e58 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -217,7 +217,7 @@ static inline void update_avg(u64 *avg, u64 sample)
>  
>  static inline bool dl_entity_is_special(struct sched_dl_entity *dl_se)
>  {
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +#if IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>   return unlikely(dl_se->flags & SCHED_FLAG_SUGOV);
>  #else
>   return false;
> @@ -2459,7 +2459,7 @@ unsigned long scale_irq_capacity(unsigned long util, 
> unsigned long irq, unsigned
>  }
>  #endif
>  
> -#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> +#if defined(CONFIG_ENERGY_MODEL) && IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>  
>  #define perf_domain_span(pd) (to_cpumask(((pd)->em_pd->cpus)))
>  
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index b905f2e8d9b2..5f49d25730bd 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -201,7 +201,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct 
> sched_domain *parent)
>   return 1;
>  }
>  
> -#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> +#if defined(CONFIG_ENERGY_MODEL) && IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>  DEFINE_STATIC_KEY_FALSE(sched_energy_present);
>  unsigned int sysctl_sched_energy_aware = 1;
>  DEFINE_MUTEX(sched_energy_mutex);
> @@ -2287,7 +2287,7 @@ void partition_sched_domains_locked(int ndoms_new, 
> cpumask_var_t doms_new[],
>   ;
>   }
>  
> -#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> +#if defined(CONFIG_ENERGY_MODEL) && IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
>   /* Build perf. domains: */
>   for (i = 0; i < ndoms_new; i++) {
>   for (j = 0; j < n && !sched_energy_update; j++) {

Now that scheduler does not have any references to schedutil_gov and cpufreq
has want_eas flag, do we need this CONFIG_CPU_FREQ_GOV_SCHEDUTIL checks here?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

2020-05-03 Thread Pavan Kondeti

On Fri, May 01, 2020 at 06:12:07PM +0200, Dietmar Eggemann wrote:
> On 30/04/2020 15:10, Pavan Kondeti wrote:
> > On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> >> From: Luca Abeni 
> 
> [...]
> 
> >> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, 
> >> int sd_flag, int flags)
> >> * other hand, if it has a shorter deadline, we
> >> * try to make it stay here, it might be important.
> >> */
> >> -  if (unlikely(dl_task(curr)) &&
> >> -  (curr->nr_cpus_allowed < 2 ||
> >> -   !dl_entity_preempt(>dl, >dl)) &&
> >> -  (p->nr_cpus_allowed > 1)) {
> >> +  select_rq = unlikely(dl_task(curr)) &&
> >> +  (curr->nr_cpus_allowed < 2 ||
> >> +   !dl_entity_preempt(>dl, >dl)) &&
> >> +  p->nr_cpus_allowed > 1;
> >> +
> >> +  /*
> >> +   * Take the capacity of the CPU into account to
> >> +   * ensure it fits the requirement of the task.
> >> +   */
> >> +  if (static_branch_unlikely(_asym_cpucapacity))
> >> +  select_rq |= !dl_task_fits_capacity(p, cpu);
> >> +
> >> +  if (select_rq) {
> >>int target = find_later_rq(p);
> > 
> > I see that find_later_rq() checks if the previous CPU is part of
> > later_mask and returns it immediately. So we don't migrate the
> > task in the case where there previous CPU can't fit the task and
> > there are no idle CPUs on which the task can fit. LGTM.
> 
> Hope I understand you here. I don't think that [patch 6/6] provides this
> already.
> 
> In case 'later_mask' has no fitting CPUs, 'max_cpu' is set in the
> otherwise empty 'later_mask'. But 'max_cpu' is not necessary task_cpu(p).
> 
> Example on Juno [L b b L L L] with thread0-0 (big task)
> 
>  cpudl_find [thread0-0 2117] orig later_mask=0,3-4 later_mask=0
>   find_later_rq [thread0-0 2117] task_cpu=2 later_mask=0
> 
> A tweak could be added favor task_cpu(p) in case it is amongst the CPUs
> with the maximum capacity in cpudl_find() for the !fit case.
> 

You are right. max_cpu can be other than task_cpu(p) in which case we
migrate the task though it won't fit on the new CPU. While introducing
capacity awareness in RT, Quais made the below change to avoid the
migration. We can do something similar here also.

commit b28bc1e002c2 (sched/rt: Re-instate old behavior in select_task_rq_rt())

> [...]
> 
> >> +/*
> >> + * Verify the fitness of task @p to run on @cpu taking into account the
> >> + * CPU original capacity and the runtime/deadline ratio of the task.
> >> + *
> >> + * The function will return true if the CPU original capacity of the
> >> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> >> + * task and false otherwise.
> >> + */
> >> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> >> +{
> >> +  unsigned long cap = arch_scale_cpu_capacity(cpu);
> >> +
> >> +  return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> >> +}
> >> +
> > 
> > This is same as
> > 
> > return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap
> > 
> > Correct?  If yes, would it be better to use this?
> 
> We could use sched_dl_entity::dl_density (dl_runtime / dl_deadline) but
> then I would have to export BW_SHIFT.

Yeah, I meant dl_denstity not dl_bw. Thanks for correcting me. I see that
BW_SHIFT is defined in kernel/sched/sched.h

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2 5/6] sched/deadline: Make DL capacity-aware

2020-04-30 Thread Pavan Kondeti

On Mon, Apr 27, 2020 at 10:37:08AM +0200, Dietmar Eggemann wrote:
> From: Luca Abeni 
> 
> The current SCHED_DEADLINE (DL) scheduler uses a global EDF scheduling
> algorithm w/o considering CPU capacity or task utilization.
> This works well on homogeneous systems where DL tasks are guaranteed
> to have a bounded tardiness but presents issues on heterogeneous
> systems.
> 
> A DL task can migrate to a CPU which does not have enough CPU capacity
> to correctly serve the task (e.g. a task w/ 70ms runtime and 100ms
> period on a CPU w/ 512 capacity).
> 
> Add the DL fitness function dl_task_fits_capacity() for DL admission
> control on heterogeneous systems. A task fits onto a CPU if:
> 
> CPU original capacity / 1024 >= task runtime / task deadline
> 
> Use this function on heterogeneous systems to try to find a CPU which
> meets this criterion during task wakeup, push and offline migration.
> 
> On homogeneous systems the original behavior of the DL admission
> control should be retained.
> 
> Signed-off-by: Luca Abeni 
> Signed-off-by: Dietmar Eggemann 
> ---
>  kernel/sched/cpudeadline.c | 14 +-
>  kernel/sched/deadline.c| 18 ++
>  kernel/sched/sched.h   | 15 +++
>  3 files changed, 42 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/cpudeadline.c b/kernel/sched/cpudeadline.c
> index 5cc4012572ec..8630f2a40a3f 100644
> --- a/kernel/sched/cpudeadline.c
> +++ b/kernel/sched/cpudeadline.c
> @@ -121,7 +121,19 @@ int cpudl_find(struct cpudl *cp, struct task_struct *p,
>  
>   if (later_mask &&
>   cpumask_and(later_mask, cp->free_cpus, p->cpus_ptr)) {
> - return 1;
> + int cpu;
> +
> + if (!static_branch_unlikely(_asym_cpucapacity))
> + return 1;
> +
> + /* Ensure the capacity of the CPUs fits the task. */
> + for_each_cpu(cpu, later_mask) {
> + if (!dl_task_fits_capacity(p, cpu))
> + cpumask_clear_cpu(cpu, later_mask);
> + }
> +
> + if (!cpumask_empty(later_mask))
> + return 1;
>   } else {
>   int best_cpu = cpudl_maximum(cp);
>  
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 08ab28e1cefc..575b7d88d839 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1634,6 +1634,7 @@ static int
>  select_task_rq_dl(struct task_struct *p, int cpu, int sd_flag, int flags)
>  {
>   struct task_struct *curr;
> + bool select_rq;
>   struct rq *rq;
>  
>   if (sd_flag != SD_BALANCE_WAKE)
> @@ -1653,10 +1654,19 @@ select_task_rq_dl(struct task_struct *p, int cpu, int 
> sd_flag, int flags)
>* other hand, if it has a shorter deadline, we
>* try to make it stay here, it might be important.
>*/
> - if (unlikely(dl_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 ||
> -  !dl_entity_preempt(>dl, >dl)) &&
> - (p->nr_cpus_allowed > 1)) {
> + select_rq = unlikely(dl_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 ||
> +  !dl_entity_preempt(>dl, >dl)) &&
> + p->nr_cpus_allowed > 1;
> +
> + /*
> +  * Take the capacity of the CPU into account to
> +  * ensure it fits the requirement of the task.
> +  */
> + if (static_branch_unlikely(_asym_cpucapacity))
> + select_rq |= !dl_task_fits_capacity(p, cpu);
> +
> + if (select_rq) {
>   int target = find_later_rq(p);

I see that find_later_rq() checks if the previous CPU is part of
later_mask and returns it immediately. So we don't migrate the
task in the case where there previous CPU can't fit the task and
there are no idle CPUs on which the task can fit. LGTM.

>  
>   if (target != -1 &&
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 511edacc2282..ec0efd99497b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -317,6 +317,21 @@ static inline bool __dl_overflow(struct dl_bw *dl_b, 
> unsigned long cap,
>  cap_scale(dl_b->bw, cap) < dl_b->total_bw - old_bw + new_bw;
>  }
>  
> +/*
> + * Verify the fitness of task @p to run on @cpu taking into account the
> + * CPU original capacity and the runtime/deadline ratio of the task.
> + *
> + * The function will return true if the CPU original capacity of the
> + * @cpu scaled by SCHED_CAPACITY_SCALE >= runtime/deadline ratio of the
> + * task and false otherwise.
> + */
> +static inline bool dl_task_fits_capacity(struct task_struct *p, int cpu)
> +{
> + unsigned long cap = arch_scale_cpu_capacity(cpu);
> +
> + return cap_scale(p->dl.dl_deadline, cap) >= p->dl.dl_runtime;
> +}
> +

This is same as

return p->dl.dl_bw >> (BW_SHIFT - SCHED_CAPACITY_SHIFT) <= cap

Correct?  If yes, would it be better to use this?

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.

Re: [PATCH v2 6/6] sched/deadline: Implement fallback mechanism for !fit case

2020-04-30 Thread Pavan Kondeti

On Wed, Apr 29, 2020 at 07:39:50PM +0200, Dietmar Eggemann wrote:
> On 27/04/2020 16:17, luca abeni wrote:
> > Hi Juri,
> > 
> > On Mon, 27 Apr 2020 15:34:38 +0200
> > Juri Lelli  wrote:
> > 
> >> Hi,
> >>
> >> On 27/04/20 10:37, Dietmar Eggemann wrote:
> >>> From: Luca Abeni 
> >>>
> >>> When a task has a runtime that cannot be served within the
> >>> scheduling deadline by any of the idle CPU (later_mask) the task is
> >>> doomed to miss its deadline.
> >>>
> >>> This can happen since the SCHED_DEADLINE admission control
> >>> guarantees only bounded tardiness and not the hard respect of all
> >>> deadlines. In this case try to select the idle CPU with the largest
> >>> CPU capacity to minimize tardiness.
> >>>
> >>> Signed-off-by: Luca Abeni 
> >>> Signed-off-by: Dietmar Eggemann 
> > [...]
> >>> - if (!cpumask_empty(later_mask))
> >>> - return 1;
> >>> + if (cpumask_empty(later_mask))
> >>> + cpumask_set_cpu(max_cpu, later_mask);  
> >>
> >> Think we touched upon this during v1 review, but I'm (still?)
> >> wondering if we can do a little better, still considering only free
> >> cpus.
> >>
> >> Can't we get into a situation that some of the (once free) big cpus
> >> have been occupied by small tasks and now a big task enters the
> >> system and it only finds small cpus available, were it could have fit
> >> into bigs if small tasks were put onto small cpus?
> >>
> >> I.e., shouldn't we always try to best fit among free cpus?
> > 
> > Yes; there was an additional patch that tried schedule each task on the
> > slowest core where it can fit, to address this issue.
> > But I think it will go in a second round of patches.
> 
> Yes, we can run into this situation in DL, but also in CFS or RT.
> 
In CFS case, the misfit task handling in load balancer should help pulling
the BIG task running on the little CPUs. I get your point that we can run
into the same scenario with other scheduling class tasks.

> IMHO, this patch is aligned with the Capacity Awareness implementation
> in CFS and RT.
> 
> Capacity Awareness so far is 'find a CPU which fits the requirement of
> the task (Req)'. It's not (yet) find the best CPU.
> 
> CFS - select_idle_capacity() -> task_fits_capacity()
> 
>   Req: util(p) * 1.25 < capacity_of(cpu)
> 
> RT  - select_task_rq_rt(), cpupri_find_fitness() ->
>   rt_task_fits_capacity()
> 
>   Req: uclamp_eff_value(p) <= capacity_orig_of(cpu)
> 
> DL  - select_task_rq_dl(), cpudl_find() -> dl_task_fits_capacity()
> 
>   Req: dl_runtime(p)/dl_deadline(p) * 1024  <= capacity_orig_of(cpu)
> 
> 
> There has to be an "idle" (from the viewpoint of the task) CPU available
> with a fitting capacity. Otherwise a fallback mechanism applies.
> 
> CFS - best capacity handling in select_idle_capacity().
> 
> RT  - Non-fitting lowest mask
> 
> DL  - This patch
> 
> You did spot the rt-app 'delay' for the small tasks in the test case ;-)

Thanks for the hint. It was not clear to me why 1 msec delay is given for
the small tasks in the rt-app json description in the cover letter.
I get it now :-)

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2 2/6] sched/deadline: Optimize dl_bw_cpus()

2020-04-30 Thread Pavan Kondeti

On Mon, Apr 27, 2020 at 10:37:05AM +0200, Dietmar Eggemann wrote:
> Return the weight of the rd (root domain) span in case it is a subset
> of the cpu_active_mask.
> 
> Continue to compute the number of CPUs over rd span and cpu_active_mask
> when in hotplug.
> 
> Signed-off-by: Dietmar Eggemann 
> ---
>  kernel/sched/deadline.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 504d2f51b0d6..4ae22bfc37ae 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -54,10 +54,16 @@ static inline struct dl_bw *dl_bw_of(int i)
>  static inline int dl_bw_cpus(int i)
>  {
>   struct root_domain *rd = cpu_rq(i)->rd;
> - int cpus = 0;
> + int cpus;
>  
>   RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
>"sched RCU must be held");
> +
> + if (cpumask_subset(rd->span, cpu_active_mask))
> + return cpumask_weight(rd->span);
> +

Looks good to me. This is a nice optimization.

> + cpus = 0;
> +
>   for_each_cpu_and(i, rd->span, cpu_active_mask)
>   cpus++;
>  
Do you know why this check is in place? Is it only to cover
the case of cpuset_cpu_inactive()->dl_cpu_busy()?

Thanks,
Pavan
> -- 
> 2.17.1
> 

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-04-29 Thread Pavan Kondeti

Hi Qais,

On Wed, Apr 29, 2020 at 01:30:57PM +0100, Qais Yousef wrote:
> Hi Pavan
> 
> On 04/29/20 17:02, Pavan Kondeti wrote:
> > Hi Qais,
> > 
> > On Tue, Apr 28, 2020 at 05:41:33PM +0100, Qais Yousef wrote:
> > 
> > [...]
> > 
> > >  
> > > +static void uclamp_sync_util_min_rt_default(struct task_struct *p)
> > > +{
> > > + struct uclamp_se *uc_se = >uclamp_req[UCLAMP_MIN];
> > > +
> > > + if (unlikely(rt_task(p)) && !uc_se->user_defined)
> > > + uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt_default, 
> > > false);
> > > +}
> > 
> > Unlike system default clamp values, RT default value is written to
> > p->uclamp_req[UCLAMP_MIN]. A user may not be able to set the uclamp.max to a
> > lower value than sysctl_sched_uclamp_util_min_rt_default. This is not a
> > big deal. Just sharing my observation. Is this how you expected it to work?
> 
> This is how I expect it to work yes.
> 
> Note that the sysctl_sched_uclamp_util_{min,max}, or 'system default clamp
> values' as you called them, are constraints. They define the range any task is
> _allowed to_ *request*.
> 
> The new sysctl_sched_uclamp_util_min_rt_default is setting the default
> *request* for RT tasks. And this was done since historically RT tasks has
> always run at the highest performance point in Linux.
> 
> Can you think of a scenario where setting the default *request* of uclamp.max
> for all RT tasks helps?
> 

It was a hypothetical question :-) Thanks for clearly explaining the
difference between system default clamp values (constraints) vs the newly
introduced sysctl_sched_uclamp_util_min_rt_default (request for RT tasks).

> I'm thinking, the global constrain of sysctl_sched_uclamp_util_max should
> suffice. Or one can use cgroup to selectively cap some tasks.
> 
> For example if you don't want any task in the system to be boosted to anything
> higher than 800, just do
> 
>   sysctl_sched_uclamp_util_max = 800
> 
> Or if you want to be selective, the same thing can be achieved by creating
> a cgroup cpu controller that has uclamp.max = 0.8 and attaching tasks to it.
> 
> > 
> > > +
> > >  static inline struct uclamp_se
> > >  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
> > >  {
> > > @@ -907,8 +935,15 @@ uclamp_tg_restrict(struct task_struct *p, enum 
> > > uclamp_id clamp_id)
> > >  static inline struct uclamp_se
> > >  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
> > >  {
> > > - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> > > - struct uclamp_se uc_max = uclamp_default[clamp_id];
> > > + struct uclamp_se uc_req, uc_max;
> > > +
> > > + /*
> > > +  * Sync up any change to sysctl_sched_uclamp_util_min_rt_default value.
> > > +  */
> > > + uclamp_sync_util_min_rt_default(p);
> > > +
> > > + uc_req = uclamp_tg_restrict(p, clamp_id);
> > > + uc_max = uclamp_default[clamp_id];
> > 
> > We are calling uclamp_sync_util_min_rt_default() unnecessarily for
> > clamp_id == UCLAMP_MAX case. Would it be better to have a separate
> 
> It was actually intentional to make sure we update the value ASAP. I didn't
> think it's a lot of overhead. I can further protect with a check to verify
> whether the value has changed if it seems heavy handed.
> 
> > uclamp_default for RT like uclamp_default_rt and select uc_max based
> > on task policy? Since all tunables are handled in 
> > sysctl_sched_uclamp_handler
> > we can cover the case of uclamp_util_min < uclamp_util_min_rt.
> 
> Hmm I'm not sure I got you.
> 
> uclamp_default[] is setting the constraints. I.e. what's the range it's 
> allowed
> to take.
> 
> What's added here is simply setting the requested uclamp.min value, which if
> the constraints allow it will be applied.
> 
> For example. If
> 
>   sysctl_sched_uclamp_util_min = 0
>   sysctl_sched_uclamp_util_min_rt_default = 400
> 
> uclamp_eff_get() will return 0 as effective uclamp.min value for the task. So
> while the task still requests to be boosted to 1024, but the system constraint
> prevents it from getting it. But as soon as the system constraint has changed,
> the task might be able to get what it wants.
> 
> For example, if at runtime sysctl_sched_uclamp_util_min was modified to 1024
> 
>   sysctl_sched_uclamp_util_min = 1024
>   sysctl_sched_uclamp_util_min_rt_default = 400
> 
> uclamp_eff_get() return 400 for the task now, as requested.
> 

Yes, I did understand the relation be

Re: [PATCH v3 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

2020-04-29 Thread Pavan Kondeti

Hi Qais,

On Tue, Apr 28, 2020 at 05:41:33PM +0100, Qais Yousef wrote:

[...]

>  
> +static void uclamp_sync_util_min_rt_default(struct task_struct *p)
> +{
> + struct uclamp_se *uc_se = >uclamp_req[UCLAMP_MIN];
> +
> + if (unlikely(rt_task(p)) && !uc_se->user_defined)
> + uclamp_se_set(uc_se, sysctl_sched_uclamp_util_min_rt_default, 
> false);
> +}

Unlike system default clamp values, RT default value is written to
p->uclamp_req[UCLAMP_MIN]. A user may not be able to set the uclamp.max to a
lower value than sysctl_sched_uclamp_util_min_rt_default. This is not a
big deal. Just sharing my observation. Is this how you expected it to work?

> +
>  static inline struct uclamp_se
>  uclamp_tg_restrict(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> @@ -907,8 +935,15 @@ uclamp_tg_restrict(struct task_struct *p, enum uclamp_id 
> clamp_id)
>  static inline struct uclamp_se
>  uclamp_eff_get(struct task_struct *p, enum uclamp_id clamp_id)
>  {
> - struct uclamp_se uc_req = uclamp_tg_restrict(p, clamp_id);
> - struct uclamp_se uc_max = uclamp_default[clamp_id];
> + struct uclamp_se uc_req, uc_max;
> +
> + /*
> +  * Sync up any change to sysctl_sched_uclamp_util_min_rt_default value.
> +  */
> + uclamp_sync_util_min_rt_default(p);
> +
> + uc_req = uclamp_tg_restrict(p, clamp_id);
> + uc_max = uclamp_default[clamp_id];

We are calling uclamp_sync_util_min_rt_default() unnecessarily for
clamp_id == UCLAMP_MAX case. Would it be better to have a separate
uclamp_default for RT like uclamp_default_rt and select uc_max based
on task policy? Since all tunables are handled in sysctl_sched_uclamp_handler
we can cover the case of uclamp_util_min < uclamp_util_min_rt.

>  
>   /* System default restrictions always apply */
>   if (unlikely(uc_req.value > uc_max.value))
> @@ -1114,12 +1149,13 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   loff_t *ppos)
>  {
>   bool update_root_tg = false;
> - int old_min, old_max;
> + int old_min, old_max, old_min_rt;
>   int result;
>  
>   mutex_lock(_mutex);
>   old_min = sysctl_sched_uclamp_util_min;
>   old_max = sysctl_sched_uclamp_util_max;
> + old_min_rt = sysctl_sched_uclamp_util_min_rt_default;
>  
>   result = proc_dointvec(table, write, buffer, lenp, ppos);
>   if (result)
> @@ -1133,6 +1169,18 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>   goto undo;
>   }
>  
> + /*
> +  * The new value will be applied to RT tasks the next time the
> +  * scheduler needs to calculate the effective uclamp.min for that task,
> +  * assuming the task is using the system default and not a user
> +  * specified value. In the latter we shall leave the value as the user
> +  * requested.
> +  */
> + if (sysctl_sched_uclamp_util_min_rt_default > SCHED_CAPACITY_SCALE) {
> + result = -EINVAL;
> + goto undo;
> + }
> +
>   if (old_min != sysctl_sched_uclamp_util_min) {
>   uclamp_se_set(_default[UCLAMP_MIN],
> sysctl_sched_uclamp_util_min, false);
> @@ -1158,6 +1206,7 @@ int sysctl_sched_uclamp_handler(struct ctl_table 
> *table, int write,
>  undo:
>   sysctl_sched_uclamp_util_min = old_min;
>   sysctl_sched_uclamp_util_max = old_max;
> + sysctl_sched_uclamp_util_min_rt_default = old_min_rt;
>  done:
>   mutex_unlock(_mutex);
>  
> @@ -1200,9 +1249,13 @@ static void __setscheduler_uclamp(struct task_struct 
> *p,
>   if (uc_se->user_defined)
>   continue;
>  
> - /* By default, RT tasks always get 100% boost */
> + /*
> +  * By default, RT tasks always get 100% boost, which the admins
> +  * are allowed to change via
> +  * sysctl_sched_uclamp_util_min_rt_default knob.
> +  */
>   if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> - clamp_value = uclamp_none(UCLAMP_MAX);
> + clamp_value = sysctl_sched_uclamp_util_min_rt_default;
>  
>   uclamp_se_set(uc_se, clamp_value, false);
>   }
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a176d8727a3..64117363c502 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -453,6 +453,13 @@ static struct ctl_table kern_table[] = {
>   .mode   = 0644,
>   .proc_handler   = sysctl_sched_uclamp_handler,
>   },
> + {
> + .procname   = "sched_util_clamp_min_rt_default",
> + .data   = _sched_uclamp_util_min_rt_default,
> + .maxlen = sizeof(unsigned int),
> + .mode   = 0644,
> + .proc_handler   = sysctl_sched_uclamp_handler,
> + },
>  #endif
>  #ifdef CONFIG_SCHED_AUTOGROUP
>   {
> -- 
> 2.17.1
> 

Thanks,
Pavan

-- 
Qualcomm

Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

2019-09-20 Thread Pavan Kondeti

Hi Quentin,

On Fri, Sep 20, 2019 at 11:41:15AM +0200, Quentin Perret wrote:
> Hi Pavan,
> 
> On Friday 20 Sep 2019 at 08:32:15 (+0530), Pavan Kondeti wrote:
> > Earlier, we are not checking the spare capacity for the prev_cpu. Now that 
> > the
> > continue statement is removed, prev_cpu could also be the max_spare_cap_cpu.
> > Actually that makes sense. Because there is no reason why we want to select
> > another CPU which has less spare capacity than previous CPU.
> > 
> > Is this behavior intentional?
> 
> The intent was indeed to not compute the energy for another CPU in
> prev_cpu's perf domain if prev_cpu is the one with max spare cap -- it
> is useless to do so since this other CPU cannot 'beat' prev_cpu and
> will never be chosen in the end.

Yes. Selecting the prev_cpu is the correct decision.

> 
> But I did miss that we'd end up computing the energy for prev_cpu
> twice ... Harmless but useless. So yeah, let's optimize that case too :)
> 
> > When prev_cpu == max_spare_cap_cpu, we are evaluating the energy again for 
> > the
> > same CPU below. That could have been skipped by returning prev_cpu when
> > prev_cpu == max_spare_cap_cpu.
> 
> Right, something like the patch below ? My test results are still
> looking good with it applied.
> 
> Thanks for the careful review,
> Quentin
> ---
> From 7b8258287f180a2c383ebe397e8129f5f898ffbe Mon Sep 17 00:00:00 2001
> From: Quentin Perret 
> Date: Fri, 20 Sep 2019 09:07:20 +0100
> Subject: [PATCH] sched/fair: Avoid redundant EAS calculation
> 
> The EAS wake-up path computes the system energy for several CPU
> candidates: the CPU with maximum spare capacity in each performance
> domain, and the prev_cpu. However, if prev_cpu also happens to be the
> CPU with maximum spare capacity in its performance domain, the energy
> calculation is still done twice, unnecessarily.
> 
> Add a condition to filter out this corner case before doing the energy
> calculation.
> 
> Reported-by: Pavan Kondeti 
> Signed-off-by: Quentin Perret 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bbf68c3161..7399382bc291 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6412,7 +6412,7 @@ static int find_energy_efficient_cpu(struct task_struct 
> *p, int prev_cpu)
>   }
>  
>   /* Evaluate the energy impact of using this CPU. */
> - if (max_spare_cap_cpu >= 0) {
> + if (max_spare_cap_cpu >= 0 && max_spare_cap_cpu != prev_cpu) {
>   cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
>   cur_delta -= base_energy_pd;
>   if (cur_delta < best_delta) {
> -- 
> 2.22.1
> 

+1. Looks good to me.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] sched/fair: Speed-up energy-aware wake-ups

2019-09-19 Thread Pavan Kondeti

Hi Quentin,

On Thu, Sep 12, 2019 at 11:44:04AM +0200, Quentin Perret wrote:
> From: Quentin Perret 
> 
> EAS computes the energy impact of migrating a waking task when deciding
> on which CPU it should run. However, the current approach is known to
> have a high algorithmic complexity, which can result in prohibitively
> high wake-up latencies on systems with complex energy models, such as
> systems with per-CPU DVFS. On such systems, the algorithm complexity is
> in O(n^2) (ignoring the cost of searching for performance states in the
> EM) with 'n' the number of CPUs.
> 
> To address this, re-factor the EAS wake-up path to compute the energy
> 'delta' (with and without the task) on a per-performance domain basis,
> rather than system-wide, which brings the complexity down to O(n).
> 
> No functional changes intended.
> 
> Signed-off-by: Quentin Perret 
>  

[snip]

>  /*
> @@ -6381,21 +6367,19 @@ compute_energy(struct task_struct *p, int dst_cpu, 
> struct perf_domain *pd)
>   * other use-cases too. So, until someone finds a better way to solve this,
>   * let's keep things simple by re-using the existing slow path.
>   */
> -
>  static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu)
>  {
> - unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
> + unsigned long prev_delta = ULONG_MAX, best_delta = ULONG_MAX;
>   struct root_domain *rd = cpu_rq(smp_processor_id())->rd;
> + unsigned long cpu_cap, util, base_energy = 0;
>   int cpu, best_energy_cpu = prev_cpu;
> - struct perf_domain *head, *pd;
> - unsigned long cpu_cap, util;
>   struct sched_domain *sd;
> + struct perf_domain *pd;
>  
>   rcu_read_lock();
>   pd = rcu_dereference(rd->pd);
>   if (!pd || READ_ONCE(rd->overutilized))
>   goto fail;
> - head = pd;
>  
>   /*
>* Energy-aware wake-up happens on the lowest sched_domain starting
> @@ -6412,9 +6396,14 @@ static int find_energy_efficient_cpu(struct 
> task_struct *p, int prev_cpu)
>   goto unlock;
>  
>   for (; pd; pd = pd->next) {
> - unsigned long cur_energy, spare_cap, max_spare_cap = 0;
> + unsigned long cur_delta, spare_cap, max_spare_cap = 0;
> + unsigned long base_energy_pd;
>   int max_spare_cap_cpu = -1;
>  
> + /* Compute the 'base' energy of the pd, without @p */
> + base_energy_pd = compute_energy(p, -1, pd);
> + base_energy += base_energy_pd;
> +
>   for_each_cpu_and(cpu, perf_domain_span(pd), 
> sched_domain_span(sd)) {
>   if (!cpumask_test_cpu(cpu, p->cpus_ptr))
>   continue;
> @@ -6427,9 +6416,9 @@ static int find_energy_efficient_cpu(struct task_struct 
> *p, int prev_cpu)
>  
>   /* Always use prev_cpu as a candidate. */
>   if (cpu == prev_cpu) {
> - prev_energy = compute_energy(p, prev_cpu, head);
> - best_energy = min(best_energy, prev_energy);
> - continue;
> + prev_delta = compute_energy(p, prev_cpu, pd);
> + prev_delta -= base_energy_pd;
> + best_delta = min(best_delta, prev_delta);
>   }

Earlier, we are not checking the spare capacity for the prev_cpu. Now that the
continue statement is removed, prev_cpu could also be the max_spare_cap_cpu.
Actually that makes sense. Because there is no reason why we want to select
another CPU which has less spare capacity than previous CPU.

Is this behavior intentional?

When prev_cpu == max_spare_cap_cpu, we are evaluating the energy again for the
same CPU below. That could have been skipped by returning prev_cpu when
prev_cpu == max_spare_cap_cpu.

>  
>   /*
> @@ -6445,9 +6434,10 @@ static int find_energy_efficient_cpu(struct 
> task_struct *p, int prev_cpu)
>  
>   /* Evaluate the energy impact of using this CPU. */
>   if (max_spare_cap_cpu >= 0) {
> - cur_energy = compute_energy(p, max_spare_cap_cpu, head);
> - if (cur_energy < best_energy) {
> - best_energy = cur_energy;
> + cur_delta = compute_energy(p, max_spare_cap_cpu, pd);
> + cur_delta -= base_energy_pd;
> + if (cur_delta < best_delta) {
> + best_delta = cur_delta;
>   best_energy_cpu = max_spare_cap_cpu;
>   }
>   }
> @@ -6459,10 +6449,10 @@ static int find_energy_efficient_cpu(struct 
> task_struct *p, int prev_cpu)
>* Pick the best CPU if prev_cpu cannot be used, or if it saves at
>* least 6% of the energy used by prev_cpu.
>*/
> - if (prev_energy == ULONG_MAX)
> + if (prev_delta == ULONG_MAX)
>

Re: [PATCH v2] sched: fix migration to invalid cpu in __set_cpus_allowed_ptr

2019-09-16 Thread Pavan Kondeti

On Mon, Sep 16, 2019 at 06:53:28AM +, KeMeng Shi wrote:
> Oops occur when running qemu on arm64:
>  Unable to handle kernel paging request at virtual address 08effe40
>  Internal error: Oops: 9607 [#1] SMP
>  Process migration/0 (pid: 12, stack limit = 0x084e3736)
>  pstate: 2085 (nzCv daIf -PAN -UAO)
>  pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
>  lr : move_queued_task.isra.21+0x124/0x298
>  ...
>  Call trace:
>   __ll_sc___cmpxchg_case_acq_4+0x4/0x20
>   __migrate_task+0xc8/0xe0
>   migration_cpu_stop+0x170/0x180
>   cpu_stopper_thread+0xec/0x178
>   smpboot_thread_fn+0x1ac/0x1e8
>   kthread+0x134/0x138
>   ret_from_fork+0x10/0x18
> 
> __set_cpus_allowed_ptr will choose an active dest_cpu in affinity mask to
> migrage the process if process is not currently running on any one of the
> CPUs specified in affinity mask. __set_cpus_allowed_ptr will choose an
> invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if 
> CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
> check. cpumask_test_cpu of dest_cpu afterwards is overflow and may pass if
> corresponding bit is coincidentally set. As a consequence, kernel will
> access an invalid rq address associate with the invalid cpu in
> migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
> 
> Process as follows may trigger the Oops:
> 1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
> sched_setaffinity.
> 2) A shell script repeatedly "echo 0 > /sys/devices/system/cpu/cpu1/online"
> and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
> 3) Oops appears if the invalid cpu is set in memory after tested cpumask.
> 
> Signed-off-by: KeMeng Shi 
> Reviewed-by: Valentin Schneider 
> ---
> Changes in v2:
> -solve format problems in log
> 
>  kernel/sched/core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3c7b90bcbe4e..087f4ac30b60 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1656,7 +1656,8 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>   if (cpumask_equal(p->cpus_ptr, new_mask))
>   goto out;
>  
> - if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
> + dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
> + if (dest_cpu >= nr_cpu_ids) {
>   ret = -EINVAL;
>   goto out;
>   }
> @@ -1677,7 +1678,6 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>   if (cpumask_test_cpu(task_cpu(p), new_mask))
>   goto out;
>  
> - dest_cpu = cpumask_any_and(cpu_valid_mask, new_mask);
>   if (task_running(rq, p) || p->state == TASK_WAKING) {
>   struct migration_arg arg = { p, dest_cpu };
>   /* Need help from migration thread: drop lock and wait. */
> -- 
> 2.19.1
> 
> 

The cpu_active_mask might have changed in between. Your fix looks good to me.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending

2019-06-10 Thread Pavan Kondeti

Hi Rafael/Thomas,

On Mon, Jun 3, 2019 at 10:03 AM Pavankumar Kondeti
 wrote:
>
> When "deep" suspend is enabled, all CPUs except the primary CPU
> are hotplugged out. Since CPU hotplug is a costly operation,
> check if we have to abort the suspend in between each CPU
> hotplug. This would improve the system suspend abort latency
> upon detecting a wakeup condition.
>

Please let me know if you have any comments on this patch.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [PATCH] sched, trace: Fix prev_state output in sched_switch tracepoint

2018-11-27 Thread Pavan Kondeti

Hi Peter/Thomas,

On Tue, Oct 30, 2018 at 12:25 PM Pavankumar Kondeti
 wrote:
>
> commit 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
> tried to fix the problem introduced by a previous commit efb40f588b43
> ("sched/tracing: Fix trace_sched_switch task-state printing"). However
> the prev_state output in sched_switch is still broken.
>
> task_state_index() uses fls() which considers the LSB as 1. Left
> shifting 1 by this value gives an incorrect mapping to the task state.
> Fix this by decrementing the value returned by __get_task_state()
> before shifting.
>
> Fixes: 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
> Signed-off-by: Pavankumar Kondeti 
> ---
>  include/trace/events/sched.h | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Can you please review this patch?

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [PATCH] sched, trace: Fix prev_state output in sched_switch tracepoint

2018-11-27 Thread Pavan Kondeti

Hi Peter/Thomas,

On Tue, Oct 30, 2018 at 12:25 PM Pavankumar Kondeti
 wrote:
>
> commit 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
> tried to fix the problem introduced by a previous commit efb40f588b43
> ("sched/tracing: Fix trace_sched_switch task-state printing"). However
> the prev_state output in sched_switch is still broken.
>
> task_state_index() uses fls() which considers the LSB as 1. Left
> shifting 1 by this value gives an incorrect mapping to the task state.
> Fix this by decrementing the value returned by __get_task_state()
> before shifting.
>
> Fixes: 3f5fe9fef5b2 ("sched/debug: Fix task state recording/printout")
> Signed-off-by: Pavankumar Kondeti 
> ---
>  include/trace/events/sched.h | 12 +++-
>  1 file changed, 11 insertions(+), 1 deletion(-)

Can you please review this patch?

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT

2018-10-30 Thread Pavan Kondeti

Hi Vincent,

On Fri, Oct 26, 2018 at 06:11:43PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6806c27..7a69673 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -674,9 +674,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
>   return calc_delta_fair(sched_slice(cfs_rq, se), se);
>  }
>  
> -#ifdef CONFIG_SMP
>  #include "pelt.h"
> -#include "sched-pelt.h"
> +#ifdef CONFIG_SMP
>  
>  static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
>  static unsigned long task_h_load(struct task_struct *p);
> @@ -764,7 +763,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>* such that the next switched_to_fair() has the
>* expected state.
>*/
> - se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
> + se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
>   return;
>   }
>   }
> @@ -3466,7 +3465,7 @@ static void detach_entity_load_avg(struct cfs_rq 
> *cfs_rq, struct sched_entity *s
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct 
> sched_entity *se, int flags)
>  {
> - u64 now = cfs_rq_clock_task(cfs_rq);
> + u64 now = cfs_rq_clock_pelt(cfs_rq);
>   struct rq *rq = rq_of(cfs_rq);
>   int cpu = cpu_of(rq);
>   int decayed;
> @@ -6694,6 +6693,12 @@ done: __maybe_unused;
>   if (new_tasks > 0)
>   goto again;
>  
> + /*
> +  * rq is about to be idle, check if we need to update the
> +  * lost_idle_time of clock_pelt
> +  */
> + update_idle_rq_clock_pelt(rq);
> +
>   return NULL;
>  }

Do you think it is better to call this from pick_next_task_idle()? I don't see
any functional difference, but it may be easier to follow.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v5 2/2] sched/fair: update scale invariance of PELT

2018-10-30 Thread Pavan Kondeti

Hi Vincent,

On Fri, Oct 26, 2018 at 06:11:43PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6806c27..7a69673 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -674,9 +674,8 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
>   return calc_delta_fair(sched_slice(cfs_rq, se), se);
>  }
>  
> -#ifdef CONFIG_SMP
>  #include "pelt.h"
> -#include "sched-pelt.h"
> +#ifdef CONFIG_SMP
>  
>  static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
>  static unsigned long task_h_load(struct task_struct *p);
> @@ -764,7 +763,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>* such that the next switched_to_fair() has the
>* expected state.
>*/
> - se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
> + se->avg.last_update_time = cfs_rq_clock_pelt(cfs_rq);
>   return;
>   }
>   }
> @@ -3466,7 +3465,7 @@ static void detach_entity_load_avg(struct cfs_rq 
> *cfs_rq, struct sched_entity *s
>  /* Update task and its cfs_rq load average */
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct 
> sched_entity *se, int flags)
>  {
> - u64 now = cfs_rq_clock_task(cfs_rq);
> + u64 now = cfs_rq_clock_pelt(cfs_rq);
>   struct rq *rq = rq_of(cfs_rq);
>   int cpu = cpu_of(rq);
>   int decayed;
> @@ -6694,6 +6693,12 @@ done: __maybe_unused;
>   if (new_tasks > 0)
>   goto again;
>  
> + /*
> +  * rq is about to be idle, check if we need to update the
> +  * lost_idle_time of clock_pelt
> +  */
> + update_idle_rq_clock_pelt(rq);
> +
>   return NULL;
>  }

Do you think it is better to call this from pick_next_task_idle()? I don't see
any functional difference, but it may be easier to follow.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

2018-10-23 Thread Pavan Kondeti

Hi Vincent,

Thanks for the detailed explanation.

On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> Hi Pavan,
> 
> On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti  wrote:
> >
> > Hi Vincent,
> >
> > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > >
> > >  /*
> > > + * The clock_pelt scales the time to reflect the effective amount of
> > > + * computation done during the running delta time but then sync back to
> > > + * clock_task when rq is idle.
> > > + *
> > > + *
> > > + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > + * @ max capacity  --**---**---
> > > + * @ half capacity ----
> > > + * clock pelt  | 1| 2|3|4| 7| 8| 9|   10|   11|14|15|16
> > > + *
> > > + */
> > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > +{
> > > +
> > > + if (is_idle_task(rq->curr)) {
> > > + u32 divider = (LOAD_AVG_MAX - 1024 + 
> > > rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > + overload += rq->avg_rt.util_sum;
> > > + overload += rq->avg_dl.util_sum;
> > > +
> > > + /*
> > > +  * Reflecting some stolen time makes sense only if the idle
> > > +  * phase would be present at max capacity. As soon as the
> > > +  * utilization of a rq has reached the maximum value, it is
> > > +  * considered as an always runnnig rq without idle time to
> > > +  * steal. This potential idle time is considered as lost in
> > > +  * this case. We keep track of this lost idle time compare 
> > > to
> > > +  * rq's clock_task.
> > > +  */
> > > + if (overload >= divider)
> > > + rq->lost_idle_time += rq_clock_task(rq) - 
> > > rq->clock_pelt;
> > > +
> >
> > I am trying to understand this better. I believe we run into this scenario, 
> > when
> > the frequency is limited due to thermal/userspace constraints. Lets say
> 
> Yes these are the most common UCs but this can also happen after tasks
> migration or with a cpufreq governor that doesn't increase OPP fast
> enough for current utilization.
> 
> > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > Fmax/2. The utilization is built up to 100% after several periods.
> > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle 
> > time
> > all along. What happens when the CPU enters idle for a short duration and 
> > comes
> > back to run this 100% utilization task?
> 
> If you are at 100%, we only apply the short idle duration
> 
> >
> > If the above block is not present i.e lost_idle_time is not tracked, we
> > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > utilization is dropped. Right?
> 
> yes that 's what would happen. I gives more details below
> 
> >
> > With the above block, we don't stretch the idle time. In fact we don't
> > consider the idle time at all. Because,
> >
> > idle_time = now - last_time;
> >
> > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - 
> > last_time
> > idle_time = rq->clock_pelt_old - last_time
> >
> > The last time is nothing but the last snapshot of the rq->clock_pelt when 
> > the
> > task entered sleep due to which CPU entered idle.
> 
> The condition for dropping this idle time is quite important. This
> only happens when the utilization reaches max compute capacity of the
> CPU. Otherwise, the idle time will be fully applied

Right.

rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt

This not only tracks the lost idle time due to running slow but also the
absolute/real sleep time. For example, when the slow running 100% task
sleeps for 100 msec, are not we ignoring the 100 msec sleep there?

For example a task ran 323 msec at full capacity and sleeps for (1000-323)
msec. when it wakes up the utilization is dropped. If the same task runs
for 626 msec at the half capacity and sleeps for (1000-626), should not
drop the utilization by taking (1000-626) sleep time into account. I
understand that why we don't strech idle time to (1000-323) but

Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

2018-10-23 Thread Pavan Kondeti

Hi Vincent,

Thanks for the detailed explanation.

On Tue, Oct 23, 2018 at 02:15:08PM +0200, Vincent Guittot wrote:
> Hi Pavan,
> 
> On Tue, 23 Oct 2018 at 07:59, Pavan Kondeti  wrote:
> >
> > Hi Vincent,
> >
> > On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
> > >
> > >  /*
> > > + * The clock_pelt scales the time to reflect the effective amount of
> > > + * computation done during the running delta time but then sync back to
> > > + * clock_task when rq is idle.
> > > + *
> > > + *
> > > + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> > > + * @ max capacity  --**---**---
> > > + * @ half capacity ----
> > > + * clock pelt  | 1| 2|3|4| 7| 8| 9|   10|   11|14|15|16
> > > + *
> > > + */
> > > +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> > > +{
> > > +
> > > + if (is_idle_task(rq->curr)) {
> > > + u32 divider = (LOAD_AVG_MAX - 1024 + 
> > > rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> > > + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> > > + overload += rq->avg_rt.util_sum;
> > > + overload += rq->avg_dl.util_sum;
> > > +
> > > + /*
> > > +  * Reflecting some stolen time makes sense only if the idle
> > > +  * phase would be present at max capacity. As soon as the
> > > +  * utilization of a rq has reached the maximum value, it is
> > > +  * considered as an always runnnig rq without idle time to
> > > +  * steal. This potential idle time is considered as lost in
> > > +  * this case. We keep track of this lost idle time compare 
> > > to
> > > +  * rq's clock_task.
> > > +  */
> > > + if (overload >= divider)
> > > + rq->lost_idle_time += rq_clock_task(rq) - 
> > > rq->clock_pelt;
> > > +
> >
> > I am trying to understand this better. I believe we run into this scenario, 
> > when
> > the frequency is limited due to thermal/userspace constraints. Lets say
> 
> Yes these are the most common UCs but this can also happen after tasks
> migration or with a cpufreq governor that doesn't increase OPP fast
> enough for current utilization.
> 
> > frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
> > Fmax/2. The utilization is built up to 100% after several periods.
> > The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle 
> > time
> > all along. What happens when the CPU enters idle for a short duration and 
> > comes
> > back to run this 100% utilization task?
> 
> If you are at 100%, we only apply the short idle duration
> 
> >
> > If the above block is not present i.e lost_idle_time is not tracked, we
> > stretch the idle time (since clock_pelt is synced to clock_task) and the
> > utilization is dropped. Right?
> 
> yes that 's what would happen. I gives more details below
> 
> >
> > With the above block, we don't stretch the idle time. In fact we don't
> > consider the idle time at all. Because,
> >
> > idle_time = now - last_time;
> >
> > idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
> > idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - 
> > last_time
> > idle_time = rq->clock_pelt_old - last_time
> >
> > The last time is nothing but the last snapshot of the rq->clock_pelt when 
> > the
> > task entered sleep due to which CPU entered idle.
> 
> The condition for dropping this idle time is quite important. This
> only happens when the utilization reaches max compute capacity of the
> CPU. Otherwise, the idle time will be fully applied

Right.

rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt

This not only tracks the lost idle time due to running slow but also the
absolute/real sleep time. For example, when the slow running 100% task
sleeps for 100 msec, are not we ignoring the 100 msec sleep there?

For example a task ran 323 msec at full capacity and sleeps for (1000-323)
msec. when it wakes up the utilization is dropped. If the same task runs
for 626 msec at the half capacity and sleeps for (1000-626), should not
drop the utilization by taking (1000-626) sleep time into account. I
understand that why we don't strech idle time to (1000-323) but

Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

2018-10-22 Thread Pavan Kondeti

Hi Vincent,

On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
>  
>  /*
> + * The clock_pelt scales the time to reflect the effective amount of
> + * computation done during the running delta time but then sync back to
> + * clock_task when rq is idle.
> + *
> + *
> + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> + * @ max capacity  --**---**---
> + * @ half capacity ----
> + * clock pelt  | 1| 2|3|4| 7| 8| 9|   10|   11|14|15|16
> + *
> + */
> +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> +{
> +
> + if (is_idle_task(rq->curr)) {
> + u32 divider = (LOAD_AVG_MAX - 1024 + 
> rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> + overload += rq->avg_rt.util_sum;
> + overload += rq->avg_dl.util_sum;
> +
> + /*
> +  * Reflecting some stolen time makes sense only if the idle
> +  * phase would be present at max capacity. As soon as the
> +  * utilization of a rq has reached the maximum value, it is
> +  * considered as an always runnnig rq without idle time to
> +  * steal. This potential idle time is considered as lost in
> +  * this case. We keep track of this lost idle time compare to
> +  * rq's clock_task.
> +  */
> + if (overload >= divider)
> + rq->lost_idle_time += rq_clock_task(rq) - 
> rq->clock_pelt;
> +

I am trying to understand this better. I believe we run into this scenario, when
the frequency is limited due to thermal/userspace constraints. Lets say
frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
Fmax/2. The utilization is built up to 100% after several periods.
The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
all along. What happens when the CPU enters idle for a short duration and comes
back to run this 100% utilization task?

If the above block is not present i.e lost_idle_time is not tracked, we
stretch the idle time (since clock_pelt is synced to clock_task) and the
utilization is dropped. Right?

With the above block, we don't stretch the idle time. In fact we don't
consider the idle time at all. Because,

idle_time = now - last_time;

idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
idle_time = rq->clock_pelt_old - last_time

The last time is nothing but the last snapshot of the rq->clock_pelt when the
task entered sleep due to which CPU entered idle.

Can you please explain the significance of the above block with an example?

> +
> + /* The rq is idle, we can sync to clock_task */
> + rq->clock_pelt  = rq_clock_task(rq);
> +
> +
> + } else {
> + /*
> +  * When a rq runs at a lower compute capacity, it will need
> +  * more time to do the same amount of work than at max
> +  * capacity: either because it takes more time to compute the
> +  * same amount of work or because taking more time means
> +  * sharing more often the CPU between entities.
> +  * In order to be invariant, we scale the delta to reflect how
> +  * much work has been really done.
> +  * Running at lower capacity also means running longer to do
> +  * the same amount of work and this results in stealing some
> +  * idle time that will disturb the load signal compared to
> +  * max capacity; This stolen idle time will be automaticcally
> +  * reflected when the rq will be idle and the clock will be
> +  * synced with rq_clock_task.
> +  */
> +
> + /*
> +  * scale the elapsed time to reflect the real amount of
> +  * computation
> +  */
> + delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, 
> cpu_of(rq)));
> +
> + rq->clock_pelt += delta;

AFAICT, the rq->clock_pelt is used for both utilization and load. So the load
also becomes a function of CPU uarch now. Is this intentional?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v4 2/2] sched/fair: update scale invariance of PELT

2018-10-22 Thread Pavan Kondeti

Hi Vincent,

On Fri, Oct 19, 2018 at 06:17:51PM +0200, Vincent Guittot wrote:
>  
>  /*
> + * The clock_pelt scales the time to reflect the effective amount of
> + * computation done during the running delta time but then sync back to
> + * clock_task when rq is idle.
> + *
> + *
> + * absolute time   | 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|16
> + * @ max capacity  --**---**---
> + * @ half capacity ----
> + * clock pelt  | 1| 2|3|4| 7| 8| 9|   10|   11|14|15|16
> + *
> + */
> +void update_rq_clock_pelt(struct rq *rq, s64 delta)
> +{
> +
> + if (is_idle_task(rq->curr)) {
> + u32 divider = (LOAD_AVG_MAX - 1024 + 
> rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> + u32 overload = rq->cfs.avg.util_sum + LOAD_AVG_MAX;
> + overload += rq->avg_rt.util_sum;
> + overload += rq->avg_dl.util_sum;
> +
> + /*
> +  * Reflecting some stolen time makes sense only if the idle
> +  * phase would be present at max capacity. As soon as the
> +  * utilization of a rq has reached the maximum value, it is
> +  * considered as an always runnnig rq without idle time to
> +  * steal. This potential idle time is considered as lost in
> +  * this case. We keep track of this lost idle time compare to
> +  * rq's clock_task.
> +  */
> + if (overload >= divider)
> + rq->lost_idle_time += rq_clock_task(rq) - 
> rq->clock_pelt;
> +

I am trying to understand this better. I believe we run into this scenario, when
the frequency is limited due to thermal/userspace constraints. Lets say
frequency is limited to Fmax/2. A 50% task at Fmax, becomes 100% running at
Fmax/2. The utilization is built up to 100% after several periods.
The clock_pelt runs at 1/2 speed of the clock_task. We are loosing the idle time
all along. What happens when the CPU enters idle for a short duration and comes
back to run this 100% utilization task?

If the above block is not present i.e lost_idle_time is not tracked, we
stretch the idle time (since clock_pelt is synced to clock_task) and the
utilization is dropped. Right?

With the above block, we don't stretch the idle time. In fact we don't
consider the idle time at all. Because,

idle_time = now - last_time;

idle_time = (rq->clock_pelt - rq->lost_idle_time) - last_time
idle_time = (rq->clock_task - rq_clock_task + rq->clock_pelt_old) - last_time
idle_time = rq->clock_pelt_old - last_time

The last time is nothing but the last snapshot of the rq->clock_pelt when the
task entered sleep due to which CPU entered idle.

Can you please explain the significance of the above block with an example?

> +
> + /* The rq is idle, we can sync to clock_task */
> + rq->clock_pelt  = rq_clock_task(rq);
> +
> +
> + } else {
> + /*
> +  * When a rq runs at a lower compute capacity, it will need
> +  * more time to do the same amount of work than at max
> +  * capacity: either because it takes more time to compute the
> +  * same amount of work or because taking more time means
> +  * sharing more often the CPU between entities.
> +  * In order to be invariant, we scale the delta to reflect how
> +  * much work has been really done.
> +  * Running at lower capacity also means running longer to do
> +  * the same amount of work and this results in stealing some
> +  * idle time that will disturb the load signal compared to
> +  * max capacity; This stolen idle time will be automaticcally
> +  * reflected when the rq will be idle and the clock will be
> +  * synced with rq_clock_task.
> +  */
> +
> + /*
> +  * scale the elapsed time to reflect the real amount of
> +  * computation
> +  */
> + delta = cap_scale(delta, arch_scale_freq_capacity(cpu_of(rq)));
> + delta = cap_scale(delta, arch_scale_cpu_capacity(NULL, 
> cpu_of(rq)));
> +
> + rq->clock_pelt += delta;

AFAICT, the rq->clock_pelt is used for both utilization and load. So the load
also becomes a function of CPU uarch now. Is this intentional?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 29/30] softirq: Make softirq processing softinterruptible

2018-10-22 Thread Pavan Kondeti

Hi Frederic,

On Wed, Oct 17, 2018 at 02:26:02AM +0200, Frederic Weisbecker wrote:
> Hi Pavan,
> 
> On Tue, Oct 16, 2018 at 09:45:52AM +0530, Pavan Kondeti wrote:
> > Hi Frederic,
> > 
> > On Thu, Oct 11, 2018 at 01:12:16AM +0200, Frederic Weisbecker wrote:
> > > From: Frederic Weisbecker 
> > > 
> > > Make do_softirq() re-entrant and allow a vector, being either processed
> > > or disabled, to be interrupted by another vector. This way a vector
> > > won't be able to monopolize the CPU for a long while at the expense of
> > > the others that may rely on some predictable latency, especially on
> > > softirq disabled sections that used to disable all vectors.
> > > 
> > I understand that a long running softirq can be preempted/interrupted by
> > other softirqs which is not possible today. I have few questions on your
> > patches.
> > 
> > (1) When softirq processing is pushed to ksoftirqd, then the long running
> > softirq can still block other softirqs (not in SOFTIRQ_NOW_MASK) for a 
> > while.
> > correct?
> 
> No, Ksoftirqd is treated the same as IRQ tail processing here: a vector can
> interrupt another. So for example, a NET_RX softirq running in Ksoftirqd can
> be interrupted by a TIMER softirq running in hardirq tail.
> 
When ksoftirqd is running, we are only allowing softirqs in SOFTIRQ_NOW_MASK
to run after serving an interrupt. So I don't see how TIMER which is not
in SOFTIRQ_NOW_MASK can interrupt a NET_RX softirq running in ksoftirqd
context.

> > 
> > (2) When softirqs processing happens asynchronously, a particular softirq
> > like TASKLET can keep interrupting an already running softirq like 
> > TIMER/NET_RX,
> > correct? In worse case scenario, a long running softirq like NET_RX 
> > interrupt
> > a TIMER softirq. But I guess this is something expected with this. i.e
> > each softirq is independent and whichever comes recent gets to interrupt the
> > previously running softirqs.
> 
> Exactly, and that's inherent with interrupts in general. The only way to work
> around that is to thread each vector independantly but that's a whole 
> different
> dimension :-)
> 
Right.

Assigning a thread for each vector also may not solve this problem because
preemption would be disabled while a softirq vector is running in its own
thread.

I guess there is no hard priorities among softirq vectors. Earlier
it was like first come first serve, now it is not. If we had priorities
defined, (don't know how :-)) we could disable the lower prio vectors while a
higher prio vector is being handled. This way we could gaurantee that TIMER
softirq or HI-TASKLET won't be starved while a long running softirq like
NET_RX/NET_TX/RCU is running.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 29/30] softirq: Make softirq processing softinterruptible

2018-10-22 Thread Pavan Kondeti

Hi Frederic,

On Wed, Oct 17, 2018 at 02:26:02AM +0200, Frederic Weisbecker wrote:
> Hi Pavan,
> 
> On Tue, Oct 16, 2018 at 09:45:52AM +0530, Pavan Kondeti wrote:
> > Hi Frederic,
> > 
> > On Thu, Oct 11, 2018 at 01:12:16AM +0200, Frederic Weisbecker wrote:
> > > From: Frederic Weisbecker 
> > > 
> > > Make do_softirq() re-entrant and allow a vector, being either processed
> > > or disabled, to be interrupted by another vector. This way a vector
> > > won't be able to monopolize the CPU for a long while at the expense of
> > > the others that may rely on some predictable latency, especially on
> > > softirq disabled sections that used to disable all vectors.
> > > 
> > I understand that a long running softirq can be preempted/interrupted by
> > other softirqs which is not possible today. I have few questions on your
> > patches.
> > 
> > (1) When softirq processing is pushed to ksoftirqd, then the long running
> > softirq can still block other softirqs (not in SOFTIRQ_NOW_MASK) for a 
> > while.
> > correct?
> 
> No, Ksoftirqd is treated the same as IRQ tail processing here: a vector can
> interrupt another. So for example, a NET_RX softirq running in Ksoftirqd can
> be interrupted by a TIMER softirq running in hardirq tail.
> 
When ksoftirqd is running, we are only allowing softirqs in SOFTIRQ_NOW_MASK
to run after serving an interrupt. So I don't see how TIMER which is not
in SOFTIRQ_NOW_MASK can interrupt a NET_RX softirq running in ksoftirqd
context.

> > 
> > (2) When softirqs processing happens asynchronously, a particular softirq
> > like TASKLET can keep interrupting an already running softirq like 
> > TIMER/NET_RX,
> > correct? In worse case scenario, a long running softirq like NET_RX 
> > interrupt
> > a TIMER softirq. But I guess this is something expected with this. i.e
> > each softirq is independent and whichever comes recent gets to interrupt the
> > previously running softirqs.
> 
> Exactly, and that's inherent with interrupts in general. The only way to work
> around that is to thread each vector independantly but that's a whole 
> different
> dimension :-)
> 
Right.

Assigning a thread for each vector also may not solve this problem because
preemption would be disabled while a softirq vector is running in its own
thread.

I guess there is no hard priorities among softirq vectors. Earlier
it was like first come first serve, now it is not. If we had priorities
defined, (don't know how :-)) we could disable the lower prio vectors while a
higher prio vector is being handled. This way we could gaurantee that TIMER
softirq or HI-TASKLET won't be starved while a long running softirq like
NET_RX/NET_TX/RCU is running.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 29/30] softirq: Make softirq processing softinterruptible

2018-10-15 Thread Pavan Kondeti

Hi Frederic,

On Thu, Oct 11, 2018 at 01:12:16AM +0200, Frederic Weisbecker wrote:
> From: Frederic Weisbecker 
> 
> Make do_softirq() re-entrant and allow a vector, being either processed
> or disabled, to be interrupted by another vector. This way a vector
> won't be able to monopolize the CPU for a long while at the expense of
> the others that may rely on some predictable latency, especially on
> softirq disabled sections that used to disable all vectors.
> 
I understand that a long running softirq can be preempted/interrupted by
other softirqs which is not possible today. I have few questions on your
patches.

(1) When softirq processing is pushed to ksoftirqd, then the long running
softirq can still block other softirqs (not in SOFTIRQ_NOW_MASK) for a while.
correct?

(2) When softirqs processing happens asynchronously, a particular softirq
like TASKLET can keep interrupting an already running softirq like TIMER/NET_RX,
correct? In worse case scenario, a long running softirq like NET_RX interrupt
a TIMER softirq. But I guess this is something expected with this. i.e
each softirq is independent and whichever comes recent gets to interrupt the
previously running softirqs.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 29/30] softirq: Make softirq processing softinterruptible

2018-10-15 Thread Pavan Kondeti

Hi Frederic,

On Thu, Oct 11, 2018 at 01:12:16AM +0200, Frederic Weisbecker wrote:
> From: Frederic Weisbecker 
> 
> Make do_softirq() re-entrant and allow a vector, being either processed
> or disabled, to be interrupted by another vector. This way a vector
> won't be able to monopolize the CPU for a long while at the expense of
> the others that may rely on some predictable latency, especially on
> softirq disabled sections that used to disable all vectors.
> 
I understand that a long running softirq can be preempted/interrupted by
other softirqs which is not possible today. I have few questions on your
patches.

(1) When softirq processing is pushed to ksoftirqd, then the long running
softirq can still block other softirqs (not in SOFTIRQ_NOW_MASK) for a while.
correct?

(2) When softirqs processing happens asynchronously, a particular softirq
like TASKLET can keep interrupting an already running softirq like TIMER/NET_RX,
correct? In worse case scenario, a long running softirq like NET_RX interrupt
a TIMER softirq. But I guess this is something expected with this. i.e
each softirq is independent and whichever comes recent gets to interrupt the
previously running softirqs.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 12/14] sched/core: uclamp: add system default clamps

2018-08-16 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:44PM +0100, Patrick Bellasi wrote:
> Clamp values cannot be tuned at the root cgroup level. Moreover, because
> of the delegation model requirements and how the parent clamps
> propagation works, if we want to enable subgroups to set a non null
> util.min, we need to be able to configure the root group util.min to the
> allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
> 
> Unfortunately this setup will also mean that all tasks running in the
> root group, will always get a maximum util.min clamp, unless they have a
> lower task specific clamp which is definitively not a desirable default
> configuration.
> 
> Let's fix this by explicitly adding a system default configuration
> (sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
> for all tasks running on the root group.
> 
> This interface is available independently from cgroups, thus providing a
> complete solution for system wide utilization clamping configuration.
> 
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Paul Turner 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Steve Muckle 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org



> +/*
> + * Minimum utilization for tasks in the root cgroup
> + * default: 0
> + */
> +unsigned int sysctl_sched_uclamp_util_min;
> +
> +/*
> + * Maximum utilization for tasks in the root cgroup
> + * default: 1024
> + */
> +unsigned int sysctl_sched_uclamp_util_max = 1024;
> +
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
> +

The default group id for un-clamped root tasks is 0 because of
this declaration, correct?

>  /**
>   * uclamp_map: reference counts a utilization "clamp value"
>   * @value:the utilization "clamp value" required
> @@ -957,12 +971,25 @@ static inline int uclamp_task_group_id(struct 
> task_struct *p, int clamp_id)
>   group_id = uc_se->group_id;
>  
>  #ifdef CONFIG_UCLAMP_TASK_GROUP
> + /*
> +  * Tasks in the root group, which do not have a task specific clamp
> +  * value, get the system default calmp value.
> +  */
> + if (group_id == UCLAMP_NOT_VALID &&
> + task_group(p) == _task_group) {
> + return uclamp_default[clamp_id].group_id;
> + }
> +
>   /* Use TG's clamp value to limit task specific values */
>   uc_se = _group(p)->uclamp[clamp_id];
>   if (group_id == UCLAMP_NOT_VALID ||
>   clamp_value > uc_se->effective.value) {
>   group_id = uc_se->effective.group_id;
>   }
> +#else
> + /* By default, all tasks get the system default clamp value */
> + if (group_id == UCLAMP_NOT_VALID)
> + return uclamp_default[clamp_id].group_id;
>  #endif
>  
>   return group_id;
> @@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct 
> *p,
>   uclamp_group_put(clamp_id, prev_group_id);
>  }
>  
> +int sched_uclamp_handler(struct ctl_table *table, int write,
> +  void __user *buffer, size_t *lenp,
> +  loff_t *ppos)
> +{
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + struct uclamp_se *uc_se;
> + int old_min, old_max;
> + int result;
> +
> + mutex_lock(_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> + goto undo;
> + if (sysctl_sched_uclamp_util_max > 1024)
> + goto undo;
> +
> + /* Find a valid group_id for each required clamp value */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + result = uclamp_group_find(UCLAMP_MIN, 
> sysctl_sched_uclamp_util_min);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MIN clamp 
> groups\n",
> +CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + result = uclamp_group_find(UCLAMP_MAX, 
> sysctl_sched_uclamp_util_max);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MAX clamp 
> groups\n",
> +CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uc_se = _default[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> +

Re: [PATCH v3 12/14] sched/core: uclamp: add system default clamps

2018-08-16 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:44PM +0100, Patrick Bellasi wrote:
> Clamp values cannot be tuned at the root cgroup level. Moreover, because
> of the delegation model requirements and how the parent clamps
> propagation works, if we want to enable subgroups to set a non null
> util.min, we need to be able to configure the root group util.min to the
> allow the maximum utilization (SCHED_CAPACITY_SCALE = 1024).
> 
> Unfortunately this setup will also mean that all tasks running in the
> root group, will always get a maximum util.min clamp, unless they have a
> lower task specific clamp which is definitively not a desirable default
> configuration.
> 
> Let's fix this by explicitly adding a system default configuration
> (sysctl_sched_uclamp_util_{min,max}) which works as a restrictive clamp
> for all tasks running on the root group.
> 
> This interface is available independently from cgroups, thus providing a
> complete solution for system wide utilization clamping configuration.
> 
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Paul Turner 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Steve Muckle 
> Cc: Juri Lelli 
> Cc: Dietmar Eggemann 
> Cc: Morten Rasmussen 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org



> +/*
> + * Minimum utilization for tasks in the root cgroup
> + * default: 0
> + */
> +unsigned int sysctl_sched_uclamp_util_min;
> +
> +/*
> + * Maximum utilization for tasks in the root cgroup
> + * default: 1024
> + */
> +unsigned int sysctl_sched_uclamp_util_max = 1024;
> +
> +static struct uclamp_se uclamp_default[UCLAMP_CNT];
> +

The default group id for un-clamped root tasks is 0 because of
this declaration, correct?

>  /**
>   * uclamp_map: reference counts a utilization "clamp value"
>   * @value:the utilization "clamp value" required
> @@ -957,12 +971,25 @@ static inline int uclamp_task_group_id(struct 
> task_struct *p, int clamp_id)
>   group_id = uc_se->group_id;
>  
>  #ifdef CONFIG_UCLAMP_TASK_GROUP
> + /*
> +  * Tasks in the root group, which do not have a task specific clamp
> +  * value, get the system default calmp value.
> +  */
> + if (group_id == UCLAMP_NOT_VALID &&
> + task_group(p) == _task_group) {
> + return uclamp_default[clamp_id].group_id;
> + }
> +
>   /* Use TG's clamp value to limit task specific values */
>   uc_se = _group(p)->uclamp[clamp_id];
>   if (group_id == UCLAMP_NOT_VALID ||
>   clamp_value > uc_se->effective.value) {
>   group_id = uc_se->effective.group_id;
>   }
> +#else
> + /* By default, all tasks get the system default clamp value */
> + if (group_id == UCLAMP_NOT_VALID)
> + return uclamp_default[clamp_id].group_id;
>  #endif
>  
>   return group_id;
> @@ -1269,6 +1296,75 @@ static inline void uclamp_group_get(struct task_struct 
> *p,
>   uclamp_group_put(clamp_id, prev_group_id);
>  }
>  
> +int sched_uclamp_handler(struct ctl_table *table, int write,
> +  void __user *buffer, size_t *lenp,
> +  loff_t *ppos)
> +{
> + int group_id[UCLAMP_CNT] = { UCLAMP_NOT_VALID };
> + struct uclamp_se *uc_se;
> + int old_min, old_max;
> + int result;
> +
> + mutex_lock(_mutex);
> +
> + old_min = sysctl_sched_uclamp_util_min;
> + old_max = sysctl_sched_uclamp_util_max;
> +
> + result = proc_dointvec(table, write, buffer, lenp, ppos);
> + if (result)
> + goto undo;
> + if (!write)
> + goto done;
> +
> + if (sysctl_sched_uclamp_util_min > sysctl_sched_uclamp_util_max)
> + goto undo;
> + if (sysctl_sched_uclamp_util_max > 1024)
> + goto undo;
> +
> + /* Find a valid group_id for each required clamp value */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + result = uclamp_group_find(UCLAMP_MIN, 
> sysctl_sched_uclamp_util_min);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MIN clamp 
> groups\n",
> +CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MIN] = result;
> + }
> + if (old_max != sysctl_sched_uclamp_util_max) {
> + result = uclamp_group_find(UCLAMP_MAX, 
> sysctl_sched_uclamp_util_max);
> + if (result == -ENOSPC) {
> + pr_err("Cannot allocate more than %d UTIL_MAX clamp 
> groups\n",
> +CONFIG_UCLAMP_GROUPS_COUNT);
> + goto undo;
> + }
> + group_id[UCLAMP_MAX] = result;
> + }
> +
> + /* Update each required clamp group */
> + if (old_min != sysctl_sched_uclamp_util_min) {
> + uc_se = _default[UCLAMP_MIN];
> + uclamp_group_get(NULL, UCLAMP_MIN, group_id[UCLAMP_MIN],
> +

Re: [PATCH v3 09/14] sched/core: uclamp: propagate parent clamps

2018-08-16 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:41PM +0100, Patrick Bellasi wrote:
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
> 
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This represent also the clamp value which is actually
> used to clamp tasks in each task group.
> 
> Since it can be interesting for tasks in a cgroup to know exactly what
> is the currently propagated/enforced configuration, the effective clamp
> values are exposed to user-space by means of a new pair of read-only
> attributes: cpu.util.{min,max}.effective.
> 
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Rafael J. Wysocki 
> Cc: Viresh Kumar 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
>  



> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f48e64fb8a6..3fac2d098084 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -589,6 +589,11 @@ struct uclamp_se {
>   unsigned int value;
>   /* Utilization clamp group for this constraint */
>   unsigned int group_id;
> + /* Effective clamp  for tasks in this group */
> + struct {
> + unsigned int value;
> + unsigned int group_id;
> + } effective;
>  };

Are these needed when CONFIG_UCLAMP_TASK_GROUP is disabled?

>  
>  union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2ba55a4afffb..f692df3787bd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1237,6 +1237,8 @@ static inline void init_uclamp_sched_group(void)
>   uc_se = _task_group.uclamp[clamp_id];
>   uc_se->value = uclamp_none(clamp_id);
>   uc_se->group_id = group_id;
> + uc_se->effective.value = uclamp_none(clamp_id);
> + uc_se->effective.group_id = group_id;
>  
>   /* Attach root TG's clamp group */
>   uc_map[group_id].se_count = 1;
>  



> @@ -7622,11 +7687,19 @@ static struct cftype cpu_legacy_files[] = {
>   .read_u64 = cpu_util_min_read_u64,
>   .write_u64 = cpu_util_min_write_u64,
>   },
> + {
> + .name = "util.min.effective",
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
>   {
>   .name = "util.max",
>   .read_u64 = cpu_util_max_read_u64,
>   .write_u64 = cpu_util_max_write_u64,
>   },
> + {
> + .name = "util.max.effective",
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
>  #endif
>   { } /* Terminate */
>  };

Is there any reason why these are not added for the default hierarchy?

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 09/14] sched/core: uclamp: propagate parent clamps

2018-08-16 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:41PM +0100, Patrick Bellasi wrote:
> In order to properly support hierarchical resources control, the cgroup
> delegation model requires that attribute writes from a child group never
> fail but still are (potentially) constrained based on parent's assigned
> resources. This requires to properly propagate and aggregate parent
> attributes down to its descendants.
> 
> Let's implement this mechanism by adding a new "effective" clamp value
> for each task group. The effective clamp value is defined as the smaller
> value between the clamp value of a group and the effective clamp value
> of its parent. This represent also the clamp value which is actually
> used to clamp tasks in each task group.
> 
> Since it can be interesting for tasks in a cgroup to know exactly what
> is the currently propagated/enforced configuration, the effective clamp
> values are exposed to user-space by means of a new pair of read-only
> attributes: cpu.util.{min,max}.effective.
> 
> Signed-off-by: Patrick Bellasi 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Tejun Heo 
> Cc: Rafael J. Wysocki 
> Cc: Viresh Kumar 
> Cc: Suren Baghdasaryan 
> Cc: Todd Kjos 
> Cc: Joel Fernandes 
> Cc: Juri Lelli 
> Cc: linux-kernel@vger.kernel.org
> Cc: linux...@vger.kernel.org
>  



> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8f48e64fb8a6..3fac2d098084 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -589,6 +589,11 @@ struct uclamp_se {
>   unsigned int value;
>   /* Utilization clamp group for this constraint */
>   unsigned int group_id;
> + /* Effective clamp  for tasks in this group */
> + struct {
> + unsigned int value;
> + unsigned int group_id;
> + } effective;
>  };

Are these needed when CONFIG_UCLAMP_TASK_GROUP is disabled?

>  
>  union rcu_special {
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2ba55a4afffb..f692df3787bd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1237,6 +1237,8 @@ static inline void init_uclamp_sched_group(void)
>   uc_se = _task_group.uclamp[clamp_id];
>   uc_se->value = uclamp_none(clamp_id);
>   uc_se->group_id = group_id;
> + uc_se->effective.value = uclamp_none(clamp_id);
> + uc_se->effective.group_id = group_id;
>  
>   /* Attach root TG's clamp group */
>   uc_map[group_id].se_count = 1;
>  



> @@ -7622,11 +7687,19 @@ static struct cftype cpu_legacy_files[] = {
>   .read_u64 = cpu_util_min_read_u64,
>   .write_u64 = cpu_util_min_write_u64,
>   },
> + {
> + .name = "util.min.effective",
> + .read_u64 = cpu_util_min_effective_read_u64,
> + },
>   {
>   .name = "util.max",
>   .read_u64 = cpu_util_max_read_u64,
>   .write_u64 = cpu_util_max_write_u64,
>   },
> + {
> + .name = "util.max.effective",
> + .read_u64 = cpu_util_max_effective_read_u64,
> + },
>  #endif
>   { } /* Terminate */
>  };

Is there any reason why these are not added for the default hierarchy?

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 02/14] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-08-14 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:34PM +0100, Patrick Bellasi wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
> 
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference couters.
> 
>  :
>(user-space changes)  :  (kernel space / scheduler)
>  :
>  SLOW PATH   : FAST PATH
>  :
> task_struct::uclamp::value   : sched/core::enqueue/dequeue
>  : cpufreq_schedutil
>  :
>   ++++ +---+
>   |  TASK  || CLAMP GROUP| |CPU CLAMPS |
>   ++++ +---+
>   |||   clamp_{min,max}  | |  clamp_{min,max}  |
>   | util_{min,max} ||  se_count  | |tasks count|
>   ++++ +---+
>  :
>+-->  :  +--->
> group_id = map(clamp_value)  :  ref_count(group_id)
>  :
>  :
> 
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
> 
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
>sense to boost/limit to different frequencies,
>e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
>the per-CPU clamp values in the fast path.
> 
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
> 

I see that we drop reference on the previous clamp group when a task changes
its clamp limits. What about exiting tasks which claimed clamp groups? should
not we drop the reference?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 02/14] sched/core: uclamp: map TASK's clamp values into CPU's clamp groups

2018-08-14 Thread Pavan Kondeti

On Mon, Aug 06, 2018 at 05:39:34PM +0100, Patrick Bellasi wrote:
> Utilization clamping requires each CPU to know which clamp values are
> assigned to tasks that are currently RUNNABLE on that CPU.
> Multiple tasks can be assigned the same clamp value and tasks with
> different clamp values can be concurrently active on the same CPU.
> Thus, a proper data structure is required to support a fast and
> efficient aggregation of the clamp values required by the currently
> RUNNABLE tasks.
> 
> For this purpose we use a per-CPU array of reference counters,
> where each slot is used to account how many tasks require a certain
> clamp value are currently RUNNABLE on each CPU.
> Each clamp value corresponds to a "clamp index" which identifies the
> position within the array of reference couters.
> 
>  :
>(user-space changes)  :  (kernel space / scheduler)
>  :
>  SLOW PATH   : FAST PATH
>  :
> task_struct::uclamp::value   : sched/core::enqueue/dequeue
>  : cpufreq_schedutil
>  :
>   ++++ +---+
>   |  TASK  || CLAMP GROUP| |CPU CLAMPS |
>   ++++ +---+
>   |||   clamp_{min,max}  | |  clamp_{min,max}  |
>   | util_{min,max} ||  se_count  | |tasks count|
>   ++++ +---+
>  :
>+-->  :  +--->
> group_id = map(clamp_value)  :  ref_count(group_id)
>  :
>  :
> 
> Let's introduce the support to map tasks to "clamp groups".
> Specifically we introduce the required functions to translate a
> "clamp value" into a clamp's "group index" (group_id).
> 
> Only a limited number of (different) clamp values are supported since:
> 1. there are usually only few classes of workloads for which it makes
>sense to boost/limit to different frequencies,
>e.g. background vs foreground, interactive vs low-priority
> 2. it allows a simpler and more memory/time efficient tracking of
>the per-CPU clamp values in the fast path.
> 
> The number of possible different clamp values is currently defined at
> compile time. Thus, setting a new clamp value for a task can result into
> a -ENOSPC error in case this will exceed the number of maximum different
> clamp values supported.
> 

I see that we drop reference on the previous clamp group when a task changes
its clamp limits. What about exiting tasks which claimed clamp groups? should
not we drop the reference?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] stop_machine: Disable preemption after queueing stopper threads

2018-08-06 Thread Pavan Kondeti

Hi Prasad,

On Wed, Aug 01, 2018 at 01:07:03AM -0700, Sodagudi Prasad wrote:
> On 2018-07-30 14:07, Peter Zijlstra wrote:
> >On Mon, Jul 30, 2018 at 10:12:43AM -0700, Sodagudi Prasad wrote:
> >>How about including below change as well?  Currently, there is
> >>no way to
> >>identify thread migrations completed or not.  When we observe
> >>this issue,
> >>the symptom was work queue lock up. It is better to have some
> >>timeout here
> >>and induce the bug_on.
> >
> >You'd trigger the soft-lockup or hung-task detector I think. And
> >if not,
> >we ought to look at making it trigger at least one of those.
> >
> >>There is no way to identify the migration threads stuck or not.
> >
> >Should be pretty obvious from the splat generated by the above, no?
> Hi Peter and Thomas,
> 
> Thanks for your support.
> I have another question on this flow and retry mechanism used in
> this cpu_stop_queue_two_works() function using the global variable
> stop_cpus_in_progress.
> 
> This variable is getting used in various paths, such as task
> migration, set task affinity, and CPU hotplug.
> 
> For example cpu hotplug path, stop_cpus_in_progress variable getting
> set with true with out checking.
> takedown_cpu()
> --stop_machine_cpuslocked()
> ---stop_cpus()
> ---__stop_cpus()
> queue_stop_cpus_work()
> setting stop_cpus_in_progress to true directly.
> 
> But in the task migration path only, the stop_cpus_in_progress
> variable is used for retry.
> 
> I am thinking that stop_cpus_in_progress variable lead race
> conditions, where CPU hotplug and task migration happening
> simultaneously. Please correct me If my understanding wrong.
> 

The stop_cpus_in_progress variable is to guard against out of order queuing.
The stopper locks does not protect this when cpu_stop_queue_two_works() and
stop_cpus() are executing in parallel.

stop_one_cpu_{nowait} functions are called to handle affinity change and
load balance. Since we are queuing the work only on 1 CPU,
stop_cpus_in_progress variable protection is not needed.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] stop_machine: Disable preemption after queueing stopper threads

2018-08-06 Thread Pavan Kondeti

Hi Prasad,

On Wed, Aug 01, 2018 at 01:07:03AM -0700, Sodagudi Prasad wrote:
> On 2018-07-30 14:07, Peter Zijlstra wrote:
> >On Mon, Jul 30, 2018 at 10:12:43AM -0700, Sodagudi Prasad wrote:
> >>How about including below change as well?  Currently, there is
> >>no way to
> >>identify thread migrations completed or not.  When we observe
> >>this issue,
> >>the symptom was work queue lock up. It is better to have some
> >>timeout here
> >>and induce the bug_on.
> >
> >You'd trigger the soft-lockup or hung-task detector I think. And
> >if not,
> >we ought to look at making it trigger at least one of those.
> >
> >>There is no way to identify the migration threads stuck or not.
> >
> >Should be pretty obvious from the splat generated by the above, no?
> Hi Peter and Thomas,
> 
> Thanks for your support.
> I have another question on this flow and retry mechanism used in
> this cpu_stop_queue_two_works() function using the global variable
> stop_cpus_in_progress.
> 
> This variable is getting used in various paths, such as task
> migration, set task affinity, and CPU hotplug.
> 
> For example cpu hotplug path, stop_cpus_in_progress variable getting
> set with true with out checking.
> takedown_cpu()
> --stop_machine_cpuslocked()
> ---stop_cpus()
> ---__stop_cpus()
> queue_stop_cpus_work()
> setting stop_cpus_in_progress to true directly.
> 
> But in the task migration path only, the stop_cpus_in_progress
> variable is used for retry.
> 
> I am thinking that stop_cpus_in_progress variable lead race
> conditions, where CPU hotplug and task migration happening
> simultaneously. Please correct me If my understanding wrong.
> 

The stop_cpus_in_progress variable is to guard against out of order queuing.
The stopper locks does not protect this when cpu_stop_queue_two_works() and
stop_cpus() are executing in parallel.

stop_one_cpu_{nowait} functions are called to handle affinity change and
load balance. Since we are queuing the work only on 1 CPU,
stop_cpus_in_progress variable protection is not needed.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2] stop_machine: Disable preemption when waking two stopper threads

2018-07-01 Thread Pavan Kondeti

Hi Issac,

On Fri, Jun 29, 2018 at 01:55:12PM -0700, Isaac J. Manjarres wrote:
> When cpu_stop_queue_two_works() begins to wake the stopper
> threads, it does so without preemption disabled, which leads
> to the following race condition:
> 
> The source CPU calls cpu_stop_queue_two_works(), with cpu1
> as the source CPU, and cpu2 as the destination CPU. When
> adding the stopper threads to the wake queue used in this
> function, the source CPU stopper thread is added first,
> and the destination CPU stopper thread is added last.
> 
> When wake_up_q() is invoked to wake the stopper threads, the
> threads are woken up in the order that they are queued in,
> so the source CPU's stopper thread is woken up first, and
> it preempts the thread running on the source CPU.
> 
> The stopper thread will then execute on the source CPU,
> disable preemption, and begin executing multi_cpu_stop(),
> and wait for an ack from the destination CPU's stopper thread,
> with preemption still disabled. Since the worker thread that
> woke up the stopper thread on the source CPU is affine to the
> source CPU, and preemption is disabled on the source CPU, that
> thread will never run to dequeue the destination CPU's stopper
> thread from the wake queue, and thus, the destination CPU's
> stopper thread will never run, causing the source CPU's stopper
> thread to wait forever, and stall.
> 
> Disable preemption when waking the stopper threads in
> cpu_stop_queue_two_works() to ensure that the worker thread
> that is waking up the stopper threads isn't preempted
> by the source CPU's stopper thread, and permanently
> scheduled out, leaving the remaining stopper thread asleep
> in the wake queue.
> 
> Co-developed-by: Pavankumar Kondeti 
> Signed-off-by: Prasad Sodagudi 
> Signed-off-by: Pavankumar Kondeti 
> Signed-off-by: Isaac J. Manjarres 
> ---

You might want to add the below Fixes tag and CC stable.

Fixes: 0b26351b910f ("stop_machine, sched: Fix migrate_swap() vs. 
active_balance() deadlock")

>  kernel/stop_machine.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a..1ff523d 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -270,7 +270,11 @@ static int cpu_stop_queue_two_works(int cpu1, struct 
> cpu_stop_work *work1,
>   goto retry;
>   }
>  
> - wake_up_q();
> + if (!err) {
> + preempt_disable();
> + wake_up_q();
> + preempt_enable();
> + }
>  
>   return err;
>  }

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v2] stop_machine: Disable preemption when waking two stopper threads

2018-07-01 Thread Pavan Kondeti

Hi Issac,

On Fri, Jun 29, 2018 at 01:55:12PM -0700, Isaac J. Manjarres wrote:
> When cpu_stop_queue_two_works() begins to wake the stopper
> threads, it does so without preemption disabled, which leads
> to the following race condition:
> 
> The source CPU calls cpu_stop_queue_two_works(), with cpu1
> as the source CPU, and cpu2 as the destination CPU. When
> adding the stopper threads to the wake queue used in this
> function, the source CPU stopper thread is added first,
> and the destination CPU stopper thread is added last.
> 
> When wake_up_q() is invoked to wake the stopper threads, the
> threads are woken up in the order that they are queued in,
> so the source CPU's stopper thread is woken up first, and
> it preempts the thread running on the source CPU.
> 
> The stopper thread will then execute on the source CPU,
> disable preemption, and begin executing multi_cpu_stop(),
> and wait for an ack from the destination CPU's stopper thread,
> with preemption still disabled. Since the worker thread that
> woke up the stopper thread on the source CPU is affine to the
> source CPU, and preemption is disabled on the source CPU, that
> thread will never run to dequeue the destination CPU's stopper
> thread from the wake queue, and thus, the destination CPU's
> stopper thread will never run, causing the source CPU's stopper
> thread to wait forever, and stall.
> 
> Disable preemption when waking the stopper threads in
> cpu_stop_queue_two_works() to ensure that the worker thread
> that is waking up the stopper threads isn't preempted
> by the source CPU's stopper thread, and permanently
> scheduled out, leaving the remaining stopper thread asleep
> in the wake queue.
> 
> Co-developed-by: Pavankumar Kondeti 
> Signed-off-by: Prasad Sodagudi 
> Signed-off-by: Pavankumar Kondeti 
> Signed-off-by: Isaac J. Manjarres 
> ---

You might want to add the below Fixes tag and CC stable.

Fixes: 0b26351b910f ("stop_machine, sched: Fix migrate_swap() vs. 
active_balance() deadlock")

>  kernel/stop_machine.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a..1ff523d 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -270,7 +270,11 @@ static int cpu_stop_queue_two_works(int cpu1, struct 
> cpu_stop_work *work1,
>   goto retry;
>   }
>  
> - wake_up_q();
> + if (!err) {
> + preempt_disable();
> + wake_up_q();
> + preempt_enable();
> + }
>  
>   return err;
>  }

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] stop_machine: Remove cpu swap from stop_two_cpus

2018-06-28 Thread Pavan Kondeti

On Tue, Jun 26, 2018 at 02:28:26PM -0700, Isaac J. Manjarres wrote:
> When invoking migrate_swap(), stop_two_cpus() swaps the
> source and destination CPU IDs if the destination CPU
> ID is greater than the source CPU ID. This leads to the
> following race condition:
> 
> The source CPU invokes migrate_swap and sets itself as
> the source CPU, and sets the destination CPU to another
> CPU, such that the CPU ID of the destination CPU is
> greater than that of the source CPU ID, and invokes
> stop_two_cpus(cpu1=destination CPU, cpu2=source CPU,...)
> Now, stop_two_cpus sees that the destination CPU ID is
> greater than the source CPU ID, and performs the swap, so
> that cpu1=source CPU, and cpu2=destination CPU.
> 
> The source CPU calls cpu_stop_queue_two_works(), with cpu1
> as the source CPU, and cpu2 as the destination CPU. When
> adding the stopper threads to the wake queue used in this
> function, the source CPU stopper thread is added first,
> and the destination CPU stopper thread is added last.
> 
> When wake_up_q() is invoked to wake the stopper threads, the
> threads are woken up in the order that they are queued in,
> so the source CPU's stopper thread is woken up first, and
> it preempts the thread running on the source CPU.
> 
> The stopper thread will then execute on the source CPU,
> disable preemption, and begin executing multi_cpu_stop()
> and wait for an ack from the destination CPU's stopper thread,
> with preemption still disabled. Since the worker thread that
> woke up the stopper thread on the source CPU is affine to the
> source CPU, and preemption is disabled on the source CPU, that
> thread will never run to dequeue the destination CPU's stopper
> thread from the wake queue, and thus, the destination CPU's
> stopper thread will never run, causing the source CPU's stopper
> thread to wait forever, and stall.
> 
> Remove CPU ID swapping in stop_two_cpus() so that the
> source CPU's stopper thread is added to the wake queue last,
> so that the source CPU's stopper thread is woken up last,
> ensuring that all other threads that it depends on are woken
> up before it runs.
> 
> Co-developed-by: Prasad Sodagudi 
> Signed-off-by: Prasad Sodagudi 
> Signed-off-by: Isaac J. Manjarres 
> ---
>  kernel/stop_machine.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a..d10d633 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -307,8 +307,6 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, 
> cpu_stop_fn_t fn, void *
>   cpu_stop_init_done(, 2);
>   set_state(, MULTI_STOP_PREPARE);
>  
> - if (cpu1 > cpu2)
> - swap(cpu1, cpu2);
>   if (cpu_stop_queue_two_works(cpu1, , cpu2, ))
>   return -ENOENT;
>  

Nested spinlocks must be taken in the same order everywhere. If you don't it
can create circular dependency which leads to deadlock. Sebastian already
pointed it out.

For example,

CPU2: stop_two_cpus(CPU0, CPU1)
CPU3: stop_two_cpus(CPU1, CPU0)

CPU2 may acquire CPU0 lock and waiting for CPU1 lock. At the same time, CPU3
which acquired CPU1 lock could be waiting for CPU0 lock. They stuck forever.

Coming to the original problem described in the changelog, it is happening due
to not waking stopper threads atomically in cpu_stop_queue_two_works().

Can you check if the below patch (not tested :-)) helps?

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index f89014a..1ff523d 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -270,7 +270,11 @@ static int cpu_stop_queue_two_works(int cpu1, struct 
cpu_stop_work *work1,
goto retry;
}
 
-   wake_up_q();
+   if (!err) {
+   preempt_disable();
+   wake_up_q();
+   preempt_enable();
+   }
 
return err;
 }
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH] stop_machine: Remove cpu swap from stop_two_cpus

2018-06-28 Thread Pavan Kondeti

On Tue, Jun 26, 2018 at 02:28:26PM -0700, Isaac J. Manjarres wrote:
> When invoking migrate_swap(), stop_two_cpus() swaps the
> source and destination CPU IDs if the destination CPU
> ID is greater than the source CPU ID. This leads to the
> following race condition:
> 
> The source CPU invokes migrate_swap and sets itself as
> the source CPU, and sets the destination CPU to another
> CPU, such that the CPU ID of the destination CPU is
> greater than that of the source CPU ID, and invokes
> stop_two_cpus(cpu1=destination CPU, cpu2=source CPU,...)
> Now, stop_two_cpus sees that the destination CPU ID is
> greater than the source CPU ID, and performs the swap, so
> that cpu1=source CPU, and cpu2=destination CPU.
> 
> The source CPU calls cpu_stop_queue_two_works(), with cpu1
> as the source CPU, and cpu2 as the destination CPU. When
> adding the stopper threads to the wake queue used in this
> function, the source CPU stopper thread is added first,
> and the destination CPU stopper thread is added last.
> 
> When wake_up_q() is invoked to wake the stopper threads, the
> threads are woken up in the order that they are queued in,
> so the source CPU's stopper thread is woken up first, and
> it preempts the thread running on the source CPU.
> 
> The stopper thread will then execute on the source CPU,
> disable preemption, and begin executing multi_cpu_stop()
> and wait for an ack from the destination CPU's stopper thread,
> with preemption still disabled. Since the worker thread that
> woke up the stopper thread on the source CPU is affine to the
> source CPU, and preemption is disabled on the source CPU, that
> thread will never run to dequeue the destination CPU's stopper
> thread from the wake queue, and thus, the destination CPU's
> stopper thread will never run, causing the source CPU's stopper
> thread to wait forever, and stall.
> 
> Remove CPU ID swapping in stop_two_cpus() so that the
> source CPU's stopper thread is added to the wake queue last,
> so that the source CPU's stopper thread is woken up last,
> ensuring that all other threads that it depends on are woken
> up before it runs.
> 
> Co-developed-by: Prasad Sodagudi 
> Signed-off-by: Prasad Sodagudi 
> Signed-off-by: Isaac J. Manjarres 
> ---
>  kernel/stop_machine.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a..d10d633 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -307,8 +307,6 @@ int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, 
> cpu_stop_fn_t fn, void *
>   cpu_stop_init_done(, 2);
>   set_state(, MULTI_STOP_PREPARE);
>  
> - if (cpu1 > cpu2)
> - swap(cpu1, cpu2);
>   if (cpu_stop_queue_two_works(cpu1, , cpu2, ))
>   return -ENOENT;
>  

Nested spinlocks must be taken in the same order everywhere. If you don't it
can create circular dependency which leads to deadlock. Sebastian already
pointed it out.

For example,

CPU2: stop_two_cpus(CPU0, CPU1)
CPU3: stop_two_cpus(CPU1, CPU0)

CPU2 may acquire CPU0 lock and waiting for CPU1 lock. At the same time, CPU3
which acquired CPU1 lock could be waiting for CPU0 lock. They stuck forever.

Coming to the original problem described in the changelog, it is happening due
to not waking stopper threads atomically in cpu_stop_queue_two_works().

Can you check if the below patch (not tested :-)) helps?

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index f89014a..1ff523d 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -270,7 +270,11 @@ static int cpu_stop_queue_two_works(int cpu1, struct 
cpu_stop_work *work1,
goto retry;
}
 
-   wake_up_q();
+   if (!err) {
+   preempt_disable();
+   wake_up_q();
+   preempt_enable();
+   }
 
return err;
 }
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Question about wakeup granularity calculation in wakeup_preempt_entity()

2018-06-26 Thread Pavan Kondeti

Hi Peter,

I have a question about wakeup granularity calculation while checking
if the waking fair task can preempt the current running fair task.

static unsigned long
wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
unsigned long gran = sysctl_sched_wakeup_granularity;

/*
 * Since its curr running now, convert the gran from real-time
 * to virtual-time in his units.
 *
 * By using 'se' instead of 'curr' we penalize light tasks, so
 * they get preempted easier. That is, if 'se' < 'curr' then
 * the resulting gran will be larger, therefore penalizing the
 * lighter, if otoh 'se' > 'curr' then the resulting gran will
 * be smaller, again penalizing the lighter task.
 *
 * This is especially important for buddies when the leftmost
 * task is higher priority than the buddy.
 */
return calc_delta_fair(gran, se);
}

The above comment says that the wakeup granularity will be larger when
'se' < 'curr'. But we completely ignore the weight of the curr and
scale the wakeup granularity solely based on the weight of the waking
task i.e se. 

For example, when curr is nice:0 and se is nice:0, the wakeup granularity
is unchanged. However when curr is nice:-15 and se is nice:-15, the
granularity becomes 3% of its original value and favour the waking task.

The git history around these functions revealed that there was once a
sched feature called ASYM_GRAN with which we can control the granularity
scaling wrt curr or se. However if we scale granularity wrt curr, then
preemption becomes difficult for a higher waking up task. 

Is there any reason why we don't want to take both curr and se into account?

scaled_granularity = granularity * (W(curr)/W(se))

I don't have any particular usecase that actually shows a regression. I
spotted this behavior in traces where a lower or default prio task that
is riding on the wakeup bonus is preempting a higher prio task which started
just running. Setting sched_wakeup_granularity to a value higher than the
wakeup bouns may prevent this, but that would make things difficult for
a normal prio tasks preempting a lower prio running task.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Question about wakeup granularity calculation in wakeup_preempt_entity()

2018-06-26 Thread Pavan Kondeti

Hi Peter,

I have a question about wakeup granularity calculation while checking
if the waking fair task can preempt the current running fair task.

static unsigned long
wakeup_gran(struct sched_entity *curr, struct sched_entity *se)
{
unsigned long gran = sysctl_sched_wakeup_granularity;

/*
 * Since its curr running now, convert the gran from real-time
 * to virtual-time in his units.
 *
 * By using 'se' instead of 'curr' we penalize light tasks, so
 * they get preempted easier. That is, if 'se' < 'curr' then
 * the resulting gran will be larger, therefore penalizing the
 * lighter, if otoh 'se' > 'curr' then the resulting gran will
 * be smaller, again penalizing the lighter task.
 *
 * This is especially important for buddies when the leftmost
 * task is higher priority than the buddy.
 */
return calc_delta_fair(gran, se);
}

The above comment says that the wakeup granularity will be larger when
'se' < 'curr'. But we completely ignore the weight of the curr and
scale the wakeup granularity solely based on the weight of the waking
task i.e se. 

For example, when curr is nice:0 and se is nice:0, the wakeup granularity
is unchanged. However when curr is nice:-15 and se is nice:-15, the
granularity becomes 3% of its original value and favour the waking task.

The git history around these functions revealed that there was once a
sched feature called ASYM_GRAN with which we can control the granularity
scaling wrt curr or se. However if we scale granularity wrt curr, then
preemption becomes difficult for a higher waking up task. 

Is there any reason why we don't want to take both curr and se into account?

scaled_granularity = granularity * (W(curr)/W(se))

I don't have any particular usecase that actually shows a regression. I
spotted this behavior in traces where a lower or default prio task that
is riding on the wakeup bonus is preempting a higher prio task which started
just running. Setting sched_wakeup_granularity to a value higher than the
wakeup bouns may prevent this, but that would make things difficult for
a normal prio tasks preempting a lower prio running task.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 07/10] sched/fair: Introduce an energy estimation helper function

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:02PM +0100, Quentin Perret wrote:



>  
> +/*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int 
> dst_cpu)
> +{
> + unsigned long util, util_est;
> + struct cfs_rq *cfs_rq;
> +
> + /* Task is where it should be, or has no impact on cpu */
> + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> + return cpu_util(cpu);
> +
> + cfs_rq = _rq(cpu)->cfs;
> + util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> + if (dst_cpu == cpu)
> + util += task_util(p);
> + else
> + util = max_t(long, util - task_util(p), 0);
> +
> + if (sched_feat(UTIL_EST)) {
> + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + if (dst_cpu == cpu)
> + util_est += _task_util_est(p);
> + else
> + util_est = max_t(long, util_est - _task_util_est(p), 0);

For UTIL_EST case, the waking task will not have any contribution in the
previous CPU's util_est. So we can just use the previous CPU util_est as is.

> + util = max(util, util_est);
> + }
> +
> + return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 07/10] sched/fair: Introduce an energy estimation helper function

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:02PM +0100, Quentin Perret wrote:



>  
> +/*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int 
> dst_cpu)
> +{
> + unsigned long util, util_est;
> + struct cfs_rq *cfs_rq;
> +
> + /* Task is where it should be, or has no impact on cpu */
> + if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> + return cpu_util(cpu);
> +
> + cfs_rq = _rq(cpu)->cfs;
> + util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> + if (dst_cpu == cpu)
> + util += task_util(p);
> + else
> + util = max_t(long, util - task_util(p), 0);
> +
> + if (sched_feat(UTIL_EST)) {
> + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + if (dst_cpu == cpu)
> + util_est += _task_util_est(p);
> + else
> + util_est = max_t(long, util_est - _task_util_est(p), 0);

For UTIL_EST case, the waking task will not have any contribution in the
previous CPU's util_est. So we can just use the previous CPU util_est as is.

> + util = max(util, util_est);
> + }
> +
> + return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 10/10] arch_topology: Start Energy Aware Scheduling

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:05PM +0100, Quentin Perret wrote:



> +static void start_eas_workfn(struct work_struct *work);
> +static DECLARE_WORK(start_eas_work, start_eas_workfn);
> +
>  static int
>  init_cpu_capacity_callback(struct notifier_block *nb,
>  unsigned long val,
> @@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>   free_raw_capacity();
>   pr_debug("cpu_capacity: parsing done\n");
>   schedule_work(_done_work);
> + schedule_work(_eas_work);
>   }
>  
>   return 0;
> @@ -249,6 +255,19 @@ static void parsing_done_workfn(struct work_struct *work)
>   free_cpumask_var(cpus_to_visit);
>  }
>  
> +static void start_eas_workfn(struct work_struct *work)
> +{
> + /* Make sure the EM knows about the updated CPU capacities. */
> + rcu_read_lock();
> + em_rescale_cpu_capacity();
> + rcu_read_unlock();
> +
> + /* Inform the scheduler about the EM availability. */
> + cpus_read_lock();
> + rebuild_sched_domains();
> + cpus_read_unlock();
> +}

Rebuilding the sched domains is unnecessary for the platform that don't have
energy-model. In fact, we can completely avoid scheduling this work.

There seems to be a sysfs interface exposed by this driver to change cpu_scale.
Should we worry about it? I don't know what is the usecase for changing the
cpu_scale from user space.

Are we expecting that the energy-model is populated by this time? If it is
not, should not we build the sched domains again after the energy-model is
populated?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 10/10] arch_topology: Start Energy Aware Scheduling

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:05PM +0100, Quentin Perret wrote:



> +static void start_eas_workfn(struct work_struct *work);
> +static DECLARE_WORK(start_eas_work, start_eas_workfn);
> +
>  static int
>  init_cpu_capacity_callback(struct notifier_block *nb,
>  unsigned long val,
> @@ -204,6 +209,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
>   free_raw_capacity();
>   pr_debug("cpu_capacity: parsing done\n");
>   schedule_work(_done_work);
> + schedule_work(_eas_work);
>   }
>  
>   return 0;
> @@ -249,6 +255,19 @@ static void parsing_done_workfn(struct work_struct *work)
>   free_cpumask_var(cpus_to_visit);
>  }
>  
> +static void start_eas_workfn(struct work_struct *work)
> +{
> + /* Make sure the EM knows about the updated CPU capacities. */
> + rcu_read_lock();
> + em_rescale_cpu_capacity();
> + rcu_read_unlock();
> +
> + /* Inform the scheduler about the EM availability. */
> + cpus_read_lock();
> + rebuild_sched_domains();
> + cpus_read_unlock();
> +}

Rebuilding the sched domains is unnecessary for the platform that don't have
energy-model. In fact, we can completely avoid scheduling this work.

There seems to be a sysfs interface exposed by this driver to change cpu_scale.
Should we worry about it? I don't know what is the usecase for changing the
cpu_scale from user space.

Are we expecting that the energy-model is populated by this time? If it is
not, should not we build the sched domains again after the energy-model is
populated?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 09/10] sched/fair: Select an energy-efficient CPU on task wake-up

2018-06-19 Thread Pavan Kondeti

On Tue, Jun 19, 2018 at 08:57:23AM +0100, Quentin Perret wrote:
> Hi Pavan,
> 
> On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> > On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
> > 
> > 
> > 
> > > + if (cpumask_test_cpu(prev_cpu, >cpus_allowed))
> > > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > + else
> > > + prev_energy = best_energy = ULONG_MAX;
> > > +
> > > + for_each_freq_domain(sfd) {
> > > + unsigned long spare_cap, max_spare_cap = 0;
> > > + int max_spare_cap_cpu = -1;
> > > + unsigned long util;
> > > +
> > > + /* Find the CPU with the max spare cap in the freq. dom. */
> > > + for_each_cpu_and(cpu, freq_domain_span(sfd), 
> > > sched_domain_span(sd)) {
> > > + if (!cpumask_test_cpu(cpu, >cpus_allowed))
> > > + continue;
> > > +
> > > + if (cpu == prev_cpu)
> > > + continue;
> > > +
> > > + /* Skip CPUs that will be overutilized */
> > > + util = cpu_util_wake(cpu, p) + task_util;
> > > + cpu_cap = capacity_of(cpu);
> > > + if (cpu_cap * 1024 < util * capacity_margin)
> > > + continue;
> > > +
> > > + spare_cap = cpu_cap - util;
> > > + if (spare_cap > max_spare_cap) {
> > > + max_spare_cap = spare_cap;
> > > + max_spare_cap_cpu = cpu;
> > > + }
> > > + }
> > > +
> > > + /* Evaluate the energy impact of using this CPU. */
> > > + if (max_spare_cap_cpu >= 0) {
> > > + cur_energy = compute_energy(p, max_spare_cap_cpu);
> > > + if (cur_energy < best_energy) {
> > > + best_energy = cur_energy;
> > > + best_energy_cpu = max_spare_cap_cpu;
> > > + }
> > > + }
> > > + }
> > > +
> > > + /*
> > > +  * We pick the best CPU only if it saves at least 1.5% of the
> > > +  * energy used by prev_cpu.
> > > +  */
> > > + if ((prev_energy - best_energy) > (prev_energy >> 6))
> > > + return best_energy_cpu;
> > > +
> > > + return prev_cpu;
> > > +}
> > 
> > We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> > can't accommodate the waking task. Is this intentional? Should not we
> > discard the prev_cpu if it can't accommodate the task.
> > 
> > This can potentially make a BIG task run on a lower capacity CPU until
> > load balancer kicks in and corrects the situation.
> 
> We shouldn't enter find_energy_efficient_cpu() in the first place if the
> system is overutilized, so that shouldn't too much of an issue in
> general.
> 

With UTIL_EST enabled, the overutilization condtion may get turned off when
that 1 BIG task goes to sleep. When it wakes up again, we may place it on
the previous CPU due to the above mentioned issue.

It is not just an existing overutilization condition. By placing this task
on the prev_cpu, we may enter overutilization state.

> But yeah, there is one small corner case where prev_cpu is overutilized
> and the system has not been flagged as such yet (when the tasks wakes-up
> shortly before the tick for ex). I think it's possible to cover this case
> by extending the "if (cpumask_test_cpu(prev_cpu, >cpus_allowed))"
> condition at the top of the function with a check on capacity_margin.
> 

LGTM.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 09/10] sched/fair: Select an energy-efficient CPU on task wake-up

2018-06-19 Thread Pavan Kondeti

On Tue, Jun 19, 2018 at 08:57:23AM +0100, Quentin Perret wrote:
> Hi Pavan,
> 
> On Tuesday 19 Jun 2018 at 10:36:01 (+0530), Pavan Kondeti wrote:
> > On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:
> > 
> > 
> > 
> > > + if (cpumask_test_cpu(prev_cpu, >cpus_allowed))
> > > + prev_energy = best_energy = compute_energy(p, prev_cpu);
> > > + else
> > > + prev_energy = best_energy = ULONG_MAX;
> > > +
> > > + for_each_freq_domain(sfd) {
> > > + unsigned long spare_cap, max_spare_cap = 0;
> > > + int max_spare_cap_cpu = -1;
> > > + unsigned long util;
> > > +
> > > + /* Find the CPU with the max spare cap in the freq. dom. */
> > > + for_each_cpu_and(cpu, freq_domain_span(sfd), 
> > > sched_domain_span(sd)) {
> > > + if (!cpumask_test_cpu(cpu, >cpus_allowed))
> > > + continue;
> > > +
> > > + if (cpu == prev_cpu)
> > > + continue;
> > > +
> > > + /* Skip CPUs that will be overutilized */
> > > + util = cpu_util_wake(cpu, p) + task_util;
> > > + cpu_cap = capacity_of(cpu);
> > > + if (cpu_cap * 1024 < util * capacity_margin)
> > > + continue;
> > > +
> > > + spare_cap = cpu_cap - util;
> > > + if (spare_cap > max_spare_cap) {
> > > + max_spare_cap = spare_cap;
> > > + max_spare_cap_cpu = cpu;
> > > + }
> > > + }
> > > +
> > > + /* Evaluate the energy impact of using this CPU. */
> > > + if (max_spare_cap_cpu >= 0) {
> > > + cur_energy = compute_energy(p, max_spare_cap_cpu);
> > > + if (cur_energy < best_energy) {
> > > + best_energy = cur_energy;
> > > + best_energy_cpu = max_spare_cap_cpu;
> > > + }
> > > + }
> > > + }
> > > +
> > > + /*
> > > +  * We pick the best CPU only if it saves at least 1.5% of the
> > > +  * energy used by prev_cpu.
> > > +  */
> > > + if ((prev_energy - best_energy) > (prev_energy >> 6))
> > > + return best_energy_cpu;
> > > +
> > > + return prev_cpu;
> > > +}
> > 
> > We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
> > can't accommodate the waking task. Is this intentional? Should not we
> > discard the prev_cpu if it can't accommodate the task.
> > 
> > This can potentially make a BIG task run on a lower capacity CPU until
> > load balancer kicks in and corrects the situation.
> 
> We shouldn't enter find_energy_efficient_cpu() in the first place if the
> system is overutilized, so that shouldn't too much of an issue in
> general.
> 

With UTIL_EST enabled, the overutilization condtion may get turned off when
that 1 BIG task goes to sleep. When it wakes up again, we may place it on
the previous CPU due to the above mentioned issue.

It is not just an existing overutilization condition. By placing this task
on the prev_cpu, we may enter overutilization state.

> But yeah, there is one small corner case where prev_cpu is overutilized
> and the system has not been flagged as such yet (when the tasks wakes-up
> shortly before the tick for ex). I think it's possible to cover this case
> by extending the "if (cpumask_test_cpu(prev_cpu, >cpus_allowed))"
> condition at the top of the function with a check on capacity_margin.
> 

LGTM.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 06/10] sched: Add over-utilization/tipping point indicator

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:01PM +0100, Quentin Perret wrote:



>   util_est_enqueue(>cfs, p);
>   hrtick_update(rq);
> @@ -8121,11 +8144,12 @@ static bool update_nohz_stats(struct rq *rq, bool 
> force)
>   * @local_group: Does group contain this_cpu.
>   * @sgs: variable to hold the statistics for this group.
>   * @overload: Indicate more than one runnable task for any CPU.
> + * @overutilized: Indicate overutilization for any CPU.
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
>   struct sched_group *group, int load_idx,
>   int local_group, struct sg_lb_stats *sgs,
> - bool *overload)
> + bool *overload, int *overutilized)
>  {
>   unsigned long load;
>   int i, nr_running;
> @@ -8152,6 +8176,9 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   if (nr_running > 1)
>   *overload = true;
>  
> + if (cpu_overutilized(i))
> + *overutilized = 1;
> +

There is no need to check if every CPU is overutilized or not once
*overutilized is marked as true, right? 



>  
> @@ -8586,6 +8621,10 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>* this level.
>*/
>   update_sd_lb_stats(env, );
> +
> + if (sched_energy_enabled() && !READ_ONCE(env->dst_rq->rd->overutilized))
> + goto out_balanced;
> +

Is there any reason for sending no-hz idle kicks but bailing out here when
system is not overutilized?

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 06/10] sched: Add over-utilization/tipping point indicator

2018-06-19 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:01PM +0100, Quentin Perret wrote:



>   util_est_enqueue(>cfs, p);
>   hrtick_update(rq);
> @@ -8121,11 +8144,12 @@ static bool update_nohz_stats(struct rq *rq, bool 
> force)
>   * @local_group: Does group contain this_cpu.
>   * @sgs: variable to hold the statistics for this group.
>   * @overload: Indicate more than one runnable task for any CPU.
> + * @overutilized: Indicate overutilization for any CPU.
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
>   struct sched_group *group, int load_idx,
>   int local_group, struct sg_lb_stats *sgs,
> - bool *overload)
> + bool *overload, int *overutilized)
>  {
>   unsigned long load;
>   int i, nr_running;
> @@ -8152,6 +8176,9 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>   if (nr_running > 1)
>   *overload = true;
>  
> + if (cpu_overutilized(i))
> + *overutilized = 1;
> +

There is no need to check if every CPU is overutilized or not once
*overutilized is marked as true, right? 



>  
> @@ -8586,6 +8621,10 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>* this level.
>*/
>   update_sd_lb_stats(env, );
> +
> + if (sched_energy_enabled() && !READ_ONCE(env->dst_rq->rd->overutilized))
> + goto out_balanced;
> +

Is there any reason for sending no-hz idle kicks but bailing out here when
system is not overutilized?

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 09/10] sched/fair: Select an energy-efficient CPU on task wake-up

2018-06-18 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:



> + if (cpumask_test_cpu(prev_cpu, >cpus_allowed))
> + prev_energy = best_energy = compute_energy(p, prev_cpu);
> + else
> + prev_energy = best_energy = ULONG_MAX;
> +
> + for_each_freq_domain(sfd) {
> + unsigned long spare_cap, max_spare_cap = 0;
> + int max_spare_cap_cpu = -1;
> + unsigned long util;
> +
> + /* Find the CPU with the max spare cap in the freq. dom. */
> + for_each_cpu_and(cpu, freq_domain_span(sfd), 
> sched_domain_span(sd)) {
> + if (!cpumask_test_cpu(cpu, >cpus_allowed))
> + continue;
> +
> + if (cpu == prev_cpu)
> + continue;
> +
> + /* Skip CPUs that will be overutilized */
> + util = cpu_util_wake(cpu, p) + task_util;
> + cpu_cap = capacity_of(cpu);
> + if (cpu_cap * 1024 < util * capacity_margin)
> + continue;
> +
> + spare_cap = cpu_cap - util;
> + if (spare_cap > max_spare_cap) {
> + max_spare_cap = spare_cap;
> + max_spare_cap_cpu = cpu;
> + }
> + }
> +
> + /* Evaluate the energy impact of using this CPU. */
> + if (max_spare_cap_cpu >= 0) {
> + cur_energy = compute_energy(p, max_spare_cap_cpu);
> + if (cur_energy < best_energy) {
> + best_energy = cur_energy;
> + best_energy_cpu = max_spare_cap_cpu;
> + }
> + }
> + }
> +
> + /*
> +  * We pick the best CPU only if it saves at least 1.5% of the
> +  * energy used by prev_cpu.
> +  */
> + if ((prev_energy - best_energy) > (prev_energy >> 6))
> + return best_energy_cpu;
> +
> + return prev_cpu;
> +}

We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
can't accommodate the waking task. Is this intentional? Should not we
discard the prev_cpu if it can't accommodate the task.

This can potentially make a BIG task run on a lower capacity CPU until
load balancer kicks in and corrects the situation.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH v3 09/10] sched/fair: Select an energy-efficient CPU on task wake-up

2018-06-18 Thread Pavan Kondeti

On Mon, May 21, 2018 at 03:25:04PM +0100, Quentin Perret wrote:



> + if (cpumask_test_cpu(prev_cpu, >cpus_allowed))
> + prev_energy = best_energy = compute_energy(p, prev_cpu);
> + else
> + prev_energy = best_energy = ULONG_MAX;
> +
> + for_each_freq_domain(sfd) {
> + unsigned long spare_cap, max_spare_cap = 0;
> + int max_spare_cap_cpu = -1;
> + unsigned long util;
> +
> + /* Find the CPU with the max spare cap in the freq. dom. */
> + for_each_cpu_and(cpu, freq_domain_span(sfd), 
> sched_domain_span(sd)) {
> + if (!cpumask_test_cpu(cpu, >cpus_allowed))
> + continue;
> +
> + if (cpu == prev_cpu)
> + continue;
> +
> + /* Skip CPUs that will be overutilized */
> + util = cpu_util_wake(cpu, p) + task_util;
> + cpu_cap = capacity_of(cpu);
> + if (cpu_cap * 1024 < util * capacity_margin)
> + continue;
> +
> + spare_cap = cpu_cap - util;
> + if (spare_cap > max_spare_cap) {
> + max_spare_cap = spare_cap;
> + max_spare_cap_cpu = cpu;
> + }
> + }
> +
> + /* Evaluate the energy impact of using this CPU. */
> + if (max_spare_cap_cpu >= 0) {
> + cur_energy = compute_energy(p, max_spare_cap_cpu);
> + if (cur_energy < best_energy) {
> + best_energy = cur_energy;
> + best_energy_cpu = max_spare_cap_cpu;
> + }
> + }
> + }
> +
> + /*
> +  * We pick the best CPU only if it saves at least 1.5% of the
> +  * energy used by prev_cpu.
> +  */
> + if ((prev_energy - best_energy) > (prev_energy >> 6))
> + return best_energy_cpu;
> +
> + return prev_cpu;
> +}

We are comparing the best_energy_cpu against prev_cpu even when prev_cpu
can't accommodate the waking task. Is this intentional? Should not we
discard the prev_cpu if it can't accommodate the task.

This can potentially make a BIG task run on a lower capacity CPU until
load balancer kicks in and corrects the situation.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

2018-04-23 Thread Pavan Kondeti

Hi Nick,

On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> This is a quick hack for comments, but I've always wondered --
> if we have a short term polling idle states in cpuidle for performance
> -- why not skip the context switch and entry into all the idle states,
> and just wait for a bit to see if something wakes up again.
> 
> It's not uncommon to see various going-to-idle work in kernel profiles.
> This might be a way to reduce that (and just the cost of switching
> registers and kernel stack to idle thread). This can be an important
> path for single thread request-response throughput.
> 
> tbench bandwidth seems to be improved (the numbers aren't too stable
> but they pretty consistently show some gain). 10-20% would be a pretty
> nice gain for such workloads
> 
> clients 1 2 4 816   128
> vanilla   232   467   823  1819  3218  9065
> patched   310   503   962  2465  3743  9820
> 



> +idle_spin_end:
>   /* Promote REQ to ACT */
>   rq->clock_update_flags <<= 1;
>   update_rq_clock(rq);
> @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
>   if (unlikely(signal_pending_state(prev->state, prev))) {
>   prev->state = TASK_RUNNING;
>   } else {
> + /*
> +  * Busy wait before switching to idle thread. This
> +  * is marked unlikely because we're idle so jumping
> +  * out of line doesn't matter too much.
> +  */
> + if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> + u64 start;
> +
> + do_idle_spin = false;
> +
> + rq->clock_update_flags &= 
> ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> + rq_unlock_irq(rq, );
> +
> + spin_begin();
> + start = local_clock();
> + while (!need_resched() && prev->state &&
> + !signal_pending_state(prev->state, 
> prev)) {
> + spin_cpu_relax();
> + if (local_clock() - start > 100)
> + break;
> + }

Couple of comments/questions.

When a RT task is doing this busy loop, 

(1) need_resched() may not be set even if a fair/normal task is enqueued on
this CPU.

(2) Any lower prio RT task waking up on this CPU may migrate to another CPU
thinking this CPU is busy with higher prio RT task.

> + spin_end();
> +
> + rq_lock_irq(rq, );
> + goto idle_spin_end;
> + }
>   deactivate_task(rq, prev, DEQUEUE_SLEEP | 
> DEQUEUE_NOCLOCK);
>   prev->on_rq = 0;
>  
> -- 
> 2.17.0
> 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH] kernel/sched/core: busy wait before going idle

2018-04-23 Thread Pavan Kondeti

Hi Nick,

On Sun, Apr 15, 2018 at 11:31:49PM +1000, Nicholas Piggin wrote:
> This is a quick hack for comments, but I've always wondered --
> if we have a short term polling idle states in cpuidle for performance
> -- why not skip the context switch and entry into all the idle states,
> and just wait for a bit to see if something wakes up again.
> 
> It's not uncommon to see various going-to-idle work in kernel profiles.
> This might be a way to reduce that (and just the cost of switching
> registers and kernel stack to idle thread). This can be an important
> path for single thread request-response throughput.
> 
> tbench bandwidth seems to be improved (the numbers aren't too stable
> but they pretty consistently show some gain). 10-20% would be a pretty
> nice gain for such workloads
> 
> clients 1 2 4 816   128
> vanilla   232   467   823  1819  3218  9065
> patched   310   503   962  2465  3743  9820
> 



> +idle_spin_end:
>   /* Promote REQ to ACT */
>   rq->clock_update_flags <<= 1;
>   update_rq_clock(rq);
> @@ -3437,6 +3439,32 @@ static void __sched notrace __schedule(bool preempt)
>   if (unlikely(signal_pending_state(prev->state, prev))) {
>   prev->state = TASK_RUNNING;
>   } else {
> + /*
> +  * Busy wait before switching to idle thread. This
> +  * is marked unlikely because we're idle so jumping
> +  * out of line doesn't matter too much.
> +  */
> + if (unlikely(do_idle_spin && rq->nr_running == 1)) {
> + u64 start;
> +
> + do_idle_spin = false;
> +
> + rq->clock_update_flags &= 
> ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
> + rq_unlock_irq(rq, );
> +
> + spin_begin();
> + start = local_clock();
> + while (!need_resched() && prev->state &&
> + !signal_pending_state(prev->state, 
> prev)) {
> + spin_cpu_relax();
> + if (local_clock() - start > 100)
> + break;
> + }

Couple of comments/questions.

When a RT task is doing this busy loop, 

(1) need_resched() may not be set even if a fair/normal task is enqueued on
this CPU.

(2) Any lower prio RT task waking up on this CPU may migrate to another CPU
thinking this CPU is busy with higher prio RT task.

> + spin_end();
> +
> + rq_lock_irq(rq, );
> + goto idle_spin_end;
> + }
>   deactivate_task(rq, prev, DEQUEUE_SLEEP | 
> DEQUEUE_NOCLOCK);
>   prev->on_rq = 0;
>  
> -- 
> 2.17.0
> 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

2018-01-25 Thread Pavan Kondeti

On Wed, Jan 24, 2018 at 07:31:38PM +, Patrick Bellasi wrote:
> 
> > > +   /*
> > > +* These are the main cases covered:
> > > +* - if *p is the only task sleeping on this CPU, then:
> > > +*  cpu_util (== task_util) > util_est (== 0)
> > > +*   and thus we return:
> > > +*  cpu_util_wake = (cpu_util - task_util) = 0
> > > +*
> > > +* - if other tasks are SLEEPING on the same CPU, which is just 
> > > waking
> > > +*   up, then:
> > > +*  cpu_util >= task_util
> > > +*  cpu_util > util_est (== 0)
> > > +*   and thus we discount *p's blocked utilization to return:
> > > +*  cpu_util_wake = (cpu_util - task_util) >= 0
> > > +*
> > > +* - if other tasks are RUNNABLE on that CPU and
> > > +*  util_est > cpu_util
> > > +*   then we use util_est since it returns a more restrictive
> > > +*   estimation of the spare capacity on that CPU, by just 
> > > considering
> > > +*   the expected utilization of tasks already runnable on that 
> > > CPU.
> > > +*/
> > > +   util_est = cpu_rq(cpu)->cfs.util_est_runnable;
> > > +   util = max(util, util_est);
> > > +
> > > +   return util;
> 
> I should instead clamp util before returning it! ;-)
> 
> > May be a separate patch to remove  the clamping part?
> 
> No, I think we should keep cpu_util_wake clamped to not affect the existing
> call sites. I just need to remove it where not needed (done) and add it where
> needed (will do on the next iteration).

cpu_util_wake() is called only from capacity_spare_wake(). There are no other
callsites. The capacity_spare_wake() is clamping the return value of
cpu_util_wake() to CPU capacity. The clamping is not needed, I think.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

2018-01-25 Thread Pavan Kondeti

On Wed, Jan 24, 2018 at 07:31:38PM +, Patrick Bellasi wrote:
> 
> > > +   /*
> > > +* These are the main cases covered:
> > > +* - if *p is the only task sleeping on this CPU, then:
> > > +*  cpu_util (== task_util) > util_est (== 0)
> > > +*   and thus we return:
> > > +*  cpu_util_wake = (cpu_util - task_util) = 0
> > > +*
> > > +* - if other tasks are SLEEPING on the same CPU, which is just 
> > > waking
> > > +*   up, then:
> > > +*  cpu_util >= task_util
> > > +*  cpu_util > util_est (== 0)
> > > +*   and thus we discount *p's blocked utilization to return:
> > > +*  cpu_util_wake = (cpu_util - task_util) >= 0
> > > +*
> > > +* - if other tasks are RUNNABLE on that CPU and
> > > +*  util_est > cpu_util
> > > +*   then we use util_est since it returns a more restrictive
> > > +*   estimation of the spare capacity on that CPU, by just 
> > > considering
> > > +*   the expected utilization of tasks already runnable on that 
> > > CPU.
> > > +*/
> > > +   util_est = cpu_rq(cpu)->cfs.util_est_runnable;
> > > +   util = max(util, util_est);
> > > +
> > > +   return util;
> 
> I should instead clamp util before returning it! ;-)
> 
> > May be a separate patch to remove  the clamping part?
> 
> No, I think we should keep cpu_util_wake clamped to not affect the existing
> call sites. I just need to remove it where not needed (done) and add it where
> needed (will do on the next iteration).

cpu_util_wake() is called only from capacity_spare_wake(). There are no other
callsites. The capacity_spare_wake() is clamping the return value of
cpu_util_wake() to CPU capacity. The clamping is not needed, I think.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

2018-01-24 Thread Pavan Kondeti

Hi Patrick,

On Tue, Jan 23, 2018 at 06:08:46PM +, Patrick Bellasi wrote:
>  static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
>  {
> - unsigned long util, capacity;
> + long util, util_est;
>  
>   /* Task has no contribution or is new */
>   if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> - return cpu_util(cpu);
> + return cpu_util_est(cpu);
>  
> - capacity = capacity_orig_of(cpu);
> - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> + /* Discount task's blocked util from CPU's util */
> + util = cpu_util(cpu) - task_util(p);
> + util = max(util, 0L);
>  
> - return (util >= capacity) ? capacity : util;
> + if (!sched_feat(UTIL_EST))
> + return util;

At first, It is not clear to me why you are not clamping the capacity to
CPU original capacity. It looks like it is not needed any more with
commit f453ae2200b0 ("sched/fair: Consider RT/IRQ pressure in
capacity_spare_wake()") inclusion. May be a separate patch to remove
the clamping part?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [PATCH v3 2/3] sched/fair: use util_est in LB and WU paths

2018-01-24 Thread Pavan Kondeti

Hi Patrick,

On Tue, Jan 23, 2018 at 06:08:46PM +, Patrick Bellasi wrote:
>  static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
>  {
> - unsigned long util, capacity;
> + long util, util_est;
>  
>   /* Task has no contribution or is new */
>   if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> - return cpu_util(cpu);
> + return cpu_util_est(cpu);
>  
> - capacity = capacity_orig_of(cpu);
> - util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> + /* Discount task's blocked util from CPU's util */
> + util = cpu_util(cpu) - task_util(p);
> + util = max(util, 0L);
>  
> - return (util >= capacity) ? capacity : util;
> + if (!sched_feat(UTIL_EST))
> + return util;

At first, It is not clear to me why you are not clamping the capacity to
CPU original capacity. It looks like it is not needed any more with
commit f453ae2200b0 ("sched/fair: Consider RT/IRQ pressure in
capacity_spare_wake()") inclusion. May be a separate patch to remove
the clamping part?

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

2018-01-21 Thread Pavan Kondeti

On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
> 
> Hi Pavan,
> 
> 
> > I have couple questions/comments.
> > 
> > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > into a deadlock if a TASKLET is killed from another work running on
> > the same bounded per-cpu worker.
> > 
> > For example,
> > 
> > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > the TASKLET. 
> > (3) The TASKLET vector is deferred to workqueue.
> > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > which won't happen.
> > 
> > We can fix this by queueing the TASKLET kill work on an unbounded
> > workqueue so that this runs in parallel with TASKLET vector work.
> > 
> > Just wanted to know if we have to be aware of this *condition*.
> 
> But IIRC the workqueues have several workers per CPU so the tasklet to
> be killed can run while the tasklet killer yields.
> 
AFAIK, the work items queued via schedule_work_on() goes to the system_wq
which is bounded to a CPU with concurrency restrictions. If any work
item (in this case tasklet kill) is getting executed on this bounded
worker, the next items have to wait. The forward progress happens only
when the current work is finished or enters sleep.

This also makes me wonder what happens if a CPU hogging work gets executed
on the system_wq while softirq work is pending? The softirq work gets
starved which won't happen now with ksoftirqd design. Ideally the CPU
hogging work should not be queued on the system_wq and instead should
be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
from concurrency management. May be we need some special workqueue
which is bounded but not subjected to concurrency management.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

2018-01-21 Thread Pavan Kondeti

On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
> 
> Hi Pavan,
> 
> 
> > I have couple questions/comments.
> > 
> > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > into a deadlock if a TASKLET is killed from another work running on
> > the same bounded per-cpu worker.
> > 
> > For example,
> > 
> > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > the TASKLET. 
> > (3) The TASKLET vector is deferred to workqueue.
> > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > which won't happen.
> > 
> > We can fix this by queueing the TASKLET kill work on an unbounded
> > workqueue so that this runs in parallel with TASKLET vector work.
> > 
> > Just wanted to know if we have to be aware of this *condition*.
> 
> But IIRC the workqueues have several workers per CPU so the tasklet to
> be killed can run while the tasklet killer yields.
> 
AFAIK, the work items queued via schedule_work_on() goes to the system_wq
which is bounded to a CPU with concurrency restrictions. If any work
item (in this case tasklet kill) is getting executed on this bounded
worker, the next items have to wait. The forward progress happens only
when the current work is finished or enters sleep.

This also makes me wonder what happens if a CPU hogging work gets executed
on the system_wq while softirq work is pending? The softirq work gets
starved which won't happen now with ksoftirqd design. Ideally the CPU
hogging work should not be queued on the system_wq and instead should
be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
from concurrency management. May be we need some special workqueue
which is bounded but not subjected to concurrency management.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

2018-01-20 Thread Pavan Kondeti

Hi Frederic,

On Fri, Jan 19, 2018 at 04:46:12PM +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
> 
> To solve this we only defer to threading the vectors that got
> re-enqueued on IRQ tail processing and leave the others inline on IRQ
> tail service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
> 
> Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
> mode.
> 

> +static void do_softirq_workqueue(u32 pending)
> +{
> + struct softirq *softirq = this_cpu_ptr(_cpu);
> + struct softirq_action *h = softirq_vec;
> + int softirq_bit;
> +
> + pending &= ~softirq->pending_work_mask;
> +
> + while ((softirq_bit = ffs(pending))) {
> + struct vector *vector;
> + unsigned int vec_nr;
> +
> + h += softirq_bit - 1;
> + vec_nr = h - softirq_vec;
> + softirq->pending_work_mask |= BIT(vec_nr);
> + vector = >vector[vec_nr];
> + schedule_work_on(smp_processor_id(), >work);
> + h++;
> + pending >>= softirq_bit;
> + }
> +}
> +

I have couple questions/comments.

(1) Since the work is queued on a bounded per-cpu worker, we may run
into a deadlock if a TASKLET is killed from another work running on
the same bounded per-cpu worker.

For example,

(1) Schedule a TASKLET on CPU#0 from IRQ.
(2) Another IRQ comes on the same CPU and we queue a work to kill
the TASKLET. 
(3) The TASKLET vector is deferred to workqueue.
(4) We run the TASKLET kill work and wait for the TASKLET to finish,
which won't happen.

We can fix this by queueing the TASKLET kill work on an unbounded
workqueue so that this runs in parallel with TASKLET vector work.

Just wanted to know if we have to be aware of this *condition*.

(2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
there is a gaurantee that the softirq handling never happens on
another CPU. Where as a bounded worker gets detached and the queued
work can run on another CPU. I guess, some special handling is
needed to handle hotplug.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

2018-01-20 Thread Pavan Kondeti

Hi Frederic,

On Fri, Jan 19, 2018 at 04:46:12PM +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
> 
> To solve this we only defer to threading the vectors that got
> re-enqueued on IRQ tail processing and leave the others inline on IRQ
> tail service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
> 
> Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
> mode.
> 

> +static void do_softirq_workqueue(u32 pending)
> +{
> + struct softirq *softirq = this_cpu_ptr(_cpu);
> + struct softirq_action *h = softirq_vec;
> + int softirq_bit;
> +
> + pending &= ~softirq->pending_work_mask;
> +
> + while ((softirq_bit = ffs(pending))) {
> + struct vector *vector;
> + unsigned int vec_nr;
> +
> + h += softirq_bit - 1;
> + vec_nr = h - softirq_vec;
> + softirq->pending_work_mask |= BIT(vec_nr);
> + vector = >vector[vec_nr];
> + schedule_work_on(smp_processor_id(), >work);
> + h++;
> + pending >>= softirq_bit;
> + }
> +}
> +

I have couple questions/comments.

(1) Since the work is queued on a bounded per-cpu worker, we may run
into a deadlock if a TASKLET is killed from another work running on
the same bounded per-cpu worker.

For example,

(1) Schedule a TASKLET on CPU#0 from IRQ.
(2) Another IRQ comes on the same CPU and we queue a work to kill
the TASKLET. 
(3) The TASKLET vector is deferred to workqueue.
(4) We run the TASKLET kill work and wait for the TASKLET to finish,
which won't happen.

We can fix this by queueing the TASKLET kill work on an unbounded
workqueue so that this runs in parallel with TASKLET vector work.

Just wanted to know if we have to be aware of this *condition*.

(2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
there is a gaurantee that the softirq handling never happens on
another CPU. Where as a bounded worker gets detached and the queued
work can run on another CPU. I guess, some special handling is
needed to handle hotplug.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti <pkond...@codeaurora.org> wrote:
> 
> > Hi Steve,
> > 
> > Thanks for the patch.
> > 
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt <rost...@goodmis.org> wrote:
> > >   
> > > >  void rto_push_irq_work_func(struct irq_work *work)
> > > >  {
> > > > +   struct root_domain *rd =
> > > > +   container_of(work, struct root_domain, rto_push_work);
> > > > struct rq *rq;  
> > > 
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >   
> > 
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> > 
> >
> 
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
> 

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti <pkond...@codeaurora.org>

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti  wrote:
> 
> > Hi Steve,
> > 
> > Thanks for the patch.
> > 
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt  wrote:
> > >   
> > > >  void rto_push_irq_work_func(struct irq_work *work)
> > > >  {
> > > > +   struct root_domain *rd =
> > > > +   container_of(work, struct root_domain, rto_push_work);
> > > > struct rq *rq;  
> > > 
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >   
> > 
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> > 
> >
> 
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
> 

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti 

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt  wrote:
> 
> >  void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +   struct root_domain *rd =
> > +   container_of(work, struct root_domain, rto_push_work);
> > struct rq *rq;
> 
> Notice that I also remove the dependency on rq from getting the rd.
> 

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt  wrote:
> 
> >  void rto_push_irq_work_func(struct irq_work *work)
> >  {
> > +   struct root_domain *rd =
> > +   container_of(work, struct root_domain, rto_push_work);
> > struct rq *rq;
> 
> Notice that I also remove the dependency on rq from getting the rd.
> 

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti <pkond...@codeaurora.org> wrote:
> 
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> > 
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the 
> > corruption
> > of the remote  CPU's IRQ work list. Right?
> > 
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. 
> > Probably
> > we have to wait for the IRQ work to finish before freeing the older root 
> > domain
> > in RCU-sched callback.
> 
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
> 
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> - struct root_domain *rd = rq->rd;
>   int next;
>   int cpu;
>  
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
>* Otherwise it is finishing up and an ipi needs to be sent.
>*/
>   if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>  
>   raw_spin_unlock(>rd->rto_lock);
>  
>   rto_start_unlock(>rd->rto_loop_start);
>  
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
>   irq_work_queue_on(>rd->rto_push_work, cpu);
> + }
>  }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti  wrote:
> 
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> > 
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the 
> > corruption
> > of the remote  CPU's IRQ work list. Right?
> > 
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. 
> > Probably
> > we have to wait for the IRQ work to finish before freeing the older root 
> > domain
> > in RCU-sched callback.
> 
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
> 
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
> 
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
>   * the rt_loop_next will cause the iterator to perform another scan.
>   *
>   */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
>  {
> - struct root_domain *rd = rq->rd;
>   int next;
>   int cpu;
>  
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
>* Otherwise it is finishing up and an ipi needs to be sent.
>*/
>   if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>  
>   raw_spin_unlock(>rd->rto_lock);
>  
>   rto_start_unlock(>rd->rto_loop_start);
>  
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
>   irq_work_queue_on(>rd->rto_push_work, cpu);
> + }
>  }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <pkond...@codeaurora.org> wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote  CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote  CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <pkond...@codeaurora.org> wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

Yeah, it should work. I will give it a try and send the patch
for review.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti  wrote:
> 
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
> 
> As so rq->rd can change. Interesting.
> 
> > 
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(>lock);
> > }
> > 
> > +   raw_spin_lock(>lock);
> 
> 
> What about just saving the rd then?
> 
>   struct root_domain *rd;
> 
>   rd = READ_ONCE(rq->rd);
> 
> then use that. Then we don't need to worry about it changing.
> 

Yeah, it should work. I will give it a try and send the patch
for review.

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steven,

>  /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> -   struct rt_rq *rt_rq = arg;
> -   struct rq *rq, *src_rq;
> -   int this_cpu;
> +   struct rq *rq;
> int cpu;
>
> -   this_cpu = rt_rq->push_cpu;
> +   rq = this_rq();
>
> -   /* Paranoid check */
> -   BUG_ON(this_cpu != smp_processor_id());
> -
> -   rq = cpu_rq(this_cpu);
> -   src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> +   /*
> +* We do not need to grab the lock to check for has_pushable_tasks.
> +* When it gets updated, a check is made if a push is possible.
> +*/
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(>lock);
> -   push_rt_task(rq);
> +   push_rt_tasks(rq);
> raw_spin_unlock(>lock);
> }
>
> -   /* Pass the IPI to the next rt overloaded queue */
> -   raw_spin_lock(_rq->push_lock);
> -   /*
> -* If the source queue changed since the IPI went out,
> -* we need to restart the search from that CPU again.
> -*/
> -   if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> -   rt_rq->push_cpu = src_rq->cpu;
> -   }
> +   raw_spin_lock(>rd->rto_lock);
>
> -   cpu = find_next_push_cpu(src_rq);
> +   /* Pass the IPI to the next rt overloaded queue */
> +   cpu = rto_next_cpu(rq);
>
> -   if (cpu >= nr_cpu_ids)
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> -   raw_spin_unlock(_rq->push_lock);
> +   raw_spin_unlock(>rd->rto_lock);
>
> -   if (cpu >= nr_cpu_ids)
> +   if (cpu < 0)
> return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}

+   raw_spin_lock(>lock);
raw_spin_lock(>rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(>rd->rto_lock);

-   if (cpu < 0)
-   return;
-
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   if (cpu >= 0)
+   irq_work_queue_on(>rd->rto_push_work, cpu);
+   raw_spin_unlock(>lock);
 }
 #endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

2018-01-19 Thread Pavan Kondeti

Hi Steven,

>  /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
>  {
> -   struct rt_rq *rt_rq = arg;
> -   struct rq *rq, *src_rq;
> -   int this_cpu;
> +   struct rq *rq;
> int cpu;
>
> -   this_cpu = rt_rq->push_cpu;
> +   rq = this_rq();
>
> -   /* Paranoid check */
> -   BUG_ON(this_cpu != smp_processor_id());
> -
> -   rq = cpu_rq(this_cpu);
> -   src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> +   /*
> +* We do not need to grab the lock to check for has_pushable_tasks.
> +* When it gets updated, a check is made if a push is possible.
> +*/
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(>lock);
> -   push_rt_task(rq);
> +   push_rt_tasks(rq);
> raw_spin_unlock(>lock);
> }
>
> -   /* Pass the IPI to the next rt overloaded queue */
> -   raw_spin_lock(_rq->push_lock);
> -   /*
> -* If the source queue changed since the IPI went out,
> -* we need to restart the search from that CPU again.
> -*/
> -   if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> -   rt_rq->push_cpu = src_rq->cpu;
> -   }
> +   raw_spin_lock(>rd->rto_lock);
>
> -   cpu = find_next_push_cpu(src_rq);
> +   /* Pass the IPI to the next rt overloaded queue */
> +   cpu = rto_next_cpu(rq);
>
> -   if (cpu >= nr_cpu_ids)
> -   rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> -   raw_spin_unlock(_rq->push_lock);
> +   raw_spin_unlock(>rd->rto_lock);
>
> -   if (cpu >= nr_cpu_ids)
> +   if (cpu < 0)
> return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(>lock);
}

+   raw_spin_lock(>lock);
raw_spin_lock(>rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(>rd->rto_lock);

-   if (cpu < 0)
-   return;
-
/* Try the next RT overloaded CPU */
-   irq_work_queue_on(>rd->rto_push_work, cpu);
+   if (cpu >= 0)
+   irq_work_queue_on(>rd->rto_push_work, cpu);
+   raw_spin_unlock(>lock);
 }
 #endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [RFC 2/3] sched/fair: use util_est in LB

2017-09-04 Thread Pavan Kondeti

On Mon, Sep 4, 2017 at 7:48 PM, Patrick Bellasi <patrick.bell...@arm.com> wrote:
> On 29-Aug 10:15, Pavan Kondeti wrote:
>> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
>> <patrick.bell...@arm.com> wrote:
>> > When the scheduler looks at the CPU utlization, the current PELT value
>> > for a CPU is returned straight away. In certain scenarios this can have
>> > undesired side effects on task placement.
>> >
>>
>> 
>>
>> > +/**
>> > + * cpu_util_est: estimated utilization for the specified CPU
>> > + * @cpu: the CPU to get the estimated utilization for
>> > + *
>> > + * The estimated utilization of a CPU is defined to be the maximum 
>> > between its
>> > + * PELT's utilization and the sum of the estimated utilization of the 
>> > tasks
>> > + * currently RUNNABLE on that CPU.
>> > + *
>> > + * This allows to properly represent the expected utilization of a CPU 
>> > which
>> > + * has just got a big task running since a long sleep period. At the same 
>> > time
>> > + * however it preserves the benefits of the "blocked load" in describing 
>> > the
>> > + * potential for other tasks waking up on the same CPU.
>> > + *
>> > + * Return: the estimated utlization for the specified CPU
>> > + */
>> > +static inline unsigned long cpu_util_est(int cpu)
>> > +{
>> > +   struct sched_avg *sa = _rq(cpu)->cfs.avg;
>> > +   unsigned long util = cpu_util(cpu);
>> > +
>> > +   if (!sched_feat(UTIL_EST))
>> > +   return util;
>> > +
>> > +   return max(util, util_est(sa, UTIL_EST_LAST));
>> > +}
>> > +
>> >  static inline int task_util(struct task_struct *p)
>> >  {
>> > return p->se.avg.util_avg;
>> > @@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct 
>> > task_struct *p)
>> >
>> > /* Task has no contribution or is new */
>> > if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
>> > -   return cpu_util(cpu);
>> > +   return cpu_util_est(cpu);
>> >
>> > capacity = capacity_orig_of(cpu);
>> > util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 
>> > 0);
>> >
>> > +   /*
>> > +* Estimated utilization tracks only tasks already enqueued, but 
>> > still
>> > +* sometimes can return a bigger value than PELT, for example when 
>> > the
>> > +* blocked load is negligible wrt the estimated utilization of the
>> > +* already enqueued tasks.
>> > +*/
>> > +   util = max_t(long, util, cpu_util_est(cpu));
>> > +
>>
>> We are supposed to discount the task's util from its CPU. But the
>> cpu_util_est() can potentially return cpu_util() which includes the
>> task's utilization.
>
> You right, this instead should cover all the cases:
>
> ---8<---
>  static int cpu_util_wake(int cpu, struct task_struct *p)
>  {
> -   unsigned long util, capacity;
> +   unsigned long util_est = cpu_util_est(cpu);
> +   unsigned long capacity;
>
> /* Task has no contribution or is new */
> if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> -   return cpu_util(cpu);
> +   return util_est;
>
> capacity = capacity_orig_of(cpu);
> -   util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> +   if (cpu_util(cpu) > util_est)
> +   util = max_t(long, cpu_util(cpu) - task_util(p), 0);
> +   else
> +   util = util_est;
>
> return (util >= capacity) ? capacity : util;
>  }
> ---8<---
>
> Indeed:
>
> - if *p is the only task sleeping on that CPU, then:
>   (cpu_util == task_util) > (cpu_util_est == 0)
>   and thus we return:
>   (cpu_util - task_util) == 0
>
> - if other tasks are SLEEPING on the same CPU, which however is IDLE, then:
>   cpu_util > (cpu_util_est == 0)
>   and thus we discount *p's blocked load by returning:
>   (cpu_util - task_util) >= 0
>
> - if other tasks are RUNNABLE on that CPU and
>   (cpu_util_est > cpu_util)
>   then we wanna use cpu_util_est since it returns a more restrictive
>   estimation of the spare capacity on that CPU, by just considering
>   the expected utilization of tasks already runnable on that CPU.
>
> What do you think?
>

This looks good to me.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [RFC 2/3] sched/fair: use util_est in LB

2017-09-04 Thread Pavan Kondeti

On Mon, Sep 4, 2017 at 7:48 PM, Patrick Bellasi  wrote:
> On 29-Aug 10:15, Pavan Kondeti wrote:
>> On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
>>  wrote:
>> > When the scheduler looks at the CPU utlization, the current PELT value
>> > for a CPU is returned straight away. In certain scenarios this can have
>> > undesired side effects on task placement.
>> >
>>
>> 
>>
>> > +/**
>> > + * cpu_util_est: estimated utilization for the specified CPU
>> > + * @cpu: the CPU to get the estimated utilization for
>> > + *
>> > + * The estimated utilization of a CPU is defined to be the maximum 
>> > between its
>> > + * PELT's utilization and the sum of the estimated utilization of the 
>> > tasks
>> > + * currently RUNNABLE on that CPU.
>> > + *
>> > + * This allows to properly represent the expected utilization of a CPU 
>> > which
>> > + * has just got a big task running since a long sleep period. At the same 
>> > time
>> > + * however it preserves the benefits of the "blocked load" in describing 
>> > the
>> > + * potential for other tasks waking up on the same CPU.
>> > + *
>> > + * Return: the estimated utlization for the specified CPU
>> > + */
>> > +static inline unsigned long cpu_util_est(int cpu)
>> > +{
>> > +   struct sched_avg *sa = _rq(cpu)->cfs.avg;
>> > +   unsigned long util = cpu_util(cpu);
>> > +
>> > +   if (!sched_feat(UTIL_EST))
>> > +   return util;
>> > +
>> > +   return max(util, util_est(sa, UTIL_EST_LAST));
>> > +}
>> > +
>> >  static inline int task_util(struct task_struct *p)
>> >  {
>> > return p->se.avg.util_avg;
>> > @@ -6007,11 +6033,19 @@ static int cpu_util_wake(int cpu, struct 
>> > task_struct *p)
>> >
>> > /* Task has no contribution or is new */
>> > if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
>> > -   return cpu_util(cpu);
>> > +   return cpu_util_est(cpu);
>> >
>> > capacity = capacity_orig_of(cpu);
>> > util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 
>> > 0);
>> >
>> > +   /*
>> > +* Estimated utilization tracks only tasks already enqueued, but 
>> > still
>> > +* sometimes can return a bigger value than PELT, for example when 
>> > the
>> > +* blocked load is negligible wrt the estimated utilization of the
>> > +* already enqueued tasks.
>> > +*/
>> > +   util = max_t(long, util, cpu_util_est(cpu));
>> > +
>>
>> We are supposed to discount the task's util from its CPU. But the
>> cpu_util_est() can potentially return cpu_util() which includes the
>> task's utilization.
>
> You right, this instead should cover all the cases:
>
> ---8<---
>  static int cpu_util_wake(int cpu, struct task_struct *p)
>  {
> -   unsigned long util, capacity;
> +   unsigned long util_est = cpu_util_est(cpu);
> +   unsigned long capacity;
>
> /* Task has no contribution or is new */
> if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
> -   return cpu_util(cpu);
> +   return util_est;
>
> capacity = capacity_orig_of(cpu);
> -   util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> +   if (cpu_util(cpu) > util_est)
> +   util = max_t(long, cpu_util(cpu) - task_util(p), 0);
> +   else
> +   util = util_est;
>
> return (util >= capacity) ? capacity : util;
>  }
> ---8<---
>
> Indeed:
>
> - if *p is the only task sleeping on that CPU, then:
>   (cpu_util == task_util) > (cpu_util_est == 0)
>   and thus we return:
>   (cpu_util - task_util) == 0
>
> - if other tasks are SLEEPING on the same CPU, which however is IDLE, then:
>   cpu_util > (cpu_util_est == 0)
>   and thus we discount *p's blocked load by returning:
>   (cpu_util - task_util) >= 0
>
> - if other tasks are RUNNABLE on that CPU and
>   (cpu_util_est > cpu_util)
>   then we wanna use cpu_util_est since it returns a more restrictive
>   estimation of the spare capacity on that CPU, by just considering
>   the expected utilization of tasks already runnable on that CPU.
>
> What do you think?
>

This looks good to me.

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [RFC 1/3] sched/fair: add util_est on top of PELT

2017-08-29 Thread Pavan Kondeti

On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
 wrote:
> The util_avg signal computed by PELT is too variable for some use-cases.
> For example, a big task waking up after a long sleep period will have its
> utilization almost completely decayed. This introduces some latency before
> schedutil will be able to pick the best frequency to run a task.
>



> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c28b182c9833..8d7bc55f68d5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* task_struct member predeclarations (sorted alphabetically): */
>  struct audit_context;
> @@ -277,6 +278,16 @@ struct load_weight {
> u32 inv_weight;
>  };
>
> +/**
> + * Utilizaton's Exponential Weighted Moving Average (EWMA)
> + *
> + * Support functions to track an EWMA for the utilization of SEs and RQs. New
> + * samples will be added to the moving average each time a task completes an
> + * activation. Thus the weight is chosen so that the EWMA wil be relatively
> + * insensitive to transient changes to the task's workload.
> + */
> +DECLARE_EWMA(util, 0, 4);
> +
>  /*

Should the factor be 1 instead of 0? i.e 25% contribution from the
recent sample.

Thanks,
Pavan


-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

Re: [RFC 1/3] sched/fair: add util_est on top of PELT

2017-08-29 Thread Pavan Kondeti

On Fri, Aug 25, 2017 at 3:50 PM, Patrick Bellasi
 wrote:
> The util_avg signal computed by PELT is too variable for some use-cases.
> For example, a big task waking up after a long sleep period will have its
> utilization almost completely decayed. This introduces some latency before
> schedutil will be able to pick the best frequency to run a task.
>



> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index c28b182c9833..8d7bc55f68d5 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  /* task_struct member predeclarations (sorted alphabetically): */
>  struct audit_context;
> @@ -277,6 +278,16 @@ struct load_weight {
> u32 inv_weight;
>  };
>
> +/**
> + * Utilizaton's Exponential Weighted Moving Average (EWMA)
> + *
> + * Support functions to track an EWMA for the utilization of SEs and RQs. New
> + * samples will be added to the moving average each time a task completes an
> + * activation. Thus the weight is chosen so that the EWMA wil be relatively
> + * insensitive to transient changes to the task's workload.
> + */
> +DECLARE_EWMA(util, 0, 4);
> +
>  /*

Should the factor be 1 instead of 0? i.e 25% contribution from the
recent sample.

Thanks,
Pavan


-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project

1 2 >

1 - 100 of 124 matches

Mail list logo