Hi Pavan, Thanks for going through this patch-set.
On 5/8/20 2:03 PM, Pavan Kondeti wrote: > Hi Parth, > > On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote: >> Monitor tasks at: >> 1. wake_up_new_task() - forked tasks >> >> 2. set_task_cpu() - task migrations, Load balancer >> >> 3. __sched_setscheduler() - set/unset latency_nice value >> Increment the nr_lat_sensitive count on the CPU with task marked with >> latency_nice == -20. >> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task >> with >-20 latency_nice task. >> >> 4. finish_task_switch() - dying task >> > > >> Signed-off-by: Parth Shah <pa...@linux.ibm.com> >> --- >> kernel/sched/core.c | 30 ++++++++++++++++++++++++++++-- >> kernel/sched/sched.h | 5 +++++ >> 2 files changed, 33 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >> index 2d8b76f41d61..ad396c36eba6 100644 >> --- a/kernel/sched/core.c >> +++ b/kernel/sched/core.c >> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int >> new_cpu) >> trace_sched_migrate_task(p, new_cpu); >> >> if (task_cpu(p) != new_cpu) { >> + if (task_is_lat_sensitive(p)) { >> + per_cpu(nr_lat_sensitive, task_cpu(p))--; >> + per_cpu(nr_lat_sensitive, new_cpu)++; >> + } >> + > > Since we can come here without rq locks, there is a possibility > of a race and incorrect updates can happen. Since the counters > are used to prevent C-states, we don't want that to happen. I did tried using task_lock(p) wherever we do change refcount and when latency_nice value is set. There I was using nr_lat_sensitive with atomic_t type. After lots of thinking to optimize it and thinking that we anyways hold rq lock, I thought of not using any lock here and see if scheduler community has well known solution for this :-) But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin refcount should solve problem, right? If you or anyone have solution for this kind of pattern, then that surely will be helpful. > >> if (p->sched_class->migrate_task_rq) >> p->sched_class->migrate_task_rq(p, new_cpu); >> p->se.nr_migrations++; >> @@ -2947,6 +2952,7 @@ void wake_up_new_task(struct task_struct *p) >> { >> struct rq_flags rf; >> struct rq *rq; >> + int target_cpu = 0; >> >> raw_spin_lock_irqsave(&p->pi_lock, rf.flags); >> p->state = TASK_RUNNING; >> @@ -2960,9 +2966,17 @@ void wake_up_new_task(struct task_struct *p) >> * as we're not fully set-up yet. >> */ >> p->recent_used_cpu = task_cpu(p); >> - __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0)); >> + target_cpu = select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0); >> + __set_task_cpu(p, target_cpu); >> + > > The target_cpu variable can be eliminated by using task_cpu(p) directly > in the below update. Right. Will change it thus saving one diff line. > >> #endif >> rq = __task_rq_lock(p, &rf); >> + >> +#ifdef CONFIG_SMP >> + if (task_is_lat_sensitive(p)) >> + per_cpu(nr_lat_sensitive, target_cpu)++; >> +#endif >> + > > Is the SMP check intentional? In some parts of this patch, updates to > nr_lat_sensitive are done without SMP checks. For example, > finish_task_switch() below. No. just forget to remove. I will remove SMP checks in next revision. > >> update_rq_clock(rq); >> post_init_entity_util_avg(p); >> >> @@ -3248,6 +3262,9 @@ static struct rq *finish_task_switch(struct >> task_struct *prev) >> if (prev->sched_class->task_dead) >> prev->sched_class->task_dead(prev); >> >> + if (task_is_lat_sensitive(prev)) >> + per_cpu(nr_lat_sensitive, prev->cpu)--; >> + >> /* >> * Remove function-return probe instances associated with this >> * task and put them back on the free list. >> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct >> *p, >> p->normal_prio = normal_prio(p); >> set_load_weight(p, true); >> >> - if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) >> + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) { >> + if (p->state != TASK_DEAD && >> + attr->sched_latency_nice != p->latency_nice) { >> + if (attr->sched_latency_nice == MIN_LATENCY_NICE) >> + per_cpu(nr_lat_sensitive, task_cpu(p))++; >> + else if (task_is_lat_sensitive(p)) >> + per_cpu(nr_lat_sensitive, task_cpu(p))--; >> + } >> + >> p->latency_nice = attr->sched_latency_nice; >> + } >> } > > There is a potential race here due to which we can mess up the refcount. > > - A latency sensitive task is marked TASK_DEAD > <snip> > - sched_setattr() called on the task to clear the latency nice. Since > we check the task state here, we skip the decrement. > - The task is finally context switched out and we skip the decrement again > since it is not a latency senstivie task. if task is already marked TASK_DEAD then we should have already decremented its refcount in finish_task_switch(). am I missing something? > >> >> /* Actually do priority change: must hold pi & rq lock. */ >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >> index 5c41020c530e..56f885e37451 100644 >> --- a/kernel/sched/sched.h >> +++ b/kernel/sched/sched.h >> @@ -211,6 +211,11 @@ static inline int task_has_dl_policy(struct task_struct >> *p) >> return dl_policy(p->policy); >> } >> >> +static inline int task_is_lat_sensitive(struct task_struct *p) >> +{ >> + return p->latency_nice == MIN_LATENCY_NICE; >> +} >> + >> #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT) >> >> /* >> -- >> 2.17.2 >> > > Thanks, > Pavan > Thanks, Parth