Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()

2015-08-27 Thread T. Zhou
Hi,

On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote:
> [   15.273708] [ cut here ]
> [   15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 
> do_set_cpus_allowed+0x7e/0x80()
> [   15.274857] Modules linked in:
> [   15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted 
> 4.2.0-rc1-00049-g25834c7 #2
> [   15.275674]    d21f1d24 c19228b2  d21f1d58 
> c1056a3b c1ba00e4
> [   15.276084]   000d c1ba17d8 0484 c10838be 0484 
> c10838be d21e5000
> [   15.276084]  d2121900 d21e5158 d21f1d68 c1056b12 0009  
> d21f1d7c c10838be
> [   15.276084] Call Trace:
> [   15.276084]  [] dump_stack+0x4b/0x75
> [   15.276084]  [] warn_slowpath_common+0x8b/0xc0
> [   15.276084]  [] ? do_set_cpus_allowed+0x7e/0x80
> [   15.276084]  [] ? do_set_cpus_allowed+0x7e/0x80
> [   15.276084]  [] warn_slowpath_null+0x22/0x30
> [   15.276084]  [] do_set_cpus_allowed+0x7e/0x80
> [   15.276084]  [] cpuset_cpus_allowed_fallback+0x7c/0x170
> [   15.276084]  [] ? cpuset_cpus_allowed+0x180/0x180
> [   15.276084]  [] select_fallback_rq+0x221/0x280
> [   15.276084]  [] migration_call+0xe3/0x250
> [   15.276084]  [] notifier_call_chain+0x53/0x70
> [   15.276084]  [] __raw_notifier_call_chain+0x1e/0x30
> [   15.276084]  [] cpu_notify+0x28/0x50
> [   15.276084]  [] take_cpu_down+0x22/0x40
> [   15.276084]  [] multi_cpu_stop+0xd5/0x140
> [   15.276084]  [] ? __stop_cpus+0x80/0x80
> [   15.276084]  [] cpu_stopper_thread+0xbc/0x170
> [   15.276084]  [] ? preempt_count_sub+0x9/0x50
> [   15.276084]  [] ? _raw_spin_unlock_irq+0x37/0x50
> [   15.276084]  [] ? _raw_spin_unlock_irqrestore+0x55/0x70
> [   15.276084]  [] ? trace_hardirqs_on_caller+0x144/0x1e0
> [   15.276084]  [] ? cpu_stop_should_run+0x35/0x40
> [   15.276084]  [] ? preempt_count_sub+0x9/0x50
> [   15.276084]  [] ? _raw_spin_unlock_irqrestore+0x41/0x70
> [   15.276084]  [] smpboot_thread_fn+0x174/0x2f0
> [   15.276084]  [] ? sort_range+0x30/0x30
> [   15.276084]  [] kthread+0xc4/0xe0
> [   15.276084]  [] ret_from_kernel_thread+0x21/0x30
> [   15.276084]  [] ? kthread_create_on_node+0x180/0x180
> [   15.276084] ---[ end trace 15f4c86d404693b0 ]---
> 

no experiment from me(hate myself and me). just guess this path:

take_cpu_down() 
  cpu_notify(CPU_DYING)
migration_call()

in migration_call(), there is CPU_DYING case. add these:

raw_spin_lock_irqsave(&p->pi_lock, flags);
raw_spin_lock(&rq->lock, flags);
...
raw_spin_unlock(&rq->lock, flags);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);

no p->pi_lock and rq->lock inversed order(from Peter's review)
no lock on p->pi_lock two times(from Peter's review)

and in do_set_cpus_allowed(), add the following and delete some:

WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) ||
 p->on_rq && 
lockdep_is_held(&task_rq(p)->lock)));
(from Peter's suggestion)

like what used in set_task_cpu()

better or right solution there :)

thanks,
-- 
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/3] sched: sync a se with its cfs_rq when attaching and dettaching

2015-08-18 Thread T. Zhou
Hi,

On Mon, Aug 17, 2015 at 04:45:50PM +0900, byungchul.p...@lge.com wrote:
> From: Byungchul Park 
> 
> current code is wrong with cfs_rq's avg loads when changing a task's
> cfs_rq to another. i tested with "echo pid > cgroup" and found that
> e.g. cfs_rq->avg.load_avg became larger and larger whenever i changed
> a cgroup to another again and again. we have to sync se's avg loads
> with both *prev* cfs_rq and next cfs_rq when changing its group.
> 

my simple think about above, may be nothing or wrong, just ignore it.

if a load balance migration happened just before cgroup change, prev
cfs_rq and next cfs_rq will be on different cpu. migrate_task_rq_fair()
and update_cfs_rq_load_avg() will sync and remove se's load avg from
prev cfs_rq. whether or not queued, well done. dequeue_task() decay se
and pre_cfs before calling task_move_group_fair(). after set cfs_rq in
task_move_group_fair(), if queued, se's load avg do not add to next
cfs_rq(try set last_update_time to 0 like migration to add), if !queued,
also need to add se's load avg to next cfs_rq.

if no load balance migration happened when change cgroup. prev cfs_rq
and next cfs_rq may be on same cpu(not sure), this time, need to remove
se's load avg by ourself, also need to add se's load avg on next cfs_rq.

thinks,
-- 
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: sync with the cfs_rq when changing sched class

2015-08-15 Thread T. Zhou
On Sat, Aug 15, 2015 at 01:24:12PM +0900, Byungchul Park wrote:
> On Fri, Aug 14, 2015 at 08:59:02PM +0800, T. Zhou wrote:
> > Hi,
> > 
> > On Thu, Aug 13, 2015 at 02:55:55PM +0900, byungchul.p...@lge.com wrote:
> > > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> > > sched_entity *se)
> > > +{
> > > + se->avg.last_update_time = cfs_rq->avg.last_update_time;
> > > + cfs_rq->avg.load_avg += se->avg.load_avg;
> > > + cfs_rq->avg.load_sum += se->avg.load_sum;
> > > + cfs_rq->avg.util_avg += se->avg.util_avg;
> > > + cfs_rq->avg.util_sum += se->avg.util_sum;
> > > +}
> > 
> > I see this function is used in enqueue_entity_load_avg() also.
> > In tip tree code, enqueue_entity_load_avg() uses cfs_rq_clock_task()
> > as se->last_update_time in migration condition.
> > Here, you use cfs_rq->avg.last_update_time as se->last_update_time.
> > If se->last_update_time is different, next decay may be different too.
> > Just from inspecting code, maybe some reasonable there?
> 
> hello zhou,
> 
> update_cfs_rq_load_avg() would update cfs_rq->avg.last_update_time to now,
> which is returned from cfs_rq_clock_task(). :)
> 
> thanks,
> byungchul
> 

Hi Byungchul,

yes, you are right. se and cfs_rq update load avg at same time. :)

thanks
-- 
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] sched: make task_move_group_fair simple by using switched_to(from)_fair

2015-08-14 Thread T. Zhou
Hi,

On Thu, Aug 13, 2015 at 07:49:19PM +0900, byungchul.p...@lge.com wrote:
> +static inline int need_vruntime_adjust(struct task_struct *p, int queued)
> +{
> + struct sched_entity *se = &p->se;
> +
> + /*
> +  * When !queued, vruntime of the task has usually NOT been normalized.
> +  * But there are some cases where it has already been normalized:
> +  *
> +  * - Moving a forked child which is waiting for being woken up by
> +  *   wake_up_new_task().
> +  * - Moving a task which has been woken up by try_to_wake_up() and
> +  *   waiting for actually being woken up by sched_ttwu_pending().
> +  */
> + if (queued)
> + return 0;
> +
> + if (!se->sum_exec_runtime || p->state == TASK_WAKING)
> + return 0;
> +
> + if (p->state != TASK_RUNNING)
> + return 1;
> +
> + return 0;
> +}

Original switched_from/to_fair() do not include the following check.

if (!se->sum_exec_runtime || p->state == TASK_WAKING)
return 0;

I guess you're sure this check will always fail in switched_from/to_fair() case

Thanks
>  static void switched_from_fair(struct rq *rq, struct task_struct *p)
>  {
>   struct sched_entity *se = &p->se;
> @@ -7918,7 +7943,7 @@ static void switched_from_fair(struct rq *rq, struct 
> task_struct *p)
>* have normalized the vruntime, if it's !queued, then only when
>* the task is sleeping will it still have non-normalized vruntime.
>*/
> - if (!task_on_rq_queued(p) && p->state != TASK_RUNNING) {
> + if (need_vruntime_adjust(p, task_on_rq_queued(p))) {
>   /*
>* Fix up our vruntime so that the current sleep doesn't
>* cause 'unlimited' sleep bonus.
> @@ -7939,7 +7964,6 @@ static void switched_from_fair(struct rq *rq, struct 
> task_struct *p)
>  static void switched_to_fair(struct rq *rq, struct task_struct *p)
>  {
>   struct sched_entity *se = &p->se;
> -
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   /*
>* Since the real-depth could have been changed (only FAIR
> @@ -7964,7 +7988,7 @@ static void switched_to_fair(struct rq *rq, struct 
> task_struct *p)
>* has non-normalized vruntime, if it's !queued, then it still 
> has
>* normalized vruntime.
>*/
> - if (p->state != TASK_RUNNING)
> + if (need_vruntime_adjust(p, task_on_rq_queued(p)))
>   se->vruntime += cfs_rq_of(se)->min_vruntime;
>   return;
>   }
> @@ -8014,55 +8038,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  static void task_move_group_fair(struct task_struct *p, int queued)
>  {
> - struct sched_entity *se = &p->se;
> - struct cfs_rq *cfs_rq;
> -
> - /*
> -  * If the task was not on the rq at the time of this cgroup movement
> -  * it must have been asleep, sleeping tasks keep their ->vruntime
> -  * absolute on their old rq until wakeup (needed for the fair sleeper
> -  * bonus in place_entity()).
> -  *
> -  * If it was on the rq, we've just 'preempted' it, which does convert
> -  * ->vruntime to a relative base.
> -  *
> -  * Make sure both cases convert their relative position when migrating
> -  * to another cgroup's rq. This does somewhat interfere with the
> -  * fair sleeper stuff for the first placement, but who cares.
> -  */
> - /*
> -  * When !queued, vruntime of the task has usually NOT been normalized.
> -  * But there are some cases where it has already been normalized:
> -  *
> -  * - Moving a forked child which is waiting for being woken up by
> -  *   wake_up_new_task().
> -  * - Moving a task which has been woken up by try_to_wake_up() and
> -  *   waiting for actually being woken up by sched_ttwu_pending().
> -  *
> -  * To prevent boost or penalty in the new cfs_rq caused by delta
> -  * min_vruntime between the two cfs_rqs, we skip vruntime adjustment.
> -  */
> - if (!queued && (!se->sum_exec_runtime || p->state == TASK_WAKING))
> - queued = 1;
> -
> - cfs_rq = cfs_rq_of(se);
> - if (!queued)
> - se->vruntime -= cfs_rq->min_vruntime;
> -
> -#ifdef CONFIG_SMP
> - /* synchronize task with its prev cfs_rq */
> - detach_entity_load_avg(cfs_rq, se);
> -#endif
> + switched_from_fair(task_rq(p), p);
>   set_task_rq(p, task_cpu(p));
> - se->depth = se->parent ? se->parent->depth + 1 : 0;
> - cfs_rq = cfs_rq_of(se);
> - if (!queued)
> - se->vruntime += cfs_rq->min_vruntime;
> -
> -#ifdef CONFIG_SMP
> - /* Virtually synchronize task with its new cfs_rq */
> - attach_entity_load_avg(cfs_rq, se);
> -#endif
> + switched_to_fair(task_rq(p), p);
>  }
>  
>  void free_fair_sched_group(struct task_group *tg)

-- 
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body 

Re: [PATCH] sched: sync with the cfs_rq when changing sched class

2015-08-14 Thread T. Zhou
Hi,

On Thu, Aug 13, 2015 at 02:55:55PM +0900, byungchul.p...@lge.com wrote:
> +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
> +{
> + se->avg.last_update_time = cfs_rq->avg.last_update_time;
> + cfs_rq->avg.load_avg += se->avg.load_avg;
> + cfs_rq->avg.load_sum += se->avg.load_sum;
> + cfs_rq->avg.util_avg += se->avg.util_avg;
> + cfs_rq->avg.util_sum += se->avg.util_sum;
> +}

I see this function is used in enqueue_entity_load_avg() also.
In tip tree code, enqueue_entity_load_avg() uses cfs_rq_clock_task()
as se->last_update_time in migration condition.
Here, you use cfs_rq->avg.last_update_time as se->last_update_time.
If se->last_update_time is different, next decay may be different too.
Just from inspecting code, maybe some reasonable there?

Thanks
> +static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
> +{
> + __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)),
> + &se->avg, se->on_rq * scale_load_down(se->load.weight),
> + cfs_rq->curr == se, NULL);
> +
> + cfs_rq->avg.load_avg =
> + max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
> + cfs_rq->avg.load_sum =
> + max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
> + cfs_rq->avg.util_avg =
> + max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
> + cfs_rq->avg.util_sum =
> + max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
> +}
> +
>  /* Add the load generated by se into cfs_rq's load average */
>  static inline void
>  enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -2717,27 +2742,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct 
> sched_entity *se)
>   u64 now = cfs_rq_clock_task(cfs_rq);
>   int migrated = 0, decayed;
>  
> - if (sa->last_update_time == 0) {
> - sa->last_update_time = now;
> + if (sa->last_update_time == 0)
>   migrated = 1;
> - }
> - else {
> + else
>   __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
> - se->on_rq * scale_load_down(se->load.weight),
> - cfs_rq->curr == se, NULL);
> - }
> + se->on_rq * scale_load_down(se->load.weight),
> + cfs_rq->curr == se, NULL);
>  
>   decayed = update_cfs_rq_load_avg(now, cfs_rq);
>  
>   cfs_rq->runnable_load_avg += sa->load_avg;
>   cfs_rq->runnable_load_sum += sa->load_sum;
>  
> - if (migrated) {
> - cfs_rq->avg.load_avg += sa->load_avg;
> - cfs_rq->avg.load_sum += sa->load_sum;
> - cfs_rq->avg.util_avg += sa->util_avg;
> - cfs_rq->avg.util_sum += sa->util_sum;
> - }
> + if (migrated)
> + attach_entity_load_avg(cfs_rq, se);
>  
>   if (decayed || migrated)
>   update_tg_load_avg(cfs_rq, 0);
> @@ -7911,17 +7929,7 @@ static void switched_from_fair(struct rq *rq, struct 
> task_struct *p)
>  
>  #ifdef CONFIG_SMP
>   /* Catch up with the cfs_rq and remove our load when we leave */
> - __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg,
> - se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == 
> se, NULL);
> -
> - cfs_rq->avg.load_avg =
> - max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0);
> - cfs_rq->avg.load_sum =
> - max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0);
> - cfs_rq->avg.util_avg =
> - max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0);
> - cfs_rq->avg.util_sum =
> - max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0);
> + detach_entity_load_avg(cfs_rq, se);
>  #endif
>  }
>  
> @@ -7938,6 +7946,11 @@ static void switched_to_fair(struct rq *rq, struct 
> task_struct *p)
>*/
>   se->depth = se->parent ? se->parent->depth + 1 : 0;
>  #endif
> +
> +#ifdef CONFIG_SMP
> + /* synchronize task with its cfs_rq */
> + attach_entity_load_avg(cfs_rq_of(&p->se), &p->se);
> +#endif
>   if (!task_on_rq_queued(p))
>   return;
>  
> @@ -8023,16 +8036,7 @@ static void task_move_group_fair(struct task_struct 
> *p, int queued)
>  
>  #ifdef CONFIG_SMP
>   /* synchronize task with its prev cfs_rq */
> - if (!queued)
> - __update_load_avg(cfs_rq->avg.last_update_time, 
> cpu_of(rq_of(cfs_rq)),
> - &se->avg, se->on_rq * 
> scale_load_down(se->load.weight),
> - cfs_rq->curr == se, NULL);
> -
> - /* remove our load when we leave */
> - cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - 
> se->avg.load_avg, 0);
> - cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - 
> se->avg.load_sum, 0);
> - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - 
> se->avg.util_avg, 0);
> - cfs_