Re: [PATCH] sched: fix tsk->pi_lock isn't held when do_set_cpus_allowed()
Hi, On Tue, Aug 25, 2015 at 03:59:54PM +0800, Wanpeng Li wrote: > [ 15.273708] [ cut here ] > [ 15.274097] WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 > do_set_cpus_allowed+0x7e/0x80() > [ 15.274857] Modules linked in: > [ 15.275101] CPU: 0 PID: 13 Comm: migration/0 Not tainted > 4.2.0-rc1-00049-g25834c7 #2 > [ 15.275674] d21f1d24 c19228b2 d21f1d58 > c1056a3b c1ba00e4 > [ 15.276084] 000d c1ba17d8 0484 c10838be 0484 > c10838be d21e5000 > [ 15.276084] d2121900 d21e5158 d21f1d68 c1056b12 0009 > d21f1d7c c10838be > [ 15.276084] Call Trace: > [ 15.276084] [] dump_stack+0x4b/0x75 > [ 15.276084] [] warn_slowpath_common+0x8b/0xc0 > [ 15.276084] [] ? do_set_cpus_allowed+0x7e/0x80 > [ 15.276084] [] ? do_set_cpus_allowed+0x7e/0x80 > [ 15.276084] [] warn_slowpath_null+0x22/0x30 > [ 15.276084] [] do_set_cpus_allowed+0x7e/0x80 > [ 15.276084] [] cpuset_cpus_allowed_fallback+0x7c/0x170 > [ 15.276084] [] ? cpuset_cpus_allowed+0x180/0x180 > [ 15.276084] [] select_fallback_rq+0x221/0x280 > [ 15.276084] [] migration_call+0xe3/0x250 > [ 15.276084] [] notifier_call_chain+0x53/0x70 > [ 15.276084] [] __raw_notifier_call_chain+0x1e/0x30 > [ 15.276084] [] cpu_notify+0x28/0x50 > [ 15.276084] [] take_cpu_down+0x22/0x40 > [ 15.276084] [] multi_cpu_stop+0xd5/0x140 > [ 15.276084] [] ? __stop_cpus+0x80/0x80 > [ 15.276084] [] cpu_stopper_thread+0xbc/0x170 > [ 15.276084] [] ? preempt_count_sub+0x9/0x50 > [ 15.276084] [] ? _raw_spin_unlock_irq+0x37/0x50 > [ 15.276084] [] ? _raw_spin_unlock_irqrestore+0x55/0x70 > [ 15.276084] [] ? trace_hardirqs_on_caller+0x144/0x1e0 > [ 15.276084] [] ? cpu_stop_should_run+0x35/0x40 > [ 15.276084] [] ? preempt_count_sub+0x9/0x50 > [ 15.276084] [] ? _raw_spin_unlock_irqrestore+0x41/0x70 > [ 15.276084] [] smpboot_thread_fn+0x174/0x2f0 > [ 15.276084] [] ? sort_range+0x30/0x30 > [ 15.276084] [] kthread+0xc4/0xe0 > [ 15.276084] [] ret_from_kernel_thread+0x21/0x30 > [ 15.276084] [] ? kthread_create_on_node+0x180/0x180 > [ 15.276084] ---[ end trace 15f4c86d404693b0 ]--- > no experiment from me(hate myself and me). just guess this path: take_cpu_down() cpu_notify(CPU_DYING) migration_call() in migration_call(), there is CPU_DYING case. add these: raw_spin_lock_irqsave(&p->pi_lock, flags); raw_spin_lock(&rq->lock, flags); ... raw_spin_unlock(&rq->lock, flags); raw_spin_unlock_irqrestore(&p->pi_lock, flags); no p->pi_lock and rq->lock inversed order(from Peter's review) no lock on p->pi_lock two times(from Peter's review) and in do_set_cpus_allowed(), add the following and delete some: WARN_ON_ONCE(debug_locks && !(lockdep_is_held(&p->pi_lock) || p->on_rq && lockdep_is_held(&task_rq(p)->lock))); (from Peter's suggestion) like what used in set_task_cpu() better or right solution there :) thanks, -- Tao -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/3] sched: sync a se with its cfs_rq when attaching and dettaching
Hi, On Mon, Aug 17, 2015 at 04:45:50PM +0900, byungchul.p...@lge.com wrote: > From: Byungchul Park > > current code is wrong with cfs_rq's avg loads when changing a task's > cfs_rq to another. i tested with "echo pid > cgroup" and found that > e.g. cfs_rq->avg.load_avg became larger and larger whenever i changed > a cgroup to another again and again. we have to sync se's avg loads > with both *prev* cfs_rq and next cfs_rq when changing its group. > my simple think about above, may be nothing or wrong, just ignore it. if a load balance migration happened just before cgroup change, prev cfs_rq and next cfs_rq will be on different cpu. migrate_task_rq_fair() and update_cfs_rq_load_avg() will sync and remove se's load avg from prev cfs_rq. whether or not queued, well done. dequeue_task() decay se and pre_cfs before calling task_move_group_fair(). after set cfs_rq in task_move_group_fair(), if queued, se's load avg do not add to next cfs_rq(try set last_update_time to 0 like migration to add), if !queued, also need to add se's load avg to next cfs_rq. if no load balance migration happened when change cgroup. prev cfs_rq and next cfs_rq may be on same cpu(not sure), this time, need to remove se's load avg by ourself, also need to add se's load avg on next cfs_rq. thinks, -- Tao -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] sched: sync with the cfs_rq when changing sched class
On Sat, Aug 15, 2015 at 01:24:12PM +0900, Byungchul Park wrote: > On Fri, Aug 14, 2015 at 08:59:02PM +0800, T. Zhou wrote: > > Hi, > > > > On Thu, Aug 13, 2015 at 02:55:55PM +0900, byungchul.p...@lge.com wrote: > > > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct > > > sched_entity *se) > > > +{ > > > + se->avg.last_update_time = cfs_rq->avg.last_update_time; > > > + cfs_rq->avg.load_avg += se->avg.load_avg; > > > + cfs_rq->avg.load_sum += se->avg.load_sum; > > > + cfs_rq->avg.util_avg += se->avg.util_avg; > > > + cfs_rq->avg.util_sum += se->avg.util_sum; > > > +} > > > > I see this function is used in enqueue_entity_load_avg() also. > > In tip tree code, enqueue_entity_load_avg() uses cfs_rq_clock_task() > > as se->last_update_time in migration condition. > > Here, you use cfs_rq->avg.last_update_time as se->last_update_time. > > If se->last_update_time is different, next decay may be different too. > > Just from inspecting code, maybe some reasonable there? > > hello zhou, > > update_cfs_rq_load_avg() would update cfs_rq->avg.last_update_time to now, > which is returned from cfs_rq_clock_task(). :) > > thanks, > byungchul > Hi Byungchul, yes, you are right. se and cfs_rq update load avg at same time. :) thanks -- Tao -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] sched: make task_move_group_fair simple by using switched_to(from)_fair
Hi, On Thu, Aug 13, 2015 at 07:49:19PM +0900, byungchul.p...@lge.com wrote: > +static inline int need_vruntime_adjust(struct task_struct *p, int queued) > +{ > + struct sched_entity *se = &p->se; > + > + /* > + * When !queued, vruntime of the task has usually NOT been normalized. > + * But there are some cases where it has already been normalized: > + * > + * - Moving a forked child which is waiting for being woken up by > + * wake_up_new_task(). > + * - Moving a task which has been woken up by try_to_wake_up() and > + * waiting for actually being woken up by sched_ttwu_pending(). > + */ > + if (queued) > + return 0; > + > + if (!se->sum_exec_runtime || p->state == TASK_WAKING) > + return 0; > + > + if (p->state != TASK_RUNNING) > + return 1; > + > + return 0; > +} Original switched_from/to_fair() do not include the following check. if (!se->sum_exec_runtime || p->state == TASK_WAKING) return 0; I guess you're sure this check will always fail in switched_from/to_fair() case Thanks > static void switched_from_fair(struct rq *rq, struct task_struct *p) > { > struct sched_entity *se = &p->se; > @@ -7918,7 +7943,7 @@ static void switched_from_fair(struct rq *rq, struct > task_struct *p) >* have normalized the vruntime, if it's !queued, then only when >* the task is sleeping will it still have non-normalized vruntime. >*/ > - if (!task_on_rq_queued(p) && p->state != TASK_RUNNING) { > + if (need_vruntime_adjust(p, task_on_rq_queued(p))) { > /* >* Fix up our vruntime so that the current sleep doesn't >* cause 'unlimited' sleep bonus. > @@ -7939,7 +7964,6 @@ static void switched_from_fair(struct rq *rq, struct > task_struct *p) > static void switched_to_fair(struct rq *rq, struct task_struct *p) > { > struct sched_entity *se = &p->se; > - > #ifdef CONFIG_FAIR_GROUP_SCHED > /* >* Since the real-depth could have been changed (only FAIR > @@ -7964,7 +7988,7 @@ static void switched_to_fair(struct rq *rq, struct > task_struct *p) >* has non-normalized vruntime, if it's !queued, then it still > has >* normalized vruntime. >*/ > - if (p->state != TASK_RUNNING) > + if (need_vruntime_adjust(p, task_on_rq_queued(p))) > se->vruntime += cfs_rq_of(se)->min_vruntime; > return; > } > @@ -8014,55 +8038,9 @@ void init_cfs_rq(struct cfs_rq *cfs_rq) > #ifdef CONFIG_FAIR_GROUP_SCHED > static void task_move_group_fair(struct task_struct *p, int queued) > { > - struct sched_entity *se = &p->se; > - struct cfs_rq *cfs_rq; > - > - /* > - * If the task was not on the rq at the time of this cgroup movement > - * it must have been asleep, sleeping tasks keep their ->vruntime > - * absolute on their old rq until wakeup (needed for the fair sleeper > - * bonus in place_entity()). > - * > - * If it was on the rq, we've just 'preempted' it, which does convert > - * ->vruntime to a relative base. > - * > - * Make sure both cases convert their relative position when migrating > - * to another cgroup's rq. This does somewhat interfere with the > - * fair sleeper stuff for the first placement, but who cares. > - */ > - /* > - * When !queued, vruntime of the task has usually NOT been normalized. > - * But there are some cases where it has already been normalized: > - * > - * - Moving a forked child which is waiting for being woken up by > - * wake_up_new_task(). > - * - Moving a task which has been woken up by try_to_wake_up() and > - * waiting for actually being woken up by sched_ttwu_pending(). > - * > - * To prevent boost or penalty in the new cfs_rq caused by delta > - * min_vruntime between the two cfs_rqs, we skip vruntime adjustment. > - */ > - if (!queued && (!se->sum_exec_runtime || p->state == TASK_WAKING)) > - queued = 1; > - > - cfs_rq = cfs_rq_of(se); > - if (!queued) > - se->vruntime -= cfs_rq->min_vruntime; > - > -#ifdef CONFIG_SMP > - /* synchronize task with its prev cfs_rq */ > - detach_entity_load_avg(cfs_rq, se); > -#endif > + switched_from_fair(task_rq(p), p); > set_task_rq(p, task_cpu(p)); > - se->depth = se->parent ? se->parent->depth + 1 : 0; > - cfs_rq = cfs_rq_of(se); > - if (!queued) > - se->vruntime += cfs_rq->min_vruntime; > - > -#ifdef CONFIG_SMP > - /* Virtually synchronize task with its new cfs_rq */ > - attach_entity_load_avg(cfs_rq, se); > -#endif > + switched_to_fair(task_rq(p), p); > } > > void free_fair_sched_group(struct task_group *tg) -- Tao -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body
Re: [PATCH] sched: sync with the cfs_rq when changing sched class
Hi, On Thu, Aug 13, 2015 at 02:55:55PM +0900, byungchul.p...@lge.com wrote: > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > +{ > + se->avg.last_update_time = cfs_rq->avg.last_update_time; > + cfs_rq->avg.load_avg += se->avg.load_avg; > + cfs_rq->avg.load_sum += se->avg.load_sum; > + cfs_rq->avg.util_avg += se->avg.util_avg; > + cfs_rq->avg.util_sum += se->avg.util_sum; > +} I see this function is used in enqueue_entity_load_avg() also. In tip tree code, enqueue_entity_load_avg() uses cfs_rq_clock_task() as se->last_update_time in migration condition. Here, you use cfs_rq->avg.last_update_time as se->last_update_time. If se->last_update_time is different, next decay may be different too. Just from inspecting code, maybe some reasonable there? Thanks > +static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > +{ > + __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq_of(cfs_rq)), > + &se->avg, se->on_rq * scale_load_down(se->load.weight), > + cfs_rq->curr == se, NULL); > + > + cfs_rq->avg.load_avg = > + max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); > + cfs_rq->avg.load_sum = > + max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); > + cfs_rq->avg.util_avg = > + max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); > + cfs_rq->avg.util_sum = > + max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); > +} > + > /* Add the load generated by se into cfs_rq's load average */ > static inline void > enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) > @@ -2717,27 +2742,20 @@ enqueue_entity_load_avg(struct cfs_rq *cfs_rq, struct > sched_entity *se) > u64 now = cfs_rq_clock_task(cfs_rq); > int migrated = 0, decayed; > > - if (sa->last_update_time == 0) { > - sa->last_update_time = now; > + if (sa->last_update_time == 0) > migrated = 1; > - } > - else { > + else > __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa, > - se->on_rq * scale_load_down(se->load.weight), > - cfs_rq->curr == se, NULL); > - } > + se->on_rq * scale_load_down(se->load.weight), > + cfs_rq->curr == se, NULL); > > decayed = update_cfs_rq_load_avg(now, cfs_rq); > > cfs_rq->runnable_load_avg += sa->load_avg; > cfs_rq->runnable_load_sum += sa->load_sum; > > - if (migrated) { > - cfs_rq->avg.load_avg += sa->load_avg; > - cfs_rq->avg.load_sum += sa->load_sum; > - cfs_rq->avg.util_avg += sa->util_avg; > - cfs_rq->avg.util_sum += sa->util_sum; > - } > + if (migrated) > + attach_entity_load_avg(cfs_rq, se); > > if (decayed || migrated) > update_tg_load_avg(cfs_rq, 0); > @@ -7911,17 +7929,7 @@ static void switched_from_fair(struct rq *rq, struct > task_struct *p) > > #ifdef CONFIG_SMP > /* Catch up with the cfs_rq and remove our load when we leave */ > - __update_load_avg(cfs_rq->avg.last_update_time, cpu_of(rq), &se->avg, > - se->on_rq * scale_load_down(se->load.weight), cfs_rq->curr == > se, NULL); > - > - cfs_rq->avg.load_avg = > - max_t(long, cfs_rq->avg.load_avg - se->avg.load_avg, 0); > - cfs_rq->avg.load_sum = > - max_t(s64, cfs_rq->avg.load_sum - se->avg.load_sum, 0); > - cfs_rq->avg.util_avg = > - max_t(long, cfs_rq->avg.util_avg - se->avg.util_avg, 0); > - cfs_rq->avg.util_sum = > - max_t(s32, cfs_rq->avg.util_sum - se->avg.util_sum, 0); > + detach_entity_load_avg(cfs_rq, se); > #endif > } > > @@ -7938,6 +7946,11 @@ static void switched_to_fair(struct rq *rq, struct > task_struct *p) >*/ > se->depth = se->parent ? se->parent->depth + 1 : 0; > #endif > + > +#ifdef CONFIG_SMP > + /* synchronize task with its cfs_rq */ > + attach_entity_load_avg(cfs_rq_of(&p->se), &p->se); > +#endif > if (!task_on_rq_queued(p)) > return; > > @@ -8023,16 +8036,7 @@ static void task_move_group_fair(struct task_struct > *p, int queued) > > #ifdef CONFIG_SMP > /* synchronize task with its prev cfs_rq */ > - if (!queued) > - __update_load_avg(cfs_rq->avg.last_update_time, > cpu_of(rq_of(cfs_rq)), > - &se->avg, se->on_rq * > scale_load_down(se->load.weight), > - cfs_rq->curr == se, NULL); > - > - /* remove our load when we leave */ > - cfs_rq->avg.load_avg = max_t(long, cfs_rq->avg.load_avg - > se->avg.load_avg, 0); > - cfs_rq->avg.load_sum = max_t(s64, cfs_rq->avg.load_sum - > se->avg.load_sum, 0); > - cfs_rq->avg.util_avg = max_t(long, cfs_rq->avg.util_avg - > se->avg.util_avg, 0); > - cfs_