On Wed, Nov 21, 2018 at 09:18:19AM +0100, Peter Zijlstra wrote: > On Tue, Nov 20, 2018 at 11:40:22PM +0100, Frederic Weisbecker wrote: > > On Tue, Nov 20, 2018 at 03:23:06PM +0100, Peter Zijlstra wrote: > > > On Wed, Nov 14, 2018 at 03:46:04AM +0100, Frederic Weisbecker wrote: > > > > > > > +void kcpustat_cputime(struct kernel_cpustat *kcpustat, int cpu, > > > > + u64 *user, u64 *nice, u64 *system, > > > > + u64 *guest, u64 *guest_nice) > > > > +{ > > > > + struct task_struct *curr; > > > > + struct vtime *vtime; > > > > + int err; > > > > + > > > > + if (!vtime_accounting_enabled()) { > > > > + kcpustat_cputime_raw(kcpustat, user, nice, > > > > + system, guest, guest_nice); > > > > + return; > > > > + } > > > > + > > > > + rcu_read_lock(); > > > > + > > > > + do { > > > > + curr = rcu_dereference(kcpustat->curr); > > > > > > Like I explained earlier; I don't think the above is correct. > > > task_struct is itself not RCU protected. > > > > But there is at least one put_task_struct() that is enqueued as an RCU > > callback > > on release_task(). That patchset (try to) make sure that kcpustat->curr > > can't > > be assigned beyond that point. > > > > Or did I misunderstand something? > > Yeah; release_task() is not the normal exit path. Oleg can probably > remember how all that works, because I always get lost there :-/ > > In any case, have a look at task_rcu_dereference(), but that still does > not explain the rcu_assign_pointer() stuff you use to set > kcpustat->curr.
So there are two reasons here. One is the ordering. Here is the write side: schedule() { rq->curr = next; context_switch() { vtime_task_switch() { vtime_seqcount_write_lock(prev); //flush pending vtime vtime_seqcount_write_unlock(prev); vtime_seqcount_write_lock(next); //Init vtime vtime_seqcount_write_unlock(next); rcu_assign_pointer(this_kcpustat->curr, next); } } } So now from the read side, if I fetch the task from rq->curr, there is a risk that I see the next task but I don't see the vtime flushed by prev task yet. If the prev task has run 1 second without being interrupted, I'm going to miss that whole time. Assigning the next task after the prev task has flushed its vtime prevent from that kind of race due to the barriers implied by seqcount and even rcu_assign_pointer. The second reason is that I looked at task_rcu_dereference() and its guts are really not appealing. The ordering guarantee is hard to parse there. At least with kcpustat->curr I can make sure that the pointer is NULL before the earliest opportunity for call_rcu(delayed_put_task_struct) to be queued (which is exit_notify() in case of autoreap).