On Fri, Dec 09, 2016 at 02:59:00PM +0100, Peter Zijlstra wrote:
> On Wed, Dec 07, 2016 at 07:34:55PM +0100, Peter Zijlstra wrote:
> 
> > @@ -2352,6 +2357,28 @@ perf_install_in_context(struct perf_event_context 
> > *ctx,
> >             return;
> >     }
> >     raw_spin_unlock_irq(&ctx->lock);
> > +
> > +   raw_spin_lock_irq(&task->pi_lock);
> > +   if (!(task->state == TASK_RUNNING || task->state == TASK_WAKING)) {
> > +           /*
> > +            * XXX horrific hack...
> > +            */
> > +           raw_spin_lock(&ctx->lock);
> > +           if (task != ctx->task) {
> > +                   raw_spin_unlock(&ctx->lock);
> > +                   raw_spin_unlock_irq(&task->pi_lock);
> > +                   goto again;
> > +           }
> > +
> > +           add_event_to_ctx(event, ctx);
> > +           raw_spin_unlock(&ctx->lock);
> > +           raw_spin_unlock_irq(&task->pi_lock);
> > +           return;
> > +   }
> > +   raw_spin_unlock_irq(&task->pi_lock);
> > +
> > +   cond_resched();
> > +
> >     /*
> >      * Since !ctx->is_active doesn't mean anything, we must IPI
> >      * unconditionally.
> 
> So while I went back and forth trying to make that less ugly, I figured
> there was another problem.
> 
> Imagine the cpu_function_call() hitting the 'right' cpu, but not finding
> the task current. It will then continue to install the event in the
> context. However, that doesn't stop another CPU from pulling the task in
> question from our rq and scheduling it elsewhere.
> 
> This all lead me to the below patch.. Now it has a rather large comment,
> and while it represents my current thinking on the matter, I'm not at
> all sure its entirely correct. I got my brain in a fair twist while
> writing it.
> 
> Please as to carefully think about it.
> 
> ---
>  kernel/events/core.c | 70 
> +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 48 insertions(+), 22 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 6ee1febdf6ff..7d9ae461c535 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2252,7 +2252,7 @@ static int  __perf_install_in_context(void *info)
>       struct perf_event_context *ctx = event->ctx;
>       struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>       struct perf_event_context *task_ctx = cpuctx->task_ctx;
> -     bool activate = true;
> +     bool reprogram = true;
>       int ret = 0;
>  
>       raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2260,27 +2260,26 @@ static int  __perf_install_in_context(void *info)
>               raw_spin_lock(&ctx->lock);
>               task_ctx = ctx;
>  
> -             /* If we're on the wrong CPU, try again */
> -             if (task_cpu(ctx->task) != smp_processor_id()) {
> -                     ret = -ESRCH;
> -                     goto unlock;
> -             }
> +             reprogram = (ctx->task == current);
>  
>               /*
> -              * If we're on the right CPU, see if the task we target is
> -              * current, if not we don't have to activate the ctx, a future
> -              * context switch will do that for us.
> +              * If the task is running, it must be running on this CPU,
> +              * otherwise we cannot reprogram things.
> +              *
> +              * If its not running, we don't care, ctx->lock will
> +              * serialize against it becoming runnable.
>                */
> -             if (ctx->task != current)
> -                     activate = false;
> -             else
> -                     WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != 
> ctx);
> +             if (task_curr(ctx->task) && !reprogram) {
> +                     ret = -ESRCH;
> +                     goto unlock;
> +             }
>  
> +             WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx 
> != ctx);
>       } else if (task_ctx) {
>               raw_spin_lock(&task_ctx->lock);
>       }
>  
> -     if (activate) {
> +     if (reprogram) {
>               ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>               add_event_to_ctx(event, ctx);
>               ctx_resched(cpuctx, task_ctx);
> @@ -2331,13 +2330,36 @@ perf_install_in_context(struct perf_event_context 
> *ctx,
>       /*
>        * Installing events is tricky because we cannot rely on ctx->is_active
>        * to be set in case this is the nr_events 0 -> 1 transition.
> +      *
> +      * Instead we use task_curr(), which tells us if the task is running.
> +      * However, since we use task_curr() outside of rq::lock, we can race
> +      * against the actual state. This means the result can be wrong.
> +      *
> +      * If we get a false positive, we retry, this is harmless.
> +      *
> +      * If we get a false negative, things are complicated. If we are after
> +      * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +      * value must be correct. If we're before, it doesn't matter since
> +      * perf_event_context_sched_in() will program the counter.
> +      *
> +      * However, this hinges on the remote context switch having observed
> +      * our task->perf_event_ctxp[] store, such that it will in fact take
> +      * ctx::lock in perf_event_context_sched_in().
> +      *
> +      * We do this by task_function_call(), if the IPI fails to hit the task
> +      * we know any future context switch of task must see the
> +      * perf_event_ctpx[] store.
>        */
> -again:
> +
>       /*
> -      * Cannot use task_function_call() because we need to run on the task's
> -      * CPU regardless of whether its current or not.
> +      * This smp_mb() orders the task->perf_event_ctxp[] store with the
> +      * task_cpu() load, such that if the IPI then does not find the task
> +      * running, a future context switch of that task must observe the
> +      * store.
>        */
> -     if (!cpu_function_call(task_cpu(task), __perf_install_in_context, 
> event))
> +     smp_mb();
> +again:
> +     if (!task_function_call(task, __perf_install_in_context, event))
>               return;

I'm trying to figure out whether or not the barriers implied by the IPI
are sufficient here, or whether we really need the explicit smp_mb().
Certainly, arch_send_call_function_single_ipi has to order the publishing
of the remote work before the signalling of the interrupt, but the comment
above refers to "the task_cpu() load" and I can't see that after your
diff.

What are you trying to order here?

Will

>  
>       raw_spin_lock_irq(&ctx->lock);
> @@ -2351,12 +2373,16 @@ perf_install_in_context(struct perf_event_context 
> *ctx,
>               raw_spin_unlock_irq(&ctx->lock);
>               return;
>       }
> -     raw_spin_unlock_irq(&ctx->lock);
>       /*
> -      * Since !ctx->is_active doesn't mean anything, we must IPI
> -      * unconditionally.
> +      * If the task is not running, ctx->lock will avoid it becoming so,
> +      * thus we can safely install the event.
>        */
> -     goto again;
> +     if (task_curr(task)) {
> +             raw_spin_unlock_irq(&ctx->lock);
> +             goto again;
> +     }
> +     add_event_to_ctx(event, ctx);
> +     raw_spin_unlock_irq(&ctx->lock);
>  }
>  
>  /*

Reply via email to