On Tue, Jun 25, 2019 at 12:43:20PM +0200, Peter Zijlstra wrote:
> 
> Syzcaller reported the following Use-after-Free issue:
> 
>       close()                                         clone()
> 
>                                                         copy_process()
>                                                           
> perf_event_init_task()
>                                                             
> perf_event_init_context()
>                                                               
> mutex_lock(parent_ctx->mutex)
>                                                               
> inherit_task_group()
>                                                                 
> inherit_group()
>                                                                   
> inherit_event()
>                                                                     
> mutex_lock(event->child_mutex)
>                                                                     // expose 
> event on child list
>                                                                     
> list_add_tail()
>                                                                     
> mutex_unlock(event->child_mutex)
>                                                               
> mutex_unlock(parent_ctx->mutex)
> 
>                                                           ...
>                                                           goto bad_fork_*
> 
>                                                         bad_fork_cleanup_perf:
>                                                           
> perf_event_free_task()
> 
>         perf_release()
>           perf_event_release_kernel()
>             list_for_each_entry()
>               mutex_lock(ctx->mutex)
>               mutex_lock(event->child_mutex)
>               // event is from the failing inherit
>               // on the other CPU
>               perf_remove_from_context()
>               list_move()
>               mutex_unlock(event->child_mutex)
>               mutex_unlock(ctx->mutex)
> 
>                                                             
> mutex_lock(ctx->mutex)
>                                                             
> list_for_each_entry_safe()
>                                                               // event 
> already stolen
>                                                             
> mutex_unlock(ctx->mutex)
> 
>                                                           delayed_free_task()
>                                                             free_task()
> 
>            list_for_each_entry_safe()
>              list_del()
>              free_event()
>                _free_event()
>                  // and so event->hw.target
>                  // is the already freed failed clone()
>                  if (event->hw.target)
>                    put_task_struct(event->hw.target)
>                      // WHOOPSIE, already quite dead
> 
> 
> Which puts the lie to the the comment on perf_event_free_task():
> 'unexposed, unused context' not so much.
> 
> Which is a 'fun' confluence of fail; copy_process() doing an
> unconditional free_task() and not respecting refcounts, and perf having
> creative locking. In particular:
> 
>   82d94856fa22 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
> 
> seems to have overlooked this 'fun' parade.
> 
> Solve it by using the fact that detached events still have a reference
> count on their (previous) context. With this perf_event_free_task()
> can detect when events have escaped and wait for their destruction.
> 
> Cc: Mark Rutland <mark.rutl...@arm.com>
> Fixes: 82d94856fa22 ("perf/core: Fix lock inversion between perf,trace,cpuhp")
> Reported-by: syzbot+a24c397a29ad22d86...@syzkaller.appspotmail.com
> Debugged-by: Alexander Shishkin <alexander.shish...@linux.intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> ---
>  kernel/events/core.c | 49 +++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 10c1dba9068c..5302c19e9892 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -4463,12 +4463,20 @@ static void _free_event(struct perf_event *event)
>       if (event->destroy)
>               event->destroy(event);
>  
> -     if (event->ctx)
> -             put_ctx(event->ctx);
> -
> +     /*
> +      * Must be after ->destroy(), due to uprobe_perf_close() using
> +      * hw.target.
> +      */
>       if (event->hw.target)
>               put_task_struct(event->hw.target);
>  
> +     /*
> +      * perf_event_free_task() relies on put_ctx() being 'last', in 
> particular
> +      * all task references must be cleaned up.
> +      */
> +     if (event->ctx)
> +             put_ctx(event->ctx);
> +
>       exclusive_event_destroy(event);
>       module_put(event->pmu->module);
>  
> @@ -4648,8 +4656,17 @@ int perf_event_release_kernel(struct perf_event *event)
>       mutex_unlock(&event->child_mutex);
>  
>       list_for_each_entry_safe(child, tmp, &free_list, child_list) {
> +             void *var = &child->ctx->refcount;
> +
>               list_del(&child->child_list);
>               free_event(child);
> +
> +             /*
> +              * Wake any perf_event_free_task() waiting for this event to be
> +              * freed.
> +              */
> +             smp_mb(); /* pairs with wait_var_event() */
> +             wake_up_var(var);

Huh, so wake_up_var() doesn't imply a RELEASE?

As an aside, doesn't that mean all callers of wake_up_var() have to do
likewise to ensure it isn't re-ordered with whatever prior stuff they're
trying to notify waiters about? Several do an smp_store_release() then a
wake_up_var(), but IIUC the wake_up_var() could get pulled before that
release...

I'm likely missing a subtlety there (I guess in practice the
implementation of wake_up_var prevents that), or maybe I have more
vocabulary than I have a clue. ;)

Other than that tangent, this looks sane to me, so FWIW:

Acked-by: Mark Rutland <mark.rutl...@arm.com>

Ideally, we'd throw some CPU hours at testing this. I can drop this in
my fuzzing queue next week, but I don't have a reproducer that I can
soak-test this with. :(

Thanks,
Mark.

>       }
>  
>  no_ctx:
> @@ -11512,11 +11529,11 @@ static void perf_free_event(struct perf_event 
> *event,
>  }
>  
>  /*
> - * Free an unexposed, unused context as created by inheritance by
> - * perf_event_init_task below, used by fork() in case of fail.
> + * Free a context as created by inheritance by perf_event_init_task() below,
> + * used by fork() in case of fail.
>   *
> - * Not all locks are strictly required, but take them anyway to be nice and
> - * help out with the lockdep assertions.
> + * Even though the task has never lived, the context and events have been
> + * exposed through the child_list, so we must take care tearing it all down.
>   */
>  void perf_event_free_task(struct task_struct *task)
>  {
> @@ -11546,7 +11563,23 @@ void perf_event_free_task(struct task_struct *task)
>                       perf_free_event(event, ctx);
>  
>               mutex_unlock(&ctx->mutex);
> -             put_ctx(ctx);
> +
> +             /*
> +              * perf_event_release_kernel() could've stolen some of our
> +              * child events and still have them on its free_list. In that
> +              * case we must wait for these events to have been freed (in
> +              * particular all their references to this task must've been
> +              * dropped).
> +              *
> +              * Without this copy_process() will unconditionally free this
> +              * task (irrespective of its reference count) and
> +              * _free_event()'s put_task_struct(event->hw.target) will be a
> +              * use-after-free.
> +              *
> +              * Wait for all events to drop their context reference.
> +              */
> +             wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 
> 1);
> +             put_ctx(ctx); /* must be last */
>       }
>  }
>  
> 

Reply via email to