On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote:
> Users can request that general purpose registers, instruction pointer,
> etc, are sampled when a perf event counter overflows. To try to avoid
> this resulting in kernel state being leaked, unprivileged users are
> usually forbidden from opening events which count while the kernel is
> running.
> 
> Unfortunately, this is not sufficient to avoid leading kernel state.

'leaking' surely.

> 
> For various reasons, there can be a delay between the overflow occurring
> and the resulting overflow exception (e.g. an NMI) being taken. During
> this window, other instructions may be executed, resulting in skid.
> 
> This skid means that a userspace-only event overflowing may result in an
> exception being taken *after* entry to the kernel, allowing kernel
> registers to be sampled. Depending on the amount of skid, this may only
> leak the PC (breaking KASLR), or it may leak secrets which are currently
> live in GPRs.
> 
> Let's avoid this by only sampling from the user registers when an event
> is supposed to exclude the kernel, providing the illusion that the
> overflow exception is taken from userspace.
> 
> We also have similar cases when sampling a guest, where we get the host
> regs in some cases. It's not entirely clear to me how we should handle
> these.

Would not a similar:

        if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? 
*/ &&
            perf_guest_cbs && !perf_guest_cbs->is_in_guest())
                return perf_guest_cbs->guest_pt_regs();

work there? Of course, perf_guest_info_callbacks is currently lacking
that guest_pt_regs() thingy..

> 
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jin Yao <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
>  kernel/events/core.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 8f0434a9951a..2ab2548b2e66 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct 
> pt_regs *regs)
>       return callchain ?: &__empty_callchain;
>  }
>  
> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event,
> +                                         struct pt_regs *regs)
> +{
> +     /*
> +      * Due to interrupt latency (AKA "skid"), we may enter the kernel
> +      * before taking an overflow, even if the PMU is only counting user
> +      * events.
> +      *
> +      * If we're not counting kernel events, always use the user regs when
> +      * sampling.
> +      *
> +      * TODO: what do we do about sampling a guest's registers? The IP is
> +      * special-cased, but for the rest of the regs they'll get the
> +      * user/kernel regs depending on whether exclude_kernel is set, which
> +      * is nonsensical.
> +      *
> +      * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU
> +      * can't provide the GPRs), so should we just zero the GPRs when in a
> +      * guest? Or skip outputting the regs in perf_output_sample?

Seems daft Xen cannot provide registers; why is that? Boris?

> +      */
> +     if (event->attr.exclude_kernel && !user_mode(regs))
> +             return task_pt_regs(current);
> +
> +     return regs;
> +}
> +
>  void perf_prepare_sample(struct perf_event_header *header,
>                        struct perf_sample_data *data,
>                        struct perf_event *event,
> @@ -6368,6 +6394,8 @@ void perf_prepare_sample(struct perf_event_header 
> *header,
>  {
>       u64 sample_type = event->attr.sample_type;
>  
> +     regs = perf_get_sample_regs(event, regs);
> +
>       header->type = PERF_RECORD_SAMPLE;
>       header->size = sizeof(*header) + event->header_size;

In any case ACK for this thing.

Reply via email to