On 07/02/2018 12:02 PM, Mark Rutland wrote: > On Mon, Jul 02, 2018 at 05:46:55PM +0200, Peter Zijlstra wrote: >> On Mon, Jul 02, 2018 at 04:12:50PM +0100, Mark Rutland wrote: >>> Users can request that general purpose registers, instruction pointer, >>> etc, are sampled when a perf event counter overflows. To try to avoid >>> this resulting in kernel state being leaked, unprivileged users are >>> usually forbidden from opening events which count while the kernel is >>> running. >>> >>> Unfortunately, this is not sufficient to avoid leading kernel state. >> 'leaking' surely. >> >>> For various reasons, there can be a delay between the overflow occurring >>> and the resulting overflow exception (e.g. an NMI) being taken. During >>> this window, other instructions may be executed, resulting in skid. >>> >>> This skid means that a userspace-only event overflowing may result in an >>> exception being taken *after* entry to the kernel, allowing kernel >>> registers to be sampled. Depending on the amount of skid, this may only >>> leak the PC (breaking KASLR), or it may leak secrets which are currently >>> live in GPRs. >>> >>> Let's avoid this by only sampling from the user registers when an event >>> is supposed to exclude the kernel, providing the illusion that the >>> overflow exception is taken from userspace. >>> >>> We also have similar cases when sampling a guest, where we get the host >>> regs in some cases. It's not entirely clear to me how we should handle >>> these. >> Would not a similar: >> >> if ((event->attr.exclude_hv || event->attr.exclude_host) /* WTF both !? >> */ && >> perf_guest_cbs && !perf_guest_cbs->is_in_guest()) >> return perf_guest_cbs->guest_pt_regs(); >> >> work there? > Mostly. It's fun if the user also passed exclude_guest -- I have no idea > what should be sampled there, if anything. > > It's easy to get stuck in a rabbit hole looking at this. > >> Of course, perf_guest_info_callbacks is currently lacking that >> guest_pt_regs() thingy.. > Yeah; I started looking at implementing it, but ran away since it wasn't > clear to me how to build that on most architectures. > >>> Signed-off-by: Mark Rutland <mark.rutl...@arm.com> >>> Cc: Ingo Molnar <mi...@redhat.com> >>> Cc: Jin Yao <yao....@linux.intel.com> >>> Cc: Peter Zijlstra <pet...@infradead.org> >>> --- >>> kernel/events/core.c | 28 ++++++++++++++++++++++++++++ >>> 1 file changed, 28 insertions(+) >>> >>> diff --git a/kernel/events/core.c b/kernel/events/core.c >>> index 8f0434a9951a..2ab2548b2e66 100644 >>> --- a/kernel/events/core.c >>> +++ b/kernel/events/core.c >>> @@ -6361,6 +6361,32 @@ perf_callchain(struct perf_event *event, struct >>> pt_regs *regs) >>> return callchain ?: &__empty_callchain; >>> } >>> >>> +static struct pt_regs *perf_get_sample_regs(struct perf_event *event, >>> + struct pt_regs *regs) >>> +{ >>> + /* >>> + * Due to interrupt latency (AKA "skid"), we may enter the kernel >>> + * before taking an overflow, even if the PMU is only counting user >>> + * events. >>> + * >>> + * If we're not counting kernel events, always use the user regs when >>> + * sampling. >>> + * >>> + * TODO: what do we do about sampling a guest's registers? The IP is >>> + * special-cased, but for the rest of the regs they'll get the >>> + * user/kernel regs depending on whether exclude_kernel is set, which >>> + * is nonsensical. >>> + * >>> + * We can't get at the full set of regs in all cases (e.g. Xen's PV PMU >>> + * can't provide the GPRs), so should we just zero the GPRs when in a >>> + * guest? Or skip outputting the regs in perf_output_sample? >> Seems daft Xen cannot provide registers; why is that? Boris? > The xen_pmu_regs structure simply doesn't have them, so I assume there's > no API to get them. > > Given we don't currently sample the guest regs, I'd be tempted to just > zero them for now, or skip the sample at output time (if that doesn't > break some other case).
(Was out on vacation, couldn't respond earlier) Yes, PV guests only get a limited set of registers passed to the handler by the hypervisor. GPRs are not part of this set. I think we need do diff --git a/arch/x86/xen/pmu.c b/arch/x86/xen/pmu.c index 7d00d4a..95997e6 100644 --- a/arch/x86/xen/pmu.c +++ b/arch/x86/xen/pmu.c @@ -478,7 +478,7 @@ static void xen_convert_regs(const struct xen_pmu_regs *xen_regs, irqreturn_t xen_pmu_irq_handler(int irq, void *dev_id) { int err, ret = IRQ_NONE; - struct pt_regs regs; + struct pt_regs regs = {0}; const struct xen_pmu_data *xenpmu_data = get_xenpmu_data(); uint8_t xenpmu_flags = get_xenpmu_flags(); Do you want me to submit a separate patch or can you make this part of yours? Thanks. -boris