Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
On Wed, Sep 15, 2021, Zhu, Lingshan wrote: > > > On 8/27/2021 3:59 AM, Sean Christopherson wrote: > > TL;DR: Please don't merge this patch, it's broken and is also built on a > > shoddy > > foundation that I would like to fix. > Hi Sean,Peter, Paolo > > I will send out an V11 which drops this patch since it's buggy, and Sean is > working on fix this. > Does this sound good? Works for me, thanks!
Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
On 8/27/2021 3:59 AM, Sean Christopherson wrote: TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy foundation that I would like to fix. Hi Sean,Peter, Paolo I will send out an V11 which drops this patch since it's buggy, and Sean is working on fix this. Does this sound good? Thanks, Zhu Lingshan On Fri, Aug 06, 2021, Zhu Lingshan wrote: diff --git a/kernel/events/core.c b/kernel/events/core.c index 464917096e73..e466fc8176e1 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry) */ struct perf_guest_info_callbacks *perf_guest_cbs; +/* explicitly use __weak to fix duplicate symbol error */ +void __weak arch_perf_update_guest_cbs(void) +{ +} + int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { + if (WARN_ON_ONCE(perf_guest_cbs)) + return -EBUSY; + perf_guest_cbs = cbs; + arch_perf_update_guest_cbs(); This is horribly broken, it fails to cleanup the static calls when KVM unregisters the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded. The explosion doesn't happen until 'kvm' is unloaded because the functions are implemented in 'kvm', i.e. the use-after-free is deferred a bit. BUG: unable to handle page fault for address: a011bb90 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0 Oops: 0010 [#1] PREEMPT SMP CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:0xa011bb90 Code: Unable to access opcode bytes at RIP 0xa011bb66. Call Trace: ? perf_misc_flags+0xe/0x50 ? perf_prepare_sample+0x53/0x6b0 ? perf_event_output_forward+0x67/0x160 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0xd/0xd0 ? __perf_event_overflow+0x52/0xf0 ? handle_pmi_common+0x1f2/0x2d0 ? __flush_tlb_all+0x30/0x30 ? intel_pmu_handle_irq+0xcf/0x410 ? nmi_handle+0x5/0x260 ? perf_event_nmi_handler+0x28/0x50 ? nmi_handle+0xc7/0x260 ? lock_release+0x2b0/0x2b0 ? default_do_nmi+0x6b/0x170 ? exc_nmi+0x103/0x130 ? end_repeat_nmi+0x16/0x1f ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 Modules linked in: irqbypass [last unloaded: kvm] Even more fun, the existing perf_guest_cbs framework is also broken, though it's much harder to get it to fail, and probably impossible to get it to fail without some help. The issue is that perf_guest_cbs is global, which means that it can be nullified by KVM (during module unload) while the callbacks are being accessed by a PMI handler on a different CPU. The bug has escaped notice because all dererfences of perf_guest_cbs follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the compiler never reload perf_guest_cbs in this sequence. The compiler does reload perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before it can be unloaded. But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger a NULL pointer derference, e.g. this tweak diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else while spamming module load/unload leads to: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf The good news is that I have a series that should fix both the existing NULL pointer bug and mostly obviate the need for static calls.
Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
On Fri, Aug 06, 2021, Zhu Lingshan wrote: > @@ -2944,18 +2966,21 @@ static unsigned long code_segment_base(struct pt_regs > *regs) > > unsigned long perf_instruction_pointer(struct pt_regs *regs) > { > - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) > - return perf_guest_cbs->get_guest_ip(); > + unsigned long ip = static_call(x86_guest_get_ip)(); > + > + if (likely(!ip)) Pivoting on ip==0 isn't correct, it's perfectly legal for a guest to execute from %rip=0. Unless there's some static_call() magic that supports this with a default function: if (unlikely(!static_call(x86_guest_get_ip)(&ip))) regs->ip + code_segment_base(regs) return ip; The easiest thing is keep the existing: if (unlikely(static_call(x86_guest_state)())) return static_call(x86_guest_get_ip)(); return regs->ip + code_segment_base(regs); It's an extra call for PMIs in guest, but I don't think any of the KVM folks care _that_ much about the performance in this case. > + ip = regs->ip + code_segment_base(regs); > > - return regs->ip + code_segment_base(regs); > + return ip; > }
Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
On 27/8/2021 3:59 am, Sean Christopherson wrote: TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy foundation that I would like to fix. Obviously, this patch is not closely related to the guest PEBS feature enabling, and we can certainly put this issue in another discussion thread [1]. [1] https://lore.kernel.org/kvm/20210827005718.585190-1-sea...@google.com/ On Fri, Aug 06, 2021, Zhu Lingshan wrote: diff --git a/kernel/events/core.c b/kernel/events/core.c index 464917096e73..e466fc8176e1 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry) */ struct perf_guest_info_callbacks *perf_guest_cbs; +/* explicitly use __weak to fix duplicate symbol error */ +void __weak arch_perf_update_guest_cbs(void) +{ +} + int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) { + if (WARN_ON_ONCE(perf_guest_cbs)) + return -EBUSY; + perf_guest_cbs = cbs; + arch_perf_update_guest_cbs(); This is horribly broken, it fails to cleanup the static calls when KVM unregisters the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded. The explosion doesn't happen until 'kvm' is unloaded because the functions are implemented in 'kvm', i.e. the use-after-free is deferred a bit. BUG: unable to handle page fault for address: a011bb90 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0 Oops: 0010 [#1] PREEMPT SMP CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:0xa011bb90 Code: Unable to access opcode bytes at RIP 0xa011bb66. Call Trace: ? perf_misc_flags+0xe/0x50 ? perf_prepare_sample+0x53/0x6b0 ? perf_event_output_forward+0x67/0x160 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0xd/0xd0 ? __perf_event_overflow+0x52/0xf0 ? handle_pmi_common+0x1f2/0x2d0 ? __flush_tlb_all+0x30/0x30 ? intel_pmu_handle_irq+0xcf/0x410 ? nmi_handle+0x5/0x260 ? perf_event_nmi_handler+0x28/0x50 ? nmi_handle+0xc7/0x260 ? lock_release+0x2b0/0x2b0 ? default_do_nmi+0x6b/0x170 ? exc_nmi+0x103/0x130 ? end_repeat_nmi+0x16/0x1f ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 Modules linked in: irqbypass [last unloaded: kvm] Even more fun, the existing perf_guest_cbs framework is also broken, though it's much harder to get it to fail, and probably impossible to get it to fail without some help. The issue is that perf_guest_cbs is global, which means that it can be nullified by KVM (during module unload) while the callbacks are being accessed by a PMI handler on a different CPU. The bug has escaped notice because all dererfences of perf_guest_cbs follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the compiler never reload perf_guest_cbs in this sequence. The compiler does reload perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before it can be unloaded. But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger a NULL pointer derference, e.g. this tweak diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else while spamming module load/unload leads to: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf The good news is that I have a series that should fix both the existing NUL
Re: [PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
TL;DR: Please don't merge this patch, it's broken and is also built on a shoddy foundation that I would like to fix. On Fri, Aug 06, 2021, Zhu Lingshan wrote: > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 464917096e73..e466fc8176e1 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -6489,9 +6489,18 @@ static void perf_pending_event(struct irq_work *entry) > */ > struct perf_guest_info_callbacks *perf_guest_cbs; > > +/* explicitly use __weak to fix duplicate symbol error */ > +void __weak arch_perf_update_guest_cbs(void) > +{ > +} > + > int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *cbs) > { > + if (WARN_ON_ONCE(perf_guest_cbs)) > + return -EBUSY; > + > perf_guest_cbs = cbs; > + arch_perf_update_guest_cbs(); This is horribly broken, it fails to cleanup the static calls when KVM unregisters the callbacks, which happens when the vendor module, e.g. kvm_intel, is unloaded. The explosion doesn't happen until 'kvm' is unloaded because the functions are implemented in 'kvm', i.e. the use-after-free is deferred a bit. BUG: unable to handle page fault for address: a011bb90 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 6211067 P4D 6211067 PUD 6212063 PMD 102b99067 PTE 0 Oops: 0010 [#1] PREEMPT SMP CPU: 0 PID: 1047 Comm: rmmod Not tainted 5.14.0-rc2+ #460 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:0xa011bb90 Code: Unable to access opcode bytes at RIP 0xa011bb66. Call Trace: ? perf_misc_flags+0xe/0x50 ? perf_prepare_sample+0x53/0x6b0 ? perf_event_output_forward+0x67/0x160 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 ? sched_clock_cpu+0xd/0xd0 ? __perf_event_overflow+0x52/0xf0 ? handle_pmi_common+0x1f2/0x2d0 ? __flush_tlb_all+0x30/0x30 ? intel_pmu_handle_irq+0xcf/0x410 ? nmi_handle+0x5/0x260 ? perf_event_nmi_handler+0x28/0x50 ? nmi_handle+0xc7/0x260 ? lock_release+0x2b0/0x2b0 ? default_do_nmi+0x6b/0x170 ? exc_nmi+0x103/0x130 ? end_repeat_nmi+0x16/0x1f ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 ? lock_release+0x2b0/0x2b0 Modules linked in: irqbypass [last unloaded: kvm] Even more fun, the existing perf_guest_cbs framework is also broken, though it's much harder to get it to fail, and probably impossible to get it to fail without some help. The issue is that perf_guest_cbs is global, which means that it can be nullified by KVM (during module unload) while the callbacks are being accessed by a PMI handler on a different CPU. The bug has escaped notice because all dererfences of perf_guest_cbs follow the same "perf_guest_cbs && perf_guest_cbs->is_in_guest()" pattern, and AFAICT the compiler never reload perf_guest_cbs in this sequence. The compiler does reload perf_guest_cbs for any future dereferences, but the ->is_in_guest() guard all but guarantees the PMI handler will win the race, e.g. to nullify perf_guest_cbs, KVM has to completely exit the guest and teardown down all VMs before it can be unloaded. But with a help, e.g. RAED_ONCE(perf_guest_cbs), unloading kvm_intel can trigger a NULL pointer derference, e.g. this tweak diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 1eb45139fcc6..202e5ad97f82 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2954,7 +2954,7 @@ unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (READ_ONCE(perf_guest_cbs) && READ_ONCE(perf_guest_cbs)->is_in_guest()) { if (perf_guest_cbs->is_user_mode()) misc |= PERF_RECORD_MISC_GUEST_USER; else while spamming module load/unload leads to: BUG: kernel NULL pointer dereference, address: #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] PREEMPT SMP CPU: 6 PID: 1825 Comm: stress Not tainted 5.14.0-rc2+ #459 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:perf_misc_flags+0x1c/0x70 Call Trace: perf_prepare_sample+0x53/0x6b0 perf_event_output_forward+0x67/0x160 __perf_event_overflow+0x52/0xf0 handle_pmi_common+0x207/0x300 intel_pmu_handle_irq+0xcf/0x410 perf_event_nmi_handler+0x28/0x50 nmi_handle+0xc7/0x260 default_do_nmi+0x6b/0x170 exc_nmi+0x103/0x130 asm_exc_nmi+0x76/0xbf The good news is that I have a series that should fix both the existing NULL pointer bug and mostly obviate the need for static calls. The bad news is that my approach, making perf_guest_cbs per-CPU, likely complicates turning these into static calls, though I'm guessing it's still a solvable problem. Tangentially related, IMO we should make architectures opt-in to getting perf_guest
[PATCH V10 01/18] perf/core: Use static_call to optimize perf_guest_info_callbacks
From: Like Xu For "struct perf_guest_info_callbacks", the two fields "is_in_guest" and "is_user_mode" are replaced with a new multiplexed member named "state", and the "get_guest_ip" field will be renamed to "get_ip". For arm64, xen and kvm/x86, the application of DEFINE_STATIC_CALL_RET0 could make all that perf_guest_cbs stuff suck less. For arm, csky, nds32, and riscv, just applied some renamed refactoring. Cc: Will Deacon Cc: Marc Zyngier Cc: Guo Ren Cc: Nick Hu Cc: Paul Walmsley Cc: Boris Ostrovsky Cc: linux-arm-ker...@lists.infradead.org Cc: kvm...@lists.cs.columbia.edu Cc: linux-c...@vger.kernel.org Cc: linux-ri...@lists.infradead.org Cc: xen-devel@lists.xenproject.org Suggested-by: Peter Zijlstra (Intel) Original-by: Peter Zijlstra (Intel) Signed-off-by: Like Xu Signed-off-by: Zhu Lingshan Reviewed-by: Boris Ostrovsky Acked-by: Peter Zijlstra (Intel) --- arch/arm/kernel/perf_callchain.c | 16 +++- arch/arm64/kernel/perf_callchain.c | 29 +- arch/arm64/kvm/perf.c | 22 - arch/csky/kernel/perf_callchain.c | 4 +-- arch/nds32/kernel/perf_event_cpu.c | 16 +++- arch/riscv/kernel/perf_callchain.c | 4 +-- arch/x86/events/core.c | 39 -- arch/x86/events/intel/core.c | 7 +++--- arch/x86/include/asm/kvm_host.h| 2 +- arch/x86/kvm/pmu.c | 2 +- arch/x86/kvm/x86.c | 37 +++- arch/x86/xen/pmu.c | 33 ++--- include/linux/perf_event.h | 12 ++--- kernel/events/core.c | 9 +++ 14 files changed, 144 insertions(+), 88 deletions(-) diff --git a/arch/arm/kernel/perf_callchain.c b/arch/arm/kernel/perf_callchain.c index 3b69a76d341e..1ce30f86d6c7 100644 --- a/arch/arm/kernel/perf_callchain.c +++ b/arch/arm/kernel/perf_callchain.c @@ -64,7 +64,7 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs { struct frame_tail __user *tail; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (perf_guest_cbs && perf_guest_cbs->state()) { /* We don't support guest os callchain now */ return; } @@ -100,7 +100,7 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re { struct stackframe fr; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (perf_guest_cbs && perf_guest_cbs->state()) { /* We don't support guest os callchain now */ return; } @@ -111,8 +111,8 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs *re unsigned long perf_instruction_pointer(struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) - return perf_guest_cbs->get_guest_ip(); + if (perf_guest_cbs && perf_guest_cbs->state()) + return perf_guest_cbs->get_ip(); return instruction_pointer(regs); } @@ -120,9 +120,13 @@ unsigned long perf_instruction_pointer(struct pt_regs *regs) unsigned long perf_misc_flags(struct pt_regs *regs) { int misc = 0; + unsigned int state = 0; - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { - if (perf_guest_cbs->is_user_mode()) + if (perf_guest_cbs) + state = perf_guest_cbs->state(); + + if (perf_guest_cbs && state) { + if (state & PERF_GUEST_USER) misc |= PERF_RECORD_MISC_GUEST_USER; else misc |= PERF_RECORD_MISC_GUEST_KERNEL; diff --git a/arch/arm64/kernel/perf_callchain.c b/arch/arm64/kernel/perf_callchain.c index 4a72c2727309..1b344e23fd2f 100644 --- a/arch/arm64/kernel/perf_callchain.c +++ b/arch/arm64/kernel/perf_callchain.c @@ -5,6 +5,7 @@ * Copyright (C) 2015 ARM Limited */ #include +#include #include #include @@ -99,10 +100,25 @@ compat_user_backtrace(struct compat_frame_tail __user *tail, } #endif /* CONFIG_COMPAT */ +DEFINE_STATIC_CALL_RET0(arm64_guest_state, *(perf_guest_cbs->state)); +DEFINE_STATIC_CALL_RET0(arm64_guest_get_ip, *(perf_guest_cbs->get_ip)); + +void arch_perf_update_guest_cbs(void) +{ + static_call_update(arm64_guest_state, (void *)&__static_call_return0); + static_call_update(arm64_guest_get_ip, (void *)&__static_call_return0); + + if (perf_guest_cbs && perf_guest_cbs->state) + static_call_update(arm64_guest_state, perf_guest_cbs->state); + + if (perf_guest_cbs && perf_guest_cbs->get_ip) + static_call_update(arm64_guest_get_ip, perf_guest_cbs->get_ip); +} + void perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs) { - if (perf_guest_cbs && perf_guest_cbs->is_in_guest()) { + if (static_call(arm64_guest_state)()) { /* We don't support guest os ca