On Tue, Aug 25, 2020 at 4:18 PM Alexander Graf <g...@amazon.com> wrote: > > Hi Thomas, > > On 25.08.20 12:28, Thomas Gleixner wrote: > >void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); }
> > Alex, > > > > On Mon, Aug 24 2020 at 19:29, Alexander Graf wrote: > >> I'm currently trying to understand a performance regression with > >> ScyllaDB on i3en.3xlarge (KVM based VM on Skylake) which we reliably > >> bisected down to this commit: > >> > >> https://github.com/scylladb/scylla/issues/7036 > >> > >> What we're seeing is that syscalls such as membarrier() take forever > >> (0-10 µs would be normal): > > ... > >> That again seems to stem from a vastly slowed down > >> smp_call_function_many_cond(): > >> > >> Samples: 218K of event 'cpu-clock', 4000 Hz > >> Overhead Shared Object Symbol > >> 94.51% [kernel] [k] smp_call_function_many_cond > >> 0.76% [kernel] [k] __do_softirq > >> 0.32% [kernel] [k] native_queued_spin_lock_slowpath > >> [...] > >> > >> which is stuck in > >> > >> │ csd_lock_wait(): > >> │ smp_cond_load_acquire(&csd->flags, !(VAL & > >> 0.00 │ mov 0x8(%rcx),%edx > >> 0.00 │ and $0x1,%edx > >> │ ↓ je 2b9 > >> │ rep_nop(): > >> 0.70 │2af: pause > >> │ csd_lock_wait(): > >> 92.82 │ mov 0x8(%rcx),%edx > >> 6.48 │ and $0x1,%edx > >> 0.00 │ ↑ jne 2af > >> 0.00 │2b9: ↑ jmp 282 > >> > >> > >> Given the patch at hand I was expecting lost IPIs, but I can't quite see > >> anything getting lost. > > > > I have no idea how that patch should be related to IPI and smp function > > calls. It's changing the way how regular device interrupts and their > > spurious counterpart are handled and not the way how IPIs are > > handled. They are handled by direct vectors and do not go through > > do_IRQ() at all. > > > > Aside of that the commit just changes the way how the interrupt vector > > of a regular device interrupt is stored and conveyed. The extra read and > > write on the cache hot stack is hardly related to anything IPI. > > I am as puzzled as you are, but the bisect is very clear: 79b9c183021e > works fast and 633260fa1 (as well as mainline) shows the weird behavior > above. > > It gets even better. This small (demonstrative only, mangled) patch on > top of 633260fa1 also resolves the performance issue: > > diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c > index c766936..7e91e9a 100644 > --- a/arch/x86/kernel/irq.c > +++ b/arch/x86/kernel/irq.c > @@ -239,6 +239,7 @@ __visible void __irq_entry do_IRQ(struct pt_regs > *regs, unsigned long vector) > * lower 8 bits. > */ > vector &= 0xFF; > + regs->orig_ax = ~vector; > > /* entering_irq() tells RCU that we're not quiescent. Check it. */ > RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU"); > > > To me that sounds like some irq exit code somewhere must still be > looking at orig_ax to decide on something - and that something is wrong > now that we removed the negation of the vector. It also seems to have an > impact on remote function calls. > > I'll have a deeper look tomorrow again if I can find any such place, but > I wouldn't mind if anyone could point me into the right direction > earlier :). How about this: void irq_complete_move(struct irq_cfg *cfg) { __irq_complete_move(cfg, ~get_irq_regs()->orig_ax); } in arch/x86/kernel/apic/vector.c.