I have shared v2 with updated commit msg to everyone in form of a new PATCH.
Please check it and let me know if anything is still missing or confusing.


On Wed, 28 May 2025 at 11:36 PM, Stefan Hajnoczi <stefa...@gmail.com> wrote:

> On Mon, May 26, 2025 at 10:51 PM Tanish Desai <tanishdesa...@gmail.com>
> wrote:
> >
> > > Can you explain 3 more?
> > Pablo and I have looked into object files for one of default trace point
> "trace_vhost_commit"(first tracepoint of hw/virtio) before and after making
> this change and we found that by moves the hot-path check for
> _simple_trace_vhost_commit into the caller (in the .h file) and leaves the
> full stack-frame prologue and tracepoint body in the cold path. By inlining
> only the if (trace_events_enabled()) test at the call site, we eliminate
> the 11-instruction prologue overhead (stack allocation, canary load/store,
> register spills, and argument masking) on every vhost commit(trace call)
> when tracing is disabled.
> > The old disassembled code looks like this:-
> > 0x10stp x29, x30, [sp, #-64]!Prologue: allocates 64-byte frame and saves
> old FP (x29) & LR (x30)
> > 0x14adrp x3, trace_events_enabled_countPrologue: computes page-base of
> the trace-enable counter
> > 0x18adrp x2, __stack_chk_guardImportant (maybe prolog don't
> know?)(stack-protector): starts up the stack-canary load
> > 0x1cmov x29, spPrologue: sets new frame pointer
> > 0x20ldr x3, [x3]Prologue: loads the actual trace-enabled count
> > 0x24stp x19, x20, [sp, #16]Prologue: spills callee-saved regs used by
> this function (x19, x20)
> > 0x28and w20, w0, #0xffTracepoint setup: extracts the low-8 bits of arg0
> as the “event boolean”
> > 0x2cldr x2, [x2]Prologue (cont’d): completes loading of the stack-canary
> value
> > 0x30and w19, w1, #0xffTracepoint setup: extracts low-8 bits of arg1
> > 0x34ldr w0, [x3]Important: loads the current trace-enabled flag from
> memory
> > 0x38ldr x1, [x2]Prologue (cont’d): reads the canary
> > 0x3cstr x1, [sp, #56]Prologue (cont’d): writes the canary into the new
> frame
> > 0x40mov x1, #0Prologue (cont’d): zeroes out x1 for the upcoming branch
> test
> > 0x44cbnz w0, 0x88Important: if tracing is disabled (w0==0) skip the
> heavy path entirely
> > Saving 11/14 instructions!Also We have not considered epilog
> instructional saves would be definitely more than 11.
> > Old flow: Every call does ~11 insns of prologue + tracepoint check, even
> if tracing is disabled.
> > New flow: We inline a tiny if (trace_enabled) at the caller; only when
> it’s true do you call into a full function with its prologue.(Extra
> instructions)
>
> Good, inlining worked as intended.
>
> Please send a v2 with an updated commit description explaining the
> purpose and with a disassembly snippet of the trace_vhost_commit()
> call site showing what happens.
>
> Thanks!
>
> Stefan
>
> >
> > On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefa...@gmail.com>
> wrote:
> >>
> >> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonz...@redhat.com>
> wrote:
> >> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefa...@gmail.com> ha
> scritto:
> >> >>
> >> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <
> tanishdesa...@gmail.com> wrote:
> >> >> >
> >> >> > Remove hot paths from .c file and added it in .h file to keep it
> inline.
> >> >>
> >> >> Please include performance results in the commit description so it's
> >> >> clear what impact this change has.
> >> >
> >> >
> >> > Hi Stefan,
> >> >
> >> > I am replying because I take the blame for this :) and as an example
> of how he could interact with the maintainer.
> >> >
> >> > For now we mostly looked at differences between the code that
> tracetool generates for the various backends, and the observation that some
> of them put code in the .h and some in the .c file. I explained to Tanish
> the concept of separating hot vs cold code in theory, showed him some
> examples in QEMU where performance measurements were done in the past, and
> suggested applying this to various backends (starting with the one with the
> weirdest choice!). However we didn't do any measurement yet.
> >> >
> >> > Some possibilities that come to mind:
> >> >
> >> > 1) maybe the coroutine benchmarks are enough to show a difference,
> with some luck
> >> >
> >> > 2) a new microbenchmark (or a set of microbenchmarks that try various
> level of register pressure around the tracepoint), which would be nice to
> have anyway
> >>
> >> 2 is going above and beyond :). I'm not trying to make life hard.
> >>
> >> Another option if 1 or 2 are difficult to quantify: I'd be happy with
> >> just a "before" and "after" disassembly snippet from objdump showing
> >> that the instructions at a call site changed as intended.
> >>
> >> What I'm looking for is a commit description that explains the purpose
> >> of the commit followed by, since this is a performance improvement,
> >> some form of evidence that the change achieved its goal. At that point
> >> it's easy for me to merge because it has a justification and the
> >> nature of the commit will be clear to anyone looking at the git log in
> >> the future.
> >>
> >> > 3) perhaps we could try to check the code size for some object files
> in block/ (for example libblock.a.p/*.c.o), as a proxy for how much
> instruction cache space is saved when all tracepoints are disabled
> >> >
> >> > We can start from 3, but also try 1+3 and 2+3 if it fails if you
> think that would be more persuasive.
> >>
> >> Can you explain 3 more? Total code size on disk should stay about the
> >> same because most trace events only have one call site. Moving code
> >> between the caller and callee doesn't make a big difference in either
> >> direction. At runtime there is a benefit from inlining the condition
> >> check since the call site no longer needs to execute the callee's
> >> function prologue/epilogue when the trace event is runtime-disabled,
> >> but that CPU cycle and instruction cache effect won't be visible from
> >> the on-disk code size.
> >>
> >> Thanks,
> >> Stefan
>

Reply via email to