On Mon, May 26, 2025 at 10:51 PM Tanish Desai <tanishdesa...@gmail.com> wrote:
>
> > Can you explain 3 more?
> Pablo and I have looked into object files for one of default trace point 
> "trace_vhost_commit"(first tracepoint of hw/virtio) before and after making 
> this change and we found that by moves the hot-path check for 
> _simple_trace_vhost_commit into the caller (in the .h file) and leaves the 
> full stack-frame prologue and tracepoint body in the cold path. By inlining 
> only the if (trace_events_enabled()) test at the call site, we eliminate the 
> 11-instruction prologue overhead (stack allocation, canary load/store, 
> register spills, and argument masking) on every vhost commit(trace call) when 
> tracing is disabled.
> The old disassembled code looks like this:-
> 0x10stp x29, x30, [sp, #-64]!Prologue: allocates 64-byte frame and saves old 
> FP (x29) & LR (x30)
> 0x14adrp x3, trace_events_enabled_countPrologue: computes page-base of the 
> trace-enable counter
> 0x18adrp x2, __stack_chk_guardImportant (maybe prolog don't 
> know?)(stack-protector): starts up the stack-canary load
> 0x1cmov x29, spPrologue: sets new frame pointer
> 0x20ldr x3, [x3]Prologue: loads the actual trace-enabled count
> 0x24stp x19, x20, [sp, #16]Prologue: spills callee-saved regs used by this 
> function (x19, x20)
> 0x28and w20, w0, #0xffTracepoint setup: extracts the low-8 bits of arg0 as 
> the “event boolean”
> 0x2cldr x2, [x2]Prologue (cont’d): completes loading of the stack-canary value
> 0x30and w19, w1, #0xffTracepoint setup: extracts low-8 bits of arg1
> 0x34ldr w0, [x3]Important: loads the current trace-enabled flag from memory
> 0x38ldr x1, [x2]Prologue (cont’d): reads the canary
> 0x3cstr x1, [sp, #56]Prologue (cont’d): writes the canary into the new frame
> 0x40mov x1, #0Prologue (cont’d): zeroes out x1 for the upcoming branch test
> 0x44cbnz w0, 0x88Important: if tracing is disabled (w0==0) skip the heavy 
> path entirely
> Saving 11/14 instructions!Also We have not considered epilog instructional 
> saves would be definitely more than 11.
> Old flow: Every call does ~11 insns of prologue + tracepoint check, even if 
> tracing is disabled.
> New flow: We inline a tiny if (trace_enabled) at the caller; only when it’s 
> true do you call into a full function with its prologue.(Extra instructions)

Good, inlining worked as intended.

Please send a v2 with an updated commit description explaining the
purpose and with a disassembly snippet of the trace_vhost_commit()
call site showing what happens.

Thanks!

Stefan

>
> On Wed, May 21, 2025 at 11:46 PM Stefan Hajnoczi <stefa...@gmail.com> wrote:
>>
>> On Tue, May 20, 2025 at 4:52 PM Paolo Bonzini <pbonz...@redhat.com> wrote:
>> > Il mar 20 mag 2025, 21:01 Stefan Hajnoczi <stefa...@gmail.com> ha scritto:
>> >>
>> >> On Mon, May 19, 2025 at 2:52 PM Tanish Desai <tanishdesa...@gmail.com> 
>> >> wrote:
>> >> >
>> >> > Remove hot paths from .c file and added it in .h file to keep it inline.
>> >>
>> >> Please include performance results in the commit description so it's
>> >> clear what impact this change has.
>> >
>> >
>> > Hi Stefan,
>> >
>> > I am replying because I take the blame for this :) and as an example of 
>> > how he could interact with the maintainer.
>> >
>> > For now we mostly looked at differences between the code that tracetool 
>> > generates for the various backends, and the observation that some of them 
>> > put code in the .h and some in the .c file. I explained to Tanish the 
>> > concept of separating hot vs cold code in theory, showed him some examples 
>> > in QEMU where performance measurements were done in the past, and 
>> > suggested applying this to various backends (starting with the one with 
>> > the weirdest choice!). However we didn't do any measurement yet.
>> >
>> > Some possibilities that come to mind:
>> >
>> > 1) maybe the coroutine benchmarks are enough to show a difference, with 
>> > some luck
>> >
>> > 2) a new microbenchmark (or a set of microbenchmarks that try various 
>> > level of register pressure around the tracepoint), which would be nice to 
>> > have anyway
>>
>> 2 is going above and beyond :). I'm not trying to make life hard.
>>
>> Another option if 1 or 2 are difficult to quantify: I'd be happy with
>> just a "before" and "after" disassembly snippet from objdump showing
>> that the instructions at a call site changed as intended.
>>
>> What I'm looking for is a commit description that explains the purpose
>> of the commit followed by, since this is a performance improvement,
>> some form of evidence that the change achieved its goal. At that point
>> it's easy for me to merge because it has a justification and the
>> nature of the commit will be clear to anyone looking at the git log in
>> the future.
>>
>> > 3) perhaps we could try to check the code size for some object files in 
>> > block/ (for example libblock.a.p/*.c.o), as a proxy for how much 
>> > instruction cache space is saved when all tracepoints are disabled
>> >
>> > We can start from 3, but also try 1+3 and 2+3 if it fails if you think 
>> > that would be more persuasive.
>>
>> Can you explain 3 more? Total code size on disk should stay about the
>> same because most trace events only have one call site. Moving code
>> between the caller and callee doesn't make a big difference in either
>> direction. At runtime there is a benefit from inlining the condition
>> check since the call site no longer needs to execute the callee's
>> function prologue/epilogue when the trace event is runtime-disabled,
>> but that CPU cycle and instruction cache effect won't be visible from
>> the on-disk code size.
>>
>> Thanks,
>> Stefan

Reply via email to