On Mon, Sep 19, 2022 at 11:25:13AM +0200, Florian Weimer wrote:
> * Jakub Jelinek:
> 
> > The disadvantage of the patch is that touching reg[x].loc and how[x]
> > now means 2 cachelines rather than one as before, and I admit beyond
> > bootstrap/regtest I haven't benchmarked it in any way.  Florian, could
> > you retry whatever you measured to get at the 40% of time spent on the
> > stack clearing to see how the numbers change?
> 
> A benchmark that unwinds through 100 frames containing a std::string
> variable goes from (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84):
> 
> min:     24418 ns
> 25%:     24740 ns
> 50%:     24790 ns
> 75%:     24840 ns
> 95%:     24937 ns
> 99%:     26174 ns
> max:     42530 ns
> avg:   24826.1 ns
> 
> to (0b5b8ac5cb7fe92dd17ae8bd7de84640daa59e84 with this patch):
> 
> min:     22307 ns
> 25%:     22640 ns
> 50%:     22713 ns
> 75%:     22787 ns
> 95%:     22948 ns
> 99%:     24839 ns
> max:     52658 ns
> avg:   22863.4 ns
> 
> So 227 ns per frame instead of 248 ns per frame, or ~9% less.

Thanks for doing that.

> Moving cfa_how after how in struct frame_state_reg_info as an 8-bit
> bitfield should avoid zeroing another 8 bytes.  This shaves off another
> 3 ns per frame in my testing (on a Core i9-10900T, so with ERMS).

Good idea.  Won't help always, on some targets how could have size divisible
by pointer alignment, but when it is at the end it always increases the
size by alignment of pointer, while after how array it only does so if
how is multiple of pointer alignment.

> The REP STOS still dominates uw_frame_state_for execution time, but this
> seems to be a profiling artifact.  Replacing it with PXOR and seven
> MOVUPS instructions makes the hotspot go away, but performance does not
> improve.  Odd.

        Jakub

Reply via email to