On Fri, 2024-09-06 at 14:51 +0200, Niels Möller wrote:
> Eric Richter <[email protected]> writes:
> 
> > I suspect with four li instructions, those are issued 4x in
> > parallel, and
> > then the subsequent (slower) lxvw4x instructions are queued 2x. By
> > removing
> > the other three li instructions, that li is queued with the first
> > lxvw4x,
> > but not the second -- causing a stall as the second lxv has to wait
> > for the
> > parallel queue of the li + lxv before, as it depends on the li
> > completing
> > first.
> 
> I don't know any details on powerpc instruction issue and pipelining
> works. But some dependence on alignment seems likely. So great that
> you
> found that; it would seem rather odd to get a performance regression
> for
> this fix.
> 
> Since .align 4 means 16 byte alignment, and instructions are 4 bytes,
> that's enough to group instructions 4-by-4, is that what you want or
> is
> it overkill?
> 

I don't think I tested with .align 1, but .align 2 did hurt
performance. For sake of minimizing the large amounts of trial and
error, I just stuck with it. I'll indicate that in the comment, unless
I find a better value, location, etc.

> I'm also a bit surprised that an align at this point, outside the
> loop,
> makes a significant difference. Maybe it's the alignment of the code
> in
> the loop that matters, which is changed indirectly by this .align?
> Maybe
> it would make more sense to add the align directive just before the
> loop: entry, and/or before the blocks of instructions in the loop
> that
> should be aligned? Nettle uses aligned loop entry points at many
> places
> for several architectures, although I'm not sure how much of that
> makes
> a measurable difference in performance, and how much was just done
> out
> of habit.

I'm suspecting similar -- I don't figure aligning that load would cause
that much of a measurable difference compared to perhaps aligning the
ROUNDs. I will be experimenting with placing alignments elsewhere to
see if there's a better/more sensible spot.

> 
> > Additional note: I did also try rearranging the LOAD macros with
> > the
> > shifts, as well as moving around the requisite byte-swap vperms,
> > but did
> > not receive any performance benefits. It appears doing the load,
> > vperm,
> > shift, addi in that order appears to be the fastest order.
> 
> To what degree does the powerpc processors do out of order execution?

I'm not entirely sure -- that will mostly be the subject of the deep-
dive I'm planning to do, I suspect there might be some hidden
dependency bubbles that are interfering with optimal execution.

> If
> you have the time to experiment more, I'd be curious to see what the
> results would be, e.g., if either doing all the loads back to back,
> 
>   lxvd2x A
>   lxvd2x B
>   lxvd2x C
>   lxvd2x D
>   vperm A
>   vperm B
>   vperm C
>   vperm D
>   ...shifts...

This was one of my experiments, and it either did not help performance,
or hurt it further. Though in my haste, I did not take notes -- I will
play around further with these and record the results for posterity, I
suspect this might be useful to capture for future work.

> 
> or alternatively, trying to schedule each load a few instrucctions
> before value is used.
> 
> 
> 
_______________________________________________
nettle-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to