On Fri, 2024-09-06 at 14:51 +0200, Niels Möller wrote: > Eric Richter <[email protected]> writes: > > > I suspect with four li instructions, those are issued 4x in > > parallel, and > > then the subsequent (slower) lxvw4x instructions are queued 2x. By > > removing > > the other three li instructions, that li is queued with the first > > lxvw4x, > > but not the second -- causing a stall as the second lxv has to wait > > for the > > parallel queue of the li + lxv before, as it depends on the li > > completing > > first. > > I don't know any details on powerpc instruction issue and pipelining > works. But some dependence on alignment seems likely. So great that > you > found that; it would seem rather odd to get a performance regression > for > this fix. > > Since .align 4 means 16 byte alignment, and instructions are 4 bytes, > that's enough to group instructions 4-by-4, is that what you want or > is > it overkill? >
I don't think I tested with .align 1, but .align 2 did hurt performance. For sake of minimizing the large amounts of trial and error, I just stuck with it. I'll indicate that in the comment, unless I find a better value, location, etc. > I'm also a bit surprised that an align at this point, outside the > loop, > makes a significant difference. Maybe it's the alignment of the code > in > the loop that matters, which is changed indirectly by this .align? > Maybe > it would make more sense to add the align directive just before the > loop: entry, and/or before the blocks of instructions in the loop > that > should be aligned? Nettle uses aligned loop entry points at many > places > for several architectures, although I'm not sure how much of that > makes > a measurable difference in performance, and how much was just done > out > of habit. I'm suspecting similar -- I don't figure aligning that load would cause that much of a measurable difference compared to perhaps aligning the ROUNDs. I will be experimenting with placing alignments elsewhere to see if there's a better/more sensible spot. > > > Additional note: I did also try rearranging the LOAD macros with > > the > > shifts, as well as moving around the requisite byte-swap vperms, > > but did > > not receive any performance benefits. It appears doing the load, > > vperm, > > shift, addi in that order appears to be the fastest order. > > To what degree does the powerpc processors do out of order execution? I'm not entirely sure -- that will mostly be the subject of the deep- dive I'm planning to do, I suspect there might be some hidden dependency bubbles that are interfering with optimal execution. > If > you have the time to experiment more, I'd be curious to see what the > results would be, e.g., if either doing all the loads back to back, > > lxvd2x A > lxvd2x B > lxvd2x C > lxvd2x D > vperm A > vperm B > vperm C > vperm D > ...shifts... This was one of my experiments, and it either did not help performance, or hurt it further. Though in my haste, I did not take notes -- I will play around further with these and record the results for posterity, I suspect this might be useful to capture for future work. > > or alternatively, trying to schedule each load a few instrucctions > before value is used. > > > _______________________________________________ nettle-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
