On 6/8/2021 8:08 AM, Michael Matz wrote:
Hello,

On Mon, 7 Jun 2021, Jeff Law wrote:

So, as many of you know I left Red Hat a while ago and joined Tachyum.  We're
building a new processor and we've come across an issue where I think we need
upstream discussion.

I can't divulge many of the details right now, but one of the quirks of our
architecture is that reg+d addressing modes for our vector loads/stores
require the displacement to be aligned.  This is an artifact of how these
instructions are encoded.

Obviously we can emit a load of the address into a register when the
displacement isn't aligned.  From a correctness point that works perfectly.
Unfortunately, it's a significant performance hit on some standard benchmarks
(spec) where we have a great number of spills of vector objects into the stack
at unaligned offsets in the hot parts of the code.


We've considered 3 possible approaches to solve this problem.

1. When the displacement isn't properly aligned, allocate more space in
assign_stack_local so that we can make the offset aligned.  The downside is
this potentially burns a lot of stack space, but in practice the cost was
minimal (16 bytes in a 9k frame)  From a performance standpoint this works
perfectly.

2. Abuse the register elimination code to create a second pointer into the
stack.  Spills would start as <virtual> + offset, then either get eliminated
to sp+offset' when the offset is aligned or gpr+offset'' when the offset
wasn't properly aligned. We started a bit down this path, but with #1 working
so well, we didn't get this approach to proof-of-concept.

3. Hack up the post-reload optimizers to fix things up as best as we can.
This may still be advantageous, but again with #1 working so well, we didn't
explore this in any significant way.  We may still look at this at some point
in other contexts.

Here's what we're playing with.  Obviously we'd need a target hook to
drive this behavior.  I was thinking that we'd pass in any slot offset
alignment requirements (from the target hook) to assign_stack_local and
that would bubble down to this point in try_fit_stack_local:
Why is the machinery involving STACK_SLOT_ALIGNMENT and
spill_slot_alignment() (for spilling) or get_stack_local_alignment() (for
backing stack slots) not working for you?  If everything is setup
correctly the input alignment to try_fit_stack_local ought to be correct
already.
We don't need the MEM as a whole aligned, just the offset in the address calculation due to how we encode those instructions.  If I've read that code correctly, it would arrange for a dynamic realignment of the stack  so that it could then align the slot. None of that is necessary for us and we'd like to avoid forcing the dynamic stack realignment.  Or did I misread the code?

jeff

Reply via email to