On 09/07/2016 02:19 AM, Richard Biener wrote:
On Tue, 6 Sep 2016, Jakub Jelinek wrote:

On Tue, Sep 06, 2016 at 04:59:23PM +0100, Kyrill Tkachov wrote:
On 06/09/16 16:32, Jakub Jelinek wrote:
On Tue, Sep 06, 2016 at 04:14:47PM +0100, Kyrill Tkachov wrote:
The v3 of this patch addresses feedback I received on the version posted at [1].
The merged store buffer is now represented as a char array that we splat values 
onto with
native_encode_expr and native_interpret_expr. This allows us to merge anything 
that native_encode_expr
accepts, including floating point values and short vectors. So this version 
extends the functionality
of the previous one in that it handles floating point values as well.

The first phase of the algorithm that detects the contiguous stores is also 
slightly refactored according
to feedback to read more fluently.

Richi, I experimented with merging up to MOVE_MAX bytes rather than word size 
but I got worse results on aarch64.
MOVE_MAX there is 16 (because it has load/store register pair instructions) but 
the 128-bit immediates that we ended
synthesising were too complex. Perhaps the TImode immediate store RTL 
expansions could be improved, but for now
I've left the maximum merge size to be BITS_PER_WORD.
At least from playing with this kind of things in the RTL PR22141 patch,
I remember storing 64-bit constants on x86_64 compared to storing 2 32-bit
constants usually isn't a win (not just for speed optimized blocks but also for
-Os).  For 64-bit store if the constant isn't signed 32-bit or unsigned
32-bit you need movabsq into some temporary register which has like 3 times 
worse
latency than normal store if I remember well, and then store it.

We could restrict the maximum width of the stores generated to 32 bits on 
x86_64.
I think this would need another parameter or target macro for the target to set.
Alternatively, is it a possibility for x86 to be a bit smarter in its DImode 
mov-immediate
expansion? For example break up the 64-bit movabsq immediate into two SImode 
immediates?

If you want a 64-bit store, you'd need to merge the two, and that would be
even more expensive.  It is a matter of say:
        movl $0x12345678, (%rsp)
        movl $0x09abcdef, 4(%rsp)
vs.
        movabsq $0x09abcdef12345678, %rax
        movq %rax, (%rsp)
vs.
        movl $0x09abcdef, %eax
        salq $32, %rax
        orq $0x12345678, %rax
        movq $rax, (%rsp)

vs.

        movq $LC0, (%rsp)

?

etc.  Guess it needs to be benchmarked on contemporary CPUs, I'm pretty sure
the last sequence is the worst one.

I think the important part to notice is that it should be straight forward
for a target / the expander to split a large store from an immediate
into any of the above but very hard to do the opposite.  Thus from a
GIMPLE side "canonicalizing" to large stores (that are eventually
supported and well-aligned) seems best to me.
Agreed.



I'm aware of that. The patch already has logic to avoid emitting unaligned 
accesses
for SLOW_UNALIGNED_ACCESS targets. Beyond that the patch introduces the 
parameter
PARAM_STORE_MERGING_ALLOW_UNALIGNED that can be used by the user or target to
forbid generation of unaligned stores by the pass altogether. Beyond that I'm 
not sure
how to behave more intelligently here. Any ideas?

Dunno, the heuristics was the main problem with my patch.  Generally, I'd
say there is a difference between cold and hot blocks, in cold ones perhaps
unaligned stores are more appropriate (if supported at all and not way too
slow), while in hot ones less desirable.

Note that I repeatedly argue that if we can canonicalize sth to "larger"
then even if unaligned, the expander should be able to produce optimal
code again (it might not do, of course).
And agreed. Furthermore, it's in line with our guiding principles WRT separation of the tree/SSA optimizers from target dependencies.

So let's push those decisions into the expanders/backend/target and canonicalize to the larger stores.

jeff


Reply via email to