Hi!
On Mon, Jan 14, 2019 at 09:53:18PM +0000, Steve Ellcey wrote:
> I have a question about PR87763, these are aarch64 specific tests
> that are failing after r265398 (combine: Do not combine moves from hard
> registers).
>
> These tests are all failing when the assembler scan looks for
> specific instructions and these instructions are no longer being
> generated. In some cases the new code is no worse than the old code
> (just different) but in most cases the new code is a performance
> regression from the old code.
>
> Note that these tests are generally *very* small functions where the
> body of the function consists of only 1 to 4 instructions so if we
> do not combine instructions involving hard registers there isn't much,
> if any, combining that can be done.
That is why all such hard regs are copied to a new pseudo first now. The
pseudo can usually be combined in the same way as the hard reg could be
before. The extra copy is optimised away by register allocation, in those
cases where that is a good choice (and, alas, register allocation sometimes
makes bad decisions).
> In larger functions this probably
> would not be an issue and I think those cases are where the incentive
> for this patch came from. So my question is, what do we want to
> do about these failures?
Fix them :-)
Some are caused by deficiencies in the target code (or, things that were
not required before, but that now are needed).
Some are shortcomings in register allocation (or elsewhere; but in generic
code).
> Find a GCC patch to generate the better code? If it isn't done by
> combine, how would we do it? Peephole optimizations?
Sometimes that is needed, sure. If your ISA is more orthogonal you need
fewer peepholes, which is great. But most targets can use a few for good
profit.
> Modify the tests to pass with the current output? Which, in my
> opinion would make the tests of not much value.
Sometimes the tests _are_ not much value, aren't really testing what they
intended to test.
> Remove the tests? Tests that search for specific assembly language
> output are rather brittle to begin with and if they are no longer
> serving a purpose after the combine patch, maybe we don't need them.
>
> The tests in question are:
Ah, not too many, I'll look at them all. Please correct me where I make
mistakes, I'm no expert on aarch64.
> gcc.target/aarch64/combine_bfi_1.c
f1:
Trying 9, 8 -> 10:
9: r99:SI=r100:SI&0xffffffffff0000ff
REG_DEAD r100:SI
8: r98:SI=r101:SI<<0x8&0xffff00
REG_DEAD r101:SI
10: r96:SI=r98:SI|r99:SI
REG_DEAD r99:SI
REG_DEAD r98:SI
Failed to match this instruction:
(set (reg:SI 96)
(ior:SI (and:SI (reg:SI 100)
(const_int -16776961 [0xffffffffff0000ff]))
(and:SI (ashift:SI (reg:SI 101)
(const_int 8 [0x8]))
(const_int 16776960 [0xffff00]))))
Either you need a pattern to match things like this, or combine (or
simplify-rtx) should write is as an lhs zero_extract.
f2 is similar; f3,f4,f5 are similar and/or the test should allow bfxil as
well as bfi.
> gcc.target/aarch64/insv_1.c
This test tests that various combinations with constant integers generate
good code. For the first test, bfi1, we get
Trying 2, 7 -> 13:
2: r92:DI=r95:DI
REG_DEAD r95:DI
7: zero_extract(r92:DI,0x8,0)=r93:DI
REG_DEAD r93:DI
13: x0:DI=r92:DI
REG_DEAD r92:DI
Failed to match this instruction:
(set (reg/i:DI 0 x0)
(ior:DI (and:DI (reg:DI 95)
(const_int -256 [0xffffffffffffff00]))
(reg:DI 93)))
Successfully matched this instruction:
(set (reg/v:DI 92 [ aD.3347 ])
(and:DI (reg:DI 95)
(const_int -256 [0xffffffffffffff00])))
Successfully matched this instruction:
(set (reg/i:DI 0 x0)
(ior:DI (reg/v:DI 92 [ aD.3347 ])
(reg:DI 93)))
allowing combination of insns 2, 7 and 13
which is not what we want. A peephole2, or a define_split, or a
define_insn_and_split would fix this.
bfi2 is similar. movk I think the same as well. set0 and set1 are best
code already IU think.
> gcc.target/aarch64/lsl_asr_sbfiz.c
sbfiz32 (sbfiz64 is fine):
Trying 6 -> 7:
6: r94:SI=r95:SI<<0x1d
REG_DEAD r95:SI
7: r93:SI=r94:SI>>0xa
REG_DEAD r94:SI
Failed to match this instruction:
(set (reg:SI 93)
(ashift:SI (subreg:SI (sign_extract:DI (subreg:DI (reg:SI 95) 0)
(const_int 3 [0x3])
(const_int 0 [0])) 0)
(const_int 19 [0x13])))
Say what? Everything was SI, where does that sign_extract:DI come from?
And why isn't it optimised back to SI?
(Please open a PR just for this one, if it isn't obviously a target thing
that causes it).
> gcc.target/aarch64/sve/tls_preserve_1.c
I get
foo:
.LFB0:
.cfi_startproc
stp x29, x30, [sp, -64]!
.cfi_def_cfa_offset 64
.cfi_offset 29, -64
.cfi_offset 30, -56
mrs x1, tpidr_el0
mov x29, sp
stp q0, q1, [sp, 16]
str q2, [sp, 48]
adrp x0, :tlsdesc:tx
ldr x2, [x0, #:tlsdesc_lo12:tx]
add x0, x0, :tlsdesc_lo12:tx
.tlsdesccall tx
blr x2
ldr q4, [x1, x0]
fmov v3.4s, 7.0e+0
ldp x29, x30, [sp], 64
.cfi_restore 30
.cfi_restore 29
.cfi_def_cfa_offset 0
fadd v0.4s, v0.4s, v4.4s
fadd v0.4s, v0.4s, v1.4s
fadd v0.4s, v0.4s, v2.4s
fadd v0.4s, v0.4s, v3.4s
ret
.cfi_endproc
which indeed has two stores of q registers, but I have no idea if that is
good or bad or what. Help?
> gcc.target/aarch64/tst_5.c
Here we get
f255:
.LFB0:
.cfi_startproc
mov w1, w0
ands w1, w1, 255
csinc w0, w0, wzr, eq
ret
.cfi_endproc
Can that be done better?
> gcc.target/aarch64/tst_6.c
foo:
.LFB0:
.cfi_startproc
sxth w1, w0
cmp w1, 0
csinc x0, x0, xzr, ne
ret
.cfi_endproc
> gcc.dg/vect/vect-nop-move.c # Scanning combine dump file, not asm file
This tests just
/* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target
aarch64*-*-* } } } */
but there is a lot of source code, I don't know where this should be. More
importantly, I have no idea if the code that is generated is any good.
Segher