Hi! On Mon, Jan 14, 2019 at 09:53:18PM +0000, Steve Ellcey wrote: > I have a question about PR87763, these are aarch64 specific tests > that are failing after r265398 (combine: Do not combine moves from hard > registers). > > These tests are all failing when the assembler scan looks for > specific instructions and these instructions are no longer being > generated. In some cases the new code is no worse than the old code > (just different) but in most cases the new code is a performance > regression from the old code. > > Note that these tests are generally *very* small functions where the > body of the function consists of only 1 to 4 instructions so if we > do not combine instructions involving hard registers there isn't much, > if any, combining that can be done.
That is why all such hard regs are copied to a new pseudo first now. The pseudo can usually be combined in the same way as the hard reg could be before. The extra copy is optimised away by register allocation, in those cases where that is a good choice (and, alas, register allocation sometimes makes bad decisions). > In larger functions this probably > would not be an issue and I think those cases are where the incentive > for this patch came from. So my question is, what do we want to > do about these failures? Fix them :-) Some are caused by deficiencies in the target code (or, things that were not required before, but that now are needed). Some are shortcomings in register allocation (or elsewhere; but in generic code). > Find a GCC patch to generate the better code? If it isn't done by > combine, how would we do it? Peephole optimizations? Sometimes that is needed, sure. If your ISA is more orthogonal you need fewer peepholes, which is great. But most targets can use a few for good profit. > Modify the tests to pass with the current output? Which, in my > opinion would make the tests of not much value. Sometimes the tests _are_ not much value, aren't really testing what they intended to test. > Remove the tests? Tests that search for specific assembly language > output are rather brittle to begin with and if they are no longer > serving a purpose after the combine patch, maybe we don't need them. > > The tests in question are: Ah, not too many, I'll look at them all. Please correct me where I make mistakes, I'm no expert on aarch64. > gcc.target/aarch64/combine_bfi_1.c f1: Trying 9, 8 -> 10: 9: r99:SI=r100:SI&0xffffffffff0000ff REG_DEAD r100:SI 8: r98:SI=r101:SI<<0x8&0xffff00 REG_DEAD r101:SI 10: r96:SI=r98:SI|r99:SI REG_DEAD r99:SI REG_DEAD r98:SI Failed to match this instruction: (set (reg:SI 96) (ior:SI (and:SI (reg:SI 100) (const_int -16776961 [0xffffffffff0000ff])) (and:SI (ashift:SI (reg:SI 101) (const_int 8 [0x8])) (const_int 16776960 [0xffff00])))) Either you need a pattern to match things like this, or combine (or simplify-rtx) should write is as an lhs zero_extract. f2 is similar; f3,f4,f5 are similar and/or the test should allow bfxil as well as bfi. > gcc.target/aarch64/insv_1.c This test tests that various combinations with constant integers generate good code. For the first test, bfi1, we get Trying 2, 7 -> 13: 2: r92:DI=r95:DI REG_DEAD r95:DI 7: zero_extract(r92:DI,0x8,0)=r93:DI REG_DEAD r93:DI 13: x0:DI=r92:DI REG_DEAD r92:DI Failed to match this instruction: (set (reg/i:DI 0 x0) (ior:DI (and:DI (reg:DI 95) (const_int -256 [0xffffffffffffff00])) (reg:DI 93))) Successfully matched this instruction: (set (reg/v:DI 92 [ aD.3347 ]) (and:DI (reg:DI 95) (const_int -256 [0xffffffffffffff00]))) Successfully matched this instruction: (set (reg/i:DI 0 x0) (ior:DI (reg/v:DI 92 [ aD.3347 ]) (reg:DI 93))) allowing combination of insns 2, 7 and 13 which is not what we want. A peephole2, or a define_split, or a define_insn_and_split would fix this. bfi2 is similar. movk I think the same as well. set0 and set1 are best code already IU think. > gcc.target/aarch64/lsl_asr_sbfiz.c sbfiz32 (sbfiz64 is fine): Trying 6 -> 7: 6: r94:SI=r95:SI<<0x1d REG_DEAD r95:SI 7: r93:SI=r94:SI>>0xa REG_DEAD r94:SI Failed to match this instruction: (set (reg:SI 93) (ashift:SI (subreg:SI (sign_extract:DI (subreg:DI (reg:SI 95) 0) (const_int 3 [0x3]) (const_int 0 [0])) 0) (const_int 19 [0x13]))) Say what? Everything was SI, where does that sign_extract:DI come from? And why isn't it optimised back to SI? (Please open a PR just for this one, if it isn't obviously a target thing that causes it). > gcc.target/aarch64/sve/tls_preserve_1.c I get foo: .LFB0: .cfi_startproc stp x29, x30, [sp, -64]! .cfi_def_cfa_offset 64 .cfi_offset 29, -64 .cfi_offset 30, -56 mrs x1, tpidr_el0 mov x29, sp stp q0, q1, [sp, 16] str q2, [sp, 48] adrp x0, :tlsdesc:tx ldr x2, [x0, #:tlsdesc_lo12:tx] add x0, x0, :tlsdesc_lo12:tx .tlsdesccall tx blr x2 ldr q4, [x1, x0] fmov v3.4s, 7.0e+0 ldp x29, x30, [sp], 64 .cfi_restore 30 .cfi_restore 29 .cfi_def_cfa_offset 0 fadd v0.4s, v0.4s, v4.4s fadd v0.4s, v0.4s, v1.4s fadd v0.4s, v0.4s, v2.4s fadd v0.4s, v0.4s, v3.4s ret .cfi_endproc which indeed has two stores of q registers, but I have no idea if that is good or bad or what. Help? > gcc.target/aarch64/tst_5.c Here we get f255: .LFB0: .cfi_startproc mov w1, w0 ands w1, w1, 255 csinc w0, w0, wzr, eq ret .cfi_endproc Can that be done better? > gcc.target/aarch64/tst_6.c foo: .LFB0: .cfi_startproc sxth w1, w0 cmp w1, 0 csinc x0, x0, xzr, ne ret .cfi_endproc > gcc.dg/vect/vect-nop-move.c # Scanning combine dump file, not asm file This tests just /* { dg-final { scan-rtl-dump "deleting noop move" "combine" { target aarch64*-*-* } } } */ but there is a lot of source code, I don't know where this should be. More importantly, I have no idea if the code that is generated is any good. Segher