[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #13 from Vladimir Makarov --- (In reply to Uroš Bizjak from comment #11) > (In reply to Jakub Jelinek from comment #5) > > I guess it depends on what exactly a normal subreg on lhs means. > > The documentation says: > > When used as an lvalue, 'subreg' is a word-based accessor. > > Storing to a 'subreg' modifies all the words of REG that > > overlap the 'subreg', but it leaves the other words of REG > > alone. > > > But this wording applies only to multi-word registers. We can't use the > above wording for 512bit single-word register, since we don't know how the > move will affect the bits outside the subreg. We can say that the move > "modifies all the words of REG that overlap the 'subreg', since we have only > one 512-bit word of a 512-bit register. > OK. > So, I think that the transformation in the Comment 10 is invalid for > registers that can't be decomposed to independent word-sized registers (to > use "word-based accessor"), e.g. V32HImode xmm20. Perhaps the mentioned > alter_subreg should choose correct approach based on TARGET_HARD_REGNO_NREGS? Actually I do the same things as the old reload does. It has practically the same alter_subreg code. May be the reload and LRA code is not up to date to treat correctly this situation. I don't know. What I can do is to generate (strict_low_part (subreg:DI (reg:V32HI ))) to reflect the new semantics. Something like Index: lra.c === --- lra.c (revision 258691) +++ lra.c (working copy) @@ -487,14 +487,26 @@ int lra_curr_reload_num; void lra_emit_move (rtx x, rtx y) { - int old; - + int old, regno; + machine_mode mode; + rtx reg; + if (GET_CODE (y) != PLUS) { if (rtx_equal_p (x, y)) return; old = max_reg_num (); - emit_move_insn (x, y); + if (GET_CODE (x) == SUBREG + && REG_P (reg = SUBREG_REG (x)) + && GET_MODE_SIZE (mode = GET_MODE (reg)).to_constant () > UNITS_PER_WORD + && (regno = REGNO (reg)) >= FIRST_PSEUDO_REGISTER + && ira_reg_class_max_nregs[lra_get_allocno_class (regno)][mode] == 1) + { + x = gen_rtx_STRICT_LOW_PART (VOIDmode, x); + emit_insn (gen_rtx_SET (x, y)); + } + else + emit_move_insn (x, y); if (REG_P (x)) lra_reg_info[ORIGINAL_REGNO (x)].last_reload = ++lra_curr_reload_num; /* Function emit_move can create pseudos -- so expand the pseudo But we need insn patterns for such cases which are absent in i386 md files. Without adding them, compiler will crash in LRA.
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #12 from Uroš Bizjak --- (In reply to Uroš Bizjak from comment #11) > gcc -O2 -m27332: "gcc -O2 -m32", the test shows dumps for 32bit x86 target.
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #11 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #5) > I guess it depends on what exactly a normal subreg on lhs means. > The documentation says: > When used as an lvalue, 'subreg' is a word-based accessor. > Storing to a 'subreg' modifies all the words of REG that > overlap the 'subreg', but it leaves the other words of REG > alone. I believe that the above wording is intended to describe the following test: --cut here-- union u { long long ll; int i[2]; }; union u test (union u a) { a.ll >>= 3; a.i[0] = 5; a.ll <<= 3; return a; } --cut here-- gcc -O2 -m27332: _.273r.ira shows: (insn 8 7 9 2 (parallel [ (set (reg/v:DI 92 [ a ]) (ashiftrt:DI (reg/v:DI 92 [ a ]) (const_int 3 [0x3]))) (clobber (reg:CC 17 flags)) ]) "subr.c":9 532 {*ashrdi3_doubleword} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (insn 9 8 10 2 (set (subreg:SI (reg/v:DI 92 [ a ]) 0) (const_int 5 [0x5])) "subr.c":10 82 {*movsi_internal} (nil)) (insn 10 9 12 2 (parallel [ (set (reg:DI 94) (ashift:DI (reg/v:DI 92 [ a ]) (const_int 3 [0x3]))) (clobber (reg:CC 17 flags)) ]) "subr.c":11 503 {*ashldi3_doubleword} (expr_list:REG_DEAD (reg/v:DI 92 [ a ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_EQUIV (mem/c:DI (reg/f:SI 91 [ .result_ptr ]) [1 +0 S8 A32]) (nil) please note (insn 9), which LRA converts to: (insn 8 7 9 2 (parallel [ (set (reg/v:DI 4 si [orig:92 a ] [92]) (ashiftrt:DI (reg/v:DI 4 si [orig:92 a ] [92]) (const_int 3 [0x3]))) (clobber (reg:CC 17 flags)) ]) "subr.c":9 532 {*ashrdi3_doubleword} (nil)) (insn 9 8 20 2 (set (reg:SI 4 si [orig:92 a ] [92]) (const_int 5 [0x5])) "subr.c":10 82 {*movsi_internal} (nil)) (insn 20 9 10 2 (set (reg:DI 2 cx [94]) (reg/v:DI 4 si [orig:92 a ] [92])) "subr.c":11 81 {*movdi_internal} (nil)) (insn 10 20 12 2 (parallel [ (set (reg:DI 2 cx [94]) (ashift:DI (reg:DI 2 cx [94]) (const_int 3 [0x3]))) (clobber (reg:CC 17 flags)) ]) "subr.c":11 503 {*ashldi3_doubleword} (expr_list:REG_EQUIV (mem/c:DI (reg/f:SI 0 ax [orig:91 .result_ptr ] [91]) [1 +0 S8 A32]) (nil))) In the above pass, using (insn 9), the compiler inserted a 32-bit word into the lowpart (%esi) of the multi-word register pair (%esi/%edi). In effect, "the other words of REG" are left alone, as the documentation says. But this wording applies only to multi-word registers. We can't use the above wording for 512bit single-word register, since we don't know how the move will affect the bits outside the subreg. We can say that the move "modifies all the words of REG that overlap the 'subreg', since we have only one 512-bit word of a 512-bit register. So, I think that the transformation in the Comment 10 is invalid for registers that can't be decomposed to independent word-sized registers (to use "word-based accessor"), e.g. V32HImode xmm20. Perhaps the mentioned alter_subreg should choose correct approach based on TARGET_HARD_REGNO_NREGS?
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #10 from Vladimir Makarov --- (In reply to Jakub Jelinek from comment #5) > I guess it depends on what exactly a normal subreg on lhs means. > The documentation says: > When used as an lvalue, 'subreg' is a word-based accessor. > Storing to a 'subreg' modifies all the words of REG that > overlap the 'subreg', but it leaves the other words of REG > alone. > but in this case, we really don't have a GPR, but rather a single much > larger (512-bit) register. Does it still imply that for -m32 (subreg:SI > (reg:V32HI) 0) > sets just the low 32 bits of the large register and doesn't modify anything > else, > and for -m64 the same means set low 32 bits, have the 32 bits above it > undefined and the rest of bits unmodified? > Yes, I guess so. For x86-64 "the 32 bits above" will be zero which might be treated as undefined. > Seems store_bit_field treats it that way, but perhaps IRA does not? I believe IRA treats it in the same way. LRA does not just emit (insn 5658 4214 5812 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512]) (reg:DI 0 ax [4040])) "pr85090.c":13 85 {*movdi_internal} (nil)) (insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422]) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13 85 {*movdi_internal} (nil)) It emits through emit_move_insn (instead of xmm20, a pseudo is used on some LRA sub-passes): (insn 5812 5658 4219 2 (set (subreg:DI (reg/v:V32HI 57 xmm20 [orig:422 i ] [422]) 0) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "v.i":13 85 {*movdi_internal} (expr_list:REG_DEAD (reg:TI 20 frame) (nil))) and this insn exists till the very LRA end where it is changed (through alter_subreg): (insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422]) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13 85 {*movdi_internal} (nil)) So LRA keeps subreg semantics. It cannot change movdi_internal into the right vector insn sequence. I believe it is not LRA responsibility.
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 Jakub Jelinek changed: What|Removed |Added CC|jakub at redhat dot com| --- Comment #9 from Jakub Jelinek --- Vlad, do you think you could look at the IRA issue (with the above patch reverted)?
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #8 from Jakub Jelinek --- Author: jakub Date: Sun Apr 1 06:05:01 2018 New Revision: 258994 URL: https://gcc.gnu.org/viewcvs?rev=258994=gcc=rev Log: PR middle-end/85090 * config/i386/sse.md (V): Add V64QI and V32HI for TARGET_AVX512F. (V_128_256): New mode iterator. (*avx512dq_vextract64x2_1 splitter): New define_split. (*avx512f_vextract32x4_1 splitter): Likewise. (xop_pcmov_): Use V_128_256 mode iterator instead of V. * config/i386/i386.c (ix86_expand_vector_set): Improve V32HImode and V64QImode expansion for !TARGET_AVX512BW && TARGET_AVX512F. * gcc.target/i386/avx512f-pr85090-1.c: New test. * gcc.target/i386/avx512f-pr85090-2.c: New test. * gcc.target/i386/avx512f-pr85090-3.c: New test. * gcc.target/i386/avx512bw-pr85090-2.c: New test. * gcc.target/i386/avx512bw-pr85090-3.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr85090-2.c trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr85090-3.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-1.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-2.c trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-3.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/sse.md trunk/gcc/testsuite/ChangeLog
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #7 from Jakub Jelinek --- Created attachment 43802 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43802=edit gcc8-pr85090-wip.patch Untested WIP; some more work is needed to get rid of the useless vextracti32x4 $0, ... insns, similarly e.g. how vec_extract_lo_v32hi deals with it. Dunno if we want in some way suggest to RA that it is less costly to use a dup operand in that case.
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 Jakub Jelinek changed: What|Removed |Added CC||itsimbal at gcc dot gnu.org, ||kyukhin at gcc dot gnu.org --- Comment #6 from Jakub Jelinek --- The reason why we do something so weird rather than ix86_expand_vector_set is that sse.md lacks vec_setv32hi and vec_setv64qi patterns, IMHO it should have them. E.g. on: typedef short V __attribute__((vector_size (64))); V foo (V x, int y) { x[0] = y; return x; } V bar (V x, int y) { x[7] = y; return x; } V baz (V x, int y) { x[11] = y; return x; } we generate completely terrible code with -O2 -mavx512f -mtune=intel or -O2 -mavx512bw -mtune=intel. Moving the word out of the vector, performing masking etc. on the GRPs and then inserting it again. clang emits: vpinsrw $0, %edi, %xmm0, %xmm2 vpblendd $15, %ymm2, %ymm0, %ymm0 (and s/$0/$7/) for foo/bar with -O2 -mavx512f, which makes me wonder if the VEX 256-bit vpblendd with 4 arguments really doesn't clear the upper 256 bits, 128-bit vpblendd is documented to clear them, and for baz: vextracti128 $1, %ymm0, %xmm2 vpinsrw $3, %edi, %xmm2, %xmm2 vinserti128 $1, %xmm2, %ymm0, %ymm0 With -O2 -mavx512bw they emit: vpinsrw $0, %edi, %xmm0, %xmm1 vinserti32x4 $0, %xmm1, %zmm0, %zmm0 (and s/$0/$7/) for foo/bar, though none of those instructions actually require AVX512BW. And for baz: vextracti128 $1, %ymm0, %xmm1 vpinsrw $3, %edi, %xmm1, %xmm1 vinserti32x4 $1, %xmm1, %zmm0, %zmm0
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org, ||vmakarov at gcc dot gnu.org --- Comment #5 from Jakub Jelinek --- I guess it depends on what exactly a normal subreg on lhs means. The documentation says: When used as an lvalue, 'subreg' is a word-based accessor. Storing to a 'subreg' modifies all the words of REG that overlap the 'subreg', but it leaves the other words of REG alone. but in this case, we really don't have a GPR, but rather a single much larger (512-bit) register. Does it still imply that for -m32 (subreg:SI (reg:V32HI) 0) sets just the low 32 bits of the large register and doesn't modify anything else, and for -m64 the same means set low 32 bits, have the 32 bits above it undefined and the rest of bits unmodified? Seems store_bit_field treats it that way, but perhaps IRA does not?
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #4 from Uroš Bizjak --- The problem lies in: (insn 4214 4213 4219 2 (parallel [ (set (subreg:DI (reg:V32HI 4037 [ i ]) 0) (ior:DI (reg:DI 4040) (reg:DI 4038 [ _7 ]))) (clobber (reg:CC 17 flags)) ]) "pr85090.c":13 442 {*iordi_1} (expr_list:REG_DEAD (reg:DI 4040) (expr_list:REG_DEAD (reg:DI 4038 [ _7 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil) (insn 4219 4214 4220 2 (set (reg/i:V32HI 21 xmm0) (reg:V32HI 4037 [ i ])) "pr85090.c":16 1254 {movv32hi_internal} (expr_list:REG_DEAD (reg:V32HI 4037 [ i ]) where (insn 4214) reloads with: (insn 4214 4213 5658 2 (parallel [ (set (reg:DI 0 ax [4040]) (ior:DI (reg:DI 0 ax [4040]) (reg:DI 1 dx [orig:4038 _7 ] [4038]))) (clobber (reg:CC 17 flags)) ]) "pr85090.c":13 442 {*iordi_1} (nil)) (insn 5658 4214 5812 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512]) (reg:DI 0 ax [4040])) "pr85090.c":13 85 {*movdi_internal} (nil)) (insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422]) (mem/c:DI (plus:DI (reg/f:DI 7 sp) (const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13 85 {*movdi_internal} (nil)) (insn 4219 5812 4220 2 (set (reg/i:V32HI 21 xmm0) (reg/v:V32HI 57 xmm20 [orig:422 i ] [422])) "pr85090.c":16 1254 {movv32hi_internal} (nil)) Using a subreg is not enough to change only part of the register.
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 H.J. Lu changed: What|Removed |Added CC||jakub at redhat dot com, ||ubizjak at gmail dot com Target Milestone|--- |8.0
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #3 from ktkachov at gcc dot gnu.org --- Hmm, I don't have access to AVX512F hardware so I can't reproduce the runtime failure. The vector simplifications that my patch introduces look correct to me from looking at the dumps. I'm not very familiar with i386.md but the *movdi_internal pattern that produces the vmovq that zeroes out the top of the z register doesn't seem to represent that in RTL. So GCC ends up loading a DImode register xmm20 but then stores it as a V32HImode register, the zero-extending effects of the DImode load are not represented at RTL. Any ideas?
[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090 --- Comment #2 from ktkachov at gcc dot gnu.org --- Thanks for bisecting, I'll have a look