[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-06 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #13 from Vladimir Makarov  ---
(In reply to Uroš Bizjak from comment #11)
> (In reply to Jakub Jelinek from comment #5)
> > I guess it depends on what exactly a normal subreg on lhs means.
> > The documentation says:
> >   When used as an lvalue, 'subreg' is a word-based accessor.
> >   Storing to a 'subreg' modifies all the words of REG that
> >   overlap the 'subreg', but it leaves the other words of REG
> >   alone.
> 
> 
> But this wording applies only to multi-word registers. We can't use the
> above wording for 512bit single-word register, since we don't know how the
> move will affect the bits outside the subreg. We can say that the move
> "modifies all the words of REG that overlap the 'subreg', since we have only
> one 512-bit word of a 512-bit register.
> 

OK.

> So, I think that the transformation in the Comment 10 is invalid for
> registers that can't be decomposed to independent word-sized registers (to
> use "word-based accessor"), e.g. V32HImode xmm20. Perhaps the mentioned
> alter_subreg should choose correct approach based on TARGET_HARD_REGNO_NREGS?

Actually I do the same things as the old reload does.  It has practically the
same alter_subreg code.  May be the reload and LRA code is not up to date to
treat correctly this situation. I don't know.

  What I can do is to generate (strict_low_part (subreg:DI (reg:V32HI ))) to reflect the new semantics.  Something like

Index: lra.c
===
--- lra.c   (revision 258691)
+++ lra.c   (working copy)
@@ -487,14 +487,26 @@ int lra_curr_reload_num;
 void
 lra_emit_move (rtx x, rtx y)
 {
-  int old;
-
+  int old, regno;
+  machine_mode mode;
+  rtx reg;
+
   if (GET_CODE (y) != PLUS)
 {
   if (rtx_equal_p (x, y))
return;
   old = max_reg_num ();
-  emit_move_insn (x, y);
+  if (GET_CODE (x) == SUBREG
+ && REG_P (reg = SUBREG_REG (x))
+ && GET_MODE_SIZE (mode = GET_MODE (reg)).to_constant () >
UNITS_PER_WORD
+ && (regno = REGNO (reg)) >= FIRST_PSEUDO_REGISTER
+ && ira_reg_class_max_nregs[lra_get_allocno_class (regno)][mode] == 1)
+   {
+ x = gen_rtx_STRICT_LOW_PART (VOIDmode, x);
+ emit_insn (gen_rtx_SET (x, y));
+   }
+  else
+   emit_move_insn (x, y);
   if (REG_P (x))
lra_reg_info[ORIGINAL_REGNO (x)].last_reload = ++lra_curr_reload_num;
   /* Function emit_move can create pseudos -- so expand the pseudo


  But we need insn patterns for such cases which are absent in i386 md files. 
Without adding them, compiler will crash in LRA.

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #12 from Uroš Bizjak  ---
(In reply to Uroš Bizjak from comment #11)

> gcc -O2 -m27332:

"gcc -O2 -m32", the test shows dumps for 32bit x86 target.

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-05 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #11 from Uroš Bizjak  ---
(In reply to Jakub Jelinek from comment #5)
> I guess it depends on what exactly a normal subreg on lhs means.
> The documentation says:
>   When used as an lvalue, 'subreg' is a word-based accessor.
>   Storing to a 'subreg' modifies all the words of REG that
>   overlap the 'subreg', but it leaves the other words of REG
>   alone.

I believe that the above wording is intended to describe the following test:

--cut here--
union u
{
  long long ll;
  int i[2];
};

union u test (union u a)
{
  a.ll >>= 3;
  a.i[0] = 5;
  a.ll <<= 3;

  return a;
}
--cut here--

gcc -O2 -m27332:

_.273r.ira shows:

(insn 8 7 9 2 (parallel [
(set (reg/v:DI 92 [ a ])
(ashiftrt:DI (reg/v:DI 92 [ a ])
(const_int 3 [0x3])))
(clobber (reg:CC 17 flags))
]) "subr.c":9 532 {*ashrdi3_doubleword}
 (expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)))
(insn 9 8 10 2 (set (subreg:SI (reg/v:DI 92 [ a ]) 0)
(const_int 5 [0x5])) "subr.c":10 82 {*movsi_internal}
 (nil))
(insn 10 9 12 2 (parallel [
(set (reg:DI 94)
(ashift:DI (reg/v:DI 92 [ a ])
(const_int 3 [0x3])))
(clobber (reg:CC 17 flags))
]) "subr.c":11 503 {*ashldi3_doubleword}
 (expr_list:REG_DEAD (reg/v:DI 92 [ a ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(expr_list:REG_EQUIV (mem/c:DI (reg/f:SI 91 [ .result_ptr ]) [1
+0 S8 A32])
(nil)

please note (insn 9), which LRA converts to:

(insn 8 7 9 2 (parallel [
(set (reg/v:DI 4 si [orig:92 a ] [92])
(ashiftrt:DI (reg/v:DI 4 si [orig:92 a ] [92])
(const_int 3 [0x3])))
(clobber (reg:CC 17 flags))
]) "subr.c":9 532 {*ashrdi3_doubleword}
 (nil))
(insn 9 8 20 2 (set (reg:SI 4 si [orig:92 a ] [92])
(const_int 5 [0x5])) "subr.c":10 82 {*movsi_internal}
 (nil))
(insn 20 9 10 2 (set (reg:DI 2 cx [94])
(reg/v:DI 4 si [orig:92 a ] [92])) "subr.c":11 81 {*movdi_internal}
 (nil))
(insn 10 20 12 2 (parallel [
(set (reg:DI 2 cx [94])
(ashift:DI (reg:DI 2 cx [94])
(const_int 3 [0x3])))
(clobber (reg:CC 17 flags))
]) "subr.c":11 503 {*ashldi3_doubleword}
 (expr_list:REG_EQUIV (mem/c:DI (reg/f:SI 0 ax [orig:91 .result_ptr ] [91])
[1 +0 S8 A32])
(nil)))

In the above pass, using (insn 9), the compiler inserted a 32-bit word into the
lowpart (%esi) of the multi-word register pair (%esi/%edi). In effect, "the
other words of REG" are left alone, as the documentation says.

But this wording applies only to multi-word registers. We can't use the above
wording for 512bit single-word register, since we don't know how the move will
affect the bits outside the subreg. We can say that the move "modifies all the
words of REG that overlap the 'subreg', since we have only one 512-bit word of
a 512-bit register.

So, I think that the transformation in the Comment 10 is invalid for registers
that can't be decomposed to independent word-sized registers (to use
"word-based accessor"), e.g. V32HImode xmm20. Perhaps the mentioned
alter_subreg should choose correct approach based on TARGET_HARD_REGNO_NREGS?

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-05 Thread vmakarov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #10 from Vladimir Makarov  ---
(In reply to Jakub Jelinek from comment #5)
> I guess it depends on what exactly a normal subreg on lhs means.
> The documentation says:
>   When used as an lvalue, 'subreg' is a word-based accessor.
>   Storing to a 'subreg' modifies all the words of REG that
>   overlap the 'subreg', but it leaves the other words of REG
>   alone.
> but in this case, we really don't have a GPR, but rather a single much
> larger (512-bit) register.  Does it still imply that for -m32 (subreg:SI
> (reg:V32HI) 0)
> sets just the low 32 bits of the large register and doesn't modify anything
> else,
> and for -m64 the same means set low 32 bits, have the 32 bits above it
> undefined and the rest of bits unmodified?
> 

Yes, I guess so. For x86-64 "the 32 bits above" will be zero which might be
treated as undefined.

> Seems store_bit_field treats it that way, but perhaps IRA does not?

I believe IRA treats it in the same way.

LRA does not just emit

(insn 5658 4214 5812 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp)
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])
(reg:DI 0 ax [4040])) "pr85090.c":13 85 {*movdi_internal}
 (nil))
(insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422])
(mem/c:DI (plus:DI (reg/f:DI 7 sp)
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13
85 {*movdi_internal}
 (nil))

It emits through emit_move_insn (instead of xmm20, a pseudo is used on some LRA
sub-passes):

(insn 5812 5658 4219 2 (set (subreg:DI (reg/v:V32HI 57 xmm20 [orig:422 i ]
[422]) 0)   
(mem/c:DI (plus:DI (reg/f:DI 7 sp)  
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "v.i":13 85
{*movdi_internal}   
 (expr_list:REG_DEAD (reg:TI 20 frame)  
(nil))) 

and this insn exists till the very LRA end where it is changed (through
alter_subreg):

(insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422])
(mem/c:DI (plus:DI (reg/f:DI 7 sp)
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13
85 {*movdi_internal}
 (nil))

So LRA keeps subreg semantics.  It cannot change movdi_internal into the right
vector insn sequence.  I believe it is not LRA responsibility.

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-03 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

Jakub Jelinek  changed:

   What|Removed |Added

 CC|jakub at redhat dot com|

--- Comment #9 from Jakub Jelinek  ---
Vlad, do you think you could look at the IRA issue (with the above patch
reverted)?

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-04-01 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #8 from Jakub Jelinek  ---
Author: jakub   
Date: Sun Apr  1 06:05:01 2018  
New Revision: 258994

URL: https://gcc.gnu.org/viewcvs?rev=258994=gcc=rev   
Log:
PR middle-end/85090 
* config/i386/sse.md (V): Add V64QI and V32HI for TARGET_AVX512F.   
(V_128_256): New mode iterator. 
(*avx512dq_vextract64x2_1 splitter): New define_split. 
(*avx512f_vextract32x4_1 splitter): Likewise.  
(xop_pcmov_): Use V_128_256 mode iterator instead  
of V.   
* config/i386/i386.c (ix86_expand_vector_set): Improve V32HImode and
V64QImode expansion for !TARGET_AVX512BW && TARGET_AVX512F. 

* gcc.target/i386/avx512f-pr85090-1.c: New test.
* gcc.target/i386/avx512f-pr85090-2.c: New test.
* gcc.target/i386/avx512f-pr85090-3.c: New test.
* gcc.target/i386/avx512bw-pr85090-2.c: New test.   
* gcc.target/i386/avx512bw-pr85090-3.c: New test.   

Added:  
trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr85090-2.c
trunk/gcc/testsuite/gcc.target/i386/avx512bw-pr85090-3.c
trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-1.c 
trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-2.c 
trunk/gcc/testsuite/gcc.target/i386/avx512f-pr85090-3.c 
Modified:   
trunk/gcc/ChangeLog 
trunk/gcc/config/i386/i386.c
trunk/gcc/config/i386/sse.md
trunk/gcc/testsuite/ChangeLog

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-29 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #7 from Jakub Jelinek  ---
Created attachment 43802
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43802=edit
gcc8-pr85090-wip.patch

Untested WIP; some more work is needed to get rid of the useless vextracti32x4
$0, ... insns, similarly e.g. how vec_extract_lo_v32hi deals with it.  Dunno if
we want in some way suggest to RA that it is less costly to use a dup operand
in that case.

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-28 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

Jakub Jelinek  changed:

   What|Removed |Added

 CC||itsimbal at gcc dot gnu.org,
   ||kyukhin at gcc dot gnu.org

--- Comment #6 from Jakub Jelinek  ---
The reason why we do something so weird rather than ix86_expand_vector_set is
that sse.md lacks vec_setv32hi and vec_setv64qi patterns, IMHO it should have
them.

E.g. on:
typedef short V __attribute__((vector_size (64)));

V
foo (V x, int y)
{
  x[0] = y;
  return x;
}

V
bar (V x, int y)
{
  x[7] = y;
  return x;
}

V
baz (V x, int y)
{
  x[11] = y;
  return x;
}

we generate completely terrible code with -O2 -mavx512f -mtune=intel or
-O2 -mavx512bw -mtune=intel.
Moving the word out of the vector, performing masking etc. on the GRPs and then
inserting it again.
clang emits:
vpinsrw $0, %edi, %xmm0, %xmm2
vpblendd $15, %ymm2, %ymm0, %ymm0
(and s/$0/$7/) for foo/bar with -O2 -mavx512f, which makes me wonder if the VEX
256-bit vpblendd with 4 arguments really doesn't clear the upper 256 bits,
128-bit vpblendd is documented to clear them, and for baz:
vextracti128 $1, %ymm0, %xmm2
vpinsrw $3, %edi, %xmm2, %xmm2
vinserti128 $1, %xmm2, %ymm0, %ymm0
With -O2 -mavx512bw they emit:
vpinsrw $0, %edi, %xmm0, %xmm1
vinserti32x4 $0, %xmm1, %zmm0, %zmm0
(and s/$0/$7/) for foo/bar, though none of those instructions actually require
AVX512BW.  And for baz:
vextracti128 $1, %ymm0, %xmm1
vpinsrw $3, %edi, %xmm1, %xmm1
vinserti32x4 $1, %xmm1, %zmm0, %zmm0

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-28 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org

--- Comment #5 from Jakub Jelinek  ---
I guess it depends on what exactly a normal subreg on lhs means.
The documentation says:
  When used as an lvalue, 'subreg' is a word-based accessor.
  Storing to a 'subreg' modifies all the words of REG that
  overlap the 'subreg', but it leaves the other words of REG
  alone.
but in this case, we really don't have a GPR, but rather a single much larger
(512-bit) register.  Does it still imply that for -m32 (subreg:SI (reg:V32HI)
0)
sets just the low 32 bits of the large register and doesn't modify anything
else,
and for -m64 the same means set low 32 bits, have the 32 bits above it
undefined and the rest of bits unmodified?

Seems store_bit_field treats it that way, but perhaps IRA does not?

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-27 Thread ubizjak at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #4 from Uroš Bizjak  ---
The problem lies in:

(insn 4214 4213 4219 2 (parallel [
(set (subreg:DI (reg:V32HI 4037 [ i ]) 0)
(ior:DI (reg:DI 4040)
(reg:DI 4038 [ _7 ])))
(clobber (reg:CC 17 flags))
]) "pr85090.c":13 442 {*iordi_1}
 (expr_list:REG_DEAD (reg:DI 4040)
(expr_list:REG_DEAD (reg:DI 4038 [ _7 ])
(expr_list:REG_UNUSED (reg:CC 17 flags)
(nil)

(insn 4219 4214 4220 2 (set (reg/i:V32HI 21 xmm0)
(reg:V32HI 4037 [ i ])) "pr85090.c":16 1254 {movv32hi_internal}
 (expr_list:REG_DEAD (reg:V32HI 4037 [ i ])

where (insn 4214) reloads with:

(insn 4214 4213 5658 2 (parallel [
(set (reg:DI 0 ax [4040])
(ior:DI (reg:DI 0 ax [4040])
(reg:DI 1 dx [orig:4038 _7 ] [4038])))
(clobber (reg:CC 17 flags))
]) "pr85090.c":13 442 {*iordi_1}
 (nil))
(insn 5658 4214 5812 2 (set (mem/c:DI (plus:DI (reg/f:DI 7 sp)
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])
(reg:DI 0 ax [4040])) "pr85090.c":13 85 {*movdi_internal}
 (nil))
(insn 5812 5658 4219 2 (set (reg:DI 57 xmm20 [orig:422 i ] [422])
(mem/c:DI (plus:DI (reg/f:DI 7 sp)
(const_int 264 [0x108])) [4 %sfp+-64 S8 A512])) "pr85090.c":13
85 {*movdi_internal}
 (nil))
(insn 4219 5812 4220 2 (set (reg/i:V32HI 21 xmm0)
(reg/v:V32HI 57 xmm20 [orig:422 i ] [422])) "pr85090.c":16 1254
{movv32hi_internal}
 (nil))

Using a subreg is not enough to change only part of the register.

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-27 Thread hjl.tools at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

H.J. Lu  changed:

   What|Removed |Added

 CC||jakub at redhat dot com,
   ||ubizjak at gmail dot com
   Target Milestone|--- |8.0

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-27 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #3 from ktkachov at gcc dot gnu.org ---
Hmm, I don't have access to AVX512F hardware so I can't reproduce the runtime
failure.
The vector simplifications that my patch introduces look correct to me from
looking at the dumps.

I'm not very familiar with i386.md but the *movdi_internal pattern that
produces the vmovq that zeroes out the top of the z register doesn't seem to
represent that in RTL.

So GCC ends up loading a DImode register xmm20 but then stores it as a
V32HImode register, the zero-extending effects of the DImode load are not
represented at RTL.

Any ideas?

[Bug middle-end/85090] [8 Regression] wrong code with -O2 -fno-tree-dominator-opts -mavx512f -fira-algorithm=priority

2018-03-27 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85090

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Thanks for bisecting, I'll have a look