Re: [PATCH] Fix non-AVX512VL handling of lo extraction from AVX512F xmm16+ (PR target/85328)

2018-04-12 Thread Jakub Jelinek
On Thu, Apr 12, 2018 at 01:46:40PM +0300, Kirill Yukhin wrote:
> 
> Hello Jakub!
> 
> > On 11 Apr 2018, at 16:27, Jakub Jelinek  wrote:
> > In lots of patterns we assume that we never see xmm16+ hard registers
> > with 128-bit and 256-bit vector modes when not -mavx512vl, because
> > HARD_REGNO_MODE_OK refuses those.
> > Unfortunately, as this testcase and patch shows, the vec_extract_lo*
> > splitters work as a loophole around this, we happily create instructions
> > like (set (reg:V32QI xmm5) (reg:V32QI xmm16)) and then hard register
> > propagation can propagate the V32QI xmm16 into other insns like vpand.
> > 
> > The following patch fixes it by making sure we never create such registers,
> > just emit (set (reg:V64QI xmm5) (reg:V64QI xmm16)) instead, which by copying
> > all the 512 bits also copies the low bits, and as the destination is
> > originally V32QI which is not HARD_REGNO_MODE_OK in xmm16+, this should be
> > fine.
> > 
> > Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
> Patch is OK for trunk.

I've posted an updated version of this patch later on in
https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00563.html
Is that one ok for trunk instead?

And sorry for not getting it right the first time.

Jakub


Re: [PATCH] Fix non-AVX512VL handling of lo extraction from AVX512F xmm16+ (PR target/85328)

2018-04-12 Thread Kirill Yukhin


> On 12 Apr 2018, at 13:53, Jakub Jelinek  wrote:
> 
> On Thu, Apr 12, 2018 at 01:46:40PM +0300, Kirill Yukhin wrote:
>> 
>> Hello Jakub!
>> 
>>> On 11 Apr 2018, at 16:27, Jakub Jelinek  wrote:
>>> In lots of patterns we assume that we never see xmm16+ hard registers
>>> with 128-bit and 256-bit vector modes when not -mavx512vl, because
>>> HARD_REGNO_MODE_OK refuses those.
>>> Unfortunately, as this testcase and patch shows, the vec_extract_lo*
>>> splitters work as a loophole around this, we happily create instructions
>>> like (set (reg:V32QI xmm5) (reg:V32QI xmm16)) and then hard register
>>> propagation can propagate the V32QI xmm16 into other insns like vpand.
>>> 
>>> The following patch fixes it by making sure we never create such registers,
>>> just emit (set (reg:V64QI xmm5) (reg:V64QI xmm16)) instead, which by copying
>>> all the 512 bits also copies the low bits, and as the destination is
>>> originally V32QI which is not HARD_REGNO_MODE_OK in xmm16+, this should be
>>> fine.
>>> 
>>> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
>> Patch is OK for trunk.
> 
> I've posted an updated version of this patch later on in
> https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00563.html
> Is that one ok for trunk instead?
Yes.

—
Thanks, K
> 
> And sorry for not getting it right the first time.
> 
>   Jakub



Re: [PATCH] Fix non-AVX512VL handling of lo extraction from AVX512F xmm16+ (PR target/85328)

2018-04-12 Thread Kirill Yukhin

Hello Jakub!

> On 11 Apr 2018, at 16:27, Jakub Jelinek  wrote:
> 
> Hi!
> 
> In lots of patterns we assume that we never see xmm16+ hard registers
> with 128-bit and 256-bit vector modes when not -mavx512vl, because
> HARD_REGNO_MODE_OK refuses those.
> Unfortunately, as this testcase and patch shows, the vec_extract_lo*
> splitters work as a loophole around this, we happily create instructions
> like (set (reg:V32QI xmm5) (reg:V32QI xmm16)) and then hard register
> propagation can propagate the V32QI xmm16 into other insns like vpand.
> 
> The following patch fixes it by making sure we never create such registers,
> just emit (set (reg:V64QI xmm5) (reg:V64QI xmm16)) instead, which by copying
> all the 512 bits also copies the low bits, and as the destination is
> originally V32QI which is not HARD_REGNO_MODE_OK in xmm16+, this should be
> fine.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
Patch is OK for trunk.

—
Thanks, K



[PATCH] Fix non-AVX512VL handling of lo extraction from AVX512F xmm16+ (PR target/85328, take 2)

2018-04-11 Thread Jakub Jelinek
On Wed, Apr 11, 2018 at 03:27:28PM +0200, Jakub Jelinek wrote:
> In lots of patterns we assume that we never see xmm16+ hard registers
> with 128-bit and 256-bit vector modes when not -mavx512vl, because
> HARD_REGNO_MODE_OK refuses those.
> Unfortunately, as this testcase and patch shows, the vec_extract_lo*
> splitters work as a loophole around this, we happily create instructions
> like (set (reg:V32QI xmm5) (reg:V32QI xmm16)) and then hard register
> propagation can propagate the V32QI xmm16 into other insns like vpand.
> 
> The following patch fixes it by making sure we never create such registers,
> just emit (set (reg:V64QI xmm5) (reg:V64QI xmm16)) instead, which by copying
> all the 512 bits also copies the low bits, and as the destination is
> originally V32QI which is not HARD_REGNO_MODE_OK in xmm16+, this should be
> fine.
> 
> Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Actually, thinking about it more (not that I have managed to come up with a
testcase), if output is a MEM and input is xmm16+, then we really need to
give up in the splitters and instead emit the v*extract* instructions,
because simple vmovdqa and vmovap[sd] require AVX512VL for the EVEX
encodings.

So, here is an updated patch, bootstrapped/regtested on x86_64-linux and
i686-linux, is this one ok for trunk instead?

Tried e.g.
#include 

__m256d f1 (__m512d x) { register __m512d a __asm ("zmm16"); __asm ("" : "=v" 
(a) : "0" (x)); return _mm512_extractf64x4_pd (a, 0); }
void f2 (__m256d *p, __m512d x) { register __m512d a __asm ("zmm16"); __asm ("" 
: "=v" (a) : "0" (x)); *p = _mm512_extractf64x4_pd (a, 0); }
__m256d f3 (__m512d x, __m256d y) { register __m512d a __asm ("zmm16"); __asm 
("" : "=v" (a) : "0" (x)); return y + _mm512_extractf64x4_pd (a, 0); }
__m128 f4 (__m512 x) { register __m512 a __asm ("zmm16"); __asm ("" : "=v" (a) 
: "0" (x)); return _mm512_extractf32x4_ps (a, 0); }
void f5 (__m128 *p, __m512 x) { register __m512 a __asm ("zmm16"); __asm ("" : 
"=v" (a) : "0" (x)); *p = _mm512_extractf32x4_ps (a, 0); }
__m128 f6 (__m512 x, __m128 y) { register __m512 a __asm ("zmm16"); __asm ("" : 
"=v" (a) : "0" (x)); return y + _mm512_extractf32x4_ps (a, 0); }
__m256i f7 (__m512i x) { register __m512i a __asm ("zmm16"); __asm ("" : "=v" 
(a) : "0" (x)); return _mm512_extracti64x4_epi64 (a, 0); }
void f8 (__m256i *p, __m512i x) { register __m512i a __asm ("zmm16"); __asm ("" 
: "=v" (a) : "0" (x)); *p = _mm512_extracti64x4_epi64 (a, 0); }
__m256i f9 (__m512i x, __m256i y) { register __m512i a __asm ("zmm16"); __asm 
("" : "=v" (a) : "0" (x)); return y + _mm512_extracti64x4_epi64 (a, 0); }
__m128i f10 (__m512i x) { register __m512i a __asm ("zmm16"); __asm ("" : "=v" 
(a) : "0" (x)); return _mm512_extracti32x4_epi32 (a, 0); }
void f11 (__m128i *p, __m512i x) { register __m512i a __asm ("zmm16"); __asm 
("" : "=v" (a) : "0" (x)); *p = _mm512_extracti32x4_epi32 (a, 0); }
__m128i f12 (__m512i x, __m128i y) { register __m512i a __asm ("zmm16"); __asm 
("" : "=v" (a) : "0" (x)); return y + _mm512_extracti32x4_epi32 (a, 0); }
but couldn't reproduce though.

2018-04-11  Jakub Jelinek  

PR target/85328
* config/i386/sse.md
(avx512dq_vextract64x2_1 split,
avx512f_vextract32x4_1 split,
vec_extract_lo_ split, vec_extract_lo_v32hi,
vec_extract_lo_v64qi): For non-AVX512VL if input is xmm16+ reg
and output is a reg, avoid creating invalid lowpart subreg, but
instead split into a 512-bit move.  Don't split if not AVX512VL,
input is xmm16+ reg and output is a mem.
(vec_extract_lo_, vec_extract_lo_v32hi,
vec_extract_lo_v64qi): Don't require split if not AVX512VL, input is
xmm16+ reg and output is a mem.

* gcc.target/i386/pr85328.c: New test.

--- gcc/config/i386/sse.md.jj   2018-04-11 13:36:29.368015262 +0200
+++ gcc/config/i386/sse.md  2018-04-11 17:15:56.175746606 +0200
@@ -7361,9 +7361,21 @@ (define_split
(vec_select:
  (match_operand:V8FI 1 "register_operand")
  (parallel [(const_int 0) (const_int 1)])))]
-  "TARGET_AVX512DQ && reload_completed"
+  "TARGET_AVX512DQ
+   && reload_completed
+   && (TARGET_AVX512VL
+   || REG_P (operands[0])
+   || !EXT_REX_SSE_REG_P (operands[1]))"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (mode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0]
+  = lowpart_subreg (mode, operands[0], mode);
+  else
+operands[1] = gen_lowpart (mode, operands[1]);
+})
 
 (define_insn "avx512f_vextract32x4_1"
   [(set (match_operand: 0 "" 
"=")
@@ -7394,9 +7406,21 @@ (define_split
  (match_operand:V16FI 1 "register_operand")
  (parallel [(const_int 0) (const_int 1)
 (const_int 2) (const_int 3)])))]
-  "TARGET_AVX512F && reload_completed"
+  "TARGET_AVX512F
+   && reload_completed
+   && (TARGET_AVX512VL
+

[PATCH] Fix non-AVX512VL handling of lo extraction from AVX512F xmm16+ (PR target/85328)

2018-04-11 Thread Jakub Jelinek
Hi!

In lots of patterns we assume that we never see xmm16+ hard registers
with 128-bit and 256-bit vector modes when not -mavx512vl, because
HARD_REGNO_MODE_OK refuses those.
Unfortunately, as this testcase and patch shows, the vec_extract_lo*
splitters work as a loophole around this, we happily create instructions
like (set (reg:V32QI xmm5) (reg:V32QI xmm16)) and then hard register
propagation can propagate the V32QI xmm16 into other insns like vpand.

The following patch fixes it by making sure we never create such registers,
just emit (set (reg:V64QI xmm5) (reg:V64QI xmm16)) instead, which by copying
all the 512 bits also copies the low bits, and as the destination is
originally V32QI which is not HARD_REGNO_MODE_OK in xmm16+, this should be
fine.

Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

2018-04-11  Jakub Jelinek  

PR target/85328
* config/i386/sse.md
(avx512dq_vextract64x2_1 split,
avx512f_vextract32x4_1 split,
vec_extract_lo_ split, vec_extract_lo_v32hi,
vec_extract_lo_v64qi): For non-AVX512VL if input is xmm16+ reg
and output is a reg, avoid creating invalid lowpart subreg, but
instead split into a 512-bit move.

* gcc.target/i386/pr85328.c: New test.

--- gcc/config/i386/sse.md.jj   2018-04-10 14:37:02.092801344 +0200
+++ gcc/config/i386/sse.md  2018-04-11 12:00:44.296840287 +0200
@@ -7362,7 +7362,15 @@ (define_split
  (parallel [(const_int 0) (const_int 1)])))]
   "TARGET_AVX512DQ && reload_completed"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (mode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0]
+  = lowpart_subreg (mode, operands[0], mode);
+  else
+operands[1] = gen_lowpart (mode, operands[1]);
+})
 
 (define_insn "avx512f_vextract32x4_1"
   [(set (match_operand: 0 "" 
"=")
@@ -7395,7 +7403,15 @@ (define_split
 (const_int 2) (const_int 3)])))]
   "TARGET_AVX512F && reload_completed"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (mode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0]
+  = lowpart_subreg (mode, operands[0], mode);
+  else
+operands[1] = gen_lowpart (mode, operands[1]);
+})
 
 (define_mode_attr extract_type_2
   [(V16SF "avx512dq") (V16SI "avx512dq") (V8DF "avx512f") (V8DI "avx512f")])
@@ -7655,7 +7671,15 @@ (define_split
   "TARGET_AVX512F && !(MEM_P (operands[0]) && MEM_P (operands[1]))
&& reload_completed"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (mode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0]
+  = lowpart_subreg (mode, operands[0], mode);
+  else
+operands[1] = gen_lowpart (mode, operands[1]);
+})
 
 (define_insn "vec_extract_lo_"
   [(set (match_operand: 0 "" "=v,m")
@@ -7830,7 +7854,14 @@ (define_insn_and_split "vec_extract_lo_v
   "#"
   "&& reload_completed"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (V16HImode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0] = lowpart_subreg (V32HImode, operands[0], V16HImode);
+  else
+operands[1] = gen_lowpart (V16HImode, operands[1]);
+})
 
 (define_insn "vec_extract_hi_v32hi"
   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=v,m")
@@ -7915,7 +7946,14 @@ (define_insn_and_split "vec_extract_lo_v
   "#"
   "&& reload_completed"
   [(set (match_dup 0) (match_dup 1))]
-  "operands[1] = gen_lowpart (V32QImode, operands[1]);")
+{
+  if (!TARGET_AVX512VL
+  && REG_P (operands[0])
+  && EXT_REX_SSE_REG_P (operands[1]))
+operands[0] = lowpart_subreg (V64QImode, operands[0], V32QImode);
+  else
+operands[1] = gen_lowpart (V32QImode, operands[1]);
+})
 
 (define_insn "vec_extract_hi_v64qi"
   [(set (match_operand:V32QI 0 "nonimmediate_operand" "=v,m")
--- gcc/testsuite/gcc.target/i386/pr85328.c.jj  2018-04-11 12:07:15.044933408 
+0200
+++ gcc/testsuite/gcc.target/i386/pr85328.c 2018-04-11 10:45:17.269733600 
+0200
@@ -0,0 +1,18 @@
+/* PR target/85328 */
+/* { dg-do assemble { target avx512f } } */
+/* { dg-options "-O3 -fno-caller-saves -mavx512f" } */
+
+typedef char U __attribute__((vector_size (64)));
+typedef int V __attribute__((vector_size (64)));
+U a, b;
+
+extern void bar (void);
+
+V
+foo (V f)
+{
+  b <<= (U){(V){}[63]} & 7;
+  bar ();
+  a = (U)f & 7;
+  return (V)b;
+}

Jakub