[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2021-01-03 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #14 from Hongtao.liu  ---
*** Bug 96891 has been marked as a duplicate of this bug. ***

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-22 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #13 from Hongtao.liu  ---
Created attachment 49832
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49832=edit
gcc11-pr98348_v3.patch

1. Use REAL_VALUE_TO_TARGET_SINGLE/DOUBLE in the
"float_vector_all_ones_operands" predicate, it should be cheaper than
lower_subreg?

2. Combine failed to match (vec_merge (vector_all_ones_operand)
(const0_operand) (knot mask)) even the corresponding pattern exists, peephole2
could be ok for this but abviously not the best solution.
 IMHO the changes about knot should be functionality ok, and to avoid some
potential performance issue, I extraly add some new define_insn_and_splits to
negate the comparison code when there's not in the result(Dig the hole deeper
and deeper:(). The patch survive regtest and bootstrap on both trunk and
rlease/gcc-10 branch.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #12 from Hongtao.liu  ---
(In reply to Jakub Jelinek from comment #11)
> I'm not sure about the knot changes, isn't that too risky at least at this
> point?
> I mean, can't we instead just match what knot emits?
> 
As indicated in
https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552205.html, we support
bitwise operator for avx512 masks, so i believe the change is fine for GCC11,
but not suitable for GCC10. 

For backport convenience, I'll make a version that matches the knot.

> As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P
> (op)
> before trying to do the lowpart_subreg, perhaps even check if the
> CONST_DOUBLE is NaN if it can be done cheaply before doing the more
> expensive lowpart_subreg.

Yes.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-21 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #11 from Jakub Jelinek  ---
I'm not sure about the knot changes, isn't that too risky at least at this
point?
I mean, can't we instead just match what knot emits?

As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P (op)
before trying to do the lowpart_subreg, perhaps even check if the CONST_DOUBLE
is NaN if it can be done cheaply before doing the more expensive
lowpart_subreg.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #10 from Hongtao.liu  ---
Created attachment 49819
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49819=edit
Incremental to gcc11-pr99348.patch

Update patch.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-21 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #9 from Hongtao.liu  ---
(In reply to Jakub Jelinek from comment #8)
> Created attachment 49806 [details]
> gcc11-pr98348.patch
> 
> So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is
> some incremental untested progress on your patch (just the sse.md part of
> it).
> Changes:
> 1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents
> other optimizations later on
> 2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is
> plain AVX or AVX2 or SSE4* etc. one
> 3) handles also the const0 vector_all_ones order
> 4) for commutative cases allows any operand order, for others ensures the
> right

The bellow pattern should be equivilent to const0 vector_all_ones order, but
the generic part didn't simplify it, so i made a bit adjustment to those
patterns, also some changes in ix86_expand_sse_movcc to generate common NOT
operator instead of gen_knot which has UNSPEC_MASKOP inside, so the
combine can do the right thing.

Successfully matched this instruction:
(set (reg:V16QI 82 [  ])
(vec_merge:V16QI (const_vector:V16QI [
(const_int -1 [0x]) repeated x16
])
(const_vector:V16QI [
(const_int 0 [0]) repeated x16
])
(not:HI (unspec:HI [
(reg:V16QI 89)
(reg:V16QI 90)
(const_int 4 [0x4])
] UNSPEC_PCMP

> one of the operands is register
> 5) handles also the LE case by swapping the comparison operands
> 
> The patch doesn't handle the cases where based on the comparison one sets up
> floating vectors, as can be seen e.g. in:
> typedef float V128 __attribute__ ((vector_size(16)));
> typedef float V256 __attribute__ ((vector_size(32)));
> typedef float V512 __attribute__ ((vector_size(64)));
> 
> V128
> foo (V128 x)
> {
>   const union U { unsigned u; float f; } u = { -1U };
>   return x > 0.0f ? u.f : 0.0f;
> }
> 
> V256
> bar (V256 x)
> {
>   const union U { unsigned u; float f; } u = { -1U };
>   return x > 0.0f ? u.f : 0.0f;
> }

I'm adding a new predicate named float_vector_all_ones_operand and
corresponding define_insn_and_split to handle it.

Successfully matched this instruction:
(set (reg:V4SF 82 [  ])
(vec_merge:V4SF (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 
S16 A128])
(const_vector:V4SF [
(const_double:SF 0.0 [0x0.0p+0]) repeated x4
])
(unspec:QI [
(reg:V4SF 84)
(reg:V4SF 88)
(const_int 1 [0x1])
] UNSPEC_PCMP)))

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-19 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #8 from Jakub Jelinek  ---
Created attachment 49806
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49806=edit
gcc11-pr98348.patch

So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is
some incremental untested progress on your patch (just the sse.md part of it).
Changes:
1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents other
optimizations later on
2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is
plain AVX or AVX2 or SSE4* etc. one
3) handles also the const0 vector_all_ones order
4) for commutative cases allows any operand order, for others ensures the right
one of the operands is register
5) handles also the LE case by swapping the comparison operands

The patch doesn't handle the cases where based on the comparison one sets up
floating vectors, as can be seen e.g. in:
typedef float V128 __attribute__ ((vector_size(16)));
typedef float V256 __attribute__ ((vector_size(32)));
typedef float V512 __attribute__ ((vector_size(64)));

V128
foo (V128 x)
{
  const union U { unsigned u; float f; } u = { -1U };
  return x > 0.0f ? u.f : 0.0f;
}

V256
bar (V256 x)
{
  const union U { unsigned u; float f; } u = { -1U };
  return x > 0.0f ? u.f : 0.0f;
}

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-19 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #7 from Jakub Jelinek  ---
Seems the adjustment for last UNSPEC_PCMP operand between all_ones, const0 and
const0, all_ones is GEN_INT (INTVAL (operands[3]) ^ 4).

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-19 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #6 from Jakub Jelinek  ---
In the light of the recent discussions I've been wondering about doing it as
combine splitters only, like roughly:
--- sse.md.jj   2020-12-03 10:04:35.862093285 +0100
+++ sse.md  2020-12-19 11:00:14.272140859 +0100
@@ -2965,6 +2965,40 @@
(set_attr "prefix" "vex")
(set_attr "mode" "")])

+(define_split
+  [(set (match_operand 0 "register_operand")
+   (vec_merge
+ (match_operand 1 "vector_all_ones_operand")
+ (match_operand 2 "const0_operand")
+ (unspec
+   [(match_operand 3 "register_operand")
+(match_operand 4 "nonimmediate_operand")
+(match_operand:SI 5 "const_0_to_31_operand")]
+UNSPEC_PCMP)))]
+  "TARGET_AVX512VL
+   && GET_MODE_CLASS (GET_MODE (operands[0])) == MODE_VECTOR_INT
+   && (GET_MODE_SIZE (GET_MODE (operands[1])) == 16
+   || GET_MODE_SIZE (GET_MODE (operands[1])) == 32)
+   && GET_MODE (operands[1]) == GET_MODE (operands[0])
+   && GET_MODE (operands[2]) == GET_MODE (operands[0])
+   && GET_MODE_CLASS (GET_MODE (operands[3])) == MODE_VECTOR_FLOAT
+   && (GET_MODE_SIZE (GET_MODE (operands[3]))
+   == GET_MODE_SIZE (GET_MODE (operands[0])))
+   && (GET_MODE_UNIT_SIZE (GET_MODE (operands[3]))
+   == GET_MODE_UNIT_SIZE (GET_MODE (operands[0])))
+   && GET_MODE (operands[4]) == GET_MODE (operands[3])"
+  [(set (match_dup 6) (match_dup 7))
+   (set (match_dup 0) (match_dup 8))]
+{
+  operands[6] = gen_reg_rtx (GET_MODE (operands[3]));
+  operands[7]
+= gen_rtx_UNSPEC (GET_MODE (operands[3]),
+ gen_rtvec (3, operands[3], operands[4], operands[5]),
+ UNSPEC_PCMP);
+  operands[8] = lowpart_subreg (GET_MODE (operands[0]), operands[6],
+   GET_MODE (operands[3]));
+})
+
 (define_insn "avx_vmcmp3"
   [(set (match_operand:VF_128 0 "register_operand" "=x")
(vec_merge:VF_128

The advantage is that one pattern can then handle in theory all (or half) of
the floating point comparison cases.
One problem is that combiner still doesn't even try the splitting if only
combining two insns.
Also, but I think that is in your patch too, vector_all_ones_operand will match
only integral all ones vectors, I think we want another predicate that will
match even MEMs with the floating point version thereof (a NaN kind with all
bits set).  And, we should have splitters for not just the -1 0 order in
VEC_MERGE, but also the 0 -1 order by inverting the comparison carefully.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-18 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #5 from Hongtao.liu  ---
Created attachment 49804
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49804=edit
This patch can fix this issue

I'm testing the patch.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #4 from Hongtao.liu  ---
(In reply to Jakub Jelinek from comment #3)
> Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac
> Perhaps peephole2s or something similar to catch that?

A patch(with peephole2) is posted at
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559301.html.

Maybe i'll add a combine spliter to handle this case.

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac
Perhaps peephole2s or something similar to catch that?

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread hjl.tools at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

H.J. Lu  changed:

   What|Removed |Added

 CC||crazylht at gmail dot com,
   ||hjl.tools at gmail dot com

--- Comment #2 from H.J. Lu  ---
Hongtao, can you take a look?

[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9

2020-12-17 Thread danielhanchen at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348

--- Comment #1 from Daniel Han-Chen  ---
I also just noticed that in GCC 10, an extra movdqa is done, which is also not
necessary.