[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #14 from Hongtao.liu --- *** Bug 96891 has been marked as a duplicate of this bug. ***
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #13 from Hongtao.liu --- Created attachment 49832 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49832=edit gcc11-pr98348_v3.patch 1. Use REAL_VALUE_TO_TARGET_SINGLE/DOUBLE in the "float_vector_all_ones_operands" predicate, it should be cheaper than lower_subreg? 2. Combine failed to match (vec_merge (vector_all_ones_operand) (const0_operand) (knot mask)) even the corresponding pattern exists, peephole2 could be ok for this but abviously not the best solution. IMHO the changes about knot should be functionality ok, and to avoid some potential performance issue, I extraly add some new define_insn_and_splits to negate the comparison code when there's not in the result(Dig the hole deeper and deeper:(). The patch survive regtest and bootstrap on both trunk and rlease/gcc-10 branch.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #12 from Hongtao.liu --- (In reply to Jakub Jelinek from comment #11) > I'm not sure about the knot changes, isn't that too risky at least at this > point? > I mean, can't we instead just match what knot emits? > As indicated in https://gcc.gnu.org/pipermail/gcc-patches/2020-August/552205.html, we support bitwise operator for avx512 masks, so i believe the change is fine for GCC11, but not suitable for GCC10. For backport convenience, I'll make a version that matches the knot. > As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P > (op) > before trying to do the lowpart_subreg, perhaps even check if the > CONST_DOUBLE is NaN if it can be done cheaply before doing the more > expensive lowpart_subreg. Yes.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #11 from Jakub Jelinek --- I'm not sure about the knot changes, isn't that too risky at least at this point? I mean, can't we instead just match what knot emits? As for the new predicate, I think we should check CONST_DOUBLE_AS_FLOAT_P (op) before trying to do the lowpart_subreg, perhaps even check if the CONST_DOUBLE is NaN if it can be done cheaply before doing the more expensive lowpart_subreg.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #10 from Hongtao.liu --- Created attachment 49819 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49819=edit Incremental to gcc11-pr99348.patch Update patch.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #9 from Hongtao.liu --- (In reply to Jakub Jelinek from comment #8) > Created attachment 49806 [details] > gcc11-pr98348.patch > > So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is > some incremental untested progress on your patch (just the sse.md part of > it). > Changes: > 1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents > other optimizations later on > 2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is > plain AVX or AVX2 or SSE4* etc. one > 3) handles also the const0 vector_all_ones order > 4) for commutative cases allows any operand order, for others ensures the > right The bellow pattern should be equivilent to const0 vector_all_ones order, but the generic part didn't simplify it, so i made a bit adjustment to those patterns, also some changes in ix86_expand_sse_movcc to generate common NOT operator instead of gen_knot which has UNSPEC_MASKOP inside, so the combine can do the right thing. Successfully matched this instruction: (set (reg:V16QI 82 [ ]) (vec_merge:V16QI (const_vector:V16QI [ (const_int -1 [0x]) repeated x16 ]) (const_vector:V16QI [ (const_int 0 [0]) repeated x16 ]) (not:HI (unspec:HI [ (reg:V16QI 89) (reg:V16QI 90) (const_int 4 [0x4]) ] UNSPEC_PCMP > one of the operands is register > 5) handles also the LE case by swapping the comparison operands > > The patch doesn't handle the cases where based on the comparison one sets up > floating vectors, as can be seen e.g. in: > typedef float V128 __attribute__ ((vector_size(16))); > typedef float V256 __attribute__ ((vector_size(32))); > typedef float V512 __attribute__ ((vector_size(64))); > > V128 > foo (V128 x) > { > const union U { unsigned u; float f; } u = { -1U }; > return x > 0.0f ? u.f : 0.0f; > } > > V256 > bar (V256 x) > { > const union U { unsigned u; float f; } u = { -1U }; > return x > 0.0f ? u.f : 0.0f; > } I'm adding a new predicate named float_vector_all_ones_operand and corresponding define_insn_and_split to handle it. Successfully matched this instruction: (set (reg:V4SF 82 [ ]) (vec_merge:V4SF (mem/u/c:V4SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S16 A128]) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) repeated x4 ]) (unspec:QI [ (reg:V4SF 84) (reg:V4SF 88) (const_int 1 [0x1]) ] UNSPEC_PCMP)))
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #8 from Jakub Jelinek --- Created attachment 49806 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49806=edit gcc11-pr98348.patch So, if we go for GCC11 the way of pre-reload define_insn_and_split, this is some incremental untested progress on your patch (just the sse.md part of it). Changes: 1) it is undesirable to put SUBREGs on the SET_DEST side, as it prevents other optimizations later on 2) I don't see the point on the TARGET_AVX512BW ||, the insn in the end is plain AVX or AVX2 or SSE4* etc. one 3) handles also the const0 vector_all_ones order 4) for commutative cases allows any operand order, for others ensures the right one of the operands is register 5) handles also the LE case by swapping the comparison operands The patch doesn't handle the cases where based on the comparison one sets up floating vectors, as can be seen e.g. in: typedef float V128 __attribute__ ((vector_size(16))); typedef float V256 __attribute__ ((vector_size(32))); typedef float V512 __attribute__ ((vector_size(64))); V128 foo (V128 x) { const union U { unsigned u; float f; } u = { -1U }; return x > 0.0f ? u.f : 0.0f; } V256 bar (V256 x) { const union U { unsigned u; float f; } u = { -1U }; return x > 0.0f ? u.f : 0.0f; }
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #7 from Jakub Jelinek --- Seems the adjustment for last UNSPEC_PCMP operand between all_ones, const0 and const0, all_ones is GEN_INT (INTVAL (operands[3]) ^ 4).
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #6 from Jakub Jelinek --- In the light of the recent discussions I've been wondering about doing it as combine splitters only, like roughly: --- sse.md.jj 2020-12-03 10:04:35.862093285 +0100 +++ sse.md 2020-12-19 11:00:14.272140859 +0100 @@ -2965,6 +2965,40 @@ (set_attr "prefix" "vex") (set_attr "mode" "")]) +(define_split + [(set (match_operand 0 "register_operand") + (vec_merge + (match_operand 1 "vector_all_ones_operand") + (match_operand 2 "const0_operand") + (unspec + [(match_operand 3 "register_operand") +(match_operand 4 "nonimmediate_operand") +(match_operand:SI 5 "const_0_to_31_operand")] +UNSPEC_PCMP)))] + "TARGET_AVX512VL + && GET_MODE_CLASS (GET_MODE (operands[0])) == MODE_VECTOR_INT + && (GET_MODE_SIZE (GET_MODE (operands[1])) == 16 + || GET_MODE_SIZE (GET_MODE (operands[1])) == 32) + && GET_MODE (operands[1]) == GET_MODE (operands[0]) + && GET_MODE (operands[2]) == GET_MODE (operands[0]) + && GET_MODE_CLASS (GET_MODE (operands[3])) == MODE_VECTOR_FLOAT + && (GET_MODE_SIZE (GET_MODE (operands[3])) + == GET_MODE_SIZE (GET_MODE (operands[0]))) + && (GET_MODE_UNIT_SIZE (GET_MODE (operands[3])) + == GET_MODE_UNIT_SIZE (GET_MODE (operands[0]))) + && GET_MODE (operands[4]) == GET_MODE (operands[3])" + [(set (match_dup 6) (match_dup 7)) + (set (match_dup 0) (match_dup 8))] +{ + operands[6] = gen_reg_rtx (GET_MODE (operands[3])); + operands[7] += gen_rtx_UNSPEC (GET_MODE (operands[3]), + gen_rtvec (3, operands[3], operands[4], operands[5]), + UNSPEC_PCMP); + operands[8] = lowpart_subreg (GET_MODE (operands[0]), operands[6], + GET_MODE (operands[3])); +}) + (define_insn "avx_vmcmp3" [(set (match_operand:VF_128 0 "register_operand" "=x") (vec_merge:VF_128 The advantage is that one pattern can then handle in theory all (or half) of the floating point comparison cases. One problem is that combiner still doesn't even try the splitting if only combining two insns. Also, but I think that is in your patch too, vector_all_ones_operand will match only integral all ones vectors, I think we want another predicate that will match even MEMs with the floating point version thereof (a NaN kind with all bits set). And, we should have splitters for not just the -1 0 order in VEC_MERGE, but also the 0 -1 order by inverting the comparison carefully.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #5 from Hongtao.liu --- Created attachment 49804 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49804=edit This patch can fix this issue I'm testing the patch.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #4 from Hongtao.liu --- (In reply to Jakub Jelinek from comment #3) > Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac > Perhaps peephole2s or something similar to catch that? A patch(with peephole2) is posted at https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559301.html. Maybe i'll add a combine spliter to handle this case.
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #3 from Jakub Jelinek --- Started with r10-5250-g8b905e9b0c09530c0f660563540257f3d181c2ac Perhaps peephole2s or something similar to catch that?
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 H.J. Lu changed: What|Removed |Added CC||crazylht at gmail dot com, ||hjl.tools at gmail dot com --- Comment #2 from H.J. Lu --- Hongtao, can you take a look?
[Bug c++/98348] GCC 10.2 AVX512 Mask regression from GCC 9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98348 --- Comment #1 from Daniel Han-Chen --- I also just noticed that in GCC 10, an extra movdqa is done, which is also not necessary.