[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #13 from Hongtao.liu --- (In reply to Richard Biener from comment #10) > Hmm, but > > DEF_INTERNAL_INT_FN (POPCOUNT, ECF_CONST | ECF_NOTHROW, popcount, unary) > > so there's clearly a mismatch between either the vectorizers interpretation > or the optab. But as far as I can see this is not a direct internal fn so > vectorizable_internal_function shouldn't apply and I do not see the x86 > backend handle POPCOUNT in the vectorizable function target hook. > > So w/o a compiler capable I can't trace how the vectorizer vectorizes this > and thus I have no idea where it goes wrong ... capable compiler is ready.
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #12 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:81d590760c31e11e3a09135f4e182aea232035f2 commit r11-5693-g81d590760c31e11e3a09135f4e182aea232035f2 Author: Hongyu Wang Date: Wed Nov 11 09:41:13 2020 +0800 Add popcount expander to enable popcount auto vectorization under AVX512BITALG/AVX512POPCNTDQ target. gcc/ChangeLog PR target/97770 * config/i386/sse.md (popcount2): New expander for SI/DI vector modes. (popcount2): Likewise for QI/HI vector modes. gcc/testsuite/ChangeLog PR target/97770 * gcc.target/i386/avx512bitalg-pr97770-1.c: New test. * gcc.target/i386/avx512vpopcntdq-pr97770-1.c: Likewise. * gcc.target/i386/avx512vpopcntdq-pr97770-2.c: Likewise. * gcc.target/i386/avx512vpopcntdqvl-pr97770-1.c: Likewise.
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #11 from Hongtao.liu --- A patch is posted at https://gcc.gnu.org/pipermail/gcc-patches/2020-November/558777.html
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 Richard Biener changed: What|Removed |Added CC||rsandifo at gcc dot gnu.org --- Comment #10 from Richard Biener --- Hmm, but DEF_INTERNAL_INT_FN (POPCOUNT, ECF_CONST | ECF_NOTHROW, popcount, unary) so there's clearly a mismatch between either the vectorizers interpretation or the optab. But as far as I can see this is not a direct internal fn so vectorizable_internal_function shouldn't apply and I do not see the x86 backend handle POPCOUNT in the vectorizable function target hook. So w/o a compiler capable I can't trace how the vectorizer vectorizes this and thus I have no idea where it goes wrong ...
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #9 from Hongtao.liu --- > I guess that the vectorized popcount IFN is defined to be VnDI -> VnDI > but we want to have VnSImode results. This means the instruction is > wrongly modeled in vectorized form? > Yes, because we have __builtin_popcount{l,ll} defined as {BT_FN_INT_ULONG, BT_FN_INT_ULONGLONG} but for vectorized form, gcc require mode of src and dest to be the same. popcountm2: Store into operand 0 the number of 1-bits in operand 1. m is either a scalar or vector integer mode. When it is a scalar, operand 1 has mode m but operand 0 can have whatever scalar integer mode is suitable for the target. The compiler will insert conversion instructions as necessary (typically to convert the result to the same width as int). When m is a vector, both operands must have mode m. This pattern is not allowed to FAIL.
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #8 from Uroš Bizjak --- (In reply to Richard Biener from comment #4) > What's missing is middle-end folding support to narrow popcount to the > appropriate internal function call with byte/half-word width when target > support > is available. But I'm quite sure there's no scalar popcount instruction > operating on half-word or byte pieces of a GPR? x86 has popcnt that operates on 16bit register. https://www.felixcloutier.com/x86/popcnt
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 Thomas Koenig changed: What|Removed |Added CC||tkoenig at gcc dot gnu.org --- Comment #7 from Thomas Koenig --- Some literature: https://arxiv.org/pdf/1611.07612
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #6 from Richard Biener --- (In reply to Hongtao.liu from comment #5) > (In reply to Richard Biener from comment #4) > > What's missing is middle-end folding support to narrow popcount to the > > appropriate internal function call with byte/half-word width when target > > support > > is available. But I'm quite sure there's no scalar popcount instruction > > operating on half-word or byte pieces of a GPR? > > > > Alternatively the vectorizer can use patterns to do this. > > Yes, but for 64bit width, vectorizer generate suboptimal code. > > sse #c3 > > vector(2) long long unsigned int vect__4.6; > vector(2) long long unsigned int vect__4.5; > vector(2) long long unsigned int _8; > vector(2) long long unsigned int _26; > > ... > ... > > _8 = .POPCOUNT (vect__4.5_16); > _26 = .POPCOUNT (vect__4.6_9); > vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this? > vector(4) int vect__5.7; > > > It could generate directly > > v4di = .POPCOUNT (v4di); I guess that the vectorized popcount IFN is defined to be VnDI -> VnDI but we want to have VnSImode results. This means the instruction is wrongly modeled in vectorized form? Note the vectorizer isn't very good in handling narrowing operations here. If you can push the missing patterns I can have a look. Bonus points for a correctness testcase (from the above I think we're generating wrong code).
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #5 from Hongtao.liu --- (In reply to Richard Biener from comment #4) > What's missing is middle-end folding support to narrow popcount to the > appropriate internal function call with byte/half-word width when target > support > is available. But I'm quite sure there's no scalar popcount instruction > operating on half-word or byte pieces of a GPR? > > Alternatively the vectorizer can use patterns to do this. Yes, but for 64bit width, vectorizer generate suboptimal code. sse #c3 vector(2) long long unsigned int vect__4.6; vector(2) long long unsigned int vect__4.5; vector(2) long long unsigned int _8; vector(2) long long unsigned int _26; ... ... _8 = .POPCOUNT (vect__4.5_16); _26 = .POPCOUNT (vect__4.6_9); vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this? vector(4) int vect__5.7; It could generate directly v4di = .POPCOUNT (v4di);
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 Richard Biener changed: What|Removed |Added Last reconfirmed||2020-11-10 Ever confirmed|0 |1 Blocks||53947 Status|UNCONFIRMED |NEW --- Comment #4 from Richard Biener --- What's missing is middle-end folding support to narrow popcount to the appropriate internal function call with byte/half-word width when target support is available. But I'm quite sure there's no scalar popcount instruction operating on half-word or byte pieces of a GPR? Alternatively the vectorizer can use patterns to do this. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #3 from Hongtao.liu --- > But for vector byte/word/quadword, vectorizer still use vpopcntd, but not > vpopcnt{b,w,q}, missing corresponding ifn? We don't have __builtin_popcount{w,b}, but we have __builtin_popcountl. for testcase --- void fooq(unsigned long long* __restrict dest, unsigned long long* src) { for (int i = 0; i != 4; i++) dest[i] = __builtin_popcountl (src[i]); } icc/clang generate --- _Z4fooqPxS_:# @_Z4fooqPxS_ vpopcntqymm0, ymmword ptr [rsi] vmovdqu ymmword ptr [rdi], ymm0 vzeroupper ret --- But gcc generate --- fooq: .LFB0: .cfi_startproc vpopcntq16(%rsi), %xmm1 vpopcntq(%rsi), %xmm0 vshufps $136, %xmm1, %xmm0, %xmm0 vpmovsxdq %xmm0, %xmm1 vpsrldq $8, %xmm0, %xmm0 vpmovsxdq %xmm0, %xmm0 vmovdqu %xmm1, (%rdi) vmovdqu %xmm0, 16(%rdi) ret .cfi_endproc --- dump for 164.vect --- ;; Function fooq (fooq, funcdef_no=0, decl_uid=4228, cgraph_uid=1, symbol_order=0) Merging blocks 2 and 6 fooq (long long unsigned int * restrict dest, long long unsigned int * src) { vector(2) long long unsigned int * vectp_dest.10; vector(2) long long unsigned int * vectp_dest.9; vector(2) long long unsigned int vect__7.8; vector(4) int vect__5.7; vector(2) long long unsigned int vect__4.6; vector(2) long long unsigned int vect__4.5; vector(2) long long unsigned int * vectp_src.4; vector(2) long long unsigned int * vectp_src.3; int i; long unsigned int _1; long unsigned int _2; long long unsigned int * _3; long long unsigned int _4; int _5; long long unsigned int * _6; long long unsigned int _7; vector(2) long long unsigned int _8; vector(2) long long unsigned int _26; unsigned int ivtmp_30; unsigned int ivtmp_31; unsigned int ivtmp_36; unsigned int ivtmp_37; [local count: 214748368]: [local count: 214748371]: # i_18 = PHI # ivtmp_31 = PHI # vectp_src.3_20 = PHI # vectp_dest.9_24 = PHI # ivtmp_36 = PHI _1 = (long unsigned int) i_18; _2 = _1 * 8; _3 = src_11(D) + _2; vect__4.5_16 = MEM [(long long unsigned int *)vectp_src.3_20]; vectp_src.3_15 = vectp_src.3_20 + 16; vect__4.6_9 = MEM [(long long unsigned int *)vectp_src.3_15]; _4 = *_3; _8 = .POPCOUNT (vect__4.5_16); _26 = .POPCOUNT (vect__4.6_9); vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this? _5 = 0; _6 = dest_12(D) + _2; vect__7.8_23 = [vec_unpack_lo_expr] vect__5.7_22; vect__7.8_25 = [vec_unpack_hi_expr] vect__5.7_22; _7 = (long long unsigned int) _5; MEM [(long long unsigned int *)vectp_dest.9_24] = vect__7.8_23; vectp_dest.9_34 = vectp_dest.9_24 + 16; MEM [(long long unsigned int *)vectp_dest.9_34] = vect__7.8_25; i_14 = i_18 + 1; ivtmp_30 = ivtmp_31 - 1; vectp_src.3_17 = vectp_src.3_15 + 16; vectp_dest.9_32 = vectp_dest.9_34 + 16; ivtmp_37 = ivtmp_36 + 1; if (ivtmp_37 < 1) goto ; [0.00%] else goto ; [100.00%] [local count: 0]: goto ; [100.00%] [local count: 214748368]: return; } ---
[Bug target/97770] [ICELAKE]Missing vectorization for vpopcnt
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770 --- Comment #2 from Hongtao.liu --- After adding expander, successfully vectorize the loop. --- diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index b153a87fb98..e8159997c40 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -22678,6 +22678,12 @@ (define_insn "avx5124vnniw_vp4dpwssds_maskz" (set_attr ("prefix") ("evex")) (set_attr ("mode") ("TI"))]) +(define_expand "popcount2" + [(set (match_operand:VI48_AVX512VL 0 "register_operand") + (popcount:VI48_AVX512VL + (match_operand:VI48_AVX512VL 1 "nonimmediate_operand")))] + "TARGET_AVX512VPOPCNTDQ") + (define_insn "vpopcount" [(set (match_operand:VI48_AVX512VL 0 "register_operand" "=v") (popcount:VI48_AVX512VL @@ -22722,6 +22728,12 @@ (define_insn "*restore_multiple_leave_return" "TARGET_SSE && TARGET_64BIT" "jmp\t%P1") +(define_insn "popcount2" + [(set (match_operand:VI12_AVX512VL 0 "register_operand" "=v") + (popcount:VI12_AVX512VL + (match_operand:VI12_AVX512VL 1 "nonimmediate_operand" "vm")))] + "TARGET_AVX512BITALG") + (define_insn "vpopcount" [(set (match_operand:VI12_AVX512VL 0 "register_operand" "=v") (popcount:VI12_AVX512VL --- But for vector byte/word/quadword, vectorizer still use vpopcntd, but not vpopcnt{b,w,q}, missing corresponding ifn? void fooq(long long* __restrict dest, long long* src) { for (int i = 0; i != 4; i++) dest[i] = __builtin_popcount (src[i]); } void foow(short* __restrict dest, short* src) { for (int i = 0; i != 16; i++) dest[i] = __builtin_popcount (src[i]); } void foob(char* __restrict dest, char* src) { for (int i = 0; i != 32; i++) dest[i] = __builtin_popcount (src[i]); } dump of test.c.164.vect ;; Function foow (foow, funcdef_no=0, decl_uid=4228, cgraph_uid=1, symbol_order=0) Merging blocks 2 and 6 foow (short int * restrict dest, short int * src) { vector(8) short int * vectp_dest.10; vector(8) short int * vectp_dest.9; vector(8) short int vect__8.8; vector(4) int vect__6.7; vector(4) unsigned int vect__5.6; vector(8) short int vect__4.5; vector(8) short int * vectp_src.4; vector(8) short int * vectp_src.3; int i; long unsigned int _1; long unsigned int _2; short int * _3; short int _4; unsigned int _5; int _6; short int * _7; short int _8; unsigned int ivtmp_26; unsigned int ivtmp_28; unsigned int ivtmp_34; unsigned int ivtmp_35; [local count: 119292720]: [local count: 119292719]: # i_19 = PHI # ivtmp_35 = PHI # vectp_src.3_24 = PHI # vectp_dest.9_9 = PHI # ivtmp_26 = PHI _1 = (long unsigned int) i_19; _2 = _1 * 2; _3 = src_12(D) + _2; vect__4.5_22 = MEM [(short int *)vectp_src.3_24]; _4 = *_3; vect__5.6_21 = [vec_unpack_lo_expr] vect__4.5_22; vect__5.6_18 = [vec_unpack_hi_expr] vect__4.5_22; _5 = (unsigned int) _4; vect__6.7_17 = .POPCOUNT (vect__5.6_21); vect__6.7_16 = .POPCOUNT (vect__5.6_18); _6 = 0; _7 = dest_13(D) + _2; vect__8.8_10 = VEC_PACK_TRUNC_EXPR ; _8 = (short int) _6; MEM [(short int *)vectp_dest.9_9] = vect__8.8_10; i_15 = i_19 + 1; ivtmp_34 = ivtmp_35 - 1; vectp_src.3_23 = vectp_src.3_24 + 16; vectp_dest.9_29 = vectp_dest.9_9 + 16; ivtmp_28 = ivtmp_26 + 1; if (ivtmp_28 < 1) goto ; [0.00%] else goto ; [100.00%] [local count: 0]: goto ; [100.00%] [local count: 119292720]: return; }