[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 --- Comment #10 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:13c556d6ae84be3ee2bc245a56eafa58221de86a commit r14-2447-g13c556d6ae84be3ee2bc245a56eafa58221de86a Author: liuhongt Date: Thu Jun 29 14:25:28 2023 +0800 Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0' False dependency happens when destination is only updated by pternlog. There is no false dependency when destination is also used in source. So either a pxor should be inserted, or input operand should be set with constraint '0'. gcc/ChangeLog: PR target/110438 PR target/110202 * config/i386/predicates.md (int_float_vector_all_ones_operand): New predicate. * config/i386/sse.md (*vmov_constm1_pternlog_false_dep): New define_insn. (*_cvtmask2_pternlog_false_dep): Ditto. (*_cvtmask2_pternlog_false_dep): Ditto. (*_cvtmask2): Adjust to define_insn_and_split to avoid false dependence. (*_cvtmask2): Ditto. (one_cmpl2): Adjust constraint of operands 1 to '0' to avoid false dependence. (*andnot3): Ditto. (iornot3): Ditto. (*3): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110438.c: New test. * gcc.target/i386/pr100711-6.c: Adjust testcase.
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 --- Comment #9 from Alexander Monakov --- (In reply to Hongtao.liu from comment #8) > > For this one, we can load *a into %zmm0 to avoid false_dependence. > > vmovdqau ZMMWORD PTR [rdi], zmm0 > vpternlogq zmm0, zmm0, zmm0, 85 Yes, since ternlog with memory operand needs two fused-domain uops on Intel CPUs, breaking out the load would be more efficient for both negate1 and negate2.
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 --- Comment #8 from Hongtao.liu --- (In reply to Alexander Monakov from comment #7) > Note that vpxor serves as a dependency-breaking instruction (see PR 110438). > So in negate1 we do the right thing for the wrong reasons, and in negate2 we > can cause a substantial stall if the previous computation of xmm0 has a > non-trivial dependency chain. For this one, we can load *a into %zmm0 to avoid false_dependence. vmovdqau ZMMWORD PTR [rdi], zmm0 vpternlogq zmm0, zmm0, zmm0, 85
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 --- Comment #7 from Alexander Monakov --- Note that vpxor serves as a dependency-breaking instruction (see PR 110438). So in negate1 we do the right thing for the wrong reasons, and in negate2 we can cause a substantial stall if the previous computation of xmm0 has a non-trivial dependency chain.
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #6 from Alexander Monakov --- (In reply to Jakub Jelinek from comment #3) > And I must say I don't immediately see easy rules how to find out from the > immediate value which set is which, so unless we find some easy rule for > that, we'd need to hardcode the mapping between the 256 values to a bitmask > which inputs are actually used. Well, that's really easy. The immediate is just a eight-entry look-up table from any possible input bit triple to the output bit. The leftmost operand corresponds to the most significant bit in the triple, so to check if the operation vpternlog(A, B, C, I) is invariant w.r.t A you check if nibbles of I are equal. Here we have 0x55, equal nibbles, and the operation is invariant w.r.t A. Similarly, to check if it's invariant w.r.t B we check if two-bit groups in I come in pairs, or in code: (I & 0x33) == ((I >> 2) & 0x33). For 0x55 both sides evaluate to 0x11, so again, invariant w.r.t B. Finally, checking invariantness w.r.t C is (I & 0x55) == ((I >> 1) & 0x55).
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 --- Comment #5 from Fabio Cannizzo --- > Well, there is nothing magic on exactly 0x55 immediate, there are 256 > possible immediates, most of them use all of A, B, C, some of them use just > A, B, others just B, C, others just A, C, others just A, others just B, > others just C, others none of them. Indeed I meant 0x55 just as an example.
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 Andrew Pinski changed: What|Removed |Added Last reconfirmed||2023-06-10 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #4 from Andrew Pinski --- (In reply to Jakub Jelinek from comment #3) > Well, there is nothing magic on exactly 0x55 immediate, there are 256 > possible immediates, most of them use all of A, B, C, some of them use just > A, B, others just B, C, others just A, C, others just A, others just B, > others just C, others none of them. > And I must say I don't immediately see easy rules how to find out from the > immediate value which set is which, so unless we find some easy rule for > that, we'd need to hardcode the mapping between the 256 values to a bitmask > which inputs are actually used. > And then the question is how to represent that in RTL to make it clear that > some operands are mentioned but their value isn't really used. In the case of 0x55, an idea might be to split (or expand) it into how ~ is represented. That is: (insn:TI 6 3 12 2 (set (reg:V8DI 20 xmm0 [85]) (xor:V8DI (mem:V8DI (reg/v/f:DI 5 di [orig:84 a ] [84]) [0 *a_3(D)+0 S64 A512]) (const_vector:V8DI [ (const_int -1 [0x]) repeated x8 ]))) "/app/example.cpp":21:14 6764 {*one_cmplv8di2} (expr_list:REG_DEAD (reg/v/f:DI 5 di [orig:84 a ] [84]) (nil)))
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 Jakub Jelinek changed: What|Removed |Added CC||hjl.tools at gmail dot com, ||jakub at gcc dot gnu.org --- Comment #3 from Jakub Jelinek --- Well, there is nothing magic on exactly 0x55 immediate, there are 256 possible immediates, most of them use all of A, B, C, some of them use just A, B, others just B, C, others just A, C, others just A, others just B, others just C, others none of them. And I must say I don't immediately see easy rules how to find out from the immediate value which set is which, so unless we find some easy rule for that, we'd need to hardcode the mapping between the 256 values to a bitmask which inputs are actually used. And then the question is how to represent that in RTL to make it clear that some operands are mentioned but their value isn't really used.
[Bug rtl-optimization/110202] _mm512_ternarylogic_epi64 generates unnecessary operations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110202 Andrew Pinski changed: What|Removed |Added Component|target |rtl-optimization Severity|normal |enhancement --- Comment #2 from Andrew Pinski --- Note you get a warning in your negate1 case : In function '__m512i negate1(const __m512i*)': :7:36: warning: 'res' is used uninitialized [-Wuninitialized] 7 | res = _mm512_ternarylogic_epi64(res, res, *a, 0x55); | ~^~~~ :6:13: note: 'res' was declared here 6 | __m512i res; | ^~~ But even doing this: __m512i negate1(const __m512i *a) { __m512i res = _mm512_undefined_si512 (); res = _mm512_ternarylogic_epi64(res, res, *a, 0x55); return res; } Will cause an extra zeroing.