[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 Richard Biener changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #26 from Richard Biener --- Let's declare this generic bug fixed. Specific unhandled cases on x86 should get a new bug with a specific testcase.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #25 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:f65878178ab05180a5937f11f8fdb755678a82ce commit r12-2085-gf65878178ab05180a5937f11f8fdb755678a82ce Author: Uros Bizjak Date: Tue Jul 6 19:27:34 2021 +0200 i386: Add variable vec_set for 32bit vectors [PR97194] To generate sane code a SSE4.1 variable PBLENDV instruction is needed. Also enable variable vec_set through vec_setm_operand predicate for TARGET_SSE4_1 instead of TARGET_AVX2. ix86_expand_vector_init_duplicate is able to emulate vpbroadcast{b,w} with pxor/pshufb. 2021-07-06 Uroš Bizjak gcc/ PR target/97194 * config/i386/predicates.md (vec_setm_operand): Enable register_operand for TARGET_SSE4_1. * config/i386/mmx.md (vec_setv2hi): Use vec_setm_operand as operand 2 predicate. Call ix86_expand_vector_set_var for non-constant index operand. (vec_setv4qi): Use vec_setm_mmx_operand as operand 2 predicate. Call ix86_expand_vector_set_var for non-constant index operand. gcc/testsuite/ PR target/97194 * gcc.target/i386/sse4_1-vec-set-1a.c: New test. * gcc.target/i386/sse4_1-vec-set-2a.c: Ditto.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #24 from CVS Commits --- The master branch has been updated by Uros Bizjak : https://gcc.gnu.org/g:20a2c8ace0ab56c147fd995432abd5e7cf89b0e3 commit r12-1563-g20a2c8ace0ab56c147fd995432abd5e7cf89b0e3 Author: Uros Bizjak Date: Thu Jun 17 15:19:12 2021 +0200 i386: Add variable vec_set for 64bit vectors [PR97194] To generate sane code a SSE4.1 variable PBLENDV instruction is needed. 2021-06-17 Uroš Bizjak gcc/ PR target/97194 * config/i386/i386-expand.c (expand_vector_set_var): Handle V2FS mode remapping. Pass TARGET_MMX_WITH_SSE to ix86_expand_vector_init_duplicate. (ix86_expand_vector_init_duplicate): Emit insv_1 for QImode for !TARGET_PARTIAL_REG_STALL. * config/i386/predicates.md (vec_setm_mmx_operand): New predicate. * config/i386/mmx.md (vec_setv2sf): Use vec_setm_mmx_operand as operand 2 predicate. Call ix86_expand_vector_set_var for non-constant index operand. (vec_setv2si): Ditto. (vec_setv4hi): Ditto. (vec_setv8qi): ditto. gcc/testsuite/ PR target/97194 * gcc.target/i386/sse4_1-vec-set-1.c: New test. * gcc.target/i386/sse4_1-vec-set-2.c: ditto.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #23 from Hongtao.liu --- Fixed in GCC11, may need a bit adjustment for the modeless operand(the variable index) as dicussed in https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559213.html
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #22 from CVS Commits --- The master branch has been updated by hongtao Liu : https://gcc.gnu.org/g:287cc750b0887e86cb309d976b17c7ee95f7ad48 commit r11-5074-g287cc750b0887e86cb309d976b17c7ee95f7ad48 Author: liuhongt Date: Mon Oct 19 16:04:39 2020 +0800 Support variable index vec_set. gcc/ChangeLog: PR target/97194 * config/i386/i386-expand.c (ix86_expand_vector_set_var): New function. * config/i386/i386-protos.h (ix86_expand_vector_set_var): New Decl. * config/i386/predicates.md (vec_setm_operand): New predicate, true for const_int_operand or register_operand under TARGET_AVX2. * config/i386/sse.md (vec_set): Support both constant and variable index vec_set. gcc/testsuite/ChangeLog: * gcc.target/i386/avx2-vec-set-1.c: New test. * gcc.target/i386/avx2-vec-set-2.c: New test. * gcc.target/i386/avx512bw-vec-set-1.c: New test. * gcc.target/i386/avx512bw-vec-set-2.c: New test. * gcc.target/i386/avx512f-vec-set-2.c: New test. * gcc.target/i386/avx512vl-vec-set-2.c: New test.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #21 from rguenther at suse dot de --- On Fri, 16 Oct 2020, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 > > --- Comment #20 from Hongtao.liu --- > (In reply to Richard Biener from comment #18) > > (In reply to Hongtao.liu from comment #16) > > > (In reply to Hongtao.liu from comment #15) > > > > I'm working on add the expander, i encounter a problem. > > > > > > > > for V32HI vec_set with constant index, the expander existed under > > > > TARGET_AVX512F, but for variable index, the expander should be existed > > > > under > > > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we > > > > need > > > > to restricted the expander under TARGET_AVX512BW, unfortunately > > > > operands is > > > > unvisible in condition scope, any solution to handle such issue? > > > > > > Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed. > > > > That sounds like a good implementation strathegy, I suppose AVX512F implies > > AVX2 even when AVX512VL is not available. > > Yes, but not sure performance impact, maybe better generate code "like" > expander not existed. I'm not so sure - the STLF penalty is quite large and the AVX2 code should only require an extra vextract + vpcmp + vinsert (and the extra vpcmp uses another constant pool entry). The broadcasts and blend only require AVX512F.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #20 from Hongtao.liu --- (In reply to Richard Biener from comment #18) > (In reply to Hongtao.liu from comment #16) > > (In reply to Hongtao.liu from comment #15) > > > I'm working on add the expander, i encounter a problem. > > > > > > for V32HI vec_set with constant index, the expander existed under > > > TARGET_AVX512F, but for variable index, the expander should be existed > > > under > > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need > > > to restricted the expander under TARGET_AVX512BW, unfortunately operands > > > is > > > unvisible in condition scope, any solution to handle such issue? > > > > Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed. > > That sounds like a good implementation strathegy, I suppose AVX512F implies > AVX2 even when AVX512VL is not available. Yes, but not sure performance impact, maybe better generate code "like" expander not existed.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #19 from Hongtao.liu --- (In reply to Richard Biener from comment #17) > (In reply to Hongtao.liu from comment #15) > > I'm working on add the expander, i encounter a problem. > > > > for V32HI vec_set with constant index, the expander existed under > > TARGET_AVX512F, but for variable index, the expander should be existed under > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need > > to restricted the expander under TARGET_AVX512BW, unfortunately operands is > > unvisible in condition scope, any solution to handle such issue? > > can_vec_set_var_idx_p checks insn_operand_matches thus the operand predicate > is what you should adjust I think, sth like > > "const_int_or_reg_for_vec_set_operand" > > or do you mean you do not see the mode of the vector inside the predicate? > Need to sse vector mode inside the predicate. > In that case I think you need to split the pattern? Or think of a way > to implement byte/word insert with just AVVX512F ...
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #18 from Richard Biener --- (In reply to Hongtao.liu from comment #16) > (In reply to Hongtao.liu from comment #15) > > I'm working on add the expander, i encounter a problem. > > > > for V32HI vec_set with constant index, the expander existed under > > TARGET_AVX512F, but for variable index, the expander should be existed under > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need > > to restricted the expander under TARGET_AVX512BW, unfortunately operands is > > unvisible in condition scope, any solution to handle such issue? > > Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed. That sounds like a good implementation strathegy, I suppose AVX512F implies AVX2 even when AVX512VL is not available.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #17 from Richard Biener --- (In reply to Hongtao.liu from comment #15) > I'm working on add the expander, i encounter a problem. > > for V32HI vec_set with constant index, the expander existed under > TARGET_AVX512F, but for variable index, the expander should be existed under > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need > to restricted the expander under TARGET_AVX512BW, unfortunately operands is > unvisible in condition scope, any solution to handle such issue? can_vec_set_var_idx_p checks insn_operand_matches thus the operand predicate is what you should adjust I think, sth like "const_int_or_reg_for_vec_set_operand" or do you mean you do not see the mode of the vector inside the predicate? In that case I think you need to split the pattern? Or think of a way to implement byte/word insert with just AVVX512F ...
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #16 from Hongtao.liu --- (In reply to Hongtao.liu from comment #15) > I'm working on add the expander, i encounter a problem. > > for V32HI vec_set with constant index, the expander existed under > TARGET_AVX512F, but for variable index, the expander should be existed under > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need > to restricted the expander under TARGET_AVX512BW, unfortunately operands is > unvisible in condition scope, any solution to handle such issue? Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #15 from Hongtao.liu --- I'm working on add the expander, i encounter a problem. for V32HI vec_set with constant index, the expander existed under TARGET_AVX512F, but for variable index, the expander should be existed under TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need to restricted the expander under TARGET_AVX512BW, unfortunately operands is unvisible in condition scope, any solution to handle such issue?
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #14 from Alexander Monakov --- I see, there are more weaknesses than I thought. For CSE (or rather fwprop?) I was thinking about a simpler case where the extracted-from value is loaded from memory, but even in trivial cases RTL optimizers cannot clean it up today (so it wouldn't get any better with separate temporaries): #define N 16 typedef int T; typedef T V __attribute__((vector_size(N))); T f(V *px, long i) { V x = *px; return x[i]; } f: movdqa (%rdi), %xmm0 movaps %xmm0, -24(%rsp) movl-24(%rsp,%rsi,4), %eax ret
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #13 from Richard Biener --- (In reply to Richard Biener from comment #12) > (In reply to Alexander Monakov from comment #11) > > Yeah, for inserts such tactic would be inappropriate due to bad store > > forwarding stalls anyway. As you've shown in earlier comments, inserts have > > a very nice generic way to expand them (that does not touch stack). > > Unfortunately it doesn't work (the CSE). Patch: > > diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c > index 1eaa1da11b9..f7b1a92dd95 100644 > --- a/gcc/cfgexpand.c > +++ b/gcc/cfgexpand.c > @@ -6102,7 +6102,11 @@ discover_nonconstant_array_refs_r (tree * tp, int > *walk_subtrees, > || CONVERT_EXPR_P (t)) > t = TREE_OPERAND (t, 0); > > - if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF) > + if ((TREE_CODE (t) == ARRAY_REF > + && !(TREE_CODE (TREE_OPERAND (t, 0)) == VIEW_CONVERT_EXPR > + && DECL_P (TREE_OPERAND (TREE_OPERAND (t, 0), 0))) > + && VECTOR_TYPE_P (TREE_TYPE (TREE_OPERAND (TREE_OPERAND (t, > 0), 0 > + || TREE_CODE (t) == ARRAY_RANGE_REF) > { > t = get_base_address (t); > if (t && DECL_P (t) > > > and for > > typedef int v4si __attribute__((vector_size(16))); > > int foo (v4si v, int i) > { > v = v + v; > return v[i] + v[2*i]; > } > > at -O2 we get > > foo: > .LFB0: > .cfi_startproc > leal(%rdi,%rdi), %edx > paddd %xmm0, %xmm0 > movslq %edi, %rdi > movslq %edx, %rdx > movaps %xmm0, -24(%rsp) > movaps %xmm0, -40(%rsp) > movl-40(%rsp,%rdi,4), %eax > addl-24(%rsp,%rdx,4), %eax > ret and unpatched foo: .LFB0: .cfi_startproc leal(%rdi,%rdi), %edx paddd %xmm0, %xmm0 movslq %edi, %rdi movslq %edx, %rdx movaps %xmm0, -24(%rsp) movl-24(%rsp,%rdi,4), %eax addl-24(%rsp,%rdx,4), %eax ret so we're able to elide the stack slot usage for the add and retain a single slot.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #12 from Richard Biener --- (In reply to Alexander Monakov from comment #11) > Yeah, for inserts such tactic would be inappropriate due to bad store > forwarding stalls anyway. As you've shown in earlier comments, inserts have > a very nice generic way to expand them (that does not touch stack). Unfortunately it doesn't work (the CSE). Patch: diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c index 1eaa1da11b9..f7b1a92dd95 100644 --- a/gcc/cfgexpand.c +++ b/gcc/cfgexpand.c @@ -6102,7 +6102,11 @@ discover_nonconstant_array_refs_r (tree * tp, int *walk_subtrees, || CONVERT_EXPR_P (t)) t = TREE_OPERAND (t, 0); - if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF) + if ((TREE_CODE (t) == ARRAY_REF + && !(TREE_CODE (TREE_OPERAND (t, 0)) == VIEW_CONVERT_EXPR + && DECL_P (TREE_OPERAND (TREE_OPERAND (t, 0), 0))) + && VECTOR_TYPE_P (TREE_TYPE (TREE_OPERAND (TREE_OPERAND (t, 0), 0 + || TREE_CODE (t) == ARRAY_RANGE_REF) { t = get_base_address (t); if (t && DECL_P (t) and for typedef int v4si __attribute__((vector_size(16))); int foo (v4si v, int i) { v = v + v; return v[i] + v[2*i]; } at -O2 we get foo: .LFB0: .cfi_startproc leal(%rdi,%rdi), %edx paddd %xmm0, %xmm0 movslq %edi, %rdi movslq %edx, %rdx movaps %xmm0, -24(%rsp) movaps %xmm0, -40(%rsp) movl-40(%rsp,%rdi,4), %eax addl-24(%rsp,%rdx,4), %eax ret we likely also not get rid of the stack allocation. Maybe it's due to the way expand does the temporary spill, not ending its lifetime, not sure. We're definitely not "remembering" the spill slot used for 'v' and do not re-use it, there's no mechanism for that IIRC. At least we don't ICE for the specific case of vectors. We're running into /* If we have either an offset, a BLKmode result, or a reference outside the underlying object, we must force it to memory. Such a case can occur in Ada if we have unchecked conversion of an expression from a scalar type to an aggregate type or for an ARRAY_RANGE_REF whose type is BLKmode, or if we were passed a partially uninitialized object or a view-conversion to a larger size. */ must_force_mem = (offset || mode1 == BLKmode || (mode == BLKmode && !int_mode_for_size (bitsize, 1).exists ()) || maybe_gt (bitpos + bitsize, GET_MODE_BITSIZE (mode2))); where 'offset' is MULT_EXPR and we've sofar expanded 'v' to op0 = (reg/v:V4SI 88 [ v ]) and then /* Otherwise, if this is a constant or the object is not in memory and need be, put it there. */ else if (CONSTANT_P (op0) || (!MEM_P (op0) && must_force_mem)) { memloc = assign_temp (TREE_TYPE (tem), 1, 1); emit_move_insn (memloc, op0); op0 = memloc; clear_mem_expr = true; }
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #11 from Alexander Monakov --- Yeah, for inserts such tactic would be inappropriate due to bad store forwarding stalls anyway. As you've shown in earlier comments, inserts have a very nice generic way to expand them (that does not touch stack).
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #10 from rguenther at suse dot de --- On Mon, 28 Sep 2020, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 > > --- Comment #9 from Alexander Monakov --- > (In reply to Richard Biener from comment #8) > > Note that currently RTL expansion forces a local vector typed variable > > to the stack (instead of allocating a pseudo) when there are > > variable-index accesses to it. That might be a reason to also handle > > slightly "expensive" extract cases. But I guess later falling back > > to a stack slot via a splitter or LRA will lead to worse code. > > Indeed, but I struggle to see a good reason to bind the entire lifetime of a > variable to memory just because one operation requires that. Cannot GCC > instead > create a fresh temporary early at RTL-expand (not split) time for each extract > operation, letting the original variable live in a pseudo, and binding only > that short-lived temporary to memory? > > It can result in extra copies if the temporary needs to be loaded from memory > anyway, but I think passes like RTL CSE should be able to propagate them. Sure, that would be possible. We do this kind of things for CONCAT (complex numbers) with handling some select cases and falling back to spilling. But we've backed off for more general handling because it ICEd too many cases. For inserts we don't want to do this since I'm quite positive that RTL wouldn't be able to merge two spill slots when doing two consecutive inserts.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #9 from Alexander Monakov --- (In reply to Richard Biener from comment #8) > Note that currently RTL expansion forces a local vector typed variable > to the stack (instead of allocating a pseudo) when there are > variable-index accesses to it. That might be a reason to also handle > slightly "expensive" extract cases. But I guess later falling back > to a stack slot via a splitter or LRA will lead to worse code. Indeed, but I struggle to see a good reason to bind the entire lifetime of a variable to memory just because one operation requires that. Cannot GCC instead create a fresh temporary early at RTL-expand (not split) time for each extract operation, letting the original variable live in a pseudo, and binding only that short-lived temporary to memory? It can result in extra copies if the temporary needs to be loaded from memory anyway, but I think passes like RTL CSE should be able to propagate them.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #8 from Richard Biener --- (In reply to Alexander Monakov from comment #7) > FWIW, Peter Cordes provides an overview of available approaches for > extraction depending on vector length and ISA extensions (up to AVX2, not > including AVX-512) in this StackOverflow answer: > https://stackoverflow.com/a/51414330/4755075 > > TL;DR: generally through store+load; possible alternatives: > 128b: > SSSE3: pshufb (1-byte elements) > SSSE3: imul+add+pshufb (any element size) > AVX: vpermilp[sd] (4 or 8-byte elements) > 256b: > AVX2: vpermps (4-byte elements) > > In all cases a (v)movd is needed to move the index to a vector register, and > potentially another (v)movd if the result is needed in a general register. > > The basic store+load tactic may look worse latency-wise, but can be better > throughput-wise (especially with multiple extractions from the same vector, > as then the store needs to be done just once, as Peter mentioned). > > Why in RTL it is important to do this without referencing the stack? For extraction it isn't absolutely required to do this w/o the stack since the spill would cover the whole vector and the reads can be usually handled with store-forwarding in the CPUs. So here this can be fully based on cost. The insert case is instead very bad here with a whole-vector store followed by an element store and then a whole-vector load. This sequence will usually cause at least additional latency or worse recovering from a bad store-forwarding. Note that currently RTL expansion forces a local vector typed variable to the stack (instead of allocating a pseudo) when there are variable-index accesses to it. That might be a reason to also handle slightly "expensive" extract cases. But I guess later falling back to a stack slot via a splitter or LRA will lead to worse code.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 Alexander Monakov changed: What|Removed |Added CC||amonakov at gcc dot gnu.org --- Comment #7 from Alexander Monakov --- FWIW, Peter Cordes provides an overview of available approaches for extraction depending on vector length and ISA extensions (up to AVX2, not including AVX-512) in this StackOverflow answer: https://stackoverflow.com/a/51414330/4755075 TL;DR: generally through store+load; possible alternatives: 128b: SSSE3: pshufb (1-byte elements) SSSE3: imul+add+pshufb (any element size) AVX: vpermilp[sd] (4 or 8-byte elements) 256b: AVX2: vpermps (4-byte elements) In all cases a (v)movd is needed to move the index to a vector register, and potentially another (v)movd if the result is needed in a general register. The basic store+load tactic may look worse latency-wise, but can be better throughput-wise (especially with multiple extractions from the same vector, as then the store needs to be done just once, as Peter mentioned). Why in RTL it is important to do this without referencing the stack?
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 Richard Biener changed: What|Removed |Added Last reconfirmed||2020-09-28 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #6 from Richard Biener --- (In reply to Hongtao.liu from comment #5) > > > There's ongoing patch iteration on the ml adding variable index vec_set > > > expanders for powerpc (and the related middle-end changes). The question > > > is whether optabs can try many things or the target should have the choice > > > (probably better). > > > > > The patch has been on trunk, will you add expander? I hope x86 maintainers will beat me to it.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #5 from Hongtao.liu --- > > There's ongoing patch iteration on the ml adding variable index vec_set > > expanders for powerpc (and the related middle-end changes). The question > > is whether optabs can try many things or the target should have the choice > > (probably better). > > The patch has been on trunk, will you add expander?
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #4 from Richard Biener --- (In reply to Richard Biener from comment #2) > So for set with T == int and N == 32 we could generate > > vmovd %edi, %xmm1 > vpbroadcastd%xmm1, %ymm1 > vpcmpeqd.LC0(%rip), %ymm1, %ymm2 > vpblendvb %ymm2, %ymm1, %ymm0, %ymm0 > ret > > .LC0: > .long 0 > .long 1 > .long 2 > .long 3 > .long 4 > .long 5 > .long 6 > .long 7 > > aka, with GCC generic vectors > > V setg (V v, int idx, T val) > { > V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx}; > V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv); > v = (v & ~mask) | (valv & mask); > return v; > } Botched this up, corrected is V setg (V v, int idx, T val) { V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx}; V valv = (V){val, val, val, val, val, val, val, val}; V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv); v = (v & ~mask) | (valv & mask); return v; } which produces vmovd %edi, %xmm1 vmovd %esi, %xmm2 vpbroadcastd%xmm1, %ymm1 vpbroadcastd%xmm2, %ymm2 vpcmpeqd.LC0(%rip), %ymm1, %ymm1 vpblendvb %ymm1, %ymm2, %ymm0, %ymm0 with AVX2, so one more vmovd/vpbroadcastd (as expected). With -mavx512vl this even becomes vpbroadcastd%edi, %ymm1 vpcmpd $0, .LC0(%rip), %ymm1, %k1 vpbroadcastd%esi, %ymm0{%k1} for the extract case we really need to compute a variable permute mask which looks harder and possibly more expensive than the spill/load, so the set case looks more important to tackle (tackling it will still eventually improve initial RTL generation by avoiding stack assignments for locals) > There's ongoing patch iteration on the ml adding variable index vec_set > expanders for powerpc (and the related middle-end changes). The question > is whether optabs can try many things or the target should have the choice > (probably better). > > Eventually there's a more efficient way to generate {0, 1, 2, 3...}.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #3 from Richard Biener --- (In reply to Richard Biener from comment #2) > Eventually there's a more efficient way to generate {0, 1, 2, 3...}. vpmovzx* could be at least used to only have a single byte vector {0, ... 255 } in memory for all cases including V256QI. Maybe also a separately useful constant pool optimization ...
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 --- Comment #2 from Richard Biener --- So for set with T == int and N == 32 we could generate vmovd %edi, %xmm1 vpbroadcastd%xmm1, %ymm1 vpcmpeqd.LC0(%rip), %ymm1, %ymm2 vpblendvb %ymm2, %ymm1, %ymm0, %ymm0 ret .LC0: .long 0 .long 1 .long 2 .long 3 .long 4 .long 5 .long 6 .long 7 aka, with GCC generic vectors V setg (V v, int idx, T val) { V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx}; V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv); v = (v & ~mask) | (valv & mask); return v; } There's ongoing patch iteration on the ml adding variable index vec_set expanders for powerpc (and the related middle-end changes). The question is whether optabs can try many things or the target should have the choice (probably better). Eventually there's a more efficient way to generate {0, 1, 2, 3...}.
[Bug target/97194] optimize vector element set/extract at variable position
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194 Richard Biener changed: What|Removed |Added Target||x86_64-*-* i?86-*-* CC||hjl.tools at gmail dot com Keywords||missed-optimization --- Comment #1 from Richard Biener --- Note I googled quite a bit but didn't find sth for the 'set' case.