[Bug target/97194] optimize vector element set/extract at variable position

2022-01-11 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #26 from Richard Biener  ---
Let's declare this generic bug fixed.  Specific unhandled cases on x86 should
get a new bug with a specific testcase.

[Bug target/97194] optimize vector element set/extract at variable position

2021-07-06 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #25 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:f65878178ab05180a5937f11f8fdb755678a82ce

commit r12-2085-gf65878178ab05180a5937f11f8fdb755678a82ce
Author: Uros Bizjak 
Date:   Tue Jul 6 19:27:34 2021 +0200

i386: Add variable vec_set for 32bit vectors [PR97194]

To generate sane code a SSE4.1 variable PBLENDV instruction is needed.

Also enable variable vec_set through vec_setm_operand predicate
for TARGET_SSE4_1 instead of TARGET_AVX2. 
ix86_expand_vector_init_duplicate
is able to emulate vpbroadcast{b,w} with pxor/pshufb.

2021-07-06  Uroš Bizjak  

gcc/
PR target/97194
* config/i386/predicates.md (vec_setm_operand): Enable
register_operand for TARGET_SSE4_1.
* config/i386/mmx.md (vec_setv2hi): Use vec_setm_operand
as operand 2 predicate.  Call ix86_expand_vector_set_var
for non-constant index operand.
(vec_setv4qi): Use vec_setm_mmx_operand as operand 2 predicate.
Call ix86_expand_vector_set_var for non-constant index operand.

gcc/testsuite/

PR target/97194
* gcc.target/i386/sse4_1-vec-set-1a.c: New test.
* gcc.target/i386/sse4_1-vec-set-2a.c: Ditto.

[Bug target/97194] optimize vector element set/extract at variable position

2021-06-17 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #24 from CVS Commits  ---
The master branch has been updated by Uros Bizjak :

https://gcc.gnu.org/g:20a2c8ace0ab56c147fd995432abd5e7cf89b0e3

commit r12-1563-g20a2c8ace0ab56c147fd995432abd5e7cf89b0e3
Author: Uros Bizjak 
Date:   Thu Jun 17 15:19:12 2021 +0200

i386: Add variable vec_set for 64bit vectors [PR97194]

To generate sane code a SSE4.1 variable PBLENDV instruction is needed.

2021-06-17  Uroš Bizjak  

gcc/
PR target/97194
* config/i386/i386-expand.c (expand_vector_set_var):
Handle V2FS mode remapping.  Pass TARGET_MMX_WITH_SSE to
ix86_expand_vector_init_duplicate.
(ix86_expand_vector_init_duplicate): Emit insv_1 for
QImode for !TARGET_PARTIAL_REG_STALL.
* config/i386/predicates.md (vec_setm_mmx_operand): New predicate.
* config/i386/mmx.md (vec_setv2sf): Use vec_setm_mmx_operand
as operand 2 predicate.  Call ix86_expand_vector_set_var
for non-constant index operand.
(vec_setv2si): Ditto.
(vec_setv4hi): Ditto.
(vec_setv8qi): ditto.

gcc/testsuite/

PR target/97194
* gcc.target/i386/sse4_1-vec-set-1.c: New test.
* gcc.target/i386/sse4_1-vec-set-2.c: ditto.

[Bug target/97194] optimize vector element set/extract at variable position

2020-11-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #23 from Hongtao.liu  ---
Fixed in GCC11, may need a bit adjustment for the modeless operand(the variable
index) as dicussed in
https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559213.html

[Bug target/97194] optimize vector element set/extract at variable position

2020-11-16 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #22 from CVS Commits  ---
The master branch has been updated by hongtao Liu :

https://gcc.gnu.org/g:287cc750b0887e86cb309d976b17c7ee95f7ad48

commit r11-5074-g287cc750b0887e86cb309d976b17c7ee95f7ad48
Author: liuhongt 
Date:   Mon Oct 19 16:04:39 2020 +0800

Support variable index vec_set.

gcc/ChangeLog:

PR target/97194
* config/i386/i386-expand.c (ix86_expand_vector_set_var): New
function.
* config/i386/i386-protos.h (ix86_expand_vector_set_var): New Decl.
* config/i386/predicates.md (vec_setm_operand): New predicate,
true for const_int_operand or register_operand under TARGET_AVX2.
* config/i386/sse.md (vec_set): Support both constant
and variable index vec_set.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx2-vec-set-1.c: New test.
* gcc.target/i386/avx2-vec-set-2.c: New test.
* gcc.target/i386/avx512bw-vec-set-1.c: New test.
* gcc.target/i386/avx512bw-vec-set-2.c: New test.
* gcc.target/i386/avx512f-vec-set-2.c: New test.
* gcc.target/i386/avx512vl-vec-set-2.c: New test.

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #21 from rguenther at suse dot de  ---
On Fri, 16 Oct 2020, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194
> 
> --- Comment #20 from Hongtao.liu  ---
> (In reply to Richard Biener from comment #18)
> > (In reply to Hongtao.liu from comment #16)
> > > (In reply to Hongtao.liu from comment #15)
> > > > I'm working on add the expander, i encounter a problem.
> > > > 
> > > > for V32HI vec_set with constant index, the expander existed under
> > > > TARGET_AVX512F, but for variable index, the expander should be existed 
> > > > under
> > > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we 
> > > > need
> > > > to restricted the expander under TARGET_AVX512BW, unfortunately 
> > > > operands is
> > > > unvisible in condition scope, any solution to handle such issue?
> > > 
> > > Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed.
> > 
> > That sounds like a good implementation strathegy, I suppose AVX512F implies
> > AVX2 even when AVX512VL is not available.
> 
> Yes, but not sure performance impact, maybe better generate code "like"
> expander not existed.

I'm not so sure - the STLF penalty is quite large and the AVX2 code
should only require an extra vextract + vpcmp + vinsert (and the
extra vpcmp uses another constant pool entry).  The broadcasts
and blend only require AVX512F.

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #20 from Hongtao.liu  ---
(In reply to Richard Biener from comment #18)
> (In reply to Hongtao.liu from comment #16)
> > (In reply to Hongtao.liu from comment #15)
> > > I'm working on add the expander, i encounter a problem.
> > > 
> > > for V32HI vec_set with constant index, the expander existed under
> > > TARGET_AVX512F, but for variable index, the expander should be existed 
> > > under
> > > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need
> > > to restricted the expander under TARGET_AVX512BW, unfortunately operands 
> > > is
> > > unvisible in condition scope, any solution to handle such issue?
> > 
> > Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed.
> 
> That sounds like a good implementation strathegy, I suppose AVX512F implies
> AVX2 even when AVX512VL is not available.

Yes, but not sure performance impact, maybe better generate code "like"
expander not existed.

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #19 from Hongtao.liu  ---
(In reply to Richard Biener from comment #17)
> (In reply to Hongtao.liu from comment #15)
> > I'm working on add the expander, i encounter a problem.
> > 
> > for V32HI vec_set with constant index, the expander existed under
> > TARGET_AVX512F, but for variable index, the expander should be existed under
> > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need
> > to restricted the expander under TARGET_AVX512BW, unfortunately operands is
> > unvisible in condition scope, any solution to handle such issue?
> 
> can_vec_set_var_idx_p checks insn_operand_matches thus the operand predicate
> is what you should adjust I think, sth like
> 
>  "const_int_or_reg_for_vec_set_operand"
> 
> or do you mean you do not see the mode of the vector inside the predicate?
> 

Need to sse vector mode inside the predicate.

> In that case I think you need to split the pattern?  Or think of a way
> to implement byte/word insert with just AVVX512F ...

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #18 from Richard Biener  ---
(In reply to Hongtao.liu from comment #16)
> (In reply to Hongtao.liu from comment #15)
> > I'm working on add the expander, i encounter a problem.
> > 
> > for V32HI vec_set with constant index, the expander existed under
> > TARGET_AVX512F, but for variable index, the expander should be existed under
> > TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need
> > to restricted the expander under TARGET_AVX512BW, unfortunately operands is
> > unvisible in condition scope, any solution to handle such issue?
> 
> Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed.

That sounds like a good implementation strathegy, I suppose AVX512F implies
AVX2 even when AVX512VL is not available.

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #17 from Richard Biener  ---
(In reply to Hongtao.liu from comment #15)
> I'm working on add the expander, i encounter a problem.
> 
> for V32HI vec_set with constant index, the expander existed under
> TARGET_AVX512F, but for variable index, the expander should be existed under
> TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need
> to restricted the expander under TARGET_AVX512BW, unfortunately operands is
> unvisible in condition scope, any solution to handle such issue?

can_vec_set_var_idx_p checks insn_operand_matches thus the operand predicate
is what you should adjust I think, sth like

 "const_int_or_reg_for_vec_set_operand"

or do you mean you do not see the mode of the vector inside the predicate?

In that case I think you need to split the pattern?  Or think of a way
to implement byte/word insert with just AVVX512F ...

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #16 from Hongtao.liu  ---
(In reply to Hongtao.liu from comment #15)
> I'm working on add the expander, i encounter a problem.
> 
> for V32HI vec_set with constant index, the expander existed under
> TARGET_AVX512F, but for variable index, the expander should be existed under
> TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need
> to restricted the expander under TARGET_AVX512BW, unfortunately operands is
> unvisible in condition scope, any solution to handle such issue?

Or break V32HI into V16HI and V8HI when TARGET_AVX512BW is not existed.

[Bug target/97194] optimize vector element set/extract at variable position

2020-10-16 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #15 from Hongtao.liu  ---
I'm working on add the expander, i encounter a problem.

for V32HI vec_set with constant index, the expander existed under
TARGET_AVX512F, but for variable index, the expander should be existed under
TARGET_AVX512BW, since vpcmpw zmm only existed in TARGET_AVX512BW, we need to
restricted the expander under TARGET_AVX512BW, unfortunately operands is
unvisible in condition scope, any solution to handle such issue?

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #14 from Alexander Monakov  ---
I see, there are more weaknesses than I thought. For CSE (or rather fwprop?) I
was thinking about a simpler case where the extracted-from value is loaded from
memory, but even in trivial cases RTL optimizers cannot clean it up today (so
it wouldn't get any better with separate temporaries):

#define N 16
typedef int T;
typedef T V __attribute__((vector_size(N)));
T f(V *px, long i)
{
V x = *px;
return x[i];
}

f:
movdqa  (%rdi), %xmm0
movaps  %xmm0, -24(%rsp)
movl-24(%rsp,%rsi,4), %eax
ret

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #13 from Richard Biener  ---
(In reply to Richard Biener from comment #12)
> (In reply to Alexander Monakov from comment #11)
> > Yeah, for inserts such tactic would be inappropriate due to bad store
> > forwarding stalls anyway. As you've shown in earlier comments, inserts have
> > a very nice generic way to expand them (that does not touch stack).
> 
> Unfortunately it doesn't work (the CSE).  Patch:
> 
> diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
> index 1eaa1da11b9..f7b1a92dd95 100644
> --- a/gcc/cfgexpand.c
> +++ b/gcc/cfgexpand.c
> @@ -6102,7 +6102,11 @@ discover_nonconstant_array_refs_r (tree * tp, int
> *walk_subtrees,
>  || CONVERT_EXPR_P (t))
> t = TREE_OPERAND (t, 0);
>  
> -  if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF)
> +  if ((TREE_CODE (t) == ARRAY_REF
> +  && !(TREE_CODE (TREE_OPERAND (t, 0)) == VIEW_CONVERT_EXPR
> +   && DECL_P (TREE_OPERAND (TREE_OPERAND (t, 0), 0)))
> +   && VECTOR_TYPE_P (TREE_TYPE (TREE_OPERAND (TREE_OPERAND (t,
> 0), 0
> +  || TREE_CODE (t) == ARRAY_RANGE_REF)
> {
>   t = get_base_address (t);
>   if (t && DECL_P (t)
> 
> 
> and for
> 
> typedef int v4si __attribute__((vector_size(16)));
> 
> int foo (v4si v, int i)
> {
>   v = v + v;
>   return v[i] + v[2*i];
> }
> 
> at -O2 we get
> 
> foo:
> .LFB0:
> .cfi_startproc
> leal(%rdi,%rdi), %edx
> paddd   %xmm0, %xmm0
> movslq  %edi, %rdi
> movslq  %edx, %rdx
> movaps  %xmm0, -24(%rsp)
> movaps  %xmm0, -40(%rsp)
> movl-40(%rsp,%rdi,4), %eax
> addl-24(%rsp,%rdx,4), %eax
> ret

and unpatched

foo:
.LFB0:
.cfi_startproc
leal(%rdi,%rdi), %edx
paddd   %xmm0, %xmm0
movslq  %edi, %rdi
movslq  %edx, %rdx
movaps  %xmm0, -24(%rsp)
movl-24(%rsp,%rdi,4), %eax
addl-24(%rsp,%rdx,4), %eax
ret

so we're able to elide the stack slot usage for the add and retain a single
slot.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #12 from Richard Biener  ---
(In reply to Alexander Monakov from comment #11)
> Yeah, for inserts such tactic would be inappropriate due to bad store
> forwarding stalls anyway. As you've shown in earlier comments, inserts have
> a very nice generic way to expand them (that does not touch stack).

Unfortunately it doesn't work (the CSE).  Patch:

diff --git a/gcc/cfgexpand.c b/gcc/cfgexpand.c
index 1eaa1da11b9..f7b1a92dd95 100644
--- a/gcc/cfgexpand.c
+++ b/gcc/cfgexpand.c
@@ -6102,7 +6102,11 @@ discover_nonconstant_array_refs_r (tree * tp, int
*walk_subtrees,
 || CONVERT_EXPR_P (t))
t = TREE_OPERAND (t, 0);

-  if (TREE_CODE (t) == ARRAY_REF || TREE_CODE (t) == ARRAY_RANGE_REF)
+  if ((TREE_CODE (t) == ARRAY_REF
+  && !(TREE_CODE (TREE_OPERAND (t, 0)) == VIEW_CONVERT_EXPR
+   && DECL_P (TREE_OPERAND (TREE_OPERAND (t, 0), 0)))
+   && VECTOR_TYPE_P (TREE_TYPE (TREE_OPERAND (TREE_OPERAND (t, 0),
0
+  || TREE_CODE (t) == ARRAY_RANGE_REF)
{
  t = get_base_address (t);
  if (t && DECL_P (t)


and for

typedef int v4si __attribute__((vector_size(16)));

int foo (v4si v, int i)
{
  v = v + v;
  return v[i] + v[2*i];
}

at -O2 we get

foo:
.LFB0:
.cfi_startproc
leal(%rdi,%rdi), %edx
paddd   %xmm0, %xmm0
movslq  %edi, %rdi
movslq  %edx, %rdx
movaps  %xmm0, -24(%rsp)
movaps  %xmm0, -40(%rsp)
movl-40(%rsp,%rdi,4), %eax
addl-24(%rsp,%rdx,4), %eax
ret

we likely also not get rid of the stack allocation.  Maybe it's due to the
way expand does the temporary spill, not ending its lifetime, not sure.
We're definitely not "remembering" the spill slot used for 'v' and do
not re-use it, there's no mechanism for that IIRC.

At least we don't ICE for the specific case of vectors.  We're running into

/* If we have either an offset, a BLKmode result, or a reference
   outside the underlying object, we must force it to memory.
   Such a case can occur in Ada if we have unchecked conversion
   of an expression from a scalar type to an aggregate type or
   for an ARRAY_RANGE_REF whose type is BLKmode, or if we were
   passed a partially uninitialized object or a view-conversion
   to a larger size.  */
must_force_mem = (offset
  || mode1 == BLKmode
  || (mode == BLKmode
  && !int_mode_for_size (bitsize, 1).exists ())
  || maybe_gt (bitpos + bitsize,
   GET_MODE_BITSIZE (mode2)));

where 'offset' is MULT_EXPR and we've sofar expanded 'v' to op0 = (reg/v:V4SI
88 [ v ])
and then

/* Otherwise, if this is a constant or the object is not in memory
   and need be, put it there.  */
else if (CONSTANT_P (op0) || (!MEM_P (op0) && must_force_mem))
  {
memloc = assign_temp (TREE_TYPE (tem), 1, 1);
emit_move_insn (memloc, op0);
op0 = memloc;
clear_mem_expr = true;
  }

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #11 from Alexander Monakov  ---
Yeah, for inserts such tactic would be inappropriate due to bad store
forwarding stalls anyway. As you've shown in earlier comments, inserts have a
very nice generic way to expand them (that does not touch stack).

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #10 from rguenther at suse dot de  ---
On Mon, 28 Sep 2020, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194
> 
> --- Comment #9 from Alexander Monakov  ---
> (In reply to Richard Biener from comment #8)
> > Note that currently RTL expansion forces a local vector typed variable
> > to the stack (instead of allocating a pseudo) when there are
> > variable-index accesses to it.  That might be a reason to also handle
> > slightly "expensive" extract cases.  But I guess later falling back
> > to a stack slot via a splitter or LRA will lead to worse code.
> 
> Indeed, but I struggle to see a good reason to bind the entire lifetime of a
> variable to memory just because one operation requires that. Cannot GCC 
> instead
> create a fresh temporary early at RTL-expand (not split) time for each extract
> operation, letting the original variable live in a pseudo, and binding only
> that short-lived temporary to memory?
> 
> It can result in extra copies if the temporary needs to be loaded from memory
> anyway, but I think passes like RTL CSE should be able to propagate them.

Sure, that would be possible.  We do this kind of things for
CONCAT (complex numbers) with handling some select cases and falling
back to spilling.  But we've backed off for more general handling
because it ICEd too many cases.

For inserts we don't want to do this since I'm quite positive that
RTL wouldn't be able to merge two spill slots when doing two
consecutive inserts.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #9 from Alexander Monakov  ---
(In reply to Richard Biener from comment #8)
> Note that currently RTL expansion forces a local vector typed variable
> to the stack (instead of allocating a pseudo) when there are
> variable-index accesses to it.  That might be a reason to also handle
> slightly "expensive" extract cases.  But I guess later falling back
> to a stack slot via a splitter or LRA will lead to worse code.

Indeed, but I struggle to see a good reason to bind the entire lifetime of a
variable to memory just because one operation requires that. Cannot GCC instead
create a fresh temporary early at RTL-expand (not split) time for each extract
operation, letting the original variable live in a pseudo, and binding only
that short-lived temporary to memory?

It can result in extra copies if the temporary needs to be loaded from memory
anyway, but I think passes like RTL CSE should be able to propagate them.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #8 from Richard Biener  ---
(In reply to Alexander Monakov from comment #7)
> FWIW, Peter Cordes provides an overview of available approaches for
> extraction depending on vector length and ISA extensions (up to AVX2, not
> including AVX-512) in this StackOverflow answer:
> https://stackoverflow.com/a/51414330/4755075
> 
> TL;DR: generally through store+load; possible alternatives:
>  128b:
>   SSSE3: pshufb  (1-byte elements)
>   SSSE3: imul+add+pshufb (any element size)
>   AVX: vpermilp[sd] (4 or 8-byte elements)
>  256b:
>   AVX2: vpermps (4-byte elements)
> 
> In all cases a (v)movd is needed to move the index to a vector register, and
> potentially another (v)movd if the result is needed in a general register.
> 
> The basic store+load tactic may look worse latency-wise, but can be better
> throughput-wise (especially with multiple extractions from the same vector,
> as then the store needs to be done just once, as Peter mentioned).
> 
> Why in RTL it is important to do this without referencing the stack?

For extraction it isn't absolutely required to do this w/o the stack
since the spill would cover the whole vector and the reads can be
usually handled with store-forwarding in the CPUs.  So here this
can be fully based on cost.

The insert case is instead very bad here with a whole-vector store
followed by an element store and then a whole-vector load.  This
sequence will usually cause at least additional latency or worse
recovering from a bad store-forwarding.

Note that currently RTL expansion forces a local vector typed variable
to the stack (instead of allocating a pseudo) when there are
variable-index accesses to it.  That might be a reason to also handle
slightly "expensive" extract cases.  But I guess later falling back
to a stack slot via a splitter or LRA will lead to worse code.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #7 from Alexander Monakov  ---
FWIW, Peter Cordes provides an overview of available approaches for extraction
depending on vector length and ISA extensions (up to AVX2, not including
AVX-512) in this StackOverflow answer:
https://stackoverflow.com/a/51414330/4755075

TL;DR: generally through store+load; possible alternatives:
 128b:
  SSSE3: pshufb  (1-byte elements)
  SSSE3: imul+add+pshufb (any element size)
  AVX: vpermilp[sd] (4 or 8-byte elements)
 256b:
  AVX2: vpermps (4-byte elements)

In all cases a (v)movd is needed to move the index to a vector register, and
potentially another (v)movd if the result is needed in a general register.

The basic store+load tactic may look worse latency-wise, but can be better
throughput-wise (especially with multiple extractions from the same vector, as
then the store needs to be done just once, as Peter mentioned).

Why in RTL it is important to do this without referencing the stack?

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2020-09-28
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW

--- Comment #6 from Richard Biener  ---
(In reply to Hongtao.liu from comment #5)
> > > There's ongoing patch iteration on the ml adding variable index vec_set
> > > expanders for powerpc (and the related middle-end changes).  The question
> > > is whether optabs can try many things or the target should have the choice
> > > (probably better).
> > > 
> 
> The patch has been on trunk, will you add expander?

I hope x86 maintainers will beat me to it.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-27 Thread crazylht at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #5 from Hongtao.liu  ---

> > There's ongoing patch iteration on the ml adding variable index vec_set
> > expanders for powerpc (and the related middle-end changes).  The question
> > is whether optabs can try many things or the target should have the choice
> > (probably better).
> > 

The patch has been on trunk, will you add expander?

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #4 from Richard Biener  ---
(In reply to Richard Biener from comment #2)
> So for set with T == int and N == 32 we could generate
> 
> vmovd   %edi, %xmm1
> vpbroadcastd%xmm1, %ymm1
> vpcmpeqd.LC0(%rip), %ymm1, %ymm2
> vpblendvb   %ymm2, %ymm1, %ymm0, %ymm0
> ret
> 
> .LC0:
> .long   0
> .long   1
> .long   2
> .long   3
> .long   4
> .long   5
> .long   6
> .long   7
> 
> aka, with GCC generic vectors
> 
> V setg (V v, int idx, T val)
> {
>   V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
>   V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
>   v = (v & ~mask) | (valv & mask);
>   return v;
> }

Botched this up, corrected is

V setg (V v, int idx, T val)
{
  V idxv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
  V valv = (V){val, val, val, val, val, val, val, val};
  V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == idxv);
  v = (v & ~mask) | (valv & mask);
  return v;
}

which produces

vmovd   %edi, %xmm1
vmovd   %esi, %xmm2
vpbroadcastd%xmm1, %ymm1
vpbroadcastd%xmm2, %ymm2
vpcmpeqd.LC0(%rip), %ymm1, %ymm1
vpblendvb   %ymm1, %ymm2, %ymm0, %ymm0

with AVX2, so one more vmovd/vpbroadcastd (as expected).  With -mavx512vl
this even becomes

vpbroadcastd%edi, %ymm1
vpcmpd  $0, .LC0(%rip), %ymm1, %k1
vpbroadcastd%esi, %ymm0{%k1}

for the extract case we really need to compute a variable permute mask
which looks harder and possibly more expensive than the spill/load,
so the set case looks more important to tackle (tackling it will still
eventually improve initial RTL generation by avoiding stack assignments
for locals)

> There's ongoing patch iteration on the ml adding variable index vec_set
> expanders for powerpc (and the related middle-end changes).  The question
> is whether optabs can try many things or the target should have the choice
> (probably better).
> 
> Eventually there's a more efficient way to generate {0, 1, 2, 3...}.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #3 from Richard Biener  ---
(In reply to Richard Biener from comment #2)
> Eventually there's a more efficient way to generate {0, 1, 2, 3...}.

vpmovzx* could be at least used to only have a single
byte vector {0, ... 255 } in memory for all cases including V256QI.
Maybe also a separately useful constant pool optimization ...

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

--- Comment #2 from Richard Biener  ---
So for set with T == int and N == 32 we could generate

vmovd   %edi, %xmm1
vpbroadcastd%xmm1, %ymm1
vpcmpeqd.LC0(%rip), %ymm1, %ymm2
vpblendvb   %ymm2, %ymm1, %ymm0, %ymm0
ret

.LC0:
.long   0
.long   1
.long   2
.long   3
.long   4
.long   5
.long   6
.long   7

aka, with GCC generic vectors

V setg (V v, int idx, T val)
{
  V valv = (V){idx, idx, idx, idx, idx, idx, idx, idx};
  V mask = ((V){0, 1, 2, 3, 4, 5, 6, 7} == valv);
  v = (v & ~mask) | (valv & mask);
  return v;
}


There's ongoing patch iteration on the ml adding variable index vec_set
expanders for powerpc (and the related middle-end changes).  The question
is whether optabs can try many things or the target should have the choice
(probably better).

Eventually there's a more efficient way to generate {0, 1, 2, 3...}.

[Bug target/97194] optimize vector element set/extract at variable position

2020-09-24 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97194

Richard Biener  changed:

   What|Removed |Added

 Target||x86_64-*-* i?86-*-*
 CC||hjl.tools at gmail dot com
   Keywords||missed-optimization

--- Comment #1 from Richard Biener  ---
Note I googled quite a bit but didn't find sth for the 'set' case.