[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-10 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #6 from rguenther at suse dot de  ---
On Tue, 10 Jan 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322
> 
> --- Comment #5 from Alexander Monakov  ---
> (In reply to Richard Biener from comment #4)
> > 
> > For the case at hand loading two vectors from the destination and then
> > punpck{h,l}bw and storing them again might be the most efficient thing
> > to do here.
> 
> I think such read-modify-write on the destination introduces a data race for
> bytes that are not accessed in the original program, so that would be okay 
> only
> under -fallow-store-data-races?

Yes, obviously.

[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-10 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

--- Comment #5 from Alexander Monakov  ---
(In reply to Richard Biener from comment #4)
> 
> For the case at hand loading two vectors from the destination and then
> punpck{h,l}bw and storing them again might be the most efficient thing
> to do here.

I think such read-modify-write on the destination introduces a data race for
bytes that are not accessed in the original program, so that would be okay only
under -fallow-store-data-races?

[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-09 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Richard Biener  changed:

   What|Removed |Added

 Blocks||53947
 Ever confirmed|0   |1
   Last reconfirmed||2023-01-10
 Status|UNCONFIRMED |NEW
 CC||rguenth at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
The vectorizer vectorizes this with a strided store, costing

*pSrc_16 1 times unaligned_load (misalign -1) costs 12 in body
_1 16 times scalar_store costs 192 in body
_1 16 times vec_to_scalar costs 64 in body
t.c:8:44: note:  operating only on full vectors.
t.c:8:44: note:  Cost model analysis: 
  Vector inside of loop cost: 268
  Vector prologue cost: 0
  Vector epilogue cost: 0
  Scalar iteration cost: 24
  Scalar outside cost: 0
  Vector outside cost: 0
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 0

now later forwprop figures it can replace the element extracts from the
vector load with scalar loads which then results in effective unrolling
of the loop by a factor of 16.

The vectorizer misses the fact that w/o SSE 4.1 it cannot do efficient
lane extracts.  With SSE 4.1 and disabling the forwprop you'd get

.L3:
movdqu  (%rsi), %xmm0
addq$16, %rsi
addq$32, %rax
pextrb  $0, %xmm0, -32(%rax)
pextrb  $1, %xmm0, -30(%rax)
pextrb  $2, %xmm0, -28(%rax)
pextrb  $3, %xmm0, -26(%rax)
pextrb  $4, %xmm0, -24(%rax)
pextrb  $5, %xmm0, -22(%rax)
pextrb  $6, %xmm0, -20(%rax)
pextrb  $7, %xmm0, -18(%rax)
pextrb  $8, %xmm0, -16(%rax)
pextrb  $9, %xmm0, -14(%rax)
pextrb  $10, %xmm0, -12(%rax)
pextrb  $11, %xmm0, -10(%rax)
pextrb  $12, %xmm0, -8(%rax)
pextrb  $13, %xmm0, -6(%rax)
pextrb  $14, %xmm0, -4(%rax)
pextrb  $15, %xmm0, -2(%rax)
cmpq%rdx, %rsi
jne .L3

which is what the vectorizer thinks is going to be generated.  But with
just SSE2 we are spilling to memory for the lane extract.

For the case at hand loading two vectors from the destination and then
punpck{h,l}bw and storing them again might be the most efficient thing
to do here.

On the cost model side 'vec_to_scalar' is ambiguous, the x86 backend
tries to compensate with

  /* If we do elementwise loads into a vector then we are bound by
 latency and execution resources for the many scalar loads 
 (AGU and load ports).  Try to account for this by scaling the
 construction cost by the number of elements involved.  */
  if ((kind == vec_construct || kind == vec_to_scalar)
  && stmt_info
  && (STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
  || STMT_VINFO_TYPE (stmt_info) == store_vec_info_type)
  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
  && TREE_CODE (DR_STEP (STMT_VINFO_DATA_REF (stmt_info))) != INTEGER_CST)
{ 
  stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
  stmt_cost *= (TYPE_VECTOR_SUBPARTS (vectype) + 1);
}

but that doesn't trigger here because the step is constant two.

RTL expansion will eventually use the vec_extract optab and that succeeds
even for SSE2 by spilling, so it isn't useful to query support:

void
ix86_expand_vector_extract (bool mmx_ok, rtx target, rtx vec, int elt)
{   
...
  if (use_vec_extr)
{ 
...
}
  else
{
  rtx mem = assign_stack_temp (mode, GET_MODE_SIZE (mode));

  emit_move_insn (mem, vec);

  tmp = adjust_address (mem, inner_mode, elt*GET_MODE_SIZE (inner_mode));
  emit_move_insn (target, tmp);
}
}

the fallback is eventually done by RTL expansion anyway.

Note fixing that and querying vec_extract support (the vectorizer doesn't
do that - it relies on expands fallback here but could do better costing
and also generate a single spill slot rather than one for each extract).


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-06 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
With '-fdisable-tree-forwprop4 -msse4.1' you see what the vectorizer perhaps
wanted to achieve.

[Bug target/108322] Using __restrict parameter with -ftree-vectorize (default with -O2) results in massive code bloat

2023-01-06 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108322

Andrew Pinski  changed:

   What|Removed |Added

 Target||x86_64-linux-gnu
  Component|tree-optimization   |target
   Keywords||missed-optimization

--- Comment #2 from Andrew Pinski  ---
This is a cost model issue with x86_64.

on aarch64, this is not vectorized unless you use -fno-vect-cost-model.