[Bug tree-optimization/117395] missed SRA opportunity with extracting subpart of type

2024-11-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117395

--- Comment #4 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #1)
> It is a slight regression from GCC 14 though.
> 
> Which produced:
> ```
> foo:
> ldr q31, [x0, 32]
> sub sp, sp, #128
> add sp, sp, 128
> dup d0, v31.d[1]
> add v0.4h, v0.4h, v31.4h
> ret
> ```
> 
> But that is only because vget_low_s16/vget_high_s16 didn't expand to using
> BIT_FIELD_REF before.
> 

That's a good point, lowering it in RTL as we did before prevented the subreg
inlining so reload didn't have to spill.

I wonder if instead of using BIT_FIELD_REF we should instead use VEC_PERM_EXPR
+ VIEW_CONVERT.  This would get us the right rotate again and recover the
regression.

We'd still need SRA for optimal codegen though without the stack allocations.


(In reply to Richard Biener from comment #3)
> Having memcpy in the IL is preventing SRA.  There's probably no type
> suitable for the single load/store memcpy inlining done by
> gimple_fold_builtin_memory_op.
> 

Yeah the original loop doesn't have memcpy but it's being idiom recognized.

> We could try to fold all memcpy to aggregate char[] array assignments,
> at least when a decl is involved on either side with the idea to
> eventually elide TREE_ADDRESSABLE.  But we need to make sure this
> doesn't pessimize RTL expansion or other code dealing with memcpy but
> not aggregate array copy.
> 
> SRA could handle memcpy and friends transparently iff it were to locally
> compute its own idea of TREE_ADDRESSABLE.

I suppose the second option is better in general? does SRA have the same issue
with memset? Would it be possible to get a rough sketch of what this would
entail?

[Bug tree-optimization/117395] New: missed SRA opportunity with extracting subpart of type

2024-11-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117395

Bug ID: 117395
   Summary: missed SRA opportunity with extracting subpart of type
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following example:

#include 
#include 

int16x4_t foo(const int16_t *src, int16x8_t c) {
int16x8_t s[8];
memcpy (&s[0], src, sizeof(s));

return vget_low_s16(s[2]) + vget_high_s16(s[2]);
}

compiled with -march=armv9-a -O3

produces:

foo(short const*, __Int16x8_t):
ldr q31, [x0, 32]
sub sp, sp, #128
add sp, sp, 128
umovx0, v31.d[0]
umovx1, v31.d[1]
fmovd30, x0
fmovd31, x1
add v0.4h, v31.4h, v30.4h
ret

which is a bit silly, but happens because reload has to spill the subreg that's
taking half the register of s[2] and changing modes.

we cleaned up the stack usage, but not the stack allocation itself.

in GIMPLE we have:

  memcpy (&s[0], src_4(D), 128);
  _1 = BIT_FIELD_REF ;
  _2 = BIT_FIELD_REF ;
  _6 = _1 + _2;

but SRA has rejected it doing:

Candidate (26131): s
Allowed ADDR_EXPR of s because of memcpy (&s[0], src_4(D), 128);

! Disqualifying s - No scalar replacements to be created.

The manually scalarized version:

int16x4_t bar(const int16_t *src, int16x8_t c) {
int16x8_t s;
memcpy (&s, src, sizeof(s));

return vget_low_s16(s) + vget_high_s16(s);
}

does however generate the right code.

Does SRA support BIT_FIELD_REFs today? comment in analyze_access_subtree seems
to indicate it may punt on them.

I also seem to remember vaguely that SRA has a limit on the size of the object
it scalarizes?

[Bug tree-optimization/117176] [15 regression] ICE when building netpbm-11.8.0

2024-10-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117176

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #12 from Tamar Christina  ---
Fixed, thanks for the report, sorry for the delay, have been away.

[Bug tree-optimization/117176] [15 regression] ICE when building netpbm-11.8.0

2024-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117176

--- Comment #5 from Tamar Christina  ---
Have added more lowering to gcond handling and now we vectorize this as well:

ptrue   p7.b, all
fmovz30.d, #1.0e+0
ld1dz31.d, p7/z, [sp, #1, mul vl]
fcmlt   p15.d, p7/z, z31.d, z30.d
not p7.b, p7/z, p15.b
str p7, [x0, #7, mul vl]

which we didn't in GCC 14.

Running regressions and will send patch.

[Bug tree-optimization/117140] [15 regression] RISC-V: ICE in initialize_flags_in_bb for rv32gcv

2024-10-18 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117140

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Tamar Christina  ---
Fixed, thanks for the repot.

[Bug tree-optimization/117176] [15 regression] ICE when building netpbm-11.8.0

2024-10-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117176

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> Confirmed.  SLP early-exit vect related.
> 
> (gdb) p operand
> $1 = 1
> (gdb) p debug (slp_node)
> t.c:9:5: note: node 0x4e46ac0 (max_nunits=4, refcnt=1) vector(4)
> 
> t.c:9:5: note: op template: _3 = ~_2;
> t.c:9:5: note:  [l] stmt 0 _3 = ~_2;
> t.c:9:5: note:  children 0x4e46b50
> 
> for some reason vectorizable_comparison_1 is called on this stmt.  ISTR
> we expect a comparison here, not a BIT_NOT - SLP discovery should probably
> reject this case (for now).

I think this is a failure of boolean lowering.  The vectorizer doesn't handle
mask inversions through bitwise NOT and instead expects vect_recog_bool_pattern
to lower this to a BIT_XOR_EXPR instead.

The problem here is that the comparison that uses the BIT_NOT_EXPR is hidden
inside the gcond so vect_recog_gcond_pattern doesn't find it.

Think the proper solution is to have vect_recog_gcond_pattern do the lowering
since patterns can't be seeds of other patterns atm.  That's why the condition
split off from vect_recog_gcond_pattern doesn't trigger
vect_recog_bool_pattern.

So Mine.

[Bug tree-optimization/117140] [15 regression] RISC-V: ICE in initialize_flags_in_bb for rv32gcv

2024-10-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117140

--- Comment #9 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #8)
> On Wed, 16 Oct 2024, tnfchris at gcc dot gnu.org wrote:
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 8727246c27a..c028594e18b 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -11128,7 +11128,8 @@ vectorize_slp_instance_root_stmt (vec_info *vinfo,
> > slp_tree node, slp_instance i
> >  can't support lane > 1 at this time.  */
> >gcc_assert (instance->root_stmts.length () == 1);
> >auto root_stmt_info = instance->root_stmts[0];
> > -  auto last_stmt = STMT_VINFO_STMT (root_stmt_info);
> 
> if this is a pattern you'd want STMT_VINFO_STMT (vect_orig_stmt 
> (root_stmt_info))

Ah of course.. doh..

running regression tests..

[Bug tree-optimization/117140] [15 regression] RISC-V: ICE in initialize_flags_in_bb for rv32gcv

2024-10-16 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117140

--- Comment #7 from Tamar Christina  ---
For this statement somehow the location of the gsi ends up having
first == last, so gsi_insert_before just silently ignores the insert.

The ICE happens because for this one BB, no vector statement is ever emitted to
the BB.

Reverting to the original code I had

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 8727246c27a..c028594e18b 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -11128,7 +11128,8 @@ vectorize_slp_instance_root_stmt (vec_info *vinfo,
slp_tree node, slp_instance i
 can't support lane > 1 at this time.  */
   gcc_assert (instance->root_stmts.length () == 1);
   auto root_stmt_info = instance->root_stmts[0];
-  auto last_stmt = STMT_VINFO_STMT (root_stmt_info);
+  auto last_stmt = vect_find_first_scalar_stmt_in_slp (node)->stmt;
   gimple_stmt_iterator rgsi = gsi_for_stmt (last_stmt);
   gimple *vec_stmt = NULL;
   gcc_assert (!SLP_TREE_VEC_DEFS (node).is_empty ());

works correctly, but I'd like to figure out why.  So looking into that first.

[Bug c++/116064] [15 Regression] SPEC 2017 523.xalancbmk_r failed to build

2024-10-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116064

--- Comment #13 from Tamar Christina  ---
(In reply to Li Pan from comment #12)
> Looks RISC-V backend still has this issue, can you reproduce this?
> 
> /opt/gcc-master//bin/g++ --version
> g++ (GCC) 15.0.0 20241014 (experimental)
> 
> /opt/gcc-master//bin/g++ -std=c++03 -c -o XalanXMLSerializerFactory.o -DSPEC
> -DNDEBUG  -DAPP_NO_THREADS -DXALAN_INMEM_MSG_LOADER -I. -Ixercesc
> -Ixercesc/dom -Ixercesc/dom/impl -Ixercesc/sax
> -Ixercesc/util/MsgLoaders/InMemory -Ixercesc/util/Transcoders/Iconv
> -Ixalanc/i>
> In file included from XalanXMLSerializerFactory.cpp:30:
> xalanc/XMLSupport/XalanOtherEncodingWriter.hpp: In member function 'void
> xalanc_1_10::XalanOtherEncodingWriter ConstantsType>::writeSafe(const xalanc_1_10::XalanDOMChar*,
> xalanc_1_10::XalanFormatterWriter::size_type)':
> xalanc/XMLSupport/XalanOtherEncodingWriter.hpp:319:30: error: 'class
> xalanc_1_10::XalanOtherEncodingWriter' has no
> member named 'm_isPresentable' [-Wtemplate-body]
>   319 | if(this->m_isPresentable(value))
>   |  ^~~
> xalanc/XMLSupport/XalanOtherEncodingWriter.hpp:325:31: error: 'class
> xalanc_1_10::XalanOtherEncodingWriter' has no
> member named 'writeNumberedEntityReference' [-Wtemplate-body]
>   325 | this->writeNumberedEntityReference(value);
>   |   ^~~~
> specmake: ***
> [/root/panli/benchmarks/spec2017/benchspec/Makefile.defaults:366:
> XalanXMLSerializerFactory.o] Error 1
> specmake: *** Waiting for unfinished jobs

your command invocation doesn't look like you passed any of the two required
flags though..

> One can now use -fpermissive or -Wno-error=template-body to compile TUs
> containing errors inside uninstantiated templates.

So you have to use one of those two.

[Bug tree-optimization/117140] [15 regression] RISC-V: ICE in initialize_flags_in_bb for rv32gcv

2024-10-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117140

--- Comment #6 from Tamar Christina  ---
Ok, mine then.

It looks like it's this loop:

int n_3_TYPE1_int64_t = 32;
int32_t x_3_int32_t = 233;
int32_t x2_3_int32_t = 78;
int64_t y_3_int64_t = 1234;
int32_t f_3_int32_t[33 * 2 + 1] = { 0 };
int64_t d_3_int64_t[33] = { 0 };
test_1_TYPE1_int64_t (f_3_int32_t, d_3_int64_t, x_3_int32_t, x2_3_int32_t,
  y_3_int64_t, n_3_TYPE1_int64_t);
for (int i = 0; i < n_3_TYPE1_int64_t; ++i)
{
if (f_3_int32_t[i * 2 + 0] != x_3_int32_t)
__builtin_abort ();
if (f_3_int32_t[i * 2 + 1] != x2_3_int32_t)
__builtin_abort ();
if (d_3_int64_t[i] != y_3_int64_t)
__builtin_abort ();
}

the first two exits vectorize fine, the third one is vectorized but the
statement never gets replaced:

;;   basic block 7, loop depth 1, count 32534376 (estimated locally, freq
1.), maybe hot
;;prev block 19, next block 32, flags: (NEW, REACHABLE, VISITED)
;;pred:   5 [always]  count:32534376 (estimated locally, freq 1.)
(FALSE_VALUE,EXECUTABLE)
  _73 = loop_len_64 / 2;
  _75 = loop_len_65 / 2;
  if (_21 != 0)
goto ; [0.00%]
  else
goto ; [100.00%]

Looking into why the last update didn't happen.  I think I just need to get cc1
working so will try to build an rv32 cross.

[Bug tree-optimization/117140] [15 regression] RISC-V: ICE in initialize_flags_in_bb for rv32gcv

2024-10-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117140

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2024-10-15
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

--- Comment #5 from Tamar Christina  ---
Ok, mine then.

It looks like it's this loop:

int n_3_TYPE1_int64_t = 32;
int32_t x_3_int32_t = 233;
int32_t x2_3_int32_t = 78;
int64_t y_3_int64_t = 1234;
int32_t f_3_int32_t[33 * 2 + 1] = { 0 };
int64_t d_3_int64_t[33] = { 0 };
test_1_TYPE1_int64_t (f_3_int32_t, d_3_int64_t, x_3_int32_t, x2_3_int32_t,
  y_3_int64_t, n_3_TYPE1_int64_t);
for (int i = 0; i < n_3_TYPE1_int64_t; ++i)
{
if (f_3_int32_t[i * 2 + 0] != x_3_int32_t)
__builtin_abort ();
if (f_3_int32_t[i * 2 + 1] != x2_3_int32_t)
__builtin_abort ();
if (d_3_int64_t[i] != y_3_int64_t)
__builtin_abort ();
}

the first two exits vectorize fine, the third one is vectorized but the
statement never gets replaced:

;;   basic block 7, loop depth 1, count 32534376 (estimated locally, freq
1.), maybe hot
;;prev block 19, next block 32, flags: (NEW, REACHABLE, VISITED)
;;pred:   5 [always]  count:32534376 (estimated locally, freq 1.)
(FALSE_VALUE,EXECUTABLE)
  _73 = loop_len_64 / 2;
  _75 = loop_len_65 / 2;
  if (_21 != 0)
goto ; [0.00%]
  else
goto ; [100.00%]

Looking into why the last update didn't happen.  I think I just need to get cc1
working so will try to build an rv32 cross.

[Bug tree-optimization/116956] [15 Regression] ICE when building PALM with -O3 -march=armv9-a in vect_analyze_loop_1 after r15-2192-g0c5c0c959c2e59

2024-10-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116956

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #4 from Tamar Christina  ---
Fixed, thanks for the report.

[Bug rtl-optimization/117012] [15 Regression] incorrect RTL simplification around vector AND and shifts

2024-10-14 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117012

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #9 from Tamar Christina  ---
Fixed.

[Bug rtl-optimization/117012] [15 Regression] incorrect RTL simplification around vector AND and shifts

2024-10-13 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117012

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #7 from Tamar Christina  ---
patch is ready for posting tomorrow.

[Bug tree-optimization/117031] Inefficient codegen of multistep zero-extends and LOAD_LANES

2024-10-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

--- Comment #7 from Tamar Christina  ---
Indeed, checking this in get_group_load_store_type gets the result I am after.

I'll clean this up if that's OK with you?

[Bug tree-optimization/117031] Inefficient codegen of multistep zero-extends and LOAD_LANES

2024-10-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

Tamar Christina  changed:

   What|Removed |Added

Summary|increasing VF during SLP|Inefficient codegen of
   |vectorization permutes  |multistep zero-extends and
   |unnecessarily   |LOAD_LANES

--- Comment #6 from Tamar Christina  ---
> ends up
> 
> .L4:
> ld4 {v28.8h - v31.8h}, [x4], 64
> add x3, x3, 64
> uaddl2  v26.4s, v28.8h, v29.8h
> uaddl   v28.4s, v28.4h, v29.4h
> uaddw2  v0.4s, v26.4s, v30.8h
> uaddw   v28.4s, v28.4s, v30.4h
> uaddw2  v0.4s, v0.4s, v31.8h
> uaddw   v28.4s, v28.4s, v31.4h
> sxtlv1.2d, v0.2s
> sxtlv27.2d, v28.2s
> sxtl2   v0.2d, v0.4s
> sxtl2   v28.2d, v28.4s
> scvtf   v1.2d, v1.2d
> scvtf   v27.2d, v27.2d
> scvtf   v0.2d, v0.2d
> scvtf   v28.2d, v28.2d
> stp q1, q0, [x3, -32]
> stp q27, q28, [x3, -64]
> cmp x5, x4
> bne .L4
> 
> we can now use widening plus and avoid the HI -> DF conversion penalty.

The conversion penalty is not the issue. This case is still bad since you're
sign extending a zero extend, which is just zero extending anyway.

But again I'm not complaining about that here. though..

> It uses interleaving because there's no ld8 and when
> vect_lower_load_permutations decides heuristically to use load-lanes it
> tries to do so vector-size agnostic so it doesn't consider using two times
> ld4.
> 
> There _are_ permutes because of the use of 4 lanes to compute the single
> lane store in the reduction operation.  The vectorization for the unrolled
> loop not using load-lanes show them:
> 
>   vect_a1_53.10_234 = MEM  [(short unsigned
> int *)vectp_x.8_232];
>   vectp_x.8_235 = vectp_x.8_232 + 16;
>   vect_a1_53.11_236 = MEM  [(short unsigned
> int *)vectp_x.8_235];
>   vectp_x.8_237 = vectp_x.8_232 + 32;
>   vect_a1_53.12_238 = MEM  [(short unsigned
> int *)vectp_x.8_237];
>   vectp_x.8_239 = vectp_x.8_232 + 48;
>   vect_a1_53.13_240 = MEM  [(short unsigned
> int *)vectp_x.8_239];
>   _254 = VEC_PERM_EXPR  9, 11, 13, 15 }>;
>   _255 = VEC_PERM_EXPR  9, 11, 13, 15 }>;
>   _286 = VEC_PERM_EXPR <_254, _255, { 1, 3, 5, 7, 9, 11, 13, 15 }>;
> ...
> 
> that's simply load-lanes open-coded.  If open-coding ld4 is better than using
> ld4 just make it not available to the vectorizer?  Similar to ld2 I suppose.

So I now realize that I missed a step in what LLVM is doing here.
You're right in that there is a permute here,  but opencoding the permute is
better for this case since the zero extension from HI to DI can be done using
a permute as well. And this zero extension permute can fold the load lanes
permute into itself.  Which seems to be the trick LLVM is doing..

In my patch I added a new target hook
targetm.vectorize.use_permute_for_promotion
that decides when to use a permute vs an unpack.  I think the right solution
here is to tweak the LOAD_LANES heuristics to not use it when the results feed
into another permute, since that permute can be optimized reducing the total
number of permutes needed.

Does that sound good to you?

[Bug tree-optimization/117031] increasing VF during SLP vectorization permutes unnecessarily

2024-10-09 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> (In reply to Tamar Christina from comment #0)
> > GCC seems to miss that there is no gap between the group accesses and that
> > stride == 1.
> > test3 is vectorized linearly by GCC, so it seems this is missed optimization
> > in data ref analysis?
> 
> The load-lanes look fine, so it must be the code generation for the
> HI to DI via SI conversions using unpacks you are complaining about?
> 

No, that one I have a patch for.

> Using load-lanes is natural here.
> 
> This isn't about permutes due to VF or so, isn't it?

It is, the load lanes is unnecessary, because there is no permute during the
loop because the group size is equal to the stride and offsets are linear.

LOAD_LANES are really expensive, especially 4 register ones.

My complaint is that this loop, does not have a permute.  While it may look
like the entries are permuted they are not.

essentially test1 and test3 are the same. the vectorizer picks VF=8, so unrolls
test1 into test3, but fails to see that the unrolled code is linear, but when
manually unrolled it does:

e.g.

void
test3 (unsigned short *x, double *y, int n)
{
for (int i = 0; i < n; i+=2)
{
unsigned short a1 = x[i * 4 + 0];
unsigned short b1 = x[i * 4 + 1];
unsigned short c1 = x[i * 4 + 2];
unsigned short d1 = x[i * 4 + 3];
y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
unsigned short a2 = x[(i + 1) * 4 + 0];
unsigned short b2 = x[(i + 1) * 4 + 1];
unsigned short c2 = x[(i + 1) * 4 + 2];
unsigned short d2 = x[(i + 1) * 4 + 3];
y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
}
}

does not use LOAD_LANES.

[Bug tree-optimization/117032] missing vectorization of V4HI to V4DF

2024-10-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117032

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1
   Last reconfirmed||2024-10-08

--- Comment #1 from Tamar Christina  ---
Confirmed.

Note that we miss the general case (e.g. without the FLOAT_EXPR) for both
direct and indirect.

static inline 
void g(unsigned short *a, int *b)
{
b[0] = a[0];
b[1] = a[1];
b[2] = a[2];
b[3] = a[3];
}
static inline 
void g2(int *a, long long *b)
{
b[0] = a[0];
b[1] = a[1];
b[2] = a[2];
b[3] = a[3];
}

void indirect(unsigned short *a, long long *b)
{
  int t[4];
  g(a, t);
  g2(t, b);
}

void direct(unsigned short *a, long long *b)
{
b[0] = a[0];
b[1] = a[1];
b[2] = a[2];
b[3] = a[3];
}

This is due to the way vec_unpack is implemented in the backend.
We rewrite it the mode to VHALF to model the intermediate vec_select, however
that means that V4HI is unmodel-able as that requires a V2HI.

I have a patch that adds direct conversion support, but this fails atm due
to the aarch64 backend not supporting 64-bit to 128-bit vector permutes yet
until another series of mine gets in.

[Bug tree-optimization/117031] New: increasing VF during SLP vectorization permutes unnecessarily

2024-10-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031

Bug ID: 117031
   Summary: increasing VF during SLP vectorization permutes
unnecessarily
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following testcase:

---
void
test1 (unsigned short *x, double *y, int n)
{
for (int i = 0; i < n; i++)
{
unsigned short a = x[i * 4 + 0];
unsigned short b = x[i * 4 + 1];
unsigned short c = x[i * 4 + 2];
unsigned short d = x[i * 4 + 3];
y[i] = (double)a + (double)b + (double)c + (double)d;
}
}
---

at -O3 vectorizes using LOAD_LANES on aarch64:

  vect_array.11 = .LOAD_LANES (MEM  [(short unsigned
int *)vectp_x.9_123]);
  vect_a_29.12_125 = vect_array.11[0];
  vect__14.17_129 = [vec_unpack_lo_expr] vect_a_29.12_125;
  vect__14.17_130 = [vec_unpack_hi_expr] vect_a_29.12_125;
  vect__14.16_131 = [vec_unpack_lo_expr] vect__14.17_129;
  vect__14.16_132 = [vec_unpack_hi_expr] vect__14.17_129;
  vect__14.16_133 = [vec_unpack_lo_expr] vect__14.17_130;
  vect__14.16_134 = [vec_unpack_hi_expr] vect__14.17_130;
  vect__14.18_135 = (vector(2) double) vect__14.16_131;
  vect__14.18_136 = (vector(2) double) vect__14.16_132;
  vect__14.18_137 = (vector(2) double) vect__14.16_133;
  vect__14.18_138 = (vector(2) double) vect__14.16_134;
...


because input type is 4 shorts, so V4HI is the natural size. V4HI fails to
vectorize because
we don't support direct conversion from V4HI to V4SI.

We then pick a higher VF (V8HI) and the loads are detected as interleaving.
LLVM however avoids
the permute here by detecting that the unrolling doesn't result in a permuted
access as it's
equivalent to:

void
test3 (unsigned short *x, double *y, int n)
{
for (int i = 0; i < n; i+=2)
{
unsigned short a1 = x[i * 4 + 0];
unsigned short b1 = x[i * 4 + 1];
unsigned short c1 = x[i * 4 + 2];
unsigned short d1 = x[i * 4 + 3];
y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1;
unsigned short a2 = x[i * 4 + 4];
unsigned short b2 = x[i * 4 + 5];
unsigned short c2 = x[i * 4 + 6];
unsigned short d2 = x[i * 4 + 7];
y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2;
}
}

GCC seems to miss that there is no gap between the group accesses and that
stride == 1.
test3 is vectorized linearly by GCC, so it seems this is missed optimization in
data ref analysis?

[Bug rtl-optimization/117012] [15 Regression] incorrect RTL simplification around vector AND and shifts

2024-10-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117012

--- Comment #6 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #2)
> But it is still only a GCC 15 regression.
> 
> GCC 14 produced:
> ```
> cmltv0.16b, v0.16b, #0
> adrpx0, .LC0
> ldr q31, [x0, #:lo12:.LC0]
> and v0.16b, v0.16b, v31.16b
> ```
> For the above one.

which shows the other regression, the match.md pattern de-optimizes the aarch64
codegen because it changes the shift into an unsigned shift, which means the
cmlt optimization doesn't trigger anymore.

In this particular case since you need the AND anyway the pattern shouldn't
match.

[Bug rtl-optimization/117012] [15 Regression] incorrect RTL simplification around vector AND and shifts

2024-10-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117012

--- Comment #1 from Tamar Christina  ---
bisect landed on

g:aac00d09859cc5934bd0f7493d537b8430337773 is the first bad commit
commit aac00d09859cc5934bd0f7493d537b8430337773
Author: liuhongt 
Date:   Thu Jun 20 12:41:13 2024 +0800

Optimize a < 0 ? -1 : 0 to (signed)a >> 31.

Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.

Move the optimization did in ix86_expand_int_vcond to match.pd

gcc/ChangeLog:

PR target/114189
* match.pd: Simplify a < 0 ? -1 : 0 to (signed) >> 31 and a <
0 ? 1 : 0 to (unsigned) a >> 31 for vector integer type.

gcc/testsuite/ChangeLog:

* gcc.target/i386/avx2-pr115517.c: New test.
* gcc.target/i386/avx512-pr115517.c: New test.
* g++.target/i386/avx2-pr115517.C: New test.
* g++.target/i386/avx512-pr115517.C: New test.
* g++.dg/tree-ssa/pr88152-1.C: Adjust testcase.

 gcc/match.pd| 30 +++
 gcc/testsuite/g++.dg/tree-ssa/pr88152-1.C   |  2 +-
 gcc/testsuite/g++.target/i386/avx2-pr115517.C   | 60 +
 gcc/testsuite/g++.target/i386/avx512-pr115517.C | 70 +
 gcc/testsuite/gcc.target/i386/avx2-pr115517.c   | 33 
 gcc/testsuite/gcc.target/i386/avx512-pr115517.c | 70 +
 6 files changed, 264 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/g++.target/i386/avx2-pr115517.C
 create mode 100644 gcc/testsuite/g++.target/i386/avx512-pr115517.C
 create mode 100644 gcc/testsuite/gcc.target/i386/avx2-pr115517.c
 create mode 100644 gcc/testsuite/gcc.target/i386/avx512-pr115517.c

but I guess that's just exposing the issue.

[Bug rtl-optimization/117012] New: [15 Regression] incorrect RTL simplification around vector AND and shifts

2024-10-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117012

Bug ID: 117012
   Summary: [15 Regression] incorrect RTL simplification around
vector AND and shifts
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: wrong-code
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

The following example:

#include 
#include 

uint8x16_t f (uint8x16_t x)
{
  uint8x16_t mask = vreinterpretq_u8_u64(vdupq_n_u64 (0x101));
  return vandq_u8(vcltq_s8(vreinterpretq_s8_u8(x), vdupq_n_s8(0)), mask);
}

compiled at -O3 gives the following:

f:
ushrv0.16b, v0.16b, 7
ret

This is incorrect as it assumes that the value in every lane for the AND was
0x1 where in fact only the bottom lane is.

combine is matching this incorrect pattern:

Trying 7, 6 -> 8:
7: r108:V16QI=const_vector
6: r107:V16QI=r109:V16QI>>const_vector
  REG_DEAD r109:V16QI
8: r106:V16QI=r107:V16QI&r108:V16QI
  REG_DEAD r108:V16QI
  REG_DEAD r107:V16QI
  REG_EQUAL r107:V16QI&const_vector
Successfully matched this instruction:
(set (reg:V16QI 106 [ _5 ])
(lshiftrt:V16QI (reg:V16QI 109 [ xD.22802 ])
(const_vector:V16QI [
(const_int 7 [0x7]) repeated x16
])))

The optimization seems to only look at the bottom lane of the vector:

#include 
#include 

uint8x16_t f (uint8x16_t x)
{
  uint8x16_t mask = vreinterpretq_u8_u64(vdupq_n_u64 (0x301));
  return vandq_u8(vcltq_s8(vreinterpretq_s8_u8(x), vdupq_n_s8(0)), mask);
}

also generates incorrect code but changing the bottom lane

#include 
#include 

uint8x16_t f (uint8x16_t x)
{
  uint8x16_t mask = vreinterpretq_u8_u64(vdupq_n_u64 (0x102));
  return vandq_u8(vcltq_s8(vreinterpretq_s8_u8(x), vdupq_n_s8(0)), mask);
}

gives the right result.

[Bug tree-optimization/116855] [14/15 Regression] Unsafe early-break vectorization

2024-10-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855

--- Comment #6 from Tamar Christina  ---
> Actually, what I wish is that we could allow vectorization on early break
> case for arbitrary address pointer (knowing nothing about alignment and
> bound) based on some sort of assumption specified via command option under
> -Ofast, as the mentioned example:
> 
> char * find(char *string, size_t n, char c)
> {
> for (size_t i = 0; i < n; i++) {
> if (string[i] == c)
> return &string[i];
> }
> return 0;
> }
> 
> and example for which there is no way to do peeling to align more than one
> address pointers:
> 
> int compare(char *string1, char *string2, size_t n)
> {
> for (size_t i = 0; i < n; i++) {
> if (string1[i] != string2[i])
> return string1[i] - string2[i];
> }
> return 0;
> }

in Alex's patch for alignment peeling we'll allow them by inserting a mutual
alignment check between all the pointers when not fully masked.  This should be
upstream soon.

For fully masked architectures we can do better by using First Faulting Loads
here, but that won't be ready for GCC 15.

[Bug tree-optimization/116956] [15 Regression] ICE when building PALM with -O3 -march=armv9-a in vect_analyze_loop_1 after r15-2192-g0c5c0c959c2e59

2024-10-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116956

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #2 from Tamar Christina  ---
The code in vect_analyze_loop_1 seems to indicate that epilogue loops can't
fail analysis.

Presumably because the original code assumes that the main and epilog loops
were are being analyzed for the same ISA?

In this case the main loop is Adv. SIMD and the epilogue is SVE.

I'm not sure why it's fatal though.. why must epiloque analysis always succeed
even for a mode which didn't succeed for the main loop? e.g. in this case
VNx16QI fails for the main loop as well:

note:   vect_is_simple_use: vectype vector([4,4]) 
missed:   not vectorized: relevant stmt not supported: patt_113 = _60 ^ 1;

It's weird that the analysis thinks it can't invert the mask.. that's a
separate issue that I'll look into as well.

But the main reason for the ice is that the accesses in the DR has a
non-constant step:

Creating dr for *_8
analyze_innermost: Applying pattern match.pd:5184, generic-match-9.cc:4041
Applying pattern match.pd:236, generic-match-3.cc:3751
success.
base_address: u.0_27
offset from base address: 0
constant offset from base address: 0
step: (ssizetype) (stride.5_19 * 4)
base alignment: 4
base misalignment: 0
offset alignment: 128
step alignment: 4
base_object: MEM[(real(kind=4) *)u.0_27]
Access function 0: {0B, +, (sizetype) (stride.5_19 * 4)}_1

but we get to vect_analyze_data_ref_access with STMT_VINFO_STRIDED_P not set.
It looks like the strided access handling code is missing handling for masked
loads.

Mine.

[Bug tree-optimization/116855] [14/15 Regression] Unsafe early-break vectorization

2024-10-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116855

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> I would suggest to add a STMT_VINFO_ENSURE_NOTRAP or so and delay actual
> verification to vectorizable_load when both vector type and VF are fixed.
> We'd then likely need a LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P set
> conservatively the other way around from
> LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P.
> Alignment peeling could then peel if STMT_VINFO_ENSURE_NOTRAP and the target
> cannot do full loop masking.


Yeah the original reported testcase is fine as the alignment makes it safe.
For the manually misaligned case that Andrew posted it makes sense to delay and
re-evaluate later on.

I don't think we should bother peeling though, I don't think they're that
common and alignment peeling breaks some dominators and exposes some existing
vectorizer bugs, which is being fixed in Alex's patch.

So at least alignment peeling I'll defer to a later point and instead just
reject loops that are loading from structures the user misaligned wrt to the
vector mode.


So mine..

[Bug tree-optimization/116950] New: IVopts missed unification of duplicate IVs

2024-10-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116950

Bug ID: 116950
   Summary: IVopts missed unification of duplicate IVs
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

consider the following  testcase extracted from a library:

#include 
#include 

void func1(uint32_t n, const float32_t *a, float32_t *c) {
  uint32_t num_lanes = svcntw();
  svbool_t pg = svptrue_b32();
  uint32_t i = 0;
  for (; (i + 4) * num_lanes <= n; i += 4) {
svfloat32_t vec_a = svld1_vnum_f32(pg, a, i);
svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a);
svst1_vnum_f32(pg, c, i, vec_c);
  }
  return;
}

void func2(uint32_t n, const float32_t *a, float32_t *c) {
  uint32_t num_lanes = svcntw();
  svbool_t pg = svptrue_b32();
  int32_t i = 0;
  for (; (i + 4) * num_lanes <= n; i += 4) {
svfloat32_t vec_a = svld1_vnum_f32(pg, a, i);
svfloat32_t vec_c = svmul_f32_x(pg, vec_a, vec_a);
svst1_vnum_f32(pg, c, i, vec_c);
  }
  return;
}

and compiled with -O3 -march=armv9-a.

They differ only in the sign of i, the loop IV.  The first one is more
canonical  as it avoids comparing signed to unsigned values.

However the first loop produces quite bad code:

.L3:
uxtwx3, w6
cmp w0, w5
add w6, w6, 4
incbx5
mul x3, x3, x8
add x7, x1, x3
add x3, x2, x3
ld1wz31.s, p7/z, [x7]
fmulz31.s, z31.s, z31.s
st1wz31.s, p7, [x3]
bcs .L3

while the second one:

.L3:
ld1wz31.s, p7/z, [x1, x3, lsl 2]
fmulz31.s, z31.s, z31.s
st1wz31.s, p7, [x2, x3, lsl 2]
incbx3
addvl   x4, x3, #1
cmp w0, w4
bcs .L3

This leads up to a 40% performance difference between the two loops.

It seems that in the second case IVopts doesn't merge the two IVs.e.g. first
one has as input to IVopts:

   [local count: 955630224]:
  # _21 = PHI <_1(6), 4(5)>
  # i_22 = PHI <_21(6), 0(5)>
...
  _1 = _21 + 4;
  _2 = _1 * POLY_INT_CST [4, 4];
  if (_2 <= n_6(D))
goto ; [89.00%]

and second one:

   [local count: 955630224]:
  # _23 = PHI <_1(6), 4(5)>
  # i_24 = PHI <_23(6), 0(5)>
...
  _1 = _23 + 4;
  _2 = (unsigned int) _1;
  _3 = _2 * POLY_INT_CST [4, 4];
  if (_3 <= n_7(D))
goto ; [89.00%]

I'm not sure if this is the exact cause, but for the first one niters seems to
fail:

;;
;; Loop 1
;;  header 3, latch 6
;;  depth 1, outer 0
;;  niter scev_not_known
;;  iterations by profile: 8.090909 (unreliable, maybe flat) entry
count:105119324 (estimated locally, freq 0.8900)
;;  nodes: 3 6
Processing loop 1 at /app/example.c:9
  single exit 3 -> 4, exit condition if (_2 <= n_6(D))

and thinks the IVs can overflow wrt niters:

IV struct:
  SSA_NAME: i_22
  Type: uint32_t
  Base: 0
  Step: 4
  Biv:  N
  Overflowness wrto loop niter: Overflow

etc.

I guess this is due to that for the unsigned case if N is large enough the loop
may not terminate at all due to an overflow wrapping. but -ffinite-loops
doesn't seem to help.  The signed case does exhibit the same behavior when
-fwrapv is used to indicate the overflow behavior.

but the signed case:

;;
;; Loop 1
;;  header 3, latch 6
;;  depth 1, outer 0, finite_p
;;  niter scev_not_known
;;  upper_bound 536870909
;;  likely_upper_bound 536870909
;;  iterations by profile: 8.090909 (unreliable, maybe flat) entry
count:105119324 (estimated locally, freq 0.8900)
;;  nodes: 3 6
Processing loop 1 at /app/example.c:23
  single exit 3 -> 4, exit condition if (_3 <= n_7(D))

IV struct:
  SSA_NAME: i_24
  Type: int32_t
  Base: 0
  Step: 4
  Biv:  N
  Overflowness wrto loop niter: No-overflow

So the reason for this bug report is to see if we can't do anything about it.
when -ffinite-loops could we not assume such loops terminate and is bound by
UINT_MAX?

p.s. even though -ffinite-loops seems to be default at -O2, adding
-ffinite-loops explicitly does seem to toggle something in niters.

[Bug tree-optimization/116817] [15 Regression] ICE on libajantv2-16.2: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r15-3768-g4150bcd205ebb6

2024-09-30 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116817

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
Fixed. thanks for the report.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-09-30 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #33 from Tamar Christina  ---
Many Thanks Mikael!

I see the functions being inlined now!

[Bug tree-optimization/116817] [15 Regression] ICE on libajantv2-16.2: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r15-3768-g4150bcd205ebb6

2024-09-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116817

--- Comment #5 from Tamar Christina  ---
Waiting for regression testing to finish and will submit.

The condition used before to check for loop invariant is !internal_def.

This of course fails when it's a reduction, which is what happened here.
I now just check for the external or constant def.

For reductions I leave it as it's always been, since we don't support
codegen for these kinds of reductions anyway.

[Bug tree-optimization/116817] [15 Regression] ICE on libajantv2-16.2: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r15-3768-g4150bcd205ebb6

2024-09-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116817

--- Comment #4 from Tamar Christina  ---
I see,

I haven't checked that the value being compared is actually loop invariant.

In this case it's a reduction value and it can't be lifted out of the loop.

Basically in GCC 14 and earlier this was a non-vectorizable loop as we don't
support this reduction.  So the mask handling code didn't matter there.

but yeah, in the invariant handling code I need to check that that the value is
actually an external/const before adding it to the invariant seq, otherwise add
it to the normal pattern seq.

Testing a fix

[Bug tree-optimization/116817] [15 Regression] ICE on libajantv2-16.2: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250 since r15-3768-g4150bcd205ebb6

2024-09-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116817

Tamar Christina  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #3 from Tamar Christina  ---
mine.

[Bug tree-optimization/116812] [15 Regression] ICE on valid code at -O2 with "-fno-tree-dce -fno-tree-dse" on x86_64-linux-gnu: verify_flow_info failed

2024-09-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116812

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Tamar Christina  ---
Fixed

[Bug tree-optimization/116817] [15 Regression] ICE on libajantv2-16.2: in compute_live_loop_exits, at tree-ssa-loop-manip.cc:250

2024-09-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116817

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
   Last reconfirmed||2024-09-23

--- Comment #2 from Tamar Christina  ---
Ah wait, didn't read.

[Bug tree-optimization/116812] [15 Regression] ICE on valid code at -O2 with "-fno-tree-dce -fno-tree-dse" on x86_64-linux-gnu: verify_flow_info failed

2024-09-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116812

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
 Ever confirmed|0   |1
   Last reconfirmed||2024-09-22

--- Comment #2 from Tamar Christina  ---
thanks, testing patch

[Bug target/116371] The SME2 svpext intrinsics are missing a _lane suffix

2024-09-21 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116371

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
svpsel_lane is also misnamed.

[Bug tree-optimization/116583] vectorizable_slp_permutation cannot handle even/odd extract from VLA vector

2024-09-20 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116583

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> Another example this shows is for gcc.dg/vect/slp-42.c - we definitely can
> do the interleaving scheme as non-SLP vectorization shows.
> 
> gcc.dg/vect/slp-42.c also shows we're not yet "lowering" all SLP load
> permutes.
> The original SLP attempt still has
> 
>node 0x45d5050 (max_nunits=4, refcnt=2) vector([4,4]) int
>op template: _2 = q[_1];
> stmt 0 _2 = q[_1];
> stmt 1 _8 = q[_7];
> stmt 2 _14 = q[_13];
> stmt 3 _20 = q[_19];
> load permutation { 0 1 2 3 }
>node 0x45d50e8 (max_nunits=4, refcnt=2) vector([4,4]) int
>op template: _4 = q[_3];
> stmt 0 _4 = q[_3];
> stmt 1 _10 = q[_9];
> stmt 2 _16 = q[_15];
> stmt 3 _22 = q[_21];
> load permutation { 4 5 6 7 }
> 
> instead of a single contiguous load and two VEC_PERM_EXPR nodes to extract
> the lo/hi parts (which is also extract even/odd, but with a larger mode
> encompassing 4 elements).
> 
> I'd say for VLA operation this is one of the major blockers for all-SLP.

I'll take a look if Richard hasn't yet once I finish early break transition :)
.

[Bug tree-optimization/116684] [vectorization][x86-64] dot_16x1x16_uint8_int8_int32 could be better optimized

2024-09-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116684

Tamar Christina  changed:

   What|Removed |Added

 CC||victorldn at gcc dot gnu.org

--- Comment #2 from Tamar Christina  ---
(In reply to ktkachov from comment #1)
> Indeed. Curiously, for aarch64 at -O2 GCC is smart enough to recognise a
> USDOT instruction but at -O3 (-mcpu=neoverse-v2) it all gets synthesised

Looks like SLP discovery fails to notice it's a reduction, we do have code to
find   + reduction with SLP but it seems that the issue is here that the store
is used to start the discovery.

The same happens with a normal dotprod

#include 

void
dot_16x1x16_uint8_int8_int32(
   uint8_t data[restrict 4],
   uint8_t kernel[restrict 16][4],
   int32_t output[restrict 16])
{
  for (int i = 0; i < 16; i++)
for (int k = 0; k < 4; k++)
  output[i] += data[k] * kernel[i][k];
}

> The O3 version does fully unroll the loop so it's probably better but maybe
> it could do a better job of using USDOT?

Yeah, we could get the same effect by implementing the
vect_recog_widen_sum_pattern using dotprod accumulating into a zero register,
and then combine should be able to do the right things.

Victor had a patch at some point I think...

But the real fix is teaching SLP discovery that there's a reduction here.

[Bug target/116667] New: missing superfluous zero-extends of SVE values

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116667

Bug ID: 116667
   Summary: missing superfluous zero-extends of SVE values
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

We've recently started vectorizing functions such as:

void
decode (unsigned char * restrict h, unsigned char * restrict p4,
unsigned char * restrict p6, int f, int b, char * restrict e,
char * restrict a, char * restrict i)
{
int j = b % 8;
for (int k = 0; k < 2; ++k)
{
p4[k] = i[a[k]] | e[k] << j;
h[k] = p6[k] = a[k];
}
}

due to the vectorizer now correctly eliding one of the loads making it
profitable. Using -O3 -march=armv9-a now vectorizes and generates:

decode:
ptrue   p7.s, vl2
ptrue   p6.b, all
ld1bz31.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]
and w4, w4, 7
movprfx z0, z31
uxtbz0.s, p6/m, z31.s
mov z30.s, w4
ld1bz29.s, p7/z, [x7, z0.s, uxtw]
lslrz30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1bz30.s, p7, [x1]
st1bz31.s, p7, [x2]
st1bz31.s, p7, [x0]
ret

where as we used to generate:

decode:
ptrue   p7.s, vl2
and w4, w4, 7
ld1bz0.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]
ld1bz29.s, p7/z, [x7, z0.s, uxtw]
ld1bz31.s, p7/z, [x6]
mov z30.s, w4
ptrue   p6.b, all
lslrz30.s, p6/m, z30.s, z28.s
orr z30.d, z30.d, z29.d
st1bz30.s, p7, [x1]
st1bz31.s, p7, [x2]
st1bz31.s, p7, [x0]
ret

This is great, however we're let down by RTL opt.

There's a couple of weird things here,
Cleaning up the sequence a bit the problematic parts are:

ptrue   p7.s, vl2
ptrue   p6.b, all
ld1bz31.s, p7/z, [x6]
movprfx z0, z31
uxtbz0.s, p6/m, z31.s
ld1bz29.s, p7/z, [x7, z0.s, uxtw]

It zero extends the same vaue in z31 three times.  In the old code we actually
loaded the same value twice, both zero extended and not zero extended.

The RTL for the z31 + extend is

(insn 15 13 16 2 (set (reg:VNx4QI 110 [ vect__3.6 ])
(unspec:VNx4QI [
(subreg:VNx4BI (reg:VNx16BI 120) 0)
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
] UNSPEC_LD1_SVE)) "/app/example.c":9:24 5683
{maskloadvnx4qivnx4bi}
 (expr_list:REG_DEAD (reg/v/f:DI 117 [ a ])
(expr_list:REG_EQUAL (unspec:VNx4QI [
(const_vector:VNx4BI [
(const_int 1 [0x1]) repeated x2
repeat [
(const_int 0 [0])
(const_int 0 [0])
]
])
(mem:VNx4QI (reg/v/f:DI 117 [ a ]) [0  S[4, 4] A8])
] UNSPEC_LD1_SVE)
(nil
(insn 16 15 17 2 (set (reg:VNx16BI 122)
(const_vector:VNx16BI repeat [
(const_int 1 [0x1])
])) 5658 {*aarch64_sve_movvnx16bi}
 (nil))
(insn 17 16 20 2 (set (reg:VNx4SI 121 [ vect_patt_59.7_52 ])
(unspec:VNx4SI [
(subreg:VNx4BI (reg:VNx16BI 122) 0)
(zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
] UNSPEC_PRED_X)) 6943 {*zero_extendvnx4qivnx4si2}
 (expr_list:REG_EQUAL (zero_extend:VNx4SI (reg:VNx4QI 110 [ vect__3.6 ]))
(nil)))

But combine refuses the merge of the zero extend into the load,

deferring rescan insn with uid = 15.
allowing combination of insns 15 and 17
original costs 4 + 4 = 8
replacement costs 4 + 4 = 8
i2 didn't change, not doing this

and instead copies it into the gather load, but leaving the insn 17 alone
presumably because of the predicate.  So it looks like a bug in our backend
costing.  The widening load is definitely cheaper than load + extend.

However I'm not sure as the line "i2 didn't change, not doing this" seems to
indicate that it wasn't rejected because of cost?

In the codegen there's a peculiarity in that while the two loads

ld1bz31.s, p7/z, [x6]
ld1bz28.s, p7/z, [x5]

are both widening loads, but they aren't modelled the same:

ld1bz31.s, p7/z, [x6]   // 15 [c=4 l=4]  maskloadvnx4qivnx4bi
ld1bz28.s, p7/z, [x5]   // 50 [c=4 l=4] 
aarch64_load_zero_extendvnx4sivnx4qi

This is because the RTL pattern seems to want to keep the same number of
elements as the input vector size. So it ends up with a gather an

[Bug tree-optimization/115130] [meta-bug] early break vectorization

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
Bug 115130 depends on bug 115866, which changed state.

Bug 115866 Summary: missed optimization vectorizing switch statements.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug tree-optimization/115866] missed optimization vectorizing switch statements.

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Tamar Christina  changed:

   What|Removed |Added

 Resolution|FIXED   |---
 Status|RESOLVED|REOPENED

--- Comment #7 from Tamar Christina  ---
The testcase I posted above is still not lowered by ifcvt

https://godbolt.org/z/44vd76eKx

short a[100];

int foo(int n, int counter)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
  return 1;
 }
return 0;
}

still produces:

   [local count: 114863530]:
  if (n_8(D) > 0)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 108546036]:

   [local count: 1014686025]:
  # i_2 = PHI 
  _10 = a[i_2];
  switch (_10)  [94.50%], case 1 ... 2:  [5.50%], case 4:
 [5.50%], case 7:  [5.50%]>

   [local count: 55807731]:
:
  goto ; [100.00%]

with last night's compiler.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-09-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 115866, which changed state.

Bug 115866 Summary: missed optimization vectorizing switch statements.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #8 from Tamar Christina  ---
Fixed, thanks for the report!

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #6 from Tamar Christina  ---
I'll take this one then.

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

--- Comment #5 from Tamar Christina  ---
(In reply to Richard Biener from comment #4)
> Confirmed.  The ICE means we've "fatally" failed to analyze an epilogue
> which we do not expect.
> 
> t.c:4:21: note:   worklist: examine stmt: .MASK_STORE (&a[e_10], 8B, _9 !=
> 0, _1);
> t.c:4:21: note:   vect_is_simple_use: operand _9 != 0, type of def: unknown
> t.c:4:21: missed:   Unsupported pattern.
> 
> possibly the embedded _9 != 0 is the problem?
> 
> t.c:4:21: note:   vect_recog_bool_pattern: detected: _ifc__24 = _9 ? _1 :
> _ifc__22;
> t.c:4:21: note:   bool pattern recognized: patt_8 = _9 != 0 ? _1 : _ifc__22;
> t.c:4:21: note:   vect_recog_cond_store_pattern: detected: a[e_10] =
> _ifc__24;
> t.c:4:21: note:   cond_store pattern recognized: .MASK_STORE (&a[e_10], 8B,
> _9 != 0, _1);

Hmmm if so https://gcc.gnu.org/pipermail/gcc-patches/2024-September/662146.html
should fix it?

[Bug tree-optimization/116628] [15 Regression] ICE in vect_analyze_loop_1 on aarch64 with -Ofast in TSVC

2024-09-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116628

--- Comment #3 from Tamar Christina  ---
Still seems to ICE after that commit on last night's trunk

https://godbolt.org/z/GnYT7Kx46

[Bug tree-optimization/116577] [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

--- Comment #3 from Tamar Christina  ---
reproducer should be saved with extension .f90

[Bug tree-optimization/116577] [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-03 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

--- Comment #2 from Tamar Christina  ---
---
module type
   type a
   complex(kind(1.0d0)) j
   real(kind(1.0d0)) k
   real(kind(1.0d0)) l
   end type
   contains
   subroutine b(c,g)
type(a), dimension(:) :: c
 target c
 type(a), dimension(:), target :: g
 type(a), pointer :: d,h
 do i=1,e
   h => c(i)
   d  => g(i)
   h%j  = d%j
   h%l  = d%l
   h%k = f
 end do
end
   end

compiled with -mcpu=neoverse-v1 -Ofast reproduces the ICE

[Bug tree-optimization/116575] [15 Regression] blender in SPEC2017 ICE in vect_analyze_slp

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116575

--- Comment #1 from Tamar Christina  ---
---
int a;
float *b, *c;
void d() {
  char *e;
  for (; a; a++, b += 4, c += 4)
if (*e++) {
  float *f = c;
  f[0] = b[0];
  f[1] = b[1];
  f[2] = b[2];
  f[3] = b[3];
}
}

compiled with -mcpu=neoverse-v1 -Ofast reproduces the ICE

[Bug tree-optimization/116577] New: [15 Regression] tonto in SPECCPU 2006 ICEs in vect_lower_load_permutations

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116577

Bug ID: 116577
   Summary: [15 Regression] tonto in SPECCPU 2006 ICEs in
vect_lower_load_permutations
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

with -mcpu=neoverse-v1 -Ofast -flto tonto ICEs with

crystal.fppized.f90:1795:3: internal compiler error: Segmentation fault
 1795 |function d_chi2(p) result(res)
  |   ^
0x1c0a0f7 internal_error(char const*, ...)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic-global-context.cc:492
0xcff233 crash_signal
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/toplev.cc:321
0xfa9db8 vect_lower_load_permutations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4354
0xfae8c3 vect_lower_load_permutations
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4509
0xfae8c3 vect_analyze_slp(vec_info*, unsigned int)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4777
0xf83a6b vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2862
0xf85123 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3409
0xf85857 vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3567
0xfc3cef try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1068
0xfc3cef try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1184
0xfc4223 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1300

Running reducer and bisect

[Bug middle-end/116575] New: [15 Regression] blender in SPEC2017 ICE in vect_analyze_slp

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116575

Bug ID: 116575
   Summary: [15 Regression] blender in SPEC2017 ICE in
vect_analyze_slp
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: ice-on-valid-code
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
Target: aarch64*

Blender from spec2017 ICEs when compiled with -Ofast -flto -mcpu=neoverse-v1
with

during GIMPLE pass: vect
blender/source/blender/editors/object/object_bake_api.c: In function
'write_internal_bake_pixels':
blender/source/blender/editors/object/object_bake_api.c:173:13: internal
compiler error: in vect_analyze_slp, at tree-vect-slp.cc:4765
  173 | static bool write_internal_bake_pixels(
  | ^
0x1c0a0f7 internal_error(char const*, ...)
   
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic-global-context.cc:492
0x7bb0c7 fancy_abort(char const*, int, char const*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/diagnostic.cc:1658
0xfaf1bb vect_analyze_slp(vec_info*, unsigned int)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-slp.cc:4765
0xf83a6b vect_analyze_loop_2
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:2862
0xf85123 vect_analyze_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3409
0xf85857 vect_analyze_loop(loop*, vec_info_shared*)
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vect-loop.cc:3567
0xfc3cef try_vectorize_loop_1
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1068
0xfc3cef try_vectorize_loop
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1184
0xfc4223 execute
/opt/buildAgent/work/5c94c4ced6ebfcd0/gcc/tree-vectorizer.cc:1300

Creating a reducer and bisecting

[Bug tree-optimization/36010] Loop interchange not performed

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36010

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #7 from Tamar Christina  ---
It looks like today at -Ofast this is due to the full unrolling

https://godbolt.org/z/3K4hPbWfG

i.e. at -Ofast we fail due to the inner loop being fully unrolled.

Would it make sense to perform loop distribution before cunrolli?

in principle it should make any potential vect and SLP simpler no?

[Bug rtl-optimization/116541] [14/15 Regression] Inefficient missing use of reg+reg addressing modes

2024-09-02 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116541

Tamar Christina  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org
 Ever confirmed|0   |1
   Last reconfirmed||2024-09-02
 Status|UNCONFIRMED |NEW

--- Comment #1 from Tamar Christina  ---
(In reply to ktkachov from comment #0)
> I don't know if Tamar's pending IVOPTs fix this but filing it here just in
> case

No, this isn't an IVopts problem. The loop has only one IV expression, however
the way we expand the address breaks the addressing mode.

in GCC 13 we had:

(insn 19 18 20 4 (set (reg:DI 114)
(high:DI (const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82] 
)
(const_int 4 [0x4]) "/app/example.c":15:49 -1
 (nil))
(insn 20 19 21 4 (set (reg:DI 113)
(lo_sum:DI (reg:DI 114)
(const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82]  )
(const_int 4 [0x4]) "/app/example.c":15:49 -1
 (expr_list:REG_EQUAL (const:DI (plus:DI (symbol_ref:DI ("c") [flags 0x82] 
)
(const_int 4 [0x4])))
(nil)))
(insn 21 20 22 4 (set (reg/f:DI 112)
(plus:DI (reg:DI 100 [ ivtmp.21 ])
(reg:DI 113))) "/app/example.c":15:49 -1
 (nil))
(insn 22 21 23 4 (set (reg:SF 116)
(mem:SF (reg/f:DI 112) [1 MEM[(float *)&c + 4B + ivtmp.21_32 * 1]+0 S4
A32])) "/app/example.c":15:46 -1
 (nil))

and in GCC 14:

(insn 19 18 20 4 (set (reg:DI 123)
(high:DI (symbol_ref:DI ("c") [flags 0x82]  ))) "/app/example.c":15:49 -1
 (nil))
(insn 20 19 21 4 (set (reg:DI 122)
(lo_sum:DI (reg:DI 123)
(symbol_ref:DI ("c") [flags 0x82]  )))
"/app/example.c":15:49 -1
 (expr_list:REG_EQUAL (symbol_ref:DI ("c") [flags 0x82]  )
(nil)))
(insn 21 20 22 4 (set (reg/f:DI 121)
(plus:DI (reg:DI 101 [ ivtmp.22 ])
(reg:DI 122))) "/app/example.c":15:49 -1
 (nil))
(insn 22 21 23 4 (set (reg:SF 125)
(mem:SF (plus:DI (reg/f:DI 121)
(const_int 4 [0x4])) [1 MEM[(float *)&c + 4B + ivtmp.22_1 *
1]+0 S4 A32])) "/app/example.c":15:46 -1
 (nil))

i.e. the offset is now in the memory access rather than in the address
calculation.  This means that the base is updated rather than the offset, as
the offset is now a constant.

This is due to:

commit db4e496aadf1d7ab1c5af24410394d1551ddd3f0
Author: Wilco Dijkstra 
Date:   Tue Jan 16 16:27:02 2024 +

AArch64: Reassociate CONST in address expressions

GCC tends to optimistically create CONST of globals with an immediate
offset.
However it is almost always better to CSE addresses of globals and add
immediate
offsets separately (the offset could be merged later in single-use cases).
Splitting CONST expressions with an index in aarch64_legitimize_address
fixes
part of PR112573.

gcc/ChangeLog:
PR target/112573
* config/aarch64/aarch64.cc (aarch64_legitimize_address):
Reassociate
badly formed CONST expressions.

gcc/testsuite/ChangeLog:
PR target/112573
* gcc.target/aarch64/pr112573.c: Add new test.

Confirmed.

[Bug tree-optimization/116520] Multiple condition lead to missing vectorization due to missing early break

2024-08-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116520

--- Comment #4 from Tamar Christina  ---
(In reply to Tamar Christina from comment #3)
> (In reply to Richard Biener from comment #2)
> > The issue seems to be that if-conversion isn't done:
>
> I wonder if this transformation is really beneficial on modern cpus though.
> Seems like the compares are independent so the entire thing executes quite
> parallel?

and with this I mean, the vector result from vectoring the unreassociated code,
the scalar is obviously still a long dependency chain.

[Bug tree-optimization/116520] Multiple condition lead to missing vectorization due to missing early break

2024-08-29 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116520

--- Comment #3 from Tamar Christina  ---
(In reply to Richard Biener from comment #2)
> The issue seems to be that if-conversion isn't done:
> 
> Can not ifcvt due to multiple exits
> 
> maybe my patched dev tree arrives with a different CFG here (no switches
> into ifcvt).  I don't think if-conversion was adjusted when the vectorizer
> gained early exit vectorization 

it was adjusted only to deal with bitfield lowering which was one of the
scenarios we had on the roadmap.

> - if-conversion shouldn't for example
> predicate with the exit condition and it should leave those conditions
> and exits around.

How would if-cvt recover what reassoc did here though?

   [local count: 1044213920]:
  # s_15 = PHI 
  _1 = *s_15;
  if (_1 > 63)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 522106960]:
  goto ; [100.00%]

   [local count: 522106960]:
  _14 = (int) _1;
  _17 = 9223372036854785024 >> _14;
  _18 = _17 & 1;
  _19 = _18 == 0;
  _12 = ~_19;

   [local count: 1044213920]:
  # prephitmp_4 = PHI <_12(4), 0(11)>
  _10 = _1 == 92;
  _13 = prephitmp_4 | _10;
  if (_13 != 0)
goto ; [8.03%]
  else
goto ; [91.97%]

is the relevant block, wouldn't BB4 need to be fully predicated to be able to
vectorize this?  That also pushes this loop to only be vectorizable when fully
masked where the original input doesn't require masking.

As a side note, it looks like it's reassoc that's transforming and merging the
conditions, i.e. https://godbolt.org/z/9Kfafvava

I wonder if this transformation is really beneficial on modern cpus though.
Seems like the compares are independent so the entire thing executes quite
parallel?

[Bug tree-optimization/116463] [15 Regression] fast-math-complex-mls-{double,float}.c fail after r15-3087-gb07f8a301158e5

2024-08-28 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463

--- Comment #11 from Tamar Christina  ---
(In reply to Richard Biener from comment #6)
> I think
> 
>   a - ((b * -c) + (d * -e))  ->  a + (b * c) + (d * e)
> 
> is a good simplification to be made, but it's difficult to do this with
> canonicalization only.  Like a * -b -> -(a * b) as the negate might
> combine with both other negates down and upstream.  But for
> a*-b + c * -d it might be more obvious to turn that into
> -a*b - c*d.

Yeah, my expectation was that this would be an easier transform to avoid
the sharing problem we discussed before and that indeed the transform
looks at the entire chain not just transforming a * -b.

a*-b + c * -d -> -a*b - c*d

has the property of still maintaining the FMS and FMNS chains and can
get further simplified in the above case.

> 
> Maybe reassoc can be of help here - IIRC it turns b * -c into
> b * c * -1, undistribute_ops_list might get that.

hmm I see, but don't we have a higher chance that folding will just
fold it back into the multiply?

For this to work we'd have to do

  (b * -c) + (d * -e) -> -(b * c + d * e)

in one transformation no? since I'd imagine

  (b * c * -1) + (d * e * -1)

would just be undone by match.pd?

> 
> Note one issue is that complex lowering leaves around dead stmts,
> confusing reassoc and forwprop, in particular
> 
> -  _10 = COMPLEX_EXPR <_18, _6>;
> 
> stay around until reassoc.  scheduling dce for testing shows reassoc
> does something.
> 
> It's update_complex_assignment who replaces existing complex
> stmts with COMPLEX_EXPRs, we should possibly resort do
> simple_dce_from_worklist
> to clean those.  Let me try to do that.

Thanks!

[Bug tree-optimization/116463] [15 Regression] fast-math-complex-mls-{double,float}.c fail after r15-3087-gb07f8a301158e5

2024-08-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116463

--- Comment #5 from Tamar Christina  ---
Yeah, This is because they generate different gimple sequences and thus
different SLP trees.
The core of the problem is there's no canonical form here, and a missing gimple
simplification rule:

  _33 = IMAGPART_EXPR <*_3> + ((REALPART_EXPR <*_5> * IMAGPART_EXPR <*_7>) +
(IMAGPART_EXPR <*_5> * REALPART_EXPR <*_7>));
vs
  _37 = IMAGPART_EXPR <*_3> - ((REALPART_EXPR <*_5> * -IMAGPART_EXPR <*_7>) +
(IMAGPART_EXPR <*_5> * -REALPART_EXPR <*_7>));

i.e. a - ((b * -c) + (d * -e)) == a + (b * c) + (d * e)

So probably in match.pd we should fold _37 into _33 which is a simpler form of
the same thing and it's better on scalar as well.

It would be better to finally introduce a vectorizer canonical form, for
instance the real part generates:

  _36 = (_31 - _30) + REALPART_EXPR <*_3>;
vs
  _32 = REALPART_EXPR <*_3> + (_26 - _27);

and this already is an additional thing to check, so it would be better if slp
build always puts complex parts consistently on one side of commutative
operations so we don't have to swap operands to check.

In any case, I have some patches in this area and can take a look when I'm
back, but think the new expression should be simplified back into the old one.

[Bug tree-optimization/116409] [15 Regression] Recent phiopt change causing ICE with sqrt and -frounding-math

2024-08-19 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116409

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Target|riscv64-linux-gnu   |riscv64-linux-gnu,
   ||aarch64-linux-gnu

--- Comment #7 from Tamar Christina  ---
I see the same error in povray on SPECCPU 2006 and 2017.

[Bug target/116229] [15 Regression] wrong code at -Ofast aarch64 due to missing fneg to generate 0x8000000000000000

2024-08-08 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116229

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #5 from Tamar Christina  ---
Fixed, thanks for the report!

[Bug target/116229] [15 Regression] wrong code at -Ofast aarch64 due to missing fneg to generate 0x8000000000000000

2024-08-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116229

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #3 from Tamar Christina  ---
(In reply to Andrew Pinski from comment #2)
>   /* For Advanced SIMD we can create an integer with only the top bit set
>  using fneg (0.0f).  */
> 
> is wrong in aarch64_maybe_generate_simd_constant.
> 
> it should use either an unspec here or an XOR instead of fneg here I think
> especially for -ffast-math reasons.

XOR would defeat the point of the optimization. The original expression is fine
but relied on nothing in the late pipeline being able to fold the zero constant
back in.

It was for this reason that we explicitly forced it to a separate register.
Late combine is just doing something not possible before. I'll fix it.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #4 from Tamar Christina  ---
It looks like it's because the old unrolled code for the pointer version did a
subtract and used the difference to optimize the IV check away to every 4
elements.  This explains the increase in instruction count.

I hadn't noticed it during benchmarking because on aarch64 the non-pointer
version got recovered with cbz.

This should be fixable while still being vectorizable with

#pragma GCC unroll 4

on the loop.  The generated code looks good, but it looks like the pragma is
being
dropped when used in the template.

I'm away for a few days so Alex is looking into it.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #3 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #2)
> Looking at the change, I do not see how that could disable inlining. It
> should only reduce size of the function size estimates in the heuristics.
> 
> I think it is more likely loop optimization doing something crazy.  But we
> need to figure out what really changed in the codegen.

Yes, looking at the change since the loop is now smaller it gets unlined into
the callers. So I guess something goes wrong after that. Trying something now..

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-08-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #24 from Tamar Christina  ---
(In reply to Mikael Morin from comment #23)
> (In reply to Mikael Morin from comment #21)
> > 
> > (...) and should be able to submit the first
> > series (inline minloc without dim argument) this week.
> > 
> I missed the "this week" mark (again), but I've finally submitted the
> patches:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658909.html

Thank you! and thanks for the clear patch! it gives a starting point if
I have to inline simpler intrinsics in the future :)

much appreciated!

[Bug target/115974] sat_add, etc. vector patterns not done for aarch64 (non-sve)

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115974

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org
 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #5 from Tamar Christina  ---
I'll assign to myself for now.

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

--- Comment #5 from Tamar Christina  ---
(In reply to ktkachov from comment #4)
> Intersting, thanks for the background. The bigger issue I was seeing was
> with a string-matching loop like https://godbolt.org/z/E7b13915E where the
> constant pool load is a reasonable codegen decision, but unfortunately every
> iteration of the loop reloads the constant which would hurt in a tight inner
> loop.
> So perhaps my problem is that the constant-pool loads are not being
> considered loop invariant, or something is sinking them erroneously

Ah, Yeah, that's definitely a bug. It looks like fwprop is pushing the constant
vector initialization into an UNSPEC, which LIM doesn't know is invariant so
can't pull it out.

We also don't do so in GIMPLE because the operation isn't lowered.

  pc_15 = svmatch_u8 (pg_12, arr1_14, { 63, 92, 13, 10, ... });

so the constant is never removed from the instruction in GIMPLE.

Should probably look at whether we really need an UNSPEC there.

[Bug target/116145] Suboptimal SVE immediate synthesis

2024-07-31 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116145

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #3 from Tamar Christina  ---
We've looked at this type of constant initialization in the past and even
though the LLVM version looks faster it's not in practice.

If you look at the Software Optimization Guides SVE cores don't handle MOV/MOVK
pairs special anymore.
So here the sequence is a 3 instruction dependency chain with a very low
throughput:

mov w8, #23615
movkw8, #2573, lsl #16
mov z0.s, w8
ret

vs

ptrue   p3.b, all
adrpx0, .LC0
add x0, x0, :lo12:.LC0
ld1rw   z0.s, p3/z, [x0]
ret

which is also a 3 instruction dependency chain but loads have a higher
throughputs than register transfers and the latency difference is hidden.
In most real code you'd also have shared the anchor and ptrue, or if in a loop,
the ptrue and the adrp would have been floated out.

Benchmarking has shown that there's no real performance difference between
these two when it's 1 constant.  When there are more than one constant the load
variant wins by a large margin as the SVE mov serializes the construction of
all constants.

The concern here is that because of this serialization that constant
rematerialization inside loops would become slower.
So I don't believe the LLVM sequence is beneficial to implement.

That said, when we looked at this we did come to the conclusion that we can use
SVE's ORR and other immediate instructions to construct more immediate
sequences on the SIMD side itself.  That way we avoid the transfer.

[Bug libstdc++/116140] [15 Regression] 5-35% slowdown of 483.xalancbmk and 523.xalancbmk_r since r15-2356-ge69456ff9a54ba

2024-07-30 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #1 from Tamar Christina  ---
Yeah, we've noticed it as well.

The weird thing is that the dynamic instruction count went up by a lot.

So it looks like some inlining or something did not happen.

[Bug target/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

Tamar Christina  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #11 from Tamar Christina  ---
Fixed, thanks for report.

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-07-26 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 116074, which changed state.

Bug 116074 Summary: [15 regression] ICE when building harfbuzz-9.0.0 on arm64 
(related_int_vector_mode, at stor-layout.cc:581)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

--- Comment #8 from Tamar Christina  ---
Going with a backend fix instead.

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

--- Comment #7 from Tamar Christina  ---
The backend is returning TImode for get_vectype_for_scalar_type for historical
reasons where large integer modes were considered struct types and this vector
modes.

However they're not modes the vectorizer can use but the backend hook
!targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode) ICEs because
it's
not a valid vector mode.

I don't think the target hook should ICE, and I don't see how other usages of
the hook do any additional checking.

diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc
index 53af5e38b53..b68aea925a4 100644
--- a/gcc/tree-vect-patterns.cc
+++ b/gcc/tree-vect-patterns.cc
@@ -6638,6 +6638,7 @@ vect_recog_cond_store_pattern (vec_info *vinfo,
   machine_mode mask_mode;
   machine_mode vecmode = TYPE_MODE (vectype);
   if (targetm.vectorize.conditional_operation_is_expensive (IFN_MASK_STORE)
+  || !VECTOR_MODE_P (vecmode)
   || !targetm.vectorize.get_mask_mode (vecmode).exists (&mask_mode)
   || !can_vec_mask_load_store_p (vecmode, mask_mode, false))
 return NULL;

This fixes the issue, but I would have expected get_mask_mode to just return
False here.

[Bug tree-optimization/116074] [15 regression] ICE when building harfbuzz-9.0.0 on arm64 (related_int_vector_mode, at stor-layout.cc:581)

2024-07-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116074

Tamar Christina  changed:

   What|Removed |Added

   Last reconfirmed||2024-07-25
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Ever confirmed|0   |1
 Status|UNCONFIRMED |ASSIGNED

--- Comment #6 from Tamar Christina  ---
mine.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #22 from Tamar Christina  ---
(In reply to Mikael Morin from comment #21)
> (In reply to Tamar Christina from comment #20)
> > Hi Mikael,
> > 
> > I did regression testing on x86_64 and AArch64 and only found one test-ism.
> > 
> > I think I understand most of the patch to be able to deal with any fallout,
> > would it be ok if I fix the test-ism and submit the patch on your behalf?
> > 
> > It would be a shame to let it bitrot.
> > 
> 
> Sorry. In the last days, I submitted a few small minloc-related patches
> found while working on this PR, and should be able to submit the first
> series (inline minloc without dim argument) this week.

Ah ok, I'll wait for you then, thanks!

> 
> You can submit on my behalf if you prefer; it would definitely accelerate
> progress on this topic.
> 
> What do you mean by test-ism?


I think this was just me, I had tested the minloc patch on top of some
additional changes to IVopts that mostly help fortran code.

At the time gfortran.dg/maxloc_bounds_[4-7].f90 started failing and I had
assumed that it had to do with the inlining.

But it looks like they were a bug in my IVopts patch as they're no longer
failing with the new patches.

[Bug ipa/106783] [12/13/14/15 Regression] ICE in ipa-modref.cc:analyze_function since r12-5247-ga34edf9a3e907de2

2024-07-23 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106783

--- Comment #8 from Tamar Christina  ---
(In reply to Jan Hubicka from comment #6)
> The problem is that n/=0 is undefined behavior (so we can optimize out call
> to function doing divide by zero), while __builtin_trap is observable and we
> do not optimize out code paths that may trip to it.
> 

H I hit this today with:

void foo1 (char *restrict a, int *restrict c, int n, int stride)
{
  if (stride <= 1)
return;
  for (int i = 0; i < 9; i++)
{
  int res = c[i];
  c[i] = a[i] ? res : 9;
}
}

compiled with -Ofast -march=armv9-a -fdump-tree-modref.

At least this variant has no builtin traps (nor UB).

See https://godbolt.org/z/1h83rasns

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #20 from Tamar Christina  ---
Hi Mikael,

I did regression testing on x86_64 and AArch64 and only found one test-ism.

I think I understand most of the patch to be able to deal with any fallout,
would it be ok if I fix the test-ism and submit the patch on your behalf?

It would be a shame to let it bitrot.

Thanks,
Tamar

[Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
Fixed on trunk

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2024-07-22 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 115531, which changed state.

Bug 115531 Summary: vectorizer generates inefficient code for masked 
conditional update loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-17 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #8 from Tamar Christina  ---
Fixed, thanks for the report.  Bug is latent on branches so won't backport for
now.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #6 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

This is created by SCEV.

simple_iv_with_niters in the case where no CHREC is found creates an IV with
base == ev, offset == 0;

however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer.
it ends up creating an unusable expression.

following the remaining code, it looks like this should be

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 5aa95a2497a..abb2bad7773 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop
*use_loop,
   if (tree_does_not_contain_chrecs (ev))
 {
   iv->base = ev;
-  iv->step = build_int_cst (TREE_TYPE (ev), 0);
+  tree ev_type = TREE_TYPE (ev);
+  if (POINTER_TYPE_P (ev_type))
+   ev_type = sizetype;
+
+  iv->step = build_int_cst (ev_type, 0);
   iv->no_overflow = true;
   return true;
 }

So I think there are two bugs here.  I also think the IVopts one is a bug, as
it's clearly changing the type and introducing a mismatch there too.

I'm regression testing both changes.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #5 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

This is created by SCEV.

simple_iv_with_niters in the case where no CHREC is found creates an IV with
base == ev, offset == 0;

however in this case EV is a POINTER_PLUS_EXPR and so the type is a pointer.
it ends up creating an unusable expression.

following the remaining code, it looks like this should be

diff --git a/gcc/tree-scalar-evolution.cc b/gcc/tree-scalar-evolution.cc
index 5aa95a2497a..abb2bad7773 100644
--- a/gcc/tree-scalar-evolution.cc
+++ b/gcc/tree-scalar-evolution.cc
@@ -3243,7 +3243,11 @@ simple_iv_with_niters (class loop *wrto_loop, class loop
*use_loop,
   if (tree_does_not_contain_chrecs (ev))
 {
   iv->base = ev;
-  iv->step = build_int_cst (TREE_TYPE (ev), 0);
+  tree ev_type = TREE_TYPE (ev);
+  if (POINTER_TYPE_P (ev_type))
+   ev_type = sizetype;
+
+  iv->step = build_int_cst (ev_type, 0);
   iv->no_overflow = true;
   return true;
 }

So I think there are two bugs here.  I also think the IVopts one is a bug, as
it's clearly changing the type and introducing a mismatch there too.

I'm regression testing both changes.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #7 from Tamar Christina  ---
(In reply to Thomas Schwinge from comment #6)
> Tamar, Richard, thanks for having a look.
> 
> (In reply to Tamar Christina from comment #4)
> > This one looks a bit like costing, [...]
> 
> I see.  So we (I) shall later re-visit this PR in context of
>  001101dad3b2$ef215730$cd640590$@nextmovesoftware.com> "[nvptx PATCH]
> Implement rtx_costs target hook for nvptx backend", and, if necessary,
> follow-up work:
> 

I believe so, it looks like try_improve_iv_set does nothing for nvptx because
it tries to look for TARGET_ADDRESS_COST and in it's absence tries to use
TARGET_RTX_COSTS both of which are missing.

Because of this it can't compare the different IVs and the costs all end up
being the same.

So basically it just ends up picking the first one from the list, which in this
case just happens to be worse off. 

> > I don't however see an implementation of TARGET_ADDRESS_COST for the target.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> iv->step should never be a pointer type

That's what I initially thought too.  My suspicion is that there is some code
that tries to create the 0 offset.

I'll try to track down where the IV is created.

0 + 0B is a weird candidate either way.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #4 from Tamar Christina  ---
This one looks a bit like costing,

before the patch IVopts had:

:
inv_expr 1: -element_7(D)
inv_expr 2: (signed int) rite_5(D) - (signed int) element_7(D)

and after the patch it generates a few more alternatives:


:
inv_expr 1: -element_7(D)
inv_expr 2: ((signed int) left_4(D) + (signed int) rite_5(D)) - (signed
int) element_7(D)
inv_expr 3: (signed int) left_4(D) + (signed int) rite_5(D)
inv_expr 4: (signed int) rite_5(D) - (signed int) element_7(D)
inv_expr 5: ((signed int) rite_5(D) - (signed int) element_7(D)) + (signed
int) left_4(D)
inv_expr 6: ((signed int) rite_5(D) + (signed int) element_7(D)) + (signed
int) left_4(D)
inv_expr 7: ((signed int) left_4(D) - (signed int) element_7(D)) + (signed
int) rite_5(D)

Before it decided it needed two separate IVs to satisfy these invariants:

Initial set of candidates:
  cost: 122 (complexity 0)
  reg_cost: 114
  cand_cost: 8
  cand_group_cost: 0 (complexity 0)
  candidates: 7, 9
   group:0 --> iv_cand:7, cost=(0,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:7, cost=(0,0)
  invariant variables: 1
  invariant expressions: 1

Original cost 122 (complexity 0)

Final cost 122 (complexity 0)

Selected IV set for loop 1 at
../gcc-dsg/gcc/testsuite/gcc.dg/tree-ssa/pr43378.c:7, 10 avg niters, 2 IVs:
Candidate 7:
  Var befor: left_14
  Var after: left_10
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   left_4(D)
Step:   element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow
Candidate 9:
  Depend on inv.exprs: 1
  Var befor: rite_15
  Var after: rite_8
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   rite_5(D)
Step:   -element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow


with the patch it decided it only needed the one IV:

Initial set of candidates:
  cost: 109 (complexity 0)
  reg_cost: 97
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Initial set of candidates:
  cost: 109 (complexity 0)
  reg_cost: 97
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Original cost 109 (complexity 0)

Final cost 109 (complexity 0)

Selected IV set for loop 1 at
../gcc-dsg/gcc/testsuite/gcc.dg/tree-ssa/pr43378.c:7, 10 avg niters, 1 IVs:
Candidate 9:
  Depend on inv.exprs: 1
  Var befor: rite_15
  Var after: rite_8
  Incr POS: orig biv
  IV struct:
Type:   int
Base:   rite_5(D)
Step:   -element_7(D)
Biv:N
Overflowness wrto loop niter:   Overflow

It realizes it can satisfy both IVs using 1 candidate and picks it because it
thinks the costs are much lower.
I don't however see an implementation of TARGET_ADDRESS_COST for the target.

On AArch64 for instance this is rejected by costing because the combined IV
requires more registers:

Initial set of candidates:
  cost: 17 (complexity 0)
  reg_cost: 5
  cand_cost: 4
  cand_group_cost: 8 (complexity 0)
  candidates: 9
   group:0 --> iv_cand:9, cost=(4,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:9, cost=(4,0)
  invariant variables: 1
  invariant expressions: 1, 3

Improved to:
  cost: 14 (complexity 0)
  reg_cost: 6
  cand_cost: 8
  cand_group_cost: 0 (complexity 0)
  candidates: 7, 9
   group:0 --> iv_cand:7, cost=(0,0)
   group:1 --> iv_cand:9, cost=(0,0)
   group:2 --> iv_cand:9, cost=(0,0)
   group:3 --> iv_cand:7, cost=(0,0)
  invariant variables: 1
  invariant expressions: 1

Original cost 14 (complexity 0)

Final cost 14 (complexity 0)

Selected IV set for loop 1 at /app/example.c:4, 10 avg niters, 2 IVs:

I'll have to take a look at what happens when a target has no cost model for
IVopts.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

   Target Milestone|--- |15.0

--- Comment #2 from Tamar Christina  ---
Looks like IVopts has generated an invalid gimple:

ivtmp.39_65 = ivtmp.39_59 + 0B;

where the IVs are DI mode and the offset is a pointer.
This comes from this weird candidate:

Candidate 8:
  Var befor: ivtmp.39_59
  Var after: ivtmp.39_65
  Incr POS: before exit test
  IV struct:
Type:   sizetype
Base:   0
Step:   0B
Biv:N
Overflowness wrto loop niter:   No-overflow

Looks like this invalid candidate was always generated, but was not selected
before as the old constant_multiple_of bailed out due to the operand_equal_p
constraining the type of the arguments.

Question is why this invalid candidate was generated at all, and that's due to:

  /* Record common candidate with initial value zero.  */
  basetype = TREE_TYPE (iv->base);
  if (POINTER_TYPE_P (basetype))
basetype = sizetype;
  record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);

which for the case the type is a pointer changes the base but not the step.
this makes base + step no longer valid gimple.

So I believe fix is:

diff --git a/gcc/tree-ssa-loop-ivopts.cc b/gcc/tree-ssa-loop-ivopts.cc
index 5fc188ae3f8..d590d6a9b78 100644
--- a/gcc/tree-ssa-loop-ivopts.cc
+++ b/gcc/tree-ssa-loop-ivopts.cc
@@ -3547,7 +3547,8 @@ add_iv_candidate_for_use (struct ivopts_data *data,
struct iv_use *use)
   basetype = TREE_TYPE (iv->base);
   if (POINTER_TYPE_P (basetype))
 basetype = sizetype;
-  record_common_cand (data, build_int_cst (basetype, 0), iv->step, use);
+  record_common_cand (data, build_int_cst (basetype, 0),
+ fold_convert (basetype, iv->step), use);


which fixes the ICE. Will regtest and submit.

[Bug target/115936] [15 Regression] GCN vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115936

Tamar Christina  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||tnfchris at gcc dot gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-15
   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #1 from Tamar Christina  ---
odd thing to iCE on, mine.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2024-07-15
 Ever confirmed|0   |1

--- Comment #3 from Tamar Christina  ---
mine.

[Bug target/115934] [15 Regression] nvptx vs. ivopts: replace constant_multiple_of with aff_combination_constant_multiple_p [PR114932]

2024-07-15 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115934

--- Comment #1 from Tamar Christina  ---
Hi, thanks for the report, could you tell me a target triple I can use for
nvptx?

[Bug tree-optimization/115866] missed optimization vectorizing switch statements.

2024-07-11 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org

--- Comment #2 from Tamar Christina  ---
mine.

[Bug tree-optimization/115866] New: missed optimization vectorizing switch statements.

2024-07-10 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115866

Bug ID: 115866
   Summary: missed optimization vectorizing switch statements.
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
Blocks: 53947, 115130
  Target Milestone: ---

The following example:

short a[100];

int foo(int n, int counter)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == 1 || a[i] == 2 || a[i] == 7 || a[i] == 4)
  return 1;
 }
return 0;
}

fails to vectorize at -O3 -march=armv9-a because in GIMPLE the if is rewritten
into a switch statement:

   [local count: 114863530]:
  if (n_6(D) > 0)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 108546036]:

   [local count: 1014686025]:
  # i_3 = PHI 
  _1 = a[i_3];
  switch (_1)  [94.50%], case 1 ... 2:  [5.50%], case 4:
 [5.50%], case 7:  [5.50%]>

   [local count: 55807731]:
:
  goto ; [100.00%]

   [local count: 958878295]:
:
  i_8 = i_3 + 1;
  if (n_6(D) > i_8)
goto ; [94.50%]
  else
goto ; [5.50%]

   [local count: 52738306]:
  goto ; [100.00%]

   [local count: 906139989]:
  goto ; [100.00%]

   [local count: 6317494]:

   [local count: 59055800]:

   [local count: 114863531]:
  # _5 = PHI <1(9), 0(8)>
:
  return _5;

However such switch statements, where all the entries lead to the same
destination are easy to vectorize.  In SVE we have the MATCH instruction that
can be used here, and for others we can duplicate the constants and lower the
switch to a series of compare and ORRs.

This is similar as what's done for when the values don't fit inside a switch:

short a[100];

int foo(int n, int counter, short x, short b, short c)
{
   for (int i = 0; i < n; i++)
 {
if (a[i] == x || a[i] == b || a[i] == 7 || a[i] == c)
  return 1;
 }
return 0;
}

is vectorized as:

.L4:
incbx5
incwz30.s, all, mul #2
cmp w6, w1
bcc .L15
.L6:
ld1hz31.h, p7/z, [x5]
cmpeq   p15.h, p7/z, z31.h, z27.h
cmpeq   p11.h, p7/z, z31.h, z28.h
cmpeq   p14.h, p7/z, z31.h, #7
cmpeq   p12.h, p7/z, z31.h, z29.h
orr p15.b, p7/z, p15.b, p11.b
orr p14.b, p7/z, p14.b, p12.b
inchx1
orr p15.b, p7/z, p15.b, p14.b
ptest   p13, p15.b
b.none  .L4
umovw1, v30.s[0]
.L3:
sxtwx1, w1
b   .L7

which is great! but should be an SVE MATCH instruction as well.

This kind of code shows up through the use of std::find_if as well:

#include 
using namespace std;

bool pred(int i) { return i == 1 || i == 2 || i == 7 || i == 4; }

int foo(vector vec)
{
vector::iterator it;

it = find_if(vec.begin(), vec.end(), pred);

return *it;
}

and once the unrolled loop is gone we should be able to vectorize it.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization

[Bug libstdc++/115799] ranges::find's optimized branching for memchr is not quite right

2024-07-06 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115799

Tamar Christina  changed:

   What|Removed |Added

 CC||tnfchris at gcc dot gnu.org

--- Comment #2 from Tamar Christina  ---
I also get an ICE related:

/opt/buildAgent/temp/buildTmp/toolchain/include/c++/15.0.0/bits/stl_algo.h:3873:38:
error: no match for 'operator+' (operand types are
'std::_Rb_tree_const_iterator' and 'long int')
 3873 |   return __first + ((const char*)__p1 - (const
char*)__p0);
  | 
^

[Bug tree-optimization/104265] Missed vectorization in 526.blender_r

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

--- Comment #5 from Tamar Christina  ---
Also for fully masked architectures we can instead of recreating the vectors
just mask out the irrelevant values.

But we should still order the exits based on complexity.

[Bug tree-optimization/104265] Missed vectorization in 526.blender_r

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> Note the SLP discovery opportunity is from the "reduction" PHI to the
> return which merges control flow to a zero/one flag.

Right, so I get what you mean here, so in

   [local count: 308696474]:
  _52 = t2x_61 < 0.0;
  _53 = t2y_63 < 0.0;
  _54 = _52 | _53;
  _66 = t2z_65 < 0.0;
  _67 = _54 | _66;
  if (_67 != 0)
goto ; [51.40%]
  else
goto ; [48.60%]

   [local count: 158662579]:
  goto ; [100.00%]

   [local count: 150033894]:
  _55 = isec_58(D)->dist;
  _68 = _55 < t1y_62;
  _69 = _55 < t1x_60;
  _70 = _68 | _69;
  _71 = _55 < t1z_64;
  _72 = _70 | _71;
  _73 = ~_72;
  _74 = (int) _73;

   [local count: 1073741824]:
  # _56 = PHI <0(8), _74(6)>
  return _56;

we start at _56 and follow the preds up.  The interesting bit here though is
that the values being compared aren't sequential in memory.

So:

  if (t1x > isec->dist || t1y > isec->dist || t1z > isec->dist) return 0;

  float t1x = (bb[isec->bv_index[0]] - isec->start[0]) * isec->idot_axis[0];
  float t1y = (bb[isec->bv_index[2]] - isec->start[1]) * isec->idot_axis[1];
  float t1z = (bb[isec->bv_index[4]] - isec->start[2]) * isec->idot_axis[2];

but then in:

  if (t1x > t2y  || t2x < t1y  || t1x > t2z || t2x < t1z || t1y > t2z || t2y <
t1z) return 0;

we need a replicated t1x and {t2x, t2x, t2y}.

It looks like the ICX code does indeed rebuild/shuffle the vector at every
exit.
ICX does a better job than OACC here, it does a nice trick, the key is that it
also re-ordered the exits based on the complexity of the shuffle.

movsxd  rax, dword ptr [rdi + 56]
vmovsd  xmm1, qword ptr [rdi]   # xmm1 = mem[0],zero
vmovsd  xmm2, qword ptr [rdi + 76]  # xmm2 = mem[0],zero
movsxd  rcx, dword ptr [rdi + 64]
vmovss  xmm0, dword ptr [rsi + 4*rax]   # xmm0 = mem[0],zero,zero,zero
vinsertps   xmm0, xmm0, dword ptr [rsi + 4*rcx], 16 # xmm0 =
xmm0[0],mem[0],xmm0[2,3]
vsubps  xmm0, xmm0, xmm1
vmulps  xmm0, xmm0, xmm2
vxorps  xmm3, xmm3, xmm3
vcmpltpsxmm3, xmm0, xmm3

i.e. the exit:

  if (t2x < 0.0f || t2y < 0.0f || t2z < 0.0f) return 0;

was made the first exit so it doesn't perform the complicated shuffles if it
doesn't need to.

So it looks like schedule SLP should take in complexity in mind?  This will
become interesting with costing as well.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2024-07-05 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

--- Comment #19 from Tamar Christina  ---
Hi Mikael,

It looks like the last version of your patch already gets inlined in the call
sites we cared about.

Would it be possible for you to upstream it?

[Bug c++/115623] ICE: Segmentation fault in finish_for_cond with novector and almost infinite loop

2024-07-04 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115623

Tamar Christina  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #7 from Tamar Christina  ---
Fixed in master and GCC-14, thanks for the report!

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

--- Comment #6 from Tamar Christina  ---
(In reply to rguent...@suse.de from comment #5)
> > In this case, the second load is conditional on the first load mask,  which
> > means it's already done an AND.
> > And crucially inverting it means you also inverted both conditions.
> > 
> > So there are some superflous masking operations happening.  But I guess 
> > that's
> > a separate bug.  Shall I just add some tests here and close it and open a 
> > new
> > PR?
> 
> Not sure if that helps - do we fully understand this is a separate issue and
> not related to how we if-convert?
> 

if-convert looks ok to me:

   [local count: 955630226]:
  # i_28 = PHI 
  _1 = (long unsigned int) i_28;
  _2 = _1 * 4;
  _3 = a_16(D) + _2;
  _4 = *_3;
  _31 = _4 != 0;
  _55 = _54 + _2;
  _6 = (int *) _55;
  _56 = ~_31;
  _7 = .MASK_LOAD (_6, 32B, _56);
  _22 = _7 == 0;
  _34 = _22 & _56;
  _58 = _57 + _2;
  _9 = (int *) _58;
  iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);
  _61 = _4 | _7;
  _35 = _61 != 0;
  _60 = _59 + _2;
  _8 = (int *) _60;
  iftmp.0_21 = .MASK_LOAD (_8, 32B, _35);
  iftmp.0_12 = _34 ? iftmp.0_19 : iftmp.0_21;
  _10 = res_23(D) + _2;
  *_10 = iftmp.0_12;
  i_25 = i_28 + 1;
  if (n_15(D) > i_25)
goto ; [89.00%]
  else
goto ; [11.00%]

I think what's missing here is that

  _7 = .MASK_LOAD (_6, 32B, _56);
  _22 = _7 == 0;
  _34 = _22 & _56;
  iftmp.0_19 = .MASK_LOAD (_9, 32B, _34);

in itself produces an AND. namely (_7 && _34) && _56 where _56 is the loop
mask.

my (probably poor understanding) is that the mask tracking in the vectorizer
attempts to prevent to keep masks and their inverse live at the same time.

but that this code in this case doesn't track masks introduced by nested
MASK_LOADS.  at least, that's my naive interpretation.

[Bug libstdc++/88545] std::find compile to memchr in trivial random access cases (patch)

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545

--- Comment #12 from Tamar Christina  ---
I had a bug in the benchmark, I forgot to set taskset,

These are the correct ones:

++---+-+-+
| NEEDLE | scalar 1x | vect| memchr  |
++---+-+-+
| 1  | -0.14%| 174.95% | 373.69% |
| 0  | 0.00% | -90.60% | -95.21% |
| 100| 0.03% | -80.28% | -80.39% |
| 1000   | 0.00% | -89.46% | -94.06% |
| 1  | 0.00% | -90.33% | -95.19% |
| -1 | 0.00% | -90.60% | -95.21% |
++---+-+-+

So this shows that on modern cores the unrolled scalar has no influence, so we
should just remove it.

It also shows that memchr is universally faster and that for the rest the
vectorizer does a pretty good job.  We'll get some additional speedups there
soon as well but memchr should still win as it's hand tuned.

So I think for 1-byte we should use memchr and the rest remove the unrolled
code and let the vectorizer handle it.

[Bug tree-optimization/115629] Inefficient if-convert of masked conditionals

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

--- Comment #4 from Tamar Christina  ---
(In reply to Richard Biener from comment #3)
> So we now tail-merge the two b[i] loading blocks.  Can you check SVE
> code-gen with this?  If that fixes the PR consider adding a SVE testcase.

Thanks, the codegen is much better now, but shows some other missing mask
tracking in the vectorizer.

Atm we generate:

.L3:
ld1wz31.s, p6/z, [x0, x6, lsl 2] <-- load a
cmpeq   p7.s, p6/z, z31.s, #0<-- a == 0, !a
ld1wz0.s, p7/z, [x2, x6, lsl 2]  <-- load c conditionally on !a
cmpeq   p7.s, p7/z, z0.s, #0 <-- !a && !c
orr z0.d, z31.d, z0.d<-- a || c
ld1wz29.s, p7/z, [x3, x6, lsl 2] <--- load d where !a && !c
cmpne   p5.s, p6/z, z0.s, #0 <--- (a || c) & loop_mask
and p7.b, p6/z, p7.b, p7.b   <--- ((!a && !c) && (!a && !c)) &
loop_mask 
ld1wz30.s, p5/z, [x1, x6, lsl 2] <-- load b conditionally on (a ||
c)
sel z30.s, p7, z29.s, z30.s  <-- select (!a && !c, d, b)
st1wz30.s, p6, [x4, x6, lsl 2]
add x6, x6, x7
whilelo p6.s, w6, w5
b.any   .L3

which corresponds to:

  # loop_mask_63 = PHI 
  vect__4.10_64 = .MASK_LOAD (vectp_a.8_53, 32B, loop_mask_63);
  mask__31.11_66 = vect__4.10_64 != { 0, ... };
  mask__56.12_67 = ~mask__31.11_66;
  vec_mask_and_70 = mask__56.12_67 & loop_mask_63;
  vect__7.15_71 = .MASK_LOAD (vectp_c.13_68, 32B, vec_mask_and_70);
  mask__22.16_73 = vect__7.15_71 == { 0, ... };
  mask__34.17_75 = vec_mask_and_70 & mask__22.16_73;
  vect_iftmp.20_78 = .MASK_LOAD (vectp_d.18_76, 32B, mask__34.17_75);
  vect__61.21_79 = vect__4.10_64 | vect__7.15_71;
  mask__35.22_81 = vect__61.21_79 != { 0, ... };
  vec_mask_and_84 = mask__35.22_81 & loop_mask_63;
  vect_iftmp.25_85 = .MASK_LOAD (vectp_b.23_82, 32B, vec_mask_and_84);
  _86 = mask__34.17_75 & loop_mask_63;
  vect_iftmp.26_87 = VEC_COND_EXPR <_86, vect_iftmp.20_78, vect_iftmp.25_85>;
  .MASK_STORE (vectp_res.27_88, 32B, loop_mask_63, vect_iftmp.26_87);

it looks like what's missing is that the mask tracking doesn't know that other
masked operations naturally perform an AND when combined.  We do some of this
in the backend but I feel like it may be better to do it in the vectorizer.

In this case, the second load is conditional on the first load mask,  which
means it's already done an AND.
And crucially inverting it means you also inverted both conditions.

So there are some superflous masking operations happening.  But I guess that's
a separate bug.  Shall I just add some tests here and close it and open a new
PR?

[Bug libstdc++/88545] std::find compile to memchr in trivial random access cases (patch)

2024-07-01 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545

--- Comment #11 from Tamar Christina  ---
(In reply to Jonathan Wakely from comment #9)
> Patch posted: https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653731.html
> 
> Rerunning benchmarks with this patch would be very welcome.

OK, I have tested the patches on AArch64 and also compared against our
vectorized cases.

I tested 5 needle positions (where the element it's searching for is found):

1
0
100
1000
1
-1

where 0 is never find, and -1 means fine last:

Results are:

++---+-+-+
| NEEDLE | scalar 1x | vect| memchr  |
++---+-+-+
| 1  | -11.67%   | -14.95% | -12.92% |
| 0  | 137.48%   | -82.31% | -83.36% |
| 100| 3.75% | 17.06%  | 8.02%   |
| 1000   | -10.34%   | -10.83% | 0.29%   |
| 1  | -3.25%| -4.97%  | -5.19%  |
| -1 | 10.28%| -31.17% | -33.93% |
++---+-+-+

So it looks like, staying with scalar the unrolling still has a positive
effect, but calling memchr has an effect on longer searches, shorter ones. 

the vector loop as currently vectorized has about a 10% unneeded overhead which
we will be working on this year.  But otherwise it's also a significant win for
longer searches.

So perhaps an idea is to use memchr for bytes, for everything else remove the
unrolled code and let the vectorizer take care of it, and if that fails let the
RTL or tree unroller do it?

for completeness, my benchmark was:

#include 
#include 
#include 
#include 
#include 

#ifndef NEEDLE
#define NEEDLE 50
#endif

#ifndef ITERS
#define ITERS 1000
#endif

__attribute__((noipa))
bool
doIt (const char* s, char v, size_t len)
{
  const char* l = s + len;
  const char* r = std::find (s, l, v);
  return (r != l);
}

int main ()
{
  std::ifstream t("find.data");
  std::stringstream buffer;
  buffer << t.rdbuf();
  std::string content = buffer.str();
  if (NEEDLE > 0)
content[NEEDLE-1] = '|';
  else if (NEEDLE < 0)
content[content.length()-1] = '|';

  bool found = false;
  for (int i = 0; i < ITERS; i++)
 found = found | doIt (content.c_str (), '|', content.length ());

  return found;
}

[Bug tree-optimization/115120] Bad interaction between ivcanon and early break vectorization

2024-06-25 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120

--- Comment #5 from Tamar Christina  ---
considering ivopts bails out on doloop prediction for multiple exits anyway,
what do you think about:

diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index 5ef24a91917..d1b25ad7dea 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1319,7 +1319,8 @@ canonicalize_loop_induction_variables (class loop *loop,

   if (create_iv
   && niter && !chrec_contains_undetermined (niter)
-  && exit && just_once_each_iteration_p (loop, exit->src))
+  && exit && just_once_each_iteration_p (loop, exit->src)
+  && (single_dom_exit (loop) || targetm.predict_doloop_p (loop)))
 {
   tree iv_niter = niter;
   if (may_be_zero)

richi?

[Bug tree-optimization/115629] New: Inefficient if-convert of masked conditionals

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115629

Bug ID: 115629
   Summary: Inefficient if-convert of masked conditionals
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Keywords: missed-optimization
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

With cases such as:

void foo1 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? b[i] : d[i]);
}

we generate:

foo1:
cmp w5, 0
ble .L1
mov x6, 0
whilelo p7.s, wzr, w5
ptrue   p3.b, all
.L3:
ld1wz31.s, p7/z, [x0, x6, lsl 2]
cmpeq   p6.s, p7/z, z31.s, #0
cmpne   p5.s, p7/z, z31.s, #0
ld1wz0.s, p6/z, [x2, x6, lsl 2]
ld1wz30.s, p5/z, [x1, x6, lsl 2]
cmpne   p15.s, p3/z, z0.s, #0
orr z0.d, z31.d, z0.d
and p6.b, p15/z, p6.b, p6.b
cmpeq   p4.s, p7/z, z0.s, #0
ld1wz28.s, p6/z, [x1, x6, lsl 2]
ld1wz29.s, p4/z, [x3, x6, lsl 2]
sel z29.s, p15, z28.s, z29.s
sel z29.s, p5, z30.s, z29.s
st1wz29.s, p7, [x4, x6, lsl 2]
incwx6
whilelo p7.s, w6, w5
b.any   .L3

where b is loaded twice, once with the mask a[i] != 0, and once a[i] == 0 &&
c[i] != 0.
clearly we don't need the second compare nor the second load.  This loop can be
optimally handled as:

foo1:
cmp w5, 0
ble .L1
mov x6, 0
cntwx7
whilelo p7.s, wzr, w5
.p2align 5,,15
.L3:
ld1wz1.s, p7/z, [x0, x6, lsl 2]
ld1wz0.s, p7/z, [x2, x6, lsl 2]
orr z0.d, z1.d, z0.d
cmpne   p6.s, p7/z, z0.s, #0
cmpeq   p5.s, p7/z, z0.s, #0
ld1wz31.s, p6/z, [x1, x6, lsl 2]
ld1wz30.s, p5/z, [x3, x6, lsl 2]
sel z30.s, p6, z31.s, z30.s
st1wz30.s, p7, [x4, x6, lsl 2]
add x6, x6, x7
whilelo p7.s, w6, w5
b.any   .L3
.L1:
ret

i.e. transform a ? b : (c ? b : d) into (a || c) ? b : d.

This transform is actually also beneficial for scalar, but that's not the case
when one of the conditions have to be inverted. i.e. cases 2 to 4 are
beneficial for vector masked operations but not scalar:

/* Convert a ? b : (c ? b : d) into (a || c) ? b : d.  */
void foo1 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? b[i] : d[i]);
}

/* Convert a ? (c ? b : d) : b into (-a || c) ? b : d.  */
void foo2 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? (c[i] ? b[i] : d[i]) : b[i];
}

/* Convert a ? (c ? d :b) : b into (-a || -c) ? b : d.  */
void foo3 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? (c[i] ? d[i] : b[i]) : b[i];
}

/* Convert a ? b : (c ? d : b) into (a || -c) ? b : d.  */
void foo4 (int *restrict a, int *restrict b, int *restrict c,
  int *restrict d, int *restrict res, int n)
{
for (int i = 0; i < n; i++)
  res[i] = a[i] ? b[i] : (c[i] ? d[i] : b[i]);
}

for instance case 3 is currently vectorized as:

foo3:
cmp w5, 0
ble .L10
mov x6, 0
whilelo p7.s, wzr, w5
ptrue   p3.b, all
.L12:
ld1wz1.s, p7/z, [x0, x6, lsl 2]
cmpeq   p5.s, p7/z, z1.s, #0
cmpne   p6.s, p7/z, z1.s, #0
ld1wz29.s, p5/z, [x1, x6, lsl 2]
ld1wz0.s, p6/z, [x2, x6, lsl 2]
cmpne   p15.s, p3/z, z0.s, #0
cmpeq   p4.s, p6/z, z0.s, #0
and p5.b, p15/z, p6.b, p6.b
ld1wz30.s, p4/z, [x1, x6, lsl 2]
ld1wz31.s, p5/z, [x3, x6, lsl 2]
sel z30.s, p15, z31.s, z30.s
sel z30.s, p6, z30.s, z29.s
st1wz30.s, p7, [x4, x6, lsl 2]
incwx6
whilelo p7.s, w6, w5
b.any   .L12

but can be

foo3:
cmp w5, 0
ble .L10
mov x6, 0
cntwx7
whilelo p6.s, wzr, w5
ptrue   p5.b, all
.p2align 5,,15
.L12:
ld1wz29.s, p6/z, [x0, x6, lsl 2]
ld1wz28.s, p6/z, [x2, x6, lsl 2]
cmpeq   p15.s, p5/z, z29.s, #0
cmpeq   p14.s, p5/z, z28.s, #0
orr p15.b, p5/z, p15.b, p14.b
and p4.b, p6/z, p15.b, p15.b
bic p7.b, p5/z, p6.b, p15.b
ld1wz31.s, p4/z, [x1, x6, lsl 2]
ld1wz30.s, p7/z, [x3, x6, lsl 2]
  

[Bug tree-optimization/115531] vectorizer generates inefficient code for masked conditional update loops

2024-06-24 Thread tnfchris at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

Tamar Christina  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |tnfchris at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

  1   2   3   4   5   6   7   8   9   10   >