[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #1 from Andrew Pinski  ---
Actually why didn't we copy the loop header in the first place?

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #2 from Andrew Pinski  ---
(In reply to Jan Hubicka from comment #0)
> saving an instruction. Why we do not move stack+8 updating out of the loop?

Maybe because of a clobber:
  cur$second_5 = MEM[(const struct pairD.26349 &)_7 +
18446744073709551608].secondD.27577;
  # PT = nonlocal escaped 
  _4 = _7 + 18446744073709551608;
  # .MEM_9 = VDEF <.MEM_1>
  stackD.26352.D.27437._M_implD.26667.D.26744._M_finishD.26670 = _4;
  # .MEM_10 = VDEF <.MEM_9>
  MEM[(struct pairD.26349 *)_7 + -8B] ={v} {CLOBBER};

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-13 Thread amonakov at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

Alexander Monakov  changed:

   What|Removed |Added

 CC||amonakov at gcc dot gnu.org

--- Comment #3 from Alexander Monakov  ---
Rather, because store-motion out of a loop that might iterate zero times would
create a data race.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #4 from Jan Hubicka  ---
> Rather, because store-motion out of a loop that might iterate zero times would
> create a data race.
Good point.  If we did copy loop headers all the way to the store the
problem will go away.  Also I assume we can still add a flag which is
set to true if loops iterates and then make store conditional...

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-14 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #5 from Jan Hubicka  ---
> Actually why didn't we copy the loop header in the first place?
Because it is considered to be do-while loop already (thanks to
the in-loop conitional, do_while_loop_p is happy).

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

Richard Biener  changed:

   What|Removed |Added

   Last reconfirmed||2023-05-15
   Keywords||missed-optimization
 Status|UNCONFIRMED |NEW
 Ever confirmed|0   |1

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-17 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

Jan Hubicka  changed:

   What|Removed |Added

 Blocks||109811
 CC||mjambor at suse dot cz

--- Comment #6 from Jan Hubicka  ---
Here is slightly improved testcase which actually pushes into stack and
measures something. It test loops 1000 times and returns.  It also makes stack
to be local variable so race conditions are not a problem.

#include 
typedef unsigned int uint32_t;
std::pair pair;
void
test()
{
std::vector> stack;
stack.push_back (pair);
while (!stack.empty()) {
std::pair cur = stack.back();
stack.pop_back();
if (!cur.first)
{
cur.second++;
stack.push_back (cur);
}
if (cur.second > 1)
break;
}
}
int
main()
{
for (int i = 0; i < 1; i++)
  test();
}

Clang code is about twice as fast

jan@localhost:/tmp> clang++ -O2 tt.C  -fno-exceptions
jan@localhost:/tmp> g++ -O2 tt.C  -fno-exceptions -o a.out-gcc
jan@localhost:/tmp> perf stat ./a.out

 Performance counter stats for './a.out':

434.24 msec task-clock:u #0.997 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   129  page-faults:u#  297.073 /sec
 1,003,191,657  cycles:u #2.310 GHz 
68,927  stalled-cycles-frontend:u#0.01% frontend
cycles idle  
   800,792,619  stalled-cycles-backend:u #   79.82% backend
cycles idle   
 1,904,682,933  instructions:u   #1.90  insn per
cycle
  #0.42  stalled cycles per
insn   
   500,912,196  branches:u   #1.154 G/sec   
23,144  branch-misses:u  #0.00% of all
branches   

   0.435340389 seconds time elapsed

   0.431409000 seconds user
   0.003994000 seconds sys


jan@localhost:/tmp> perf stat ./a.out-gcc

 Performance counter stats for './a.out-gcc':

  1,197.28 msec task-clock:u #0.999 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
   131  page-faults:u#  109.415 /sec
 2,903,995,656  cycles:u #2.425 GHz 
86,204  stalled-cycles-frontend:u#0.00% frontend
cycles idle  
 2,690,907,052  stalled-cycles-backend:u #   92.66% backend
cycles idle   
 2,005,212,311  instructions:u   #0.69  insn per
cycle
  #1.34  stalled cycles per
insn   
   401,007,320  branches:u   #  334.932 M/sec   
23,290  branch-misses:u  #0.01% of all
branches   

   1.198388186 seconds time elapsed

   1.19845 seconds user
   0.0 seconds sys


The problem seems to be, like in first example, that we keep updating in-memory
stack in the main loop.

.L39:
movl12(%rsp), %ebx
.L30:
movq16(%rsp), %rax
cmpl$1, %ebx
ja  .L33
.L40:
movq24(%rsp), %rdi
cmpq%rdi, %rax
je  .L28
.L34:
movq-8(%rdi), %rax
leaq-8(%rdi), %rsi
movq%rsi, 24(%rsp)
movq%rax, 8(%rsp)
testl   %eax, %eax
jne .L39

While clang does:

.LBB0_1:#   in Loop: Header=BB0_4 Depth=1
movq%rax, %r14
.LBB0_2:#   in Loop: Header=BB0_4 Depth=1
movq%rbx, %r12
movq%r12, %rbx
cmpl$10001, %r13d   # imm = 0x2711
jae .LBB0_27
.LBB0_4:# =>This Loop Header: Depth=1
# Child Loop BB0_16 Depth 2
# Child Loop BB0_21 Depth 2
cmpq%r14, %rbx
je  .LBB0_26
# %bb.5:#   in Loop: Header=BB0_4 Depth=1
leaq-8(%r14), %rax
movq-8(%r14), %rcx
movq%rcx, %r13
shrq$32, %r13
testl   %ecx, %ecx
jne .LBB0_1


Referenced Bugs:

https://gcc.gnu.org/bugzilla/sho

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-17 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #7 from Richard Biener  ---
There is nothing to sink really, loop header copying introduces a PHI and
there's not partial redundancies but only partial-partial and those are not
obvious to CSE because of the introduced PHI.

I believe we have to teach SRA to decompose 'cur' and maybe also 'stack',
there's no scalar optimization going to do it also because we have aggregate
copies involved.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #8 from Jan Hubicka  ---
We can only SRA if the address is non-escaping.  Clang does not seem to need it
to optimize better:

jan@localhost:~> cat t.c
extern void q(int *);
__attribute__ ((noinline))
void
test()
{
for (int a = 0; a < 1000;a++)
if (!(a%100))
q(&a);
}
int
main()
{
for (int a = 0; a < 100;a++)
test ();
}
jan@localhost:~> cat t2.c
void q(int *a)
{
}
jan@localhost:~> gcc -O2 t.c t2.c ; perf stat ./a.out

 Performance counter stats for './a.out':

  2,916.73 msec task-clock:u #0.999 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
52  page-faults:u#   17.828 /sec
 8,344,719,833  cycles:u #2.861 GHz 
13,561,375  stalled-cycles-frontend:u#0.16% frontend
cycles idle  
 5,128,112,757  stalled-cycles-backend:u #   61.45% backend
cycles idle   
10,050,172,242  instructions:u   #1.20  insn per
cycle
  #0.51  stalled cycles per
insn   
 2,034,043,082  branches:u   #  697.370 M/sec   
11,186,312  branch-misses:u  #0.55% of all
branches   

   2.918344737 seconds time elapsed

   2.917844000 seconds user
   0.0 seconds sys


jan@localhost:~> clang -O2 t.c t2.c ; perf stat ./a.out

 Performance counter stats for './a.out':

664.40 msec task-clock:u #0.999 CPUs
utilized 
 0  context-switches:u   #0.000 /sec
 0  cpu-migrations:u #0.000 /sec
54  page-faults:u#   81.276 /sec
 2,318,095,848  cycles:u #3.489 GHz 
10,417,694  stalled-cycles-frontend:u#0.45% frontend
cycles idle  
 1,057,731,301  stalled-cycles-backend:u #   45.63% backend
cycles idle   
10,062,172,840  instructions:u   #4.34  insn per
cycle
  #0.11  stalled cycles per
insn   
 2,034,042,724  branches:u   #3.061 G/sec   
10,003,620  branch-misses:u  #0.49% of all
branches   

   0.665267996 seconds time elapsed

   0.665247000 seconds user
   0.0 seconds sys


We do:

jmp .L3
.p2align 4,,10
.p2align 3
.L2:
movl12(%rsp), %eax
addl$1, %eax
movl%eax, 12(%rsp)
cmpl$999, %eax
jg  .L7
.L3:
imull   $-1030792151, %eax, %eax
addl$85899344, %eax
rorl$2, %eax
cmpl$42949672, %eax
ja  .L2
leaq12(%rsp), %rdi
callq
jmp .L2

Which has stupid store-to-load dpendency in the internal loop. Clang keeps the
store but optimizes away the load:

jmp .LBB0_1
.p2align4, 0x90
.LBB0_3:#   in Loop: Header=BB0_1 Depth=1
leal1(%rax), %ecx
movl%ecx, 12(%rsp)
cmpl$999, %eax  # imm = 0x3E7
movl%ecx, %eax
jge .LBB0_4
.LBB0_1:# =>This Inner Loop Header: Depth=1
imull   $-1030792151, %eax, %ecx# imm = 0xC28F5C29
addl$85899344, %ecx # imm = 0x51EB850
rorl$2, %ecx
cmpl$42949672, %ecx # imm = 0x28F5C28
ja  .LBB0_3
# %bb.2:#   in Loop: Header=BB0_1 Depth=1
movq%rbx, %rdi
callq   q@PLT
movl12(%rsp), %eax
jmp .LBB0_3

Wonder what makes clang to think it needs @PLT though.
Why we do not consider the load as partially redundant with itself?

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-18 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #9 from Richard Biener  ---
Created attachment 55110
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55110&action=edit
patch for the missed hoisting

For the testcase in comment#6 there is a missing code hoisting from PRE
which is caused by do_hoist_insertion doing

  /* A hoistable value must be in ANTIC_IN(block)
 but not in AVAIL_OUT(BLOCK).  */
  bitmap_initialize (&hoistable_set.values, &grand_bitmap_obstack);
  bitmap_and_compl (&hoistable_set.values,
&ANTIC_IN (block)->values, &AVAIL_OUT (block)->values);

but in reality we want to check ANTIC_OUT(block), not ANTIC_IN(block).
cur.second is killed by the aggregate assignment to cur at the beginning
of the block we should hoist to and that's reflected in ANTIC_IN.

The attached patch properly re-computes ANTIC_OUT and uses that.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-18 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #10 from Jan Hubicka  ---
Thanks. I tested the patch on jpegxl and it does not help there (I guess
becuase the redundancy there is partial). But it is cool we compile at least
the simplified testcase well.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-23 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #11 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:9e2017ae6ac788d3e36999bb0f0d20ea0f62c20e

commit r14-1127-g9e2017ae6ac788d3e36999bb0f0d20ea0f62c20e
Author: Richard Biener 
Date:   Thu May 18 13:52:29 2023 +0200

tree-optimization/109849 - missed code hoisting

The following fixes code hoisting to properly consider ANTIC_OUT instead
of ANTIC_IN.  That's a bit expensive to re-compute but since we no
longer iterate we're doing this only once per BB which should be
acceptable.  This avoids missing hoistings to the end of blocks where
something in the block clobbers the hoisted value.

PR tree-optimization/109849
* tree-ssa-pre.cc (do_hoist_insertion): Compute ANTIC_OUT
and use that to determine what to hoist.

* gcc.dg/tree-ssa/ssa-hoist-8.c: New testcase.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-23 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #12 from Richard Biener  ---
So this fixed the missing code hoisting - partial PRE is done with -O3 only.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-05-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #13 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:5476de2618ffb77f3a52e59e2c9f10b018329689

commit r14-1161-g5476de2618ffb77f3a52e59e2c9f10b018329689
Author: Richard Biener 
Date:   Wed May 24 12:36:28 2023 +0200

tree-optimization/109849 - fix fallout of PRE hoisting change

The PR109849 fix made us no longer hoist some memory loads because
of the expression set intersection.  We can still avoid to compute
the union by simply taking the first sets expressions and leave
the pruning of expressions with values not suitable for hoisting
to sorted_array_from_bitmap_set.

PR tree-optimization/109849
* tree-ssa-pre.cc (do_hoist_insertion): Do not intersect
expressions but take the first sets.

* gcc.dg/tree-ssa/ssa-hoist-9.c: New testcase.

[Bug middle-end/109849] suboptimal code for vector walking loop

2024-05-30 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #35 from Jonathan Wakely  ---
(In reply to Jonathan Wakely from comment #31)
> Bisection points to r14-5831-gaae723d360ca26cd9fd0b039fb0a616bd0eae363 for
> that remaining FAIL as well (and it isn't fixed by the new patch).
> 
> It introduced a new warning which wasn't present before:
> 
> /tmp/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_algobase.h:437:
> warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)'
> writing between 2 and 9223372036854775806 bytes into a region of size 0
> overflows the destination [-Wstringop-overflow=]

I don't know why we only get a warning for C++98 and not other modes, but this
seems to fix it:

--- a/libstdc++-v3/include/bits/vector.tcc
+++ b/libstdc++-v3/include/bits/vector.tcc
@@ -933,6 +933,9 @@ _GLIBCXX_BEGIN_NAMESPACE_CONTAINER

const size_type __len =
  _M_check_len(__n, "vector::_M_range_insert");
+   if (__len < (__n + (__old_finish - __old_start)))
+ __builtin_unreachable();
+
pointer __new_start(this->_M_allocate(__len));
pointer __new_finish(__new_start);
__try

So it looks like the compiler can't tell that _M_check_len(n) doesn't undergo
unsigned wraparound.

[Bug middle-end/109849] suboptimal code for vector walking loop

2024-06-03 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #36 from GCC Commits  ---
The master branch has been updated by Francois Dumont :

https://gcc.gnu.org/g:0426be454448f8cfb9db21f4f669426afb7b57c8

commit r15-996-g0426be454448f8cfb9db21f4f669426afb7b57c8
Author: François Dumont 
Date:   Sat Jun 1 22:17:19 2024 +0200

libstdc++: Fix -Wstringop-overflow warning coming from std::vector
[PR109849]

libstdc++-v3/ChangeLog:

PR libstdc++/109849
* include/bits/vector.tcc
(std::vector<>::_M_range_insert(iterator, _FwdIt, _FwdIt,
forward_iterator_tag))[__cplusplus < 201103L]: Add
__builtin_unreachable
expression to tell the compiler that the allocated buffer is large
enough to
receive current elements plus the elements of the range to insert.

[Bug middle-end/109849] suboptimal code for vector walking loop

2024-06-03 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #37 from GCC Commits  ---
The releases/gcc-14 branch has been updated by Francois Dumont
:

https://gcc.gnu.org/g:955202eb2cdbe2bc74c626bce90ee1eac410ad4f

commit r14-10272-g955202eb2cdbe2bc74c626bce90ee1eac410ad4f
Author: François Dumont 
Date:   Sat Jun 1 22:17:19 2024 +0200

libstdc++: Fix -Wstringop-overflow warning coming from std::vector
[PR109849]

libstdc++-v3/ChangeLog:

PR libstdc++/109849
* include/bits/vector.tcc
(std::vector<>::_M_range_insert(iterator, _FwdIt, _FwdIt,
forward_iterator_tag))[__cplusplus < 201103L]: Add
__builtin_unreachable
expression to tell the compiler that the allocated buffer is large
enough to
receive current elements plus the elements of the range to insert.

(cherry picked from commit 0426be454448f8cfb9db21f4f669426afb7b57c8)

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-16 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #14 from Jan Hubicka  ---
One interesting situation is:
void std::vector >::push_back (struct
vector * const this, const struct value_type & __x)
{
  struct __normal_iterator D.27894;
  struct pair * _1;
  struct pair * _2;
  struct pair * _3;

   [local count: 1073741824]:
  _1 = this_6(D)->D.26707._M_impl.D.26014._M_finish;
  _2 = this_6(D)->D.26707._M_impl.D.26014._M_end_of_storage;
  if (_1 != _2)
goto ; [82.57%]
  else
goto ; [17.43%]

   [local count: 886588625]:
  *_1 = MEM[(const struct pair &)__x_7(D)];
  _3 = _1 + 8;
  this_6(D)->D.26707._M_impl.D.26014._M_finish = _3;
  goto ; [100.00%]

   [local count: 187153200]:
  D.27894._M_current = _1;
  std::vector >::_M_realloc_insert&> (this_6(D), D.27894, __x_7(D));

   [local count: 1073741824]:
  return;

}
here we could do partial inlining and offline the call to _M_realloc_insert but
we fail to cut since _1 is already load:

Split point at BB 4
  header time: 9.302800 header size: 9
  split time: 2.440200 split size: 5
  bbs: 4
  SSA names to pass: 1, 9, 11
  Refused: need to pass non-param values 

It should be easy to insert code loading the parameter again in the split part.
We still hit the SRA limitation since this would be still escaping, but it is
another missed optimization on this simple testcase.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-18 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #15 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:5a1ef1cfac005370d0a5a0f85798724cb2c9cf5e

commit r14-1909-g5a1ef1cfac005370d0a5a0f85798724cb2c9cf5e
Author: Honza 
Date:   Sun Jun 18 18:58:26 2023 +0200

Analyze SRA candidates in ipa-fnsummary

this patch extends ipa-fnsummary to anticipate statements that will be
removed
by SRA.  This is done by looking for calls passing addresses of automatic
variables.  In function body we look for dereferences from pointers of such
variables and mark them with new not_sra_candidate condition.

This is just first step which is overly optimistic.  We do not try to prove
that
given automatic variable will not be SRAed even after inlining.  We now
also
optimistically assume that the transformation will always happen.  I will
restrict
this in a followup patch, but I think it is useful to gether some data on
how
much code is affected by this.

This is motivated by PR109849 where we fail to fully inline push_back.
The patch alone does not solve the problem even for -O3, but improves
analysis in this case.

gcc/ChangeLog:

PR tree-optimization/109849
* ipa-fnsummary.cc (evaluate_conditions_for_known_args): Add new
parameter
ES; handle ipa_predicate::not_sra_candidate.
(evaluate_properties_for_edge): Pass es to
evaluate_conditions_for_known_args.
(ipa_fn_summary_t::duplicate): Handle sra candidates.
(dump_ipa_call_summary): Dump points_to_possible_sra_candidate.
(load_or_store_of_ptr_parameter): New function.
(points_to_possible_sra_candidate_p): New function.
(analyze_function_body): Initialize
points_to_possible_sra_candidate;
determine sra predicates.
(estimate_ipcp_clone_size_and_time): Update call of
evaluate_conditions_for_known_args.
(remap_edge_params): Update points_to_possible_sra_candidate.
(read_ipa_call_summary): Stream points_to_possible_sra_candidate
(write_ipa_call_summary): Likewise.
* ipa-predicate.cc (ipa_predicate::add_clause): Handle
not_sra_candidate.
(dump_condition): Dump it.
* ipa-predicate.h (struct inline_param_summary): Add
points_to_possible_sra_candidate.

gcc/testsuite/ChangeLog:

PR tree-optimization/109849
* g++.dg/ipa/devirt-45.C: Update template.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-19 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #16 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:7b34cacc5735385e7e2855d7c0a6fad60ef4a99b

commit r14-1951-g7b34cacc5735385e7e2855d7c0a6fad60ef4a99b
Author: Jan Hubicka 
Date:   Mon Jun 19 18:28:17 2023 +0200

optimize std::max early

we currently produce very bad code on loops using std::vector as a stack,
since
we fail to inline push_back which in turn prevents SRA and we fail to
optimize
out some store-to-load pairs.

I looked into why this function is not inlined and it is inlined by clang. 
We
currently estimate it to 66 instructions and inline limits are 15 at -O2
and 30
at -O3.  Clang has similar estimate, but still decides to inline at -O2.

I looked into reason why the body is so large and one problem I spotted is
the
way std::max is implemented by taking and returning reference to the
values.

  const T& max( const T& a, const T& b );

This makes it necessary to store the values to memory and load them later
and max is used by code computing new size of vector on resize.

We optimize this to MAX_EXPR, but only during late optimizations.  I think
this
is a common enough coding pattern and we ought to make this transparent to
early opts and IPA.  The following is easist fix that simply adds phiprop
pass
that turns the PHI of address values into PHI of values so later FRE can
propagate values across memory, phiopt discover the MAX_EXPR pattern and
DSE
remove the memory stores.

gcc/ChangeLog:

PR tree-optimization/109811
PR tree-optimization/109849
* passes.def: Add phiprop to early optimization passes.
* tree-ssa-phiprop.cc: Allow clonning.

gcc/testsuite/ChangeLog:

PR tree-optimization/109811
PR tree-optimization/109849
* gcc.dg/tree-ssa/phiprop-1.c: New test.
* gcc.dg/tree-ssa/pr21463.c: Adjust template.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-26 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #17 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:c2ebccc97190a978a44e341516b488f02a78c598

commit r14-2101-gc2ebccc97190a978a44e341516b488f02a78c598
Author: Jan Hubicka 
Date:   Mon Jun 26 18:29:39 2023 +0200

Fix profile of forwarders produced by cd-dce

compiling the testcase from PR109849 (which uses std:vector based stack to
drive a loop) with profile feedbakc leads to profile mismatches introduced
by
tree-ssa-dce.  This is the new code to produce unified forwarder blocks for
PHIs.

I am not including the testcase itself since
checking it for Invalid sum is probably going to be too fragile and this
should
show in our LNT testers. The patch however fixes the mismatch.

Bootstrapped/regtested x86_64-linux and plan to commit it shortly.

gcc/ChangeLog:

PR tree-optimization/109849
* tree-ssa-dce.cc (make_forwarders_with_degenerate_phis): Fix
profile
count of newly constructed forwarder block.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-28 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #18 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:45c53768b6fa3d737ae818e31d3c50da62e0ad2b

commit r14-2157-g45c53768b6fa3d737ae818e31d3c50da62e0ad2b
Author: Jan Hubicka 
Date:   Wed Jun 28 11:45:15 2023 +0200

Add cold attribute to throw wrappers and terminate

PR middle-end/109849
* include/bits/c++config (std::__terminate): Mark cold.
* include/bits/functexcept.h: Mark everything as cold.
* libsupc++/exception: Mark terminate and unexpected as cold.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #19 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:9dc18fca431626404b0692c689a2e103666e7adb

commit r14-2202-g9dc18fca431626404b0692c689a2e103666e7adb
Author: Jan Hubicka 
Date:   Thu Jun 29 22:45:37 2023 +0200

Compute ipa-predicates for conditionals involving __builtin_expect_p

std::vector allocator looks as follows:

__attribute__((nodiscard))
struct pair * std::__new_allocator
>::allocate (struct __new_allocator * const this, size_type __n, const void *
D.27753)
{
  bool _1;
  long int _2;
  long int _3;
  long unsigned int _5;
  struct pair * _9;

   [local count: 1073741824]:
  _1 = __n_7(D) > 1152921504606846975;
  _2 = (long int) _1;
  _3 = __builtin_expect (_2, 0);
  if (_3 != 0)
goto ; [10.00%]
  else
goto ; [90.00%]

   [local count: 107374184]:
  if (__n_7(D) > 2305843009213693951)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 53687092]:
  std::__throw_bad_array_new_length ();

   [local count: 53687092]:
  std::__throw_bad_alloc ();

   [local count: 966367641]:
  _5 = __n_7(D) * 8;
  _9 = operator new (_5);
  return _9;
}

So there is check for allocated block size being greater than max_size
which is
wrapper in __builtin_expect.  This makes ipa-fnsummary to give up analyzing
predicates and it will miss the fact that the two different calls to
__throw
will be optimized out if __n is larady smaller than 1152921504606846975
which
it is after _M_check_len.

This patch extends ipa-fnsummary to understand functions that return their
parameter.

gcc/ChangeLog:

PR tree-optimization/109849
* ipa-fnsummary.cc (decompose_param_expr): Skip
functions returning its parameter.
(set_cond_stmt_execution_predicate): Return early
if predicate was constructed.

gcc/testsuite/ChangeLog:

PR tree-optimization/109849
* gcc.dg/ipa/pr109849.c: New test.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-06-30 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #20 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:eab57b825bcc350e9ff44eb2fa739a80199d9bb1

commit r14-2219-geab57b825bcc350e9ff44eb2fa739a80199d9bb1
Author: Jan Hubicka 
Date:   Fri Jun 30 16:27:27 2023 +0200

Fix handling of __builtin_expect_with_probability and improve first-match
heuristics

While looking into the std::vector _M_realloc_insert codegen I noticed that
call of __throw_bad_alloc is predicted with 10% probability. This is
because
the conditional guarding it has __builtin_expect (cond, 0) on it.  This
incorrectly takes precedence over more reliable heuristics predicting that
call
to cold noreturn is likely not going to happen.

So I reordered the predictors so __builtin_expect_with_probability comes
first
after predictors that never makes a mistake (so user can use it to always
specify the outcome by hand).  I also downgraded malloc predictor since I
do
think user-defined malloc functions & new operators may behave funny ways
and
moved usual __builtin_expect after the noreturn cold predictor.

This triggered latent bug in expr_expected_value_1 where

  if (*predictor < predictor2)
*predictor = predictor2;

should be:

  if (predictor2 < *predictor)
*predictor = predictor2;

which eventually triggered an ICE on combining heuristics.  This made me
notice
that we can do slightly better while combining expected values in case only
one of the parameters (such as in a*b when we expect a==0) can determine
overall result.

Note that the new code may pick weaker heuristics in case that both values
are
predicted.  Not sure if this scenario is worth the extra CPU time: there is
not correct way to combine the probabilities anyway since we do not know if
the predictions are independent, so I think users should not rely on it.

Fixing this issue uncovered another problem.  In 2018 Martin Liska added
code predicting that MALLOC returns non-NULL but instead of that he
predicts
that it returns true (boolean 1).  This sort of works for testcase testing
 malloc (10) != NULL
but, for example, we will predict
 malloc (10) == malloc (10)
as true, which is not right and such comparsion may happen in real code

I think proper way is to update expr_expected_value_1 to work with value
ranges, but that needs greater surgery so I decided to postpone this and
only add FIXME and fill PR110499.

gcc/ChangeLog:

PR middle-end/109849
* predict.cc (estimate_bb_frequencies): Turn to static function.
(expr_expected_value_1): Fix handling of binary expressions with
predicted values.
* predict.def (PRED_MALLOC_NONNULL): Move later in the priority
queue.
(PRED_BUILTIN_EXPECT_WITH_PROBABILITY): Move to almost top of the
priority
queue.
* predict.h (estimate_bb_frequencies): No longer declare it.

gcc/testsuite/ChangeLog:

PR middle-end/109849
* gcc.dg/predict-18.c: Improve testcase.

[Bug middle-end/109849] suboptimal code for vector walking loop

2024-01-03 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #34 from Martin Jambor  ---
(In reply to Jan Hubicka from comment #32)
> > /tmp/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_algobase.h:437:
> > warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)'
> > writing between 2 and 9223372036854775806 bytes into a region of size 0
> > overflows the destination [-Wstringop-overflow=]
> 
> It warns on:
> 
>   template
> struct __copy_move<_IsMove, true, random_access_iterator_tag>
> {
>   template
> _GLIBCXX20_CONSTEXPR
> static _Up*
> __copy_m(_Tp* __first, _Tp* __last, _Up* __result)
> {
>   const ptrdiff_t _Num = __last - __first;
>   if (__builtin_expect(_Num > 1, true))
> __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
>   else if (_Num == 1)
> std::__copy_move<_IsMove, false, random_access_iterator_tag>::
>   __assign_one(__result, __first);
>   return __result + _Num;
> }
> };
> 
> It is likely false positive on a code path that never happens in real
> code, but we now optimize it better.
> 

We end up with:
   [local count: 64736968]:
  __builtin_memcpy (1B, v$_M_impl$D10203$_M_start_448, _354);

IIRC the statement variant is created by jump threading (specifically
thread2).

Moreover, if I understand the comment in compute_objsize_r about the
INTEGER_CST case correctly, small integers are considered potential
"result of erroneous null pointer addition/subtraction."  So not
warning on a constant 1 destination does not seem to be desirable.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-19 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #21 from Jan Hubicka  ---
Patch
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637265.html
gets us closer to inlining _M_realloc_insert at -O3 (3 insns away)

Patch
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636935.html
reduces the expense when _M_realloc_insert is not inlined at -O2 (where I think
we should not inline it, unlike for clang)

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-21 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #22 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:1d82fc2e6824bf83159389729c31a942f7b91b04

commit r14-5679-g1d82fc2e6824bf83159389729c31a942f7b91b04
Author: Jan Hubicka 
Date:   Tue Nov 21 15:17:16 2023 +0100

optimize std::vector::push_back

this patch speeds up the push_back at -O3 significantly by making the
reallocation to be inlined by default.  _M_realloc_insert is general
insertion that takes iterator pointing to location where the value
should be inserted.  As such it contains code to move other entries around
that is quite large.

Since appending to the end of array is common operation, I think we should
have specialized code for that.  Sadly it is really hard to work out this
from IPA passes, since we basically care whether the iterator points to
the same place as the end pointer, which are both passed by reference.
This is inter-procedural value numbering that is quite out of reach.

I also added extra check making it clear that the new length of the vector
is non-zero.  This saves extra conditionals.  Again it is quite hard case
since _M_check_len seem to be able to return 0 if its parameter is 0.
This never happens here, but we are not able to propagate this early nor
at IPA stage.

libstdc++-v3/ChangeLog:

PR libstdc++/110287
PR middle-end/109811
PR middle-end/109849
* include/bits/stl_vector.h (_M_realloc_append): New member
function.
(push_back): Use it.
* include/bits/vector.tcc: (emplace_back): Use it.
(_M_realloc_insert): Let compiler know that new vector size is
non-zero.
(_M_realloc_append): New member function.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-21 Thread hubicka at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849
Bug 109849 depends on bug 110377, which changed state.

Bug 110377 Summary: Early VRP and IPA-PROP should work out value ranges from 
__builtin_unreachable
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110377

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #23 from CVS Commits  ---
The master branch has been updated by Martin Jambor :

https://gcc.gnu.org/g:aae723d360ca26cd9fd0b039fb0a616bd0eae363

commit r14-5831-gaae723d360ca26cd9fd0b039fb0a616bd0eae363
Author: Martin Jambor 
Date:   Fri Nov 24 17:32:35 2023 +0100

sra: SRA of non-escaped aggregates passed by reference to calls

PR109849 shows that a loop that heavily pushes and pops from a stack
implemented by a C++ std::vec results in slow code, mainly because the
vector structure is not split by SRA and so we end up in many loads
and stores into it.  This is because it is passed by reference
to (re)allocation methods and so needs to live in memory, even though
it does not escape from them and so we could SRA it if we
re-constructed it before the call and then separated it to distinct
replacements afterwards.

This patch does exactly that, first relaxing the selection of
candidates to also include those which are addressable but do not
escape and then adding code to deal with the calls.  The
micro-benchmark that is also the (scan-dump) testcase in this patch
runs twice as fast with it than with current trunk.  Honza measured
its effect on the libjxl benchmark and it almost closes the
performance gap between Clang and GCC while not requiring excessive
inlining and thus code growth.

The patch disallows creation of replacements for such aggregates which
are also accessed with a precision smaller than their size because I
have observed that this led to excessive zero-extending of data
leading to slow-downs of perlbench (on some CPUs).  Apart from this
case I have not noticed any regressions, at least not so far.

Gimple call argument flags can tell if an argument is unused (and then
we do not need to generate any statements for it) or if it is not
written to and then we do not need to generate statements loading
replacements from the original aggregate after the call statement.
Unfortunately, we cannot symmetrically use flags that an aggregate is
not read because to avoid re-constructing the aggregate before the
call because flags don't tell which what parts of aggregates were not
written to, so we load all replacements, and so all need to have the
correct value before the call.

This version of the patch also takes care to avoid attempts to modify
abnormal edges, something which was missing in the previosu version.

gcc/ChangeLog:

2023-11-23  Martin Jambor  

PR middle-end/109849
* tree-sra.cc (passed_by_ref_in_call): New.
(sra_initialize): Allocate passed_by_ref_in_call.
(sra_deinitialize): Free passed_by_ref_in_call.
(create_access): Add decl pool candidates only if they are not
already candidates.
(build_access_from_expr_1): Bail out on ADDR_EXPRs.
(build_access_from_call_arg): New function.
(asm_visit_addr): Rename to scan_visit_addr, change the
disqualification dump message.
(scan_function): Check taken addresses for all non-call statements,
including phi nodes.  Process all call arguments, including the
static
chain, build_access_from_call_arg.
(maybe_add_sra_candidate): Relax need_to_live_in_memory check to
allow
non-escaped local variables.
(sort_and_splice_var_accesses): Disallow smaller-than-precision
replacements for aggregates passed by reference to functions.
(sra_modify_expr): Use a separate stmt iterator for adding
satements
before the processed statement and after it.
(enum out_edge_check): New type.
(abnormal_edge_after_stmt_p): New function.
(sra_modify_call_arg): New function.
(sra_modify_assign): Adjust calls to sra_modify_expr.
(sra_modify_function_body): Likewise, use sra_modify_call_arg to
process call arguments, including the static chain.

gcc/testsuite/ChangeLog:

2023-11-23  Martin Jambor  

PR middle-end/109849
* g++.dg/tree-ssa/pr109849.C: New test.
* g++.dg/tree-ssa/sra-eh-1.C: Likewise.
* gcc.dg/tree-ssa/pr109849.c: Likewise.
* gcc.dg/tree-ssa/sra-longjmp-1.c: Likewise.
* gfortran.dg/pr43984.f90: Added -fno-tree-sra to dg-options.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-24 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #24 from CVS Commits  ---
The master branch has been updated by Jan Hubicka :

https://gcc.gnu.org/g:c2dcfb6ba6e9a84a16e63ae73a822ae2a843170c

commit r14-5832-gc2dcfb6ba6e9a84a16e63ae73a822ae2a843170c
Author: Jan Hubicka 
Date:   Fri Nov 24 17:59:44 2023 +0100

Use memcpy instead of memmove in __relocate_a_1

__relocate_a_1 is used to copy data after vector reizing.  This can be done
by memcpy
rather than memmove.

libstdc++-v3/ChangeLog:

PR middle-end/109849
* include/bits/stl_uninitialized.h (__relocate_a_1): Use memcpy
instead
of memmove.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-24 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #25 from Martin Jambor  ---
(In reply to Richard Biener from comment #7)
> There is nothing to sink really, loop header copying introduces a PHI and
> there's not partial redundancies but only partial-partial and those are not
> obvious to CSE because of the introduced PHI.
> 

SRA now decomposes stack.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-27 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849
Bug 109849 depends on bug 112653, which changed state.

Bug 112653 Summary: PTA should handle correctly escape information of values 
returned by a function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112653

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #26 from Jonathan Wakely  ---
(In reply to GCC Commits from comment #23)
> https://gcc.gnu.org/g:aae723d360ca26cd9fd0b039fb0a616bd0eae363
> 
> commit r14-5831-gaae723d360ca26cd9fd0b039fb0a616bd0eae363
> Author: Martin Jambor 
> Date:   Fri Nov 24 17:32:35 2023 +0100
> 
> sra: SRA of non-escaped aggregates passed by reference to calls
> 

I'm seeing a large number of libstdc++ testsuite failures, bisected to this
patch.

For example:

make check -C x86_64-pc-linux-gnu/libstdc++-v3
RUNTESTFLAGS="conformance.exp=21_strings/basic_string/operators/char/1.cc
--target_board=unix/-D_GLIBCXX_USE_CXX11_ABI=0"

The full list of FAILs is:

FAIL: 23_containers/vector/types/1.cc  -std=gnu++98 (test for excess errors)
FAIL: 23_containers/vector/types/1.cc  -std=gnu++98 (test for excess errors)
FAIL: 19_diagnostics/stacktrace/output.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/stacktrace/output.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++11 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/cons-1.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++11 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/what-1.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++11 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/what-2.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++11 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/what-3.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++11 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/what-4.cc  -std=gnu++26 execution test
FAIL: 19_diagnostics/system_error/what-big.cc  -std=gnu++14 execution test
FAIL: 19_diagnostics/system_error/what-big.cc  -std=gnu++17 execution test
FAIL: 19_diagnostics/system_error/what-big.cc  -std=gnu++20 execution test
FAIL: 19_diagnostics/system_error/what-big.cc  -std=gnu++23 execution test
FAIL: 19_diagnostics/system_error/what-big.cc  -std=gnu++26 execution test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++11 execution
test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++14 execution
test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++17 execution
test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++20 execution
test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++23 execution
test
FAIL: 21_strings/basic_string/cons/char/moveable2.cc  -std=gnu++26 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++11 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++14 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++17 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++20 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++23 execution
test
FAIL: 21_strings/basic_string/cons/wchar_t/moveable2.cc  -std=gnu++26 execution
test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++11 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++14 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++20 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++23 execution test
FAIL: 21_strings/basic_string/op

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread jamborm at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #27 from Martin Jambor  ---
(In reply to Jonathan Wakely from comment #26)
> (In reply to GCC Commits from comment #23)
> > https://gcc.gnu.org/g:aae723d360ca26cd9fd0b039fb0a616bd0eae363
> > 
> > commit r14-5831-gaae723d360ca26cd9fd0b039fb0a616bd0eae363
> > Author: Martin Jambor 
> > Date:   Fri Nov 24 17:32:35 2023 +0100
> > 
> > sra: SRA of non-escaped aggregates passed by reference to calls
> > 
> 
> I'm seeing a large number of libstdc++ testsuite failures, bisected to this
> patch.
> 
> For example:
> 
> make check -C x86_64-pc-linux-gnu/libstdc++-v3
> RUNTESTFLAGS="conformance.exp=21_strings/basic_string/operators/char/1.cc
> --target_board=unix/-D_GLIBCXX_USE_CXX11_ABI=0"
> 

Unfortunately I cannot reproduce this, the above (on pristine master
commit 006e90e1344 on an x86_64-linux) results in:

Running target unix/-D_GLIBCXX_USE_CXX11_ABI=0
Running
/home/mjambor/gcc/small/src/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp
...
PASS: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 (test for
excess errors)
PASS: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 execution test

Can you please try if
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638318.html
fixes this?

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #28 from Jonathan Wakely  ---
(In reply to Martin Jambor from comment #27)
> Unfortunately I cannot reproduce this, the above (on pristine master
> commit 006e90e1344 on an x86_64-linux) results in:
> 
> Running target unix/-D_GLIBCXX_USE_CXX11_ABI=0
> Running
> /home/mjambor/gcc/small/src/libstdc++-v3/testsuite/libstdc++-dg/conformance.
> exp ...
> PASS: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 (test for
> excess errors)
> PASS: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 execution
> test

Oops, sorry, that particular FAIL needs either
--target_board=unix/-D_GLIBCXX_USE_CXX11_ABI=0/-D_GLIBCXX_DEBUG which then
makes it fail for all -std modes:


Schedule of variations:
unix/-D_GLIBCXX_USE_CXX11_ABI=0/-D_GLIBCXX_DEBUG

Running target unix/-D_GLIBCXX_USE_CXX11_ABI=0/-D_GLIBCXX_DEBUG
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /home/test/src/gcc/libstdc++-v3/testsuite/config/default.exp as
tool-and-target-specific interface file.
Running /home/test/src/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp
...
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++11 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++14 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++17 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++20 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++23 execution test
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++26 execution test

Or just set GLIBCXX_TESTSUITE_STDS="17,20" in the env before running the test:

Schedule of variations:
unix/-D_GLIBCXX_USE_CXX11_ABI=0

Running target unix/-D_GLIBCXX_USE_CXX11_ABI=0
Using /usr/share/dejagnu/baseboards/unix.exp as board description file for
target.
Using /usr/share/dejagnu/config/unix.exp as generic interface file for target.
Using /home/test/src/gcc/libstdc++-v3/testsuite/config/default.exp as
tool-and-target-specific interface file.
Running /home/test/src/gcc/libstdc++-v3/testsuite/libstdc++-dg/conformance.exp
...
FAIL: 21_strings/basic_string/operators/char/1.cc  -std=gnu++20 execution test

It seems I picked a bad example to give, which requires additional options to
FAIL.

Many of the other FAILs do not require _GLIBCXX_DEBUG or -std=gnu++20 to FAIL,
but the -D_GLIBCXX_USE_CXX11_ABI=0 option is necessary, at least for all the
ones I inspected. That option isn't used by default, but I run the full
testsuite with that several times a day, and with
GLIBCXX_TESTSUITE_STDS=98,11,14,17,20,23,26.


> Can you please try if
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638318.html
> fixes this?

Testing now ...

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #29 from Jonathan Wakely  ---
(In reply to Jonathan Wakely from comment #28)
> (In reply to Martin Jambor from comment #27)
> > Can you please try if
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638318.html
> > fixes this?
> 
> Testing now ...

Yes, this fixes 21_strings/basic_string/operators/char/* and
28_regex/basic_regex/106607.cc

Running the full testsuite now...

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #30 from Jonathan Wakely  ---
So far the only FAIL is still see is:

FAIL: 23_containers/vector/types/1.cc  -std=gnu++98 (test for excess errors)

I'm not sure if this is caused by your patch or one of Honza's. The test only
fails with GLIBCXX_TESTSUITE_STDS=98 defined.

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-28 Thread redi at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #31 from Jonathan Wakely  ---
Bisection points to r14-5831-gaae723d360ca26cd9fd0b039fb0a616bd0eae363 for that
remaining FAIL as well (and it isn't fixed by the new patch).

It introduced a new warning which wasn't present before:

/tmp/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_algobase.h:437:
warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)'
writing between 2 and 9223372036854775806 bytes into a region of size 0
overflows the destination [-Wstringop-overflow=]

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-29 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #32 from Jan Hubicka  ---
> /tmp/build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/stl_algobase.h:437:
> warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)'
> writing between 2 and 9223372036854775806 bytes into a region of size 0
> overflows the destination [-Wstringop-overflow=]

It warns on:

  template
struct __copy_move<_IsMove, true, random_access_iterator_tag>
{
  template
_GLIBCXX20_CONSTEXPR
static _Up*
__copy_m(_Tp* __first, _Tp* __last, _Up* __result)
{
  const ptrdiff_t _Num = __last - __first;
  if (__builtin_expect(_Num > 1, true))
__builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
  else if (_Num == 1)
std::__copy_move<_IsMove, false, random_access_iterator_tag>::
  __assign_one(__result, __first);
  return __result + _Num;
}
};

It is likely false positive on a code path that never happens in real
code, but we now optimize it better.

Does it show an inline path?

[Bug middle-end/109849] suboptimal code for vector walking loop

2023-11-29 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109849

--- Comment #33 from GCC Commits  ---
The master branch has been updated by Martin Jambor :

https://gcc.gnu.org/g:302461ad9a04d82fee904bddac69811d13d5bb6a

commit r14-5971-g302461ad9a04d82fee904bddac69811d13d5bb6a
Author: Martin Jambor 
Date:   Wed Nov 29 16:24:33 2023 +0100

tree-sra: Avoid returns of references to SRA candidates

The enhancement to address PR 109849 contained an importsnt thinko,
and that any reference that is passed to a function and does not
escape, must also not happen to be aliased by the return value of the
function.  This has quickly transpired as bugs PR 112711 and PR
112721.

Just as IPA-modref does a good enough job to allow us to rely on the
escaped set of variables, it sems to be doing well also on updating
EAF_NOT_RETURNED_DIRECTLY call argument flag which happens to address
exactly the situation we need to avoid.  Of course, if a call
statement ignores any returned value, we also do not need to check the
flag.

Hopefully this does not pessimize things too much, I have verified
that the PR 109849 testcae remains quick and so should also the
benchmark it is derived from.

gcc/ChangeLog:

2023-11-27  Martin Jambor  

PR tree-optimization/112711
PR tree-optimization/112721
* tree-sra.cc (build_access_from_call_arg): New parameter
CAN_BE_RETURNED, disqualify any candidate passed by reference if it
is
true.  Adjust leading comment.
(scan_function): Pass appropriate value to CAN_BE_RETURNED of
build_access_from_call_arg.

gcc/testsuite/ChangeLog:

2023-11-29  Martin Jambor  

PR tree-optimization/112711
PR tree-optimization/112721
* g++.dg/tree-ssa/pr112711.C: New test.
* gcc.dg/tree-ssa/pr112721.c: Likewise.