[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread hubicka at ucw dot cz via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #7 from Jan Hubicka  ---
> and then we inline them back, introducing the extra copy.  Why do we use
> tail-calls here instead of aliases?  Why do we lack cost modeling here?
Because the function is exported and we must keep addresses different.
Cost modeling is somewhat hard here, since it is not clear what will
help inliner and whether inliner will inline back.

For example if we icf two function calling third function. the third one
may become called once and we get better code by inlining it and eliding
offline copy rather than keeping the caller duplicated.

So the idea is to get code into as canonical form as possible (with no
obvious duplicates) and then let the inliner heuristics to decide what
should and what should not be duplicated in the final code.

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #6 from Richard Biener  ---
(In reply to Richard Biener from comment #5)
[...]
>   /* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
>  Recall that aggregate_value_p is FALSE for any aggregate type that is
>  returned in registers.  If we're returning values in registers, then
>  we don't want to extend the lifetime of the RESULT_DECL, particularly
>  across another call.  In addition, for those aggregates for which
>  hard_function_value generates a PARALLEL, we'll die during normal
>  expansion of structure assignments; there's special code in
> expand_return
>  to handle this case that does not exist in expand_expr.  */
> 
> but the code does something else, using a new temporary register
> (here !aggregate_value_p).  The following fixes that.
> 
> diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
> index b0ed58ed0f9..e4cf5438e39 100644
> --- a/gcc/gimplify.cc
> +++ b/gcc/gimplify.cc
> @@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
>   to handle this case that does not exist in expand_expr.  */
>if (!result_decl)
>  result = NULL_TREE;
> -  else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl)))
> +  else if (aggregate_value_p (result_decl, TREE_TYPE
> (current_function_decl))
> +  || !is_gimple_reg_type (TREE_TYPE (result_decl)))
>  {
>if (!poly_int_tree_p (DECL_SIZE (result_decl)))
> {

That said, this likely breaks exactly what the comment says but this should
be then worked around during RTL expansion and not gimplification (but
of course historically that was the same time).

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

--- Comment #5 from Richard Biener  ---
Testcase for the inliner issue:

struct A { int a; short b; };
A test(A& a) { return a; }
A test1(A& a) { return test(a); }

where the issue is that test() doesn't use DECL_RESULT for its return and thus
declare_return_variable setting up DECL_RESULT mapped to the call LHS doesn't
help.

And the reason why we don't use DECL_RESULT is that we have

 >
side-effects
arg:0 
unit-size 
align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type
0x76bbdbd0 fields  context

full-name "struct A"
X() X(constX&) this=(X&) n_parents=0 use_template=0
interface-unknown
pointer_to_this  reference_to_this
 chain >
side-effects
arg:0 
ignored DI t2.ii:2:1 size  unit-size

align:32 warn_if_not_align:0 context >
arg:1 
side-effects tree_3 arg:0 
...

and gimplification changes the RESULT_DECL return because

  /* If aggregate_value_p is true, then we can return the bare RESULT_DECL.
 Recall that aggregate_value_p is FALSE for any aggregate type that is
 returned in registers.  If we're returning values in registers, then
 we don't want to extend the lifetime of the RESULT_DECL, particularly
 across another call.  In addition, for those aggregates for which
 hard_function_value generates a PARALLEL, we'll die during normal
 expansion of structure assignments; there's special code in expand_return
 to handle this case that does not exist in expand_expr.  */

but the code does something else, using a new temporary register
(here !aggregate_value_p).  The following fixes that.

diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc
index b0ed58ed0f9..e4cf5438e39 100644
--- a/gcc/gimplify.cc
+++ b/gcc/gimplify.cc
@@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p)
  to handle this case that does not exist in expand_expr.  */
   if (!result_decl)
 result = NULL_TREE;
-  else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)))
+  else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl))
+  || !is_gimple_reg_type (TREE_TYPE (result_decl)))
 {
   if (!poly_int_tree_p (DECL_SIZE (result_decl)))
{

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Andrew Pinski  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #4 from Andrew Pinski  ---
(In reply to Richard Biener from comment #2)
> So actually it seems that the reason is ICF plus inlining:

>, introducing the extra copy.  Why do we use
> tail-calls here instead of aliases?  Why do we lack cost modeling here?
> Why do we inline back?  It looks like a pointless exercise to me ...
> With -fdisable-ipa-inline we get
> 
> _Z5test2O1A:
> .LFB5:
> .cfi_startproc
> jmp _Z5test1R1A
> 
> so that's at least reasonable and what's expected I suppose.  So one
> could argue the bug is in the inliner and with introducing the extra
> copy (IIRC there's a bug about this), but still.

Which means this is dup of bug 96252

*** This bug has been marked as a duplicate of bug 96252 ***

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread jakub at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Jakub Jelinek  changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek  ---
(In reply to Richard Biener from comment #2)
> Why do we use
> tail-calls here instead of aliases?

The functions are exported, so we can't prove if some other TU won't do
return  == 
or similar.

> Why do we inline back?

But IMHO if we ICF optimize something, we shouldn't undo that, so should
disable the inlining across that edge.  Or decide not to ICF optimize it.

Plus we have the similar problem with fnsplit, where if we split some function
and inline it back it causes undesirable debug info differences, ideally the
inlining of fnsplit into self should just restore the old body.

[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type

2024-05-15 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097

Richard Biener  changed:

   What|Removed |Added

  Component|tree-optimization   |ipa
 CC||hubicka at gcc dot gnu.org,
   ||rguenth at gcc dot gnu.org

--- Comment #2 from Richard Biener  ---
So actually it seems that the reason is ICF plus inlining:

t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test2(A&&)/1
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test2O1A/1
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test3(const A&)/2
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test3RK1A/2
t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test4(const A&&)/3
t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test4OK1A/3
optimized:  Inlined A test1(A&)/4 into A test2(A&&)/1 which now has time
4.00 and size 5, net change of -1.
optimized:  Inlined A test1(A&)/5 into A test3(const A&)/2 which now has time
4.00 and size 5, net change of -1.
optimized:  Inlined A test1(A&)/6 into A test4(const A&&)/3 which now has time
4.00 and size 5, net change of -1.

for some reason we "optimize" the functions to the following in IPA ICF:

struct A test4 (const struct A & a)
{
  struct A retval.6;

   [local count: 1073741824]:
  retval.6 = test1 (a_2(D)); [tail call]
  return retval.6;

}


struct A test3 (const struct A & a)
{
  struct A retval.5;

   [local count: 1073741824]:
  retval.5 = test1 (a_2(D)); [tail call]
  return retval.5;

}


struct A test2 (struct A & a)
{
  struct A retval.4;

   [local count: 1073741824]:
  retval.4 = test1 (a_2(D)); [tail call]
  return retval.4;

}

and then we inline them back, introducing the extra copy.  Why do we use
tail-calls here instead of aliases?  Why do we lack cost modeling here?
Why do we inline back?  It looks like a pointless exercise to me ...
With -fdisable-ipa-inline we get

_Z5test2O1A:
.LFB5:
.cfi_startproc
jmp _Z5test1R1A

so that's at least reasonable and what's expected I suppose.  So one
could argue the bug is in the inliner and with introducing the extra
copy (IIRC there's a bug about this), but still.