[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 --- Comment #7 from Jan Hubicka --- > and then we inline them back, introducing the extra copy. Why do we use > tail-calls here instead of aliases? Why do we lack cost modeling here? Because the function is exported and we must keep addresses different. Cost modeling is somewhat hard here, since it is not clear what will help inliner and whether inliner will inline back. For example if we icf two function calling third function. the third one may become called once and we get better code by inlining it and eliding offline copy rather than keeping the caller duplicated. So the idea is to get code into as canonical form as possible (with no obvious duplicates) and then let the inliner heuristics to decide what should and what should not be duplicated in the final code.
[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 --- Comment #6 from Richard Biener --- (In reply to Richard Biener from comment #5) [...] > /* If aggregate_value_p is true, then we can return the bare RESULT_DECL. > Recall that aggregate_value_p is FALSE for any aggregate type that is > returned in registers. If we're returning values in registers, then > we don't want to extend the lifetime of the RESULT_DECL, particularly > across another call. In addition, for those aggregates for which > hard_function_value generates a PARALLEL, we'll die during normal > expansion of structure assignments; there's special code in > expand_return > to handle this case that does not exist in expand_expr. */ > > but the code does something else, using a new temporary register > (here !aggregate_value_p). The following fixes that. > > diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc > index b0ed58ed0f9..e4cf5438e39 100644 > --- a/gcc/gimplify.cc > +++ b/gcc/gimplify.cc > @@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p) > to handle this case that does not exist in expand_expr. */ >if (!result_decl) > result = NULL_TREE; > - else if (aggregate_value_p (result_decl, TREE_TYPE > (current_function_decl))) > + else if (aggregate_value_p (result_decl, TREE_TYPE > (current_function_decl)) > + || !is_gimple_reg_type (TREE_TYPE (result_decl))) > { >if (!poly_int_tree_p (DECL_SIZE (result_decl))) > { That said, this likely breaks exactly what the comment says but this should be then worked around during RTL expansion and not gimplification (but of course historically that was the same time).
[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 --- Comment #5 from Richard Biener --- Testcase for the inliner issue: struct A { int a; short b; }; A test(A& a) { return a; } A test1(A& a) { return test(a); } where the issue is that test() doesn't use DECL_RESULT for its return and thus declare_return_variable setting up DECL_RESULT mapped to the call LHS doesn't help. And the reason why we don't use DECL_RESULT is that we have > side-effects arg:0 unit-size align:32 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x76bbdbd0 fields context full-name "struct A" X() X(constX&) this=(X&) n_parents=0 use_template=0 interface-unknown pointer_to_this reference_to_this chain > side-effects arg:0 ignored DI t2.ii:2:1 size unit-size align:32 warn_if_not_align:0 context > arg:1 side-effects tree_3 arg:0 ... and gimplification changes the RESULT_DECL return because /* If aggregate_value_p is true, then we can return the bare RESULT_DECL. Recall that aggregate_value_p is FALSE for any aggregate type that is returned in registers. If we're returning values in registers, then we don't want to extend the lifetime of the RESULT_DECL, particularly across another call. In addition, for those aggregates for which hard_function_value generates a PARALLEL, we'll die during normal expansion of structure assignments; there's special code in expand_return to handle this case that does not exist in expand_expr. */ but the code does something else, using a new temporary register (here !aggregate_value_p). The following fixes that. diff --git a/gcc/gimplify.cc b/gcc/gimplify.cc index b0ed58ed0f9..e4cf5438e39 100644 --- a/gcc/gimplify.cc +++ b/gcc/gimplify.cc @@ -1873,7 +1873,8 @@ gimplify_return_expr (tree stmt, gimple_seq *pre_p) to handle this case that does not exist in expand_expr. */ if (!result_decl) result = NULL_TREE; - else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl))) + else if (aggregate_value_p (result_decl, TREE_TYPE (current_function_decl)) + || !is_gimple_reg_type (TREE_TYPE (result_decl))) { if (!poly_int_tree_p (DECL_SIZE (result_decl))) {
[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 Andrew Pinski changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |DUPLICATE --- Comment #4 from Andrew Pinski --- (In reply to Richard Biener from comment #2) > So actually it seems that the reason is ICF plus inlining: >, introducing the extra copy. Why do we use > tail-calls here instead of aliases? Why do we lack cost modeling here? > Why do we inline back? It looks like a pointless exercise to me ... > With -fdisable-ipa-inline we get > > _Z5test2O1A: > .LFB5: > .cfi_startproc > jmp _Z5test1R1A > > so that's at least reasonable and what's expected I suppose. So one > could argue the bug is in the inliner and with introducing the extra > copy (IIRC there's a bug about this), but still. Which means this is dup of bug 96252 *** This bug has been marked as a duplicate of bug 96252 ***
[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 Jakub Jelinek changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #3 from Jakub Jelinek --- (In reply to Richard Biener from comment #2) > Why do we use > tail-calls here instead of aliases? The functions are exported, so we can't prove if some other TU won't do return == or similar. > Why do we inline back? But IMHO if we ICF optimize something, we shouldn't undo that, so should disable the inlining across that edge. Or decide not to ICF optimize it. Plus we have the similar problem with fnsplit, where if we split some function and inline it back it causes undesirable debug info differences, ideally the inlining of fnsplit into self should just restore the old body.
[Bug ipa/115097] Strange suboptimal codegen specifically at -O2 when copying struct type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115097 Richard Biener changed: What|Removed |Added Component|tree-optimization |ipa CC||hubicka at gcc dot gnu.org, ||rguenth at gcc dot gnu.org --- Comment #2 from Richard Biener --- So actually it seems that the reason is ICF plus inlining: t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test2(A&&)/1 t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test2O1A/1 t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test3(const A&)/2 t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test3RK1A/2 t.ii:2:3: optimized: Semantic equality hit:A test1(A&)/0->A test4(const A&&)/3 t.ii:2:3: optimized: Assembler symbol names:_Z5test1R1A/0->_Z5test4OK1A/3 optimized: Inlined A test1(A&)/4 into A test2(A&&)/1 which now has time 4.00 and size 5, net change of -1. optimized: Inlined A test1(A&)/5 into A test3(const A&)/2 which now has time 4.00 and size 5, net change of -1. optimized: Inlined A test1(A&)/6 into A test4(const A&&)/3 which now has time 4.00 and size 5, net change of -1. for some reason we "optimize" the functions to the following in IPA ICF: struct A test4 (const struct A & a) { struct A retval.6; [local count: 1073741824]: retval.6 = test1 (a_2(D)); [tail call] return retval.6; } struct A test3 (const struct A & a) { struct A retval.5; [local count: 1073741824]: retval.5 = test1 (a_2(D)); [tail call] return retval.5; } struct A test2 (struct A & a) { struct A retval.4; [local count: 1073741824]: retval.4 = test1 (a_2(D)); [tail call] return retval.4; } and then we inline them back, introducing the extra copy. Why do we use tail-calls here instead of aliases? Why do we lack cost modeling here? Why do we inline back? It looks like a pointless exercise to me ... With -fdisable-ipa-inline we get _Z5test2O1A: .LFB5: .cfi_startproc jmp _Z5test1R1A so that's at least reasonable and what's expected I suppose. So one could argue the bug is in the inliner and with introducing the extra copy (IIRC there's a bug about this), but still.