Re: Massive performance regression from switching to gcc 4.5
On 06/30/2010 02:26 PM, Basile Starynkevitch wrote: On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote: I tried 4.5 -O2 and it's actually faster than 4.3 -Os. I am happy that -O2 performance is actually pretty good, but -Os regression is going to hurt on mobile. Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2? It would be interesting to hear that GCC is able to LTO a program as big as Mozilla! And figures (notably RAM, CPU time, wallclock time for build) would be interesting. Both whopr and flto cause gcc to segfault while building Mozilla. 4.5 WHOPR is completely broken. LTO is in better shape but I am not sure if we can resonably expect it to build mozilla. However I would be very happy to help getting WHOPR working for 4.6. Hi, I now got the 4.6 WHOPR build up to libxul.so that seems to be one of bigger files. WHOPR linking consists of serial stage (WPA) merging whole program and doing interprocedural optimization followed by parallel build. The serial stage needs 3.7GB of RAM, 10 minutes, most of it is spent by writting out the files for parallel builds that are around 5GB overall. The size of files can be significantly cut down by sane partitioning algorithm, since we produce over 1000 partitions where 40 would do the job. (this is with enable-checking compiler) Later build still die for me, but it seems that libxul is not too large for WHOPR. (I hope all parameters to reduce significantly before 4.6 is out) What are the other big components I should be affraid of? Oprofile of WPA stage is as follows: 3825078.4240 lto_output_1_stream 3791588.3503 htab_find_slot_with_hash 2073304.5661 bp_pack_value 1557933.4311 iterative_hash_hashval_t 1351322.9760 lto_output_uleb128_stream 1011102.2268 gimple_types_compatible_p 92828 2.0444 cgraph_node_in_set_p 83205 1.8324 lto_promote_cross_file_statics 76243 1.6791 htab_expand 75993 1.6736 htab_hash_string 75790 1.6691 eq_string_slot_node 75020 1.6522 bp_unpack_value 73403 1.6166 linemap_lookup 65353 1.4393 lto_output_sleb128_stream 64864 1.4285 inflate_fast 64508 1.4207 verify_cgraph_node 60076 1.3231 lto_output_tree 57120 1.2580 referenced_from_this_partition_p 56225 1.2383 lto_input_uleb128 53620 1.1809 lto_streamer_cache_insert_1 52973 1.1666 htab_find_slot 45728 1.0071 lto_output_tree_or_ref 43428 0.9564 lto_input_1_unsigned 41556 0.9152 tree_map_base_eq 39232 0.8640 hash_cgraph_node_set_element 35695 0.7861 ggc_set_mark So not much of surprise - streaming is ineffecient and we need a lot of time for type merging too. I am compiling to get time report. Honza
Re: Massive performance regression from switching to gcc 4.5
... and time report Execution times (seconds) garbage collection: 12.48 ( 2%) usr 0.00 ( 0%) sys 12.50 ( 2%) wall 0 kB ( 0%) ggc callgraph optimization: 0.21 ( 0%) usr 0.00 ( 0%) sys 0.21 ( 0%) wall 2743 kB ( 0%) ggc varpool construction : 0.97 ( 0%) usr 0.02 ( 0%) sys 0.97 ( 0%) wall 44476 kB ( 1%) ggc ipa cp: 2.82 ( 0%) usr 0.05 ( 0%) sys 2.90 ( 0%) wall 65189 kB ( 2%) ggc ipa lto gimple in : 4.96 ( 1%) usr 0.30 ( 2%) sys 5.28 ( 1%) wall 8 kB ( 0%) ggc ipa lto gimple out: 26.51 ( 4%) usr 0.73 ( 4%) sys 27.54 ( 4%) wall 0 kB ( 0%) ggc ipa lto decl in : 100.69 (14%) usr 2.76 (15%) sys 103.69 (14%) wall 3069230 kB (87%) ggc ipa lto decl out : 405.60 (55%) usr 0.05 ( 0%) sys 406.86 (54%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 2.21 ( 0%) usr 0.03 ( 0%) sys 2.25 ( 0%) wall 82822 kB ( 2%) ggc ipa lto cgraph I/O: 1.85 ( 0%) usr 0.26 ( 1%) sys 2.11 ( 0%) wall 247249 kB ( 7%) ggc ipa lto decl merge: 54.30 ( 7%) usr 1.20 ( 6%) sys 55.63 ( 7%) wall 283 kB ( 0%) ggc ipa lto cgraph merge : 1.35 ( 0%) usr 0.00 ( 0%) sys 1.35 ( 0%) wall 5259 kB ( 0%) ggc whopr wpa : 79.05 (11%) usr 0.14 ( 1%) sys 79.35 (11%) wall 22146 kB ( 1%) ggc whopr wpa I/O : 10.07 ( 1%) usr 12.89 (69%) sys 22.82 ( 3%) wall 0 kB ( 0%) ggc ipa reference : 1.83 ( 0%) usr 0.00 ( 0%) sys 1.79 ( 0%) wall 0 kB ( 0%) ggc ipa profile : 0.19 ( 0%) usr 0.00 ( 0%) sys 0.20 ( 0%) wall 0 kB ( 0%) ggc ipa pure const: 1.44 ( 0%) usr 0.01 ( 0%) sys 1.46 ( 0%) wall 0 kB ( 0%) ggc inline heuristics : 14.37 ( 2%) usr 0.00 ( 0%) sys 14.41 ( 2%) wall 2825 kB ( 0%) ggc callgraph verifier: 9.22 ( 1%) usr 0.19 ( 1%) sys 9.42 ( 1%) wall 0 kB ( 0%) ggc varconst : 0.02 ( 0%) usr 0.03 ( 0%) sys 0.04 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 732.4218.73 753.13 3543343 kB decl out time definitly will reduce with sane partitioning. Also it seems that osme of my simple minded partitioning code should be optimized to avoid quadratic behaviour. Honza
Re: Massive performance regression from switching to gcc 4.5
When you compile with -Os, the inlining happens only when code size reduces. Thus we pretty much care about the code size metrics only. I suspect the problem here might be that normal C++ code needs some inlining to make abstraction penalty go away. GCC -Os implementation is generally tuned for CSiBE and it is somewhat C centric (that makes sense for embedded world). As a result we might get quite noticeable slowdowns on C++ apps compiled with -Os (and code size growth too since abstraction is never eliminated). It can be seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a lot slower code. Looks like -Os needs tweaking for C++. It would be awesome if Mozilla could serve a testcase for that. I would be very interested to know the most obvious cases where we miss inlining and should not. It would be most helpful to directly know -fdump-tree-inline_param-details for those or have self contained testcase. I posted such a testcase under Crucial C++ inlining broken under Os subject. Thanks, I will check it out tomorrow. Testcases are really important here. Would be nice to ensure that code-size-reducing inlining has testsuite coverage and then move on to more ambitious targets. We have CSiBE benchmark to test -Os, but it does not contain that much of inlining heavy testcases as far as I am aware. For example, it would be nice to mix -O2/-Os via pgo. Mozilla carries support for a lot of special-cases that are rarely used, would be nice to aggressively minimize binary size for those features. I'm hoping we can reduce binary size further via LTO. With PGO everything at average fewer than once per execution in train run is optimized for size. I would be happy to see a streamlined feedback loop between GCC and Mozilla. I think setting up regular Mozilla benchmarking is a great idea. I volunteer to be the Mozilla contact for that. Mozilla build instructions are here: https://developer.mozilla.org/En/Simple_Firefox_build PGO instructions are at https://developer.mozilla.org/en/Building_with_Profile-Guided_Optimization Our testsuite is pretty easy to setup now, see https://wiki.mozilla.org/StandaloneTalos It takes over an hour to run on slower hardware. I'm not aware of any good documentation on what the tests do and what they are measured in. https://wiki.mozilla.org/Performance:Tinderbox_Tests is an older reference on the matter. Ping me on irc if you aren't sure what a test is doing. I will check it out tomorrow and see how far I get. Automated nightly tester would be very good for this. Honza Cheers, Taras
Re: Massive performance regression from switching to gcc 4.5
On 06/24/2010 12:06 PM, Andrew Pinski wrote: On Jun 24, 2010, at 11:50 AM, Taras Glek tg...@mozilla.com wrote: Hi, Just wanted to give a heads up on what might be the biggest compiler-upgrade-related performance difference we've seen at Mozilla. We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. A lone 8% speedup was measured on the Sunspider javascript benchmark on 64bit linux. Here are some of the slowdowns reported: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630# http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41# Most of the code is compiled with -fPIC -fno-rtti -fno-exceptions -Os Stop right there. You are compiling at -Os, that is tuned for size and not speed. So the question is did the size go down? Not the speed decreased. Try at -O2 and report back. I doubt we are going to do a tradeoff for speed at -Os at all. Thanks, Andrew Pinski Good point. Looks like the actual problem is that at -Os there is less inclining happening in 4.5 vs 4.3, which results in a bigger binary and slower code. I tried 4.5 -O2 and it's actually faster than 4.3 -Os. I am happy that -O2 performance is actually pretty good, but -Os regression is going to hurt on mobile. Taras
Re: Massive performance regression from switching to gcc 4.5
On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote: I tried 4.5 -O2 and it's actually faster than 4.3 -Os. I am happy that -O2 performance is actually pretty good, but -Os regression is going to hurt on mobile. Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2? It would be interesting to hear that GCC is able to LTO a program as big as Mozilla! And figures (notably RAM, CPU time, wallclock time for build) would be interesting. Cheers. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basileatstarynkevitchdotnet mobile: +33 6 8501 2359 8, rue de la Faiencerie, 92340 Bourg La Reine, France *** opinions {are only mines, sont seulement les miennes} ***
Re: Massive performance regression from switching to gcc 4.5
On 06/30/2010 02:26 PM, Basile Starynkevitch wrote: On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote: I tried 4.5 -O2 and it's actually faster than 4.3 -Os. I am happy that -O2 performance is actually pretty good, but -Os regression is going to hurt on mobile. Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2? It would be interesting to hear that GCC is able to LTO a program as big as Mozilla! And figures (notably RAM, CPU time, wallclock time for build) would be interesting. Both whopr and flto cause gcc to segfault while building Mozilla. Taras
Re: Massive performance regression from switching to gcc 4.5
On 06/30/2010 02:26 PM, Basile Starynkevitch wrote: On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote: I tried 4.5 -O2 and it's actually faster than 4.3 -Os. I am happy that -O2 performance is actually pretty good, but -Os regression is going to hurt on mobile. Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2? It would be interesting to hear that GCC is able to LTO a program as big as Mozilla! And figures (notably RAM, CPU time, wallclock time for build) would be interesting. Both whopr and flto cause gcc to segfault while building Mozilla. 4.5 WHOPR is completely broken. LTO is in better shape but I am not sure if we can resonably expect it to build mozilla. However I would be very happy to help getting WHOPR working for 4.6. If you can find actual simple examples where -Os is losing size and speed we can try to do something about them. Honza Taras
Re: Massive performance regression from switching to gcc 4.5
Jan Hubicka wrote: GiNaC indeed shows interesting behaviour. Just the first test on 4.3 is: timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.064 0.301.4 6.2 for 4.5 timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.080 0.361.6 7.4 and for 4.6 timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.076 0.371.7 8.2 I assume that the numbers are times in second, so it is indeed not good. Can I easilly run the individual tests by hand? Do you have any idea what is going wrong here? Yes, one can run the individual tests by hand. And it should suffice to focus on the first one, as you did. All the follow-up regressions are most likely due to the same problem. And no, I have no idea what's up yet. -richy. -- Richard B. Kreckel http://www.ginac.de/~kreckel/
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 06:10:56AM -0700, Jan Hubicka wrote: When you compile with -Os, the inlining happens only when code size reduces. Thus we pretty much care about the code size metrics only. I suspect the problem here might be that normal C++ code needs some inlining to make abstraction penalty go away. GCC -Os implementation is generally tuned for CSiBE and it is somewhat C centric (that makes sense for embedded world). As a result we might get quite noticeable slowdowns on C++ apps compiled with -Os (and code size growth too since abstraction is never eliminated). It can be seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a lot slower code. One would think that in most of the abstraction-penalty cases, the inlined code (often the direct reading or setting of a class data member) should be both smaller and faster than the call, so -Os should inline. Perhaps Yes at -Os inliner should inline to produce smallest binary. Problem is to figure out at inline time what the results of the inlining decisions will be. Inliner has no idea how the code will optimize if it does inlining or if it does not and often the code quality depends on multiple inline decisions (i.e. one needs to inline all methods of the object to make it to be scalar replaced and optimized away). Current herusitcs makes some simple guesses. Some of them are turned off for -Os to avoid regressions on C benchmarks in CSiBE. So I would really need to see individual examples to work out what can be done about them. Honza there are cases where the inlined version is, say, one or two instructions larger than the version with a call, and this causes the degradation? If so, maybe some heuristic could be produced that would inline anyway for a small function?
Re: Massive performance regression from switching to gcc 4.5
On Fri, 25 Jun 2010, it was written: On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote: We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. Could you please also try gcc4.4, so that it is clear if the slowdowns are between 4.3 and 4.4 or 4.4 and 4.5? Would be nice to narrow the changes a little bit. There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. FP intensive code could be also affected by: On x86 targets, code containing floating-point calculations may run significantly slower when compiled with GCC 4.5 in strict C99 conformance mode than they did with earlier GCC versions. This is due to stricter standard conformance of the compiler and can be avoided by using the option -fexcess-precision=fast; also see below. (see http://gcc.gnu.org/gcc-4.5/changes.html) If not, I would be interested to take a look. C++ code in general tends to challenge inliner heruistic. We assembled a benchmark suite that we use for tunning the interprocedural optimizers pretty much every release. So perhaps GiNaC can be used as one of the tests. Honza -richy. -- Richard B. Kreckel http://www.ginac.de/~kreckel/
Re: Massive performance regression from switching to gcc 4.5
Jan Hubicka wrote: On Fri, 25 Jun 2010, it was written: There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. FP intensive code could be also affected by: This code isn't using floating-point. -richy. -- Richard B. Kreckel http://www.ginac.de/~kreckel/
Re: Massive performance regression from switching to gcc 4.5
Jan Hubicka wrote: On Fri, 25 Jun 2010, it was written: There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. FP intensive code could be also affected by: This code isn't using floating-point. Hmm, building ginac with current 4.5 branch I get: /bin/sh ../libtool --tag=CXX --mode=compile /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c -o function.lo ../../ginac/function.cpp libtool: compile: /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c ../../ginac/function.cpp -fPIC -DPIC -o .libs/function.o ../../ginac/function.cpp: In member function âGiNaC::ex GiNaC::function::power(const GiNaC::ex) constâ: ../../ginac/function.cpp:1800:15: error: expected type-specifier ../../ginac/function.cpp:1800:15: error: expected â)â ../../ginac/function.cpp:1801:72: error: conversion from âint*â to âGiNaC::exâ is ambiguous ../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned int) near match ../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near match ../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) near match ../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near match Honza -richy. -- Richard B. Kreckel http://www.ginac.de/~kreckel/
Re: Massive performance regression from switching to gcc 4.5
Jan Hubicka wrote: On Fri, 25 Jun 2010, it was written: There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. FP intensive code could be also affected by: This code isn't using floating-point. Hmm, building ginac with current 4.5 branch I get: /bin/sh ../libtool --tag=CXX --mode=compile /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c -o function.lo ../../ginac/function.cpp libtool: compile: /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c ../../ginac/function.cpp -fPIC -DPIC -o .libs/function.o ../../ginac/function.cpp: In member function âGiNaC::ex GiNaC::function::power(const GiNaC::ex) constâ: ../../ginac/function.cpp:1800:15: error: expected type-specifier ../../ginac/function.cpp:1800:15: error: expected â)â ../../ginac/function.cpp:1801:72: error: conversion from âint*â to âGiNaC::exâ is ambiguous ../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned int) near match ../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near match ../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) near match ../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near match Hi, I got arround by just replacing the offending line by abort() that seems to get me far enough to build the testsuite. One thing I noticed is that compile time prolonged excessively to about 10 minutes. It seems to be var tracking 3960911.3093 cc1plus canonicalize_values_star 33469 9.5562 cc1plus loc_cmp 22206 6.3403 cc1plus set_slot_part 15798 4.5107 cc1plus rtx_equal_p 14479 4.1341 cc1plus htab_find_with_hash 10461 2.9869 cc1plus htab_find_slot_with_hash 7268 2.0752 cc1plus find_loc_in_1pdv 6718 1.9181 cc1plus cselib_expand_value_rtx_1 6619 1.8899 libc-2.11.1.so strcmp 6193 1.7682 cc1plus htab_expand 6020 1.7188 cc1plus check_changed_vars_0 5797 1.6552 cc1plus htab_traverse_noresize 5615 1.6032 cc1plus htab_find_with_hash 4757 1.3582 cc1plus vt_expand_loc_callback 4529 1.2931 libginac-1.5.so.0.1.2GiNaC::basic::compare(GiNaC::basic const) const 4497 1.2840 libc-2.11.1.so _int_malloc 3879 1.1075 as /usr/bin/as 3703 1.0573 cc1plus variable_htab_eq 3698 1.0559 cc1plus insert_into_intersection 3287 0.9385 cc1plus emit_note_insn_var_location Honza
Re: Massive performance regression from switching to gcc 4.5
Jan Hubicka wrote: On Fri, 25 Jun 2010, it was written: There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. FP intensive code could be also affected by: This code isn't using floating-point. Hmm, building ginac with current 4.5 branch I get: /bin/sh ../libtool --tag=CXX --mode=compile /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c -o function.lo ../../ginac/function.cpp libtool: compile: /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD -MP -MF .deps/function.Tpo -c ../../ginac/function.cpp -fPIC -DPIC -o .libs/function.o ../../ginac/function.cpp: In member function âGiNaC::ex GiNaC::function::power(const GiNaC::ex) constâ: ../../ginac/function.cpp:1800:15: error: expected type-specifier ../../ginac/function.cpp:1800:15: error: expected â)â ../../ginac/function.cpp:1801:72: error: conversion from âint*â to âGiNaC::exâ is ambiguous ../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned int) near match ../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near match ../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) near match ../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near match Hi, I got arround by just replacing the offending line by abort() that seems to get me far enough to build the testsuite. One thing I noticed is that compile time prolonged excessively to about 10 minutes. It seems to be var tracking 3960911.3093 cc1plus canonicalize_values_star 33469 9.5562 cc1plus loc_cmp 22206 6.3403 cc1plus set_slot_part 15798 4.5107 cc1plus rtx_equal_p 14479 4.1341 cc1plus htab_find_with_hash 10461 2.9869 cc1plus htab_find_slot_with_hash 7268 2.0752 cc1plus find_loc_in_1pdv 6718 1.9181 cc1plus cselib_expand_value_rtx_1 6619 1.8899 libc-2.11.1.so strcmp 6193 1.7682 cc1plus htab_expand 6020 1.7188 cc1plus check_changed_vars_0 5797 1.6552 cc1plus htab_traverse_noresize 5615 1.6032 cc1plus htab_find_with_hash 4757 1.3582 cc1plus vt_expand_loc_callback 4529 1.2931 libginac-1.5.so.0.1.2GiNaC::basic::compare(GiNaC::basic const) const 4497 1.2840 libc-2.11.1.so _int_malloc 3879 1.1075 as /usr/bin/as 3703 1.0573 cc1plus variable_htab_eq 3698 1.0559 cc1plus insert_into_intersection 3287 0.9385 cc1plus emit_note_insn_var_location (it is regression at 4.5 branch, forgot to mention) Honza Honza
Re: Massive performance regression from switching to gcc 4.5
(it is regression at 4.5 branch, forgot to mention) PR44694 GiNaC indeed shows interesting behaviour. Just the first test on 4.3 is: timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.064 0.301.4 6.2 for 4.5 timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.080 0.361.6 7.4 and for 4.6 timing commutative expansion and substitution size: 100 200 400 800 time/s: 0.076 0.371.7 8.2 I assume that the numbers are times in second, so it is indeed not good. Can I easilly run the individual tests by hand? Do you have any idea what is going wrong here? Honza
Re: Massive performance regression from switching to gcc 4.5
On Fri, 25 Jun 2010, it was written: On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote: We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. Could you please also try gcc4.4, so that it is clear if the slowdowns are between 4.3 and 4.4 or 4.4 and 4.5? Would be nice to narrow the changes a little bit. There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC (a computer algebra library) benchmark suite after switching from 4.4 to 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable performance differences between 4.3 and 4.4. -richy. -- Richard B. Kreckel http://www.ginac.de/~kreckel/
Re: Massive performance regression from switching to gcc 4.5
On 25/06/10 06:39, Richard Guenther wrote: There are btw. some bugs wrt accounting of functions called once being inlined in 4.5 which were fixed on trunk which allow extra inlining. Are these changes likely to make it onto the 4.5 branch and into (say) 4.5.1? j.
Re: Massive performance regression from switching to gcc 4.5
Minus whitespace changes it seems to be ! if (lhs_free (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs))) rhs_free = true; vs. ! if (lhs_free !(is_gimple_reg (rhs) ! || !is_gimple_reg_type (TREE_TYPE (rhs)) ! || is_gimple_min_invariant (rhs))) rhs_free = true; so the stmt is likely being eliminated if either the LHS or the RHS is based on a parameter and the other side is a register or an invariant. You change that to also discount aggregate stores/loads to/from parameters to be free. There is also the counterpart for the RHS: ! if (rhs_free is_gimple_reg (lhs)) lhs_free = true; vs ! if (rhs_free !(is_gimple_reg (lhs) ! || !is_gimple_reg_type (TREE_TYPE (lhs lhs_free = true; Which you could have simplified to just say if (lhs_free || rhs_free) return true; and drop the code you are changing. I don't think so, compare your version and mine for scalar stores/loads from/to parameters or return values. -- Eric Botcazou
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 8:34 AM, Eric Botcazou ebotca...@adacore.com wrote: Minus whitespace changes it seems to be ! if (lhs_free (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs))) rhs_free = true; vs. ! if (lhs_free ! (is_gimple_reg (rhs) ! || !is_gimple_reg_type (TREE_TYPE (rhs)) ! || is_gimple_min_invariant (rhs))) rhs_free = true; so the stmt is likely being eliminated if either the LHS or the RHS is based on a parameter and the other side is a register or an invariant. You change that to also discount aggregate stores/loads to/from parameters to be free. There is also the counterpart for the RHS: ! if (rhs_free is_gimple_reg (lhs)) lhs_free = true; vs ! if (rhs_free ! (is_gimple_reg (lhs) ! || !is_gimple_reg_type (TREE_TYPE (lhs lhs_free = true; Sure, but it's requivalent. Which you could have simplified to just say if (lhs_free || rhs_free) return true; and drop the code you are changing. I don't think so, compare your version and mine for scalar stores/loads from/to parameters or return values. I do think so. Quoting the complete patched code: if (TREE_CODE (inner_rhs) == PARM_DECL || (TREE_CODE (inner_rhs) == SSA_NAME SSA_NAME_IS_DEFAULT_DEF (inner_rhs) TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL)) rhs_free = true; ! if (rhs_free !(is_gimple_reg (lhs) ! || !is_gimple_reg_type (TREE_TYPE (lhs lhs_free = true; if (((TREE_CODE (inner_lhs) == PARM_DECL || (TREE_CODE (inner_lhs) == SSA_NAME SSA_NAME_IS_DEFAULT_DEF (inner_lhs) TREE_CODE (SSA_NAME_VAR (inner_lhs)) == PARM_DECL)) inner_lhs != lhs) || TREE_CODE (inner_lhs) == RESULT_DECL || (TREE_CODE (inner_lhs) == SSA_NAME TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL)) lhs_free = true; ! if (lhs_free !(is_gimple_reg (rhs) ! || !is_gimple_reg_type (TREE_TYPE (rhs)) ! || is_gimple_min_invariant (rhs))) rhs_free = true; if (lhs_free rhs_free) return true; now, with is_gimple_reg () || !is_gimple_reg_type () || is_gimple_min_invariant () you allow registers, constants or any aggregates but _not_ register typed variables that have their address taken. Which in the following example makes i = *p not likely eliminated but makes j = *q likely eliminated. void foo (int *p, struct X *q) { int i; struct X j; i = *p; j = *q; bar (i, q); } That doesn't make sense. What makes sense is that all scalar (thus gimple_reg_typed) loads/stores to/from parameters or the result are free. Which isn't what the current code does but is also not what you are changing it to. Thus in the above example i = *p should be likely eliminated but not j = *q (maybe we can make aggregate loads/stores from/to non-address-taken vars as free, too). Richard. -- Eric Botcazou
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 8:15 AM, Jonathan Adamczewski jadam...@utas.edu.au wrote: On 25/06/10 06:39, Richard Guenther wrote: There are btw. some bugs wrt accounting of functions called once being inlined in 4.5 which were fixed on trunk which allow extra inlining. Are these changes likely to make it onto the 4.5 branch and into (say) 4.5.1? Well, I'm always a bit nervous when backporting inline heuristic changes as that may trigger latent problems on code where they weren't seen before. We are talking about revs 158278 and 159931. And at this point I'd leave it to Honza to consider their safety and do and test a backport. Richard. j.
Re: Massive performance regression from switching to gcc 4.5
I do think so. Huh? What do your version and mine return for the following assignment? void foo (int i) { struct S s; s.a = i; } Which in the following example makes i = *p not likely eliminated but makes j = *q likely eliminated. void foo (int *p, struct X *q) { int i; struct X j; i = *p; j = *q; bar (i, q); } That doesn't make sense. Yet that's what's supposed to be implemented, see the comment: loads from parameters passed by reference. What makes sense is that all scalar (thus gimple_reg_typed) loads/stores to/from parameters or the result are free. Precisely not, they aren't free, otherwise they wouldn't exist in the first place. Scalar loads/stores are never free, aggregate loads/stores may be free if they are created only to pass the object around. -- Eric Botcazou
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 12:45 PM, Eric Botcazou ebotca...@adacore.com wrote: I do think so. Huh? What do your version and mine return for the following assignment? void foo (int i) { struct S s; s.a = i; } Which in the following example makes i = *p not likely eliminated but makes j = *q likely eliminated. void foo (int *p, struct X *q) { int i; struct X j; i = *p; j = *q; bar (i, q); } That doesn't make sense. Yet that's what's supposed to be implemented, see the comment: loads from parameters passed by reference. What makes sense is that all scalar (thus gimple_reg_typed) loads/stores to/from parameters or the result are free. Precisely not, they aren't free, otherwise they wouldn't exist in the first place. Scalar loads/stores are never free, aggregate loads/stores may be free if they are created only to pass the object around. Err. aggregate loads/stores do not appear because aggregate uses can appear in calls. Scalar uses cannot appear in calls and thus you see them as separate statements. Thus, struct X; void bar(struct X); void foo(struct Xx) { bar (x); } will appear as a single call stmt while void bar (int); void foo(int x) { bar (x); } will have a load that is not supposed to be free? Richard. -- Eric Botcazou
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 8:15 AM, Jonathan Adamczewski jadam...@utas.edu.au wrote: On 25/06/10 06:39, Richard Guenther wrote: There are btw. some bugs wrt accounting of functions called once being inlined in 4.5 which were fixed on trunk which allow extra inlining. Are these changes likely to make it onto the 4.5 branch and into (say) 4.5.1? Well, I'm always a bit nervous when backporting inline heuristic changes as that may trigger latent problems on code where they weren't seen before. We are talking about revs 158278 and 159931. And at this point I'd leave it to Honza to consider their safety and do and test a backport. Main change in GCC 4.5 heuristic is that it is no longer driven by somewhat fuzzy estimates of costs that are mixture of size, speed and some legacy (such as bug completely ignoring existence of loads and stores). It now uses code size estimate and speedup to drive inlining (that is basically greedy algorithm trying to maximize speedup at the code size growth constrains). When you compile with -Os, the inlining happens only when code size reduces. Thus we pretty much care about the code size metrics only. I suspect the problem here might be that normal C++ code needs some inlining to make abstraction penalty go away. GCC -Os implementation is generally tuned for CSiBE and it is somewhat C centric (that makes sense for embedded world). As a result we might get quite noticeable slowdowns on C++ apps compiled with -Os (and code size growth too since abstraction is never eliminated). It can be seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a lot slower code. I would be very interested to know the most obvious cases where we miss inlining and should not. It would be most helpful to directly know -fdump-tree-inline_param-details for those or have self contained testcase. It might be for benefit of both projects if we managed to set up regular mozilla benchmarking. (Simlar as we do for C++ benchmarks at http://gcc.opensuse.org/c++bench-frescobaldi/ ) I was thinking about this up for a while but was somewhat discougrated by the overall complexity of Mozilla and also currently we lack hardware for all the testing we would like to do. Mozilla is wonderful example of complex real world C++ APP with a benchmark suite, so it makes it really good target for tunning IPA. I would be also very interested to know how profile feedback works in this case (and why it does not work in previous releases). I am maintaining both areas of compiler and would be definitly happy to do some work to help to make it useful for you. GCC 4.6 has several changes in inlining heruistics that might be considered for backporting if they are found to be _really_ important. Most noticeable are probably: 1) It fixes miscounting of variadic functios (this had quite large effect on GCC itself since it prevents inlining parts of fatal_error) 2) It fixes accounting of static functions (previously the overall unit change was decreased twice for every offline copy eliminated, that accidentally imroved codegen for some C++ testcases but caused code size growth eslewhere) 3) Priority queue was fixed, so it is now accoutning correctly cost changes after inlining (this caused best improvements in C) 4) There was speedups in inlining heruristics when delaing with functions having realy many (say over 5) callers. 2) and 3) needs to go together or we get slowdonws on our current C++ suite. I am however concerned that the problem might be clash in between -Os and the fact that C++ code generally needs speculative code growing inlining to get rid of abstraction. It depends what your abstraction is to see if we can get somehow easilly around this problem. GCC can detect certain form of constructs that will go away after inlining and I was also thining about adding small code growth buffer for -Os inlining too if it helps at average. Honza Richard. j.
Re: Massive performance regression from switching to gcc 4.5
On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote: Just wanted to give a heads up on what might be the biggest compiler-upgrade-related performance difference we've seen at Mozilla. We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. Could you please also try gcc4.4, so that it is clear if the slowdowns are between 4.3 and 4.4 or 4.4 and 4.5? Would be nice to narrow the changes a little bit. Jakub
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 1:02 PM, Richard Guenther richard.guent...@gmail.com wrote: On Fri, Jun 25, 2010 at 12:45 PM, Eric Botcazou ebotca...@adacore.com wrote: I do think so. Huh? What do your version and mine return for the following assignment? void foo (int i) { struct S s; s.a = i; } Which in the following example makes i = *p not likely eliminated but makes j = *q likely eliminated. void foo (int *p, struct X *q) { int i; struct X j; i = *p; j = *q; bar (i, q); } That doesn't make sense. Yet that's what's supposed to be implemented, see the comment: loads from parameters passed by reference. What makes sense is that all scalar (thus gimple_reg_typed) loads/stores to/from parameters or the result are free. Precisely not, they aren't free, otherwise they wouldn't exist in the first place. Scalar loads/stores are never free, aggregate loads/stores may be free if they are created only to pass the object around. Err. aggregate loads/stores do not appear because aggregate uses can appear in calls. Scalar uses cannot appear in calls and thus you see them as separate statements. Thus, struct X; void bar(struct X); void foo(struct Xx) { bar (x); } will appear as a single call stmt while void bar (int); void foo(int x) { bar (x); } will have a load that is not supposed to be free? Thus, do you have a testcase where your patch helps? Richard. Richard. -- Eric Botcazou
Re: Massive performance regression from switching to gcc 4.5
Hi, On Fri, 25 Jun 2010, Jan Hubicka wrote: I would be also very interested to know how profile feedback works in this case (and why it does not work in previous releases). Profiling multi-threading programs needs -fprofile-correction that appeared only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the initial message only speaks about 4.3 and 4.5). Mozilla code also triggered a bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified their code to never leave non-default alignment at the end of the TU (I have posted a patch for the libgcov bug [1], but it was not reviewed and does not apply anymore due to build_constructor changes). [1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html Cheers, Alexander
Re: Massive performance regression from switching to gcc 4.5
On Fri, Jun 25, 2010 at 06:10:56AM -0700, Jan Hubicka wrote: When you compile with -Os, the inlining happens only when code size reduces. Thus we pretty much care about the code size metrics only. I suspect the problem here might be that normal C++ code needs some inlining to make abstraction penalty go away. GCC -Os implementation is generally tuned for CSiBE and it is somewhat C centric (that makes sense for embedded world). As a result we might get quite noticeable slowdowns on C++ apps compiled with -Os (and code size growth too since abstraction is never eliminated). It can be seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a lot slower code. One would think that in most of the abstraction-penalty cases, the inlined code (often the direct reading or setting of a class data member) should be both smaller and faster than the call, so -Os should inline. Perhaps there are cases where the inlined version is, say, one or two instructions larger than the version with a call, and this causes the degradation? If so, maybe some heuristic could be produced that would inline anyway for a small function?
Re: Massive performance regression from switching to gcc 4.5
Hi, On Fri, 25 Jun 2010, Jan Hubicka wrote: I would be also very interested to know how profile feedback works in this case (and why it does not work in previous releases). Profiling multi-threading programs needs -fprofile-correction that appeared only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the initial message only speaks about 4.3 and 4.5). Mozilla code also triggered a bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified their code to never leave non-default alignment at the end of the TU (I have posted a patch for the libgcov bug [1], but it was not reviewed and does not apply anymore due to build_constructor changes). [1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html Ah, sorry. I tought the consensus was to disable effect of pragma pack at the end of parsing to avoid libgcov incompatibility? Honza
Massive performance regression from switching to gcc 4.5
Hi, Just wanted to give a heads up on what might be the biggest compiler-upgrade-related performance difference we've seen at Mozilla. We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. A lone 8% speedup was measured on the Sunspider javascript benchmark on 64bit linux. Here are some of the slowdowns reported: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630# http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41# Most of the code is compiled with -fPIC -fno-rtti -fno-exceptions -Os -freorder-blocks -fomit-frame-pointer. The only difference in 4.5 is that we link with -static-libstdc++ and compile libstdc++ with -fPIC. However we barely make use of libstdc++, so I doubt that's the problem. We needed to link statically because of 4.5 uses a handful of newer libstdc++ symbols. We were upgrading to gcc 4.5.0 because of plugins and the fact that it can compile Firefox with PGO on(above builds were not built with PGO). Now we have to reconsider a complete switchover to 4.5. I'm not sure how to proceed from here, Taras
Re: Massive performance regression from switching to gcc 4.5
On Jun 24, 2010, at 11:50 AM, Taras Glek tg...@mozilla.com wrote: Hi, Just wanted to give a heads up on what might be the biggest compiler- upgrade-related performance difference we've seen at Mozilla. We switched gcc4.3 for gcc4.5 and our automated benchmarking infrastructure reported 4-19% slowdown on most of our performance metrics on 32 and 64bit Linux. A lone 8% speedup was measured on the Sunspider javascript benchmark on 64bit linux. Here are some of the slowdowns reported: http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630# http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41# Most of the code is compiled with -fPIC -fno-rtti -fno-exceptions -Os Stop right there. You are compiling at -Os, that is tuned for size and not speed. So the question is did the size go down? Not the speed decreased. Try at -O2 and report back. I doubt we are going to do a tradeoff for speed at -Os at all. Thanks, Andrew Pinski -freorder-blocks -fomit-frame-pointer. The only difference in 4.5 is that we link with -static-libstdc++ and compile libstdc++ with - fPIC. However we barely make use of libstdc++, so I doubt that's the problem. We needed to link statically because of 4.5 uses a handful of newer libstdc++ symbols. We were upgrading to gcc 4.5.0 because of plugins and the fact that it can compile Firefox with PGO on(above builds were not built with PGO). Now we have to reconsider a complete switchover to 4.5. I'm not sure how to proceed from here, Taras
Re: Massive performance regression from switching to gcc 4.5
On 6/24/10 3:06 PM, Andrew Pinski wrote: Most of the code is compiled with -fPIC -fno-rtti -fno-exceptions -Os Stop right there. You are compiling at -Os, that is tuned for size and not speed. So the question is did the size go down? Not the speed decreased. Try at -O2 and report back. I doubt we are going to do a tradeoff for speed at -Os at all. For what it's worth, Mozilla-compiled-with-GCC has historically been faster compiled -Os instead of -O2. This is because the vast majority of our code is cold, and -O2 has produced substantially larger code, which causes our hot code to be evicted from processor caches more often. We will definitely try -O2 to see if the previous measurements are no longer valid with GCC 4.5. Looking through our codesize comparison logs, some of our methods are thosands of bytes longer with GCC 4.5 than 4.3 (same -Os compiler flags): +796nsHTMLEditRules::nsHTMLEditRules() +1088 nsCrypto::GenerateCRMFRequest(nsIDOMCRMFObject**) In addition, it appears at first glance that GCC is either no longer inlining at -Os, even when it would be a size advantage to do so, or is making some very poor inlining choices. e.g. +72nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const) We can turn some of these observations into bug reports if that would be helpful, but if it would make more sense we could perhaps just tune the inlining parameters directly to get the real -Os that we usually want. --BDS
Re: Massive performance regression from switching to gcc 4.5
In addition, it appears at first glance that GCC is either no longer inlining at -Os, even when it would be a size advantage to do so, or is making some very poor inlining choices. e.g. +72 nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const) We can turn some of these observations into bug reports if that would be helpful, but if it would make more sense we could perhaps just tune the inlining parameters directly to get the real -Os that we usually want. We ran into similar inlining regressions in Ada, the heuristics have indeed changed significantly. The attached patchlet alone saves 3% in code size at -Os on a 50 MB executable and yields a 5% speedup at -O2 on another code. * ipa-inline.c (likely_eliminated_by_inlining_p): Really consider that loads from parameters passed by reference are free after inlining. -- Eric Botcazou *** gcc/ipa-inline.c.0 2010-06-12 17:01:09.0 +0200 --- gcc/ipa-inline.c 2010-06-12 18:26:32.0 +0200 *** likely_eliminated_by_inlining_p (gimple *** 1736,1754 bool rhs_free = false; bool lhs_free = false; ! while (handled_component_p (inner_lhs) || TREE_CODE (inner_lhs) == INDIRECT_REF) inner_lhs = TREE_OPERAND (inner_lhs, 0); ! while (handled_component_p (inner_rhs) ! || TREE_CODE (inner_rhs) == ADDR_EXPR || TREE_CODE (inner_rhs) == INDIRECT_REF) inner_rhs = TREE_OPERAND (inner_rhs, 0); - if (TREE_CODE (inner_rhs) == PARM_DECL || (TREE_CODE (inner_rhs) == SSA_NAME SSA_NAME_IS_DEFAULT_DEF (inner_rhs) TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL)) rhs_free = true; ! if (rhs_free is_gimple_reg (lhs)) lhs_free = true; if (((TREE_CODE (inner_lhs) == PARM_DECL || (TREE_CODE (inner_lhs) == SSA_NAME --- 1736,1757 bool rhs_free = false; bool lhs_free = false; ! while (handled_component_p (inner_lhs) ! || TREE_CODE (inner_lhs) == INDIRECT_REF) inner_lhs = TREE_OPERAND (inner_lhs, 0); ! while (handled_component_p (inner_rhs) ! || TREE_CODE (inner_rhs) == ADDR_EXPR ! || TREE_CODE (inner_rhs) == INDIRECT_REF) inner_rhs = TREE_OPERAND (inner_rhs, 0); if (TREE_CODE (inner_rhs) == PARM_DECL || (TREE_CODE (inner_rhs) == SSA_NAME SSA_NAME_IS_DEFAULT_DEF (inner_rhs) TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL)) rhs_free = true; ! if (rhs_free ! (is_gimple_reg (lhs) ! || !is_gimple_reg_type (TREE_TYPE (lhs lhs_free = true; if (((TREE_CODE (inner_lhs) == PARM_DECL || (TREE_CODE (inner_lhs) == SSA_NAME *** likely_eliminated_by_inlining_p (gimple *** 1759,1765 || (TREE_CODE (inner_lhs) == SSA_NAME TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL)) lhs_free = true; ! if (lhs_free (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs))) rhs_free = true; if (lhs_free rhs_free) return true; --- 1762,1771 || (TREE_CODE (inner_lhs) == SSA_NAME TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL)) lhs_free = true; ! if (lhs_free ! (is_gimple_reg (rhs) ! || !is_gimple_reg_type (TREE_TYPE (rhs)) ! || is_gimple_min_invariant (rhs))) rhs_free = true; if (lhs_free rhs_free) return true;
Re: Massive performance regression from switching to gcc 4.5
On Thu, Jun 24, 2010 at 10:24 PM, Eric Botcazou ebotca...@adacore.com wrote: In addition, it appears at first glance that GCC is either no longer inlining at -Os, even when it would be a size advantage to do so, or is making some very poor inlining choices. e.g. +72 nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const) We can turn some of these observations into bug reports if that would be helpful, but if it would make more sense we could perhaps just tune the inlining parameters directly to get the real -Os that we usually want. We ran into similar inlining regressions in Ada, the heuristics have indeed changed significantly. The attached patchlet alone saves 3% in code size at -Os on a 50 MB executable and yields a 5% speedup at -O2 on another code. * ipa-inline.c (likely_eliminated_by_inlining_p): Really consider that loads from parameters passed by reference are free after inlining. I don't understand this change. Minus whitespace changes it seems to be ! if (lhs_free (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs))) rhs_free = true; vs. ! if (lhs_free !(is_gimple_reg (rhs) ! || !is_gimple_reg_type (TREE_TYPE (rhs)) ! || is_gimple_min_invariant (rhs))) rhs_free = true; so the stmt is likely being eliminated if either the LHS or the RHS is based on a parameter and the other side is a register or an invariant. You change that to also discount aggregate stores/loads to/from parameters to be free. Which you could have simplified to just say if (lhs_free || rhs_free) return true; and drop the code you are changing. I never considered the heuristic making loads/stores from parameters free a very good one. It makes *p free but not *(p+1) for example. I would rather have seen the call stmts actual argument list to be considered. There are btw. some bugs wrt accounting of functions called once being inlined in 4.5 which were fixed on trunk which allow extra inlining. See 2010-04-13 Jan Hubicka j...@suse.cz * ipa-inline.c (cgraph_mark_inline_edge): Avoid double accounting of optimized out static functions. ... Richard.