Re: Massive performance regression from switching to gcc 4.5

2010-07-06 Thread Jan Hubicka
  On 06/30/2010 02:26 PM, Basile Starynkevitch wrote:
  On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote:
 
  I tried 4.5 -O2 and it's actually faster than 4.3 -Os.
 
  I am happy that -O2 performance is actually pretty good, but -Os
  regression is going to hurt on mobile.
   
  Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2?
 
  It would be interesting to hear that GCC is able to LTO a program as big
  as Mozilla! And figures (notably RAM, CPU time, wallclock time for
  build) would be interesting.
 
 
  Both whopr and flto cause gcc to segfault while building Mozilla.
 
 4.5 WHOPR is completely broken.  LTO is in better shape but I am not sure if 
 we
 can resonably expect it to build mozilla.  However I would be very happy to 
 help
 getting WHOPR working for 4.6.
Hi,
I now got the 4.6 WHOPR build up to libxul.so that seems to be one of bigger
files.

WHOPR linking consists of serial stage (WPA) merging whole program and doing
interprocedural optimization followed by parallel build.  The serial stage
needs 3.7GB of RAM, 10 minutes, most of it is spent by writting out the files
for parallel builds that are around 5GB overall.  The size of files can be
significantly cut down by sane partitioning algorithm, since we produce over
1000 partitions where 40 would do the job.  (this is with enable-checking
compiler)

Later build still die for me, but it seems that libxul is not too large for
WHOPR. (I hope all parameters to reduce significantly before 4.6 is out)

What are the other big components I should be affraid of?

Oprofile of WPA stage is as follows:

3825078.4240  lto_output_1_stream
3791588.3503  htab_find_slot_with_hash
2073304.5661  bp_pack_value
1557933.4311  iterative_hash_hashval_t
1351322.9760  lto_output_uleb128_stream
1011102.2268  gimple_types_compatible_p
92828 2.0444  cgraph_node_in_set_p
83205 1.8324  lto_promote_cross_file_statics
76243 1.6791  htab_expand
75993 1.6736  htab_hash_string
75790 1.6691  eq_string_slot_node
75020 1.6522  bp_unpack_value
73403 1.6166  linemap_lookup
65353 1.4393  lto_output_sleb128_stream
64864 1.4285  inflate_fast
64508 1.4207  verify_cgraph_node
60076 1.3231  lto_output_tree
57120 1.2580  referenced_from_this_partition_p
56225 1.2383  lto_input_uleb128
53620 1.1809  lto_streamer_cache_insert_1
52973 1.1666  htab_find_slot
45728 1.0071  lto_output_tree_or_ref
43428 0.9564  lto_input_1_unsigned
41556 0.9152  tree_map_base_eq
39232 0.8640  hash_cgraph_node_set_element
35695 0.7861  ggc_set_mark

So not much of surprise - streaming is ineffecient and we need a lot of time 
for type merging
too.  I am compiling to get time report.

Honza


Re: Massive performance regression from switching to gcc 4.5

2010-07-06 Thread Jan Hubicka
... and time report
Execution times (seconds)
 garbage collection:  12.48 ( 2%) usr   0.00 ( 0%) sys  12.50 ( 2%) wall
   0 kB ( 0%) ggc
 callgraph optimization:   0.21 ( 0%) usr   0.00 ( 0%) sys   0.21 ( 0%) wall
2743 kB ( 0%) ggc
 varpool construction  :   0.97 ( 0%) usr   0.02 ( 0%) sys   0.97 ( 0%) wall   
44476 kB ( 1%) ggc
 ipa cp:   2.82 ( 0%) usr   0.05 ( 0%) sys   2.90 ( 0%) wall   
65189 kB ( 2%) ggc
 ipa lto gimple in :   4.96 ( 1%) usr   0.30 ( 2%) sys   5.28 ( 1%) wall
   8 kB ( 0%) ggc
 ipa lto gimple out:  26.51 ( 4%) usr   0.73 ( 4%) sys  27.54 ( 4%) wall
   0 kB ( 0%) ggc
 ipa lto decl in   : 100.69 (14%) usr   2.76 (15%) sys 103.69 (14%) wall 
3069230 kB (87%) ggc
 ipa lto decl out  : 405.60 (55%) usr   0.05 ( 0%) sys 406.86 (54%) wall
   0 kB ( 0%) ggc
 ipa lto decl init I/O :   2.21 ( 0%) usr   0.03 ( 0%) sys   2.25 ( 0%) wall   
82822 kB ( 2%) ggc
 ipa lto cgraph I/O:   1.85 ( 0%) usr   0.26 ( 1%) sys   2.11 ( 0%) wall  
247249 kB ( 7%) ggc
 ipa lto decl merge:  54.30 ( 7%) usr   1.20 ( 6%) sys  55.63 ( 7%) wall
 283 kB ( 0%) ggc
 ipa lto cgraph merge  :   1.35 ( 0%) usr   0.00 ( 0%) sys   1.35 ( 0%) wall
5259 kB ( 0%) ggc
 whopr wpa :  79.05 (11%) usr   0.14 ( 1%) sys  79.35 (11%) wall   
22146 kB ( 1%) ggc
 whopr wpa I/O :  10.07 ( 1%) usr  12.89 (69%) sys  22.82 ( 3%) wall
   0 kB ( 0%) ggc
 ipa reference :   1.83 ( 0%) usr   0.00 ( 0%) sys   1.79 ( 0%) wall
   0 kB ( 0%) ggc
 ipa profile   :   0.19 ( 0%) usr   0.00 ( 0%) sys   0.20 ( 0%) wall
   0 kB ( 0%) ggc
 ipa pure const:   1.44 ( 0%) usr   0.01 ( 0%) sys   1.46 ( 0%) wall
   0 kB ( 0%) ggc
 inline heuristics :  14.37 ( 2%) usr   0.00 ( 0%) sys  14.41 ( 2%) wall
2825 kB ( 0%) ggc
 callgraph verifier:   9.22 ( 1%) usr   0.19 ( 1%) sys   9.42 ( 1%) wall
   0 kB ( 0%) ggc
 varconst  :   0.02 ( 0%) usr   0.03 ( 0%) sys   0.04 ( 0%) wall
   0 kB ( 0%) ggc
 TOTAL : 732.4218.73   753.13
3543343 kB

decl out time definitly will reduce with sane partitioning.  Also it seems that 
osme of my simple
minded partitioning code should be optimized to avoid quadratic behaviour.

Honza


Re: Massive performance regression from switching to gcc 4.5

2010-07-01 Thread Jan Hubicka
 When you compile with -Os, the inlining happens only when code size reduces.
 Thus we pretty much care about the code size metrics only.  I suspect the
 problem here might be that normal C++ code needs some inlining to make
 abstraction penalty go away. GCC -Os implementation is generally tuned for
 CSiBE and it is somewhat C centric (that makes sense for embedded world). As 
 a
 result we might get quite noticeable slowdowns on C++ apps compiled with -Os
 (and code size growth too since abstraction is never eliminated). It can be
 seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a 
 lot
 slower code.
 Looks like -Os needs tweaking for C++. It would be awesome if Mozilla  
 could serve a testcase for that.
 I would be very interested to know the most obvious cases where we miss
 inlining and should not.  It would be most helpful to directly know
 -fdump-tree-inline_param-details for those or have self contained testcase.
 I posted such a testcase under Crucial C++ inlining broken under Os  
 subject.

Thanks, I will check it out tomorrow.  Testcases are really important here.

 Would be nice to ensure that code-size-reducing inlining has testsuite  
 coverage and then move on to more ambitious targets.

We have CSiBE benchmark to test -Os, but it does not contain that much of
inlining heavy testcases as far as I am aware.

 For example, it would be nice to mix -O2/-Os via pgo. Mozilla carries  
 support for a lot of special-cases that are rarely used, would be nice  
 to aggressively minimize binary size for those features. I'm hoping we  
 can reduce binary size further via LTO.

With PGO everything at average fewer than once per execution in train run
is optimized for size.

 I would be happy to see a streamlined feedback loop between GCC and  
 Mozilla. I think setting up regular Mozilla benchmarking is a great  
 idea. I volunteer to be the Mozilla contact for that.

 Mozilla build instructions are here:  
 https://developer.mozilla.org/En/Simple_Firefox_build

 PGO instructions are at  
 https://developer.mozilla.org/en/Building_with_Profile-Guided_Optimization

 Our testsuite is pretty easy to setup now, see
 https://wiki.mozilla.org/StandaloneTalos
 It takes over an hour to run on slower hardware. I'm not aware of any  
 good documentation on what the tests do and what they are measured in.  
 https://wiki.mozilla.org/Performance:Tinderbox_Tests is an older  
 reference on the matter. Ping me on irc if you aren't sure what a test  
 is doing.

I will check it out tomorrow and see how far I get.  Automated nightly
tester would be very good for this.

Honza

 Cheers,
 Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-30 Thread Taras Glek

On 06/24/2010 12:06 PM, Andrew Pinski wrote:



On Jun 24, 2010, at 11:50 AM, Taras Glek tg...@mozilla.com wrote:


Hi,
Just wanted to give a heads up on what might be the biggest 
compiler-upgrade-related performance difference we've seen at Mozilla.


We switched gcc4.3 for gcc4.5 and our automated benchmarking 
infrastructure reported  4-19% slowdown on most of our performance 
metrics on 32 and 64bit Linux.


A lone 8% speedup was measured on the Sunspider javascript benchmark 
on 64bit linux.


Here are some of the slowdowns reported:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630# 

http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41# 




Most of the code is compiled with   -fPIC  -fno-rtti -fno-exceptions -Os


Stop right there. You are compiling at -Os, that is tuned for size and 
not speed. So the question is did the size go down? Not the speed 
decreased. Try at -O2 and report back. I doubt we are going to do a 
tradeoff for speed at -Os at all.

Thanks,
Andrew Pinski


Good point.

Looks like the actual problem is that at -Os there is less inclining 
happening in 4.5 vs 4.3, which results in a bigger binary and slower code.


I tried 4.5 -O2 and it's actually faster than 4.3 -Os.

I am happy that -O2 performance is actually pretty good, but -Os 
regression is going to hurt on mobile.


Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-30 Thread Basile Starynkevitch
On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote:
 
 I tried 4.5 -O2 and it's actually faster than 4.3 -Os.
 
 I am happy that -O2 performance is actually pretty good, but -Os 
 regression is going to hurt on mobile.

Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2?

It would be interesting to hear that GCC is able to LTO a program as big
as Mozilla! And figures (notably RAM, CPU time, wallclock time for
build) would be interesting.

Cheers.

-- 
Basile STARYNKEVITCH http://starynkevitch.net/Basile/
email: basileatstarynkevitchdotnet mobile: +33 6 8501 2359
8, rue de la Faiencerie, 92340 Bourg La Reine, France
*** opinions {are only mines, sont seulement les miennes} ***




Re: Massive performance regression from switching to gcc 4.5

2010-06-30 Thread Taras Glek

On 06/30/2010 02:26 PM, Basile Starynkevitch wrote:

On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote:
   

I tried 4.5 -O2 and it's actually faster than 4.3 -Os.

I am happy that -O2 performance is actually pretty good, but -Os
regression is going to hurt on mobile.
 

Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2?

It would be interesting to hear that GCC is able to LTO a program as big
as Mozilla! And figures (notably RAM, CPU time, wallclock time for
build) would be interesting.
   


Both whopr and flto cause gcc to segfault while building Mozilla.


Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-30 Thread Jan Hubicka
 On 06/30/2010 02:26 PM, Basile Starynkevitch wrote:
 On Wed, 2010-06-30 at 14:23 -0700, Taras Glek wrote:

 I tried 4.5 -O2 and it's actually faster than 4.3 -Os.

 I am happy that -O2 performance is actually pretty good, but -Os
 regression is going to hurt on mobile.
  
 Did you try gcc-4.5 -flto -Os or gcc-4.5 -flto -O2?

 It would be interesting to hear that GCC is able to LTO a program as big
 as Mozilla! And figures (notably RAM, CPU time, wallclock time for
 build) would be interesting.


 Both whopr and flto cause gcc to segfault while building Mozilla.

4.5 WHOPR is completely broken.  LTO is in better shape but I am not sure if we
can resonably expect it to build mozilla.  However I would be very happy to help
getting WHOPR working for 4.6.

If you can find actual simple examples where -Os is losing size and speed we 
can try
to do something about them.

Honza


 Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-28 Thread Richard B. Kreckel

Jan Hubicka wrote:

GiNaC indeed shows interesting behaviour.  Just the first test on 4.3 is:
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.064   0.301.4 6.2

for 4.5
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.080   0.361.6 7.4
and for 4.6
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.076   0.371.7 8.2

I assume that the numbers are times in second, so it is indeed not good.  Can I
easilly run the individual tests by hand?  Do you have any idea what is going
wrong here?


Yes, one can run the individual tests by hand. And it should suffice to 
focus on the first one, as you did. All the follow-up regressions are 
most likely due to the same problem. And no, I have no idea what's up yet.


  -richy.
--
Richard B. Kreckel
http://www.ginac.de/~kreckel/


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
 On Fri, Jun 25, 2010 at 06:10:56AM -0700, Jan Hubicka wrote:
  When you compile with -Os, the inlining happens only when code size reduces.
  Thus we pretty much care about the code size metrics only.  I suspect the
  problem here might be that normal C++ code needs some inlining to make
  abstraction penalty go away. GCC -Os implementation is generally tuned for
  CSiBE and it is somewhat C centric (that makes sense for embedded world). 
  As a
  result we might get quite noticeable slowdowns on C++ apps compiled with -Os
  (and code size growth too since abstraction is never eliminated). It can be
  seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a 
  lot
  slower code.
 
 One would think that in most of the abstraction-penalty cases, the inlined
 code (often the direct reading or setting of a class data member) should
 be both smaller and faster than the call, so -Os should inline.  Perhaps

Yes at -Os inliner should inline to produce smallest binary.  Problem is to
figure out at inline time what the results of the inlining decisions will be.
Inliner has no idea how the code will optimize if it does inlining or if it
does not and often the code quality depends on multiple inline decisions (i.e.
one needs to inline all methods of the object to make it to be scalar replaced
and optimized away).

Current herusitcs makes some simple guesses. Some of them are turned off for -Os
to avoid regressions on C benchmarks in CSiBE.  So I would really need to see 
individual
examples to work out what can be done about them.

Honza
 there are cases where the inlined version is, say, one or two instructions 
 larger
 than the version with a call, and this causes the degradation?  If so,
 maybe some heuristic could be produced that would inline anyway for
 a small function?


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
 On Fri, 25 Jun 2010, it was written:
  On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote:
   We switched gcc4.3 for gcc4.5 and our automated benchmarking
   infrastructure reported  4-19% slowdown on most of our performance
   metrics on 32 and 64bit Linux.
 
  Could you please also try gcc4.4, so that it is clear if the slowdowns
  are between 4.3 and 4.4 or 4.4 and 4.5?  Would be nice to narrow the changes
  a little bit.
 
 There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
 (a computer algebra library) benchmark suite after switching from 4.4 to
 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable
 performance differences between 4.3 and 4.4.

FP intensive code could be also affected by:

On x86 targets, code containing floating-point calculations may run 
significantly slower when compiled with GCC 4.5 in strict C99 conformance mode 
than they did with earlier GCC versions. This is due to stricter standard 
conformance of the compiler and can be avoided by using the option 
-fexcess-precision=fast; also see below.

(see http://gcc.gnu.org/gcc-4.5/changes.html)
If not, I would be interested to take a look.  C++ code in general tends to
challenge inliner heruistic.  We assembled a benchmark suite that we use for
tunning the interprocedural optimizers pretty much every release. So perhaps
GiNaC can be used as one of the tests.

Honza

 
   -richy.
 -- 
 Richard B. Kreckel
 http://www.ginac.de/~kreckel/


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Richard B. Kreckel

Jan Hubicka wrote:

On Fri, 25 Jun 2010, it was written:
There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
(a computer algebra library) benchmark suite after switching from 4.4 to
4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable
performance differences between 4.3 and 4.4.


FP intensive code could be also affected by:


This code isn't using floating-point.

  -richy.
--
Richard B. Kreckel
http://www.ginac.de/~kreckel/


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
 Jan Hubicka wrote:
 On Fri, 25 Jun 2010, it was written:
 There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
 (a computer algebra library) benchmark suite after switching from 4.4 to
 4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable
 performance differences between 4.3 and 4.4.

 FP intensive code could be also affected by:

 This code isn't using floating-point.

Hmm, building ginac with current 4.5 branch I get:
/bin/sh ../libtool --tag=CXX   --mode=compile 
/abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac 
-I../config   -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF 
.deps/function.Tpo -c -o function.lo ../../ginac/function.cpp
libtool: compile:  /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. 
-I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD -MP 
-MF .deps/function.Tpo -c ../../ginac/function.cpp  -fPIC -DPIC -o 
.libs/function.o
../../ginac/function.cpp: In member function ‘GiNaC::ex 
GiNaC::function::power(const GiNaC::ex) const’:
../../ginac/function.cpp:1800:15: error: expected type-specifier
../../ginac/function.cpp:1800:15: error: expected ‘)’
../../ginac/function.cpp:1801:72: error: conversion from ‘int*’ to 
‘GiNaC::ex’ is ambiguous
../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned int) 
near match
../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near 
match
../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) near 
match
../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near match

Honza

   -richy.
 -- 
 Richard B. Kreckel
 http://www.ginac.de/~kreckel/


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
  Jan Hubicka wrote:
  On Fri, 25 Jun 2010, it was written:
  There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
  (a computer algebra library) benchmark suite after switching from 4.4 to
  4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable
  performance differences between 4.3 and 4.4.
 
  FP intensive code could be also affected by:
 
  This code isn't using floating-point.
 
 Hmm, building ginac with current 4.5 branch I get:
 /bin/sh ../libtool --tag=CXX   --mode=compile 
 /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac 
 -I../config   -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF 
 .deps/function.Tpo -c -o function.lo ../../ginac/function.cpp
 libtool: compile:  /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. 
 -I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD -MP 
 -MF .deps/function.Tpo -c ../../ginac/function.cpp  -fPIC -DPIC -o 
 .libs/function.o
 ../../ginac/function.cpp: In member function ‘GiNaC::ex 
 GiNaC::function::power(const GiNaC::ex) const’:
 ../../ginac/function.cpp:1800:15: error: expected type-specifier
 ../../ginac/function.cpp:1800:15: error: expected ‘)’
 ../../ginac/function.cpp:1801:72: error: conversion from ‘int*’ to 
 ‘GiNaC::ex’ is ambiguous
 ../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned 
 int) near match
 ../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near 
 match
 ../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) 
 near match
 ../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near match

Hi,
I got arround by just replacing the offending line by abort() that seems to get 
me far enough
to build the testsuite.  One thing I noticed is that compile time prolonged 
excessively to about
10 minutes.  It seems to be var tracking

3960911.3093  cc1plus  canonicalize_values_star
33469 9.5562  cc1plus  loc_cmp
22206 6.3403  cc1plus  set_slot_part
15798 4.5107  cc1plus  rtx_equal_p
14479 4.1341  cc1plus  htab_find_with_hash
10461 2.9869  cc1plus  htab_find_slot_with_hash
7268  2.0752  cc1plus  find_loc_in_1pdv
6718  1.9181  cc1plus  cselib_expand_value_rtx_1
6619  1.8899  libc-2.11.1.so   strcmp
6193  1.7682  cc1plus  htab_expand
6020  1.7188  cc1plus  check_changed_vars_0
5797  1.6552  cc1plus  htab_traverse_noresize
5615  1.6032  cc1plus  htab_find_with_hash
4757  1.3582  cc1plus  vt_expand_loc_callback
4529  1.2931  libginac-1.5.so.0.1.2GiNaC::basic::compare(GiNaC::basic 
const) const
4497  1.2840  libc-2.11.1.so   _int_malloc
3879  1.1075  as   /usr/bin/as
3703  1.0573  cc1plus  variable_htab_eq
3698  1.0559  cc1plus  insert_into_intersection
3287  0.9385  cc1plus  emit_note_insn_var_location

Honza


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
   Jan Hubicka wrote:
   On Fri, 25 Jun 2010, it was written:
   There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
   (a computer algebra library) benchmark suite after switching from 4.4 to
   4.5 on x86_64 when compiling with -O2. And there hasn't been a 
   measurable
   performance differences between 4.3 and 4.4.
  
   FP intensive code could be also affected by:
  
   This code isn't using floating-point.
  
  Hmm, building ginac with current 4.5 branch I get:
  /bin/sh ../libtool --tag=CXX   --mode=compile 
  /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. -I../../ginac 
  -I../config   -I/usr/local/include-g -O2 -MT function.lo -MD -MP -MF 
  .deps/function.Tpo -c -o function.lo ../../ginac/function.cpp
  libtool: compile:  /abuild/jh/gcc-4.5-nopatch/bin/g++ -DHAVE_CONFIG_H -I. 
  -I../../ginac -I../config -I/usr/local/include -g -O2 -MT function.lo -MD 
  -MP -MF .deps/function.Tpo -c ../../ginac/function.cpp  -fPIC -DPIC -o 
  .libs/function.o
  ../../ginac/function.cpp: In member function ‘GiNaC::ex 
  GiNaC::function::power(const GiNaC::ex) const’:
  ../../ginac/function.cpp:1800:15: error: expected type-specifier
  ../../ginac/function.cpp:1800:15: error: expected ‘)’
  ../../ginac/function.cpp:1801:72: error: conversion from ‘int*’ to 
  ‘GiNaC::ex’ is ambiguous
  ../../ginac/ex.h:279:1: note: candidates are: GiNaC::ex::ex(long unsigned 
  int) near match
  ../../ginac/ex.h:273:1: note: GiNaC::ex::ex(long int) near 
  match
  ../../ginac/ex.h:267:1: note: GiNaC::ex::ex(unsigned int) 
  near match
  ../../ginac/ex.h:261:1: note: GiNaC::ex::ex(int) near 
  match
 
 Hi,
 I got arround by just replacing the offending line by abort() that seems to 
 get me far enough
 to build the testsuite.  One thing I noticed is that compile time prolonged 
 excessively to about
 10 minutes.  It seems to be var tracking
 
 3960911.3093  cc1plus  canonicalize_values_star
 33469 9.5562  cc1plus  loc_cmp
 22206 6.3403  cc1plus  set_slot_part
 15798 4.5107  cc1plus  rtx_equal_p
 14479 4.1341  cc1plus  htab_find_with_hash
 10461 2.9869  cc1plus  htab_find_slot_with_hash
 7268  2.0752  cc1plus  find_loc_in_1pdv
 6718  1.9181  cc1plus  cselib_expand_value_rtx_1
 6619  1.8899  libc-2.11.1.so   strcmp
 6193  1.7682  cc1plus  htab_expand
 6020  1.7188  cc1plus  check_changed_vars_0
 5797  1.6552  cc1plus  htab_traverse_noresize
 5615  1.6032  cc1plus  htab_find_with_hash
 4757  1.3582  cc1plus  vt_expand_loc_callback
 4529  1.2931  libginac-1.5.so.0.1.2GiNaC::basic::compare(GiNaC::basic 
 const) const
 4497  1.2840  libc-2.11.1.so   _int_malloc
 3879  1.1075  as   /usr/bin/as
 3703  1.0573  cc1plus  variable_htab_eq
 3698  1.0559  cc1plus  insert_into_intersection
 3287  0.9385  cc1plus  emit_note_insn_var_location

(it is regression at 4.5 branch, forgot to mention)

Honza
 
 Honza


Re: Massive performance regression from switching to gcc 4.5

2010-06-27 Thread Jan Hubicka
 
 (it is regression at 4.5 branch, forgot to mention)
PR44694

GiNaC indeed shows interesting behaviour.  Just the first test on 4.3 is:
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.064   0.301.4 6.2

for 4.5
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.080   0.361.6 7.4
and for 4.6
timing commutative expansion and substitution
size:   100 200 400 800
time/s: 0.076   0.371.7 8.2

I assume that the numbers are times in second, so it is indeed not good.  Can I
easilly run the individual tests by hand?  Do you have any idea what is going
wrong here?

Honza


Re: Massive performance regression from switching to gcc 4.5

2010-06-26 Thread Richard B. Kreckel
On Fri, 25 Jun 2010, it was written:
 On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote:
  We switched gcc4.3 for gcc4.5 and our automated benchmarking
  infrastructure reported  4-19% slowdown on most of our performance
  metrics on 32 and 64bit Linux.

 Could you please also try gcc4.4, so that it is clear if the slowdowns
 are between 4.3 and 4.4 or 4.4 and 4.5?  Would be nice to narrow the changes
 a little bit.

There sure is something in 4.5. I've seen a 1-10% slowdown at the GiNaC
(a computer algebra library) benchmark suite after switching from 4.4 to
4.5 on x86_64 when compiling with -O2. And there hasn't been a measurable
performance differences between 4.3 and 4.4.

  -richy.
-- 
Richard B. Kreckel
http://www.ginac.de/~kreckel/



Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Jonathan Adamczewski
On 25/06/10 06:39, Richard Guenther wrote:
 There are btw. some bugs wrt accounting of functions called once
 being inlined in 4.5 which were fixed on trunk which allow extra
 inlining.
   

Are these changes likely to make it onto the 4.5 branch and into (say)
4.5.1?

j.


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Eric Botcazou
 Minus whitespace changes it seems to be

 !   if (lhs_free  (is_gimple_reg (rhs) ||
 is_gimple_min_invariant (rhs)))
   rhs_free = true;

 vs.

 !   if (lhs_free
 !(is_gimple_reg (rhs)
 !   || !is_gimple_reg_type (TREE_TYPE (rhs))
 !   || is_gimple_min_invariant (rhs)))
   rhs_free = true;

 so the stmt is likely being eliminated if either the LHS or the RHS is
 based on a parameter and the other side is a register or an invariant.  You
 change that to also discount aggregate stores/loads to/from parameters to
 be free.

There is also the counterpart for the RHS:

!   if (rhs_free  is_gimple_reg (lhs))
  lhs_free = true;
vs

!   if (rhs_free
!(is_gimple_reg (lhs)
!   || !is_gimple_reg_type (TREE_TYPE (lhs
  lhs_free = true;

 Which you could have simplified to just say

   if (lhs_free || rhs_free)
 return true;

 and drop the code you are changing.

I don't think so, compare your version and mine for scalar stores/loads 
from/to parameters or return values.

-- 
Eric Botcazou


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Richard Guenther
On Fri, Jun 25, 2010 at 8:34 AM, Eric Botcazou ebotca...@adacore.com wrote:
 Minus whitespace changes it seems to be

 !           if (lhs_free  (is_gimple_reg (rhs) ||
 is_gimple_min_invariant (rhs)))
               rhs_free = true;

 vs.

 !           if (lhs_free
 !                (is_gimple_reg (rhs)
 !                   || !is_gimple_reg_type (TREE_TYPE (rhs))
 !                   || is_gimple_min_invariant (rhs)))
               rhs_free = true;

 so the stmt is likely being eliminated if either the LHS or the RHS is
 based on a parameter and the other side is a register or an invariant.  You
 change that to also discount aggregate stores/loads to/from parameters to
 be free.

 There is also the counterpart for the RHS:

 !           if (rhs_free  is_gimple_reg (lhs))
              lhs_free = true;
 vs

 !           if (rhs_free
 !                (is_gimple_reg (lhs)
 !                   || !is_gimple_reg_type (TREE_TYPE (lhs
              lhs_free = true;

Sure, but it's requivalent.

 Which you could have simplified to just say

   if (lhs_free || rhs_free)
     return true;

 and drop the code you are changing.

 I don't think so, compare your version and mine for scalar stores/loads
 from/to parameters or return values.

I do think so.  Quoting the complete patched code:

if (TREE_CODE (inner_rhs) == PARM_DECL
|| (TREE_CODE (inner_rhs) == SSA_NAME
 SSA_NAME_IS_DEFAULT_DEF (inner_rhs)
 TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL))
  rhs_free = true;
!   if (rhs_free
!(is_gimple_reg (lhs)
!   || !is_gimple_reg_type (TREE_TYPE (lhs
  lhs_free = true;
if (((TREE_CODE (inner_lhs) == PARM_DECL
  || (TREE_CODE (inner_lhs) == SSA_NAME
   SSA_NAME_IS_DEFAULT_DEF (inner_lhs)
   TREE_CODE (SSA_NAME_VAR (inner_lhs)) == PARM_DECL))
  inner_lhs != lhs)
|| TREE_CODE (inner_lhs) == RESULT_DECL
|| (TREE_CODE (inner_lhs) == SSA_NAME
 TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL))
  lhs_free = true;
!   if (lhs_free
!(is_gimple_reg (rhs)
!   || !is_gimple_reg_type (TREE_TYPE (rhs))
!   || is_gimple_min_invariant (rhs)))
  rhs_free = true;
if (lhs_free  rhs_free)
  return true;

now, with is_gimple_reg () || !is_gimple_reg_type () ||
is_gimple_min_invariant () you allow registers, constants or
any aggregates but _not_ register typed variables that have
their address taken.

Which in the following example makes i = *p not likely eliminated
but makes j = *q likely eliminated.

void foo (int *p, struct X *q)
{
  int i;
  struct X j;
  i = *p;
  j = *q;
  bar (i, q);
}

That doesn't make sense.

What makes sense is that all scalar (thus gimple_reg_typed)
loads/stores to/from parameters or the result are free.  Which
isn't what the current code does but is also not what you
are changing it to.

Thus in the above example i = *p should be likely eliminated
but not j = *q (maybe we can make aggregate loads/stores
from/to non-address-taken vars as free, too).

Richard.

 --
 Eric Botcazou



Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Richard Guenther
On Fri, Jun 25, 2010 at 8:15 AM, Jonathan Adamczewski
jadam...@utas.edu.au wrote:
 On 25/06/10 06:39, Richard Guenther wrote:
 There are btw. some bugs wrt accounting of functions called once
 being inlined in 4.5 which were fixed on trunk which allow extra
 inlining.


 Are these changes likely to make it onto the 4.5 branch and into (say)
 4.5.1?

Well, I'm always a bit nervous when backporting inline heuristic
changes as that may trigger latent problems on code where they
weren't seen before.

We are talking about revs 158278 and 159931.  And at this point
I'd leave it to Honza to consider their safety and do and test a
backport.

Richard.

 j.



Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Eric Botcazou
 I do think so.

Huh?  What do your version and mine return for the following assignment?

void foo (int i)
{
  struct S s;
  s.a = i;
}

 Which in the following example makes i = *p not likely eliminated
 but makes j = *q likely eliminated.

 void foo (int *p, struct X *q)
 {
   int i;
   struct X j;
   i = *p;
   j = *q;
   bar (i, q);
 }

 That doesn't make sense.

Yet that's what's supposed to be implemented, see the comment: loads from 
parameters passed by reference.

 What makes sense is that all scalar (thus gimple_reg_typed)
 loads/stores to/from parameters or the result are free.

Precisely not, they aren't free, otherwise they wouldn't exist in the first 
place.  Scalar loads/stores are never free, aggregate loads/stores may be 
free if they are created only to pass the object around.

-- 
Eric Botcazou


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Richard Guenther
On Fri, Jun 25, 2010 at 12:45 PM, Eric Botcazou ebotca...@adacore.com wrote:
 I do think so.

 Huh?  What do your version and mine return for the following assignment?

 void foo (int i)
 {
  struct S s;
  s.a = i;
 }

 Which in the following example makes i = *p not likely eliminated
 but makes j = *q likely eliminated.

 void foo (int *p, struct X *q)
 {
   int i;
   struct X j;
   i = *p;
   j = *q;
   bar (i, q);
 }

 That doesn't make sense.

 Yet that's what's supposed to be implemented, see the comment: loads from
 parameters passed by reference.

 What makes sense is that all scalar (thus gimple_reg_typed)
 loads/stores to/from parameters or the result are free.

 Precisely not, they aren't free, otherwise they wouldn't exist in the first
 place.  Scalar loads/stores are never free, aggregate loads/stores may be
 free if they are created only to pass the object around.

Err.  aggregate loads/stores do not appear because aggregate
uses can appear in calls.

Scalar uses cannot appear in calls and thus you see them as
separate statements.

Thus,

struct X;
void bar(struct X);
void foo(struct Xx)
{
  bar (x);
}

will appear as a single call stmt while

void bar (int);
void foo(int x)
{
  bar (x);
}

will have a load that is not supposed to be free?

Richard.

 --
 Eric Botcazou



Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Jan Hubicka
 On Fri, Jun 25, 2010 at 8:15 AM, Jonathan Adamczewski
 jadam...@utas.edu.au wrote:
  On 25/06/10 06:39, Richard Guenther wrote:
  There are btw. some bugs wrt accounting of functions called once
  being inlined in 4.5 which were fixed on trunk which allow extra
  inlining.
 
 
  Are these changes likely to make it onto the 4.5 branch and into (say)
  4.5.1?
 
 Well, I'm always a bit nervous when backporting inline heuristic
 changes as that may trigger latent problems on code where they
 weren't seen before.
 
 We are talking about revs 158278 and 159931.  And at this point
 I'd leave it to Honza to consider their safety and do and test a
 backport.

Main change in GCC 4.5 heuristic is that it is no longer driven by somewhat
fuzzy estimates of costs that are mixture of size, speed and some legacy (such
as bug completely ignoring existence of loads and stores).  It now uses code
size estimate and speedup to drive inlining (that is basically greedy algorithm
trying to maximize speedup at the code size growth constrains).

When you compile with -Os, the inlining happens only when code size reduces.
Thus we pretty much care about the code size metrics only.  I suspect the
problem here might be that normal C++ code needs some inlining to make
abstraction penalty go away. GCC -Os implementation is generally tuned for
CSiBE and it is somewhat C centric (that makes sense for embedded world). As a
result we might get quite noticeable slowdowns on C++ apps compiled with -Os
(and code size growth too since abstraction is never eliminated). It can be
seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a lot
slower code.

I would be very interested to know the most obvious cases where we miss
inlining and should not.  It would be most helpful to directly know
-fdump-tree-inline_param-details for those or have self contained testcase.

It might be for benefit of both projects if we managed to set up regular
mozilla benchmarking. (Simlar as we do for C++ benchmarks at
http://gcc.opensuse.org/c++bench-frescobaldi/ ) I was thinking about this up
for a while but was somewhat discougrated by the overall complexity of Mozilla
and also currently we lack hardware for all the testing we would like to do.
Mozilla is wonderful example of complex real world C++ APP with a benchmark
suite, so it makes it really good target for tunning IPA.

I would be also very interested to know how profile feedback works in this case
(and why it does not work in previous releases).  I am maintaining both areas
of compiler and would be definitly happy to do some work to help to make it
useful for you.

GCC 4.6 has several changes in inlining heruistics that might be considered
for backporting if they are found to be _really_ important. Most noticeable
are probably:

  1) It fixes miscounting of variadic functios (this had quite large effect
 on GCC itself since it prevents inlining parts of fatal_error)
  2) It fixes accounting of static functions (previously the overall unit
 change was decreased twice for every offline copy eliminated, that 
 accidentally imroved codegen for some C++ testcases but caused code
 size growth eslewhere)
  3) Priority queue was fixed, so it is now accoutning correctly cost changes
 after inlining (this caused best improvements in C)
  4) There was speedups in inlining heruristics when delaing with functions
 having realy many (say over 5) callers.

2) and 3) needs to go together or we get slowdonws on our current C++ suite.

I am however concerned that the problem might be clash in between -Os
and the fact that C++ code generally needs speculative code growing inlining
to get rid of abstraction.  It depends what your abstraction is to see
if we can get somehow easilly around this problem. GCC can detect certain
form of constructs that will go away after inlining and I was also thining
about adding small code growth buffer for -Os inlining too if it helps
at average.

Honza
 
 Richard.
 
  j.
 


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Jakub Jelinek
On Thu, Jun 24, 2010 at 11:50:52AM -0700, Taras Glek wrote:
 Just wanted to give a heads up on what might be the biggest  
 compiler-upgrade-related performance difference we've seen at Mozilla.

 We switched gcc4.3 for gcc4.5 and our automated benchmarking  
 infrastructure reported  4-19% slowdown on most of our performance  
 metrics on 32 and 64bit Linux.

Could you please also try gcc4.4, so that it is clear if the slowdowns
are between 4.3 and 4.4 or 4.4 and 4.5?  Would be nice to narrow the changes
a little bit.

Jakub


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Richard Guenther
On Fri, Jun 25, 2010 at 1:02 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Fri, Jun 25, 2010 at 12:45 PM, Eric Botcazou ebotca...@adacore.com wrote:
 I do think so.

 Huh?  What do your version and mine return for the following assignment?

 void foo (int i)
 {
  struct S s;
  s.a = i;
 }

 Which in the following example makes i = *p not likely eliminated
 but makes j = *q likely eliminated.

 void foo (int *p, struct X *q)
 {
   int i;
   struct X j;
   i = *p;
   j = *q;
   bar (i, q);
 }

 That doesn't make sense.

 Yet that's what's supposed to be implemented, see the comment: loads from
 parameters passed by reference.

 What makes sense is that all scalar (thus gimple_reg_typed)
 loads/stores to/from parameters or the result are free.

 Precisely not, they aren't free, otherwise they wouldn't exist in the first
 place.  Scalar loads/stores are never free, aggregate loads/stores may be
 free if they are created only to pass the object around.

 Err.  aggregate loads/stores do not appear because aggregate
 uses can appear in calls.

 Scalar uses cannot appear in calls and thus you see them as
 separate statements.

 Thus,

 struct X;
 void bar(struct X);
 void foo(struct Xx)
 {
  bar (x);
 }

 will appear as a single call stmt while

 void bar (int);
 void foo(int x)
 {
  bar (x);
 }

 will have a load that is not supposed to be free?

Thus, do you have a testcase where your patch helps?

Richard.

 Richard.

 --
 Eric Botcazou




Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Alexander Monakov
Hi,

On Fri, 25 Jun 2010, Jan Hubicka wrote:

 I would be also very interested to know how profile feedback works in this 
 case
 (and why it does not work in previous releases).

Profiling multi-threading programs needs -fprofile-correction that appeared
only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the
initial message only speaks about 4.3 and 4.5).  Mozilla code also triggered a
bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified
their code to never leave non-default alignment at the end of the TU (I have
posted a patch for the libgcov bug [1], but it was not reviewed and does not
apply anymore due to build_constructor changes).

[1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html

Cheers,
Alexander


Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Joe Buck
On Fri, Jun 25, 2010 at 06:10:56AM -0700, Jan Hubicka wrote:
 When you compile with -Os, the inlining happens only when code size reduces.
 Thus we pretty much care about the code size metrics only.  I suspect the
 problem here might be that normal C++ code needs some inlining to make
 abstraction penalty go away. GCC -Os implementation is generally tuned for
 CSiBE and it is somewhat C centric (that makes sense for embedded world). As a
 result we might get quite noticeable slowdowns on C++ apps compiled with -Os
 (and code size growth too since abstraction is never eliminated). It can be
 seen also at tramp3d (Pooma testcase) where -Os produces a lot bigger and a 
 lot
 slower code.

One would think that in most of the abstraction-penalty cases, the inlined
code (often the direct reading or setting of a class data member) should
be both smaller and faster than the call, so -Os should inline.  Perhaps
there are cases where the inlined version is, say, one or two instructions 
larger
than the version with a call, and this causes the degradation?  If so,
maybe some heuristic could be produced that would inline anyway for
a small function?



Re: Massive performance regression from switching to gcc 4.5

2010-06-25 Thread Jan Hubicka
 Hi,
 
 On Fri, 25 Jun 2010, Jan Hubicka wrote:
 
  I would be also very interested to know how profile feedback works in this 
  case
  (and why it does not work in previous releases).
 
 Profiling multi-threading programs needs -fprofile-correction that appeared
 only in 4.4 (but I have no idea whether 4.4 works for Mozilla or not -- the
 initial message only speaks about 4.3 and 4.5).  Mozilla code also triggered a
 bug in libgcov ( http://gcc.gnu.org/PR43825 ), and they have probably modified
 their code to never leave non-default alignment at the end of the TU (I have
 posted a patch for the libgcov bug [1], but it was not reviewed and does not
 apply anymore due to build_constructor changes).
 
 [1] http://gcc.gnu.org/ml/gcc-patches/2010-05/msg00292.html

Ah, sorry.  I tought the consensus was to disable effect of pragma pack at the 
end
of parsing to avoid libgcov incompatibility?

Honza


Massive performance regression from switching to gcc 4.5

2010-06-24 Thread Taras Glek

Hi,
Just wanted to give a heads up on what might be the biggest 
compiler-upgrade-related performance difference we've seen at Mozilla.


We switched gcc4.3 for gcc4.5 and our automated benchmarking 
infrastructure reported  4-19% slowdown on most of our performance 
metrics on 32 and 64bit Linux.


A lone 8% speedup was measured on the Sunspider javascript benchmark on 
64bit linux.


Here are some of the slowdowns reported:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630#
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41#


Most of the code is compiled with   -fPIC  -fno-rtti -fno-exceptions -Os 
-freorder-blocks -fomit-frame-pointer. The only difference in 4.5 is 
that we link with -static-libstdc++ and compile libstdc++ with -fPIC. 
However we barely make use of libstdc++, so I doubt that's the problem. 
We needed to link statically because of 4.5 uses a handful of newer 
libstdc++ symbols.


We were upgrading to gcc 4.5.0 because of plugins and the fact that it 
can compile Firefox with PGO on(above builds were not built with PGO). 
Now we have to reconsider a complete switchover to 4.5.


I'm not sure how to proceed from here,
Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-24 Thread Andrew Pinski



On Jun 24, 2010, at 11:50 AM, Taras Glek tg...@mozilla.com wrote:


Hi,
Just wanted to give a heads up on what might be the biggest compiler- 
upgrade-related performance difference we've seen at Mozilla.


We switched gcc4.3 for gcc4.5 and our automated benchmarking  
infrastructure reported  4-19% slowdown on most of our performance  
metrics on 32 and 64bit Linux.


A lone 8% speedup was measured on the Sunspider javascript benchmark  
on 64bit linux.


Here are some of the slowdowns reported:
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/77951ccb76b5e630#
http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/624246d7d900ed41#


Most of the code is compiled with   -fPIC  -fno-rtti -fno-exceptions  
-Os


Stop right there. You are compiling at -Os, that is tuned for size and  
not speed. So the question is did the size go down? Not the speed  
decreased. Try at -O2 and report back. I doubt we are going to do a  
tradeoff for speed at -Os at all.

Thanks,
Andrew Pinski


-freorder-blocks -fomit-frame-pointer. The only difference in 4.5 is  
that we link with -static-libstdc++ and compile libstdc++ with - 
fPIC. However we barely make use of libstdc++, so I doubt that's the  
problem. We needed to link statically because of 4.5 uses a handful  
of newer libstdc++ symbols.


We were upgrading to gcc 4.5.0 because of plugins and the fact that  
it can compile Firefox with PGO on(above builds were not built with  
PGO). Now we have to reconsider a complete switchover to 4.5.


I'm not sure how to proceed from here,
Taras


Re: Massive performance regression from switching to gcc 4.5

2010-06-24 Thread Benjamin Smedberg

On 6/24/10 3:06 PM, Andrew Pinski wrote:


Most of the code is compiled with -fPIC -fno-rtti -fno-exceptions -Os


Stop right there. You are compiling at -Os, that is tuned for size and
not speed. So the question is did the size go down? Not the speed
decreased. Try at -O2 and report back. I doubt we are going to do a
tradeoff for speed at -Os at all.


For what it's worth, Mozilla-compiled-with-GCC has historically been faster 
compiled -Os instead of -O2. This is because the vast majority of our code 
is cold, and -O2 has produced substantially larger code, which causes our 
hot code to be evicted from processor caches more often.


We will definitely try -O2 to see if the previous measurements are no longer 
valid with GCC 4.5.


Looking through our codesize comparison logs, some of our methods are 
thosands of bytes longer with GCC 4.5 than 4.3 (same -Os compiler flags):


+796nsHTMLEditRules::nsHTMLEditRules()
+1088   nsCrypto::GenerateCRMFRequest(nsIDOMCRMFObject**)

In addition, it appears at first glance that GCC is either no longer 
inlining at -Os, even when it would be a size advantage to do so, or is 
making some very poor inlining choices.


e.g. +72nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const)

We can turn some of these observations into bug reports if that would be 
helpful, but if it would make more sense we could perhaps just tune the 
inlining parameters directly to get the real -Os that we usually want.


--BDS



Re: Massive performance regression from switching to gcc 4.5

2010-06-24 Thread Eric Botcazou
 In addition, it appears at first glance that GCC is either no longer
 inlining at -Os, even when it would be a size advantage to do so, or is
 making some very poor inlining choices.

 e.g. +72  nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const)

 We can turn some of these observations into bug reports if that would be
 helpful, but if it would make more sense we could perhaps just tune the
 inlining parameters directly to get the real -Os that we usually want.

We ran into similar inlining regressions in Ada, the heuristics have indeed 
changed significantly.  The attached patchlet alone saves 3% in code size 
at -Os on a 50 MB executable and yields a 5% speedup at -O2 on another code.


* ipa-inline.c (likely_eliminated_by_inlining_p): Really consider that
loads from parameters passed by reference are free after inlining.


-- 
Eric Botcazou
*** gcc/ipa-inline.c.0	2010-06-12 17:01:09.0 +0200
--- gcc/ipa-inline.c	2010-06-12 18:26:32.0 +0200
*** likely_eliminated_by_inlining_p (gimple
*** 1736,1754 
  	bool rhs_free = false;
  	bool lhs_free = false;
  
!  	while (handled_component_p (inner_lhs) || TREE_CODE (inner_lhs) == INDIRECT_REF)
  	  inner_lhs = TREE_OPERAND (inner_lhs, 0);
!  	while (handled_component_p (inner_rhs)
! 	   || TREE_CODE (inner_rhs) == ADDR_EXPR || TREE_CODE (inner_rhs) == INDIRECT_REF)
  	  inner_rhs = TREE_OPERAND (inner_rhs, 0);
  
- 
  	if (TREE_CODE (inner_rhs) == PARM_DECL
  	|| (TREE_CODE (inner_rhs) == SSA_NAME
  		 SSA_NAME_IS_DEFAULT_DEF (inner_rhs)
  		 TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL))
  	  rhs_free = true;
! 	if (rhs_free  is_gimple_reg (lhs))
  	  lhs_free = true;
  	if (((TREE_CODE (inner_lhs) == PARM_DECL
  	  || (TREE_CODE (inner_lhs) == SSA_NAME
--- 1736,1757 
  	bool rhs_free = false;
  	bool lhs_free = false;
  
! 	while (handled_component_p (inner_lhs)
! 		   || TREE_CODE (inner_lhs) == INDIRECT_REF)
  	  inner_lhs = TREE_OPERAND (inner_lhs, 0);
! 	while (handled_component_p (inner_rhs)
! 	   || TREE_CODE (inner_rhs) == ADDR_EXPR
! 		   || TREE_CODE (inner_rhs) == INDIRECT_REF)
  	  inner_rhs = TREE_OPERAND (inner_rhs, 0);
  
  	if (TREE_CODE (inner_rhs) == PARM_DECL
  	|| (TREE_CODE (inner_rhs) == SSA_NAME
  		 SSA_NAME_IS_DEFAULT_DEF (inner_rhs)
  		 TREE_CODE (SSA_NAME_VAR (inner_rhs)) == PARM_DECL))
  	  rhs_free = true;
! 	if (rhs_free
! 		 (is_gimple_reg (lhs)
! 		|| !is_gimple_reg_type (TREE_TYPE (lhs
  	  lhs_free = true;
  	if (((TREE_CODE (inner_lhs) == PARM_DECL
  	  || (TREE_CODE (inner_lhs) == SSA_NAME
*** likely_eliminated_by_inlining_p (gimple
*** 1759,1765 
  	|| (TREE_CODE (inner_lhs) == SSA_NAME
  		 TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL))
  	  lhs_free = true;
! 	if (lhs_free  (is_gimple_reg (rhs) || is_gimple_min_invariant (rhs)))
  	  rhs_free = true;
  	if (lhs_free  rhs_free)
  	  return true;
--- 1762,1771 
  	|| (TREE_CODE (inner_lhs) == SSA_NAME
  		 TREE_CODE (SSA_NAME_VAR (inner_lhs)) == RESULT_DECL))
  	  lhs_free = true;
! 	if (lhs_free
! 		 (is_gimple_reg (rhs)
! 		|| !is_gimple_reg_type (TREE_TYPE (rhs))
! 		|| is_gimple_min_invariant (rhs)))
  	  rhs_free = true;
  	if (lhs_free  rhs_free)
  	  return true;


Re: Massive performance regression from switching to gcc 4.5

2010-06-24 Thread Richard Guenther
On Thu, Jun 24, 2010 at 10:24 PM, Eric Botcazou ebotca...@adacore.com wrote:
 In addition, it appears at first glance that GCC is either no longer
 inlining at -Os, even when it would be a size advantage to do so, or is
 making some very poor inlining choices.

 e.g. +72      nsTArrayObserverRef::nsTArray(nsTArrayObserverRef const)

 We can turn some of these observations into bug reports if that would be
 helpful, but if it would make more sense we could perhaps just tune the
 inlining parameters directly to get the real -Os that we usually want.

 We ran into similar inlining regressions in Ada, the heuristics have indeed
 changed significantly.  The attached patchlet alone saves 3% in code size
 at -Os on a 50 MB executable and yields a 5% speedup at -O2 on another code.


        * ipa-inline.c (likely_eliminated_by_inlining_p): Really consider that
        loads from parameters passed by reference are free after inlining.

I don't understand this change.  Minus whitespace changes it seems to be

!   if (lhs_free  (is_gimple_reg (rhs) ||
is_gimple_min_invariant (rhs)))
  rhs_free = true;

vs.

!   if (lhs_free
!(is_gimple_reg (rhs)
!   || !is_gimple_reg_type (TREE_TYPE (rhs))
!   || is_gimple_min_invariant (rhs)))
  rhs_free = true;

so the stmt is likely being eliminated if either the LHS or the RHS is based
on a parameter and the other side is a register or an invariant.  You change
that to also discount aggregate stores/loads to/from parameters to be
free.

Which you could have simplified to just say

  if (lhs_free || rhs_free)
return true;

and drop the code you are changing.

I never considered the heuristic making loads/stores from parameters
free a very good one.  It makes *p free but not *(p+1) for example.
I would rather have seen the call stmts actual argument list to
be considered.

There are btw. some bugs wrt accounting of functions called once
being inlined in 4.5 which were fixed on trunk which allow extra
inlining.  See

2010-04-13  Jan Hubicka  j...@suse.cz

* ipa-inline.c (cgraph_mark_inline_edge): Avoid double accounting
of optimized out static functions.
...

Richard.