On 10/11/2014 10:19 AM, Jan Hubicka wrote:

After few days of measurement and tuning, I was able to get numbers to the 
following shape:
Execution times (seconds)
  phase setup             :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.00 ( 0%) wall 
   1412 kB ( 0%) ggc
  phase opt and generate  :  27.83 (59%) usr   0.66 (19%) sys  28.52 (37%) wall 
1028813 kB (24%) ggc
  phase stream in         :  16.90 (36%) usr   0.63 (18%) sys  17.60 (23%) wall 
3246453 kB (76%) ggc
  phase stream out        :   2.76 ( 6%) usr   2.19 (63%) sys  31.34 (40%) wall 
      2 kB ( 0%) ggc
  callgraph optimization  :   0.36 ( 1%) usr   0.00 ( 0%) sys   0.35 ( 0%) wall 
     40 kB ( 0%) ggc
  ipa dead code removal   :   3.31 ( 7%) usr   0.01 ( 0%) sys   3.25 ( 4%) wall 
      0 kB ( 0%) ggc
  ipa virtual call target :   3.69 ( 8%) usr   0.03 ( 1%) sys   3.80 ( 5%) wall 
     21 kB ( 0%) ggc
  ipa devirtualization    :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall 
  13704 kB ( 0%) ggc
  ipa cp                  :   1.11 ( 2%) usr   0.07 ( 2%) sys   1.17 ( 2%) wall 
 188558 kB ( 4%) ggc
  ipa inlining heuristics :   8.17 (17%) usr   0.14 ( 4%) sys   8.27 (11%) wall 
 494738 kB (12%) ggc
  ipa comdats             :   0.12 ( 0%) usr   0.00 ( 0%) sys   0.12 ( 0%) wall 
      0 kB ( 0%) ggc
  ipa lto gimple in       :   1.86 ( 4%) usr   0.40 (11%) sys   2.20 ( 3%) wall 
 537970 kB (13%) ggc
  ipa lto gimple out      :   0.19 ( 0%) usr   0.08 ( 2%) sys   0.27 ( 0%) wall 
      2 kB ( 0%) ggc
  ipa lto decl in         :  12.20 (26%) usr   0.37 (11%) sys  12.64 (16%) wall 
2441687 kB (57%) ggc
  ipa lto decl out        :   2.51 ( 5%) usr   0.21 ( 6%) sys   2.71 ( 3%) wall 
      0 kB ( 0%) ggc
  ipa lto constructors in :   0.13 ( 0%) usr   0.02 ( 1%) sys   0.17 ( 0%) wall 
  15692 kB ( 0%) ggc
  ipa lto constructors out:   0.03 ( 0%) usr   0.00 ( 0%) sys   0.03 ( 0%) wall 
      0 kB ( 0%) ggc
  ipa lto cgraph I/O      :   0.54 ( 1%) usr   0.09 ( 3%) sys   0.63 ( 1%) wall 
 407182 kB (10%) ggc
  ipa lto decl merge      :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.34 ( 2%) wall 
   8220 kB ( 0%) ggc
  ipa lto cgraph merge    :   1.00 ( 2%) usr   0.00 ( 0%) sys   1.00 ( 1%) wall 
  14605 kB ( 0%) ggc
  whopr wpa               :   0.92 ( 2%) usr   0.00 ( 0%) sys   0.89 ( 1%) wall 
      1 kB ( 0%) ggc
  whopr wpa I/O           :   0.01 ( 0%) usr   1.90 (55%) sys  28.31 (37%) wall 
      0 kB ( 0%) ggc
  whopr partitioning      :   2.81 ( 6%) usr   0.01 ( 0%) sys   2.83 ( 4%) wall 
   4943 kB ( 0%) ggc
  ipa reference           :   1.34 ( 3%) usr   0.00 ( 0%) sys   1.35 ( 2%) wall 
      0 kB ( 0%) ggc
  ipa profile             :   0.20 ( 0%) usr   0.01 ( 0%) sys   0.21 ( 0%) wall 
      0 kB ( 0%) ggc
  ipa pure const          :   1.62 ( 3%) usr   0.00 ( 0%) sys   1.63 ( 2%) wall 
      0 kB ( 0%) ggc
  ipa icf                 :   2.65 ( 6%) usr   0.02 ( 1%) sys   2.68 ( 3%) wall 
   1352 kB ( 0%) ggc
  inline parameters       :   0.00 ( 0%) usr   0.01 ( 0%) sys   0.00 ( 0%) wall 
      0 kB ( 0%) ggc
  tree SSA rewrite        :   0.11 ( 0%) usr   0.01 ( 0%) sys   0.08 ( 0%) wall 
  18919 kB ( 0%) ggc
  tree SSA other          :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall 
      0 kB ( 0%) ggc
  tree SSA incremental    :   0.24 ( 1%) usr   0.01 ( 0%) sys   0.32 ( 0%) wall 
  11325 kB ( 0%) ggc
  tree operand scan       :   0.15 ( 0%) usr   0.02 ( 1%) sys   0.18 ( 0%) wall 
 116283 kB ( 3%) ggc
  dominance frontiers     :   0.01 ( 0%) usr   0.00 ( 0%) sys   0.02 ( 0%) wall 
      0 kB ( 0%) ggc
  dominance computation   :   0.13 ( 0%) usr   0.01 ( 0%) sys   0.16 ( 0%) wall 
      0 kB ( 0%) ggc
  varconst                :   0.01 ( 0%) usr   0.02 ( 1%) sys   0.01 ( 0%) wall 
      0 kB ( 0%) ggc
  loop fini               :   0.02 ( 0%) usr   0.00 ( 0%) sys   0.04 ( 0%) wall 
      0 kB ( 0%) ggc
  unaccounted todo        :   0.55 ( 1%) usr   0.00 ( 0%) sys   0.56 ( 1%) wall 
      0 kB ( 0%) ggc
  TOTAL                 :  47.49             3.48            77.46            
4276682 kB

and I was able to reduce function bodies loaded in WPA to 35% (from previous 
55%). The main problem

35% means that 35% of all function bodies are compared with something else? 
That feels pretty high.
but overall numbers are not so terrible.

Currently, the pass is able to merge 32K functions. As you know, we group 
functions to so called classes.
According to stats, average non-singular class size contains at the end of 
comparison 7.39 candidates and we
have 5K such functions. Because we load body for each candidate in such groups, 
it gives us minimum number
of loaded bodies: 37K. As we load 70K function, we have still place to improve. 
But I guess WPA body-less
comparison is quite efficient.


with speed was hidden in work list for congruence classes, where hash_set was 
used. I chose the data
structure to support delete operation, but it was really slow. Thus, hash_set 
was replaced with linked list
and a flag is used to identify if a set is removed or not.

Interesting, I would not expect bottleneck in a congruence solving :)

The problem was just the hash_set that showed to be slow data structure for a 
set of operations needed
in congruence solving.


I have no clue who complicated can it be to implement release_body function to 
an operation that
really releases the memory?

I suppose one can keep the caches from streamer and free trees read.  Freeing
gimple statemnts, cfg should be relatively easy.

Lets however first try to tune the implementation rather than try to this hack
implemented. Explicit ggc_free calls traditionally tended to cause some negative
reactions wrt memory fragmentation concerns.

Agree with suggested approach.



Markus' problem with -fprofile-use has been removed, IPA-ICF is preceding 
devirtualization pass. I hope it is fine?

Yes, I think devirtualization should actually work better with identical
virutal methods merged.  We just need to be sure it sees through the newly
introduced aliases (there should be no thunks for virutal methods)

Thanks,
Martin


Honza


Reply via email to