On 10/11/2014 10:19 AM, Jan Hubicka wrote:
After few days of measurement and tuning, I was able to get numbers to the
following shape:
Execution times (seconds)
phase setup : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall
1412 kB ( 0%) ggc
phase opt and generate : 27.83 (59%) usr 0.66 (19%) sys 28.52 (37%) wall
1028813 kB (24%) ggc
phase stream in : 16.90 (36%) usr 0.63 (18%) sys 17.60 (23%) wall
3246453 kB (76%) ggc
phase stream out : 2.76 ( 6%) usr 2.19 (63%) sys 31.34 (40%) wall
2 kB ( 0%) ggc
callgraph optimization : 0.36 ( 1%) usr 0.00 ( 0%) sys 0.35 ( 0%) wall
40 kB ( 0%) ggc
ipa dead code removal : 3.31 ( 7%) usr 0.01 ( 0%) sys 3.25 ( 4%) wall
0 kB ( 0%) ggc
ipa virtual call target : 3.69 ( 8%) usr 0.03 ( 1%) sys 3.80 ( 5%) wall
21 kB ( 0%) ggc
ipa devirtualization : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall
13704 kB ( 0%) ggc
ipa cp : 1.11 ( 2%) usr 0.07 ( 2%) sys 1.17 ( 2%) wall
188558 kB ( 4%) ggc
ipa inlining heuristics : 8.17 (17%) usr 0.14 ( 4%) sys 8.27 (11%) wall
494738 kB (12%) ggc
ipa comdats : 0.12 ( 0%) usr 0.00 ( 0%) sys 0.12 ( 0%) wall
0 kB ( 0%) ggc
ipa lto gimple in : 1.86 ( 4%) usr 0.40 (11%) sys 2.20 ( 3%) wall
537970 kB (13%) ggc
ipa lto gimple out : 0.19 ( 0%) usr 0.08 ( 2%) sys 0.27 ( 0%) wall
2 kB ( 0%) ggc
ipa lto decl in : 12.20 (26%) usr 0.37 (11%) sys 12.64 (16%) wall
2441687 kB (57%) ggc
ipa lto decl out : 2.51 ( 5%) usr 0.21 ( 6%) sys 2.71 ( 3%) wall
0 kB ( 0%) ggc
ipa lto constructors in : 0.13 ( 0%) usr 0.02 ( 1%) sys 0.17 ( 0%) wall
15692 kB ( 0%) ggc
ipa lto constructors out: 0.03 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall
0 kB ( 0%) ggc
ipa lto cgraph I/O : 0.54 ( 1%) usr 0.09 ( 3%) sys 0.63 ( 1%) wall
407182 kB (10%) ggc
ipa lto decl merge : 1.34 ( 3%) usr 0.00 ( 0%) sys 1.34 ( 2%) wall
8220 kB ( 0%) ggc
ipa lto cgraph merge : 1.00 ( 2%) usr 0.00 ( 0%) sys 1.00 ( 1%) wall
14605 kB ( 0%) ggc
whopr wpa : 0.92 ( 2%) usr 0.00 ( 0%) sys 0.89 ( 1%) wall
1 kB ( 0%) ggc
whopr wpa I/O : 0.01 ( 0%) usr 1.90 (55%) sys 28.31 (37%) wall
0 kB ( 0%) ggc
whopr partitioning : 2.81 ( 6%) usr 0.01 ( 0%) sys 2.83 ( 4%) wall
4943 kB ( 0%) ggc
ipa reference : 1.34 ( 3%) usr 0.00 ( 0%) sys 1.35 ( 2%) wall
0 kB ( 0%) ggc
ipa profile : 0.20 ( 0%) usr 0.01 ( 0%) sys 0.21 ( 0%) wall
0 kB ( 0%) ggc
ipa pure const : 1.62 ( 3%) usr 0.00 ( 0%) sys 1.63 ( 2%) wall
0 kB ( 0%) ggc
ipa icf : 2.65 ( 6%) usr 0.02 ( 1%) sys 2.68 ( 3%) wall
1352 kB ( 0%) ggc
inline parameters : 0.00 ( 0%) usr 0.01 ( 0%) sys 0.00 ( 0%) wall
0 kB ( 0%) ggc
tree SSA rewrite : 0.11 ( 0%) usr 0.01 ( 0%) sys 0.08 ( 0%) wall
18919 kB ( 0%) ggc
tree SSA other : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall
0 kB ( 0%) ggc
tree SSA incremental : 0.24 ( 1%) usr 0.01 ( 0%) sys 0.32 ( 0%) wall
11325 kB ( 0%) ggc
tree operand scan : 0.15 ( 0%) usr 0.02 ( 1%) sys 0.18 ( 0%) wall
116283 kB ( 3%) ggc
dominance frontiers : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall
0 kB ( 0%) ggc
dominance computation : 0.13 ( 0%) usr 0.01 ( 0%) sys 0.16 ( 0%) wall
0 kB ( 0%) ggc
varconst : 0.01 ( 0%) usr 0.02 ( 1%) sys 0.01 ( 0%) wall
0 kB ( 0%) ggc
loop fini : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.04 ( 0%) wall
0 kB ( 0%) ggc
unaccounted todo : 0.55 ( 1%) usr 0.00 ( 0%) sys 0.56 ( 1%) wall
0 kB ( 0%) ggc
TOTAL : 47.49 3.48 77.46
4276682 kB
and I was able to reduce function bodies loaded in WPA to 35% (from previous
55%). The main problem
35% means that 35% of all function bodies are compared with something else?
That feels pretty high.
but overall numbers are not so terrible.
Currently, the pass is able to merge 32K functions. As you know, we group
functions to so called classes.
According to stats, average non-singular class size contains at the end of
comparison 7.39 candidates and we
have 5K such functions. Because we load body for each candidate in such groups,
it gives us minimum number
of loaded bodies: 37K. As we load 70K function, we have still place to improve.
But I guess WPA body-less
comparison is quite efficient.
with speed was hidden in work list for congruence classes, where hash_set was
used. I chose the data
structure to support delete operation, but it was really slow. Thus, hash_set
was replaced with linked list
and a flag is used to identify if a set is removed or not.
Interesting, I would not expect bottleneck in a congruence solving :)
The problem was just the hash_set that showed to be slow data structure for a
set of operations needed
in congruence solving.
I have no clue who complicated can it be to implement release_body function to
an operation that
really releases the memory?
I suppose one can keep the caches from streamer and free trees read. Freeing
gimple statemnts, cfg should be relatively easy.
Lets however first try to tune the implementation rather than try to this hack
implemented. Explicit ggc_free calls traditionally tended to cause some negative
reactions wrt memory fragmentation concerns.
Agree with suggested approach.
Markus' problem with -fprofile-use has been removed, IPA-ICF is preceding
devirtualization pass. I hope it is fine?
Yes, I think devirtualization should actually work better with identical
virutal methods merged. We just need to be sure it sees through the newly
introduced aliases (there should be no thunks for virutal methods)
Thanks,
Martin
Honza