Diego Novillo wrote:
On Tue, Jun 3, 2008 at 22:26, Chris Lattner <[EMAIL PROTECTED]> wrote:
and whopr here. Is LTO the mode "normal people" will use, and whopr is the
mode where "people with huge clusters" will use? Will LTO/whopr support
useful optimization on common multicore machines?
As Ollie said, WHOPR is just an extension on the LTO framework to
cater for scalability when building large applications. As such, when
building large applications we expect not to be able to apply IPA
passes that rely on having the whole program callgraph and bodies
loaded in memory.
However, WHOPR does not limit IPA passes to summary-only. That's why
you see the distinction between IPA_PASS and SIMPLE_IPA_PASS in the
pass manager.
Are you focusing on inlining here as a specific example, or is this the only
planned IPA optimization that can use summaries? It seems unfortunate to
No. Just the first pass that we are going to concentrate for the
initial implementation.
I think that one thing that the gcc community should understand is that
to a great extent whopr is a google thing. All of the documents are
drafted by google people, in meetings that are only open to google
people and it is only after these documents have been drafted do the
people who are outside of google who are working on lto, like Honza and
myself, see the documents and get to comment. The gcc community never
sees the constraints, deadlines, needs, or benchmarks that are
motivating the decisions that are made in the whopr documents.
Honza and I plan, and are implementing, a system where most, but
probably all of the ipa passes, will be able to work in an environment
where the entire call graph and all of the decls and types are
available. I.e. only the function bodies are missing. In this
environment, we plan to do all of the interprocedural analysis and
generate work orders that will be applied to each function.
In a distributed environment, these "work orders" can then be streamed
out to the machines that are actually going to read the function bodies
and compile them.
It is certainly not going to be possible to do this for all ipa passes,
in particular any pass that requires the function body to be reanalyzed
as part of the analysis pass will not be done, or will be degraded so
that it does not use this mechanism. But for a large number of passes
this will work.
How this scales to google sized applications will have to be seen. The
point is that there is a rich space with a complex set tradeoffs to be
explored with lto. The decision to farm off the function bodies to
other processors because we "cannot" have all of the function bodies in
memory will have a dramatic effect on what gcc/lto/whopr compilation
will be able to achieve. We did not make this decision just because gcc
is fat, we made it because we wanted to be able to compile larger
programs that could fit into memory even if we did go on a real diet.
However, in other lto systems like ibm's and (i believe) llvm where the
link time compilation is done with everything in memory, you can do a
lot more transformation because you can iterate and propagate
information discovered from improvements in one function to another.
IBM seems to sell 64 processor machines with up to 28tb of memory. I
do not know whether they can compile all of db2 at one time on this box,
the last time i talked to them, a year ago, they could not (or at least
did not) compile all of db2 at one time. But they are able to do
several rounds that consist of global analysis and local
analysis/transformation. This is certainly the way to squeeze out
everything that static compilation has to offer. However it is unlikely
that many in the gcc community are going to have this kind of horsepower
available (balrog is a toy compared to one of these monsters).
The bet (guess) that we are making in gcc is that doing weaker analysis
over a larger context is going to win. In the initial whopr
proposal/implementation, this is taken to the extreme, to say that
inlining is the only ipa transformation, but it is going to be applied
the entire code base of some monster app. The rest of the gcc community
may not see the need to go here, and in fact i would guess (an
uninformed guess from an outsider) that even google will not need this
for all of their apps either. In particular, as consumer machines get
larger memories and more processors, the assumption that we cannot see
all of the functions bodies gets more questionable, especially for
modest sized apps that are the staple of the gcc community.
In particular, Google may be willing to compile the "entire" app, even
sucking in the code from shared libraries if it provides any benefit.
Users in the gcc community will most likely rarely go there, since it
makes the process of doing updates almost impossible.
There is also a rich set of choices that need to be made to support
distributed compilation. I think that it is dangerous to depend so
heavily on the experience of distcc. (on the other hand, I realize that
nfs is a real problem). For one, the size of the merged type system
and global decls is going to be quite large. Cutting this down to
produce a custom compilable file for each of the processors is not going
to be without cost either. I fear that for very large problems, the
cost of doing this and the cherry picking is going to severely limit the
scalability of lto/whopr.
Furthermore, as nodes move to having more cores, the cherry picking
model begins to look bad because each of the machines could have copies
of all of the hot inlinable functions in their file cache. The bottom
line here is that there is not a best solution here because the ground
is shifting. On the other hand, nfs is still bad and is likely to remain
that way.
kenny