Diego Novillo wrote:
On Tue, Jun 3, 2008 at 22:26, Chris Lattner <[EMAIL PROTECTED]> wrote:

and whopr here.  Is LTO the mode "normal people" will use, and whopr is the
mode where "people with huge clusters" will use?  Will LTO/whopr support
useful optimization on common multicore machines?

As Ollie said, WHOPR is just an extension on the LTO framework to
cater for scalability when building large applications.  As such, when
building large applications we expect not to be able to apply IPA
passes that rely on having the whole program callgraph and bodies
loaded in memory.

However, WHOPR does not limit IPA passes to summary-only.  That's why
you see the distinction between IPA_PASS and SIMPLE_IPA_PASS in the
pass manager.

Are you focusing on inlining here as a specific example, or is this the only
planned IPA optimization that can use summaries?  It seems unfortunate to

No.  Just the first pass that we are going to concentrate for the
initial implementation.


I think that one thing that the gcc community should understand is that to a great extent whopr is a google thing. All of the documents are drafted by google people, in meetings that are only open to google people and it is only after these documents have been drafted do the people who are outside of google who are working on lto, like Honza and myself, see the documents and get to comment. The gcc community never sees the constraints, deadlines, needs, or benchmarks that are motivating the decisions that are made in the whopr documents.

Honza and I plan, and are implementing, a system where most, but probably all of the ipa passes, will be able to work in an environment where the entire call graph and all of the decls and types are available. I.e. only the function bodies are missing. In this environment, we plan to do all of the interprocedural analysis and generate work orders that will be applied to each function. In a distributed environment, these "work orders" can then be streamed out to the machines that are actually going to read the function bodies and compile them. It is certainly not going to be possible to do this for all ipa passes, in particular any pass that requires the function body to be reanalyzed as part of the analysis pass will not be done, or will be degraded so that it does not use this mechanism. But for a large number of passes this will work.

How this scales to google sized applications will have to be seen. The point is that there is a rich space with a complex set tradeoffs to be explored with lto. The decision to farm off the function bodies to other processors because we "cannot" have all of the function bodies in memory will have a dramatic effect on what gcc/lto/whopr compilation will be able to achieve. We did not make this decision just because gcc is fat, we made it because we wanted to be able to compile larger programs that could fit into memory even if we did go on a real diet. However, in other lto systems like ibm's and (i believe) llvm where the link time compilation is done with everything in memory, you can do a lot more transformation because you can iterate and propagate information discovered from improvements in one function to another. IBM seems to sell 64 processor machines with up to 28tb of memory. I do not know whether they can compile all of db2 at one time on this box, the last time i talked to them, a year ago, they could not (or at least did not) compile all of db2 at one time. But they are able to do several rounds that consist of global analysis and local analysis/transformation. This is certainly the way to squeeze out everything that static compilation has to offer. However it is unlikely that many in the gcc community are going to have this kind of horsepower available (balrog is a toy compared to one of these monsters).

The bet (guess) that we are making in gcc is that doing weaker analysis over a larger context is going to win. In the initial whopr proposal/implementation, this is taken to the extreme, to say that inlining is the only ipa transformation, but it is going to be applied the entire code base of some monster app. The rest of the gcc community may not see the need to go here, and in fact i would guess (an uninformed guess from an outsider) that even google will not need this for all of their apps either. In particular, as consumer machines get larger memories and more processors, the assumption that we cannot see all of the functions bodies gets more questionable, especially for modest sized apps that are the staple of the gcc community.

In particular, Google may be willing to compile the "entire" app, even sucking in the code from shared libraries if it provides any benefit. Users in the gcc community will most likely rarely go there, since it makes the process of doing updates almost impossible. There is also a rich set of choices that need to be made to support distributed compilation. I think that it is dangerous to depend so heavily on the experience of distcc. (on the other hand, I realize that nfs is a real problem). For one, the size of the merged type system and global decls is going to be quite large. Cutting this down to produce a custom compilable file for each of the processors is not going to be without cost either. I fear that for very large problems, the cost of doing this and the cherry picking is going to severely limit the scalability of lto/whopr.

Furthermore, as nodes move to having more cores, the cherry picking model begins to look bad because each of the machines could have copies of all of the hot inlinable functions in their file cache. The bottom line here is that there is not a best solution here because the ground is shifting. On the other hand, nfs is still bad and is likely to remain that way.

kenny

Reply via email to