Re: [whopr] Design/implementation alternatives for the driver and WPA

Kenneth Zadeck Wed, 04 Jun 2008 08:28:15 -0700

Diego Novillo wrote:

On Tue, Jun 3, 2008 at 22:26, Chris Lattner <[EMAIL PROTECTED]> wrote:

and whopr here.  Is LTO the mode "normal people" will use, and whopr is the
mode where "people with huge clusters" will use?  Will LTO/whopr support
useful optimization on common multicore machines?


As Ollie said, WHOPR is just an extension on the LTO framework to
cater for scalability when building large applications.  As such, when
building large applications we expect not to be able to apply IPA
passes that rely on having the whole program callgraph and bodies
loaded in memory.

However, WHOPR does not limit IPA passes to summary-only.  That's why
you see the distinction between IPA_PASS and SIMPLE_IPA_PASS in the
pass manager.

Are you focusing on inlining here as a specific example, or is this the only
planned IPA optimization that can use summaries?  It seems unfortunate to


No.  Just the first pass that we are going to concentrate for the
initial implementation.

I think that one thing that the gcc community should understand is thatto a great extent whopr is a google thing. All of the documents aredrafted by google people, in meetings that are only open to googlepeople and it is only after these documents have been drafted do thepeople who are outside of google who are working on lto, like Honza andmyself, see the documents and get to comment. The gcc community neversees the constraints, deadlines, needs, or benchmarks that aremotivating the decisions that are made in the whopr documents.

Honza and I plan, and are implementing, a system where most, butprobably all of the ipa passes, will be able to work in an environmentwhere the entire call graph and all of the decls and types areavailable. I.e. only the function bodies are missing. In thisenvironment, we plan to do all of the interprocedural analysis andgenerate work orders that will be applied to each function.In a distributed environment, these "work orders" can then be streamedout to the machines that are actually going to read the function bodiesand compile them.It is certainly not going to be possible to do this for all ipa passes,in particular any pass that requires the function body to be reanalyzedas part of the analysis pass will not be done, or will be degraded sothat it does not use this mechanism. But for a large number of passesthis will work.

How this scales to google sized applications will have to be seen. Thepoint is that there is a rich space with a complex set tradeoffs to beexplored with lto. The decision to farm off the function bodies toother processors because we "cannot" have all of the function bodies inmemory will have a dramatic effect on what gcc/lto/whopr compilationwill be able to achieve. We did not make this decision just because gccis fat, we made it because we wanted to be able to compile largerprograms that could fit into memory even if we did go on a real diet.However, in other lto systems like ibm's and (i believe) llvm where thelink time compilation is done with everything in memory, you can do alot more transformation because you can iterate and propagateinformation discovered from improvements in one function to another.IBM seems to sell 64 processor machines with up to 28tb of memory. Ido not know whether they can compile all of db2 at one time on this box,the last time i talked to them, a year ago, they could not (or at leastdid not) compile all of db2 at one time. But they are able to doseveral rounds that consist of global analysis and localanalysis/transformation. This is certainly the way to squeeze outeverything that static compilation has to offer. However it is unlikelythat many in the gcc community are going to have this kind of horsepoweravailable (balrog is a toy compared to one of these monsters).

The bet (guess) that we are making in gcc is that doing weaker analysisover a larger context is going to win. In the initial whoprproposal/implementation, this is taken to the extreme, to say thatinlining is the only ipa transformation, but it is going to be appliedthe entire code base of some monster app. The rest of the gcc communitymay not see the need to go here, and in fact i would guess (anuninformed guess from an outsider) that even google will not need thisfor all of their apps either. In particular, as consumer machines getlarger memories and more processors, the assumption that we cannot seeall of the functions bodies gets more questionable, especially formodest sized apps that are the staple of the gcc community.

In particular, Google may be willing to compile the "entire" app, evensucking in the code from shared libraries if it provides any benefit.Users in the gcc community will most likely rarely go there, since itmakes the process of doing updates almost impossible.There is also a rich set of choices that need to be made to supportdistributed compilation. I think that it is dangerous to depend soheavily on the experience of distcc. (on the other hand, I realize thatnfs is a real problem). For one, the size of the merged type systemand global decls is going to be quite large. Cutting this down toproduce a custom compilable file for each of the processors is not goingto be without cost either. I fear that for very large problems, thecost of doing this and the cherry picking is going to severely limit thescalability of lto/whopr.

Furthermore, as nodes move to having more cores, the cherry pickingmodel begins to look bad because each of the machines could have copiesof all of the hot inlinable functions in their file cache. The bottomline here is that there is not a best solution here because the groundis shifting. On the other hand, nfs is still bad and is likely to remainthat way.


kenny

Re: [whopr] Design/implementation alternatives for the driver and WPA

Reply via email to