[rust-dev] Plans for improving compiler performance

Patrick Walton Sat, 05 Jan 2013 13:37:40 -0800

I thought I'd begin a discussion as to how to improve performance of therustc compiler.

First off, I do not see any way we can improve *general* compilerperformance by an order of magnitude. The language is simply designed tofavor safety and expressivity over compile-time performance. Other thancode generation, typechecking is by far the longest pass for large Rust.But there is an upper bound on how fast we can make typechecking,because the language requires subtyping, generics, and a variant ofHindley-Milner type inference. This means that these common trickscannot be used:

1. Fast C-like typechecking won't work because we need to solve for typevariables. For instance, the type of `let x = [];` or `let y = None;` isdetermined from use, unlike for example C++, Java, C#, or Go.

2. Fast ML-like "type equality can be determined with a pointercomparison" tricks will not work, because we have subtyping and mustrecur on type structure to unify.

3. Nominal types in general cannot be represented as a simple integer"class ID", as in early Java. They require a heap-allocated vector torepresent the type parameter substitutions.

In general, the low-hanging fruit for general compiler performance ismostly picked at this point. I would put an upper bound of compilerperformance improvements for all stages of a self-hosted build of theRust compiler at 20% or so. The reasons for this are:

1. Typechecking and LLVM code generation are mostly optimal. Whencompiling `rustc`, the time spent in these two passes dwarfs all theothers. Typechecking cannot be algorithmically improved, and LLVM codegeneration is about as straightforward as it can possibly be. Theremaining performance issues in these two passes are generally due toallocating too much, but allocation and freeing in Rust is no more than15% of the compile time. Thus even if we spent all our time on theallocator and got its cost down to a theoretical zero, we would onlyimprove performance by 15% or so.

2. LLVM optimizations end up dominating compile time when they're turnedon (75% of compile time). However, the Rust compiler, like most Rust (orC++) code, is dependent on LLVM optimizations for good performance. Soif you turn off optimization, you have a slow compiler. But if you turnon optimization, the vast majority of your self-hosting time is spent inLLVM optimizations. The obvious way around this catch-22 is to spend alot of time manually writing the optimizations that LLVM would haveperformed into our compiler in order to improve performance at -O0, butI don't think that's a particularly good use of our time, and it wouldhurt the compiler's maintainability.


There are, however, some more situational things we can do.

# General code generation performance

* We can make `~"string"` allocations (and some others, like ~[ 1, 2, 3,4, 5 ]) turn into calls to the moral equivalent of `strdup`. Thisimproves some workloads, such as the PGP key in cargo (which should justbe a constant in the first place). `rustc` still allocates a lot ofstrings like this, so this might improve the LLVM portion of `rustc`'scompilation speed.

* Visitor glue should be optional; you should have to opt into itsgeneration, like Haskell's `Data.Typeable`. This would potentiallyremove 15% of our code size and improve our code generation performanceby a similar amount, but, as Graydon points out, it is needed forprecise-on-the-heap GC. Perhaps we could use conservative GC at -O0, andthus reduce the amount of visitor glue we need to generate forunoptimized builds.


# -O0 performance

For -O0 (which is the default), we get kicked off LLVM's fastinstruction selector far too often. We need to stop generating theinstructions that cause LLVM to bail out to the slow SelectionDAG pathat -O0.

This only affects -O0, but since that's the most common case thatmatters for compilation speed, that's fine. Note that theseoptimizations are severely limited in what they can do for self-hostingperformance, for the reasons stated above.

* Invoke instructions cause FastISel bailouts. This means that we can'tuse the C++ exception infrastructure for task failure if we want fastbuilds. Graydon has work on an optional return-value-based unwindingmode which is nearing completion. I have a patch in review for a"disable_unwinding" flag, which disables unwinding for failure; thisshould be safe to turn on for libsyntax and librustc, since they have noneed to handle failure gracefully, and doing so improves compile-time-O0 LLVM performance by 1.9x.

* Visitor glue used to cause FastISel bailouts, but this is fixed inincoming.

* Switch instructions cause FastISel bailouts. Pattern matching on enums(and sometimes on integers too) generates these. Drop and take glue onenums generates these too. This shouldn't be too hard to fix.

* Integer division instructions result in FastISel bailouts on x86. Wegenerate a lot of these due to the fact that our vector lengths are inbytes. We could change that, or we could try to hack LLVM, or we couldturn integer divisions into function calls to libcore on -O0. (Note thatinteger division turns into a function call *anyway* on ARM, since ARMhas no integer divide instruction. So I'm inclined to try the last one.)


# Memory allocation performance

Our memory allocation is suboptimal in several ways. I do not think thatimproving it will improve compiler performance as long as you aren'talready swapping, but I'll list them anyway.

* We do not have our own allocator; we just use the system malloc.However, we need to trace all allocations, to clean up @ cycles on taskdeath. So we thread all allocations into a doubly-linked list. This is ahuge waste of memory for the next and previous pointers. We could fixthis by using an allocator that allows us to trace allocations.

I would be surprised if fixing this had a huge impact in performance,but maybe it would bump some allocations that were previously in higherstorage classes into the TINY class, which generally has a fast path inthe allocator. And, of course, it would reduce swapping whenself-hosting if you don't have enough memory.

* We don't clean up @ cycles until task death. Fixing this will, in alllikelihood, worsen the compiler's performance. However, its memory usagewill improve.

* ~ allocations don't really need to be linked into any list or betraceable, *unless* they contain @ pointers, at which point they do needto be traceable. Fixing this will improve memory usage and improveperformance by a negligible amount.


# External metadata

We currently read external crate metadata in its entirety for externalcrates during a few phases of the compiler. This dominates thecompilation time of small programs only, as in larger programs such asrustc, the cost quickly shrinks to nothing compared to the largercompilation. However, since newcomers to Rust generally compile smallprograms, this is most of the cost they see. Also, this constitutes themajority of the time that our test suite takes. Finally, this is theperformance bottleneck for the REPL.

Improving this will not improve the compilation speed of self-hosting bymore than 1%. The biggest benefit of fixing this is that small programswill appear to compile instantly, which improves the first impressionsof Rust a lot for those used to fast builds in other languages.

* External metadata reading takes a long time (0.3 s). I'm not surewhether all of this is necessary, as I'm not too familiar with this pass.

* Language item collection reads all the items in external crates tolook for language items (another 0.3 s). This is silly and is easy tofix; we just add a new table to the metadata that specifies def IDs forlanguage items.

* Name resolution has to read all the items in external crates (another0.3 s). This was the easiest way to approximate the 0.5 name resolutionsemantics. (The actual semantics were basically unimplementable, butthis algorithm got close enough to work in practice -- usually.) Withthe new semantics in Rust 0.6 we should be able to do better here andavoid reading modules until they're actually referenced. Unfortunately,fixing this will require rewriting resolve, which is a month's worth ofwork.


# Stack switching

* We could run rustc with a large stack and avoid stack switching. Thisis functionality we need for Servo anyway. This might improve compilerperformance by 1% or so.

None of these optimizations will improve the `rustc` self-hosting timeby anything approaching an order of magnitude. However, I think theycould have a positive impact on the experience for newcomers to Rust.


Patrick
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev

[rust-dev] Plans for improving compiler performance

Reply via email to