I thought I'd begin a discussion as to how to improve performance of the
rustc compiler.
First off, I do not see any way we can improve *general* compiler
performance by an order of magnitude. The language is simply designed to
favor safety and expressivity over compile-time performance. Other than
code generation, typechecking is by far the longest pass for large Rust.
But there is an upper bound on how fast we can make typechecking,
because the language requires subtyping, generics, and a variant of
Hindley-Milner type inference. This means that these common tricks
cannot be used:
1. Fast C-like typechecking won't work because we need to solve for type
variables. For instance, the type of `let x = [];` or `let y = None;` is
determined from use, unlike for example C++, Java, C#, or Go.
2. Fast ML-like "type equality can be determined with a pointer
comparison" tricks will not work, because we have subtyping and must
recur on type structure to unify.
3. Nominal types in general cannot be represented as a simple integer
"class ID", as in early Java. They require a heap-allocated vector to
represent the type parameter substitutions.
In general, the low-hanging fruit for general compiler performance is
mostly picked at this point. I would put an upper bound of compiler
performance improvements for all stages of a self-hosted build of the
Rust compiler at 20% or so. The reasons for this are:
1. Typechecking and LLVM code generation are mostly optimal. When
compiling `rustc`, the time spent in these two passes dwarfs all the
others. Typechecking cannot be algorithmically improved, and LLVM code
generation is about as straightforward as it can possibly be. The
remaining performance issues in these two passes are generally due to
allocating too much, but allocation and freeing in Rust is no more than
15% of the compile time. Thus even if we spent all our time on the
allocator and got its cost down to a theoretical zero, we would only
improve performance by 15% or so.
2. LLVM optimizations end up dominating compile time when they're turned
on (75% of compile time). However, the Rust compiler, like most Rust (or
C++) code, is dependent on LLVM optimizations for good performance. So
if you turn off optimization, you have a slow compiler. But if you turn
on optimization, the vast majority of your self-hosting time is spent in
LLVM optimizations. The obvious way around this catch-22 is to spend a
lot of time manually writing the optimizations that LLVM would have
performed into our compiler in order to improve performance at -O0, but
I don't think that's a particularly good use of our time, and it would
hurt the compiler's maintainability.
There are, however, some more situational things we can do.
# General code generation performance
* We can make `~"string"` allocations (and some others, like ~[ 1, 2, 3,
4, 5 ]) turn into calls to the moral equivalent of `strdup`. This
improves some workloads, such as the PGP key in cargo (which should just
be a constant in the first place). `rustc` still allocates a lot of
strings like this, so this might improve the LLVM portion of `rustc`'s
compilation speed.
* Visitor glue should be optional; you should have to opt into its
generation, like Haskell's `Data.Typeable`. This would potentially
remove 15% of our code size and improve our code generation performance
by a similar amount, but, as Graydon points out, it is needed for
precise-on-the-heap GC. Perhaps we could use conservative GC at -O0, and
thus reduce the amount of visitor glue we need to generate for
unoptimized builds.
# -O0 performance
For -O0 (which is the default), we get kicked off LLVM's fast
instruction selector far too often. We need to stop generating the
instructions that cause LLVM to bail out to the slow SelectionDAG path
at -O0.
This only affects -O0, but since that's the most common case that
matters for compilation speed, that's fine. Note that these
optimizations are severely limited in what they can do for self-hosting
performance, for the reasons stated above.
* Invoke instructions cause FastISel bailouts. This means that we can't
use the C++ exception infrastructure for task failure if we want fast
builds. Graydon has work on an optional return-value-based unwinding
mode which is nearing completion. I have a patch in review for a
"disable_unwinding" flag, which disables unwinding for failure; this
should be safe to turn on for libsyntax and librustc, since they have no
need to handle failure gracefully, and doing so improves compile-time
-O0 LLVM performance by 1.9x.
* Visitor glue used to cause FastISel bailouts, but this is fixed in
incoming.
* Switch instructions cause FastISel bailouts. Pattern matching on enums
(and sometimes on integers too) generates these. Drop and take glue on
enums generates these too. This shouldn't be too hard to fix.
* Integer division instructions result in FastISel bailouts on x86. We
generate a lot of these due to the fact that our vector lengths are in
bytes. We could change that, or we could try to hack LLVM, or we could
turn integer divisions into function calls to libcore on -O0. (Note that
integer division turns into a function call *anyway* on ARM, since ARM
has no integer divide instruction. So I'm inclined to try the last one.)
# Memory allocation performance
Our memory allocation is suboptimal in several ways. I do not think that
improving it will improve compiler performance as long as you aren't
already swapping, but I'll list them anyway.
* We do not have our own allocator; we just use the system malloc.
However, we need to trace all allocations, to clean up @ cycles on task
death. So we thread all allocations into a doubly-linked list. This is a
huge waste of memory for the next and previous pointers. We could fix
this by using an allocator that allows us to trace allocations.
I would be surprised if fixing this had a huge impact in performance,
but maybe it would bump some allocations that were previously in higher
storage classes into the TINY class, which generally has a fast path in
the allocator. And, of course, it would reduce swapping when
self-hosting if you don't have enough memory.
* We don't clean up @ cycles until task death. Fixing this will, in all
likelihood, worsen the compiler's performance. However, its memory usage
will improve.
* ~ allocations don't really need to be linked into any list or be
traceable, *unless* they contain @ pointers, at which point they do need
to be traceable. Fixing this will improve memory usage and improve
performance by a negligible amount.
# External metadata
We currently read external crate metadata in its entirety for external
crates during a few phases of the compiler. This dominates the
compilation time of small programs only, as in larger programs such as
rustc, the cost quickly shrinks to nothing compared to the larger
compilation. However, since newcomers to Rust generally compile small
programs, this is most of the cost they see. Also, this constitutes the
majority of the time that our test suite takes. Finally, this is the
performance bottleneck for the REPL.
Improving this will not improve the compilation speed of self-hosting by
more than 1%. The biggest benefit of fixing this is that small programs
will appear to compile instantly, which improves the first impressions
of Rust a lot for those used to fast builds in other languages.
* External metadata reading takes a long time (0.3 s). I'm not sure
whether all of this is necessary, as I'm not too familiar with this pass.
* Language item collection reads all the items in external crates to
look for language items (another 0.3 s). This is silly and is easy to
fix; we just add a new table to the metadata that specifies def IDs for
language items.
* Name resolution has to read all the items in external crates (another
0.3 s). This was the easiest way to approximate the 0.5 name resolution
semantics. (The actual semantics were basically unimplementable, but
this algorithm got close enough to work in practice -- usually.) With
the new semantics in Rust 0.6 we should be able to do better here and
avoid reading modules until they're actually referenced. Unfortunately,
fixing this will require rewriting resolve, which is a month's worth of
work.
# Stack switching
* We could run rustc with a large stack and avoid stack switching. This
is functionality we need for Servo anyway. This might improve compiler
performance by 1% or so.
None of these optimizations will improve the `rustc` self-hosting time
by anything approaching an order of magnitude. However, I think they
could have a positive impact on the experience for newcomers to Rust.
Patrick
_______________________________________________
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev