Re: [gradle-dev] spiking incremental java compilation

Adam Murdoch Mon, 23 Dec 2013 21:46:09 -0800

On 22 Dec 2013, at 12:49 am, Luke Daley <[email protected]> wrote:


> 
> On 20 Dec 2013, at 3:37, Adam Murdoch wrote:
> 
>> Hi,
>> 
>> Just some thoughts on how we might spike a solution for incremental java 
>> compilation, to see if it’s worthwhile and what the effort might be:
>> 
>> The goal is to improve the Java compile tasks, so that they do less work for 
>> certain kinds of changes. Here, ‘less work’ means compiling fewer source 
>> files, and also touching fewer output files so that consumers of the task 
>> output can also do less work. It doesn’t mean compiling the *fewest* 
>> possible number of source files - just fewer than we do now.
>> 
>> The basic approach comes down to keeping track of dependencies between 
>> source files and the other compilation inputs - where inputs are source 
>> files, the compile classpath, the compile settings, and so on. Then, when an 
>> input changes, we would recompile the source files that depend on that 
>> input. Currently, we assume that every source file depends on every input, 
>> so that when an input changes we recompile everything.
>> 
>> Note that we don’t necessarily need to track dependencies at a fine-grained 
>> level. For example, we may track dependencies between packages rather than 
>> classes, or we may continue to assume that every source file depends on 
>> every class in the compile classpath.
>> 
>> A basic solution would look something like:
>> 
>> 1. Determine which inputs have changed.
>> 2. If the compile settings have changed, or if we don’t have any history, 
>> then schedule every source file for compilation, and skip to #5.
>> 3. If a class in the compile classpath has changed, then schedule for 
>> compilation every source file that depends on this class.
>> 4. If a source file has changed, then schedule for compilation every source 
>> file that depends on the classes of the source file.
>> 5. For each source file scheduled for compilation, remove the previous 
>> output for that source file.
>> 6. Invoke the compiler.
>> 7. For each successfully compiled source file, extract the dependency 
>> information for the classes in the source file and persist this for next 
>> time.
>> 
>> For the above, “depends on” includes indirect dependencies.
>> 
>> Steps #1 and #2 are already covered by the incremental task API, at least 
>> enough to spike this.
>> 
>> Step #3 isn’t quite as simple as it is described above:
>> - Firstly, we can ignore changes for a class with a given name, if a class 
>> with the same name appears before it in the classpath (this includes the 
>> source files).
>> - If a class is removed, this counts as a ‘change’, so that we recompile any 
>> source files that used to depend on this class.
>> - If a class is added before some other class with the same name in the 
>> classpath, then we recompile any source files that used to depend on the old 
>> class.
>> - Dependencies can travel through other classes in the classpath, or source 
>> files, or a combination of both (e.g. a source class depends on a classpath 
>> class depends on a source class depends on a classpath class).
>> 
>> Step #4 is similar to step #3.
>> 
>> For a spike, it might be worth simply invalidating everything when the 
>> compile classpath changes, and just deal with changes in the source files.
>> 
>> For step #7 we have three basic approaches for extracting the dependencies:
>> 
>> The first approach is to use asm to extract the dependencies from the byte 
>> code after compilation. The upside is that this is very simple to implement 
>> and very fast. We have an implementation already that we use in the tooling 
>> API (ClasspathInferer  - but it’s mixed in with some other stuff). It also 
>> works for things that we only have the byte code for.
>> 
>> The downside is that it’s lossy: the compiler inlines constants into the 
>> byte code and discards source-only annotations. We also don’t easily know 
>> what type of dependency it is (is it an implementation detail or is is 
>> visible in the API of the class?)
>> 
>> Both these downsides can be addressed: For example we might treat a class 
>> with a constant field or a class for a source-only annotation as a 
>> dependency of every source file, so that when one of these things change, we 
>> would recompile everything. And to determine the type of dependency, we just 
>> need to dig deeper into the byte code.
>> 
>> The second approach is to use the compiler API that we are already using to 
>> invoke the compiler to query the dependencies during compilation. The upside 
>> is that we get the full source dependency information. The downsides are 
>> that we have to use a sun-specific extension of the compiler API to do this 
>> and it’s a very complicated API, which means fiddly to get right.
>> 
>> The third approach is to parse and analyse the source separately from 
>> compilation.
>> 
>> I’d probably try out the first option, as it’s the simplest to implement and 
>> probably the fastest at execution time.
>> 
>> There are some issues around making this efficient.
>> 
>> First, we need to make the persistence mechanism fast. For the spike, let’s 
>> assume we can do this. I would just keep the state in some static field 
>> somewhere and not bother with persistence.
>> 
>> Second, we need to make the calculation of affected source files fast. One 
>> option is to calculate this when something changes rather than each time we 
>> run the compilation task, so that we keep, basically, a map from input file 
>> to the closure of all source files affected by that input file.
> 
> This is a direction we are no doubt going to go into anyway.
> 
>> Third, we need to keep the dependency graph as small as we can. So, we might 
>> play around with tracking dependencies between packages rather than classes.
> 
> Will be interesting to see how this works in the real world on nasty code 
> bases where packages are monolithic and have lots of dependencies.

It’s all about trade-offs. The idea of tracking package dependencies is to make 
the graphs smaller, with fewer nodes and edges, meaning less work to figure out 
the things that definitely are up-to-date, at the cost of compiling some source 
files that might not be out-of-date. We’d have to measure and see.

The point was really that we don’t necessarily need to build a graph of 
individual source files to make compilation better, and given how fast the Java 
compiler is, it might be a better trade-off to chunk the source files to some 
degree.

> 
>> We should also ignore dependencies that are not visible to the consumer, so 
>> that we don’t traverse the dependencies of method bodies, or private 
>> elements.
> 
> What do you mean here?

There are lots of dependencies of a class that aren’t visible at compile time 
to any consumer of that class. For example:

- A class that is only referenced in a method body.
- A class that is only referenced in a signature of a private element
- Annotations (except @Deprecated)

So, if we have something like:

A extends B  { }

B { someMethod() { new C() } }

Then C is not visible to A via B, and when C changes we shouldn’t recompile A, 
but we should recompile B.

However, when a dependency may be visible to a consumer, then we should 
traverse that dependency:

A extends B { }

B extends C { }

Then C is visible to A via B and when C changes we should recompile both A and 
B.


> 
>> Finally, we should ignore changes that are not visible to the consumer, so 
>> that we ignore changes to method bodies, private elements of a class, the 
>> annotations of classes, debug info and so on. This is relatively easy for 
>> changes to the compile classpath. For changes to source files, it’s a bit 
>> trickier, as we don’t know what’s changed until we compile the source file. 
>> We could, potentially, compile in two passes - first source files that have 
>> changed and then second source files that have not change but depend on 
>> those that have. Something, potentially, to play with as part of a spike.
> 
> 
> I'm pretty dubious about all of this. Looks to me like a difficult thing to 
> pull off outside of the compiler. I'm sure we can get something working, but 
> whether it's reliable enough and fast enough is another question (hopefully 
> answered by the spike).

We can make it reliable. The byte code format makes it quite easy and very fast 
to extract the compile-time dependencies of a type - you slurp up the 
CONSTANT_Class entries in the constant pool. You have to invalidate more source 
files than might have actually changed when a kind of usage that the compiler 
inlines changes, such as a constant. But this is just part of the trade-off.

We can also make it reliable using the compiler API during compilation. It’s 
just more fiddly, and doesn’t work when we aren’t responsible for compilation.

Fast is another story, and that’s something for the spike to answer. I’m pretty 
confident.

Something to remember is that the goal here is not just about making 
compilation faster - it’s also about reducing the impact on things that use the 
compiled classes. So, it would be totally fine if we actually made compilation 
slightly slower, provided we saw a nice reduction in the average total build 
time.


> I also wonder whether investing into more fine grained parallelism and 
> coarser avoidance (e.g. ignoring non visible classpath changes) wouldn't be 
> more fruitful and more generally applicable.

The coarser avoidance is part of this work. It has to be to make it efficient.


--
Adam Murdoch
Gradle Co-founder
http://www.gradle.org
VP of Engineering, Gradleware Inc. - Gradle Training, Support, Consulting
http://www.gradleware.com

Re: [gradle-dev] spiking incremental java compilation

Reply via email to