On 20 Dec 2013, at 3:37, Adam Murdoch wrote:
Hi,
Just some thoughts on how we might spike a solution for incremental
java compilation, to see if it’s worthwhile and what the effort
might be:
The goal is to improve the Java compile tasks, so that they do less
work for certain kinds of changes. Here, ‘less work’ means
compiling fewer source files, and also touching fewer output files so
that consumers of the task output can also do less work. It doesn’t
mean compiling the *fewest* possible number of source files - just
fewer than we do now.
The basic approach comes down to keeping track of dependencies between
source files and the other compilation inputs - where inputs are
source files, the compile classpath, the compile settings, and so on.
Then, when an input changes, we would recompile the source files that
depend on that input. Currently, we assume that every source file
depends on every input, so that when an input changes we recompile
everything.
Note that we don’t necessarily need to track dependencies at a
fine-grained level. For example, we may track dependencies between
packages rather than classes, or we may continue to assume that every
source file depends on every class in the compile classpath.
A basic solution would look something like:
1. Determine which inputs have changed.
2. If the compile settings have changed, or if we don’t have any
history, then schedule every source file for compilation, and skip to
#5.
3. If a class in the compile classpath has changed, then schedule for
compilation every source file that depends on this class.
4. If a source file has changed, then schedule for compilation every
source file that depends on the classes of the source file.
5. For each source file scheduled for compilation, remove the previous
output for that source file.
6. Invoke the compiler.
7. For each successfully compiled source file, extract the dependency
information for the classes in the source file and persist this for
next time.
For the above, “depends on” includes indirect dependencies.
Steps #1 and #2 are already covered by the incremental task API, at
least enough to spike this.
Step #3 isn’t quite as simple as it is described above:
- Firstly, we can ignore changes for a class with a given name, if a
class with the same name appears before it in the classpath (this
includes the source files).
- If a class is removed, this counts as a ‘change’, so that we
recompile any source files that used to depend on this class.
- If a class is added before some other class with the same name in
the classpath, then we recompile any source files that used to depend
on the old class.
- Dependencies can travel through other classes in the classpath, or
source files, or a combination of both (e.g. a source class depends on
a classpath class depends on a source class depends on a classpath
class).
Step #4 is similar to step #3.
For a spike, it might be worth simply invalidating everything when the
compile classpath changes, and just deal with changes in the source
files.
For step #7 we have three basic approaches for extracting the
dependencies:
The first approach is to use asm to extract the dependencies from the
byte code after compilation. The upside is that this is very simple to
implement and very fast. We have an implementation already that we use
in the tooling API (ClasspathInferer - but it’s mixed in with some
other stuff). It also works for things that we only have the byte code
for.
The downside is that it’s lossy: the compiler inlines constants into
the byte code and discards source-only annotations. We also don’t
easily know what type of dependency it is (is it an implementation
detail or is is visible in the API of the class?)
Both these downsides can be addressed: For example we might treat a
class with a constant field or a class for a source-only annotation as
a dependency of every source file, so that when one of these things
change, we would recompile everything. And to determine the type of
dependency, we just need to dig deeper into the byte code.
The second approach is to use the compiler API that we are already
using to invoke the compiler to query the dependencies during
compilation. The upside is that we get the full source dependency
information. The downsides are that we have to use a sun-specific
extension of the compiler API to do this and it’s a very complicated
API, which means fiddly to get right.
The third approach is to parse and analyse the source separately from
compilation.
I’d probably try out the first option, as it’s the simplest to
implement and probably the fastest at execution time.
There are some issues around making this efficient.
First, we need to make the persistence mechanism fast. For the spike,
let’s assume we can do this. I would just keep the state in some
static field somewhere and not bother with persistence.
Second, we need to make the calculation of affected source files fast.
One option is to calculate this when something changes rather than
each time we run the compilation task, so that we keep, basically, a
map from input file to the closure of all source files affected by
that input file.
This is a direction we are no doubt going to go into anyway.
Third, we need to keep the dependency graph as small as we can. So, we
might play around with tracking dependencies between packages rather
than classes.
Will be interesting to see how this works in the real world on nasty
code bases where packages are monolithic and have lots of dependencies.
We should also ignore dependencies that are not visible to the
consumer, so that we don’t traverse the dependencies of method
bodies, or private elements.
What do you mean here?
Finally, we should ignore changes that are not visible to the
consumer, so that we ignore changes to method bodies, private elements
of a class, the annotations of classes, debug info and so on. This is
relatively easy for changes to the compile classpath. For changes to
source files, it’s a bit trickier, as we don’t know what’s
changed until we compile the source file. We could, potentially,
compile in two passes - first source files that have changed and then
second source files that have not change but depend on those that
have. Something, potentially, to play with as part of a spike.
I'm pretty dubious about all of this. Looks to me like a difficult thing
to pull off outside of the compiler. I'm sure we can get something
working, but whether it's reliable enough and fast enough is another
question (hopefully answered by the spike). I also wonder whether
investing into more fine grained parallelism and coarser avoidance (e.g.
ignoring non visible classpath changes) wouldn't be more fruitful and
more generally applicable.
---------------------------------------------------------------------
To unsubscribe from this list, please visit:
http://xircles.codehaus.org/manage_email