Hi all,

currently, our new code generator for operator fusion, uses the
programmatic javax.tools.JavaCompiler, which is Java's standard API for
compilation. Despite a plan cache that mitigates unnecessary compilation
and recompilation overheads, we still see significant end-to-end overhead
especially for small input data.

Moving forward, I'd like to switch to Janino
(org.codehaus.janino.SimpleCompiler), which is a fast in-memory Java
compiler with restricted language support. The advantages are

(1) Reduced compilation overhead: On end-to-end scenarios for L2SVM, GLM,
and MLogreg, Janino improved total javac compilation time from 2.039 to
0.195 (14 operators), from 8.134 to 0.411 (82 operators), and from 4.854 to
0.283 (46 operators), respectively. At the same time, there was no
measurable impact on runtime efficiency, but even slightly reduced JIT
compilation overhead.

(2) Removed JDK requirement: Using the standard javax.tools.JavaCompiler
requires the existence of a JDK, while Janino only requires a JRE, which
means it makes it easier to apply code generation by default.

However, I'm raising this here as Janino would add another explicit
dependency (with BSD license). Fortunately, Spark also uses Janino for
whole-stage-codegen. So we should be able to mark Janino as provided
library. The only issue is a pure Hadoop environment, where we still want
to use code generation for CP operations. To simplify the build, I could
imagine using the javax.tools.JavaCompiler for hadoop execution types, but
Janino by default.

If you have any concerns, please let me know by Monday; otherwise I'd like
to push this change into our upcoming 0.14 release.


Regards,
Matthias

Reply via email to