About a year ago we decided to drop Java 6 support in Spark 1.5. I am
wondering if we should also just drop Java 7 support in Spark 2.0 (i.e.
Spark 2.0 would require Java 8 to run).

Oracle ended public updates for JDK 7 in one year ago (Apr 2015), and
removed public downloads for JDK 7 in July 2015. In the past I've actually
been against dropping Java 8, but today I ran into an issue with the new
Dataset API not working well with Java 8 lambdas, and that changed my
opinion on this.

I've been thinking more about this issue today and also talked with a lot
people offline to gather feedback, and I actually think the pros outweighs
the cons, for the following reasons (in some rough order of importance):

1. It is complicated to test how well Spark APIs work for Java lambdas if
we support Java 7. Jenkins machines need to have both Java 7 and Java 8
installed and we must run through a set of test suites in 7, and then the
lambda tests in Java 8. This complicates build environments/scripts, and
makes them less robust. Without good testing infrastructure, I have no
confidence in building good APIs for Java 8.

2. Dataset/DataFrame performance will be between 1x to 10x slower in Java
7. The primary APIs we want users to use in Spark 2.x are
Dataset/DataFrame, and this impacts pretty much everything from machine
learning to structured streaming. We have made great progress in their
performance through extensive use of code generation. (In many dimensions
Spark 2.0 with DataFrames/Datasets looks more like a compiler than a
MapReduce or query engine.) These optimizations don't work well in Java 7
due to broken code cache flushing. This problem has been fixed by Oracle in
Java 8. In addition, Java 8 comes with better support for Unsafe and SIMD.

3. Scala 2.12 will come out soon, and we will want to add support for that.
Scala 2.12 only works on Java 8. If we do support Java 7, we'd have a
fairly complicated compatibility matrix and testing infrastructure.

4. There are libraries that I've looked into in the past that support only
Java 8. This is more common in high performance libraries such as Aeron (a
messaging library). Having to support Java 7 means we are not able to use
these. It is not that big of a deal right now, but will become increasingly
more difficult as we optimize performance.


The downside of not supporting Java 7 is also obvious. Some organizations
are stuck with Java 7, and they wouldn't be able to use Spark 2.0 without
upgrading Java.

Reply via email to