Re: [discuss] ending support for Java 7 in Spark 2.0

Koert Kuipers Thu, 24 Mar 2016 18:40:51 -0700

i think marcelo also pointed this out before. its very interesting to hear,
i was not aware of that until today. it would mean we would only have to
convince a group/client with a cluster to install jdk8 on the nodes,
without actually transitioning to it, if i understand it correctly. that
would definitely lower the hurdle by a lot.


On Thu, Mar 24, 2016 at 9:36 PM, Mridul Muralidharan <mri...@gmail.com>
wrote:

>
> Container Java version can be different from yarn Java version : we run
> jobs with jdk8 on jdk7 cluster without issues.
>
> Regards
> Mridul
>
>
> On Thursday, March 24, 2016, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i guess what i am saying is that in a yarn world the only hard
>> restrictions left are the the containers you run in, which means the hadoop
>> version, java version and python version (if you use python).
>>
>>
>> On Thu, Mar 24, 2016 at 12:39 PM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> The group will not upgrade to spark 2.0 themselves, but they are mostly
>>> fine with vendors like us deploying our application via yarn with whatever
>>> spark version we choose (and bundle, so they do not install it separately,
>>> they might not even be aware of what spark version we use). This all works
>>> because spark does not need to be on the cluster nodes, just on the one
>>> machine where our application gets launched. Having yarn is pretty awesome
>>> in this respect.
>>>
>>> On Thu, Mar 24, 2016 at 12:25 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> (PS CDH5 runs fine with Java 8, but I understand your more general
>>>> point.)
>>>>
>>>> This is a familiar context indeed, but in that context, would a group
>>>> not wanting to update to Java 8 want to manually put Spark 2.0 into
>>>> the mix? That is, if this is a context where the cluster is
>>>> purposefully some stable mix of components, would you be updating just
>>>> one?
>>>>
>>>> You make a good point about Scala being more library than
>>>> infrastructure component. So it can be updated on a per-app basis. On
>>>> the one hand it's harder to handle different Scala versions from the
>>>> framework side, it's less hard on the deployment side.
>>>>
>>>> On Thu, Mar 24, 2016 at 4:27 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>> > i think the arguments are convincing, but it also makes me wonder if
>>>> i live
>>>> > in some kind of alternate universe... we deploy on customers
>>>> clusters, where
>>>> > the OS, python version, java version and hadoop distro are not chosen
>>>> by us.
>>>> > so think centos 6, cdh5 or hdp 2.3, java 7 and python 2.6. we simply
>>>> have
>>>> > access to a single proxy machine and launch through yarn. asking them
>>>> to
>>>> > upgrade java is pretty much out of the question or a 6+ month ordeal.
>>>> of the
>>>> > 10 client clusters i can think of on the top of my head all of them
>>>> are on
>>>> > java 7, none are on java 8. so by doing this you would make spark 2
>>>> > basically unusable for us (unless most of them have plans of
>>>> upgrading in
>>>> > near term to java 8, i will ask around and report back...).
>>>> >
>>>> > on a side note, its particularly interesting to me that spark 2 chose
>>>> to
>>>> > continue support for scala 2.10, because even for us in our very
>>>> constricted
>>>> > client environments the scala version is something we can easily
>>>> upgrade (we
>>>> > just deploy a custom build of spark for the relevant scala version and
>>>> > hadoop distro). and because scala is not a dependency of any hadoop
>>>> distro
>>>> > (so not on classpath, which i am very happy about) we can use
>>>> whatever scala
>>>> > version we like. also i found the upgrade path from scala 2.10 to
>>>> 2.11 to be
>>>> > very easy, so i have a hard time understanding why anyone would stay
>>>> on
>>>> > scala 2.10. and finally with scala 2.12 around the corner you really
>>>> dont
>>>> > want to be supporting 3 versions. so clearly i am missing something
>>>> here.
>>>> >
>>>> >
>>>> >
>>>> > On Thu, Mar 24, 2016 at 8:52 AM, Jean-Baptiste Onofré <
>>>> j...@nanthrax.net>
>>>> > wrote:
>>>> >>
>>>> >> +1 to support Java 8 (and future) *only* in Spark 2.0, and end
>>>> support of
>>>> >> Java 7. It makes sense.
>>>> >>
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >>
>>>> >> On 03/24/2016 08:27 AM, Reynold Xin wrote:
>>>> >>>
>>>> >>> About a year ago we decided to drop Java 6 support in Spark 1.5. I
>>>> am
>>>> >>> wondering if we should also just drop Java 7 support in Spark 2.0
>>>> (i.e.
>>>> >>> Spark 2.0 would require Java 8 to run).
>>>> >>>
>>>> >>> Oracle ended public updates for JDK 7 in one year ago (Apr 2015),
>>>> and
>>>> >>> removed public downloads for JDK 7 in July 2015. In the past I've
>>>> >>> actually been against dropping Java 8, but today I ran into an issue
>>>> >>> with the new Dataset API not working well with Java 8 lambdas, and
>>>> that
>>>> >>> changed my opinion on this.
>>>> >>>
>>>> >>> I've been thinking more about this issue today and also talked with
>>>> a
>>>> >>> lot people offline to gather feedback, and I actually think the pros
>>>> >>> outweighs the cons, for the following reasons (in some rough order
>>>> of
>>>> >>> importance):
>>>> >>>
>>>> >>> 1. It is complicated to test how well Spark APIs work for Java
>>>> lambdas
>>>> >>> if we support Java 7. Jenkins machines need to have both Java 7 and
>>>> Java
>>>> >>> 8 installed and we must run through a set of test suites in 7, and
>>>> then
>>>> >>> the lambda tests in Java 8. This complicates build
>>>> environments/scripts,
>>>> >>> and makes them less robust. Without good testing infrastructure, I
>>>> have
>>>> >>> no confidence in building good APIs for Java 8.
>>>> >>>
>>>> >>> 2. Dataset/DataFrame performance will be between 1x to 10x slower in
>>>> >>> Java 7. The primary APIs we want users to use in Spark 2.x are
>>>> >>> Dataset/DataFrame, and this impacts pretty much everything from
>>>> machine
>>>> >>> learning to structured streaming. We have made great progress in
>>>> their
>>>> >>> performance through extensive use of code generation. (In many
>>>> >>> dimensions Spark 2.0 with DataFrames/Datasets looks more like a
>>>> compiler
>>>> >>> than a MapReduce or query engine.) These optimizations don't work
>>>> well
>>>> >>> in Java 7 due to broken code cache flushing. This problem has been
>>>> fixed
>>>> >>> by Oracle in Java 8. In addition, Java 8 comes with better support
>>>> for
>>>> >>> Unsafe and SIMD.
>>>> >>>
>>>> >>> 3. Scala 2.12 will come out soon, and we will want to add support
>>>> for
>>>> >>> that. Scala 2.12 only works on Java 8. If we do support Java 7, we'd
>>>> >>> have a fairly complicated compatibility matrix and testing
>>>> >>> infrastructure.
>>>> >>>
>>>> >>> 4. There are libraries that I've looked into in the past that
>>>> support
>>>> >>> only Java 8. This is more common in high performance libraries such
>>>> as
>>>> >>> Aeron (a messaging library). Having to support Java 7 means we are
>>>> not
>>>> >>> able to use these. It is not that big of a deal right now, but will
>>>> >>> become increasingly more difficult as we optimize performance.
>>>> >>>
>>>> >>>
>>>> >>> The downside of not supporting Java 7 is also obvious. Some
>>>> >>> organizations are stuck with Java 7, and they wouldn't be able to
>>>> use
>>>> >>> Spark 2.0 without upgrading Java.
>>>> >>>
>>>> >>>
>>>> >>
>>>> >> --
>>>> >> Jean-Baptiste Onofré
>>>> >> jbono...@apache.org
>>>> >> http://blog.nanthrax.net
>>>> >> Talend - http://www.talend.com
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>>> >> For additional commands, e-mail: dev-h...@spark.apache.org
>>>> >>
>>>> >
>>>>
>>>
>>>
>>

Re: [discuss] ending support for Java 7 in Spark 2.0

Reply via email to