Re: time for Apache Spark 3.0?

Matei Zaharia Thu, 05 Apr 2018 10:40:11 -0700

Oh, forgot to add, but splitting the source tree in Scala also creates the 
issue of a big maintenance burden for third-party libraries built on Spark. As 
Josh said on the JIRA:


"I think this is primarily going to be an issue for end users who want to use 
an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus 
the pain of the source incompatibility would mostly be felt by library/package 
maintainers but it can be worked around as long as there's at least some common 
subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our 
source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote:
> 
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> 
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12). 
> On the other hand, if we want to keep the API and ABI of the Spark 2.x 
> branch, we’ll need a different source tree for Scala 2.12 with different 
> copies of pretty large classes such as RDD, DataFrame and DStream, and Java 
> users may have to change their code when linking against different versions 
> of Spark.
> 
> This is of course only one of the possible ABI changes, but it is a 
> considerable engineering effort, so we’d have to sign up for maintaining all 
> these different source files. It seems kind of silly given that Scala 2.12 
> was released in 2016, so we’re doing all this work to keep ABI compatibility 
> for Scala 2.11, which isn’t even that widely used any more for new projects. 
> Also keep in mind that the next Spark release will probably take at least 3-4 
> months, so we’re talking about what people will be using in fall 2018.
> 
> Matei
> 
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin <van...@cloudera.com> wrote:
>> 
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>> 
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>> 
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>> 
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido <marcogaid...@gmail.com> wrote:
>>> Hi all,
>>> 
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>> 
>>> Thanks,
>>> Marco
>>> 
>>> 
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra <m...@clearstorydata.com>:
>>>> 
>>>> As with Sean, I'm not sure that this will require a new major version, but
>>>> we should also be looking at Java 9 & 10 support -- particularly with 
>>>> regard
>>>> to their better functionality in a containerized environment (memory limits
>>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>>> current Spark branches.
>>>> 
>>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen <sro...@gmail.com> wrote:
>>>>> 
>>>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin <r...@databricks.com> wrote:
>>>>>> 
>>>>>> The primary motivating factor IMO for a major version bump is to support
>>>>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>>>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>>>>> changes
>>>>>> that we know have been biting us for a long time but can’t be changed in
>>>>>> feature releases (to be clear, I’m actually not sure they are all good
>>>>>> ideas, but I’m writing them down as candidates for consideration):
>>>>> 
>>>>> 
>>>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>>>> simultaneously. The cross-build already works now in 2.3.0. Barring some 
>>>>> big
>>>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>>>> nearly works that way now.
>>>>> 
>>>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>>>> and 2.12 anyway; that's never been promised as compatible.
>>>>> 
>>>>> (Interesting question about what *Java* users should expect; they would
>>>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been 
>>>>> true.)
>>>>> 
>>>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider 
>>>>> dropping
>>>>> 2.11 support if needed to make supporting 2.12 less painful.
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Marcelo
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: time for Apache Spark 3.0?

Reply via email to