Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
I just added the label to https://issues.apache.org/jira/browse/SPARK-25908. 
Unsure if there are any others. I’ll look through the tickets and see if there 
are any that are missing the label.

 

-Matt Cheah

 

From: Sean Owen 
Date: Tuesday, November 13, 2018 at 12:09 PM
To: Matt Cheah 
Cc: Sean Owen , Vinoo Ganesh , dev 

Subject: Re: time for Apache Spark 3.0?

 

As far as I know any JIRA that has implications for users is tagged this way 
but I haven't examined all of them. All that are going in for 3.0 should have 
it as Fix Version . Most changes won't have a user visible impact. Do you see 
any that seem to need the tag? Call em out or even fix them by adding the tag 
and proposed release notes. 

 

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah  wrote:

My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some 
subset of deprecated methods, what is that subset? I see a bunch were removed 
in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
 for example. Are we only committed to removing methods that were deprecated in 
some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed 
breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket 
that can be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Re: time for Apache Spark 3.0?

2018-11-13 Thread Sean Owen
As far as I know any JIRA that has implications for users is tagged this
way but I haven't examined all of them. All that are going in for 3.0
should have it as Fix Version . Most changes won't have a user visible
impact. Do you see any that seem to need the tag? Call em out or even fix
them by adding the tag and proposed release notes.

On Tue, Nov 13, 2018, 11:49 AM Matt Cheah  The release-notes label on JIRA sounds good. Can we make it a point to
> have that done retroactively now, and then moving forward?
>
> On 11/12/18, 4:01 PM, "Sean Owen"  wrote:
>
> My non-definitive takes --
>
> I would personally like to remove all deprecated methods for Spark 3.
> I started by removing 'old' deprecated methods in that commit. Things
> deprecated in 2.4 are maybe less clear, whether they should be removed
>
> Everything's fair game for removal or change in a major release. So
> far some items in discussion seem to be Scala 2.11 support, Python 2
> support, R support before 3.4. I don't know about other APIs.
>
> Generally, take a look at JIRA for items targeted at version 3.0. Not
> everything targeted for 3.0 is going in, but ones from committers are
> more likely than others. Breaking changes ought to be tagged
> 'release-notes' with a description of the change. The release itself
> has a migration guide that's being updated as we go.
>
>
> On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah 
> wrote:
> >
> > I wanted to clarify what categories of APIs are eligible to be
> broken in Spark 3.0. Specifically:
> >
> >
> >
> > Are we removing all deprecated methods? If we’re only removing some
> subset of deprecated methods, what is that subset? I see a bunch were
> removed in
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
> for example. Are we only committed to removing methods that were deprecated
> in some Spark version and earlier?
> > Aside from removing support for Scala 2.11, what other kinds of
> (non-experimental and non-evolving) APIs are eligible to be broken?
> > Is there going to be a way to track the current list of all proposed
> breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
> ticket that can be filtered down to somehow?
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>


Re: time for Apache Spark 3.0?

2018-11-13 Thread Matt Cheah
The release-notes label on JIRA sounds good. Can we make it a point to have 
that done retroactively now, and then moving forward?

On 11/12/18, 4:01 PM, "Sean Owen"  wrote:

My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some 
subset of deprecated methods, what is that subset? I see a bunch were removed 
in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_pull_22921=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=yQSElmBeMSlm-LdOsYqwPm3ZZJaoBktOmNYSGTF7FKk=_pRqHGBRV-RX3Ij_qSDb7bevUDmqENa-4caKSr5xs88=
 for example. Are we only committed to removing methods that were deprecated in 
some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed 
breaking changes / JIRA tickets? Perhaps we can include it in the JIRA ticket 
that can be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Re: time for Apache Spark 3.0?

2018-11-12 Thread Sean Owen
My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
> Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some subset of 
> deprecated methods, what is that subset? I see a bunch were removed in 
> https://github.com/apache/spark/pull/22921 for example. Are we only committed 
> to removing methods that were deprecated in some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
> (non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed breaking 
> changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can 
> be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
All API removal and deprecation JIRAs should be tagged "releasenotes", so
we can reference them when we build release notes. I don't know if
everybody is still following that practice, but it'd be great to do that.
Since we don't have that many PRs, we should still be able to retroactively
tag.

We can also add a new tag for API changes, but I feel at this stage it
might be easier to just use "releasenotes".


On Mon, Nov 12, 2018 at 3:49 PM Matt Cheah  wrote:

> I wanted to clarify what categories of APIs are eligible to be broken in
> Spark 3.0. Specifically:
>
>
>
>- Are we removing all deprecated methods? If we’re only removing some
>subset of deprecated methods, what is that subset? I see a bunch were
>removed in https://github.com/apache/spark/pull/22921 for example. Are
>we only committed to removing methods that were deprecated in some Spark
>version and earlier?
>- Aside from removing support for Scala 2.11, what other kinds of
>(non-experimental and non-evolving) APIs are eligible to be broken?
>- Is there going to be a way to track the current list of all proposed
>breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
>ticket that can be filtered down to somehow?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
> *From: *Vinoo Ganesh 
> *Date: *Monday, November 12, 2018 at 2:48 PM
> *To: *Reynold Xin 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Makes sense, thanks Reynold.
>
>
>
> *From: *Reynold Xin 
> *Date: *Monday, November 12, 2018 at 16:57
> *To: *Vinoo Ganesh 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
> 3.0. In terms of time lining, unless we change anything specifically, Spark
> feature releases are on a 6-mo cadence. Spark 2.4 was just released last
> week, so 3.0 will be roughly 6 month from now.
>
>
>
> On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:
>
> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2

Re: time for Apache Spark 3.0?

2018-11-12 Thread Matt Cheah
I wanted to clarify what categories of APIs are eligible to be broken in Spark 
3.0. Specifically:

 
Are we removing all deprecated methods? If we’re only removing some subset of 
deprecated methods, what is that subset? I see a bunch were removed in 
https://github.com/apache/spark/pull/22921 for example. Are we only committed 
to removing methods that were deprecated in some Spark version and earlier?
Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
Is there going to be a way to track the current list of all proposed breaking 
changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can 
be filtered down to somehow?
 

Thanks,

 

-Matt Cheah

From: Vinoo Ganesh 
Date: Monday, November 12, 2018 at 2:48 PM
To: Reynold Xin 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

 

Makes sense, thanks Reynold. 

 

From: Reynold Xin 
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

 

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. 
In terms of time lining, unless we change anything specifically, Spark feature 
releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 
will be roughly 6 month from now.

 

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:

Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated? 

 

From: Xiao Li 
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin 
Cc: Matei Zaharia , Ryan Blue , 
Mark Hamstra , "u...@spark.apache.org" 

Subject: Re: time for Apache Spark 3.0?

 

Yes. We should create a SPIP for each major breaking change. 

 

Reynold Xin  于2018年9月28日周五 下午11:05写道:

i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with 


--

excuse the brevity and lower case due to wrist injury

 

 

On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:

Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold). 

 

Cheers,

 

Xiao 

 

Matei Zaharia  于2018年9月6日周四 下午2:21写道:

Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra  wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  [github.com]
> 
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> I think this doesn't necessarily mean 3.0 i

Re: time for Apache Spark 3.0?

2018-11-12 Thread Vinoo Ganesh
Makes sense, thanks Reynold.

From: Reynold Xin 
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. 
In terms of time lining, unless we change anything specifically, Spark feature 
releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 
will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh 
mailto:vgan...@palantir.com>> wrote:
Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li mailto:gatorsm...@gmail.com>>
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin mailto:r...@databricks.com>>
Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>>, 
Ryan Blue mailto:rb...@netflix.com>>, Mark Hamstra 
mailto:m...@clearstorydata.com>>, 
"u...@spark.apache.org<mailto:u...@spark.apache.org>" 
mailto:dev@spark.apache.org>>
Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin mailto:r...@databricks.com>> 于2018年9月28日周五 
下午11:05写道:
i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia mailto:matei.zaha...@gmail.com>> 
于2018年9月6日周四 下午2:21写道:
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> mailto:m...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  
> [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc=>
>
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin 
> mailto:r...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
3.0. In terms of time lining, unless we change anything specifically, Spark
feature releases are on a 6-mo cadence. Spark 2.4 was just released last
week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:

> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc=>
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
&g

Re: time for Apache Spark 3.0?

2018-11-12 Thread Vinoo Ganesh
Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li 
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin 
Cc: Matei Zaharia , Ryan Blue , 
Mark Hamstra , "u...@spark.apache.org" 

Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin mailto:r...@databricks.com>> 于2018年9月28日周五 
下午11:05写道:
i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia mailto:matei.zaha...@gmail.com>> 
于2018年9月6日周四 下午2:21写道:
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> mailto:m...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  
> [github.com]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_common_tags_src_main_java_org_apache_spark_annotation_InterfaceStability.java=DwMFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=7WzLIMu3WvZwd6AMPatqn1KZW39eI6c_oflAHIy1NUc=XgVDeB7pewN3jZ6po86BzIEmn1mgLmYtNGgcLZMQRjY=VSHC6Lqh_ewbLsLD69bdkRpXSeiR63uu3wOcHeJizbc=>
>
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin 
> mailto:r...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of older 
> dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:
> My concern is that the v2 data source API is still evolving and not very 
> close to stable. I had hoped to have stabilized the API and behaviors for a 
> 3.0 release. But we could also wait on that for a 4.0 release, depending on 
> when we think that will be.
>
> Unless there is a pressing need to move to

Re: time for Apache Spark 3.0?

2018-09-29 Thread Xiao Li
Yes. We should create a SPIP for each major breaking change.

Reynold Xin  于2018年9月28日周五 下午11:05写道:

> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
> --
> excuse the brevity and lower case due to wrist injury
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
>> Based on the above discussions, we have a "rough consensus" that the next
>> release will be 3.0. Now, we can start working on the API breaking changes
>> (e.g., the ones mentioned in the original email from Reynold).
>>
>> Cheers,
>>
>> Xiao
>>
>> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>>
>>> Yes, you can start with Unstable and move to Evolving and Stable when
>>> needed. We’ve definitely had experimental features that changed across
>>> maintenance releases when they were well-isolated. If your change risks
>>> breaking stuff in stable components of Spark though, then it probably won’t
>>> be suitable for that.
>>>
>>> > On Sep 6, 2018, at 1:49 PM, Ryan Blue 
>>> wrote:
>>> >
>>> > I meant flexibility beyond the point releases. I think what Reynold
>>> was suggesting was getting v2 code out more often than the point releases
>>> every 6 months. An Evolving API can change in point releases, but maybe we
>>> should move v2 to Unstable so it can change more often? I don't really see
>>> another way to get changes out more often.
>>> >
>>> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
>>> wrote:
>>> > Yes, that is why we have these annotations in the code and the
>>> corresponding labels appearing in the API documentation:
>>> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>>> >
>>> > As long as it is properly annotated, we can change or even eliminate
>>> an API method before the next major release. And frankly, we shouldn't be
>>> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
>>> without such an annotation. There is just too much risk of not getting
>>> everything right before we see the results of the new API being more widely
>>> used, and too much cost in maintaining until the next major release
>>> something that we come to regret for us to create new API in a fully frozen
>>> state.
>>> >
>>> >
>>> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
>>> wrote:
>>> > It would be great to get more features out incrementally. For
>>> experimental features, do we have more relaxed constraints?
>>> >
>>> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin 
>>> wrote:
>>> > +1 on 3.0
>>> >
>>> > Dsv2 stable can still evolve in across major releases. DataFrame,
>>> Dataset, dsv1 and a lot of other major features all were developed
>>> throughout the 1.x and 2.x lines.
>>> >
>>> > I do want to explore ways for us to get dsv2 incremental changes out
>>> there more frequently, to get feedback. Maybe that means we apply additive
>>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>>> will start a separate thread about it.
>>> >
>>> >
>>> >
>>> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>> >
>>> > I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>> >
>>> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>> >
>>> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>>> wrote:
>>> > My concern is that the v2 data source API is still evolving and not
>>> very close to stable. I had hoped to have stabilized the API and behaviors
>>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>>> depending on when we think that will be.
>>> >
>>> > Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>> >
>>> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>> > Yesterday, the 2.4 branch was created. Based on the above discussion,
>>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Re: time for Apache Spark 3.0?

2018-09-29 Thread Reynold Xin
i think we should create spips for some of them, since they are pretty
large ... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:

> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
> Cheers,
>
> Xiao
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
>> Yes, you can start with Unstable and move to Evolving and Stable when
>> needed. We’ve definitely had experimental features that changed across
>> maintenance releases when they were well-isolated. If your change risks
>> breaking stuff in stable components of Spark though, then it probably won’t
>> be suitable for that.
>>
>> > On Sep 6, 2018, at 1:49 PM, Ryan Blue 
>> wrote:
>> >
>> > I meant flexibility beyond the point releases. I think what Reynold was
>> suggesting was getting v2 code out more often than the point releases every
>> 6 months. An Evolving API can change in point releases, but maybe we should
>> move v2 to Unstable so it can change more often? I don't really see another
>> way to get changes out more often.
>> >
>> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
>> wrote:
>> > Yes, that is why we have these annotations in the code and the
>> corresponding labels appearing in the API documentation:
>> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>> >
>> > As long as it is properly annotated, we can change or even eliminate an
>> API method before the next major release. And frankly, we shouldn't be
>> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
>> without such an annotation. There is just too much risk of not getting
>> everything right before we see the results of the new API being more widely
>> used, and too much cost in maintaining until the next major release
>> something that we come to regret for us to create new API in a fully frozen
>> state.
>> >
>> >
>> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
>> wrote:
>> > It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>> >
>> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>> > +1 on 3.0
>> >
>> > Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>> >
>> > I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>> >
>> >
>> >
>> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>> >
>> > I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>> >
>> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>> >
>> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>> wrote:
>> > My concern is that the v2 data source API is still evolving and not
>> very close to stable. I had hoped to have stabilized the API and behaviors
>> for a 3.0 release. But we could also wait on that for a 4.0 release,
>> depending on when we think that will be.
>> >
>> > Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>> >
>> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>> >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: time for Apache Spark 3.0?

2018-09-29 Thread Xiao Li
Based on the above discussions, we have a "rough consensus" that the next
release will be 3.0. Now, we can start working on the API breaking changes
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia  于2018年9月6日周四 下午2:21写道:

> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation:
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
> >
> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> >
> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
> > My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
> >
> > Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
> >
> > On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
> > Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Matei Zaharia
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra  wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> 
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
> 
> I generally support moving on to 3.x so we can also jettison a lot of older 
> dependencies, code, fix some long standing issues, etc.
> 
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> 
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:
> My concern is that the v2 data source API is still evolving and not very 
> close to stable. I had hoped to have stabilized the API and behaviors for a 
> 3.0 release. But we could also wait on that for a 4.0 release, depending on 
> when we think that will be.
> 
> Unless there is a pressing need to move to 3.0 for some other area, I think 
> it would be better for the v2 sources to have a 2.5 release.
> 
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
> we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
I meant flexibility beyond the point releases. I think what Reynold was
suggesting was getting v2 code out more often than the point releases every
6 months. An Evolving API can change in point releases, but maybe we should
move v2 to Unstable so it can change more often? I don't really see another
way to get changes out more often.

On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
wrote:

> Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation:
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>
> As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
>
>> It would be great to get more features out incrementally. For
>> experimental features, do we have more relaxed constraints?
>>
>> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>>
>>> +1 on 3.0
>>>
>>> Dsv2 stable can still evolve in across major releases. DataFrame,
>>> Dataset, dsv1 and a lot of other major features all were developed
>>> throughout the 1.x and 2.x lines.
>>>
>>> I do want to explore ways for us to get dsv2 incremental changes out
>>> there more frequently, to get feedback. Maybe that means we apply additive
>>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>>> will start a separate thread about it.
>>>
>>>
>>>
>>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>>
 I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
 timing? 6 months?) but simply next. Do you mean you'd prefer that change to
 happen before 3.x? if it's a significant change, seems reasonable for a
 major version bump rather than minor. Is the concern that tying it to 3.0
 means you have to take a major version update to get it?

 I generally support moving on to 3.x so we can also jettison a lot of
 older dependencies, code, fix some long standing issues, etc.

 (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)

 On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
 wrote:

> My concern is that the v2 data source API is still evolving and not
> very close to stable. I had hoped to have stabilized the API and behaviors
> for a 3.0 release. But we could also wait on that for a 4.0 release,
> depending on when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>
>> Yesterday, the 2.4 branch was created. Based on the above discussion,
>> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Mark Hamstra
Yes, that is why we have these annotations in the code and the
corresponding labels appearing in the API documentation:
https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java

As long as it is properly annotated, we can change or even eliminate an API
method before the next major release. And frankly, we shouldn't be
contemplating bringing in the DS v2 API (and, I'd argue, *any* new API)
without such an annotation. There is just too much risk of not getting
everything right before we see the results of the new API being more widely
used, and too much cost in maintaining until the next major release
something that we come to regret for us to create new API in a fully frozen
state.


On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:

> It would be great to get more features out incrementally. For experimental
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
>
>> +1 on 3.0
>>
>> Dsv2 stable can still evolve in across major releases. DataFrame,
>> Dataset, dsv1 and a lot of other major features all were developed
>> throughout the 1.x and 2.x lines.
>>
>> I do want to explore ways for us to get dsv2 incremental changes out
>> there more frequently, to get feedback. Maybe that means we apply additive
>> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
>> will start a separate thread about it.
>>
>>
>>
>> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>>
>>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>>> happen before 3.x? if it's a significant change, seems reasonable for a
>>> major version bump rather than minor. Is the concern that tying it to 3.0
>>> means you have to take a major version update to get it?
>>>
>>> I generally support moving on to 3.x so we can also jettison a lot of
>>> older dependencies, code, fix some long standing issues, etc.
>>>
>>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>>
>>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>>> wrote:
>>>
 My concern is that the v2 data source API is still evolving and not
 very close to stable. I had hoped to have stabilized the API and behaviors
 for a 3.0 release. But we could also wait on that for a 4.0 release,
 depending on when we think that will be.

 Unless there is a pressing need to move to 3.0 for some other area, I
 think it would be better for the v2 sources to have a 2.5 release.

 On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion,
> I think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: time for Apache Spark 3.0?

2018-09-06 Thread sadhen
I’d like to see an independent Spark Catalyst without Spark Core and Hadoop 
dependencies in Spark 3.0 .


I created Enzyme (A Spark SQL compatible SQL engine that depends on Spark 
Catalyst) in Wacai for performance reason in a non-distribute scenario.


Enzyme is a simplified version of Spark SQL, similar to liancheng’s toy 
projecthttps://github.com/liancheng/spear , but aims to keep compatibility with 
Spark SQL and Dataframe with Hive UDF support.


The implementation of Enzyme is a shame mimic of existing code from Spark SQL. 
Besides, I tuned it for better performance, lower memory and CPU usage.


We mainly use Enzyme for using SQL as a DSL in our inner product for data 
analysts. And guys from other comanies in China are interested in using Enzyme 
for ML serving. My colleagues are trying to use Enzyme in Flink Streaming 
because we can reuse our existing Hive UDFs with Enzyme.


This is my reason for make Spark Catalyst independent. And we will open source 
Enzyme several months later.


Spark Catalyst is awesome. Personally, I hope it goes beyond Spark and finally 
become a great alternative of Calcite.




Best Regards,
Darcy Shen


原始邮件
发件人:Xiao ligatorsm...@gmail.com
收件人:vaquar khanvaquar.k...@gmail.com
抄送:Reynold xinr...@databricks.com; Mridul muralidharanmri...@gmail.com; Mark 
hamstram...@clearstorydata.com; 银狐andyye...@gmail.com; 
user@spark.apache.org...@spark.apache.org
发送时间:2018年9月6日(周四) 23:59
主题:Re: time for Apache Spark 3.0?


Yesterday, the 2.4 branch was created. Based on the above discussion, I think 
we can bump the master branch to3.0.0-SNAPSHOT. Any concern?


Thanks,


Xiao


vaquar khan vaquar.k...@gmail.com 于2018年6月16日周六 上午10:21写道:

+1 for 2.4 next, followed by 3.0.  

Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
is it possible we can share future release proposed specification same like 
releases (https://spark.apache.org/releases/spark-release-2-3-0.html)


Regards,
Viquar khan


On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan vaquar.k...@gmail.com wrote:

Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.




On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan vaquar.k...@gmail.com wrote:

+1


https://www.youtube.com/watch?v=-ik7aJ5U6kg



Regards,
Vaquar khan


On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin r...@databricks.com wrote:

Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.




On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan mri...@gmail.com wrote:

I agree, I dont see pressing need for major version bump as well.
 
 
 Regards,
 Mridul
 On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra m...@clearstorydata.com wrote:
 
  Changing major version numbers is not about new features or a vague notion 
that it is time to do something that will be seen to be a significant release. 
It is about breaking stable public APIs.
 
  I still remain unconvinced that the next version can't be 2.4.0.
 
  On Fri, Jun 15, 2018 at 1:34 AM Andy andyye...@gmail.com wrote:
 
  Dear all:
 
  It have been 2 months since this topic being proposed. Any progress now? 2018 
has been passed about 1/2.
 
  I agree with that the new version should be some exciting new feature. How 
about this one:
 
  6. ML/DL framework to be integrated as core component and feature. (Such as 
Angel / BigDL / ……)
 
  3.0 is a very important version for an good open source project. It should be 
better to drift away the historical burden and focus in new area. Spark has 
been widely used all over the world as a successful big data framework. And it 
can be better than that.
 
  Andy
 
 
  On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin r...@databricks.com wrote:
 
  There was a discussion thread on scala-contributors about Apache Spark not 
yet supporting Scala 2.12, and that got me to think perhaps it is about time 
for Spark to work towards the 3.0 release. By the time it comes out, it will be 
more than 2 years since Spark 2.0.
 
  For contributors less familiar with Spark’s history, I want to give more 
context on Spark releases:
 
  1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If we 
were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 in 2018.
 
  2. Spark’s versioning policy promises that Spark does not break stable APIs 
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 3.0).
 
  3. That said, a major version isn’t necessarily the playground for disruptive 
API changes to make it painful for users to update. The main purpose of a major 
release is an opportunity to fix things that are broken in the current API and 
remove certain deprecated APIs.
 
  4. Spark as a project has a culture of evolving architecture and developing 
major new features incrementally, so major releases are not the only time for 
exciting new features. For example, the bulk of the work in the move towards

Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
It would be great to get more features out incrementally. For experimental
features, do we have more relaxed constraints?

On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:

> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
> dsv1 and a lot of other major features all were developed throughout the
> 1.x and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there
> more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>>
 Yesterday, the 2.4 branch was created. Based on the above discussion, I
 think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?



-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Reynold Xin
I definitely agree we shouldn't make dsv2 stable in the next release.

On Thu, Sep 6, 2018 at 9:48 AM Ryan Blue  wrote:

> I definitely support moving to 3.0 to remove deprecations and update
> dependencies.
>
> For the v2 work, we know that there will be a major API changes and
> standardization of behavior from the new logical plans going into the next
> release. I think it is a safe bet that this isn’t going to be completely
> done for the next release, so it will still be experimental or unstable for
> 3.0. I also expect that there will be some things that we want to
> deprecate. Ideally, that deprecation could happen before a major release so
> we can remove it.
>
> I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
> 4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
> consider it when deciding what the next release should be.
>
> It is probably better to release 3.0 now because it isn’t clear when the
> v2 API will become stable. And if we choose to release 3.0 next, we should
> *not* aim to stabilize v2 for that release. Not that we shouldn’t try to
> make it stable as soon as possible, I just think that it is unlikely to
> happen in time and we should not rush to claim it is stable.
>
> rb
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
>
>> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
>> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
>> happen before 3.x? if it's a significant change, seems reasonable for a
>> major version bump rather than minor. Is the concern that tying it to 3.0
>> means you have to take a major version update to get it?
>>
>> I generally support moving on to 3.x so we can also jettison a lot of
>> older dependencies, code, fix some long standing issues, etc.
>>
>> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>>
>> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
>> wrote:
>>
>>> My concern is that the v2 data source API is still evolving and not very
>>> close to stable. I had hoped to have stabilized the API and behaviors for a
>>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>>> when we think that will be.
>>>
>>> Unless there is a pressing need to move to 3.0 for some other area, I
>>> think it would be better for the v2 sources to have a 2.5 release.
>>>
>>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>>
 Yesterday, the 2.4 branch was created. Based on the above discussion, I
 think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?


>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
I definitely support moving to 3.0 to remove deprecations and update
dependencies.

For the v2 work, we know that there will be a major API changes and
standardization of behavior from the new logical plans going into the next
release. I think it is a safe bet that this isn’t going to be completely
done for the next release, so it will still be experimental or unstable for
3.0. I also expect that there will be some things that we want to
deprecate. Ideally, that deprecation could happen before a major release so
we can remove it.

I don’t have a problem releasing 3.0 with an unstable v2 API or targeting
4.0 to remove behavior and APIs replaced by v2. But, I want to make sure we
consider it when deciding what the next release should be.

It is probably better to release 3.0 now because it isn’t clear when the v2
API will become stable. And if we choose to release 3.0 next, we should
*not* aim to stabilize v2 for that release. Not that we shouldn’t try to
make it stable as soon as possible, I just think that it is unlikely to
happen in time and we should not rush to claim it is stable.

rb

On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: time for Apache Spark 3.0?

2018-09-06 Thread Reynold Xin
+1 on 3.0

Dsv2 stable can still evolve in across major releases. DataFrame, Dataset,
dsv1 and a lot of other major features all were developed throughout the
1.x and 2.x lines.

I do want to explore ways for us to get dsv2 incremental changes out there
more frequently, to get feedback. Maybe that means we apply additive
changes to 2.4.x; maybe that means making another 2.5 release sooner. I
will start a separate thread about it.



On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:

> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
>
>> My concern is that the v2 data source API is still evolving and not very
>> close to stable. I had hoped to have stabilized the API and behaviors for a
>> 3.0 release. But we could also wait on that for a 4.0 release, depending on
>> when we think that will be.
>>
>> Unless there is a pressing need to move to 3.0 for some other area, I
>> think it would be better for the v2 sources to have a 2.5 release.
>>
>> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>>
>>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>>
>>>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Sean Owen
I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
timing? 6 months?) but simply next. Do you mean you'd prefer that change to
happen before 3.x? if it's a significant change, seems reasonable for a
major version bump rather than minor. Is the concern that tying it to 3.0
means you have to take a major version update to get it?

I generally support moving on to 3.x so we can also jettison a lot of older
dependencies, code, fix some long standing issues, etc.

(BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)

On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:

> My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I
> think it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:
>
>> Yesterday, the 2.4 branch was created. Based on the above discussion, I
>> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>>
>>


Re: time for Apache Spark 3.0?

2018-09-06 Thread Ryan Blue
My concern is that the v2 data source API is still evolving and not very
close to stable. I had hoped to have stabilized the API and behaviors for a
3.0 release. But we could also wait on that for a 4.0 release, depending on
when we think that will be.

Unless there is a pressing need to move to 3.0 for some other area, I think
it would be better for the v2 sources to have a 2.5 release.

On Thu, Sep 6, 2018 at 8:59 AM Xiao Li  wrote:

> Yesterday, the 2.4 branch was created. Based on the above discussion, I
> think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?
>
> Thanks,
>
> Xiao
>
> vaquar khan  于2018年6月16日周六 上午10:21写道:
>
>> +1  for 2.4 next, followed by 3.0.
>>
>> Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
>> is it possible we can share future release proposed specification same
>> like  releases (
>> https://spark.apache.org/releases/spark-release-2-3-0.html)
>> Regards,
>> Viquar khan
>>
>> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan 
>> wrote:
>>
>>> Plz ignore last email link (you tube )not sure how it added .
>>> Apologies not sure how to delete it.
>>>
>>>
>>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
>>> wrote:
>>>
 +1

 https://www.youtube.com/watch?v=-ik7aJ5U6kg

 Regards,
 Vaquar khan

 On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin 
 wrote:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra <
>> m...@clearstorydata.com> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any
>> progress now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new
>> feature. How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project.
>> It should be better to drift away the historical burden and focus in new
>> area. Spark has been widely used all over the world as a successful big
>> data framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it 
>> is
>> about time for Spark to work towards the 3.0 release. By the time it 
>> comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to
>> give more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 
>> to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground
>> for disruptive API changes to make it painful for users to update. The 
>> main
>> purpose of a major release is an opportunity to fix things that are 
>> broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not 
>> the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking 

Re: time for Apache Spark 3.0?

2018-09-06 Thread Xiao Li
Yesterday, the 2.4 branch was created. Based on the above discussion, I
think we can bump the master branch to 3.0.0-SNAPSHOT. Any concern?

Thanks,

Xiao

vaquar khan  于2018年6月16日周六 上午10:21写道:

> +1  for 2.4 next, followed by 3.0.
>
> Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
> is it possible we can share future release proposed specification same
> like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html
> )
> Regards,
> Viquar khan
>
> On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan 
> wrote:
>
>> Plz ignore last email link (you tube )not sure how it added .
>> Apologies not sure how to delete it.
>>
>>
>> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
>> wrote:
>>
>>> +1
>>>
>>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>>
>>> Regards,
>>> Vaquar khan
>>>
>>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin 
>>> wrote:
>>>
 Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


 On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
 wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new
> feature. How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new 
> area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
> wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache
> Spark not yet supporting Scala 2.12, and that got me to think perhaps it 
> is
> about time for Spark to work towards the 3.0 release. By the time it comes
> out, it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to
> give more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
> Spark 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break
> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for 
> consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
> in Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL

Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
+1  for 2.4 next, followed by 3.0.

Where we can get Apache Spark road map for 2.4 and 2.5  3.0 ?
is it possible we can share future release proposed specification same
like  releases (https://spark.apache.org/releases/spark-release-2-3-0.html)
Regards,
Viquar khan

On Sat, Jun 16, 2018 at 12:02 PM, vaquar khan  wrote:

> Plz ignore last email link (you tube )not sure how it added .
> Apologies not sure how to delete it.
>
>
> On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan 
> wrote:
>
>> +1
>>
>> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>>
>> Regards,
>> Vaquar khan
>>
>> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:
>>
>>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>>
>>>
>>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
>>> wrote:
>>>
 I agree, I dont see pressing need for major version bump as well.


 Regards,
 Mridul
 On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
 wrote:
 >
 > Changing major version numbers is not about new features or a vague
 notion that it is time to do something that will be seen to be a
 significant release. It is about breaking stable public APIs.
 >
 > I still remain unconvinced that the next version can't be 2.4.0.
 >
 > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
 >>
 >> Dear all:
 >>
 >> It have been 2 months since this topic being proposed. Any progress
 now? 2018 has been passed about 1/2.
 >>
 >> I agree with that the new version should be some exciting new
 feature. How about this one:
 >>
 >> 6. ML/DL framework to be integrated as core component and feature.
 (Such as Angel / BigDL / ……)
 >>
 >> 3.0 is a very important version for an good open source project. It
 should be better to drift away the historical burden and focus in new area.
 Spark has been widely used all over the world as a successful big data
 framework. And it can be better than that.
 >>
 >> Andy
 >>
 >>
 >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
 wrote:
 >>>
 >>> There was a discussion thread on scala-contributors about Apache
 Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
 about time for Spark to work towards the 3.0 release. By the time it comes
 out, it will be more than 2 years since Spark 2.0.
 >>>
 >>> For contributors less familiar with Spark’s history, I want to give
 more context on Spark releases:
 >>>
 >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
 Spark 3.0 in 2018.
 >>>
 >>> 2. Spark’s versioning policy promises that Spark does not break
 stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
 sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
 2.0, 2.x to 3.0).
 >>>
 >>> 3. That said, a major version isn’t necessarily the playground for
 disruptive API changes to make it painful for users to update. The main
 purpose of a major release is an opportunity to fix things that are broken
 in the current API and remove certain deprecated APIs.
 >>>
 >>> 4. Spark as a project has a culture of evolving architecture and
 developing major new features incrementally, so major releases are not the
 only time for exciting new features. For example, the bulk of the work in
 the move towards the DataFrame API was done in Spark 1.3, and Continuous
 Processing was introduced in Spark 2.3. Both were feature releases rather
 than major releases.
 >>>
 >>>
 >>> You can find more background in the thread discussing Spark 2.0:
 http://apache-spark-developers-list.1001551.n3.nabble.com/A-
 proposal-for-Spark-2-0-td15122.html
 >>>
 >>>
 >>> The primary motivating factor IMO for a major version bump is to
 support Scala 2.12, which requires minor API breaking changes to Spark’s
 APIs. Similar to Spark 2.0, I think there are also opportunities for other
 changes that we know have been biting us for a long time but can’t be
 changed in feature releases (to be clear, I’m actually not sure they are
 all good ideas, but I’m writing them down as candidates for consideration):
 >>>
 >>> 1. Support Scala 2.12.
 >>>
 >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
 in Spark 2.x.
 >>>
 >>> 3. Shade all dependencies.
 >>>
 >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
 compliant, to prevent users from shooting themselves in the foot, e.g.
 “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
 less painful for users to upgrade here, I’d suggest creating a flag for
 backward compatibility mode.
 >>>
 >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL 

Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
Plz ignore last email link (you tube )not sure how it added .
Apologies not sure how to delete it.


On Sat, Jun 16, 2018 at 11:58 AM, vaquar khan  wrote:

> +1
>
> https://www.youtube.com/watch?v=-ik7aJ5U6kg
>
> Regards,
> Vaquar khan
>
> On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:
>
>> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>>
>>
>> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
>> wrote:
>>
>>> I agree, I dont see pressing need for major version bump as well.
>>>
>>>
>>> Regards,
>>> Mridul
>>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>>> wrote:
>>> >
>>> > Changing major version numbers is not about new features or a vague
>>> notion that it is time to do something that will be seen to be a
>>> significant release. It is about breaking stable public APIs.
>>> >
>>> > I still remain unconvinced that the next version can't be 2.4.0.
>>> >
>>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>> >>
>>> >> Dear all:
>>> >>
>>> >> It have been 2 months since this topic being proposed. Any progress
>>> now? 2018 has been passed about 1/2.
>>> >>
>>> >> I agree with that the new version should be some exciting new
>>> feature. How about this one:
>>> >>
>>> >> 6. ML/DL framework to be integrated as core component and feature.
>>> (Such as Angel / BigDL / ……)
>>> >>
>>> >> 3.0 is a very important version for an good open source project. It
>>> should be better to drift away the historical burden and focus in new area.
>>> Spark has been widely used all over the world as a successful big data
>>> framework. And it can be better than that.
>>> >>
>>> >> Andy
>>> >>
>>> >>
>>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>>> wrote:
>>> >>>
>>> >>> There was a discussion thread on scala-contributors about Apache
>>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>>> about time for Spark to work towards the 3.0 release. By the time it comes
>>> out, it will be more than 2 years since Spark 2.0.
>>> >>>
>>> >>> For contributors less familiar with Spark’s history, I want to give
>>> more context on Spark releases:
>>> >>>
>>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>>> Spark 3.0 in 2018.
>>> >>>
>>> >>> 2. Spark’s versioning policy promises that Spark does not break
>>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>>> 2.0, 2.x to 3.0).
>>> >>>
>>> >>> 3. That said, a major version isn’t necessarily the playground for
>>> disruptive API changes to make it painful for users to update. The main
>>> purpose of a major release is an opportunity to fix things that are broken
>>> in the current API and remove certain deprecated APIs.
>>> >>>
>>> >>> 4. Spark as a project has a culture of evolving architecture and
>>> developing major new features incrementally, so major releases are not the
>>> only time for exciting new features. For example, the bulk of the work in
>>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>>> Processing was introduced in Spark 2.3. Both were feature releases rather
>>> than major releases.
>>> >>>
>>> >>>
>>> >>> You can find more background in the thread discussing Spark 2.0:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-
>>> proposal-for-Spark-2-0-td15122.html
>>> >>>
>>> >>>
>>> >>> The primary motivating factor IMO for a major version bump is to
>>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>> >>>
>>> >>> 1. Support Scala 2.12.
>>> >>>
>>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated
>>> in Spark 2.x.
>>> >>>
>>> >>> 3. Shade all dependencies.
>>> >>>
>>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>>> compliant, to prevent users from shooting themselves in the foot, e.g.
>>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>>> less painful for users to upgrade here, I’d suggest creating a flag for
>>> backward compatibility mode.
>>> >>>
>>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>>> standard compliant, and have a flag for backward compatibility.
>>> >>>
>>> >>> 6. Miscellaneous other small changes documented in JIRA already
>>> (e.g. “JavaPairRDD flatMapValues requires function returning Iterable, not
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>> >>>
>>> >>>
>>> >>> Now the reality of a major version bump is that the world often
>>> thinks in terms of what exciting features are coming. 

Re: time for Apache Spark 3.0?

2018-06-16 Thread vaquar khan
+1

https://www.youtube.com/watch?v=-ik7aJ5U6kg

Regards,
Vaquar khan

On Fri, Jun 15, 2018 at 4:55 PM, Reynold Xin  wrote:

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any progress
>> now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new feature.
>> How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project. It
>> should be better to drift away the historical burden and focus in new area.
>> Spark has been widely used all over the world as a successful big data
>> framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>> about time for Spark to work towards the 3.0 release. By the time it comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to give
>> more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-
>> Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>> >>>
>> >>> 1. Support Scala 2.12.
>> >>>
>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>> >>>
>> >>> 3. Shade all dependencies.
>> >>>
>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>> >>>
>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>> >>>
>> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>> >>>
>> >>>
>> >>> Now the reality of a major version bump is that the world often
>> thinks in terms of what exciting features are coming. I do think there are
>> a number of major changes happening already that can be part of the 3.0
>> release, if they make it in:
>> >>>
>> >>> 1. Scala 2.12 support (listing it twice)
>> >>> 2. Continuous Processing non-experimental
>> >>> 3. Kubernetes support non-experimental

Re: time for Apache Spark 3.0?

2018-06-16 Thread Xiao Li
+1

2018-06-15 14:55 GMT-07:00 Reynold Xin :

> Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.
>
>
> On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
> wrote:
>
>> I agree, I dont see pressing need for major version bump as well.
>>
>>
>> Regards,
>> Mridul
>> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
>> wrote:
>> >
>> > Changing major version numbers is not about new features or a vague
>> notion that it is time to do something that will be seen to be a
>> significant release. It is about breaking stable public APIs.
>> >
>> > I still remain unconvinced that the next version can't be 2.4.0.
>> >
>> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>> >>
>> >> Dear all:
>> >>
>> >> It have been 2 months since this topic being proposed. Any progress
>> now? 2018 has been passed about 1/2.
>> >>
>> >> I agree with that the new version should be some exciting new feature.
>> How about this one:
>> >>
>> >> 6. ML/DL framework to be integrated as core component and feature.
>> (Such as Angel / BigDL / ……)
>> >>
>> >> 3.0 is a very important version for an good open source project. It
>> should be better to drift away the historical burden and focus in new area.
>> Spark has been widely used all over the world as a successful big data
>> framework. And it can be better than that.
>> >>
>> >> Andy
>> >>
>> >>
>> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin 
>> wrote:
>> >>>
>> >>> There was a discussion thread on scala-contributors about Apache
>> Spark not yet supporting Scala 2.12, and that got me to think perhaps it is
>> about time for Spark to work towards the 3.0 release. By the time it comes
>> out, it will be more than 2 years since Spark 2.0.
>> >>>
>> >>> For contributors less familiar with Spark’s history, I want to give
>> more context on Spark releases:
>> >>>
>> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July
>> 2016. If we were to maintain the ~ 2 year cadence, it is time to work on
>> Spark 3.0 in 2018.
>> >>>
>> >>> 2. Spark’s versioning policy promises that Spark does not break
>> stable APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>> >>>
>> >>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>> >>>
>> >>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>> >>>
>> >>>
>> >>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-
>> Spark-2-0-td15122.html
>> >>>
>> >>>
>> >>> The primary motivating factor IMO for a major version bump is to
>> support Scala 2.12, which requires minor API breaking changes to Spark’s
>> APIs. Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>> >>>
>> >>> 1. Support Scala 2.12.
>> >>>
>> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>> >>>
>> >>> 3. Shade all dependencies.
>> >>>
>> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>> >>>
>> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>> >>>
>> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>> >>>
>> >>>
>> >>> Now the reality of a major version bump is that the world often
>> thinks in terms of what exciting features are coming. I do think there are
>> a number of major changes happening already that can be part of the 3.0
>> release, if they make it in:
>> >>>
>> >>> 1. Scala 2.12 support (listing it twice)
>> >>> 2. Continuous Processing non-experimental
>> >>> 3. Kubernetes support non-experimental
>> >>> 4. A more flushed out version of data source API v2 (I don’t think it
>> 

Re: time for Apache Spark 3.0?

2018-06-15 Thread Reynold Xin
Yes. At this rate I think it's better to do 2.4 next, followed by 3.0.


On Fri, Jun 15, 2018 at 10:52 AM Mridul Muralidharan 
wrote:

> I agree, I dont see pressing need for major version bump as well.
>
>
> Regards,
> Mridul
> On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra 
> wrote:
> >
> > Changing major version numbers is not about new features or a vague
> notion that it is time to do something that will be seen to be a
> significant release. It is about breaking stable public APIs.
> >
> > I still remain unconvinced that the next version can't be 2.4.0.
> >
> > On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
> >>
> >> Dear all:
> >>
> >> It have been 2 months since this topic being proposed. Any progress
> now? 2018 has been passed about 1/2.
> >>
> >> I agree with that the new version should be some exciting new feature.
> How about this one:
> >>
> >> 6. ML/DL framework to be integrated as core component and feature.
> (Such as Angel / BigDL / ……)
> >>
> >> 3.0 is a very important version for an good open source project. It
> should be better to drift away the historical burden and focus in new area.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
> >>
> >> Andy
> >>
> >>
> >> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
> >>>
> >>> There was a discussion thread on scala-contributors about Apache Spark
> not yet supporting Scala 2.12, and that got me to think perhaps it is about
> time for Spark to work towards the 3.0 release. By the time it comes out,
> it will be more than 2 years since Spark 2.0.
> >>>
> >>> For contributors less familiar with Spark’s history, I want to give
> more context on Spark releases:
> >>>
> >>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016.
> If we were to maintain the ~ 2 year cadence, it is time to work on Spark
> 3.0 in 2018.
> >>>
> >>> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
> >>>
> >>> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
> >>>
> >>> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
> >>>
> >>>
> >>> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
> >>>
> >>>
> >>> The primary motivating factor IMO for a major version bump is to
> support Scala 2.12, which requires minor API breaking changes to Spark’s
> APIs. Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
> >>>
> >>> 1. Support Scala 2.12.
> >>>
> >>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
> >>>
> >>> 3. Shade all dependencies.
> >>>
> >>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
> >>>
> >>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
> >>>
> >>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
> >>>
> >>>
> >>> Now the reality of a major version bump is that the world often thinks
> in terms of what exciting features are coming. I do think there are a
> number of major changes happening already that can be part of the 3.0
> release, if they make it in:
> >>>
> >>> 1. Scala 2.12 support (listing it twice)
> >>> 2. Continuous Processing non-experimental
> >>> 3. Kubernetes support non-experimental
> >>> 4. A more flushed out version of data source API v2 (I don’t think it
> is realistic to stabilize that in one release)
> >>> 5. Hadoop 3.0 support
> >>> 6. ...
> >>>
> >>>
> >>>
> >>> Similar to the 2.0 discussion, this thread should focus on 

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mridul Muralidharan
I agree, I dont see pressing need for major version bump as well.


Regards,
Mridul
On Fri, Jun 15, 2018 at 10:25 AM Mark Hamstra  wrote:
>
> Changing major version numbers is not about new features or a vague notion 
> that it is time to do something that will be seen to be a significant 
> release. It is about breaking stable public APIs.
>
> I still remain unconvinced that the next version can't be 2.4.0.
>
> On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:
>>
>> Dear all:
>>
>> It have been 2 months since this topic being proposed. Any progress now? 
>> 2018 has been passed about 1/2.
>>
>> I agree with that the new version should be some exciting new feature. How 
>> about this one:
>>
>> 6. ML/DL framework to be integrated as core component and feature. (Such as 
>> Angel / BigDL / ……)
>>
>> 3.0 is a very important version for an good open source project. It should 
>> be better to drift away the historical burden and focus in new area. Spark 
>> has been widely used all over the world as a successful big data framework. 
>> And it can be better than that.
>>
>> Andy
>>
>>
>> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>>>
>>> There was a discussion thread on scala-contributors about Apache Spark not 
>>> yet supporting Scala 2.12, and that got me to think perhaps it is about 
>>> time for Spark to work towards the 3.0 release. By the time it comes out, 
>>> it will be more than 2 years since Spark 2.0.
>>>
>>> For contributors less familiar with Spark’s history, I want to give more 
>>> context on Spark releases:
>>>
>>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If 
>>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0 
>>> in 2018.
>>>
>>> 2. Spark’s versioning policy promises that Spark does not break stable APIs 
>>> in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a 
>>> necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to 
>>> 3.0).
>>>
>>> 3. That said, a major version isn’t necessarily the playground for 
>>> disruptive API changes to make it painful for users to update. The main 
>>> purpose of a major release is an opportunity to fix things that are broken 
>>> in the current API and remove certain deprecated APIs.
>>>
>>> 4. Spark as a project has a culture of evolving architecture and developing 
>>> major new features incrementally, so major releases are not the only time 
>>> for exciting new features. For example, the bulk of the work in the move 
>>> towards the DataFrame API was done in Spark 1.3, and Continuous Processing 
>>> was introduced in Spark 2.3. Both were feature releases rather than major 
>>> releases.
>>>
>>>
>>> You can find more background in the thread discussing Spark 2.0: 
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>>
>>>
>>> The primary motivating factor IMO for a major version bump is to support 
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
>>> Similar to Spark 2.0, I think there are also opportunities for other 
>>> changes that we know have been biting us for a long time but can’t be 
>>> changed in feature releases (to be clear, I’m actually not sure they are 
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>> 1. Support Scala 2.12.
>>>
>>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark 
>>> 2.x.
>>>
>>> 3. Shade all dependencies.
>>>
>>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant, 
>>> to prevent users from shooting themselves in the foot, e.g. “SELECT 2 
>>> SECOND” -- is “SECOND” an interval unit or an alias? To make it less 
>>> painful for users to upgrade here, I’d suggest creating a flag for backward 
>>> compatibility mode.
>>>
>>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard 
>>> compliant, and have a flag for backward compatibility.
>>>
>>> 6. Miscellaneous other small changes documented in JIRA already (e.g. 
>>> “JavaPairRDD flatMapValues requires function returning Iterable, not 
>>> Iterator”, “Prevent column name duplication in temporary view”).
>>>
>>>
>>> Now the reality of a major version bump is that the world often thinks in 
>>> terms of what exciting features are coming. I do think there are a number 
>>> of major changes happening already that can be part of the 3.0 release, if 
>>> they make it in:
>>>
>>> 1. Scala 2.12 support (listing it twice)
>>> 2. Continuous Processing non-experimental
>>> 3. Kubernetes support non-experimental
>>> 4. A more flushed out version of data source API v2 (I don’t think it is 
>>> realistic to stabilize that in one release)
>>> 5. Hadoop 3.0 support
>>> 6. ...
>>>
>>>
>>>
>>> Similar to the 2.0 discussion, this thread should focus on the framework 
>>> and whether it’d make sense to create Spark 3.0 as the next release, rather 
>>> than the individual feature requests. Those are important 

Re: time for Apache Spark 3.0?

2018-06-15 Thread Mark Hamstra
Changing major version numbers is not about new features or a vague notion
that it is time to do something that will be seen to be a significant
release. It is about breaking stable public APIs.

I still remain unconvinced that the next version can't be 2.4.0.

On Fri, Jun 15, 2018 at 1:34 AM Andy  wrote:

> *Dear all:*
>
> It have been 2 months since this topic being proposed. Any progress now?
> 2018 has been passed about 1/2.
>
> I agree with that the new version should be some exciting new feature. How
> about this one:
>
> *6. ML/DL framework to be integrated as core component and feature. (Such
> as Angel / BigDL / ……)*
>
> 3.0 is a very important version for an good open source project. It should
> be better to drift away the historical burden and *focus in new area*.
> Spark has been widely used all over the world as a successful big data
> framework. And it can be better than that.
>
>
> *Andy*
>
>
> On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:
>
>> There was a discussion thread on scala-contributors
>> 
>> about Apache Spark not yet supporting Scala 2.12, and that got me to think
>> perhaps it is about time for Spark to work towards the 3.0 release. By the
>> time it comes out, it will be more than 2 years since Spark 2.0.
>>
>> For contributors less familiar with Spark’s history, I want to give more
>> context on Spark releases:
>>
>> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
>> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
>> in 2018.
>>
>> 2. Spark’s versioning policy promises that Spark does not break stable
>> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
>> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
>> 2.0, 2.x to 3.0).
>>
>> 3. That said, a major version isn’t necessarily the playground for
>> disruptive API changes to make it painful for users to update. The main
>> purpose of a major release is an opportunity to fix things that are broken
>> in the current API and remove certain deprecated APIs.
>>
>> 4. Spark as a project has a culture of evolving architecture and
>> developing major new features incrementally, so major releases are not the
>> only time for exciting new features. For example, the bulk of the work in
>> the move towards the DataFrame API was done in Spark 1.3, and Continuous
>> Processing was introduced in Spark 2.3. Both were feature releases rather
>> than major releases.
>>
>>
>> You can find more background in the thread discussing Spark 2.0:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>>
>>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>> 1. Support Scala 2.12.
>>
>> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
>> Spark 2.x.
>>
>> 3. Shade all dependencies.
>>
>> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
>> compliant, to prevent users from shooting themselves in the foot, e.g.
>> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
>> less painful for users to upgrade here, I’d suggest creating a flag for
>> backward compatibility mode.
>>
>> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
>> standard compliant, and have a flag for backward compatibility.
>>
>> 6. Miscellaneous other small changes documented in JIRA already (e.g.
>> “JavaPairRDD flatMapValues requires function returning Iterable, not
>> Iterator”, “Prevent column name duplication in temporary view”).
>>
>>
>> Now the reality of a major version bump is that the world often thinks in
>> terms of what exciting features are coming. I do think there are a number
>> of major changes happening already that can be part of the 3.0 release, if
>> they make it in:
>>
>> 1. Scala 2.12 support (listing it twice)
>> 2. Continuous Processing non-experimental
>> 3. Kubernetes support non-experimental
>> 4. A more flushed out version of data source API v2 (I don’t think it is
>> realistic to stabilize that in one release)
>> 5. Hadoop 3.0 support
>> 6. ...
>>
>>
>>
>> Similar to the 2.0 discussion, this thread should focus on the framework
>> and whether it’d make sense to create Spark 3.0 as the next release, rather
>> than the individual feature requests. Those are important but are best done
>> in their own separate threads.
>>
>>
>>
>>
>>


Re: time for Apache Spark 3.0?

2018-06-15 Thread Andy
*Dear all:*

It have been 2 months since this topic being proposed. Any progress now?
2018 has been passed about 1/2.

I agree with that the new version should be some exciting new feature. How
about this one:

*6. ML/DL framework to be integrated as core component and feature. (Such
as Angel / BigDL / ……)*

3.0 is a very important version for an good open source project. It should
be better to drift away the historical burden and *focus in new area*.
Spark has been widely used all over the world as a successful big data
framework. And it can be better than that.


*Andy*


On Thu, Apr 5, 2018 at 7:20 AM Reynold Xin  wrote:

> There was a discussion thread on scala-contributors
> 
> about Apache Spark not yet supporting Scala 2.12, and that got me to think
> perhaps it is about time for Spark to work towards the 3.0 release. By the
> time it comes out, it will be more than 2 years since Spark 2.0.
>
> For contributors less familiar with Spark’s history, I want to give more
> context on Spark releases:
>
> 1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
> we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
> in 2018.
>
> 2. Spark’s versioning policy promises that Spark does not break stable
> APIs in feature releases (e.g. 2.1, 2.2). API breaking changes are
> sometimes a necessary evil, and can be done in major releases (e.g. 1.6 to
> 2.0, 2.x to 3.0).
>
> 3. That said, a major version isn’t necessarily the playground for
> disruptive API changes to make it painful for users to update. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs.
>
> 4. Spark as a project has a culture of evolving architecture and
> developing major new features incrementally, so major releases are not the
> only time for exciting new features. For example, the bulk of the work in
> the move towards the DataFrame API was done in Spark 1.3, and Continuous
> Processing was introduced in Spark 2.3. Both were feature releases rather
> than major releases.
>
>
> You can find more background in the thread discussing Spark 2.0:
> http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html
>
>
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>
> 1. Support Scala 2.12.
>
> 2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 2.x.
>
> 3. Shade all dependencies.
>
> 4. Change the reserved keywords in Spark SQL to be more ANSI-SQL
> compliant, to prevent users from shooting themselves in the foot, e.g.
> “SELECT 2 SECOND” -- is “SECOND” an interval unit or an alias? To make it
> less painful for users to upgrade here, I’d suggest creating a flag for
> backward compatibility mode.
>
> 5. Similar to 4, make our type coercion rule in DataFrame/SQL more
> standard compliant, and have a flag for backward compatibility.
>
> 6. Miscellaneous other small changes documented in JIRA already (e.g.
> “JavaPairRDD flatMapValues requires function returning Iterable, not
> Iterator”, “Prevent column name duplication in temporary view”).
>
>
> Now the reality of a major version bump is that the world often thinks in
> terms of what exciting features are coming. I do think there are a number
> of major changes happening already that can be part of the 3.0 release, if
> they make it in:
>
> 1. Scala 2.12 support (listing it twice)
> 2. Continuous Processing non-experimental
> 3. Kubernetes support non-experimental
> 4. A more flushed out version of data source API v2 (I don’t think it is
> realistic to stabilize that in one release)
> 5. Hadoop 3.0 support
> 6. ...
>
>
>
> Similar to the 2.0 discussion, this thread should focus on the framework
> and whether it’d make sense to create Spark 3.0 as the next release, rather
> than the individual feature requests. Those are important but are best done
> in their own separate threads.
>
>
>
>
>


Re: time for Apache Spark 3.0?

2018-04-19 Thread Sean Owen
That certainly sounds beneficial, to maybe several other projects. If
there's no downside and it takes away API issues, seems like a win.

On Thu, Apr 19, 2018 at 5:28 AM Dean Wampler  wrote:

> I spoke with Martin Odersky and Lightbend's Scala Team about the known API
> issue with method disambiguation. They offered to implement a small patch
> in a new release of Scala 2.12 to handle the issue without requiring a
> Spark API change. They would cut a 2.12.6 release for it. I'm told that
> Scala 2.13 should already handle the issue without modification (it's not
> yet released, to be clear). They can also offer feedback on updating the
> closure cleaner.
>
> So, this approach would support Scala 2.12 in Spark, but limited to
> 2.12.6+, without the API change requirement, but the closure cleaner would
> still need updating. Hence, it could be done for Spark 2.X.
>
> Let me if you want to pursue this approach.
>
> dean
>


Re: time for Apache Spark 3.0?

2018-04-19 Thread Dean Wampler
I spoke with Martin Odersky and Lightbend's Scala Team about the known API
issue with method disambiguation. They offered to implement a small patch
in a new release of Scala 2.12 to handle the issue without requiring a
Spark API change. They would cut a 2.12.6 release for it. I'm told that
Scala 2.13 should already handle the issue without modification (it's not
yet released, to be clear). They can also offer feedback on updating the
closure cleaner.

So, this approach would support Scala 2.12 in Spark, but limited to
2.12.6+, without the API change requirement, but the closure cleaner would
still need updating. Hence, it could be done for Spark 2.X.

Let me if you want to pursue this approach.

dean




*Dean Wampler, Ph.D.*

*VP, Fast Data Engineering at Lightbend*
Author: Programming Scala, 2nd Edition
, Fast Data Architectures
for Streaming Applications
,
and other content from O'Reilly
@deanwampler 
http://polyglotprogramming.com
https://github.com/deanwampler

On Thu, Apr 5, 2018 at 8:13 PM, Marcelo Vanzin  wrote:

> On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia 
> wrote:
> > Sorry, but just to be clear here, this is the 2.12 API issue:
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in
> this doc: https://docs.google.com/document/d/1P_
> wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> >
> > Basically, if we are allowed to change Spark’s API a little to have only
> one version of methods that are currently overloaded between Java and
> Scala, we can get away with a single source three for all Scala versions
> and Java ABI compatibility against any type of Spark (whether using Scala
> 2.11 or 2.12).
>
> Fair enough. To play devil's advocate, most of those methods seem to
> be marked "Experimental / Evolving", which could be used as a reason
> to change them for this purpose in a minor release.
>
> Not all of them are, though (e.g. foreach / foreachPartition are not
> experimental).
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Marcelo Vanzin
On Thu, Apr 5, 2018 at 10:30 AM, Matei Zaharia  wrote:
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
>
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12).

Fair enough. To play devil's advocate, most of those methods seem to
be marked "Experimental / Evolving", which could be used as a reason
to change them for this purpose in a minor release.

Not all of them are, though (e.g. foreach / foreachPartition are not
experimental).

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Steve Loughran


On 5 Apr 2018, at 18:04, Matei Zaharia 
> wrote:

Java 9/10 support would be great to add as well.

Be aware that the work moving hadoop core to java 9+ is still a big piece of 
work being undertaken by Akira Ajisaka & colleagues at NTT

https://issues.apache.org/jira/browse/HADOOP-11123

Big dependency updates and handling Oracle hiding sun.misc stuff which low 
level code depends on are the troublespots, with a move to Log4J 2 going to be 
observably traumatic to all apps which require a log4.properties to set 
themselves up. As usual: any testing which can be done early will be welcomed 
by all, the earlier the better

That stuff is all about getting things working: supporting the java 9 packaging 
model. Which is a really compelling reason to go for it


Regarding Scala 2.12, I thought that supporting it would become easier if we 
change the Spark API and ABI slightly. Basically, it is of course possible to 
create an alternate source tree today, but it might be possible to share the 
same source files if we tweak some small things in the methods that are 
overloaded across Scala and Java. I don’t remember the exact details, but the 
idea was to reduce the total maintenance work needed at the cost of requiring 
users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as 
well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency 
shading (a major pain point for lots of users)

Hadoop 3 does have a shaded client, though not enough for Spark; if work 
identifying & fixing the outstanding dependencies is started now, Hadoop 3.2 
should be able to offer the set of shaded libraries needed by Spark.

There's always a price to that, which is in redistributable size and it's 
impact on start times, duplicate classes loaded (memory,  reduced chance of JIT 
recompilation, ...), and the whole transitive-shading problem. Java 9 should be 
the real target for a clean solution to all of this.


Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Oh, forgot to add, but splitting the source tree in Scala also creates the 
issue of a big maintenance burden for third-party libraries built on Spark. As 
Josh said on the JIRA:

"I think this is primarily going to be an issue for end users who want to use 
an existing source tree to cross-compile for Scala 2.10, 2.11, and 2.12. Thus 
the pain of the source incompatibility would mostly be felt by library/package 
maintainers but it can be worked around as long as there's at least some common 
subset which is source compatible across all of those versions.”

This means that all the data sources, ML algorithms, etc developed outside our 
source tree would have to do the same thing we do internally.

> On Apr 5, 2018, at 10:30 AM, Matei Zaharia  wrote:
> 
> Sorry, but just to be clear here, this is the 2.12 API issue: 
> https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
> doc: 
> https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.
> 
> Basically, if we are allowed to change Spark’s API a little to have only one 
> version of methods that are currently overloaded between Java and Scala, we 
> can get away with a single source three for all Scala versions and Java ABI 
> compatibility against any type of Spark (whether using Scala 2.11 or 2.12). 
> On the other hand, if we want to keep the API and ABI of the Spark 2.x 
> branch, we’ll need a different source tree for Scala 2.12 with different 
> copies of pretty large classes such as RDD, DataFrame and DStream, and Java 
> users may have to change their code when linking against different versions 
> of Spark.
> 
> This is of course only one of the possible ABI changes, but it is a 
> considerable engineering effort, so we’d have to sign up for maintaining all 
> these different source files. It seems kind of silly given that Scala 2.12 
> was released in 2016, so we’re doing all this work to keep ABI compatibility 
> for Scala 2.11, which isn’t even that widely used any more for new projects. 
> Also keep in mind that the next Spark release will probably take at least 3-4 
> months, so we’re talking about what people will be using in fall 2018.
> 
> Matei
> 
>> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin  wrote:
>> 
>> I remember seeing somewhere that Scala still has some issues with Java
>> 9/10 so that might be hard...
>> 
>> But on that topic, it might be better to shoot for Java 11
>> compatibility. 9 and 10, following the new release model, aren't
>> really meant to be long-term releases.
>> 
>> In general, agree with Sean here. Doesn't look like 2.12 support
>> requires unexpected API breakages. So unless there's a really good
>> reason to break / remove a bunch of existing APIs...
>> 
>> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
>>> Hi all,
>>> 
>>> I also agree with Mark that we should add Java 9/10 support to an eventual
>>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>>> are using some internal APIs for the memory management which changed: either
>>> we find a solution which works on both (but I am not sure it is feasible) or
>>> we have to switch between 2 implementations according to the Java version.
>>> So I'd rather avoid doing this in a non-major release.
>>> 
>>> Thanks,
>>> Marco
>>> 
>>> 
>>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
 
 As with Sean, I'm not sure that this will require a new major version, but
 we should also be looking at Java 9 & 10 support -- particularly with 
 regard
 to their better functionality in a containerized environment (memory limits
 from cgroups, not sysconf; support for cpusets). In that regard, we should
 also be looking at using the latest Scala 2.11.x maintenance release in
 current Spark branches.
 
 On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
> 
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>> 
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other 
>> changes
>> that we know have been biting us for a long time but can’t be changed in
>> feature releases (to be clear, I’m actually not sure they are all good
>> ideas, but I’m writing them down as candidates for consideration):
> 
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some 
> big
> change needed to get 2.12 fully working -- and that may be the case -- it
> nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
> in byte code. However Scala itself isn't mutually compatible 

Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Sorry, but just to be clear here, this is the 2.12 API issue: 
https://issues.apache.org/jira/browse/SPARK-14643, with more details in this 
doc: 
https://docs.google.com/document/d/1P_wmH3U356f079AYgSsN53HKixuNdxSEvo8nw_tgLgM/edit.

Basically, if we are allowed to change Spark’s API a little to have only one 
version of methods that are currently overloaded between Java and Scala, we can 
get away with a single source three for all Scala versions and Java ABI 
compatibility against any type of Spark (whether using Scala 2.11 or 2.12). On 
the other hand, if we want to keep the API and ABI of the Spark 2.x branch, 
we’ll need a different source tree for Scala 2.12 with different copies of 
pretty large classes such as RDD, DataFrame and DStream, and Java users may 
have to change their code when linking against different versions of Spark.

This is of course only one of the possible ABI changes, but it is a 
considerable engineering effort, so we’d have to sign up for maintaining all 
these different source files. It seems kind of silly given that Scala 2.12 was 
released in 2016, so we’re doing all this work to keep ABI compatibility for 
Scala 2.11, which isn’t even that widely used any more for new projects. Also 
keep in mind that the next Spark release will probably take at least 3-4 
months, so we’re talking about what people will be using in fall 2018.

Matei

> On Apr 5, 2018, at 10:13 AM, Marcelo Vanzin  wrote:
> 
> I remember seeing somewhere that Scala still has some issues with Java
> 9/10 so that might be hard...
> 
> But on that topic, it might be better to shoot for Java 11
> compatibility. 9 and 10, following the new release model, aren't
> really meant to be long-term releases.
> 
> In general, agree with Sean here. Doesn't look like 2.12 support
> requires unexpected API breakages. So unless there's a really good
> reason to break / remove a bunch of existing APIs...
> 
> On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
>> Hi all,
>> 
>> I also agree with Mark that we should add Java 9/10 support to an eventual
>> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
>> are using some internal APIs for the memory management which changed: either
>> we find a solution which works on both (but I am not sure it is feasible) or
>> we have to switch between 2 implementations according to the Java version.
>> So I'd rather avoid doing this in a non-major release.
>> 
>> Thanks,
>> Marco
>> 
>> 
>> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
>>> 
>>> As with Sean, I'm not sure that this will require a new major version, but
>>> we should also be looking at Java 9 & 10 support -- particularly with regard
>>> to their better functionality in a containerized environment (memory limits
>>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>>> also be looking at using the latest Scala 2.11.x maintenance release in
>>> current Spark branches.
>>> 
>>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
 
 On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> 
> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other 
> changes
> that we know have been biting us for a long time but can’t be changed in
> feature releases (to be clear, I’m actually not sure they are all good
> ideas, but I’m writing them down as candidates for consideration):
 
 
 IIRC from looking at this, it is possible to support 2.11 and 2.12
 simultaneously. The cross-build already works now in 2.3.0. Barring some 
 big
 change needed to get 2.12 fully working -- and that may be the case -- it
 nearly works that way now.
 
 Compiling vs 2.11 and 2.12 does however result in some APIs that differ
 in byte code. However Scala itself isn't mutually compatible between 2.11
 and 2.12 anyway; that's never been promised as compatible.
 
 (Interesting question about what *Java* users should expect; they would
 see a difference in 2.11 vs 2.12 Spark APIs, but that has always been 
 true.)
 
 I don't disagree with shooting for Spark 3.0, just saying I don't know if
 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
 2.11 support if needed to make supporting 2.12 less painful.
>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Marcelo Vanzin
I remember seeing somewhere that Scala still has some issues with Java
9/10 so that might be hard...

But on that topic, it might be better to shoot for Java 11
compatibility. 9 and 10, following the new release model, aren't
really meant to be long-term releases.

In general, agree with Sean here. Doesn't look like 2.12 support
requires unexpected API breakages. So unless there's a really good
reason to break / remove a bunch of existing APIs...

On Thu, Apr 5, 2018 at 9:04 AM, Marco Gaido  wrote:
> Hi all,
>
> I also agree with Mark that we should add Java 9/10 support to an eventual
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we
> are using some internal APIs for the memory management which changed: either
> we find a solution which works on both (but I am not sure it is feasible) or
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
>
> Thanks,
> Marco
>
>
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
>>
>> As with Sean, I'm not sure that this will require a new major version, but
>> we should also be looking at Java 9 & 10 support -- particularly with regard
>> to their better functionality in a containerized environment (memory limits
>> from cgroups, not sysconf; support for cpusets). In that regard, we should
>> also be looking at using the latest Scala 2.11.x maintenance release in
>> current Spark branches.
>>
>> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
>>>
>>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:

 The primary motivating factor IMO for a major version bump is to support
 Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
 Similar to Spark 2.0, I think there are also opportunities for other 
 changes
 that we know have been biting us for a long time but can’t be changed in
 feature releases (to be clear, I’m actually not sure they are all good
 ideas, but I’m writing them down as candidates for consideration):
>>>
>>>
>>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>>> simultaneously. The cross-build already works now in 2.3.0. Barring some big
>>> change needed to get 2.12 fully working -- and that may be the case -- it
>>> nearly works that way now.
>>>
>>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>>> in byte code. However Scala itself isn't mutually compatible between 2.11
>>> and 2.12 anyway; that's never been promised as compatible.
>>>
>>> (Interesting question about what *Java* users should expect; they would
>>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>>
>>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Matei Zaharia
Java 9/10 support would be great to add as well.

Regarding Scala 2.12, I thought that supporting it would become easier if we 
change the Spark API and ABI slightly. Basically, it is of course possible to 
create an alternate source tree today, but it might be possible to share the 
same source files if we tweak some small things in the methods that are 
overloaded across Scala and Java. I don’t remember the exact details, but the 
idea was to reduce the total maintenance work needed at the cost of requiring 
users to recompile their apps.

I’m personally for moving to 3.0 because of the other things we can clean up as 
well, e.g. the default SQL dialect, Iterable stuff, and possibly dependency 
shading (a major pain point for lots of users). It’s also a chance to highlight 
Kubernetes, continuous processing and other features more if they become “GA".

Matei

> On Apr 5, 2018, at 9:04 AM, Marco Gaido  wrote:
> 
> Hi all,
> 
> I also agree with Mark that we should add Java 9/10 support to an eventual 
> Spark 3.0 release, because supporting Java 9 is not a trivial task since we 
> are using some internal APIs for the memory management which changed: either 
> we find a solution which works on both (but I am not sure it is feasible) or 
> we have to switch between 2 implementations according to the Java version.
> So I'd rather avoid doing this in a non-major release.
> 
> Thanks,
> Marco
> 
> 
> 2018-04-05 17:35 GMT+02:00 Mark Hamstra :
> As with Sean, I'm not sure that this will require a new major version, but we 
> should also be looking at Java 9 & 10 support -- particularly with regard to 
> their better functionality in a containerized environment (memory limits from 
> cgroups, not sysconf; support for cpusets). In that regard, we should also be 
> looking at using the latest Scala 2.11.x maintenance release in current Spark 
> branches.
> 
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
> The primary motivating factor IMO for a major version bump is to support 
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs. 
> Similar to Spark 2.0, I think there are also opportunities for other changes 
> that we know have been biting us for a long time but can’t be changed in 
> feature releases (to be clear, I’m actually not sure they are all good ideas, 
> but I’m writing them down as candidates for consideration):
> 
> IIRC from looking at this, it is possible to support 2.11 and 2.12 
> simultaneously. The cross-build already works now in 2.3.0. Barring some big 
> change needed to get 2.12 fully working -- and that may be the case -- it 
> nearly works that way now.
> 
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in 
> byte code. However Scala itself isn't mutually compatible between 2.11 and 
> 2.12 anyway; that's never been promised as compatible.
> 
> (Interesting question about what *Java* users should expect; they would see a 
> difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
> 
> I don't disagree with shooting for Spark 3.0, just saying I don't know if 
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping 
> 2.11 support if needed to make supporting 2.12 less painful.
> 
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-04-05 Thread Marco Gaido
Hi all,

I also agree with Mark that we should add Java 9/10 support to an eventual
Spark 3.0 release, because supporting Java 9 is not a trivial task since we
are using some internal APIs for the memory management which changed:
either we find a solution which works on both (but I am not sure it is
feasible) or we have to switch between 2 implementations according to the
Java version.
So I'd rather avoid doing this in a non-major release.

Thanks,
Marco


2018-04-05 17:35 GMT+02:00 Mark Hamstra :

> As with Sean, I'm not sure that this will require a new major version, but
> we should also be looking at Java 9 & 10 support -- particularly with
> regard to their better functionality in a containerized environment (memory
> limits from cgroups, not sysconf; support for cpusets). In that regard, we
> should also be looking at using the latest Scala 2.11.x maintenance release
> in current Spark branches.
>
> On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:
>
>> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>>
>>> The primary motivating factor IMO for a major version bump is to support
>>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>>> Similar to Spark 2.0, I think there are also opportunities for other
>>> changes that we know have been biting us for a long time but can’t be
>>> changed in feature releases (to be clear, I’m actually not sure they are
>>> all good ideas, but I’m writing them down as candidates for consideration):
>>>
>>
>> IIRC from looking at this, it is possible to support 2.11 and 2.12
>> simultaneously. The cross-build already works now in 2.3.0. Barring some
>> big change needed to get 2.12 fully working -- and that may be the case --
>> it nearly works that way now.
>>
>> Compiling vs 2.11 and 2.12 does however result in some APIs that differ
>> in byte code. However Scala itself isn't mutually compatible between 2.11
>> and 2.12 anyway; that's never been promised as compatible.
>>
>> (Interesting question about what *Java* users should expect; they would
>> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>>
>> I don't disagree with shooting for Spark 3.0, just saying I don't know if
>> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
>> 2.11 support if needed to make supporting 2.12 less painful.
>>
>
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Mark Hamstra
As with Sean, I'm not sure that this will require a new major version, but
we should also be looking at Java 9 & 10 support -- particularly with
regard to their better functionality in a containerized environment (memory
limits from cgroups, not sysconf; support for cpusets). In that regard, we
should also be looking at using the latest Scala 2.11.x maintenance release
in current Spark branches.

On Thu, Apr 5, 2018 at 5:45 AM, Sean Owen  wrote:

> On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:
>
>> The primary motivating factor IMO for a major version bump is to support
>> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
>> Similar to Spark 2.0, I think there are also opportunities for other
>> changes that we know have been biting us for a long time but can’t be
>> changed in feature releases (to be clear, I’m actually not sure they are
>> all good ideas, but I’m writing them down as candidates for consideration):
>>
>
> IIRC from looking at this, it is possible to support 2.11 and 2.12
> simultaneously. The cross-build already works now in 2.3.0. Barring some
> big change needed to get 2.12 fully working -- and that may be the case --
> it nearly works that way now.
>
> Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
> byte code. However Scala itself isn't mutually compatible between 2.11 and
> 2.12 anyway; that's never been promised as compatible.
>
> (Interesting question about what *Java* users should expect; they would
> see a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)
>
> I don't disagree with shooting for Spark 3.0, just saying I don't know if
> 2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
> 2.11 support if needed to make supporting 2.12 less painful.
>


Re: time for Apache Spark 3.0?

2018-04-05 Thread Sean Owen
On Wed, Apr 4, 2018 at 6:20 PM Reynold Xin  wrote:

> The primary motivating factor IMO for a major version bump is to support
> Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
> Similar to Spark 2.0, I think there are also opportunities for other
> changes that we know have been biting us for a long time but can’t be
> changed in feature releases (to be clear, I’m actually not sure they are
> all good ideas, but I’m writing them down as candidates for consideration):
>

IIRC from looking at this, it is possible to support 2.11 and 2.12
simultaneously. The cross-build already works now in 2.3.0. Barring some
big change needed to get 2.12 fully working -- and that may be the case --
it nearly works that way now.

Compiling vs 2.11 and 2.12 does however result in some APIs that differ in
byte code. However Scala itself isn't mutually compatible between 2.11 and
2.12 anyway; that's never been promised as compatible.

(Interesting question about what *Java* users should expect; they would see
a difference in 2.11 vs 2.12 Spark APIs, but that has always been true.)

I don't disagree with shooting for Spark 3.0, just saying I don't know if
2.12 support requires moving to 3.0. But, Spark 3.0 could consider dropping
2.11 support if needed to make supporting 2.12 less painful.


time for Apache Spark 3.0?

2018-04-04 Thread Reynold Xin
There was a discussion thread on scala-contributors

about Apache Spark not yet supporting Scala 2.12, and that got me to think
perhaps it is about time for Spark to work towards the 3.0 release. By the
time it comes out, it will be more than 2 years since Spark 2.0.

For contributors less familiar with Spark’s history, I want to give more
context on Spark releases:

1. Timeline: Spark 1.0 was released May 2014. Spark 2.0 was July 2016. If
we were to maintain the ~ 2 year cadence, it is time to work on Spark 3.0
in 2018.

2. Spark’s versioning policy promises that Spark does not break stable APIs
in feature releases (e.g. 2.1, 2.2). API breaking changes are sometimes a
necessary evil, and can be done in major releases (e.g. 1.6 to 2.0, 2.x to
3.0).

3. That said, a major version isn’t necessarily the playground for
disruptive API changes to make it painful for users to update. The main
purpose of a major release is an opportunity to fix things that are broken
in the current API and remove certain deprecated APIs.

4. Spark as a project has a culture of evolving architecture and developing
major new features incrementally, so major releases are not the only time
for exciting new features. For example, the bulk of the work in the move
towards the DataFrame API was done in Spark 1.3, and Continuous Processing
was introduced in Spark 2.3. Both were feature releases rather than major
releases.


You can find more background in the thread discussing Spark 2.0:
http://apache-spark-developers-list.1001551.n3.nabble.com/A-proposal-for-Spark-2-0-td15122.html


The primary motivating factor IMO for a major version bump is to support
Scala 2.12, which requires minor API breaking changes to Spark’s APIs.
Similar to Spark 2.0, I think there are also opportunities for other
changes that we know have been biting us for a long time but can’t be
changed in feature releases (to be clear, I’m actually not sure they are
all good ideas, but I’m writing them down as candidates for consideration):

1. Support Scala 2.12.

2. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in Spark
2.x.

3. Shade all dependencies.

4. Change the reserved keywords in Spark SQL to be more ANSI-SQL compliant,
to prevent users from shooting themselves in the foot, e.g. “SELECT 2
SECOND” -- is “SECOND” an interval unit or an alias? To make it less
painful for users to upgrade here, I’d suggest creating a flag for backward
compatibility mode.

5. Similar to 4, make our type coercion rule in DataFrame/SQL more standard
compliant, and have a flag for backward compatibility.

6. Miscellaneous other small changes documented in JIRA already (e.g.
“JavaPairRDD flatMapValues requires function returning Iterable, not
Iterator”, “Prevent column name duplication in temporary view”).


Now the reality of a major version bump is that the world often thinks in
terms of what exciting features are coming. I do think there are a number
of major changes happening already that can be part of the 3.0 release, if
they make it in:

1. Scala 2.12 support (listing it twice)
2. Continuous Processing non-experimental
3. Kubernetes support non-experimental
4. A more flushed out version of data source API v2 (I don’t think it is
realistic to stabilize that in one release)
5. Hadoop 3.0 support
6. ...



Similar to the 2.0 discussion, this thread should focus on the framework
and whether it’d make sense to create Spark 3.0 as the next release, rather
than the individual feature requests. Those are important but are best done
in their own separate threads.