Re: DataSourceV2 capability API

2018-11-12 Thread JackyLee
I don't know if it is a right thing to make table API as
ContinuousScanBuilder -> ContinuousScan -> ContinuousBatch, it makes
batch/microBatch/Continuous too different from each other.
In my opinion, these are basically similar at the table level. So is it
possible to design an API like this?
ScanBuilder -> Scan -> ContinuousBatch/MicroBatch/SingleBatch



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-11-12 Thread Sean Owen
My non-definitive takes --

I would personally like to remove all deprecated methods for Spark 3.
I started by removing 'old' deprecated methods in that commit. Things
deprecated in 2.4 are maybe less clear, whether they should be removed

Everything's fair game for removal or change in a major release. So
far some items in discussion seem to be Scala 2.11 support, Python 2
support, R support before 3.4. I don't know about other APIs.

Generally, take a look at JIRA for items targeted at version 3.0. Not
everything targeted for 3.0 is going in, but ones from committers are
more likely than others. Breaking changes ought to be tagged
'release-notes' with a description of the change. The release itself
has a migration guide that's being updated as we go.


On Mon, Nov 12, 2018 at 5:49 PM Matt Cheah  wrote:
>
> I wanted to clarify what categories of APIs are eligible to be broken in 
> Spark 3.0. Specifically:
>
>
>
> Are we removing all deprecated methods? If we’re only removing some subset of 
> deprecated methods, what is that subset? I see a bunch were removed in 
> https://github.com/apache/spark/pull/22921 for example. Are we only committed 
> to removing methods that were deprecated in some Spark version and earlier?
> Aside from removing support for Scala 2.11, what other kinds of 
> (non-experimental and non-evolving) APIs are eligible to be broken?
> Is there going to be a way to track the current list of all proposed breaking 
> changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can 
> be filtered down to somehow?
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
All API removal and deprecation JIRAs should be tagged "releasenotes", so
we can reference them when we build release notes. I don't know if
everybody is still following that practice, but it'd be great to do that.
Since we don't have that many PRs, we should still be able to retroactively
tag.

We can also add a new tag for API changes, but I feel at this stage it
might be easier to just use "releasenotes".


On Mon, Nov 12, 2018 at 3:49 PM Matt Cheah  wrote:

> I wanted to clarify what categories of APIs are eligible to be broken in
> Spark 3.0. Specifically:
>
>
>
>- Are we removing all deprecated methods? If we’re only removing some
>subset of deprecated methods, what is that subset? I see a bunch were
>removed in https://github.com/apache/spark/pull/22921 for example. Are
>we only committed to removing methods that were deprecated in some Spark
>version and earlier?
>- Aside from removing support for Scala 2.11, what other kinds of
>(non-experimental and non-evolving) APIs are eligible to be broken?
>- Is there going to be a way to track the current list of all proposed
>breaking changes / JIRA tickets? Perhaps we can include it in the JIRA
>ticket that can be filtered down to somehow?
>
>
>
> Thanks,
>
>
>
> -Matt Cheah
>
> *From: *Vinoo Ganesh 
> *Date: *Monday, November 12, 2018 at 2:48 PM
> *To: *Reynold Xin 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Makes sense, thanks Reynold.
>
>
>
> *From: *Reynold Xin 
> *Date: *Monday, November 12, 2018 at 16:57
> *To: *Vinoo Ganesh 
> *Cc: *Xiao Li , Matei Zaharia <
> matei.zaha...@gmail.com>, Ryan Blue , Mark Hamstra <
> m...@clearstorydata.com>, dev 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
> 3.0. In terms of time lining, unless we change anything specifically, Spark
> feature releases are on a 6-mo cadence. Spark 2.4 was just released last
> week, so 3.0 will be roughly 6 month from now.
>
>
>
> On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:
>
> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> 
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Matt Cheah
I wanted to clarify what categories of APIs are eligible to be broken in Spark 
3.0. Specifically:

 
Are we removing all deprecated methods? If we’re only removing some subset of 
deprecated methods, what is that subset? I see a bunch were removed in 
https://github.com/apache/spark/pull/22921 for example. Are we only committed 
to removing methods that were deprecated in some Spark version and earlier?
Aside from removing support for Scala 2.11, what other kinds of 
(non-experimental and non-evolving) APIs are eligible to be broken?
Is there going to be a way to track the current list of all proposed breaking 
changes / JIRA tickets? Perhaps we can include it in the JIRA ticket that can 
be filtered down to somehow?
 

Thanks,

 

-Matt Cheah

From: Vinoo Ganesh 
Date: Monday, November 12, 2018 at 2:48 PM
To: Reynold Xin 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

 

Makes sense, thanks Reynold. 

 

From: Reynold Xin 
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

 

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. 
In terms of time lining, unless we change anything specifically, Spark feature 
releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 
will be roughly 6 month from now.

 

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:

Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated? 

 

From: Xiao Li 
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin 
Cc: Matei Zaharia , Ryan Blue , 
Mark Hamstra , "u...@spark.apache.org" 

Subject: Re: time for Apache Spark 3.0?

 

Yes. We should create a SPIP for each major breaking change. 

 

Reynold Xin  于2018年9月28日周五 下午11:05写道:

i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with 


--

excuse the brevity and lower case due to wrist injury

 

 

On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:

Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold). 

 

Cheers,

 

Xiao 

 

Matei Zaharia  于2018年9月6日周四 下午2:21写道:

Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> 
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
> 
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra  wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  [github.com]
> 
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>  
> 
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
> 
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> +1 on 3.0
> 
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
> 
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
> 
> 
> 
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Vinoo Ganesh
Makes sense, thanks Reynold.

From: Reynold Xin 
Date: Monday, November 12, 2018 at 16:57
To: Vinoo Ganesh 
Cc: Xiao Li , Matei Zaharia , 
Ryan Blue , Mark Hamstra , dev 

Subject: Re: time for Apache Spark 3.0?

Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be 3.0. 
In terms of time lining, unless we change anything specifically, Spark feature 
releases are on a 6-mo cadence. Spark 2.4 was just released last week, so 3.0 
will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh 
mailto:vgan...@palantir.com>> wrote:
Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li mailto:gatorsm...@gmail.com>>
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin mailto:r...@databricks.com>>
Cc: Matei Zaharia mailto:matei.zaha...@gmail.com>>, 
Ryan Blue mailto:rb...@netflix.com>>, Mark Hamstra 
mailto:m...@clearstorydata.com>>, 
"u...@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin mailto:r...@databricks.com>> 于2018年9月28日周五 
下午11:05写道:
i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia mailto:matei.zaha...@gmail.com>> 
于2018年9月6日周四 下午2:21写道:
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> mailto:m...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  
> [github.com]
>
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin 
> mailto:r...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Reynold Xin
Master branch now tracks 3.0.0-SHAPSHOT version, so the next one will be
3.0. In terms of time lining, unless we change anything specifically, Spark
feature releases are on a 6-mo cadence. Spark 2.4 was just released last
week, so 3.0 will be roughly 6 month from now.

On Mon, Nov 12, 2018 at 1:54 PM Vinoo Ganesh  wrote:

> Quickly following up on this – is there a target date for when Spark 3.0
> may be released and/or a list of the likely api breaks that are
> anticipated?
>
>
>
> *From: *Xiao Li 
> *Date: *Saturday, September 29, 2018 at 02:09
> *To: *Reynold Xin 
> *Cc: *Matei Zaharia , Ryan Blue <
> rb...@netflix.com>, Mark Hamstra , "
> u...@spark.apache.org" 
> *Subject: *Re: time for Apache Spark 3.0?
>
>
>
> Yes. We should create a SPIP for each major breaking change.
>
>
>
> Reynold Xin  于2018年9月28日周五 下午11:05写道:
>
> i think we should create spips for some of them, since they are pretty
> large ... i can create some tickets to start with
>
>
> --
>
> excuse the brevity and lower case due to wrist injury
>
>
>
>
>
> On Fri, Sep 28, 2018 at 11:01 PM Xiao Li  wrote:
>
> Based on the above discussions, we have a "rough consensus" that the next
> release will be 3.0. Now, we can start working on the API breaking changes
> (e.g., the ones mentioned in the original email from Reynold).
>
>
>
> Cheers,
>
>
>
> Xiao
>
>
>
> Matei Zaharia  于2018年9月6日周四 下午2:21写道:
>
> Yes, you can start with Unstable and move to Evolving and Stable when
> needed. We’ve definitely had experimental features that changed across
> maintenance releases when they were well-isolated. If your change risks
> breaking stuff in stable components of Spark though, then it probably won’t
> be suitable for that.
>
> > On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
> >
> > I meant flexibility beyond the point releases. I think what Reynold was
> suggesting was getting v2 code out more often than the point releases every
> 6 months. An Evolving API can change in point releases, but maybe we should
> move v2 to Unstable so it can change more often? I don't really see another
> way to get changes out more often.
> >
> > On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> wrote:
> > Yes, that is why we have these annotations in the code and the
> corresponding labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
> [github.com]
> 
> >
> > As long as it is properly annotated, we can change or even eliminate an
> API method before the next major release. And frankly, we shouldn't be
> contemplating bringing in the DS v2 API (and, I'd argue, any new API)
> without such an annotation. There is just too much risk of not getting
> everything right before we see the results of the new API being more widely
> used, and too much cost in maintaining until the next major release
> something that we come to regret for us to create new API in a fully frozen
> state.
> >
> >
> > On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue 
> wrote:
> > It would be great to get more features out incrementally. For
> experimental features, do we have more relaxed constraints?
> >
> > On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin  wrote:
> > +1 on 3.0
> >
> > Dsv2 stable can still evolve in across major releases. DataFrame,
> Dataset, dsv1 and a lot of other major features all were developed
> throughout the 1.x and 2.x lines.
> >
> > I do want to explore ways for us to get dsv2 incremental changes out
> there more frequently, to get feedback. Maybe that means we apply additive
> changes to 2.4.x; maybe that means making another 2.5 release sooner. I
> will start a separate thread about it.
> >
> >
> >
> > On Thu, Sep 6, 2018 at 9:31 AM Sean Owen  wrote:
> > I think this doesn't necessarily mean 3.0 is coming soon (thoughts on
> timing? 6 months?) but simply next. Do you mean you'd prefer that change to
> happen before 3.x? if it's a significant change, seems reasonable for a
> major version bump rather than minor. Is the concern that tying it to 3.0
> means you have to take a major version update to get it?
> >
> > I generally support moving on to 3.x so we can also jettison a lot of
> older dependencies, code, fix some long standing issues, etc.
> >
> > (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
> >
> > On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue 
> wrote:
> > My concern is that the v2 data source API is still evolving and not very
> close to stable. I had hoped to have stabilized the API and behaviors for a
> 3.0 release. But we could also wait on that for a 4.0 release, depending on
> when we think that will 

Re: time for Apache Spark 3.0?

2018-11-12 Thread Vinoo Ganesh
Quickly following up on this – is there a target date for when Spark 3.0 may be 
released and/or a list of the likely api breaks that are anticipated?

From: Xiao Li 
Date: Saturday, September 29, 2018 at 02:09
To: Reynold Xin 
Cc: Matei Zaharia , Ryan Blue , 
Mark Hamstra , "u...@spark.apache.org" 

Subject: Re: time for Apache Spark 3.0?

Yes. We should create a SPIP for each major breaking change.

Reynold Xin mailto:r...@databricks.com>> 于2018年9月28日周五 
下午11:05写道:
i think we should create spips for some of them, since they are pretty large 
... i can create some tickets to start with

--
excuse the brevity and lower case due to wrist injury


On Fri, Sep 28, 2018 at 11:01 PM Xiao Li 
mailto:gatorsm...@gmail.com>> wrote:
Based on the above discussions, we have a "rough consensus" that the next 
release will be 3.0. Now, we can start working on the API breaking changes 
(e.g., the ones mentioned in the original email from Reynold).

Cheers,

Xiao

Matei Zaharia mailto:matei.zaha...@gmail.com>> 
于2018年9月6日周四 下午2:21写道:
Yes, you can start with Unstable and move to Evolving and Stable when needed. 
We’ve definitely had experimental features that changed across maintenance 
releases when they were well-isolated. If your change risks breaking stuff in 
stable components of Spark though, then it probably won’t be suitable for that.

> On Sep 6, 2018, at 1:49 PM, Ryan Blue  wrote:
>
> I meant flexibility beyond the point releases. I think what Reynold was 
> suggesting was getting v2 code out more often than the point releases every 6 
> months. An Evolving API can change in point releases, but maybe we should 
> move v2 to Unstable so it can change more often? I don't really see another 
> way to get changes out more often.
>
> On Thu, Sep 6, 2018 at 11:07 AM Mark Hamstra 
> mailto:m...@clearstorydata.com>> wrote:
> Yes, that is why we have these annotations in the code and the corresponding 
> labels appearing in the API documentation: 
> https://github.com/apache/spark/blob/master/common/tags/src/main/java/org/apache/spark/annotation/InterfaceStability.java
>  
> [github.com]
>
> As long as it is properly annotated, we can change or even eliminate an API 
> method before the next major release. And frankly, we shouldn't be 
> contemplating bringing in the DS v2 API (and, I'd argue, any new API) without 
> such an annotation. There is just too much risk of not getting everything 
> right before we see the results of the new API being more widely used, and 
> too much cost in maintaining until the next major release something that we 
> come to regret for us to create new API in a fully frozen state.
>
>
> On Thu, Sep 6, 2018 at 9:49 AM Ryan Blue  wrote:
> It would be great to get more features out incrementally. For experimental 
> features, do we have more relaxed constraints?
>
> On Thu, Sep 6, 2018 at 9:47 AM Reynold Xin 
> mailto:r...@databricks.com>> wrote:
> +1 on 3.0
>
> Dsv2 stable can still evolve in across major releases. DataFrame, Dataset, 
> dsv1 and a lot of other major features all were developed throughout the 1.x 
> and 2.x lines.
>
> I do want to explore ways for us to get dsv2 incremental changes out there 
> more frequently, to get feedback. Maybe that means we apply additive changes 
> to 2.4.x; maybe that means making another 2.5 release sooner. I will start a 
> separate thread about it.
>
>
>
> On Thu, Sep 6, 2018 at 9:31 AM Sean Owen 
> mailto:sro...@gmail.com>> wrote:
> I think this doesn't necessarily mean 3.0 is coming soon (thoughts on timing? 
> 6 months?) but simply next. Do you mean you'd prefer that change to happen 
> before 3.x? if it's a significant change, seems reasonable for a major 
> version bump rather than minor. Is the concern that tying it to 3.0 means you 
> have to take a major version update to get it?
>
> I generally support moving on to 3.x so we can also jettison a lot of older 
> dependencies, code, fix some long standing issues, etc.
>
> (BTW Scala 2.12 support, mentioned in the OP, will go in for 2.4)
>
> On Thu, Sep 6, 2018 at 9:10 AM Ryan Blue  wrote:
> My concern is that the v2 data source API is still evolving and not very 
> close to stable. I had hoped to have stabilized the API and behaviors for a 
> 3.0 release. But we could also wait on that for a 4.0 release, depending on 
> when we think that will be.
>
> Unless there is a pressing need to move to 3.0 for some other area, I think 
> it would be better for the v2 sources to have a 2.5 release.
>
> On Thu, Sep 6, 2018 at 8:59 AM Xiao Li 
> mailto:gatorsm...@gmail.com>> wrote:
> Yesterday, the 2.4 branch was created. Based on the above discussion, I 

Re: Spark Utf 8 encoding

2018-11-12 Thread lsn24
My Terminal can display UTF-8 encoded characters. I already verified that.
But will double check again.
Thanks!



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: DataSourceV2 capability API

2018-11-12 Thread Wenchen Fan
I think this works, but there are also other solutions, e.g. mixin traits
and runtime exceptions

Assuming the general abstraction is: table -> scan builder -> scan ->
batch/batches (see alternative #2 in the doc

)

For example, if we want to tell if a table supports continuous streaming,
we can define 3 traits(interfaces)
interface SupportsBatchScan extends Table {
  ScanBuilder newScanBuilder();
}
interface SupportsMicroBatchScan extends Table {
  ScanBuilder newScanBuilder();
}
interface SupportsContinuousScan extends Table {
  ScanBuilder newScanBuilder();
}

And see if the given table implements SupportsContinuousScan or not. Note
that, Java allows a class to implement different interfaces with the same
method(s).

Or put everything in the table interface
interface Table {
  default ScanBuilder newSingleBatchScanBuilder() { throw exception }
  default ScanBuilder newMicroBatchScanBuilder() { throw exception }
  default ScanBuilder newContinuousScanBuilder() { throw exception }
}

And Spark just calls the corresponding method of a scan mode.


Another problem is how much type safety we want. Better type safety usually
means more complicated interfaces.

For example, if we want strong type safety, we need to do the branching
between batch, micro-batch and continuous modes at the table level. Then
the table interface becomes (assuming we pick the mixin trait solution)
interface SupportsBatchScan extends Table {
  SingBatchScanBuilder newScanBuilder();
}
interface SupportsMicroBatchScan extends Table {
  MicroBatchScanBuilder newScanBuilder();
}
interface SupportsContinuousScan extends Table {
  ContinuousScanBuilder newScanBuilder();
}

The drawback is, we have a lot of interfaces, i.e. ContinuousScanBuilder ->
ContinuousScan -> ContinuousBatch, and their corresponding versions for
other 2 scan modes.


If we don't care much about type safety, we can do the  branching at the
scan level:
interface Scan {
  // batch and micro-batch modes share the same Batch interface
  default Batch newSingleBatch() { throw exception }
  default Batch newMicroBatch() { throw exception }
  default ContinuousBatch newContinuousBatch() { throw exception }
}

Now we have ScanBuilder and Scan interfaces shared for all the scan modes,
and only have different batch interfaces. We can delay the branching
further, but that needs some refactoring of the continuous streaming data
source APIs.


I think the capability API is not a must-have at the current stage, but
it's worth to investigate further and see which use cases it can help.

Thanks,
Wenchen

On Sat, Nov 10, 2018 at 5:35 AM Ryan Blue  wrote:

> Another solution to the decimal case is using the capability API: use a
> capability to signal that the table knows about `supports-decimal`. So
> before the decimal support check, it would check
> `table.isSupported("type-capabilities")`.
>
> On Fri, Nov 9, 2018 at 12:45 PM Ryan Blue  wrote:
>
>> For that case, I think we would have a property that defines whether
>> supports-decimal is assumed or checked with the capability.
>>
>> Wouldn't we have this problem no matter what the capability API is? If we
>> used a trait to signal decimal support, then we would have to deal with
>> sources that were written before the trait was introduced. That doesn't
>> change the need for some way to signal support for specific capabilities
>> like the ones I've suggested.
>>
>> On Fri, Nov 9, 2018 at 12:38 PM Reynold Xin  wrote:
>>
>>> "If there is no way to report a feature (e.g., able to read missing as
>>> null) then there is no way for Spark to take advantage of it in the first
>>> place"
>>>
>>> Consider this (just a hypothetical scenario): We added
>>> "supports-decimal" in the future, because we see a lot of data sources
>>> don't support decimal and we want a more graceful error handling. That'd
>>> break all existing data sources.
>>>
>>> You can say we would never add any "existing" features to the feature
>>> list in the future, as a requirement for the feature list. But then I'm
>>> wondering how much does it really give you, beyond telling data sources to
>>> throw exceptions when they don't support a specific operation.
>>>
>>>
>>> On Fri, Nov 9, 2018 at 11:54 AM Ryan Blue  wrote:
>>>
 Do you have an example in mind where we might add a capability and
 break old versions of data sources?

 These are really for being able to tell what features a data source
 has. If there is no way to report a feature (e.g., able to read missing as
 null) then there is no way for Spark to take advantage of it in the first
 place. For the uses I've proposed, forward compatibility isn't a concern.
 When we add a capability, we add handling for it that old versions wouldn't
 be able to use anyway. The advantage is that we don't have to treat all
 sources the same.

 On Fri, Nov 9, 2018 at 11:32 AM Reynold 

Re: On Java 9+ support, Cleaners, modules and the death of reflection

2018-11-12 Thread Sean Owen
For those following, I have a PR up at
https://github.com/apache/spark/pull/22993

The implication is that ignoring MaxDirectMemorySize doesn't work out
of the box in Java 9+ now. However, you can make it work by setting
JVM flags to allow access to the new Cleaner class. Or set
MaxDirectMemorySize if it's an issue. Or do nothing if you don't
actually run up against the MaxDirectMemorySize limit, which seems to
default to equal the size of the JVM heap.

On Thu, Nov 8, 2018 at 12:46 PM Sean Owen  wrote:
>
> I think this is a key thread, perhaps one of the only big problems,
> for Java 9+ support:
>
> https://issues.apache.org/jira/browse/SPARK-24421?focusedCommentId=16680169=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16680169
>
> We basically can't access a certain method (Cleaner.clean()) anymore
> without the user adding JVM flags to allow it. As far as I can tell.
> I'm working out what the alternatives or implications are.
>
> Thoughts welcome.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org