Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
;>>> *From:* wangzhenhua (G) <wangzhen...@huawei.com> >>>> *Sent:* Friday, September 8, 2017 2:20:07 AM >>>> *To:* Dongjoon Hyun; 蒋星博 >>>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>>> Westerflier; Ryan B

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-11 Thread Wenchen Fan
om:* wangzhenhua (G) <wangzhen...@huawei.com> >>> *Sent:* Friday, September 8, 2017 2:20:07 AM >>> *To:* Dongjoon Hyun; 蒋星博 >>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >>> Westerflier; Ryan Blue; Spark dev list; Suresh Thala

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Hemant Bhanawat
gjoon Hyun; 蒋星博 >> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan >> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path >> >> >> +1 (non-binding)

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread vaquar khan
> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot > Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan > *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path > > > +1 (non-binding) Great to see data source API is going to

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-10 Thread Noman Khan
ubject: 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path +1 (non-binding) Great to see data source API is going to be improved! best regards, -Zhenhua(Xander) 发件人: Dongjoon Hyun [mailto:dongjoon.h...@gmail.com] 发送时间: 2017年9月8日 4:07 收件人: 蒋星博 抄送: Michael Armbrust; Reynold Xin; Andr

答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread wangzhenhua (G)
; Suresh Thalamati; Wenchen Fan 主题: Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path +1 (non-binding). On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com<mailto:jiangxb1...@gmail.com>> wrote: +1 Reynold Xin <r...@databricks.com<mailto:r...@databricks.com>&g

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Dongjoon Hyun
+1 (non-binding). On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 wrote: > +1 > > > Reynold Xin 于2017年9月7日 周四下午12:04写道: > >> +1 as well >> >> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust >> wrote: >> >>> +1 >>> >>> On Thu, Sep 7,

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread 蒋星博
+1 Reynold Xin 于2017年9月7日 周四下午12:04写道: > +1 as well > > On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust > wrote: > >> +1 >> >> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue >> wrote: >> >>> +1 (non-binding) >>> >>> Thanks

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Reynold Xin
+1 as well On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust wrote: > +1 > > On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue > wrote: > >> +1 (non-binding) >> >> Thanks for making the updates reflected in the current PR. It would be >> great to see

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Michael Armbrust
+1 On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue wrote: > +1 (non-binding) > > Thanks for making the updates reflected in the current PR. It would be > great to see the doc updated before it is finally published though. > > Right now it feels like this SPIP is focused

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Ryan Blue
+1 (non-binding) Thanks for making the updates reflected in the current PR. It would be great to see the doc updated before it is finally published though. Right now it feels like this SPIP is focused more on getting the basics right for what many datasources are already doing in API V1 combined

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Herman van Hövell tot Westerflier
+1 (binding) I personally believe that there is quite a big difference between having a generic data source interface with a low surface area and pushing down a significant part of query processing into a datasource. The later has much wider wider surface area and will require us to stabilize

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-07 Thread Andrew Ash
+0 (non-binding) I think there are benefits to unifying all the Spark-internal datasources into a common public API for sure. It will serve as a forcing function to ensure that those internal datasources aren't advantaged vs datasources developed externally as plugins to Spark, and that all

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Suresh Thalamati
+1 (non-binding) > On Sep 6, 2017, at 7:29 PM, Wenchen Fan wrote: > > Hi all, > > In the previous discussion, we decided to split the read and write path of > data source v2 into 2 SPIPs, and I'm sending this email to call a vote for > Data Source V2 read path only. >

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Sameer Agarwal
+1 On Wed, Sep 6, 2017 at 8:53 PM, Xiao Li wrote: > +1 > > Xiao > > 2017-09-06 19:37 GMT-07:00 Wenchen Fan : > >> adding my own +1 (binding) >> >> On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote: >> >>> Hi all, >>> >>> In the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Xiao Li
+1 Xiao 2017-09-06 19:37 GMT-07:00 Wenchen Fan : > adding my own +1 (binding) > > On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote: > >> Hi all, >> >> In the previous discussion, we decided to split the read and write path >> of data source v2 into 2

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

2017-09-06 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Sep 7, 2017 at 10:29 AM, Wenchen Fan wrote: > Hi all, > > In the previous discussion, we decided to split the read and write path of > data source v2 into 2 SPIPs, and I'm sending this email to call a vote for > Data Source V2 read path

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
Hi Ryan, Yea I agree with you that we should discuss some substantial details during the vote, and I addressed your comments about schema inference API in my new PR, please take a look. I've also called a new vote for the read path, please vote there, thanks! On Thu, Sep 7, 2017 at 7:55 AM,

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Ryan Blue
I'm all for keeping this moving and not getting too far into the details (like naming), but I think the substantial details should be clarified first since they are in the proposal that's being voted on. I would prefer moving the write side to a separate SPIP, too, since there isn't much detail

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-06 Thread Wenchen Fan
Hi all, I've submitted a PR for a basic data source v2, i.e., only contains features we already have in data source v1. We can discuss API details like naming in that PR: https://github.com/apache/spark/pull/19136 In the meanwhile, let's keep this vote open and collecting more feedbacks. Thanks

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-09-01 Thread Reynold Xin
Why does ordering matter here for sort vs filter? The source should be able to handle it in whatever way it wants (which is almost always filter beneath sort I'd imagine). The only ordering that'd matter in the current set of pushdowns is limit - it should always mean the root of the pushded

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
> Ideally also getting sort orders _after_ getting filters. Yea we should have a deterministic order when applying various push downs, and I think filter should definitely go before sort. This is one of the details we can discuss during PR review :) On Fri, Sep 1, 2017 at 9:19 AM, James Baker

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary driver. I actually don't care right now about aggregations on the datasource I'm integrating with - I just care about receiving all the filters (and ideally also the desired sort order) at the same time. I am mostly fine

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary driver. I actually don't care right now about aggregations on the datasource I'm integrating with - I just care about receiving all the filters (and ideally also the desired sort order) at the same time. I am mostly fine

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread James Baker
I think that makes sense. I didn't understand backcompat was the primary driver. I actually don't care right now about aggregations on the datasource I'm integrating with - I just care about receiving all the filters (and ideally also the desired sort order) at the same time. I am mostly fine

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-31 Thread Wenchen Fan
Hi Ryan, I think for a SPIP, we should not worry too much about details, as we can discuss them during PR review after the vote pass. I think we should focus more on the overall design, like James did. The interface mix-in vs plan push down discussion was great, hope we can get a consensus on

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
Maybe I'm missing something, but the high-level proposal consists of: Goals, Non-Goals, and Proposed API. What is there to discuss other than the details of the API that's being proposed? I think the goals make sense, but goals alone aren't enough to approve a SPIP. On Wed, Aug 30, 2017 at 2:46

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, it becomes easy for someone to layer an easy mode beneath it to enable simpler datasources to be integrated (and that simple mode should be the out of scope thing). Taking a small step back here, one of the places

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, it becomes easy for someone to layer an easy mode beneath it to enable simpler datasources to be integrated (and that simple mode should be the out of scope thing). Taking a small step back here, one of the places

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, it becomes easy for someone to layer an easy mode beneath it to enable simpler datasources to be integrated (and that simple mode should be the out of scope thing). Taking a small step back here, one of the places

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
Sure that's good to do (and as discussed earlier a good compromise might be to expose an interface for the source to decide which part of the logical plan they want to accept). To me everything is about cost vs benefit. In my mind, the biggest issue with the existing data source API is backward

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread James Baker
I guess I was more suggesting that by coding up the powerful mode as the API, it becomes easy for someone to layer an easy mode beneath it to enable simpler datasources to be integrated (and that simple mode should be the out of scope thing). Taking a small step back here, one of the places

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
So we seem to be getting into a cycle of discussing more about the details of APIs than the high level proposal. The details of APIs are important to debate, but those belong more in code reviews. One other important thing is that we should avoid API design by committee. While it is extremely

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Ryan Blue
-1 (non-binding) Sometimes it takes a VOTE thread to get people to actually read and comment, so thanks for starting this one… but there’s still discussion happening on the prototype API, which it hasn’t been updated. I’d like to see the proposal shaped by the ongoing discussion so that we have a

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Reynold Xin
That might be good to do, but seems like orthogonal to this effort itself. It would be a completely different interface. On Wed, Aug 30, 2017 at 1:10 PM Wenchen Fan wrote: > OK I agree with it, how about we add a new interface to push down the > query plan, based on the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-30 Thread Wenchen Fan
OK I agree with it, how about we add a new interface to push down the query plan, based on the current framework? We can mark the query-plan-push-down interface as unstable, to save the effort of designing a stable representation of query plan and maintaining forward compatibility. On Wed, Aug

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
I'll just focus on the one-by-one thing for now - it's the thing that blocks me the most. I think the place where we're most confused here is on the cost of determining whether I can push down a filter. For me, in order to work out whether I can push down a filter or satisfy a sort, I might

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
I'll just focus on the one-by-one thing for now - it's the thing that blocks me the most. I think the place where we're most confused here is on the cost of determining whether I can push down a filter. For me, in order to work out whether I can push down a filter or satisfy a sort, I might

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Wenchen Fan
Hi James, Thanks for your feedback! I think your concerns are all valid, but we need to make a tradeoff here. > Explicitly here, what I'm looking for is a convenient mechanism to accept a fully specified set of arguments The problem with this approach is: 1) if we wanna add more arguments in

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
Yeah, for sure. With the stable representation - agree that in the general case this is pretty intractable, it restricts the modifications that you can do in the future too much. That said, it shouldn't be as hard if you restrict yourself to the parts of the plan which are supported by the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread James Baker
Yeah, for sure. With the stable representation - agree that in the general case this is pretty intractable, it restricts the modifications that you can do in the future too much. That said, it shouldn't be as hard if you restrict yourself to the parts of the plan which are supported by the

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-29 Thread Reynold Xin
James, Thanks for the comment. I think you just pointed out a trade-off between expressiveness and API simplicity, compatibility and evolvability. For the max expressiveness, we'd want the ability to expose full query plans, and let the data source decide which part of the query plan can be

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
Copying from the code review comments I just submitted on the draft API (https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745): Context here is that I've spent some time implementing a Spark datasource and have had some issues with the current API which are made worse in V2.

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
Copying from the code review comments I just submitted on the draft API (https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745): Context here is that I've spent some time implementing a Spark datasource and have had some issues with the current API which are made worse in V2.

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread James Baker
Copying from the code review comments I just submitted on the draft API (https://github.com/cloud-fan/spark/pull/10#pullrequestreview-59088745): Context here is that I've spent some time implementing a Spark datasource and have had some issues with the current API which are made worse in V2.

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread 蒋星博
+1 (Non-binding) Xiao Li 于2017年8月28日 周一下午5:38写道: > +1 > > 2017-08-28 12:45 GMT-07:00 Cody Koeninger : > >> Just wanted to point out that because the jira isn't labeled SPIP, it >> won't have shown up linked from >> >>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Xiao Li
+1 2017-08-28 12:45 GMT-07:00 Cody Koeninger : > Just wanted to point out that because the jira isn't labeled SPIP, it > won't have shown up linked from > > http://spark.apache.org/improvement-proposals.html > > On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Cody Koeninger
Just wanted to point out that because the jira isn't labeled SPIP, it won't have shown up linked from http://spark.apache.org/improvement-proposals.html On Mon, Aug 28, 2017 at 2:20 PM, Wenchen Fan wrote: > Hi all, > > It has been almost 2 weeks since I proposed the data

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Russell Spitzer
+1 (Non-binding) The clustering approach covers most of my requirements on saving some shuffles. We kind of left the "should the user be allowed to provide a full partitioner" discussion on the table. I understand that would require exposing a lot of internals so this is perhaps a good

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-28 Thread Wenchen Fan
Hi all, It has been almost 2 weeks since I proposed the data source V2 for discussion, and we already got some feedbacks on the JIRA ticket and the prototype PR, so I'd like to call for a vote. The full document of the Data Source API V2 is:

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Mark Hamstra
Points 2, 3 and 4 of the Project Plan in that document (i.e. "port existing data sources using internal APIs to use the proposed public Data Source V2 API") have my full support. Really, I'd like to see that dog-fooding effort completed and lesson learned from it fully digested before we remove

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Reynold Xin
Yea I don't think it's a good idea to upload a doc and then call for a vote immediately. People need time to digest ... On Thu, Aug 17, 2017 at 6:22 AM, Wenchen Fan wrote: > Sorry let's remove the VOTE tag as I just wanna bring this up for > discussion. > > I'll restart

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Sorry let's remove the VOTE tag as I just wanna bring this up for discussion. I'll restart the voting process after we have enough discussion on the JIRA ticket or here in this email thread. On Thu, Aug 17, 2017 at 9:12 PM, Russell Spitzer wrote: > -1, I don't think

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Russell Spitzer
-1, I don't think there has really been any discussion of this api change yet or at least it hasn't occurred on the jira ticket On Thu, Aug 17, 2017 at 8:05 AM Wenchen Fan wrote: > adding my own +1 (binding) > > On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread 蒋星博
+1 (non-binding) Wenchen Fan 于2017年8月17日 周四下午9:05写道: > adding my own +1 (binding) > > On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan wrote: > >> Hi all, >> >> Following the SPIP process, I'm putting this SPIP up for a vote. >> >> The current data source

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
adding my own +1 (binding) On Thu, Aug 17, 2017 at 9:02 PM, Wenchen Fan wrote: > Hi all, > > Following the SPIP process, I'm putting this SPIP up for a vote. > > The current data source API doesn't work well because of some limitations > like: no partitioning/bucketing

[VOTE] [SPIP] SPARK-15689: Data Source API V2

2017-08-17 Thread Wenchen Fan
Hi all, Following the SPIP process, I'm putting this SPIP up for a vote. The current data source API doesn't work well because of some limitations like: no partitioning/bucketing support, no columnar read, hard to support more operator push down, etc. I'm proposing a Data Source API V2 to