+1 (non-binding) I have found the suggestion from Andrew Ash and James about plan push down quite interesting. However, I am not clear about the join push-down support at the data source level. Shouldn't it be the responsibility of the join node to carry out a data source specific join? I mean join node and the data source scan of the two sides can be coalesced into a single node (theoretically). This can be done by providing a Strategy that replaces the join node with a data source specific join node. We are doing it that way for our data sources. I find this more intuitive.
BTW, aggregate push-down support is desirable and should be considered as an enhancement going forward. Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <vaquar.k...@gmail.com> wrote: > +1 > > Regards, > Vaquar khan > > On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote: > >> +1 >> ------------------------------ >> *From:* wangzhenhua (G) <wangzhen...@huawei.com> >> *Sent:* Friday, September 8, 2017 2:20:07 AM >> *To:* Dongjoon Hyun; 蒋星博 >> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan >> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path >> >> >> +1 (non-binding) Great to see data source API is going to be improved! >> >> >> >> best regards, >> >> -Zhenhua(Xander) >> >> >> >> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com] >> *发送时间:* 2017年9月8日 4:07 >> *收件人:* 蒋星博 >> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot >> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan >> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path >> >> >> >> +1 (non-binding). >> >> >> >> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote: >> >> +1 >> >> >> >> >> >> Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道: >> >> +1 as well >> >> >> >> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >> +1 >> >> >> >> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> >> wrote: >> >> +1 (non-binding) >> >> Thanks for making the updates reflected in the current PR. It would be >> great to see the doc updated before it is finally published though. >> >> Right now it feels like this SPIP is focused more on getting the basics >> right for what many datasources are already doing in API V1 combined with >> other private APIs, vs pushing forward state of the art for performance. >> >> I think that’s the right approach for this SPIP. We can add the support >> you’re talking about later with a more specific plan that doesn’t block >> fixing the problems that this addresses. >> >> >> >> >> >> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < >> hvanhov...@databricks.com> wrote: >> >> +1 (binding) >> >> >> >> I personally believe that there is quite a big difference between having >> a generic data source interface with a low surface area and pushing down a >> significant part of query processing into a datasource. The later has much >> wider wider surface area and will require us to stabilize most of the >> internal catalyst API's which will be a significant burden on the community >> to maintain and has the potential to slow development velocity >> significantly. If you want to write such integrations then you should be >> prepared to work with catalyst internals and own up to the fact that things >> might change across minor versions (and in some cases even maintenance >> releases). If you are willing to go down that road, then your best bet is >> to use the already existing spark session extensions which will allow you >> to write such integrations and can be used as an `escape hatch`. >> >> >> >> >> >> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> wrote: >> >> +0 (non-binding) >> >> >> >> I think there are benefits to unifying all the Spark-internal datasources >> into a common public API for sure. It will serve as a forcing function to >> ensure that those internal datasources aren't advantaged vs datasources >> developed externally as plugins to Spark, and that all Spark features are >> available to all datasources. >> >> >> >> But I also think this read-path proposal avoids the more difficult >> questions around how to continue pushing datasource performance forwards. >> James Baker (my colleague) had a number of questions about advanced >> pushdowns (combined sorting and filtering), and Reynold also noted that >> pushdown of aggregates and joins are desirable on longer timeframes as >> well. The Spark community saw similar requests, for aggregate pushdown in >> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >> in SPARK-12449. Clearly a number of people are interested in this kind of >> performance work for datasources. >> >> >> >> To leave enough space for datasource developers to continue experimenting >> with advanced interactions between Spark and their datasources, I'd propose >> we leave some sort of escape valve that enables these datasources to keep >> pushing the boundaries without forking Spark. Possibly that looks like an >> additional unsupported/unstable interface that pushes down an entire >> (unstable API) logical plan, which is expected to break API on every >> release. (Spark attempts this full-plan pushdown, and if that fails Spark >> ignores it and continues on with the rest of the V2 API for >> compatibility). Or maybe it looks like something else that we don't know >> of yet. Possibly this falls outside of the desired goals for the V2 API >> and instead should be a separate SPIP. >> >> >> >> If we had a plan for this kind of escape valve for advanced datasource >> developers I'd be an unequivocal +1. Right now it feels like this SPIP is >> focused more on getting the basics right for what many datasources are >> already doing in API V1 combined with other private APIs, vs pushing >> forward state of the art for performance. >> >> >> >> Andrew >> >> >> >> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >> suresh.thalam...@gmail.com> wrote: >> >> +1 (non-binding) >> >> >> >> >> >> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >> >> >> >> Hi all, >> >> >> >> In the previous discussion, we decided to split the read and write path >> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote >> for Data Source V2 read path only. >> >> >> >> The full document of the Data Source API V2 is: >> >> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >> -Z8qU5Frf6WMQZ6jJVM/edit >> >> >> >> The ready-for-review PR that implements the basic infrastructure for the >> read path is: >> >> https://github.com/apache/spark/pull/19136 >> >> >> >> The vote will be up for the next 72 hours. Please reply with your vote: >> >> >> >> +1: Yeah, let's go forward and implement the SPIP. >> >> +0: Don't really care. >> >> -1: I don't think this is a good idea because of the following technical >> reasons. >> >> >> >> Thanks! >> >> >> >> >> >> >> >> >> >> -- >> >> Herman van Hövell >> >> Software Engineer >> >> Databricks Inc. >> >> hvanhov...@databricks.com >> >> +31 6 420 590 27 >> >> databricks.com >> >> [image: http://databricks.com] <http://databricks.com/> >> >> >> >> [image: Announcing Databricks Serverless. The first serverless data >> science and big data platform. Watch the demo from Spark Summit 2017.] >> <http://go.databricks.com/announcing-databricks-serverless> >> >> >> >> >> >> -- >> >> Ryan Blue >> >> Software Engineer >> >> Netflix >> >> >> >> >> >> >> >