+1 On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:
> +1 (non-binding) > > Thanks for making the updates reflected in the current PR. It would be > great to see the doc updated before it is finally published though. > > Right now it feels like this SPIP is focused more on getting the basics > right for what many datasources are already doing in API V1 combined with > other private APIs, vs pushing forward state of the art for performance. > > I think that’s the right approach for this SPIP. We can add the support > you’re talking about later with a more specific plan that doesn’t block > fixing the problems that this addresses. > > > On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> +1 (binding) >> >> I personally believe that there is quite a big difference between having >> a generic data source interface with a low surface area and pushing down a >> significant part of query processing into a datasource. The later has much >> wider wider surface area and will require us to stabilize most of the >> internal catalyst API's which will be a significant burden on the community >> to maintain and has the potential to slow development velocity >> significantly. If you want to write such integrations then you should be >> prepared to work with catalyst internals and own up to the fact that things >> might change across minor versions (and in some cases even maintenance >> releases). If you are willing to go down that road, then your best bet is >> to use the already existing spark session extensions which will allow you >> to write such integrations and can be used as an `escape hatch`. >> >> >> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> wrote: >> >>> +0 (non-binding) >>> >>> I think there are benefits to unifying all the Spark-internal >>> datasources into a common public API for sure. It will serve as a forcing >>> function to ensure that those internal datasources aren't advantaged vs >>> datasources developed externally as plugins to Spark, and that all Spark >>> features are available to all datasources. >>> >>> But I also think this read-path proposal avoids the more difficult >>> questions around how to continue pushing datasource performance forwards. >>> James Baker (my colleague) had a number of questions about advanced >>> pushdowns (combined sorting and filtering), and Reynold also noted that >>> pushdown of aggregates and joins are desirable on longer timeframes as >>> well. The Spark community saw similar requests, for aggregate pushdown in >>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >>> in SPARK-12449. Clearly a number of people are interested in this kind of >>> performance work for datasources. >>> >>> To leave enough space for datasource developers to continue >>> experimenting with advanced interactions between Spark and their >>> datasources, I'd propose we leave some sort of escape valve that enables >>> these datasources to keep pushing the boundaries without forking Spark. >>> Possibly that looks like an additional unsupported/unstable interface that >>> pushes down an entire (unstable API) logical plan, which is expected to >>> break API on every release. (Spark attempts this full-plan pushdown, and >>> if that fails Spark ignores it and continues on with the rest of the V2 API >>> for compatibility). Or maybe it looks like something else that we don't >>> know of yet. Possibly this falls outside of the desired goals for the V2 >>> API and instead should be a separate SPIP. >>> >>> If we had a plan for this kind of escape valve for advanced datasource >>> developers I'd be an unequivocal +1. Right now it feels like this SPIP is >>> focused more on getting the basics right for what many datasources are >>> already doing in API V1 combined with other private APIs, vs pushing >>> forward state of the art for performance. >>> >>> Andrew >>> >>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >>> suresh.thalam...@gmail.com> wrote: >>> >>>> +1 (non-binding) >>>> >>>> >>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>>> >>>> Hi all, >>>> >>>> In the previous discussion, we decided to split the read and write path >>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote >>>> for Data Source V2 read path only. >>>> >>>> The full document of the Data Source API V2 is: >>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>>> -Z8qU5Frf6WMQZ6jJVM/edit >>>> >>>> The ready-for-review PR that implements the basic infrastructure for >>>> the read path is: >>>> https://github.com/apache/spark/pull/19136 >>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> +0: Don't really care. >>>> -1: I don't think this is a good idea because of the following >>>> technical reasons. >>>> >>>> Thanks! >>>> >>>> >>>> >>> >> >> >> -- >> >> Herman van Hövell >> >> Software Engineer >> >> Databricks Inc. >> >> hvanhov...@databricks.com >> >> +31 6 420 590 27 >> >> databricks.com >> >> [image: http://databricks.com] <http://databricks.com/> >> >> >> >> [image: Announcing Databricks Serverless. The first serverless data >> science and big data platform. Watch the demo from Spark Summit 2017.] >> <http://go.databricks.com/announcing-databricks-serverless> >> > > > > -- > Ryan Blue > Software Engineer > Netflix >