+1 (non-binding). On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote:
> +1 > > > Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道: > >> +1 as well >> >> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> +1 >>> >>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid> >>> wrote: >>> >>>> +1 (non-binding) >>>> >>>> Thanks for making the updates reflected in the current PR. It would be >>>> great to see the doc updated before it is finally published though. >>>> >>>> Right now it feels like this SPIP is focused more on getting the basics >>>> right for what many datasources are already doing in API V1 combined with >>>> other private APIs, vs pushing forward state of the art for performance. >>>> >>>> I think that’s the right approach for this SPIP. We can add the support >>>> you’re talking about later with a more specific plan that doesn’t block >>>> fixing the problems that this addresses. >>>> >>>> >>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier < >>>> hvanhov...@databricks.com> wrote: >>>> >>>>> +1 (binding) >>>>> >>>>> I personally believe that there is quite a big difference between >>>>> having a generic data source interface with a low surface area and pushing >>>>> down a significant part of query processing into a datasource. The later >>>>> has much wider wider surface area and will require us to stabilize most of >>>>> the internal catalyst API's which will be a significant burden on the >>>>> community to maintain and has the potential to slow development velocity >>>>> significantly. If you want to write such integrations then you should be >>>>> prepared to work with catalyst internals and own up to the fact that >>>>> things >>>>> might change across minor versions (and in some cases even maintenance >>>>> releases). If you are willing to go down that road, then your best bet is >>>>> to use the already existing spark session extensions which will allow you >>>>> to write such integrations and can be used as an `escape hatch`. >>>>> >>>>> >>>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> >>>>> wrote: >>>>> >>>>>> +0 (non-binding) >>>>>> >>>>>> I think there are benefits to unifying all the Spark-internal >>>>>> datasources into a common public API for sure. It will serve as a >>>>>> forcing >>>>>> function to ensure that those internal datasources aren't advantaged vs >>>>>> datasources developed externally as plugins to Spark, and that all Spark >>>>>> features are available to all datasources. >>>>>> >>>>>> But I also think this read-path proposal avoids the more difficult >>>>>> questions around how to continue pushing datasource performance forwards. >>>>>> James Baker (my colleague) had a number of questions about advanced >>>>>> pushdowns (combined sorting and filtering), and Reynold also noted that >>>>>> pushdown of aggregates and joins are desirable on longer timeframes as >>>>>> well. The Spark community saw similar requests, for aggregate pushdown >>>>>> in >>>>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown >>>>>> in SPARK-12449. Clearly a number of people are interested in this kind >>>>>> of >>>>>> performance work for datasources. >>>>>> >>>>>> To leave enough space for datasource developers to continue >>>>>> experimenting with advanced interactions between Spark and their >>>>>> datasources, I'd propose we leave some sort of escape valve that enables >>>>>> these datasources to keep pushing the boundaries without forking Spark. >>>>>> Possibly that looks like an additional unsupported/unstable interface >>>>>> that >>>>>> pushes down an entire (unstable API) logical plan, which is expected to >>>>>> break API on every release. (Spark attempts this full-plan pushdown, >>>>>> and >>>>>> if that fails Spark ignores it and continues on with the rest of the V2 >>>>>> API >>>>>> for compatibility). Or maybe it looks like something else that we don't >>>>>> know of yet. Possibly this falls outside of the desired goals for the V2 >>>>>> API and instead should be a separate SPIP. >>>>>> >>>>>> If we had a plan for this kind of escape valve for advanced >>>>>> datasource developers I'd be an unequivocal +1. Right now it feels like >>>>>> this SPIP is focused more on getting the basics right for what many >>>>>> datasources are already doing in API V1 combined with other private APIs, >>>>>> vs pushing forward state of the art for performance. >>>>>> >>>>>> Andrew >>>>>> >>>>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati < >>>>>> suresh.thalam...@gmail.com> wrote: >>>>>> >>>>>>> +1 (non-binding) >>>>>>> >>>>>>> >>>>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> In the previous discussion, we decided to split the read and write >>>>>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call >>>>>>> a >>>>>>> vote for Data Source V2 read path only. >>>>>>> >>>>>>> The full document of the Data Source API V2 is: >>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ- >>>>>>> Z8qU5Frf6WMQZ6jJVM/edit >>>>>>> >>>>>>> The ready-for-review PR that implements the basic infrastructure for >>>>>>> the read path is: >>>>>>> https://github.com/apache/spark/pull/19136 >>>>>>> >>>>>>> The vote will be up for the next 72 hours. Please reply with your >>>>>>> vote: >>>>>>> >>>>>>> +1: Yeah, let's go forward and implement the SPIP. >>>>>>> +0: Don't really care. >>>>>>> -1: I don't think this is a good idea because of the following >>>>>>> technical reasons. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Herman van Hövell >>>>> >>>>> Software Engineer >>>>> >>>>> Databricks Inc. >>>>> >>>>> hvanhov...@databricks.com >>>>> >>>>> +31 6 420 590 27 >>>>> >>>>> databricks.com >>>>> >>>>> [image: http://databricks.com] <http://databricks.com/> >>>>> >>>>> >>>>> >>>>> [image: Announcing Databricks Serverless. The first serverless data >>>>> science and big data platform. Watch the demo from Spark Summit 2017.] >>>>> <http://go.databricks.com/announcing-databricks-serverless> >>>>> >>>> >>>> >>>> >>>> -- >>>> Ryan Blue >>>> Software Engineer >>>> Netflix >>>> >>> >>> >>