Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Ryan Blue Thu, 07 Sep 2017 09:33:57 -0700

+1 (non-binding)

Thanks for making the updates reflected in the current PR. It would be
great to see the doc updated before it is finally published though.


Right now it feels like this SPIP is focused more on getting the basics
right for what many datasources are already doing in API V1 combined with
other private APIs, vs pushing forward state of the art for performance.

I think that’s the right approach for this SPIP. We can add the support
you’re talking about later with a more specific plan that doesn’t block
fixing the problems that this addresses.


On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
[email protected]> wrote:

> +1 (binding)
>
> I personally believe that there is quite a big difference between having a
> generic data source interface with a low surface area and pushing down a
> significant part of query processing into a datasource. The later has much
> wider wider surface area and will require us to stabilize most of the
> internal catalyst API's which will be a significant burden on the community
> to maintain and has the potential to slow development velocity
> significantly. If you want to write such integrations then you should be
> prepared to work with catalyst internals and own up to the fact that things
> might change across minor versions (and in some cases even maintenance
> releases). If you are willing to go down that road, then your best bet is
> to use the already existing spark session extensions which will allow you
> to write such integrations and can be used as an `escape hatch`.
>
>
> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[email protected]> wrote:
>
>> +0 (non-binding)
>>
>> I think there are benefits to unifying all the Spark-internal datasources
>> into a common public API for sure.  It will serve as a forcing function to
>> ensure that those internal datasources aren't advantaged vs datasources
>> developed externally as plugins to Spark, and that all Spark features are
>> available to all datasources.
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>> performance work for datasources.
>>
>> To leave enough space for datasource developers to continue experimenting
>> with advanced interactions between Spark and their datasources, I'd propose
>> we leave some sort of escape valve that enables these datasources to keep
>> pushing the boundaries without forking Spark.  Possibly that looks like an
>> additional unsupported/unstable interface that pushes down an entire
>> (unstable API) logical plan, which is expected to break API on every
>> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
>> ignores it and continues on with the rest of the V2 API for
>> compatibility).  Or maybe it looks like something else that we don't know
>> of yet.  Possibly this falls outside of the desired goals for the V2 API
>> and instead should be a separate SPIP.
>>
>> If we had a plan for this kind of escape valve for advanced datasource
>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>> focused more on getting the basics right for what many datasources are
>> already doing in API V1 combined with other private APIs, vs pushing
>> forward state of the art for performance.
>>
>> Andrew
>>
>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>> [email protected]> wrote:
>>
>>> +1 (non-binding)
>>>
>>>
>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> In the previous discussion, we decided to split the read and write path
>>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>>> for Data Source V2 read path only.
>>>
>>> The full document of the Data Source API V2 is:
>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>>> -Z8qU5Frf6WMQZ6jJVM/edit
>>>
>>> The ready-for-review PR that implements the basic infrastructure for the
>>> read path is:
>>> https://github.com/apache/spark/pull/19136
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Thanks!
>>>
>>>
>>>
>>
>
>
> --
>
> Herman van Hövell
>
> Software Engineer
>
> Databricks Inc.
>
> [email protected]
>
> +31 6 420 590 27
>
> databricks.com
>
> [image: http://databricks.com] <http://databricks.com/>
>
>
>
> [image: Announcing Databricks Serverless. The first serverless data
> science and big data platform. Watch the demo from Spark Summit 2017.]
> <http://go.databricks.com/announcing-databricks-serverless>
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Reply via email to