Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Dongjoon Hyun Thu, 07 Sep 2017 13:08:50 -0700

+1 (non-binding).

On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <[email protected]> wrote:


> +1
>
>
> Reynold Xin <[email protected]>于2017年9月7日 周四下午12:04写道：
>
>> +1 as well
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <[email protected]>
>> wrote:
>>
>>> +1
>>>
>>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <[email protected]>
>>> wrote:
>>>
>>>> +1 (non-binding)
>>>>
>>>> Thanks for making the updates reflected in the current PR. It would be
>>>> great to see the doc updated before it is finally published though.
>>>>
>>>> Right now it feels like this SPIP is focused more on getting the basics
>>>> right for what many datasources are already doing in API V1 combined with
>>>> other private APIs, vs pushing forward state of the art for performance.
>>>>
>>>> I think that’s the right approach for this SPIP. We can add the support
>>>> you’re talking about later with a more specific plan that doesn’t block
>>>> fixing the problems that this addresses.
>>>> 
>>>>
>>>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>>>> [email protected]> wrote:
>>>>
>>>>> +1 (binding)
>>>>>
>>>>> I personally believe that there is quite a big difference between
>>>>> having a generic data source interface with a low surface area and pushing
>>>>> down a significant part of query processing into a datasource. The later
>>>>> has much wider wider surface area and will require us to stabilize most of
>>>>> the internal catalyst API's which will be a significant burden on the
>>>>> community to maintain and has the potential to slow development velocity
>>>>> significantly. If you want to write such integrations then you should be
>>>>> prepared to work with catalyst internals and own up to the fact that 
>>>>> things
>>>>> might change across minor versions (and in some cases even maintenance
>>>>> releases). If you are willing to go down that road, then your best bet is
>>>>> to use the already existing spark session extensions which will allow you
>>>>> to write such integrations and can be used as an `escape hatch`.
>>>>>
>>>>>
>>>>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +0 (non-binding)
>>>>>>
>>>>>> I think there are benefits to unifying all the Spark-internal
>>>>>> datasources into a common public API for sure.  It will serve as a 
>>>>>> forcing
>>>>>> function to ensure that those internal datasources aren't advantaged vs
>>>>>> datasources developed externally as plugins to Spark, and that all Spark
>>>>>> features are available to all datasources.
>>>>>>
>>>>>> But I also think this read-path proposal avoids the more difficult
>>>>>> questions around how to continue pushing datasource performance forwards.
>>>>>> James Baker (my colleague) had a number of questions about advanced
>>>>>> pushdowns (combined sorting and filtering), and Reynold also noted that
>>>>>> pushdown of aggregates and joins are desirable on longer timeframes as
>>>>>> well.  The Spark community saw similar requests, for aggregate pushdown 
>>>>>> in
>>>>>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>>>>>> in SPARK-12449.  Clearly a number of people are interested in this kind 
>>>>>> of
>>>>>> performance work for datasources.
>>>>>>
>>>>>> To leave enough space for datasource developers to continue
>>>>>> experimenting with advanced interactions between Spark and their
>>>>>> datasources, I'd propose we leave some sort of escape valve that enables
>>>>>> these datasources to keep pushing the boundaries without forking Spark.
>>>>>> Possibly that looks like an additional unsupported/unstable interface 
>>>>>> that
>>>>>> pushes down an entire (unstable API) logical plan, which is expected to
>>>>>> break API on every release.   (Spark attempts this full-plan pushdown, 
>>>>>> and
>>>>>> if that fails Spark ignores it and continues on with the rest of the V2 
>>>>>> API
>>>>>> for compatibility).  Or maybe it looks like something else that we don't
>>>>>> know of yet.  Possibly this falls outside of the desired goals for the V2
>>>>>> API and instead should be a separate SPIP.
>>>>>>
>>>>>> If we had a plan for this kind of escape valve for advanced
>>>>>> datasource developers I'd be an unequivocal +1.  Right now it feels like
>>>>>> this SPIP is focused more on getting the basics right for what many
>>>>>> datasources are already doing in API V1 combined with other private APIs,
>>>>>> vs pushing forward state of the art for performance.
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>>
>>>>>>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> In the previous discussion, we decided to split the read and write
>>>>>>> path of data source v2 into 2 SPIPs, and I'm sending this email to call 
>>>>>>> a
>>>>>>> vote for Data Source V2 read path only.
>>>>>>>
>>>>>>> The full document of the Data Source API V2 is:
>>>>>>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
>>>>>>> Z8qU5Frf6WMQZ6jJVM/edit
>>>>>>>
>>>>>>> The ready-for-review PR that implements the basic infrastructure for
>>>>>>> the read path is:
>>>>>>> https://github.com/apache/spark/pull/19136
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following
>>>>>>> technical reasons.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Herman van Hövell
>>>>>
>>>>> Software Engineer
>>>>>
>>>>> Databricks Inc.
>>>>>
>>>>> [email protected]
>>>>>
>>>>> +31 6 420 590 27
>>>>>
>>>>> databricks.com
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>>>
>>>>>
>>>>> [image: Announcing Databricks Serverless. The first serverless data
>>>>> science and big data platform. Watch the demo from Spark Summit 2017.]
>>>>> <http://go.databricks.com/announcing-databricks-serverless>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Reply via email to