Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Hemant Bhanawat Sun, 10 Sep 2017 22:54:30 -0700

+1 (non-binding)

I have found the suggestion from Andrew Ash and James about plan push down
quite interesting. However, I am not clear about the join push-down support
at the data source level. Shouldn't it be the responsibility of the join
node to carry out a data source specific join? I mean join node and the
data source scan of the two sides can be coalesced into a single node
(theoretically). This can be done by providing a Strategy that replaces the
join node with a data source specific join node. We are doing it that way
for our data sources. I find this more intuitive.


BTW, aggregate push-down support is desirable and should be considered as
an enhancement going forward.

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Sun, Sep 10, 2017 at 8:45 PM, vaquar khan <vaquar.k...@gmail.com> wrote:

> +1
>
> Regards,
> Vaquar khan
>
> On Sep 10, 2017 5:18 AM, "Noman Khan" <nomanbp...@live.com> wrote:
>
>> +1
>> ------------------------------
>> *From:* wangzhenhua (G) <wangzhen...@huawei.com>
>> *Sent:* Friday, September 8, 2017 2:20:07 AM
>> *To:* Dongjoon Hyun; 蒋星博
>> *Cc:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>> *Subject:* 答复: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>
>>
>> +1 (non-binding)  Great to see data source API is going to be improved!
>>
>>
>>
>> best regards,
>>
>> -Zhenhua(Xander)
>>
>>
>>
>> *发件人:* Dongjoon Hyun [mailto:dongjoon.h...@gmail.com]
>> *发送时间:* 2017年9月8日 4:07
>> *收件人:* 蒋星博
>> *抄送:* Michael Armbrust; Reynold Xin; Andrew Ash; Herman van Hövell tot
>> Westerflier; Ryan Blue; Spark dev list; Suresh Thalamati; Wenchen Fan
>> *主题:* Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path
>>
>>
>>
>> +1 (non-binding).
>>
>>
>>
>> On Thu, Sep 7, 2017 at 12:46 PM, 蒋星博 <jiangxb1...@gmail.com> wrote:
>>
>> +1
>>
>>
>>
>>
>>
>> Reynold Xin <r...@databricks.com>于2017年9月7日 周四下午12:04写道：
>>
>> +1 as well
>>
>>
>>
>> On Thu, Sep 7, 2017 at 9:12 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>> +1
>>
>>
>>
>> On Thu, Sep 7, 2017 at 9:32 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>> +1 (non-binding)
>>
>> Thanks for making the updates reflected in the current PR. It would be
>> great to see the doc updated before it is finally published though.
>>
>> Right now it feels like this SPIP is focused more on getting the basics
>> right for what many datasources are already doing in API V1 combined with
>> other private APIs, vs pushing forward state of the art for performance.
>>
>> I think that’s the right approach for this SPIP. We can add the support
>> you’re talking about later with a more specific plan that doesn’t block
>> fixing the problems that this addresses.
>>
>> 
>>
>>
>>
>> On Thu, Sep 7, 2017 at 2:00 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@databricks.com> wrote:
>>
>> +1 (binding)
>>
>>
>>
>> I personally believe that there is quite a big difference between having
>> a generic data source interface with a low surface area and pushing down a
>> significant part of query processing into a datasource. The later has much
>> wider wider surface area and will require us to stabilize most of the
>> internal catalyst API's which will be a significant burden on the community
>> to maintain and has the potential to slow development velocity
>> significantly. If you want to write such integrations then you should be
>> prepared to work with catalyst internals and own up to the fact that things
>> might change across minor versions (and in some cases even maintenance
>> releases). If you are willing to go down that road, then your best bet is
>> to use the already existing spark session extensions which will allow you
>> to write such integrations and can be used as an `escape hatch`.
>>
>>
>>
>>
>>
>> On Thu, Sep 7, 2017 at 10:23 AM, Andrew Ash <and...@andrewash.com> wrote:
>>
>> +0 (non-binding)
>>
>>
>>
>> I think there are benefits to unifying all the Spark-internal datasources
>> into a common public API for sure.  It will serve as a forcing function to
>> ensure that those internal datasources aren't advantaged vs datasources
>> developed externally as plugins to Spark, and that all Spark features are
>> available to all datasources.
>>
>>
>>
>> But I also think this read-path proposal avoids the more difficult
>> questions around how to continue pushing datasource performance forwards.
>> James Baker (my colleague) had a number of questions about advanced
>> pushdowns (combined sorting and filtering), and Reynold also noted that
>> pushdown of aggregates and joins are desirable on longer timeframes as
>> well.  The Spark community saw similar requests, for aggregate pushdown in
>> SPARK-12686, join pushdown in SPARK-20259, and arbitrary plan pushdown
>> in SPARK-12449.  Clearly a number of people are interested in this kind of
>> performance work for datasources.
>>
>>
>>
>> To leave enough space for datasource developers to continue experimenting
>> with advanced interactions between Spark and their datasources, I'd propose
>> we leave some sort of escape valve that enables these datasources to keep
>> pushing the boundaries without forking Spark.  Possibly that looks like an
>> additional unsupported/unstable interface that pushes down an entire
>> (unstable API) logical plan, which is expected to break API on every
>> release.   (Spark attempts this full-plan pushdown, and if that fails Spark
>> ignores it and continues on with the rest of the V2 API for
>> compatibility).  Or maybe it looks like something else that we don't know
>> of yet.  Possibly this falls outside of the desired goals for the V2 API
>> and instead should be a separate SPIP.
>>
>>
>>
>> If we had a plan for this kind of escape valve for advanced datasource
>> developers I'd be an unequivocal +1.  Right now it feels like this SPIP is
>> focused more on getting the basics right for what many datasources are
>> already doing in API V1 combined with other private APIs, vs pushing
>> forward state of the art for performance.
>>
>>
>>
>> Andrew
>>
>>
>>
>> On Wed, Sep 6, 2017 at 10:56 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>> +1 (non-binding)
>>
>>
>>
>>
>>
>> On Sep 6, 2017, at 7:29 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>
>>
>> Hi all,
>>
>>
>>
>> In the previous discussion, we decided to split the read and write path
>> of data source v2 into 2 SPIPs, and I'm sending this email to call a vote
>> for Data Source V2 read path only.
>>
>>
>>
>> The full document of the Data Source API V2 is:
>>
>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ
>> -Z8qU5Frf6WMQZ6jJVM/edit
>>
>>
>>
>> The ready-for-review PR that implements the basic infrastructure for the
>> read path is:
>>
>> https://github.com/apache/spark/pull/19136
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>>
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>>
>> +0: Don't really care.
>>
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Herman van Hövell
>>
>> Software Engineer
>>
>> Databricks Inc.
>>
>> hvanhov...@databricks.com
>>
>> +31 6 420 590 27
>>
>> databricks.com
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>>
>>
>> [image: Announcing Databricks Serverless. The first serverless data
>> science and big data platform. Watch the demo from Spark Summit 2017.]
>> <http://go.databricks.com/announcing-databricks-serverless>
>>
>>
>>
>>
>>
>> --
>>
>> Ryan Blue
>>
>> Software Engineer
>>
>> Netflix
>>
>>
>>
>>
>>
>>
>>
>

Re: [VOTE] [SPIP] SPARK-15689: Data Source API V2 read path

Reply via email to