Re: [Discuss][SPIP] DataSource V2 SQL push down

huaxin gao Fri, 09 Apr 2021 09:21:37 -0700

I am in PST. Beijing Time is 15 hours ahead of my time. Next Monday night
works for me. Let's talk about the details offline? Moving spark dev and
others to bcc.


On Thu, Apr 8, 2021 at 11:48 PM Chang Chen <baibaic...@gmail.com> wrote:

> hi huaxin
>
> I look into your PR, there would be a way to consolidate the file source
> and SQL source.
>
> What's the time difference between Beijing and your timezone? I prefer
> next Monday night or Tuesday morning.
>
> I can share zoom.
>
> huaxin gao <huaxin.ga...@gmail.com> 于2021年4月8日周四 上午7:10写道：
>
>> Hi Chang,
>>
>> Thanks for working on this.
>>
>> Could you please explain how your proposal can be extended to the
>> file-based data sources? Since at least half of the Spark community are
>> using file-based data sources, I think any designs should consider the
>> file-based data sources as well. I work on both sql-based and file-based
>> data sources, and I understand that they are very different. It’s
>> challenging to have a design to work for both, but the current filter push
>> down and column pruning have been designed nicely to fit both sides. I
>> think we should follow the same approach to make Aggregate push down work
>> for both too.
>>
>> I am currently collaborating with the Apple Spark team and Facebook Spark
>> team to push down Aggregate to file-based data sources. We are doing some
>> ongoing work right now to push down Max/Min/Count to parquet and later to
>> ORC to utilize the statistics information there (
>> https://github.com/apache/spark/pull/32049). Please correct me if I am
>> wrong: it seems to me that your proposal doesn't consider file-based data
>> sources at all and will stop us from continuing our work.
>>
>> Let's schedule a meeting to discuss this?
>>
>> Thanks,
>>
>> Huaxin
>>
>>
>>
>> On Wed, Apr 7, 2021 at 1:32 AM Chang Chen <baibaic...@gmail.com> wrote:
>>
>>> hi huaxin
>>>
>>> please review https://github.com/apache/spark/pull/32061
>>>
>>> as for add a *trait PrunedFilteredAggregateScan* for V1 JDBC, I delete
>>> trait, since V1 DataSource needn't support aggregation push down
>>>
>>> Chang Chen <baibaic...@gmail.com> 于2021年4月5日周一 下午10:02写道：
>>>
>>>> Hi huaxin
>>>>
>>>> What I am concerned about is abstraction
>>>>
>>>>    1. How to extend sources.Aggregation. Because Catalyst Expression
>>>>    is recursion, it is very bad to define a new hierarchy, I think 
>>>> ScanBuilder
>>>>    must convert pushed expressions to its format.
>>>>    2. The optimization rule is also an extended point, I didn't see
>>>>    any consideration on join push down. I also think
>>>>    SupportsPushDownRequiredColumns and SupportsPushDownFilters are
>>>>    problematic.
>>>>
>>>> Obviously, File Based Source and SQL Based Source are quite different
>>>> on push down capabilities. I am not sure they can be consolidated into one
>>>> API.
>>>>
>>>> I will push my PR tomorrow, and after that, could we schedule a meeting
>>>> to discuss the API?
>>>>
>>>> huaxin gao <huaxin.ga...@gmail.com> 于2021年4月5日周一 上午2:24写道：
>>>>
>>>>> Hello Chang,
>>>>>
>>>>> Thanks for proposing the SPIP and initiating the discussion.
>>>>> However, I think the problem with your proposal is that you haven’t taken
>>>>> into consideration file-based data sources such as parquet, ORC, etc. As
>>>>> far as I know, most of the Spark users have file-based data sources.  As a
>>>>> matter of fact, I have customers waiting for Aggregate push down for
>>>>> Parquet. That’s the reason I have my current implementation, which has a
>>>>> unified Aggregate push down approach for both the file-based data sources
>>>>> and JDBC.
>>>>>
>>>>> I discussed with several members of the Spark community recently, and
>>>>> we have agreed to break down the Aggregate push down work into the
>>>>> following steps:
>>>>>
>>>>>    1.
>>>>>
>>>>>    Implement Max, Min and Count push down in Parquet
>>>>>    2.
>>>>>
>>>>>    Add a new physical plan rewrite rule to remove partial aggregate.
>>>>>    We can optimize one more step to remove ShuffleExchange if the group by
>>>>>    column and partition col are the same.
>>>>>    3.
>>>>>
>>>>>    Implement Max, Min and Count push down in JDBC
>>>>>    4.
>>>>>
>>>>>    Implement Sum and Avg push down in JDBC
>>>>>
>>>>>
>>>>> I plan to implement Aggregate push down for Parquet first for now. The
>>>>> reasons are:
>>>>>
>>>>>    1.
>>>>>
>>>>>    It’s relatively easier to implement Parquet Aggregate push down
>>>>>    than JDBC.
>>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>>    Only need to implement  Max, Min and Count
>>>>>    2.
>>>>>
>>>>>    No need to deal with the differences between Spark and other
>>>>>    databases. For example, aggregating decimal values have different
>>>>>    behaviours between database implementations.
>>>>>
>>>>> The main point is that we want to keep the PR minimal and support the
>>>>> basic infrastructure for Aggregate push down first. Actually, the PR for
>>>>> implementing Parquet Aggregate push down is already very big. We don’t 
>>>>> want
>>>>> to have a huge PR to solve all the problems. It’s too hard to review.
>>>>>
>>>>>
>>>>>    1.
>>>>>
>>>>>    I think it’s too early to implement the JDBC Aggregate push down
>>>>>    for now. Underneath, V2 DS JDBC still calls the V1 DS JDBC path. If we
>>>>>    implement JDBC Aggregate push down now, we still need to add a *trait
>>>>>    PrunedFilteredAggregateScan* for V1 JDBC. One of the major
>>>>>    motivations that we are having V2 DS is that we want to improve the
>>>>>    flexibility of implementing new operator push down by avoiding adding 
>>>>> a new
>>>>>    push down trait. If we still add a new pushdown trait in V1 DS JDBC, I 
>>>>> feel
>>>>>    we are defeating the purpose of having DS V2. So I want to wait until 
>>>>> we
>>>>>    fully migrate to DS V2 JDBC, and then implement Aggregate push down for
>>>>>    JDBC.
>>>>>
>>>>>
>>>>> I have submitted Parquet Aggregate push down PR. Here is the link:
>>>>>
>>>>> https://github.com/apache/spark/pull/32049
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Huaxin
>>>>>
>>>>>
>>>>> On Fri, Apr 2, 2021 at 1:04 AM Chang Chen <baibaic...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The link is broken. I post a PDF version.
>>>>>>
>>>>>> Chang Chen <baibaic...@gmail.com> 于2021年4月2日周五 下午3:57写道：
>>>>>>
>>>>>>> Hi All
>>>>>>>
>>>>>>> We would like to post s SPIP of Datasource V2 SQL PushDown in
>>>>>>> Spark.  Here is document link:
>>>>>>>
>>>>>>>
>>>>>>> https://olapio.atlassian.net/wiki/spaces/TeamCX/pages/2667315361/Discuss+SQL+Data+Source+V2+SQL+Push+Down?atlOrigin=eyJpIjoiOTI5NGYzYWMzMWYwNDliOWIwM2ZkODllODk4Njk2NzEiLCJwIjoiYyJ9
>>>>>>>
>>>>>>> This SPIP aims to make pushdown more extendable.
>>>>>>>
>>>>>>> I would like to thank huaxin gao, my prototype is based on her PR. I
>>>>>>> will submit a PR ASAP
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Chang.
>>>>>>>
>>>>>>

Re: [Discuss][SPIP] DataSource V2 SQL push down

Reply via email to