Hello Chang,

Thanks for proposing the SPIP and initiating the discussion. However, I
think the problem with your proposal is that you haven’t taken into
consideration file-based data sources such as parquet, ORC, etc. As far as
I know, most of the Spark users have file-based data sources.  As a matter
of fact, I have customers waiting for Aggregate push down for Parquet.
That’s the reason I have my current implementation, which has a unified
Aggregate push down approach for both the file-based data sources and JDBC.

I discussed with several members of the Spark community recently, and we
have agreed to break down the Aggregate push down work into the following
steps:

   1.

   Implement Max, Min and Count push down in Parquet
   2.

   Add a new physical plan rewrite rule to remove partial aggregate. We can
   optimize one more step to remove ShuffleExchange if the group by column and
   partition col are the same.
   3.

   Implement Max, Min and Count push down in JDBC
   4.

   Implement Sum and Avg push down in JDBC


I plan to implement Aggregate push down for Parquet first for now. The
reasons are:

   1.

   It’s relatively easier to implement Parquet Aggregate push down than
   JDBC.


   1.

   Only need to implement  Max, Min and Count
   2.

   No need to deal with the differences between Spark and other databases. For
   example, aggregating decimal values have different behaviours between
   database implementations.

The main point is that we want to keep the PR minimal and support the basic
infrastructure for Aggregate push down first. Actually, the PR for
implementing Parquet Aggregate push down is already very big. We don’t want
to have a huge PR to solve all the problems. It’s too hard to review.


   1.

   I think it’s too early to implement the JDBC Aggregate push down for
   now. Underneath, V2 DS JDBC still calls the V1 DS JDBC path. If we
   implement JDBC Aggregate push down now, we still need to add a *trait
   PrunedFilteredAggregateScan* for V1 JDBC. One of the major motivations
   that we are having V2 DS is that we want to improve the flexibility of
   implementing new operator push down by avoiding adding a new push down
   trait. If we still add a new pushdown trait in V1 DS JDBC, I feel we are
   defeating the purpose of having DS V2. So I want to wait until we fully
   migrate to DS V2 JDBC, and then implement Aggregate push down for JDBC.


I have submitted Parquet Aggregate push down PR. Here is the link:

https://github.com/apache/spark/pull/32049


Thanks,

Huaxin


On Fri, Apr 2, 2021 at 1:04 AM Chang Chen <baibaic...@gmail.com> wrote:

> The link is broken. I post a PDF version.
>
> Chang Chen <baibaic...@gmail.com> 于2021年4月2日周五 下午3:57写道:
>
>> Hi All
>>
>> We would like to post s SPIP of Datasource V2 SQL PushDown in Spark.
>> Here is document link:
>>
>>
>> https://olapio.atlassian.net/wiki/spaces/TeamCX/pages/2667315361/Discuss+SQL+Data+Source+V2+SQL+Push+Down?atlOrigin=eyJpIjoiOTI5NGYzYWMzMWYwNDliOWIwM2ZkODllODk4Njk2NzEiLCJwIjoiYyJ9
>>
>> This SPIP aims to make pushdown more extendable.
>>
>> I would like to thank huaxin gao, my prototype is based on her PR. I will
>> submit a PR ASAP
>>
>> Thanks
>>
>> Chang.
>>
>

Reply via email to