[jira] [Commented] (CALCITE-468) Introduce semi join reduction optimization in Calcite

Mostafa Mokhtar (JIRA) Tue, 18 Nov 2014 11:46:13 -0800

    [ 
https://issues.apache.org/jira/browse/CALCITE-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216667#comment-14216667
 ]


Mostafa Mokhtar commented on CALCITE-468:
-----------------------------------------

[~vladimirsitnikov]

Implementing this in Calcite will reduce the query runtime from 1,377 to 343 
seconds, based on bloom filters experiments I have done 343 will go down to 200 
seconds at best as the remaining time will be dominated by table scans and data 
movement.

Bloom filters are definitely the better option as the bloom filter is a couple 
of orders of magnitude smaller than the dimension table itself, but this 
implementation has to rely on Hive & execution engine (Tez, Spark or MR) to 
create a bloom filter than broadcast it to all containers, this will work will 
span Calcite + Hive + (MR, Tez, Spark)





> Introduce semi join reduction optimization in Calcite 
> ------------------------------------------------------
>
>                 Key: CALCITE-468
>                 URL: https://issues.apache.org/jira/browse/CALCITE-468
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Mostafa Mokhtar
>            Assignee: Laljo John Pullokkaran
>              Labels: hive
>         Attachments: BaselineTree.png, SemiJoinReductionGains.png, 
> SemiJoinReductionTreee.png
>
>
> The basic idea is to apply join predicates early in a plan in order to reduce 
> the size of intermediate query results and, thus, reduce the cost of other 
> operations. In other words, the idea is to apply the same join predicates 
> twice or more often in a query plan
> In order to reduce the communication costs of a distributed system. 
> Obviously, semi-join reducers are only effective if the (redundant) 
> semi-joins are cheap and result in a significant reduction of the size of 
> intermediate
> query results.
> I propose to extend a query optimizer and integrate semi-join reducer and
> join-ordering, etc. into a single query optimization step
> Several TPC-DS queries like 24, 64 & 80 run very slow do to the lake of semi 
> join reduction optimization in Calcite.
> Doing a rewrite of Q64 to simulate semi join reduction produced 4x gains.
> {code}
> Query                 Total time           CPU        Intermediate rows 
> (Million)
> Baseline                      1,377    356,900                            
> 23,940
> Semi Join Reduction           343       47,253                                
> 23
> {code}
> Q64 subset 
> {code}
> select 
>     count(*)
> FROM
>     store_sales
>         JOIN
>     item ON store_sales.ss_item_sk = item.i_item_sk
>         JOIN
>     store_returns ON store_sales.ss_item_sk = store_returns.sr_item_sk
>         JOIN
>     (select 
>         cs_item_sk
>     from
>         catalog_sales
>     JOIN catalog_returns ON catalog_sales.cs_item_sk = 
> catalog_returns.cr_item_sk
>     group by cs_item_sk
>     having 
> sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + 
> cr_store_credit)) cs_ui 
> ON store_sales.ss_item_sk = cs_ui.cs_item_sk
> WHERE
>     i_color in ('maroon' , 'burnished',
>         'dim',
>         'steel',
>         'navajo',
>         'chocolate')
>         and i_current_price between 35 and 35 + 10
>         and i_current_price between 35 + 1 and 35 + 15
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CALCITE-468) Introduce semi join reduction optimization in Calcite

Reply via email to