[
https://issues.apache.org/jira/browse/CALCITE-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216667#comment-14216667
]
Mostafa Mokhtar commented on CALCITE-468:
-----------------------------------------
[~vladimirsitnikov]
Implementing this in Calcite will reduce the query runtime from 1,377 to 343
seconds, based on bloom filters experiments I have done 343 will go down to 200
seconds at best as the remaining time will be dominated by table scans and data
movement.
Bloom filters are definitely the better option as the bloom filter is a couple
of orders of magnitude smaller than the dimension table itself, but this
implementation has to rely on Hive & execution engine (Tez, Spark or MR) to
create a bloom filter than broadcast it to all containers, this will work will
span Calcite + Hive + (MR, Tez, Spark)
> Introduce semi join reduction optimization in Calcite
> ------------------------------------------------------
>
> Key: CALCITE-468
> URL: https://issues.apache.org/jira/browse/CALCITE-468
> Project: Calcite
> Issue Type: Bug
> Reporter: Mostafa Mokhtar
> Assignee: Laljo John Pullokkaran
> Labels: hive
> Attachments: BaselineTree.png, SemiJoinReductionGains.png,
> SemiJoinReductionTreee.png
>
>
> The basic idea is to apply join predicates early in a plan in order to reduce
> the size of intermediate query results and, thus, reduce the cost of other
> operations. In other words, the idea is to apply the same join predicates
> twice or more often in a query plan
> In order to reduce the communication costs of a distributed system.
> Obviously, semi-join reducers are only effective if the (redundant)
> semi-joins are cheap and result in a significant reduction of the size of
> intermediate
> query results.
> I propose to extend a query optimizer and integrate semi-join reducer and
> join-ordering, etc. into a single query optimization step
> Several TPC-DS queries like 24, 64 & 80 run very slow do to the lake of semi
> join reduction optimization in Calcite.
> Doing a rewrite of Q64 to simulate semi join reduction produced 4x gains.
> {code}
> Query Total time CPU Intermediate rows
> (Million)
> Baseline 1,377 356,900
> 23,940
> Semi Join Reduction 343 47,253
> 23
> {code}
> Q64 subset
> {code}
> select
> count(*)
> FROM
> store_sales
> JOIN
> item ON store_sales.ss_item_sk = item.i_item_sk
> JOIN
> store_returns ON store_sales.ss_item_sk = store_returns.sr_item_sk
> JOIN
> (select
> cs_item_sk
> from
> catalog_sales
> JOIN catalog_returns ON catalog_sales.cs_item_sk =
> catalog_returns.cr_item_sk
> group by cs_item_sk
> having
> sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge +
> cr_store_credit)) cs_ui
> ON store_sales.ss_item_sk = cs_ui.cs_item_sk
> WHERE
> i_color in ('maroon' , 'burnished',
> 'dim',
> 'steel',
> 'navajo',
> 'chocolate')
> and i_current_price between 35 and 35 + 10
> and i_current_price between 35 + 1 and 35 + 15
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)