[jira] [Updated] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Ruben Q L (Jira) Tue, 07 Mar 2023 00:03:04 -0800


     [ 
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ruben Q L updated CALCITE-5559:
-------------------------------
    Description: 
Currently, RepeatUnion operator with all=false keeps track of the elements that 
it has returned in order to discard duplicates. However, the TableSpool 
operators that are right below it do not have such control. In certain 
scenarios, duplicates are returned by the TableSpool current iteration, 
discarded by the RepeatUnion, but have been already "fed back" by the 
TableSpool into the next iteration, causing unnecessary processing.
We can optimize this scenario by keeping track of the duplicates inside/before 
the TableSpool too (note: we still need to keep track of duplicates at 
RepeatUnion level, because that is the only place where we can detect a 
potential "global duplicate" of an element: returned by the LHS and then also 
by the RHS, or by two different iterations of the RHS).

A PoC testing this improvement on a downstream project showed that certain 
queries can go from ~40s down to ~1s.

  was:
Currently, RepeatUnion operator with all=false keeps track of the elements that 
it has returned in order to discard duplicates. However, the TableSpool 
operators that are right below it do not have such control. In certain 
scenarios, duplicates are returned by the TableSpool current iteration, 
discarded by the RepeatUnion, but have been already "fed back" by the 
TableSpool into the next iteration, causing unnecessary processing.
We can optimize this scenario by keeping track of the duplicates inside the 
TableSpool too (note: we still need to keep track of duplicates at RepeatUnion 
level, because that is the only place where we can detect a potential "global 
duplicate" of an element: returned by the LHS and then also by the RHS, or by 
two different iterations of the RHS).

A PoC testing this improvement on a downstream project showed that certain 
queries can go from ~40s down to ~1s.


> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
>                 Key: CALCITE-5559
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5559
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, RepeatUnion operator with all=false keeps track of the elements 
> that it has returned in order to discard duplicates. However, the TableSpool 
> operators that are right below it do not have such control. In certain 
> scenarios, duplicates are returned by the TableSpool current iteration, 
> discarded by the RepeatUnion, but have been already "fed back" by the 
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates 
> inside/before the TableSpool too (note: we still need to keep track of 
> duplicates at RepeatUnion level, because that is the only place where we can 
> detect a potential "global duplicate" of an element: returned by the LHS and 
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain 
> queries can go from ~40s down to ~1s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Reply via email to