[ 
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruben Q L reassigned CALCITE-5559:
----------------------------------

    Assignee: Ruben Q L

> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
>                 Key: CALCITE-5559
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5559
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, RepeatUnion operator with all=false keeps track of the elements 
> that it has returned in order to discard duplicates. However, the TableSpool 
> operators that are right below it do not have such control. In certain 
> scenarios, duplicates are returned by the TableSpool current iteration, 
> discarded by the RepeatUnion, but have been already "fed back" by the 
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates inside the 
> TableSpool too (note: we still need to keep track of duplicates at 
> RepeatUnion level, because that is the only place where we can detect a 
> potential "global duplicate" of an element: returned by the LHS and then also 
> by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain 
> queries can go from ~40s down to ~1s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to