[ https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruben Q L reassigned CALCITE-5559: ---------------------------------- Assignee: Ruben Q L > Improve RepeatUnion by discarding duplicates at TableSpool level > ---------------------------------------------------------------- > > Key: CALCITE-5559 > URL: https://issues.apache.org/jira/browse/CALCITE-5559 > Project: Calcite > Issue Type: Improvement > Components: core > Reporter: Ruben Q L > Assignee: Ruben Q L > Priority: Major > > Currently, RepeatUnion operator with all=false keeps track of the elements > that it has returned in order to discard duplicates. However, the TableSpool > operators that are right below it do not have such control. In certain > scenarios, duplicates are returned by the TableSpool current iteration, > discarded by the RepeatUnion, but have been already "fed back" by the > TableSpool into the next iteration, causing unnecessary processing. > We can optimize this scenario by keeping track of the duplicates inside the > TableSpool too (note: we still need to keep track of duplicates at > RepeatUnion level, because that is the only place where we can detect a > potential "global duplicate" of an element: returned by the LHS and then also > by the RHS, or by two different iterations of the RHS). > A PoC testing this improvement on a downstream project showed that certain > queries can go from ~40s down to ~1s. -- This message was sent by Atlassian Jira (v8.20.10#820010)