[ https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated CALCITE-5559: ------------------------------------ Labels: pull-request-available (was: ) > Improve RepeatUnion by discarding duplicates at TableSpool level > ---------------------------------------------------------------- > > Key: CALCITE-5559 > URL: https://issues.apache.org/jira/browse/CALCITE-5559 > Project: Calcite > Issue Type: Improvement > Components: core > Reporter: Ruben Q L > Assignee: Ruben Q L > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, RepeatUnion operator with all=false keeps track of the elements > that it has returned in order to discard duplicates. However, the TableSpool > operators that are right below it do not have such control. In certain > scenarios, duplicates are returned by the TableSpool current iteration, > discarded by the RepeatUnion, but have been already "fed back" by the > TableSpool into the next iteration, causing unnecessary processing. > We can optimize this scenario by keeping track of the duplicates > inside/before the TableSpool too (note: we still need to keep track of > duplicates at RepeatUnion level, because that is the only place where we can > detect a potential "global duplicate" of an element: returned by the LHS and > then also by the RHS, or by two different iterations of the RHS). > A PoC testing this improvement on a downstream project showed that certain > queries can go from ~40s down to ~1s. -- This message was sent by Atlassian Jira (v8.20.10#820010)