[ https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698388#comment-17698388 ]
Ruben Q L commented on CALCITE-5559: ------------------------------------ I have created [PR#3101|https://github.com/apache/calcite/pull/3101] which shows how approach A might look like. > Improve RepeatUnion by discarding duplicates at TableSpool level > ---------------------------------------------------------------- > > Key: CALCITE-5559 > URL: https://issues.apache.org/jira/browse/CALCITE-5559 > Project: Calcite > Issue Type: Improvement > Components: core > Reporter: Ruben Q L > Assignee: Ruben Q L > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Currently, RepeatUnion operator with all=false keeps track of the elements > that it has returned in order to discard duplicates. However, the TableSpool > operators that are right below it do not have such control. In certain > scenarios, duplicates are returned by the TableSpool current iteration, > discarded by the RepeatUnion, but have been already "fed back" by the > TableSpool into the next iteration, causing unnecessary processing. > We can optimize this scenario by keeping track of the duplicates > inside/before the TableSpool too (note: we still need to keep track of > duplicates at RepeatUnion level, because that is the only place where we can > detect a potential "global duplicate" of an element: returned by the LHS and > then also by the RHS, or by two different iterations of the RHS). > A PoC testing this improvement on a downstream project showed that certain > queries can go from ~40s down to ~1s. -- This message was sent by Atlassian Jira (v8.20.10#820010)