[jira] [Commented] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Julian Hyde (Jira) Tue, 07 Mar 2023 11:04:14 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697593#comment-17697593
 ]


Julian Hyde commented on CALCITE-5559:
--------------------------------------

Sorry, coming late to this, and haven't read everything, but let me just check 
some things at a high level. RepeatUnion implements seminaive evaluation, 
right? And that means detecting whether this iteration found some values that 
previous iterations did not. That is, we compute the difference. If the 
semantics of the particular query allows this do be done using set-difference 
(as opposed to multiset-difference) then this would often (maybe always) seem 
to be a win.

My hunch is that there should be an 'eliminate duplicates' flag that applies to 
the deltas. Maybe it is the same flag that applies to the collections of 
records at each iteration, or maybe it is a different flag.

> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
>                 Key: CALCITE-5559
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5559
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, RepeatUnion operator with all=false keeps track of the elements 
> that it has returned in order to discard duplicates. However, the TableSpool 
> operators that are right below it do not have such control. In certain 
> scenarios, duplicates are returned by the TableSpool current iteration, 
> discarded by the RepeatUnion, but have been already "fed back" by the 
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates 
> inside/before the TableSpool too (note: we still need to keep track of 
> duplicates at RepeatUnion level, because that is the only place where we can 
> detect a potential "global duplicate" of an element: returned by the LHS and 
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain 
> queries can go from ~40s down to ~1s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Reply via email to