[ 
https://issues.apache.org/jira/browse/CALCITE-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17115376#comment-17115376
 ] 

Haisheng Yuan commented on CALCITE-3221:
----------------------------------------

Ah, apparently, I was wrong. I was think of it using Greenplum or MaxCompute's 
mindset, which doesn't apply to Enumerable operators.

In both products, UNION is rewritten to Aggregate (distinct) on top of UNOIN 
ALL.
There is only PhysicalUnionAll operator, but no PhysicalUnion(where all is 
false) operator. If the parent operator requests a collation, it can generate 
an alternative with the required collation, at the same time requires its 
children to satisfy the same collation. It doesn't need to have additional 
field to indicate order preserving at all, if the PhysicalUnionAll's collation 
is not empty, that means it needs to do sorted merge operation.

Separating distinct operation has some benefits. The distinct can be hash based 
or sort based, UNION ALL doesn't need to worry about it, just merge the data 
(with or without order). Because aggregate and sort are among the most complex 
operators (window is the most complex one), we don't want to duplicate the 
complex logic in UNION operator. In addition, it can utilize the existing 
multi-stage StreamAgg / HashAgg operators, apparently we don't want to create a 
2-stage UNION operator.

And this is SQL Server's execution plan:
 !screenshot-1.png! 

But since we don't need to worry all that stuff for EnumerableConvention, which 
is non-MPP and in-memory, and the existing one is already hash based, so it 
totally makes sense to add another physical operator to do sort based union.

> Add a sort-merge union algorithm
> --------------------------------
>
>                 Key: CALCITE-3221
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3221
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Stamatis Zampetakis
>            Priority: Minor
>         Attachments: screenshot-1.png
>
>
> Currently, the union operation offered by Calcite is based on a {{HashSet}} 
> (see 
> [EnumerableDefaults.union|https://github.com/apache/calcite/blob/d98856bf1a5f5c151d004b769e14bdd368a67234/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L2747])
>  and necessitates reading in memory all rows before returning a single 
> result.   
> Apart from increased memory consumption the operator is blocking and also 
> destroys the order of its inputs.  
> The goal of this issue is to add a new union algorithm (EnumerableMergeUnion 
> ?) exploiting the fact that the inputs are sorted which consumes less memory 
> and retains the order of its inputs.   
> Most likely the implementation of the merge join can be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to