[
https://issues.apache.org/jira/browse/CALCITE-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734813#comment-16734813
]
Vladimir Sitnikov commented on CALCITE-2648:
--------------------------------------------
{quote}Don't set the distribution trait. It only relates to distributed
execution frameworks (e.g. Hadoop and Spark) where there are multiple instances
of each operator, each processing one slice of the input.
{quote}
[~julianhyde], could you please clarify which traits should
LogicalWindow(rn=ROW_NUMBER() over (partition by x order by y)) have?
I claim convention=NONE.collation=[y]. Note that the order of produced rows is
impacted by x, so the rows are not just ordered by y. The rows are sorted by y
WITHIN each group.
Ah, I've suddenly found
org.apache.calcite.rel.RelFieldCollation.Direction#CLUSTERED
Do you suggest LogicalWindow(rn=ROW_NUMBER() over (partition by x order by y))
should produce just convention=NONE.collation=[0:CLUSTERED, 1:ASC] ? In other
words, it should return just a simple collation that says x is clustered and y
is ordered?
{quote}How about making a sub-class of EnumerableWindow that exploits the order
of the input?
{quote}
Sub-class vs a field+if is basically the same thing. The important bit is a
decision on LogicalWindow "API". In other words, we need to decide if
LogicalWindow requires sorted input or if LogicalWindow can sort on is own. If
we decide that LogicalWindow requires pre-sorted input, then CalcRelSplitter
should indeed add relevant Sort nodes in-between LogicalWindow relations, and
CalcRelSplitter should ensure that each LogicalWindow aggregations use the same
sort order of the rows.
It is not clear if it is allowed to have different "partition by" in the same
LogicalWindow though.
By the way, current implementation of CalcRelSplitter assumes LogicalWindow can
sort in its own since currently created LogicalWindows can easily include
multiple aggregates with vastly different orderings.
{quote}The code generated would be significantly different, because there is no
need to buffer rows.
{quote}
Note that non-trivial windows/aggregates would HAVE to buffer rows.
For instance: {{count(\*) over ()}}, {{count(\*) over (order by x range
between current row and unbounded following)}}, {{lead(x, 1) over (order by
y)}} and so on. All those aggregates would have to buffer rows one way or
another.
Of course we can pin-point the aggregations that do not require buffering,
however I would treat that as an optimization, and I think it is NOT related to
the current ticket.
Current **defect** is "Calcite is unable to plan a window aggregate when a
relation has multiple collations". "no need to buffer" is just an optimization
that can wait.
{quote}In Volcano it amounts to the same thing: you ask for a RelSubset with
the desired sort order, and it may or may not have higher cost than the current
best
{quote}
Current RelSubset does not support multi-collated relations. In other words,
you can't ask for a subset that has multiple collations. Of course, regular
relations rarely have those crazy properties, however, allowing for RelSubset
to have multiple collations (==disable RelTraitSet#simplify) would enable to
propagate multi-collated property when planning.
I truly don't get why do you say that Volcano has assumptions for simple
collations only. Can you please pin-point or describe it somehow?
I'm sure composite.satisfy(composite) is well-defined at least for collation
trait.
Just a side note: collation=\[0:ASC, 1:ASC] should satisfy for
collation=\[0:CLUSTERED, 1:ASC] request. In other words, current implementation
of org.apache.calcite.rel.RelCollationImpl#satisfies might be improved to
account more valid cases.
> Output collation of EnumerableWindow is not consistent with its implementation
> ------------------------------------------------------------------------------
>
> Key: CALCITE-2648
> URL: https://issues.apache.org/jira/browse/CALCITE-2648
> Project: Calcite
> Issue Type: Bug
> Affects Versions: 1.17.0
> Reporter: Hongze Zhang
> Assignee: Julian Hyde
> Priority: Major
>
> Here is a case:
> {code:sql}
> select x, COUNT(*) OVER (PARTITION BY x) from (values (20), (35)) as t(x)
> ORDER BY x
> {code}
> Final plan:
> {code:java}
> EnumerableWindow(window#0=[window(partition {0} order by [] range between
> UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [COUNT()])])
> EnumerableValues(tuples=[[{ 20 }, { 35 }]])
> {code}
> Output rows:
> {code:java}
> X |EXPR$1 |
> ---|-------|
> 35 |1 |
> 20 |1 |
> {code}
> EnumerableWindow is supposed to preserve input collations, as a result
> EnumerableSort is ignored. However the implementation of EnumerableWindow
> generates non-ordered output (when PARTITION BY clause is used).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)