[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387518#comment-17387518 ] Haisheng Yuan commented on CALCITE-3181: Yeah, with "row()" the transformation should work. Thanks for the example. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378573#comment-17378573 ] Julian Hyde commented on CALCITE-3181: -- You can use use {{ROW(a, b, y)}}. For example, if a is {{state}}, b is {{year}}, x is {{color}} and {{y}} is {{units}}, you can write {code:java} SELECT color, ARRAY_AGG(r) WITHIN GROUP (ORDER BY r.units DESC LIMIT 3) FROM (SELECT color, ROW (state, year, units) AS r FROM Sales) GROUP BY color{code} instead of {code:java} SELECT state, year, color FROM (SELECT state, year, color, ROW_NUMBER() OVER (PARTITION BY color ORDER BY units DESC) AS rn FROM Sales) WHERE rn <= 3{code} > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377857#comment-17377857 ] Haisheng Yuan commented on CALCITE-3181: Yeah, that makes more sense. How does it work for the following case? SELECT a, b, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y)... here the value of "a" and "b" may vary row by row. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377711#comment-17377711 ] Julian Hyde commented on CALCITE-3181: -- Only one row per group, but that row contains an array of row objects which, when unnested, become multiple rows. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377683#comment-17377683 ] Haisheng Yuan commented on CALCITE-3181: Yes, they are quite similar with each other. There are still some slight difference, the window keep the row count to the limit, but the aggregate in CALCITE-4687 only generate 1 row per group, no matter with or without the limit. So maybe not exchangable? > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377599#comment-17377599 ] Julian Hyde commented on CALCITE-3181: -- [~hyuan], I just logged CALCITE-4687, which is another way of posing queries that have "per-key limits". Please take a look, and let me know what you think. The important insight, I think, is that computing the rank is expensive, but filtering on the rank (say {{rn <= 3}}) may be much cheaper, as long as you don't return the rank column in your result. Since the two formulations are equivalent, it would be possible to define rules that transform one into the other. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16885545#comment-16885545 ] Julian Hyde commented on CALCITE-3181: -- The {{enum RelFieldCollation}} has values ASCENDING, STRICTLY_ASCENDING, DESCENDING, STRICTLY_DESCENDING and CLUSTERED. CLUSTERED exists for precisely this purpose. We have no way of saying in a physical property (trait) "for a given deptno key there are no more than 10 rows". My instinct says we probably shouldn't add a way; it would make the trait more complicated for everyone else. So we'd do this by transformation (forward chaining) not by traits (backward chaining). > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881722#comment-16881722 ] Haisheng Yuan commented on CALCITE-3181: {quote} It would make sense to have an implementation of Window (let's call it StreamingWindow) that exploits the fact that the input is sorted. {quote} This is exactly how Greenplum database and Orca optimizer implement Window operator. In Orca, the physical window operator requests physical properties of [partition keys] hash distribution and [partition keys, sort keys] sort order, which gives optimizer more optimization opportunities, like window reordering to avoid redundant sort or shuffle. {quote} if Sort supported partitioned top-n, then we could create a StreamingWindow on top of a top-n Sort, and the window would not even have to worry that the Sort is removing rows. {quote} Agree. So that we don't need to change Window semantics. What physical order property would the top-n Sort deliver? [partition keys + sort keys]? But if the sort is hash table based implementation, the [sort keys] order trait might be wrong. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881530#comment-16881530 ] Julian Hyde commented on CALCITE-3181: -- Now switching to consider algebra rather than SQL syntax. It would make sense to have an implementation of {{Window}} (let's call it {{StreamingWindow}}) that exploits the fact that the input is sorted. If the Window only needed the top 3 rows in each partition, and if {{Sort}} supported partitioned top-n, then we could create a {{StreamingWindow}} on top of a top-n {{Sort}}, and the window would not even have to worry that the {{Sort}} is removing rows. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881528#comment-16881528 ] Julian Hyde commented on CALCITE-3181: -- I wonder if there is any way to express what we want using the {{rows}} sub-clause. Your query is *almost* the same as {code} select category, product, pv, sum(pv) over(partition by category order by sale_count desc rows 10 preceding) from prod; {code} except that "rows 10 preceding" is relative to the current row whereas we want the top 10 rows, i.e. the top row and its 10 preceding rows. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881121#comment-16881121 ] Haisheng Yuan commented on CALCITE-3181: e.g. compute the total page views for the top 10 hot-sale products in each category: {code:java} select category, product, pv, sum(pv) over(partition by category order by sale_count desc limit 10) from prod; {code} Instead of sort-based window implementation, we can use hash table based implementation for window with limit. Use the partition key as the hash key, each distinct key has an entry that contains a heap if there is order by clause, with the heap size be the limit value. In distributed system, we can have 2-phase window (global and local), execute local window before shuffling tuples by the window partition key, which might reduce the shuffle cost dramatically. But the side effect is also obvious: this will break the assumption that window operator keeps the cardinality unchanged. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CALCITE-3181) Support limit per group in Window
[ https://issues.apache.org/jira/browse/CALCITE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880945#comment-16880945 ] Julian Hyde commented on CALCITE-3181: -- It seems to make sense. Can you devise an example on a "real" data set, and involving not just ROW_NUMBER but an aggregate function such as AVG or SUM. I know it is not standard SQL, but can you sketch out how the query would look if we allowed "OVER (PARTITION BY ... ORDER BY ... FETCH ...)"? A while ago I was thinking about "partitioned sort-limit". We support "Sort(Scan(EMP), key=[salary DESC], fetch=10)" (the 10 top-earning employees) but we don't support "Sort(Scan(EMP), partition=[deptno], key=[salary DESC], fetch=10)" (the top 10 top-earning employees in each department). This came up in CALCITE-1317, the "WinMagic" optimization. Regular Sort with limit can handle non-correlated sub-queries, but for correlated sub-queries we needed partitioned Sort with limit. Thus it seems a natural and useful extension to Sort. I haven't thought about whether the extension of Window that you propose is as natural. > Support limit per group in Window > - > > Key: CALCITE-3181 > URL: https://issues.apache.org/jira/browse/CALCITE-3181 > Project: Calcite > Issue Type: Improvement > Components: core >Reporter: Haisheng Yuan >Priority: Major > > We have a lot of queries like the following to retrieve top N tuples per > group: > {code:java} > SELECT x, y FROM > (SELECT x, y, ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) > AS rn FROM t1) t2 WHERE rn <= 3; > {code} > The performance is not good if each group has a lot more tuples than wanted, > because we will retrieve and sort all the tuples, instead of just doing a > top-N heap sort. > In order to do optimization for this kind of query, we need to extend window > to support limit, if and only if there is only 1 window function, and it is > {{row_number()}}. We also need a substitute rule to push the limit into > window. Of course, we also need to modify executor to support this > optimization (can be later). > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y ROW_NUMBER()}) > {code} > to > {code:java} > Filter (rn <= 3) > +- Window (window#0={Partition by x order by y limit 3 ROW_NUMBER()}) > {code} > Thoughts? Objections? -- This message was sent by Atlassian JIRA (v7.6.3#76005)