[ https://issues.apache.org/jira/browse/CALCITE-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289718#comment-15289718 ]
Julian Hyde edited comment on CALCITE-1237 at 5/18/16 8:17 PM: --------------------------------------------------------------- bq. Q3: I think supporting session in the GROUP BY clause is a good starting point. I agree, but I'll give some thought to whether it would work in {{OVER}}. For example, does it make sense to simultaneously compute sessions partitioning by (sourceIp) and (sourceIp, port)? bq. Can we add functions similar to TUMBLE_START and TUMBLE_END? It would be consistent with the other window functions and a shortcut compared to accessing the corresponding values with first_value and last_value (given that we allow order-dependent aggregates). Yes. bq. You said "Unlike the tumble function, each row belongs to precisely one window.". Tumbling windows are non-overlapping, so tumble should be hop, right? My mistake. I'll fix the original text. bq. is it possible to move the additional grouping columns out of the session function to the GROUP BY clause? This would be more similar to the definition of TUMBLE and HOP in http://calcite.apache.org/docs/stream.html? I thought about that. I added the extra arguments to make its semantics to be as close as possible to a mathematical function. Some examples: * {{FLOOR}} is a true function; its value depends only on its arguments. * {{TUMBLE}} is a true function, although the value is a tuple consisting of the lower and upper bound of the window. * {{HOP}} is a little less pure. In fact, it is a mathematical function that returns a set. The set depends only on the arguments. The GROUP BY operator produces one sub-total row for each value in the set. If we adopt your semantics, the {{SESSION}} "function" depends not only on its arguments, but the set of rows that have the same {{GROUP BY}} key values. When you describe it this way, you see that it behaves exactly like a SQL aggregate function. The only strange thing is that we allow it in the {{GROUP BY}} clause. By the way, under your semantics, the example looks like this: {code} select stream session(rowtime, interval '5' second), productId, count(*) as c from Orders group by session(rowtime, interval '5' second), productId {code} If we are to adopt your semantics, we need to think what are the implications of allowing an aggregate function in the GROUP BY clause. I think we could make it work. I can think other aggregate functions, such as [NTILE|http://stackoverflow.com/questions/14355324/want-to-learn-more-on-ntile], that you might want to include in the GROUP BY clause. was (Author: julianhyde): bq. Q3: I think supporting session in the GROUP BY clause is a good starting point. I agree, but I'll give some thought to whether it would work in {{OVER}}. For example, does it make sense to simultaneously compute sessions partitioning by (sourceIp) and (sourceIp, port)? bq. Can we add functions similar to TUMBLE_START and TUMBLE_END? It would be consistent with the other window functions and a shortcut compared to accessing the corresponding values with first_value and last_value (given that we allow order-dependent aggregates). Yes. bq. You said "Unlike the tumble function, each row belongs to precisely one window.". Tumbling windows are non-overlapping, so tumble should be hop, right? My mistake. I'll fix the original text. bq. is it possible to move the additional grouping columns out of the session function to the GROUP BY clause? This would be more similar to the definition of TUMBLE and HOP in http://calcite.apache.org/docs/stream.html? I thought about that. I added the extra arguments to make its semantics to be as close as possible to a mathematical function. Some examples: * FLOOR is a true function; its value depends only on its arguments. * TUMBLE is a true function, although the value is a tuple consisting of the lower and upper bound of the window. * HOP is a little less pure, because it produces a set of values, but the set depends only on the arguments. If we adopt your semantics, the {{SESSION}} "function" depends not only on its arguments, but the set of rows that have the same {{GROUP BY}} key values. When you describe it this way, you see that it behaves exactly like a SQL aggregate function. The only strange thing is that we allow it in the {{GROUP BY}} clause. By the way, under your semantics, the example looks like this: {code} select stream session(rowtime, interval '5' second), productId, count(*) as c from Orders group by session(rowtime, interval '5' second), productId {code} If we are to adopt your semantics, we need to think what are the implications of allowing an aggregate function in the GROUP BY clause. I think we could make it work. I can think other aggregate functions, such as [NTILE|http://stackoverflow.com/questions/14355324/want-to-learn-more-on-ntile], that you might want to include in the GROUP BY clause. > Session windows for streaming SQL > --------------------------------- > > Key: CALCITE-1237 > URL: https://issues.apache.org/jira/browse/CALCITE-1237 > Project: Calcite > Issue Type: Bug > Components: stream > Reporter: Julian Hyde > Assignee: Julian Hyde > > A session window is a collection of rows whose key values, when sorted, have > a gap of at most N. > Q1. Should "at most" be "less than"? > The key type can be any type that has a minus operator, that is, numeric and > date-time. > I propose the following syntax: {{session(key [, ...]*, interval)}}. For > example: > {code} > select stream session(rowtime, productId, interval '5' second), > productId, count(*) as c > from Orders > group by session(rowtime, productId, interval '5' second), > productId > {code} > to find bursts of orders for the same product where consecutive orders are no > more than 5 seconds apart. > The first key column {{rowtime}} defines the session and must be of > numeric/date-time type, and must have monotonicity or similar in order for > the query to make progress; the other key columns (in this case > {{productId}}) can be of any type; the last column is the interval, and must > be constant. > The {{session}} function returns the key value at the start of the window. > Unlike the {{hop}} function, each row belongs to precisely one window. But > {{session}} is not a true function, because its value depends on the records > flowing in the stream. > Q2. If {{session}} is used, should we allow order-dependent aggregate > functions such as {{first_value}}? > Q3. Should we allow {{session}} as a windowed aggregate function? -- This message was sent by Atlassian JIRA (v6.3.4#6332)