[jira] [Comment Edited] (CALCITE-1237) Session windows for streaming SQL

Julian Hyde (JIRA) Wed, 18 May 2016 13:18:32 -0700

    [ 
https://issues.apache.org/jira/browse/CALCITE-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289718#comment-15289718
 ]


Julian Hyde edited comment on CALCITE-1237 at 5/18/16 8:17 PM:
---------------------------------------------------------------

bq. Q3: I think supporting session in the GROUP BY clause is a good starting 
point.

I agree, but I'll give some thought to whether it would work in {{OVER}}. For 
example, does it make sense to simultaneously compute sessions partitioning by 
(sourceIp) and (sourceIp, port)?

bq. Can we add functions similar to TUMBLE_START and TUMBLE_END? It would be 
consistent with the other window functions and a shortcut compared to accessing 
the corresponding values with first_value and last_value (given that we allow 
order-dependent aggregates).

Yes.

bq. You said "Unlike the tumble function, each row belongs to precisely one 
window.". Tumbling windows are non-overlapping, so tumble should be hop, right?

My mistake. I'll fix the original text.

bq. is it possible to move the additional grouping columns out of the session 
function to the GROUP BY clause? This would be more similar to the definition 
of TUMBLE and HOP in http://calcite.apache.org/docs/stream.html?

I thought about that. I added the extra arguments to make its semantics to be 
as close as possible to a mathematical function. Some examples:
* {{FLOOR}} is a true function; its value depends only on its arguments.
* {{TUMBLE}} is a true function, although the value is a tuple consisting of 
the lower and upper bound of the window.
* {{HOP}} is a little less pure. In fact, it is a mathematical function that 
returns a set. The set depends only on the arguments. The GROUP BY operator 
produces one sub-total row for each value in the set.

If we adopt your semantics, the {{SESSION}} "function" depends not only on its 
arguments, but the set of rows that have the same {{GROUP BY}} key values. When 
you describe it this way, you see that it behaves exactly like a SQL aggregate 
function. The only strange thing is that we allow it in the {{GROUP BY}} clause.

By the way, under your semantics, the example looks like this:

{code}
select stream session(rowtime, interval '5' second), productId, count(*) as c
from Orders
group by session(rowtime, interval '5' second), productId
{code}

If we are to adopt your semantics, we need to think what are the implications 
of allowing an aggregate function in the GROUP BY clause. I think we could make 
it work. I can think other aggregate functions, such as 
[NTILE|http://stackoverflow.com/questions/14355324/want-to-learn-more-on-ntile],
 that you might want to include in the GROUP BY clause.


was (Author: julianhyde):
bq. Q3: I think supporting session in the GROUP BY clause is a good starting 
point.

I agree, but I'll give some thought to whether it would work in {{OVER}}. For 
example, does it make sense to simultaneously compute sessions partitioning by 
(sourceIp) and (sourceIp, port)?

bq. Can we add functions similar to TUMBLE_START and TUMBLE_END? It would be 
consistent with the other window functions and a shortcut compared to accessing 
the corresponding values with first_value and last_value (given that we allow 
order-dependent aggregates).

Yes.

bq. You said "Unlike the tumble function, each row belongs to precisely one 
window.". Tumbling windows are non-overlapping, so tumble should be hop, right?

My mistake. I'll fix the original text.

bq. is it possible to move the additional grouping columns out of the session 
function to the GROUP BY clause? This would be more similar to the definition 
of TUMBLE and HOP in http://calcite.apache.org/docs/stream.html?

I thought about that. I added the extra arguments to make its semantics to be 
as close as possible to a mathematical function. Some examples:
* FLOOR is a true function; its value depends only on its arguments.
* TUMBLE is a true function, although the value is a tuple consisting of the 
lower and upper bound of the window.
* HOP is a little less pure, because it produces a set of values, but the set 
depends only on the arguments.

If we adopt your semantics, the {{SESSION}} "function" depends not only on its 
arguments, but the set of rows that have the same {{GROUP BY}} key values. When 
you describe it this way, you see that it behaves exactly like a SQL aggregate 
function. The only strange thing is that we allow it in the {{GROUP BY}} clause.

By the way, under your semantics, the example looks like this:

{code}
select stream session(rowtime, interval '5' second), productId, count(*) as c
from Orders
group by session(rowtime, interval '5' second), productId
{code}

If we are to adopt your semantics, we need to think what are the implications 
of allowing an aggregate function in the GROUP BY clause. I think we could make 
it work. I can think other aggregate functions, such as 
[NTILE|http://stackoverflow.com/questions/14355324/want-to-learn-more-on-ntile],
 that you might want to include in the GROUP BY clause.

> Session windows for streaming SQL
> ---------------------------------
>
>                 Key: CALCITE-1237
>                 URL: https://issues.apache.org/jira/browse/CALCITE-1237
>             Project: Calcite
>          Issue Type: Bug
>          Components: stream
>            Reporter: Julian Hyde
>            Assignee: Julian Hyde
>
> A session window is a collection of rows whose key values, when sorted, have 
> a gap of at most N.
> Q1. Should "at most" be "less than"?
> The key type can be any type that has a minus operator, that is, numeric and 
> date-time.
> I propose the following syntax: {{session(key [, ...]*, interval)}}. For 
> example:
> {code}
> select stream session(rowtime, productId, interval '5' second),
>   productId, count(*) as c
> from Orders
> group by session(rowtime, productId, interval '5' second),
>   productId
> {code}
> to find bursts of orders for the same product where consecutive orders are no 
> more than 5 seconds apart.
> The first key column {{rowtime}} defines the session and must be of 
> numeric/date-time type, and must have monotonicity or similar in order for 
> the query to make progress; the other key columns (in this case 
> {{productId}}) can be of any type; the last column is the interval, and must 
> be constant.
> The {{session}} function returns the key value at the start of the window. 
> Unlike the {{hop}} function, each row belongs to precisely one window. But 
> {{session}} is not a true function, because its value depends on the records 
> flowing in the stream.
> Q2. If {{session}} is used, should we allow order-dependent aggregate 
> functions such as {{first_value}}?
> Q3. Should we allow {{session}} as a windowed aggregate function?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CALCITE-1237) Session windows for streaming SQL

Reply via email to