[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228898#comment-14228898
]
Milinda Lakmal Pathirage commented on SAMZA-390:
------------------------------------------------
Another interesting paper I found was "Query Languages and Data Models for
Database Sequences and Data Streams" [1] which propose a different way of
handling window queries over streams using 'user defined aggregates(UDA)'. They
introduce notion of nonblocking(NB) queries and NB-completeness first. They
also show that relational algebra is not NB-complete (its well know that we
can't support ALL, EXCEPT, NOT IN like blocking operations over stream without
window operator). Instead of using window operator like 'S [Rows 5]', they
proposed to use UDA like following to do window computations.
AGGREGATE tumble avg(Next Int) : Real
{
TABLE state(tsum Int, cnt Int);
INITIALIZE : {
INSERT INTO state VALUES (Next, 1)
}
ITERATE: {
UPDATE state
SET tsum=tsum+Next, cnt=cnt+1;
INSERT INTO RETURN
SELECT tsum/cnt FROM state
WHERE cnt % 200 = 0;
UPDATE state SET tsum=0, cnt=0
WHERE cnt % 200 = 0
}
TERMINATE : { }
}
Emitting tuples to down stream is done by 'INSERT INTO RETUEN'. If you have
'INSERT INTO RETURN' in TERMINATE block, your aggregate is blocking and cannot
executed over a stream. There are some interesting samples like finding
patterns over a stream in Section 5 of the paper [1]. They even show a
implementation of a turing machine using UDAs. Also they use 'union' and UDAs
to implement stream joins instead of blocking join operator. Sample can be
found in [2].
Why I was interested about this paper is mainly because
- It looks like we can even do pattern matching type of queries over streams
using UDAs. I am not sure how complicated this using general SQL
- It looks like we can use this as the intermediate model where other
languages, DSLs, APIs transformed into. I am yet to understand how well this
will work. But concept of UDA seems pretty interesting to me given the fact
that we can even model a turing machine.
I found several other references in this paper which explains/motivated some of
the concepts here. I'll let you know if I find any interesting things in those.
[1] http://www.cs.ucla.edu/~zaniolo/papers/vldb04cr.pdf
[2] http://wis.cs.ucla.edu/wis/stream-mill/examples/nexmark.html
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)