[ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234299#comment-14234299
 ] 

Milinda Lakmal Pathirage commented on SAMZA-390:
------------------------------------------------

Thanks [~criccomini] for the great summary. 

I really like the TIMESTAMP BY clause in Azure query language. This allows us 
to control how timestamp is extracted at the time of the query. I was thinking 
of adding this to stream definition in Freshet. But this method is better than 
adding the timestamp column to stream definition. We can do something like 
following regarding the default timestamp:

- We can add system timestamp every time tuple get introduced to Samza. This 
system timestamp will always be there in a tuple. If TIMESTAMP BY is not there, 
we can use this default timestamp. This system timestamp may become handy when 
handling out of order events, etc.

I am not exactly sure whether partitioning based on JOIN column (Item [9] from 
[~criccomini] discussion on Azure Streams) will always work for JOIN scenarios. 
As I remember, one user described a scenario this will not work in Samza user 
list. But I think this is okay for the first iteration. 

Other thing is window language described in CQL paper is very limited (for 
example, [Row 30] or [Range 30 seconds] always means a sliding window which 
drops oldest elements and no way to specify different sliding parameters or 
specify tumbling windows), so we need to extend this to suits to our needs as 
discussed earlier.

One important this about CQL is the concept of stream to relation and then 
operating over relation allow us to use most of the SQL construct available 
without conflicting semantics. For example, I assume NOT IN, ALL like blocking 
SQL constructs can be used in the context of CQL because we are theoretically 
operating over a relation (time varying). 

I like the PARTITION BY concept and extensions [~nickpan47] proposed. This will 
give us more control and in case of round-robin like partitioning this will 
allow us to control how partitioning is done during query time.

I think we need to two different constructs for data definitions. One for 
STREAM and other for TABLE. Because we are going to support both streams and 
tables. Another option is to extend /re-use CREATE TABLE (like in Azure) to 
support streams.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to