[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234299#comment-14234299
]
Milinda Lakmal Pathirage commented on SAMZA-390:
------------------------------------------------
Thanks [~criccomini] for the great summary.
I really like the TIMESTAMP BY clause in Azure query language. This allows us
to control how timestamp is extracted at the time of the query. I was thinking
of adding this to stream definition in Freshet. But this method is better than
adding the timestamp column to stream definition. We can do something like
following regarding the default timestamp:
- We can add system timestamp every time tuple get introduced to Samza. This
system timestamp will always be there in a tuple. If TIMESTAMP BY is not there,
we can use this default timestamp. This system timestamp may become handy when
handling out of order events, etc.
I am not exactly sure whether partitioning based on JOIN column (Item [9] from
[~criccomini] discussion on Azure Streams) will always work for JOIN scenarios.
As I remember, one user described a scenario this will not work in Samza user
list. But I think this is okay for the first iteration.
Other thing is window language described in CQL paper is very limited (for
example, [Row 30] or [Range 30 seconds] always means a sliding window which
drops oldest elements and no way to specify different sliding parameters or
specify tumbling windows), so we need to extend this to suits to our needs as
discussed earlier.
One important this about CQL is the concept of stream to relation and then
operating over relation allow us to use most of the SQL construct available
without conflicting semantics. For example, I assume NOT IN, ALL like blocking
SQL constructs can be used in the context of CQL because we are theoretically
operating over a relation (time varying).
I like the PARTITION BY concept and extensions [~nickpan47] proposed. This will
give us more control and in case of round-robin like partitioning this will
allow us to control how partitioning is done during query time.
I think we need to two different constructs for data definitions. One for
STREAM and other for TABLE. Because we are going to support both streams and
tables. Another option is to extend /re-use CREATE TABLE (like in Azure) to
support streams.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)