[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Yi Pan (Data Infrastructure) (JIRA) Fri, 05 Dec 2014 16:30:53 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236370#comment-14236370
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------

My notes on 
[StreamSQL|http://docs.streambase.com/sb74/index.jsp?topic=/com.streambase.sb.ide.help/data/html/streamsql/index.html]:

# Defined a rich DDL to 
## define schemas and allow hierarchical inheritance of schemas
## define input/output stream and can associate a schema to the input stream 
(why not output stream?). I don’t quite see how the WITH PARAMETERS is used. 
Also, it seems a bit redundant with another CREATE STREAM as a general 
definition of input and output stream. Partitions and parallel sub-streams?
## it has a very interesting concept of “error streams” that is sort of an 
escape channel for all unhandled exceptions in the system. It could be very 
useful for debugging in a distributed system, i.e. if we have a Kafka topic as 
a distributed system error log, it can be used for this purpose.
## Note that I am not a big fan of a strong-typed schema here, since field 
names may be added or deleted over time. In a distributed system, updating the 
software throughout a distributed system needs time and the data stayed in the 
states in various different component also takes time to be updated (i.e. 
messages with old schema may co-exist with messages with new schema in Kafka 
topic). A DDL to specify a data model should still be useful. But the schema 
itself should only contains the minimum *absolutely* required fields and leave 
all other fields as optional (may or may not be there)
## DDL in StreamSQL also defines window specification, which is a convenient 
tool when the same window specification is used again and again. However, it 
may not be necessary as first-class citizen in our implementation, since the 
window specification should be simple and short in our first version. What I 
like about is that the window specification seems very comprehensive:
  SIZE size ADVANCE increment
  {TIME | TUPLES | ON field_identifier_w
   PREDICATE OPEN ON open_expr CLOSE ON close_expr EMIT ON emit_expr}
  [PARTIAL {TRUE | FALSE}]
  [PARTITION BY field_identifier_p[,...]]
  [VALID ALWAYS]
  [OFFSET offset]
  [TIMEOUT timeout]
## I don’t see a big use case for the materialized window definition, and 
StreamBase also plan to deprecate it in the future.
# StreamBase also introduce lock concept to control the flow of streams: when 
locked, buffer the tuples; release the buffered tuples on unlock. I don’t see a 
big use case on that, except for dealing the case of arrival of tuples with 
out-of-order timestamps.
# Merge seems to be a commonly used operator to combine two streams w/ the same 
schema together. We can use the same to mark up the merge of two or more 
partitions of the same stream, s.t. FROM <stream> [MERGE [ALL | EVERY 
<m_partitions>]]
# StreamSQL has defined a HEARTBEAT mechanism to add timer tuples on the same 
stream with the data tuples. This would be useful to perform synchronization in 
a distributed system like SAMZA. I would think that SAMZA as a system should 
provide a HEARTBEAT token by default, and this SQL extension can be optional to 
provide some control on specific query output streams to the user.
# APPLY has a PARALLEL <num_instances> BY <int_field> to specify parallelism of 
modules running to work on the same input stream. This is necessary if the 
default model is sequential execution of tasks on a stream. My take on this is: 
SAMZA should default to a distributed parallel model s.t. by default, there 
should be as many tasks running on all partitions in any input stream. Hence, 
no need to introduce this statement. We do need some way of specifying the 
number of tasks and distribution of partitions in all input streams to the 
tasks. i.e. if input streams to a SQL query has N1, N2 partitions, how many 
tasks do we need and how do we distribute and match the partitions to the 
tasks? Besides, if joining two streams, what if the join key is not the key 
used for partition? How do we get the result through merge?

Overall StreamSQL seems to be a candidate that is very close to what we want, 
in terms of SQL syntax (maybe more than needed). I like the following points:
# It has a DDL to define table(i.e. relation) and stream separately
# The grammar seems to be very matured and follow SQL standard
# It has a rich window specification
# It has a heart beat build in the SQL syntax. It might be useful if we allow 
user to inject timer tokens.

The biggest gap IMO is:
# Lack of a convenient way of specify the parallelism in splitting the streams. 
The APPLY clause seems too convoluted. We need a simple and clear way to 
specify the parallel tasks that can take M streams each with a (potentially) 
different number of partitions following the FROM clause.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to