[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Yi Pan (Data Infrastructure) (JIRA) Thu, 04 Dec 2014 18:08:32 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234926#comment-14234926
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------

My notes on 
[Esper|http://esper.codehaus.org/esper-5.1.0/doc/reference/en-US/html/epl_clauses.html#epl-select-syntax]
# SELECT has istream|rstream|irstream operators to identify the type of output 
stream
## istream as default? We may want it to be explicit, since there is a chance 
that our output is to a relation (i.e. a remote DB). We could potentially use 
INSERT INTO for that purpose
# window operator is attached to the stream source, e.g. streamA.win:time(30) 
and does not seem to use application timestamp from the streams. All timestamp 
seems to be the local system time
# complicated stream operators
## introduced pattern / filter on stream which applies as filter on a stream 
(explicit optimization on stream). I don't think that we need it in the initial 
draft.
## introduced a combined window view operator. I am not quite a fan of that, 
since it makes the windowing operation more complicated than necessary. IMO, 
the windowing operation should just be doing the stream-to-relation conversion, 
not other functions.
# tightly coupled with programming language (i.e. some JAVA-flavor in the 
syntax). I prefer more high level language that is agnostic to the 
implementation.
# hierarchical aggregation keywords in Group By. Not sure how useful it is w/ 
the 80% use cases in SQL?
# defined output and order by: allow to specify output intervals and re-order 
to the output stream. This maybe useful when we process streams that contains 
out-of-order tuples.
# defined complex multi-row and multi-column selections: 
http://esper.codehaus.org/esper-5.1.0/doc/reference/en-US/html/epl_clauses.html#epl-subqueries-multicolumn
## however, the two examples of sub-queries are both some kind of JOIN between 
two streams and could be re-written with an aggregation function of assemble() 
and Group By to create some nested properties in the final query result. It 
seems more intuitive to me. [~criccomini] mentioned AGG/JOIN/EXPLODE method, 
which seems more intuitive than the examples here.
# define an abstract to use sql:myDB(“select…from…”) as a source of relation, 
easy to plugin external DBs as we needed
# define UDF to access non-relational data as relation: e.g. select * from 
StreamA, ufd.getData(“my-NRD”). Esper defines a specific function for each udf 
to return the metadata of the returned relation, and another function to 
actually evaluate the values of the rows. e.g. udf.getMeta(“my-NRD”) and 
udf.getData(“my-NRD”)
# Create Schema is used to define event data schema: I am thinking of a 
semi-schema model that only specify fields that can not be null. For other 
fields, all can be optional.
# defines a method to split the output and define the input partitions
## e.g. on event_type insert into <insert_into_def> select … where … insert 
into <insert_into_def> is used to split the output stream
## Esper introduce a concept of Context on stream and Partition is used to 
create context. Then, all queries are within a context. e.g. create context 
SegmentedByCustomer partition by custId from BankTxn.
## To me, the definition of context is too complex. It would be easier to split 
the output using PARTITION By keyA TO n_parts in the SELECT result. The tasks 
need to combine multiple partitions can be done by FROM <stream> (ALL|<n>) 
PARTITIONS.
   

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to