[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234926#comment-14234926
]
Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------
My notes on
[Esper|http://esper.codehaus.org/esper-5.1.0/doc/reference/en-US/html/epl_clauses.html#epl-select-syntax]
# SELECT has istream|rstream|irstream operators to identify the type of output
stream
## istream as default? We may want it to be explicit, since there is a chance
that our output is to a relation (i.e. a remote DB). We could potentially use
INSERT INTO for that purpose
# window operator is attached to the stream source, e.g. streamA.win:time(30)
and does not seem to use application timestamp from the streams. All timestamp
seems to be the local system time
# complicated stream operators
## introduced pattern / filter on stream which applies as filter on a stream
(explicit optimization on stream). I don't think that we need it in the initial
draft.
## introduced a combined window view operator. I am not quite a fan of that,
since it makes the windowing operation more complicated than necessary. IMO,
the windowing operation should just be doing the stream-to-relation conversion,
not other functions.
# tightly coupled with programming language (i.e. some JAVA-flavor in the
syntax). I prefer more high level language that is agnostic to the
implementation.
# hierarchical aggregation keywords in Group By. Not sure how useful it is w/
the 80% use cases in SQL?
# defined output and order by: allow to specify output intervals and re-order
to the output stream. This maybe useful when we process streams that contains
out-of-order tuples.
# defined complex multi-row and multi-column selections:
http://esper.codehaus.org/esper-5.1.0/doc/reference/en-US/html/epl_clauses.html#epl-subqueries-multicolumn
## however, the two examples of sub-queries are both some kind of JOIN between
two streams and could be re-written with an aggregation function of assemble()
and Group By to create some nested properties in the final query result. It
seems more intuitive to me. [~criccomini] mentioned AGG/JOIN/EXPLODE method,
which seems more intuitive than the examples here.
# define an abstract to use sql:myDB(“select…from…”) as a source of relation,
easy to plugin external DBs as we needed
# define UDF to access non-relational data as relation: e.g. select * from
StreamA, ufd.getData(“my-NRD”). Esper defines a specific function for each udf
to return the metadata of the returned relation, and another function to
actually evaluate the values of the rows. e.g. udf.getMeta(“my-NRD”) and
udf.getData(“my-NRD”)
# Create Schema is used to define event data schema: I am thinking of a
semi-schema model that only specify fields that can not be null. For other
fields, all can be optional.
# defines a method to split the output and define the input partitions
## e.g. on event_type insert into <insert_into_def> select … where … insert
into <insert_into_def> is used to split the output stream
## Esper introduce a concept of Context on stream and Partition is used to
create context. Then, all queries are within a context. e.g. create context
SegmentedByCustomer partition by custId from BankTxn.
## To me, the definition of context is too complex. It would be easier to split
the output using PARTITION By keyA TO n_parts in the SELECT result. The tasks
need to combine multiple partitions can be done by FROM <stream> (ALL|<n>)
PARTITIONS.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)