[ 
https://issues.apache.org/jira/browse/SAMZA-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275300#comment-14275300
 ] 

Milinda Lakmal Pathirage commented on SAMZA-483:
------------------------------------------------

I started to look into defining common representation (object model) for 
streaming algebra. Below are my thoughts and questions that came into my mind.

If we look at the normal flow from a query to a execution plan, this process 
involves at least following steps (In the context of a DSL, steps 1 and 2 may 
or may not be there):

1. Tokenization
2. Parsing. Generates abstract syntax tree.
3. Semantic analyzing
4. Optimization
5. Query Plan (Code generation in case of compilers)

If we take a compiler infrastructure like LLVM, it starts from somewhere 
between step 3 and 4 (AFAIK, there can be some types of semantic analysis 
happening at the intermediate representation (IR) layer). LLVM has LLVM IR and 
CLANG like front-ends generate LLVM IR from C/C++ code. In addition to LLVM IR 
generation, CLANG takes care of parsing and semantic analysis. 

Say we map the LLVM scenario to our problem;

- Do we need something like LLVM IR (semantic analysis will be handled by a 
upper layer)?
- Or do we need to include semantic analysis also in this layer?

I prefer the LLVM IR like model and let upper layer handle semantic analysis. 
Even in this case we have several complications.

- Is this model going to be a object model for streaming SQL?
- Or, Will relational algebra like model is enough?

Relational algebra 'like' model is going to be a representation of extended 
relational algebra expression ('extended' because   there are streaming 
specific modifications) which looks like following (I made this expression 
format up for this example).

σ (expresisons) π (field_list) ρ (rename_list) ((ω (window_spec) S1) ⋈
(ω (window_spec) S2))

σ - Selection
π - Projection
ρ - Renaming
ω - Window operator
⋈ - Natural join

There are pros and cons in both SQL like model and relational algebra like 
model. For example, DSL developers need to generate a relational algebra model 
from their internal representations. Depending on the DSL and internals of it, 
generating SQL like model may be easier than relational model. On the other 
hand, a relational model may be easier to generate if DSL (or any other 
high-level API) developer knows how to map his/her language/API constructs to 
relational algebra.

Please let me know what you think about this.

> A common representation of relational algebra for streaming SQL 
> ----------------------------------------------------------------
>
>                 Key: SAMZA-483
>                 URL: https://issues.apache.org/jira/browse/SAMZA-483
>             Project: Samza
>          Issue Type: Sub-task
>            Reporter: Yi Pan (Data Infrastructure)
>            Priority: Minor
>              Labels: project
>
> Per discussion with [~criccomini] and [~milinda], we agreed that it seems to 
> be a good idea to define a common representation of relational algebra on top 
> of the operators defined in the operator layer (see SAMZA-482), which can be 
> the common base that we can use to generate the description/configuration of 
> a Samza job.
> This common layer can also be used by DSL-like language parser as a result of 
> parsing a DSL program.
> Some additional requirements needed in addition to pure relational algebra:
> 1) the common representation should include window operators and stream 
> operators (i.e. IStream/DStream/RStream)
> 2) the common representation should include description on parallelism of the 
> jobs (i.e. how many partitions the resultant Samza job will use)
> Some references:
> http://web.cs.wpi.edu/~mukherab/i/DCAPE.pdf
> https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf
> http://davis.wpi.edu/dsrg/PROJECTS/CAPE/publications.htm
> http://davis.wpi.edu/dsrg/PROJECTS/CAPE/slides.htm



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to