Thanks, zhijiang. 

For the optimization, such as cost-based estimation, we still want to keep it 
in the data set layer, 
but your suggestion is also a thought that can be considered.


As I know, currently these batch scenarios have been contained in DataSet, such 
as
the sort-merge join algorithm. So I think that the unification should consider 
such features
as input selection at reading.


Best,
Haibo


At 2018-12-03 16:38:13, "zhijiang" <wangzhijiang...@aliyun.com.INVALID> wrote:
>Hi haibo,
>
>Thanks for bringing this discussion!
>
> I reviewd the google doc and really like the idea of unifying the stream and 
> batch in all stacks. Currently only network runtime stack is unified for both 
> stream and batch jobs, but the compilation, operator and runtime task stacks 
> are all separate. The stream stack developed frequently and behaved 
> dominantly these years, but the batch stack was touched less. If they are 
> unified into one stack, the batch jobs can also get benefits from all the 
> improvements. I think it is a very big work but worth doing, left some 
> concerns:
>
>1. The current job graph generation for batch covers complicated optimization 
>such as cost-based estimate, plan etc. Would this part also be considered 
>retaining during integrating with stream graph generation?
>
>2. I saw some other special improvements for batch scenarios in the doc, such 
>as input selection while reading. I acknowledge these roles for special batch 
>scenarios, but they seem not the blocker for unification motivation, because 
>current batch jobs can also work without these improvements. So the further 
>improvments can be separated into individual topics after we reaching the 
>unification of stream and batch firstly.
>
>Best,
>Zhijiang
>
>
>------------------------------------------------------------------
>发件人:孙海波 <sunhaib...@163.com>
>发送时间:2018年12月3日(星期一) 10:52
>收件人:dev <dev@flink.apache.org>
>主 题:[DISCUSS] Unified Core API for Streaming and Batch
>
>Hi all,
>This post proposes unified core API for Streaming and Batch. 
>Currently DataStream and DataSet adopt separated compilation processes, 
>execution tasks
>and basic programming models in the runtime layer, which complicates the 
>system implementation. 
>We think that batch jobs can be processed in the same way as streaming jobs, 
>thus we can unify
>the execution stack of DataSet into that of DataStream.  After the unification 
>the DataSet API will
>also be built on top of StreamTransformation, and its basic programming model 
>will be changed
>from "UDF on Driver" to "UDF on StreamOperator". Although the DataSet 
>operators will need to
>implement the interface StreamOperator instead after the unification, user 
>jobs do not need to change
>since DataSet uses the same UDF interfaces as DataStream.
>
>The unification has at least three benefits:
>1. The system will be greatly simplified with the same execution stack for 
>both streaming and batch jobs.
>2. It is no longer necessary to implement two sets of Driver(s) (operator 
>strategies) for batch, namely chained and non-chained.
>3. The unified programming model enables streaming and batch jobs to share the 
>same operator implementation.
>

>The following is the design draft. Any feedback is highly appreciated.
>https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit?usp=sharing
>
>Best, 
>Haibo

Reply via email to