Thanks, zhijiang.
For the optimization, such as cost-based estimation, we still want to keep it in the data set layer, but your suggestion is also a thought that can be considered. As I know, currently these batch scenarios have been contained in DataSet, such as the sort-merge join algorithm. So I think that the unification should consider such features as input selection at reading. Best, Haibo At 2018-12-03 16:38:13, "zhijiang" <wangzhijiang...@aliyun.com.INVALID> wrote: >Hi haibo, > >Thanks for bringing this discussion! > > I reviewd the google doc and really like the idea of unifying the stream and > batch in all stacks. Currently only network runtime stack is unified for both > stream and batch jobs, but the compilation, operator and runtime task stacks > are all separate. The stream stack developed frequently and behaved > dominantly these years, but the batch stack was touched less. If they are > unified into one stack, the batch jobs can also get benefits from all the > improvements. I think it is a very big work but worth doing, left some > concerns: > >1. The current job graph generation for batch covers complicated optimization >such as cost-based estimate, plan etc. Would this part also be considered >retaining during integrating with stream graph generation? > >2. I saw some other special improvements for batch scenarios in the doc, such >as input selection while reading. I acknowledge these roles for special batch >scenarios, but they seem not the blocker for unification motivation, because >current batch jobs can also work without these improvements. So the further >improvments can be separated into individual topics after we reaching the >unification of stream and batch firstly. > >Best, >Zhijiang > > >------------------------------------------------------------------ >发件人:孙海波 <sunhaib...@163.com> >发送时间:2018年12月3日(星期一) 10:52 >收件人:dev <dev@flink.apache.org> >主 题:[DISCUSS] Unified Core API for Streaming and Batch > >Hi all, >This post proposes unified core API for Streaming and Batch. >Currently DataStream and DataSet adopt separated compilation processes, >execution tasks >and basic programming models in the runtime layer, which complicates the >system implementation. >We think that batch jobs can be processed in the same way as streaming jobs, >thus we can unify >the execution stack of DataSet into that of DataStream. After the unification >the DataSet API will >also be built on top of StreamTransformation, and its basic programming model >will be changed >from "UDF on Driver" to "UDF on StreamOperator". Although the DataSet >operators will need to >implement the interface StreamOperator instead after the unification, user >jobs do not need to change >since DataSet uses the same UDF interfaces as DataStream. > >The unification has at least three benefits: >1. The system will be greatly simplified with the same execution stack for >both streaming and batch jobs. >2. It is no longer necessary to implement two sets of Driver(s) (operator >strategies) for batch, namely chained and non-chained. >3. The unified programming model enables streaming and batch jobs to share the >same operator implementation. > >The following is the design draft. Any feedback is highly appreciated. >https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit?usp=sharing > >Best, >Haibo