回复：[DISCUSS] Unified Core API for Streaming and Batch

zhijiang Mon, 03 Dec 2018 00:39:01 -0800

Hi haibo,

Thanks for bringing this discussion!

I reviewd the google doc and really like the idea of unifying the stream and
batch in all stacks. Currently only network runtime stack is unified for both
stream and batch jobs, but the compilation, operator and runtime task stacks
are all separate. The stream stack developed frequently and behaved dominantly
these years, but the batch stack was touched less. If they are unified into one
stack, the batch jobs can also get benefits from all the improvements. I think
it is a very big work but worth doing, left some concerns:

1. The current job graph generation for batch covers complicated optimization
such as cost-based estimate, plan etc. Would this part also be considered
retaining during integrating with stream graph generation?

2. I saw some other special improvements for batch scenarios in the doc, such
as input selection while reading. I acknowledge these roles for special batch
scenarios, but they seem not the blocker for unification motivation, because
current batch jobs can also work without these improvements. So the further
improvments can be separated into individual topics after we reaching the
unification of stream and batch firstly.

Best,
Zhijiang

------------------------------------------------------------------
发件人：孙海波 <sunhaib...@163.com>
发送时间：2018年12月3日(星期一) 10:52
收件人：dev <dev@flink.apache.org>
主 题：[DISCUSS] Unified Core API for Streaming and Batch

Hi all,
This post proposes unified core API for Streaming and Batch.
Currently DataStream and DataSet adopt separated compilation processes,
execution tasks
and basic programming models in the runtime layer, which complicates the system
implementation.
We think that batch jobs can be processed in the same way as streaming jobs,
thus we can unify
the execution stack of DataSet into that of DataStream. After the unification
the DataSet API will
also be built on top of StreamTransformation, and its basic programming model
will be changed
from "UDF on Driver" to "UDF on StreamOperator". Although the DataSet operators
will need to
implement the interface StreamOperator instead after the unification, user jobs
do not need to change
since DataSet uses the same UDF interfaces as DataStream.

The unification has at least three benefits:
1. The system will be greatly simplified with the same execution stack for both
streaming and batch jobs.
2. It is no longer necessary to implement two sets of Driver(s) (operator
strategies) for batch, namely chained and non-chained.
3. The unified programming model enables streaming and batch jobs to share the
same operator implementation.

The following is the design draft. Any feedback is highly appreciated.
.https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit?usp=sharing

Best,
Haibo

回复：[DISCUSS] Unified Core API for Streaming and Batch

Reply via email to