Hi all,
This post proposes unified core API for Streaming and Batch. 
Currently DataStream and DataSet adopt separated compilation processes, 
execution tasks
and basic programming models in the runtime layer, which complicates the system 
implementation. 
We think that batch jobs can be processed in the same way as streaming jobs, 
thus we can unify
the execution stack of DataSet into that of DataStream.  After the unification 
the DataSet API will
also be built on top of StreamTransformation, and its basic programming model 
will be changed
from "UDF on Driver" to "UDF on StreamOperator". Although the DataSet operators 
will need to
implement the interface StreamOperator instead after the unification, user jobs 
do not need to change
since DataSet uses the same UDF interfaces as DataStream.

The unification has at least three benefits:
1. The system will be greatly simplified with the same execution stack for both 
streaming and batch jobs.
2. It is no longer necessary to implement two sets of Driver(s) (operator 
strategies) for batch, namely chained and non-chained.
3. The unified programming model enables streaming and batch jobs to share the 
same operator implementation.

The following is the design draft. Any feedback is highly appreciated. 
.https://docs.google.com/document/d/1G0NUIaaNJvT6CMrNCP6dRXGv88xNhDQqZFrQEuJ0rVU/edit?usp=sharing

Best, 
Haibo

Reply via email to