Hi Weihua,

    Thanks for the exciting proposal! 

    I have quickly read through it,  and I really appropriate the idea of 
providing the ML Pipeline API similar to the commonly used library 
scikit-learn, since it greatly reduce the learning cost for the AI engineers to 
transfer to the Flink platform. 

    Currently we are also working on a related issue, namely enhancing the 
stream iteration of Flink to support both SGD and online learning, and it also 
support batch training as a special case. we have had a rough design and will 
start a new discussion in the next few days. I think the enhanced stream 
iteration will help to implement Estimators directly in Flink, and it may help 
to simplify the online learning pipeline by eliminating the requirement to load 
the models from external file systems.

    I will read the design doc more carefully. Thanks again for sharing the 
design doc!

Yours sincerely
    Yun Gao 


------------------------------------------------------------------
发件人:Weihua Jiang <weihua.ji...@gmail.com>
发送时间:2018年11月20日(星期二) 20:53
收件人:dev <dev@flink.apache.org>
主 题:[DISCUSS] Embracing Table API in Flink ML

ML Pipeline is the idea brought by Scikit-learn
<https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed this
idea and made their own implementations [Spark ML Pipeline
<https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML Pipeline
<https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html>].



NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML
and DL pipelines.


ML Pipeline is quite helpful for model composition (i.e. using model(s) for
feature engineering) . And it enables logic reuse in train and inference
phases (via pipeline persistence and load), which is essential for AI
engineering. ML Pipeline can also be a good base for Flink based AI
engineering platform if we can make ML Pipeline have good tooling support
(i.e. meta data human readable).


As the Table API will be the unified high level API for both stream and
batch processing, I want to initiate the design discussion of new Table
based Flink ML Pipeline.


I drafted a design document [1] for this discussion. This design tries to
create a new ML Pipeline implementation so that concrete ML/DL algorithms
can fit to this new API to achieve interoperability.


Any feedback is highly appreciated.


Thanks

Weihua


[1]
https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing

Reply via email to