Hi Shaoxuan, You are perfectly right. What I want to achieve is a combination of all your 3 points. Let me rephrase here: 1. Define a Table based ML Pipeline interface to have the same functionality as current DataSet based implementations. 2. Support new features like online learning, streaming inference. 3. Provide a base for Flink AI tooling (i.e. AI platform) and ML/DL SQL support.
This definitely will be step-by-step actions and will need a lot of help from Table enhancements. I am currently working on #1. Thanks Weihua Shaoxuan Wang <wshaox...@gmail.com> 于2018年11月20日周二 下午11:11写道: > Hi Weihua, > > Thanks for the proposal. I have quickly read through it. It looks great. > A quick question. Do you consider changing the ML Lib (implementation > of Estimator/Predictor/Transformer) also on top of the tableAPI? I > will be very happy if this is also included in the scope. It is not > easy and needs lots of new tableAPI functionalities, which is exactly > one of the reasons that motivate us to "enhance the tableAPI" > discussed in other threads. > > The entire scope of your proposal is so big that I would suggest we > should complete it step by step. I think you have mainly proposed 3 > things: > 1. Redesign the ML pipeline based on tableAPI > 2. Take streaming ML pipeline into account > 3. Enhance ML pipeline with some new features for a better user experience > Maybe we should first replace the ml pipeline interface with tableAPI, > then move into #2 and #3. In the meanwhile, we can also explore the > possibility of changing the ML lib also on top of tableAPI. What do > you think? > > BTW, we should not break the current ML pipeline interface (which is > based on dataset) when we introduce the new ones. Let us leave it for > a while before the new interface is completed and well adopted. Then > we can deprecate the old ones. > > I will take a more thorough look at your proposal and leave comments > directly on the doc. > > Regards, > Shaoxuan > > > On 11/20/18, Weihua Jiang <weihua.ji...@gmail.com> wrote: > > ML Pipeline is the idea brought by Scikit-learn > > <https://arxiv.org/abs/1309.0238>. Both Spark and Flink has borrowed > this > > idea and made their own implementations [Spark ML Pipeline > > <https://spark.apache.org/docs/latest/ml-pipeline.html>, Flink ML > Pipeline > > < > https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/libs/ml/pipelines.html > >]. > > > > > > > > NOTE: though I am using the term "ML", ML Pipeline shall apply to both ML > > and DL pipelines. > > > > > > ML Pipeline is quite helpful for model composition (i.e. using model(s) > for > > feature engineering) . And it enables logic reuse in train and inference > > phases (via pipeline persistence and load), which is essential for AI > > engineering. ML Pipeline can also be a good base for Flink based AI > > engineering platform if we can make ML Pipeline have good tooling support > > (i.e. meta data human readable). > > > > > > As the Table API will be the unified high level API for both stream and > > batch processing, I want to initiate the design discussion of new Table > > based Flink ML Pipeline. > > > > > > I drafted a design document [1] for this discussion. This design tries to > > create a new ML Pipeline implementation so that concrete ML/DL algorithms > > can fit to this new API to achieve interoperability. > > > > > > Any feedback is highly appreciated. > > > > > > Thanks > > > > Weihua > > > > > > [1] > > > https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing > > > > > -- > > ----------------------------------------------------------------------------------- > > *Rome was not built in one day* > > > ----------------------------------------------------------------------------------- >