Shaoxuan Wang created FLINK-12470:
-------------------------------------
Summary: FLIP39: Flink ML pipeline and ML libs
Key: FLINK-12470
URL: https://issues.apache.org/jira/browse/FLINK-12470
Project: Flink
Issue Type: New Feature
Components: Library / Machine Learning
Affects Versions: 1.9.0
Reporter: Shaoxuan Wang
Assignee: Shaoxuan Wang
Fix For: 1.9.0
This is the umbrella Jira for FLIP39, which intents to to enhance the
scalability and the ease of use of Flink ML.
ML Discussion thread:
[http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html]
Google Doc: (will convert it to an official confluence page very soon )
[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit]
In machine learning, there are mainly two types of people. The first type is
MLlib developer. They need a set of standard/well abstracted core ML APIs to
implement the algorithms. Every ML algorithm is a certain concrete
implementation on top of these APIs. The second type is MLlib users who utilize
the existing/packaged MLlib to train or server a model. It is pretty common
that the entire training or inference is constructed by a sequence of
transformation or algorithms. It is essential to provide a workflow/pipeline
API for MLlib users such that they can easily combine multiple algorithms to
describe the ML workflow/pipeline.
Current Flink has a set of ML core inferences, but they are built on top of
dataset API. This does not quite align with the latest flink
[roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the first
class citizen and primary API for analytics use cases, while dataset API will
be gradually deprecated). Moreover, Flink at present does not have any
interface that allows MLlib users to describe an ML workflow/pipeline, nor
provides any approach to persist pipeline or model and reuse them in the
future. To solve/improve these issues, in this FLIP we propose to:
* Provide a new set of ML core interface (on top of Flink TableAPI)
* Provide a ML pipeline interface (on top of Flink TableAPI)
* Provide the interfaces for parameters management and pipeline persistence
* All the above interfaces should facilitate any new ML algorithm. We will
gradually add various standard ML algorithms on top of these new proposed
interfaces to ensure their feasibility and scalability.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)