Hi everyone,

I am opening this thread to discuss the idea of moving Flink ML pipeline
API and library code to a separate repository in Flink (similar to what we
did for flink-statefun <https://github.com/apache/flink-statefun>).

The Flink ML pipeline API was proposed by FLIP-39: Flink ML pipeline and ML
libs
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs>.
It allows MLlib developers and users to develop ML pipelines on top of
Flink.

According to the discussion in this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Deprecation-and-removal-of-the-legacy-SQL-planner-td48988.html>
thread, we plan to remove SQL planner in Flink 1.14. However,
there exist ML libraries which currently use Flink's DataSet API together
with Table API. Those libraries will either stop working or suffer
considerable performance regression if they bump up dependency to Flink
1.14. As a result, if we keep ML pipeline API in Flink, then those ML
libraries can not use the latest ML pipeline API/lib in Flink until Flink
compenstates the missing functionality with new DataStream APIs, which is
supposed to happen about 1 year from now in e.g. Flink 1.15.

In order to allow us to remove SQL planner in Flink 1.14 while still
allowing ML pipeline API/lib development in the coming year, we propose to
move Flink ML pipeline API and library code to a separate repository. More
specifically, the new repo will have the following setup:
- The repo will be created at https://github.com/apache/flink-ml. This repo
will depend on the core Flink repo.
- The flink-ml documentation will be linked from the existing main Flink
docs similar to
https://ci.apache.org/projects/flink/flink-statefun-docs-master.
- The new repo will be under namespace org.apache.flink.
- We can revisit whether we should put it back to the core Flink repo after
the above issue is resolved and if there is good reason to make the change.

Here is the proposed plan if we agree to make this change:
- We will create the flink-ml repo and move Flink ML pipeline related code
to this repo before Flink 1.13 code release (3/31/2021)
- Then we update flink-ml repo to depend on Flink 1.13 after Flink 1.13 is
released.
- Then we update core Flink with new DataStream API (e.g. DataStream
iteration) such that core Flink can support the same (or better) ML lib
performance as it does now with the SQL planner. This is supposed to happen
in about 1 year.
- Then we update flink-ml repo to depend on the latest Flink version once
Flink has the new DataStream API.

Besides the main motivation described above, this change also shares
similar pros/cons of creating a separate repo for flink-statefun
<https://github.com/apache/flink-statefun> (see this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Stateful-Functions-in-which-form-to-contribute-same-or-different-repository-td34034.html>
and this
<http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/PROPOSAL-Contribute-Stateful-Functions-to-Apache-Flink-td33913.html>
for priory discussion).

Pros:
- A separate repos allows faster development for an early stage project
like flink ML pipeline (both API and libs).
- Flink repo is already super large and it is good not to bloat its size
(and the number of tests)
- Less tests to run when we make code changes in each repo.

Cons:
- The code change in the core Flink might potentially break the test or
cause performance regression in flink-ml since they are in different repo.
So more effort is needed when we bump up flink-ml's Flink dependency.

Overall it seems that the pros outweigh the cons. Looking forward to
hearing what you think!


Regards,
Dong

Reply via email to