Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

Aljoscha Krettek Fri, 17 May 2019 02:33:58 -0700

Hi,

Why is it necessary to acquire a TableEnvironment from a Table?


I think you even said yourself what we should do: "I believe it's better to 
make the api
clean and hide the detail of implementation as much as possible.”. In my 
opinion this means we can only depend on the generic Table API module and not 
let any planner/runner specifics or DataSet/DataStream API leak out. This would 
be setting us up for future problems once we want to deprecate/remove/rework 
those APIs.

Best,
Aljoscha

> On 17. May 2019, at 09:06, Gen Luo <luogen...@gmail.com> wrote:
> 
> It's better not to depend on flink-table-planner indeed. It's currently
> needed for 3 points: registering udagg, judging the tableEnv batch or
> streaming, converting table to dataSet to collect data. Most of these
> requirements can be fulfilled by flink-table-api-java-bridge and
> flink-table-api-scala-bridge.
> 
> But there's a lack that without current flink-table-planner, it's
> impossible to acquire the tableEnv from a table. If so, all interfaces have
> to require an extra argument tableEnv.
> 
> This does make sense, but personally I don't like it because it has nothing
> to do with machine learning concept. The flink-ml is mainly towards to
> algorithm engineers and scientists, I believe it's better to make the api
> clean and hide the detail of implementation as much as possible. Hopefully
> there would another way to acquire the tableEnv and the api could stay
> clean.
> 
> Aljoscha Krettek <aljos...@apache.org> 于2019年5月16日周四 下午8:16写道：
> 
>> Hi,
>> 
>> I had a look at the document mostly from a module structure/dependency
>> structure perspective.
>> 
>> We should make the expected dependency structure explicit in the document.
>> 
>> From the discussion in the doc it seems that the intention is that
>> flink-ml-lib should depend on flink-table-planner (the current, pre-blink
>> Table API planner that has a dependency on the DataSet API and DataStream
>> API). I think we should not have this because it ties the Flink ML
>> implementation to a module that is going to be deprecated. As far as I
>> understood, the intention for this new Flink ML module is to be the next
>> generation approach, based on the Table API. If this is true, we should
>> make sure that this only depends on the Table API and is independent of the
>> underlying planner implementation. Especially if we want this to work with
>> the new Blink-based planner that is currently being added to Flink.
>> 
>> What do you think?
>> 
>> Best,
>> Aljoscha
>> 
>>> On 10. May 2019, at 11:22, Shaoxuan Wang <wshaox...@gmail.com> wrote:
>>> 
>>> Hi everyone,
>>> 
>>> I created umbrella Jira FLINK-12470
>>> <https://issues.apache.org/jira/browse/FLINK-12470> for FLIP39 and
>> added an
>>> "implementation plan" section in the google doc
>>> (
>> https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit#heading=h.pggjwvwg8mrx
>> )
>>> <http://%28https//
>> docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit#heading=h.pggjwvwg8mrx)
>> .>
>>> .
>>> Need your special attention on the organization of modules/packages of
>>> flink-ml. @Aljosha, @Till, @Rong, @Jincheng, @Becket, and all.
>>> 
>>> We anticipate a quick development growth of Flink ML in the next several
>>> releases. Several components (for instance, pipeline, mllib, model
>> serving,
>>> ml integration test) need to be separated into different submodules.
>>> Therefore, we propose to create a new flink-ml module at the root, and
>> add
>>> sub-modules for ml-pipeline and ml-lib of FLIP39, and potentially we
>>> can also design FLIP23 as another sub-module under this new flink-ml
>>> module (I will raise a discussion in FLIP23 ML thread about this). The
>>> legacy flink-ml module (under flink-libraries) can be remained as it is
>> and
>>> await to be deprecated in the future, or alternatively we move it under
>>> this new flink-ml module and rename it to flink-dataset-ml. What do you
>>> think?
>>> 
>>> Looking forward to your feedback.
>>> 
>>> Regards,
>>> Shaoxuan
>>> 
>>> 
>>> On Tue, May 7, 2019 at 8:42 AM Rong Rong <walter...@gmail.com> wrote:
>>> 
>>>> Thanks for following up promptly and sharing the feedback @shaoxuan.
>>>> 
>>>> Yes I share the same view with you on the convergence of these 2 FLIPs
>>>> eventually. I also have some questions regarding the API as well as the
>>>> possible convergence challenges (especially current Co-processor
>> approach
>>>> vs. FLIP-39's table API approach), I will follow up on the discussion
>>>> thread and the PR on FLIP-23 with you and Boris :-)
>>>> 
>>>> --
>>>> Rong
>>>> 
>>>> On Mon, May 6, 2019 at 3:30 AM Shaoxuan Wang <wshaox...@gmail.com>
>> wrote:
>>>> 
>>>>> 
>>>>> Thanks for the feedback, Rong and Flavio.
>>>>> 
>>>>> @Rong Rong
>>>>>> There's another thread regarding a close to merge FLIP-23
>> implementation
>>>>>> [1]. I agree this might still be early stage to talk about
>>>>> productionizing
>>>>>> and model-serving. But I would be nice to keep the
>>>>> design/implementation in
>>>>>> mind that: ease of use for productionizing a ML pipeline is also very
>>>>>> important.
>>>>>> And if we can leverage the implementation in FLIP-23 in the future,
>>>>> (some
>>>>>> adjustment might be needed) that would be super helpful.
>>>>> Your raised a very good point. Actually I have been reviewing FLIP23
>> for
>>>>> a while (mostly offline to help Boris polish the PR). FMPOV, FLIP23 and
>>>>> FLIP39 can be well unified at some point. Model serving in FLIP23 is
>>>>> actually a special case of “transformer/model” proposed in FLIP39.
>> Boris's
>>>>> implementation of model serving can be designed as an abstract class
>> on top
>>>>> of transformer/model interface, and then can be used by ML users as a
>>>>> certain ML lib.  I have some other comments WRT FLIP23 x FLIP39, I will
>>>>> reply to the FLIP23 ML later with more details.
>>>>> 
>>>>> @Flavio
>>>>>> I have read many discussion about Flink ML and none of them take into
>>>>>> account the ongoing efforts carried out of by the Streamline H2020
>>>>> project
>>>>>> [1] on this topic.
>>>>>> Have you tried to ping them? I think that both projects could benefits
>>>>> from
>>>>>> a joined effort on this side..
>>>>>> [1] https://h2020-streamline-project.eu/objectives/
>>>>> Thank you for your info. I am not aware of the Streamline H2020
>> projects
>>>>> before. Just did a quick look at its website and github. IMO these
>> projects
>>>>> could be very good Flink ecosystem projects and can be built on top of
>> ML
>>>>> pipeline & ML lib interfaces introduced in FLIP39. I will try to
>> contact
>>>>> the owners of these projects to understand their plans and blockers of
>>>>> using Flink (if there is any). In the meantime, if you have the direct
>>>>> contact of person who might be interested on ML pipeline & ML lib,
>> please
>>>>> share with me.
>>>>> 
>>>>> Regards,
>>>>> Shaoxuan
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, May 2, 2019 at 3:59 PM Flavio Pompermaier <
>> pomperma...@okkam.it>
>>>>> wrote:
>>>>> 
>>>>>> Hi to all,
>>>>>> I have read many discussion about Flink ML and none of them take into
>>>>>> account the ongoing efforts carried out of by the Streamline H2020
>>>>>> project
>>>>>> [1] on this topic.
>>>>>> Have you tried to ping them? I think that both projects could benefits
>>>>>> from
>>>>>> a joined effort on this side..
>>>>>> [1] https://h2020-streamline-project.eu/objectives/
>>>>>> 
>>>>>> Best,
>>>>>> Flavio
>>>>>> 
>>>>>> On Thu, May 2, 2019 at 12:18 AM Rong Rong <walter...@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> Hi Shaoxuan/Weihua,
>>>>>>> 
>>>>>>> Thanks for the proposal and driving the effort.
>>>>>>> I also replied to the original discussion thread, and still a +1 on
>>>>>> moving
>>>>>>> towards the ski-learn model.
>>>>>>> I just left a few comments on the API details and some general
>>>>>> questions.
>>>>>>> Please kindly take a look.
>>>>>>> 
>>>>>>> There's another thread regarding a close to merge FLIP-23
>>>>>> implementation
>>>>>>> [1]. I agree this might still be early stage to talk about
>>>>>> productionizing
>>>>>>> and model-serving. But I would be nice to keep the
>>>>>> design/implementation in
>>>>>>> mind that: ease of use for productionizing a ML pipeline is also very
>>>>>>> important.
>>>>>>> And if we can leverage the implementation in FLIP-23 in the future,
>>>>>> (some
>>>>>>> adjustment might be needed) that would be super helpful.
>>>>>>> 
>>>>>>> Best,
>>>>>>> Rong
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>>>> 
>>>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-23-Model-Serving-td20260.html
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Apr 30, 2019 at 1:47 AM Shaoxuan Wang <wshaox...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks for all the feedback.
>>>>>>>> 
>>>>>>>> @Jincheng Sun
>>>>>>>>> I recommend It's better to add a detailed implementation plan to
>>>>>> FLIP
>>>>>>> and
>>>>>>>> google doc.
>>>>>>>> Yes, I will add a subsection for implementation plan.
>>>>>>>> 
>>>>>>>> @Chen Qin
>>>>>>>>> Just share some of insights from operating SparkML side at scale
>>>>>>>>> - map reduce may not best way to iterative sync partitioned
>> workers.
>>>>>>>>> - native hardware accelerations is key to adopt rapid changes in ML
>>>>>>>> improvements in foreseeable future.
>>>>>>>> Thanks for sharing your experience on SparkML. The purpose of this
>>>>>> FLIP
>>>>>>> is
>>>>>>>> mainly to provide the interfaces for ML pipeline and ML lib, and the
>>>>>>>> implementations of most standard algorithms. Besides this FLIP, for
>>>>>> AI
>>>>>>>> computing on Flink, we will continue to contribute the efforts, like
>>>>>> the
>>>>>>>> enhancement of iterative and the integration of deep learning
>> engines
>>>>>>> (such
>>>>>>>> as Tensoflow/Pytorch). I have presented part of these work in
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> https://www.ververica.com/resources/flink-forward-san-francisco-2019/when-table-meets-ai-build-flink-ai-ecosystem-on-table-api
>>>>>>>> I am not sure if I have fully got your comments. Can you please
>>>>>> elaborate
>>>>>>>> them with more details, and if possible, please provide some
>>>>>> suggestions
>>>>>>>> about what we should work on to address the challenges you have
>>>>>>> mentioned.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Shaoxuan
>>>>>>>> 
>>>>>>>> On Mon, Apr 29, 2019 at 11:28 AM Chen Qin <qinnc...@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Just share some of insights from operating SparkML side at scale
>>>>>>>>> - map reduce may not best way to iterative sync partitioned
>>>>>> workers.
>>>>>>>>> - native hardware accelerations is key to adopt rapid changes in ML
>>>>>>>>> improvements in foreseeable future.
>>>>>>>>> 
>>>>>>>>> Chen
>>>>>>>>> 
>>>>>>>>> On Apr 29, 2019, at 11:02, jincheng sun <sunjincheng...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Shaoxuan,
>>>>>>>>>> 
>>>>>>>>>> Thanks for doing more efforts for the enhances of the
>>>>>> scalability and
>>>>>>>> the
>>>>>>>>>> ease of use of Flink ML and make it one step further. Thank you
>>>>>> for
>>>>>>>>> sharing
>>>>>>>>>> a lot of context information.
>>>>>>>>>> 
>>>>>>>>>> big +1 for this proposal!
>>>>>>>>>> 
>>>>>>>>>> Here only one suggestion, that is, It has been a short time
>>>>>> until the
>>>>>>>>>> release of flink-1.9, so I recommend It's better to add a
>>>>>> detailed
>>>>>>>>>> implementation plan to FLIP and google doc.
>>>>>>>>>> 
>>>>>>>>>> What do you think?
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jincheng
>>>>>>>>>> 
>>>>>>>>>> Shaoxuan Wang <wshaox...@gmail.com> 于2019年4月29日周一 上午10:34写道：
>>>>>>>>>> 
>>>>>>>>>>> Hi everyone,
>>>>>>>>>>> 
>>>>>>>>>>> Weihua has proposed to rebuild Flink ML pipeline on top of
>>>>>> TableAPI
>>>>>>>>> several
>>>>>>>>>>> months ago in this mail thread:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Embracing-Table-API-in-Flink-ML-td25368.html
>>>>>>>>>>> 
>>>>>>>>>>> Luogen, Becket, Xu, Weihua and I have been working on this
>>>>>> proposal
>>>>>>>>>>> offline in
>>>>>>>>>>> the past a few months. Now we want to share the first phase of
>>>>>> the
>>>>>>>>> entire
>>>>>>>>>>> proposal with a FLIP. In this FLIP-39, we want to achieve
>>>>>> several
>>>>>>>> things
>>>>>>>>>>> (and hope those can be accomplished and released in Flink-1.9):
>>>>>>>>>>> 
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> Provide a new set of ML core interface (on top of Flink
>>>>>> TableAPI)
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> Provide a ML pipeline interface (on top of Flink TableAPI)
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> Provide the interfaces for parameters management and
>>>>>> pipeline/mode
>>>>>>>>>>> persistence
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> All the above interfaces should facilitate any new ML
>>>>>> algorithm.
>>>>>>> We
>>>>>>>>> will
>>>>>>>>>>> gradually add various standard ML algorithms on top of these
>>>>>> new
>>>>>>>>>>> proposed
>>>>>>>>>>> interfaces to ensure their feasibility and scalability.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Part of this FLIP has been present in Flink Forward 2019 @ San
>>>>>>>>> Francisco by
>>>>>>>>>>> Xu and Me.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> https://sf-2019.flink-forward.org/conference-program#when-table-meets-ai--build-flink-ai-ecosystem-on-table-api
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> https://sf-2019.flink-forward.org/conference-program#high-performance-ml-library-based-on-flink
>>>>>>>>>>> 
>>>>>>>>>>> You can find the videos & slides at
>>>>>>>>>>> https://www.ververica.com/flink-forward-san-francisco-2019
>>>>>>>>>>> 
>>>>>>>>>>> The design document for FLIP-39 can be found here:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>> https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I am looking forward to your feedback.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> 
>>>>>>>>>>> Shaoxuan
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>> 
>>

Re: [DISCUSS] FLIP-39: Flink ML pipeline and ML libs

Reply via email to