Hi

Meanwhile until FlinkML matures, it might be worth having Flink as the engine 
powering H2O in a similar way Spark are doing with their Sparkling Water.
Any thoughts?

Thanks

Slim Baltagi

On Feb 12, 2016, at 7:25 AM, Theodore Vasiloudis 
<theodoros.vasilou...@gmail.com> wrote:

> I think Simone raises some good points here.
> 
> The truth is that FlinkML is still in its infancy and it will be hard to
> compete with mllib, H2O and Graphlab in terms of features
> and algorithm "coverage".
> 
> My hope has always been that the library will be focused on what Flink does
> well and implement algorithms that are
> built around the inherent advantages Flink provides over other platforms.
> 
> This is an open source project of course it's not up to one person to
> decide what makes into the library and what doesn't,
> and for me it's been really hard to gauge what the community "wants" from
> the library in terms of algorithms.
> 
> The "basics" (sklearn-like predictors, evaluators, CV and pipelines) I
> think are necessary and are largely in place already.
> Making sure that they provide a good user experience is paramount of course
> before we settle on the design.
> 
> But this is less of a discussion on where we take FlinkML, but *how *we do
> it.
> I do believe there is a need for an integrated ML library for Flink, the
> question for me is how can we ensure its continued development.
> 
> 
> 
> On Fri, Feb 12, 2016 at 12:59 PM, Chiwan Park <chiwanp...@apache.org> wrote:
> 
>> Hi,
>> 
>> I agree what Theo said. Currently, only few committers spend time to
>> review PRs about FlinkML. But I also agree Fabian’s opinion. I would like
>> to keep FlinkML under main repository of Flink. I hope new committers
>> spending time for FlinkML.
>> 
>> About Simone’s opinion, yes, FlinkML is still immature ML library. There
>> is a lack of many useful features and some of the features are pending in
>> pull requests.
>> 
>> Integration with some other libraries such as Mahout, H2O, Weka would be
>> also good. Already there are some attempts using Flink or other distributed
>> data processing framework as a backend of other library [1] [2] [3]. But I
>> think, as you can see the link, we have to re-implement many algorithms
>> even though we integrate other library with Flink. I doubt if there is a
>> big development advantage of integration.
>> 
>> [1]: https://issues.apache.org/jira/browse/MAHOUT-1570
>> [2]: http://mahout.apache.org/users/basics/algorithms.html
>> [3]: https://github.com/ariskk/distributedWekaSpark
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On Feb 12, 2016, at 7:04 PM, Fabian Hueske <fhue...@gmail.com> wrote:
>>> 
>>> Hi Theo,
>>> 
>>> thanks for starting this discussion. You are certainly right that the
>>> development of FlinkML is stalling. On the other hand, we regularly see
>>> people on the mailing list asking for feature.
>>> 
>>> Regarding your proposed ways to proceed:
>>> 
>>> 1) I am not sure how much it would help to move FlinkML to a separate
>>> repository.
>>> We have discussed to move connectors (and libraries) to separate
>>> repositories before but the thread fall asleep [1].
>>> We would still need committers to spend time with reviewing, merging, and
>>> contributing.
>>> So IMO, this is orthogonal to having more committer involvement.
>>> 
>>> 2) Having committers (current /  new ones) spending time on FlinkML is
>> the
>>> requirement for keep it alive within the Flink project.
>>> Adding new committers is kind of a bootstrap problem here because it is
>>> hard for contributors to get involved with FlinkML if very little
>> committer
>>> time is spend on code reviews and merging. Nonetheless, I see this as the
>>> best option.
>>> 
>>> 3) Forking of a project on Github is certainly possible (even without the
>>> endorsement of the Flink community). However, merging changes back into
>>> Flink would again require a committer to review and merge (probably a
>> much
>>> larger chunk of code) and also require the permission of all
>> contributors.
>>> 
>>> Best,
>>> Fabian
>>> 
>>> [1]
>>> 
>> https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E
>>> 
>>> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
>>> theodoros.vasilou...@gmail.com>:
>>> 
>>>> Hello all,
>>>> 
>>>> I would like to get a conversation started on how we plan to move
>> forward
>>>> with FlinkML.
>>>> 
>>>> Development on the library currently has been mostly dormant for the
>> past 6
>>>> months,
>>>> 
>>>> mainly I believe because of the lack of available committers to review
>> PRs.
>>>> 
>>>> Last month we got together with Till and Marton and talked about how we
>>>> could try to
>>>> 
>>>> solve this and ensure continued development of the library.
>>>> 
>>>> We see 3 possible paths we could take:
>>>> 
>>>>  1.
>>>> 
>>>>  Externalize the library, creating a new repository under the Apache
>>>>  Flink project. This decouples the development of FlinkML from the
>> Flink
>>>>  release cycle, allowing us to move faster and incorporate new features
>>>> as
>>>>  they become available. As FlinkML is a library under development tying
>>>> it
>>>>  to specific versions does not make much sense anyway. The library
>> would
>>>>  depend on the latest snapshot version of Flink. It would then be
>>>> possible
>>>>  for the Flink distribution to cherry-pick parts of the library to be
>>>>  included with the core distribution.
>>>>  2.
>>>> 
>>>>  Keep the development under the main Flink project but bring in new
>>>>  committers. This would mean that the development remains as is and is
>>>> tied
>>>>  to core Flink releases, but new worked should get merged at much more
>>>>  regular intervals through the help of committers other than Till.
>> Marton
>>>>  Balassi has volunteered for that role and I hope that more might take
>> up
>>>>  that role.
>>>>  3. A third option is to fork FlinkML on a repository on which we are
>>>>  able to commit freely (again through PRs and reviews of course) and
>>>> merge
>>>>  good parts back into the main repo once in a while. This allows for
>>>> faster
>>>>  progress and more experimental work but obviously creates
>> fragmentation.
>>>> 
>>>> 
>>>> I would like to hear your thoughts on these three options, as well as
>>>> discuss other
>>>> 
>>>> alternatives that could help move FlinkML forward.
>>>> 
>>>> Cheers,
>>>> Theodore
>>>> 
>> 
>> 

Reply via email to