My 2 cents as someone who's done ML over the years - having worked on Oryx
2.0 and Mahout and having used Spark MlLib (read as "had no choice due to
strict workplace enforcement") and understands well their limitations.

1. FlinkML in its present form seems like "do it like how Spark did it".

2. The recent discussion about PMML support in Flink to my mind is a clear
example of putting the cart before the horse.  Why are we even talking PMML
when there ain't much ML algos in FlinkML?

For a real good implementation of PMML and how its being used (with jPMML),
suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0
predates Spark and is a clean example of separating PMML from the
underlying framework (Spark or Flink).

We have had PMML discussions on the Mahout project in the past, but the
idea never gained any traction in large part due to PMML spec limitations
(mostly for clustering and classification algorithms) and the lack of
adoption within the community.

See the discussion here and specifically Ted Dunning's comment on PMML -
http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E

Most of the ML in practice (deployed in production) today are Recommenders
and Deep Learning - both of which are not supported by the PMML spec.

3. Leveraging a probabilistic programming language like Figaro might be a
good way to go (just my thought) - that way most of the ML groundwork would
be external to Flink.

4. Within the Mahout community, we had been talking (and are working) on
redoing the Samsara Distributed linear algebra framework to support Flink
(in large part we realized that Flink is a better platform than the more
popular one out there that Slim wouldn't wanna talk about :) ).

 We should be having a release out in the next few weeks (depending on
committers' availability). It would be great if FlinkML had something like
it.

There was a good audience to Sebastian's talk on this subject at #FF15 in
October.

5. Its a good idea to add Flink support to H2O as Slim had suggested
elsewhere in this thread.


Thoughts?



On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
simone.robu...@radicalbit.io> wrote:

> I will say my opinion as a person that have worked with SparkML and will be
> involved soon in the development of ML solutions on Flink.
>
> In these days I tried to track the evolution and development of FlinkML and
> I see a big critical point: FlinkML looks a lot like a placeholder for
> commercial purposes but there's not enough investment and commitment to
> achieve an usable product. I did a few things with FlinkML coming from
> SparkML and I can say that it's unsuitable for most of the common use cases
> covered by SparkML (that is not a good ML library at all in terms of
> usability).
>
> So my question is: do we really need FlinkML? The roadmap looks a lot like
> "Spark has SparkML so we MUST have a ML library too". This could be
> reasonable if you aim at a fine-tuned library tailored on the specifics of
> Flink that are different from Spark. This could be even better if you
> developed an implementation of SGD that exploit the computational model of
> Flink that, I think, could achieve a lot more compared to the actual
> implementation. This is a subject that I want to study better before saying
> more but I'm looking at better parallelization strategies for data and
> models.
>
> Going back to FlinkML, do we really need to reimplement the same workhorse
> algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka,
> Oryx and other distributed learning libraries? Is it really useful at this
> stage? Given the current resources of the project, wouldn't it be more
> reasonable to invest time and energy in integrating more mature libraries
> (and eventually rich tooling that would give a big advantage over the other
> libraries)?
>
> I would like to comment on your proposals but my experience in
> collaborative open source development is way too limited to form an
> interesting opinion. Also I had no historical visibility on the motivations
> and discussions behind the development of FlinkML and I would like pointers
> to read something on what is the shared vision on this part of the project
> so that I could join the discussion from now on.
>
> Thanks,
>
> Simone
>
>
>
> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
>
> > Hello all,
> >
> > I would like to get a conversation started on how we plan to move forward
> > with FlinkML.
> >
> > Development on the library currently has been mostly dormant for the
> past 6
> > months,
> >
> > mainly I believe because of the lack of available committers to review
> PRs.
> >
> > Last month we got together with Till and Marton and talked about how we
> > could try to
> >
> > solve this and ensure continued development of the library.
> >
> > We see 3 possible paths we could take:
> >
> >    1.
> >
> >    Externalize the library, creating a new repository under the Apache
> >    Flink project. This decouples the development of FlinkML from the
> Flink
> >    release cycle, allowing us to move faster and incorporate new features
> > as
> >    they become available. As FlinkML is a library under development tying
> > it
> >    to specific versions does not make much sense anyway. The library
> would
> >    depend on the latest snapshot version of Flink. It would then be
> > possible
> >    for the Flink distribution to cherry-pick parts of the library to be
> >    included with the core distribution.
> >    2.
> >
> >    Keep the development under the main Flink project but bring in new
> >    committers. This would mean that the development remains as is and is
> > tied
> >    to core Flink releases, but new worked should get merged at much more
> >    regular intervals through the help of committers other than Till.
> Marton
> >    Balassi has volunteered for that role and I hope that more might take
> up
> >    that role.
> >    3. A third option is to fork FlinkML on a repository on which we are
> >    able to commit freely (again through PRs and reviews of course) and
> > merge
> >    good parts back into the main repo once in a while. This allows for
> > faster
> >    progress and more experimental work but obviously creates
> fragmentation.
> >
> >
> > I would like to hear your thoughts on these three options, as well as
> > discuss other
> >
> > alternatives that could help move FlinkML forward.
> >
> > Cheers,
> > Theodore
> >
>

Reply via email to