Re: Opening a discussion on FlinkML

2016-02-14 Thread Martin Neumann
I think the focus of this discussion should be how we proceed not what to
do. The what comes from the committers anyway.

There are several people who like to commit, including people from the
Streamline project. Having pull requests that are older than 6 Month is not
good for any project.
The main question is how can we develop the library further with high
standards but without creating a bottleneck that holds things back to much.

In my opinion it would be best if we find enough resources to keep things
inside Flink. However if we have to depend on people who are
already stretched for time, splitting it out might be the better option.
(path 1 from Theos original mail)

cheers Martin




On Fri, Feb 12, 2016 at 3:54 PM, Suneel Marthi  wrote:

> On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
> > @Suneel
> >
> > 1) Totally agree, as I wrote before.
> >
> > 2)I agree that support for PMML is premature but we shouldn't
> underestimate
> > the variety and complexity of the uses of ML models in the industry. The
> > adoption of Flink, hopefully, will grow and reach less innovative
> realities
> > where Random Forests and SVMs are still the main algorithms in use. In
> > these same realities there are legacies that justify the use of PMML to
> > port models. Still, FlinkML is still in an early stage so as you said, it
> > doesn't make sense to spend time right now on such a feature.
> >
>
> +1, as I mentioned earlier the PMML spec only supports classification and
> clustering (I last checked this in Aug 2015, pretty sure it would not have
> changed since then); hence 'Yes' it has some limited uses; 'No' - its too
> premature to even talk about it given the present state of FlinkML.
>
> >
> > 3)This would be really interesting. How do you imagine that the
> integration
> > with a distributed processing engine would work?
> >
>
> I am not sure yet, we r still exploring this on Mahout project to add to
> Mahout-Samsara - most of the statistics and probabilistic modeling would
> then be supported by Figaro (Bayesian, MCMC etc) and hence can be external
> to FlinkML.
>
> Figaro is Scala based. See https://github.com/p2t2/figaro
>
> I believe there are few other similar DSLs out there, need to dig up my old
> emails.
>
> (Not sure if its ASLv2 License, need verification here)
>
>
> >
> > 5) Agree on this one too. To my knowledge it would be the best option
> > together with SAMOA (for the streaming part).
> >
>
> There's already Flink - Samoa integration in place IIRC.
>
>
> >
> > 2016-02-12 15:25 GMT+01:00 Suneel Marthi :
> >
> > > My 2 cents as someone who's done ML over the years - having worked on
> > Oryx
> > > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due
> to
> > > strict workplace enforcement") and understands well their limitations.
> > >
> > > 1. FlinkML in its present form seems like "do it like how Spark did
> it".
> > >
> > > 2. The recent discussion about PMML support in Flink to my mind is a
> > clear
> > > example of putting the cart before the horse.  Why are we even talking
> > PMML
> > > when there ain't much ML algos in FlinkML?
> > >
> > > For a real good implementation of PMML and how its being used (with
> > jPMML),
> > > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx
> 2.0
> > > predates Spark and is a clean example of separating PMML from the
> > > underlying framework (Spark or Flink).
> > >
> > > We have had PMML discussions on the Mahout project in the past, but the
> > > idea never gained any traction in large part due to PMML spec
> limitations
> > > (mostly for clustering and classification algorithms) and the lack of
> > > adoption within the community.
> > >
> > > See the discussion here and specifically Ted Dunning's comment on PMML
> -
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
> > >
> > > Most of the ML in practice (deployed in production) today are
> > Recommenders
> > > and Deep Learning - both of which are not supported by the PMML spec.
> > >
> > > 3. Leveraging a probabilistic programming language like Figaro might
> be a
> > > good way to go (just my thought) - that way most of the ML groundwork
> > would
> > > be external to Flink.
> > >
> > > 4. Within the Mahout community, we had been talking (and are working)
> on
> > > redoing the Samsara Distributed linear algebra framework to support
> Flink
> > > (in large part we realized that Flink is a better platform than the
> more
> > > popular one out there that Slim wouldn't wanna talk about :) ).
> > >
> > >  We should be having a release out in the next few weeks (depending on
> > > committers' availability). It would be great if FlinkML had something
> > like
> > > it.
> > >
> > > There was a good audience to Sebastian's talk on this subject at #FF15
> in
> > > October.
> > >

Re: Opening a discussion on FlinkML

2016-02-12 Thread Simone Robutti
I will say my opinion as a person that have worked with SparkML and will be
involved soon in the development of ML solutions on Flink.

In these days I tried to track the evolution and development of FlinkML and
I see a big critical point: FlinkML looks a lot like a placeholder for
commercial purposes but there's not enough investment and commitment to
achieve an usable product. I did a few things with FlinkML coming from
SparkML and I can say that it's unsuitable for most of the common use cases
covered by SparkML (that is not a good ML library at all in terms of
usability).

So my question is: do we really need FlinkML? The roadmap looks a lot like
"Spark has SparkML so we MUST have a ML library too". This could be
reasonable if you aim at a fine-tuned library tailored on the specifics of
Flink that are different from Spark. This could be even better if you
developed an implementation of SGD that exploit the computational model of
Flink that, I think, could achieve a lot more compared to the actual
implementation. This is a subject that I want to study better before saying
more but I'm looking at better parallelization strategies for data and
models.

Going back to FlinkML, do we really need to reimplement the same workhorse
algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka,
Oryx and other distributed learning libraries? Is it really useful at this
stage? Given the current resources of the project, wouldn't it be more
reasonable to invest time and energy in integrating more mature libraries
(and eventually rich tooling that would give a big advantage over the other
libraries)?

I would like to comment on your proposals but my experience in
collaborative open source development is way too limited to form an
interesting opinion. Also I had no historical visibility on the motivations
and discussions behind the development of FlinkML and I would like pointers
to read something on what is the shared vision on this part of the project
so that I could join the discussion from now on.

Thanks,

Simone



2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

> Hello all,
>
> I would like to get a conversation started on how we plan to move forward
> with FlinkML.
>
> Development on the library currently has been mostly dormant for the past 6
> months,
>
> mainly I believe because of the lack of available committers to review PRs.
>
> Last month we got together with Till and Marton and talked about how we
> could try to
>
> solve this and ensure continued development of the library.
>
> We see 3 possible paths we could take:
>
>1.
>
>Externalize the library, creating a new repository under the Apache
>Flink project. This decouples the development of FlinkML from the Flink
>release cycle, allowing us to move faster and incorporate new features
> as
>they become available. As FlinkML is a library under development tying
> it
>to specific versions does not make much sense anyway. The library would
>depend on the latest snapshot version of Flink. It would then be
> possible
>for the Flink distribution to cherry-pick parts of the library to be
>included with the core distribution.
>2.
>
>Keep the development under the main Flink project but bring in new
>committers. This would mean that the development remains as is and is
> tied
>to core Flink releases, but new worked should get merged at much more
>regular intervals through the help of committers other than Till. Marton
>Balassi has volunteered for that role and I hope that more might take up
>that role.
>3. A third option is to fork FlinkML on a repository on which we are
>able to commit freely (again through PRs and reviews of course) and
> merge
>good parts back into the main repo once in a while. This allows for
> faster
>progress and more experimental work but obviously creates fragmentation.
>
>
> I would like to hear your thoughts on these three options, as well as
> discuss other
>
> alternatives that could help move FlinkML forward.
>
> Cheers,
> Theodore
>


Re: Opening a discussion on FlinkML

2016-02-12 Thread Fabian Hueske
Hi Theo,

thanks for starting this discussion. You are certainly right that the
development of FlinkML is stalling. On the other hand, we regularly see
people on the mailing list asking for feature.

Regarding your proposed ways to proceed:

1) I am not sure how much it would help to move FlinkML to a separate
repository.
We have discussed to move connectors (and libraries) to separate
repositories before but the thread fall asleep [1].
We would still need committers to spend time with reviewing, merging, and
contributing.
So IMO, this is orthogonal to having more committer involvement.

2) Having committers (current /  new ones) spending time on FlinkML is the
requirement for keep it alive within the Flink project.
Adding new committers is kind of a bootstrap problem here because it is
hard for contributors to get involved with FlinkML if very little committer
time is spend on code reviews and merging. Nonetheless, I see this as the
best option.

3) Forking of a project on Github is certainly possible (even without the
endorsement of the Flink community). However, merging changes back into
Flink would again require a committer to review and merge (probably a much
larger chunk of code) and also require the permission of all contributors.

Best,
Fabian

[1]
https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E

2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

> Hello all,
>
> I would like to get a conversation started on how we plan to move forward
> with FlinkML.
>
> Development on the library currently has been mostly dormant for the past 6
> months,
>
> mainly I believe because of the lack of available committers to review PRs.
>
> Last month we got together with Till and Marton and talked about how we
> could try to
>
> solve this and ensure continued development of the library.
>
> We see 3 possible paths we could take:
>
>1.
>
>Externalize the library, creating a new repository under the Apache
>Flink project. This decouples the development of FlinkML from the Flink
>release cycle, allowing us to move faster and incorporate new features
> as
>they become available. As FlinkML is a library under development tying
> it
>to specific versions does not make much sense anyway. The library would
>depend on the latest snapshot version of Flink. It would then be
> possible
>for the Flink distribution to cherry-pick parts of the library to be
>included with the core distribution.
>2.
>
>Keep the development under the main Flink project but bring in new
>committers. This would mean that the development remains as is and is
> tied
>to core Flink releases, but new worked should get merged at much more
>regular intervals through the help of committers other than Till. Marton
>Balassi has volunteered for that role and I hope that more might take up
>that role.
>3. A third option is to fork FlinkML on a repository on which we are
>able to commit freely (again through PRs and reviews of course) and
> merge
>good parts back into the main repo once in a while. This allows for
> faster
>progress and more experimental work but obviously creates fragmentation.
>
>
> I would like to hear your thoughts on these three options, as well as
> discuss other
>
> alternatives that could help move FlinkML forward.
>
> Cheers,
> Theodore
>


Opening a discussion on FlinkML

2016-02-12 Thread Theodore Vasiloudis
Hello all,

I would like to get a conversation started on how we plan to move forward
with FlinkML.

Development on the library currently has been mostly dormant for the past 6
months,

mainly I believe because of the lack of available committers to review PRs.

Last month we got together with Till and Marton and talked about how we
could try to

solve this and ensure continued development of the library.

We see 3 possible paths we could take:

   1.

   Externalize the library, creating a new repository under the Apache
   Flink project. This decouples the development of FlinkML from the Flink
   release cycle, allowing us to move faster and incorporate new features as
   they become available. As FlinkML is a library under development tying it
   to specific versions does not make much sense anyway. The library would
   depend on the latest snapshot version of Flink. It would then be possible
   for the Flink distribution to cherry-pick parts of the library to be
   included with the core distribution.
   2.

   Keep the development under the main Flink project but bring in new
   committers. This would mean that the development remains as is and is tied
   to core Flink releases, but new worked should get merged at much more
   regular intervals through the help of committers other than Till. Marton
   Balassi has volunteered for that role and I hope that more might take up
   that role.
   3. A third option is to fork FlinkML on a repository on which we are
   able to commit freely (again through PRs and reviews of course) and merge
   good parts back into the main repo once in a while. This allows for faster
   progress and more experimental work but obviously creates fragmentation.


I would like to hear your thoughts on these three options, as well as
discuss other

alternatives that could help move FlinkML forward.

Cheers,
Theodore


Re: Opening a discussion on FlinkML

2016-02-12 Thread Chiwan Park
Hi,

I agree what Theo said. Currently, only few committers spend time to review PRs 
about FlinkML. But I also agree Fabian’s opinion. I would like to keep FlinkML 
under main repository of Flink. I hope new committers spending time for FlinkML.

About Simone’s opinion, yes, FlinkML is still immature ML library. There is a 
lack of many useful features and some of the features are pending in pull 
requests.

Integration with some other libraries such as Mahout, H2O, Weka would be also 
good. Already there are some attempts using Flink or other distributed data 
processing framework as a backend of other library [1] [2] [3]. But I think, as 
you can see the link, we have to re-implement many algorithms even though we 
integrate other library with Flink. I doubt if there is a big development 
advantage of integration.

[1]: https://issues.apache.org/jira/browse/MAHOUT-1570
[2]: http://mahout.apache.org/users/basics/algorithms.html
[3]: https://github.com/ariskk/distributedWekaSpark

Regards,
Chiwan Park

> On Feb 12, 2016, at 7:04 PM, Fabian Hueske  wrote:
> 
> Hi Theo,
> 
> thanks for starting this discussion. You are certainly right that the
> development of FlinkML is stalling. On the other hand, we regularly see
> people on the mailing list asking for feature.
> 
> Regarding your proposed ways to proceed:
> 
> 1) I am not sure how much it would help to move FlinkML to a separate
> repository.
> We have discussed to move connectors (and libraries) to separate
> repositories before but the thread fall asleep [1].
> We would still need committers to spend time with reviewing, merging, and
> contributing.
> So IMO, this is orthogonal to having more committer involvement.
> 
> 2) Having committers (current /  new ones) spending time on FlinkML is the
> requirement for keep it alive within the Flink project.
> Adding new committers is kind of a bootstrap problem here because it is
> hard for contributors to get involved with FlinkML if very little committer
> time is spend on code reviews and merging. Nonetheless, I see this as the
> best option.
> 
> 3) Forking of a project on Github is certainly possible (even without the
> endorsement of the Flink community). However, merging changes back into
> Flink would again require a committer to review and merge (probably a much
> larger chunk of code) and also require the permission of all contributors.
> 
> Best,
> Fabian
> 
> [1]
> https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E
> 
> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
> 
>> Hello all,
>> 
>> I would like to get a conversation started on how we plan to move forward
>> with FlinkML.
>> 
>> Development on the library currently has been mostly dormant for the past 6
>> months,
>> 
>> mainly I believe because of the lack of available committers to review PRs.
>> 
>> Last month we got together with Till and Marton and talked about how we
>> could try to
>> 
>> solve this and ensure continued development of the library.
>> 
>> We see 3 possible paths we could take:
>> 
>>   1.
>> 
>>   Externalize the library, creating a new repository under the Apache
>>   Flink project. This decouples the development of FlinkML from the Flink
>>   release cycle, allowing us to move faster and incorporate new features
>> as
>>   they become available. As FlinkML is a library under development tying
>> it
>>   to specific versions does not make much sense anyway. The library would
>>   depend on the latest snapshot version of Flink. It would then be
>> possible
>>   for the Flink distribution to cherry-pick parts of the library to be
>>   included with the core distribution.
>>   2.
>> 
>>   Keep the development under the main Flink project but bring in new
>>   committers. This would mean that the development remains as is and is
>> tied
>>   to core Flink releases, but new worked should get merged at much more
>>   regular intervals through the help of committers other than Till. Marton
>>   Balassi has volunteered for that role and I hope that more might take up
>>   that role.
>>   3. A third option is to fork FlinkML on a repository on which we are
>>   able to commit freely (again through PRs and reviews of course) and
>> merge
>>   good parts back into the main repo once in a while. This allows for
>> faster
>>   progress and more experimental work but obviously creates fragmentation.
>> 
>> 
>> I would like to hear your thoughts on these three options, as well as
>> discuss other
>> 
>> alternatives that could help move FlinkML forward.
>> 
>> Cheers,
>> Theodore
>> 



Re: Opening a discussion on FlinkML

2016-02-12 Thread Simone Robutti
@Suneel

1) Totally agree, as I wrote before.

2)I agree that support for PMML is premature but we shouldn't underestimate
the variety and complexity of the uses of ML models in the industry. The
adoption of Flink, hopefully, will grow and reach less innovative realities
where Random Forests and SVMs are still the main algorithms in use. In
these same realities there are legacies that justify the use of PMML to
port models. Still, FlinkML is still in an early stage so as you said, it
doesn't make sense to spend time right now on such a feature.

3)This would be really interesting. How do you imagine that the integration
with a distributed processing engine would work?

5) Agree on this one too. To my knowledge it would be the best option
together with SAMOA (for the streaming part).

2016-02-12 15:25 GMT+01:00 Suneel Marthi :

> My 2 cents as someone who's done ML over the years - having worked on Oryx
> 2.0 and Mahout and having used Spark MlLib (read as "had no choice due to
> strict workplace enforcement") and understands well their limitations.
>
> 1. FlinkML in its present form seems like "do it like how Spark did it".
>
> 2. The recent discussion about PMML support in Flink to my mind is a clear
> example of putting the cart before the horse.  Why are we even talking PMML
> when there ain't much ML algos in FlinkML?
>
> For a real good implementation of PMML and how its being used (with jPMML),
> suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0
> predates Spark and is a clean example of separating PMML from the
> underlying framework (Spark or Flink).
>
> We have had PMML discussions on the Mahout project in the past, but the
> idea never gained any traction in large part due to PMML spec limitations
> (mostly for clustering and classification algorithms) and the lack of
> adoption within the community.
>
> See the discussion here and specifically Ted Dunning's comment on PMML -
>
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
>
> Most of the ML in practice (deployed in production) today are Recommenders
> and Deep Learning - both of which are not supported by the PMML spec.
>
> 3. Leveraging a probabilistic programming language like Figaro might be a
> good way to go (just my thought) - that way most of the ML groundwork would
> be external to Flink.
>
> 4. Within the Mahout community, we had been talking (and are working) on
> redoing the Samsara Distributed linear algebra framework to support Flink
> (in large part we realized that Flink is a better platform than the more
> popular one out there that Slim wouldn't wanna talk about :) ).
>
>  We should be having a release out in the next few weeks (depending on
> committers' availability). It would be great if FlinkML had something like
> it.
>
> There was a good audience to Sebastian's talk on this subject at #FF15 in
> October.
>
> 5. Its a good idea to add Flink support to H2O as Slim had suggested
> elsewhere in this thread.
>
>
> Thoughts?
>
>
>
> On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
> simone.robu...@radicalbit.io> wrote:
>
> > I will say my opinion as a person that have worked with SparkML and will
> be
> > involved soon in the development of ML solutions on Flink.
> >
> > In these days I tried to track the evolution and development of FlinkML
> and
> > I see a big critical point: FlinkML looks a lot like a placeholder for
> > commercial purposes but there's not enough investment and commitment to
> > achieve an usable product. I did a few things with FlinkML coming from
> > SparkML and I can say that it's unsuitable for most of the common use
> cases
> > covered by SparkML (that is not a good ML library at all in terms of
> > usability).
> >
> > So my question is: do we really need FlinkML? The roadmap looks a lot
> like
> > "Spark has SparkML so we MUST have a ML library too". This could be
> > reasonable if you aim at a fine-tuned library tailored on the specifics
> of
> > Flink that are different from Spark. This could be even better if you
> > developed an implementation of SGD that exploit the computational model
> of
> > Flink that, I think, could achieve a lot more compared to the actual
> > implementation. This is a subject that I want to study better before
> saying
> > more but I'm looking at better parallelization strategies for data and
> > models.
> >
> > Going back to FlinkML, do we really need to reimplement the same
> workhorse
> > algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka,
> > Oryx and other distributed learning libraries? Is it really useful at
> this
> > stage? Given the current resources of the project, wouldn't it be more
> > reasonable to invest time and energy in integrating more mature libraries
> > (and eventually rich tooling that would give a big advantage over the
> other
> > libraries)?
> >
> > I would like to comment on your 

Re: Opening a discussion on FlinkML

2016-02-12 Thread Suneel Marthi
On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti <
simone.robu...@radicalbit.io> wrote:

> @Suneel
>
> 1) Totally agree, as I wrote before.
>
> 2)I agree that support for PMML is premature but we shouldn't underestimate
> the variety and complexity of the uses of ML models in the industry. The
> adoption of Flink, hopefully, will grow and reach less innovative realities
> where Random Forests and SVMs are still the main algorithms in use. In
> these same realities there are legacies that justify the use of PMML to
> port models. Still, FlinkML is still in an early stage so as you said, it
> doesn't make sense to spend time right now on such a feature.
>

+1, as I mentioned earlier the PMML spec only supports classification and
clustering (I last checked this in Aug 2015, pretty sure it would not have
changed since then); hence 'Yes' it has some limited uses; 'No' - its too
premature to even talk about it given the present state of FlinkML.

>
> 3)This would be really interesting. How do you imagine that the integration
> with a distributed processing engine would work?
>

I am not sure yet, we r still exploring this on Mahout project to add to
Mahout-Samsara - most of the statistics and probabilistic modeling would
then be supported by Figaro (Bayesian, MCMC etc) and hence can be external
to FlinkML.

Figaro is Scala based. See https://github.com/p2t2/figaro

I believe there are few other similar DSLs out there, need to dig up my old
emails.

(Not sure if its ASLv2 License, need verification here)


>
> 5) Agree on this one too. To my knowledge it would be the best option
> together with SAMOA (for the streaming part).
>

There's already Flink - Samoa integration in place IIRC.


>
> 2016-02-12 15:25 GMT+01:00 Suneel Marthi :
>
> > My 2 cents as someone who's done ML over the years - having worked on
> Oryx
> > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due to
> > strict workplace enforcement") and understands well their limitations.
> >
> > 1. FlinkML in its present form seems like "do it like how Spark did it".
> >
> > 2. The recent discussion about PMML support in Flink to my mind is a
> clear
> > example of putting the cart before the horse.  Why are we even talking
> PMML
> > when there ain't much ML algos in FlinkML?
> >
> > For a real good implementation of PMML and how its being used (with
> jPMML),
> > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0
> > predates Spark and is a clean example of separating PMML from the
> > underlying framework (Spark or Flink).
> >
> > We have had PMML discussions on the Mahout project in the past, but the
> > idea never gained any traction in large part due to PMML spec limitations
> > (mostly for clustering and classification algorithms) and the lack of
> > adoption within the community.
> >
> > See the discussion here and specifically Ted Dunning's comment on PMML -
> >
> >
> http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E
> >
> > Most of the ML in practice (deployed in production) today are
> Recommenders
> > and Deep Learning - both of which are not supported by the PMML spec.
> >
> > 3. Leveraging a probabilistic programming language like Figaro might be a
> > good way to go (just my thought) - that way most of the ML groundwork
> would
> > be external to Flink.
> >
> > 4. Within the Mahout community, we had been talking (and are working) on
> > redoing the Samsara Distributed linear algebra framework to support Flink
> > (in large part we realized that Flink is a better platform than the more
> > popular one out there that Slim wouldn't wanna talk about :) ).
> >
> >  We should be having a release out in the next few weeks (depending on
> > committers' availability). It would be great if FlinkML had something
> like
> > it.
> >
> > There was a good audience to Sebastian's talk on this subject at #FF15 in
> > October.
> >
> > 5. Its a good idea to add Flink support to H2O as Slim had suggested
> > elsewhere in this thread.
> >
> >
> > Thoughts?
> >
> >
> >
> > On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
> > simone.robu...@radicalbit.io> wrote:
> >
> > > I will say my opinion as a person that have worked with SparkML and
> will
> > be
> > > involved soon in the development of ML solutions on Flink.
> > >
> > > In these days I tried to track the evolution and development of FlinkML
> > and
> > > I see a big critical point: FlinkML looks a lot like a placeholder for
> > > commercial purposes but there's not enough investment and commitment to
> > > achieve an usable product. I did a few things with FlinkML coming from
> > > SparkML and I can say that it's unsuitable for most of the common use
> > cases
> > > covered by SparkML (that is not a good ML library at all in terms of
> > > usability).
> > >
> > > So my question is: do we really need FlinkML? The roadmap looks a lot
> > like
> > > 

Re: Opening a discussion on FlinkML

2016-02-12 Thread Slim Baltagi
Hi

Meanwhile until FlinkML matures, it might be worth having Flink as the engine 
powering H2O in a similar way Spark are doing with their Sparkling Water.
Any thoughts?

Thanks

Slim Baltagi

On Feb 12, 2016, at 7:25 AM, Theodore Vasiloudis 
 wrote:

> I think Simone raises some good points here.
> 
> The truth is that FlinkML is still in its infancy and it will be hard to
> compete with mllib, H2O and Graphlab in terms of features
> and algorithm "coverage".
> 
> My hope has always been that the library will be focused on what Flink does
> well and implement algorithms that are
> built around the inherent advantages Flink provides over other platforms.
> 
> This is an open source project of course it's not up to one person to
> decide what makes into the library and what doesn't,
> and for me it's been really hard to gauge what the community "wants" from
> the library in terms of algorithms.
> 
> The "basics" (sklearn-like predictors, evaluators, CV and pipelines) I
> think are necessary and are largely in place already.
> Making sure that they provide a good user experience is paramount of course
> before we settle on the design.
> 
> But this is less of a discussion on where we take FlinkML, but *how *we do
> it.
> I do believe there is a need for an integrated ML library for Flink, the
> question for me is how can we ensure its continued development.
> 
> 
> 
> On Fri, Feb 12, 2016 at 12:59 PM, Chiwan Park  wrote:
> 
>> Hi,
>> 
>> I agree what Theo said. Currently, only few committers spend time to
>> review PRs about FlinkML. But I also agree Fabian’s opinion. I would like
>> to keep FlinkML under main repository of Flink. I hope new committers
>> spending time for FlinkML.
>> 
>> About Simone’s opinion, yes, FlinkML is still immature ML library. There
>> is a lack of many useful features and some of the features are pending in
>> pull requests.
>> 
>> Integration with some other libraries such as Mahout, H2O, Weka would be
>> also good. Already there are some attempts using Flink or other distributed
>> data processing framework as a backend of other library [1] [2] [3]. But I
>> think, as you can see the link, we have to re-implement many algorithms
>> even though we integrate other library with Flink. I doubt if there is a
>> big development advantage of integration.
>> 
>> [1]: https://issues.apache.org/jira/browse/MAHOUT-1570
>> [2]: http://mahout.apache.org/users/basics/algorithms.html
>> [3]: https://github.com/ariskk/distributedWekaSpark
>> 
>> Regards,
>> Chiwan Park
>> 
>>> On Feb 12, 2016, at 7:04 PM, Fabian Hueske  wrote:
>>> 
>>> Hi Theo,
>>> 
>>> thanks for starting this discussion. You are certainly right that the
>>> development of FlinkML is stalling. On the other hand, we regularly see
>>> people on the mailing list asking for feature.
>>> 
>>> Regarding your proposed ways to proceed:
>>> 
>>> 1) I am not sure how much it would help to move FlinkML to a separate
>>> repository.
>>> We have discussed to move connectors (and libraries) to separate
>>> repositories before but the thread fall asleep [1].
>>> We would still need committers to spend time with reviewing, merging, and
>>> contributing.
>>> So IMO, this is orthogonal to having more committer involvement.
>>> 
>>> 2) Having committers (current /  new ones) spending time on FlinkML is
>> the
>>> requirement for keep it alive within the Flink project.
>>> Adding new committers is kind of a bootstrap problem here because it is
>>> hard for contributors to get involved with FlinkML if very little
>> committer
>>> time is spend on code reviews and merging. Nonetheless, I see this as the
>>> best option.
>>> 
>>> 3) Forking of a project on Github is certainly possible (even without the
>>> endorsement of the Flink community). However, merging changes back into
>>> Flink would again require a committer to review and merge (probably a
>> much
>>> larger chunk of code) and also require the permission of all
>> contributors.
>>> 
>>> Best,
>>> Fabian
>>> 
>>> [1]
>>> 
>> https://mail-archives.apache.org/mod_mbox/flink-dev/201512.mbox/%3CCAGco--aZhZhrrSzzPROwXwmtYmD5CkoGKe7xNCWG1Vw7V-D%2BaA%40mail.gmail.com%3E
>>> 
>>> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
>>> theodoros.vasilou...@gmail.com>:
>>> 
 Hello all,
 
 I would like to get a conversation started on how we plan to move
>> forward
 with FlinkML.
 
 Development on the library currently has been mostly dormant for the
>> past 6
 months,
 
 mainly I believe because of the lack of available committers to review
>> PRs.
 
 Last month we got together with Till and Marton and talked about how we
 could try to
 
 solve this and ensure continued development of the library.
 
 We see 3 possible paths we could take:
 
  1.
 
  Externalize the library, creating a new repository under the Apache
  Flink project. This 

Re: Opening a discussion on FlinkML

2016-02-12 Thread Suneel Marthi
My 2 cents as someone who's done ML over the years - having worked on Oryx
2.0 and Mahout and having used Spark MlLib (read as "had no choice due to
strict workplace enforcement") and understands well their limitations.

1. FlinkML in its present form seems like "do it like how Spark did it".

2. The recent discussion about PMML support in Flink to my mind is a clear
example of putting the cart before the horse.  Why are we even talking PMML
when there ain't much ML algos in FlinkML?

For a real good implementation of PMML and how its being used (with jPMML),
suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0
predates Spark and is a clean example of separating PMML from the
underlying framework (Spark or Flink).

We have had PMML discussions on the Mahout project in the past, but the
idea never gained any traction in large part due to PMML spec limitations
(mostly for clustering and classification algorithms) and the lack of
adoption within the community.

See the discussion here and specifically Ted Dunning's comment on PMML -
http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E

Most of the ML in practice (deployed in production) today are Recommenders
and Deep Learning - both of which are not supported by the PMML spec.

3. Leveraging a probabilistic programming language like Figaro might be a
good way to go (just my thought) - that way most of the ML groundwork would
be external to Flink.

4. Within the Mahout community, we had been talking (and are working) on
redoing the Samsara Distributed linear algebra framework to support Flink
(in large part we realized that Flink is a better platform than the more
popular one out there that Slim wouldn't wanna talk about :) ).

 We should be having a release out in the next few weeks (depending on
committers' availability). It would be great if FlinkML had something like
it.

There was a good audience to Sebastian's talk on this subject at #FF15 in
October.

5. Its a good idea to add Flink support to H2O as Slim had suggested
elsewhere in this thread.


Thoughts?



On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti <
simone.robu...@radicalbit.io> wrote:

> I will say my opinion as a person that have worked with SparkML and will be
> involved soon in the development of ML solutions on Flink.
>
> In these days I tried to track the evolution and development of FlinkML and
> I see a big critical point: FlinkML looks a lot like a placeholder for
> commercial purposes but there's not enough investment and commitment to
> achieve an usable product. I did a few things with FlinkML coming from
> SparkML and I can say that it's unsuitable for most of the common use cases
> covered by SparkML (that is not a good ML library at all in terms of
> usability).
>
> So my question is: do we really need FlinkML? The roadmap looks a lot like
> "Spark has SparkML so we MUST have a ML library too". This could be
> reasonable if you aim at a fine-tuned library tailored on the specifics of
> Flink that are different from Spark. This could be even better if you
> developed an implementation of SGD that exploit the computational model of
> Flink that, I think, could achieve a lot more compared to the actual
> implementation. This is a subject that I want to study better before saying
> more but I'm looking at better parallelization strategies for data and
> models.
>
> Going back to FlinkML, do we really need to reimplement the same workhorse
> algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka,
> Oryx and other distributed learning libraries? Is it really useful at this
> stage? Given the current resources of the project, wouldn't it be more
> reasonable to invest time and energy in integrating more mature libraries
> (and eventually rich tooling that would give a big advantage over the other
> libraries)?
>
> I would like to comment on your proposals but my experience in
> collaborative open source development is way too limited to form an
> interesting opinion. Also I had no historical visibility on the motivations
> and discussions behind the development of FlinkML and I would like pointers
> to read something on what is the shared vision on this part of the project
> so that I could join the discussion from now on.
>
> Thanks,
>
> Simone
>
>
>
> 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com>:
>
> > Hello all,
> >
> > I would like to get a conversation started on how we plan to move forward
> > with FlinkML.
> >
> > Development on the library currently has been mostly dormant for the
> past 6
> > months,
> >
> > mainly I believe because of the lack of available committers to review
> PRs.
> >
> > Last month we got together with Till and Marton and talked about how we
> > could try to
> >
> > solve this and ensure continued development of the library.
> >
> > We see 3 possible paths we could take:
> >
> >1.
> >
>