On Fri, Feb 12, 2016 at 9:40 AM, Simone Robutti < simone.robu...@radicalbit.io> wrote:
> @Suneel > > 1) Totally agree, as I wrote before. > > 2)I agree that support for PMML is premature but we shouldn't underestimate > the variety and complexity of the uses of ML models in the industry. The > adoption of Flink, hopefully, will grow and reach less innovative realities > where Random Forests and SVMs are still the main algorithms in use. In > these same realities there are legacies that justify the use of PMML to > port models. Still, FlinkML is still in an early stage so as you said, it > doesn't make sense to spend time right now on such a feature. > +1, as I mentioned earlier the PMML spec only supports classification and clustering (I last checked this in Aug 2015, pretty sure it would not have changed since then); hence 'Yes' it has some limited uses; 'No' - its too premature to even talk about it given the present state of FlinkML. > > 3)This would be really interesting. How do you imagine that the integration > with a distributed processing engine would work? > I am not sure yet, we r still exploring this on Mahout project to add to Mahout-Samsara - most of the statistics and probabilistic modeling would then be supported by Figaro (Bayesian, MCMC etc) and hence can be external to FlinkML. Figaro is Scala based. See https://github.com/p2t2/figaro I believe there are few other similar DSLs out there, need to dig up my old emails. (Not sure if its ASLv2 License, need verification here) > > 5) Agree on this one too. To my knowledge it would be the best option > together with SAMOA (for the streaming part). > There's already Flink - Samoa integration in place IIRC. > > 2016-02-12 15:25 GMT+01:00 Suneel Marthi <smar...@apache.org>: > > > My 2 cents as someone who's done ML over the years - having worked on > Oryx > > 2.0 and Mahout and having used Spark MlLib (read as "had no choice due to > > strict workplace enforcement") and understands well their limitations. > > > > 1. FlinkML in its present form seems like "do it like how Spark did it". > > > > 2. The recent discussion about PMML support in Flink to my mind is a > clear > > example of putting the cart before the horse. Why are we even talking > PMML > > when there ain't much ML algos in FlinkML? > > > > For a real good implementation of PMML and how its being used (with > jPMML), > > suggest look at the Oryx 2.0 project. The PMML implementation in Oryx 2.0 > > predates Spark and is a clean example of separating PMML from the > > underlying framework (Spark or Flink). > > > > We have had PMML discussions on the Mahout project in the past, but the > > idea never gained any traction in large part due to PMML spec limitations > > (mostly for clustering and classification algorithms) and the lack of > > adoption within the community. > > > > See the discussion here and specifically Ted Dunning's comment on PMML - > > > > > http://mail-archives.apache.org/mod_mbox/mahout-dev/201503.mbox/%3CCAJwFCa1%3DAw%2B3G54FgkYdTH%3DoNQBRqfeU-SS19iCFKMWbAfWzOQ%40mail.gmail.com%3E > > > > Most of the ML in practice (deployed in production) today are > Recommenders > > and Deep Learning - both of which are not supported by the PMML spec. > > > > 3. Leveraging a probabilistic programming language like Figaro might be a > > good way to go (just my thought) - that way most of the ML groundwork > would > > be external to Flink. > > > > 4. Within the Mahout community, we had been talking (and are working) on > > redoing the Samsara Distributed linear algebra framework to support Flink > > (in large part we realized that Flink is a better platform than the more > > popular one out there that Slim wouldn't wanna talk about :) ). > > > > We should be having a release out in the next few weeks (depending on > > committers' availability). It would be great if FlinkML had something > like > > it. > > > > There was a good audience to Sebastian's talk on this subject at #FF15 in > > October. > > > > 5. Its a good idea to add Flink support to H2O as Slim had suggested > > elsewhere in this thread. > > > > > > Thoughts? > > > > > > > > On Fri, Feb 12, 2016 at 5:00 AM, Simone Robutti < > > simone.robu...@radicalbit.io> wrote: > > > > > I will say my opinion as a person that have worked with SparkML and > will > > be > > > involved soon in the development of ML solutions on Flink. > > > > > > In these days I tried to track the evolution and development of FlinkML > > and > > > I see a big critical point: FlinkML looks a lot like a placeholder for > > > commercial purposes but there's not enough investment and commitment to > > > achieve an usable product. I did a few things with FlinkML coming from > > > SparkML and I can say that it's unsuitable for most of the common use > > cases > > > covered by SparkML (that is not a good ML library at all in terms of > > > usability). > > > > > > So my question is: do we really need FlinkML? The roadmap looks a lot > > like > > > "Spark has SparkML so we MUST have a ML library too". This could be > > > reasonable if you aim at a fine-tuned library tailored on the specifics > > of > > > Flink that are different from Spark. This could be even better if you > > > developed an implementation of SGD that exploit the computational model > > of > > > Flink that, I think, could achieve a lot more compared to the actual > > > implementation. This is a subject that I want to study better before > > saying > > > more but I'm looking at better parallelization strategies for data and > > > models. > > > > > > Going back to FlinkML, do we really need to reimplement the same > > workhorse > > > algorithms already implemented in SparkML, H2O, Mahout, SystemML, Weka, > > > Oryx and other distributed learning libraries? Is it really useful at > > this > > > stage? Given the current resources of the project, wouldn't it be more > > > reasonable to invest time and energy in integrating more mature > libraries > > > (and eventually rich tooling that would give a big advantage over the > > other > > > libraries)? > > > > > > I would like to comment on your proposals but my experience in > > > collaborative open source development is way too limited to form an > > > interesting opinion. Also I had no historical visibility on the > > motivations > > > and discussions behind the development of FlinkML and I would like > > pointers > > > to read something on what is the shared vision on this part of the > > project > > > so that I could join the discussion from now on. > > > > > > Thanks, > > > > > > Simone > > > > > > > > > > > > 2016-02-12 10:23 GMT+01:00 Theodore Vasiloudis < > > > theodoros.vasilou...@gmail.com>: > > > > > > > Hello all, > > > > > > > > I would like to get a conversation started on how we plan to move > > forward > > > > with FlinkML. > > > > > > > > Development on the library currently has been mostly dormant for the > > > past 6 > > > > months, > > > > > > > > mainly I believe because of the lack of available committers to > review > > > PRs. > > > > > > > > Last month we got together with Till and Marton and talked about how > we > > > > could try to > > > > > > > > solve this and ensure continued development of the library. > > > > > > > > We see 3 possible paths we could take: > > > > > > > > 1. > > > > > > > > Externalize the library, creating a new repository under the > Apache > > > > Flink project. This decouples the development of FlinkML from the > > > Flink > > > > release cycle, allowing us to move faster and incorporate new > > features > > > > as > > > > they become available. As FlinkML is a library under development > > tying > > > > it > > > > to specific versions does not make much sense anyway. The library > > > would > > > > depend on the latest snapshot version of Flink. It would then be > > > > possible > > > > for the Flink distribution to cherry-pick parts of the library to > be > > > > included with the core distribution. > > > > 2. > > > > > > > > Keep the development under the main Flink project but bring in new > > > > committers. This would mean that the development remains as is and > > is > > > > tied > > > > to core Flink releases, but new worked should get merged at much > > more > > > > regular intervals through the help of committers other than Till. > > > Marton > > > > Balassi has volunteered for that role and I hope that more might > > take > > > up > > > > that role. > > > > 3. A third option is to fork FlinkML on a repository on which we > are > > > > able to commit freely (again through PRs and reviews of course) > and > > > > merge > > > > good parts back into the main repo once in a while. This allows > for > > > > faster > > > > progress and more experimental work but obviously creates > > > fragmentation. > > > > > > > > > > > > I would like to hear your thoughts on these three options, as well as > > > > discuss other > > > > > > > > alternatives that could help move FlinkML forward. > > > > > > > > Cheers, > > > > Theodore > > > > > > > > > >