Thanks Henry! Do you know of a good source that gives pointers or examples how to interact with H2O ?
Stephan On Sun, Jan 4, 2015 at 7:14 PM, Till Rohrmann <trohrm...@apache.org> wrote: > The idea to work with H2O sounds really interesting. > > In terms of the Mahout DSL this would mean that we have to translate a > Flink dataset into H2O's basic abstraction of distributed data and vice > versa. Everything other than writing to disk with one system and reading > from there with the other is probably non-trivial and hard to realize. > On Jan 4, 2015 9:18 AM, "Henry Saputra" <henry.sapu...@gmail.com> wrote: > > > Happy new year all! > > > > Like the idea to add ML module with Flink. > > > > As I have mentioned to Kostas, Stephan, and Robert before, I would > > love to see if we could work with H20 project [1], and it seemed like > > the community has added support for it for Apache Mahout backend > > binding [2]. > > > > So we might get some additional scale ML algos like deep learning. > > > > Definitely would love to help with this initiative =) > > > > - Henry > > > > [1] https://github.com/h2oai/h2o-dev > > [2] https://issues.apache.org/jira/browse/MAHOUT-1500 > > > > On Fri, Jan 2, 2015 at 6:46 AM, Stephan Ewen <se...@apache.org> wrote: > > > Hi everyone! > > > > > > Happy new year, first of all and I hope you had a nice end-of-the-year > > > season. > > > > > > I thought that it is a good time now to officially kick off the > creation > > of > > > a library of machine learning algorithms. There are a lot of individual > > > artifacts and algorithms floating around which we should consolidate. > > > > > > The machine-learning library in Flink would stand on two legs: > > > > > > - A collection of efficient implementations for common problems and > > > algorithms, e.g., Regression (logistic), clustering (k-Means, Canopy), > > > Matrix Factorization (ALS), ... > > > > > > - An adapter to the linear algebra DSL in Apache Mahout. > > > > > > In the long run, it would be the goal to be able to mix and match code > > from > > > both parts. > > > The linear algebra DSL is very convenient when it comes to quickly > > > composing an algorithm, or some custom pre- and post-processing steps. > > > For some complex algorithms, however, a low level system specific > > > implementation is necessary to make the algorithm efficient. > > > Being able to call the tailored algorithms from the DSL would combine > the > > > benefits. > > > > > > > > > As a concrete initial step, I suggest to do the following: > > > > > > 1) We create a dedicated maven sub-project for that ML library > > > (flink-lib-ml). The project gets two sub-projects, one for the > collection > > > of specialized algorithms, one for the Mahout DSL > > > > > > 2) We add the code for the existing specialized algorithms. As followup > > > work, we need to consolidate data types between those algorithms, to > > ensure > > > that they can easily be combined/chained. > > > > > > 3) The code for the Flink bindings to the Mahout DSL will actually > reside > > > in the Mahout project, which we need to add as a dependency to > > flink-lib-ml. > > > > > > 4) We add some examples of Mahout DSL algorithms, and a template how to > > use > > > them within Flink programs. > > > > > > 5) Create a good introductory readme.md, outlining this structure. The > > > readme can also track the implemented algorithms and the ones we put on > > the > > > roadmap. > > > > > > > > > Comments welcome :-) > > > > > > > > > Greetings, > > > Stephan > > >