--- sent from my phone On Jul 25, 2013 6:44 AM, "Nick Pentreath" <[email protected]> wrote:
> Hi > > Ok, that all makes sense. I can see the benefit of good standard libraries > definitely, and I guess the pieces that felt "missing" to me were what you > are describing as MLI and MLOptimizer. > > It seems like the aims of MLI are very much in line with what I have/had in > mind for a ML library/framework. It seems the goals overlap quite a lot. > > I guess one "frustration" I have had is that there are all these great BDAS > projects, but we never really know when they will be released and what they > will look like until they are. In this particular case I couldn't wait for > MLlib so ended up doing some work myself to port Mahout's ALS and of course > have ended up duplicating effort (which is not a problem as it was > necessary at the time and has been a great learning experience). > > Similarly for GraphX, I would like to develop a project for a Spark-based > version of Faunus (https://github.com/thinkaurelius/faunus) for batch > processing of data in our Titan graph DB. For now I am working with > Bagel-based primitives and Spark RDDs directly, but would love to use > GraphX, but have no idea when it will be released and have little > involvement until it is. > > (I use "frustration" in the nicest way here - I love the BDAS concepts and > all the projects coming out, I just want them all to be released NOW!! :) > > So yes I would love to be involved in MLlib and MLI work to the extent I > can assist and the work is aligned with what I need currently in my > projects (this is just from a time allocation viewpoint - I'm sure much of > it will be complementary). > > Anyway, it seems to me the best course of action is as follows: > > - I'll get involved in MLlib and see how I can contribute there. Some > things that jump out: > > > - implicit preference capability for ALS model since as far as I can see > currently it handles explicit prefs only? (Implicit prefs here: > http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is > typically better if we don't have actual rating data but instead > "view", > "click", "play" or whatever) > - RMSE and other evaluation metrics for ALS as well as test/train > split / cross-val stuff? > - linear model additions, like new loss functions for hinge loss, > least squares etc for SGD, as well as learning rate stuff ( > http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net) > - > i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if > that's > desirable, my view is yes) > - what about sparse weight and feature vectors for linear models/SGD? > Together with hashing allows very large models while still being > efficient, > and with L1 reg is particularly useful. > - finally what about online models? ie SGD models currently are > "static" ie once trained can only predict, whereas SGD can of course > keep > learning. Or does one simply re-train with the previous initial > weight > vector (I guess that can work just as well)... Also on this > topic training > / predicting on Streams as well as RDDs > - I can put up what I have done to a BitBucket account and grant access > to whichever devs would like to take a look. The only reason I don't > just > throw it up on GitHub is that frankly it is not really ready and is not > a > fully-fledged project yet (I think anyway). Possibly some of this can be > useful (not that there's all that much there apart from the ALS (but it > does solve for both explicit and implicit preference data as per > Mahout's > implementation), KMeans (simpler than the one in MLlib as I didn't yet > get > around to doing KMeans++ init) and the arg-parsing / jobrunner (which > may > or may not be interesting both for ML and for Spark jobs in general)). > > Let me know your thoughts > Nick > > > On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar > <[email protected]>wrote: > > > Hi Nick, > > > > Thanks for your email, and it's great to see such excitement around this > > work! Matei and Reynold already addressed the motivation behind MLlib as > > well as our reasons for not using Breeze, and I'd like to give you some > > background about MLbase, and discuss how it may fit with your interests. > > > > There are three components of MLbase: > > > > 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML > > kernels and solid implementations of common algorithms that can be used > > easily by Java/Python and also called into by higher-level systems (e.g. > > MLI, Shark, PySpark). > > > > 2) MLI: this is an ML API that provides a common interface for ML > > algorithms (the same interface used in MLlib), and introduces high-level > > abstractions to simplify feature extraction / exploration and ML > algorithm > > development. These abstractions leverage the kernels in MLlib when > > possible, and also introduce additional kernels. This work also > includes a > > library written against the MLI. The MLI is currently written against > > Spark, but is designed to be platform independent, so that code written > > against MLI could be run on different engines (e.g., Hadoop, GraphX, > etc.). > > > > > > 3) ML Optimizer: This piece automates the task of model selection. The > > optimizer can be viewed as a search problem over feature extraction / > > algorithms included in the MLI library, and is in part based on efficient > > cross validation. This work is under active development but is in an > > earlier stage of development than MLlib and MLI. > > > > (note: MLlib will be included with the Spark codebase, while the MLI and > ML > > Optimizer will live in separate repositories.) > > > > As far as I can tell (though please correct me if I've misunderstood) > your > > main goals include: > > > > i) "consistency in the API" > > ii) "some level of abstraction but to keep things as simple as possible" > > iii) "execute models on Spark ... while providing workflows for > pipelining > > transformations, feature extraction, testing and cross-validation, and > data > > viz." > > > > The MLI (and to some extent the ML Optimizer) is very much in line with > > these goals, and it would be great if you were interested in contributing > > to it. MLI is a private repository right now, but we'll make it public > > soon though, and Evan Sparks or I will let you know when we do so. > > > > Thanks again for getting in touch with us! > > > > -Ameet > > > > > > On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]> > > wrote: > > > > > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath < > > [email protected] > > > >wrote: > > > > > > > > > > > I also found Breeze to be very nice to work with and like the DSL - > > hence > > > > my question about why not use that? (Especially now that Breeze is > > > actually > > > > just breeze-math and breeze-viz). > > > > > > > > > > > > > Matei addressed this from a higher level. I want to provide a little > bit > > > more context. A common properties of a lot of high level Scala DSL > > > libraries is that simple operators tend to have high virtual function > > > overheads and also create a lot of temporary objects. And because the > > level > > > of abstraction is so high, it is fairly hard to debug / optimize > > > performance. > > > > > > > > > > > > > > > -- > > > Reynold Xin, AMPLab, UC Berkeley > > > http://rxin.org > > > > > >
