On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <[email protected]>wrote:
> Hi dev team > > (Apologies for a long email!) > > Firstly great news about the inclusion of MLlib into the Spark project! > > I've been working on a concept and some code for a machine learning library > on Spark, and so of course there is a lot of overlap between MLlib and what > I've been doing. > > I wanted to throw this out there and (a) ask a couple of design and roadmap > questions about MLLib, and (b) talk about how to work together / integrate > my ideas (if at all :) > > *Some questions* > * > * > 1. What is the general design idea behind MLLib - is it aimed at being a > collection of algorithms, ie a library? Or is it aimed at being a "Mahout > for Spark", i.e. something that can be used as a library as well as a set > of tools for things like running jobs, feature extraction, text processing > etc? > 2. How married are we to keeping it within the Spark project? While I > understand the reasoning behind it I am not convinced it's best. But I > guess we can wait and see how it develops > 3. Some of the original test code I saw around the Block ALS did use Breeze > (https://github.com/dlwh/breeze) for some of the linear algebra. Now I see > everything is using JBLAS directly and Array[Double]. Is there a specific > reason for this? Is it aimed at creating a separation whereby the linear > algebra backend could be switched out? Scala 2.10 issues? > 4. Since Spark is meant to be nicely compatible with Hadoop, do we care > about compatibility/integration with Mahout? This may also encourage Mahout > developers to switch over and contribute their expertise (see for example > Dmitry's work at: > > https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala > , > It's 0.9.x-scala branch now. We've released 0.8 > where he is doing a Scala/Spark DSL around mahout-math matrices and > distributed operations). Potentially even using mahout-math for linear > algebra routines? > 5. Is there a roadmap? (I've checked the JIRA which does have a few > intended models etc). Who are the devs most involved in this project? > 6. What are thoughts around API design for models? > > *Some thoughts* > * > * > So, over the past couple of months I have been working on a machine > learning library. Initially it was for my own use but I've added a few > things and was starting to think about releasing it (though it's not nearly > ready). The model that I really needed first was ALS for doing > recommendations. So I have ported the ALS code from Mahout to Spark. Well, > "ported" in some sense - mostly I copied the algorithm and data > distribution design, using Spark's primitives and Breeze for all the linear > algebra. > > I found it pretty straightforward to port over. So far I have done local > testing only on the Movielens datasets. I have found my RMSE results to > match that of Mahout's. Overall interestingly the wall clock performance is > not as dissimilar as I would have expected. But I would like to now do some > larger-scale tests on a cluster to really do a good comparison. > > Obviously with Spark's Block ALS model, my version is now somewhat > superfluous since I expect (and have so far seen in my simple local > experiments) that the block model will significantly outperform. I will > probably be porting my use case over to this in due time once I've done > further testing. > > I also found Breeze to be very nice to work with and like the DSL - hence > my question about why not use that? (Especially now that Breeze is actually > just breeze-math and breeze-viz). > > Anyway, I then added KMeans (basically just the Spark example with some > Breeze tweaks), and started working on a Linear Model framework. I've also > added a simple framework for arg parsing and config (using Twitter > Algebird's Args and Typesafe Config), and have started on feature > extraction stuff - of particular interest will be text feature extraction > and feature hashing. > > This is roughly the idea for a machine learning library on Spark that I > have - call it a design or manifesto or whatever: > > - Library available and consistent across Scala, Java and Python (as much > as possible in any event) > - A core library and also a set of stuff for easily running models based on > standard input formats etc > - Standardised model API (even across languages) to the extent possible. > I've based mine so far on Python's scikit-learn (.fit(), .predict() etc). > Why? I believe it's a major strength of scikit-learn, that its API is so > clean, simple and consistent. Plus, for the Python version of the lib, > scikit-learn will no doubt be used wherever possible to avoid re-creating > code > - Models to be included initially: > - ALS > - Possibly co-occurrence recommendation stuff similar to Mahout's Taste > - Clustering (K-Means and others potentially) > - Linear Models - the idea here is to have something very close to Vowpal > Wabbit, ie a generic SGD engine with various Loss Functions, learning rate > paradigms etc. Furthermore this would allow other models similar to VW such > as online versions of matrix factorisation, neural nets and learning > reductions > - Possibly Decision Trees / Random Forests > - Some utilities for feature extraction (hashing in particular), and to > make running jobs easy (integration with Spark's ./run etc?) > - Stuff for making pipelining easy (like scikit-learn) and for doing things > like cross-validation in a principled (and parallel) way > - Clean and easy integration with Spark Streaming for online models (e.g. a > linear SGD can be called with fit() on batch data, and then fit() and/or > fit/predict() on streaming data to learn further online etc). > - Interactivity provided by shells (IPython, Spark shell) and also plotting > capability (Matplotlib, and Breeze Viz) > - For Scala, integration with Shark via sql2rdd etc. > - I'd like to create something similar to Scalding's Matrix API based on > RDDs for representing distributed matrices, as well as integrate the ideas > of Dmitry and Mahout's DistributedRowMatrix etc > > Here is a rough outline of the model API I have used at the moment: > https://gist.github.com/MLnick/6068841. This works nicely for ALS, > clustering, linear models etc. > > So as you can see, mostly overlapping with what MLlib already has or has > planned in some way, but my main aim is frankly to have consistency in the > API, some level of abstraction but to keep things as simple as possible (ie > let Spark handle the complex stuff), and thus hopefully avoid things > becoming just a somewhat haphazard collection of models that is not that > simple to figure out how to use - which is unfortunately what I believe has > happened to Mahout. > > So the question then is, how to work together or integrate? I see 3 > options: > > 1. I go my own way (not very appealing obviously) > 2. Contribute what I have (or as much as makes sense) to MLlib > 3. Create my project as a "front-end" or "wrapper" around MLlib as the > core, effectively providing the API and workflow interface but using MLlib > as the model engine. > > #2 is appealing but then a lot depends on the API and framework design and > how much what I have in mind is compatible with the rest of the devs etc > #3 now that I have written it, starts to sound pretty interesting - > potentially we're looking at a "front-end" that could in fact execute > models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while > providing workflows for pipelining transformations, feature extraction, > testing and cross-validation, and data viz. > > But of course #3 starts sounding somewhat like what MLBase is aiming to be > (I think)! > > At this point I'm willing to show out what I have done so far on a > selective basis - be warned though it is rough and not finished and > somewhat clunky perhaps as it's my first attempt at a library/framework, if > it makes sense. Especially because really the main thing I did was the ALS > port, and with MLlib's version of ALS that may be less useful now in any > case. > > It may be that none of this is that useful to others anyway which is fine > as I'll keep developing tools that I need and potentially they will be > useful at some point. > > Thoughts, feedback, comments, discussion? I really want to jump into MLlib > and get involved in contributing to standardised machine learning on Spark! > > Nick >
