Re: Machine Learning on Spark [long rambling discussion email]

Dmitriy Lyubimov Thu, 25 Jul 2013 10:20:01 -0700

On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <[email protected]>wrote:


> Hi dev team
>
> (Apologies for a long email!)
>
> Firstly great news about the inclusion of MLlib into the Spark project!
>
> I've been working on a concept and some code for a machine learning library
> on Spark, and so of course there is a lot of overlap between MLlib and what
> I've been doing.
>
> I wanted to throw this out there and (a) ask a couple of design and roadmap
> questions about MLLib, and (b) talk about how to work together / integrate
> my ideas (if at all :)
>
> *Some questions*
> *
> *
> 1. What is the general design idea behind MLLib - is it aimed at being a
> collection of algorithms, ie a library? Or is it aimed at being a "Mahout
> for Spark", i.e. something that can be used as a library as well as a set
> of tools for things like running jobs, feature extraction, text processing
> etc?
> 2. How married are we to keeping it within the Spark project? While I
> understand the reasoning behind it I am not convinced it's best. But I
> guess we can wait and see how it develops
> 3. Some of the original test code I saw around the Block ALS did use Breeze
> (https://github.com/dlwh/breeze) for some of the linear algebra. Now I see
> everything is using JBLAS directly and Array[Double]. Is there a specific
> reason for this? Is it aimed at creating a separation whereby the linear
> algebra backend could be switched out? Scala 2.10 issues?
> 4. Since Spark is meant to be nicely compatible with Hadoop, do we care
> about compatibility/integration with Mahout? This may also encourage Mahout
> developers to switch over and contribute their expertise (see for example
> Dmitry's work at:
>
> https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala
> ,
>

It's 0.9.x-scala branch now. We've released 0.8


> where he is doing a Scala/Spark DSL around mahout-math matrices and
> distributed operations). Potentially even using mahout-math for linear
> algebra routines?
> 5. Is there a roadmap? (I've checked the JIRA which does have a few
> intended models etc). Who are the devs most involved in this project?
> 6. What are thoughts around API design for models?
>
> *Some thoughts*
> *
> *
> So, over the past couple of months I have been working on a machine
> learning library. Initially it was for my own use but I've added a few
> things and was starting to think about releasing it (though it's not nearly
> ready). The model that I really needed first was ALS for doing
> recommendations. So I have ported the ALS code from Mahout to Spark. Well,
> "ported" in some sense - mostly I copied the algorithm and data
> distribution design, using Spark's primitives and Breeze for all the linear
> algebra.
>
> I found it pretty straightforward to port over. So far I have done local
> testing only on the Movielens datasets. I have found my RMSE results to
> match that of Mahout's. Overall interestingly the wall clock performance is
> not as dissimilar as I would have expected. But I would like to now do some
> larger-scale tests on a cluster to really do a good comparison.
>
> Obviously with Spark's Block ALS model, my version is now somewhat
> superfluous since I expect (and have so far seen in my simple local
> experiments) that the block model will significantly outperform. I will
> probably be porting my use case over to this in due time once I've done
> further testing.
>
> I also found Breeze to be very nice to work with and like the DSL - hence
> my question about why not use that? (Especially now that Breeze is actually
> just breeze-math and breeze-viz).
>
> Anyway, I then added KMeans (basically just the Spark example with some
> Breeze tweaks), and started working on a Linear Model framework. I've also
> added a simple framework for arg parsing and config (using Twitter
> Algebird's Args and Typesafe Config), and have started on feature
> extraction stuff - of particular interest will be text feature extraction
> and feature hashing.
>
> This is roughly the idea for a machine learning library on Spark that I
> have - call it a design or manifesto or whatever:
>
> - Library available and consistent across Scala, Java and Python (as much
> as possible in any event)
> - A core library and also a set of stuff for easily running models based on
> standard input formats etc
> - Standardised model API (even across languages) to the extent possible.
> I've based mine so far on Python's scikit-learn (.fit(), .predict() etc).
> Why? I believe it's a major strength of scikit-learn, that its API is so
> clean, simple and consistent. Plus, for the Python version of the lib,
> scikit-learn will no doubt be used wherever possible to avoid re-creating
> code
> - Models to be included initially:
>   - ALS
>   - Possibly co-occurrence recommendation stuff similar to Mahout's Taste
>   - Clustering (K-Means and others potentially)
>   - Linear Models - the idea here is to have something very close to Vowpal
> Wabbit, ie a generic SGD engine with various Loss Functions, learning rate
> paradigms etc. Furthermore this would allow other models similar to VW such
> as online versions of matrix factorisation, neural nets and learning
> reductions
>   - Possibly Decision Trees / Random Forests
> - Some utilities for feature extraction (hashing in particular), and to
> make running jobs easy (integration with Spark's ./run etc?)
> - Stuff for making pipelining easy (like scikit-learn) and for doing things
> like cross-validation in a principled (and parallel) way
> - Clean and easy integration with Spark Streaming for online models (e.g. a
> linear SGD can be called with fit() on batch data, and then fit() and/or
> fit/predict() on streaming data to learn further online etc).
> - Interactivity provided by shells (IPython, Spark shell) and also plotting
> capability (Matplotlib, and Breeze Viz)
> - For Scala, integration with Shark via sql2rdd etc.
> - I'd like to create something similar to Scalding's Matrix API based on
> RDDs for representing distributed matrices, as well as integrate the ideas
> of Dmitry and Mahout's DistributedRowMatrix etc
>
> Here is a rough outline of the model API I have used at the moment:
> https://gist.github.com/MLnick/6068841. This works nicely for ALS,
> clustering, linear models etc.
>
> So as you can see, mostly overlapping with what MLlib already has or has
> planned in some way, but my main aim is frankly to have consistency in the
> API, some level of abstraction but to keep things as simple as possible (ie
> let Spark handle the complex stuff), and thus hopefully avoid things
> becoming just a somewhat haphazard collection of models that is not that
> simple to figure out how to use - which is unfortunately what I believe has
> happened to Mahout.
>
> So the question then is, how to work together or integrate? I see 3
> options:
>
> 1. I go my own way (not very appealing obviously)
> 2. Contribute what I have (or as much as makes sense) to MLlib
> 3. Create my project as a "front-end" or "wrapper" around MLlib as the
> core, effectively providing the API and workflow interface but using MLlib
> as the model engine.
>
> #2 is appealing but then a lot depends on the API and framework design and
> how much what I have in mind is compatible with the rest of the devs etc
> #3 now that I have written it, starts to sound pretty interesting -
> potentially we're looking at a "front-end" that could in fact execute
> models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while
> providing workflows for pipelining transformations, feature extraction,
> testing and cross-validation, and data viz.
>
> But of course #3 starts sounding somewhat like what MLBase is aiming to be
> (I think)!
>
> At this point I'm willing to show out what I have done so far on a
> selective basis - be warned though it is rough and not finished and
> somewhat clunky perhaps as it's my first attempt at a library/framework, if
> it makes sense. Especially because really the main thing I did was the ALS
> port, and with MLlib's version of ALS that may be less useful now in any
> case.
>
> It may be that none of this is that useful to others anyway which is fine
> as I'll keep developing tools that I need and potentially they will be
> useful at some point.
>
> Thoughts, feedback, comments, discussion? I really want to jump into MLlib
> and get involved in contributing to standardised machine learning on Spark!
>
> Nick
>

Re: Machine Learning on Spark [long rambling discussion email]

Reply via email to