Re: Machine Learning on Spark [long rambling discussion email]

Patrick Wendell Thu, 25 Jul 2013 09:01:20 -0700

---
sent from my phone
On Jul 25, 2013 6:44 AM, "Nick Pentreath" <[email protected]> wrote:


> Hi
>
> Ok, that all makes sense. I can see the benefit of good standard libraries
> definitely, and I guess the pieces that felt "missing" to me were what you
> are describing as MLI and MLOptimizer.
>
> It seems like the aims of MLI are very much in line with what I have/had in
> mind for a ML library/framework. It seems the goals overlap quite a lot.
>
> I guess one "frustration" I have had is that there are all these great BDAS
> projects, but we never really know when they will be released and what they
> will look like until they are. In this particular case I couldn't wait for
> MLlib so ended up doing some work myself to port Mahout's ALS and of course
> have ended up duplicating effort (which is not a problem as it was
> necessary at the time and has been a great learning experience).
>
> Similarly for GraphX, I would like to develop a project for a Spark-based
> version of Faunus (https://github.com/thinkaurelius/faunus) for batch
> processing of data in our Titan graph DB. For now I am working with
> Bagel-based primitives and Spark RDDs directly, but would love to use
> GraphX, but have no idea when it will be released and have little
> involvement until it is.
>
> (I use "frustration" in the nicest way here - I love the BDAS concepts and
> all the projects coming out, I just want them all to be released NOW!! :)
>
> So yes I would love to be involved in MLlib and MLI work to the extent I
> can assist and the work is aligned with what I need currently in my
> projects (this is just from a time allocation viewpoint - I'm sure much of
> it will be complementary).
>
> Anyway, it seems to me the best course of action is as follows:
>
>    - I'll get involved in MLlib and see how I can contribute there. Some
>    things that jump out:
>
>
>    - implicit preference capability for ALS model since as far as I can see
>       currently it handles explicit prefs only? (Implicit prefs here:
>       http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
>       typically better if we don't have actual rating data but instead
> "view",
>       "click", "play" or whatever)
>       - RMSE and other evaluation metrics for ALS as well as test/train
>       split / cross-val stuff?
>       - linear model additions, like new loss functions for hinge loss,
>       least squares etc for SGD, as well as learning rate stuff (
>       http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net)
> -
>       i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if
> that's
>       desirable, my view is yes)
>       - what about sparse weight and feature vectors for linear models/SGD?
>       Together with hashing allows very large models while still being
> efficient,
>       and with L1 reg is particularly useful.
>       - finally what about online models? ie SGD models currently are
>       "static" ie once trained can only predict, whereas SGD can of course
> keep
>       learning. Or does one simply re-train with the previous initial
> weight
>       vector (I guess that can work just as well)... Also on this
> topic training
>       / predicting on Streams as well as RDDs
>    - I can put up what I have done to a BitBucket account and grant access
>    to whichever devs would like to take a look. The only reason I don't
> just
>    throw it up on GitHub is that frankly it is not really ready and is not
> a
>    fully-fledged project yet (I think anyway). Possibly some of this can be
>    useful (not that there's all that much there apart from the ALS (but it
>    does solve for both explicit and implicit preference data as per
> Mahout's
>    implementation), KMeans (simpler than the one in MLlib as I didn't yet
> get
>    around to doing KMeans++ init) and the arg-parsing / jobrunner (which
> may
>    or may not be interesting both for ML and for Spark jobs in general)).
>
> Let me know your thoughts
> Nick
>
>
> On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar
> <[email protected]>wrote:
>
> > Hi Nick,
> >
> > Thanks for your email, and it's great to see such excitement around this
> > work!  Matei and Reynold already addressed the motivation behind MLlib as
> > well as our reasons for not using Breeze, and I'd like to give you some
> > background about MLbase, and discuss how it may fit with your interests.
> >
> > There are three components of MLbase:
> >
> > 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
> > kernels and solid implementations of common algorithms that can be used
> > easily by Java/Python and also called into by higher-level systems (e.g.
> > MLI, Shark, PySpark).
> >
> > 2) MLI: this is an ML API that provides a common interface for ML
> > algorithms (the same interface used in MLlib), and introduces high-level
> > abstractions to simplify feature extraction / exploration and ML
> algorithm
> > development.  These abstractions leverage the kernels in MLlib when
> > possible, and also introduce additional kernels.  This work also
> includes a
> > library written against the MLI.  The MLI is currently written against
> > Spark, but is designed to be platform independent, so that code written
> > against MLI could be run on different engines (e.g., Hadoop, GraphX,
> etc.).
> >
> >
> > 3) ML Optimizer: This piece automates the task of model selection.  The
> > optimizer can be viewed as a search problem over feature extraction /
> > algorithms included in the MLI library, and is in part based on efficient
> > cross validation. This work is under active development but is in an
> > earlier stage of development than MLlib and MLI.
> >
> > (note: MLlib will be included with the Spark codebase, while the MLI and
> ML
> > Optimizer will live in separate repositories.)
> >
> > As far as I can tell (though please correct me if I've misunderstood)
> your
> > main goals include:
> >
> > i) "consistency in the API"
> > ii) "some level of abstraction but to keep things as simple as possible"
> > iii) "execute models on Spark ... while providing workflows for
> pipelining
> > transformations, feature extraction, testing and cross-validation, and
> data
> > viz."
> >
> > The MLI (and to some extent the ML Optimizer) is very much in line with
> > these goals, and it would be great if you were interested in contributing
> > to it.  MLI is a private repository right now, but we'll make it public
> > soon though, and Evan Sparks or I will let you know when we do so.
> >
> > Thanks again for getting in touch with us!
> >
> > -Ameet
> >
> >
> > On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]>
> > wrote:
> >
> > > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
> > [email protected]
> > > >wrote:
> > >
> > > >
> > > > I also found Breeze to be very nice to work with and like the DSL -
> > hence
> > > > my question about why not use that? (Especially now that Breeze is
> > > actually
> > > > just breeze-math and breeze-viz).
> > > >
> > >
> > >
> > > Matei addressed this from a higher level. I want to provide a little
> bit
> > > more context. A common properties of a lot of high level Scala DSL
> > > libraries is that simple operators tend to have high virtual function
> > > overheads and also create a lot of temporary objects. And because the
> > level
> > > of abstraction is so high, it is fairly hard to debug / optimize
> > > performance.
> > >
> > >
> > >
> > >
> > > --
> > > Reynold Xin, AMPLab, UC Berkeley
> > > http://rxin.org
> > >
> >
>

Re: Machine Learning on Spark [long rambling discussion email]

Reply via email to