Re: Machine Learning on Spark [long rambling discussion email]

Nick Pentreath Thu, 25 Jul 2013 06:44:55 -0700

Hi

Ok, that all makes sense. I can see the benefit of good standard libraries
definitely, and I guess the pieces that felt "missing" to me were what you
are describing as MLI and MLOptimizer.


It seems like the aims of MLI are very much in line with what I have/had in
mind for a ML library/framework. It seems the goals overlap quite a lot.

I guess one "frustration" I have had is that there are all these great BDAS
projects, but we never really know when they will be released and what they
will look like until they are. In this particular case I couldn't wait for
MLlib so ended up doing some work myself to port Mahout's ALS and of course
have ended up duplicating effort (which is not a problem as it was
necessary at the time and has been a great learning experience).

Similarly for GraphX, I would like to develop a project for a Spark-based
version of Faunus (https://github.com/thinkaurelius/faunus) for batch
processing of data in our Titan graph DB. For now I am working with
Bagel-based primitives and Spark RDDs directly, but would love to use
GraphX, but have no idea when it will be released and have little
involvement until it is.

(I use "frustration" in the nicest way here - I love the BDAS concepts and
all the projects coming out, I just want them all to be released NOW!! :)

So yes I would love to be involved in MLlib and MLI work to the extent I
can assist and the work is aligned with what I need currently in my
projects (this is just from a time allocation viewpoint - I'm sure much of
it will be complementary).

Anyway, it seems to me the best course of action is as follows:

   - I'll get involved in MLlib and see how I can contribute there. Some
   things that jump out:


   - implicit preference capability for ALS model since as far as I can see
      currently it handles explicit prefs only? (Implicit prefs here:
      http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
      typically better if we don't have actual rating data but instead "view",
      "click", "play" or whatever)
      - RMSE and other evaluation metrics for ALS as well as test/train
      split / cross-val stuff?
      - linear model additions, like new loss functions for hinge loss,
      least squares etc for SGD, as well as learning rate stuff (
      http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net) -
      i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if that's
      desirable, my view is yes)
      - what about sparse weight and feature vectors for linear models/SGD?
      Together with hashing allows very large models while still being
efficient,
      and with L1 reg is particularly useful.
      - finally what about online models? ie SGD models currently are
      "static" ie once trained can only predict, whereas SGD can of course keep
      learning. Or does one simply re-train with the previous initial weight
      vector (I guess that can work just as well)... Also on this
topic training
      / predicting on Streams as well as RDDs
   - I can put up what I have done to a BitBucket account and grant access
   to whichever devs would like to take a look. The only reason I don't just
   throw it up on GitHub is that frankly it is not really ready and is not a
   fully-fledged project yet (I think anyway). Possibly some of this can be
   useful (not that there's all that much there apart from the ALS (but it
   does solve for both explicit and implicit preference data as per Mahout's
   implementation), KMeans (simpler than the one in MLlib as I didn't yet get
   around to doing KMeans++ init) and the arg-parsing / jobrunner (which may
   or may not be interesting both for ML and for Spark jobs in general)).

Let me know your thoughts
Nick


On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar
<[email protected]>wrote:

> Hi Nick,
>
> Thanks for your email, and it's great to see such excitement around this
> work!  Matei and Reynold already addressed the motivation behind MLlib as
> well as our reasons for not using Breeze, and I'd like to give you some
> background about MLbase, and discuss how it may fit with your interests.
>
> There are three components of MLbase:
>
> 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
> kernels and solid implementations of common algorithms that can be used
> easily by Java/Python and also called into by higher-level systems (e.g.
> MLI, Shark, PySpark).
>
> 2) MLI: this is an ML API that provides a common interface for ML
> algorithms (the same interface used in MLlib), and introduces high-level
> abstractions to simplify feature extraction / exploration and ML algorithm
> development.  These abstractions leverage the kernels in MLlib when
> possible, and also introduce additional kernels.  This work also includes a
> library written against the MLI.  The MLI is currently written against
> Spark, but is designed to be platform independent, so that code written
> against MLI could be run on different engines (e.g., Hadoop, GraphX, etc.).
>
>
> 3) ML Optimizer: This piece automates the task of model selection.  The
> optimizer can be viewed as a search problem over feature extraction /
> algorithms included in the MLI library, and is in part based on efficient
> cross validation. This work is under active development but is in an
> earlier stage of development than MLlib and MLI.
>
> (note: MLlib will be included with the Spark codebase, while the MLI and ML
> Optimizer will live in separate repositories.)
>
> As far as I can tell (though please correct me if I've misunderstood) your
> main goals include:
>
> i) "consistency in the API"
> ii) "some level of abstraction but to keep things as simple as possible"
> iii) "execute models on Spark ... while providing workflows for pipelining
> transformations, feature extraction, testing and cross-validation, and data
> viz."
>
> The MLI (and to some extent the ML Optimizer) is very much in line with
> these goals, and it would be great if you were interested in contributing
> to it.  MLI is a private repository right now, but we'll make it public
> soon though, and Evan Sparks or I will let you know when we do so.
>
> Thanks again for getting in touch with us!
>
> -Ameet
>
>
> On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]>
> wrote:
>
> > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
> [email protected]
> > >wrote:
> >
> > >
> > > I also found Breeze to be very nice to work with and like the DSL -
> hence
> > > my question about why not use that? (Especially now that Breeze is
> > actually
> > > just breeze-math and breeze-viz).
> > >
> >
> >
> > Matei addressed this from a higher level. I want to provide a little bit
> > more context. A common properties of a lot of high level Scala DSL
> > libraries is that simple operators tend to have high virtual function
> > overheads and also create a lot of temporary objects. And because the
> level
> > of abstraction is so high, it is fairly hard to debug / optimize
> > performance.
> >
> >
> >
> >
> > --
> > Reynold Xin, AMPLab, UC Berkeley
> > http://rxin.org
> >
>

Re: Machine Learning on Spark [long rambling discussion email]

Reply via email to