Re: Machine Learning on Spark [long rambling discussion email]

Nick Pentreath Thu, 25 Jul 2013 13:57:29 -0700

Cool I totally understand the constraints you're under and it's not really a 
criticism at all - the amplab projects are all awesome!



If I can find ways to help then all the better
—
Sent from Mailbox for iPhone

On Thu, Jul 25, 2013 at 10:04 PM, Matei Zaharia <[email protected]>
wrote:

> I fully agree that we need to be clearer with the timelines in AMP Lab. One 
> thing is that many of these are still research projects, so it's hard to 
> predict when they will be ready for prime-time. Usually with all the things 
> we officially announce (e.g. MLlib, GraphX), and especially the things we put 
> in the Spark codebase, the team behind them really wants to make them widely 
> available and has committed to spend the engineering to make them usable in 
> real applications (as opposed to prototyping and moving on). But even then it 
> can take some time to get the first release out. Hopefully we'll improve our 
> communication about this through more careful tracking in JIRA.
> Matei
> On Jul 25, 2013, at 11:41 AM, Ameet Talwalkar <[email protected]> wrote:
>> Hi Nick,
>> 
>> I can understand your 'frustration' -- my hope is that having discussions
>> (like the one we're having now) via this mailing list will help mitigate
>> duplicate work moving forward.
>> 
>> Regarding your detailed comments, we are aiming to include various
>> components that you mentioned in our release (basic evaluation for
>> collaborative filtering, linear model additions, and basic support for
>> sparse vectors/features).  One particularly interesting avenue that is not
>> on our immediate roadmap is adding implicit feedback for matrix
>> factorization.  Algorithms like SVD++ are often used in practice, and it
>> would be great to add them to the MLI library (and perhaps also MLlib).
>> 
>> -Ameet
>> 
>> 
>> On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath 
>> <[email protected]>wrote:
>> 
>>> Hi
>>> 
>>> Ok, that all makes sense. I can see the benefit of good standard libraries
>>> definitely, and I guess the pieces that felt "missing" to me were what you
>>> are describing as MLI and MLOptimizer.
>>> 
>>> It seems like the aims of MLI are very much in line with what I have/had in
>>> mind for a ML library/framework. It seems the goals overlap quite a lot.
>>> 
>>> I guess one "frustration" I have had is that there are all these great BDAS
>>> projects, but we never really know when they will be released and what they
>>> will look like until they are. In this particular case I couldn't wait for
>>> MLlib so ended up doing some work myself to port Mahout's ALS and of course
>>> have ended up duplicating effort (which is not a problem as it was
>>> necessary at the time and has been a great learning experience).
>>> 
>>> Similarly for GraphX, I would like to develop a project for a Spark-based
>>> version of Faunus (https://github.com/thinkaurelius/faunus) for batch
>>> processing of data in our Titan graph DB. For now I am working with
>>> Bagel-based primitives and Spark RDDs directly, but would love to use
>>> GraphX, but have no idea when it will be released and have little
>>> involvement until it is.
>>> 
>>> (I use "frustration" in the nicest way here - I love the BDAS concepts and
>>> all the projects coming out, I just want them all to be released NOW!! :)
>>> 
>>> So yes I would love to be involved in MLlib and MLI work to the extent I
>>> can assist and the work is aligned with what I need currently in my
>>> projects (this is just from a time allocation viewpoint - I'm sure much of
>>> it will be complementary).
>>> 
>>> Anyway, it seems to me the best course of action is as follows:
>>> 
>>>   - I'll get involved in MLlib and see how I can contribute there. Some
>>>   things that jump out:
>>> 
>>> 
>>>   - implicit preference capability for ALS model since as far as I can see
>>>      currently it handles explicit prefs only? (Implicit prefs here:
>>>      http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is
>>>      typically better if we don't have actual rating data but instead
>>> "view",
>>>      "click", "play" or whatever)
>> 
>>      - RMSE and other evaluation metrics for ALS as well as test/train
>>>      split / cross-val stuff?
>> 
>>      - linear model additions, like new loss functions for hinge loss,
>>>      least squares etc for SGD, as well as learning rate stuff (
>>>      http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net)
>>> -
>>>      i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if
>>> that's
>>>      desirable, my view is yes)
>> 
>>      - what about sparse weight and feature vectors for linear models/SGD?
>>>      Together with hashing allows very large models while still being
>>> efficient,
>>>      and with L1 reg is particularly useful.
>> 
>>      - finally what about online models? ie SGD models currently are
>>>      "static" ie once trained can only predict, whereas SGD can of course
>>> keep
>>>      learning. Or does one simply re-train with the previous initial
>>> weight
>>>      vector (I guess that can work just as well)... Also on this
>>> topic training
>>>      / predicting on Streams as well as RDDs
>>>   - I can put up what I have done to a BitBucket account and grant access
>>>   to whichever devs would like to take a look. The only reason I don't
>>> just
>>>   throw it up on GitHub is that frankly it is not really ready and is not
>>> a
>>>   fully-fledged project yet (I think anyway). Possibly some of this can be
>>>   useful (not that there's all that much there apart from the ALS (but it
>>>   does solve for both explicit and implicit preference data as per
>>> Mahout's
>>>   implementation), KMeans (simpler than the one in MLlib as I didn't yet
>>> get
>>>   around to doing KMeans++ init) and the arg-parsing / jobrunner (which
>>> may
>>>   or may not be interesting both for ML and for Spark jobs in general)).
>>> 
>>> Let me know your thoughts
>>> Nick
>>> 
>>> 
>>> On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar
>>> <[email protected]>wrote:
>>> 
>>>> Hi Nick,
>>>> 
>>>> Thanks for your email, and it's great to see such excitement around this
>>>> work!  Matei and Reynold already addressed the motivation behind MLlib as
>>>> well as our reasons for not using Breeze, and I'd like to give you some
>>>> background about MLbase, and discuss how it may fit with your interests.
>>>> 
>>>> There are three components of MLbase:
>>>> 
>>>> 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML
>>>> kernels and solid implementations of common algorithms that can be used
>>>> easily by Java/Python and also called into by higher-level systems (e.g.
>>>> MLI, Shark, PySpark).
>>>> 
>>>> 2) MLI: this is an ML API that provides a common interface for ML
>>>> algorithms (the same interface used in MLlib), and introduces high-level
>>>> abstractions to simplify feature extraction / exploration and ML
>>> algorithm
>>>> development.  These abstractions leverage the kernels in MLlib when
>>>> possible, and also introduce additional kernels.  This work also
>>> includes a
>>>> library written against the MLI.  The MLI is currently written against
>>>> Spark, but is designed to be platform independent, so that code written
>>>> against MLI could be run on different engines (e.g., Hadoop, GraphX,
>>> etc.).
>>>> 
>>>> 
>>>> 3) ML Optimizer: This piece automates the task of model selection.  The
>>>> optimizer can be viewed as a search problem over feature extraction /
>>>> algorithms included in the MLI library, and is in part based on efficient
>>>> cross validation. This work is under active development but is in an
>>>> earlier stage of development than MLlib and MLI.
>>>> 
>>>> (note: MLlib will be included with the Spark codebase, while the MLI and
>>> ML
>>>> Optimizer will live in separate repositories.)
>>>> 
>>>> As far as I can tell (though please correct me if I've misunderstood)
>>> your
>>>> main goals include:
>>>> 
>>>> i) "consistency in the API"
>>>> ii) "some level of abstraction but to keep things as simple as possible"
>>>> iii) "execute models on Spark ... while providing workflows for
>>> pipelining
>>>> transformations, feature extraction, testing and cross-validation, and
>>> data
>>>> viz."
>>>> 
>>>> The MLI (and to some extent the ML Optimizer) is very much in line with
>>>> these goals, and it would be great if you were interested in contributing
>>>> to it.  MLI is a private repository right now, but we'll make it public
>>>> soon though, and Evan Sparks or I will let you know when we do so.
>>>> 
>>>> Thanks again for getting in touch with us!
>>>> 
>>>> -Ameet
>>>> 
>>>> 
>>>> On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]>
>>>> wrote:
>>>> 
>>>>> On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <
>>>> [email protected]
>>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> I also found Breeze to be very nice to work with and like the DSL -
>>>> hence
>>>>>> my question about why not use that? (Especially now that Breeze is
>>>>> actually
>>>>>> just breeze-math and breeze-viz).
>>>>>> 
>>>>> 
>>>>> 
>>>>> Matei addressed this from a higher level. I want to provide a little
>>> bit
>>>>> more context. A common properties of a lot of high level Scala DSL
>>>>> libraries is that simple operators tend to have high virtual function
>>>>> overheads and also create a lot of temporary objects. And because the
>>>> level
>>>>> of abstraction is so high, it is fairly hard to debug / optimize
>>>>> performance.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Reynold Xin, AMPLab, UC Berkeley
>>>>> http://rxin.org
>>>>> 
>>>> 
>>>

Re: Machine Learning on Spark [long rambling discussion email]

Reply via email to