Cool I totally understand the constraints you're under and it's not really a criticism at all - the amplab projects are all awesome!
If I can find ways to help then all the better — Sent from Mailbox for iPhone On Thu, Jul 25, 2013 at 10:04 PM, Matei Zaharia <[email protected]> wrote: > I fully agree that we need to be clearer with the timelines in AMP Lab. One > thing is that many of these are still research projects, so it's hard to > predict when they will be ready for prime-time. Usually with all the things > we officially announce (e.g. MLlib, GraphX), and especially the things we put > in the Spark codebase, the team behind them really wants to make them widely > available and has committed to spend the engineering to make them usable in > real applications (as opposed to prototyping and moving on). But even then it > can take some time to get the first release out. Hopefully we'll improve our > communication about this through more careful tracking in JIRA. > Matei > On Jul 25, 2013, at 11:41 AM, Ameet Talwalkar <[email protected]> wrote: >> Hi Nick, >> >> I can understand your 'frustration' -- my hope is that having discussions >> (like the one we're having now) via this mailing list will help mitigate >> duplicate work moving forward. >> >> Regarding your detailed comments, we are aiming to include various >> components that you mentioned in our release (basic evaluation for >> collaborative filtering, linear model additions, and basic support for >> sparse vectors/features). One particularly interesting avenue that is not >> on our immediate roadmap is adding implicit feedback for matrix >> factorization. Algorithms like SVD++ are often used in practice, and it >> would be great to add them to the MLI library (and perhaps also MLlib). >> >> -Ameet >> >> >> On Thu, Jul 25, 2013 at 6:44 AM, Nick Pentreath >> <[email protected]>wrote: >> >>> Hi >>> >>> Ok, that all makes sense. I can see the benefit of good standard libraries >>> definitely, and I guess the pieces that felt "missing" to me were what you >>> are describing as MLI and MLOptimizer. >>> >>> It seems like the aims of MLI are very much in line with what I have/had in >>> mind for a ML library/framework. It seems the goals overlap quite a lot. >>> >>> I guess one "frustration" I have had is that there are all these great BDAS >>> projects, but we never really know when they will be released and what they >>> will look like until they are. In this particular case I couldn't wait for >>> MLlib so ended up doing some work myself to port Mahout's ALS and of course >>> have ended up duplicating effort (which is not a problem as it was >>> necessary at the time and has been a great learning experience). >>> >>> Similarly for GraphX, I would like to develop a project for a Spark-based >>> version of Faunus (https://github.com/thinkaurelius/faunus) for batch >>> processing of data in our Titan graph DB. For now I am working with >>> Bagel-based primitives and Spark RDDs directly, but would love to use >>> GraphX, but have no idea when it will be released and have little >>> involvement until it is. >>> >>> (I use "frustration" in the nicest way here - I love the BDAS concepts and >>> all the projects coming out, I just want them all to be released NOW!! :) >>> >>> So yes I would love to be involved in MLlib and MLI work to the extent I >>> can assist and the work is aligned with what I need currently in my >>> projects (this is just from a time allocation viewpoint - I'm sure much of >>> it will be complementary). >>> >>> Anyway, it seems to me the best course of action is as follows: >>> >>> - I'll get involved in MLlib and see how I can contribute there. Some >>> things that jump out: >>> >>> >>> - implicit preference capability for ALS model since as far as I can see >>> currently it handles explicit prefs only? (Implicit prefs here: >>> http://68.180.206.246/files/HuKorenVolinsky-ICDM08.pdf which is >>> typically better if we don't have actual rating data but instead >>> "view", >>> "click", "play" or whatever) >> >> - RMSE and other evaluation metrics for ALS as well as test/train >>> split / cross-val stuff? >> >> - linear model additions, like new loss functions for hinge loss, >>> least squares etc for SGD, as well as learning rate stuff ( >>> http://arxiv.org/pdf/1305.6646) and regularisers (L1/L2/Elasic Net) >>> - >>> i.e. bring the SGD stuff in line with Vowpal Wabbit / sklearn (if >>> that's >>> desirable, my view is yes) >> >> - what about sparse weight and feature vectors for linear models/SGD? >>> Together with hashing allows very large models while still being >>> efficient, >>> and with L1 reg is particularly useful. >> >> - finally what about online models? ie SGD models currently are >>> "static" ie once trained can only predict, whereas SGD can of course >>> keep >>> learning. Or does one simply re-train with the previous initial >>> weight >>> vector (I guess that can work just as well)... Also on this >>> topic training >>> / predicting on Streams as well as RDDs >>> - I can put up what I have done to a BitBucket account and grant access >>> to whichever devs would like to take a look. The only reason I don't >>> just >>> throw it up on GitHub is that frankly it is not really ready and is not >>> a >>> fully-fledged project yet (I think anyway). Possibly some of this can be >>> useful (not that there's all that much there apart from the ALS (but it >>> does solve for both explicit and implicit preference data as per >>> Mahout's >>> implementation), KMeans (simpler than the one in MLlib as I didn't yet >>> get >>> around to doing KMeans++ init) and the arg-parsing / jobrunner (which >>> may >>> or may not be interesting both for ML and for Spark jobs in general)). >>> >>> Let me know your thoughts >>> Nick >>> >>> >>> On Wed, Jul 24, 2013 at 10:09 PM, Ameet Talwalkar >>> <[email protected]>wrote: >>> >>>> Hi Nick, >>>> >>>> Thanks for your email, and it's great to see such excitement around this >>>> work! Matei and Reynold already addressed the motivation behind MLlib as >>>> well as our reasons for not using Breeze, and I'd like to give you some >>>> background about MLbase, and discuss how it may fit with your interests. >>>> >>>> There are three components of MLbase: >>>> >>>> 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML >>>> kernels and solid implementations of common algorithms that can be used >>>> easily by Java/Python and also called into by higher-level systems (e.g. >>>> MLI, Shark, PySpark). >>>> >>>> 2) MLI: this is an ML API that provides a common interface for ML >>>> algorithms (the same interface used in MLlib), and introduces high-level >>>> abstractions to simplify feature extraction / exploration and ML >>> algorithm >>>> development. These abstractions leverage the kernels in MLlib when >>>> possible, and also introduce additional kernels. This work also >>> includes a >>>> library written against the MLI. The MLI is currently written against >>>> Spark, but is designed to be platform independent, so that code written >>>> against MLI could be run on different engines (e.g., Hadoop, GraphX, >>> etc.). >>>> >>>> >>>> 3) ML Optimizer: This piece automates the task of model selection. The >>>> optimizer can be viewed as a search problem over feature extraction / >>>> algorithms included in the MLI library, and is in part based on efficient >>>> cross validation. This work is under active development but is in an >>>> earlier stage of development than MLlib and MLI. >>>> >>>> (note: MLlib will be included with the Spark codebase, while the MLI and >>> ML >>>> Optimizer will live in separate repositories.) >>>> >>>> As far as I can tell (though please correct me if I've misunderstood) >>> your >>>> main goals include: >>>> >>>> i) "consistency in the API" >>>> ii) "some level of abstraction but to keep things as simple as possible" >>>> iii) "execute models on Spark ... while providing workflows for >>> pipelining >>>> transformations, feature extraction, testing and cross-validation, and >>> data >>>> viz." >>>> >>>> The MLI (and to some extent the ML Optimizer) is very much in line with >>>> these goals, and it would be great if you were interested in contributing >>>> to it. MLI is a private repository right now, but we'll make it public >>>> soon though, and Evan Sparks or I will let you know when we do so. >>>> >>>> Thanks again for getting in touch with us! >>>> >>>> -Ameet >>>> >>>> >>>> On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]> >>>> wrote: >>>> >>>>> On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath < >>>> [email protected] >>>>>> wrote: >>>>> >>>>>> >>>>>> I also found Breeze to be very nice to work with and like the DSL - >>>> hence >>>>>> my question about why not use that? (Especially now that Breeze is >>>>> actually >>>>>> just breeze-math and breeze-viz). >>>>>> >>>>> >>>>> >>>>> Matei addressed this from a higher level. I want to provide a little >>> bit >>>>> more context. A common properties of a lot of high level Scala DSL >>>>> libraries is that simple operators tend to have high virtual function >>>>> overheads and also create a lot of temporary objects. And because the >>>> level >>>>> of abstraction is so high, it is fairly hard to debug / optimize >>>>> performance. >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Reynold Xin, AMPLab, UC Berkeley >>>>> http://rxin.org >>>>> >>>> >>>
