Hi dev team (Apologies for a long email!)
Firstly great news about the inclusion of MLlib into the Spark project! I've been working on a concept and some code for a machine learning library on Spark, and so of course there is a lot of overlap between MLlib and what I've been doing. I wanted to throw this out there and (a) ask a couple of design and roadmap questions about MLLib, and (b) talk about how to work together / integrate my ideas (if at all :) *Some questions* * * 1. What is the general design idea behind MLLib - is it aimed at being a collection of algorithms, ie a library? Or is it aimed at being a "Mahout for Spark", i.e. something that can be used as a library as well as a set of tools for things like running jobs, feature extraction, text processing etc? 2. How married are we to keeping it within the Spark project? While I understand the reasoning behind it I am not convinced it's best. But I guess we can wait and see how it develops 3. Some of the original test code I saw around the Block ALS did use Breeze (https://github.com/dlwh/breeze) for some of the linear algebra. Now I see everything is using JBLAS directly and Array[Double]. Is there a specific reason for this? Is it aimed at creating a separation whereby the linear algebra backend could be switched out? Scala 2.10 issues? 4. Since Spark is meant to be nicely compatible with Hadoop, do we care about compatibility/integration with Mahout? This may also encourage Mahout developers to switch over and contribute their expertise (see for example Dmitry's work at: https://github.com/dlyubimov/mahout-commits/commits/dev-0.8.x-scala/math-scala, where he is doing a Scala/Spark DSL around mahout-math matrices and distributed operations). Potentially even using mahout-math for linear algebra routines? 5. Is there a roadmap? (I've checked the JIRA which does have a few intended models etc). Who are the devs most involved in this project? 6. What are thoughts around API design for models? *Some thoughts* * * So, over the past couple of months I have been working on a machine learning library. Initially it was for my own use but I've added a few things and was starting to think about releasing it (though it's not nearly ready). The model that I really needed first was ALS for doing recommendations. So I have ported the ALS code from Mahout to Spark. Well, "ported" in some sense - mostly I copied the algorithm and data distribution design, using Spark's primitives and Breeze for all the linear algebra. I found it pretty straightforward to port over. So far I have done local testing only on the Movielens datasets. I have found my RMSE results to match that of Mahout's. Overall interestingly the wall clock performance is not as dissimilar as I would have expected. But I would like to now do some larger-scale tests on a cluster to really do a good comparison. Obviously with Spark's Block ALS model, my version is now somewhat superfluous since I expect (and have so far seen in my simple local experiments) that the block model will significantly outperform. I will probably be porting my use case over to this in due time once I've done further testing. I also found Breeze to be very nice to work with and like the DSL - hence my question about why not use that? (Especially now that Breeze is actually just breeze-math and breeze-viz). Anyway, I then added KMeans (basically just the Spark example with some Breeze tweaks), and started working on a Linear Model framework. I've also added a simple framework for arg parsing and config (using Twitter Algebird's Args and Typesafe Config), and have started on feature extraction stuff - of particular interest will be text feature extraction and feature hashing. This is roughly the idea for a machine learning library on Spark that I have - call it a design or manifesto or whatever: - Library available and consistent across Scala, Java and Python (as much as possible in any event) - A core library and also a set of stuff for easily running models based on standard input formats etc - Standardised model API (even across languages) to the extent possible. I've based mine so far on Python's scikit-learn (.fit(), .predict() etc). Why? I believe it's a major strength of scikit-learn, that its API is so clean, simple and consistent. Plus, for the Python version of the lib, scikit-learn will no doubt be used wherever possible to avoid re-creating code - Models to be included initially: - ALS - Possibly co-occurrence recommendation stuff similar to Mahout's Taste - Clustering (K-Means and others potentially) - Linear Models - the idea here is to have something very close to Vowpal Wabbit, ie a generic SGD engine with various Loss Functions, learning rate paradigms etc. Furthermore this would allow other models similar to VW such as online versions of matrix factorisation, neural nets and learning reductions - Possibly Decision Trees / Random Forests - Some utilities for feature extraction (hashing in particular), and to make running jobs easy (integration with Spark's ./run etc?) - Stuff for making pipelining easy (like scikit-learn) and for doing things like cross-validation in a principled (and parallel) way - Clean and easy integration with Spark Streaming for online models (e.g. a linear SGD can be called with fit() on batch data, and then fit() and/or fit/predict() on streaming data to learn further online etc). - Interactivity provided by shells (IPython, Spark shell) and also plotting capability (Matplotlib, and Breeze Viz) - For Scala, integration with Shark via sql2rdd etc. - I'd like to create something similar to Scalding's Matrix API based on RDDs for representing distributed matrices, as well as integrate the ideas of Dmitry and Mahout's DistributedRowMatrix etc Here is a rough outline of the model API I have used at the moment: https://gist.github.com/MLnick/6068841. This works nicely for ALS, clustering, linear models etc. So as you can see, mostly overlapping with what MLlib already has or has planned in some way, but my main aim is frankly to have consistency in the API, some level of abstraction but to keep things as simple as possible (ie let Spark handle the complex stuff), and thus hopefully avoid things becoming just a somewhat haphazard collection of models that is not that simple to figure out how to use - which is unfortunately what I believe has happened to Mahout. So the question then is, how to work together or integrate? I see 3 options: 1. I go my own way (not very appealing obviously) 2. Contribute what I have (or as much as makes sense) to MLlib 3. Create my project as a "front-end" or "wrapper" around MLlib as the core, effectively providing the API and workflow interface but using MLlib as the model engine. #2 is appealing but then a lot depends on the API and framework design and how much what I have in mind is compatible with the rest of the devs etc #3 now that I have written it, starts to sound pretty interesting - potentially we're looking at a "front-end" that could in fact execute models on Spark (or other engines like Hadoop/Mahout, GraphX etc), while providing workflows for pipelining transformations, feature extraction, testing and cross-validation, and data viz. But of course #3 starts sounding somewhat like what MLBase is aiming to be (I think)! At this point I'm willing to show out what I have done so far on a selective basis - be warned though it is rough and not finished and somewhat clunky perhaps as it's my first attempt at a library/framework, if it makes sense. Especially because really the main thing I did was the ALS port, and with MLlib's version of ALS that may be less useful now in any case. It may be that none of this is that useful to others anyway which is fine as I'll keep developing tools that I need and potentially they will be useful at some point. Thoughts, feedback, comments, discussion? I really want to jump into MLlib and get involved in contributing to standardised machine learning on Spark! Nick
