Hi Nick, Thanks for your email, and it's great to see such excitement around this work! Matei and Reynold already addressed the motivation behind MLlib as well as our reasons for not using Breeze, and I'd like to give you some background about MLbase, and discuss how it may fit with your interests.
There are three components of MLbase: 1) MLlib: As Matei mentioned, this is an ML library in Spark with core ML kernels and solid implementations of common algorithms that can be used easily by Java/Python and also called into by higher-level systems (e.g. MLI, Shark, PySpark). 2) MLI: this is an ML API that provides a common interface for ML algorithms (the same interface used in MLlib), and introduces high-level abstractions to simplify feature extraction / exploration and ML algorithm development. These abstractions leverage the kernels in MLlib when possible, and also introduce additional kernels. This work also includes a library written against the MLI. The MLI is currently written against Spark, but is designed to be platform independent, so that code written against MLI could be run on different engines (e.g., Hadoop, GraphX, etc.). 3) ML Optimizer: This piece automates the task of model selection. The optimizer can be viewed as a search problem over feature extraction / algorithms included in the MLI library, and is in part based on efficient cross validation. This work is under active development but is in an earlier stage of development than MLlib and MLI. (note: MLlib will be included with the Spark codebase, while the MLI and ML Optimizer will live in separate repositories.) As far as I can tell (though please correct me if I've misunderstood) your main goals include: i) "consistency in the API" ii) "some level of abstraction but to keep things as simple as possible" iii) "execute models on Spark ... while providing workflows for pipelining transformations, feature extraction, testing and cross-validation, and data viz." The MLI (and to some extent the ML Optimizer) is very much in line with these goals, and it would be great if you were interested in contributing to it. MLI is a private repository right now, but we'll make it public soon though, and Evan Sparks or I will let you know when we do so. Thanks again for getting in touch with us! -Ameet On Wed, Jul 24, 2013 at 11:47 AM, Reynold Xin <[email protected]> wrote: > On Wed, Jul 24, 2013 at 1:46 AM, Nick Pentreath <[email protected] > >wrote: > > > > > I also found Breeze to be very nice to work with and like the DSL - hence > > my question about why not use that? (Especially now that Breeze is > actually > > just breeze-math and breeze-viz). > > > > > Matei addressed this from a higher level. I want to provide a little bit > more context. A common properties of a lot of high level Scala DSL > libraries is that simple operators tend to have high virtual function > overheads and also create a lot of temporary objects. And because the level > of abstraction is so high, it is fairly hard to debug / optimize > performance. > > > > > -- > Reynold Xin, AMPLab, UC Berkeley > http://rxin.org >
