To me, the big thing we continue to be missing is the ability for those of us working on the project to reliably test the algorithms at scale. For instance, I've seen hints of several places where our clustering algorithms don't appear to scale very well (which are all M/R -- K-Means does scale) and it isn't clear to me whether it is our implementation, Hadoop, or simply that the data set isn't big enough or the combination of all three. To see this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc. Now, I realize EC2 isn't ideal for this kind of testing, but it all many of us have access to. Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but in some regards that's silly since the whole point is supposed to be it scales. Or perhaps my tests were flawed. Either way, it seems like it is an area we need to focus on more.
Of course, the hard part with all of this is debugging where the bottlenecks are. In the end, we need to figure out how to reliably get compute time available for testing along with a real data sets that we can use to validate scalability. On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote: > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote: > >> ... >> >> They discover Mahout, which does specifically bill itself as scalable >> (from http://mahout.apache.org, in some of the largest letters: "What >> is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal >> is to build scalable machine learning libraries."). They sniff check >> it by massaging some moderately-sized data set into the same format as >> an example from the wiki and they fail to get a result - often because >> their problem has some very different properties (more classes, much >> larger feature space, etc.) and the implementation has some limitation >> that they trip over. >> > > I have worked with users of Mahout who had 10^9 possible features and > others who are classifying > into 60,000 categories. > > Neither of these implementations uses Naive Bayes. Both work very well. > > They will usually try one of the simplest methods available under the >> assumption "well, if this doesn't scale well, the more complex methods >> are surely no better". > > > Silly assumption. > > >> This may not be entirely fair, but since the >> docs they're encountering on the main website and wiki don't warn them >> that certain implementations don't necessarily scale in different >> ways, it's certainly not unreasonable. > > > Well, it is actually silly. > > Clearly the docs can be better. Clearly the code quality can be better > especially in terms of nuking capabilities that have not found an audience. > But clearly also just trying one technique without asking anybody what the > limitations are isn't going to work as an evaluation technique. This is > exactly analogous to somebody finding that a matrix in R doesn't do what a > data frame is supposed to do. It doesn't and you aren't going to find out > why or how from the documentation very quickly. > > In both cases of investigating Mahout or investigating R you will find out > plenty if you ask somebody who knows what they are talking about. > > They're at best going to >> conclude the scalability will be hit-and-miss when a simple method >> doesn't work. Perhaps they'll check in again in 6-12 months. >> > > Maybe so. Maybe not. I have little sympathy with people who make > scatter-shot decisions like this. > > >> ... >> I see your analogy to R or sciPy - and I don't disagree. But those >> projects do not put scaling front and center; if Mahout is going to >> keep scalability as a "headline feature" (which I would like to see!), >> I think prominently acknowledging how different methods fail to scale >> would really help its credibility. For what it's worth, of the people >> I know who've tried Mahout 100% of them were using R and/or sciPy >> already, but were curious about Mahout specifically for better >> scalability. >> > > Did they ask on the mailing list? > > >> I'm not sure where this information is best placed - it would be great >> to see it on the Wiki along with the examples, at least. > > > Sounds OK. Maybe we should put it in the book. > > (oh... wait, we already did that) > > >> It would be >> awesome to see warnings at runtime ("Warning: You just trained a model >> that you cannot load without at least 20GB of RAM"), but I'm not sure >> how realistic that is. > > > I think it is fine that loading the model fails with a fine error message > but putting yellow warning tape all over the user's keyboard isn't going to > help anything. > > >> I would like it to be easier to determine, at some very high level, why >> something didn't work when an experiment fails. Ideally, without having to >> dive into the code at all. >> > > How about you ask an expert? > > That really is easier. It helps the community to hear about what other > people need and it helps the new user to hear what other people have done. -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com