On Dec 28, 2011, at 7:28 PM, Jeff Eastman wrote: > This is something that I'm enthusiastic about investigating right now. I'm > heartened that K-Means seems to scale well in your tests and I think I've > just improved Dirichlet a lot.
I suspect we found out why before, at least for Dirichlet, due to the choice of some parameters. > I'd like to test it again with your data. FuzzyK is problematic as its > clusters always end up with dense vectors for center and radius. I think it > will always be a hog. 100GB is not a huge data set and it should sing on a > 10-node cluster. Even without MapR <grin>. > > I think improving our predictability at scale is a great goal for 1.0. > Getting started would be a great goal for 0.7. +1 > Jeff > > On 12/28/11 11:35 AM, Grant Ingersoll wrote: >> To me, the big thing we continue to be missing is the ability for those of >> us working on the project to reliably test the algorithms at scale. For >> instance, I've seen hints of several places where our clustering algorithms >> don't appear to scale very well (which are all M/R -- K-Means does scale) >> and it isn't clear to me whether it is our implementation, Hadoop, or simply >> that the data set isn't big enough or the combination of all three. To see >> this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 >> EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc. Now, I >> realize EC2 isn't ideal for this kind of testing, but it all many of us have >> access to. Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but >> in some regards that's silly since the whole point is supposed to be it >> scales. Or perhaps my tests were flawed. Either way, it seems like it is >> an area we need to focus on more. >> >> Of course, the hard part with all of this is debugging where the bottlenecks >> are. In the end, we need to figure out how to reliably get compute time >> available for testing along with a real data sets that we can use to >> validate scalability. >> >> >> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote: >> >>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<t...@cloudera.com> wrote: >>> >>>> ... >>>> >>>> They discover Mahout, which does specifically bill itself as scalable >>>> (from http://mahout.apache.org, in some of the largest letters: "What >>>> is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal >>>> is to build scalable machine learning libraries."). They sniff check >>>> it by massaging some moderately-sized data set into the same format as >>>> an example from the wiki and they fail to get a result - often because >>>> their problem has some very different properties (more classes, much >>>> larger feature space, etc.) and the implementation has some limitation >>>> that they trip over. >>>> >>> I have worked with users of Mahout who had 10^9 possible features and >>> others who are classifying >>> into 60,000 categories. >>> >>> Neither of these implementations uses Naive Bayes. Both work very well. >>> >>> They will usually try one of the simplest methods available under the >>>> assumption "well, if this doesn't scale well, the more complex methods >>>> are surely no better". >>> >>> Silly assumption. >>> >>> >>>> This may not be entirely fair, but since the >>>> docs they're encountering on the main website and wiki don't warn them >>>> that certain implementations don't necessarily scale in different >>>> ways, it's certainly not unreasonable. >>> >>> Well, it is actually silly. >>> >>> Clearly the docs can be better. Clearly the code quality can be better >>> especially in terms of nuking capabilities that have not found an audience. >>> But clearly also just trying one technique without asking anybody what the >>> limitations are isn't going to work as an evaluation technique. This is >>> exactly analogous to somebody finding that a matrix in R doesn't do what a >>> data frame is supposed to do. It doesn't and you aren't going to find out >>> why or how from the documentation very quickly. >>> >>> In both cases of investigating Mahout or investigating R you will find out >>> plenty if you ask somebody who knows what they are talking about. >>> >>> They're at best going to >>>> conclude the scalability will be hit-and-miss when a simple method >>>> doesn't work. Perhaps they'll check in again in 6-12 months. >>>> >>> Maybe so. Maybe not. I have little sympathy with people who make >>> scatter-shot decisions like this. >>> >>> >>>> ... >>>> I see your analogy to R or sciPy - and I don't disagree. But those >>>> projects do not put scaling front and center; if Mahout is going to >>>> keep scalability as a "headline feature" (which I would like to see!), >>>> I think prominently acknowledging how different methods fail to scale >>>> would really help its credibility. For what it's worth, of the people >>>> I know who've tried Mahout 100% of them were using R and/or sciPy >>>> already, but were curious about Mahout specifically for better >>>> scalability. >>>> >>> Did they ask on the mailing list? >>> >>> >>>> I'm not sure where this information is best placed - it would be great >>>> to see it on the Wiki along with the examples, at least. >>> >>> Sounds OK. Maybe we should put it in the book. >>> >>> (oh... wait, we already did that) >>> >>> >>>> It would be >>>> awesome to see warnings at runtime ("Warning: You just trained a model >>>> that you cannot load without at least 20GB of RAM"), but I'm not sure >>>> how realistic that is. >>> >>> I think it is fine that loading the model fails with a fine error message >>> but putting yellow warning tape all over the user's keyboard isn't going to >>> help anything. >>> >>> >>>> I would like it to be easier to determine, at some very high level, why >>>> something didn't work when an experiment fails. Ideally, without having to >>>> dive into the code at all. >>>> >>> How about you ask an expert? >>> >>> That really is easier. It helps the community to hear about what other >>> people need and it helps the new user to hear what other people have done. >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> >> >> > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com