This is something that I'm enthusiastic about investigating right now. I'm heartened that K-Means seems to scale well in your tests and I think I've just improved Dirichlet a lot. I'd like to test it again with your data. FuzzyK is problematic as its clusters always end up with dense vectors for center and radius. I think it will always be a hog. 100GB is not a huge data set and it should sing on a 10-node cluster. Even without MapR <grin>.

I think improving our predictability at scale is a great goal for 1.0. Getting started would be a great goal for 0.7.
Jeff

On 12/28/11 11:35 AM, Grant Ingersoll wrote:
To me, the big thing we continue to be missing is the ability for those of us 
working on the project to reliably test the algorithms at scale.  For instance, 
I've seen hints of several places where our clustering algorithms don't appear 
to scale very well (which are all M/R -- K-Means does scale) and it isn't clear 
to me whether it is our implementation, Hadoop, or simply that the data set 
isn't big enough or the combination of all three.  To see this in action, try 
out the ASF email archive up on Amazon with 10, 15 or 30 EC2 double x-large 
nodes and try out fuzzy k-means, dirichlet, etc.  Now, I realize EC2 isn't 
ideal for this kind of testing, but it all many of us have access to.  Perhaps 
it's also b/c 7M+ emails isn't big enough (~100GB), but in some regards that's 
silly since the whole point is supposed to be it scales.  Or perhaps my tests 
were flawed.  Either way, it seems like it is an area we need to focus on more.

Of course, the hard part with all of this is debugging where the bottlenecks 
are.  In the end, we need to figure out how to reliably get compute time 
available for testing along with a real data sets that we can use to validate 
scalability.


On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:

On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<t...@cloudera.com>  wrote:

...

They discover Mahout, which does specifically bill itself as scalable
(from http://mahout.apache.org, in some of the largest letters: "What
is Apache Mahout?  The Apache Mahoutâ„¢ machine learning library's goal
is to build scalable machine learning libraries.").  They sniff check
it by massaging some moderately-sized data set into the same format as
an example from the wiki and they fail to get a result - often because
their problem has some very different properties (more classes, much
larger feature space, etc.) and the implementation has some limitation
that they trip over.

I have worked with users of Mahout who had 10^9 possible features and
others who are classifying
into 60,000 categories.

Neither of these implementations uses Naive Bayes.  Both work very well.

They will usually try one of the simplest methods available under the
assumption "well, if this doesn't scale well, the more complex methods
are surely no better".

Silly assumption.


This may not be entirely fair, but since the
docs they're encountering on the main website and wiki don't warn them
that certain implementations don't necessarily scale in different
ways, it's certainly not unreasonable.

Well, it is actually silly.

Clearly the docs can be better.  Clearly the code quality can be better
especially in terms of nuking capabilities that have not found an audience.
But clearly also just trying one technique without asking anybody what the
limitations are isn't going to work as an evaluation technique.  This is
exactly analogous to somebody finding that a matrix in R doesn't do what a
data frame is supposed to do.  It doesn't and you aren't going to find out
why or how from the documentation very quickly.

In both cases of investigating Mahout or investigating R you will find out
plenty if you ask somebody who knows what they are talking about.

They're at best going to
conclude the scalability will be hit-and-miss when a simple method
doesn't work.  Perhaps they'll check in again in 6-12 months.

Maybe so.  Maybe not.  I have little sympathy with people who make
scatter-shot decisions like this.


...
I see your analogy to R or sciPy - and I don't disagree.  But those
projects do not put scaling front and center; if Mahout is going to
keep scalability as a "headline feature" (which I would like to see!),
I think prominently acknowledging how different methods fail to scale
would really help its credibility.  For what it's worth, of the people
I know who've tried Mahout 100% of them were using R and/or sciPy
already, but were curious about Mahout specifically for better
scalability.

Did they ask on the mailing list?


I'm not sure where this information is best placed - it would be great
to see it on the Wiki along with the examples, at least.

Sounds OK.  Maybe we should put it in the book.

(oh... wait, we already did that)


It would be
awesome to see warnings at runtime ("Warning: You just trained a model
that you cannot load without at least 20GB of RAM"), but I'm not sure
how realistic that is.

I think it is fine that loading the model fails with a fine error message
but putting yellow warning tape all over the user's keyboard isn't going to
help anything.


I would like it to be easier to determine, at some very high level, why
something didn't work when an experiment fails.  Ideally, without having to
dive into the code at all.

How about you ask an expert?

That really is easier.  It helps the community to hear about what other
people need and it helps the new user to hear what other people have done.
--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com





Reply via email to