On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<t...@cloudera.com> wrote:
...
They discover Mahout, which does specifically bill itself as scalable
(from http://mahout.apache.org, in some of the largest letters: "What
is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal
is to build scalable machine learning libraries."). They sniff check
it by massaging some moderately-sized data set into the same format as
an example from the wiki and they fail to get a result - often because
their problem has some very different properties (more classes, much
larger feature space, etc.) and the implementation has some limitation
that they trip over.
I have worked with users of Mahout who had 10^9 possible features and
others who are classifying
into 60,000 categories.
Neither of these implementations uses Naive Bayes. Both work very well.
They will usually try one of the simplest methods available under the
assumption "well, if this doesn't scale well, the more complex methods
are surely no better".
Silly assumption.
This may not be entirely fair, but since the
docs they're encountering on the main website and wiki don't warn them
that certain implementations don't necessarily scale in different
ways, it's certainly not unreasonable.
Well, it is actually silly.
Clearly the docs can be better. Clearly the code quality can be better
especially in terms of nuking capabilities that have not found an audience.
But clearly also just trying one technique without asking anybody what the
limitations are isn't going to work as an evaluation technique. This is
exactly analogous to somebody finding that a matrix in R doesn't do what a
data frame is supposed to do. It doesn't and you aren't going to find out
why or how from the documentation very quickly.
In both cases of investigating Mahout or investigating R you will find out
plenty if you ask somebody who knows what they are talking about.
They're at best going to
conclude the scalability will be hit-and-miss when a simple method
doesn't work. Perhaps they'll check in again in 6-12 months.
Maybe so. Maybe not. I have little sympathy with people who make
scatter-shot decisions like this.
...
I see your analogy to R or sciPy - and I don't disagree. But those
projects do not put scaling front and center; if Mahout is going to
keep scalability as a "headline feature" (which I would like to see!),
I think prominently acknowledging how different methods fail to scale
would really help its credibility. For what it's worth, of the people
I know who've tried Mahout 100% of them were using R and/or sciPy
already, but were curious about Mahout specifically for better
scalability.
Did they ask on the mailing list?
I'm not sure where this information is best placed - it would be great
to see it on the Wiki along with the examples, at least.
Sounds OK. Maybe we should put it in the book.
(oh... wait, we already did that)
It would be
awesome to see warnings at runtime ("Warning: You just trained a model
that you cannot load without at least 20GB of RAM"), but I'm not sure
how realistic that is.
I think it is fine that loading the model fails with a fine error message
but putting yellow warning tape all over the user's keyboard isn't going to
help anything.
I would like it to be easier to determine, at some very high level, why
something didn't work when an experiment fails. Ideally, without having to
dive into the code at all.
How about you ask an expert?
That really is easier. It helps the community to hear about what other
people need and it helps the new user to hear what other people have done.