On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote:

> ...
>
> They discover Mahout, which does specifically bill itself as scalable
> (from http://mahout.apache.org, in some of the largest letters: "What
> is Apache Mahout?  The Apache Mahoutâ„¢ machine learning library's goal
> is to build scalable machine learning libraries.").  They sniff check
> it by massaging some moderately-sized data set into the same format as
> an example from the wiki and they fail to get a result - often because
> their problem has some very different properties (more classes, much
> larger feature space, etc.) and the implementation has some limitation
> that they trip over.
>

I have worked with users of Mahout who had 10^9 possible features and
others who are classifying
into 60,000 categories.

Neither of these implementations uses Naive Bayes.  Both work very well.

They will usually try one of the simplest methods available under the
> assumption "well, if this doesn't scale well, the more complex methods
> are surely no better".


Silly assumption.


> This may not be entirely fair, but since the
> docs they're encountering on the main website and wiki don't warn them
> that certain implementations don't necessarily scale in different
> ways, it's certainly not unreasonable.


Well, it is actually silly.

Clearly the docs can be better.  Clearly the code quality can be better
especially in terms of nuking capabilities that have not found an audience.
 But clearly also just trying one technique without asking anybody what the
limitations are isn't going to work as an evaluation technique.  This is
exactly analogous to somebody finding that a matrix in R doesn't do what a
data frame is supposed to do.  It doesn't and you aren't going to find out
why or how from the documentation very quickly.

In both cases of investigating Mahout or investigating R you will find out
plenty if you ask somebody who knows what they are talking about.

They're at best going to
> conclude the scalability will be hit-and-miss when a simple method
> doesn't work.  Perhaps they'll check in again in 6-12 months.
>

Maybe so.  Maybe not.  I have little sympathy with people who make
scatter-shot decisions like this.


> ...
> I see your analogy to R or sciPy - and I don't disagree.  But those
> projects do not put scaling front and center; if Mahout is going to
> keep scalability as a "headline feature" (which I would like to see!),
> I think prominently acknowledging how different methods fail to scale
> would really help its credibility.  For what it's worth, of the people
> I know who've tried Mahout 100% of them were using R and/or sciPy
> already, but were curious about Mahout specifically for better
> scalability.
>

Did they ask on the mailing list?


> I'm not sure where this information is best placed - it would be great
> to see it on the Wiki along with the examples, at least.


Sounds OK.  Maybe we should put it in the book.

(oh... wait, we already did that)


> It would be
> awesome to see warnings at runtime ("Warning: You just trained a model
> that you cannot load without at least 20GB of RAM"), but I'm not sure
> how realistic that is.


I think it is fine that loading the model fails with a fine error message
but putting yellow warning tape all over the user's keyboard isn't going to
help anything.


> I would like it to be easier to determine, at some very high level, why
> something didn't work when an experiment fails.  Ideally, without having to
> dive into the code at all.
>

How about you ask an expert?

That really is easier.  It helps the community to hear about what other
people need and it helps the new user to hear what other people have done.

Reply via email to