The users I'm talking about are often quite advanced in many ways -
familiar with R, SAS, etc., capable of coding up their own
implementations based on papers, etc.  They don't know Mahout, they
aren't eager to study a new API out of curiosity, but they would like
to find a suite of super-scalable (in terms of parallelized effort and
data size) ML tools.

They discover Mahout, which does specifically bill itself as scalable
(from http://mahout.apache.org, in some of the largest letters: "What
is Apache Mahout?  The Apache Mahoutâ„¢ machine learning library's goal
is to build scalable machine learning libraries.").  They sniff check
it by massaging some moderately-sized data set into the same format as
an example from the wiki and they fail to get a result - often because
their problem has some very different properties (more classes, much
larger feature space, etc.) and the implementation has some limitation
that they trip over.

They will usually try one of the simplest methods available under the
assumption "well, if this doesn't scale well, the more complex methods
are surely no better".  This may not be entirely fair, but since the
docs they're encountering on the main website and wiki don't warn them
that certain implementations don't necessarily scale in different
ways, it's certainly not unreasonable.  They're at best going to
conclude the scalability will be hit-and-miss when a simple method
doesn't work.  Perhaps they'll check in again in 6-12 months.

In truth, most of these users would probably never use Mahout's NB
trainer "for real" - most would write their own that required no
interim data transformation from their existing feature space, since
that is often easier than productionalizing the conversion.  However,
they will use it as the tryout method - and they think they're really
giving the project the best possible chance to "shine" - because they
haven't even begun to consider quality/stability of models yet.

I see your analogy to R or sciPy - and I don't disagree.  But those
projects do not put scaling front and center; if Mahout is going to
keep scalability as a "headline feature" (which I would like to see!),
I think prominently acknowledging how different methods fail to scale
would really help its credibility.  For what it's worth, of the people
I know who've tried Mahout 100% of them were using R and/or sciPy
already, but were curious about Mahout specifically for better
scalability.

I'm not sure where this information is best placed - it would be great
to see it on the Wiki along with the examples, at least.  It would be
awesome to see warnings at runtime ("Warning: You just trained a model
that you cannot load without at least 20GB of RAM"), but I'm not sure
how realistic that is.  I would like it to be easier to determine, at
some very high level, why something didn't work when an experiment
fails.  Ideally, without having to dive into the code at all.

-tom

On Tue, Dec 27, 2011 at 5:14 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
> On Tue, Dec 27, 2011 at 2:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:
>> Yes, i think this one is in terms of documentation.
>
> I meant, this patch one is going in in terms of its effects for API
> and their docs.
>
>>
>> Wiki technically doesn't require annotation to be useful in describing
>> method use though.
>>
>> No plans for command line as of the moment as far as i know. What you
>> would suggest people should see there in addition to what they cannot
>> see on wiki?
>>
>>>
>>> When you're just trying out a package - especially one where a prime
>>> benefit you're hoping for is scalability - and you hit an unadvertised
>>> limit in scaling, there's a strong tendency to write off the entire
>>> project as "not quite ready". Especially when you don't have a lot of
>>> time dig into code to understand problems.
>>>
>>
>> I am not sure about this. Mahout is very much like R or sciPy, i.e. a
>> data representation framework that glues a collection of methods
>> ranging widely in their performance (and, in this case, yes, maturity,
>> that's why it is not a 1.0 project yet). I see what you are saying but
>> in the same time I also cannot figure why would anybody be tempted to
>> write off an R as a whole just because some of its numerous packages
>> provides an implementation that scales less or less accurate than
>> other implementations in R.
>>
>> Also as far as i understand advices against Naive Bayes are generally
>> not due to quality of its implementation in Mahout but are rather
>> based on characteristics of this method as opposed to SGD and the
>> stated problem. NB is easy to implement and that's why it's popular,
>> but not because it is a swiss army knife. Therefore, they generally
>> would be true Mahout or not.
>>
>> -D

Reply via email to