The users I'm talking about are often quite advanced in many ways - familiar with R, SAS, etc., capable of coding up their own implementations based on papers, etc. They don't know Mahout, they aren't eager to study a new API out of curiosity, but they would like to find a suite of super-scalable (in terms of parallelized effort and data size) ML tools.
They discover Mahout, which does specifically bill itself as scalable (from http://mahout.apache.org, in some of the largest letters: "What is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal is to build scalable machine learning libraries."). They sniff check it by massaging some moderately-sized data set into the same format as an example from the wiki and they fail to get a result - often because their problem has some very different properties (more classes, much larger feature space, etc.) and the implementation has some limitation that they trip over. They will usually try one of the simplest methods available under the assumption "well, if this doesn't scale well, the more complex methods are surely no better". This may not be entirely fair, but since the docs they're encountering on the main website and wiki don't warn them that certain implementations don't necessarily scale in different ways, it's certainly not unreasonable. They're at best going to conclude the scalability will be hit-and-miss when a simple method doesn't work. Perhaps they'll check in again in 6-12 months. In truth, most of these users would probably never use Mahout's NB trainer "for real" - most would write their own that required no interim data transformation from their existing feature space, since that is often easier than productionalizing the conversion. However, they will use it as the tryout method - and they think they're really giving the project the best possible chance to "shine" - because they haven't even begun to consider quality/stability of models yet. I see your analogy to R or sciPy - and I don't disagree. But those projects do not put scaling front and center; if Mahout is going to keep scalability as a "headline feature" (which I would like to see!), I think prominently acknowledging how different methods fail to scale would really help its credibility. For what it's worth, of the people I know who've tried Mahout 100% of them were using R and/or sciPy already, but were curious about Mahout specifically for better scalability. I'm not sure where this information is best placed - it would be great to see it on the Wiki along with the examples, at least. It would be awesome to see warnings at runtime ("Warning: You just trained a model that you cannot load without at least 20GB of RAM"), but I'm not sure how realistic that is. I would like it to be easier to determine, at some very high level, why something didn't work when an experiment fails. Ideally, without having to dive into the code at all. -tom On Tue, Dec 27, 2011 at 5:14 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > On Tue, Dec 27, 2011 at 2:13 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: >> Yes, i think this one is in terms of documentation. > > I meant, this patch one is going in in terms of its effects for API > and their docs. > >> >> Wiki technically doesn't require annotation to be useful in describing >> method use though. >> >> No plans for command line as of the moment as far as i know. What you >> would suggest people should see there in addition to what they cannot >> see on wiki? >> >>> >>> When you're just trying out a package - especially one where a prime >>> benefit you're hoping for is scalability - and you hit an unadvertised >>> limit in scaling, there's a strong tendency to write off the entire >>> project as "not quite ready". Especially when you don't have a lot of >>> time dig into code to understand problems. >>> >> >> I am not sure about this. Mahout is very much like R or sciPy, i.e. a >> data representation framework that glues a collection of methods >> ranging widely in their performance (and, in this case, yes, maturity, >> that's why it is not a 1.0 project yet). I see what you are saying but >> in the same time I also cannot figure why would anybody be tempted to >> write off an R as a whole just because some of its numerous packages >> provides an implementation that scales less or less accurate than >> other implementations in R. >> >> Also as far as i understand advices against Naive Bayes are generally >> not due to quality of its implementation in Mahout but are rather >> based on characteristics of this method as opposed to SGD and the >> stated problem. NB is easy to implement and that's why it's popular, >> but not because it is a swiss army knife. Therefore, they generally >> would be true Mahout or not. >> >> -D