On Apr 15, 2011, at 8:59 AM, Sean Owen wrote:

> I had a chance to get feedback last night from a few Old Street
> startups using Mahout. The overall comments were of course positive --
> it provides a solution that's at least 80% ready-to-go and saves a
> great deal of trial and error in getting towards something working.

Very cool.  Thanks for bringing this up.

> 
> The problems I heard were similar to last time. The jobs are uneven
> and not standard, so each has its own peculiar learning curve. There
> are evidently still a number of invisible assumptions baked into the
> code about the file structure and environment too -- I heard again
> that repeated use of "new Configuration()" around the code breaks
> things. The experience of Mahout seemed to be one of weeks of trial
> and error, some of which has to do with understanding the machinery of
> Hadoop of course.

And machine learning, I suspect.  There seems to be a fair amount of T & E in 
ML no matter what, given the need to find good parameters and to do feature 
selection.

> Finally there was a group using the LDA
> implementation but had abandoned it over scalability concerns --
> didn't get more detail on that.

Yeah, we've heard the LDA concerns before.  I actually think all of our 
clustering other than K-Means needs a good hard look in terms of performance.  
From the tests that Tim and Syzmon and I did for MAHOUT-588, Dirchlet, Fuzzy 
K-Means, Mean Shift and Canopy don't look good.  In fact, K-Means is the only 
one that scaled.  We've got a repeatable framework setup for them, so they 
should be runnable by others.  

Now, having said that, I believe our approach with them is correct (in other 
words, they should be able to scale), so it points to either the way we were 
running them or the implementation.  I hope it's the former, but am pretty sure 
it's the latter.

Hopefully with the convergence work that Jeff is doing, that will make getting 
performance improvements easier since there is less code to debug and the 
pathways get exercised more often.

> 
> I do reiterate that there is, at heart, a significant and eager
> developer audience who is finding all this really useful, that are
> burning up a lot of energy just getting started. That's just the
> nature of this beast at version 0.x, but, I think it just once again
> underscores that the need is not for new algorithms, but cleaning up,
> fixing, documenting, streamlining what's already there.

Yeah, I agree.  I had really hoped that someone would put in for a benchmarking 
GSOC project, but I didn't see it (if you did submit one, please point me at 
it, as I missed the title!)

I also think it points to the fact that we aren't going to go from 0.5 to 1.0 
like we had thought, but instead will have at least a 0.6.

I think in order to do this scrubbing, we should focus on some real world data 
that we can run our primary algorithms on.  I would propose the 6.5M ASF mail 
archives that I have up on S3 (see utils/bin/prep_asf_mail_archives.sh on 
trunk).  From this, I think we could test/demo our 3 C's (clustering, 
classifiers, collab filtering) along w/ Freq. Patternset mining.  Doing this, 
will give us a consistent, easily repeated set of examples across real content 
and should help us flesh out the performance issues as well as exercise many of 
the dark areas of Mahout.  This would also let us put together some recipes 
around how to do things in Mahout, especially feature selection.  The bonus is 
all the data is freely redistributable.

I think we would also benefit from something similar to Lucene's randomized 
testing framework.  I'm not sure how to incorporate it just yet, but it 
massively expands our test capabilities.

I also will go back to my REST service layer.  I think if we had a service 
layer (ala Solr, etc.) that you could start jobs, get status, add content, get 
results, etc. all in a scalable way, it would really help people get started 
and running.  This is probably longer term, however.

-Grant

Reply via email to