On Dec 28, 2011, at 7:28 PM, Jeff Eastman wrote:

> This is something that I'm enthusiastic about investigating right now. I'm 
> heartened that K-Means seems to scale well in your tests and I think I've 
> just improved Dirichlet a lot.

I suspect we found out why before, at least for Dirichlet, due to the choice of 
some parameters.

> I'd like to test it again with your data. FuzzyK is problematic as its 
> clusters always end up with dense vectors for center and radius. I think it 
> will always be a hog. 100GB is not a huge data set and it should sing on a 
> 10-node cluster. Even without MapR <grin>.
> 
> I think improving our predictability at scale is a great goal for 1.0. 
> Getting started would be a great goal for 0.7.

+1


> Jeff
> 
> On 12/28/11 11:35 AM, Grant Ingersoll wrote:
>> To me, the big thing we continue to be missing is the ability for those of 
>> us working on the project to reliably test the algorithms at scale.  For 
>> instance, I've seen hints of several places where our clustering algorithms 
>> don't appear to scale very well (which are all M/R -- K-Means does scale) 
>> and it isn't clear to me whether it is our implementation, Hadoop, or simply 
>> that the data set isn't big enough or the combination of all three.  To see 
>> this in action, try out the ASF email archive up on Amazon with 10, 15 or 30 
>> EC2 double x-large nodes and try out fuzzy k-means, dirichlet, etc.  Now, I 
>> realize EC2 isn't ideal for this kind of testing, but it all many of us have 
>> access to.  Perhaps it's also b/c 7M+ emails isn't big enough (~100GB), but 
>> in some regards that's silly since the whole point is supposed to be it 
>> scales.  Or perhaps my tests were flawed.  Either way, it seems like it is 
>> an area we need to focus on more.
>> 
>> Of course, the hard part with all of this is debugging where the bottlenecks 
>> are.  In the end, we need to figure out how to reliably get compute time 
>> available for testing along with a real data sets that we can use to 
>> validate scalability.
>> 
>> 
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>> 
>>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce<t...@cloudera.com>  wrote:
>>> 
>>>> ...
>>>> 
>>>> They discover Mahout, which does specifically bill itself as scalable
>>>> (from http://mahout.apache.org, in some of the largest letters: "What
>>>> is Apache Mahout?  The Apache Mahoutâ„¢ machine learning library's goal
>>>> is to build scalable machine learning libraries.").  They sniff check
>>>> it by massaging some moderately-sized data set into the same format as
>>>> an example from the wiki and they fail to get a result - often because
>>>> their problem has some very different properties (more classes, much
>>>> larger feature space, etc.) and the implementation has some limitation
>>>> that they trip over.
>>>> 
>>> I have worked with users of Mahout who had 10^9 possible features and
>>> others who are classifying
>>> into 60,000 categories.
>>> 
>>> Neither of these implementations uses Naive Bayes.  Both work very well.
>>> 
>>> They will usually try one of the simplest methods available under the
>>>> assumption "well, if this doesn't scale well, the more complex methods
>>>> are surely no better".
>>> 
>>> Silly assumption.
>>> 
>>> 
>>>> This may not be entirely fair, but since the
>>>> docs they're encountering on the main website and wiki don't warn them
>>>> that certain implementations don't necessarily scale in different
>>>> ways, it's certainly not unreasonable.
>>> 
>>> Well, it is actually silly.
>>> 
>>> Clearly the docs can be better.  Clearly the code quality can be better
>>> especially in terms of nuking capabilities that have not found an audience.
>>> But clearly also just trying one technique without asking anybody what the
>>> limitations are isn't going to work as an evaluation technique.  This is
>>> exactly analogous to somebody finding that a matrix in R doesn't do what a
>>> data frame is supposed to do.  It doesn't and you aren't going to find out
>>> why or how from the documentation very quickly.
>>> 
>>> In both cases of investigating Mahout or investigating R you will find out
>>> plenty if you ask somebody who knows what they are talking about.
>>> 
>>> They're at best going to
>>>> conclude the scalability will be hit-and-miss when a simple method
>>>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>>>> 
>>> Maybe so.  Maybe not.  I have little sympathy with people who make
>>> scatter-shot decisions like this.
>>> 
>>> 
>>>> ...
>>>> I see your analogy to R or sciPy - and I don't disagree.  But those
>>>> projects do not put scaling front and center; if Mahout is going to
>>>> keep scalability as a "headline feature" (which I would like to see!),
>>>> I think prominently acknowledging how different methods fail to scale
>>>> would really help its credibility.  For what it's worth, of the people
>>>> I know who've tried Mahout 100% of them were using R and/or sciPy
>>>> already, but were curious about Mahout specifically for better
>>>> scalability.
>>>> 
>>> Did they ask on the mailing list?
>>> 
>>> 
>>>> I'm not sure where this information is best placed - it would be great
>>>> to see it on the Wiki along with the examples, at least.
>>> 
>>> Sounds OK.  Maybe we should put it in the book.
>>> 
>>> (oh... wait, we already did that)
>>> 
>>> 
>>>> It would be
>>>> awesome to see warnings at runtime ("Warning: You just trained a model
>>>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>>>> how realistic that is.
>>> 
>>> I think it is fine that loading the model fails with a fine error message
>>> but putting yellow warning tape all over the user's keyboard isn't going to
>>> help anything.
>>> 
>>> 
>>>> I would like it to be easier to determine, at some very high level, why
>>>> something didn't work when an experiment fails.  Ideally, without having to
>>>> dive into the code at all.
>>>> 
>>> How about you ask an expert?
>>> 
>>> That really is easier.  It helps the community to hear about what other
>>> people need and it helps the new user to hear what other people have done.
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to