On Dec 28, 2011, at 1:47 PM, Ted Dunning wrote:

> I have nearly given up on getting publicly available large data sets and
> have started to specify synthetic datasets for development projects.   The
> key is to build reasonably realistic generation algorithms and for that
> there are always some serious difficulties.

Yeah, I agree. 

 Still, 7M+ real emails seems like it should be interesting in size for us 
while not being overwhelming.  Of course, that only solves 1/2 of the problem.  
We still need access to a cluster so we can regularly run experiments.

> 
> For simple scaling tests, however, synthetic data is often just the ticket.
> You still need some sophistication about the data, but it doesn't take
> much.  For k-means clustering of text documents, for instance, you can use
> re-sample from real text to generate new text with desired properties or
> you can define an LDA-like generator to generate data with known clustering
> properties.  Similarly, to test scaling of classification algorithms, it is
> easy to generate text-like data with known properties.

I still like our idea of a "good fake data" project.  Or at least a Util in 
Mahout.

> 
> The primary virtues of synthetic data are that a synthetic data set is easy
> to carry around and it can be any size at all.
> 
> As an example of a potential pitfall, I wrote tests for the sequential
> version of the SSVD codes by building low rank matrices and testing the
> reconstruction error.  This is a fine test for correctness and some scaling
> attributes, but it ignores the truncation error that Radim was fulminating
> about recently.  It would be good to additionally explore large matrices
> that are more realistic because they are generated as count data from a
> model that has a realistic spectrum.
> 
> On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gsing...@apache.org>wrote:
> 
>> To me, the big thing we continue to be missing is the ability for those of
>> us working on the project to reliably test the algorithms at scale.  For
>> instance, I've seen hints of several places where our clustering algorithms
>> don't appear to scale very well (which are all M/R -- K-Means does scale)
>> and it isn't clear to me whether it is our implementation, Hadoop, or
>> simply that the data set isn't big enough or the combination of all three.
>> To see this in action, try out the ASF email archive up on Amazon with 10,
>> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet,
>> etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all
>> many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big
>> enough (~100GB), but in some regards that's silly since the whole point is
>> supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it
>> seems like it is an area we need to focus on more.
>> 
>> Of course, the hard part with all of this is debugging where the
>> bottlenecks are.  In the end, we need to figure out how to reliably get
>> compute time available for testing along with a real data sets that we can
>> use to validate scalability.
>> 
>> 
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>> 
>>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote:
>>> 
>>>> ...
>>>> 
>>>> They discover Mahout, which does specifically bill itself as scalable
>>>> (from http://mahout.apache.org, in some of the largest letters: "What
>>>> is Apache Mahout?  The Apache Mahoutâ„¢ machine learning library's goal
>>>> is to build scalable machine learning libraries.").  They sniff check
>>>> it by massaging some moderately-sized data set into the same format as
>>>> an example from the wiki and they fail to get a result - often because
>>>> their problem has some very different properties (more classes, much
>>>> larger feature space, etc.) and the implementation has some limitation
>>>> that they trip over.
>>>> 
>>> 
>>> I have worked with users of Mahout who had 10^9 possible features and
>>> others who are classifying
>>> into 60,000 categories.
>>> 
>>> Neither of these implementations uses Naive Bayes.  Both work very well.
>>> 
>>> They will usually try one of the simplest methods available under the
>>>> assumption "well, if this doesn't scale well, the more complex methods
>>>> are surely no better".
>>> 
>>> 
>>> Silly assumption.
>>> 
>>> 
>>>> This may not be entirely fair, but since the
>>>> docs they're encountering on the main website and wiki don't warn them
>>>> that certain implementations don't necessarily scale in different
>>>> ways, it's certainly not unreasonable.
>>> 
>>> 
>>> Well, it is actually silly.
>>> 
>>> Clearly the docs can be better.  Clearly the code quality can be better
>>> especially in terms of nuking capabilities that have not found an
>> audience.
>>> But clearly also just trying one technique without asking anybody what
>> the
>>> limitations are isn't going to work as an evaluation technique.  This is
>>> exactly analogous to somebody finding that a matrix in R doesn't do what
>> a
>>> data frame is supposed to do.  It doesn't and you aren't going to find
>> out
>>> why or how from the documentation very quickly.
>>> 
>>> In both cases of investigating Mahout or investigating R you will find
>> out
>>> plenty if you ask somebody who knows what they are talking about.
>>> 
>>> They're at best going to
>>>> conclude the scalability will be hit-and-miss when a simple method
>>>> doesn't work.  Perhaps they'll check in again in 6-12 months.
>>>> 
>>> 
>>> Maybe so.  Maybe not.  I have little sympathy with people who make
>>> scatter-shot decisions like this.
>>> 
>>> 
>>>> ...
>>>> I see your analogy to R or sciPy - and I don't disagree.  But those
>>>> projects do not put scaling front and center; if Mahout is going to
>>>> keep scalability as a "headline feature" (which I would like to see!),
>>>> I think prominently acknowledging how different methods fail to scale
>>>> would really help its credibility.  For what it's worth, of the people
>>>> I know who've tried Mahout 100% of them were using R and/or sciPy
>>>> already, but were curious about Mahout specifically for better
>>>> scalability.
>>>> 
>>> 
>>> Did they ask on the mailing list?
>>> 
>>> 
>>>> I'm not sure where this information is best placed - it would be great
>>>> to see it on the Wiki along with the examples, at least.
>>> 
>>> 
>>> Sounds OK.  Maybe we should put it in the book.
>>> 
>>> (oh... wait, we already did that)
>>> 
>>> 
>>>> It would be
>>>> awesome to see warnings at runtime ("Warning: You just trained a model
>>>> that you cannot load without at least 20GB of RAM"), but I'm not sure
>>>> how realistic that is.
>>> 
>>> 
>>> I think it is fine that loading the model fails with a fine error message
>>> but putting yellow warning tape all over the user's keyboard isn't going
>> to
>>> help anything.
>>> 
>>> 
>>>> I would like it to be easier to determine, at some very high level, why
>>>> something didn't work when an experiment fails.  Ideally, without
>> having to
>>>> dive into the code at all.
>>>> 
>>> 
>>> How about you ask an expert?
>>> 
>>> That really is easier.  It helps the community to hear about what other
>>> people need and it helps the new user to hear what other people have
>> done.
>> 
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>> 
>> 
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to