Or you can take a small set of good data and generate variations to
get a big set with the same disribution curves.

On Wed, Dec 28, 2011 at 10:47 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> I have nearly given up on getting publicly available large data sets and
> have started to specify synthetic datasets for development projects.   The
> key is to build reasonably realistic generation algorithms and for that
> there are always some serious difficulties.
>
> For simple scaling tests, however, synthetic data is often just the ticket.
>  You still need some sophistication about the data, but it doesn't take
> much.  For k-means clustering of text documents, for instance, you can use
> re-sample from real text to generate new text with desired properties or
> you can define an LDA-like generator to generate data with known clustering
> properties.  Similarly, to test scaling of classification algorithms, it is
> easy to generate text-like data with known properties.
>
> The primary virtues of synthetic data are that a synthetic data set is easy
> to carry around and it can be any size at all.
>
> As an example of a potential pitfall, I wrote tests for the sequential
> version of the SSVD codes by building low rank matrices and testing the
> reconstruction error.  This is a fine test for correctness and some scaling
> attributes, but it ignores the truncation error that Radim was fulminating
> about recently.  It would be good to additionally explore large matrices
> that are more realistic because they are generated as count data from a
> model that has a realistic spectrum.
>
> On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gsing...@apache.org>wrote:
>
>> To me, the big thing we continue to be missing is the ability for those of
>> us working on the project to reliably test the algorithms at scale.  For
>> instance, I've seen hints of several places where our clustering algorithms
>> don't appear to scale very well (which are all M/R -- K-Means does scale)
>> and it isn't clear to me whether it is our implementation, Hadoop, or
>> simply that the data set isn't big enough or the combination of all three.
>>  To see this in action, try out the ASF email archive up on Amazon with 10,
>> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet,
>> etc.  Now, I realize EC2 isn't ideal for this kind of testing, but it all
>> many of us have access to.  Perhaps it's also b/c 7M+ emails isn't big
>> enough (~100GB), but in some regards that's silly since the whole point is
>> supposed to be it scales.  Or perhaps my tests were flawed.  Either way, it
>> seems like it is an area we need to focus on more.
>>
>> Of course, the hard part with all of this is debugging where the
>> bottlenecks are.  In the end, we need to figure out how to reliably get
>> compute time available for testing along with a real data sets that we can
>> use to validate scalability.
>>
>>
>> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote:
>>
>> > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote:
>> >
>> >> ...
>> >>
>> >> They discover Mahout, which does specifically bill itself as scalable
>> >> (from http://mahout.apache.org, in some of the largest letters: "What
>> >> is Apache Mahout?  The Apache Mahout™ machine learning library's goal
>> >> is to build scalable machine learning libraries.").  They sniff check
>> >> it by massaging some moderately-sized data set into the same format as
>> >> an example from the wiki and they fail to get a result - often because
>> >> their problem has some very different properties (more classes, much
>> >> larger feature space, etc.) and the implementation has some limitation
>> >> that they trip over.
>> >>
>> >
>> > I have worked with users of Mahout who had 10^9 possible features and
>> > others who are classifying
>> > into 60,000 categories.
>> >
>> > Neither of these implementations uses Naive Bayes.  Both work very well.
>> >
>> > They will usually try one of the simplest methods available under the
>> >> assumption "well, if this doesn't scale well, the more complex methods
>> >> are surely no better".
>> >
>> >
>> > Silly assumption.
>> >
>> >
>> >> This may not be entirely fair, but since the
>> >> docs they're encountering on the main website and wiki don't warn them
>> >> that certain implementations don't necessarily scale in different
>> >> ways, it's certainly not unreasonable.
>> >
>> >
>> > Well, it is actually silly.
>> >
>> > Clearly the docs can be better.  Clearly the code quality can be better
>> > especially in terms of nuking capabilities that have not found an
>> audience.
>> > But clearly also just trying one technique without asking anybody what
>> the
>> > limitations are isn't going to work as an evaluation technique.  This is
>> > exactly analogous to somebody finding that a matrix in R doesn't do what
>> a
>> > data frame is supposed to do.  It doesn't and you aren't going to find
>> out
>> > why or how from the documentation very quickly.
>> >
>> > In both cases of investigating Mahout or investigating R you will find
>> out
>> > plenty if you ask somebody who knows what they are talking about.
>> >
>> > They're at best going to
>> >> conclude the scalability will be hit-and-miss when a simple method
>> >> doesn't work.  Perhaps they'll check in again in 6-12 months.
>> >>
>> >
>> > Maybe so.  Maybe not.  I have little sympathy with people who make
>> > scatter-shot decisions like this.
>> >
>> >
>> >> ...
>> >> I see your analogy to R or sciPy - and I don't disagree.  But those
>> >> projects do not put scaling front and center; if Mahout is going to
>> >> keep scalability as a "headline feature" (which I would like to see!),
>> >> I think prominently acknowledging how different methods fail to scale
>> >> would really help its credibility.  For what it's worth, of the people
>> >> I know who've tried Mahout 100% of them were using R and/or sciPy
>> >> already, but were curious about Mahout specifically for better
>> >> scalability.
>> >>
>> >
>> > Did they ask on the mailing list?
>> >
>> >
>> >> I'm not sure where this information is best placed - it would be great
>> >> to see it on the Wiki along with the examples, at least.
>> >
>> >
>> > Sounds OK.  Maybe we should put it in the book.
>> >
>> > (oh... wait, we already did that)
>> >
>> >
>> >> It would be
>> >> awesome to see warnings at runtime ("Warning: You just trained a model
>> >> that you cannot load without at least 20GB of RAM"), but I'm not sure
>> >> how realistic that is.
>> >
>> >
>> > I think it is fine that loading the model fails with a fine error message
>> > but putting yellow warning tape all over the user's keyboard isn't going
>> to
>> > help anything.
>> >
>> >
>> >> I would like it to be easier to determine, at some very high level, why
>> >> something didn't work when an experiment fails.  Ideally, without
>> having to
>> >> dive into the code at all.
>> >>
>> >
>> > How about you ask an expert?
>> >
>> > That really is easier.  It helps the community to hear about what other
>> > people need and it helps the new user to hear what other people have
>> done.
>>
>> --------------------------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com
>>
>>
>>
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to