Or you can take a small set of good data and generate variations to get a big set with the same disribution curves.
On Wed, Dec 28, 2011 at 10:47 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > I have nearly given up on getting publicly available large data sets and > have started to specify synthetic datasets for development projects. The > key is to build reasonably realistic generation algorithms and for that > there are always some serious difficulties. > > For simple scaling tests, however, synthetic data is often just the ticket. > You still need some sophistication about the data, but it doesn't take > much. For k-means clustering of text documents, for instance, you can use > re-sample from real text to generate new text with desired properties or > you can define an LDA-like generator to generate data with known clustering > properties. Similarly, to test scaling of classification algorithms, it is > easy to generate text-like data with known properties. > > The primary virtues of synthetic data are that a synthetic data set is easy > to carry around and it can be any size at all. > > As an example of a potential pitfall, I wrote tests for the sequential > version of the SSVD codes by building low rank matrices and testing the > reconstruction error. This is a fine test for correctness and some scaling > attributes, but it ignores the truncation error that Radim was fulminating > about recently. It would be good to additionally explore large matrices > that are more realistic because they are generated as count data from a > model that has a realistic spectrum. > > On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gsing...@apache.org>wrote: > >> To me, the big thing we continue to be missing is the ability for those of >> us working on the project to reliably test the algorithms at scale. For >> instance, I've seen hints of several places where our clustering algorithms >> don't appear to scale very well (which are all M/R -- K-Means does scale) >> and it isn't clear to me whether it is our implementation, Hadoop, or >> simply that the data set isn't big enough or the combination of all three. >> To see this in action, try out the ASF email archive up on Amazon with 10, >> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, >> etc. Now, I realize EC2 isn't ideal for this kind of testing, but it all >> many of us have access to. Perhaps it's also b/c 7M+ emails isn't big >> enough (~100GB), but in some regards that's silly since the whole point is >> supposed to be it scales. Or perhaps my tests were flawed. Either way, it >> seems like it is an area we need to focus on more. >> >> Of course, the hard part with all of this is debugging where the >> bottlenecks are. In the end, we need to figure out how to reliably get >> compute time available for testing along with a real data sets that we can >> use to validate scalability. >> >> >> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote: >> >> > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote: >> > >> >> ... >> >> >> >> They discover Mahout, which does specifically bill itself as scalable >> >> (from http://mahout.apache.org, in some of the largest letters: "What >> >> is Apache Mahout? The Apache Mahout™ machine learning library's goal >> >> is to build scalable machine learning libraries."). They sniff check >> >> it by massaging some moderately-sized data set into the same format as >> >> an example from the wiki and they fail to get a result - often because >> >> their problem has some very different properties (more classes, much >> >> larger feature space, etc.) and the implementation has some limitation >> >> that they trip over. >> >> >> > >> > I have worked with users of Mahout who had 10^9 possible features and >> > others who are classifying >> > into 60,000 categories. >> > >> > Neither of these implementations uses Naive Bayes. Both work very well. >> > >> > They will usually try one of the simplest methods available under the >> >> assumption "well, if this doesn't scale well, the more complex methods >> >> are surely no better". >> > >> > >> > Silly assumption. >> > >> > >> >> This may not be entirely fair, but since the >> >> docs they're encountering on the main website and wiki don't warn them >> >> that certain implementations don't necessarily scale in different >> >> ways, it's certainly not unreasonable. >> > >> > >> > Well, it is actually silly. >> > >> > Clearly the docs can be better. Clearly the code quality can be better >> > especially in terms of nuking capabilities that have not found an >> audience. >> > But clearly also just trying one technique without asking anybody what >> the >> > limitations are isn't going to work as an evaluation technique. This is >> > exactly analogous to somebody finding that a matrix in R doesn't do what >> a >> > data frame is supposed to do. It doesn't and you aren't going to find >> out >> > why or how from the documentation very quickly. >> > >> > In both cases of investigating Mahout or investigating R you will find >> out >> > plenty if you ask somebody who knows what they are talking about. >> > >> > They're at best going to >> >> conclude the scalability will be hit-and-miss when a simple method >> >> doesn't work. Perhaps they'll check in again in 6-12 months. >> >> >> > >> > Maybe so. Maybe not. I have little sympathy with people who make >> > scatter-shot decisions like this. >> > >> > >> >> ... >> >> I see your analogy to R or sciPy - and I don't disagree. But those >> >> projects do not put scaling front and center; if Mahout is going to >> >> keep scalability as a "headline feature" (which I would like to see!), >> >> I think prominently acknowledging how different methods fail to scale >> >> would really help its credibility. For what it's worth, of the people >> >> I know who've tried Mahout 100% of them were using R and/or sciPy >> >> already, but were curious about Mahout specifically for better >> >> scalability. >> >> >> > >> > Did they ask on the mailing list? >> > >> > >> >> I'm not sure where this information is best placed - it would be great >> >> to see it on the Wiki along with the examples, at least. >> > >> > >> > Sounds OK. Maybe we should put it in the book. >> > >> > (oh... wait, we already did that) >> > >> > >> >> It would be >> >> awesome to see warnings at runtime ("Warning: You just trained a model >> >> that you cannot load without at least 20GB of RAM"), but I'm not sure >> >> how realistic that is. >> > >> > >> > I think it is fine that loading the model fails with a fine error message >> > but putting yellow warning tape all over the user's keyboard isn't going >> to >> > help anything. >> > >> > >> >> I would like it to be easier to determine, at some very high level, why >> >> something didn't work when an experiment fails. Ideally, without >> having to >> >> dive into the code at all. >> >> >> > >> > How about you ask an expert? >> > >> > That really is easier. It helps the community to hear about what other >> > people need and it helps the new user to hear what other people have >> done. >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> >> >> -- Lance Norskog goks...@gmail.com