I have nearly given up on getting publicly available large data sets and have started to specify synthetic datasets for development projects. The key is to build reasonably realistic generation algorithms and for that there are always some serious difficulties.
For simple scaling tests, however, synthetic data is often just the ticket. You still need some sophistication about the data, but it doesn't take much. For k-means clustering of text documents, for instance, you can use re-sample from real text to generate new text with desired properties or you can define an LDA-like generator to generate data with known clustering properties. Similarly, to test scaling of classification algorithms, it is easy to generate text-like data with known properties. The primary virtues of synthetic data are that a synthetic data set is easy to carry around and it can be any size at all. As an example of a potential pitfall, I wrote tests for the sequential version of the SSVD codes by building low rank matrices and testing the reconstruction error. This is a fine test for correctness and some scaling attributes, but it ignores the truncation error that Radim was fulminating about recently. It would be good to additionally explore large matrices that are more realistic because they are generated as count data from a model that has a realistic spectrum. On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gsing...@apache.org>wrote: > To me, the big thing we continue to be missing is the ability for those of > us working on the project to reliably test the algorithms at scale. For > instance, I've seen hints of several places where our clustering algorithms > don't appear to scale very well (which are all M/R -- K-Means does scale) > and it isn't clear to me whether it is our implementation, Hadoop, or > simply that the data set isn't big enough or the combination of all three. > To see this in action, try out the ASF email archive up on Amazon with 10, > 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, > etc. Now, I realize EC2 isn't ideal for this kind of testing, but it all > many of us have access to. Perhaps it's also b/c 7M+ emails isn't big > enough (~100GB), but in some regards that's silly since the whole point is > supposed to be it scales. Or perhaps my tests were flawed. Either way, it > seems like it is an area we need to focus on more. > > Of course, the hard part with all of this is debugging where the > bottlenecks are. In the end, we need to figure out how to reliably get > compute time available for testing along with a real data sets that we can > use to validate scalability. > > > On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote: > > > On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote: > > > >> ... > >> > >> They discover Mahout, which does specifically bill itself as scalable > >> (from http://mahout.apache.org, in some of the largest letters: "What > >> is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal > >> is to build scalable machine learning libraries."). They sniff check > >> it by massaging some moderately-sized data set into the same format as > >> an example from the wiki and they fail to get a result - often because > >> their problem has some very different properties (more classes, much > >> larger feature space, etc.) and the implementation has some limitation > >> that they trip over. > >> > > > > I have worked with users of Mahout who had 10^9 possible features and > > others who are classifying > > into 60,000 categories. > > > > Neither of these implementations uses Naive Bayes. Both work very well. > > > > They will usually try one of the simplest methods available under the > >> assumption "well, if this doesn't scale well, the more complex methods > >> are surely no better". > > > > > > Silly assumption. > > > > > >> This may not be entirely fair, but since the > >> docs they're encountering on the main website and wiki don't warn them > >> that certain implementations don't necessarily scale in different > >> ways, it's certainly not unreasonable. > > > > > > Well, it is actually silly. > > > > Clearly the docs can be better. Clearly the code quality can be better > > especially in terms of nuking capabilities that have not found an > audience. > > But clearly also just trying one technique without asking anybody what > the > > limitations are isn't going to work as an evaluation technique. This is > > exactly analogous to somebody finding that a matrix in R doesn't do what > a > > data frame is supposed to do. It doesn't and you aren't going to find > out > > why or how from the documentation very quickly. > > > > In both cases of investigating Mahout or investigating R you will find > out > > plenty if you ask somebody who knows what they are talking about. > > > > They're at best going to > >> conclude the scalability will be hit-and-miss when a simple method > >> doesn't work. Perhaps they'll check in again in 6-12 months. > >> > > > > Maybe so. Maybe not. I have little sympathy with people who make > > scatter-shot decisions like this. > > > > > >> ... > >> I see your analogy to R or sciPy - and I don't disagree. But those > >> projects do not put scaling front and center; if Mahout is going to > >> keep scalability as a "headline feature" (which I would like to see!), > >> I think prominently acknowledging how different methods fail to scale > >> would really help its credibility. For what it's worth, of the people > >> I know who've tried Mahout 100% of them were using R and/or sciPy > >> already, but were curious about Mahout specifically for better > >> scalability. > >> > > > > Did they ask on the mailing list? > > > > > >> I'm not sure where this information is best placed - it would be great > >> to see it on the Wiki along with the examples, at least. > > > > > > Sounds OK. Maybe we should put it in the book. > > > > (oh... wait, we already did that) > > > > > >> It would be > >> awesome to see warnings at runtime ("Warning: You just trained a model > >> that you cannot load without at least 20GB of RAM"), but I'm not sure > >> how realistic that is. > > > > > > I think it is fine that loading the model fails with a fine error message > > but putting yellow warning tape all over the user's keyboard isn't going > to > > help anything. > > > > > >> I would like it to be easier to determine, at some very high level, why > >> something didn't work when an experiment fails. Ideally, without > having to > >> dive into the code at all. > >> > > > > How about you ask an expert? > > > > That really is easier. It helps the community to hear about what other > > people need and it helps the new user to hear what other people have > done. > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > > >