On Dec 28, 2011, at 1:47 PM, Ted Dunning wrote: > I have nearly given up on getting publicly available large data sets and > have started to specify synthetic datasets for development projects. The > key is to build reasonably realistic generation algorithms and for that > there are always some serious difficulties.
Yeah, I agree. Still, 7M+ real emails seems like it should be interesting in size for us while not being overwhelming. Of course, that only solves 1/2 of the problem. We still need access to a cluster so we can regularly run experiments. > > For simple scaling tests, however, synthetic data is often just the ticket. > You still need some sophistication about the data, but it doesn't take > much. For k-means clustering of text documents, for instance, you can use > re-sample from real text to generate new text with desired properties or > you can define an LDA-like generator to generate data with known clustering > properties. Similarly, to test scaling of classification algorithms, it is > easy to generate text-like data with known properties. I still like our idea of a "good fake data" project. Or at least a Util in Mahout. > > The primary virtues of synthetic data are that a synthetic data set is easy > to carry around and it can be any size at all. > > As an example of a potential pitfall, I wrote tests for the sequential > version of the SSVD codes by building low rank matrices and testing the > reconstruction error. This is a fine test for correctness and some scaling > attributes, but it ignores the truncation error that Radim was fulminating > about recently. It would be good to additionally explore large matrices > that are more realistic because they are generated as count data from a > model that has a realistic spectrum. > > On Wed, Dec 28, 2011 at 10:35 AM, Grant Ingersoll <gsing...@apache.org>wrote: > >> To me, the big thing we continue to be missing is the ability for those of >> us working on the project to reliably test the algorithms at scale. For >> instance, I've seen hints of several places where our clustering algorithms >> don't appear to scale very well (which are all M/R -- K-Means does scale) >> and it isn't clear to me whether it is our implementation, Hadoop, or >> simply that the data set isn't big enough or the combination of all three. >> To see this in action, try out the ASF email archive up on Amazon with 10, >> 15 or 30 EC2 double x-large nodes and try out fuzzy k-means, dirichlet, >> etc. Now, I realize EC2 isn't ideal for this kind of testing, but it all >> many of us have access to. Perhaps it's also b/c 7M+ emails isn't big >> enough (~100GB), but in some regards that's silly since the whole point is >> supposed to be it scales. Or perhaps my tests were flawed. Either way, it >> seems like it is an area we need to focus on more. >> >> Of course, the hard part with all of this is debugging where the >> bottlenecks are. In the end, we need to figure out how to reliably get >> compute time available for testing along with a real data sets that we can >> use to validate scalability. >> >> >> On Dec 27, 2011, at 10:22 PM, Ted Dunning wrote: >> >>> On Tue, Dec 27, 2011 at 3:24 PM, Tom Pierce <t...@cloudera.com> wrote: >>> >>>> ... >>>> >>>> They discover Mahout, which does specifically bill itself as scalable >>>> (from http://mahout.apache.org, in some of the largest letters: "What >>>> is Apache Mahout? The Apache Mahoutâ„¢ machine learning library's goal >>>> is to build scalable machine learning libraries."). They sniff check >>>> it by massaging some moderately-sized data set into the same format as >>>> an example from the wiki and they fail to get a result - often because >>>> their problem has some very different properties (more classes, much >>>> larger feature space, etc.) and the implementation has some limitation >>>> that they trip over. >>>> >>> >>> I have worked with users of Mahout who had 10^9 possible features and >>> others who are classifying >>> into 60,000 categories. >>> >>> Neither of these implementations uses Naive Bayes. Both work very well. >>> >>> They will usually try one of the simplest methods available under the >>>> assumption "well, if this doesn't scale well, the more complex methods >>>> are surely no better". >>> >>> >>> Silly assumption. >>> >>> >>>> This may not be entirely fair, but since the >>>> docs they're encountering on the main website and wiki don't warn them >>>> that certain implementations don't necessarily scale in different >>>> ways, it's certainly not unreasonable. >>> >>> >>> Well, it is actually silly. >>> >>> Clearly the docs can be better. Clearly the code quality can be better >>> especially in terms of nuking capabilities that have not found an >> audience. >>> But clearly also just trying one technique without asking anybody what >> the >>> limitations are isn't going to work as an evaluation technique. This is >>> exactly analogous to somebody finding that a matrix in R doesn't do what >> a >>> data frame is supposed to do. It doesn't and you aren't going to find >> out >>> why or how from the documentation very quickly. >>> >>> In both cases of investigating Mahout or investigating R you will find >> out >>> plenty if you ask somebody who knows what they are talking about. >>> >>> They're at best going to >>>> conclude the scalability will be hit-and-miss when a simple method >>>> doesn't work. Perhaps they'll check in again in 6-12 months. >>>> >>> >>> Maybe so. Maybe not. I have little sympathy with people who make >>> scatter-shot decisions like this. >>> >>> >>>> ... >>>> I see your analogy to R or sciPy - and I don't disagree. But those >>>> projects do not put scaling front and center; if Mahout is going to >>>> keep scalability as a "headline feature" (which I would like to see!), >>>> I think prominently acknowledging how different methods fail to scale >>>> would really help its credibility. For what it's worth, of the people >>>> I know who've tried Mahout 100% of them were using R and/or sciPy >>>> already, but were curious about Mahout specifically for better >>>> scalability. >>>> >>> >>> Did they ask on the mailing list? >>> >>> >>>> I'm not sure where this information is best placed - it would be great >>>> to see it on the Wiki along with the examples, at least. >>> >>> >>> Sounds OK. Maybe we should put it in the book. >>> >>> (oh... wait, we already did that) >>> >>> >>>> It would be >>>> awesome to see warnings at runtime ("Warning: You just trained a model >>>> that you cannot load without at least 20GB of RAM"), but I'm not sure >>>> how realistic that is. >>> >>> >>> I think it is fine that loading the model fails with a fine error message >>> but putting yellow warning tape all over the user's keyboard isn't going >> to >>> help anything. >>> >>> >>>> I would like it to be easier to determine, at some very high level, why >>>> something didn't work when an experiment fails. Ideally, without >> having to >>>> dive into the code at all. >>>> >>> >>> How about you ask an expert? >>> >>> That really is easier. It helps the community to hear about what other >>> people need and it helps the new user to hear what other people have >> done. >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> >> >> -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com