For large datasets that are not likely to disappear, an interface to download them sounds more useful to me.
Cheers, Kevin On Thursday, May 1, 2014, Viral Shah <vi...@mayin.org> wrote: > This would certainly be useful - to have prepackaged large datasets for > people to work with. The question is what kind of operations would one want > to do on such a dataset. If you could provide a set of well defined > benchmarks (simple kernel codes that developers can work with), this could > certainly be useful. > > -viral > > On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote: > > If there is some desire for "big data" tests, there is a fair number of > public astronomical datasets that wouldn't be too hard to package up. > > The catalog level versions aren't too different than the type of dataset > metioned by Doug. There are a number of fairly simple analyses that could > be done on them for testing, either simple predictions or classifications. > These wouldn't be hard too document and/or describe. I can produce > examples if people care. > > For example, SDSS (a survey I work on) has public catalog data of ~470 > million objects (rows), with something like ~3 million of those that have > more in depth information (many more columns). Depending on the test > questions, these can be trimmed to provide datasets of various sizes. > Numbers pulled from: http://www.sdss3.org/dr10/scope.php. > > Anyhow, I guess the advantage here is the data is public and can be used > indefinitely. And it's astronomy data, so naturally it's awesome. ;) > (However, it might suffer from the "who cares in the real world" issue.) > > Cameron > > > > On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski <ste...@karpinski.org>wrote: > > Ah, ok, yes – if there aren't very many distinct values, it could > definitely help. With strings it's always nice to convert from > variable-length strings to fixed-size indices. > > > On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote: > > On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote: > > Is 22GB too much? It seems like just uncompressing this and storing it > naturally would be fine on a large machine. How big are the categorical > integers? Would storing an index to an integer really help? It seems like > it would only help if the integers are larger than the indices. > > > For example I just checked the first million instances of one of the > variables and there are only 1050 distinct values, even though those values > are 10 digit integers, as often happens with identifiers like this. So > let's assume that we can store the indices as Uint16. We obtain the > equivalent information by storing a relatively small vector of Int's, > representing the actual values and a memory-mapped file at two bytes per > record, for this variable. > > To me it seems that working from the original textual representation as a > .csv.gz file is going to involve a lot of storage, i/o and conversion of > strings to integers. > > > > > On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com> wrote: > > On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote: > > It is sometimes difficult to obtain realistic "Big" data sets. A > Revolution Analytics blog post yesterday > > http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers- > will-become-repeat-buyers.html > > mentioned the competition > > http://www.kaggle.com/c/acquire-valued-shoppers-challenge > > with a very large data set, which may be useful in looking at performance > bottlenecks. > > You do need to sign up to be able to downloa > >