This would certainly be useful - to have prepackaged large datasets for people to work with. The question is what kind of operations would one want to do on such a dataset. If you could provide a set of well defined benchmarks (simple kernel codes that developers can work with), this could certainly be useful.
-viral On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote: > > If there is some desire for "big data" tests, there is a fair number of > public astronomical datasets that wouldn't be too hard to package up. > > The catalog level versions aren't too different than the type of dataset > metioned by Doug. There are a number of fairly simple analyses that could > be done on them for testing, either simple predictions or classifications. > These wouldn't be hard too document and/or describe. I can produce > examples if people care. > > For example, SDSS (a survey I work on) has public catalog data of ~470 > million objects (rows), with something like ~3 million of those that have > more in depth information (many more columns). Depending on the test > questions, these can be trimmed to provide datasets of various sizes. > Numbers pulled from: http://www.sdss3.org/dr10/scope.php. > > Anyhow, I guess the advantage here is the data is public and can be used > indefinitely. And it's astronomy data, so naturally it's awesome. ;) > (However, it might suffer from the "who cares in the real world" issue.) > > Cameron > > > > On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski <ste...@karpinski.org>wrote: > >> Ah, ok, yes – if there aren't very many distinct values, it could >> definitely help. With strings it's always nice to convert from >> variable-length strings to fixed-size indices. >> >> >> On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote: >> >>> On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote: >>>> >>>> Is 22GB too much? It seems like just uncompressing this and storing it >>>> naturally would be fine on a large machine. How big are the categorical >>>> integers? Would storing an index to an integer really help? It seems like >>>> it would only help if the integers are larger than the indices. >>>> >>> >>> For example I just checked the first million instances of one of the >>> variables and there are only 1050 distinct values, even though those values >>> are 10 digit integers, as often happens with identifiers like this. So >>> let's assume that we can store the indices as Uint16. We obtain the >>> equivalent information by storing a relatively small vector of Int's, >>> representing the actual values and a memory-mapped file at two bytes per >>> record, for this variable. >>> >>> To me it seems that working from the original textual representation as >>> a .csv.gz file is going to involve a lot of storage, i/o and conversion of >>> strings to integers. >>> >>> >>>> >>>> >>>> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com>wrote: >>>> >>>>> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote: >>>>>> >>>>>> It is sometimes difficult to obtain realistic "Big" data sets. A >>>>>> Revolution Analytics blog post yesterday >>>>>> >>>>>> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers- >>>>>> will-become-repeat-buyers.html >>>>>> >>>>>> mentioned the competition >>>>>> >>>>>> http://www.kaggle.com/c/acquire-valued-shoppers-challenge >>>>>> >>>>>> with a very large data set, which may be useful in looking at >>>>>> performance bottlenecks. >>>>>> >>>>>> You do need to sign up to be able to download the data and you must >>>>>> agree only to use the data for the purposes of the competition and to >>>>>> remove the data once the competition is over. >>>>>> >>>>> >>>>> I did download the largest of the data files which consists of about >>>>> 350 million records on 11 variables in CSV format. The compressed file >>>>> is >>>>> around 2.6 GB, uncompressed it would be over 22GB. Fortunately, the GZip >>>>> package allows for working with the compressed file for sequential access. >>>>> >>>>> Most of the variables are what I would call categorical (stored as >>>>> integer values) and could be represented as a pooled data vector. One >>>>> variable is a date and one is a price which could be stored as an integer >>>>> value (number of cents) or as a Float32. >>>>> >>>>> So the first task would be parsing all those integers and creating a >>>>> binary representation. This could be done using a Relational DataBase >>>>> but >>>>> I think that might be overkill for a static table like this. I have been >>>>> thinking of storing each column as a memory-mapped array in a format like >>>>> pooled data. That is, store only the indices into a table of values so >>>>> that the indices can be represented as whatever size of unsigned int is >>>>> large enough for the table size. >>>>> >>>>> To work out the storage format I should first determine the number of >>>>> distinct values for each categorical variable. I was planning on using >>>>> split(readline(gzfilehandle,",")) applying int() to the appropriate >>>>> fields and storing the values in a Set or perhaps an IntSet. Does this >>>>> seem like a reasonable way to start? >>>>> >>>> >>>> >> >