For large datasets that are not likely to disappear, an interface to
download them sounds more useful to me.
Cheers, Kevin
On Thursday, May 1, 2014, Viral Shah vi...@mayin.org wrote:
This would certainly be useful - to have prepackaged large datasets for
people to work with. The question is what kind of operations would one want
to do on such a dataset. If you could provide a set of well defined
benchmarks (simple kernel codes that developers can work with), this could
certainly be useful.
-viral
On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote:
If there is some desire for big data tests, there is a fair number of
public astronomical datasets that wouldn't be too hard to package up.
The catalog level versions aren't too different than the type of dataset
metioned by Doug. There are a number of fairly simple analyses that could
be done on them for testing, either simple predictions or classifications.
These wouldn't be hard too document and/or describe. I can produce
examples if people care.
For example, SDSS (a survey I work on) has public catalog data of ~470
million objects (rows), with something like ~3 million of those that have
more in depth information (many more columns). Depending on the test
questions, these can be trimmed to provide datasets of various sizes.
Numbers pulled from: http://www.sdss3.org/dr10/scope.php.
Anyhow, I guess the advantage here is the data is public and can be used
indefinitely. And it's astronomy data, so naturally it's awesome. ;)
(However, it might suffer from the who cares in the real world issue.)
Cameron
On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski ste...@karpinski.orgwrote:
Ah, ok, yes – if there aren't very many distinct values, it could
definitely help. With strings it's always nice to convert from
variable-length strings to fixed-size indices.
On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:
On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
Is 22GB too much? It seems like just uncompressing this and storing it
naturally would be fine on a large machine. How big are the categorical
integers? Would storing an index to an integer really help? It seems like
it would only help if the integers are larger than the indices.
For example I just checked the first million instances of one of the
variables and there are only 1050 distinct values, even though those values
are 10 digit integers, as often happens with identifiers like this. So
let's assume that we can store the indices as Uint16. We obtain the
equivalent information by storing a relatively small vector of Int's,
representing the actual values and a memory-mapped file at two bytes per
record, for this variable.
To me it seems that working from the original textual representation as a
.csv.gz file is going to involve a lot of storage, i/o and conversion of
strings to integers.
On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.com wrote:
On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
It is sometimes difficult to obtain realistic Big data sets. A
Revolution Analytics blog post yesterday
http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
will-become-repeat-buyers.html
mentioned the competition
http://www.kaggle.com/c/acquire-valued-shoppers-challenge
with a very large data set, which may be useful in looking at performance
bottlenecks.
You do need to sign up to be able to downloa