For large datasets that are not likely to disappear, an interface to
download them sounds more useful to me.

Cheers, Kevin

On Thursday, May 1, 2014, Viral Shah <vi...@mayin.org> wrote:

> This would certainly be useful - to have prepackaged large datasets for
> people to work with. The question is what kind of operations would one want
> to do on such a dataset. If you could provide a set of well defined
> benchmarks (simple kernel codes that developers can work with), this could
> certainly be useful.
>
> -viral
>
> On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote:
>
> If there is some desire for "big data" tests, there is a fair number of
> public astronomical datasets that wouldn't be too hard to package up.
>
> The catalog level versions aren't too different than the type of dataset
> metioned by Doug. There are a number of fairly simple analyses that could
> be done on them for testing, either simple predictions or classifications.
>  These wouldn't be hard too document and/or describe.  I can produce
> examples if people care.
>
> For example, SDSS (a survey I work on) has public catalog data of ~470
> million objects (rows), with something like ~3 million of those that have
> more in depth information (many more columns).   Depending on the test
> questions, these can be trimmed to provide datasets of various sizes.
>  Numbers pulled from: http://www.sdss3.org/dr10/scope.php.
>
> Anyhow, I guess the advantage here is the data is public and can be used
> indefinitely.  And it's astronomy data, so naturally it's awesome.  ;)
>  (However, it might suffer from the "who cares in the real world" issue.)
>
> Cameron
>
>
>
> On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski <ste...@karpinski.org>wrote:
>
> Ah, ok, yes – if there aren't very many distinct values, it could
> definitely help. With strings it's always nice to convert from
> variable-length strings to fixed-size indices.
>
>
> On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates <dmba...@gmail.com> wrote:
>
> On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:
>
> Is 22GB too much? It seems like just uncompressing this and storing it
> naturally would be fine on a large machine. How big are the categorical
> integers? Would storing an index to an integer really help? It seems like
> it would only help if the integers are larger than the indices.
>
>
> For example I just checked the first million instances of one of the
> variables and there are only 1050 distinct values, even though those values
> are 10 digit integers, as often happens with identifiers like this.  So
> let's assume that we can store the indices as Uint16.  We obtain the
> equivalent information by storing a relatively small vector of Int's,
> representing the actual values and a memory-mapped file at two bytes per
> record, for this variable.
>
> To me it seems that working from the original textual representation as a
> .csv.gz file is going to involve a lot of storage, i/o and conversion of
> strings to integers.
>
>
>
>
> On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates <dmb...@gmail.com> wrote:
>
> On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:
>
> It is sometimes difficult to obtain realistic "Big" data sets.  A
> Revolution Analytics blog post yesterday
>
> http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
> will-become-repeat-buyers.html
>
> mentioned the competition
>
> http://www.kaggle.com/c/acquire-valued-shoppers-challenge
>
> with a very large data set, which may be useful in looking at performance
> bottlenecks.
>
> You do need to sign up to be able to downloa
>
>

Reply via email to