Re: [julia-users] A Big Data stress test

2014-05-14 Thread David
I'm glad you found this useful.  We continue to host data from many of the
older competitions so you might enjoy those as well.

David
 On Apr 30, 2014 9:30 AM, Douglas Bates dmba...@gmail.com wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A
 Revolution Analytics blog post yesterday


 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at performance
 bottlenecks.

 You do need to sign up to be able to download the data and you must agree
 only to use the data for the purposes of the competition and to remove the
 data once the competition is over.




Re: [julia-users] A Big Data stress test

2014-05-06 Thread Viral Shah
An RDatasets like approach with datasets in increasing sizes and kernel 
codes is certainly the way to go. The julia perf tests essentially do some 
of this for the kernels at http://alioth.debian.org, and combined with 
codespeed, this has been incredibly useful.

-viral

On Thursday, May 1, 2014 7:48:06 PM UTC+5:30, Cameron McBride wrote:

 On Thursday, May 1, 2014, Viral Shah vi...@mayin.org wrote:

 This would certainly be useful - to have prepackaged large datasets for 
 people to work with. The question is what kind of operations would one want 
 to do on such a dataset. If you could provide a set of well defined 
 benchmarks (simple kernel codes that developers can work with), this could 
 certainly be useful.


 This is basically what I had in mind. However, there is a long list of 
 potentials.  It'd be helpful to get a sense for what tests would be most 
 useful first, and then create specific examples. (i.e. data size / 
 algorithms).

 On Thu, May 1, 2014 at 8:51 AM, Kevin Squire kevin.squ...@gmail.comwrote:

 For large datasets that are not likely to disappear, an interface to 
 download them sounds more useful to me. 


 This is certainly possible.  Feasibility will (obviously) be an issue 
 almost by definition for big data. 

 For a few cases, it is easy to create an interface now using the available 
 public hosts.  For example, a recent book focusing on python does just this:

 http://www.astroml.org/examples/datasets/compute_sdss_pca.html#example-datasets-compute-sdss-pcahttp://www.google.com/url?q=http%3A%2F%2Fwww.astroml.org%2Fexamples%2Fdatasets%2Fcompute_sdss_pca.html%23example-datasets-compute-sdss-pcasa=Dsntz=1usg=AFQjCNEC0yegXKHcCq5KdmlIzQZz52iipA
 (This highlights a specific case of ~700 MB of data that takes ~30 mins 
 (reportedly) to snag. Clearly, it is an interface limitation, and not a bit 
 shuffling issue.)

 Alternative ways exist, but the data is packaged in ways that make it more 
 difficult to access (FITS tables will make you cry).

 If the idea was benchmarks and example tests, I'm sure we could curate and 
 make a few datasets available in easily digest-able formats and of 
 increasing size. (Basically, like a few big RDatasets.) 

 Cameron



Re: [julia-users] A Big Data stress test

2014-05-01 Thread Kevin Squire
For large datasets that are not likely to disappear, an interface to
download them sounds more useful to me.

Cheers, Kevin

On Thursday, May 1, 2014, Viral Shah vi...@mayin.org wrote:

 This would certainly be useful - to have prepackaged large datasets for
 people to work with. The question is what kind of operations would one want
 to do on such a dataset. If you could provide a set of well defined
 benchmarks (simple kernel codes that developers can work with), this could
 certainly be useful.

 -viral

 On Thursday, May 1, 2014 4:03:41 AM UTC+5:30, Cameron McBride wrote:

 If there is some desire for big data tests, there is a fair number of
 public astronomical datasets that wouldn't be too hard to package up.

 The catalog level versions aren't too different than the type of dataset
 metioned by Doug. There are a number of fairly simple analyses that could
 be done on them for testing, either simple predictions or classifications.
  These wouldn't be hard too document and/or describe.  I can produce
 examples if people care.

 For example, SDSS (a survey I work on) has public catalog data of ~470
 million objects (rows), with something like ~3 million of those that have
 more in depth information (many more columns).   Depending on the test
 questions, these can be trimmed to provide datasets of various sizes.
  Numbers pulled from: http://www.sdss3.org/dr10/scope.php.

 Anyhow, I guess the advantage here is the data is public and can be used
 indefinitely.  And it's astronomy data, so naturally it's awesome.  ;)
  (However, it might suffer from the who cares in the real world issue.)

 Cameron



 On Wed, Apr 30, 2014 at 3:02 PM, Stefan Karpinski ste...@karpinski.orgwrote:

 Ah, ok, yes – if there aren't very many distinct values, it could
 definitely help. With strings it's always nice to convert from
 variable-length strings to fixed-size indices.


 On Wed, Apr 30, 2014 at 2:54 PM, Douglas Bates dmba...@gmail.com wrote:

 On Wednesday, April 30, 2014 1:20:26 PM UTC-5, Stefan Karpinski wrote:

 Is 22GB too much? It seems like just uncompressing this and storing it
 naturally would be fine on a large machine. How big are the categorical
 integers? Would storing an index to an integer really help? It seems like
 it would only help if the integers are larger than the indices.


 For example I just checked the first million instances of one of the
 variables and there are only 1050 distinct values, even though those values
 are 10 digit integers, as often happens with identifiers like this.  So
 let's assume that we can store the indices as Uint16.  We obtain the
 equivalent information by storing a relatively small vector of Int's,
 representing the actual values and a memory-mapped file at two bytes per
 record, for this variable.

 To me it seems that working from the original textual representation as a
 .csv.gz file is going to involve a lot of storage, i/o and conversion of
 strings to integers.




 On Wed, Apr 30, 2014 at 2:10 PM, Douglas Bates dmb...@gmail.com wrote:

 On Wednesday, April 30, 2014 11:30:41 AM UTC-5, Douglas Bates wrote:

 It is sometimes difficult to obtain realistic Big data sets.  A
 Revolution Analytics blog post yesterday

 http://blog.revolutionanalytics.com/2014/04/predict-which-shoppers-
 will-become-repeat-buyers.html

 mentioned the competition

 http://www.kaggle.com/c/acquire-valued-shoppers-challenge

 with a very large data set, which may be useful in looking at performance
 bottlenecks.

 You do need to sign up to be able to downloa