An RDatasets like approach with datasets in increasing sizes and kernel 
codes is certainly the way to go. The julia perf tests essentially do some 
of this for the kernels at http://alioth.debian.org, and combined with 
codespeed, this has been incredibly useful.

-viral

On Thursday, May 1, 2014 7:48:06 PM UTC+5:30, Cameron McBride wrote:
>
> On Thursday, May 1, 2014, Viral Shah <vi...@mayin.org> wrote:
>
>> This would certainly be useful - to have prepackaged large datasets for 
>> people to work with. The question is what kind of operations would one want 
>> to do on such a dataset. If you could provide a set of well defined 
>> benchmarks (simple kernel codes that developers can work with), this could 
>> certainly be useful.
>>
>
> This is basically what I had in mind. However, there is a long list of 
> potentials.  It'd be helpful to get a sense for what tests would be most 
> useful first, and then create specific examples. (i.e. data size / 
> algorithms).
>
> On Thu, May 1, 2014 at 8:51 AM, Kevin Squire <kevin.squ...@gmail.com>wrote:
>
>> For large datasets that are not likely to disappear, an interface to 
>> download them sounds more useful to me. 
>>
>
> This is certainly possible.  Feasibility will (obviously) be an issue 
> almost by definition for "big data". 
>
> For a few cases, it is easy to create an interface now using the available 
> public hosts.  For example, a recent book focusing on python does just this:
>
> http://www.astroml.org/examples/datasets/compute_sdss_pca.html#example-datasets-compute-sdss-pca<http://www.google.com/url?q=http%3A%2F%2Fwww.astroml.org%2Fexamples%2Fdatasets%2Fcompute_sdss_pca.html%23example-datasets-compute-sdss-pca&sa=D&sntz=1&usg=AFQjCNEC0yegXKHcCq5KdmlIzQZz52iipA>
> (This highlights a specific case of ~700 MB of data that takes ~30 mins 
> (reportedly) to snag. Clearly, it is an interface limitation, and not a bit 
> shuffling issue.)
>
> Alternative ways exist, but the data is packaged in ways that make it more 
> difficult to access (FITS tables will make you cry).
>
> If the idea was benchmarks and example tests, I'm sure we could curate and 
> make a few datasets available in easily digest-able formats and of 
> increasing size. (Basically, like a few big RDatasets.) 
>
> Cameron
>

Reply via email to