On Thursday, May 1, 2014, Viral Shah <vi...@mayin.org> wrote:

> This would certainly be useful - to have prepackaged large datasets for
> people to work with. The question is what kind of operations would one want
> to do on such a dataset. If you could provide a set of well defined
> benchmarks (simple kernel codes that developers can work with), this could
> certainly be useful.
>

This is basically what I had in mind. However, there is a long list of
potentials.  It'd be helpful to get a sense for what tests would be most
useful first, and then create specific examples. (i.e. data size /
algorithms).

On Thu, May 1, 2014 at 8:51 AM, Kevin Squire <kevin.squ...@gmail.com> wrote:

> For large datasets that are not likely to disappear, an interface to
> download them sounds more useful to me.
>

This is certainly possible.  Feasibility will (obviously) be an issue
almost by definition for "big data".

For a few cases, it is easy to create an interface now using the available
public hosts.  For example, a recent book focusing on python does just this:
http://www.astroml.org/examples/datasets/compute_sdss_pca.html#example-datasets-compute-sdss-pca
(This highlights a specific case of ~700 MB of data that takes ~30 mins
(reportedly) to snag. Clearly, it is an interface limitation, and not a bit
shuffling issue.)

Alternative ways exist, but the data is packaged in ways that make it more
difficult to access (FITS tables will make you cry).

If the idea was benchmarks and example tests, I'm sure we could curate and
make a few datasets available in easily digest-able formats and of
increasing size. (Basically, like a few big RDatasets.)

Cameron

Reply via email to