On Thursday, May 1, 2014, Viral Shah <vi...@mayin.org> wrote: > This would certainly be useful - to have prepackaged large datasets for > people to work with. The question is what kind of operations would one want > to do on such a dataset. If you could provide a set of well defined > benchmarks (simple kernel codes that developers can work with), this could > certainly be useful. >
This is basically what I had in mind. However, there is a long list of potentials. It'd be helpful to get a sense for what tests would be most useful first, and then create specific examples. (i.e. data size / algorithms). On Thu, May 1, 2014 at 8:51 AM, Kevin Squire <kevin.squ...@gmail.com> wrote: > For large datasets that are not likely to disappear, an interface to > download them sounds more useful to me. > This is certainly possible. Feasibility will (obviously) be an issue almost by definition for "big data". For a few cases, it is easy to create an interface now using the available public hosts. For example, a recent book focusing on python does just this: http://www.astroml.org/examples/datasets/compute_sdss_pca.html#example-datasets-compute-sdss-pca (This highlights a specific case of ~700 MB of data that takes ~30 mins (reportedly) to snag. Clearly, it is an interface limitation, and not a bit shuffling issue.) Alternative ways exist, but the data is packaged in ways that make it more difficult to access (FITS tables will make you cry). If the idea was benchmarks and example tests, I'm sure we could curate and make a few datasets available in easily digest-able formats and of increasing size. (Basically, like a few big RDatasets.) Cameron