Jim, do you need barebones RDDs or some of the more structured types (Spark DataFrame, Dataset)? How about loading the data via HDF5/JDBC?
G. From: Hdf-forum [mailto:[email protected]] On Behalf Of Rowe, Jim Sent: Monday, January 30, 2017 9:23 AM To: HDF Users Discussion List <[email protected]> Cc: Smith, Jacob <[email protected]> Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions.... Hello HDF Gurus, We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each). We are looking for others who may have blazed or been blazing this trail. We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark. We have been working with h5py, but running into issues where we cannot access files that MS exposes using the "adl://" URI-our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?). Our best option so far is to copy the files locally, which introduces an extra step and delay in the process. If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you. We are also potentially looking for some paid consulting help in this area if anyone is interested. Warm regards, --Jim
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
