Jake, are you using 100%, 80%, 60%, ... of the data that you'd be copying?
If you were using just a fraction (< 20%), copying all those files sounds like 
a waste.

[OK, I'm peddling HDF5/JDBC server here...]

With HDF5/JDBC server you could:

1. Limit (SELECT) the amount of data to be brought in (over the network)
2. With something like Sqoop, you could save the data in any BigData format you 
like.

G.

________________________________________
From: Smith, Jacob <[email protected]>
Sent: Monday, January 30, 2017 11:19:55 AM
To: Gerd Heber; HDF Users Discussion List
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Gerd,

Thanks for the response!  My name is Jake Smith and I’ll be working with this 
cloud solution.  Currently, our HDF5 files are in DataLake, we use a Python 
Jupyter notebook around Azure’s HDInsight with a Spark cluster.  We want to 
load our data from HDF5 into a H2O frame to build additional models.  We are 
using Sparkling Water (the integration of H2O and Spark).  Since h5py (python 
module) doesn’t seem to facilitate remote querying of HDF5 files (which I’m not 
sure if that’s a characteristic of HDF5 itself rather than this python client), 
we are wondering if it is a good idea to download these files to the Spark 
cluster before transforming them to RDDs.

From: Gerd Heber [mailto:[email protected]]
Sent: Monday, January 30, 2017 8:59 AM
To: HDF Users Discussion List <[email protected]>
Cc: Smith, Jacob <[email protected]>
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Jim, do you need barebones RDDs or some of the more structured types (Spark 
DataFrame, Dataset)?
How about loading the data via HDF5/JDBC?

G.

From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Rowe, Jim
Sent: Monday, January 30, 2017 9:23 AM
To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Cc: Smith, Jacob <[email protected]<mailto:[email protected]>>
Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions....

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred 
files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail.  We 
are in Azure using Microsoft DataLake storage and working through trying to 
read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access 
files that MS exposes using the “adl://” URI—our assumption is that however 
that is implemented, it does not translate to a filesystem the underlying HDF5 
libraries can read (?).   Our best option so far is to copy the files locally, 
which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as 
roughly described, we would love to talk to you.  We are also potentially 
looking for some paid consulting help in this area if anyone is interested.


Warm regards,
--Jim

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to