Reading hdf5 formats with pyspark

Mohit Singh Mon, 28 Jul 2014 21:06:20 -0700

Hi,
   We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing a function
def process(filename):
   #logic


and then execute via
files = [list of filenames]
sc.parallelize(files).foreach(process)

Is this the right approach??
-- 
Mohit

"When you want success as badly as you want the air, then you will get it.
There is no other secret of success."
-Socrates

Reading hdf5 formats with pyspark

Reply via email to