I'm looking to run a job that involves a zillion files in a format called
CDF, a nasa standard.  There are a number of libraries out there that can
read CDFs but most of them are not high quality compared to the official
NASA one, which has java bindings (via JNI).  It's a little clumsy but I
have it working fairly well in Scala.

The way I was planning on distributing work was with
SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in
an RDD of byte[] and unfortunately the CDF library doesn't support any kind
of array or stream as input.  The reason is that CDF is really looking for
a random-access file, for performance reasons.

Whats worse, all this code is implemented down at the native layer, in C.

I think my best choice here is to distribute the job using .binaryFiles()
but then have the first task of the worker be to write all those bytes to a
ramdisk file (or maybe a real file, we'll see)... then have the CDF library
open it as if it were a local file.  This seems clumsy and awful but I
haven't come up with any other good ideas.

Has anybody else worked with these files and have a better idea?  Some info
on the library that parses all this:

https://cdf.gsfc.nasa.gov/html/cdf_docs.html


--Chris

Reply via email to