Develop your own HadoopFileFormat and use 
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/SparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)
 to load. The Spark datasource API will be relevant for you in the upcoming 
version 2 as an alternative.

> On 16. Dec 2017, at 03:33, Christopher Piggott <cpigg...@gmail.com> wrote:
> 
> I'm looking to run a job that involves a zillion files in a format called 
> CDF, a nasa standard.  There are a number of libraries out there that can 
> read CDFs but most of them are not high quality compared to the official NASA 
> one, which has java bindings (via JNI).  It's a little clumsy but I have it 
> working fairly well in Scala.
> 
> The way I was planning on distributing work was with 
> SparkContext.binaryFIles("hdfs://somepath/*) but that's really sending in an 
> RDD of byte[] and unfortunately the CDF library doesn't support any kind of 
> array or stream as input.  The reason is that CDF is really looking for a 
> random-access file, for performance reasons.
> 
> Whats worse, all this code is implemented down at the native layer, in C.
> 
> I think my best choice here is to distribute the job using .binaryFiles() but 
> then have the first task of the worker be to write all those bytes to a 
> ramdisk file (or maybe a real file, we'll see)... then have the CDF library 
> open it as if it were a local file.  This seems clumsy and awful but I 
> haven't come up with any other good ideas.
> 
> Has anybody else worked with these files and have a better idea?  Some info 
> on the library that parses all this:
> 
> https://cdf.gsfc.nasa.gov/html/cdf_docs.html
> 
> 
> --Chris
> 

Reply via email to