Spark: Using "node-local" files within functions?

Horsmann, Tobias Tue, 14 Apr 2015 01:45:07 -0700

Hi,

I am trying to use Spark in combination with Yarn with 3rd party code which is 
unaware of distributed file systems. Providing hdfs file references thus does 
not work.


My idea to resolve this issue was the following:

Within a function I take the HDFS file reference I get as parameter and copy it 
into the local file system and provide the 3rd party components what they 
expect.
textFolder.map(new Function<....>()
        {
            public List<...> call(String inputFile)
                throws Exception
            {
               //resolve, copy hdfs file to local file system

               //get local file pointer
               //this function should be executed on a node, right. There is 
probably a local file system)

               //call 3rd party library with 'local file' reference

               // do other stuff
}
}

This seem to work, but I am not really sure if this might cause other problems 
when going to productive file sizes. E.g. the files I copy to the local file 
system might be large. Would this affect Yarn somehow? Are there more advisable 
ways to befriend HDFS-unaware libraries with HDFS file pointer?

Regards,

Spark: Using "node-local" files within functions?

Reply via email to