Hi,
I am trying to use Spark in combination with Yarn with 3rd party code which is
unaware of distributed file systems. Providing hdfs file references thus does
not work.
My idea to resolve this issue was the following:
Within a function I take the HDFS file reference I get as parameter and copy it
into the local file system and provide the 3rd party components what they
expect.
textFolder.map(new Function<....>()
{
public List<...> call(String inputFile)
throws Exception
{
//resolve, copy hdfs file to local file system
//get local file pointer
//this function should be executed on a node, right. There is
probably a local file system)
//call 3rd party library with 'local file' reference
// do other stuff
}
}
This seem to work, but I am not really sure if this might cause other problems
when going to productive file sizes. E.g. the files I copy to the local file
system might be large. Would this affect Yarn somehow? Are there more advisable
ways to befriend HDFS-unaware libraries with HDFS file pointer?
Regards,