I am attempting to distribute the execution of a C-based program onto a Hadoop 
cluster, without using MapReduce.  I read that YARN can be used to schedule 
non-MapReduce applications by programming to the ASM/RM interfaces.  As I 
understand it, eventually I get down to specifying each sub-task via 
ContainerLaunchContext.setCommands().

However, the program and shared libraries need to be stored on each worker's 
local disk to run.  In addition there is a hefty data set that the application 
uses (say, 4GB) that is accessed via regular open()/read() calls by a library.  
I thought a decent strategy would be to push the program+data package to a 
known folder in HDFS, then launch a "bootstrap" that compared the HDFS folder 
version to a local folder, copying any updated files as needed before launching 
the native application task.

Are there better approaches?  I notice that one can implicitly copy "local 
resources" as part of the launch, but I don't want to copy 4GB every time, only 
occasionally when the application or reference data is updated.  Also, will my 
bootstrapper be allowed to set executable-mode bits on the programs after they 
are copied?

Thanks
John

Reply via email to