I am attempting to distribute the execution of a C-based program onto a Hadoop cluster, without using MapReduce. I read that YARN can be used to schedule non-MapReduce applications by programming to the ASM/RM interfaces. As I understand it, eventually I get down to specifying each sub-task via ContainerLaunchContext.setCommands().
However, the program and shared libraries need to be stored on each worker's local disk to run. In addition there is a hefty data set that the application uses (say, 4GB) that is accessed via regular open()/read() calls by a library. I thought a decent strategy would be to push the program+data package to a known folder in HDFS, then launch a "bootstrap" that compared the HDFS folder version to a local folder, copying any updated files as needed before launching the native application task. Are there better approaches? I notice that one can implicitly copy "local resources" as part of the launch, but I don't want to copy 4GB every time, only occasionally when the application or reference data is updated. Also, will my bootstrapper be allowed to set executable-mode bits on the programs after they are copied? Thanks John