Your first two items are spot on.  We don't expect to have the cluster to 
ourselves.  We also expect to interop with existing HDFS data and want to 
schedule for data locality.

From: Vinod Kumar Vavilapalli []
Sent: Friday, May 17, 2013 11:08 AM
Subject: Re: Distribution of native executables and data for YARN-based 

I have a little bit of conflict of interest given I worked on Hadoop YARN all 
time but..

I have worked on torque/condor based resource management systems too. There are 
many advantages of working on top of YARN, a couple that should be specifically 
relevant here:
 - MR and non MR all on same cluster (there are a few not-so-ready MR 
implementations on existing schedulers but with lots of limitations)
 - Data locality feature that is native in Hadoop YARN and hard to simulate in 
other schedulers (we have experience trying this in the past)
 - Elastic resource managements - jobs can grow and shrink elastically

+Vinod Kumar Vavilapalli
Hortonworks Inc.

On May 17, 2013, at 7:20 AM, Tim St Clair wrote:

Hi John -

If you are doing extensive levels of non-MR C-style batch, you may be better 
served to look at myriad universes of existing schedulers (torque, condor, 
etc.).  Or investigate the space around interop (1 cluster, many schedulers).

Either way, I recommend minimizing your dependency graph on your C-application 
where possible if you are working in a heterogeneous environment.


From: "John Lilley" <<>>
Sent: Friday, May 17, 2013 8:35:53 AM
Subject: RE: Distribution of native executables and data for YARN-based 

Thanks!  This sounds exactly like what I need.  PUBLIC is right.

Do you know if this works for executables as well?  Like, would there be any 
issue transferring the executable bit on the file?


From: Vinod Kumar Vavilapalli []
Sent: Friday, May 17, 2013 12:56 AM
Subject: Re: Distribution of native executables and data for YARN-based 

The "local resources" you mentioned is the exact solution for this. For each 
LocalResource, you also mention a LocalResourceVisibility which takes one of 
the three values today - PUBLIC, PRIVATE and APPLICATON.

PUBLIC resources are downloaded only once and shared by any application running 
on that node.

PRIVATE resources are downloaded only once and shared by any application run by 
the same user on that node

APPLICATION resources are downloaded per application and removed after the 
application finishes.

Seems like you want PUBLIC or PRIVATE.

Note that for PUBLIC resources to work, the corresponding files need to be 
public on HDFS too.

Also if the remote files on HDFS are updated, these local files will be 
uploaded afresh again on each node where your containers run.


+Vinod Kumar Vavilapalli
Hortonworks Inc.

On May 16, 2013, at 2:21 PM, John Lilley wrote:

I am attempting to distribute the execution of a C-based program onto a Hadoop 
cluster, without using MapReduce.  I read that YARN can be used to schedule 
non-MapReduce applications by programming to the ASM/RM interfaces.  As I 
understand it, eventually I get down to specifying each sub-task via 

However, the program and shared libraries need to be stored on each worker's 
local disk to run.  In addition there is a hefty data set that the application 
uses (say, 4GB) that is accessed via regular open()/read() calls by a library.  
I thought a decent strategy would be to push the program+data package to a 
known folder in HDFS, then launch a "bootstrap" that compared the HDFS folder 
version to a local folder, copying any updated files as needed before launching 
the native application task.

Are there better approaches?  I notice that one can implicitly copy "local 
resources" as part of the launch, but I don't want to copy 4GB every time, only 
occasionally when the application or reference data is updated.  Also, will my 
bootstrapper be allowed to set executable-mode bits on the programs after they 
are copied?


Reply via email to