Re: Streaming data locality

David Rosenstrauch Thu, 03 Feb 2011 09:29:44 -0800

On 02/03/2011 12:16 PM, Keith Wiley wrote:

I've seen this asked before, but haven't seen a response yet.


If the input to a streaming job is not actual data splits but simple
HDFS file names which are then read by the mappers, then how can data
locality be achieved.

Likewise, is there any easier way to make those files accessible
other than using the -cacheFile flag?  That requires building a very
very long hadoop command (100s of files potentially).  I'm worried
about overstepping some command-line length limit...plus it would be
nice to do this programatically, say with the
DistributedCache.addCacheFile() command, but that requires writing
your own driver, which I don't see how to do with streaming.

Thoughts?

Submit the job in a Java app instead of via streaming? Have a big loopwhere you repeatedly call job.addInputPath. (Or, if you're going tohave a large number of input files, use CombineFileInputFormat forefficiency.)


HTH,

DR

Re: Streaming data locality

Reply via email to