On 02/03/2011 12:16 PM, Keith Wiley wrote:
I've seen this asked before, but haven't seen a response yet.

If the input to a streaming job is not actual data splits but simple
HDFS file names which are then read by the mappers, then how can data
locality be achieved.

Likewise, is there any easier way to make those files accessible
other than using the -cacheFile flag?  That requires building a very
very long hadoop command (100s of files potentially).  I'm worried
about overstepping some command-line length limit...plus it would be
nice to do this programatically, say with the
DistributedCache.addCacheFile() command, but that requires writing
your own driver, which I don't see how to do with streaming.

Thoughts?

Submit the job in a Java app instead of via streaming? Have a big loop where you repeatedly call job.addInputPath. (Or, if you're going to have a large number of input files, use CombineFileInputFormat for efficiency.)

HTH,

DR

Reply via email to