Hi, We've been using Hadoop streaming for the last 3-4 months and it all worked out fine except for one little problem:
in some situations a hadoop reduce job gets multiple key groups and is desired to write out a separate binary output file for each group. However, when a reduce task takes too long and there is spare capacity, the task may be replicated on another node and these two are basically racing each other. One finishes cleanly and the other is terminated. Hadoop takes care to remove ther terminated job's output from HDFS, but since we're writing files from scripts, it's up to us to separate the output of cleanly finished tasks from the output of tasks that are terminated prematurely. Does somebody have answers to the following questions: 1. Is there an easy way to tell in a script launched by the Hadoop streaming, if the script was terminated before it received complete input? As far as I was able to ascertain, no signals are being sent to those unix-jobs. They just stop receiving data from STDIN. The only way that seems to work for me was to process all input and then write something to STDOUT/STDERR and see if that causes a SIGPIPE. But this is ugly, I hope there is a better solution. 2. Is there any good way to write multiple HDFS files from a streaming script *and have Hadoop cleanup those files* when it decides to destroy the task? If there was just one file, I could simply use STDOUT, but dumping multiple binary files to STDOUT is not pretty. We are writing output files to an NFS partition shared among all reducers, which makes it all slightly more complicated because of possible file overwrites. Our current solution, which is not pretty but avoids directly addressing this problem is to write out files with random names (created with mktemp) and write to STDOUT the renaming command for this file to it's desired name. Then as a post-processing stage, I execute all those commands and delete the remaining temporary files as duplicates/incompletes. Thanks, -Yuri