writing output files in hadoop streaming

Yuri Pradkin Mon, 14 Jan 2008 12:33:42 -0800

Hi,

We've been using Hadoop streaming for the last 3-4 months and
it all worked out fine except for one little problem:


in some situations a hadoop reduce job gets multiple key groups
and is desired to write out a separate binary output file for
each group.  However, when a reduce task takes too long and
there is spare capacity, the task may be replicated on another
node and these two are basically racing each other.  One finishes
cleanly and the other is terminated.  Hadoop takes care to remove
ther terminated job's output from HDFS, but since we're writing
files from scripts, it's up to us to separate the output of cleanly
finished tasks from the output of tasks that are terminated
prematurely.

Does somebody have answers to the following questions:
1. Is there an easy way to tell in a script launched by the Hadoop
   streaming, if the script was terminated before it received complete
   input?
   As far as I was able to ascertain, no signals are being sent to those
   unix-jobs.  They just stop receiving data from STDIN.  The only way
   that seems to work for me was to process all input and then write
   something to STDOUT/STDERR and see if that causes a SIGPIPE.  But
   this is ugly, I hope there is a better solution.

2. Is there any good way to write multiple HDFS files from a streaming script
   *and have Hadoop cleanup those files* when it decides to destroy the
   task?  If there was just one file, I could simply use STDOUT, but dumping
   multiple binary files to STDOUT is not pretty.

We are writing output files to an NFS partition shared among all reducers, which
makes it all slightly more complicated because of possible file overwrites.

Our current solution, which is not pretty but avoids directly addressing this
problem is to write out files with random names (created with mktemp) and write 
to STDOUT the renaming command for this file to it's desired name.  Then as a 
post-processing stage, I execute all those commands and delete the remaining
temporary files as duplicates/incompletes.

Thanks,

  -Yuri

writing output files in hadoop streaming

Reply via email to