Here is how we (attempt to) do it:

Reducer (in streaming) writes one file for each different key it receives as 
Here's some example code in perl:
    my $envdir = $ENV{'mapred_output_dir'};
    my $fs = ($envdir =~ s/^file://);
    if ($fs) {
        #output goes onto NFS
        open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open 
$envdir/${filename}.png: $!\n";
    } else {
        #output specifies DFS
        open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open 
/tmp/${filename}.png: $!\n"; #or pipe to dfs -put
    ... #write FILEOUT
    if ($fs) {
        #for NFS just fix permissions
        chmod 0664, "$envdir/$filename.png";
        chmod 0775, "$envdir";
    } else {
        #for HDFS -put the file
        my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
        system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",  
"$envdir/${filename}.png") and
            unlink "/tmp/${filename}.png";
If -output option to streaming specifies an NFS directory, everything works 
it doesn't scale.  We must use mapred_output_dir environment because it points 
the temporary directory and you don't want 2 or more instances of the same 
tasks writing
to the same file.

If -output points to HDFS, however, the code above bombs while trying to -put a 
with an error something like "couldn't not reserve enough memory for java vm 
at which point Java dies.  If anyone has any suggestions on how to fix that, I'd
appreciate it.


On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
> Here is the task that I am trying to accomplish:
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
> Any help/ideas are appreciated.
> Ashish

