Here is how we (attempt to) do it:

Reducer (in streaming) writes one file for each different key it receives as 
input.  
Here's some example code in perl:
    my $envdir = $ENV{'mapred_output_dir'};
    my $fs = ($envdir =~ s/^file://);
    if ($fs) {
        #output goes onto NFS
        open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open 
$envdir/${filename}.png: $!\n";
    } else {
        #output specifies DFS
        open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open 
/tmp/${filename}.png: $!\n"; #or pipe to dfs -put
    }
    ... #write FILEOUT
    if ($fs) {
        #for NFS just fix permissions
        chmod 0664, "$envdir/$filename.png";
        chmod 0775, "$envdir";
    } else {
        #for HDFS -put the file
        my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
        $ENV{HADOOP_HEAPSIZE}=300;
        system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",  
"$envdir/${filename}.png") and
            unlink "/tmp/${filename}.png";
    }
   
If -output option to streaming specifies an NFS directory, everything works 
except 
it doesn't scale.  We must use mapred_output_dir environment because it points 
to 
the temporary directory and you don't want 2 or more instances of the same 
tasks writing
to the same file.

If -output points to HDFS, however, the code above bombs while trying to -put a 
file
with an error something like "couldn't not reserve enough memory for java vm 
heap/libs"
at which point Java dies.  If anyone has any suggestions on how to fix that, I'd
appreciate it.

Thanks,

 -Yuri
 
On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
>
> Here is the task that I am trying to accomplish:
>
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
>
> Any help/ideas are appreciated.
>
> Ashish


Reply via email to