Here is how we (attempt to) do it: Reducer (in streaming) writes one file for each different key it receives as input. Here's some example code in perl: my $envdir = $ENV{'mapred_output_dir'}; my $fs = ($envdir =~ s/^file://); if ($fs) { #output goes onto NFS open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open $envdir/${filename}.png: $!\n"; } else { #output specifies DFS open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open /tmp/${filename}.png: $!\n"; #or pipe to dfs -put } ... #write FILEOUT if ($fs) { #for NFS just fix permissions chmod 0664, "$envdir/$filename.png"; chmod 0775, "$envdir"; } else { #for HDFS -put the file my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop"; $ENV{HADOOP_HEAPSIZE}=300; system($hadoop, "dfs", "-put", "/tmp/${filename}.png", "$envdir/${filename}.png") and unlink "/tmp/${filename}.png"; } If -output option to streaming specifies an NFS directory, everything works except it doesn't scale. We must use mapred_output_dir environment because it points to the temporary directory and you don't want 2 or more instances of the same tasks writing to the same file.
If -output points to HDFS, however, the code above bombs while trying to -put a file with an error something like "couldn't not reserve enough memory for java vm heap/libs" at which point Java dies. If anyone has any suggestions on how to fix that, I'd appreciate it. Thanks, -Yuri On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote: > Hi, I am using Hadoop streaming and I am trying to create a MapReduce that > will generate output where a single key is found in a single output part > file. > Does anyone know how to ensure this condition? I want the reduce task (no > matter how many are specified), to only receive > key-value output from a single key each, process the key-value pairs for > this key, write an output part-XXX file, and only > then process the next key. > > Here is the task that I am trying to accomplish: > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > Output: Each part-XXX should contain the lines of T that contain the word > from line XXX in V. > > Any help/ideas are appreciated. > > Ashish