Thanks Yuri! I followed your pattern here and the version where you make the sytem call directly to -put onto DFS works for me. I did not set $ENV{HADOOP_HEAPSIZE}=300; and it seems to work fine (i didnt try setting this variable to see if it failed). I also used perl's built in File::Temp mechanism to avoid worrying bout manually deleting the temp file.
Thanks! Ashish On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin <[EMAIL PROTECTED]> wrote: > Here is how we (attempt to) do it: > > Reducer (in streaming) writes one file for each different key it receives > as input. > Here's some example code in perl: > my $envdir = $ENV{'mapred_output_dir'}; > my $fs = ($envdir =~ s/^file://); > if ($fs) { > #output goes onto NFS > open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open > $envdir/${filename}.png: $!\n"; > } else { > #output specifies DFS > open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open > /tmp/${filename}.png: $!\n"; #or pipe to dfs -put > } > ... #write FILEOUT > if ($fs) { > #for NFS just fix permissions > chmod 0664, "$envdir/$filename.png"; > chmod 0775, "$envdir"; > } else { > #for HDFS -put the file > my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop"; > $ENV{HADOOP_HEAPSIZE}=300; > system($hadoop, "dfs", "-put", "/tmp/${filename}.png", > "$envdir/${filename}.png") and > unlink "/tmp/${filename}.png"; > } > > If -output option to streaming specifies an NFS directory, everything > works except > it doesn't scale. We must use mapred_output_dir environment because it > points to > the temporary directory and you don't want 2 or more instances of the same > tasks writing > to the same file. > > If -output points to HDFS, however, the code above bombs while trying to > -put a file > with an error something like "couldn't not reserve enough memory for java > vm heap/libs" > at which point Java dies. If anyone has any suggestions on how to fix > that, I'd > appreciate it. > > Thanks, > > -Yuri > > On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote: > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > > >