Re: one key per output part file

Ashish Venugopal Thu, 03 Apr 2008 17:14:50 -0700

Thanks Yuri! I followed your pattern here and the version where you make the
sytem call directly to -put onto DFS works for me. I did not set
$ENV{HADOOP_HEAPSIZE}=300;
and it seems to work fine (i didnt try setting this variable to see if it
failed).
I also used perl's built in File::Temp mechanism to avoid worrying bout
manually deleting the temp file.


Thanks!

Ashish


On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin <[EMAIL PROTECTED]> wrote:

> Here is how we (attempt to) do it:
>
> Reducer (in streaming) writes one file for each different key it receives
> as input.
> Here's some example code in perl:
>    my $envdir = $ENV{'mapred_output_dir'};
>    my $fs = ($envdir =~ s/^file://);
>    if ($fs) {
>        #output goes onto NFS
>        open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open
> $envdir/${filename}.png: $!\n";
>    } else {
>        #output specifies DFS
>        open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open
> /tmp/${filename}.png: $!\n"; #or pipe to dfs -put
>    }
>    ... #write FILEOUT
>    if ($fs) {
>        #for NFS just fix permissions
>        chmod 0664, "$envdir/$filename.png";
>        chmod 0775, "$envdir";
>    } else {
>        #for HDFS -put the file
>        my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
>        $ENV{HADOOP_HEAPSIZE}=300;
>        system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",
>  "$envdir/${filename}.png") and
>            unlink "/tmp/${filename}.png";
>    }
>
> If -output option to streaming specifies an NFS directory, everything
> works except
> it doesn't scale.  We must use mapred_output_dir environment because it
> points to
> the temporary directory and you don't want 2 or more instances of the same
> tasks writing
> to the same file.
>
> If -output points to HDFS, however, the code above bombs while trying to
> -put a file
> with an error something like "couldn't not reserve enough memory for java
> vm heap/libs"
> at which point Java dies.  If anyone has any suggestions on how to fix
> that, I'd
> appreciate it.
>
> Thanks,
>
>  -Yuri
>
> On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>
>

Re: one key per output part file

Reply via email to