Re: one key per output part file
Thanks Yuri! I followed your pattern here and the version where you make the sytem call directly to -put onto DFS works for me. I did not set $ENV{HADOOP_HEAPSIZE}=300; and it seems to work fine (i didnt try setting this variable to see if it failed). I also used perl's built in File::Temp mechanism to avoid worrying bout manually deleting the temp file. Thanks! Ashish On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin <[EMAIL PROTECTED]> wrote: > Here is how we (attempt to) do it: > > Reducer (in streaming) writes one file for each different key it receives > as input. > Here's some example code in perl: >my $envdir = $ENV{'mapred_output_dir'}; >my $fs = ($envdir =~ s/^file://); >if ($fs) { >#output goes onto NFS >open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open > $envdir/${filename}.png: $!\n"; >} else { >#output specifies DFS >open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open > /tmp/${filename}.png: $!\n"; #or pipe to dfs -put >} >... #write FILEOUT >if ($fs) { >#for NFS just fix permissions >chmod 0664, "$envdir/$filename.png"; >chmod 0775, "$envdir"; >} else { >#for HDFS -put the file >my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop"; >$ENV{HADOOP_HEAPSIZE}=300; >system($hadoop, "dfs", "-put", "/tmp/${filename}.png", > "$envdir/${filename}.png") and >unlink "/tmp/${filename}.png"; >} > > If -output option to streaming specifies an NFS directory, everything > works except > it doesn't scale. We must use mapred_output_dir environment because it > points to > the temporary directory and you don't want 2 or more instances of the same > tasks writing > to the same file. > > If -output points to HDFS, however, the code above bombs while trying to > -put a file > with an error something like "couldn't not reserve enough memory for java > vm heap/libs" > at which point Java dies. If anyone has any suggestions on how to fix > that, I'd > appreciate it. > > Thanks, > > -Yuri > > On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote: > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > > >
Re: one key per output part file
Here is how we (attempt to) do it: Reducer (in streaming) writes one file for each different key it receives as input. Here's some example code in perl: my $envdir = $ENV{'mapred_output_dir'}; my $fs = ($envdir =~ s/^file://); if ($fs) { #output goes onto NFS open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open $envdir/${filename}.png: $!\n"; } else { #output specifies DFS open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open /tmp/${filename}.png: $!\n"; #or pipe to dfs -put } ... #write FILEOUT if ($fs) { #for NFS just fix permissions chmod 0664, "$envdir/$filename.png"; chmod 0775, "$envdir"; } else { #for HDFS -put the file my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop"; $ENV{HADOOP_HEAPSIZE}=300; system($hadoop, "dfs", "-put", "/tmp/${filename}.png", "$envdir/${filename}.png") and unlink "/tmp/${filename}.png"; } If -output option to streaming specifies an NFS directory, everything works except it doesn't scale. We must use mapred_output_dir environment because it points to the temporary directory and you don't want 2 or more instances of the same tasks writing to the same file. If -output points to HDFS, however, the code above bombs while trying to -put a file with an error something like "couldn't not reserve enough memory for java vm heap/libs" at which point Java dies. If anyone has any suggestions on how to fix that, I'd appreciate it. Thanks, -Yuri On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote: > Hi, I am using Hadoop streaming and I am trying to create a MapReduce that > will generate output where a single key is found in a single output part > file. > Does anyone know how to ensure this condition? I want the reduce task (no > matter how many are specified), to only receive > key-value output from a single key each, process the key-value pairs for > this key, write an output part-XXX file, and only > then process the next key. > > Here is the task that I am trying to accomplish: > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > Output: Each part-XXX should contain the lines of T that contain the word > from line XXX in V. > > Any help/ideas are appreciated. > > Ashish
Re: one key per output part file
Thanks for this information - I might be missing something here, but can my perl script reducer (which is run via streaming, and is not linked to HDFS libraries) just start writing to HDFS? I thought I would have to write it locally ie in "." for the reduce script and then rely on the MapReduce mechanism to promote the file into the output directory... Thanks for all the help! Ashish On Wed, Apr 2, 2008 at 11:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Writing to HDFS leaves the files as accessible as anything else, if not > more > so. > > You can retrieve a file using a URL of the form: > > http:///data/ > > Similarly, you can list a directory using a similar URL (whose details I > forget for the nonce). > > On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> > > wrote: > > > >> curious - why do we need a file per XXX? > >> > >> - if the data needs to be exported (either to a sql db or an external > file > >> system) - then why not do so directly from the reducer (instead of > trying to > >> create these intermediate small files in hdfs)? data can be written to > tmp > >> tables/files and can be overwritten in case the reducer re-runs (and > then > >> committed to final location once the job is complete) > >> > > > > The second case (data needs to be exported) is the reason that I have. > Each > > of these small files is used in an external process. This seems like a > good > > solution - only question then is where can these files be written to > safely? > > Local directory? /tmp? > > > > Ashish > > > > > > > >> > >> > >> > >> -Original Message- > >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal > >> Sent: Tue 4/1/2008 6:42 PM > >> To: core-user@hadoop.apache.org > >> Subject: Re: one key per output part file > >> > >> This seems like a reasonable solution - but I am using Hadoop streaming > >> and > >> byreducer is a perl script. Is it possible to handle side-effect files > in > >> streaming? I havent found > >> anything that indicates that you can... > >> > >> Ashish > >> > >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> > >>> > >>> > >>> Try opening the desired output file in the reduce method. Make sure > >> that > >>> the output files are relative to the correct task specific directory > >> (look > >>> for side-effect files on the wiki). > >>> > >>> > >>> > >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > >>> > >>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce > >>> that > >>>> will generate output where a single key is found in a single output > >> part > >>>> file. > >>>> Does anyone know how to ensure this condition? I want the reduce task > >>> (no > >>>> matter how many are specified), to only receive > >>>> key-value output from a single key each, process the key-value pairs > >> for > >>>> this key, write an output part-XXX file, and only > >>>> then process the next key. > >>>> > >>>> Here is the task that I am trying to accomplish: > >>>> > >>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word) > >>>> Output: Each part-XXX should contain the lines of T that contain the > >>> word > >>>> from line XXX in V. > >>>> > >>>> Any help/ideas are appreciated. > >>>> > >>>> Ashish > >>> > >>> > >> > >> > >
Re: one key per output part file
Writing to HDFS leaves the files as accessible as anything else, if not more so. You can retrieve a file using a URL of the form: http:///data/ Similarly, you can list a directory using a similar URL (whose details I forget for the nonce). On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> > wrote: > >> curious - why do we need a file per XXX? >> >> - if the data needs to be exported (either to a sql db or an external file >> system) - then why not do so directly from the reducer (instead of trying to >> create these intermediate small files in hdfs)? data can be written to tmp >> tables/files and can be overwritten in case the reducer re-runs (and then >> committed to final location once the job is complete) >> > > The second case (data needs to be exported) is the reason that I have. Each > of these small files is used in an external process. This seems like a good > solution - only question then is where can these files be written to safely? > Local directory? /tmp? > > Ashish > > > >> >> >> >> -Original Message- >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal >> Sent: Tue 4/1/2008 6:42 PM >> To: core-user@hadoop.apache.org >> Subject: Re: one key per output part file >> >> This seems like a reasonable solution - but I am using Hadoop streaming >> and >> byreducer is a perl script. Is it possible to handle side-effect files in >> streaming? I havent found >> anything that indicates that you can... >> >> Ashish >> >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >>> >>> >>> Try opening the desired output file in the reduce method. Make sure >> that >>> the output files are relative to the correct task specific directory >> (look >>> for side-effect files on the wiki). >>> >>> >>> >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: >>> >>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce >>> that >>>> will generate output where a single key is found in a single output >> part >>>> file. >>>> Does anyone know how to ensure this condition? I want the reduce task >>> (no >>>> matter how many are specified), to only receive >>>> key-value output from a single key each, process the key-value pairs >> for >>>> this key, write an output part-XXX file, and only >>>> then process the next key. >>>> >>>> Here is the task that I am trying to accomplish: >>>> >>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word) >>>> Output: Each part-XXX should contain the lines of T that contain the >>> word >>>> from line XXX in V. >>>> >>>> Any help/ideas are appreciated. >>>> >>>> Ashish >>> >>> >> >>
Re: one key per output part file
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote: > curious - why do we need a file per XXX? > > - if the data needs to be exported (either to a sql db or an external file > system) - then why not do so directly from the reducer (instead of trying to > create these intermediate small files in hdfs)? data can be written to tmp > tables/files and can be overwritten in case the reducer re-runs (and then > committed to final location once the job is complete) > The second case (data needs to be exported) is the reason that I have. Each of these small files is used in an external process. This seems like a good solution - only question then is where can these files be written to safely? Local directory? /tmp? Ashish > > > > -Original Message- > From: [EMAIL PROTECTED] on behalf of Ashish Venugopal > Sent: Tue 4/1/2008 6:42 PM > To: core-user@hadoop.apache.org > Subject: Re: one key per output part file > > This seems like a reasonable solution - but I am using Hadoop streaming > and > byreducer is a perl script. Is it possible to handle side-effect files in > streaming? I havent found > anything that indicates that you can... > > Ashish > > On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > > Try opening the desired output file in the reduce method. Make sure > that > > the output files are relative to the correct task specific directory > (look > > for side-effect files on the wiki). > > > > > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > > that > > > will generate output where a single key is found in a single output > part > > > file. > > > Does anyone know how to ensure this condition? I want the reduce task > > (no > > > matter how many are specified), to only receive > > > key-value output from a single key each, process the key-value pairs > for > > > this key, write an output part-XXX file, and only > > > then process the next key. > > > > > > Here is the task that I am trying to accomplish: > > > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > > Output: Each part-XXX should contain the lines of T that contain the > > word > > > from line XXX in V. > > > > > > Any help/ideas are appreciated. > > > > > > Ashish > > > > > >
RE: one key per output part file
curious - why do we need a file per XXX? - if further processing is going to be done in hadoop itself - then it's hard to see a reason. One can always have multiple entries in the same hdfs file. note that it's possible to align map task splits on sort key boundaries in pre-sorted data (it's not something that hadoop supports natively right now - but u can write ur own inputformat to do this). meaning - that subsequent processing that wants all entries corresponding to XXX in one group (as in a reducer) can do so in the map phase itself (ie. - it's damned cheap and doesn't require sorting data all over again). - if the data needs to be exported (either to a sql db or an external file system) - then why not do so directly from the reducer (instead of trying to create these intermediate small files in hdfs)? data can be written to tmp tables/files and can be overwritten in case the reducer re-runs (and then committed to final location once the job is complete) -Original Message- From: [EMAIL PROTECTED] on behalf of Ashish Venugopal Sent: Tue 4/1/2008 6:42 PM To: core-user@hadoop.apache.org Subject: Re: one key per output part file This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Try opening the desired output file in the reduce method. Make sure that > the output files are relative to the correct task specific directory (look > for side-effect files on the wiki). > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > >
Re: one key per output part file
No. That is a limitation of streaming. On 4/1/08 6:42 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > This seems like a reasonable solution - but I am using Hadoop streaming and > byreducer is a perl script. Is it possible to handle side-effect files in > streaming? I havent found > anything that indicates that you can... > > Ashish > > On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> >> >> Try opening the desired output file in the reduce method. Make sure that >> the output files are relative to the correct task specific directory (look >> for side-effect files on the wiki). >> >> >> >> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: >> >>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce >> that >>> will generate output where a single key is found in a single output part >>> file. >>> Does anyone know how to ensure this condition? I want the reduce task >> (no >>> matter how many are specified), to only receive >>> key-value output from a single key each, process the key-value pairs for >>> this key, write an output part-XXX file, and only >>> then process the next key. >>> >>> Here is the task that I am trying to accomplish: >>> >>> Input: Corpus T (lines of text), Corpus V (each line has 1 word) >>> Output: Each part-XXX should contain the lines of T that contain the >> word >>> from line XXX in V. >>> >>> Any help/ideas are appreciated. >>> >>> Ashish >> >>
Re: one key per output part file
This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Try opening the desired output file in the reduce method. Make sure that > the output files are relative to the correct task specific directory (look > for side-effect files on the wiki). > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > >
Re: one key per output part file
Try opening the desired output file in the reduce method. Make sure that the output files are relative to the correct task specific directory (look for side-effect files on the wiki). On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > Hi, I am using Hadoop streaming and I am trying to create a MapReduce that > will generate output where a single key is found in a single output part > file. > Does anyone know how to ensure this condition? I want the reduce task (no > matter how many are specified), to only receive > key-value output from a single key each, process the key-value pairs for > this key, write an output part-XXX file, and only > then process the next key. > > Here is the task that I am trying to accomplish: > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > Output: Each part-XXX should contain the lines of T that contain the word > from line XXX in V. > > Any help/ideas are appreciated. > > Ashish
one key per output part file
Hi, I am using Hadoop streaming and I am trying to create a MapReduce that will generate output where a single key is found in a single output part file. Does anyone know how to ensure this condition? I want the reduce task (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the word from line XXX in V. Any help/ideas are appreciated. Ashish