Re: one key per output part file

2008-04-03 Thread Ashish Venugopal
Thanks Yuri! I followed your pattern here and the version where you make the
sytem call directly to -put onto DFS works for me. I did not set
$ENV{HADOOP_HEAPSIZE}=300;
and it seems to work fine (i didnt try setting this variable to see if it
failed).
I also used perl's built in File::Temp mechanism to avoid worrying bout
manually deleting the temp file.

Thanks!

Ashish


On Thu, Apr 3, 2008 at 12:07 PM, Yuri Pradkin <[EMAIL PROTECTED]> wrote:

> Here is how we (attempt to) do it:
>
> Reducer (in streaming) writes one file for each different key it receives
> as input.
> Here's some example code in perl:
>my $envdir = $ENV{'mapred_output_dir'};
>my $fs = ($envdir =~ s/^file://);
>if ($fs) {
>#output goes onto NFS
>open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open
> $envdir/${filename}.png: $!\n";
>} else {
>#output specifies DFS
>open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open
> /tmp/${filename}.png: $!\n"; #or pipe to dfs -put
>}
>... #write FILEOUT
>if ($fs) {
>#for NFS just fix permissions
>chmod 0664, "$envdir/$filename.png";
>chmod 0775, "$envdir";
>} else {
>#for HDFS -put the file
>my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
>$ENV{HADOOP_HEAPSIZE}=300;
>system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",
>  "$envdir/${filename}.png") and
>unlink "/tmp/${filename}.png";
>}
>
> If -output option to streaming specifies an NFS directory, everything
> works except
> it doesn't scale.  We must use mapred_output_dir environment because it
> points to
> the temporary directory and you don't want 2 or more instances of the same
> tasks writing
> to the same file.
>
> If -output points to HDFS, however, the code above bombs while trying to
> -put a file
> with an error something like "couldn't not reserve enough memory for java
> vm heap/libs"
> at which point Java dies.  If anyone has any suggestions on how to fix
> that, I'd
> appreciate it.
>
> Thanks,
>
>  -Yuri
>
> On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>
>


Re: one key per output part file

2008-04-03 Thread Yuri Pradkin
Here is how we (attempt to) do it:

Reducer (in streaming) writes one file for each different key it receives as 
input.  
Here's some example code in perl:
my $envdir = $ENV{'mapred_output_dir'};
my $fs = ($envdir =~ s/^file://);
if ($fs) {
#output goes onto NFS
open(FILEOUT, ">$envdir/${filename}.png") or die "$0: cannot open 
$envdir/${filename}.png: $!\n";
} else {
#output specifies DFS
open(FILEOUT, ">/tmp/${filename}.png") or die "Cannot open 
/tmp/${filename}.png: $!\n"; #or pipe to dfs -put
}
... #write FILEOUT
if ($fs) {
#for NFS just fix permissions
chmod 0664, "$envdir/$filename.png";
chmod 0775, "$envdir";
} else {
#for HDFS -put the file
my $hadoop = $ENV{HADOOP_HOME} . "/bin/hadoop";
$ENV{HADOOP_HEAPSIZE}=300;
system($hadoop,  "dfs", "-put", "/tmp/${filename}.png",  
"$envdir/${filename}.png") and
unlink "/tmp/${filename}.png";
}
   
If -output option to streaming specifies an NFS directory, everything works 
except 
it doesn't scale.  We must use mapred_output_dir environment because it points 
to 
the temporary directory and you don't want 2 or more instances of the same 
tasks writing
to the same file.

If -output points to HDFS, however, the code above bombs while trying to -put a 
file
with an error something like "couldn't not reserve enough memory for java vm 
heap/libs"
at which point Java dies.  If anyone has any suggestions on how to fix that, I'd
appreciate it.

Thanks,

 -Yuri
 
On Tuesday 01 April 2008 05:57:31 pm Ashish Venugopal wrote:
> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
>
> Here is the task that I am trying to accomplish:
>
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
>
> Any help/ideas are appreciated.
>
> Ashish




Re: one key per output part file

2008-04-02 Thread Ashish Venugopal
Thanks for this information - I might be missing something here, but can my
perl script reducer (which is run via streaming, and is not linked to HDFS
libraries) just start writing to HDFS?
I thought I would have to write it locally ie in "." for the reduce script
and then rely on the MapReduce mechanism to promote the file into the output
directory...
Thanks for all the help!

Ashish



On Wed, Apr 2, 2008 at 11:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Writing to HDFS leaves the files as accessible as anything else, if not
> more
> so.
>
> You can retrieve a file using a URL of the form:
>
>  http:///data/
>
> Similarly, you can list a directory using a similar URL (whose details I
> forget for the nonce).
>
> On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
> > wrote:
> >
> >> curious - why do we need a file per XXX?
> >>
> >> - if the data needs to be exported (either to a sql db or an external
> file
> >> system) - then why not do so directly from the reducer (instead of
> trying to
> >> create these intermediate small files in hdfs)? data can be written to
> tmp
> >> tables/files and can be overwritten in case the reducer re-runs (and
> then
> >> committed to final location once the job is complete)
> >>
> >
> > The second case (data needs to be exported) is the reason that I have.
> Each
> > of these small files is used in an external process. This seems like a
> good
> > solution - only question then is where can these files be written to
> safely?
> > Local directory? /tmp?
> >
> > Ashish
> >
> >
> >
> >>
> >>
> >>
> >> -Original Message-
> >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
> >> Sent: Tue 4/1/2008 6:42 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Re: one key per output part file
> >>
> >> This seems like a reasonable solution - but I am using Hadoop streaming
> >> and
> >> byreducer is a perl script. Is it possible to handle side-effect files
> in
> >> streaming? I havent found
> >> anything that indicates that you can...
> >>
> >> Ashish
> >>
> >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >>
> >>>
> >>>
> >>> Try opening the desired output file in the reduce method.  Make sure
> >> that
> >>> the output files are relative to the correct task specific directory
> >> (look
> >>> for side-effect files on the wiki).
> >>>
> >>>
> >>>
> >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
> >>>
> >>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> >>> that
> >>>> will generate output where a single key is found in a single output
> >> part
> >>>> file.
> >>>> Does anyone know how to ensure this condition? I want the reduce task
> >>> (no
> >>>> matter how many are specified), to only receive
> >>>> key-value output from a single key each, process the key-value pairs
> >> for
> >>>> this key, write an output part-XXX file, and only
> >>>> then process the next key.
> >>>>
> >>>> Here is the task that I am trying to accomplish:
> >>>>
> >>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> >>>> Output: Each part-XXX should contain the lines of T that contain the
> >>> word
> >>>> from line XXX in V.
> >>>>
> >>>> Any help/ideas are appreciated.
> >>>>
> >>>> Ashish
> >>>
> >>>
> >>
> >>
>
>


Re: one key per output part file

2008-04-02 Thread Ted Dunning


Writing to HDFS leaves the files as accessible as anything else, if not more
so.

You can retrieve a file using a URL of the form:

  http:///data/

Similarly, you can list a directory using a similar URL (whose details I
forget for the nonce).

On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
> wrote:
> 
>> curious - why do we need a file per XXX?
>> 
>> - if the data needs to be exported (either to a sql db or an external file
>> system) - then why not do so directly from the reducer (instead of trying to
>> create these intermediate small files in hdfs)? data can be written to tmp
>> tables/files and can be overwritten in case the reducer re-runs (and then
>> committed to final location once the job is complete)
>> 
> 
> The second case (data needs to be exported) is the reason that I have. Each
> of these small files is used in an external process. This seems like a good
> solution - only question then is where can these files be written to safely?
> Local directory? /tmp?
> 
> Ashish
> 
> 
> 
>> 
>> 
>> 
>> -Original Message-
>> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
>> Sent: Tue 4/1/2008 6:42 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: one key per output part file
>> 
>> This seems like a reasonable solution - but I am using Hadoop streaming
>> and
>> byreducer is a perl script. Is it possible to handle side-effect files in
>> streaming? I havent found
>> anything that indicates that you can...
>> 
>> Ashish
>> 
>> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>>> 
>>> 
>>> Try opening the desired output file in the reduce method.  Make sure
>> that
>>> the output files are relative to the correct task specific directory
>> (look
>>> for side-effect files on the wiki).
>>> 
>>> 
>>> 
>>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>>> 
>>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce
>>> that
>>>> will generate output where a single key is found in a single output
>> part
>>>> file.
>>>> Does anyone know how to ensure this condition? I want the reduce task
>>> (no
>>>> matter how many are specified), to only receive
>>>> key-value output from a single key each, process the key-value pairs
>> for
>>>> this key, write an output part-XXX file, and only
>>>> then process the next key.
>>>> 
>>>> Here is the task that I am trying to accomplish:
>>>> 
>>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
>>>> Output: Each part-XXX should contain the lines of T that contain the
>>> word
>>>> from line XXX in V.
>>>> 
>>>> Any help/ideas are appreciated.
>>>> 
>>>> Ashish
>>> 
>>> 
>> 
>> 



Re: one key per output part file

2008-04-02 Thread Ashish Venugopal
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
wrote:

> curious - why do we need a file per XXX?
>
> - if the data needs to be exported (either to a sql db or an external file
> system) - then why not do so directly from the reducer (instead of trying to
> create these intermediate small files in hdfs)? data can be written to tmp
> tables/files and can be overwritten in case the reducer re-runs (and then
> committed to final location once the job is complete)
>

The second case (data needs to be exported) is the reason that I have. Each
of these small files is used in an external process. This seems like a good
solution - only question then is where can these files be written to safely?
Local directory? /tmp?

Ashish



>
>
>
> -Original Message-
> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
> Sent: Tue 4/1/2008 6:42 PM
> To: core-user@hadoop.apache.org
> Subject: Re: one key per output part file
>
> This seems like a reasonable solution - but I am using Hadoop streaming
> and
> byreducer is a perl script. Is it possible to handle side-effect files in
> streaming? I havent found
> anything that indicates that you can...
>
> Ashish
>
> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> >
> > Try opening the desired output file in the reduce method.  Make sure
> that
> > the output files are relative to the correct task specific directory
> (look
> > for side-effect files on the wiki).
> >
> >
> >
> > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> > that
> > > will generate output where a single key is found in a single output
> part
> > > file.
> > > Does anyone know how to ensure this condition? I want the reduce task
> > (no
> > > matter how many are specified), to only receive
> > > key-value output from a single key each, process the key-value pairs
> for
> > > this key, write an output part-XXX file, and only
> > > then process the next key.
> > >
> > > Here is the task that I am trying to accomplish:
> > >
> > > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > > Output: Each part-XXX should contain the lines of T that contain the
> > word
> > > from line XXX in V.
> > >
> > > Any help/ideas are appreciated.
> > >
> > > Ashish
> >
> >
>
>


RE: one key per output part file

2008-04-02 Thread Joydeep Sen Sarma
curious - why do we need a file per XXX?

- if further processing is going to be done in hadoop itself - then it's hard 
to see a reason. One can always have multiple entries in the same hdfs file. 
note that it's possible to align map task splits on sort key boundaries in 
pre-sorted data (it's not something that hadoop supports natively right now - 
but u can write ur own inputformat to do this). meaning - that subsequent 
processing that wants all entries corresponding to XXX in one group (as in a 
reducer) can do so in the map phase itself (ie. - it's damned cheap and doesn't 
require sorting data all over again).

- if the data needs to be exported (either to a sql db or an external file 
system) - then why not do so directly from the reducer (instead of trying to 
create these intermediate small files in hdfs)? data can be written to tmp 
tables/files and can be overwritten in case the reducer re-runs (and then 
committed to final location once the job is complete)



-Original Message-
From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
Sent: Tue 4/1/2008 6:42 PM
To: core-user@hadoop.apache.org
Subject: Re: one key per output part file
 
This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...

Ashish

On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Try opening the desired output file in the reduce method.  Make sure that
> the output files are relative to the correct task specific directory (look
> for side-effect files on the wiki).
>
>
>
> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>



Re: one key per output part file

2008-04-01 Thread Ted Dunning

No.  That is a limitation of streaming.


On 4/1/08 6:42 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> This seems like a reasonable solution - but I am using Hadoop streaming and
> byreducer is a perl script. Is it possible to handle side-effect files in
> streaming? I havent found
> anything that indicates that you can...
> 
> Ashish
> 
> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> 
>> 
>> 
>> Try opening the desired output file in the reduce method.  Make sure that
>> the output files are relative to the correct task specific directory (look
>> for side-effect files on the wiki).
>> 
>> 
>> 
>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce
>> that
>>> will generate output where a single key is found in a single output part
>>> file.
>>> Does anyone know how to ensure this condition? I want the reduce task
>> (no
>>> matter how many are specified), to only receive
>>> key-value output from a single key each, process the key-value pairs for
>>> this key, write an output part-XXX file, and only
>>> then process the next key.
>>> 
>>> Here is the task that I am trying to accomplish:
>>> 
>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
>>> Output: Each part-XXX should contain the lines of T that contain the
>> word
>>> from line XXX in V.
>>> 
>>> Any help/ideas are appreciated.
>>> 
>>> Ashish
>> 
>> 



Re: one key per output part file

2008-04-01 Thread Ashish Venugopal
This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...

Ashish

On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Try opening the desired output file in the reduce method.  Make sure that
> the output files are relative to the correct task specific directory (look
> for side-effect files on the wiki).
>
>
>
> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>


Re: one key per output part file

2008-04-01 Thread Ted Dunning


Try opening the desired output file in the reduce method.  Make sure that
the output files are relative to the correct task specific directory (look
for side-effect files on the wiki).
 


On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
> 
> Here is the task that I am trying to accomplish:
> 
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
> 
> Any help/ideas are appreciated.
> 
> Ashish



one key per output part file

2008-04-01 Thread Ashish Venugopal
Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
will generate output where a single key is found in a single output part
file.
Does anyone know how to ensure this condition? I want the reduce task (no
matter how many are specified), to only receive
key-value output from a single key each, process the key-value pairs for
this key, write an output part-XXX file, and only
then process the next key.

Here is the task that I am trying to accomplish:

Input: Corpus T (lines of text), Corpus V (each line has 1 word)
Output: Each part-XXX should contain the lines of T that contain the word
from line XXX in V.

Any help/ideas are appreciated.

Ashish