Re: architecture diagram

2008-10-08 Thread Alex Loddengaard
Glad we could help, Terrence.  The second pivot might be tricky; you may
have to run a second iteration.  I haven't thought the problem all the way
through, though.

Good luck.

Alex

On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I think I can figure this out now and get it to work. I will check back in
> if I get it. All that is missing at the moment is in my pivot back mapping
> step. Thanks for the help.
>
> Terrence A. Pietrondi
>
>
> --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Tuesday, October 7, 2008, 1:55 PM
> > Thanks for the clarification, Samuel.  I wasn't aware
> > that parts of a line
> > might be emitted depending on the split, while using
> > TextInputFormat.
> > Terrence, this means that you'll have to take the
> > approach of collecting key
> > => column_count, value => column_contents in your map
> > step.
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo
> > <[EMAIL PROTECTED]> wrote:
> >
> > > I think what Alex talked about 'split' is the
> > mapreduce system's action.
> > > What you said about 'split' is your
> > mapper's action.
> > >
> > > I guess that your map/reduce application uses
> > *TextInputFormat* to treat
> > > your input file.
> > >
> > > your input file will first be splitted into a few
> > splits. these splits may
> > > be like . What Alex
> > said about 'The location of
> > > these splits is semi-arbitrary' means that the
> > file split's offset in your
> > > input file is semi-arbitrary. Am I right, Alex?
> > > Then *TextInputFormat* will translate these file
> > splits into a sequence of
> > > lines, where offset is treated as key and line is
> > treated as value.
> > >
> > > As these file splits are splitted by offset. Some
> > lines in your file may be
> > > splitted into different file splits. A
> > *LineRecordReader* used by
> > > *TextInputFormat* will remove the half-baked line in
> > these file splits to
> > > make sure that every mapper will get integrated lines
> > one by one.
> > >
> > > For examples:
> > >
> > > a file as below:
> > > 
> > > AAA BBB CCC DDD
> > > EEE FFF GGG HHH
> > > AAA BBB CCC DDD
> > > 
> > >
> > > it may be splitted into two file splits(we assume that
> > there are two
> > > mappers.).
> > > split one:
> > > 
> > > AAA BBB CCC
> > >
> > > split two:
> > > DDD
> > > EEE FFF GGG HHH
> > > AAA BBB CCC DDD
> > > 
> > >
> > > take split two as example:
> > > TextInputFormat will use LineRecordReader to translate
> > split two into a
> > > sequence of  pairs, and it will
> > skip the first half-baked
> > > line
> > > "DDD". so the sequence will be:
> > > 
> > > 
> > > 
> > >
> > > Then what to do with the lines depends on your job.
> > >
> > >
> > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi
> > <
> > > [EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > So looking at the following mapper...
> > > >
> > > >
> > > >
> > >
> >
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
> > > >
> > > > On line 32, you can see the row split via a
> > delimiter. On line 43, you
> > > can
> > > > see that the field index (the column index) is
> > the map key, and the map
> > > > value is the field contents. How is this
> > incorrect? I think this follows
> > > > your earlier suggestion of:
> > > >
> > > > "You may want to play with the following
> > idea: collect key =>
> > > column_number
> > > > and value => column_contents in your map
> > step."
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > > --- On Mon, 10/6/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Lod

Re: architecture diagram

2008-10-08 Thread Terrence A. Pietrondi
I think I can figure this out now and get it to work. I will check back in if I 
get it. All that is missing at the moment is in my pivot back mapping step. 
Thanks for the help.

Terrence A. Pietrondi


--- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Tuesday, October 7, 2008, 1:55 PM
> Thanks for the clarification, Samuel.  I wasn't aware
> that parts of a line
> might be emitted depending on the split, while using
> TextInputFormat.
> Terrence, this means that you'll have to take the
> approach of collecting key
> => column_count, value => column_contents in your map
> step.
> 
> Alex
> 
> On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo
> <[EMAIL PROTECTED]> wrote:
> 
> > I think what Alex talked about 'split' is the
> mapreduce system's action.
> > What you said about 'split' is your
> mapper's action.
> >
> > I guess that your map/reduce application uses
> *TextInputFormat* to treat
> > your input file.
> >
> > your input file will first be splitted into a few
> splits. these splits may
> > be like . What Alex
> said about 'The location of
> > these splits is semi-arbitrary' means that the
> file split's offset in your
> > input file is semi-arbitrary. Am I right, Alex?
> > Then *TextInputFormat* will translate these file
> splits into a sequence of
> > lines, where offset is treated as key and line is
> treated as value.
> >
> > As these file splits are splitted by offset. Some
> lines in your file may be
> > splitted into different file splits. A
> *LineRecordReader* used by
> > *TextInputFormat* will remove the half-baked line in
> these file splits to
> > make sure that every mapper will get integrated lines
> one by one.
> >
> > For examples:
> >
> > a file as below:
> > 
> > AAA BBB CCC DDD
> > EEE FFF GGG HHH
> > AAA BBB CCC DDD
> > 
> >
> > it may be splitted into two file splits(we assume that
> there are two
> > mappers.).
> > split one:
> > 
> > AAA BBB CCC
> >
> > split two:
> > DDD
> > EEE FFF GGG HHH
> > AAA BBB CCC DDD
> > 
> >
> > take split two as example:
> > TextInputFormat will use LineRecordReader to translate
> split two into a
> > sequence of  pairs, and it will
> skip the first half-baked
> > line
> > "DDD". so the sequence will be:
> > 
> > 
> > 
> >
> > Then what to do with the lines depends on your job.
> >
> >
> > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi
> <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > > So looking at the following mapper...
> > >
> > >
> > >
> >
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
> > >
> > > On line 32, you can see the row split via a
> delimiter. On line 43, you
> > can
> > > see that the field index (the column index) is
> the map key, and the map
> > > value is the field contents. How is this
> incorrect? I think this follows
> > > your earlier suggestion of:
> > >
> > > "You may want to play with the following
> idea: collect key =>
> > column_number
> > > and value => column_contents in your map
> step."
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Mon, 10/6/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Monday, October 6, 2008, 12:55 PM
> > > > As far as I know, splits will never be made
> within a line,
> > > > only between
> > > > rows.  To answer your question about ways to
> control the
> > > > splits, see below:
> > > >
> > > >
> <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > > > <
> > > >
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > > > >
> > > >
> > > > Alex
> > > >
> > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A.
> Pietrondi
> > > > <[EM

Re: architecture diagram

2008-10-07 Thread Alex Loddengaard
Thanks for the clarification, Samuel.  I wasn't aware that parts of a line
might be emitted depending on the split, while using TextInputFormat.
Terrence, this means that you'll have to take the approach of collecting key
=> column_count, value => column_contents in your map step.

Alex

On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo <[EMAIL PROTECTED]> wrote:

> I think what Alex talked about 'split' is the mapreduce system's action.
> What you said about 'split' is your mapper's action.
>
> I guess that your map/reduce application uses *TextInputFormat* to treat
> your input file.
>
> your input file will first be splitted into a few splits. these splits may
> be like . What Alex said about 'The location of
> these splits is semi-arbitrary' means that the file split's offset in your
> input file is semi-arbitrary. Am I right, Alex?
> Then *TextInputFormat* will translate these file splits into a sequence of
> lines, where offset is treated as key and line is treated as value.
>
> As these file splits are splitted by offset. Some lines in your file may be
> splitted into different file splits. A *LineRecordReader* used by
> *TextInputFormat* will remove the half-baked line in these file splits to
> make sure that every mapper will get integrated lines one by one.
>
> For examples:
>
> a file as below:
> 
> AAA BBB CCC DDD
> EEE FFF GGG HHH
> AAA BBB CCC DDD
> 
>
> it may be splitted into two file splits(we assume that there are two
> mappers.).
> split one:
> 
> AAA BBB CCC
>
> split two:
> DDD
> EEE FFF GGG HHH
> AAA BBB CCC DDD
> 
>
> take split two as example:
> TextInputFormat will use LineRecordReader to translate split two into a
> sequence of  pairs, and it will skip the first half-baked
> line
> "DDD". so the sequence will be:
> 
> 
> 
>
> Then what to do with the lines depends on your job.
>
>
> On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi <
> [EMAIL PROTECTED]
> > wrote:
>
> > So looking at the following mapper...
> >
> >
> >
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
> >
> > On line 32, you can see the row split via a delimiter. On line 43, you
> can
> > see that the field index (the column index) is the map key, and the map
> > value is the field contents. How is this incorrect? I think this follows
> > your earlier suggestion of:
> >
> > "You may want to play with the following idea: collect key =>
> column_number
> > and value => column_contents in your map step."
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Monday, October 6, 2008, 12:55 PM
> > > As far as I know, splits will never be made within a line,
> > > only between
> > > rows.  To answer your question about ways to control the
> > > splits, see below:
> > >
> > > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > > <
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > > >
> > >
> > > Alex
> > >
> > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > Can you explain "The location of these splits is
> > > semi-arbitrary"? What if
> > > > the example was...
> > > >
> > > > AAA|BBB|CCC|DDD
> > > > EEE|FFF|GGG|HHH
> > > >
> > > >
> > > > Does this mean the split might be between CCC such
> > > that it results in
> > > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > > to control this
> > > > behavior to split on my delimiter?
> > > >
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > > --- On Sun, 10/5/08, Alex Loddengaard
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > &

Re: architecture diagram

2008-10-06 Thread Samuel Guo
I think what Alex talked about 'split' is the mapreduce system's action.
What you said about 'split' is your mapper's action.

I guess that your map/reduce application uses *TextInputFormat* to treat
your input file.

your input file will first be splitted into a few splits. these splits may
be like . What Alex said about 'The location of
these splits is semi-arbitrary' means that the file split's offset in your
input file is semi-arbitrary. Am I right, Alex?
Then *TextInputFormat* will translate these file splits into a sequence of
lines, where offset is treated as key and line is treated as value.

As these file splits are splitted by offset. Some lines in your file may be
splitted into different file splits. A *LineRecordReader* used by
*TextInputFormat* will remove the half-baked line in these file splits to
make sure that every mapper will get integrated lines one by one.

For examples:

a file as below:

AAA BBB CCC DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


it may be splitted into two file splits(we assume that there are two
mappers.).
split one:

AAA BBB CCC

split two:
DDD
EEE FFF GGG HHH
AAA BBB CCC DDD


take split two as example:
TextInputFormat will use LineRecordReader to translate split two into a
sequence of  pairs, and it will skip the first half-baked line
"DDD". so the sequence will be:




Then what to do with the lines depends on your job.


On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> So looking at the following mapper...
>
>
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
>
> On line 32, you can see the row split via a delimiter. On line 43, you can
> see that the field index (the column index) is the map key, and the map
> value is the field contents. How is this incorrect? I think this follows
> your earlier suggestion of:
>
> "You may want to play with the following idea: collect key => column_number
> and value => column_contents in your map step."
>
> Terrence A. Pietrondi
>
>
> --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Monday, October 6, 2008, 12:55 PM
> > As far as I know, splits will never be made within a line,
> > only between
> > rows.  To answer your question about ways to control the
> > splits, see below:
> >
> > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > <
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > >
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Can you explain "The location of these splits is
> > semi-arbitrary"? What if
> > > the example was...
> > >
> > > AAA|BBB|CCC|DDD
> > > EEE|FFF|GGG|HHH
> > >
> > >
> > > Does this mean the split might be between CCC such
> > that it results in
> > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > to control this
> > > behavior to split on my delimiter?
> > >
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Sun, 10/5/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > > Let's say you have one very large input file
> > of the
> > > > form:
> > > >
> > > > A|B|C|D
> > > > E|F|G|H
> > > > ...
> > > > |1|2|3|4
> > > >
> > > > This input file will be broken up into N pieces,
> > where N is
> > > > the number of
> > > > mappers that run.  The location of these splits
> > is
> > > > semi-arbitrary.  This
> > > > means that unless you have one mapper, you
> > won't be
> > > > able to see the entire
> > > > contents of a column in your mapper.  Given that
> > you would
> > > > need one mapper
> > > > to be able to see the entirety of a column,
> > you've now
> > > > essentially reduced
> > > > your problem to a single machine.
> > > >
> > > > Yo

Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
This mapper does follow my original suggestion, though I'm not familiar with
how the delimiter works in this example.  Anyone else?

Alex

On Mon, Oct 6, 2008 at 2:55 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> So looking at the following mapper...
>
>
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
>
> On line 32, you can see the row split via a delimiter. On line 43, you can
> see that the field index (the column index) is the map key, and the map
> value is the field contents. How is this incorrect? I think this follows
> your earlier suggestion of:
>
> "You may want to play with the following idea: collect key => column_number
> and value => column_contents in your map step."
>
> Terrence A. Pietrondi
>
>
> --- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Monday, October 6, 2008, 12:55 PM
> > As far as I know, splits will never be made within a line,
> > only between
> > rows.  To answer your question about ways to control the
> > splits, see below:
> >
> > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > <
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > >
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Can you explain "The location of these splits is
> > semi-arbitrary"? What if
> > > the example was...
> > >
> > > AAA|BBB|CCC|DDD
> > > EEE|FFF|GGG|HHH
> > >
> > >
> > > Does this mean the split might be between CCC such
> > that it results in
> > > AAA|BBB|C and C|DDD for the first line? Is there a way
> > to control this
> > > behavior to split on my delimiter?
> > >
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Sun, 10/5/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Sunday, October 5, 2008, 9:26 PM
> > > > Let's say you have one very large input file
> > of the
> > > > form:
> > > >
> > > > A|B|C|D
> > > > E|F|G|H
> > > > ...
> > > > |1|2|3|4
> > > >
> > > > This input file will be broken up into N pieces,
> > where N is
> > > > the number of
> > > > mappers that run.  The location of these splits
> > is
> > > > semi-arbitrary.  This
> > > > means that unless you have one mapper, you
> > won't be
> > > > able to see the entire
> > > > contents of a column in your mapper.  Given that
> > you would
> > > > need one mapper
> > > > to be able to see the entirety of a column,
> > you've now
> > > > essentially reduced
> > > > your problem to a single machine.
> > > >
> > > > You may want to play with the following idea:
> > collect key
> > > > => column_number
> > > > and value => column_contents in your map step.
> >  This
> > > > means that you would be
> > > > able to see the entirety of a column in your
> > reduce step,
> > > > though you're
> > > > still faced with the tasks of shuffling and
> > re-pivoting.
> > > >
> > > > Does this clear up your confusion?  Let me know
> > if
> > > > you'd like me to clarify
> > > > more.
> > > >
> > > > Alex
> > > >
> > > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > I am not sure why this doesn't fit,
> > maybe you can
> > > > help me understand. Your
> > > > > previous comment was...
> > > > >
> > > > > "The reason I'm making this claim
> > is because
> > > > in order to do the pivot
> > > > > operation you must know about every row.
> > Your input
> > &g

Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
So looking at the following mapper...

http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup

On line 32, you can see the row split via a delimiter. On line 43, you can see 
that the field index (the column index) is the map key, and the map value is 
the field contents. How is this incorrect? I think this follows your earlier 
suggestion of:

"You may want to play with the following idea: collect key => column_number and 
value => column_contents in your map step."

Terrence A. Pietrondi


--- On Mon, 10/6/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Monday, October 6, 2008, 12:55 PM
> As far as I know, splits will never be made within a line,
> only between
> rows.  To answer your question about ways to control the
> splits, see below:
> 
> <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> <
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> >
> 
> Alex
> 
> On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > Can you explain "The location of these splits is
> semi-arbitrary"? What if
> > the example was...
> >
> > AAA|BBB|CCC|DDD
> > EEE|FFF|GGG|HHH
> >
> >
> > Does this mean the split might be between CCC such
> that it results in
> > AAA|BBB|C and C|DDD for the first line? Is there a way
> to control this
> > behavior to split on my delimiter?
> >
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Sun, 10/5/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Sunday, October 5, 2008, 9:26 PM
> > > Let's say you have one very large input file
> of the
> > > form:
> > >
> > > A|B|C|D
> > > E|F|G|H
> > > ...
> > > |1|2|3|4
> > >
> > > This input file will be broken up into N pieces,
> where N is
> > > the number of
> > > mappers that run.  The location of these splits
> is
> > > semi-arbitrary.  This
> > > means that unless you have one mapper, you
> won't be
> > > able to see the entire
> > > contents of a column in your mapper.  Given that
> you would
> > > need one mapper
> > > to be able to see the entirety of a column,
> you've now
> > > essentially reduced
> > > your problem to a single machine.
> > >
> > > You may want to play with the following idea:
> collect key
> > > => column_number
> > > and value => column_contents in your map step.
>  This
> > > means that you would be
> > > able to see the entirety of a column in your
> reduce step,
> > > though you're
> > > still faced with the tasks of shuffling and
> re-pivoting.
> > >
> > > Does this clear up your confusion?  Let me know
> if
> > > you'd like me to clarify
> > > more.
> > >
> > > Alex
> > >
> > > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > I am not sure why this doesn't fit,
> maybe you can
> > > help me understand. Your
> > > > previous comment was...
> > > >
> > > > "The reason I'm making this claim
> is because
> > > in order to do the pivot
> > > > operation you must know about every row.
> Your input
> > > files will be split at
> > > > semi-arbitrary places, essentially making it
> > > impossible for each mapper to
> > > > know every single row."
> > > >
> > > > Are you saying that my row segments might
> not actually
> > > be the entire row so
> > > > I will get a bad key index? If so, would the
> row
> > > segments be determined? I
> > > > based my initial work off of the word count
> example,
> > > where the lines are
> > > > tokenized. Does this mean in this example
> the row
> > > tokens may not be the
> > > > complete row?
> > > >
> > > > Thanks.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > &

Re: architecture diagram

2008-10-06 Thread Alex Loddengaard
As far as I know, splits will never be made within a line, only between
rows.  To answer your question about ways to control the splits, see below:

<http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
<
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
>

Alex

On Mon, Oct 6, 2008 at 6:38 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> Can you explain "The location of these splits is semi-arbitrary"? What if
> the example was...
>
> AAA|BBB|CCC|DDD
> EEE|FFF|GGG|HHH
>
>
> Does this mean the split might be between CCC such that it results in
> AAA|BBB|C and C|DDD for the first line? Is there a way to control this
> behavior to split on my delimiter?
>
>
> Terrence A. Pietrondi
>
>
> --- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Sunday, October 5, 2008, 9:26 PM
> > Let's say you have one very large input file of the
> > form:
> >
> > A|B|C|D
> > E|F|G|H
> > ...
> > |1|2|3|4
> >
> > This input file will be broken up into N pieces, where N is
> > the number of
> > mappers that run.  The location of these splits is
> > semi-arbitrary.  This
> > means that unless you have one mapper, you won't be
> > able to see the entire
> > contents of a column in your mapper.  Given that you would
> > need one mapper
> > to be able to see the entirety of a column, you've now
> > essentially reduced
> > your problem to a single machine.
> >
> > You may want to play with the following idea: collect key
> > => column_number
> > and value => column_contents in your map step.  This
> > means that you would be
> > able to see the entirety of a column in your reduce step,
> > though you're
> > still faced with the tasks of shuffling and re-pivoting.
> >
> > Does this clear up your confusion?  Let me know if
> > you'd like me to clarify
> > more.
> >
> > Alex
> >
> > On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > I am not sure why this doesn't fit, maybe you can
> > help me understand. Your
> > > previous comment was...
> > >
> > > "The reason I'm making this claim is because
> > in order to do the pivot
> > > operation you must know about every row. Your input
> > files will be split at
> > > semi-arbitrary places, essentially making it
> > impossible for each mapper to
> > > know every single row."
> > >
> > > Are you saying that my row segments might not actually
> > be the entire row so
> > > I will get a bad key index? If so, would the row
> > segments be determined? I
> > > based my initial work off of the word count example,
> > where the lines are
> > > tokenized. Does this mean in this example the row
> > tokens may not be the
> > > complete row?
> > >
> > > Thanks.
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Fri, 10/3/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Friday, October 3, 2008, 7:14 PM
> > > > The approach that you've described does not
> > fit well in
> > > > to the MapReduce
> > > > paradigm.  You may want to consider randomizing
> > your data
> > > > in a different
> > > > way.
> > > >
> > > > Unfortunately some things can't be solved
> > well with
> > > > MapReduce, and I think
> > > > this is one of them.
> > > >
> > > > Can someone else say more?
> > > >
> > > > Alex
> > > >
> > > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > Sorry for the confusion, I did make some
> > typos. My
> > > > example should have
> > > > > looked like...
> > > > >
> > > > > > A|B|C
> > > > > > D|E|G
> > > > > >
> > > > > > pivots too...
> &g

Re: architecture diagram

2008-10-06 Thread Terrence A. Pietrondi
Can you explain "The location of these splits is semi-arbitrary"? What if the 
example was...

AAA|BBB|CCC|DDD
EEE|FFF|GGG|HHH


Does this mean the split might be between CCC such that it results in AAA|BBB|C 
and C|DDD for the first line? Is there a way to control this behavior to split 
on my delimiter?


Terrence A. Pietrondi


--- On Sun, 10/5/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Sunday, October 5, 2008, 9:26 PM
> Let's say you have one very large input file of the
> form:
> 
> A|B|C|D
> E|F|G|H
> ...
> |1|2|3|4
> 
> This input file will be broken up into N pieces, where N is
> the number of
> mappers that run.  The location of these splits is
> semi-arbitrary.  This
> means that unless you have one mapper, you won't be
> able to see the entire
> contents of a column in your mapper.  Given that you would
> need one mapper
> to be able to see the entirety of a column, you've now
> essentially reduced
> your problem to a single machine.
> 
> You may want to play with the following idea: collect key
> => column_number
> and value => column_contents in your map step.  This
> means that you would be
> able to see the entirety of a column in your reduce step,
> though you're
> still faced with the tasks of shuffling and re-pivoting.
> 
> Does this clear up your confusion?  Let me know if
> you'd like me to clarify
> more.
> 
> Alex
> 
> On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > I am not sure why this doesn't fit, maybe you can
> help me understand. Your
> > previous comment was...
> >
> > "The reason I'm making this claim is because
> in order to do the pivot
> > operation you must know about every row. Your input
> files will be split at
> > semi-arbitrary places, essentially making it
> impossible for each mapper to
> > know every single row."
> >
> > Are you saying that my row segments might not actually
> be the entire row so
> > I will get a bad key index? If so, would the row
> segments be determined? I
> > based my initial work off of the word count example,
> where the lines are
> > tokenized. Does this mean in this example the row
> tokens may not be the
> > complete row?
> >
> > Thanks.
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Fri, 10/3/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Friday, October 3, 2008, 7:14 PM
> > > The approach that you've described does not
> fit well in
> > > to the MapReduce
> > > paradigm.  You may want to consider randomizing
> your data
> > > in a different
> > > way.
> > >
> > > Unfortunately some things can't be solved
> well with
> > > MapReduce, and I think
> > > this is one of them.
> > >
> > > Can someone else say more?
> > >
> > > Alex
> > >
> > > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > Sorry for the confusion, I did make some
> typos. My
> > > example should have
> > > > looked like...
> > > >
> > > > > A|B|C
> > > > > D|E|G
> > > > >
> > > > > pivots too...
> > > > >
> > > > > D|A
> > > > > E|B
> > > > > G|C
> > > > >
> > > > > Then for each row, shuffle the contents
> around
> > > randomly...
> > > > >
> > > > > D|A
> > > > > B|E
> > > > > C|G
> > > > >
> > > > > Then pivot the data back...
> > > > >
> > > > > A|E|G
> > > > > D|B|C
> > > >
> > > > The general goal is to shuffle the elements
> in each
> > > column in the input
> > > > data. Meaning, the ordering of the elements
> in each
> > > column will not be the
> > > > same as in input.
> > > >
> > > > If you look at the initial input and compare
> to the
> > > final output, you'll
> > > > see that during the shuffling, B and E are
> swappe

Re: architecture diagram

2008-10-05 Thread Alex Loddengaard
Let's say you have one very large input file of the form:

A|B|C|D
E|F|G|H
...
|1|2|3|4

This input file will be broken up into N pieces, where N is the number of
mappers that run.  The location of these splits is semi-arbitrary.  This
means that unless you have one mapper, you won't be able to see the entire
contents of a column in your mapper.  Given that you would need one mapper
to be able to see the entirety of a column, you've now essentially reduced
your problem to a single machine.

You may want to play with the following idea: collect key => column_number
and value => column_contents in your map step.  This means that you would be
able to see the entirety of a column in your reduce step, though you're
still faced with the tasks of shuffling and re-pivoting.

Does this clear up your confusion?  Let me know if you'd like me to clarify
more.

Alex

On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I am not sure why this doesn't fit, maybe you can help me understand. Your
> previous comment was...
>
> "The reason I'm making this claim is because in order to do the pivot
> operation you must know about every row. Your input files will be split at
> semi-arbitrary places, essentially making it impossible for each mapper to
> know every single row."
>
> Are you saying that my row segments might not actually be the entire row so
> I will get a bad key index? If so, would the row segments be determined? I
> based my initial work off of the word count example, where the lines are
> tokenized. Does this mean in this example the row tokens may not be the
> complete row?
>
> Thanks.
>
> Terrence A. Pietrondi
>
>
> --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Friday, October 3, 2008, 7:14 PM
> > The approach that you've described does not fit well in
> > to the MapReduce
> > paradigm.  You may want to consider randomizing your data
> > in a different
> > way.
> >
> > Unfortunately some things can't be solved well with
> > MapReduce, and I think
> > this is one of them.
> >
> > Can someone else say more?
> >
> > Alex
> >
> > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Sorry for the confusion, I did make some typos. My
> > example should have
> > > looked like...
> > >
> > > > A|B|C
> > > > D|E|G
> > > >
> > > > pivots too...
> > > >
> > > > D|A
> > > > E|B
> > > > G|C
> > > >
> > > > Then for each row, shuffle the contents around
> > randomly...
> > > >
> > > > D|A
> > > > B|E
> > > > C|G
> > > >
> > > > Then pivot the data back...
> > > >
> > > > A|E|G
> > > > D|B|C
> > >
> > > The general goal is to shuffle the elements in each
> > column in the input
> > > data. Meaning, the ordering of the elements in each
> > column will not be the
> > > same as in input.
> > >
> > > If you look at the initial input and compare to the
> > final output, you'll
> > > see that during the shuffling, B and E are swapped,
> > and G and C are swapped,
> > > while A and D were shuffled back into their
> > originating positions in the
> > > column.
> > >
> > > Once again, sorry for the typos and confusion.
> > >
> > > Terrence A. Pietrondi
> > >
> > > --- On Fri, 10/3/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Friday, October 3, 2008, 11:01 AM
> > > > Can you confirm that the example you've
> > presented is
> > > > accurate?  I think you
> > > > may have made some typos, because the letter
> > "G"
> > > > isn't in the final result;
> > > > I also think your first pivot accidentally
> > swapped C and G.
> > > >  I'm having a
> > > > hard time understanding what you want to do,
> > because it
> > > > seems like your
> > > > operations differ from your example.
> > > 

Re: architecture diagram

2008-10-05 Thread Terrence A. Pietrondi
I am not sure why this doesn't fit, maybe you can help me understand. Your 
previous comment was...

"The reason I'm making this claim is because in order to do the pivot operation 
you must know about every row. Your input files will be split at semi-arbitrary 
places, essentially making it impossible for each mapper to know every single 
row."

Are you saying that my row segments might not actually be the entire row so I 
will get a bad key index? If so, would the row segments be determined? I based 
my initial work off of the word count example, where the lines are tokenized. 
Does this mean in this example the row tokens may not be the complete row?

Thanks.

Terrence A. Pietrondi


--- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Friday, October 3, 2008, 7:14 PM
> The approach that you've described does not fit well in
> to the MapReduce
> paradigm.  You may want to consider randomizing your data
> in a different
> way.
> 
> Unfortunately some things can't be solved well with
> MapReduce, and I think
> this is one of them.
> 
> Can someone else say more?
> 
> Alex
> 
> On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > Sorry for the confusion, I did make some typos. My
> example should have
> > looked like...
> >
> > > A|B|C
> > > D|E|G
> > >
> > > pivots too...
> > >
> > > D|A
> > > E|B
> > > G|C
> > >
> > > Then for each row, shuffle the contents around
> randomly...
> > >
> > > D|A
> > > B|E
> > > C|G
> > >
> > > Then pivot the data back...
> > >
> > > A|E|G
> > > D|B|C
> >
> > The general goal is to shuffle the elements in each
> column in the input
> > data. Meaning, the ordering of the elements in each
> column will not be the
> > same as in input.
> >
> > If you look at the initial input and compare to the
> final output, you'll
> > see that during the shuffling, B and E are swapped,
> and G and C are swapped,
> > while A and D were shuffled back into their
> originating positions in the
> > column.
> >
> > Once again, sorry for the typos and confusion.
> >
> > Terrence A. Pietrondi
> >
> > --- On Fri, 10/3/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Friday, October 3, 2008, 11:01 AM
> > > Can you confirm that the example you've
> presented is
> > > accurate?  I think you
> > > may have made some typos, because the letter
> "G"
> > > isn't in the final result;
> > > I also think your first pivot accidentally
> swapped C and G.
> > >  I'm having a
> > > hard time understanding what you want to do,
> because it
> > > seems like your
> > > operations differ from your example.
> > >
> > > With that said, at first glance, this problem may
> not fit
> > > well in to the
> > > MapReduce paradigm.  The reason I'm making
> this claim
> > > is because in order to
> > > do the pivot operation you must know about every
> row.  Your
> > > input files will
> > > be split at semi-arbitrary places, essentially
> making it
> > > impossible for each
> > > mapper to know every single row.  There may be a
> way to do
> > > this by
> > > collecting, in your map step, key => column
> number (0,
> > > 1, 2, etc) and value
> > > => (A, B, C, etc), though you may run in to
> problems
> > > when you try to pivot
> > > back.  I say this because when you pivot back,
> you need to
> > > have each column,
> > > which means you'll need one reduce step. 
> There may be
> > > a way to put the
> > > pivot-back operation in a second iteration,
> though I
> > > don't think that would
> > > help you.
> > >
> > > Terrence, please confirm that you've defined
> your
> > > example correctly.  In the
> > > meantime, can someone else confirm that this
> problem does
> > > not fit will in to
> > > the MapReduce paradigm?
> > >
> > > Alex
> > >
> > > On Thu, Oct 2, 2008 at 10:48

Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
The approach that you've described does not fit well in to the MapReduce
paradigm.  You may want to consider randomizing your data in a different
way.

Unfortunately some things can't be solved well with MapReduce, and I think
this is one of them.

Can someone else say more?

Alex

On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> Sorry for the confusion, I did make some typos. My example should have
> looked like...
>
> > A|B|C
> > D|E|G
> >
> > pivots too...
> >
> > D|A
> > E|B
> > G|C
> >
> > Then for each row, shuffle the contents around randomly...
> >
> > D|A
> > B|E
> > C|G
> >
> > Then pivot the data back...
> >
> > A|E|G
> > D|B|C
>
> The general goal is to shuffle the elements in each column in the input
> data. Meaning, the ordering of the elements in each column will not be the
> same as in input.
>
> If you look at the initial input and compare to the final output, you'll
> see that during the shuffling, B and E are swapped, and G and C are swapped,
> while A and D were shuffled back into their originating positions in the
> column.
>
> Once again, sorry for the typos and confusion.
>
> Terrence A. Pietrondi
>
> --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Friday, October 3, 2008, 11:01 AM
> > Can you confirm that the example you've presented is
> > accurate?  I think you
> > may have made some typos, because the letter "G"
> > isn't in the final result;
> > I also think your first pivot accidentally swapped C and G.
> >  I'm having a
> > hard time understanding what you want to do, because it
> > seems like your
> > operations differ from your example.
> >
> > With that said, at first glance, this problem may not fit
> > well in to the
> > MapReduce paradigm.  The reason I'm making this claim
> > is because in order to
> > do the pivot operation you must know about every row.  Your
> > input files will
> > be split at semi-arbitrary places, essentially making it
> > impossible for each
> > mapper to know every single row.  There may be a way to do
> > this by
> > collecting, in your map step, key => column number (0,
> > 1, 2, etc) and value
> > => (A, B, C, etc), though you may run in to problems
> > when you try to pivot
> > back.  I say this because when you pivot back, you need to
> > have each column,
> > which means you'll need one reduce step.  There may be
> > a way to put the
> > pivot-back operation in a second iteration, though I
> > don't think that would
> > help you.
> >
> > Terrence, please confirm that you've defined your
> > example correctly.  In the
> > meantime, can someone else confirm that this problem does
> > not fit will in to
> > the MapReduce paradigm?
> >
> > Alex
> >
> > On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi <
> > [EMAIL PROTECTED]> wrote:
> >
> > > I am trying to write a map reduce implementation to do
> > the following:
> > >
> > > 1) read tabular data delimited in some fashion
> > > 2) pivot that data, so the rows are columns and the
> > columns are rows
> > > 3) shuffle the rows (that were the columns) to
> > randomize the data
> > > 4) pivot the data back
> > >
> > > For example.....
> > >
> > > A|B|C
> > > D|E|G
> > >
> > > pivots too...
> > >
> > > D|A
> > > E|B
> > > C|G
> > >
> > > Then for each row, shuffle the contents around
> > randomly...
> > >
> > > D|A
> > > B|E
> > > G|C
> > >
> > > Then pivot the data back...
> > >
> > > A|E|C
> > > D|B|C
> > >
> > > You can reference my progress so far...
> > >
> > >
> > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > > --- On Thu, 10/2/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Thursday, October 2, 200

Re: architecture diagram

2008-10-03 Thread Terrence A. Pietrondi
Sorry for the confusion, I did make some typos. My example should have looked 
like... 

> A|B|C
> D|E|G
>
> pivots too...
>
> D|A
> E|B
> G|C
>
> Then for each row, shuffle the contents around randomly...
>
> D|A
> B|E
> C|G
>
> Then pivot the data back...
>
> A|E|G
> D|B|C

The general goal is to shuffle the elements in each column in the input data. 
Meaning, the ordering of the elements in each column will not be the same as in 
input.

If you look at the initial input and compare to the final output, you'll see 
that during the shuffling, B and E are swapped, and G and C are swapped, while 
A and D were shuffled back into their originating positions in the column. 

Once again, sorry for the typos and confusion.

Terrence A. Pietrondi

--- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Friday, October 3, 2008, 11:01 AM
> Can you confirm that the example you've presented is
> accurate?  I think you
> may have made some typos, because the letter "G"
> isn't in the final result;
> I also think your first pivot accidentally swapped C and G.
>  I'm having a
> hard time understanding what you want to do, because it
> seems like your
> operations differ from your example.
> 
> With that said, at first glance, this problem may not fit
> well in to the
> MapReduce paradigm.  The reason I'm making this claim
> is because in order to
> do the pivot operation you must know about every row.  Your
> input files will
> be split at semi-arbitrary places, essentially making it
> impossible for each
> mapper to know every single row.  There may be a way to do
> this by
> collecting, in your map step, key => column number (0,
> 1, 2, etc) and value
> => (A, B, C, etc), though you may run in to problems
> when you try to pivot
> back.  I say this because when you pivot back, you need to
> have each column,
> which means you'll need one reduce step.  There may be
> a way to put the
> pivot-back operation in a second iteration, though I
> don't think that would
> help you.
> 
> Terrence, please confirm that you've defined your
> example correctly.  In the
> meantime, can someone else confirm that this problem does
> not fit will in to
> the MapReduce paradigm?
> 
> Alex
> 
> On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi <
> [EMAIL PROTECTED]> wrote:
> 
> > I am trying to write a map reduce implementation to do
> the following:
> >
> > 1) read tabular data delimited in some fashion
> > 2) pivot that data, so the rows are columns and the
> columns are rows
> > 3) shuffle the rows (that were the columns) to
> randomize the data
> > 4) pivot the data back
> >
> > For example.
> >
> > A|B|C
> > D|E|G
> >
> > pivots too...
> >
> > D|A
> > E|B
> > C|G
> >
> > Then for each row, shuffle the contents around
> randomly...
> >
> > D|A
> > B|E
> > G|C
> >
> > Then pivot the data back...
> >
> > A|E|C
> > D|B|C
> >
> > You can reference my progress so far...
> >
> >
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Thu, 10/2/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Thursday, October 2, 2008, 1:36 PM
> > > I think it really depends on the job as to where
> logic goes.
> > >  Sometimes your
> > > reduce step is as simple as an identify function,
> and
> > > sometimes it can be
> > > more complex than your map step.  It all depends
> on your
> > > data and the
> > > operation(s) you're trying to perform.
> > >
> > > Perhaps we should step out of the abstract.  Do
> you have a
> > > specific problem
> > > you're trying to solve?  Can you describe it?
> > >
> > > Alex
> > >
> > > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > I am sorry for the confusion. I meant
> distributed
> > > data.
> > > >
> > > > So help me out here. For example, if I am
> reducing to
> > > a single file, then
> > >

Re: architecture diagram

2008-10-03 Thread Alex Loddengaard
Can you confirm that the example you've presented is accurate?  I think you
may have made some typos, because the letter "G" isn't in the final result;
I also think your first pivot accidentally swapped C and G.  I'm having a
hard time understanding what you want to do, because it seems like your
operations differ from your example.

With that said, at first glance, this problem may not fit well in to the
MapReduce paradigm.  The reason I'm making this claim is because in order to
do the pivot operation you must know about every row.  Your input files will
be split at semi-arbitrary places, essentially making it impossible for each
mapper to know every single row.  There may be a way to do this by
collecting, in your map step, key => column number (0, 1, 2, etc) and value
=> (A, B, C, etc), though you may run in to problems when you try to pivot
back.  I say this because when you pivot back, you need to have each column,
which means you'll need one reduce step.  There may be a way to put the
pivot-back operation in a second iteration, though I don't think that would
help you.

Terrence, please confirm that you've defined your example correctly.  In the
meantime, can someone else confirm that this problem does not fit will in to
the MapReduce paradigm?

Alex

On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi <
[EMAIL PROTECTED]> wrote:

> I am trying to write a map reduce implementation to do the following:
>
> 1) read tabular data delimited in some fashion
> 2) pivot that data, so the rows are columns and the columns are rows
> 3) shuffle the rows (that were the columns) to randomize the data
> 4) pivot the data back
>
> For example.
>
> A|B|C
> D|E|G
>
> pivots too...
>
> D|A
> E|B
> C|G
>
> Then for each row, shuffle the contents around randomly...
>
> D|A
> B|E
> G|C
>
> Then pivot the data back...
>
> A|E|C
> D|B|C
>
> You can reference my progress so far...
>
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
>
> Terrence A. Pietrondi
>
>
> --- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Thursday, October 2, 2008, 1:36 PM
> > I think it really depends on the job as to where logic goes.
> >  Sometimes your
> > reduce step is as simple as an identify function, and
> > sometimes it can be
> > more complex than your map step.  It all depends on your
> > data and the
> > operation(s) you're trying to perform.
> >
> > Perhaps we should step out of the abstract.  Do you have a
> > specific problem
> > you're trying to solve?  Can you describe it?
> >
> > Alex
> >
> > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > I am sorry for the confusion. I meant distributed
> > data.
> > >
> > > So help me out here. For example, if I am reducing to
> > a single file, then
> > > my main transformation logic would be in my mapping
> > step since I am reducing
> > > away from the data?
> > >
> > > Terrence A. Pietrondi
> > > http://del.icio.us/tepietrondi
> > >
> > >
> > > --- On Wed, 10/1/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Wednesday, October 1, 2008, 7:44 PM
> > > > I'm not sure what you mean by
> > "disconnected parts
> > > > of data," but Hadoop is
> > > > implemented to try and perform map tasks on
> > machines that
> > > > have input data.
> > > > This is to lower the amount of network traffic,
> > hence
> > > > making the entire job
> > > > run faster.  Hadoop does all this for you under
> > the hood.
> > > > From a user's
> > > > point of view, all you need to do is store data
> > in HDFS
> > > > (the distributed
> > > > filesystem), and run MapReduce jobs on that data.
> >  Take a
> > > > look here:
> > > >
> > > > <http://wiki.apache.org/hadoop/WordCount>
> > > >
> > > > Alex
> > > >
> > > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > &g

Re: architecture diagram

2008-10-02 Thread Terrence A. Pietrondi
I am trying to write a map reduce implementation to do the following:

1) read tabular data delimited in some fashion
2) pivot that data, so the rows are columns and the columns are rows
3) shuffle the rows (that were the columns) to randomize the data
4) pivot the data back 

For example.

A|B|C
D|E|G

pivots too...

D|A
E|B
C|G

Then for each row, shuffle the contents around randomly...

D|A
B|E
G|C

Then pivot the data back...

A|E|C
D|B|C

You can reference my progress so far...

http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/

Terrence A. Pietrondi


--- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Thursday, October 2, 2008, 1:36 PM
> I think it really depends on the job as to where logic goes.
>  Sometimes your
> reduce step is as simple as an identify function, and
> sometimes it can be
> more complex than your map step.  It all depends on your
> data and the
> operation(s) you're trying to perform.
> 
> Perhaps we should step out of the abstract.  Do you have a
> specific problem
> you're trying to solve?  Can you describe it?
> 
> Alex
> 
> On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > I am sorry for the confusion. I meant distributed
> data.
> >
> > So help me out here. For example, if I am reducing to
> a single file, then
> > my main transformation logic would be in my mapping
> step since I am reducing
> > away from the data?
> >
> > Terrence A. Pietrondi
> > http://del.icio.us/tepietrondi
> >
> >
> > --- On Wed, 10/1/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Wednesday, October 1, 2008, 7:44 PM
> > > I'm not sure what you mean by
> "disconnected parts
> > > of data," but Hadoop is
> > > implemented to try and perform map tasks on
> machines that
> > > have input data.
> > > This is to lower the amount of network traffic,
> hence
> > > making the entire job
> > > run faster.  Hadoop does all this for you under
> the hood.
> > > From a user's
> > > point of view, all you need to do is store data
> in HDFS
> > > (the distributed
> > > filesystem), and run MapReduce jobs on that data.
>  Take a
> > > look here:
> > >
> > > <http://wiki.apache.org/hadoop/WordCount>
> > >
> > > Alex
> > >
> > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > So to be "distributed" in a sense,
> you would
> > > want to do your computation on
> > > > the disconnected parts of data in the map
> phase I
> > > would guess?
> > > >
> > > > Terrence A. Pietrondi
> > > > http://del.icio.us/tepietrondi
> > > >
> > > >
> > > > --- On Wed, 10/1/08, Arun C Murthy
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Arun C Murthy
> <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Wednesday, October 1, 2008, 2:16
> PM
> > > > > On Oct 1, 2008, at 10:17 AM, Terrence
> A.
> > > Pietrondi wrote:
> > > > >
> > > > > > I am trying to plan out my
> map-reduce
> > > implementation
> > > > > and I have some
> > > > > > questions of where computation
> should be
> > > split in
> > > > > order to take
> > > > > > advantage of the distributed
> nodes.
> > > > > >
> > > > > > Looking at the architecture
> diagram
> > > > >
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > ), are the map boxes the major
> computation
> > > areas or is
> > > > > the reduce
> > > > > > the major computation area?
> > > > > >
> > > > >
> > > > > Usually the maps perform the
> 'embarrassingly
> > > > > parallel' computational
> > > > > steps where-in each map works
> independently on a
> > > > > 'split' on your input
> > > > > and the reduces perform the
> 'aggregate'
> > > > > computations.
> > > > >
> > > > >  From http://hadoop.apache.org/core/ :
> > > > >
> > > > > Hadoop implements MapReduce, using the
> Hadoop
> > > Distributed
> > > > > File System
> > > > > (HDFS). MapReduce divides applications
> into many
> > > small
> > > > > blocks of work.
> > > > > HDFS creates multiple replicas of data
> blocks for
> > > > > reliability, placing
> > > > > them on compute nodes around the
> cluster.
> > > MapReduce can
> > > > > then process
> > > > > the data where it is located.
> > > > >
> > > > > The Hadoop Map-Reduce framework is
> quite good at
> > > scheduling
> > > > > your
> > > > > 'maps' on the actual data-nodes
> where the
> > > > > input-blocks are present,
> > > > > leading to i/o efficiencies...
> > > > >
> > > > > Arun
> > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> >
> >
> >
> >


  


Re: architecture diagram

2008-10-02 Thread Alex Loddengaard
I think it really depends on the job as to where logic goes.  Sometimes your
reduce step is as simple as an identify function, and sometimes it can be
more complex than your map step.  It all depends on your data and the
operation(s) you're trying to perform.

Perhaps we should step out of the abstract.  Do you have a specific problem
you're trying to solve?  Can you describe it?

Alex

On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I am sorry for the confusion. I meant distributed data.
>
> So help me out here. For example, if I am reducing to a single file, then
> my main transformation logic would be in my mapping step since I am reducing
> away from the data?
>
> Terrence A. Pietrondi
> http://del.icio.us/tepietrondi
>
>
> --- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Wednesday, October 1, 2008, 7:44 PM
> > I'm not sure what you mean by "disconnected parts
> > of data," but Hadoop is
> > implemented to try and perform map tasks on machines that
> > have input data.
> > This is to lower the amount of network traffic, hence
> > making the entire job
> > run faster.  Hadoop does all this for you under the hood.
> > From a user's
> > point of view, all you need to do is store data in HDFS
> > (the distributed
> > filesystem), and run MapReduce jobs on that data.  Take a
> > look here:
> >
> > <http://wiki.apache.org/hadoop/WordCount>
> >
> > Alex
> >
> > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > So to be "distributed" in a sense, you would
> > want to do your computation on
> > > the disconnected parts of data in the map phase I
> > would guess?
> > >
> > > Terrence A. Pietrondi
> > > http://del.icio.us/tepietrondi
> > >
> > >
> > > --- On Wed, 10/1/08, Arun C Murthy
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Arun C Murthy <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Wednesday, October 1, 2008, 2:16 PM
> > > > On Oct 1, 2008, at 10:17 AM, Terrence A.
> > Pietrondi wrote:
> > > >
> > > > > I am trying to plan out my map-reduce
> > implementation
> > > > and I have some
> > > > > questions of where computation should be
> > split in
> > > > order to take
> > > > > advantage of the distributed nodes.
> > > > >
> > > > > Looking at the architecture diagram
> > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > ), are the map boxes the major computation
> > areas or is
> > > > the reduce
> > > > > the major computation area?
> > > > >
> > > >
> > > > Usually the maps perform the 'embarrassingly
> > > > parallel' computational
> > > > steps where-in each map works independently on a
> > > > 'split' on your input
> > > > and the reduces perform the 'aggregate'
> > > > computations.
> > > >
> > > >  From http://hadoop.apache.org/core/ :
> > > >
> > > > Hadoop implements MapReduce, using the Hadoop
> > Distributed
> > > > File System
> > > > (HDFS). MapReduce divides applications into many
> > small
> > > > blocks of work.
> > > > HDFS creates multiple replicas of data blocks for
> > > > reliability, placing
> > > > them on compute nodes around the cluster.
> > MapReduce can
> > > > then process
> > > > the data where it is located.
> > > >
> > > > The Hadoop Map-Reduce framework is quite good at
> > scheduling
> > > > your
> > > > 'maps' on the actual data-nodes where the
> > > > input-blocks are present,
> > > > leading to i/o efficiencies...
> > > >
> > > > Arun
> > > >
> > > > > Thanks.
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
>
>
>
>


Re: architecture diagram

2008-10-02 Thread Terrence A. Pietrondi
I am sorry for the confusion. I meant distributed data. 

So help me out here. For example, if I am reducing to a single file, then my 
main transformation logic would be in my mapping step since I am reducing away 
from the data?

Terrence A. Pietrondi
http://del.icio.us/tepietrondi


--- On Wed, 10/1/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Wednesday, October 1, 2008, 7:44 PM
> I'm not sure what you mean by "disconnected parts
> of data," but Hadoop is
> implemented to try and perform map tasks on machines that
> have input data.
> This is to lower the amount of network traffic, hence
> making the entire job
> run faster.  Hadoop does all this for you under the hood. 
> From a user's
> point of view, all you need to do is store data in HDFS
> (the distributed
> filesystem), and run MapReduce jobs on that data.  Take a
> look here:
> 
> <http://wiki.apache.org/hadoop/WordCount>
> 
> Alex
> 
> On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi
> <[EMAIL PROTECTED]
> > wrote:
> 
> > So to be "distributed" in a sense, you would
> want to do your computation on
> > the disconnected parts of data in the map phase I
> would guess?
> >
> > Terrence A. Pietrondi
> > http://del.icio.us/tepietrondi
> >
> >
> > --- On Wed, 10/1/08, Arun C Murthy
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Arun C Murthy <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Wednesday, October 1, 2008, 2:16 PM
> > > On Oct 1, 2008, at 10:17 AM, Terrence A.
> Pietrondi wrote:
> > >
> > > > I am trying to plan out my map-reduce
> implementation
> > > and I have some
> > > > questions of where computation should be
> split in
> > > order to take
> > > > advantage of the distributed nodes.
> > > >
> > > > Looking at the architecture diagram
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > ), are the map boxes the major computation
> areas or is
> > > the reduce
> > > > the major computation area?
> > > >
> > >
> > > Usually the maps perform the 'embarrassingly
> > > parallel' computational
> > > steps where-in each map works independently on a
> > > 'split' on your input
> > > and the reduces perform the 'aggregate'
> > > computations.
> > >
> > >  From http://hadoop.apache.org/core/ :
> > >
> > > Hadoop implements MapReduce, using the Hadoop
> Distributed
> > > File System
> > > (HDFS). MapReduce divides applications into many
> small
> > > blocks of work.
> > > HDFS creates multiple replicas of data blocks for
> > > reliability, placing
> > > them on compute nodes around the cluster.
> MapReduce can
> > > then process
> > > the data where it is located.
> > >
> > > The Hadoop Map-Reduce framework is quite good at
> scheduling
> > > your
> > > 'maps' on the actual data-nodes where the
> > > input-blocks are present,
> > > leading to i/o efficiencies...
> > >
> > > Arun
> > >
> > > > Thanks.
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > >
> >
> >
> >
> >


  


Re: architecture diagram

2008-10-01 Thread Alex Loddengaard
I'm not sure what you mean by "disconnected parts of data," but Hadoop is
implemented to try and perform map tasks on machines that have input data.
This is to lower the amount of network traffic, hence making the entire job
run faster.  Hadoop does all this for you under the hood.  From a user's
point of view, all you need to do is store data in HDFS (the distributed
filesystem), and run MapReduce jobs on that data.  Take a look here:

<http://wiki.apache.org/hadoop/WordCount>

Alex

On Wed, Oct 1, 2008 at 1:11 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> So to be "distributed" in a sense, you would want to do your computation on
> the disconnected parts of data in the map phase I would guess?
>
> Terrence A. Pietrondi
> http://del.icio.us/tepietrondi
>
>
> --- On Wed, 10/1/08, Arun C Murthy <[EMAIL PROTECTED]> wrote:
>
> > From: Arun C Murthy <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Wednesday, October 1, 2008, 2:16 PM
> > On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:
> >
> > > I am trying to plan out my map-reduce implementation
> > and I have some
> > > questions of where computation should be split in
> > order to take
> > > advantage of the distributed nodes.
> > >
> > > Looking at the architecture diagram
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > ), are the map boxes the major computation areas or is
> > the reduce
> > > the major computation area?
> > >
> >
> > Usually the maps perform the 'embarrassingly
> > parallel' computational
> > steps where-in each map works independently on a
> > 'split' on your input
> > and the reduces perform the 'aggregate'
> > computations.
> >
> >  From http://hadoop.apache.org/core/ :
> >
> > Hadoop implements MapReduce, using the Hadoop Distributed
> > File System
> > (HDFS). MapReduce divides applications into many small
> > blocks of work.
> > HDFS creates multiple replicas of data blocks for
> > reliability, placing
> > them on compute nodes around the cluster. MapReduce can
> > then process
> > the data where it is located.
> >
> > The Hadoop Map-Reduce framework is quite good at scheduling
> > your
> > 'maps' on the actual data-nodes where the
> > input-blocks are present,
> > leading to i/o efficiencies...
> >
> > Arun
> >
> > > Thanks.
> > >
> > > Terrence A. Pietrondi
> > >
> > >
> > >
>
>
>
>


Re: architecture diagram

2008-10-01 Thread Terrence A. Pietrondi
So to be "distributed" in a sense, you would want to do your computation on the 
disconnected parts of data in the map phase I would guess?

Terrence A. Pietrondi
http://del.icio.us/tepietrondi


--- On Wed, 10/1/08, Arun C Murthy <[EMAIL PROTECTED]> wrote:

> From: Arun C Murthy <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Wednesday, October 1, 2008, 2:16 PM
> On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:
> 
> > I am trying to plan out my map-reduce implementation
> and I have some  
> > questions of where computation should be split in
> order to take  
> > advantage of the distributed nodes.
> >
> > Looking at the architecture diagram
> (http://hadoop.apache.org/core/images/architecture.gif 
> > ), are the map boxes the major computation areas or is
> the reduce  
> > the major computation area?
> >
> 
> Usually the maps perform the 'embarrassingly
> parallel' computational  
> steps where-in each map works independently on a
> 'split' on your input  
> and the reduces perform the 'aggregate'
> computations.
> 
>  From http://hadoop.apache.org/core/ :
> 
> Hadoop implements MapReduce, using the Hadoop Distributed
> File System  
> (HDFS). MapReduce divides applications into many small
> blocks of work.  
> HDFS creates multiple replicas of data blocks for
> reliability, placing  
> them on compute nodes around the cluster. MapReduce can
> then process  
> the data where it is located.
> 
> The Hadoop Map-Reduce framework is quite good at scheduling
> your  
> 'maps' on the actual data-nodes where the
> input-blocks are present,  
> leading to i/o efficiencies...
> 
> Arun
> 
> > Thanks.
> >
> > Terrence A. Pietrondi
> >
> >
> >


  


Re: architecture diagram

2008-10-01 Thread Arun C Murthy


On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote:

I am trying to plan out my map-reduce implementation and I have some  
questions of where computation should be split in order to take  
advantage of the distributed nodes.


Looking at the architecture diagram (http://hadoop.apache.org/core/images/architecture.gif 
), are the map boxes the major computation areas or is the reduce  
the major computation area?




Usually the maps perform the 'embarrassingly parallel' computational  
steps where-in each map works independently on a 'split' on your input  
and the reduces perform the 'aggregate' computations.


From http://hadoop.apache.org/core/ :

Hadoop implements MapReduce, using the Hadoop Distributed File System  
(HDFS). MapReduce divides applications into many small blocks of work.  
HDFS creates multiple replicas of data blocks for reliability, placing  
them on compute nodes around the cluster. MapReduce can then process  
the data where it is located.


The Hadoop Map-Reduce framework is quite good at scheduling your  
'maps' on the actual data-nodes where the input-blocks are present,  
leading to i/o efficiencies...


Arun


Thanks.

Terrence A. Pietrondi







Re: architecture diagram

2008-10-01 Thread Tim Wintle
I normally find the intermediate stage of copying data to the reducers
from the mappers to be a significant step - but that's not over the best
quality switches...

The mappers and reducers work on the same boxes, close to the data.  


On Wed, 2008-10-01 at 10:59 -0700, Alex Loddengaard wrote:
> 
> It really depends on your job I think.  Often reduce steps can be the
> bottleneck if you want a single output file (one reducer).
> 
> Hope this helps.
> 
> Alex



Re: architecture diagram

2008-10-01 Thread Alex Loddengaard
Hi Terrence,

It really depends on your job I think.  Often reduce steps can be the
bottleneck if you want a single output file (one reducer).

Hope this helps.

Alex

On Wed, Oct 1, 2008 at 10:17 AM, Terrence A. Pietrondi <
[EMAIL PROTECTED]> wrote:

> I am trying to plan out my map-reduce implementation and I have some
> questions of where computation should be split in order to take advantage of
> the distributed nodes.
>
> Looking at the architecture diagram (
> http://hadoop.apache.org/core/images/architecture.gif), are the map boxes
> the major computation areas or is the reduce the major computation area?
>
> Thanks.
>
> Terrence A. Pietrondi
>
>
>
>


architecture diagram

2008-10-01 Thread Terrence A. Pietrondi
I am trying to plan out my map-reduce implementation and I have some questions 
of where computation should be split in order to take advantage of the 
distributed nodes. 

Looking at the architecture diagram 
(http://hadoop.apache.org/core/images/architecture.gif), are the map boxes the 
major computation areas or is the reduce the major computation area?

Thanks.

Terrence A. Pietrondi