Glad we could help, Terrence.  The second pivot might be tricky; you may
have to run a second iteration.  I haven't thought the problem all the way
through, though.

Good luck.

Alex

On Wed, Oct 8, 2008 at 1:02 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I think I can figure this out now and get it to work. I will check back in
> if I get it. All that is missing at the moment is in my pivot back mapping
> step. Thanks for the help.
>
> Terrence A. Pietrondi
>
>
> --- On Tue, 10/7/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Tuesday, October 7, 2008, 1:55 PM
> > Thanks for the clarification, Samuel.  I wasn't aware
> > that parts of a line
> > might be emitted depending on the split, while using
> > TextInputFormat.
> > Terrence, this means that you'll have to take the
> > approach of collecting key
> > => column_count, value => column_contents in your map
> > step.
> >
> > Alex
> >
> > On Mon, Oct 6, 2008 at 6:41 PM, Samuel Guo
> > <[EMAIL PROTECTED]> wrote:
> >
> > > I think what Alex talked about 'split' is the
> > mapreduce system's action.
> > > What you said about 'split' is your
> > mapper's action.
> > >
> > > I guess that your map/reduce application uses
> > *TextInputFormat* to treat
> > > your input file.
> > >
> > > your input file will first be splitted into a few
> > splits. these splits may
> > > be like <filename, offset, length>. What Alex
> > said about 'The location of
> > > these splits is semi-arbitrary' means that the
> > file split's offset in your
> > > input file is semi-arbitrary. Am I right, Alex?
> > > Then *TextInputFormat* will translate these file
> > splits into a sequence of
> > > lines, where offset is treated as key and line is
> > treated as value.
> > >
> > > As these file splits are splitted by offset. Some
> > lines in your file may be
> > > splitted into different file splits. A
> > *LineRecordReader* used by
> > > *TextInputFormat* will remove the half-baked line in
> > these file splits to
> > > make sure that every mapper will get integrated lines
> > one by one.
> > >
> > > For examples:
> > >
> > > a file as below:
> > > ....
> > > AAA BBB CCC DDD
> > > EEE FFF GGG HHH
> > > AAA BBB CCC DDD
> > > ....
> > >
> > > it may be splitted into two file splits(we assume that
> > there are two
> > > mappers.).
> > > split one:
> > > ....
> > > AAA BBB CCC
> > >
> > > split two:
> > > DDD
> > > EEE FFF GGG HHH
> > > AAA BBB CCC DDD
> > > ....
> > >
> > > take split two as example:
> > > TextInputFormat will use LineRecordReader to translate
> > split two into a
> > > sequence of <offset, line> pairs, and it will
> > skip the first half-baked
> > > line
> > > "DDD". so the sequence will be:
> > > <offset1, "EEE FFF GGG HHH">
> > > <offset2, "AAA BBB CCC DDD">
> > > ....
> > >
> > > Then what to do with the lines depends on your job.
> > >
> > >
> > > On Tue, Oct 7, 2008 at 5:55 AM, Terrence A. Pietrondi
> > <
> > > [EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > So looking at the following mapper...
> > > >
> > > >
> > > >
> > >
> >
> http://csvdatamix.svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/src/com/datamix/pivot/PivotMapper.java?view=markup
> > > >
> > > > On line 32, you can see the row split via a
> > delimiter. On line 43, you
> > > can
> > > > see that the field index (the column index) is
> > the map key, and the map
> > > > value is the field contents. How is this
> > incorrect? I think this follows
> > > > your earlier suggestion of:
> > > >
> > > > "You may want to play with the following
> > idea: collect key =>
> > > column_number
> > > > and value => column_contents in your map
> > step."
> > > >
> > > > Terrence A. Pietrondi
> > > >
> > > >
> > > > --- On Mon, 10/6/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Monday, October 6, 2008, 12:55 PM
> > > > > As far as I know, splits will never be made
> > within a line,
> > > > > only between
> > > > > rows.  To answer your question about ways to
> > control the
> > > > > splits, see below:
> > > > >
> > > > >
> > <http://wiki.apache.org/hadoop/HowManyMapsAndReduces>
> > > > > <
> > > > >
> > > >
> > >
> >
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html
> > > > > >
> > > > >
> > > > > Alex
> > > > >
> > > > > On Mon, Oct 6, 2008 at 6:38 AM, Terrence A.
> > Pietrondi
> > > > > <[EMAIL PROTECTED]
> > > > > > wrote:
> > > > >
> > > > > > Can you explain "The location of
> > these splits is
> > > > > semi-arbitrary"? What if
> > > > > > the example was...
> > > > > >
> > > > > > AAA|BBB|CCC|DDD
> > > > > > EEE|FFF|GGG|HHH
> > > > > >
> > > > > >
> > > > > > Does this mean the split might be
> > between CCC such
> > > > > that it results in
> > > > > > AAA|BBB|C and C|DDD for the first line?
> > Is there a way
> > > > > to control this
> > > > > > behavior to split on my delimiter?
> > > > > >
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > >
> > > > > >
> > > > > > --- On Sun, 10/5/08, Alex Loddengaard
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > > From: Alex Loddengaard
> > > > > <[EMAIL PROTECTED]>
> > > > > > > Subject: Re: architecture diagram
> > > > > > > To: core-user@hadoop.apache.org
> > > > > > > Date: Sunday, October 5, 2008,
> > 9:26 PM
> > > > > > > Let's say you have one very
> > large input file
> > > > > of the
> > > > > > > form:
> > > > > > >
> > > > > > > A|B|C|D
> > > > > > > E|F|G|H
> > > > > > > ...
> > > > > > > |1|2|3|4
> > > > > > >
> > > > > > > This input file will be broken up
> > into N pieces,
> > > > > where N is
> > > > > > > the number of
> > > > > > > mappers that run.  The location of
> > these splits
> > > > > is
> > > > > > > semi-arbitrary.  This
> > > > > > > means that unless you have one
> > mapper, you
> > > > > won't be
> > > > > > > able to see the entire
> > > > > > > contents of a column in your
> > mapper.  Given that
> > > > > you would
> > > > > > > need one mapper
> > > > > > > to be able to see the entirety of
> > a column,
> > > > > you've now
> > > > > > > essentially reduced
> > > > > > > your problem to a single machine.
> > > > > > >
> > > > > > > You may want to play with the
> > following idea:
> > > > > collect key
> > > > > > > => column_number
> > > > > > > and value => column_contents in
> > your map step.
> > > > >  This
> > > > > > > means that you would be
> > > > > > > able to see the entirety of a
> > column in your
> > > > > reduce step,
> > > > > > > though you're
> > > > > > > still faced with the tasks of
> > shuffling and
> > > > > re-pivoting.
> > > > > > >
> > > > > > > Does this clear up your confusion?
> >  Let me know
> > > > > if
> > > > > > > you'd like me to clarify
> > > > > > > more.
> > > > > > >
> > > > > > > Alex
> > > > > > >
> > > > > > > On Sun, Oct 5, 2008 at 3:54 PM,
> > Terrence A.
> > > > > Pietrondi
> > > > > > > <[EMAIL PROTECTED]
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > I am not sure why this
> > doesn't fit,
> > > > > maybe you can
> > > > > > > help me understand. Your
> > > > > > > > previous comment was...
> > > > > > > >
> > > > > > > > "The reason I'm
> > making this claim
> > > > > is because
> > > > > > > in order to do the pivot
> > > > > > > > operation you must know about
> > every row.
> > > > > Your input
> > > > > > > files will be split at
> > > > > > > > semi-arbitrary places,
> > essentially making it
> > > > > > > impossible for each mapper to
> > > > > > > > know every single row."
> > > > > > > >
> > > > > > > > Are you saying that my row
> > segments might
> > > > > not actually
> > > > > > > be the entire row so
> > > > > > > > I will get a bad key index?
> > If so, would the
> > > > > row
> > > > > > > segments be determined? I
> > > > > > > > based my initial work off of
> > the word count
> > > > > example,
> > > > > > > where the lines are
> > > > > > > > tokenized. Does this mean in
> > this example
> > > > > the row
> > > > > > > tokens may not be the
> > > > > > > > complete row?
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Terrence A. Pietrondi
> > > > > > > >
> > > > > > > >
> > > > > > > > --- On Fri, 10/3/08, Alex
> > Loddengaard
> > > > > > > <[EMAIL PROTECTED]>
> > wrote:
> > > > > > > >
> > > > > > > > > From: Alex Loddengaard
> > > > > > > <[EMAIL PROTECTED]>
> > > > > > > > > Subject: Re:
> > architecture diagram
> > > > > > > > > To:
> > core-user@hadoop.apache.org
> > > > > > > > > Date: Friday, October 3,
> > 2008, 7:14 PM
> > > > > > > > > The approach that
> > you've described
> > > > > does not
> > > > > > > fit well in
> > > > > > > > > to the MapReduce
> > > > > > > > > paradigm.  You may want
> > to consider
> > > > > randomizing
> > > > > > > your data
> > > > > > > > > in a different
> > > > > > > > > way.
> > > > > > > > >
> > > > > > > > > Unfortunately some
> > things can't be
> > > > > solved
> > > > > > > well with
> > > > > > > > > MapReduce, and I think
> > > > > > > > > this is one of them.
> > > > > > > > >
> > > > > > > > > Can someone else say
> > more?
> > > > > > > > >
> > > > > > > > > Alex
> > > > > > > > >
> > > > > > > > > On Fri, Oct 3, 2008 at
> > 8:15 AM,
> > > > > Terrence A.
> > > > > > > Pietrondi
> > > > > > > > >
> > <[EMAIL PROTECTED]
> > > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Sorry for the
> > confusion, I did
> > > > > make some
> > > > > > > typos. My
> > > > > > > > > example should have
> > > > > > > > > > looked like...
> > > > > > > > > >
> > > > > > > > > > > A|B|C
> > > > > > > > > > > D|E|G
> > > > > > > > > > >
> > > > > > > > > > > pivots too...
> > > > > > > > > > >
> > > > > > > > > > > D|A
> > > > > > > > > > > E|B
> > > > > > > > > > > G|C
> > > > > > > > > > >
> > > > > > > > > > > Then for each
> > row, shuffle
> > > > > the contents
> > > > > > > around
> > > > > > > > > randomly...
> > > > > > > > > > >
> > > > > > > > > > > D|A
> > > > > > > > > > > B|E
> > > > > > > > > > > C|G
> > > > > > > > > > >
> > > > > > > > > > > Then pivot the
> > data back...
> > > > > > > > > > >
> > > > > > > > > > > A|E|G
> > > > > > > > > > > D|B|C
> > > > > > > > > >
> > > > > > > > > > The general goal is
> > to shuffle the
> > > > > elements
> > > > > > > in each
> > > > > > > > > column in the input
> > > > > > > > > > data. Meaning, the
> > ordering of the
> > > > > elements
> > > > > > > in each
> > > > > > > > > column will not be the
> > > > > > > > > > same as in input.
> > > > > > > > > >
> > > > > > > > > > If you look at the
> > initial input
> > > > > and compare
> > > > > > > to the
> > > > > > > > > final output, you'll
> > > > > > > > > > see that during the
> > shuffling, B
> > > > > and E are
> > > > > > > swapped,
> > > > > > > > > and G and C are swapped,
> > > > > > > > > > while A and D were
> > shuffled back
> > > > > into their
> > > > > > > > > originating positions in
> > the
> > > > > > > > > > column.
> > > > > > > > > >
> > > > > > > > > > Once again, sorry
> > for the typos
> > > > > and
> > > > > > > confusion.
> > > > > > > > > >
> > > > > > > > > > Terrence A.
> > Pietrondi
> > > > > > > > > >
> > > > > > > > > > --- On Fri,
> > 10/3/08, Alex
> > > > > Loddengaard
> > > > > > > > >
> > <[EMAIL PROTECTED]>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Alex
> > Loddengaard
> > > > > > > > >
> > <[EMAIL PROTECTED]>
> > > > > > > > > > > Subject: Re:
> > architecture
> > > > > diagram
> > > > > > > > > > > To:
> > > > > core-user@hadoop.apache.org
> > > > > > > > > > > Date: Friday,
> > October 3,
> > > > > 2008, 11:01 AM
> > > > > > > > > > > Can you
> > confirm that the
> > > > > example
> > > > > > > you've
> > > > > > > > > presented is
> > > > > > > > > > > accurate?  I
> > think you
> > > > > > > > > > > may have made
> > some typos,
> > > > > because the
> > > > > > > letter
> > > > > > > > > "G"
> > > > > > > > > > > isn't in
> > the final
> > > > > result;
> > > > > > > > > > > I also think
> > your first pivot
> > > > > > > accidentally
> > > > > > > > > swapped C and G.
> > > > > > > > > > >  I'm
> > having a
> > > > > > > > > > > hard time
> > understanding what
> > > > > you want
> > > > > > > to do,
> > > > > > > > > because it
> > > > > > > > > > > seems like
> > your
> > > > > > > > > > > operations
> > differ from your
> > > > > example.
> > > > > > > > > > >
> > > > > > > > > > > With that
> > said, at first
> > > > > glance, this
> > > > > > > problem may
> > > > > > > > > not fit
> > > > > > > > > > > well in to the
> > > > > > > > > > > MapReduce
> > paradigm.  The
> > > > > reason I'm
> > > > > > > making
> > > > > > > > > this claim
> > > > > > > > > > > is because in
> > order to
> > > > > > > > > > > do the pivot
> > operation you
> > > > > must know
> > > > > > > about every
> > > > > > > > > row.  Your
> > > > > > > > > > > input files
> > will
> > > > > > > > > > > be split at
> > semi-arbitrary
> > > > > places,
> > > > > > > essentially
> > > > > > > > > making it
> > > > > > > > > > > impossible for
> > each
> > > > > > > > > > > mapper to know
> > every single
> > > > > row.  There
> > > > > > > may be a
> > > > > > > > > way to do
> > > > > > > > > > > this by
> > > > > > > > > > > collecting, in
> > your map step,
> > > > > key =>
> > > > > > > column
> > > > > > > > > number (0,
> > > > > > > > > > > 1, 2, etc) and
> > value
> > > > > > > > > > > => (A, B,
> > C, etc), though
> > > > > you may
> > > > > > > run in to
> > > > > > > > > problems
> > > > > > > > > > > when you try
> > to pivot
> > > > > > > > > > > back.  I say
> > this because
> > > > > when you
> > > > > > > pivot back,
> > > > > > > > > you need to
> > > > > > > > > > > have each
> > column,
> > > > > > > > > > > which means
> > you'll need
> > > > > one reduce
> > > > > > > step.
> > > > > > > > > There may be
> > > > > > > > > > > a way to put
> > the
> > > > > > > > > > > pivot-back
> > operation in a
> > > > > second
> > > > > > > iteration,
> > > > > > > > > though I
> > > > > > > > > > > don't
> > think that would
> > > > > > > > > > > help you.
> > > > > > > > > > >
> > > > > > > > > > > Terrence,
> > please confirm that
> > > > > > > you've defined
> > > > > > > > > your
> > > > > > > > > > > example
> > correctly.  In the
> > > > > > > > > > > meantime, can
> > someone else
> > > > > confirm that
> > > > > > > this
> > > > > > > > > problem does
> > > > > > > > > > > not fit will
> > in to
> > > > > > > > > > > the MapReduce
> > paradigm?
> > > > > > > > > > >
> > > > > > > > > > > Alex
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Oct 2,
> > 2008 at 10:48
> > > > > AM,
> > > > > > > Terrence A.
> > > > > > > > > Pietrondi <
> > > > > > > > > > >
> > [EMAIL PROTECTED]>
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I am
> > trying to write a
> > > > > map reduce
> > > > > > > > > implementation to do
> > > > > > > > > > > the following:
> > > > > > > > > > > >
> > > > > > > > > > > > 1) read
> > tabular data
> > > > > delimited in
> > > > > > > some
> > > > > > > > > fashion
> > > > > > > > > > > > 2) pivot
> > that data, so
> > > > > the rows
> > > > > > > are columns
> > > > > > > > > and the
> > > > > > > > > > > columns are
> > rows
> > > > > > > > > > > > 3)
> > shuffle the rows
> > > > > (that were the
> > > > > > > columns)
> > > > > > > > > to
> > > > > > > > > > > randomize the
> > data
> > > > > > > > > > > > 4) pivot
> > the data back
> > > > > > > > > > > >
> > > > > > > > > > > > For
> > example.....
> > > > > > > > > > > >
> > > > > > > > > > > > A|B|C
> > > > > > > > > > > > D|E|G
> > > > > > > > > > > >
> > > > > > > > > > > > pivots
> > too...
> > > > > > > > > > > >
> > > > > > > > > > > > D|A
> > > > > > > > > > > > E|B
> > > > > > > > > > > > C|G
> > > > > > > > > > > >
> > > > > > > > > > > > Then for
> > each row,
> > > > > shuffle the
> > > > > > > contents
> > > > > > > > > around
> > > > > > > > > > > randomly...
> > > > > > > > > > > >
> > > > > > > > > > > > D|A
> > > > > > > > > > > > B|E
> > > > > > > > > > > > G|C
> > > > > > > > > > > >
> > > > > > > > > > > > Then
> > pivot the data
> > > > > back...
> > > > > > > > > > > >
> > > > > > > > > > > > A|E|C
> > > > > > > > > > > > D|B|C
> > > > > > > > > > > >
> > > > > > > > > > > > You can
> > reference my
> > > > > progress so
> > > > > > > far...
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > > > > > > > > > > >
> > > > > > > > > > > > Terrence
> > A. Pietrondi
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --- On
> > Thu, 10/2/08,
> > > > > Alex
> > > > > > > Loddengaard
> > > > > > > > > > >
> > > > > <[EMAIL PROTECTED]>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > From: Alex
> > > > > Loddengaard
> > > > > > > > > > >
> > > > > <[EMAIL PROTECTED]>
> > > > > > > > > > > > >
> > Subject: Re:
> > > > > architecture
> > > > > > > diagram
> > > > > > > > > > > > > To:
> > > > > > > core-user@hadoop.apache.org
> > > > > > > > > > > > >
> > Date: Thursday,
> > > > > October 2,
> > > > > > > 2008, 1:36
> > > > > > > > > PM
> > > > > > > > > > > > > I
> > think it really
> > > > > depends on
> > > > > > > the job as
> > > > > > > > > to where
> > > > > > > > > > > logic goes.
> > > > > > > > > > > > >
> > Sometimes your
> > > > > > > > > > > > >
> > reduce step is as
> > > > > simple as
> > > > > > > an identify
> > > > > > > > > function,
> > > > > > > > > > > and
> > > > > > > > > > > > >
> > sometimes it can be
> > > > > > > > > > > > > more
> > complex than
> > > > > your map
> > > > > > > step.  It
> > > > > > > > > all depends
> > > > > > > > > > > on your
> > > > > > > > > > > > > data
> > and the
> > > > > > > > > > > > >
> > operation(s)
> > > > > you're
> > > > > > > trying to
> > > > > > > > > perform.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > Perhaps we should
> > > > > step out of
> > > > > > > the
> > > > > > > > > abstract.  Do
> > > > > > > > > > > you have a
> > > > > > > > > > > > >
> > specific problem
> > > > > > > > > > > > >
> > you're trying
> > > > > to solve?
> > > > > > > Can you
> > > > > > > > > describe it?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alex
> > > > > > > > > > > > >
> > > > > > > > > > > > > On
> > Thu, Oct 2, 2008
> > > > > at 4:55
> > > > > > > AM,
> > > > > > > > > Terrence A.
> > > > > > > > > > > Pietrondi
> > > > > > > > > > > > >
> > > > > <[EMAIL PROTECTED]
> > > > > > > > > > > > > >
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > I am sorry for
> > > > > the
> > > > > > > confusion. I
> > > > > > > > > meant
> > > > > > > > > > > distributed
> > > > > > > > > > > > >
> > data.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > So help me out
> > > > > here. For
> > > > > > > example,
> > > > > > > > > if I am
> > > > > > > > > > > reducing to
> > > > > > > > > > > > > a
> > single file, then
> > > > > > > > > > > > > >
> > my main
> > > > > transformation
> > > > > > > logic would
> > > > > > > > > be in my
> > > > > > > > > > > mapping
> > > > > > > > > > > > > step
> > since I am
> > > > > reducing
> > > > > > > > > > > > > >
> > away from the
> > > > > data?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > Terrence A.
> > > > > Pietrondi
> > > > > > > > > > > > > >
> > > > > > > http://del.icio.us/tepietrondi
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > --- On Wed,
> > > > > 10/1/08,
> > > > > > > Alex
> > > > > > > > > Loddengaard
> > > > > > > > > > > > >
> > > > > > > <[EMAIL PROTECTED]>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > From:
> > > > > Alex
> > > > > > > Loddengaard
> > > > > > > > > > > > >
> > > > > > > <[EMAIL PROTECTED]>
> > > > > > > > > > > > > >
> > > Subject:
> > > > > Re:
> > > > > > > architecture
> > > > > > > > > diagram
> > > > > > > > > > > > > >
> > > To:
> > > > > > > > >
> > core-user@hadoop.apache.org
> > > > > > > > > > > > > >
> > > Date:
> > > > > Wednesday,
> > > > > > > October 1,
> > > > > > > > > 2008, 7:44
> > > > > > > > > > > PM
> > > > > > > > > > > > > >
> > > I'm
> > > > > not sure
> > > > > > > what you
> > > > > > > > > mean by
> > > > > > > > > > > > >
> > "disconnected
> > > > > parts
> > > > > > > > > > > > > >
> > > of
> > > > > data," but
> > > > > > > Hadoop is
> > > > > > > > > > > > > >
> > >
> > > > > implemented to try
> > > > > > > and
> > > > > > > > > perform map
> > > > > > > > > > > tasks on
> > > > > > > > > > > > >
> > machines that
> > > > > > > > > > > > > >
> > > have
> > > > > input data.
> > > > > > > > > > > > > >
> > > This is
> > > > > to lower
> > > > > > > the amount
> > > > > > > > > of network
> > > > > > > > > > > traffic,
> > > > > > > > > > > > >
> > hence
> > > > > > > > > > > > > >
> > > making
> > > > > the entire
> > > > > > > job
> > > > > > > > > > > > > >
> > > run
> > > > > faster.  Hadoop
> > > > > > > does all
> > > > > > > > > this for
> > > > > > > > > > > you under
> > > > > > > > > > > > > the
> > hood.
> > > > > > > > > > > > > >
> > > From a
> > > > > user's
> > > > > > > > > > > > > >
> > > point of
> > > > > view, all
> > > > > > > you need
> > > > > > > > > to do is
> > > > > > > > > > > store data
> > > > > > > > > > > > > in
> > HDFS
> > > > > > > > > > > > > >
> > > (the
> > > > > distributed
> > > > > > > > > > > > > >
> > >
> > > > > filesystem), and
> > > > > > > run
> > > > > > > > > MapReduce jobs on
> > > > > > > > > > > that data.
> > > > > > > > > > > > >
> > Take a
> > > > > > > > > > > > > >
> > > look
> > > > > here:
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > >
> > > > > > >
> > <http://wiki.apache.org/hadoop/WordCount>
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > >
> > > Alex
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > >
> > > On Wed,
> > > > > Oct 1, 2008
> > > > > > > at 1:11
> > > > > > > > > PM,
> > > > > > > > > > > Terrence A.
> > > > > > > > > > > > >
> > Pietrondi
> > > > > > > > > > > > > >
> > >
> > > > > > > <[EMAIL PROTECTED]
> > > > > > > > > > > > > >
> > > >
> > > > > wrote:
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > > >
> > > > So
> > > > > to be
> > > > > > > > > "distributed"
> > > > > > > > > > > in a sense,
> > > > > > > > > > > > > you
> > would
> > > > > > > > > > > > > >
> > > want to
> > > > > do your
> > > > > > > computation
> > > > > > > > > on
> > > > > > > > > > > > > >
> > > > the
> > > > > > > disconnected parts
> > > > > > > > > of data in
> > > > > > > > > > > the map
> > > > > > > > > > > > >
> > phase I
> > > > > > > > > > > > > >
> > > would
> > > > > guess?
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > >
> > > > > Terrence A.
> > > > > > > Pietrondi
> > > > > > > > > > > > > >
> > > >
> > > > > > > > >
> > http://del.icio.us/tepietrondi
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > > ---
> > > > > On Wed,
> > > > > > > 10/1/08,
> > > > > > > > > Arun C Murthy
> > > > > > > > > > > > > >
> > >
> > > > > > > <[EMAIL PROTECTED]>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > From:
> > > > > > > Arun C Murthy
> > > > > > > > > > > > >
> > > > > <[EMAIL PROTECTED]>
> > > > > > > > > > > > > >
> > > > >
> > > > > Subject:
> > > > > > > Re:
> > > > > > > > > architecture
> > > > > > > > > > > diagram
> > > > > > > > > > > > > >
> > > > >
> > > > > To:
> > > > > > > > > > >
> > core-user@hadoop.apache.org
> > > > > > > > > > > > > >
> > > > >
> > > > > Date:
> > > > > > > Wednesday,
> > > > > > > > > October 1,
> > > > > > > > > > > 2008, 2:16
> > > > > > > > > > > > > PM
> > > > > > > > > > > > > >
> > > > >
> > > > > On Oct 1,
> > > > > > > 2008, at
> > > > > > > > > 10:17 AM,
> > > > > > > > > > > Terrence
> > > > > > > > > > > > > A.
> > > > > > > > > > > > > >
> > > Pietrondi
> > > > > wrote:
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > > I am
> > > > > > > trying to
> > > > > > > > > plan out
> > > > > > > > > > > my
> > > > > > > > > > > > >
> > map-reduce
> > > > > > > > > > > > > >
> > >
> > > > > implementation
> > > > > > > > > > > > > >
> > > > >
> > > > > and I
> > > > > > > have some
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > questions of
> > > > > > > > > where
> > > > > > > > > > > computation
> > > > > > > > > > > > >
> > should be
> > > > > > > > > > > > > >
> > > split in
> > > > > > > > > > > > > >
> > > > >
> > > > > order to
> > > > > > > take
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > advantage of
> > > > > > > > > the
> > > > > > > > > > > distributed
> > > > > > > > > > > > >
> > nodes.
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > Looking at the
> > > > > > > > > > > architecture
> > > > > > > > > > > > >
> > diagram
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > >
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > > > > > > >
> > > > >
> > > > > > ),
> > > > > > > are the map
> > > > > > > > > boxes the
> > > > > > > > > > > major
> > > > > > > > > > > > >
> > computation
> > > > > > > > > > > > > >
> > > areas or
> > > > > is
> > > > > > > > > > > > > >
> > > > >
> > > > > the
> > > > > > > reduce
> > > > > > > > > > > > > >
> > > > >
> > > > > > the
> > > > > > > major
> > > > > > > > > computation
> > > > > > > > > > > area?
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > Usually
> > > > > > > the maps
> > > > > > > > > perform the
> > > > > > > > > > > > >
> > 'embarrassingly
> > > > > > > > > > > > > >
> > > > >
> > > > > > > parallel'
> > > > > > > > > computational
> > > > > > > > > > > > > >
> > > > >
> > > > > steps
> > > > > > > where-in each
> > > > > > > > > map works
> > > > > > > > > > > > >
> > independently on a
> > > > > > > > > > > > > >
> > > > >
> > > > > > > 'split' on
> > > > > > > > > your input
> > > > > > > > > > > > > >
> > > > >
> > > > > and the
> > > > > > > reduces
> > > > > > > > > perform the
> > > > > > > > > > > > >
> > 'aggregate'
> > > > > > > > > > > > > >
> > > > >
> > > > > > > computations.
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > >  From
> > > > > > > > > > >
> > > > > http://hadoop.apache.org/core/ :
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > Hadoop
> > > > > > > implements
> > > > > > > > > MapReduce,
> > > > > > > > > > > using the
> > > > > > > > > > > > >
> > Hadoop
> > > > > > > > > > > > > >
> > >
> > > > > Distributed
> > > > > > > > > > > > > >
> > > > >
> > > > > File
> > > > > > > System
> > > > > > > > > > > > > >
> > > > >
> > > > > (HDFS).
> > > > > > > MapReduce
> > > > > > > > > divides
> > > > > > > > > > > applications
> > > > > > > > > > > > > into
> > many
> > > > > > > > > > > > > >
> > > small
> > > > > > > > > > > > > >
> > > > >
> > > > > blocks of
> > > > > > > work.
> > > > > > > > > > > > > >
> > > > >
> > > > > HDFS
> > > > > > > creates
> > > > > > > > > multiple
> > > > > > > > > > > replicas of
> > data
> > > > > > > > > > > > >
> > blocks for
> > > > > > > > > > > > > >
> > > > >
> > > > > > > reliability,
> > > > > > > > > placing
> > > > > > > > > > > > > >
> > > > >
> > > > > them on
> > > > > > > compute
> > > > > > > > > nodes around
> > > > > > > > > > > the
> > > > > > > > > > > > >
> > cluster.
> > > > > > > > > > > > > >
> > > MapReduce
> > > > > can
> > > > > > > > > > > > > >
> > > > >
> > > > > then
> > > > > > > process
> > > > > > > > > > > > > >
> > > > >
> > > > > the data
> > > > > > > where it
> > > > > > > > > is located.
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > The
> > > > > > > Hadoop
> > > > > > > > > Map-Reduce
> > > > > > > > > > > framework is
> > > > > > > > > > > > >
> > quite good at
> > > > > > > > > > > > > >
> > >
> > > > > scheduling
> > > > > > > > > > > > > >
> > > > >
> > > > > your
> > > > > > > > > > > > > >
> > > > >
> > > > > > > 'maps' on
> > > > > > > > > the actual
> > > > > > > > > > > data-nodes
> > > > > > > > > > > > >
> > where the
> > > > > > > > > > > > > >
> > > > >
> > > > > > > input-blocks are
> > > > > > > > > present,
> > > > > > > > > > > > > >
> > > > >
> > > > > leading
> > > > > > > to i/o
> > > > > > > > > > >
> > efficiencies...
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > Arun
> > > > > > > > > > > > > >
> > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > Thanks.
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > Terrence A.
> > > > > > > > > Pietrondi
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > > >
> > > > > >
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> > >
>
>
>
>

Reply via email to