Re: architecture diagram

Alex Loddengaard Sun, 05 Oct 2008 18:26:58 -0700

Let's say you have one very large input file of the form:

A|B|C|D
E|F|G|H
...
|1|2|3|4


This input file will be broken up into N pieces, where N is the number of
mappers that run.  The location of these splits is semi-arbitrary.  This
means that unless you have one mapper, you won't be able to see the entire
contents of a column in your mapper.  Given that you would need one mapper
to be able to see the entirety of a column, you've now essentially reduced
your problem to a single machine.

You may want to play with the following idea: collect key => column_number
and value => column_contents in your map step.  This means that you would be
able to see the entirety of a column in your reduce step, though you're
still faced with the tasks of shuffling and re-pivoting.

Does this clear up your confusion?  Let me know if you'd like me to clarify
more.

Alex

On Sun, Oct 5, 2008 at 3:54 PM, Terrence A. Pietrondi <[EMAIL PROTECTED]
> wrote:

> I am not sure why this doesn't fit, maybe you can help me understand. Your
> previous comment was...
>
> "The reason I'm making this claim is because in order to do the pivot
> operation you must know about every row. Your input files will be split at
> semi-arbitrary places, essentially making it impossible for each mapper to
> know every single row."
>
> Are you saying that my row segments might not actually be the entire row so
> I will get a bad key index? If so, would the row segments be determined? I
> based my initial work off of the word count example, where the lines are
> tokenized. Does this mean in this example the row tokens may not be the
> complete row?
>
> Thanks.
>
> Terrence A. Pietrondi
>
>
> --- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Friday, October 3, 2008, 7:14 PM
> > The approach that you've described does not fit well in
> > to the MapReduce
> > paradigm.  You may want to consider randomizing your data
> > in a different
> > way.
> >
> > Unfortunately some things can't be solved well with
> > MapReduce, and I think
> > this is one of them.
> >
> > Can someone else say more?
> >
> > Alex
> >
> > On Fri, Oct 3, 2008 at 8:15 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > Sorry for the confusion, I did make some typos. My
> > example should have
> > > looked like...
> > >
> > > > A|B|C
> > > > D|E|G
> > > >
> > > > pivots too...
> > > >
> > > > D|A
> > > > E|B
> > > > G|C
> > > >
> > > > Then for each row, shuffle the contents around
> > randomly...
> > > >
> > > > D|A
> > > > B|E
> > > > C|G
> > > >
> > > > Then pivot the data back...
> > > >
> > > > A|E|G
> > > > D|B|C
> > >
> > > The general goal is to shuffle the elements in each
> > column in the input
> > > data. Meaning, the ordering of the elements in each
> > column will not be the
> > > same as in input.
> > >
> > > If you look at the initial input and compare to the
> > final output, you'll
> > > see that during the shuffling, B and E are swapped,
> > and G and C are swapped,
> > > while A and D were shuffled back into their
> > originating positions in the
> > > column.
> > >
> > > Once again, sorry for the typos and confusion.
> > >
> > > Terrence A. Pietrondi
> > >
> > > --- On Fri, 10/3/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Friday, October 3, 2008, 11:01 AM
> > > > Can you confirm that the example you've
> > presented is
> > > > accurate?  I think you
> > > > may have made some typos, because the letter
> > "G"
> > > > isn't in the final result;
> > > > I also think your first pivot accidentally
> > swapped C and G.
> > > >  I'm having a
> > > > hard time understanding what you want to do,
> > because it
> > > > seems like your
> > > > operations differ from your example.
> > > >
> > > > With that said, at first glance, this problem may
> > not fit
> > > > well in to the
> > > > MapReduce paradigm.  The reason I'm making
> > this claim
> > > > is because in order to
> > > > do the pivot operation you must know about every
> > row.  Your
> > > > input files will
> > > > be split at semi-arbitrary places, essentially
> > making it
> > > > impossible for each
> > > > mapper to know every single row.  There may be a
> > way to do
> > > > this by
> > > > collecting, in your map step, key => column
> > number (0,
> > > > 1, 2, etc) and value
> > > > => (A, B, C, etc), though you may run in to
> > problems
> > > > when you try to pivot
> > > > back.  I say this because when you pivot back,
> > you need to
> > > > have each column,
> > > > which means you'll need one reduce step.
> > There may be
> > > > a way to put the
> > > > pivot-back operation in a second iteration,
> > though I
> > > > don't think that would
> > > > help you.
> > > >
> > > > Terrence, please confirm that you've defined
> > your
> > > > example correctly.  In the
> > > > meantime, can someone else confirm that this
> > problem does
> > > > not fit will in to
> > > > the MapReduce paradigm?
> > > >
> > > > Alex
> > > >
> > > > On Thu, Oct 2, 2008 at 10:48 AM, Terrence A.
> > Pietrondi <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > > > I am trying to write a map reduce
> > implementation to do
> > > > the following:
> > > > >
> > > > > 1) read tabular data delimited in some
> > fashion
> > > > > 2) pivot that data, so the rows are columns
> > and the
> > > > columns are rows
> > > > > 3) shuffle the rows (that were the columns)
> > to
> > > > randomize the data
> > > > > 4) pivot the data back
> > > > >
> > > > > For example.....
> > > > >
> > > > > A|B|C
> > > > > D|E|G
> > > > >
> > > > > pivots too...
> > > > >
> > > > > D|A
> > > > > E|B
> > > > > C|G
> > > > >
> > > > > Then for each row, shuffle the contents
> > around
> > > > randomly...
> > > > >
> > > > > D|A
> > > > > B|E
> > > > > G|C
> > > > >
> > > > > Then pivot the data back...
> > > > >
> > > > > A|E|C
> > > > > D|B|C
> > > > >
> > > > > You can reference my progress so far...
> > > > >
> > > > >
> > > >
> > http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> > > > >
> > > > > Terrence A. Pietrondi
> > > > >
> > > > >
> > > > > --- On Thu, 10/2/08, Alex Loddengaard
> > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > From: Alex Loddengaard
> > > > <[EMAIL PROTECTED]>
> > > > > > Subject: Re: architecture diagram
> > > > > > To: core-user@hadoop.apache.org
> > > > > > Date: Thursday, October 2, 2008, 1:36
> > PM
> > > > > > I think it really depends on the job as
> > to where
> > > > logic goes.
> > > > > >  Sometimes your
> > > > > > reduce step is as simple as an identify
> > function,
> > > > and
> > > > > > sometimes it can be
> > > > > > more complex than your map step.  It
> > all depends
> > > > on your
> > > > > > data and the
> > > > > > operation(s) you're trying to
> > perform.
> > > > > >
> > > > > > Perhaps we should step out of the
> > abstract.  Do
> > > > you have a
> > > > > > specific problem
> > > > > > you're trying to solve?  Can you
> > describe it?
> > > > > >
> > > > > > Alex
> > > > > >
> > > > > > On Thu, Oct 2, 2008 at 4:55 AM,
> > Terrence A.
> > > > Pietrondi
> > > > > > <[EMAIL PROTECTED]
> > > > > > > wrote:
> > > > > >
> > > > > > > I am sorry for the confusion. I
> > meant
> > > > distributed
> > > > > > data.
> > > > > > >
> > > > > > > So help me out here. For example,
> > if I am
> > > > reducing to
> > > > > > a single file, then
> > > > > > > my main transformation logic would
> > be in my
> > > > mapping
> > > > > > step since I am reducing
> > > > > > > away from the data?
> > > > > > >
> > > > > > > Terrence A. Pietrondi
> > > > > > > http://del.icio.us/tepietrondi
> > > > > > >
> > > > > > >
> > > > > > > --- On Wed, 10/1/08, Alex
> > Loddengaard
> > > > > > <[EMAIL PROTECTED]>
> > wrote:
> > > > > > >
> > > > > > > > From: Alex Loddengaard
> > > > > > <[EMAIL PROTECTED]>
> > > > > > > > Subject: Re: architecture
> > diagram
> > > > > > > > To:
> > core-user@hadoop.apache.org
> > > > > > > > Date: Wednesday, October 1,
> > 2008, 7:44
> > > > PM
> > > > > > > > I'm not sure what you
> > mean by
> > > > > > "disconnected parts
> > > > > > > > of data," but Hadoop is
> > > > > > > > implemented to try and
> > perform map
> > > > tasks on
> > > > > > machines that
> > > > > > > > have input data.
> > > > > > > > This is to lower the amount
> > of network
> > > > traffic,
> > > > > > hence
> > > > > > > > making the entire job
> > > > > > > > run faster.  Hadoop does all
> > this for
> > > > you under
> > > > > > the hood.
> > > > > > > > From a user's
> > > > > > > > point of view, all you need
> > to do is
> > > > store data
> > > > > > in HDFS
> > > > > > > > (the distributed
> > > > > > > > filesystem), and run
> > MapReduce jobs on
> > > > that data.
> > > > > >  Take a
> > > > > > > > look here:
> > > > > > > >
> > > > > > > >
> > > > <http://wiki.apache.org/hadoop/WordCount>
> > > > > > > >
> > > > > > > > Alex
> > > > > > > >
> > > > > > > > On Wed, Oct 1, 2008 at 1:11
> > PM,
> > > > Terrence A.
> > > > > > Pietrondi
> > > > > > > > <[EMAIL PROTECTED]
> > > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > So to be
> > "distributed"
> > > > in a sense,
> > > > > > you would
> > > > > > > > want to do your computation
> > on
> > > > > > > > > the disconnected parts
> > of data in
> > > > the map
> > > > > > phase I
> > > > > > > > would guess?
> > > > > > > > >
> > > > > > > > > Terrence A. Pietrondi
> > > > > > > > >
> > http://del.icio.us/tepietrondi
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --- On Wed, 10/1/08,
> > Arun C Murthy
> > > > > > > > <[EMAIL PROTECTED]>
> > wrote:
> > > > > > > > >
> > > > > > > > > > From: Arun C Murthy
> > > > > > <[EMAIL PROTECTED]>
> > > > > > > > > > Subject: Re:
> > architecture
> > > > diagram
> > > > > > > > > > To:
> > > > core-user@hadoop.apache.org
> > > > > > > > > > Date: Wednesday,
> > October 1,
> > > > 2008, 2:16
> > > > > > PM
> > > > > > > > > > On Oct 1, 2008, at
> > 10:17 AM,
> > > > Terrence
> > > > > > A.
> > > > > > > > Pietrondi wrote:
> > > > > > > > > >
> > > > > > > > > > > I am trying to
> > plan out
> > > > my
> > > > > > map-reduce
> > > > > > > > implementation
> > > > > > > > > > and I have some
> > > > > > > > > > > questions of
> > where
> > > > computation
> > > > > > should be
> > > > > > > > split in
> > > > > > > > > > order to take
> > > > > > > > > > > advantage of
> > the
> > > > distributed
> > > > > > nodes.
> > > > > > > > > > >
> > > > > > > > > > > Looking at the
> > > > architecture
> > > > > > diagram
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > > > > ), are the map
> > boxes the
> > > > major
> > > > > > computation
> > > > > > > > areas or is
> > > > > > > > > > the reduce
> > > > > > > > > > > the major
> > computation
> > > > area?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Usually the maps
> > perform the
> > > > > > 'embarrassingly
> > > > > > > > > > parallel'
> > computational
> > > > > > > > > > steps where-in each
> > map works
> > > > > > independently on a
> > > > > > > > > > 'split' on
> > your input
> > > > > > > > > > and the reduces
> > perform the
> > > > > > 'aggregate'
> > > > > > > > > > computations.
> > > > > > > > > >
> > > > > > > > > >  From
> > > > http://hadoop.apache.org/core/ :
> > > > > > > > > >
> > > > > > > > > > Hadoop implements
> > MapReduce,
> > > > using the
> > > > > > Hadoop
> > > > > > > > Distributed
> > > > > > > > > > File System
> > > > > > > > > > (HDFS). MapReduce
> > divides
> > > > applications
> > > > > > into many
> > > > > > > > small
> > > > > > > > > > blocks of work.
> > > > > > > > > > HDFS creates
> > multiple
> > > > replicas of data
> > > > > > blocks for
> > > > > > > > > > reliability,
> > placing
> > > > > > > > > > them on compute
> > nodes around
> > > > the
> > > > > > cluster.
> > > > > > > > MapReduce can
> > > > > > > > > > then process
> > > > > > > > > > the data where it
> > is located.
> > > > > > > > > >
> > > > > > > > > > The Hadoop
> > Map-Reduce
> > > > framework is
> > > > > > quite good at
> > > > > > > > scheduling
> > > > > > > > > > your
> > > > > > > > > > 'maps' on
> > the actual
> > > > data-nodes
> > > > > > where the
> > > > > > > > > > input-blocks are
> > present,
> > > > > > > > > > leading to i/o
> > > > efficiencies...
> > > > > > > > > >
> > > > > > > > > > Arun
> > > > > > > > > >
> > > > > > > > > > > Thanks.
> > > > > > > > > > >
> > > > > > > > > > > Terrence A.
> > Pietrondi
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
>
>
>
>

Re: architecture diagram

Reply via email to