Re: architecture diagram

Terrence A. Pietrondi Fri, 03 Oct 2008 08:17:20 -0700

Sorry for the confusion, I did make some typos. My example should have looked 
like...


> A|B|C
> D|E|G
>
> pivots too...
>
> D|A
> E|B
> G|C
>
> Then for each row, shuffle the contents around randomly...
>
> D|A
> B|E
> C|G
>
> Then pivot the data back...
>
> A|E|G
> D|B|C

The general goal is to shuffle the elements in each column in the input data. 
Meaning, the ordering of the elements in each column will not be the same as in 
input.

If you look at the initial input and compare to the final output, you'll see 
that during the shuffling, B and E are swapped, and G and C are swapped, while 
A and D were shuffled back into their originating positions in the column. 

Once again, sorry for the typos and confusion.

Terrence A. Pietrondi

--- On Fri, 10/3/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:

> From: Alex Loddengaard <[EMAIL PROTECTED]>
> Subject: Re: architecture diagram
> To: core-user@hadoop.apache.org
> Date: Friday, October 3, 2008, 11:01 AM
> Can you confirm that the example you've presented is
> accurate?  I think you
> may have made some typos, because the letter "G"
> isn't in the final result;
> I also think your first pivot accidentally swapped C and G.
>  I'm having a
> hard time understanding what you want to do, because it
> seems like your
> operations differ from your example.
> 
> With that said, at first glance, this problem may not fit
> well in to the
> MapReduce paradigm.  The reason I'm making this claim
> is because in order to
> do the pivot operation you must know about every row.  Your
> input files will
> be split at semi-arbitrary places, essentially making it
> impossible for each
> mapper to know every single row.  There may be a way to do
> this by
> collecting, in your map step, key => column number (0,
> 1, 2, etc) and value
> => (A, B, C, etc), though you may run in to problems
> when you try to pivot
> back.  I say this because when you pivot back, you need to
> have each column,
> which means you'll need one reduce step.  There may be
> a way to put the
> pivot-back operation in a second iteration, though I
> don't think that would
> help you.
> 
> Terrence, please confirm that you've defined your
> example correctly.  In the
> meantime, can someone else confirm that this problem does
> not fit will in to
> the MapReduce paradigm?
> 
> Alex
> 
> On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi <
> [EMAIL PROTECTED]> wrote:
> 
> > I am trying to write a map reduce implementation to do
> the following:
> >
> > 1) read tabular data delimited in some fashion
> > 2) pivot that data, so the rows are columns and the
> columns are rows
> > 3) shuffle the rows (that were the columns) to
> randomize the data
> > 4) pivot the data back
> >
> > For example.....
> >
> > A|B|C
> > D|E|G
> >
> > pivots too...
> >
> > D|A
> > E|B
> > C|G
> >
> > Then for each row, shuffle the contents around
> randomly...
> >
> > D|A
> > B|E
> > G|C
> >
> > Then pivot the data back...
> >
> > A|E|C
> > D|B|C
> >
> > You can reference my progress so far...
> >
> >
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
> >
> > Terrence A. Pietrondi
> >
> >
> > --- On Thu, 10/2/08, Alex Loddengaard
> <[EMAIL PROTECTED]> wrote:
> >
> > > From: Alex Loddengaard
> <[EMAIL PROTECTED]>
> > > Subject: Re: architecture diagram
> > > To: core-user@hadoop.apache.org
> > > Date: Thursday, October 2, 2008, 1:36 PM
> > > I think it really depends on the job as to where
> logic goes.
> > >  Sometimes your
> > > reduce step is as simple as an identify function,
> and
> > > sometimes it can be
> > > more complex than your map step.  It all depends
> on your
> > > data and the
> > > operation(s) you're trying to perform.
> > >
> > > Perhaps we should step out of the abstract.  Do
> you have a
> > > specific problem
> > > you're trying to solve?  Can you describe it?
> > >
> > > Alex
> > >
> > > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A.
> Pietrondi
> > > <[EMAIL PROTECTED]
> > > > wrote:
> > >
> > > > I am sorry for the confusion. I meant
> distributed
> > > data.
> > > >
> > > > So help me out here. For example, if I am
> reducing to
> > > a single file, then
> > > > my main transformation logic would be in my
> mapping
> > > step since I am reducing
> > > > away from the data?
> > > >
> > > > Terrence A. Pietrondi
> > > > http://del.icio.us/tepietrondi
> > > >
> > > >
> > > > --- On Wed, 10/1/08, Alex Loddengaard
> > > <[EMAIL PROTECTED]> wrote:
> > > >
> > > > > From: Alex Loddengaard
> > > <[EMAIL PROTECTED]>
> > > > > Subject: Re: architecture diagram
> > > > > To: core-user@hadoop.apache.org
> > > > > Date: Wednesday, October 1, 2008, 7:44
> PM
> > > > > I'm not sure what you mean by
> > > "disconnected parts
> > > > > of data," but Hadoop is
> > > > > implemented to try and perform map
> tasks on
> > > machines that
> > > > > have input data.
> > > > > This is to lower the amount of network
> traffic,
> > > hence
> > > > > making the entire job
> > > > > run faster.  Hadoop does all this for
> you under
> > > the hood.
> > > > > From a user's
> > > > > point of view, all you need to do is
> store data
> > > in HDFS
> > > > > (the distributed
> > > > > filesystem), and run MapReduce jobs on
> that data.
> > >  Take a
> > > > > look here:
> > > > >
> > > > >
> <http://wiki.apache.org/hadoop/WordCount>
> > > > >
> > > > > Alex
> > > > >
> > > > > On Wed, Oct 1, 2008 at 1:11 PM,
> Terrence A.
> > > Pietrondi
> > > > > <[EMAIL PROTECTED]
> > > > > > wrote:
> > > > >
> > > > > > So to be "distributed"
> in a sense,
> > > you would
> > > > > want to do your computation on
> > > > > > the disconnected parts of data in
> the map
> > > phase I
> > > > > would guess?
> > > > > >
> > > > > > Terrence A. Pietrondi
> > > > > > http://del.icio.us/tepietrondi
> > > > > >
> > > > > >
> > > > > > --- On Wed, 10/1/08, Arun C Murthy
> > > > > <[EMAIL PROTECTED]> wrote:
> > > > > >
> > > > > > > From: Arun C Murthy
> > > <[EMAIL PROTECTED]>
> > > > > > > Subject: Re: architecture
> diagram
> > > > > > > To:
> core-user@hadoop.apache.org
> > > > > > > Date: Wednesday, October 1,
> 2008, 2:16
> > > PM
> > > > > > > On Oct 1, 2008, at 10:17 AM,
> Terrence
> > > A.
> > > > > Pietrondi wrote:
> > > > > > >
> > > > > > > > I am trying to plan out
> my
> > > map-reduce
> > > > > implementation
> > > > > > > and I have some
> > > > > > > > questions of where
> computation
> > > should be
> > > > > split in
> > > > > > > order to take
> > > > > > > > advantage of the
> distributed
> > > nodes.
> > > > > > > >
> > > > > > > > Looking at the
> architecture
> > > diagram
> > > > > > >
> > > > >
> > >
> (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > > ), are the map boxes the
> major
> > > computation
> > > > > areas or is
> > > > > > > the reduce
> > > > > > > > the major computation
> area?
> > > > > > > >
> > > > > > >
> > > > > > > Usually the maps perform the
> > > 'embarrassingly
> > > > > > > parallel' computational
> > > > > > > steps where-in each map works
> > > independently on a
> > > > > > > 'split' on your input
> > > > > > > and the reduces perform the
> > > 'aggregate'
> > > > > > > computations.
> > > > > > >
> > > > > > >  From
> http://hadoop.apache.org/core/ :
> > > > > > >
> > > > > > > Hadoop implements MapReduce,
> using the
> > > Hadoop
> > > > > Distributed
> > > > > > > File System
> > > > > > > (HDFS). MapReduce divides
> applications
> > > into many
> > > > > small
> > > > > > > blocks of work.
> > > > > > > HDFS creates multiple
> replicas of data
> > > blocks for
> > > > > > > reliability, placing
> > > > > > > them on compute nodes around
> the
> > > cluster.
> > > > > MapReduce can
> > > > > > > then process
> > > > > > > the data where it is located.
> > > > > > >
> > > > > > > The Hadoop Map-Reduce
> framework is
> > > quite good at
> > > > > scheduling
> > > > > > > your
> > > > > > > 'maps' on the actual
> data-nodes
> > > where the
> > > > > > > input-blocks are present,
> > > > > > > leading to i/o
> efficiencies...
> > > > > > >
> > > > > > > Arun
> > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Terrence A. Pietrondi
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > > >
> > > >
> >
> >
> >
> >

Re: architecture diagram

Reply via email to