Re: architecture diagram

Alex Loddengaard Fri, 03 Oct 2008 08:02:03 -0700

Can you confirm that the example you've presented is accurate?  I think you
may have made some typos, because the letter "G" isn't in the final result;
I also think your first pivot accidentally swapped C and G.  I'm having a
hard time understanding what you want to do, because it seems like your
operations differ from your example.


With that said, at first glance, this problem may not fit well in to the
MapReduce paradigm.  The reason I'm making this claim is because in order to
do the pivot operation you must know about every row.  Your input files will
be split at semi-arbitrary places, essentially making it impossible for each
mapper to know every single row.  There may be a way to do this by
collecting, in your map step, key => column number (0, 1, 2, etc) and value
=> (A, B, C, etc), though you may run in to problems when you try to pivot
back.  I say this because when you pivot back, you need to have each column,
which means you'll need one reduce step.  There may be a way to put the
pivot-back operation in a second iteration, though I don't think that would
help you.

Terrence, please confirm that you've defined your example correctly.  In the
meantime, can someone else confirm that this problem does not fit will in to
the MapReduce paradigm?

Alex

On Thu, Oct 2, 2008 at 10:48 AM, Terrence A. Pietrondi <
[EMAIL PROTECTED]> wrote:

> I am trying to write a map reduce implementation to do the following:
>
> 1) read tabular data delimited in some fashion
> 2) pivot that data, so the rows are columns and the columns are rows
> 3) shuffle the rows (that were the columns) to randomize the data
> 4) pivot the data back
>
> For example.....
>
> A|B|C
> D|E|G
>
> pivots too...
>
> D|A
> E|B
> C|G
>
> Then for each row, shuffle the contents around randomly...
>
> D|A
> B|E
> G|C
>
> Then pivot the data back...
>
> A|E|C
> D|B|C
>
> You can reference my progress so far...
>
> http://svn.sourceforge.net/viewvc/csvdatamix/branches/datamix_mapreduce/
>
> Terrence A. Pietrondi
>
>
> --- On Thu, 10/2/08, Alex Loddengaard <[EMAIL PROTECTED]> wrote:
>
> > From: Alex Loddengaard <[EMAIL PROTECTED]>
> > Subject: Re: architecture diagram
> > To: core-user@hadoop.apache.org
> > Date: Thursday, October 2, 2008, 1:36 PM
> > I think it really depends on the job as to where logic goes.
> >  Sometimes your
> > reduce step is as simple as an identify function, and
> > sometimes it can be
> > more complex than your map step.  It all depends on your
> > data and the
> > operation(s) you're trying to perform.
> >
> > Perhaps we should step out of the abstract.  Do you have a
> > specific problem
> > you're trying to solve?  Can you describe it?
> >
> > Alex
> >
> > On Thu, Oct 2, 2008 at 4:55 AM, Terrence A. Pietrondi
> > <[EMAIL PROTECTED]
> > > wrote:
> >
> > > I am sorry for the confusion. I meant distributed
> > data.
> > >
> > > So help me out here. For example, if I am reducing to
> > a single file, then
> > > my main transformation logic would be in my mapping
> > step since I am reducing
> > > away from the data?
> > >
> > > Terrence A. Pietrondi
> > > http://del.icio.us/tepietrondi
> > >
> > >
> > > --- On Wed, 10/1/08, Alex Loddengaard
> > <[EMAIL PROTECTED]> wrote:
> > >
> > > > From: Alex Loddengaard
> > <[EMAIL PROTECTED]>
> > > > Subject: Re: architecture diagram
> > > > To: core-user@hadoop.apache.org
> > > > Date: Wednesday, October 1, 2008, 7:44 PM
> > > > I'm not sure what you mean by
> > "disconnected parts
> > > > of data," but Hadoop is
> > > > implemented to try and perform map tasks on
> > machines that
> > > > have input data.
> > > > This is to lower the amount of network traffic,
> > hence
> > > > making the entire job
> > > > run faster.  Hadoop does all this for you under
> > the hood.
> > > > From a user's
> > > > point of view, all you need to do is store data
> > in HDFS
> > > > (the distributed
> > > > filesystem), and run MapReduce jobs on that data.
> >  Take a
> > > > look here:
> > > >
> > > > <http://wiki.apache.org/hadoop/WordCount>
> > > >
> > > > Alex
> > > >
> > > > On Wed, Oct 1, 2008 at 1:11 PM, Terrence A.
> > Pietrondi
> > > > <[EMAIL PROTECTED]
> > > > > wrote:
> > > >
> > > > > So to be "distributed" in a sense,
> > you would
> > > > want to do your computation on
> > > > > the disconnected parts of data in the map
> > phase I
> > > > would guess?
> > > > >
> > > > > Terrence A. Pietrondi
> > > > > http://del.icio.us/tepietrondi
> > > > >
> > > > >
> > > > > --- On Wed, 10/1/08, Arun C Murthy
> > > > <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > > From: Arun C Murthy
> > <[EMAIL PROTECTED]>
> > > > > > Subject: Re: architecture diagram
> > > > > > To: core-user@hadoop.apache.org
> > > > > > Date: Wednesday, October 1, 2008, 2:16
> > PM
> > > > > > On Oct 1, 2008, at 10:17 AM, Terrence
> > A.
> > > > Pietrondi wrote:
> > > > > >
> > > > > > > I am trying to plan out my
> > map-reduce
> > > > implementation
> > > > > > and I have some
> > > > > > > questions of where computation
> > should be
> > > > split in
> > > > > > order to take
> > > > > > > advantage of the distributed
> > nodes.
> > > > > > >
> > > > > > > Looking at the architecture
> > diagram
> > > > > >
> > > >
> > (http://hadoop.apache.org/core/images/architecture.gif
> > > > > > > ), are the map boxes the major
> > computation
> > > > areas or is
> > > > > > the reduce
> > > > > > > the major computation area?
> > > > > > >
> > > > > >
> > > > > > Usually the maps perform the
> > 'embarrassingly
> > > > > > parallel' computational
> > > > > > steps where-in each map works
> > independently on a
> > > > > > 'split' on your input
> > > > > > and the reduces perform the
> > 'aggregate'
> > > > > > computations.
> > > > > >
> > > > > >  From http://hadoop.apache.org/core/ :
> > > > > >
> > > > > > Hadoop implements MapReduce, using the
> > Hadoop
> > > > Distributed
> > > > > > File System
> > > > > > (HDFS). MapReduce divides applications
> > into many
> > > > small
> > > > > > blocks of work.
> > > > > > HDFS creates multiple replicas of data
> > blocks for
> > > > > > reliability, placing
> > > > > > them on compute nodes around the
> > cluster.
> > > > MapReduce can
> > > > > > then process
> > > > > > the data where it is located.
> > > > > >
> > > > > > The Hadoop Map-Reduce framework is
> > quite good at
> > > > scheduling
> > > > > > your
> > > > > > 'maps' on the actual data-nodes
> > where the
> > > > > > input-blocks are present,
> > > > > > leading to i/o efficiencies...
> > > > > >
> > > > > > Arun
> > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Terrence A. Pietrondi
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> > >
> > >
>
>
>
>

Re: architecture diagram

Reply via email to