So to be "distributed" in a sense, you would want to do your computation on the disconnected parts of data in the map phase I would guess?
Terrence A. Pietrondi http://del.icio.us/tepietrondi --- On Wed, 10/1/08, Arun C Murthy <[EMAIL PROTECTED]> wrote: > From: Arun C Murthy <[EMAIL PROTECTED]> > Subject: Re: architecture diagram > To: core-user@hadoop.apache.org > Date: Wednesday, October 1, 2008, 2:16 PM > On Oct 1, 2008, at 10:17 AM, Terrence A. Pietrondi wrote: > > > I am trying to plan out my map-reduce implementation > and I have some > > questions of where computation should be split in > order to take > > advantage of the distributed nodes. > > > > Looking at the architecture diagram > (http://hadoop.apache.org/core/images/architecture.gif > > ), are the map boxes the major computation areas or is > the reduce > > the major computation area? > > > > Usually the maps perform the 'embarrassingly > parallel' computational > steps where-in each map works independently on a > 'split' on your input > and the reduces perform the 'aggregate' > computations. > > From http://hadoop.apache.org/core/ : > > Hadoop implements MapReduce, using the Hadoop Distributed > File System > (HDFS). MapReduce divides applications into many small > blocks of work. > HDFS creates multiple replicas of data blocks for > reliability, placing > them on compute nodes around the cluster. MapReduce can > then process > the data where it is located. > > The Hadoop Map-Reduce framework is quite good at scheduling > your > 'maps' on the actual data-nodes where the > input-blocks are present, > leading to i/o efficiencies... > > Arun > > > Thanks. > > > > Terrence A. Pietrondi > > > > > >