Hello Aaron
Yes it makes a lot of sense! Thank you! :)

The incremental wavefront model is another option we are looking at.
Currently we have a two map/reduce levels, the upper level has to wait until
the lower map/reduce has produced the entire result set. We want to avoid
this... We were thinking of using two separate clusters so that these levels
can run on them - hoping to achieve better resource utilization. We were
hoping to connect the two clusters in some way so that the processes can
interact - but it seems like Hadoop is limited in that sense. I was
wondering how a common HDFS system can be setup for this purpose.

I tried looking for material for synchronization between two map-reduce
clusters - there is limited/no data available out on the Web! If we stick to
the incremental wavefront model, then we could probably work with one
cluster.

Mithila

On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball <aa...@cloudera.com> wrote:

> Hi Mithila,
>
> Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
> begin; the inputs for the job are then fixed. So additional files that
> arrive in the input directory after processing has begun, etc, do not
> participate in the job.
>
> And HDFS does not currently support appends to files, so existing files
> cannot be updated.
>
> A typical way in which this sort of problem is handled is to do processing
> in incremental "wavefronts;" process A generates some data which goes in an
> "incoming" directory for process B; process B starts on a timer every so
> often and collects the new input files and works on them. After it's done,
> it moves those inputs which it processed into a "done" directory. In the
> mean time, new files may have arrived. After another time interval, another
> round of process B starts.  The major limitation of this model is that it
> requires that your process work incrementally, or that you are emitting a
> small enough volume of data each time in process B that subsequent
> iterations can load into memory a summary table of results from previous
> iterations. Look into using the DistributedCache to disseminate such files.
>
> Also, why are you using two MapReduce clusters for this, as opposed to one?
> Is there a common HDFS cluster behind them?  You'll probably get much
> better
> performance for the overall process if the output data from one job does
> not
> need to be transferred to another cluster before it is further processed.
>
> Does this model make sense?
> - Aaron
>
> On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra <mnage...@asu.edu> wrote:
>
> > Aaron,
> > We hope to achieve a level of pipelining between two clusters - similar
> to
> > how pipelining is done in executing RDB queries. You can look at it as
> the
> > producer-consumer problem, one cluster produces some data and the other
> > cluster consumes it. The issue that has to be dealt with here is the data
> > exchange between the clusters - synchronized interaction between the
> > map-reduce jobs on the two clusters is what I m hoping to achieve.
> >
> > Mithila
> >
> > On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball <aa...@cloudera.com>
> wrote:
> >
> > > Clusters don't really have identities beyond the addresses of the
> > NameNodes
> > > and JobTrackers. In the example below, nn1 and nn2 are the hostnames of
> > the
> > > namenodes of the source and destination clusters. The 8020 in each
> > address
> > > assumes that they're on the default port.
> > >
> > > Hadoop provides no inter-task or inter-job synchronization primitives,
> on
> > > purpose (even within a cluster, the most you get in terms of
> > > synchronization
> > > is the ability to "join" on the status of a running job to determine
> that
> > > it's completed). The model is designed to be as identity-independent as
> > > possible to make it more resiliant to failure. If individual jobs/tasks
> > > could lock common resources, then the intermittent failure of tasks
> could
> > > easily cause deadlock.
> > >
> > > Using a file as a "scoreboard" or other communication mechanism between
> > > multiple jobs is not something explicitly designed for, and likely to
> end
> > > in
> > > frustration. Can you describe the goal you're trying to accomplish?
> It's
> > > likely that there's another, more MapReduce-y way of looking at the job
> > and
> > > refactoring the code to make it work more cleanly with the intended
> > > programming model.
> > >
> > > - Aaron
> > >
> > > On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <mnage...@asu.edu>
> > > wrote:
> > >
> > > > Thanks! I was looking at the link sent by Philip. The copy is done
> with
> > > the
> > > > following command:
> > > > hadoop distcp hdfs://nn1:8020/foo/bar \
> > > >                    hdfs://nn2:8020/bar/foo
> > > >
> > > > I was wondering if nn1 and nn2 are the names of the clusters or the
> > name
> > > of
> > > > the masters on each cluster.
> > > >
> > > > I wanted map/reduce tasks running on each of the two clusters to
> > > > communicate
> > > > with each other. I dont know if hadoop provides for synchronization
> > > between
> > > > two map/reduce tasks. The tasks run simultaneouly, and they need to
> > > access
> > > > a
> > > > common file - something like a map/reduce task at a higher level
> > > utilizing
> > > > the data produced by the map/reduce at the lower level.
> > > >
> > > > Mithila
> > > >
> > > > On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <omal...@apache.org>
> > > wrote:
> > > >
> > > > >
> > > > > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
> > > > >
> > > > >  Hey all
> > > > >> I'm trying to connect two separate Hadoop clusters. Is it possible
> > to
> > > do
> > > > >> so?
> > > > >> I need data to be shuttled back and forth between the two
> clusters.
> > > Any
> > > > >> suggestions?
> > > > >>
> > > > >
> > > > > You should use hadoop distcp. It is a map/reduce program that
> copies
> > > > data,
> > > > > typically from one cluster to another. If you have the hftp
> interface
> > > > > enabled, you can use that to copy between hdfs clusters that are
> > > > different
> > > > > versions.
> > > > >
> > > > > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
> > > > >
> > > > > -- Owen
> > > > >
> > > >
> > >
> >
>

Reply via email to