Hi Mithila,

Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
begin; the inputs for the job are then fixed. So additional files that
arrive in the input directory after processing has begun, etc, do not
participate in the job.

And HDFS does not currently support appends to files, so existing files
cannot be updated.

A typical way in which this sort of problem is handled is to do processing
in incremental "wavefronts;" process A generates some data which goes in an
"incoming" directory for process B; process B starts on a timer every so
often and collects the new input files and works on them. After it's done,
it moves those inputs which it processed into a "done" directory. In the
mean time, new files may have arrived. After another time interval, another
round of process B starts.  The major limitation of this model is that it
requires that your process work incrementally, or that you are emitting a
small enough volume of data each time in process B that subsequent
iterations can load into memory a summary table of results from previous
iterations. Look into using the DistributedCache to disseminate such files.

Also, why are you using two MapReduce clusters for this, as opposed to one?
Is there a common HDFS cluster behind them?  You'll probably get much better
performance for the overall process if the output data from one job does not
need to be transferred to another cluster before it is further processed.

Does this model make sense?
- Aaron

On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra <mnage...@asu.edu> wrote:

> Aaron,
> We hope to achieve a level of pipelining between two clusters - similar to
> how pipelining is done in executing RDB queries. You can look at it as the
> producer-consumer problem, one cluster produces some data and the other
> cluster consumes it. The issue that has to be dealt with here is the data
> exchange between the clusters - synchronized interaction between the
> map-reduce jobs on the two clusters is what I m hoping to achieve.
>
> Mithila
>
> On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball <aa...@cloudera.com> wrote:
>
> > Clusters don't really have identities beyond the addresses of the
> NameNodes
> > and JobTrackers. In the example below, nn1 and nn2 are the hostnames of
> the
> > namenodes of the source and destination clusters. The 8020 in each
> address
> > assumes that they're on the default port.
> >
> > Hadoop provides no inter-task or inter-job synchronization primitives, on
> > purpose (even within a cluster, the most you get in terms of
> > synchronization
> > is the ability to "join" on the status of a running job to determine that
> > it's completed). The model is designed to be as identity-independent as
> > possible to make it more resiliant to failure. If individual jobs/tasks
> > could lock common resources, then the intermittent failure of tasks could
> > easily cause deadlock.
> >
> > Using a file as a "scoreboard" or other communication mechanism between
> > multiple jobs is not something explicitly designed for, and likely to end
> > in
> > frustration. Can you describe the goal you're trying to accomplish? It's
> > likely that there's another, more MapReduce-y way of looking at the job
> and
> > refactoring the code to make it work more cleanly with the intended
> > programming model.
> >
> > - Aaron
> >
> > On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <mnage...@asu.edu>
> > wrote:
> >
> > > Thanks! I was looking at the link sent by Philip. The copy is done with
> > the
> > > following command:
> > > hadoop distcp hdfs://nn1:8020/foo/bar \
> > >                    hdfs://nn2:8020/bar/foo
> > >
> > > I was wondering if nn1 and nn2 are the names of the clusters or the
> name
> > of
> > > the masters on each cluster.
> > >
> > > I wanted map/reduce tasks running on each of the two clusters to
> > > communicate
> > > with each other. I dont know if hadoop provides for synchronization
> > between
> > > two map/reduce tasks. The tasks run simultaneouly, and they need to
> > access
> > > a
> > > common file - something like a map/reduce task at a higher level
> > utilizing
> > > the data produced by the map/reduce at the lower level.
> > >
> > > Mithila
> > >
> > > On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <omal...@apache.org>
> > wrote:
> > >
> > > >
> > > > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
> > > >
> > > >  Hey all
> > > >> I'm trying to connect two separate Hadoop clusters. Is it possible
> to
> > do
> > > >> so?
> > > >> I need data to be shuttled back and forth between the two clusters.
> > Any
> > > >> suggestions?
> > > >>
> > > >
> > > > You should use hadoop distcp. It is a map/reduce program that copies
> > > data,
> > > > typically from one cluster to another. If you have the hftp interface
> > > > enabled, you can use that to copy between hdfs clusters that are
> > > different
> > > > versions.
> > > >
> > > > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
> > > >
> > > > -- Owen
> > > >
> > >
> >
>

Reply via email to