Hello Aaron Yes it makes a lot of sense! Thank you! :) The incremental wavefront model is another option we are looking at. Currently we have a two map/reduce levels, the upper level has to wait until the lower map/reduce has produced the entire result set. We want to avoid this... We were thinking of using two separate clusters so that these levels can run on them - hoping to achieve better resource utilization. We were hoping to connect the two clusters in some way so that the processes can interact - but it seems like Hadoop is limited in that sense. I was wondering how a common HDFS system can be setup for this purpose.
I tried looking for material for synchronization between two map-reduce clusters - there is limited/no data available out on the Web! If we stick to the incremental wavefront model, then we could probably work with one cluster. Mithila On Tue, Apr 7, 2009 at 7:05 PM, Aaron Kimball <aa...@cloudera.com> wrote: > Hi Mithila, > > Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they > begin; the inputs for the job are then fixed. So additional files that > arrive in the input directory after processing has begun, etc, do not > participate in the job. > > And HDFS does not currently support appends to files, so existing files > cannot be updated. > > A typical way in which this sort of problem is handled is to do processing > in incremental "wavefronts;" process A generates some data which goes in an > "incoming" directory for process B; process B starts on a timer every so > often and collects the new input files and works on them. After it's done, > it moves those inputs which it processed into a "done" directory. In the > mean time, new files may have arrived. After another time interval, another > round of process B starts. The major limitation of this model is that it > requires that your process work incrementally, or that you are emitting a > small enough volume of data each time in process B that subsequent > iterations can load into memory a summary table of results from previous > iterations. Look into using the DistributedCache to disseminate such files. > > Also, why are you using two MapReduce clusters for this, as opposed to one? > Is there a common HDFS cluster behind them? You'll probably get much > better > performance for the overall process if the output data from one job does > not > need to be transferred to another cluster before it is further processed. > > Does this model make sense? > - Aaron > > On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra <mnage...@asu.edu> wrote: > > > Aaron, > > We hope to achieve a level of pipelining between two clusters - similar > to > > how pipelining is done in executing RDB queries. You can look at it as > the > > producer-consumer problem, one cluster produces some data and the other > > cluster consumes it. The issue that has to be dealt with here is the data > > exchange between the clusters - synchronized interaction between the > > map-reduce jobs on the two clusters is what I m hoping to achieve. > > > > Mithila > > > > On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball <aa...@cloudera.com> > wrote: > > > > > Clusters don't really have identities beyond the addresses of the > > NameNodes > > > and JobTrackers. In the example below, nn1 and nn2 are the hostnames of > > the > > > namenodes of the source and destination clusters. The 8020 in each > > address > > > assumes that they're on the default port. > > > > > > Hadoop provides no inter-task or inter-job synchronization primitives, > on > > > purpose (even within a cluster, the most you get in terms of > > > synchronization > > > is the ability to "join" on the status of a running job to determine > that > > > it's completed). The model is designed to be as identity-independent as > > > possible to make it more resiliant to failure. If individual jobs/tasks > > > could lock common resources, then the intermittent failure of tasks > could > > > easily cause deadlock. > > > > > > Using a file as a "scoreboard" or other communication mechanism between > > > multiple jobs is not something explicitly designed for, and likely to > end > > > in > > > frustration. Can you describe the goal you're trying to accomplish? > It's > > > likely that there's another, more MapReduce-y way of looking at the job > > and > > > refactoring the code to make it work more cleanly with the intended > > > programming model. > > > > > > - Aaron > > > > > > On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra <mnage...@asu.edu> > > > wrote: > > > > > > > Thanks! I was looking at the link sent by Philip. The copy is done > with > > > the > > > > following command: > > > > hadoop distcp hdfs://nn1:8020/foo/bar \ > > > > hdfs://nn2:8020/bar/foo > > > > > > > > I was wondering if nn1 and nn2 are the names of the clusters or the > > name > > > of > > > > the masters on each cluster. > > > > > > > > I wanted map/reduce tasks running on each of the two clusters to > > > > communicate > > > > with each other. I dont know if hadoop provides for synchronization > > > between > > > > two map/reduce tasks. The tasks run simultaneouly, and they need to > > > access > > > > a > > > > common file - something like a map/reduce task at a higher level > > > utilizing > > > > the data produced by the map/reduce at the lower level. > > > > > > > > Mithila > > > > > > > > On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley <omal...@apache.org> > > > wrote: > > > > > > > > > > > > > > On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote: > > > > > > > > > > Hey all > > > > >> I'm trying to connect two separate Hadoop clusters. Is it possible > > to > > > do > > > > >> so? > > > > >> I need data to be shuttled back and forth between the two > clusters. > > > Any > > > > >> suggestions? > > > > >> > > > > > > > > > > You should use hadoop distcp. It is a map/reduce program that > copies > > > > data, > > > > > typically from one cluster to another. If you have the hftp > interface > > > > > enabled, you can use that to copy between hdfs clusters that are > > > > different > > > > > versions. > > > > > > > > > > hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar > > > > > > > > > > -- Owen > > > > > > > > > > > > > > >