Since MapReduce programming model defines only one "communication" point between jobs - the one that occurs after all Map tasks are done and before Reduce tasks begin I believe that anyway the solution to your problem will come at a price of lower performance.
Although I don't think that while having 10 workers you should use "setNumReduceTasks(1)" tactics since the performance will be very degraded. Of course this depends on N number a lot: if N is quite small and everything you're going to output from your reduce task is going to lay down in one datanode then *may be* your strategy can be considered. If N is really very big and there is lot of work to do before Reducers should stop, then I'd consider communication throught storing the info about a progress in DFS (implementation will not be straightforward though, since we don't want to affect performance a lot). Alex. On Tue, Jan 26, 2010 at 1:22 AM, Something Something < [email protected]> wrote: > If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class > be instantiated only on one machine.. always? I mean if I have a cluster > of > say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to > be instantiated only on 1 machine? > > If answer is yes, then I will use static variable as a counter to see how > may rows have been added to my HBase table so far. In my use case, I want > to write only N number of rows to a table. Is there a better way to do > this? Please let me know. Thanks. >
