Re: setNumReduceTasks(1)

Alex Baranov Wed, 27 Jan 2010 22:04:56 -0800

Since MapReduce programming model defines only one "communication" point
between jobs - the one that occurs after all Map tasks are done and before
Reduce tasks begin I believe that anyway the solution to your problem will
come at a price of lower performance.

Although I don't think that while having 10 workers you should use
"setNumReduceTasks(1)" tactics since the performance will be very degraded.
Of course this depends on N number a lot: if N is quite small and everything
you're going to output from your reduce task is going to lay down in one
datanode then *may be* your strategy can be considered.

If N is really very big and there is lot of work to do before Reducers
should stop, then I'd consider communication throught storing the info about
a progress in DFS (implementation will not be straightforward though, since
we don't want to affect performance a lot).

Alex.

On Tue, Jan 26, 2010 at 1:22 AM, Something Something <
[email protected]> wrote:

> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class
> be instantiated only on one machine.. always?  I mean if I have a cluster
> of
> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed to
> be instantiated only on 1 machine?
>
> If answer is yes, then I will use static variable as a counter to see how
> may rows have been added to my HBase table so far.  In my use case, I want
> to write only N number of rows to a table.  Is there a better way to do
> this?  Please let me know.  Thanks.
>

Re: setNumReduceTasks(1)

Reply via email to