Re: setNumReduceTasks(1)

Something Something Fri, 29 Jan 2010 09:36:31 -0800

I am sorry, but I forgot to add one important piece of information.

I don't want to write any random N rows to the table.  I want to write the
*top* N rows - meaning - I want to write the "key" values of the Reducer in
descending order.  Does this make sense?  Sorry for the confusion.


On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan <[email protected]
> wrote:

>
> A possible solution is to emit only N rows from each mapper and then use 1
> reduce task [*] - if value of N is not very high.
> So you end up with utmost m * N rows on reducer instead of full inputset -
> and so the limit can be done easier.
>
>
> If you ok with some sort of variance in the number of rows inserted (and if
> value of N is very high), you can do more interesting things like N/m' rows
> per mapper - and multiple reducers (r) : with assumtion that each reducer
> will see atleast N/r rows - and so you can limit to N/r per reducer :
> ofcourse, there is a possible error that gets introduced here ...
>
>
> Regards,
> Mridul
>
> [*] Assuming you just want simple limit - nothing else.
> Also note, each mapper might want to emit N rows instead of 'tweaks' like
> N/m rows, since it is possible that multiple mappers might have less than
> N/m rows to emit to begin with !
>
>
>
> Something Something wrote:
>
>> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the
>> class
>> be instantiated only on one machine.. always?  I mean if I have a cluster
>> of
>> say 1 master, 10 workers & 3 zookeepers, is the Reducer class guaranteed
>> to
>> be instantiated only on 1 machine?
>>
>> If answer is yes, then I will use static variable as a counter to see how
>> may rows have been added to my HBase table so far.  In my use case, I want
>> to write only N number of rows to a table.  Is there a better way to do
>> this?  Please let me know.  Thanks.
>>
>
>

Re: setNumReduceTasks(1)

Reply via email to