N could be up to 1000, and output from Map job could be about 5 Million. We only want the top 1000 because rest of it could be just noise. Thanks for your help.
On Fri, Jan 29, 2010 at 11:43 AM, Alex Baranov <[email protected]>wrote: > How big is N? How big is outcome of Map job? > > Alex. > > On Fri, Jan 29, 2010 at 7:36 PM, Something Something < > [email protected]> wrote: > > > I am sorry, but I forgot to add one important piece of information. > > > > I don't want to write any random N rows to the table. I want to write > the > > *top* N rows - meaning - I want to write the "key" values of the Reducer > in > > descending order. Does this make sense? Sorry for the confusion. > > > > On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < > > [email protected] > > > wrote: > > > > > > > > A possible solution is to emit only N rows from each mapper and then > use > > 1 > > > reduce task [*] - if value of N is not very high. > > > So you end up with utmost m * N rows on reducer instead of full > inputset > > - > > > and so the limit can be done easier. > > > > > > > > > If you ok with some sort of variance in the number of rows inserted > (and > > if > > > value of N is very high), you can do more interesting things like N/m' > > rows > > > per mapper - and multiple reducers (r) : with assumtion that each > reducer > > > will see atleast N/r rows - and so you can limit to N/r per reducer : > > > ofcourse, there is a possible error that gets introduced here ... > > > > > > > > > Regards, > > > Mridul > > > > > > [*] Assuming you just want simple limit - nothing else. > > > Also note, each mapper might want to emit N rows instead of 'tweaks' > like > > > N/m rows, since it is possible that multiple mappers might have less > than > > > N/m rows to emit to begin with ! > > > > > > > > > > > > Something Something wrote: > > > > > >> If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the > > >> class > > >> be instantiated only on one machine.. always? I mean if I have a > > cluster > > >> of > > >> say 1 master, 10 workers & 3 zookeepers, is the Reducer class > guaranteed > > >> to > > >> be instantiated only on 1 machine? > > >> > > >> If answer is yes, then I will use static variable as a counter to see > > how > > >> may rows have been added to my HBase table so far. In my use case, I > > want > > >> to write only N number of rows to a table. Is there a better way to > do > > >> this? Please let me know. Thanks. > > >> > > > > > > > > >
