how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
I am looking for a better solution for this.

1 way to do this would be to find top N values from each mappers and
then find out the top N out of them in 1 reducer.  I am afraid that
this won't work effectively if my N is larger than number of values in
my inputsplit (or mapper input).

Otherway is to just sort all of them in 1 reducer and then do the cat of top-N.

Wondering if there is any better approach to do this ?

Regards
Praveenesh


Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.

Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB which could be rows
worth of 1 GB (roughly) in my mappers.
I think that would not be possible given my input splits are of
64/128/512 MB (based on my block size) or am I making wrong
assumptions. I can increase the inputsplit size, but is there a better
way to find top n%.


My whole actual problem is to give ranks to some values and then find
out the top 10 ranks.

I think this context can give more idea about the problem ?

Regards
Praveenesh

On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com wrote:
 Hi,

 Can you tell more about:
  * How big is N
  * How big is the input dataset
  * How many mappers you have
  * Do input splits correlate with the sorting criterion for top N?

 Depending on the answers, very different strategies will be optimal.



 On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote:

 I am looking for a better solution for this.

 1 way to do this would be to find top N values from each mappers and
 then find out the top N out of them in 1 reducer.  I am afraid that
 this won't work effectively if my N is larger than number of values in
 my inputsplit (or mapper input).

 Otherway is to just sort all of them in 1 reducer and then do the cat of
 top-N.

 Wondering if there is any better approach to do this ?

 Regards
 Praveenesh




 --
 Eugene Kirpichov
 http://www.linkedin.com/in/eugenekirpichov
 http://jkff.info/software/timeplotters - my performance visualization tools


Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.

Regards
Praveenesh

On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote:
 Pig. Datafu. 7 lines of code.

 https://gist.github.com/4696443
 https://github.com/linkedin/datafu


 On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar praveen...@gmail.comwrote:

 Actually what I am trying to find to top n% of the whole data.
 This n could be very large if my data is large.

 Assuming I have uniform rows of equal size and if the total data size
 is 10 GB, using the above mentioned approach, if I have to take top
 10% of the whole data set, I need 10% of 10GB which could be rows
 worth of 1 GB (roughly) in my mappers.
 I think that would not be possible given my input splits are of
 64/128/512 MB (based on my block size) or am I making wrong
 assumptions. I can increase the inputsplit size, but is there a better
 way to find top n%.


 My whole actual problem is to give ranks to some values and then find
 out the top 10 ranks.

 I think this context can give more idea about the problem ?

 Regards
 Praveenesh

 On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com
 wrote:
  Hi,
 
  Can you tell more about:
   * How big is N
   * How big is the input dataset
   * How many mappers you have
   * Do input splits correlate with the sorting criterion for top N?
 
  Depending on the answers, very different strategies will be optimal.
 
 
 
  On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.com
 wrote:
 
  I am looking for a better solution for this.
 
  1 way to do this would be to find top N values from each mappers and
  then find out the top N out of them in 1 reducer.  I am afraid that
  this won't work effectively if my N is larger than number of values in
  my inputsplit (or mapper input).
 
  Otherway is to just sort all of them in 1 reducer and then do the cat of
  top-N.
 
  Wondering if there is any better approach to do this ?
 
  Regards
  Praveenesh
 
 
 
 
  --
  Eugene Kirpichov
  http://www.linkedin.com/in/eugenekirpichov
  http://jkff.info/software/timeplotters - my performance visualization
 tools




 --
 Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com