I am looking for a better solution for this.
1 way to do this would be to find top N values from each mappers and
then find out the top N out of them in 1 reducer. I am afraid that
this won't work effectively if my N is larger than number of values in
my inputsplit (or mapper input).
Otherway
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.
Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote:
Pig. Datafu. 7 lines of code.