My suggestion is to use secondary sort with a single reducer. That easy you
can easily extract the top N. If you want to get the top N% you'll need an
additional phase to determine how many records this N% really is.
--
Met vriendelijke groet,
Niels Basjes
(Verstuurd vanaf mobiel )
Op 2 feb. 2013
My actual problem is to rank all values and then run logic 1 to top n%
values and logic 2 to rest values.
1st - Ranking ? (need major suggestions here)
2nd - Find top n% out of them.
Then rest is covered.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang wrote:
> there's one thing
Maybe look at the pig source to see how it does it?
Russell Jurney http://datasyndrome.com
On Feb 1, 2013, at 11:37 PM, praveenesh kumar wrote:
> Thanks for that Russell. Unfortunately I can't use Pig. Need to write
> my own MR job. I was wondering how its usually done in the best way
> possibl
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney wrote:
> Pig. Datafu. 7 lines of code.
>
> https://gist.github.com/4696443
> https
Pig. Datafu. 7 lines of code.
https://gist.github.com/4696443
https://github.com/linkedin/datafu
On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar wrote:
> Actually what I am trying to find to top n% of the whole data.
> This n could be very large if my data is large.
>
> Assuming I have unifor
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.
Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB whi
Hi,
Can you tell more about:
* How big is N
* How big is the input dataset
* How many mappers you have
* Do input splits correlate with the sorting criterion for top N?
Depending on the answers, very different strategies will be optimal.
On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar wro
I am looking for a better solution for this.
1 way to do this would be to find top N values from each mappers and
then find out the top N out of them in 1 reducer. I am afraid that
this won't work effectively if my N is larger than number of values in
my inputsplit (or mapper input).
Otherway is