My suggestion is to use secondary sort with a single reducer. That easy you can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is.
-- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb. 2013 12:08 schreef "praveenesh kumar" <praveen...@gmail.com> het volgende: > My actual problem is to rank all values and then run logic 1 to top n% > values and logic 2 to rest values. > 1st - Ranking ? (need major suggestions here) > 2nd - Find top n% out of them. > Then rest is covered. > > Regards > Praveenesh > > On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang <lakech...@gmail.com> wrote: > > there's one thing i want to clarify that you can use multi-reducers to > sort > > the data globally and then cat all the parts to get the top n records. > The > > data in all parts are globally in order. > > Then you may find the problem is much easier. > > > > 在 2013-2-2 下午3:18,"praveenesh kumar" <praveen...@gmail.com>写道: > > > >> Actually what I am trying to find to top n% of the whole data. > >> This n could be very large if my data is large. > >> > >> Assuming I have uniform rows of equal size and if the total data size > >> is 10 GB, using the above mentioned approach, if I have to take top > >> 10% of the whole data set, I need 10% of 10GB which could be rows > >> worth of 1 GB (roughly) in my mappers. > >> I think that would not be possible given my input splits are of > >> 64/128/512 MB (based on my block size) or am I making wrong > >> assumptions. I can increase the inputsplit size, but is there a better > >> way to find top n%. > >> > >> > >> My whole actual problem is to give ranks to some values and then find > >> out the top 10 ranks. > >> > >> I think this context can give more idea about the problem ? > >> > >> Regards > >> Praveenesh > >> > >> On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov <ekirpic...@gmail.com > > > >> wrote: > >> > Hi, > >> > > >> > Can you tell more about: > >> > * How big is N > >> > * How big is the input dataset > >> > * How many mappers you have > >> > * Do input splits correlate with the sorting criterion for top N? > >> > > >> > Depending on the answers, very different strategies will be optimal. > >> > > >> > > >> > > >> > On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar > >> > <praveen...@gmail.com>wrote: > >> > > >> >> I am looking for a better solution for this. > >> >> > >> >> 1 way to do this would be to find top N values from each mappers and > >> >> then find out the top N out of them in 1 reducer. I am afraid that > >> >> this won't work effectively if my N is larger than number of values > in > >> >> my inputsplit (or mapper input). > >> >> > >> >> Otherway is to just sort all of them in 1 reducer and then do the cat > >> >> of > >> >> top-N. > >> >> > >> >> Wondering if there is any better approach to do this ? > >> >> > >> >> Regards > >> >> Praveenesh > >> >> > >> > > >> > > >> > > >> > -- > >> > Eugene Kirpichov > >> > http://www.linkedin.com/in/eugenekirpichov > >> > http://jkff.info/software/timeplotters - my performance visualization > >> > tools >