how to find top N values using map-reduce ?
I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh
Re: how to find top N values using map-reduce ?
Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB which could be rows worth of 1 GB (roughly) in my mappers. I think that would not be possible given my input splits are of 64/128/512 MB (based on my block size) or am I making wrong assumptions. I can increase the inputsplit size, but is there a better way to find top n%. My whole actual problem is to give ranks to some values and then find out the top 10 ranks. I think this context can give more idea about the problem ? Regards Praveenesh On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com wrote: Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote: I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov http://jkff.info/software/timeplotters - my performance visualization tools
Re: how to find top N values using map-reduce ?
Thanks for that Russell. Unfortunately I can't use Pig. Need to write my own MR job. I was wondering how its usually done in the best way possible. Regards Praveenesh On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote: Pig. Datafu. 7 lines of code. https://gist.github.com/4696443 https://github.com/linkedin/datafu On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar praveen...@gmail.comwrote: Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB which could be rows worth of 1 GB (roughly) in my mappers. I think that would not be possible given my input splits are of 64/128/512 MB (based on my block size) or am I making wrong assumptions. I can increase the inputsplit size, but is there a better way to find top n%. My whole actual problem is to give ranks to some values and then find out the top 10 ranks. I think this context can give more idea about the problem ? Regards Praveenesh On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com wrote: Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.com wrote: I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov http://jkff.info/software/timeplotters - my performance visualization tools -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com