Re: Merge sorting reduce output files

Robert Evans Wed, 29 Feb 2012 15:08:09 -0800

Niels,

Sorry it has taken me so long to respond.  Today has been a very crazy day.


I am just guessing what your algorithm is for auto-complete.  I really don't 
know so I will just design a back of the envelope one myself as a starting 
point.  My guess is that you have a few map/reduce jobs.  The first M/R Job is 
mostly a glorified word count to get a word or phrase with how often it is 
searched for.  In the next job the  map splits the phrases up so that they are 
output with an ever increasing number of letters as the key along with the 
original phrase and its weight as the value.  The Reducer/Combiner groups them 
by the key and produces a top N list of phrases that have the highest weights 
for each key.  If we want the keys to come out in sorted order, we need to have 
a sequence file with the partition keys for the total order partitioner.

TeraSort generates a partition file by getting the number of splits and then 
reading the first N records from each split where N is based off of the number 
of samples desired and the number of splits.  The keys for all of the sampled 
entries are sorted and divided into mostly equal length partitions that are 
stored in the partition file.  This only really works for Terasort because it 
assumes that all of the partitions are more or less random already.  The worst 
input dataset to TeraSort would be one where each partition is sorted 
internally, but made up of fairly evenly distributed data.  This is the case 
for the output of a typical map/reduce job where the reduce does not change the 
keys passed in and the output of the reducer is less then a block in size.  
That sure sounds like what wordcount does to me.  The only real way to get 
around that is to do it as part of a map/reduce job, and do some random 
sampling instead of reading the first N.  It should be a map/reduce job because 
it is going to be reading a lot more data then TeraSort's partition generation 
code.  In this case you would have a second M/R job that runs after the first 
and randomly samples words/phrases to work on.  It would then generate the 
increasing long phrases and send them all to a single reducer that would buffer 
them up, and when the Reducer has no more input it would output every Nth key 
so that you get the proper number of partitions for the Reducers.  You could 
sort these keys yourself to be sure, but they should come in in sorted order so 
why bother resorting.

If my assumptions are totally wrong here please let me know.

--Bobby Evans

On 2/29/12 4:59 AM, "Niels Basjes" <ni...@basjes.nl> wrote:

Robert,

On Tue, Feb 28, 2012 at 23:28, Robert Evans <ev...@yahoo-inc.com> wrote:
I am not sure I can help with that unless I know better what "a special 
distribution" means.

The thing is that this application is a "Auto Complete" feature that has a key 
that is "the letters that have been typed so far".
Now for several reasons we need this to be sorted by length of the input. So 
the '1 letter suggestions' first, then the '2 letter suggestions' etc.
I've been trying to come up with an automatic partitioning that would split the 
dataset into something like 30 parts that when concatenated do what you suggest.

Unless you are doing a massive amount of processing in your reducer having a 
partition that is only close to balancing the distribution is a big win over 
all of the other options that put the data on a single machine and sort it 
there.  Even if you are doing a lot of processing in the reducer, or you need a 
special grouping to make the reduce work properly having a second map/reduce 
job to sort the data that is just close to balancing I would suspect would beat 
out all of the other options.

Thanks, this is a useful suggestion. I'll see if there is a pattern in the data 
and from there simply manual define the partitions based on the pattern we find.

Re: Merge sorting reduce output files

Reply via email to