Niels, Sorry it has taken me so long to respond. Today has been a very crazy day.
I am just guessing what your algorithm is for auto-complete. I really don't know so I will just design a back of the envelope one myself as a starting point. My guess is that you have a few map/reduce jobs. The first M/R Job is mostly a glorified word count to get a word or phrase with how often it is searched for. In the next job the map splits the phrases up so that they are output with an ever increasing number of letters as the key along with the original phrase and its weight as the value. The Reducer/Combiner groups them by the key and produces a top N list of phrases that have the highest weights for each key. If we want the keys to come out in sorted order, we need to have a sequence file with the partition keys for the total order partitioner. TeraSort generates a partition file by getting the number of splits and then reading the first N records from each split where N is based off of the number of samples desired and the number of splits. The keys for all of the sampled entries are sorted and divided into mostly equal length partitions that are stored in the partition file. This only really works for Terasort because it assumes that all of the partitions are more or less random already. The worst input dataset to TeraSort would be one where each partition is sorted internally, but made up of fairly evenly distributed data. This is the case for the output of a typical map/reduce job where the reduce does not change the keys passed in and the output of the reducer is less then a block in size. That sure sounds like what wordcount does to me. The only real way to get around that is to do it as part of a map/reduce job, and do some random sampling instead of reading the first N. It should be a map/reduce job because it is going to be reading a lot more data then TeraSort's partition generation code. In this case you would have a second M/R job that runs after the first and randomly samples words/phrases to work on. It would then generate the increasing long phrases and send them all to a single reducer that would buffer them up, and when the Reducer has no more input it would output every Nth key so that you get the proper number of partitions for the Reducers. You could sort these keys yourself to be sure, but they should come in in sorted order so why bother resorting. If my assumptions are totally wrong here please let me know. --Bobby Evans On 2/29/12 4:59 AM, "Niels Basjes" <ni...@basjes.nl> wrote: Robert, On Tue, Feb 28, 2012 at 23:28, Robert Evans <ev...@yahoo-inc.com> wrote: I am not sure I can help with that unless I know better what "a special distribution" means. The thing is that this application is a "Auto Complete" feature that has a key that is "the letters that have been typed so far". Now for several reasons we need this to be sorted by length of the input. So the '1 letter suggestions' first, then the '2 letter suggestions' etc. I've been trying to come up with an automatic partitioning that would split the dataset into something like 30 parts that when concatenated do what you suggest. Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the data on a single machine and sort it there. Even if you are doing a lot of processing in the reducer, or you need a special grouping to make the reduce work properly having a second map/reduce job to sort the data that is just close to balancing I would suspect would beat out all of the other options. Thanks, this is a useful suggestion. I'll see if there is a pattern in the data and from there simply manual define the partitions based on the pattern we find.