I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer. If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that.
Look at TotalOrderPartitioner. It should do what you want. --Bobby Evans On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl> wrote: Hi, We have a job that outputs a set of files that are several hundred MB of text each. Using the comparators and such we can produce output files that are each sorted by themselves. What we want is to have one giant outputfile (outside of the cluster) that is sorted. Now we see the following options: 1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long). 2) Create an additional job that sorts the existing files and has 1 reducer. 3) Download all of the files and run the standard commandline tool "sort -m" 4) Install HDFS fuse and run the standard commandline tool "sort -m" 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" in one go. During our discussion we were wondering: What is the best way of doing this? What do you recommend?