Niels, I am not sure I can help with that unless I know better what "a special distribution" means. Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the data on a single machine and sort it there. Even if you are doing a lot of processing in the reducer, or you need a special grouping to make the reduce work properly having a second map/reduce job to sort the data that is just close to balancing I would suspect would beat out all of the other options.
If that is still not an option then I would run some benchmarks with a mapreduce job with a single reducer and copying the data to a single machine and running sort -m. I don't have much experience with HDFS fuse, but from what I have seen most older versions of it have some performance issues, especially with the page cache. Although sort -m really would not need the page cache, so it may not be a problem. If download then sort is even close to as fast as the mapreduce job then you might want to try option 5, because it would reduce the disk overhead when moving the data and I would suspect be faster then just downloading and sorting. A disclaimer, the only real way to know what to do is to run benchmarks on all of these options. Hadoop is so complex trying to really reason about exactly which will be faster is very difficult. --Bobby Evans On 2/28/12 3:46 PM, "Niels Basjes" <ni...@basjes.nl> wrote: Hi Robert, On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote: I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer. If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that. Look at TotalOrderPartitioner. It should do what you want. I know about that partitioner. The trouble I have is comming up with a partitioning that "evenly" balances the data for this specific problem. Taking a sample and base the partitioning on that (like the one used in terrasort) wouldn't help. The data has a special distribution... Niels Basjes --Bobby Evans On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl <http://ni...@basjes.nl> > wrote: Hi, We have a job that outputs a set of files that are several hundred MB of text each. Using the comparators and such we can produce output files that are each sorted by themselves. What we want is to have one giant outputfile (outside of the cluster) that is sorted. Now we see the following options: 1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long). 2) Create an additional job that sorts the existing files and has 1 reducer. 3) Download all of the files and run the standard commandline tool "sort -m" 4) Install HDFS fuse and run the standard commandline tool "sort -m" 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" in one go. During our discussion we were wondering: What is the best way of doing this? What do you recommend?