I would recommend that you do what terrasort does and use a different 
partitioner, to ensure that all keys within a given range will go to a single 
reducer.  If your partitioner is set up correctly then all you have to do is to 
concatenate the files together, if you even need to do that.

Look at TotalOrderPartitioner.  It should do what you want.

--Bobby Evans

On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl> wrote:

Hi,

We have a job that outputs a set of files that are several hundred MB of text 
each.
Using the comparators and such we can produce output files that are each sorted 
by themselves.

What we want is to have one giant outputfile (outside of the cluster) that is 
sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because that 
would put a significant part of the processing time through 1 cpu (this would 
take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort -m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" 
in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?

Reply via email to