Niels,

I am not sure I can help with that unless I know better what "a special 
distribution" means.   Unless you are doing a massive amount of processing in 
your reducer having a partition that is only close to balancing the 
distribution is a big win over all of the other options that put the data on a 
single machine and sort it there.  Even if you are doing a lot of processing in 
the reducer, or you need a special grouping to make the reduce work properly 
having a second map/reduce job to sort the data that is just close to balancing 
I would suspect would beat out all of the other options.

If that is still not an option then I would run some benchmarks with a 
mapreduce job with a single reducer and copying the data to a single machine 
and running sort -m.  I don't have much experience with HDFS fuse, but from 
what I have seen most older versions of it have some performance issues, 
especially with the page cache.  Although sort -m really would not need the 
page cache, so it may not be a problem.  If download then sort is even close to 
as fast as the mapreduce job then you might want to try option 5, because it 
would reduce the disk overhead when moving the data and I would suspect be 
faster then just downloading and sorting.

A disclaimer, the only real way to know what to do is to run benchmarks on all 
of these options.  Hadoop is so complex trying to really reason about exactly 
which will be faster is very difficult.

--Bobby Evans

On 2/28/12 3:46 PM, "Niels Basjes" <ni...@basjes.nl> wrote:

Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote:
I would recommend that you do what terrasort does and use a different 
partitioner, to ensure that all keys within a given range will go to a single 
reducer.  If your partitioner is set up correctly then all you have to do is to 
concatenate the files together, if you even need to do that.

Look at TotalOrderPartitioner.  It should do what you want.

I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances the 
data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in 
terrasort) wouldn't help.
The data has a special distribution...


Niels Basjes



--Bobby Evans


On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl <http://ni...@basjes.nl> > 
wrote:

Hi,

We have a job that outputs a set of files that are several hundred MB of text 
each.
Using the comparators and such we can produce output files that are each sorted 
by themselves.

What we want is to have one giant outputfile (outside of the cluster) that is 
sorted.

Now we see the following options:
1) Run the last job with 1 reducer. This is not really an option because that 
would put a significant part of the processing time through 1 cpu (this would 
take too long).
2) Create an additional job that sorts the existing files and has 1 reducer.
3) Download all of the files and run the standard commandline tool "sort -m"
4) Install HDFS fuse and run the standard commandline tool "sort -m"
5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort -m" 
in one go.

During our discussion we were wondering: What is the best way of doing this?
What do you recommend?


Reply via email to