Re: Merge sorting reduce output files

Niels Basjes Tue, 28 Feb 2012 13:47:13 -0800

Hi Robert,

On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote:


>  I would recommend that you do what terrasort does and use a different
> partitioner, to ensure that all keys within a given range will go to a
> single reducer.  If your partitioner is set up correctly then all you have
> to do is to concatenate the files together, if you even need to do that.
>
> Look at TotalOrderPartitioner.  It should do what you want.
>

I know about that partitioner.
The trouble I have is comming up with a partitioning that "evenly" balances
the data for this specific problem.
Taking a sample and base the partitioning on that (like the one used in
terrasort) wouldn't help.
The data has a special distribution...


Niels Basjes



>
> --Bobby Evans
>
>
> On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl> wrote:
>
> Hi,
>
> We have a job that outputs a set of files that are several hundred MB of
> text each.
> Using the comparators and such we can produce output files that are each
> sorted by themselves.
>
> What we want is to have one giant outputfile (outside of the cluster) that
> is sorted.
>
> Now we see the following options:
> 1) Run the last job with 1 reducer. This is not really an option because
> that would put a significant part of the processing time through 1 cpu
> (this would take too long).
> 2) Create an additional job that sorts the existing files and has 1
> reducer.
> 3) Download all of the files and run the standard commandline tool "sort
> -m"
> 4) Install HDFS fuse and run the standard commandline tool "sort -m"
> 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort
> -m" in one go.
>
> During our discussion we were wondering: What is the best way of doing
> this?
> What do you recommend?
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Reply via email to