Hi Robert, On Tue, Feb 28, 2012 at 21:41, Robert Evans <ev...@yahoo-inc.com> wrote:
> I would recommend that you do what terrasort does and use a different > partitioner, to ensure that all keys within a given range will go to a > single reducer. If your partitioner is set up correctly then all you have > to do is to concatenate the files together, if you even need to do that. > > Look at TotalOrderPartitioner. It should do what you want. > I know about that partitioner. The trouble I have is comming up with a partitioning that "evenly" balances the data for this specific problem. Taking a sample and base the partitioning on that (like the one used in terrasort) wouldn't help. The data has a special distribution... Niels Basjes > > --Bobby Evans > > > On 2/28/12 2:10 PM, "Niels Basjes" <ni...@basjes.nl> wrote: > > Hi, > > We have a job that outputs a set of files that are several hundred MB of > text each. > Using the comparators and such we can produce output files that are each > sorted by themselves. > > What we want is to have one giant outputfile (outside of the cluster) that > is sorted. > > Now we see the following options: > 1) Run the last job with 1 reducer. This is not really an option because > that would put a significant part of the processing time through 1 cpu > (this would take too long). > 2) Create an additional job that sorts the existing files and has 1 > reducer. > 3) Download all of the files and run the standard commandline tool "sort > -m" > 4) Install HDFS fuse and run the standard commandline tool "sort -m" > 5) Create an hadoop specific tool that can do "hadoop fs -text" and "sort > -m" in one go. > > During our discussion we were wondering: What is the best way of doing > this? > What do you recommend? > > -- Best regards / Met vriendelijke groeten, Niels Basjes