Newbie - question - how do I use Hadoop to sort a very large file

Steve Lewis Wed, 23 Jun 2010 10:15:45 -0700

Assume I have a large file called *BigData.unsorted*  ( say 500GB)
consisting of lines of text. Assume that these lines are in random order -
I understand how to assign a key to lines and that Hadoop will pass the
lines to my reducers in order of that key.


Now assume I want a single file called *BigData.sorted*  with the lines in
the order of the keys.

I think I understand how to get files part00000, part000001 ,,, but not
1) How I get just the lines from the reducer not the keys
2) How I  make the reducer generate a file with the name that I want "*
BigData.sorted"*
*3) How without using a single reducer instance I get a single output file
or is a single reducer the right choice for this task.*
*
*
*Also it would be very nice if the output of the reducer were compressed -
say BigData.sorted.gz *
*
*
*Any suggestions
*--
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA

Newbie - question - how do I use Hadoop to sort a very large file

Reply via email to