Re: Newbie - question - how do I use Hadoop to sort a very large file

James Hammerton Thu, 24 Jun 2010 10:40:22 -0700

Hi,

Regarding getting part00000, part000001 joined together, assuming the files
are numbered in order, then you can use:


hadoop fs -getmerge

This is used to concatenate the files. See the following URL for details:

http://hadoop.apache.org/common/docs/current/hdfs_shell.html#getmerge

As for removing the keys, if the file is tab separated you could remove the
keys using the unix/linux 'cut' command, e.g.:

cut -f2,3,4 file.txt

This will give you the 2nd, 3rd and 4th columns from file.txt. Don't know if
there's a similar command for Windows though.

Regards,

James

On Wed, Jun 23, 2010 at 6:15 PM, Steve Lewis <[email protected]> wrote:

> Assume I have a large file called *BigData.unsorted*  ( say 500GB)
> consisting of lines of text. Assume that these lines are in random order -
> I understand how to assign a key to lines and that Hadoop will pass the
> lines to my reducers in order of that key.
>
> Now assume I want a single file called *BigData.sorted*  with the lines in
> the order of the keys.
>
> I think I understand how to get files part00000, part000001 ,,, but not
> 1) How I get just the lines from the reducer not the keys
> 2) How I  make the reducer generate a file with the name that I want "*
> BigData.sorted"*
> *3) How without using a single reducer instance I get a single output file
> or is a single reducer the right choice for this task.*
> *
> *
> *Also it would be very nice if the output of the reducer were compressed -
> say BigData.sorted.gz *
> *
> *
> *Any suggestions
> *--
> Steven M. Lewis PhD
> Institute for Systems Biology
> Seattle WA
>



-- 
James Hammerton | Senior Data Mining Engineer
www.mendeley.com/profiles/james-hammerton

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Newbie - question - how do I use Hadoop to sort a very large file

Reply via email to