Hi, Regarding getting part00000, part000001 joined together, assuming the files are numbered in order, then you can use:
hadoop fs -getmerge This is used to concatenate the files. See the following URL for details: http://hadoop.apache.org/common/docs/current/hdfs_shell.html#getmerge As for removing the keys, if the file is tab separated you could remove the keys using the unix/linux 'cut' command, e.g.: cut -f2,3,4 file.txt This will give you the 2nd, 3rd and 4th columns from file.txt. Don't know if there's a similar command for Windows though. Regards, James On Wed, Jun 23, 2010 at 6:15 PM, Steve Lewis <[email protected]> wrote: > Assume I have a large file called *BigData.unsorted* ( say 500GB) > consisting of lines of text. Assume that these lines are in random order - > I understand how to assign a key to lines and that Hadoop will pass the > lines to my reducers in order of that key. > > Now assume I want a single file called *BigData.sorted* with the lines in > the order of the keys. > > I think I understand how to get files part00000, part000001 ,,, but not > 1) How I get just the lines from the reducer not the keys > 2) How I make the reducer generate a file with the name that I want "* > BigData.sorted"* > *3) How without using a single reducer instance I get a single output file > or is a single reducer the right choice for this task.* > * > * > *Also it would be very nice if the output of the reducer were compressed - > say BigData.sorted.gz * > * > * > *Any suggestions > *-- > Steven M. Lewis PhD > Institute for Systems Biology > Seattle WA > -- James Hammerton | Senior Data Mining Engineer www.mendeley.com/profiles/james-hammerton Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015
