Jim, As far as I know, there is no difference in terms of the number of output partitions relative to the OutputFormat used.
If you want to sample your output file, I'd suggest you write a new MR job that uses a random number generator to sample your output files, and outputs text key/value pairs in the mapper, and uses exactly one reducer with the TextOutputFormat. You don't even need to supply a reducer class if your mapper outputs Text/Text key/value pairs. -- Stefan > From: Jim Twensky <jim.twen...@gmail.com> > Reply-To: <core-user@hadoop.apache.org> > Date: Sun, 11 Jan 2009 01:55:35 -0600 > To: <core-user@hadoop.apache.org> > Subject: Merging reducer outputs into a single part-00000 file > > Hello, The original map-reduce paper states: "After successful completion, > the output of the map-reduce execution is available in the R output files > (one per reduce task, with file names as specified by the user)." However, > when using Hadoop's TextOutputFormat, all the reducer outputs are combined in > a single file called part-00000. I was wondering how and when this > merging process is done. When the reducer calls output.collect(key,value), is > this record written to a local temporary output file in the reducer's disk > and then these local files (a total of R) are later merged into one single > file with a final thread or is it directly written to the final output > file (part-00000)? I am asking this because I'd like to get an ordered sample > of the final output data, ie. one record per every 1000 records or > something similar and I don't want to run a serial process that iterates on > the final output file. Thanks, Jim