The implied reason is that it puts a bottleneck on I/O - to write one file (with current HDFS semantics), the bytes for that file all have to pass through a single host. So, you can have N reduces writing to HDFS in parallel, or you can have one output file written from one machine. It also means, in the current implementation, that you must have enuogh room (x2 or x3 at this point) for that whole output file on a single drive of a single machine.
Unless your output is not being read from Java, it's pretty easy to make your next process read all of the output files in parallel. I've even done this when generating MapFiles from jobs... there is code in place to make this work already. Alternately, you can force there to be a single reducer in the job settings. On 7/19/06, Thomas FRIOL <[EMAIL PROTECTED]> wrote:
Hi all, Each reduce task produces one part file in the DFS. Why the job tracker does not merge them at the end of the job to produce only one file. It seems to me that it could be better to process results. I think there is certainly a reason for the actual behavior but I really need to get results of my map reduce job in a single file. Maybe someone can give me a clue to solve my problem. Thanks for any help. Thomas. -- Thomas FRIOL Développeur Eclipse / Eclipse Developer Solutions & Technologies ANYWARE TECHNOLOGIES Tél : +33 (0)561 000 653 Portable : +33 (0)609 704 810 Fax : +33 (0)561 005 146 www.anyware-tech.com
-- Bryan A. Pendleton Ph: (877) geek-1-bp
