Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same node?
Let me explain why I want to do this- I am converting huge number of XML files into SequenceFiles. So theoretically I don't even need reducers, mappers would read xml files and output Sequencefiles. But the problem with this approach is I will end up getting huge number of small output files. To avoid generating large number of smaller files, I can Identity reducers. But by running reducers, I am unnecessarily transfering data over network. I ran some test case using a small subset of my data (~90GB). With map only jobs, my cluster finished conversion in only 6 minutes. But with map and Identity reducers job, it takes around 38 minutes. I have to process close to a terabyte of data. So I was thinking of a faster alternatives- * Writing a custom OutputFormat * Somehow restrict output of mappers running on a node to go to reducers running on the same node. May be I can write my own partitioner (simple) but not sure how Hadoop's framework assigns partitions to reduce tasks. Any pointers ? Or this is not possible at all ? Thanks, Tarandeep