Hi,

Can I restrict the output of mappers running on a node to go to reducer(s)
running on the same node?

Let me explain why I want to do this-

I am converting huge number of XML files into SequenceFiles. So
theoretically I don't even need reducers, mappers would read xml files and
output Sequencefiles. But the problem with this approach is I will end up
getting huge number of small output files.

To avoid generating large number of smaller files, I can Identity reducers.
But by running reducers, I am unnecessarily transfering data over network. I
ran some test case using a small subset of my data (~90GB). With map only
jobs, my cluster finished conversion in only 6 minutes. But with map and
Identity reducers job, it takes around 38 minutes.

I have to process close to a terabyte of data. So I was thinking of a faster
alternatives-

* Writing a custom OutputFormat
* Somehow restrict output of mappers running on a node to go to reducers
running on the same node. May be I can write my own partitioner (simple) but
not sure how Hadoop's framework assigns partitions to reduce tasks.

Any pointers ?

Or this is not possible at all ?

Thanks,
Tarandeep

Reply via email to