how to do a reduce-only job

David Hawthorne Thu, 15 Jul 2010 13:27:21 -0700

I have two previously created output files of format:

key[tab]value


where key is text, value is an integer sum of how many times the key appeared.

I would like to reduce these output files together into one new output file.  
I'm having problems finding out how to do this.

I've found ways to specify a job with no reducers, but it doesn't look like 
there's a way to specify a reduce-only job, aside from using the streaming 
interface with 'cat' as the mapper.  I'm not opposed to this, but I also 
couldn't find a way to specify 'cat' as a mapper and the reducer in my java 
class as the reducer.  I'm also not sure this would work, as the reducer might 
simply see the entire line emitted by cat as the key.  I could use awk as the 
reducer, but I've heard that streaming is less performant than java, and I've 
already got the java class written. I could write another java class with a 
mapper that splits in the value on tab and emits the two fields as <key, 
value>, but that seems like it would be extra work and less optimal than being 
able to run a reduce-only job.

So... what are the options?  Is there a way to specify a reduce-only job?

how to do a reduce-only job

Reply via email to