The down side of this (which appears to be the only way) is that your entire input data set has to pass through the identity mapper and then go through shuffle and sort before it gets to the reducer. If you have a large input data set, this takes real resources - cpu, disk, network and wall clock time.

What we have been doing is making map files of our data sets, and running the Join code on them, then we have reduce equivalent capability in the mapper.

Richard Tomsett wrote:
Leandro Alvim wrote:
How can i use only a reduce without map?

I don't know if there's a way to run just a reduce task without a map stage, but you could do it by having a map stage just using the IdentityMapper class (which passes the data through to the reducers unchanged), so effectively just doing a reduce.
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if interested

Reply via email to