The down side of this (which appears to be the only way) is that your
entire input data set has to pass through the identity mapper and then
go through shuffle and sort before it gets to the reducer.
If you have a large input data set, this takes real resources - cpu,
disk, network and wall clock time.
What we have been doing is making map files of our data sets, and
running the Join code on them, then we have reduce equivalent capability
in the mapper.
Richard Tomsett wrote:
Leandro Alvim wrote:
How can i use only a reduce without map?
I don't know if there's a way to run just a reduce task without a map
stage, but you could do it by having a map stage just using the
IdentityMapper class (which passes the data through to the reducers
unchanged), so effectively just doing a reduce.
--
Jason Venner
Attributor - Program the Web <http://www.attributor.com/>
Attributor is hiring Hadoop Wranglers and coding wizards, contact if
interested