i am wondering if hadoop always respect Job.setNumReduceTasks(int)?

as i am emitting items from the mapper, i expect/desire only 1 reducer to
get these items because i want to assign each key of the key-value input
pair a unique integer id. if i had 1 reducer, i can just keep a local
counter (with respect to the reducer instance) and increment it.

on my local hadoop cluster, i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).

however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
i notice that there are multiple reducers. if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).

also, it looks like this might not be a good idea just having 1 reducer (it
won't scale). it is most likely better if there are +1 reducers, but in
that case, i lose the ability to assign unique numbers to the key-value
pairs coming in. is there a design pattern out there that addresses this
issue?

my mapper/reducer key-value pair signatures looks something like the
following.

mapper(Text, Text, Text, IntWritable)
reducer(Text, IntWritable, IntWritable, Text)

the mapper reads a sequence file whose key-value pairs are of type Text and
Text. i then emit Text (let's say a word) and IntWritable (let's say
frequency of the word).

the reducer gets the word and its frequencies, and then assigns the word an
integer id. it emits IntWritable (the id) and Text (the word).

i remember seeing code from mahout's API where they assign integer ids to
items. the items were already given an id of type long. the conversion they
make is as follows.

public static int idToIndex(long id) {
 return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
}

is there something equivalent for Text or a "word"? i was thinking about
simply taking the hash value of the string/word, but of course, different
strings can map to the same hash value.

Reply via email to