i am wondering if hadoop always respect Job.setNumReduceTasks(int)? as i am emitting items from the mapper, i expect/desire only 1 reducer to get these items because i want to assign each key of the key-value input pair a unique integer id. if i had 1 reducer, i can just keep a local counter (with respect to the reducer instance) and increment it.
on my local hadoop cluster, i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). however, as soon as i moved the code unto amazon's elastic mapreduce (emr), i notice that there are multiple reducers. if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). also, it looks like this might not be a good idea just having 1 reducer (it won't scale). it is most likely better if there are +1 reducers, but in that case, i lose the ability to assign unique numbers to the key-value pairs coming in. is there a design pattern out there that addresses this issue? my mapper/reducer key-value pair signatures looks something like the following. mapper(Text, Text, Text, IntWritable) reducer(Text, IntWritable, IntWritable, Text) the mapper reads a sequence file whose key-value pairs are of type Text and Text. i then emit Text (let's say a word) and IntWritable (let's say frequency of the word). the reducer gets the word and its frequencies, and then assigns the word an integer id. it emits IntWritable (the id) and Text (the word). i remember seeing code from mahout's API where they assign integer ids to items. the items were already given an id of type long. the conversion they make is as follows. public static int idToIndex(long id) { return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32)); } is there something equivalent for Text or a "word"? i was thinking about simply taking the hash value of the string/word, but of course, different strings can map to the same hash value.