Instead of String.hashCode() you can use the MD5 hashcode generator. This has not "in the wild" created a duplicate. (It has been hacked, but that's not relevant here.)
http://snippets.dzone.com/posts/show/3686 I think the Partitioner class guarantees that you will have multiple reducers. On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne <jane.wayne2...@gmail.com> wrote: > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? > > as i am emitting items from the mapper, i expect/desire only 1 reducer to > get these items because i want to assign each key of the key-value input > pair a unique integer id. if i had 1 reducer, i can just keep a local > counter (with respect to the reducer instance) and increment it. > > on my local hadoop cluster, i noticed that most, if not all, my jobs have > only 1 reducer, regardless of whether or not i set > Job.setNumReduceTasks(int). > > however, as soon as i moved the code unto amazon's elastic mapreduce (emr), > i notice that there are multiple reducers. if i set the number of reduce > tasks to 1, is this always guaranteed? i ask because i don't know if there > is a gotcha like the combiner (where it may or may not run at all). > > also, it looks like this might not be a good idea just having 1 reducer (it > won't scale). it is most likely better if there are +1 reducers, but in > that case, i lose the ability to assign unique numbers to the key-value > pairs coming in. is there a design pattern out there that addresses this > issue? > > my mapper/reducer key-value pair signatures looks something like the > following. > > mapper(Text, Text, Text, IntWritable) > reducer(Text, IntWritable, IntWritable, Text) > > the mapper reads a sequence file whose key-value pairs are of type Text and > Text. i then emit Text (let's say a word) and IntWritable (let's say > frequency of the word). > > the reducer gets the word and its frequencies, and then assigns the word an > integer id. it emits IntWritable (the id) and Text (the word). > > i remember seeing code from mahout's API where they assign integer ids to > items. the items were already given an id of type long. the conversion they > make is as follows. > > public static int idToIndex(long id) { > return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32)); > } > > is there something equivalent for Text or a "word"? i was thinking about > simply taking the hash value of the string/word, but of course, different > strings can map to the same hash value. -- Lance Norskog goks...@gmail.com