Instead of String.hashCode() you can use the MD5 hashcode generator.
This has not "in the wild" created a duplicate. (It has been hacked,
but that's not relevant here.)

http://snippets.dzone.com/posts/show/3686

I think the Partitioner class guarantees that you will have multiple reducers.

On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne <jane.wayne2...@gmail.com> wrote:
> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
>
> as i am emitting items from the mapper, i expect/desire only 1 reducer to
> get these items because i want to assign each key of the key-value input
> pair a unique integer id. if i had 1 reducer, i can just keep a local
> counter (with respect to the reducer instance) and increment it.
>
> on my local hadoop cluster, i noticed that most, if not all, my jobs have
> only 1 reducer, regardless of whether or not i set
> Job.setNumReduceTasks(int).
>
> however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
> i notice that there are multiple reducers. if i set the number of reduce
> tasks to 1, is this always guaranteed? i ask because i don't know if there
> is a gotcha like the combiner (where it may or may not run at all).
>
> also, it looks like this might not be a good idea just having 1 reducer (it
> won't scale). it is most likely better if there are +1 reducers, but in
> that case, i lose the ability to assign unique numbers to the key-value
> pairs coming in. is there a design pattern out there that addresses this
> issue?
>
> my mapper/reducer key-value pair signatures looks something like the
> following.
>
> mapper(Text, Text, Text, IntWritable)
> reducer(Text, IntWritable, IntWritable, Text)
>
> the mapper reads a sequence file whose key-value pairs are of type Text and
> Text. i then emit Text (let's say a word) and IntWritable (let's say
> frequency of the word).
>
> the reducer gets the word and its frequencies, and then assigns the word an
> integer id. it emits IntWritable (the id) and Text (the word).
>
> i remember seeing code from mahout's API where they assign integer ids to
> items. the items were already given an id of type long. the conversion they
> make is as follows.
>
> public static int idToIndex(long id) {
>  return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
> }
>
> is there something equivalent for Text or a "word"? i was thinking about
> simply taking the hash value of the string/word, but of course, different
> strings can map to the same hash value.



-- 
Lance Norskog
goks...@gmail.com

Reply via email to