Thanks Lance.

On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog <goks...@gmail.com> wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne <jane.wayne2...@gmail.com>
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Reply via email to