Re: does hadoop always respect setNumReduceTasks?

2012-03-14 Thread Jane Wayne
Thanks Lance.

On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog  wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne 
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


RE: does hadoop always respect setNumReduceTasks?

2012-03-10 Thread WangRamon

Jane, i think you have mapred.tasktracker.reduce.tasks.maximum or 
mapred.reduce.tasks set to 1 in your local, and have them set to some other 
values in the emr, that's why you always get one reducer in your local and not 
on the emr. CheersRamon
 > Date: Thu, 8 Mar 2012 21:30:26 -0500
> Subject: does hadoop always respect setNumReduceTasks?
> From: jane.wayne2...@gmail.com
> To: common-user@hadoop.apache.org
> 
> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> 
> as i am emitting items from the mapper, i expect/desire only 1 reducer to
> get these items because i want to assign each key of the key-value input
> pair a unique integer id. if i had 1 reducer, i can just keep a local
> counter (with respect to the reducer instance) and increment it.
> 
> on my local hadoop cluster, i noticed that most, if not all, my jobs have
> only 1 reducer, regardless of whether or not i set
> Job.setNumReduceTasks(int).
> 
> however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
> i notice that there are multiple reducers. if i set the number of reduce
> tasks to 1, is this always guaranteed? i ask because i don't know if there
> is a gotcha like the combiner (where it may or may not run at all).
> 
> also, it looks like this might not be a good idea just having 1 reducer (it
> won't scale). it is most likely better if there are +1 reducers, but in
> that case, i lose the ability to assign unique numbers to the key-value
> pairs coming in. is there a design pattern out there that addresses this
> issue?
> 
> my mapper/reducer key-value pair signatures looks something like the
> following.
> 
> mapper(Text, Text, Text, IntWritable)
> reducer(Text, IntWritable, IntWritable, Text)
> 
> the mapper reads a sequence file whose key-value pairs are of type Text and
> Text. i then emit Text (let's say a word) and IntWritable (let's say
> frequency of the word).
> 
> the reducer gets the word and its frequencies, and then assigns the word an
> integer id. it emits IntWritable (the id) and Text (the word).
> 
> i remember seeing code from mahout's API where they assign integer ids to
> items. the items were already given an id of type long. the conversion they
> make is as follows.
> 
> public static int idToIndex(long id) {
>  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> }
> 
> is there something equivalent for Text or a "word"? i was thinking about
> simply taking the hash value of the string/word, but of course, different
> strings can map to the same hash value.
  

Re: does hadoop always respect setNumReduceTasks?

2012-03-09 Thread Bejoy Ks
Hi Jayne
  Adding on to Lance's comments, (answer to other queries)

> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
 Yes, unless you mark it final in mapred-site.xml (you normally
never do)

>i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).
 The default value in mapred-site.xml would be 1. But should be
able to override the same in job and it is common requirement in map reduce
jobs

if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).
  Yes it is guaranteed. Combiner comes to place only if you specify
one,else no combiner is used by default.

Regards
Bejoy KS

On Fri, Mar 9, 2012 at 8:08 AM, Lance Norskog  wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne 
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: does hadoop always respect setNumReduceTasks?

2012-03-08 Thread Lance Norskog
Instead of String.hashCode() you can use the MD5 hashcode generator.
This has not "in the wild" created a duplicate. (It has been hacked,
but that's not relevant here.)

http://snippets.dzone.com/posts/show/3686

I think the Partitioner class guarantees that you will have multiple reducers.

On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne  wrote:
> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
>
> as i am emitting items from the mapper, i expect/desire only 1 reducer to
> get these items because i want to assign each key of the key-value input
> pair a unique integer id. if i had 1 reducer, i can just keep a local
> counter (with respect to the reducer instance) and increment it.
>
> on my local hadoop cluster, i noticed that most, if not all, my jobs have
> only 1 reducer, regardless of whether or not i set
> Job.setNumReduceTasks(int).
>
> however, as soon as i moved the code unto amazon's elastic mapreduce (emr),
> i notice that there are multiple reducers. if i set the number of reduce
> tasks to 1, is this always guaranteed? i ask because i don't know if there
> is a gotcha like the combiner (where it may or may not run at all).
>
> also, it looks like this might not be a good idea just having 1 reducer (it
> won't scale). it is most likely better if there are +1 reducers, but in
> that case, i lose the ability to assign unique numbers to the key-value
> pairs coming in. is there a design pattern out there that addresses this
> issue?
>
> my mapper/reducer key-value pair signatures looks something like the
> following.
>
> mapper(Text, Text, Text, IntWritable)
> reducer(Text, IntWritable, IntWritable, Text)
>
> the mapper reads a sequence file whose key-value pairs are of type Text and
> Text. i then emit Text (let's say a word) and IntWritable (let's say
> frequency of the word).
>
> the reducer gets the word and its frequencies, and then assigns the word an
> integer id. it emits IntWritable (the id) and Text (the word).
>
> i remember seeing code from mahout's API where they assign integer ids to
> items. the items were already given an id of type long. the conversion they
> make is as follows.
>
> public static int idToIndex(long id) {
>  return 0x7FFF & ((int) id ^ (int) (id >>> 32));
> }
>
> is there something equivalent for Text or a "word"? i was thinking about
> simply taking the hash value of the string/word, but of course, different
> strings can map to the same hash value.



-- 
Lance Norskog
goks...@gmail.com