Re: does hadoop always respect setNumReduceTasks?
Thanks Lance. On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog wrote: > Instead of String.hashCode() you can use the MD5 hashcode generator. > This has not "in the wild" created a duplicate. (It has been hacked, > but that's not relevant here.) > > http://snippets.dzone.com/posts/show/3686 > > I think the Partitioner class guarantees that you will have multiple > reducers. > > On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne > wrote: > > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? > > > > as i am emitting items from the mapper, i expect/desire only 1 reducer to > > get these items because i want to assign each key of the key-value input > > pair a unique integer id. if i had 1 reducer, i can just keep a local > > counter (with respect to the reducer instance) and increment it. > > > > on my local hadoop cluster, i noticed that most, if not all, my jobs have > > only 1 reducer, regardless of whether or not i set > > Job.setNumReduceTasks(int). > > > > however, as soon as i moved the code unto amazon's elastic mapreduce > (emr), > > i notice that there are multiple reducers. if i set the number of reduce > > tasks to 1, is this always guaranteed? i ask because i don't know if > there > > is a gotcha like the combiner (where it may or may not run at all). > > > > also, it looks like this might not be a good idea just having 1 reducer > (it > > won't scale). it is most likely better if there are +1 reducers, but in > > that case, i lose the ability to assign unique numbers to the key-value > > pairs coming in. is there a design pattern out there that addresses this > > issue? > > > > my mapper/reducer key-value pair signatures looks something like the > > following. > > > > mapper(Text, Text, Text, IntWritable) > > reducer(Text, IntWritable, IntWritable, Text) > > > > the mapper reads a sequence file whose key-value pairs are of type Text > and > > Text. i then emit Text (let's say a word) and IntWritable (let's say > > frequency of the word). > > > > the reducer gets the word and its frequencies, and then assigns the word > an > > integer id. it emits IntWritable (the id) and Text (the word). > > > > i remember seeing code from mahout's API where they assign integer ids to > > items. the items were already given an id of type long. the conversion > they > > make is as follows. > > > > public static int idToIndex(long id) { > > return 0x7FFF & ((int) id ^ (int) (id >>> 32)); > > } > > > > is there something equivalent for Text or a "word"? i was thinking about > > simply taking the hash value of the string/word, but of course, different > > strings can map to the same hash value. > > > > -- > Lance Norskog > goks...@gmail.com >
RE: does hadoop always respect setNumReduceTasks?
Jane, i think you have mapred.tasktracker.reduce.tasks.maximum or mapred.reduce.tasks set to 1 in your local, and have them set to some other values in the emr, that's why you always get one reducer in your local and not on the emr. CheersRamon > Date: Thu, 8 Mar 2012 21:30:26 -0500 > Subject: does hadoop always respect setNumReduceTasks? > From: jane.wayne2...@gmail.com > To: common-user@hadoop.apache.org > > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? > > as i am emitting items from the mapper, i expect/desire only 1 reducer to > get these items because i want to assign each key of the key-value input > pair a unique integer id. if i had 1 reducer, i can just keep a local > counter (with respect to the reducer instance) and increment it. > > on my local hadoop cluster, i noticed that most, if not all, my jobs have > only 1 reducer, regardless of whether or not i set > Job.setNumReduceTasks(int). > > however, as soon as i moved the code unto amazon's elastic mapreduce (emr), > i notice that there are multiple reducers. if i set the number of reduce > tasks to 1, is this always guaranteed? i ask because i don't know if there > is a gotcha like the combiner (where it may or may not run at all). > > also, it looks like this might not be a good idea just having 1 reducer (it > won't scale). it is most likely better if there are +1 reducers, but in > that case, i lose the ability to assign unique numbers to the key-value > pairs coming in. is there a design pattern out there that addresses this > issue? > > my mapper/reducer key-value pair signatures looks something like the > following. > > mapper(Text, Text, Text, IntWritable) > reducer(Text, IntWritable, IntWritable, Text) > > the mapper reads a sequence file whose key-value pairs are of type Text and > Text. i then emit Text (let's say a word) and IntWritable (let's say > frequency of the word). > > the reducer gets the word and its frequencies, and then assigns the word an > integer id. it emits IntWritable (the id) and Text (the word). > > i remember seeing code from mahout's API where they assign integer ids to > items. the items were already given an id of type long. the conversion they > make is as follows. > > public static int idToIndex(long id) { > return 0x7FFF & ((int) id ^ (int) (id >>> 32)); > } > > is there something equivalent for Text or a "word"? i was thinking about > simply taking the hash value of the string/word, but of course, different > strings can map to the same hash value.
Re: does hadoop always respect setNumReduceTasks?
Hi Jayne Adding on to Lance's comments, (answer to other queries) > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? Yes, unless you mark it final in mapred-site.xml (you normally never do) >i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). The default value in mapred-site.xml would be 1. But should be able to override the same in job and it is common requirement in map reduce jobs if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). Yes it is guaranteed. Combiner comes to place only if you specify one,else no combiner is used by default. Regards Bejoy KS On Fri, Mar 9, 2012 at 8:08 AM, Lance Norskog wrote: > Instead of String.hashCode() you can use the MD5 hashcode generator. > This has not "in the wild" created a duplicate. (It has been hacked, > but that's not relevant here.) > > http://snippets.dzone.com/posts/show/3686 > > I think the Partitioner class guarantees that you will have multiple > reducers. > > On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne > wrote: > > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? > > > > as i am emitting items from the mapper, i expect/desire only 1 reducer to > > get these items because i want to assign each key of the key-value input > > pair a unique integer id. if i had 1 reducer, i can just keep a local > > counter (with respect to the reducer instance) and increment it. > > > > on my local hadoop cluster, i noticed that most, if not all, my jobs have > > only 1 reducer, regardless of whether or not i set > > Job.setNumReduceTasks(int). > > > > however, as soon as i moved the code unto amazon's elastic mapreduce > (emr), > > i notice that there are multiple reducers. if i set the number of reduce > > tasks to 1, is this always guaranteed? i ask because i don't know if > there > > is a gotcha like the combiner (where it may or may not run at all). > > > > also, it looks like this might not be a good idea just having 1 reducer > (it > > won't scale). it is most likely better if there are +1 reducers, but in > > that case, i lose the ability to assign unique numbers to the key-value > > pairs coming in. is there a design pattern out there that addresses this > > issue? > > > > my mapper/reducer key-value pair signatures looks something like the > > following. > > > > mapper(Text, Text, Text, IntWritable) > > reducer(Text, IntWritable, IntWritable, Text) > > > > the mapper reads a sequence file whose key-value pairs are of type Text > and > > Text. i then emit Text (let's say a word) and IntWritable (let's say > > frequency of the word). > > > > the reducer gets the word and its frequencies, and then assigns the word > an > > integer id. it emits IntWritable (the id) and Text (the word). > > > > i remember seeing code from mahout's API where they assign integer ids to > > items. the items were already given an id of type long. the conversion > they > > make is as follows. > > > > public static int idToIndex(long id) { > > return 0x7FFF & ((int) id ^ (int) (id >>> 32)); > > } > > > > is there something equivalent for Text or a "word"? i was thinking about > > simply taking the hash value of the string/word, but of course, different > > strings can map to the same hash value. > > > > -- > Lance Norskog > goks...@gmail.com >
Re: does hadoop always respect setNumReduceTasks?
Instead of String.hashCode() you can use the MD5 hashcode generator. This has not "in the wild" created a duplicate. (It has been hacked, but that's not relevant here.) http://snippets.dzone.com/posts/show/3686 I think the Partitioner class guarantees that you will have multiple reducers. On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne wrote: > i am wondering if hadoop always respect Job.setNumReduceTasks(int)? > > as i am emitting items from the mapper, i expect/desire only 1 reducer to > get these items because i want to assign each key of the key-value input > pair a unique integer id. if i had 1 reducer, i can just keep a local > counter (with respect to the reducer instance) and increment it. > > on my local hadoop cluster, i noticed that most, if not all, my jobs have > only 1 reducer, regardless of whether or not i set > Job.setNumReduceTasks(int). > > however, as soon as i moved the code unto amazon's elastic mapreduce (emr), > i notice that there are multiple reducers. if i set the number of reduce > tasks to 1, is this always guaranteed? i ask because i don't know if there > is a gotcha like the combiner (where it may or may not run at all). > > also, it looks like this might not be a good idea just having 1 reducer (it > won't scale). it is most likely better if there are +1 reducers, but in > that case, i lose the ability to assign unique numbers to the key-value > pairs coming in. is there a design pattern out there that addresses this > issue? > > my mapper/reducer key-value pair signatures looks something like the > following. > > mapper(Text, Text, Text, IntWritable) > reducer(Text, IntWritable, IntWritable, Text) > > the mapper reads a sequence file whose key-value pairs are of type Text and > Text. i then emit Text (let's say a word) and IntWritable (let's say > frequency of the word). > > the reducer gets the word and its frequencies, and then assigns the word an > integer id. it emits IntWritable (the id) and Text (the word). > > i remember seeing code from mahout's API where they assign integer ids to > items. the items were already given an id of type long. the conversion they > make is as follows. > > public static int idToIndex(long id) { > return 0x7FFF & ((int) id ^ (int) (id >>> 32)); > } > > is there something equivalent for Text or a "word"? i was thinking about > simply taking the hash value of the string/word, but of course, different > strings can map to the same hash value. -- Lance Norskog goks...@gmail.com