Re: one quesiton in the book of hadoop:definitive guide 2 edition
On Fri, 5 Aug 2011 08:50:02 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: The book also mentioned the value if mutable, I think the key might also be mutable, means as we loop each value in iterableNullWritable, the content of the key object is reset. The mutability of the value is one of the weirdnesses of Hadoop you have to get used to, and one of the few times it becomes important that Java object semantics are pointer semantics. Anyway, it wouldn't surprise me if the key were also replaced on iteration, but I'd have to dig into the Hadoop code to check on that. Or you could!
Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
Thanks John, I am confused again by the result of my test case, could you please take a look: The code related is: public static class IntSumReducer extends ReducerIntPair,NullWritable,IntPair,NullWritable { public void reduce(IntPair key, IterableNullWritable values, Context context ) throws IOException, InterruptedException { int count=0; for (NullWritable iw:values) { count++; System.out.print(key.getFirst()); System.out.print( : ); System.out.println(key.getSecond()); } System.out.println(number of records for this group +Integer.toString(count)); System.out.println(-biggest key is--); System.out.print(key.getFirst()); System.out.print( -); System.out.println(key.getSecond()); context.write(key, NullWritable.get()); } } I am using the new API (released is from cloudera). We can see from the output, for each call of reduce function, 100 records were processed, but as the reduce is defined as reduce(IntPair key, IterableNullWritable values, Context context), so key should be fixed (not change) during every single execution, but the strange thing is that for each loop of IterableNullWritable values, the key is different!!. Using your explanation, the same information (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below 0 : 97 0 : 97 0 : 96 0 : 96 0 : 94 0 : 93 0 : 93 0 : 91 0 : 90 0 : 89 0 : 86 0 : 85 deleted to save space 0 : 2 0 : 1 0 : 1 0 : 0 0 : 0 number of records for this group 100 -biggest key is-- 0 -0 4 : 99 4 : 99 4 : 98 4 : 96 4 : 95 4 : 94 4 : 93 4 : 92 4 : 91 4 : 91 4 : 90 At 2011-08-03 20:02:34,John Armstrong john.armstr...@ccri.com wrote: On Wed, 3 Aug 2011 10:35:51 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: So the key of a group is determined by the first coming record in the group, if we have 3 records in a group 1: (1900,35) 2:(1900,34) 3:(1900,33) if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct. Effectively, yes. Remember that on the inside it's using the comparator something like this: (1900, 35).. do I have that key already? [searches collection of keys with, say, a BST] no! I'll add it here. (1900,34).. do I have that key already? [searches again, now getting a result of 0 when comparing to (1900,35)] yes! [it's not the same key, but according to the GroupComparator it is!] so I'll add its value to the key's iterable of values. etc.
Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
On Thu, 4 Aug 2011 14:07:12 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: I am using the new API (released is from cloudera). We can see from the output, for each call of reduce function, 100 records were processed, but as the reduce is defined as reduce(IntPair key, IterableNullWritable values, Context context), so key should be fixed (not change) during every single execution, but the strange thing is that for each loop of IterableNullWritable values, the key is different!!. Using your explanation, the same information (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below Ah, but they're NOT different! That's the whole point! Think carefully: how does Hadoop decide what keys are the same when sorting and grouping reducer inputs? It uses a comparator. If the comparator says compare(key1,key2)==0, then as far as Hadoop is concerned the keys are the same. So here the comparator only really checks the first int in the pair: compare(0:97,0:96)? well let's compare 0 and 0... Integer.compare(0,0)==0, so these are the same key. You have to be careful about the semantics of equality whenever you're using nonstandard comparators.
Re:Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
Hi John, Another finding, if I remove the loop of values ( remove for (NullWritable iw:values)), then the result is the MAX temperature for each year. and the original test I did return the MIN temperature for each year. The book also mentioned the value if mutable, I think the key might also be mutable, means as we loop each value in iterableNullWritable, the content of the key object is reset. Since the input is in order, so if we don't do any loop (as in the new test), the the key got at the end of reduce function is the first record in the group, which has the max value. If we loop each value in the value list, say loop 100 times, the context of the key will also change 100 times, and the key got at the end of the reduce function will be the last key, which has the MIN value. This theory of immutable Key can explain how to test works.Just need to figure out why each loop in the statement for (NullWritable iw:values) can change the content of the key. If any one know this, pleas e help tell me. public void reduce(IntPair key, IterableNullWritable values, Context context ) throws IOException, InterruptedException { int count=0; /*for (NullWritable iw:values) { count++; System.out.print(key.getFirst()); System.out.print( : ); System.out.println(key.getSecond()); }*/ // System.out.println(number of records for this group +Integer.toString(count)); System.out.println(-biggest key is--); System.out.print(key.getFirst()); System.out.print( -); System.out.println(key.getSecond()); context.write(key, NullWritable.get()); } } -biggest key is-- 0 -97 -biggest key is-- 4 -99 -biggest key is-- 8 -99 -biggest key is-- 12 -97 -biggest key is-- 16 -98 At 2011-08-04 20:51:01,John Armstrong john.armstr...@ccri.com wrote: On Thu, 4 Aug 2011 14:07:12 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: I am using the new API (released is from cloudera). We can see from the output, for each call of reduce function, 100 records were processed, but as the reduce is defined as reduce(IntPair key, IterableNullWritable values, Context context), so key should be fixed (not change) during every single execution, but the strange thing is that for each loop of IterableNullWritable values, the key is different!!. Using your explanation, the same information (0:97)should be repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below Ah, but they're NOT different! That's the whole point! Think carefully: how does Hadoop decide what keys are the same when sorting and grouping reducer inputs? It uses a comparator. If the comparator says compare(key1,key2)==0, then as far as Hadoop is concerned the keys are the same. So here the comparator only really checks the first int in the pair: compare(0:97,0:96)? well let's compare 0 and 0... Integer.compare(0,0)==0, so these are the same key. You have to be careful about the semantics of equality whenever you're using nonstandard comparators.
Re:Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
I understand now. And looks like the job will print the min value instead of max value per my test. In the stdout I can see the following data: 3 is the year (I fake the data by myself), 99 is the max, and 0 is the min. We can see for year 3, there are 100 records. So the inside a group, the key could be different, and context.write(key, NullWritable.get()) will write the LAST key to the output, since the temperature is order desc, so the last key has the min temperature 3 99 3 0 number of records for this group 100 -biggest key is-- 3 0 public void reduce(IntPair key, IterableNullWritable values, Context context ) throws IOException, InterruptedException { int count=0; for (NullWritable iw:values) { count++; System.out.print(key.getFirst()); System.out.print(' '); System.out.println(key.getSecond()); } System.out.println(number of records for this group +Integer.toString(count)); System.out.println(-biggest key is--); System.out.print(key.getFirst()); System.out.print(' '); System.out.println(key.getSecond()); context.write(key, NullWritable.get()); } At 2011-08-03 11:41:23,Daniel,Wu hadoop...@163.com wrote: or I should ask, should the input of the reducer for the group of year 1900 be like key, value pair (1900,35), null (1900,34),null (1900,33),null or like (1900,35), null (1900,35), null== since (1900,34) is for the same group as (1900,35), so it use (1900,35) as the key. (1900,35), null At 2011-08-03 10:35:51,Daniel,Wu hadoop...@163.com wrote: So the key of a group is determined by the first coming record in the group, if we have 3 records in a group 1: (1900,35) 2:(1900,34) 3:(1900,33) if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct. in the KeyComparator, these are guaranteed to come in reverse order in the second slot. That is, if 35 is the maximum temperature then (1900,35) will come before ANY other (1900,t). Then as the GroupComparator does its thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO (1900,35), and thus its (null) value is added to the (1900,35) group. The reducer then gets a (1900,35) key with an Iterable of null values, which it pretty much discards and just emits the key, which contains the maximum value.
Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
On Wed, 3 Aug 2011 10:35:51 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: So the key of a group is determined by the first coming record in the group, if we have 3 records in a group 1: (1900,35) 2:(1900,34) 3:(1900,33) if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct. Effectively, yes. Remember that on the inside it's using the comparator something like this: (1900, 35).. do I have that key already? [searches collection of keys with, say, a BST] no! I'll add it here. (1900,34).. do I have that key already? [searches again, now getting a result of 0 when comparing to (1900,35)] yes! [it's not the same key, but according to the GroupComparator it is!] so I'll add its value to the key's iterable of values. etc.
Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
we usually use something like values.next() to loop every rows in a specific group, but I didn't see any code to loop the list, at least it need to get the first row in the list, which is something like values.get(). or will NullWritable.get() get the first row in the group? static class MaxTemperatureReducer extends MapReduceBase implements ReducerIntPair, NullWritable, IntPair, NullWritable { public void reduce(IntPair key, IteratorNullWritable values, OutputCollectorIntPair, NullWritable output, Reporter reporter) throws IOException { output.collect(key, NullWritable.get()); } } If we group values in the reducer by the year part of the key, then we will see all the records for the same year in one reduce group. And since they are sorted by temperature in descending order, the first is the maximum temperature. At 2011-08-02 21:34:57,John Armstrong john.armstr...@ccri.com wrote: On Tue, 2 Aug 2011 21:25:47 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: at page 243: Per my understanding, The reducer is supposed to output the first value (the maximum) for each year. But I just don't know how it work. suppose we have the data 1901 200 1901 300 1901 400 Since group is done by the year, so we have only one group, but we have 3 different key as the key is a combination of year and temperature. for the reduce, the output should be key, list(value) pair, since we have 3 key, so we should output 3 rows, but since we have only one group, we only output 1 rows. So where is the conflict? Where do I misunderstand? Keep reading the section in the book: This still isn't enough to achieve our coal, however. A partitioner ensures only that one reducer receives all the records for a year; it doesn't change the fact that the reducer groups by key within the partition... The final piece of the puzzle is the setting to control the grouping. If we group values in the reducer by the year part of the key, then we will see all the records for the same year in one reduce group. And since they are sorted by temperature in descending order, the first is the maximum temperature. That is, in that example they also change the way the reducer groups its inputs.
Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
On Tue, 2 Aug 2011 21:49:22 +0800 (CST), Daniel,Wu hadoop...@163.com wrote: we usually use something like values.next() to loop every rows in a specific group, but I didn't see any code to loop the list, at least it need to get the first row in the list, which is something like values.get(). or will NullWritable.get() get the first row in the group? No; like you said before the value is now in the key. The grouping comparator receives (1900,35),(1900,34),(1900,34), and so on. Due to the line return -IntPair.compare(ip1.getSecond(),ip2.getSecond()); in the KeyComparator, these are guaranteed to come in reverse order in the second slot. That is, if 35 is the maximum temperature then (1900,35) will come before ANY other (1900,t). Then as the GroupComparator does its thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO (1900,35), and thus its (null) value is added to the (1900,35) group. The reducer then gets a (1900,35) key with an Iterable of null values, which it pretty much discards and just emits the key, which contains the maximum value. I admit, it's a pretty subtle trick, and I'm actually glad you brought it up since I think I may be able to use it to solve a problem I've been having...
Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
So the key of a group is determined by the first coming record in the group, if we have 3 records in a group 1: (1900,35) 2:(1900,34) 3:(1900,33) if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct. in the KeyComparator, these are guaranteed to come in reverse order in the second slot. That is, if 35 is the maximum temperature then (1900,35) will come before ANY other (1900,t). Then as the GroupComparator does its thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO (1900,35), and thus its (null) value is added to the (1900,35) group. The reducer then gets a (1900,35) key with an Iterable of null values, which it pretty much discards and just emits the key, which contains the maximum value.
Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition
or I should ask, should the input of the reducer for the group of year 1900 be like key, value pair (1900,35), null (1900,34),null (1900,33),null or like (1900,35), null (1900,35), null== since (1900,34) is for the same group as (1900,35), so it use (1900,35) as the key. (1900,35), null At 2011-08-03 10:35:51,Daniel,Wu hadoop...@163.com wrote: So the key of a group is determined by the first coming record in the group, if we have 3 records in a group 1: (1900,35) 2:(1900,34) 3:(1900,33) if (1900,35) comes in as the first row, then the result key will be (1900,35), when the second row (1900,34) comes in, it won't the impact the key of the group, meaning it will not overwrite the key (1900,35) to (1900,34), correct. in the KeyComparator, these are guaranteed to come in reverse order in the second slot. That is, if 35 is the maximum temperature then (1900,35) will come before ANY other (1900,t). Then as the GroupComparator does its thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO (1900,35), and thus its (null) value is added to the (1900,35) group. The reducer then gets a (1900,35) key with an Iterable of null values, which it pretty much discards and just emits the key, which contains the maximum value.