Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-05 Thread John Armstrong
On Fri, 5 Aug 2011 08:50:02 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 The book also
 mentioned the value if mutable, I think the key might also be mutable,
 means as we loop each value in iterableNullWritable, the content of
the
 key object is reset.

The mutability of the value is one of the weirdnesses of Hadoop you have
to get used to, and one of the few times it becomes important that Java
object semantics are pointer semantics.  Anyway, it wouldn't surprise me if
the key were also replaced on iteration, but I'd have to dig into the
Hadoop code to check on that.  Or you could!


Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-04 Thread Daniel,Wu
Thanks John,

I am confused again by the result of my test case, could you please take a look:
The code related is:

  public static class IntSumReducer
   extends ReducerIntPair,NullWritable,IntPair,NullWritable {

public void reduce(IntPair key, IterableNullWritable values,
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print( : );
System.out.println(key.getSecond());
   }
  System.out.println(number of records for this group 
+Integer.toString(count));
  System.out.println(-biggest key 
is--);
  System.out.print(key.getFirst());
  System.out.print(   -);
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
 }
   }

I am using the new API (released is from cloudera).  We can see from the 
output, for each call of reduce function, 100 records were processed,  but as 
the reduce is defined as
reduce(IntPair key, IterableNullWritable values, Context context),  so key 
should be fixed (not change) during every single execution, but the strange 
thing is that for each loop of IterableNullWritable values,  the key is 
different!!.  Using your explanation,  the same information (0:97)should be 
repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below


0 : 97
0 : 97
0 : 96
0 : 96
0 : 94
0 : 93
0 : 93
0 : 91
0 : 90
0 : 89
0 : 86
0 : 85
   deleted to save space
0 : 2
0 : 1
0 : 1
0 : 0
0 : 0
number of records for this group 100
-biggest key is--
0   -0
4 : 99
4 : 99
4 : 98
4 : 96
4 : 95
4 : 94
4 : 93
4 : 92
4 : 91
4 : 91
4 : 90





At 2011-08-03 20:02:34,John Armstrong john.armstr...@ccri.com wrote:
On Wed, 3 Aug 2011 10:35:51 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 So the key of a group is determined by the first coming record in the
 group,  if we have 3 records in a group
 1: (1900,35)
 2:(1900,34)
 3:(1900,33)
 
 if (1900,35) comes in as the first row, then the result key will be
 (1900,35), when the second row (1900,34) comes in, it won't the impact
the
 key of the group, meaning it will not overwrite the key (1900,35) to
 (1900,34), correct.

Effectively, yes.  Remember that on the inside it's using the comparator
something like this:

(1900, 35).. do I have that key already? [searches collection of keys
with, say, a BST] no! I'll add it here.
(1900,34).. do I have that key already? [searches again, now getting a
result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
according to the GroupComparator it is!] so I'll add its value to the key's
iterable of values.
etc.


Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-04 Thread John Armstrong
On Thu, 4 Aug 2011 14:07:12 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 I am using the new API (released is from cloudera).  We can see from the
 output, for each call of reduce function, 100 records were processed, 
but
 as the reduce is defined as
 reduce(IntPair key, IterableNullWritable values, Context context),  so
 key should be fixed (not change) during every single execution, but the
 strange thing is that for each loop of IterableNullWritable values, 
the
 key is different!!.  Using your explanation,  the same information
 (0:97)should be repeated 100 times, but actually it is 0:97, 0:97,
0:96...
 0:0 as below

Ah, but they're NOT different! That's the whole point!

Think carefully: how does Hadoop decide what keys are the same when
sorting and grouping reducer inputs?  It uses a comparator.  If the
comparator says compare(key1,key2)==0, then as far as Hadoop is concerned
the keys are the same.

So here the comparator only really checks the first int in the pair:

compare(0:97,0:96)?  well let's compare 0 and 0...
Integer.compare(0,0)==0, so these are the same key.

You have to be careful about the semantics of equality whenever you're
using nonstandard comparators.


Re:Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-04 Thread Daniel,Wu
Hi John,

Another finding, if I remove the loop of values ( remove for (NullWritable 
iw:values)), then the result is the MAX temperature for each year.  and the 
original test I did return the MIN temperature for each year. The book also 
mentioned the value if mutable, I think the key might also be mutable, means as 
we loop each value in iterableNullWritable, the content of the key object is 
reset. Since the input is in order, so if we don't do any loop (as in the new 
test), the the key got at the end of reduce function is the first record in the 
group, which has the max value. If we loop each value in the value list, say 
loop 100 times, the context of the key will also change 100 times, and the key 
got at the end of the reduce function will be the last key, which has the MIN 
value. This theory of immutable Key can explain how to test works.Just need to 
figure out why each loop in the statement for (NullWritable iw:values) can 
change the content of the key. If any one know this, pleas
 e help tell me.

public void reduce(IntPair key, IterableNullWritable values,
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  /*for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print( : );
System.out.println(key.getSecond());
   }*/
//  System.out.println(number of records for this group 
+Integer.toString(count));
  System.out.println(-biggest key 
is--);
  System.out.print(key.getFirst());
  System.out.print(   -);
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
 }
   }


-biggest key is--
0   -97
-biggest key is--
4   -99
-biggest key is--
8   -99
-biggest key is--
12   -97
-biggest key is--
16   -98



At 2011-08-04 20:51:01,John Armstrong john.armstr...@ccri.com wrote:
On Thu, 4 Aug 2011 14:07:12 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 I am using the new API (released is from cloudera).  We can see from the
 output, for each call of reduce function, 100 records were processed, 
but
 as the reduce is defined as
 reduce(IntPair key, IterableNullWritable values, Context context),  so
 key should be fixed (not change) during every single execution, but the
 strange thing is that for each loop of IterableNullWritable values, 
the
 key is different!!.  Using your explanation,  the same information
 (0:97)should be repeated 100 times, but actually it is 0:97, 0:97,
0:96...
 0:0 as below

Ah, but they're NOT different! That's the whole point!

Think carefully: how does Hadoop decide what keys are the same when
sorting and grouping reducer inputs?  It uses a comparator.  If the
comparator says compare(key1,key2)==0, then as far as Hadoop is concerned
the keys are the same.

So here the comparator only really checks the first int in the pair:

compare(0:97,0:96)?  well let's compare 0 and 0...
Integer.compare(0,0)==0, so these are the same key.

You have to be careful about the semantics of equality whenever you're
using nonstandard comparators.


Re:Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-03 Thread Daniel,Wu
I understand now. And looks like the job will print the min value instead of 
max value per my test. In the stdout I can see the following data: 3 is the 
year (I fake the data by myself), 99 is the max, and 0 is the min. We can see 
for year 3, there are 100 records. So the inside a group, the key could be 
different, and 
context.write(key, NullWritable.get()) will write the LAST key to the output, 
since the temperature is order desc, so the last key has the min temperature 

3 99

3 0
number of records for this group 100
-biggest key is--
3 0


public void reduce(IntPair key, IterableNullWritable values, 
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print(' ');
System.out.println(key.getSecond());
   }
  System.out.println(number of records for this group 
+Integer.toString(count));
  System.out.println(-biggest key 
is--);
  System.out.print(key.getFirst());
  System.out.print(' ');
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
 }




At 2011-08-03 11:41:23,Daniel,Wu hadoop...@163.com wrote:
or I should ask, should the input of the reducer for the group of year 1900 be 
like
key,  value pair
(1900,35), null
(1900,34),null
(1900,33),null


or like
(1900,35), null
(1900,35), null== since (1900,34) is for the same group as (1900,35), so 
it use (1900,35) as the key.
(1900,35), null


At 2011-08-03 10:35:51,Daniel,Wu hadoop...@163.com wrote:

So the key of a group is determined by the first coming record in the group,  
if we have 3 records in a group
1: (1900,35)
2:(1900,34)
3:(1900,33)

if (1900,35) comes in as the first row, then the result key will be 
(1900,35), when the second row (1900,34) comes in, it won't the impact the 
key of the group, meaning it will not overwrite the key (1900,35) to 
(1900,34), correct.

in the KeyComparator, these are guaranteed to come in reverse order in the 
second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
come before ANY other (1900,t).  Then as the GroupComparator does its 
thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO 
(1900,35), and thus its (null) value is added to the (1900,35) group.  
The reducer then gets a (1900,35) key with an Iterable of null values, 
which it pretty much discards and just emits the key, which contains the 
maximum value.


Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-03 Thread John Armstrong
On Wed, 3 Aug 2011 10:35:51 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 So the key of a group is determined by the first coming record in the
 group,  if we have 3 records in a group
 1: (1900,35)
 2:(1900,34)
 3:(1900,33)
 
 if (1900,35) comes in as the first row, then the result key will be
 (1900,35), when the second row (1900,34) comes in, it won't the impact
the
 key of the group, meaning it will not overwrite the key (1900,35) to
 (1900,34), correct.

Effectively, yes.  Remember that on the inside it's using the comparator
something like this:

(1900, 35).. do I have that key already? [searches collection of keys
with, say, a BST] no! I'll add it here.
(1900,34).. do I have that key already? [searches again, now getting a
result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
according to the GroupComparator it is!] so I'll add its value to the key's
iterable of values.
etc.


Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-02 Thread Daniel,Wu

we usually use something like values.next()  to loop every rows in a specific 
group, but I didn't see any code to loop the list, at least it need to get the 
first row in the list, which is something like
values.get().   
or will NullWritable.get() get the first row in the group?


static class MaxTemperatureReducer extends MapReduceBase
implements ReducerIntPair, NullWritable, IntPair, NullWritable {
public void reduce(IntPair key, IteratorNullWritable values,
OutputCollectorIntPair, NullWritable output, Reporter reporter)
throws IOException {
output.collect(key, NullWritable.get());
}
}
 If we group values in the reducer by the year part of the key,
then we will see all the records for the same year in one reduce group. 
And since they are sorted by temperature in descending order, the first is
the maximum temperature.

At 2011-08-02 21:34:57,John Armstrong john.armstr...@ccri.com wrote:
On Tue, 2 Aug 2011 21:25:47 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 at page 243:
 Per my understanding, The reducer is supposed to output the first value
 (the maximum)  for each year. But I just don't know how it work.
 
 suppose we have  the data
 1901  200
 1901  300
 1901  400
 
 Since group is done by the year, so we have only one group,  but we have
3
 different key as the key is a combination of year and temperature.  for
the
 reduce,  the output should be  key, list(value) pair,  since we have 3
key,
 so we should output 3 rows,  but since we have only one group, we only
 output 1 rows. So where is the conflict? Where do I misunderstand?

Keep reading the section in the book:

This still isn't enough to achieve our coal, however.  A partitioner
ensures only that one reducer receives all the records for a year; it
doesn't change the fact that the reducer groups by key within the
partition... The final piece of the puzzle is the setting to control the
grouping.  If we group values in the reducer by the year part of the key,
then we will see all the records for the same year in one reduce group. 
And since they are sorted by temperature in descending order, the first is
the maximum temperature.

That is, in that example they also change the way the reducer groups its
inputs.


Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-02 Thread John Armstrong
On Tue, 2 Aug 2011 21:49:22 +0800 (CST), Daniel,Wu hadoop...@163.com
wrote:
 we usually use something like values.next()  to loop every rows in a
 specific group, but I didn't see any code to loop the list, at least it
 need to get the first row in the list, which is something like
 values.get().   
 or will NullWritable.get() get the first row in the group?

No; like you said before the value is now in the key.

The grouping comparator receives (1900,35),(1900,34),(1900,34), and so on.
Due to the line

return -IntPair.compare(ip1.getSecond(),ip2.getSecond());

in the KeyComparator, these are guaranteed to come in reverse order in the
second slot.  That is, if 35 is the maximum temperature then (1900,35) will
come before ANY other (1900,t).  Then as the GroupComparator does its
thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO
(1900,35), and thus its (null) value is added to the (1900,35) group.

The reducer then gets a (1900,35) key with an Iterable of null values,
which it pretty much discards and just emits the key, which contains the
maximum value.

I admit, it's a pretty subtle trick, and I'm actually glad you brought it
up since I think I may be able to use it to solve a problem I've been
having...


Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-02 Thread Daniel,Wu

So the key of a group is determined by the first coming record in the group,  
if we have 3 records in a group
1: (1900,35)
2:(1900,34)
3:(1900,33)

if (1900,35) comes in as the first row, then the result key will be (1900,35), 
when the second row (1900,34) comes in, it won't the impact the key of the 
group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.

in the KeyComparator, these are guaranteed to come in reverse order in the 
second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
come before ANY other (1900,t).  Then as the GroupComparator does its thing, 
any time (1900,t) comes up it gets compared AND FOUND EQUAL TO (1900,35), and 
thus its (null) value is added to the (1900,35) group.  The reducer then 
gets a (1900,35) key with an Iterable of null values, which it pretty much 
discards and just emits the key, which contains the maximum value.


Re:Re:Re:Re: one quesiton in the book of hadoop:definitive guide 2 edition

2011-08-02 Thread Daniel,Wu
or I should ask, should the input of the reducer for the group of year 1900 be 
like
key,  value pair
(1900,35), null
(1900,34),null
(1900,33),null


or like
(1900,35), null
(1900,35), null== since (1900,34) is for the same group as (1900,35), so 
it use (1900,35) as the key.
(1900,35), null


At 2011-08-03 10:35:51,Daniel,Wu hadoop...@163.com wrote:

So the key of a group is determined by the first coming record in the group,  
if we have 3 records in a group
1: (1900,35)
2:(1900,34)
3:(1900,33)

if (1900,35) comes in as the first row, then the result key will be (1900,35), 
when the second row (1900,34) comes in, it won't the impact the key of the 
group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.

in the KeyComparator, these are guaranteed to come in reverse order in the 
second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
come before ANY other (1900,t).  Then as the GroupComparator does its 
thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO 
(1900,35), and thus its (null) value is added to the (1900,35) group.  The 
reducer then gets a (1900,35) key with an Iterable of null values, which it 
pretty much discards and just emits the key, which contains the maximum 
value.