Re:Hadoop Meetup in Sept in Shanghai

2011-08-16 Thread Daniel,Wu


I'd like to attend, like to hear more about hive


At 2011-08-17 07:42:07,"Michael Lv"  wrote:
>Hi,
>
>We plan to organize a developer meetup to talk about Hadoop and big data
>during the week of Sept 12 in Shanghai. We'll have presenters from U.S and
>the topic looks very interesting. Suggestions and presentation by guest are
>welcome.
>
>If you are interested to attend, please reply to this thread or contact me
>directly.
>
>Regards,
>Michael  
>
>


why two values in the log: data buffer = 79691776/99614720

2011-08-13 Thread Daniel,Wu
They have 2 values in the log, do they mean max and min?

2011-08-13 18:18:16,661 INFO org.apache.hadoop.mapred.MapTask: data buffer = 
79691776/99614720
2011-08-13 18:18:16,661 INFO org.apache.hadoop.mapred.MapTask: record buffer = 
262144/327680

Re:Re:Re:Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-04 Thread Daniel,Wu
Hi John,

Another finding, if I remove the loop of values ( remove for (NullWritable 
iw:values)), then the result is the MAX temperature for each year.  and the 
original test I did return the MIN temperature for each year. The book also 
mentioned the value if mutable, I think the key might also be mutable, means as 
we loop each value in iterable, the content of the key object is 
reset. Since the input is in order, so if we don't do any loop (as in the new 
test), the the key got at the end of reduce function is the first record in the 
group, which has the max value. If we loop each value in the value list, say 
loop 100 times, the context of the key will also change 100 times, and the key 
got at the end of the reduce function will be the last key, which has the MIN 
value. This theory of immutable Key can explain how to test works.Just need to 
figure out why each loop in the statement for (NullWritable iw:values) can 
change the content of the key. If any one know this, pleas
 e help tell me.

public void reduce(IntPair key, Iterable values,
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  /*for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print(" : ");
System.out.println(key.getSecond());
   }*/
//  System.out.println("number of records for this group 
"+Integer.toString(count));
  System.out.println("-biggest key 
is--");
  System.out.print(key.getFirst());
  System.out.print("   -");
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
 }
   }


-biggest key is--
0   -97
-biggest key is--
4   -99
-biggest key is--
8   -99
-biggest key is--
12   -97
-biggest key is--
16   -98



At 2011-08-04 20:51:01,"John Armstrong"  wrote:
>On Thu, 4 Aug 2011 14:07:12 +0800 (CST), "Daniel,Wu" 
>wrote:
>> I am using the new API (released is from cloudera).  We can see from the
>> output, for each call of reduce function, 100 records were processed, 
>but
>> as the reduce is defined as
>> reduce(IntPair key, Iterable values, Context context),  so
>> key should be fixed (not change) during every single execution, but the
>> strange thing is that for each loop of Iterable values, 
>the
>> key is different!!.  Using your explanation,  the same information
>> (0:97)should be repeated 100 times, but actually it is 0:97, 0:97,
>0:96...
>> 0:0 as below
>
>Ah, but they're NOT different! That's the whole point!
>
>Think carefully: how does Hadoop decide what keys are "the same" when
>sorting and grouping reducer inputs?  It uses a comparator.  If the
>comparator says compare(key1,key2)==0, then as far as Hadoop is concerned
>the keys are the same.
>
>So here the comparator only really checks the first int in the pair:
>
>"compare(0:97,0:96)?  well let's compare 0 and 0...
>Integer.compare(0,0)==0, so these are the same key."
>
>You have to be careful about the semantics of "equality" whenever you're
>using nonstandard comparators.


Re:Re:Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-03 Thread Daniel,Wu
Thanks John,

I am confused again by the result of my test case, could you please take a look:
The code related is:

  public static class IntSumReducer
   extends Reducer {

public void reduce(IntPair key, Iterable values,
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print(" : ");
System.out.println(key.getSecond());
   }
  System.out.println("number of records for this group 
"+Integer.toString(count));
  System.out.println("-biggest key 
is--");
  System.out.print(key.getFirst());
  System.out.print("   -");
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
 }
   }

I am using the new API (released is from cloudera).  We can see from the 
output, for each call of reduce function, 100 records were processed,  but as 
the reduce is defined as
reduce(IntPair key, Iterable values, Context context),  so key 
should be fixed (not change) during every single execution, but the strange 
thing is that for each loop of Iterable values,  the key is 
different!!.  Using your explanation,  the same information (0:97)should be 
repeated 100 times, but actually it is 0:97, 0:97, 0:96... 0:0 as below


0 : 97
0 : 97
0 : 96
0 : 96
0 : 94
0 : 93
0 : 93
0 : 91
0 : 90
0 : 89
0 : 86
0 : 85
   deleted to save space
0 : 2
0 : 1
0 : 1
0 : 0
0 : 0
number of records for this group 100
-biggest key is--
0   -0
4 : 99
4 : 99
4 : 98
4 : 96
4 : 95
4 : 94
4 : 93
4 : 92
4 : 91
4 : 91
4 : 90





At 2011-08-03 20:02:34,"John Armstrong"  wrote:
>On Wed, 3 Aug 2011 10:35:51 +0800 (CST), "Daniel,Wu" 
>wrote:
>> So the key of a group is determined by the first coming record in the
>> group,  if we have 3 records in a group
>> 1: (1900,35)
>> 2:(1900,34)
>> 3:(1900,33)
>> 
>> if (1900,35) comes in as the first row, then the result key will be
>> (1900,35), when the second row (1900,34) comes in, it won't the impact
>the
>> key of the group, meaning it will not overwrite the key (1900,35) to
>> (1900,34), correct.
>
>Effectively, yes.  Remember that on the inside it's using the comparator
>something like this:
>
>(1900, 35).. do I have that key already? [searches collection of keys
>with, say, a BST] no! I'll add it here.
>(1900,34).. do I have that key already? [searches again, now getting a
>result of 0 when comparing to (1900,35)] yes! [it's not the same key, but
>according to the GroupComparator it is!] so I'll add its value to the key's
>iterable of values.
>etc.


Re:Re:Re:Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-03 Thread Daniel,Wu
I understand now. And looks like the job will print the min value instead of 
max value per my test. In the stdout I can see the following data: 3 is the 
year (I fake the data by myself), 99 is the max, and 0 is the min. We can see 
for year 3, there are 100 records. So the inside a group, the key could be 
different, and 
context.write(key, NullWritable.get()) will write the LAST key to the output, 
since the temperature is order desc, so the last key has the min temperature 

3 99

3 0
number of records for this group 100
-biggest key is--
3 0


public void reduce(IntPair key, Iterable values, 
   Context context
   ) throws IOException, InterruptedException {
  int count=0;
  for (NullWritable iw:values) {
count++;
System.out.print(key.getFirst());
System.out.print(' ');
System.out.println(key.getSecond());
   }
  System.out.println("number of records for this group 
"+Integer.toString(count));
  System.out.println("-biggest key 
is--");
  System.out.print(key.getFirst());
  System.out.print(' ');
  System.out.println(key.getSecond());
  context.write(key, NullWritable.get());
     }




At 2011-08-03 11:41:23,"Daniel,Wu"  wrote:
>or I should ask, should the input of the reducer for the group of year 1900 be 
>like
>key,  value pair
>(1900,35), null
>(1900,34),null
>(1900,33),null
>
>
>or like
>(1900,35), null
>(1900,35), null==> since (1900,34) is for the same group as (1900,35), so 
>it use (1900,35) as the key.
>(1900,35), null
>
>
>At 2011-08-03 10:35:51,"Daniel,Wu"  wrote:
>>
>>So the key of a group is determined by the first coming record in the group,  
>>if we have 3 records in a group
>>1: (1900,35)
>>2:(1900,34)
>>3:(1900,33)
>>
>>if (1900,35) comes in as the first row, then the result key will be 
>>(1900,35), when the second row (1900,34) comes in, it won't the impact the 
>>key of the group, meaning it will not overwrite the key (1900,35) to 
>>(1900,34), correct.
>>
>>>in the KeyComparator, these are guaranteed to come in reverse order in the 
>>>>second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
>>>>come before ANY other (1900,t).  Then as the GroupComparator does its 
>>>>thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO 
>>>>(1900,35), and thus its (null) value is added to the (1900,35) group. > 
>>>>The reducer then gets a (1900,35) key with an Iterable of null values, 
>>>>which it pretty much discards and just emits the key, which contains the 
>>>>maximum value.


Re:Re:Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-02 Thread Daniel,Wu
or I should ask, should the input of the reducer for the group of year 1900 be 
like
key,  value pair
(1900,35), null
(1900,34),null
(1900,33),null


or like
(1900,35), null
(1900,35), null==> since (1900,34) is for the same group as (1900,35), so 
it use (1900,35) as the key.
(1900,35), null


At 2011-08-03 10:35:51,"Daniel,Wu"  wrote:
>
>So the key of a group is determined by the first coming record in the group,  
>if we have 3 records in a group
>1: (1900,35)
>2:(1900,34)
>3:(1900,33)
>
>if (1900,35) comes in as the first row, then the result key will be (1900,35), 
>when the second row (1900,34) comes in, it won't the impact the key of the 
>group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.
>
>>in the KeyComparator, these are guaranteed to come in reverse order in the 
>>>second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
>>>come before ANY other (1900,t).  Then as the GroupComparator does its 
>>>thing, any time (1900,t) comes up it gets compared AND FOUND EQUAL TO 
>>>(1900,35), and thus its (null) value is added to the (1900,35) group. > >The 
>>reducer then gets a (1900,35) key with an Iterable of null values, >which it 
>>pretty much discards and just emits the key, which contains the >maximum 
>>value.


Re:Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-02 Thread Daniel,Wu

So the key of a group is determined by the first coming record in the group,  
if we have 3 records in a group
1: (1900,35)
2:(1900,34)
3:(1900,33)

if (1900,35) comes in as the first row, then the result key will be (1900,35), 
when the second row (1900,34) comes in, it won't the impact the key of the 
group, meaning it will not overwrite the key (1900,35) to (1900,34), correct.

>in the KeyComparator, these are guaranteed to come in reverse order in the 
>>second slot.  That is, if 35 is the maximum temperature then (1900,35) will 
>>come before ANY other (1900,t).  Then as the GroupComparator does its >thing, 
>any time (1900,t) comes up it gets compared AND FOUND EQUAL TO >(1900,35), and 
>thus its (null) value is added to the (1900,35) group. > >The reducer then 
>gets a (1900,35) key with an Iterable of null values, >which it pretty much 
>discards and just emits the key, which contains the >maximum value.


Re:Re: one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-02 Thread Daniel,Wu

we usually use something like values.next()  to loop every rows in a specific 
group, but I didn't see any code to loop the list, at least it need to get the 
first row in the list, which is something like
values.get().   
or will NullWritable.get() get the first row in the group?


static class MaxTemperatureReducer extends MapReduceBase
implements Reducer {
public void reduce(IntPair key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
output.collect(key, NullWritable.get());
}
}
> If we group values in the reducer by the year part of the key,
>then we will see all the records for the same year in one reduce group. 
>And since they are sorted by temperature in descending order, the first is
>the maximum temperature."

At 2011-08-02 21:34:57,"John Armstrong"  wrote:
>On Tue, 2 Aug 2011 21:25:47 +0800 (CST), "Daniel,Wu" 
>wrote:
>> at page 243:
>> Per my understanding, The reducer is supposed to output the first value
>> (the maximum)  for each year. But I just don't know how it work.
>> 
>> suppose we have  the data
>> 1901  200
>> 1901  300
>> 1901  400
>> 
>> Since group is done by the year, so we have only one group,  but we have
>3
>> different key as the key is a combination of year and temperature.  for
>the
>> reduce,  the output should be  key, list(value) pair,  since we have 3
>key,
>> so we should output 3 rows,  but since we have only one group, we only
>> output 1 rows. So where is the conflict? Where do I misunderstand?
>
>Keep reading the section in the book:
>
>"This still isn't enough to achieve our coal, however.  A partitioner
>ensures only that one reducer receives all the records for a year; it
>doesn't change the fact that the reducer groups by key within the
>partition... The final piece of the puzzle is the setting to control the
>grouping.  If we group values in the reducer by the year part of the key,
>then we will see all the records for the same year in one reduce group. 
>And since they are sorted by temperature in descending order, the first is
>the maximum temperature."
>
>That is, in that example they also change the way the reducer groups its
>inputs.


one quesiton in the book of "hadoop:definitive guide 2 edition"

2011-08-02 Thread Daniel,Wu
  at page 243:
Per my understanding, The reducer is supposed to output the first value (the 
maximum)  for each year. But I just don't know how it work.

suppose we have  the data
1901  200
1901  300
1901  400

Since group is done by the year, so we have only one group,  but we have 3 
different key as the key is a combination of year and temperature.  for the 
reduce,  the output should be  key, list(value) pair,  since we have 3 key, so 
we should output 3 rows,  but since we have only one group, we only output 1 
rows. So where is the conflict? Where do I misunderstand?

public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntPair.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
return IntPair.compare(ip1.getFirst(), ip2.getFirst());
}
}

static class MaxTemperatureReducer extends MapReduceBase
implements Reducer {
public void reduce(IntPair key, Iterator values,
OutputCollector output, Reporter reporter)
throws IOException {
   output.collect(key, NullWritable.get());
}
}


Re:Re: error:Type mismatch in value from map

2011-07-29 Thread Daniel,Wu
Thanks Joey,

It works, but one place I don't understand:

1: in the map

 extends Mapper
so the output value is of type IntWritable
2: in the reduce
extends Reducer
So input value is of type Text.

type of map output should be the same as input type of reduce, correct? but 
here 
IntWritable<>Text

And the code can run without any error, shouldn't it complain type mismatch?

At 2011-07-29 22:49:31,"Joey Echeverria"  wrote:
>If you want to use a combiner, your map has to output the same types
>as your combiner outputs. In your case, modify your map to look like
>this:
>
>  public static class TokenizerMapper
>   extends Mapper{
>public void map(Text key, Text value, Context context
>) throws IOException, InterruptedException {
>context.write(key, new IntWritable(1));
>}
>  }
>
>>  11/07/29 22:22:22 INFO mapred.JobClient: Task Id : 
>> attempt_201107292131_0011_m_00_2, Status : FAILED
>> java.io.IOException: Type mismatch in value from map: expected 
>> org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
>>
>> But I already set IntWritable in 2 places,
>> 1: Reducer
>> 2:job.setOutputValueClass(IntWritable.class);
>>
>> So where am I wrong?
>>
>> public class MyTest {
>>
>>  public static class TokenizerMapper
>>   extends Mapper{
>>public void map(Text key, Text value, Context context
>>) throws IOException, InterruptedException {
>>context.write(key, value);
>>}
>>  }
>>
>>  public static class IntSumReducer
>>   extends Reducer {
>>
>>public void reduce(Text key, Iterable values,
>>   Context context
>>   ) throws IOException, InterruptedException {
>>   int count = 0;
>>   for (Text iw:values) {
>>count++;
>>   }
>>  context.write(key, new IntWritable(count));
>> }
>>   }
>>
>>  public static void main(String[] args) throws Exception {
>>Configuration conf = new Configuration();
>> // the configure of seprator should be done in conf
>>conf.set("key.value.separator.in.input.line", ",");
>>String[] otherArgs = new GenericOptionsParser(conf, 
>> args).getRemainingArgs();
>>if (otherArgs.length != 2) {
>>  System.err.println("Usage: wordcount  ");
>>  System.exit(2);
>>}
>>Job job = new Job(conf, "word count");
>>job.setJarByClass(WordCount.class);
>>job.setMapperClass(TokenizerMapper.class);
>>job.setCombinerClass(IntSumReducer.class);
>> //job.setReducerClass(IntSumReducer.class);
>>job.setInputFormatClass(KeyValueTextInputFormat.class);
>>// job.set("key.value.separator.in.input.line", ",");
>>job.setOutputKeyClass(Text.class);
>>job.setOutputValueClass(IntWritable.class);
>>FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
>>FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
>>System.exit(job.waitForCompletion(true) ? 0 : 1);
>>  }
>> }
>>
>
>
>
>-- 
>Joseph Echeverria
>Cloudera, Inc.
>443.305.9434


error:Type mismatch in value from map

2011-07-29 Thread Daniel,Wu
When I run the job, the throws the following error.

  11/07/29 22:22:22 INFO mapred.JobClient: Task Id : 
attempt_201107292131_0011_m_00_2, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected 
org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text

But I already set IntWritable in 2 places,
1: Reducer
2:job.setOutputValueClass(IntWritable.class);

So where am I wrong?

public class MyTest {

  public static class TokenizerMapper
   extends Mapper{
public void map(Text key, Text value, Context context
) throws IOException, InterruptedException {
context.write(key, value);
}
  }
 
  public static class IntSumReducer
   extends Reducer {

public void reduce(Text key, Iterable values,
   Context context
   ) throws IOException, InterruptedException {
   int count = 0;
   for (Text iw:values) {
count++;
   }
  context.write(key, new IntWritable(count));
 }
   }

  public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// the configure of seprator should be done in conf
conf.set("key.value.separator.in.input.line", ",");
String[] otherArgs = new GenericOptionsParser(conf, 
args).getRemainingArgs();
if (otherArgs.length != 2) {
  System.err.println("Usage: wordcount  ");
  System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
//job.setReducerClass(IntSumReducer.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// job.set("key.value.separator.in.input.line", ",");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}


/tmp/hadoop-oracle/dfs/name is in an inconsistent state

2011-07-28 Thread Daniel,Wu
When I started hadoop, the namenode failed to startup because of the following 
error. The strange thing is that it says/tmp/hadoop-oracle/dfs/name 
isinconsistent, but I don't think I have configured it as 
/tmp/hadoop-oracle/dfs/name. Where should I check for the related configuration?
  2011-07-28 21:07:35,383 ERROR 
org.apache.hadoop.hdfs.server.namenode.NameNode: 
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/tmp/hadoop-oracle/dfs/name is in an inconsistent state: storage directory does 
not exist or is not accessible.



where to find the log info

2011-07-27 Thread Daniel,Wu
Hi everyone,

I am new to it, and want to do some debug/log. I'd like to check what the value 
is for each mapper execution. If I add the following code in bold, where can I 
find the log info? If I can't do it in this way, how should I do?

 public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
  StringTokenizer itr = new StringTokenizer(value.toString());
  System.out.println(value.toString);
  while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
  }
}
  }