Re: A brief report of Second Hadoop in China Salon
Congratulations! Nicholas Sze - Original Message From: He Yongqiang heyongqi...@software.ict.ac.cn To: core-...@hadoop.apache.org core-...@hadoop.apache.org; core-user@hadoop.apache.org core-user@hadoop.apache.org Sent: Friday, May 15, 2009 6:09:50 PM Subject: A brief report of Second Hadoop in China Salon Hi, all In May 9, we held the second Hadoop In China salon. About 150 people attended, 46% of them are engineers/managers from industry companies, and 38% of them are students/professors from universities and institutes. This salon was successfully held with great technical support from Yahoo! Beijing RD, Zheng Shao from Facebook Inc., Wang Shouyan from Baidu Inc. and many other high technology companies in China. We got over one hundred feedbacks from attendees, and most of them are interested in details and wants more discussions. And 1/3 of them want we to include more topics or more sessions for hadoop subprojects. And most students/professors want to be more familiar with hadoop and try to find new research topic on top of hadoop. Most students want to involve themselves and contribute to hadoop, but do not know how or find it is a little difficulty because of language/zone problems. Thank you all the attendees again. Without you, it would never success. We already put the slides on site: www.hadooper.cn, and the videos are coming soon. BTW, I insist on letting this event to be nonprofit. In the past two meetings, we did not charge anyone for anything.
Re: Is there any performance issue with Jrockit JVM for Hadoop
To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance did become better to some extend. But it is still not comparable with the Sun JVM. Maybe, it is due to the use case with short duration and different implementation in JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM currently. Thanks all for your time and help. On Tue, May 12, 2009 at 1:21 PM, Grace syso...@gmail.com wrote: Thanks quite a lot for your time and kind advices. You are so right. From the Mission Control, I found that the map process allocated 90% large objects and 10% small ones. I will try to set the TLA setting to see if it works. Thanks again. On Tue, May 12, 2009 at 6:43 AM, Scott Carey sc...@richrelevance.comwrote: Before Java 1.6, Jrockit was almost always faster than Sun, and often by a lot (especially during the 1.4 days). Now, its much more use-case dependant. Some apps are faster on one than another, and vice-versa. I have tested many other applications with both in the past (and IBM's VM on AIX, and HP's VM on HP-UX), but not Hadoop. I suppose it just may be a use case that Sun has done a bit better. The Jrockit settings that remain that may be of use are the TLA settings. You can use Mission Control to do a memory profile and see if the average object sizes are large enough to warrant increasing the thread local object size thresholds. That's the only major tuning knob I recall that I don't see below. If Hadoop is creating a lot of medium sized (~1000 byte to 32kbyte) objects Jrockit isn't so optimized by default for that. You should consider sending the data to the Jrockit team. They are generally on the lookout for example use-cases where they do poorly relative to Sun. However, now that they are all under the Oracle-Larry-Umbrella it wouldn't shock me if that changes. On 5/7/09 6:34 PM, Grace syso...@gmail.com wrote: Thanks all for your replying. I have run several times with different Java options for Map/Reduce tasks. However there is no much difference. Following is the example of my test setting: Test A: -Xmx1024m -server -XXlazyUnlocking -XlargePages -XgcPrio:deterministic -XXallocPrefetch -XXallocRedoPrefetch Test B: -Xmx1024m Test C: -Xmx1024m -XXaggressive Is there any tricky or special setting for Jrockit vm on Hadoop? In the Hadoop Quick Start guides, it says that JavaTM 1.6.x, preferably from Sun. Is there any concern about the Jrockit performance issue? I'd highly appreciate for your time and consideration. On Fri, May 8, 2009 at 7:36 AM, JQ Hadoop jq.had...@gmail.com wrote: There are a lot of tuning knobs for the JRockit JVM when it comes to performance; those tuning can make a huge difference. I'm very interested if there are some tuning tips for Hadoop. Grace, what are the parameters that you used in your testing? Thanks, JQ On Thu, May 7, 2009 at 11:35 PM, Steve Loughran ste...@apache.org wrote: Chris Collins wrote: a couple of years back we did a lot of experimentation between sun's vm and jrocket. We had initially assumed that jrocket was going to scream since thats what the press were saying. In short, what we discovered was that certain jdk library usage was a little bit faster with jrocket, but for core vm performance such as synchronization, primitive operations the sun vm out performed. We were not taking account of startup time, just raw code execution. As I said, this was a couple of years back so things may of changed. C I run JRockit as its what some of our key customers use, and we need to test things. One lovely feature is tests time out before the stack runs out on a recursive operation; clearly different stack management at work. Another: no PermGenHeapSpace to fiddle with. * I have to turn debug logging of in hadoop test runs, or there are problems. * It uses short pointers (32 bits long) for near memory on a 64 bit JVM. So your memory footprint on sub-4GB VM images is better. Java7 promises this, and with the merger, who knows what we will see. This is unimportant on 32-bit boxes * debug single stepping doesnt work. That's ok, I use functional tests instead :) I havent looked at outright performance. /
TASKS KILLED WHEN RUNNING : bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
hELLO TO EVERY BODY I AM A NEW HAOODP USER I STARTED RUNNING HADOOP USING SITE http://hadoop.apache.org/core/docs/current/quickstart.html BUT WHEN I RUN COMMAND bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' , IN PSEDUO DISTRIBUTED MODE I GET ERROR LIKE ::: Task task_200801251420_0007_m_06_0 failed to report status for 601 seconds. Killing! Task task_200801251420_0007_m_07_0 failed to report status for 602 seconds. Killing! AND SO ON... THEN ALL THE TAKED GETS KILLED BUT STILL DATANODE IS ALIVE... I HAVE BEEN RUNNING HADOOP IN VMWARE AND ON 512MB RAM MACHINE.. SO PLZ HELP ME IN SOLVING THIS PROBLEM... THANKS IN ADVANCE. REGARDS, ASHISH
Re: Loading FSEditLog fails
OK, I've just solved problem with minor data lost. Steps to solve: 1) comment out FSEditLog.java:542 2) compile hadoop-core jar 3) start cluster with new jar Namenode will skip bad records in name/current/edits and write new edits file back into fs. As bad records stand for actual IO operations, some files in HDFS may be deleted as they consist of blocks which do not correspond to edits entries. In my situtation, I've lost files of last fortnight period. 4) wait for some time while datanodes are removing blocks, that do not corresond to entries in edits file 5) stop cluster 6) replace hadoop-core jar with release one 7) start cluster
hadoop MapReduce and stop words
Hi, I am trying to include the stop words into hadoop map reduce, and later on, into hive. What is the accepted solution regarding the stop words in hadoop? All I can think is to load all the stop words into an array in the mapper, and then check each token against the stop words..(this would be O(n^2) ) Regards
Re: hadoop MapReduce and stop words
Perhaps some kind of in memory index would be better than iterating an array? Binary tree or so. I did similar with polygon indexes and point data. It requires careful memory planning on the nodes if the indexes are large (mine were several GB). Just a thought, Tim On Sat, May 16, 2009 at 1:56 PM, PORTO aLET portoa...@gmail.com wrote: Hi, I am trying to include the stop words into hadoop map reduce, and later on, into hive. What is the accepted solution regarding the stop words in hadoop? All I can think is to load all the stop words into an array in the mapper, and then check each token against the stop words..(this would be O(n^2) ) Regards
Re: hadoop MapReduce and stop words
Can you please elaborate more about in memory index? What kind of software did you used to implement this ? Regards On Sat, May 16, 2009 at 8:55 PM, tim robertson timrobertson...@gmail.comwrote: Perhaps some kind of in memory index would be better than iterating an array? Binary tree or so. I did similar with polygon indexes and point data. It requires careful memory planning on the nodes if the indexes are large (mine were several GB). Just a thought, Tim On Sat, May 16, 2009 at 1:56 PM, PORTO aLET portoa...@gmail.com wrote: Hi, I am trying to include the stop words into hadoop map reduce, and later on, into hive. What is the accepted solution regarding the stop words in hadoop? All I can think is to load all the stop words into an array in the mapper, and then check each token against the stop words..(this would be O(n^2) ) Regards
Re: hadoop MapReduce and stop words
Try and google binary tree java and you will get loads of hits... This is a simple implementation but I am sure there are better ones that handle balancing better. Cheers Tim public class BinaryTree { public static void main(String[] args) { BinaryTree bt = new BinaryTree(); for (int i = 0; i 1; i++) { bt.insert( + i); } System.out.println(bt.lookup(999)); System.out.println(bt.lookup(100)); System.out.println(bt.lookup(a)); // should be null } private Node root; private static class Node { Node left; Node right; String value; public Node(String value) { this.value = value; } } public String lookup(String key) { return (lookup(root, key)); } private String lookup(Node node, String value) { if (node == null) { return (null); } if (value.equals(node.value)) { return (node.value); } else if (value.compareTo(node.value) 0) { return (lookup(node.left, value)); } else { return (lookup(node.right, value)); } } public void insert(String value) { root = insert(root, value); } private Node insert(Node node, String value) { if (node == null) { node = new Node(value); } else { if (value.compareTo(node.value) = 0) { node.left = insert(node.left, value); } else { node.right = insert(node.right, value); } } return (node); } }
Re: hadoop MapReduce and stop words
Just use a java.util.HashSet for this. There should only be a few dozen stopwords, so load them into a HashSet when the Mapper starts up, and then check your tokens against it while you're processing records. -- Stefan From: tim robertson timrobertson...@gmail.com Reply-To: core-user@hadoop.apache.org Date: Sat, 16 May 2009 15:48:23 +0200 To: core-user@hadoop.apache.org Subject: Re: hadoop MapReduce and stop words Try and google binary tree java and you will get loads of hits... This is a simple implementation but I am sure there are better ones that handle balancing better. Cheers Tim public class BinaryTree { public static void main(String[] args) { BinaryTree bt = new BinaryTree(); for (int i = 0; i 1; i++) { bt.insert( + i); } System.out.println(bt.lookup(999)); System.out.println(bt.lookup(100)); System.out.println(bt.lookup(a)); // should be null } private Node root; private static class Node { Node left; Node right; String value; public Node(String value) { this.value = value; } } public String lookup(String key) { return (lookup(root, key)); } private String lookup(Node node, String value) { if (node == null) { return (null); } if (value.equals(node.value)) { return (node.value); } else if (value.compareTo(node.value) 0) { return (lookup(node.left, value)); } else { return (lookup(node.right, value)); } } public void insert(String value) { root = insert(root, value); } private Node insert(Node node, String value) { if (node == null) { node = new Node(value); } else { if (value.compareTo(node.value) = 0) { node.left = insert(node.left, value); } else { node.right = insert(node.right, value); } } return (node); } }
Re: A brief report of Second Hadoop in China Salon
Congratulations! Wished I were there. :-) Best, Arber On Sat, May 16, 2009 at 9:09 AM, He Yongqiang heyongqi...@software.ict.ac.cn wrote: Hi, all In May 9, we held the second Hadoop In China salon. About 150 people attended, 46% of them are engineers/managers from industry companies, and 38% of them are students/professors from universities and institutes. This salon was successfully held with great technical support from Yahoo! Beijing RD, Zheng Shao from Facebook Inc., Wang Shouyan from Baidu Inc. and many other high technology companies in China. We got over one hundred feedbacks from attendees, and most of them are interested in details and wants more discussions. And 1/3 of them want we to include more topics or more sessions for hadoop subprojects. And most students/professors want to be more familiar with hadoop and try to find new research topic on top of hadoop. Most students want to involve themselves and contribute to hadoop, but do not know how or find it is a little difficulty because of language/zone problems. Thank you all the attendees again. Without you, it would never success. We already put the slides on site: www.hadooper.cn, and the videos are coming soon. BTW, I insist on letting this event to be nonprofit. In the past two meetings, we did not charge anyone for anything.
sort example
Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd
Re: sort example
BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote: Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd
Re: sort example
1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with: -D mapred.text.key.comparator.options=-k2,2nr\ see the section A Useful Comparator Class in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apache.org/jira/browse/HADOOP-2302 2) For the second issue, I think you will need to use 1 reducer to guarantee global sort order or use another MR pass. On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote: BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote: Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: sort example
I just copy and pasted that comparator option from the docs, the -n part is what you want in this case. On Sun, May 17, 2009 at 12:40 AM, Peter Skomoroch peter.skomor...@gmail.com wrote: 1) It is doing alphabetical sort by default, you can force Hadoop streaming to sort numerically with: -D mapred.text.key.comparator.options=-k2,2nr\ see the section A Useful Comparator Class in the streaming docs: http://hadoop.apache.org/core/docs/current/streaming.html and https://issues.apache.org/jira/browse/HADOOP-2302 2) For the second issue, I think you will need to use 1 reducer to guarantee global sort order or use another MR pass. On Sun, May 17, 2009 at 12:14 AM, David Rio driodei...@gmail.com wrote: BTW, Basically, this is the unix equivalent to what I am trying to do: $ cat input_file.txt | sort -n -drd On Sat, May 16, 2009 at 11:10 PM, David Rio driodei...@gmail.com wrote: Hi, I am trying to sort some data with hadoop(streaming mode). The input looks like: $ cat small_numbers.txt 9971681 9686036 2592322 4518219 1467363 To send my job to the cluster I use: hadoop jar /home/drio/hadoop-0.20.0/contrib/streaming/hadoop-0.20.0-streaming.jar \ -D mapred.reduce.tasks=2 \ -D stream.num.map.output.key.fields=1 \ -D mapred.text.key.comparator.options=-k1,1n \ -input /input \ -output /output \ -mapper sort_mapper.rb \ -file `pwd`/scripts_sort/sort_mapper.rb \ -reducer sort_reducer.rb \ -file `pwd`/scripts_sort/sort_reducer.rb The mapper code basically writes key, value = input_line, input_line. The reducer just prints the keys from the standard input. Incase you care: $ cat scripts_sort/sort_* #!/usr/bin/ruby STDIN.each_line {|l| puts #{l.chomp}\t#{l.chomp}} - #!/usr/bin/ruby STDIN.each_line { |line| puts line.split[0] } I run the job and it completes without problems, the output looks like: d...@milhouse:~/tmp $ cat output/part-1 1380664 1467363 32485 3857847 422538 4354952 4518219 5719091 7838358 9686036 d...@milhouse:~/tmp $ cat output/part-0 1453024 2592322 3875994 4689583 5340522 607354 6447778 6535495 8647464 9971681 These are my questions: 1. It seems the sorting (per reducer) is working but I don't know why, for example, 607354 is not the first number in the output. 2. How can I tell hadoop to send data to the reduces in such a way that inputReduce1keys inputReduce2keys . inputReduceNkeys. In that way I would ensure the data is fully sorted once the job is done. I've tried also using the identity classes for the mapper and reducer but the job dies generating exceptions about the input format. Can anyone show me or point me to some code showing how to properly perform sorting. Thanks in advance, -drd -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch