Re: How to verify all my master/slave name/data nodes have been configured correctly?
Hi, Use the JobTracker WEB UI at master:50030 and Namenode web UI at master:50070. On Fri, Feb 10, 2012 at 9:03 AM, Wq Az azq...@gmail.com wrote: Hi, Is there a quick way to check this? Thanks ahead, Will -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Can I start a Hadoop job from an EJB?
Yes you can . Please make sure all Hadoop jars and conf directory is in classpath. On Thu, Feb 9, 2012 at 7:02 AM, Sanjeev Verma sanjeev.x.ve...@gmail.comwrote: This is based on my understanding and no real life experience, so going to go out on a limb here :-)...assuming that you are planning on kicking off this map-reduce job based on a event of sorts (a file arrived and is ready to be processed?), and no direct user wait is involved, then yes, I would imagine you should be able to do something like this from inside a MDB (asynchronous so no one is held up in queue). Some random thoughts: 1. The user under which the app server is running will need to be a setup as a hadoop client user - this is rather obvious, just wanted to list it for completeness. 2. Hadoop, AFAIK, does not support transactions, and no XA. I assume you have no need for any of that stuff either. 3. Your MDB could potentially log job start/end times, but that info is available from Hadoop's monitoring infrastructure also. I would be very interested in hearing what senior members on the list have to say... HTH Sanjeev On Wed, Feb 8, 2012 at 2:18 PM, Andy Doddington a...@doddington.net wrote: OK, I have a working Hadoop application that I would like to integrate into an application server environment. So, the question arises: can I do this? E.g. can I create a JobClient instance inside an EJB and run it in the normal way, or is something more complex required? In addition, are there any unpleasant interactions between the application server and the hadoop runtime? Thanks for any guidance. Andy D. -- https://github.com/zinnia-phatak-dev/Nectar
Re: Standalone operation - file permission, Pseudo-Distributed operation - no output
Hello Can you please tell which version of Hadoop you are using and also Does your error matches with below message? Failed to set permissions of path: file:/tmp/hadoop-jj/mapred/staging/jj-1931875024/.staging to 0700 Thanks Jagat On Thu, Mar 8, 2012 at 5:10 PM, madhu phatak phatak@gmail.com wrote: Hi, Just make sure both task tracker and data node is up. Go to localhost:50030 and see is it shows no.of nodes equal to 1? On Thu, Feb 9, 2012 at 9:18 AM, Kyong-Ho Min kyong-ho@sydney.edu.au wrote: Hello, I am a hadoop newbie and I have 2 questions. I followed Hadoop standalone mode testing. I got error message from Cygwin terminal like file permission error. I checked out mailing list and changed the part in RawLocalFileSystem.java but not working. Still I have file permission error in the directory: c:/tmp/hadoop../mapred/staging... I followed instruction about Pseudo-Distributed operation. Ssh is OK and namenode -format is OK. But it did not return any results and the processing is just halted. The Cygwin console scripts are - $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+' 12/02/09 14:25:44 INFO mapred.FileInputFormat: Total input paths to process : 17 12/02/09 14:25:44 INFO mapred.JobClient: Running job: job_201202091423_0001 12/02/09 14:25:45 INFO mapred.JobClient: map 0% reduce 0% - Any help pls. Thanks. Kyongho Min -- https://github.com/zinnia-phatak-dev/Nectar
Convergence on File Format?
Hi, It seems that Avro is poised to become the file format, is that still the case? We've looked at Text, RCFile and Avro. Text is nice, but we'd really need to extend it. RCFile is great for Hive, but it has been a challenge using it outside of Hive. Avro has a great feature set, but is comparably (to RCFile) significantly slower and larger on disk in our testing, but if it has the highest rate of development, it may be the right choice. If you were choosing a File Format today to build a general purpose cluster (general purpose in the sense of using all the Hadoop tools, not just Hive), what would you choose? (one of the choices being development of a Custom format) Thanks, Mike
Re: Convergence on File Format?
We started using Avro few month ago and results are great! Easy to use, reliable, feature rich, great integration with MapReduce On 3/8/12 3:07 PM, Michal Klos mk...@compete.com wrote: Hi, It seems that Avro is poised to become the file format, is that still the case? We've looked at Text, RCFile and Avro. Text is nice, but we'd really need to extend it. RCFile is great for Hive, but it has been a challenge using it outside of Hive. Avro has a great feature set, but is comparably (to RCFile) significantly slower and larger on disk in our testing, but if it has the highest rate of development, it may be the right choice. If you were choosing a File Format today to build a general purpose cluster (general purpose in the sense of using all the Hadoop tools, not just Hive), what would you choose? (one of the choices being development of a Custom format) Thanks, Mike
Re: Profiling Hadoop Job
Does anyone have any idea how to solve this problem? Regardless of whether I'm using plain HPROF or profiling through Starfish, I am getting the same error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_ 00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) But I can't find what permissions to change to fix this issue. Any ideas? Thanks in advance, Best, -Leo On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote: Thanks, -Leo On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leo, Thanks for pointing out the outdated README file. Glad to tell you that we do support the old API in the latest version. See here: http://www.cs.duke.edu/starfish/previous.html Welcome to join our mailing list and your questions will reach more of our group members. Jie On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu wrote: Hi Jie, According to the Starfish README, the hadoop programs must be written using the new Hadoop API. This is not my case (I am using MultipleInputs among other non-new API supported features). Is there any way around this? Thanks, -Leo On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leonardo, You might want to try Starfish which supports the memory profiling as well as cpu/disk/network profiling for the performance tuning. Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu wrote: Hello everyone, I have a Hadoop job that I run on several GBs of data that I am trying to optimize in order to reduce the memory consumption as well as improve the speed. I am following the steps outlined in Tom White's Hadoop: The Definitive Guide for profiling using HPROF (p161), by setting the following properties in the JobConf: job.setProfileEnabled(true); job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6, + force=n,thread=y,verbose=n,file=%s); job.setProfileTaskRange(true, 0-2); job.setProfileTaskRange(false, 0-2); I am trying to run this locally on a single pseudo-distributed install of hadoop (0.20.2) and it gives the following error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) However, I can access these logs directly from the tasktracker's logs (through the web UI). For the sakes of running this locally, I could just
Re: Why is hadoop build I generated from a release branch different from release build?
Hi Pawan, The complete way releases are built (for v0.20/v1.0) is documented at http://wiki.apache.org/hadoop/HowToRelease#Building However, that does a bunch of stuff you don't need, like generate the documentation and do a ton of cross-checks. The full set of ant build targets are defined in build.xml in the top level of the source code tree. binary may be the target you want. --Matt On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.comwrote: Hi, I am trying to generate hadoop binaries from source and execute hadoop from the build I generate. I am able to build, however I am seeing that as part of build *bin* folder which comes with hadoop installation is not generated in my build. Can someone tell me how to do a build so that I can generate build equivalent to hadoop release build and which can be used directly to run hadoop. Here's the details. Desktop: Ubuntu Server 11.10 Hadoop version for installation: 0.20.203.0 (link: http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/) Hadoop Branch used build: branch-0.20-security-203 Build Command used: Ant maven-install Here's the directory structures from build I generated vs hadoop official release build. *Hadoop directory which I generated:* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant c++ classes contrib examples hadoop-0.20-security-203-pawan hadoop-ant-0.20-security-203-pawan.jar hadoop-core-0.20-security-203-pawan.jar hadoop-examples-0.20-security-203-pawan.jar hadoop-test-0.20-security-203-pawan.jar hadoop-tools-0.20-security-203-pawan.jar ivy jsvc src test tools webapps *Official Hadoop build installation* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 bin build.xml c++ CHANGES.txt conf contrib docs hadoop-ant-0.20.203.0.jar hadoop-core-0.20.203.0.jar hadoop-examples-0.20.203.0.jar hadoop-test-0.20.203.0.jar hadoop-tools-0.20.203.0.jar input ivy ivy.xml lib librecordio LICENSE.txt logs NOTICE.txt README.txt src webapps Any pointers for help are greatly appreciated? Also, if there are any other resources for understanding hadoop build system, pointers to that would be also helpful. Thanks Pawan
Re: Convergence on File Format?
Avro support in Pig will be fairly mature in 0.10. Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com On Mar 8, 2012, at 3:10 PM, Serge Blazhievsky serge.blazhiyevs...@nice.com wrote: We started using Avro few month ago and results are great! Easy to use, reliable, feature rich, great integration with MapReduce On 3/8/12 3:07 PM, Michal Klos mk...@compete.com wrote: Hi, It seems that Avro is poised to become the file format, is that still the case? We've looked at Text, RCFile and Avro. Text is nice, but we'd really need to extend it. RCFile is great for Hive, but it has been a challenge using it outside of Hive. Avro has a great feature set, but is comparably (to RCFile) significantly slower and larger on disk in our testing, but if it has the highest rate of development, it may be the right choice. If you were choosing a File Format today to build a general purpose cluster (general purpose in the sense of using all the Hadoop tools, not just Hive), what would you choose? (one of the choices being development of a Custom format) Thanks, Mike
Re: Profiling Hadoop Job
Can you check which user you are running this process as and compare it with the ownership on the directory? On Thu, Mar 8, 2012 at 3:13 PM, Leonardo Urbina lurb...@mit.edu wrote: Does anyone have any idea how to solve this problem? Regardless of whether I'm using plain HPROF or profiling through Starfish, I am getting the same error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_ 00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) But I can't find what permissions to change to fix this issue. Any ideas? Thanks in advance, Best, -Leo On Wed, Mar 7, 2012 at 3:52 PM, Leonardo Urbina lurb...@mit.edu wrote: Thanks, -Leo On Wed, Mar 7, 2012 at 3:47 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leo, Thanks for pointing out the outdated README file. Glad to tell you that we do support the old API in the latest version. See here: http://www.cs.duke.edu/starfish/previous.html Welcome to join our mailing list and your questions will reach more of our group members. Jie On Wed, Mar 7, 2012 at 3:37 PM, Leonardo Urbina lurb...@mit.edu wrote: Hi Jie, According to the Starfish README, the hadoop programs must be written using the new Hadoop API. This is not my case (I am using MultipleInputs among other non-new API supported features). Is there any way around this? Thanks, -Leo On Wed, Mar 7, 2012 at 3:19 PM, Jie Li ji...@cs.duke.edu wrote: Hi Leonardo, You might want to try Starfish which supports the memory profiling as well as cpu/disk/network profiling for the performance tuning. Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Wed, Mar 7, 2012 at 2:36 PM, Leonardo Urbina lurb...@mit.edu wrote: Hello everyone, I have a Hadoop job that I run on several GBs of data that I am trying to optimize in order to reduce the memory consumption as well as improve the speed. I am following the steps outlined in Tom White's Hadoop: The Definitive Guide for profiling using HPROF (p161), by setting the following properties in the JobConf: job.setProfileEnabled(true); job.setProfileParams(-agentlib:hprof=cpu=samples,heap=sites,depth=6, + force=n,thread=y,verbose=n,file=%s); job.setProfileTaskRange(true, 0-2); job.setProfileTaskRange(false, 0-2); I am trying to run this locally on a single pseudo-distributed install of hadoop (0.20.2) and it gives the following error: Exception in thread main java.io.FileNotFoundException: attempt_201203071311_0004_m_00_0.profile (Permission denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:194) at java.io.FileOutputStream.init(FileOutputStream.java:84) at org.apache.hadoop.mapred.JobClient.downloadProfile(JobClient.java:1226) at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1302) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251) at com.BitSight.hadoopAggregator.AggregatorDriver.run(AggregatorDriver.java:89) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at com.BitSight.hadoopAggregator.AggregatorDriver.main(AggregatorDriver.java:94) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at
RE: Why is hadoop build I generated from a release branch different from release build?
Hi Pawan, ant -p (not for 0.23+) will tell you the available build targets. Use mvn (maven) for 0.23 or newer -Original Message- From: Matt Foley [mailto:mfo...@hortonworks.com] Sent: Thursday, March 08, 2012 3:52 PM To: common-user@hadoop.apache.org Subject: Re: Why is hadoop build I generated from a release branch different from release build? Hi Pawan, The complete way releases are built (for v0.20/v1.0) is documented at http://wiki.apache.org/hadoop/HowToRelease#Building However, that does a bunch of stuff you don't need, like generate the documentation and do a ton of cross-checks. The full set of ant build targets are defined in build.xml in the top level of the source code tree. binary may be the target you want. --Matt On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.comwrote: Hi, I am trying to generate hadoop binaries from source and execute hadoop from the build I generate. I am able to build, however I am seeing that as part of build *bin* folder which comes with hadoop installation is not generated in my build. Can someone tell me how to do a build so that I can generate build equivalent to hadoop release build and which can be used directly to run hadoop. Here's the details. Desktop: Ubuntu Server 11.10 Hadoop version for installation: 0.20.203.0 (link: http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/) Hadoop Branch used build: branch-0.20-security-203 Build Command used: Ant maven-install Here's the directory structures from build I generated vs hadoop official release build. *Hadoop directory which I generated:* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant c++ classes contrib examples hadoop-0.20-security-203-pawan hadoop-ant-0.20-security-203-pawan.jar hadoop-core-0.20-security-203-pawan.jar hadoop-examples-0.20-security-203-pawan.jar hadoop-test-0.20-security-203-pawan.jar hadoop-tools-0.20-security-203-pawan.jar ivy jsvc src test tools webapps *Official Hadoop build installation* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 bin build.xml c++ CHANGES.txt conf contrib docs hadoop-ant-0.20.203.0.jar hadoop-core-0.20.203.0.jar hadoop-examples-0.20.203.0.jar hadoop-test-0.20.203.0.jar hadoop-tools-0.20.203.0.jar input ivy ivy.xml lib librecordio LICENSE.txt logs NOTICE.txt README.txt src webapps Any pointers for help are greatly appreciated? Also, if there are any other resources for understanding hadoop build system, pointers to that would be also helpful. Thanks Pawan
does hadoop always respect setNumReduceTasks?
i am wondering if hadoop always respect Job.setNumReduceTasks(int)? as i am emitting items from the mapper, i expect/desire only 1 reducer to get these items because i want to assign each key of the key-value input pair a unique integer id. if i had 1 reducer, i can just keep a local counter (with respect to the reducer instance) and increment it. on my local hadoop cluster, i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). however, as soon as i moved the code unto amazon's elastic mapreduce (emr), i notice that there are multiple reducers. if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). also, it looks like this might not be a good idea just having 1 reducer (it won't scale). it is most likely better if there are +1 reducers, but in that case, i lose the ability to assign unique numbers to the key-value pairs coming in. is there a design pattern out there that addresses this issue? my mapper/reducer key-value pair signatures looks something like the following. mapper(Text, Text, Text, IntWritable) reducer(Text, IntWritable, IntWritable, Text) the mapper reads a sequence file whose key-value pairs are of type Text and Text. i then emit Text (let's say a word) and IntWritable (let's say frequency of the word). the reducer gets the word and its frequencies, and then assigns the word an integer id. it emits IntWritable (the id) and Text (the word). i remember seeing code from mahout's API where they assign integer ids to items. the items were already given an id of type long. the conversion they make is as follows. public static int idToIndex(long id) { return 0x7FFF ((int) id ^ (int) (id 32)); } is there something equivalent for Text or a word? i was thinking about simply taking the hash value of the string/word, but of course, different strings can map to the same hash value.
Re: does hadoop always respect setNumReduceTasks?
Instead of String.hashCode() you can use the MD5 hashcode generator. This has not in the wild created a duplicate. (It has been hacked, but that's not relevant here.) http://snippets.dzone.com/posts/show/3686 I think the Partitioner class guarantees that you will have multiple reducers. On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne jane.wayne2...@gmail.com wrote: i am wondering if hadoop always respect Job.setNumReduceTasks(int)? as i am emitting items from the mapper, i expect/desire only 1 reducer to get these items because i want to assign each key of the key-value input pair a unique integer id. if i had 1 reducer, i can just keep a local counter (with respect to the reducer instance) and increment it. on my local hadoop cluster, i noticed that most, if not all, my jobs have only 1 reducer, regardless of whether or not i set Job.setNumReduceTasks(int). however, as soon as i moved the code unto amazon's elastic mapreduce (emr), i notice that there are multiple reducers. if i set the number of reduce tasks to 1, is this always guaranteed? i ask because i don't know if there is a gotcha like the combiner (where it may or may not run at all). also, it looks like this might not be a good idea just having 1 reducer (it won't scale). it is most likely better if there are +1 reducers, but in that case, i lose the ability to assign unique numbers to the key-value pairs coming in. is there a design pattern out there that addresses this issue? my mapper/reducer key-value pair signatures looks something like the following. mapper(Text, Text, Text, IntWritable) reducer(Text, IntWritable, IntWritable, Text) the mapper reads a sequence file whose key-value pairs are of type Text and Text. i then emit Text (let's say a word) and IntWritable (let's say frequency of the word). the reducer gets the word and its frequencies, and then assigns the word an integer id. it emits IntWritable (the id) and Text (the word). i remember seeing code from mahout's API where they assign integer ids to items. the items were already given an id of type long. the conversion they make is as follows. public static int idToIndex(long id) { return 0x7FFF ((int) id ^ (int) (id 32)); } is there something equivalent for Text or a word? i was thinking about simply taking the hash value of the string/word, but of course, different strings can map to the same hash value. -- Lance Norskog goks...@gmail.com
Best way for setting up a large cluster
Hi all, I installed hadoop in a pilot cluster with 3 machines and now going to make our actual cluster with 32 nodes. as you know setting up hadoop separately in every nodes is time consuming and not perfect way. whats the best way or tool to setup hadoop cluster (expect cloudera)? Thanks, B.S
Re: Best way for setting up a large cluster
Something like puppet it is a good choice. There are example puppet manifests available for most Hadoop-related projects in Apache BigTop, for example: https://svn.apache.org/repos/asf/incubator/bigtop/branches/branch-0.2/bigtop-deploy/puppet/ -Joey On Thu, Mar 8, 2012 at 9:42 PM, Masoud mas...@agape.hanyang.ac.kr wrote: Hi all, I installed hadoop in a pilot cluster with 3 machines and now going to make our actual cluster with 32 nodes. as you know setting up hadoop separately in every nodes is time consuming and not perfect way. whats the best way or tool to setup hadoop cluster (expect cloudera)? Thanks, B.S -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Getting different results every time I run the same job on the cluster
Hi, I have to admit, I am lost. My code http://frd.org/ is stable on a pseudo distributed cluster, but every time I run it one a 4 - slave cluster, I get different results, ranging from 100 output lines to 4,000 output lines, whereas the real answer on my standalone is about 2000. I look at the logs and see no exceptions, so I am totally lost. Where should I look? Thank you, Mark
Re: Why is hadoop build I generated from a release branch different from release build?
Thanks for all the replies. It turns out that build generated by ant has bin conf etc folders in one level above. And I looked at hadoop scripts and apparently it looks for right jars both in root directory and root/build/ directory as well. so I think I am covered for now. Thanks again! On Thu, Mar 8, 2012 at 4:15 PM, Leo Leung lle...@ddn.com wrote: Hi Pawan, ant -p (not for 0.23+) will tell you the available build targets. Use mvn (maven) for 0.23 or newer -Original Message- From: Matt Foley [mailto:mfo...@hortonworks.com] Sent: Thursday, March 08, 2012 3:52 PM To: common-user@hadoop.apache.org Subject: Re: Why is hadoop build I generated from a release branch different from release build? Hi Pawan, The complete way releases are built (for v0.20/v1.0) is documented at http://wiki.apache.org/hadoop/HowToRelease#Building However, that does a bunch of stuff you don't need, like generate the documentation and do a ton of cross-checks. The full set of ant build targets are defined in build.xml in the top level of the source code tree. binary may be the target you want. --Matt On Thu, Mar 8, 2012 at 3:35 PM, Pawan Agarwal pawan.agar...@gmail.com wrote: Hi, I am trying to generate hadoop binaries from source and execute hadoop from the build I generate. I am able to build, however I am seeing that as part of build *bin* folder which comes with hadoop installation is not generated in my build. Can someone tell me how to do a build so that I can generate build equivalent to hadoop release build and which can be used directly to run hadoop. Here's the details. Desktop: Ubuntu Server 11.10 Hadoop version for installation: 0.20.203.0 (link: http://mirrors.gigenet.com/apache//hadoop/common/hadoop-0.20.203.0/) Hadoop Branch used build: branch-0.20-security-203 Build Command used: Ant maven-install Here's the directory structures from build I generated vs hadoop official release build. *Hadoop directory which I generated:* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls -1 ant c++ classes contrib examples hadoop-0.20-security-203-pawan hadoop-ant-0.20-security-203-pawan.jar hadoop-core-0.20-security-203-pawan.jar hadoop-examples-0.20-security-203-pawan.jar hadoop-test-0.20-security-203-pawan.jar hadoop-tools-0.20-security-203-pawan.jar ivy jsvc src test tools webapps *Official Hadoop build installation* pawan@ubuntu01:/hadoop0.20.203.0/hadoop-common/build$ ls /hadoop -1 bin build.xml c++ CHANGES.txt conf contrib docs hadoop-ant-0.20.203.0.jar hadoop-core-0.20.203.0.jar hadoop-examples-0.20.203.0.jar hadoop-test-0.20.203.0.jar hadoop-tools-0.20.203.0.jar input ivy ivy.xml lib librecordio LICENSE.txt logs NOTICE.txt README.txt src webapps Any pointers for help are greatly appreciated? Also, if there are any other resources for understanding hadoop build system, pointers to that would be also helpful. Thanks Pawan
Hadoop-Pig setup question
Hi Hadoop users, I am new member and please let me know if this is not the correct format to ask questions. I am trying to setup a small Hadoop cluster where I will run Pig queries. Hadoop cluster is running fine but when I run a pig query it just hangs. Note - Pig runs fine in local mode So I narrowed down the errors to the following - I have a secondary name node on a different machine (.e.g node 2). Point 1. When I execute start-mapred.sh on node 2, I get ssh_exchange_identification closed by remote host message. BUT the secondary name node starts with no error messages in the log I can even access it through port 50030 So far no errors Point 2. When I try to run a map-reduce job. I get a java.net.ConnectException: Connection refused error in the secondary name node log files. Are point 1 and point 2 related Any hints / pointers on how to solve this ? Also, ssh time out is set to 20 so I am assuming that error is not because of this. Thanks for reading -- Warm Regards Atul
Hadoop node name problem
Hi All: I'm trying to use hadoop,zookeeper and hbase to build a NoSQL database,but when I make hadoop and zookeeper work well and going to install hbase,it report an exception: BindException:Problem binding to /202.106.199.37:60020:Cannot assign requested address My PC IPHost is 192.168.1.91 slave1. Then I search the http://192.168.1.90:50070 (master)/dfsnodelist.jsp?whatNodes=LIVE I saw the message like this: Node web30http://web30.bbn.com.cn:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F bt-199-036http://bt-199-036.bta.net.cn:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F 202.106.199.37http://202.106.199.37:50075/browseDirectory.jsp?namenodeInfoPort=50070dir=%2F I want to know why the node's name like this and how to solve this
Re: Java Heap space error
I'm curious if you have been able to track down the cause of the error? We've seen similar problems with loading data and I've discovered if I presort my data before the load that things go a LOT smoother. When running queries against our data sometimes we've seen it where the jobtracker just freezes. I've seen Heap out of memory errors when I cranked up jobtracker logging to debug. Still working on figuring this one out. should be an interesting ride :D On 03/06/2012 11:10 AM, Mohit Anchlia wrote: I am still trying to see how to narrow this down. Is it possible to set heapdumponoutofmemoryerror option on these individual tasks? On Mon, Mar 5, 2012 at 5:49 PM, Mohit Anchliamohitanch...@gmail.comwrote:
state of HOD
(my apologies for those who have received this already. i posted this mail a few days back on the common-dev list, as this is more a development related mail; but one of the original authors/maintainers suggested to also post this here) hi all, i am a system administrator/user support person/... for the HPC team at Ghent University (Ghent, Flanders, Belgium). recently we have been asked to look into support for hadoop. for the moment we are holding off on a dedicated cluster (esp dedicated hdfs setup). but as all our systems are torque/pbs based, we looked into HOD to help out our users. we have started from the HOD code that was part of the hadoop 1.0.0 release (in the contrib part). at first it was not working, but we have been patching and cleaning up the code for a a few weeks and now have a version that works for us (we had to add some features besides fixing a few things). it looks sufficient for now, although we will add some more features soon to get the users started. my question is the following: what is the state of HOD atm? is it still maintained/supported? are there forks somewhere that have more up-to-date code? what we are now missing most is the documentation (eg http://hadoop.apache.org/common/docs/r0.16.4/hod.html) so we can update this with our extra features. is the source available somewhere? i could contribute back all patches, but a few of them are identation fixes (to use 4 space indentation throughout the code) and other cosmetic changes, so this messes up patches a lot. i have also shuffled a bit with the options (rename and/or move to other sections) so no 100% backwards compatibility with the current HOD code. current main improvements: - works with python 2.5 and up (we have been testing with 2.7.2) - set options through environment variables - better default values (we can now run with empty hodrc file) - support for mail and nodes:ppn for pbs - no deprecation warnings from hadoop (nearly finished) - host-mask to bind xrs addr on non-default ip (in case you have non-standard network on the compute nodes) - more debug statements - gradual code cleanup (using pylint) on the todo list: - further tuning of hadoop parameters (i'm not a hadoop user myself, so this will take some time) - 0.23.X support many thanks, stijn