Hadoop's datajoin
Hi, I am trying to use the hadoop's datajoin for joining two relation. According to the Readme file of datajoin, it gives the following syntax: $HADOOP_HOME/bin/hadoop jar hadoop-datajoin-examples.jar org.apache.hadoop.contrib.utils.join.DataJoinJob datajoin/input datajoin/output Text 1 org.apache.hadoop.contrib.utils.join.SampleDataJoinMapper org.apache.hadoop.contrib.utils.join.SampleDataJoinReducer org.apache.hadoop.contrib.utils.join.SampleTaggedMapOutput Text But I do not find hadoop-datajoin-examples.jar anywhere in my Hadoop_home. Can anyone tell me how to produce it or where to find it? Thanks in advance.
Re: Terasort problem
Thank you for your response Owen. It is true, I haven't done that, figured that few hours after posting here. I'm having problems with understanding these variables: mapred.tasktracker.reduce.tasks.maximum - Is this configured on every datanode separately? What number shall I put here? mapred.tasktracker.map.tasks.maximum - same question as mapred.tasktracker.reduce.tasks.maximum mapred.reduce.tasks - Is this configured ONLY on Namenode and what value should it have for my 8 node cluster? mapred.map.tasks - same question as mapred.reduce.tasks I've tried playing with these variables but getting error:Too many fetch-failures... Please, if anyone have any idea how to setup this the right way. Thank you. On 9 July 2010 15:33, Owen O'Malley omal...@apache.org wrote: I would guess that you didn't set the number of reducers for the job, and it defaulted to 2. -- Owen
java.lang.OutOfMemoryError: Java heap space
Hi All I am facing a hard problem. I am running a map reduce job using streaming but it fails and it gives the following error. Caught: java.lang.OutOfMemoryError: Java heap space at Nodemapper5.parseXML(Nodemapper5.groovy:25) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I have increased the heap size in hadoop-env.sh and make it 2000M. Also I tell the job manually by following line. -D mapred.child.java.opts=-Xmx2000M \ but it still gives the error. The same job runs fine if i run on shell using 1024M heap size like cat file.xml | /root/Nodemapper5.groovy Any clue? Thanks in advance. -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445
java.lang.OutOfMemoryError: Java heap space
Hi All I am facing a hard problem. I am running a map reduce job using streaming but it fails and it gives the following error. Caught: java.lang.OutOfMemoryError: Java heap space at Nodemapper5.parseXML(Nodemapper5.groovy:25) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I have increased the heap size in hadoop-env.sh and make it 2000M. Also I tell the job manually by following line. -D mapred.child.java.opts=-Xmx2000M \ but it still gives the error. The same job runs fine if i run on shell using 1024M heap size like cat file.xml | /root/Nodemapper5.groovy Any clue? Thanks in advance. -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445
Re: java.lang.OutOfMemoryError: Java heap space
Hi Shuja, It looks like the OOM is happening in your code. Are you running MapReduce in a cluster? If so, can you send the exact command line your code is invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy' command on one of the nodes which is running the task? Thanks, Alex K On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.comwrote: Hi All I am facing a hard problem. I am running a map reduce job using streaming but it fails and it gives the following error. Caught: java.lang.OutOfMemoryError: Java heap space at Nodemapper5.parseXML(Nodemapper5.groovy:25) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I have increased the heap size in hadoop-env.sh and make it 2000M. Also I tell the job manually by following line. -D mapred.child.java.opts=-Xmx2000M \ but it still gives the error. The same job runs fine if i run on shell using 1024M heap size like cat file.xml | /root/Nodemapper5.groovy Any clue? Thanks in advance. -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792, Pakistan Cell: +92 3214207445
Re: java.lang.OutOfMemoryError: Java heap space
Hi Alex Yeah, I am running a job on cluster of 2 machines and using Cloudera distribution of hadoop. and here is the output of this command. root 5277 5238 3 12:51 pts/200:00:00 /usr/jdk1.6.0_03/bin/java -Xmx1023m -Dhadoop.log.dir=/usr/lib /hadoop-0.20/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20 -Dhadoop.id.str= -Dhado op.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -classpath /usr/lib/hadoop-0.20/conf:/usr/ jdk1.6.0_03/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2+320.jar:/usr/lib/hadoo p-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/common s-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1 .0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.ja r:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2+320.jar:/usr/l ib/hadoop-0.20/lib/hadoop-scribe-log4j-0.20.2+320.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/h adoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jackso n-mapper-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-ru ntime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib /hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0. 2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib /log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-jav a-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/u sr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0 .20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api -2.1.jar org.apache.hadoop.util.RunJar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+320.jar -D mapred.child.java.opts=-Xmx2000M -inputformat StreamInputFormat -inputreader StreamXmlRecordReader,begin= mdc xmlns:HTML= http://www.w3.org/TR/REC-xml;,end=/mdc -input /user/root/RNCDATA/MDFDORKUCRAR02/A20100531 .-0700-0015-0700_RNCCN-MDFDORKUCRAR02 -jobconf mapred.map.tasks=1 -jobconf mapred.reduce.tasks=0 -output RNC11 -mapper /home/ftpuser1/Nodemapper5.groovy -reducer org.apache.hadoop.mapred.lib.IdentityReducer -file / home/ftpuser1/Nodemapper5.groovy root 5360 5074 0 12:51 pts/100:00:00 grep Nodemapper5.groovy -- and what is meant by OOM and thanks for helping, Best Regards On Sun, Jul 11, 2010 at 12:30 AM, Alex Kozlov ale...@cloudera.com wrote: Hi Shuja, It looks like the OOM is happening in your code. Are you running MapReduce in a cluster? If so, can you send the exact command line your code is invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy' command on one of the nodes which is running the task? Thanks, Alex K On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.com wrote: Hi All I am facing a hard problem. I am running a map reduce job using streaming but it fails and it gives the following error. Caught: java.lang.OutOfMemoryError: Java heap space at Nodemapper5.parseXML(Nodemapper5.groovy:25) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I have increased the heap size in hadoop-env.sh and make it 2000M. Also I tell the job manually by following line. -D mapred.child.java.opts=-Xmx2000M \ but it still gives the error. The same job runs fine if i run on shell using 1024M heap size like cat file.xml | /root/Nodemapper5.groovy Any clue? Thanks in advance. -- Regards Shuja-ur-Rehman Baig _ MS CS - School of Science and Engineering Lahore University of Management Sciences (LUMS) Sector U, DHA, Lahore, 54792,
Re: reading distributed cache returns null pointer
Hi, Thanks. Ok Path[] ps=DistributedCache.getLocalCacheFiles(cnf); retreives for me the correct path in pseudo-distributed mode. But when I run my program in fully-distributed mode with 5 nodes, I get a null pointer. Theorcatically, if it worked on pseudo-distributed mode, it should work on fully-distributed mode as well. What possibilities can be there for this behavior? Cheers From: Hemanth Yamijala yhema...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, July 9, 2010 10:21:19 AM Subject: Re: reading distributed cache returns null pointer Hi, Thanks for the information. I got your point. What I specifically want to ask is that if I use the following method to read my file now in each mapper: FileSystemhdfs=FileSystem.get(conf); URI[] uris=DistributedCache.getCacheFiles(conf); Path my_path=new Path(uris[0].getPath()); if(hdfs.exists(my_path)) { FSDataInputStreamfs=hdfs.open(my_path); while((str=fs.readLine())!=null) System.out.println(str); } would this method retrieve the file from HDFS? since I am using the Hadoop API? not the local file API. It would be instructive to look at the test code in src/test/mapred/org/apache/hadoop/mapred/TestMRWithDistributedCache.java. This gives a fair idea of how to access the files of DistributedCache from within the mapper. Specifically see how the LocalFileSystem is used to access the files. You could look at the same class in the branch-20 source code if you are using an older version of Hadoop. I may be understanding somehting horribly wrong. The situation is that now my_path contains DCache/Orders.txt and if i am reading from here, this is the path of file on HDFS as well. How does it know to pick the file from the local file system, not the HDFS? Thanks again From: Rahul Jain rja...@gmail.com To: common-user@hadoop.apache.org Sent: Fri, July 9, 2010 12:19:44 AM Subject: Re: reading distributed cache returns null pointer Yes, distributed cache writes files to the local file system for each mapper / reducer. So you should be able to access the file(s) using local file system APIs. If the files were staying in HDFS there would be no point to using distributed cache since all mappers already have access to the global HDFS directories :). -Rahul On Thu, Jul 8, 2010 at 3:03 PM, abc xyz fabc_xyz...@yahoo.com wrote: Hi Rahul, Thanks. It worked. I was using getFileClassPaths() to get the paths to the files in the cache and then use this path to access the file. It should have worked but I don't know why that doesn't produce the required result. I added the file HDFS file DCache/Orders.txt to my distributed cache. After calling DistributedCache.getCacheFiles(conf); in the configure method of the mapper node, if I read the file now from the returned path (which happens to be DCache/Orders.txt) using the Hadoop API , would the file be read from the local directory of the mapper node? More specifically I am doing this: FileSystemhdfs=FileSystem.get(conf); URI[] uris=DistributedCache.getCacheFiles(conf); Path my_path=new Path(uris[0].getPath()); if(hdfs.exists(my_path)) { FSDataInputStreamfs=hdfs.open(my_path); while((str=fs.readLine())!=null) System.out.println(str); } Thanks From: Rahul Jain rja...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, July 8, 2010 8:15:58 PM Subject: Re: reading distributed cache returns null pointer I am not sure why you are using getFileClassPaths() API to access files... here is what works for us: Add the file(s) to distributed cache using: DistributedCache.addCacheFile(p.toUri(), conf); Read the files on the mapper using: URI[] uris = DistributedCache.getCacheFiles(conf); // access one of the files: paths[0] = new Path(uris[0].getPath()); // now follow hadoop or local file APIs to access the file... Did you try the above and did it not work ? -Rahul On Thu, Jul 8, 2010 at 12:04 PM, abc xyz fabc_xyz...@yahoo.com wrote: Hello all, As a new user of hadoop, I am having some problems with understanding some things. I am writing a program to load a file to the distributed cache and read this file in each mapper. In my driver program, I have added the file to my distributed cache using: Path p=new Path(hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt); DistributedCache.addCacheFile(p.toUri(), conf); In the configure method of the mapper, I am reading the file from cache using: Path[] cacheFiles=DistributedCache.getFileClassPaths(conf); BufferedReader
Re: java.lang.OutOfMemoryError: Java heap space
Hi Shuja, First, thank you for using CDH3. Can you also check what m* apred.child.ulimit* you are using? Try adding * -D mapred.child.ulimit=3145728* to the command line. I would also recommend to upgrade java to JDK 1.6 update 8 at a minimum, which you can download from the Java SE Homepagehttp://java.sun.com/javase/downloads/index.jsp . Let me know how it goes. Alex K On Sat, Jul 10, 2010 at 12:59 PM, Shuja Rehman shujamug...@gmail.comwrote: Hi Alex Yeah, I am running a job on cluster of 2 machines and using Cloudera distribution of hadoop. and here is the output of this command. root 5277 5238 3 12:51 pts/200:00:00 /usr/jdk1.6.0_03/bin/java -Xmx1023m -Dhadoop.log.dir=/usr/lib /hadoop-0.20/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop-0.20 -Dhadoop.id.str= -Dhado op.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -classpath /usr/lib/hadoop-0.20/conf:/usr/ jdk1.6.0_03/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2+320.jar:/usr/lib/hadoo p-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.3.jar:/usr/lib/hadoop-0.20/lib/common s-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1 .0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.ja r:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2+320.jar:/usr/l ib/hadoop-0.20/lib/hadoop-scribe-log4j-0.20.2+320.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/h adoop-0.20/lib/hsqldb.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jackso n-mapper-asl-1.0.1.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-ru ntime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.14.jar:/usr/lib /hadoop-0.20/lib/jetty-util-6.1.14.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0. 2.2.jar:/usr/lib/hadoop-0.20/lib/libfb303.jar:/usr/lib/hadoop-0.20/lib/libthrift.jar:/usr/lib/hadoop-0.20/lib /log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/mysql-connector-jav a-5.0.8-bin.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/u sr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0 .20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api -2.1.jar org.apache.hadoop.util.RunJar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2+320.jar -D mapred.child.java.opts=-Xmx2000M -inputformat StreamInputFormat -inputreader StreamXmlRecordReader,begin= mdc xmlns:HTML= http://www.w3.org/TR/REC-xml;,end=/mdc -input /user/root/RNCDATA/MDFDORKUCRAR02/A20100531 .-0700-0015-0700_RNCCN-MDFDORKUCRAR02 -jobconf mapred.map.tasks=1 -jobconf mapred.reduce.tasks=0 -output RNC11 -mapper /home/ftpuser1/Nodemapper5.groovy -reducer org.apache.hadoop.mapred.lib.IdentityReducer -file / home/ftpuser1/Nodemapper5.groovy root 5360 5074 0 12:51 pts/100:00:00 grep Nodemapper5.groovy -- and what is meant by OOM and thanks for helping, Best Regards On Sun, Jul 11, 2010 at 12:30 AM, Alex Kozlov ale...@cloudera.com wrote: Hi Shuja, It looks like the OOM is happening in your code. Are you running MapReduce in a cluster? If so, can you send the exact command line your code is invoked with -- you can get it with a 'ps -Af | grep Nodemapper5.groovy' command on one of the nodes which is running the task? Thanks, Alex K On Sat, Jul 10, 2010 at 10:40 AM, Shuja Rehman shujamug...@gmail.com wrote: Hi All I am facing a hard problem. I am running a map reduce job using streaming but it fails and it gives the following error. Caught: java.lang.OutOfMemoryError: Java heap space at Nodemapper5.parseXML(Nodemapper5.groovy:25) java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:362) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:572) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170)
Re: Next Release of Hadoop version number and Kerberos
On Wed, Jul 7, 2010 at 8:54 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, Jul 7, 2010 at 8:29 AM, Ananth Sarathy ananth.t.sara...@gmail.comwrote: The Security/Kerberos support is a huge project that has been in progress for several months, so the implementation spans tens (if not hundreds?) of patches. Manually adding these patches to a prior Apache release will take days if not weeks of work, is my guess. Based on a quick check from Yahoo's github (http://github.com/yahoo/hadoop-common): Between yahoo 0.20.10 to yahoo 0.20.104.2: 421 commits combined diff of 8.75 mb 12 person-years worth of work consists almost exclusively of security work For a single person, who doesn't know the code it will take months to apply it to one of the Apache branches. -- Owen