Re: Create and write files on mounted HDFS via java api
Are you using Fuse for mounting HDFS ? On Fri, Apr 19, 2013 at 4:30 PM, lijinlong wakingdrea...@163.com wrote: I mounted HDFS to a local directory for storage,that is /mnt/hdfs.I can do the basic file operation such as create ,remove,copy etc just using linux command and GUI.But when I tried to do the same thing in the mounted directory via java api(not hadoop api),I got exceptions.The detail information can be senn here. http://stackoverflow.com/questions/16083497/java-exception-when-creating-or-writing-files-on-mounted-hdfs . Now my question is that if I can do file operactions to the mounted hdfs via java api that I wrote in the url.If not,what should be the proper way to acomplish that?
Re: Mapreduce
As this is a HBase specific question, it will be better to ask this question on the HBase user mailing list. Thanks Hemanth On Fri, Apr 19, 2013 at 10:46 PM, Adrian Acosta Mitjans amitj...@estudiantes.uci.cu wrote: Hello: I'm working in a proyect, and i'm using hbase for storage the data, y have this method that work great but without the performance i'm looking for, so i want is to make the same but using mapreduce. public ArrayListMyObject findZ(String z) throws IOException { ArrayListMyObject rows = new ArrayListMyObject(); Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, test); Scan s = new Scan(); s.addColumn(Bytes.toBytes(x), Bytes.toBytes(y)); ResultScanner scanner = table.getScanner(s); try { for (Result rr : scanner) { if (Bytes.toString(rr.getValue(Bytes.toBytes(x), Bytes.toBytes(y))).equals(z)) { rows.add(getInformation(Bytes.toString(rr.getRow(; } } } finally { scanner.close(); } return archivos; } The getInformation method take all the columns and convert the row in MyObject type. I just want a example or a link to a tutorial that make something like this, i want to get a result type as answer and not a number to count words, like many a found. My natural language is spanish, so sorry if something is not well writing. Thanks. http://www.uci.cu/
Re: Errors about MRunit
Hi, If your goal is to use the new API, I am able to get it to work with the following maven configuration: dependency groupIdorg.apache.mrunit/groupId artifactIdmrunit/artifactId version0.9.0-incubating/version classifierhadoop1/classifier /dependency If I switch with classifier hadoop2, I get the same errors as what you facing. Thanks Hemanth On Sat, Apr 20, 2013 at 3:42 PM, 姚吉龙 geelong...@gmail.com wrote: Hi Everyone I am testing my MR programe with MRunit, it's version is mrunit-0.9.0-incubating-hadoop2. My hadoop version is 1.0.4 The error trace is below: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskInputOutputContext, but interface was expected at org.apache.hadoop.mrunit.mapreduce.mock.MockContextWrapper.createCommon(MockContextWrapper.java:53) at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.create(MockMapContextWrapper.java:70) at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.init(MockMapContextWrapper.java:62) at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:217) at org.apache.hadoop.mrunit.MapDriverBase.runTest(MapDriverBase.java:150) at org.apache.hadoop.mrunit.TestDriver.runTest(TestDriver.java:137) at UnitTest.testMapper(UnitTest.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Anyone has idea? BRs Geelong -- From Good To Great
Re: Create and write files on mounted HDFS via java api
Sorry - no. I just wanted to know if you were using FUSE, because I knew of no other way of mounting HDFS.. Basically was wondering if some libraries needed to be system path for the Java programs to work. From your response looks like you aren't using FUSE. So what are you using to mount ? Hemanth On Sat, Apr 20, 2013 at 4:24 PM, 金龙 李 wakingdrea...@live.com wrote: Yes,I tried both FUSE and NTFS,but all failed.Have you done this before?And do you know why? -- Date: Sat, 20 Apr 2013 15:48:36 +0530 Subject: Re: Create and write files on mounted HDFS via java api From: yhema...@thoughtworks.com To: user@hadoop.apache.org Are you using Fuse for mounting HDFS ? On Fri, Apr 19, 2013 at 4:30 PM, lijinlong wakingdrea...@163.com wrote: I mounted HDFS to a local directory for storage,that is /mnt/hdfs.I can do the basic file operation such as create ,remove,copy etc just using linux command and GUI.But when I tried to do the same thing in the mounted directory via java api(not hadoop api),I got exceptions.The detail information can be senn here. http://stackoverflow.com/questions/16083497/java-exception-when-creating-or-writing-files-on-mounted-hdfs . Now my question is that if I can do file operactions to the mounted hdfs via java api that I wrote in the url.If not,what should be the proper way to acomplish that?
Re: Errors about MRunit
+ user@ Please do continue the conversation on the mailing list, in case others like you can benefit from / contribute to the discussion Thanks Hemanth On Sat, Apr 20, 2013 at 5:32 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, My code is working with having mrunit-0.9.0-incubating-hadoop1.jar as a dependency. So, can you pull this from the mrunit download tarball, add it to the dependencies in eclipse and try. Of course remove any other mrunit jar you have already Thanks Hemanth On Sat, Apr 20, 2013 at 5:02 PM, 姚吉龙 geelong...@gmail.com wrote: Sorry, I have not used the Maven things Could u tell me how to set this with Eclipse BRs geelong 2013/4/20 Hemanth Yamijala yhema...@thoughtworks.com Hi, If your goal is to use the new API, I am able to get it to work with the following maven configuration: dependency groupIdorg.apache.mrunit/groupId artifactIdmrunit/artifactId version0.9.0-incubating/version classifierhadoop1/classifier /dependency If I switch with classifier hadoop2, I get the same errors as what you facing. Thanks Hemanth On Sat, Apr 20, 2013 at 3:42 PM, 姚吉龙 geelong...@gmail.com wrote: Hi Everyone I am testing my MR programe with MRunit, it's version is mrunit-0.9.0-incubating-hadoop2. My hadoop version is 1.0.4 The error trace is below: java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.TaskInputOutputContext, but interface was expected at org.apache.hadoop.mrunit.mapreduce.mock.MockContextWrapper.createCommon(MockContextWrapper.java:53) at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.create(MockMapContextWrapper.java:70) at org.apache.hadoop.mrunit.mapreduce.mock.MockMapContextWrapper.init(MockMapContextWrapper.java:62) at org.apache.hadoop.mrunit.mapreduce.MapDriver.run(MapDriver.java:217) at org.apache.hadoop.mrunit.MapDriverBase.runTest(MapDriverBase.java:150) at org.apache.hadoop.mrunit.TestDriver.runTest(TestDriver.java:137) at UnitTest.testMapper(UnitTest.java:41) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:79) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Anyone has idea? BRs Geelong -- From Good To Great -- From Good To Great
Re: Which version of Hadoop
2.x.x provides NN high availability. http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html However, it is in alpha stage right now. Thanks hemanth On Sat, Apr 20, 2013 at 5:30 PM, Ascot Moss ascot.m...@gmail.com wrote: Hi, I am new to Hadoop, from Hadoop download I can find 4 versions: 1.0.x / 1.1.x / 2.x.x / 0.23.x May I know which one is the latest stable version that provides Namenode high availability for production environment? regards
Re: How to configure mapreduce archive size?
Well, since the DistributedCache is used by the tasktracker, you need to update the log4j configuration file used by the tasktracker daemon. And you need to get the tasktracker log file - from the machine where you see the distributed cache problem. On Fri, Apr 19, 2013 at 6:27 AM, xia_y...@dell.com wrote: Hi Hemanth, ** ** I tried http://machine:50030. It did not work for me. ** ** In hbase_home/conf folder, I update the log4j configuration properties and got attached log. Do you find what is happening for the map reduce job?*** * ** ** Thanks, ** ** Jane ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Wednesday, April 17, 2013 9:11 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** The check for cache file cleanup is controlled by the property mapreduce.tasktracker.distributedcache.checkperiod. It defaults to 1 minute (which should be sufficient for your requirement). ** ** I am not sure why the JobTracker UI is inaccessible. If you know where JT is running, try hitting http://machine:50030. If that doesn't work, maybe check if ports have been changed in mapred-site.xml for a property similar to mapred.job.tracker.http.address. ** ** There is logging in the code of the tasktracker component that can help debug the distributed cache behaviour. In order to get those logs you need to enable debug logging in the log4j configuration properties and restart the daemons. Hopefully that will help you get some hints on what is happening. ** ** Thanks hemanth ** ** On Wed, Apr 17, 2013 at 11:49 PM, xia_y...@dell.com wrote: Hi Hemanth and Bejoy KS, I have tried both mapred-site.xml and core-site.xml. They do not work. I set the value to 50K just for testing purpose, however the folder size already goes to 900M now. As in your email, “After they are done, the property will help cleanup the files due to the limit set. ” How frequently the cleanup task will be triggered? Regarding the job.xml, I cannot use JT web UI to find it. It seems when hadoop is packaged within Hbase, this is disabled. I am only use Hbase jobs. I was suggested by Hbase people to get help from Hadoop mailing list. I will contact them again. Thanks, Jane *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Tuesday, April 16, 2013 9:35 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? You can limit the size by setting local.cache.size in the mapred-site.xml (or core-site.xml if that works for you). I mistakenly mentioned mapred-default.xml in my last mail - apologies for that. However, please note that this does not prevent whatever is writing into the distributed cache from creating those files when they are required. After they are done, the property will help cleanup the files due to the limit set. That's why I am more keen on finding what is using the files in the Distributed cache. It may be useful if you can ask on the HBase list as well if the APIs you are using are creating the files you mention (assuming you are only running HBase jobs on the cluster and nothing else) Thanks Hemanth On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote: Hi Hemanth, I did not explicitly using DistributedCache in my code. I did not use any command line arguments like –libjars neither. Where can I find job.xml? I am using Hbase MapReduce API and not setting any job.xml. The key point is I want to limit the size of /tmp/hadoop-root/mapred/local/archive. Could you help? Thanks. Xia *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? TableMapReduceUtil has APIs like addDependencyJars which will use DistributedCache. I don't think you are explicitly using that. Are you using any command line arguments like -libjars etc when you are launching the MapReduce job ? Alternatively you can check job.xml of the launched MR job to see if it has set properties having prefixes like mapred.cache. If nothing's set there, it would seem like some other process or user is adding jars to DistributedCache when using the cluster. Thanks hemanth On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote: Hi Hemanth, Attached is some sample folders within my /tmp/hadoop-root/mapred/local/archive. There are some jar and class files inside. My application uses MapReduce job to do purge Hbase old data. I am using basic HBase MapReduce API to delete rows from Hbase
Re: Run multiple HDFS instances
Are you trying to implement something like namespace federation, that's a part of Hadoop 2.0 - http://hadoop.apache.org/docs/r2.0.3-alpha/hadoop-project-dist/hadoop-hdfs/Federation.html On Thu, Apr 18, 2013 at 10:02 PM, Lixiang Ao aolixi...@gmail.com wrote: Actually I'm trying to do something like combining multiple namenodes so that they present themselves to clients as a single namespace, implementing basic namenode functionalities. 在 2013年4月18日星期四,Chris Embree 写道: Glad you got this working... can you explain your use case a little? I'm trying to understand why you might want to do that. On Thu, Apr 18, 2013 at 11:29 AM, Lixiang Ao aolixi...@gmail.com wrote: I modified sbin/hadoop-daemon.sh, where HADOOP_PID_DIR is set. It works! Everything looks fine now. Seems direct command hdfs namenode gives a better sense of control :) Thanks a lot. 在 2013年4月18日星期四,Harsh J 写道: Yes you can but if you want the scripts to work, you should have them use a different PID directory (I think its called HADOOP_PID_DIR) every time you invoke them. I instead prefer to start the daemons up via their direct command such as hdfs namenode and so and move them to the background, with a redirect for logging. On Thu, Apr 18, 2013 at 2:34 PM, Lixiang Ao aolixi...@gmail.com wrote: Hi all, Can I run mutiple HDFS instances, that is, n seperate namenodes and n datanodes, on a single machine? I've modified core-site.xml and hdfs-site.xml to avoid port and file conflicting between HDFSes, but when I started the second HDFS, I got the errors: Starting namenodes on [localhost] localhost: namenode running as process 20544. Stop it first. localhost: datanode running as process 20786. Stop it first. Starting secondary namenodes [0.0.0.0] 0.0.0.0: secondarynamenode running as process 21074. Stop it first. Is there a way to solve this? Thank you in advance, Lixiang Ao -- Harsh J
Re: How to configure mapreduce archive size?
The check for cache file cleanup is controlled by the property mapreduce.tasktracker.distributedcache.checkperiod. It defaults to 1 minute (which should be sufficient for your requirement). I am not sure why the JobTracker UI is inaccessible. If you know where JT is running, try hitting http://machine:50030. If that doesn't work, maybe check if ports have been changed in mapred-site.xml for a property similar to mapred.job.tracker.http.address. There is logging in the code of the tasktracker component that can help debug the distributed cache behaviour. In order to get those logs you need to enable debug logging in the log4j configuration properties and restart the daemons. Hopefully that will help you get some hints on what is happening. Thanks hemanth On Wed, Apr 17, 2013 at 11:49 PM, xia_y...@dell.com wrote: Hi Hemanth and Bejoy KS, ** ** I have tried both mapred-site.xml and core-site.xml. They do not work. I set the value to 50K just for testing purpose, however the folder size already goes to 900M now. As in your email, “After they are done, the property will help cleanup the files due to the limit set. ” How frequently the cleanup task will be triggered? ** ** Regarding the job.xml, I cannot use JT web UI to find it. It seems when hadoop is packaged within Hbase, this is disabled. I am only use Hbase jobs. I was suggested by Hbase people to get help from Hadoop mailing list. I will contact them again. ** ** Thanks, ** ** Jane ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Tuesday, April 16, 2013 9:35 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** You can limit the size by setting local.cache.size in the mapred-site.xml (or core-site.xml if that works for you). I mistakenly mentioned mapred-default.xml in my last mail - apologies for that. However, please note that this does not prevent whatever is writing into the distributed cache from creating those files when they are required. After they are done, the property will help cleanup the files due to the limit set. ** ** That's why I am more keen on finding what is using the files in the Distributed cache. It may be useful if you can ask on the HBase list as well if the APIs you are using are creating the files you mention (assuming you are only running HBase jobs on the cluster and nothing else) ** ** Thanks Hemanth ** ** On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote: Hi Hemanth, I did not explicitly using DistributedCache in my code. I did not use any command line arguments like –libjars neither. Where can I find job.xml? I am using Hbase MapReduce API and not setting any job.xml. The key point is I want to limit the size of /tmp/hadoop-root/mapred/local/archive. Could you help? Thanks. Xia *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? TableMapReduceUtil has APIs like addDependencyJars which will use DistributedCache. I don't think you are explicitly using that. Are you using any command line arguments like -libjars etc when you are launching the MapReduce job ? Alternatively you can check job.xml of the launched MR job to see if it has set properties having prefixes like mapred.cache. If nothing's set there, it would seem like some other process or user is adding jars to DistributedCache when using the cluster. Thanks hemanth On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote: Hi Hemanth, Attached is some sample folders within my /tmp/hadoop-root/mapred/local/archive. There are some jar and class files inside. My application uses MapReduce job to do purge Hbase old data. I am using basic HBase MapReduce API to delete rows from Hbase table. I do not specify to use Distributed cache. Maybe HBase use it? Some code here: Scan scan = *new* Scan(); scan.setCaching(500);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(*false*); // don't set to true for MR jobs scan.setTimeRange(Long.*MIN_VALUE*, timestamp); // set other scan *attrs* // the purge start time Date date=*new* Date(); TableMapReduceUtil.*initTableMapperJob*( tableName,// input table scan, // Scan instance to control CF and attribute selection MapperDelete.*class*, // *mapper* class *null*, // *mapper* output key *null*, // *mapper* output value
Re: Hadoop fs -getmerge
I don't think that is possible. When we use -getmerge, the destination filesystem happens to be a LocalFileSystem which extends from ChecksumFileSystem. I believe that's why the CRC files are getting in. Would it not be possible for you to ignore them, since they have a fixed extension ? Thanks Hemanth On Wed, Apr 17, 2013 at 8:09 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote: Hi all, is there a way to use the *getmerge* fs command and not generate the .crc files in the output local directory? Thanks, Fabio Pitzolu**
Re: How to configure mapreduce archive size?
You can limit the size by setting local.cache.size in the mapred-site.xml (or core-site.xml if that works for you). I mistakenly mentioned mapred-default.xml in my last mail - apologies for that. However, please note that this does not prevent whatever is writing into the distributed cache from creating those files when they are required. After they are done, the property will help cleanup the files due to the limit set. That's why I am more keen on finding what is using the files in the Distributed cache. It may be useful if you can ask on the HBase list as well if the APIs you are using are creating the files you mention (assuming you are only running HBase jobs on the cluster and nothing else) Thanks Hemanth On Tue, Apr 16, 2013 at 11:15 PM, xia_y...@dell.com wrote: Hi Hemanth, ** ** I did not explicitly using DistributedCache in my code. I did not use any command line arguments like –libjars neither. ** ** Where can I find job.xml? I am using Hbase MapReduce API and not setting any job.xml. ** ** The key point is I want to limit the size of /tmp/hadoop-root/mapred/local/archive. Could you help? ** ** Thanks. ** ** Xia ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** TableMapReduceUtil has APIs like addDependencyJars which will use DistributedCache. I don't think you are explicitly using that. Are you using any command line arguments like -libjars etc when you are launching the MapReduce job ? Alternatively you can check job.xml of the launched MR job to see if it has set properties having prefixes like mapred.cache. If nothing's set there, it would seem like some other process or user is adding jars to DistributedCache when using the cluster. ** ** Thanks hemanth ** ** ** ** ** ** On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote: Hi Hemanth, Attached is some sample folders within my /tmp/hadoop-root/mapred/local/archive. There are some jar and class files inside. My application uses MapReduce job to do purge Hbase old data. I am using basic HBase MapReduce API to delete rows from Hbase table. I do not specify to use Distributed cache. Maybe HBase use it? Some code here: Scan scan = *new* Scan(); scan.setCaching(500);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(*false*); // don't set to true for MR jobs scan.setTimeRange(Long.*MIN_VALUE*, timestamp); // set other scan *attrs* // the purge start time Date date=*new* Date(); TableMapReduceUtil.*initTableMapperJob*( tableName,// input table scan, // Scan instance to control CF and attribute selection MapperDelete.*class*, // *mapper* class *null*, // *mapper* output key *null*, // *mapper* output value job); job.setOutputFormatClass(TableOutputFormat.*class*); job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*, tableName); job.setNumReduceTasks(0); *boolean* b = job.waitForCompletion(*true*); *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 12:29 AM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Could you paste the contents of the directory ? Not sure whether that will help, but just giving it a shot. What application are you using ? Is it custom MapReduce jobs in which you use Distributed cache (I guess not) ? Thanks Hemanth On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote: Hi Arun, I stopped my application, then restarted my hbase (which include hadoop). After that I start my application. After one evening, my /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work. Is this the right place to change the value? local.cache.size in file core-default.xml, which is in hadoop-core-1.0.3.jar Thanks, Jane *From:* Arun C Murthy [mailto:a...@hortonworks.com] *Sent:* Wednesday, April 10, 2013 2:45 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Ensure no jobs are running (cache limit is only for non-active cache files), check after a little while (takes sometime for the cleaner thread to kick in). Arun On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com wrote
Re: How to configure mapreduce archive size?
Could you paste the contents of the directory ? Not sure whether that will help, but just giving it a shot. What application are you using ? Is it custom MapReduce jobs in which you use Distributed cache (I guess not) ? Thanks Hemanth On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote: Hi Arun, ** ** I stopped my application, then restarted my hbase (which include hadoop). After that I start my application. After one evening, my /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work. ** ** Is this the right place to change the value? ** ** local.cache.size in file core-default.xml, which is in hadoop-core-1.0.3.jar ** ** Thanks, ** ** Jane ** ** *From:* Arun C Murthy [mailto:a...@hortonworks.com] *Sent:* Wednesday, April 10, 2013 2:45 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** Ensure no jobs are running (cache limit is only for non-active cache files), check after a little while (takes sometime for the cleaner thread to kick in). ** ** Arun ** ** On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com wrote: Hi Hemanth, For the hadoop 1.0.3, I can only find local.cache.size in file core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in mapred-default.xml. I updated the value in file default.xml and changed the value to 50. This is just for my testing purpose. However, the folder /tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks like it does not do the work. Could you advise if what I did is correct?** ** namelocal.cache.size/name value50/value Thanks, Xia *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Monday, April 08, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Hi, This directory is used as part of the 'DistributedCache' feature. ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). There is a configuration key local.cache.size which controls the amount of data stored under DistributedCache. The default limit is 10GB. However, the files under this cannot be deleted if they are being used. Also, some frameworks on Hadoop could be using DistributedCache transparently to you. So you could check what is being stored here and based on that lower the limit of the cache size if you feel that will help. The property needs to be set in mapred-default.xml. Thanks Hemanth On Mon, Apr 8, 2013 at 11:09 PM, xia_y...@dell.com wrote: Hi, I am using hadoop which is packaged within hbase -0.94.1. It is hadoop 1.0.3. There is some mapreduce job running on my server. After some time, I found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.** ** How to configure this and limit the size? I do not want to waste my space for archive. Thanks, Xia ** ** -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ ** **
Re: Copy Vs DistCP
AFAIK, the cp command works fully from the DFS client. It reads bytes from the InputStream created when the file is opened and writes the same to the OutputStream of the file. It does not work at the level of data blocks. A configuration io.file.buffer.size is used as the size of the buffer used in copy - set to 4096 by default. Thanks Hemanth On Thu, Apr 11, 2013 at 9:42 AM, KayVajj vajjalak...@gmail.com wrote: If CP command is not parallel how does it work for a file partitioned on various data nodes? On Wed, Apr 10, 2013 at 6:30 PM, Azuryy Yu azury...@gmail.com wrote: CP command is not parallel, It's just call FileSystem, even if DFSClient has multi threads. DistCp can work well on the same cluster. On Thu, Apr 11, 2013 at 8:17 AM, KayVajj vajjalak...@gmail.com wrote: The File System Copy utility copies files byte by byte if I'm not wrong. Could it be possible that the cp command works with blocks and moves them which could be significantly efficient? Also how does the cp command work if the file is distributed on different data nodes?? Thanks Kay On Wed, Apr 10, 2013 at 4:48 PM, Jay Vyas jayunit...@gmail.com wrote: DistCP is a full blown mapreduce job (mapper only, where the mappers do a fully parallel copy to the detsination). CP appears (correct me if im wrong) to simply invoke the FileSystem and issues a copy command for every source file. I have an additional question: how is CP which is internal to a cluster optimized (if at all) ? On Wed, Apr 10, 2013 at 7:28 PM, 麦树荣 shurong@qunar.com wrote: ** Hi, I think it' better using Copy in the same cluster while using distCP between clusters, and cp command is a hadoop internal parallel process and will not copy files locally. -- 麦树荣 *From:* KayVajj vajjalak...@gmail.com *Date:* 2013-04-11 06:20 *To:* user@hadoop.apache.org *Subject:* Copy Vs DistCP I have few questions regarding the usage of DistCP for copying files in the same cluster. 1) Which one is better within a same cluster and what factors (like file size etc) wouldinfluence the usage of one over te other? 2) when we run a cp command like below from a client node of the cluster (not a data node), How does the cp command work i) like an MR job ii) copy files locally and then it copy it back at the new location. Example of the copy command hdfs dfs -cp /some_location/file /new_location/ Thanks, your responses are appreciated. -- Kay -- Jay Vyas http://jayunit100.blogspot.com
Re: How to configure mapreduce archive size?
TableMapReduceUtil has APIs like addDependencyJars which will use DistributedCache. I don't think you are explicitly using that. Are you using any command line arguments like -libjars etc when you are launching the MapReduce job ? Alternatively you can check job.xml of the launched MR job to see if it has set properties having prefixes like mapred.cache. If nothing's set there, it would seem like some other process or user is adding jars to DistributedCache when using the cluster. Thanks hemanth On Thu, Apr 11, 2013 at 11:40 PM, xia_y...@dell.com wrote: Hi Hemanth, ** ** Attached is some sample folders within my /tmp/hadoop-root/mapred/local/archive. There are some jar and class files inside. ** ** My application uses MapReduce job to do purge Hbase old data. I am using basic HBase MapReduce API to delete rows from Hbase table. I do not specify to use Distributed cache. Maybe HBase use it? ** ** Some code here: ** ** Scan scan = *new* Scan(); scan.setCaching(500);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(*false*); // don't set to true for MR jobs scan.setTimeRange(Long.*MIN_VALUE*, timestamp); // set other scan *attrs* // the purge start time Date date=*new* Date(); TableMapReduceUtil.*initTableMapperJob*( tableName,// input table scan, // Scan instance to control CF and attribute selection MapperDelete.*class*, // *mapper* class *null*, // *mapper* output key *null*, // *mapper* output value job); ** ** job.setOutputFormatClass(TableOutputFormat.*class*); job.getConfiguration().set(TableOutputFormat.*OUTPUT_TABLE*, tableName); job.setNumReduceTasks(0); *boolean* b = job.waitForCompletion(*true*); ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, April 11, 2013 12:29 AM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? ** ** Could you paste the contents of the directory ? Not sure whether that will help, but just giving it a shot. ** ** What application are you using ? Is it custom MapReduce jobs in which you use Distributed cache (I guess not) ? ** ** Thanks Hemanth ** ** On Thu, Apr 11, 2013 at 3:34 AM, xia_y...@dell.com wrote: Hi Arun, I stopped my application, then restarted my hbase (which include hadoop). After that I start my application. After one evening, my /tmp/hadoop-root/mapred/local/archive goes to more than 1G. It does not work. Is this the right place to change the value? local.cache.size in file core-default.xml, which is in hadoop-core-1.0.3.jar Thanks, Jane *From:* Arun C Murthy [mailto:a...@hortonworks.com] *Sent:* Wednesday, April 10, 2013 2:45 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Ensure no jobs are running (cache limit is only for non-active cache files), check after a little while (takes sometime for the cleaner thread to kick in). Arun On Apr 11, 2013, at 2:29 AM, xia_y...@dell.com xia_y...@dell.com wrote: ** ** Hi Hemanth, For the hadoop 1.0.3, I can only find local.cache.size in file core-default.xml, which is in hadoop-core-1.0.3.jar. It is not in mapred-default.xml. I updated the value in file default.xml and changed the value to 50. This is just for my testing purpose. However, the folder /tmp/hadoop-root/mapred/local/archive already goes more than 1G now. Looks like it does not do the work. Could you advise if what I did is correct?** ** namelocal.cache.size/name value50/value Thanks, Xia *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Monday, April 08, 2013 9:09 PM *To:* user@hadoop.apache.org *Subject:* Re: How to configure mapreduce archive size? Hi, This directory is used as part of the 'DistributedCache' feature. ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). There is a configuration key local.cache.size which controls the amount of data stored under DistributedCache. The default limit is 10GB. However, the files under this cannot be deleted if they are being used. Also, some frameworks on Hadoop could be using DistributedCache transparently to you. So you could check what is being stored here and based on that lower the limit of the cache size if you feel that will help. The property needs to be set in mapred-default.xml
Re: How to configure mapreduce archive size?
Hi, This directory is used as part of the 'DistributedCache' feature. ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#DistributedCache). There is a configuration key local.cache.size which controls the amount of data stored under DistributedCache. The default limit is 10GB. However, the files under this cannot be deleted if they are being used. Also, some frameworks on Hadoop could be using DistributedCache transparently to you. So you could check what is being stored here and based on that lower the limit of the cache size if you feel that will help. The property needs to be set in mapred-default.xml. Thanks Hemanth On Mon, Apr 8, 2013 at 11:09 PM, xia_y...@dell.com wrote: Hi, ** ** I am using hadoop which is packaged within hbase -0.94.1. It is hadoop 1.0.3. There is some mapreduce job running on my server. After some time, I found that my folder /tmp/hadoop-root/mapred/local/archive has 14G size.** ** ** ** How to configure this and limit the size? I do not want to waste my space for archive. ** ** Thanks, ** ** Xia ** **
Re: Find reducer for a key
Hi, Not sure if I am answering your question, but this is the background. Every MapReduce job has a partitioner associated to it. The default partitioner is a HashPartitioner. You can as a user write your own partitioner as well and plug it into the job. The partitioner is responsible for splitting the map outputs key space among the reducers. So, to know which reducer a key will go to, it is basically the value returned by the partitioner's getPartition method. For e.g this is the code in the HashPartitioner: public int getPartition(K2 key, V2 value, int numReduceTasks) { return (key.hashCode() Integer.MAX_VALUE) % numReduceTasks; } mapred.task.partition is the key that defines the partition number of this reducer. I guess you can piece together these bits into what you'd want.. However, I am interested in understanding why you want to know this ? Can you share some info ? Thanks Hemanth On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Hi everyone, how can i know the keys that are associated to a particular reducer in the setup method? Let's assume in the setup method to read from a file where each line is a string that will become a key emitted from mappers. For each of these lines I would like to know if the string will be a key associated with the current reducer or not. I read something about mapred.task.partition and mapred.task.id, but I didn't understand the usage. Thanks, Alberto -- Alberto Cordioli
Re: Find reducer for a key
Hmm. That feels like a join. Can't you read the input file on the map side and output those keys along with the original map output keys.. That way the reducer would automatically get both together ? On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Hi Hemanth, thanks for your reply. Yes, this partially answered to my question. I know how hash partitioner works and I guessed something similar. The piece that I missed was that mapred.task.partition returns the partition number of the reducer. So, putting al the pieces together I undersand that: for each key in the file I have to call the HashPartitioner. Then I have to compare the returned index with the one retrieved by Configuration.getInt(mapred.task.partition). If it is equal then such a key will be served by that reducer. Is this correct? To answer to your question: In a reduce side of a MR job, I want to load from file some data in a in-memory structure. Actually, I don't need to store the whole file for each reducer, but only the lines that are related to such keys a particular reducers will receive. So, my intention is to know the keys in the setup method to store only the needed lines. Thanks, Alberto On 28 March 2013 11:01, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Not sure if I am answering your question, but this is the background. Every MapReduce job has a partitioner associated to it. The default partitioner is a HashPartitioner. You can as a user write your own partitioner as well and plug it into the job. The partitioner is responsible for splitting the map outputs key space among the reducers. So, to know which reducer a key will go to, it is basically the value returned by the partitioner's getPartition method. For e.g this is the code in the HashPartitioner: public int getPartition(K2 key, V2 value, int numReduceTasks) { return (key.hashCode() Integer.MAX_VALUE) % numReduceTasks; } mapred.task.partition is the key that defines the partition number of this reducer. I guess you can piece together these bits into what you'd want.. However, I am interested in understanding why you want to know this ? Can you share some info ? Thanks Hemanth On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Hi everyone, how can i know the keys that are associated to a particular reducer in the setup method? Let's assume in the setup method to read from a file where each line is a string that will become a key emitted from mappers. For each of these lines I would like to know if the string will be a key associated with the current reducer or not. I read something about mapred.task.partition and mapred.task.id, but I didn't understand the usage. Thanks, Alberto -- Alberto Cordioli -- Alberto Cordioli
Re: Find reducer for a key
Hi, The way I understand your requirement - you have a file that contains a set of keys. You want to read this file on every reducer and take only those entries of the set, whose keys correspond to the current reducer. If the above summary is correct, can I assume that you are potentially reading the entire intermediate output key space on every reducer. Would that even work (considering memory constraints, etc). It seemed to me that your solution is implementing what the framework can already do for you. That was the rationale behind my suggestion. Maybe you should try and implement both approaches to see which one works better for you. Thanks hemanth On Thu, Mar 28, 2013 at 6:37 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Yes, that is a possible solution. But since the MR job has another scope, the mappers already read other files (very large) and output tuples. You cannot control the number of mappers and hence the risk is that a lot of mappers will be created, and each of them read also the other file instead of a small number of reducers. Do you think that the solution I proposed is not so elegant or efficient? Alberto On 28 March 2013 13:12, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. That feels like a join. Can't you read the input file on the map side and output those keys along with the original map output keys.. That way the reducer would automatically get both together ? On Thu, Mar 28, 2013 at 5:20 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Hi Hemanth, thanks for your reply. Yes, this partially answered to my question. I know how hash partitioner works and I guessed something similar. The piece that I missed was that mapred.task.partition returns the partition number of the reducer. So, putting al the pieces together I undersand that: for each key in the file I have to call the HashPartitioner. Then I have to compare the returned index with the one retrieved by Configuration.getInt(mapred.task.partition). If it is equal then such a key will be served by that reducer. Is this correct? To answer to your question: In a reduce side of a MR job, I want to load from file some data in a in-memory structure. Actually, I don't need to store the whole file for each reducer, but only the lines that are related to such keys a particular reducers will receive. So, my intention is to know the keys in the setup method to store only the needed lines. Thanks, Alberto On 28 March 2013 11:01, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Not sure if I am answering your question, but this is the background. Every MapReduce job has a partitioner associated to it. The default partitioner is a HashPartitioner. You can as a user write your own partitioner as well and plug it into the job. The partitioner is responsible for splitting the map outputs key space among the reducers. So, to know which reducer a key will go to, it is basically the value returned by the partitioner's getPartition method. For e.g this is the code in the HashPartitioner: public int getPartition(K2 key, V2 value, int numReduceTasks) { return (key.hashCode() Integer.MAX_VALUE) % numReduceTasks; } mapred.task.partition is the key that defines the partition number of this reducer. I guess you can piece together these bits into what you'd want.. However, I am interested in understanding why you want to know this ? Can you share some info ? Thanks Hemanth On Thu, Mar 28, 2013 at 2:17 PM, Alberto Cordioli cordioli.albe...@gmail.com wrote: Hi everyone, how can i know the keys that are associated to a particular reducer in the setup method? Let's assume in the setup method to read from a file where each line is a string that will become a key emitted from mappers. For each of these lines I would like to know if the string will be a key associated with the current reducer or not. I read something about mapred.task.partition and mapred.task.id, but I didn't understand the usage. Thanks, Alberto -- Alberto Cordioli -- Alberto Cordioli -- Alberto Cordioli
Re: Child JVM memory allocation / Usage
Couple of things to check: Does your class com.hadoop.publicationMrPOC.Launcher implement the Tool interface ? You can look at an example at ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code-N110D0). That's what accepts the -D params on command line. Alternatively, you can also set the same in the configuration object like this, in your launcher code: Configuration conf = new Configuration() conf.set(mapred.create.symlink, yes); conf.set(mapred.cache.files, hdfs:///user/hemanty/scripts/copy_dump.sh#copy_dump.sh); conf.set(mapred.child.java.opts, -Xmx200m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heapdump.hprof -XX:OnOutOfMemoryError=./copy_dump.sh); Second, the position of the arguments matters. I think the command should be hadoop jar -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' com.hadoop.publicationMrPOC.Launcher Fudan\ Univ Thanks Hemanth On Wed, Mar 27, 2013 at 1:58 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth/Koji, Seems the above script doesn't work for me. Can u look into the following and suggest what more can I do hadoop fs -cat /user/ims-b/dump.sh #!/bin/sh hadoop dfs -put myheapdump.hprof /tmp/myheapdump_ims/${PWD//\//_}.hprof hadoop jar LL.jar com.hadoop.publicationMrPOC.Launcher Fudan\ Univ -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' I am not able to see the heap dump at /tmp/myheapdump_ims Erorr in the mapper : Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.ArrayList.ensureCapacity(ArrayList.java:167) at java.util.ArrayList.add(ArrayList.java:351) at com.hadoop.publicationMrPOC.PublicationMapper.configure(PublicationMapper.java:59) ... 22 more On Wed, Mar 27, 2013 at 10:16 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Koji, Works beautifully. Thanks a lot. I learnt at least 3 different things with your script today ! Hemanth On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi knogu...@yahoo-inc.comwrote: Create a dump.sh on hdfs. $ hadoop dfs -cat /user/knoguchi/dump.sh #!/bin/sh hadoop dfs -put myheapdump.hprof /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof Run your job with -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi. Koji On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote: Hi, I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I suspected, the dump goes to the current work directory of the task attempt as it executes on the cluster. This directory is cleaned up once the task is done. There are options to keep failed task files or task files matching a pattern. However, these are NOT retaining the current working directory. Hence, there is no option to get this from a cluster AFAIK. You are effectively left with the jmap option on pseudo distributed cluster I think. Thanks Hemanth On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: If your task is running out of memory, you could add the option -XX:+HeapDumpOnOutOfMemoryError to mapred.child.java.opts (along with the heap memory). However, I am not sure where it stores the dump.. You might need to experiment a little on it.. Will try and send out the info if I get time to try out. Thanks Hemanth On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi hemanth, This sounds interesting, will out try out that on the pseudo cluster. But the real problem for me is, the cluster is being maintained by third party. I only have have a edge node through which I can submit the jobs. Is there any other way of getting the dump instead of physically going to that machine and checking out. On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala yhema
Re: Child JVM memory allocation / Usage
Hi, Dumping heap to ./heapdump.hprof File myheapdump.hprof does not exist. The file names don't match - can you check your script / command line args. Thanks hemanth On Wed, Mar 27, 2013 at 3:21 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, Nice to see this. I didnot know about this till now. But few one more issue.. the dump file did not get created.. The following are the logs ttempt_201302211510_81218_m_00_0: /data/1/mapred/local/taskTracker/distcache/8776089957260881514_-363500746_715125253/cmp111wcd/user/ims-b/nagarjuna/AddressId_Extractor/Numbers attempt_201302211510_81218_m_00_0: java.lang.OutOfMemoryError: Java heap space attempt_201302211510_81218_m_00_0: Dumping heap to ./heapdump.hprof ... attempt_201302211510_81218_m_00_0: Heap dump file created [210641441 bytes in 3.778 secs] attempt_201302211510_81218_m_00_0: # attempt_201302211510_81218_m_00_0: # java.lang.OutOfMemoryError: Java heap space attempt_201302211510_81218_m_00_0: # -XX:OnOutOfMemoryError=./dump.sh attempt_201302211510_81218_m_00_0: # Executing /bin/sh -c ./dump.sh... attempt_201302211510_81218_m_00_0: put: File myheapdump.hprof does not exist. attempt_201302211510_81218_m_00_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). On Wed, Mar 27, 2013 at 2:29 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Couple of things to check: Does your class com.hadoop.publicationMrPOC.Launcher implement the Tool interface ? You can look at an example at ( http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Source+Code-N110D0). That's what accepts the -D params on command line. Alternatively, you can also set the same in the configuration object like this, in your launcher code: Configuration conf = new Configuration() conf.set(mapred.create.symlink, yes); conf.set(mapred.cache.files, hdfs:///user/hemanty/scripts/copy_dump.sh#copy_dump.sh); conf.set(mapred.child.java.opts, -Xmx200m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./heapdump.hprof -XX:OnOutOfMemoryError=./copy_dump.sh); Second, the position of the arguments matters. I think the command should be hadoop jar -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' com.hadoop.publicationMrPOC.Launcher Fudan\ Univ Thanks Hemanth On Wed, Mar 27, 2013 at 1:58 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth/Koji, Seems the above script doesn't work for me. Can u look into the following and suggest what more can I do hadoop fs -cat /user/ims-b/dump.sh #!/bin/sh hadoop dfs -put myheapdump.hprof /tmp/myheapdump_ims/${PWD//\//_}.hprof hadoop jar LL.jar com.hadoop.publicationMrPOC.Launcher Fudan\ Univ -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/ims-b/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' I am not able to see the heap dump at /tmp/myheapdump_ims Erorr in the mapper : Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 17 more Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.ArrayList.ensureCapacity(ArrayList.java:167) at java.util.ArrayList.add(ArrayList.java:351) at com.hadoop.publicationMrPOC.PublicationMapper.configure(PublicationMapper.java:59) ... 22 more On Wed, Mar 27, 2013 at 10:16 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Koji, Works beautifully. Thanks a lot. I learnt at least 3 different things with your script today ! Hemanth On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi knogu...@yahoo-inc.comwrote: Create a dump.sh on hdfs. $ hadoop dfs -cat /user/knoguchi/dump.sh #!/bin/sh hadoop dfs -put myheapdump.hprof /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof Run your job with -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi. Koji On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote: Hi, I tried to use
Re: Child JVM memory allocation / Usage
If your task is running out of memory, you could add the option * -XX:+HeapDumpOnOutOfMemoryError * *to *mapred.child.java.opts (along with the heap memory). However, I am not sure where it stores the dump.. You might need to experiment a little on it.. Will try and send out the info if I get time to try out. Thanks Hemanth On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi hemanth, This sounds interesting, will out try out that on the pseudo cluster. But the real problem for me is, the cluster is being maintained by third party. I only have have a edge node through which I can submit the jobs. Is there any other way of getting the dump instead of physically going to that machine and checking out. On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, One option to find what could be taking the memory is to use jmap on the running task. The steps I followed are: - I ran a sleep job (which comes in the examples jar of the distribution - effectively does nothing in the mapper / reducer). - From the JobTracker UI looked at a map task attempt ID. - Then on the machine where the map task is running, got the PID of the running task - ps -ef | grep task attempt id - On the same machine executed jmap -histo pid This will give you an idea of the count of objects allocated and size. Jmap also has options to get a dump, that will contain more information, but this should help to get you started with debugging. For my sleep job task - I saw allocations worth roughly 130 MB. Thanks hemanth On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: I have a lookup file which I need in the mapper. So I am trying to read the whole file and load it into list in the mapper. For each and every record Iook in this file which I got from distributed cache. — Sent from iPhone On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. How are you loading the file into memory ? Is it some sort of memory mapping etc ? Are they being read as records ? Some details of the app will help On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, I tried out your suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY -- + Runtime.getRuntime().freeMemory()); System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory()); } Surprisingly the output was FREE MEMORY -- 341854864 = 320 MB MAX MEMORY ---1908932608 = 1.9 GB I am just wondering what processes are taking up that extra 1.6GB of heap which I configured for the child jvm heap. Appreciate in helping me understand the scenario. Regards Nagarjuna K -- Harsh J -- Sent from iPhone
Re: Child JVM memory allocation / Usage
Hi, I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I suspected, the dump goes to the current work directory of the task attempt as it executes on the cluster. This directory is cleaned up once the task is done. There are options to keep failed task files or task files matching a pattern. However, these are NOT retaining the current working directory. Hence, there is no option to get this from a cluster AFAIK. You are effectively left with the jmap option on pseudo distributed cluster I think. Thanks Hemanth On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: If your task is running out of memory, you could add the option * -XX:+HeapDumpOnOutOfMemoryError * *to *mapred.child.java.opts (along with the heap memory). However, I am not sure where it stores the dump.. You might need to experiment a little on it.. Will try and send out the info if I get time to try out. Thanks Hemanth On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi hemanth, This sounds interesting, will out try out that on the pseudo cluster. But the real problem for me is, the cluster is being maintained by third party. I only have have a edge node through which I can submit the jobs. Is there any other way of getting the dump instead of physically going to that machine and checking out. On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, One option to find what could be taking the memory is to use jmap on the running task. The steps I followed are: - I ran a sleep job (which comes in the examples jar of the distribution - effectively does nothing in the mapper / reducer). - From the JobTracker UI looked at a map task attempt ID. - Then on the machine where the map task is running, got the PID of the running task - ps -ef | grep task attempt id - On the same machine executed jmap -histo pid This will give you an idea of the count of objects allocated and size. Jmap also has options to get a dump, that will contain more information, but this should help to get you started with debugging. For my sleep job task - I saw allocations worth roughly 130 MB. Thanks hemanth On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: I have a lookup file which I need in the mapper. So I am trying to read the whole file and load it into list in the mapper. For each and every record Iook in this file which I got from distributed cache. — Sent from iPhone On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. How are you loading the file into memory ? Is it some sort of memory mapping etc ? Are they being read as records ? Some details of the app will help On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, I tried out your suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY -- + Runtime.getRuntime().freeMemory()); System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory()); } Surprisingly the output was FREE MEMORY -- 341854864 = 320 MB MAX MEMORY ---1908932608 = 1.9 GB I am just wondering what processes are taking up that extra 1.6GB of heap which I configured for the child jvm heap. Appreciate in helping me understand the scenario. Regards Nagarjuna K -- Harsh J -- Sent from iPhone
Re: Child JVM memory allocation / Usage
Koji, Works beautifully. Thanks a lot. I learnt at least 3 different things with your script today ! Hemanth On Tue, Mar 26, 2013 at 9:41 PM, Koji Noguchi knogu...@yahoo-inc.comwrote: Create a dump.sh on hdfs. $ hadoop dfs -cat /user/knoguchi/dump.sh #!/bin/sh hadoop dfs -put myheapdump.hprof /tmp/myheapdump_knoguchi/${PWD//\//_}.hprof Run your job with -Dmapred.create.symlink=yes -Dmapred.cache.files=hdfs:///user/knoguchi/dump.sh#dump.sh -Dmapred.reduce.child.java.opts='-Xmx2048m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=./myheapdump.hprof -XX:OnOutOfMemoryError=./dump.sh' This should create the heap dump on hdfs at /tmp/myheapdump_knoguchi. Koji On Mar 26, 2013, at 11:53 AM, Hemanth Yamijala wrote: Hi, I tried to use the -XX:+HeapDumpOnOutOfMemoryError. Unfortunately, like I suspected, the dump goes to the current work directory of the task attempt as it executes on the cluster. This directory is cleaned up once the task is done. There are options to keep failed task files or task files matching a pattern. However, these are NOT retaining the current working directory. Hence, there is no option to get this from a cluster AFAIK. You are effectively left with the jmap option on pseudo distributed cluster I think. Thanks Hemanth On Tue, Mar 26, 2013 at 11:37 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: If your task is running out of memory, you could add the option -XX:+HeapDumpOnOutOfMemoryError to mapred.child.java.opts (along with the heap memory). However, I am not sure where it stores the dump.. You might need to experiment a little on it.. Will try and send out the info if I get time to try out. Thanks Hemanth On Tue, Mar 26, 2013 at 10:23 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi hemanth, This sounds interesting, will out try out that on the pseudo cluster. But the real problem for me is, the cluster is being maintained by third party. I only have have a edge node through which I can submit the jobs. Is there any other way of getting the dump instead of physically going to that machine and checking out. On Tue, Mar 26, 2013 at 10:12 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, One option to find what could be taking the memory is to use jmap on the running task. The steps I followed are: - I ran a sleep job (which comes in the examples jar of the distribution - effectively does nothing in the mapper / reducer). - From the JobTracker UI looked at a map task attempt ID. - Then on the machine where the map task is running, got the PID of the running task - ps -ef | grep task attempt id - On the same machine executed jmap -histo pid This will give you an idea of the count of objects allocated and size. Jmap also has options to get a dump, that will contain more information, but this should help to get you started with debugging. For my sleep job task - I saw allocations worth roughly 130 MB. Thanks hemanth On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: I have a lookup file which I need in the mapper. So I am trying to read the whole file and load it into list in the mapper. For each and every record Iook in this file which I got from distributed cache. — Sent from iPhone On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. How are you loading the file into memory ? Is it some sort of memory mapping etc ? Are they being read as records ? Some details of the app will help On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, I tried out your suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY
Re: How to tell my Hadoop cluster to read data from an external server
The stack trace indicates the job client is trying to submit a job to the MR cluster and it is failing. Are you certain that at the time of submitting the job, the JobTracker is running ? (On localhost:54312) ? Regarding using a different file system - it depends a lot on what file system you are using, and whether it will match the requirements of large scale distributed processing that Hadoop MR can offer. Suggest you be very sure about this, before you take that route. Thanks Hemanth On Tue, Mar 26, 2013 at 4:22 PM, Agarwal, Nikhil nikhil.agar...@netapp.comwrote: Hi, Thanks for your reply. I do not know about cascading. Should I google it as “cascading in hadoop”? Also, what I was thinking is to implement a file system which overrides the functions provided by fs.FileSystem interface in Hadoop. I tried to write some portions of the filesystem (for my external server) so that it recompiles successfully but when I submit a MR job I get the following error: ** ** 13/03/26 06:09:10 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 0 time(s). 13/03/26 06:09:11 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 1 time(s). 13/03/26 06:09:12 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 2 time(s). 13/03/26 06:09:13 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 3 time(s). 13/03/26 06:09:14 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 4 time(s). 13/03/26 06:09:15 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 5 time(s). 13/03/26 06:09:16 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 6 time(s). 13/03/26 06:09:17 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 7 time(s). 13/03/26 06:09:18 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 8 time(s). 13/03/26 06:09:19 INFO ipc.Client: Retrying connect to server: localhost/ 127.0.0.1:54312. Already tried 9 time(s). 13/03/26 06:10:20 ERROR security.UserGroupInformation: PriviledgedActionException as:nikhil cause:java.net.ConnectException: Call to localhost/127.0.0.1:54312 failed on connection exception: java.net.ConnectException: Connection refused java.net.ConnectException: Call to localhost/127.0.0.1:54312 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1099) at org.apache.hadoop.ipc.Client.call(Client.java:1075) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379) at org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:480) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:474) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:457) at org.apache.hadoop.mapreduce.Job$1.run(Job.java:513) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapreduce.Job.connect(Job.java:511) at org.apache.hadoop.mapreduce.Job.submit(Job.java:499) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530) at org.apache.hadoop.examples.WordCount.main(WordCount.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at
Re: Child JVM memory allocation / Usage
Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY -- + Runtime.getRuntime().freeMemory()); System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory()); } Surprisingly the output was FREE MEMORY -- 341854864 = 320 MB MAX MEMORY ---1908932608 = 1.9 GB I am just wondering what processes are taking up that extra 1.6GB of heap which I configured for the child jvm heap. Appreciate in helping me understand the scenario. Regards Nagarjuna K -- Harsh J -- Sent from iPhone
Re: Child JVM memory allocation / Usage
Hmm. How are you loading the file into memory ? Is it some sort of memory mapping etc ? Are they being read as records ? Some details of the app will help On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, I tried out your suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY -- + Runtime.getRuntime().freeMemory()); System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory()); } Surprisingly the output was FREE MEMORY -- 341854864 = 320 MB MAX MEMORY ---1908932608 = 1.9 GB I am just wondering what processes are taking up that extra 1.6GB of heap which I configured for the child jvm heap. Appreciate in helping me understand the scenario. Regards Nagarjuna K -- Harsh J -- Sent from iPhone
Re: Child JVM memory allocation / Usage
Hi, One option to find what could be taking the memory is to use jmap on the running task. The steps I followed are: - I ran a sleep job (which comes in the examples jar of the distribution - effectively does nothing in the mapper / reducer). - From the JobTracker UI looked at a map task attempt ID. - Then on the machine where the map task is running, got the PID of the running task - ps -ef | grep task attempt id - On the same machine executed jmap -histo pid This will give you an idea of the count of objects allocated and size. Jmap also has options to get a dump, that will contain more information, but this should help to get you started with debugging. For my sleep job task - I saw allocations worth roughly 130 MB. Thanks hemanth On Mon, Mar 25, 2013 at 6:43 PM, Nagarjuna Kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: I have a lookup file which I need in the mapper. So I am trying to read the whole file and load it into list in the mapper. For each and every record Iook in this file which I got from distributed cache. — Sent from iPhone On Mon, Mar 25, 2013 at 6:39 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hmm. How are you loading the file into memory ? Is it some sort of memory mapping etc ? Are they being read as records ? Some details of the app will help On Mon, Mar 25, 2013 at 2:14 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi Hemanth, I tried out your suggestion loading 420 MB file into memory. It threw java heap space error. I am not sure where this 1.6 GB of configured heap went to ? On Mon, Mar 25, 2013 at 12:01 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The free memory might be low, just because GC hasn't reclaimed what it can. Can you just try reading in the data you want to read and see if that works ? Thanks Hemanth On Mon, Mar 25, 2013 at 10:32 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: io.sort.mb = 256 MB On Monday, March 25, 2013, Harsh J wrote: The MapTask may consume some memory of its own as well. What is your io.sort.mb (MR1) or mapreduce.task.io.sort.mb (MR2) set to? On Sun, Mar 24, 2013 at 3:40 PM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com wrote: Hi, I configured my child jvm heap to 2 GB. So, I thought I could really read 1.5GB of data and store it in memory (mapper/reducer). I wanted to confirm the same and wrote the following piece of code in the configure method of mapper. @Override public void configure(JobConf job) { System.out.println(FREE MEMORY -- + Runtime.getRuntime().freeMemory()); System.out.println(MAX MEMORY --- + Runtime.getRuntime().maxMemory()); } Surprisingly the output was FREE MEMORY -- 341854864 = 320 MB MAX MEMORY ---1908932608 = 1.9 GB I am just wondering what processes are taking up that extra 1.6GB of heap which I configured for the child jvm heap. Appreciate in helping me understand the scenario. Regards Nagarjuna K -- Harsh J -- Sent from iPhone
Re: MapReduce Failed and Killed
Any MapReduce task needs to communicate with the tasktracker that launched it periodically in order to let the tasktracker know it is still alive and active. The time for which silence is tolerated is controlled by a configuration property mapred.task.timeout. It looks like in your case, this has already been bumped up to 20 minutes (from the default 10 minutes). It also looks like this is not sufficient. You could bump this value even further up. However, the correct approach could be to see what the reducer is actually doing to become inactive during this time. Can you look at the reducer attempt's logs (which you can access from the web UI of the Jobtracker) and post them here ? Thanks hemanth On Fri, Mar 22, 2013 at 5:32 PM, Jinchun Kim cien...@gmail.com wrote: Hi, All. I'm trying to create category-based splits of Wikipedia dataset(41GB) and the training data set(5GB) using Mahout. I'm using following command. $MAHOUT_HOME/bin/mahout wikipediaDataSetCreator -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/temp/categories.txt I had no problem with the training data set, but Hadoop showed following messages when I tried to do a same job with Wikipedia dataset, . 13/03/21 22:31:00 INFO mapred.JobClient: map 27% reduce 1% 13/03/21 22:40:31 INFO mapred.JobClient: map 27% reduce 2% 13/03/21 22:58:49 INFO mapred.JobClient: map 27% reduce 3% 13/03/21 23:22:57 INFO mapred.JobClient: map 27% reduce 4% 13/03/21 23:46:32 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 00:27:14 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:06:55 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 01:14:06 INFO mapred.JobClient: map 27% reduce 3% 13/03/22 01:15:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_r_00_1, Status : FAILED Task attempt_201303211339_0002_r_00_1 failed to report status for 1200 seconds. Killing! 13/03/22 01:20:09 INFO mapred.JobClient: map 27% reduce 4% 13/03/22 01:33:35 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_37_1, Status : FAILED Task attempt_201303211339_0002_m_37_1 failed to report status for 1228 seconds. Killing! 13/03/22 01:35:12 INFO mapred.JobClient: map 27% reduce 5% 13/03/22 01:40:38 INFO mapred.JobClient: map 27% reduce 6% 13/03/22 01:52:28 INFO mapred.JobClient: map 27% reduce 7% 13/03/22 02:16:27 INFO mapred.JobClient: map 27% reduce 8% 13/03/22 02:19:02 INFO mapred.JobClient: Task Id : attempt_201303211339_0002_m_18_1, Status : FAILED Task attempt_201303211339_0002_m_18_1 failed to report status for 1204 seconds. Killing! 13/03/22 02:49:03 INFO mapred.JobClient: map 27% reduce 9% 13/03/22 02:52:04 INFO mapred.JobClient: map 28% reduce 9% Because I just started to learn how to run Hadoop, I have no idea how to solve this problem... Does anyone have an idea how to handle this weird thing? -- *Jinchun Kim*
Re: Too many open files error with YARN
There is a way to confirm if it is the same bug. Can you pick a jstack on the process that has established a connection to 50010 and post it here.. Thanks hemanth On Thu, Mar 21, 2013 at 12:13 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Hemanth Sandy, Thanks for your reply. Yes, that indicates it is in close wait state, exactly like below: java 30718 dsadm 200u IPv4 1178376459 0t0 TCP *:50010 (LISTEN) java 31512 dsadm 240u IPv6 1178391921 0t0 TCP node1:51342-node1:50010 (CLOSE_WAIT) I just checked in at the link https://issues.apache.org/jira/browse/HDFS-3357 it shows 2.0.0-alpha both in affect versions and fix versions. There is another bug 3591, at https://issues.apache.org/jira/browse/HDFS-3591 which says it is for backporting 3357 to branch 0.23 So, I don't understand whether the fix is really in 2.0.0-alpha, request you to please clarify me. Thanks, Kishore On Thu, Mar 21, 2013 at 9:57 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: There was an issue related to hung connections (HDFS-3357). But the JIRA indicates the fix is available in Hadoop-2.0.0-alpha. Still, would be worth checking on Sandy's suggestion On Wed, Mar 20, 2013 at 11:09 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Kishore, 50010 is the datanode port. Does your lsof indicate that the sockets are in CLOSE_WAIT? I had come across an issue like this where that was a symptom. -Sandy On Wed, Mar 20, 2013 at 4:24 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am running a date command with YARN's distributed shell example in a loop of 1000 times in this way: yarn jar /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar --shell_command date --num_containers 2 Around 730th time or so, I am getting an error in node manager's log saying that it failed to launch container because there are Too many open files and when I observe through lsof command,I find that there is one instance of this kind of file is left for each run of Application Master, and it kept growing as I am running it in loop. node1:44871-node1:50010 Is this a known issue? Or am I missing doing something? Please help. Note: I am working on hadoop--2.0.0-alpha Thanks, Kishore
Re: Too many open files error with YARN
There was an issue related to hung connections (HDFS-3357). But the JIRA indicates the fix is available in Hadoop-2.0.0-alpha. Still, would be worth checking on Sandy's suggestion On Wed, Mar 20, 2013 at 11:09 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Kishore, 50010 is the datanode port. Does your lsof indicate that the sockets are in CLOSE_WAIT? I had come across an issue like this where that was a symptom. -Sandy On Wed, Mar 20, 2013 at 4:24 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am running a date command with YARN's distributed shell example in a loop of 1000 times in this way: yarn jar /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar org.apache.hadoop.yarn.applications.distributedshell.Client --jar /home/kbonagir/yarn/hadoop-2.0.0-alpha/share/hadoop/mapreduce/hadoop-yarn-applications-distributedshell-2.0.0-alpha.jar --shell_command date --num_containers 2 Around 730th time or so, I am getting an error in node manager's log saying that it failed to launch container because there are Too many open files and when I observe through lsof command,I find that there is one instance of this kind of file is left for each run of Application Master, and it kept growing as I am running it in loop. node1:44871-node1:50010 Is this a known issue? Or am I missing doing something? Please help. Note: I am working on hadoop--2.0.0-alpha Thanks, Kishore
Re: map reduce and sync
I am using the same version of Hadoop as you. Can you look at something like Scribe, which AFAIK fits the use case you describe. Thanks Hemanth On Sun, Feb 24, 2013 at 3:33 AM, Lucas Bernardi luc...@gmail.com wrote: That is exactly what I did, but in my case, it is like if the file were empty, the job counters say no bytes read. I'm using hadoop 1.0.3 which version did you try? What I'm trying to do is just some basic analyitics on a product search system. There is a search service, every time a user performs a search, the search string, and the results are stored in this file, and the file is sync'ed. I'm actually using pig to do some basic counts, it doesn't work, like I described, because the file looks empty for the map reduce components. I thought it was about pig, but I wasn't sure, so I tried a simple mr job, and used the word count to test the map reduce compoinents actually see the sync'ed bytes. Of course if I close the file, everything works perfectly, but I don't want to close the file every while, since that means I should create another one (since no append support), and that would end up with too many tiny files, something we know is bad for mr performance, and I don't want to add more parts to this (like a file merging tool). I think unign sync is a clean solution, since we don't care about writing performance, so I'd rather keep it like this if I can make it work. Any idea besides hadoop version? Thanks! Lucas On Sat, Feb 23, 2013 at 11:54 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi Lucas, I tried something like this but got different results. I wrote code that opened a file on HDFS, wrote a line and called sync. Without closing the file, I ran a wordcount with that file as input. It did work fine and was able to count the words that were sync'ed (even though the file length seems to come as 0 like you noted in fs -ls) So, not sure what's happening in your case. In the MR job, do the job counters indicate no bytes were read ? On a different note though, if you can describe a little more what you are trying to accomplish, we could probably work a better solution. Thanks hemanth On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi luc...@gmail.com wrote: Helo Hemanth, thanks for answering. The file is open by a separate process not map reduce related at all. You can think of it as a servlet, receiving requests, and writing them to this file, every time a request is received it is written and org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked. At the same time, I want to run a map reduce job over this file. Simply runing the word count example doesn't seem to work, it is like if the file were empty. hadoop -fs -tail works just fine, and reading the file using org.apache.hadoop.fs.FSDataInputStream also works ok. Last thing, the web interface doesn't see the contents, and command hadoop -fs -ls says the file is empty. What am I doing wrong? Thanks! Lucas On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Could you please clarify, are you opening the file in your mapper code and reading from there ? Thanks Hemanth On Friday, February 22, 2013, Lucas Bernardi wrote: Hello there, I'm trying to use hadoop map reduce to process an open file. The writing process, writes a line to the file and syncs the file to readers. (org.apache.hadoop.fs.FSDataOutputStream.sync()). If I try to read the file from another process, it works fine, at least using org.apache.hadoop.fs.FSDataInputStream. hadoop -fs -tail also works just fine But it looks like map reduce doesn't read any data. I tried using the word count example, same thing, it is like if the file were empty for the map reduce framework. I'm using hadoop 1.0.3. and pig 0.10.0 I need some help around this. Thanks! Lucas
Re: Trouble in running MapReduce application
Can you try this ? Pick a class like WordCount from your package and execute this command: javap -classpath path to your jar -verbose org.myorg.Wordcount | grep version. For e.g. here's what I get for my class: $ javap -verbose WCMapper | grep version minor version: 0 major version: 50 Please paste the output of this - we can verify what the problem is. Thanks Hemanth On Sat, Feb 23, 2013 at 4:45 PM, Fatih Haltas fatih.hal...@nyu.edu wrote: Hi again, Thanks for your help but now, I am struggling with the same problem on a machine. As the preivous problem, I just decrease the Java version by Java 6, but this time I could not solve the problem. those are outputs that may explain the situation: - 1. I could not run my own code, to check the system I just tried to run basic wordcount example without any modification, except package info. ** COMMAND EXECUTED: hadoop jar my.jar org.myorg.WordCount NetFlow NetFlow.out Warning: $HADOOP_HOME is deprecated. Exception in thread main java.lang.UnsupportedClassVersionError: org/myorg/WordCount : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:266) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) ** 2. Java version: COMMAND EXECUTED: java -version java version 1.6.0_24 OpenJDK Runtime Environment (IcedTea6 1.11.6) (rhel-1.33.1.11.6.el5_9-x86_64) OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) ** 3. JAVA_HOME variable: ** COMMAND EXECUTED: echo $JAVA_HOME /usr/lib/jvm/jre-1.6.0-openjdk.x86_64 4. HADOOP version: *** COMMAND EXECUTED: hadoop version Warning: $HADOOP_HOME is deprecated. Hadoop 1.0.4 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290 Compiled by hortonfo on Wed Oct 3 05:13:58 UTC 2012 From source with checksum fe2baea87c4c81a2c505767f3f9b71f4 Are these still incompatible with eachother? (Hadoop version and java version) Thank you very much. On Tue, Feb 19, 2013 at 10:26 PM, Fatih Haltas fatih.hal...@nyu.eduwrote: Thank you all very much 19 Şubat 2013 Salı tarihinde Harsh J adlı kullanıcı şöyle yazdı: Oops. I just noticed Hemanth has been answering on a dupe thread as well. Lets drop this thread and carry on there :) On Tue, Feb 19, 2013 at 11:14 PM, Harsh J ha...@cloudera.com wrote: Hi, The new error usually happens if you compile using Java 7 and try to run via Java 6 (for example). That is, an incompatibility in the runtimes for the binary artifact produced. On Tue, Feb 19, 2013 at 10:09 PM, Fatih Haltas fatih.hal...@nyu.edu wrote: Thank you very much Harsh, Now, as I promised earlier I am much obliged to you. But, now I solved that problem by just changing the directories then again creating a jar file of org. but I am getting this error: 1.) What I got -- [hadoop@ADUAE042-LAP-V flowclasses_18_02]$ hadoop jar flow19028pm.jar org.myorg.MapReduce /home/hadoop/project/hadoop-data/NetFlow 19_02.out Warning: $HADOOP_HOME is deprecated. Exception in thread main java.lang.UnsupportedClassVersionError: org/myorg/MapReduce : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at
Re: Reg job tracker page
Yes. It corresponds to the JT start time. Thanks hemanth On Sat, Feb 23, 2013 at 5:37 PM, Manoj Babu manoj...@gmail.com wrote: Bharath, I can understand that its time stamp. what does identifier means? whether is holds the job tracker instance started time? Cheers! Manoj. On Sat, Feb 23, 2013 at 5:25 PM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Read it this way 2013/02/21-12:22 (yymmdd-time) On Sat, Feb 23, 2013 at 4:20 PM, Manoj Babu manoj...@gmail.com wrote: Hi All, What does this identifier means in the job tracker page? State: RUNNING Started: Thu Feb 21 12:22:03 CST 2013 Version: 0.20.2-cdh3u1, bdafb1dbffd0d5f2fbc6ee022e1c8df6500fd638 Compiled: Mon Jul 18 09:40:29 PDT 2011 by root from Unknown Identifier: 201302211222 Thanks in advance. Cheers! Manoj.
Re: map reduce and sync
Hi Lucas, I tried something like this but got different results. I wrote code that opened a file on HDFS, wrote a line and called sync. Without closing the file, I ran a wordcount with that file as input. It did work fine and was able to count the words that were sync'ed (even though the file length seems to come as 0 like you noted in fs -ls) So, not sure what's happening in your case. In the MR job, do the job counters indicate no bytes were read ? On a different note though, if you can describe a little more what you are trying to accomplish, we could probably work a better solution. Thanks hemanth On Sat, Feb 23, 2013 at 7:15 PM, Lucas Bernardi luc...@gmail.com wrote: Helo Hemanth, thanks for answering. The file is open by a separate process not map reduce related at all. You can think of it as a servlet, receiving requests, and writing them to this file, every time a request is received it is written and org.apache.hadoop.fs.FSDataOutputStream.sync() is invoked. At the same time, I want to run a map reduce job over this file. Simply runing the word count example doesn't seem to work, it is like if the file were empty. hadoop -fs -tail works just fine, and reading the file using org.apache.hadoop.fs.FSDataInputStream also works ok. Last thing, the web interface doesn't see the contents, and command hadoop -fs -ls says the file is empty. What am I doing wrong? Thanks! Lucas On Sat, Feb 23, 2013 at 4:37 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Could you please clarify, are you opening the file in your mapper code and reading from there ? Thanks Hemanth On Friday, February 22, 2013, Lucas Bernardi wrote: Hello there, I'm trying to use hadoop map reduce to process an open file. The writing process, writes a line to the file and syncs the file to readers. (org.apache.hadoop.fs.FSDataOutputStream.sync()). If I try to read the file from another process, it works fine, at least using org.apache.hadoop.fs.FSDataInputStream. hadoop -fs -tail also works just fine But it looks like map reduce doesn't read any data. I tried using the word count example, same thing, it is like if the file were empty for the map reduce framework. I'm using hadoop 1.0.3. and pig 0.10.0 I need some help around this. Thanks! Lucas
Re: map reduce and sync
Could you please clarify, are you opening the file in your mapper code and reading from there ? Thanks Hemanth On Friday, February 22, 2013, Lucas Bernardi wrote: Hello there, I'm trying to use hadoop map reduce to process an open file. The writing process, writes a line to the file and syncs the file to readers. (org.apache.hadoop.fs.FSDataOutputStream.sync()). If I try to read the file from another process, it works fine, at least using org.apache.hadoop.fs.FSDataInputStream. hadoop -fs -tail also works just fine But it looks like map reduce doesn't read any data. I tried using the word count example, same thing, it is like if the file were empty for the map reduce framework. I'm using hadoop 1.0.3. and pig 0.10.0 I need some help around this. Thanks! Lucas
Re: Database insertion by HAdoop
Sqoop can be used to export as well. Thanks Hemanth On Tuesday, February 19, 2013, Masoud wrote: Dear Tariq No, exactly in opposite way, actually we compute the similarity between documents and insert them in database, in every table almost 2/000/000 records. Best Regards On 02/19/2013 06:41 PM, Mohammad Tariq wrote: Hello Masoud, So you want to pull your data from SQL server to your Hadoop cluster first and then do the processing. Please correct me if I am wrong. You can do that using Sqoop as mention by Hemanth sir. BTW, what exactly is the kind of processing which you are planning to do on your data. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Tue, Feb 19, 2013 at 6:44 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, You could consider using sqoop. http://sqoop.apache.org/ there seemed to be a SQL connector from Microsoft. http://www.microsoft.com/en-gb/download/details.aspx?id=27584 Thanks Hemanth On Tuesday, February 19, 2013, Masoud wrote: Hello Tariq, Our database is sql server 2008, and we dont need to develop a professional app, we just need to develop it fast and make our experiment result soon. Thanks On 02/18/2013 11:58 PM, Hemanth Yamijala wrote: What database is this ? Was hbase mentioned ? On Monday, February 18, 2013, Mohammad Tariq wrote: Hello Masoud, You can use the Bulk Load feature. You might find it more efficient than normal client APIs or using the TableOutputFormat. The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. For a detailed info you can go here : http://hbase.apache.org/book/arch.bulk.load.html Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
Re: ClassNotFoundException in Main
I am not sure if that will actually work, because the class is defined to be in the org.myorg package. I suggest you repackage to reflect the right package structure. Also, the error you are getting seems to indicate that you aphave compiled using Jdk 7. Note that some versions of Hadoop are supported only on Jdk 6. Which version of Hadoop are you using. Thanks Hemanth On Tuesday, February 19, 2013, Fatih Haltas wrote: Thank you very much. When i tried with wordcount_classes.org.myorg.WordCount, I am getting the following error: [hadoop@ADUAE042-LAP-V project]$ hadoop jar wordcount_19_02.jar wordcount_classes.org.myorg.WordCount /home/hadoop/project/hadoop-data/NetFlow 19_02_wordcount.out Warning: $HADOOP_HOME is deprecated. Exception in thread main java.lang.UnsupportedClassVersionError: wordcount_classes/org/myorg/WordCount : Unsupported major.minor version 51.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:634) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:277) at java.net.URLClassLoader.access$000(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:212) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:266) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) On Tue, Feb 19, 2013 at 8:10 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Sorry. I did not read the mail correctly. I think the error is in how the jar has been created. The classes start with root as wordcount_classes, instead of org. Thanks Hemanth On Tuesday, February 19, 2013, Hemanth Yamijala wrote: Have you used the Api setJarByClass in your main program? http://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/mapreduce/Job.html#setJarByClass(java.lang.Class) On Tuesday, February 19, 2013, Fatih Haltas wrote: Hi everyone, I know this is the common mistake to not specify the class adress while trying to run a jar, however, although I specified, I am still getting the ClassNotFound exception. What may be the reason for it? I have been struggling for this problem more than a 2 days. I just wrote different MapReduce application for some anlaysis. I got this problem. To check, is there something wrong with my system, i tried to run WordCount example. When I just run hadoop-examples wordcount, it is working fine. But when I add just package org.myorg; command at the beginning, it doesnot work. Here is what I have done so far * 1. I just copied wordcount code from the apaches own examples source code and I just changed package decleration as package org.myorg; ** 2. Then I tried to run that command: * hadoop jar wordcount_19_02.jar org.myorg.WordCount /home/hadoop/project/hadoop-data/NetFlow 19_02_wordcount.output * 3. I got following error: ** [hadoop@ADUAE042-LAP-V project]$ hadoop jar wordcount_19_02.jar org.myorg.WordCount /home/hadoop/project/hadoop-data/NetFlow 19_02_wordcount.output Warning: $HADOOP_HOME is deprecated. Exception in thread main java.lang.ClassNotFoundException: org.myorg.WordCount at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:266) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) ** 4. This is the content of my .jar file: [hadoop@ADUAE042-LAP-V project]$ jar tf wordcount_19_02.jar META-INF/ META-INF/MANIFEST.MF wordcount_classes/ wordcount_classes/org/ wordcount_classes/org/myorg/ wordcount_classes/org/myorg/WordCount.class wordcount_classes/org/myorg/WordCount$TokenizerMapper.class wordcount_classes/org/myorg/WordCount$IntSumReducer.class
Re: JUint test failing in HDFS when building Hadoop from source.
Hi, In the past, some tests have been flaky. It would be good if you can search jira and see whether this is a known issue. Else, please file it, and if possible, provide a patch. :) Regarding whether this will be a reliable build, it depends a little bit on what you are going to use it for. For development / test purposes, personally, I would still go with it. Thanks Hemanth On Tuesday, February 19, 2013, Leena Rajendran wrote: Hi, I am posting for the first time. Please let me know if this needs to go to any other mailing list. I am trying to build Hadoop from source code, and I am able to successfully build until the Hadoop-Common-Project. However, in case of HDFS the test called TestHftpURLTimeouts is failing intermittently. Please note that when I run the test individually, it passes. I have taken care of all steps given in HowToContribute twiki page of Hadoop. Please let me know , 1. Whether this kind of behaviour is expected 2. Can this intermittent test case failure be ignored. 3. Will it be a reliable build if I use the -DskipTests option in mvn command Adding build results below for the following command : mvn -e package -Pdist,native -Dtar Results : Failed tests: testHftpSocketTimeout(org.apache.hadoop.hdfs.TestHftpURLTimeouts): expected:connect timed out but was:null Tests run: 705, Failures: 1, Errors: 0, Skipped: 3 [INFO] [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main SUCCESS [2.280s] [INFO] Apache Hadoop Project POM . SUCCESS [2.339s] [INFO] Apache Hadoop Annotations . SUCCESS [3.916s] [INFO] Apache Hadoop Assemblies .. SUCCESS [0.323s] [INFO] Apache Hadoop Project Dist POM SUCCESS [2.762s] [INFO] Apache Hadoop Auth SUCCESS [7.695s] [INFO] Apache Hadoop Auth Examples ... SUCCESS [1.997s] [INFO] Apache Hadoop Common .. SUCCESS [14:13.276s] [INFO] Apache Hadoop Common Project .. SUCCESS [0.026s] [INFO] Apache Hadoop HDFS FAILURE [1:36:02.602s] [INFO] Apache Hadoop HttpFS .. SKIPPED [INFO] Apache Hadoop HDFS BookKeeper Journal . SKIPPED [INFO] Apache Hadoop HDFS Project SKIPPED [INFO] hadoop-yarn ... SKIPPED [INFO] hadoop-yarn-api ... SKIPPED [INFO] hadoop-yarn-common SKIPPED [INFO] hadoop-yarn-server SKIPPED [INFO] hadoop-yarn-server-common . SKIPPED [INFO] hadoop-yarn-server-nodemanager SKIPPED [INFO] hadoop-yarn-server-web-proxy .. SKIPPED [INFO] hadoop-yarn-server-resourcemanager SKIPPED [INFO] hadoop-yarn-server-tests .. SKIPPED [INFO] hadoop-yarn-client SKIPPED [INFO] hadoop-yarn-applications .. SKIPPED [INFO] hadoop-yarn-applications-distributedshell . SKIPPED [INFO] hadoop-mapreduce-client ... SKIPPED [INFO] hadoop-mapreduce-client-core .. SKIPPED [INFO] hadoop-yarn-applications-unmanaged-am-launcher SKIPPED [INFO] hadoop-yarn-site .. SKIPPED [INFO] hadoop-yarn-project ... SKIPPED [INFO] hadoop-mapreduce-client-common SKIPPED [INFO] hadoop-mapreduce-client-shuffle ... SKIPPED [INFO] hadoop-mapreduce-client-app ... SKIPPED [INFO] hadoop-mapreduce-client-hs SKIPPED [INFO] hadoop-mapreduce-client-jobclient . SKIPPED [INFO] hadoop-mapreduce-client-hs-plugins SKIPPED [INFO] Apache Hadoop MapReduce Examples .. SKIPPED [INFO] hadoop-mapreduce .. SKIPPED [INFO] Apache Hadoop MapReduce Streaming . SKIPPED [INFO] Apache Hadoop Distributed Copy SKIPPED [INFO] Apache Hadoop Archives SKIPPED [INFO] Apache Hadoop Rumen ... SKIPPED [INFO] Apache Hadoop Gridmix . SKIPPED [INFO] Apache Hadoop Data Join ... SKIPPED [INFO] Apache Hadoop Extras .. SKIPPED [INFO] Apache Hadoop Pipes ... SKIPPED [INFO] Apache Hadoop Tools Dist .. SKIPPED [INFO] Apache Hadoop Tools ... SKIPPED [INFO] Apache Hadoop Distribution SKIPPED [INFO] Apache Hadoop Client
Re: Database insertion by HAdoop
What database is this ? Was hbase mentioned ? On Monday, February 18, 2013, Mohammad Tariq wrote: Hello Masoud, You can use the Bulk Load feature. You might find it more efficient than normal client APIs or using the TableOutputFormat. The bulk load feature uses a MapReduce job to output table data in HBase's internal data format, and then directly loads the generated StoreFiles into a running cluster. Using bulk load will use less CPU and network resources than simply using the HBase API. For a detailed info you can go here : http://hbase.apache.org/book/arch.bulk.load.html Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Mon, Feb 18, 2013 at 5:00 PM, Masoud mas...@agape.hanyang.ac.krjavascript:_e({}, 'cvml', 'mas...@agape.hanyang.ac.kr'); wrote: Dear All, We are going to do our experiment of a scientific papers, ] We must insert data in our database for later consideration, it almost 300 tables each one has 2/000/000 records. as you know It takes lots of time to do it with a single machine, we are going to use our Hadoop cluster (32 machines) and divide 300 insertion tasks between them, I need some hint to progress faster, 1- as i know we dont need to Reduser, just Mapper in enough. 2- so wee need just implement Mapper class with needed code. Please let me know if there is any point, Best Regards Masoud
Re: How to understand DataNode usages ?
This seems to be related to the % used capacity at a datanode. The values are computed for all the live datanodes, and the range / central limits / deviations are computed based on a sorted list of the values. Thanks hemanth On Thu, Feb 14, 2013 at 2:42 PM, Dhanasekaran Anbalagan bugcy...@gmail.comwrote: Hi Guys, In Datanode UI page they give Datanode usage. what does actually mean, Please guide me Min, Median, stdev DataNodes usages:Min %Median %Max %stdev % 22.15 %24.33 %58.09 %15.4 % -Dhanasekaran Did I learn something today? If not, I wasted it.
Re: Java submit job to remote server
Can you please include the complete stack trace and not just the root. Also, have you set fs.default.name to a hdfs location like hdfs://localhost:9000 ? Thanks Hemanth On Wednesday, February 13, 2013, Alex Thieme wrote: Thanks for the prompt reply and I'm sorry I forgot to include the exception. My bad. I've included it below. There certainly appears to be a server running on localhost:9001. At least, I was able to telnet to that address. While in development, I'm treating the server on localhost as the remote server. Moving to production, there'd obviously be a different remote server address configured. Root Exception stack trace: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) + 3 more (set debug level logging or '-Dmule.verbose.exceptions=true' for everything) On Feb 12, 2013, at 4:22 PM, Nitin Pawar nitinpawar...@gmail.com wrote: conf.set(mapred.job.tracker, localhost:9001); this means that your jobtracker is on port 9001 on localhost if you change it to the remote host and thats the port its running on then it should work as expected whats the exception you are getting? On Wed, Feb 13, 2013 at 2:41 AM, Alex Thieme athi...@athieme.com wrote: I apologize for asking what seems to be such a basic question, but I would use some help with submitting a job to a remote server. I have downloaded and installed hadoop locally in pseudo-distributed mode. I have written some Java code to submit a job. Here's the org.apache.hadoop.util.Tool and org.apache.hadoop.mapreduce.Mapper I've written. If I enable the conf.set(mapred.job.tracker, localhost:9001) line, then I get the exception included below. If that line is disabled, then the job is completed. However, in reviewing the hadoop server administration page ( http://localhost:50030/jobtracker.jsp) I don't see the job as processed by the server. Instead, I wonder if my Java code is simply running the necessary mapper Java code, bypassing the locally installed server. Thanks in advance. Alex public class OfflineDataTool extends Configured implements Tool { public int run(final String[] args) throws Exception { final Configuration conf = getConf(); //conf.set(mapred.job.tracker, localhost:9001); final Job job = new Job(conf); job.setJarByClass(getClass()); job.setJobName(getClass().getName()); job.setMapperClass(OfflineDataMapper.class); job.setInputFormatClass(TextInputFormat.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new org.apache.hadoop.fs.Path(args[0])); final org.apache.hadoop.fs.Path output = new org.a
Re: Cannot use env variables in hodrc
Hi, Hadoop On Demand is no longer supported with recent releases of Hadoop. There is no separate user list for HOD related questions. Which version of Hadoop are you using right now ? Thanks hemanth On Wed, Feb 6, 2013 at 8:59 PM, Mehmet Belgin mehmet.bel...@oit.gatech.eduwrote: Hello again, Considering that I have not received any replies, I was wondering whether this is not the correct list for hod questions? Please let me know if I should better direct this question to another list, a hod-specific one perhaps? Thank you! On a related note, env-vars is also being ignored: env-vars= HOD_PYTHON_HOME=/usr/local/packages/python/2.5.1/bin/python2.5 And hod picks the system-default python and terminates with errors unless I manually export HOD_PYTHON_HOME. export HOD_PYTHON_HOME=`which python2.5` I am also having problems having hod use the cluster I created, but I assume those issues are also related. How can I make sure that hodrc contents are passed correctly into hod? Thanks a lot in advance! On Feb 5, 2013, at 4:41 PM, Mehmet Belgin wrote: Hello everyone, I am setting up Hadoop for the first time, so please bear with me while I ask all these beginner questions :) I followed the instructions to create a hodrc, but looks like I cannot user env variables in this file: error: bin/hod failed to start. error: invalid 'java-home' specified in section hod (--hod.java-home): ${JAVA_HOME} error: invalid 'batch-home' specified in section resource_manager (--resource_manager.batch-home): ${RM_HOME} ... despite the fact that I have ${JAVA_HOME} and ${RM_HOME} correctly defined in my environment. When I replace these variables with full explicit paths, it works. I checked the permissions, and everything else looks fine. What am I missing here? Thanks!
Re: How to find Blacklisted Nodes via cli.
Hi, Part answer: you can get the blacklisted tasktrackers using the command line: mapred job -list-blacklisted-trackers. Also, I think that a blacklisted tasktracker becomes 'unblacklisted' if it works fine after some time. Though I am not very sure about this. Thanks hemanth On Wed, Jan 30, 2013 at 9:35 PM, Dhanasekaran Anbalagan bugcy...@gmail.comwrote: Hi Guys, How to find Blacklisted Nodes via, command line. I want to see job Tracker Blacklisted Nodes and hdfs Blacklisted Nodes. and also how to clear blacklisted nodes to clear start. The only option to restart the service. some other way clear the Blacklisted Nodes. please guide me. -Dhanasekaran. Did I learn something today? If not, I wasted it.
Re: Filesystem closed exception
FS Caching is enabled on the cluster (i.e. the default is not changed). Our code isn't actually mapper code, but a standalone java program being run as part of Oozie. It just seemed confusing and not a very clear strategy to leave unclosed resources. Hence my suggestion to get an uncached FS handle for this use case alone. Note, I am not suggesting to disable FS caching in general. Thanks Hemanth On Thu, Jan 31, 2013 at 12:19 AM, Alejandro Abdelnur t...@cloudera.comwrote: Hemanth, Is FS caching enabled or not in your cluster? A simple solution would be to modify your mapper code not to close the FS. It will go away when the task ends anyway. Thx On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code. Debugging this we found that the mapper is getting a handle to the filesystem object and itself calling a close on it. Because filesystem objects are cached, I believe the behaviour is as expected in terms of the exception. I just wanted to confirm that: - if we do have a requirement to use a filesystem object in a mapper or reducer, we should either not close it ourselves - or (seems better to me) ask for a new version of the filesystem instance by setting the fs.hdfs.impl.disable.cache property to true in job configuration. Also, does anyone know if this behaviour was any different in Hadoop 0.20 ? For some context, this behaviour is actually seen in Oozie, which runs a launcher mapper for a simple java action. Hence, the java action could very well interact with a file system. I know this is probably better addressed in Oozie context, but wanted to get the map reduce view of things. Thanks, Hemanth -- Alejandro
Re: Filesystem closed exception
Thanks, Harsh. Particularly for pointing out HADOOP-7973. On Fri, Jan 25, 2013 at 11:51 AM, Harsh J ha...@cloudera.com wrote: It is pretty much the same in 0.20.x as well, IIRC. Your two points are also correct (for a fix to this). Also see: https://issues.apache.org/jira/browse/HADOOP-7973. On Fri, Jan 25, 2013 at 6:56 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code. Debugging this we found that the mapper is getting a handle to the filesystem object and itself calling a close on it. Because filesystem objects are cached, I believe the behaviour is as expected in terms of the exception. I just wanted to confirm that: - if we do have a requirement to use a filesystem object in a mapper or reducer, we should either not close it ourselves - or (seems better to me) ask for a new version of the filesystem instance by setting the fs.hdfs.impl.disable.cache property to true in job configuration. Also, does anyone know if this behaviour was any different in Hadoop 0.20 ? For some context, this behaviour is actually seen in Oozie, which runs a launcher mapper for a simple java action. Hence, the java action could very well interact with a file system. I know this is probably better addressed in Oozie context, but wanted to get the map reduce view of things. Thanks, Hemanth -- Harsh J
Re: mappers-node relationship
This may beof some use, about how maps are decided: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Thanks Hemanth On Friday, January 25, 2013, jamal sasha wrote: Hi. A very very lame question. Does numbers of mapper depends on the number of nodes I have? How I imagine map-reduce is this. For example in word count example I have bunch of slave nodes. The documents are distributed across these slave nodes. Now depending on how big the data is, it will spread across the slave nodes.. and that is how my number of mappers are decided. I am sure, this is wrong understanding. As in pseudo-distributed node, i can see multiple mappers. So question is.. how does a single node machine runs multiple mappers? is it run in parallel or sequentially?? Any resources to learn these Thanks
Re: TT nodes distributed cache failure
Could you post the stack trace from the job logs. Also looking at the task tracker logs on the failed nodes may help. Thanks Hemanth On Friday, January 25, 2013, Terry Healy wrote: Running hadoop-0.20.2 on a 20 node cluster. When running a Map/Reduce job that uses several .jars loaded into the Distributed cache, several (~4) nodes have their map jobs fails because of ClassNotFoundException. All the other nodes proceed through the job normally and the jobs completes. But this is wasting 20-25% of my TT nodes. Can anyone explain why some nodes might fail to read all the .jars from the Distributed cache? Thanks
Filesystem closed exception
Hi, We are noticing a problem where we get a filesystem closed exception when a map task is done and is finishing execution. By map task, I literally mean the MapTask class of the map reduce code. Debugging this we found that the mapper is getting a handle to the filesystem object and itself calling a close on it. Because filesystem objects are cached, I believe the behaviour is as expected in terms of the exception. I just wanted to confirm that: - if we do have a requirement to use a filesystem object in a mapper or reducer, we should either not close it ourselves - or (seems better to me) ask for a new version of the filesystem instance by setting the fs.hdfs.impl.disable.cache property to true in job configuration. Also, does anyone know if this behaviour was any different in Hadoop 0.20 ? For some context, this behaviour is actually seen in Oozie, which runs a launcher mapper for a simple java action. Hence, the java action could very well interact with a file system. I know this is probably better addressed in Oozie context, but wanted to get the map reduce view of things. Thanks, Hemanth
Re: Where do/should .jar files live?
On top of what Bejoy said, just wanted to add that when you submit a job to Hadoop using the hadoop jar command, the jars which you reference in the command on the edge/client node will be picked up by Hadoop and made available to the cluster nodes where the mappers and reducers run. Thanks Hemanth On Wed, Jan 23, 2013 at 8:24 AM, bejoy.had...@gmail.com wrote: ** Hi Chris In larger clusters it is better to have an edge/client node where all the user jars reside and you trigger your MR jobs from here. A client/edge node is a server with hadoop jars and conf but hosting no daemons. In smaller clusters one DN might act as the client node and you can execute your jars from there. Here you have a risk of that DN getting filled if the files are copied to hdfs from this DN (as per block placement policy one replica would always be on this node) In oozie you put your executables into hdfs . But oozie comes at an integration level. In initial development phase, developers put jar into the LFS on client node, execute and test their code. Regards Bejoy KS Sent from remote device, Please excuse typos -- *From: * Chris Embree cemb...@gmail.com *Date: *Tue, 22 Jan 2013 14:24:40 -0500 *To: *user@hadoop.apache.org *ReplyTo: * user@hadoop.apache.org *Subject: *Where do/should .jar files live? Hi List, This should be a simple question, I think. Disclosure, I am not a java developer. ;) We're getting ready to build our Dev and Prod clusters. I'm pretty comfortable with HDFS and how it sits atop several local file systems on multiple servers. I'm fairly comfortable with the concept of Map/Reduce and why it's cool and we want it. Now for the question. Where should my developers, put and store their jar files? Or asked another way, what's the best entry point for submitting jobs? We have separate physical systems for NN, Checkpoint Node (formerly 2nn), Job Tracker and Standby NN. Should I run from the JT node? Do I keep all of my finished .jar's on the JT local file system? Or should I expect that jobs will be run via Oozie? Do I put jars on the local Oozie FS? Thanks in advance. Chris
Re: passing arguments to hadoop job
Hi, Please note that you are referring to a very old version of Hadoop. the current stable release is Hadoop 1.x. The API has changed in 1.x. Take a look at the wordcount example here: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v2.0 But, in principle your method should work. I wrote it using the new API in a similar fashion and it worked fine. Can you show the code of your driver program (i.e. where you have main) ? Thanks hemanth On Tue, Jan 22, 2013 at 5:22 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, Lets say I have the standard helloworld program http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html#Example%3A+WordCount+v2.0 Now, lets say, I want to start the counting not from zero but from 20. So my reference line is 20. I modified the Reduce code as following: public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { *private static int baseSum ;* * public void configure(JobConf job){* * baseSum = Integer.parseInt(job.get(basecount));* * * * }* public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum =* baseSum*; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } And in main added: conf.setInt(basecount,20); So my hope was this should have done the trick.. But its not working. the code is running normally :( How do i resolve this? Thanks
Re: passing arguments to hadoop job
OK. The easiest way I can think of for debugging this is to add a System.out.println in your Reduce.configure code. The output will come in the logs specific to your reduce tasks. You can access these logs from the web ui of the jobtracker. Navigate to your job page from the Jobtracker UI reduce select any task click on the task log links. Look under 'stdout'. Thanks Hemanth On Tue, Jan 22, 2013 at 11:19 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, The driver code is actually the same as of java word count old example: copying from site public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); * conf.setInt(basecount,20); // added this line* conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } Reducer class public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { *private static int baseSum ;* * public void configure(JobConf job){* * baseSum = Integer.parseInt(job.get(basecount));* * * * }* public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum =* baseSum*; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } On Mon, Jan 21, 2013 at 8:29 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Please note that you are referring to a very old version of Hadoop. the current stable release is Hadoop 1.x. The API has changed in 1.x. Take a look at the wordcount example here: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html#Example%3A+WordCount+v2.0 But, in principle your method should work. I wrote it using the new API in a similar fashion and it worked fine. Can you show the code of your driver program (i.e. where you have main) ? Thanks hemanth On Tue, Jan 22, 2013 at 5:22 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, Lets say I have the standard helloworld program http://hadoop.apache.org/docs/r0.17.0/mapred_tutorial.html#Example%3A+WordCount+v2.0 Now, lets say, I want to start the counting not from zero but from 20. So my reference line is 20. I modified the Reduce code as following: public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { private static int baseSum ; public void configure(JobConf job){ baseSum = Integer.parseInt(job.get(basecount)); } public void reduce(Text key, IteratorIntWritable values, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int sum = baseSum; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } And in main added: conf.setInt(basecount,20); So my hope was this should have done the trick.. But its not working. the code is running normally :( How do i resolve this? Thanks
Re: How to unit test mappers reading data from DistributedCache?
Hi, Not sure how to do it using MRUnit, but should be possible to do this using a mocking framework like Mockito or EasyMock. In a mapper (or reducer), you'd use the Context classes to get the DistributedCache files. By mocking these to return what you want, you could potentially run a true unit test. Thanks Hemanth On Fri, Jan 18, 2013 at 1:37 AM, Barak Yaish barak.ya...@gmail.com wrote: Hi, I've found MRUnit a very easy to unit test jobs, is it possible as well to test mappers reading data from DisributedCache? If yes, can you share an example how the test' setup() should look like? Thanks.
Re: tcp error
Coincidentally, I faced the same issue just now. In my case, it turned out that I was running Hadoop daemons in pseudo-distributed mode, and in between a machine suspend and restart, the network configuration changed. The logs link was referring to the older IP address in use in the URL and thus failed when I tried to open it. Restarting the daemons helped. I don't think this problem will come in a normal up-and-running production cluster. Thanks hemanth On Thu, Jan 17, 2013 at 9:48 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: At the place where you get the error, can you cross check what the URL is that is being accessed ? Also, can you compare it with the URL with pages before this that work ? Thanks hemanth On Thu, Jan 17, 2013 at 1:08 AM, jamal sasha jamalsha...@gmail.comwrote: I am inside a network where I need proxy settings to access the internet. I have a weird problem. The internet is working fine. But it is one particular instance when i get this error: Network Error (tcp_error) A communication error occurred: Operation timed out The Web Server may be down, too busy, or experiencing other problems preventing it from responding to requests. You may wish to try again at a later time. For assistance, contact your network support team. This happens when I use hadoop in local mode. I can access the UI interface. I can see the jobs running. but when I try to see the logs of each task.. i am not able to access those logs. UI-- job--map-- task-- all -- this is where the error is.. Any clues? THanks
Re: Biggest cluster running YARN in the world?
You may get more updated information from folks at Yahoo!, but here is a mail on hadoop-general mailing list that has some statistics: http://www.mail-archive.com/general@hadoop.apache.org/msg05592.html Please note it is a little dated, so things should be better now :-) Thank hemanth On Tue, Jan 15, 2013 at 7:26 AM, Tan, Wangda wangda@emc.com wrote: Hi guys, I've a question in my head for a long time, what's the biggest cluster running YARN? I just heard some rumor about some biggest cluster running map-reduce 1.0 with 10,000+ nodes, but rarely heard about such rumor about YARN. Welcome any message about this, like inside information or rumor :-p. -- Thanks, Wangda
Re: Compile error using contrib.utils.join package with new mapreduce API
On the dev mailing list, Harsh pointed out that there is another join related package: http://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/join/ This seems to be available in 2.x and trunk. Could you check if this provides functionality you require - so we at least know there is new API support in later versions ? Thanks Hemanth On Mon, Jan 14, 2013 at 7:45 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, No. I didn't find any reference to a working sample. I also didn't find any JIRA that asks for a migration of this package to the new API. Not sure why. I have asked on the dev list. Thanks hemanth On Mon, Jan 14, 2013 at 6:25 PM, Michael Forage michael.for...@livenation.co.uk wrote: Thanks Hemanth ** ** I appreciate your response Did you find any working example of it in use? It looks to me like I’d still be tied to the old API Thanks Mike ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* 14 January 2013 05:08 *To:* user@hadoop.apache.org *Subject:* Re: Compile error using contrib.utils.join package with new mapreduce API ** ** Hi, ** ** The datajoin package has a class called DataJoinJob ( http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/contrib/utils/join/DataJoinJob.html ) ** ** I think using this will help you get around the issue you are facing. ** ** From the source, this is the command line usage of the class: ** ** usage: DataJoinJob inputdirs outputdir map_input_file_format numofParts mapper_class reducer_class map_output_value_class output_value_class [maxNumOfValuesPerGroup [descriptionOfJob]]] ** ** Internally the class uses the old API to set the mapper and reducer passed as arguments above. ** ** Thanks hemanth ** ** ** ** ** ** On Fri, Jan 11, 2013 at 9:00 PM, Michael Forage michael.for...@livenation.co.uk wrote: Hi I’m using Hadoop 1.0.4 and using the hadoop.mapreduce API having problems compiling a simple class to implement a reduce-side data join of 2 files. I’m trying to do this using contrib.utils.join and in Eclipse it all compiles fine other than: job.*setMapperClass*(MapClass.*class*); job.*setReducerClass*(Reduce.*class*); …which both complain that the referenced class no longer extends either Mapper or Reducer It’s my understanding that for what they should instead extend DataJoinMapperBase and DataJoinReducerBase in order Have searched for a solution everywhere but unfortunately, all the examples I can find are based on the deprecated mapred API. Assuming this package actually works with the new API, can anyone offer any advice? Complete compile errors: The method setMapperClass(Class? extends Mapper) in the type Job is not applicable for the arguments (ClassDataJoin.MapClass) The method setReducerClass(Class? extends Reducer) in the type Job is not applicable for the arguments (ClassDataJoin.Reduce) …and the code… *package* JoinTest; *import* java.io.DataInput; *import* java.io.DataOutput; *import* java.io.IOException; *import* java.util.Iterator; *import* org.apache.hadoop.conf.Configuration; *import* org.apache.hadoop.conf.Configured; *import* org.apache.hadoop.fs.Path; *import* org.apache.hadoop.io.LongWritable; *import* org.apache.hadoop.io.Text; *import* org.apache.hadoop.io.Writable; *import* org.apache.hadoop.mapreduce.Job; *import* org.apache.hadoop.mapreduce.Mapper; *import* org.apache.hadoop.mapreduce.Reducer; *import* org.apache.hadoop.mapreduce.Mapper.Context; *import* org.apache.hadoop.mapreduce.lib.input.FileInputFormat; *import* org.apache.hadoop.mapreduce.lib.input.TextInputFormat; *import* org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; *import* org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; *import* org.apache.hadoop.util.Tool; *import* org.apache.hadoop.util.ToolRunner; *import* org.apache.hadoop.contrib.utils.join.DataJoinMapperBase; *import* org.apache.hadoop.contrib.utils.join.DataJoinReducerBase; *import* org.apache.hadoop.contrib.utils.join.TaggedMapOutput; *public* *class* DataJoin *extends* Configured *implements* Tool { *public* *static* *class* MapClass *extends* DataJoinMapperBase {** ** *protected* Text generateInputTag(String inputFile) { String datasource = inputFile.split(-)[0]; *return* *new* Text(datasource); } *protected* Text generateGroupKey
Re: FileSystem.workingDir vs mapred.local.dir
Hi, AFAIK, the mapred.local.dir property refers to a set of directories under which different types of data related to mapreduce jobs are stored - for e.g. intermediate data, localized files for a job etc. The working directory for a mapreduce job is configured under a sub directory within one of the directories configured in this property. The workingDir property of the FileSystem class simply seems to indicate the working directory for a given filesystem as set by applications. They don't seem very related per se, unless I am missing something ? Thanks Hemanth On Tue, Jan 15, 2013 at 2:54 AM, Jay Vyas jayunit...@gmail.com wrote: Hi guys: What is the relationship between the working directory in the FileSystem class (filesystem.workingDir), compared with the mapred.local.dir properties ? It seems like these would essentially refer to the same thing? -- Jay Vyas http://jayunit100.blogspot.com
Re: config file loactions in Hadoop 2.0.2
Hi, One place where I could find the capacity-scheduler.xml was from source - hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/resources. AFAIK, the masters file is only used for starting the secondary namenode - which has in 2.x been replaced by a proper HA solution. So, I think there is no need for this file anymore. Please refer to this link for more details on the HA solution: http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html Thanks hemanth On Wed, Jan 16, 2013 at 3:15 AM, Panshul Whisper ouchwhis...@gmail.comwrote: Hello, I was setting up Hadoop 2.0.2. Since I have already setup Hadoop 1.0.x so i have a fair idea where are all the files and what to put in each of them. In case of Hadoop 2.0.2, the config files have been moved to [hadoop install directory]/etc/hadoop but I still cannot find capacity-scheduler.xml masters - for listing master nodes Please help me setup this version. Thanking You, -- Regards, Ouch Whisper 010101010101
Re: Compile error using contrib.utils.join package with new mapreduce API
Hi, No. I didn't find any reference to a working sample. I also didn't find any JIRA that asks for a migration of this package to the new API. Not sure why. I have asked on the dev list. Thanks hemanth On Mon, Jan 14, 2013 at 6:25 PM, Michael Forage michael.for...@livenation.co.uk wrote: Thanks Hemanth ** ** I appreciate your response Did you find any working example of it in use? It looks to me like I’d still be tied to the old API Thanks Mike ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* 14 January 2013 05:08 *To:* user@hadoop.apache.org *Subject:* Re: Compile error using contrib.utils.join package with new mapreduce API ** ** Hi, ** ** The datajoin package has a class called DataJoinJob ( http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/contrib/utils/join/DataJoinJob.html ) ** ** I think using this will help you get around the issue you are facing. ** ** From the source, this is the command line usage of the class: ** ** usage: DataJoinJob inputdirs outputdir map_input_file_format numofParts mapper_class reducer_class map_output_value_class output_value_class [maxNumOfValuesPerGroup [descriptionOfJob]]] ** ** Internally the class uses the old API to set the mapper and reducer passed as arguments above. ** ** Thanks hemanth ** ** ** ** ** ** On Fri, Jan 11, 2013 at 9:00 PM, Michael Forage michael.for...@livenation.co.uk wrote: Hi I’m using Hadoop 1.0.4 and using the hadoop.mapreduce API having problems compiling a simple class to implement a reduce-side data join of 2 files.* *** I’m trying to do this using contrib.utils.join and in Eclipse it all compiles fine other than: job.*setMapperClass*(MapClass.*class*); job.*setReducerClass*(Reduce.*class*); …which both complain that the referenced class no longer extends either Mapper or Reducer It’s my understanding that for what they should instead extend DataJoinMapperBase and DataJoinReducerBase in order Have searched for a solution everywhere but unfortunately, all the examples I can find are based on the deprecated mapred API. Assuming this package actually works with the new API, can anyone offer any advice? Complete compile errors: The method setMapperClass(Class? extends Mapper) in the type Job is not applicable for the arguments (ClassDataJoin.MapClass) The method setReducerClass(Class? extends Reducer) in the type Job is not applicable for the arguments (ClassDataJoin.Reduce) …and the code… *package* JoinTest; *import* java.io.DataInput; *import* java.io.DataOutput; *import* java.io.IOException; *import* java.util.Iterator; *import* org.apache.hadoop.conf.Configuration; *import* org.apache.hadoop.conf.Configured; *import* org.apache.hadoop.fs.Path; *import* org.apache.hadoop.io.LongWritable; *import* org.apache.hadoop.io.Text; *import* org.apache.hadoop.io.Writable; *import* org.apache.hadoop.mapreduce.Job; *import* org.apache.hadoop.mapreduce.Mapper; *import* org.apache.hadoop.mapreduce.Reducer; *import* org.apache.hadoop.mapreduce.Mapper.Context; *import* org.apache.hadoop.mapreduce.lib.input.FileInputFormat; *import* org.apache.hadoop.mapreduce.lib.input.TextInputFormat; *import* org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; *import* org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; *import* org.apache.hadoop.util.Tool; *import* org.apache.hadoop.util.ToolRunner; *import* org.apache.hadoop.contrib.utils.join.DataJoinMapperBase; *import* org.apache.hadoop.contrib.utils.join.DataJoinReducerBase; *import* org.apache.hadoop.contrib.utils.join.TaggedMapOutput; *public* *class* DataJoin *extends* Configured *implements* Tool { *public* *static* *class* MapClass *extends* DataJoinMapperBase {*** * *protected* Text generateInputTag(String inputFile) { String datasource = inputFile.split(-)[0]; *return* *new* Text(datasource); } *protected* Text generateGroupKey(TaggedMapOutput aRecord) { String line = ((Text) aRecord.getData()).toString(); String[] tokens = line.split(,); String groupKey = tokens[0]; *return* *new* Text(groupKey); } *protected* TaggedMapOutput generateTaggedMapOutput(Object value) { TaggedWritable retv = *new* TaggedWritable((Text) value); retv.setTag(*this*.inputTag); *return* retv
Re: log server for hadoop MR jobs??
To add to that, log aggregation is a feature available with Hadoop 2.0 (where mapreduce is re-written to YARN). The functionality is available via the History Server: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html Thanks hemanth On Sat, Jan 12, 2013 at 12:08 AM, shashwat shriparv dwivedishash...@gmail.com wrote: Have a look on flume.. On Fri, Jan 11, 2013 at 11:58 PM, Xiaowei Li sell...@gmail.com wrote: ct all log generated from ∞ Shashwat Shriparv
Re: JobCache directory cleanup
Hmm. Unfortunately, there is another config variable that may be affecting this: keep.task.files.pattern This is set to .* in the job.xml file you sent. I suspect this may be causing a problem. Can you please remove this, assuming you have not set it intentionally ? Thanks Hemanth On Fri, Jan 11, 2013 at 3:28 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Thanks for replies! keep.failed.task.files set to false. Config of one of the jobs attached. On Fri, Jan 11, 2013 at 5:44 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Good point. Forgot that one :-) On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Can you check the job configuration for these ~100 jobs? Do they have keep.failed.task.files set to true? If so, these files won't be deleted. If it doesn't, it could be a bug. Sharing your configs for these jobs will definitely help. Thanks, +Vinod On Wed, Jan 9, 2013 at 6:41 AM, Ivan Tretyakov itretya...@griddynamics.com wrote: Hello! I've found that jobcache directory became very large on our cluster, e.g.: # du -sh /data?/mapred/local/taskTracker/user/jobcache 465G/data1/mapred/local/taskTracker/user/jobcache 464G/data2/mapred/local/taskTracker/user/jobcache 454G/data3/mapred/local/taskTracker/user/jobcache And it stores information for about 100 jobs: # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/ | sort | uniq | wc -l -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com itretya...@griddynamics.com
Re: queues in haddop
Queues in the capacity scheduler are logical data structures into which MapReduce jobs are placed to be picked up by the JobTracker / Scheduler framework, according to some capacity constraints that can be defined for a queue. So, given your use case, I don't think Capacity Scheduler is going to directly help you (since you only spoke about data-in, and not processing) So, yes something like Flume or Scribe Thanks Hemanth On Fri, Jan 11, 2013 at 11:34 AM, Harsh J ha...@cloudera.com wrote: Your question in unclear: HDFS has no queues for ingesting data (it is a simple, distributed FileSystem). The Hadoop M/R and Hadoop YARN components have queues for processing data purposes. On Fri, Jan 11, 2013 at 8:42 AM, Panshul Whisper ouchwhis...@gmail.com wrote: Hello, I have a hadoop cluster setup of 10 nodes and I an in need of implementing queues in the cluster for receiving high volumes of data. Please suggest what will be more efficient to use in the case of receiving 24 Million Json files.. approx 5 KB each in every 24 hours : 1. Using Capacity Scheduler 2. Implementing RabbitMQ and receive data from them using Spring Integration Data pipe lines. I cannot afford to loose any of the JSON files received. Thanking You, -- Regards, Ouch Whisper 010101010101 -- Harsh J
Re: JobCache directory cleanup
Hi, On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Thanks for replies! Hemanth, I could see following exception in TaskTracker log: https://issues.apache.org/jira/browse/MAPREDUCE-5 But I'm not sure if it is related to this issue. Now, when a job completes, the directories under the jobCache must get automatically cleaned up. However it doesn't look like this is happening in your case. So, If I've no running jobs, jobcache directory should be empty. Is it correct? That is correct. I just verified it with my Hadoop 1.0.2 version Thanks Hemanth On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks are run on the slaves. Hence, I don't think this is related to the parameter mapreduce.jobtracker.retiredjobs.cache.size, which is a parameter related to the jobtracker process. Now, when a job completes, the directories under the jobCache must get automatically cleaned up. However it doesn't look like this is happening in your case. Could you please look at the logs of the tasktracker machine where this has gotten filled up to see if there are any errors that could give clues ? Also, since this is a CDH release, it could be a problem specific to that - and maybe reaching out on the CDH mailing lists will also help Thanks hemanth On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Hello! I've found that jobcache directory became very large on our cluster, e.g.: # du -sh /data?/mapred/local/taskTracker/user/jobcache 465G/data1/mapred/local/taskTracker/user/jobcache 464G/data2/mapred/local/taskTracker/user/jobcache 454G/data3/mapred/local/taskTracker/user/jobcache And it stores information for about 100 jobs: # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/ | sort | uniq | wc -l I've found that there is following parameter: property namemapreduce.jobtracker.retiredjobs.cache.size/name value1000/value descriptionThe number of retired job status to keep in the cache. /description /property So, if I got it right it intended to control job cache size by limiting number of jobs to store cache for. Also, I've seen that some hadoop users uses cron approach to cleanup jobcache: http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E ) Are there other approaches to control jobcache size? What is more correct way to do it? Thanks in advance! P.S. We are using CDH 4.1.1. -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com itretya...@griddynamics.com -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com itretya...@griddynamics.com
Re: Not committing output in map reduce
Is this the same as: http://stackoverflow.com/questions/6137139/how-to-save-only-non-empty-reducers-output-in-hdfs? i.e. LazyOutputFormat, etc. ? On Thu, Jan 10, 2013 at 4:51 PM, Pratyush Chandra chandra.praty...@gmail.com wrote: Hi, I am using s3n as file system. I do not wish to create output folders and file if there is no output from RecordReader implementation. Currently it creates empty part* files and _SUCCESS. Is there a way to do so ? -- Pratyush Chandra
Re: JobCache directory cleanup
Good point. Forgot that one :-) On Thu, Jan 10, 2013 at 10:53 PM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: Can you check the job configuration for these ~100 jobs? Do they have keep.failed.task.files set to true? If so, these files won't be deleted. If it doesn't, it could be a bug. Sharing your configs for these jobs will definitely help. Thanks, +Vinod On Wed, Jan 9, 2013 at 6:41 AM, Ivan Tretyakov itretya...@griddynamics.com wrote: Hello! I've found that jobcache directory became very large on our cluster, e.g.: # du -sh /data?/mapred/local/taskTracker/user/jobcache 465G/data1/mapred/local/taskTracker/user/jobcache 464G/data2/mapred/local/taskTracker/user/jobcache 454G/data3/mapred/local/taskTracker/user/jobcache And it stores information for about 100 jobs: # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/ | sort | uniq | wc -l
Re: JobCache directory cleanup
Hi, The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks are run on the slaves. Hence, I don't think this is related to the parameter mapreduce.jobtracker.retiredjobs.cache.size, which is a parameter related to the jobtracker process. Now, when a job completes, the directories under the jobCache must get automatically cleaned up. However it doesn't look like this is happening in your case. Could you please look at the logs of the tasktracker machine where this has gotten filled up to see if there are any errors that could give clues ? Also, since this is a CDH release, it could be a problem specific to that - and maybe reaching out on the CDH mailing lists will also help Thanks hemanth On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov itretya...@griddynamics.comwrote: Hello! I've found that jobcache directory became very large on our cluster, e.g.: # du -sh /data?/mapred/local/taskTracker/user/jobcache 465G/data1/mapred/local/taskTracker/user/jobcache 464G/data2/mapred/local/taskTracker/user/jobcache 454G/data3/mapred/local/taskTracker/user/jobcache And it stores information for about 100 jobs: # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/ | sort | uniq | wc -l I've found that there is following parameter: property namemapreduce.jobtracker.retiredjobs.cache.size/name value1000/value descriptionThe number of retired job status to keep in the cache. /description /property So, if I got it right it intended to control job cache size by limiting number of jobs to store cache for. Also, I've seen that some hadoop users uses cron approach to cleanup jobcache: http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually ( http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E ) Are there other approaches to control jobcache size? What is more correct way to do it? Thanks in advance! P.S. We are using CDH 4.1.1. -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com itretya...@griddynamics.com
Re: Why the official Hadoop Documents are so messy?
Hi, I am not sure if your complaint is as much about the changing interfaces as it is about documentation. Please note that versions prior to 1.0 did not have stable interfaces as a major requirement. Not by choice, but because the focus was on seemingly more important functionality, stability, performance etc. Specifically with respect to the shell commands you refer to, they were going through the same evolution. From now, 1.x releases will not change these kind of public interfaces and Apis. I don't intend that documentation is unimportant. Just that this might be less of an issue now, post the 1.x release. As others have mentioned, it would be great if you can participate to improve documentation by filing or fixing jiras. Thanks Hemanth On Tuesday, January 8, 2013, javaLee wrote: For example,look at the documents about HDFS shell guide: In 0.17, the prefix of HDFS shell is hadoop dfs: http://hadoop.apache.org/docs/r0.17.2/hdfs_shell.html In 0.19, the prefix of HDFS shell is hadoop fs: http://hadoop.apache.org/docs/r0.19.1/hdfs_shell.html#lsr In 1.0.4,the prefix of HDFS shell is hdfs dfs: http://hadoop.apache.org/docs/r1.0.4/file_system_shell.html#ls Reading official Hadoop ducuments is such a suffering. As a end user, I am confused...
Re: Reg: Fetching TaskAttempt Details from a RunningJob
Hi, In Hadoop 1.0, I don't think this information is exposed. The TaskInProgress is an internal class and hence cannot / should not be used from client applications. The only way out seems to be to screen scrape the information from the Jobtracker web UI. If you can live with completed events, then there is something called TaskCompletionEvents that seem to provide some of this information. You could look at JobClient.getTaskCompletionEvents. Please note that in Hadoop 2.0 where mapreduce has been re-architected into YARN, there are JSON APIs that seem to expose the information you require: http://hadoop.apache.org/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html#Response_Examples Look here for taskAttempts Thanks hemanth On Sun, Jan 6, 2013 at 8:30 PM, Hadoop Learner hadooplearner1...@gmail.comwrote: Hi All, Working on a requirement of hadoop Job Monitoring. Requirement is to get every Task attempt details of a Running Job. Details are following : Task Attempt Start Time Task Tracker Name where Task Attempt is executing Errors or Exceptions in a running Task Attempt I have Implemented JobClient, RunningJob, JobStatus, TaskReport APIs to get the Task Attempt IDs( I have the task Attempt IDs for which I need required details). By using these APIs I was unable to get Task Attempt details as per my requirement mentioned above. Can you please help me with some ways to get required details using Java APIs. Any pointers will be helpful. (Note : I have tried using TaskInProgress API, But not sure how to get TaskInProgress instance from a task ID. Also noticed TaskInProgress class is not visible to my classes) Thanks and Regards, Shyam
Re: Differences between 'mapped' and 'mapreduce' packages
From a user perspective, at a high level, the mapreduce package can be thought of as having user facing client code that can be invoked, extended etc as applicable from client programs. The mapred package is to be treated as internal to the mapreduce system, and shouldn't directly be used unless no alternative in the mapreduce package is available. Thanks Hemanth On Mon, Jan 7, 2013 at 11:44 PM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: What is the differences between the two? It seems like MR job could be configured using one of the other (e.g, extends MapReduceBase implements Mapper of extends Mapper) Cheers Oleg
Re: Skipping entire task
Hi, Are tasks being executed multiple times due to failures? Sorry, it was not very clear from your question. Thanks hemanth On Sat, Jan 5, 2013 at 7:44 PM, David Parks davidpark...@yahoo.com wrote: Thinking here... if you submitted the task programmatically you should be able to capture the failure of the task and gracefully move past it to your next tasks. To say it in a long-winded way: Let's say you submit a job to Hadoop, a java jar, and your main class implements Tool. That code has the responsibility to submit a series of jobs to hadoop, something like this: try{ Job myJob = new MyJob(getConf()); myJob.submitAndWait(); }catch(Exception uhhohh){ //Deal with the issue and move on } Job myNextJob = new MyNextJob(getConf()); myNextJob.submit(); Just pseudo code there to demonstrate my thought. David -Original Message- From: Håvard Wahl Kongsgård [mailto:haavard.kongsga...@gmail.com] Sent: Saturday, January 05, 2013 4:54 PM To: user Subject: Skipping entire task Hi, hadoop can skip bad records http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-c ode. But it is also possible to skip entire tasks? -Håvard -- Håvard Wahl Kongsgård Faculty of Medicine Department of Mathematical Sciences NTNU http://havard.security-review.net/
Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer
If it is a small number, A seems the best way to me. On Friday, December 28, 2012, Kshiva Kps wrote: Which one is current .. What is the preferred way to pass a small number of configuration parameters to a mapper or reducer? *A. *As key-value pairs in the jobconf object. * * *B. *As a custom input key-value pair passed to each mapper or reducer. * * *C. *Using a plain text file via the Distributedcache, which each mapper or reducer reads. * * *D. *Through a static variable in the MapReduce driver class (i.e., the class that submits the MapReduce job). *Answer: B*
Re: Selecting a task for the tasktracker
Hi, Firstly, I am talking about Hadoop 1.0. Please note that in Hadoop 2.x and trunk, the Mapreduce framework is completely revamped to Yarn ( http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html) and you may need to look at different interfaces for building your own scheduler. In 1.0, the primary function of the TaskScheduler is the assignTasks method. Given a TaskTracker object as input, this method figures out how many free map and reduce slots exist in that particular tasktracker and selects one or more task that can be scheduled on it. Since task selection is the primary responsibility and the granularity is at a task level, the class is called TaskScheduler. The method of choosing a job and then a task within the job is customised by the different schedulers already present in Hadoop. Also, the core logic of selecting a map task with data locality optimizations is not implemented in the schedulers per se, but they rely on the JobInProgress object in MapReduce framework for achieving the same. To implement your own Scheduler, it may be best to look at the sources of existing schedulers: JobQueueTaskScheduler, CapacityTaskScheduler or FairScheduler. In particular, the last two are in the contrib modules of mapreduce, and hence will be fairly independent to follow. Their build files will also tell you how to resolve any compile problems like the one you are facing. Thanks Hemanth On Thu, Dec 27, 2012 at 4:10 PM, Yaron Gonen yaron.go...@gmail.com wrote: Hi, If I understand correctly, the job scheduler (why is the class called TaskScheduler?) is responsible for assigning the task whose split is as close as possible to the tasktacker. Meaning that the job scheduler is responsible to two things: 1. Selecting a job. 2. Once a job is selected, assign the closest task to the tasktracker that send the heartbeat. Is this correct? I want to write my own job scheduler to change the logic above, but it says The type TaskScheduler is not visible. How can I write my own scheduler? thanks
Re: Sane max storage size for DN
This is a dated blog post, so it would help if someone with current HDFS knowledge can validate it: http://developer.yahoo.com/blogs/hadoop/posts/2010/05/scalability_of_the_hadoop_dist/ . There is a bit about the RAM required for the Namenode and how to compute it: You can look at the 'Namespace limitations' section. Thanks hemanth On Thu, Dec 13, 2012 at 10:57 AM, Mohammad Tariq donta...@gmail.com wrote: Hello Chris, Thank you so much for the valuable insights. I was actually using the same principle. I did the blunder and did the maths for entire (9*3)PB. Seems I am higher than you, that too without drinking ;) Many thanks. Regards, Mohammad Tariq On Thu, Dec 13, 2012 at 10:38 AM, Chris Embree cemb...@gmail.com wrote: Hi Mohammed, The amount of RAM on the NN is related to the number of blocks... so let's do some math. :) 1G of RAM to 1M blocks seems to be the general rule. I'll probably mess this up so someone check my math: 9 PT ~ 9,216 TB ~ 9,437,184 GB of data. Let's put that in 128MB blocks: according to kcalc that's 75,497,472 of 128 MB Blocks. Unless I missed this by an order of magnitude (entirely possible... I've been drinking since 6), that sound like 76G of RAM (above OS requirements). 128G should kick it's ass; 256G seems like a waste of $$. Hmm... That makes the NN sound extremely efficient. Someone validate me or kick me to the curb. YMMV ;) On Wed, Dec 12, 2012 at 10:52 PM, Mohammad Tariq donta...@gmail.comwrote: Hello Michael, It's an array. The actual size of the data could be somewhere around 9PB(exclusive of replication) and we want to keep the no of DNs as less as possible. Computations are not too frequent, as I have specified earlier. If I have 500TB in 1 DN, the no of DNs would be around 49. And, if the block size is 128MB, the no of blocks would be 201326592. So, I was thinking of having 256GB RAM for the NN. Does this make sense to you? Many thanks. Regards, Mohammad Tariq On Thu, Dec 13, 2012 at 12:28 AM, Michael Segel michael_se...@hotmail.com wrote: 500 TB? How many nodes in the cluster? Is this attached storage or is it in an array? I mean if you have 4 nodes for a total of 2PB, what happens when you lose 1 node? On Dec 12, 2012, at 9:02 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I don't know if this question makes any sense, but I would like to ask, does it make sense to store 500TB (or more) data in a single DN?If yes, then what should be the spec of other parameters *viz*. NN DN RAM, N/W etc?If no, what could be the alternative? Many thanks. Regards, Mohammad Tariq
Re: attempt* directories in user logs
However, in the case Oleg is talking about the attempts are: attempt_201212051224_0021_m_00_0 attempt_201212051224_0021_m_02_0 attempt_201212051224_0021_m_03_0 These aren't multiple attempts of a single task, are they ? They are actually different tasks. If they were multiple attempts, I would expect the last digit to get incremented, like attempt_201212051224_0021_m_00_0 and attempt_201212051224_0021_m_00_1, for instance. It looks like at least 3 different tasks were launched on this node. One of them could be setup task. Oleg, how many map tasks does the Jobtracker UI show for this job. Thanks hemanth On Tue, Dec 11, 2012 at 12:19 AM, Vinod Kumar Vavilapalli vino...@hortonworks.com wrote: MR launches multiple attempts for single Task in case of TaskAttempt failures or when speculative execution is turned on. In either case, a given Task will only ever have one successful TaskAttempt whose output will be accepted (committed). Number of reduces is set to 1 by default in mapred-default.xml - you should explicitly set it to zero if you don't want reducers. By master, I suppose you mean JobTracker. JobTracker doesn't show all the attempts for a given Task, you should navigate to per-task page to see that. Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Dec 9, 2012, at 6:53 AM, Oleg Zhurakousky wrote: I studying user logs on the two node cluster that I have setup and I was wondering if anyone can shed some light on these attempt*' directories $ ls attempt_201212051224_0021_m_00_0 attempt_201212051224_0021_m_03_0 job-acls.xml attempt_201212051224_0021_m_02_0 attempt_201212051224_0021_r_00_0 I mean its obvious that its talking about 3 attempts for Map task and 1 attempt for reduce task. However my current MR job only results in some output written to attempt_201212051224_0021_m_00_0. Nothing is the reduce part (understandably since I don't even have a reducer, so my question is: 1. The two more M attempts. . . what are they? 2. Why was there an attempt to do a Reduce when no reducer was provided.implemented 3. Why my master node only had 1 attempt for M task but the slave had all that's displayed and questioned above (the 'ls' output above is from the slave node) Thanks Oleg
Re: Map tasks processing some files multiple times
David, You are using FileNameTextInputFormat. This is not in Hadoop source, as far as I can see. Can you please confirm where this is being used from ? It seems like the isSplittable method of this input format may need checking. Another thing, given you are adding the same input format for all files, do you need MultipleInputs ? Thanks Hemanth On Thu, Dec 6, 2012 at 1:06 PM, David Parks davidpark...@yahoo.com wrote: I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this. ** ** I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as *splittable*, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3: ** ** Path lsDir = *new* Path( s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*); MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.* class*, LinkShareCatalogImportMapper.*class*); ** ** I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway). ** ** Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right. ** ** David ** ** ** ** *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com] *Sent:* Thursday, December 06, 2012 1:45 PM *To:* user@hadoop.apache.org *Subject:* Re: Map tasks processing some files multiple times ** ** Could it be due to spec-ex? Does it make a diffrerence in the end? ** ** Raj ** ** -- *From:* David Parks davidpark...@yahoo.com *To:* user@hadoop.apache.org *Sent:* Wednesday, December 5, 2012 10:15 PM *Subject:* Map tasks processing some files multiple times ** ** I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times. This is the code I use to set up the mapper: Path lsDir = *new* Path( s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*); *for*(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info(Identified linkshare catalog: + f.getPath().toString()); *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length 0 ){ MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);*** * } I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files. I also have the following confirmation that it found the 167 files correctly: 2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167 When I look through the syslogs I can see that the file in question was opened by two different map attempts: ./task-attempts/job_201212060351_0001/* attempt_201212060351_0001_m_05_0*/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading ./task-attempts/job_201212060351_0001/* attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple times. Notably, map attempt 173 is more map attempts than should be possible. There are 167 input files (from S3, gzipped), thus there should be 167 map attempts. But I see a total of 176 map tasks. Any thoughts/ideas/guesses? ** **
Re: Map tasks processing some files multiple times
Glad it helps. Could you also explain the reason for using MultipleInputs ? On Thu, Dec 6, 2012 at 2:59 PM, David Parks davidpark...@yahoo.com wrote: Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat to replace the LongWritable key with a key representing the file name. It was a bit tricky to do because of changing the generics from LongWritable, Text to Text, Text and I goofed up and mis-directed a call to isSplittable, which was causing the issue. ** ** It now works fine. Thanks very much for the response, it gave me pause to think enough to work out what I had done. ** ** Dave ** ** ** ** *From:* Hemanth Yamijala [mailto:yhema...@thoughtworks.com] *Sent:* Thursday, December 06, 2012 3:25 PM *To:* user@hadoop.apache.org *Subject:* Re: Map tasks processing some files multiple times ** ** David, ** ** You are using FileNameTextInputFormat. This is not in Hadoop source, as far as I can see. Can you please confirm where this is being used from ? It seems like the isSplittable method of this input format may need checking. ** ** Another thing, given you are adding the same input format for all files, do you need MultipleInputs ? ** ** Thanks Hemanth ** ** On Thu, Dec 6, 2012 at 1:06 PM, David Parks davidpark...@yahoo.com wrote: I believe I just tracked down the problem, maybe you can help confirm if you’re familiar with this. I see that FileInputFormat is specifying that gzip files (.gz extension) from s3n filesystem are being reported as *splittable*, and I see that it’s creating multiple input splits for these files. I’m mapping the files directly off S3: Path lsDir = *new* Path( s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*); MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.* class*, LinkShareCatalogImportMapper.*class*); I see in the map phase, based on my counters, that it’s actually processing the entire file (I set up a counter per file input). So the 2 files which were processed twice had 2 splits (I now see that in some debug logs I created), and the 1 file that was processed 3 times had 3 splits (the rest were smaller and were only assigned one split by default anyway). Am I wrong in expecting all files on the s3n filesystem to come through as not-splittable? This seems to be a bug in hadoop code if I’m right. David *From:* Raj Vishwanathan [mailto:rajv...@yahoo.com] *Sent:* Thursday, December 06, 2012 1:45 PM *To:* user@hadoop.apache.org *Subject:* Re: Map tasks processing some files multiple times Could it be due to spec-ex? Does it make a diffrerence in the end? Raj -- *From:* David Parks davidpark...@yahoo.com *To:* user@hadoop.apache.org *Sent:* Wednesday, December 5, 2012 10:15 PM *Subject:* Map tasks processing some files multiple times I’ve got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times. This is the code I use to set up the mapper: Path lsDir = *new* Path( s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*); *for*(FileStatus f : lsDir.getFileSystem(getConf()).globStatus(lsDir)) log.info(Identified linkshare catalog: + f.getPath().toString()); *if*( lsDir.getFileSystem(getConf()).globStatus(lsDir).length 0 ){ MultipleInputs.*addInputPath*(job, lsDir, FileNameTextInputFormat.*class*, LinkShareCatalogImportMapper.*class*);*** * } I can see from the logs that it sees only 1 copy of each of these files, and correctly identifies 167 files. I also have the following confirmation that it found the 167 files correctly: 2012-12-06 04:56:41,213 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat (main): Total input paths to process : 167 When I look through the syslogs I can see that the file in question was opened by two different map attempts: ./task-attempts/job_201212060351_0001/* attempt_201212060351_0001_m_05_0*/syslog:2012-12-06 03:56:05,265 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading ./task-attempts/job_201212060351_0001/* attempt_201212060351_0001_m_000173_0*/syslog:2012-12-06 03:53:18,765 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3n://fruggmapreduce/input/catalogs/linkshare_catalogs/linkshare~CD%20Universe~85.csv.gz' for reading This is only happening to these 3 files, all others seem to be fine. For the life of me I can’t see a reason why these files might be processed multiple
Re: Changing hadoop configuration without restarting service
Generally true for the framework config files, but some of the supplementary features can be refreshed without restart. For e.g. scheduler configuration, host files (for included / excluded nodes) ... On Tue, Dec 4, 2012 at 5:33 AM, Cristian Cira cmc0...@tigermail.auburn.eduwrote: No. You will have to restart hadoop. Hot/online configuration is not supported in hadoop all the best, Cristian Cira Graduate Research Assistant Parallel Architecture and System Laboratory (PASL) Shelby Center 2105 Auburn University, AL 36849 From: Pankaj Gupta [pan...@brightroll.com] Sent: Monday, December 03, 2012 5:59 PM To: user@hadoop.apache.org Subject: Changing hadoop configuration without restarting service Hi, Is it possible to change hadoop configuration files such as core-site.xml and get the changes take effect without having to restart hadoop services? Thanks, Pankaj
Re: Failed to call hadoop API
Hi, Little confused about where JNI comes in here (you mentioned this in your original email). Also, where do you want to get the information for the hadoop job ? Is it in a program that is submitting a job, or some sort of monitoring application that is monitoring jobs submitted to a cluster by others ? I think some of this information will drive an answer. FWIW, JobID.forName(job_id_as_string) would give you a handle to a JobID tied to a job. Reference: http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/JobID.html Thanks hemanth On Wed, Nov 28, 2012 at 6:58 AM, ugiwgh ugi...@gmail.com wrote: I want to code a program to get hadoop job information, such as jobid, jobname, owner, running nodes. ** -- Original -- *From: * Mahesh Balijabalijamahesh@gmail.com; *Date: * Tue, Nov 27, 2012 10:02 PM *To: * useruser@hadoop.apache.org; ** *Subject: * Re: Failed to call hadoop API Hi Hui, JobID constructor is not a public constructor it has default visibility so you have to create the instance within the same package. Usually you cannot create a JobID rather you can get one from the JOB instance by invoking getJobID(). If this doesnot works for you, please tell what you are trying to do? Thanks, Mahesh Balija, Calsoft Labs. On Tue, Nov 27, 2012 at 5:37 PM, GHui ugi...@gmail.com wrote: I call the sentence JobID id = new JobID() of hadoop API with JNI. But when my program run to this sentence, it exits. And no errors output. I don't make any sense of this. The hadoop is hadoop-core-1.0.3.jar. The Java sdk is jdk1.6.0-34. Any help will be appreciated. -GHui **
Re: problem using s3 instead of hdfs
Hi, I've not tried this on S3. However, the directory mentioned in the exception is based on the value of this particular configuration key: mapreduce.jobtracker.staging.root.dir. This defaults to ${hadoop.tmp.dir}/mapred/staging. Can you please set this to an S3 location and try ? Thanks Hemanth On Mon, Oct 15, 2012 at 10:43 PM, Parth Savani pa...@sensenetworks.comwrote: Hello, I am trying to run hadoop on s3 using distributed mode. However I am having issues running my job successfully on it. I get the following error I followed the instructions provided in this article - http://wiki.apache.org/hadoop/AmazonS3 I replaced the fs.default.name value in my hdfs-site.xml to s3n://ID:SECRET@BUCKET And I am running my job using the following: hadoop jar /path/to/my/jar/abcd.jar /input /output Where */input* is the folder name inside the s3 bucket (s3n://ID:SECRET@BUCKET/input) and */output *folder should created in my bucket (s3n://ID:SECRET@BUCKET /output) Below is the error i get. It is looking for job.jar on s3 and that path is on my server from where i am launching my job. java.io.FileNotFoundException: No such file or directory '/opt/data/hadoop/hadoop-mapred/mapred/staging/psavani/.staging/job_201207021606_1036/job.jar' at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:412) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1371) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1352) at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273) at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381) at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371) at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:222) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1372) at java.security.AccessController.doPri
Re: problem using s3 instead of hdfs
Parth, I notice in the below stack trace that the LocalJobRunner, instead of the JobTracker is being used. Are you sure this is a distributed cluster ? Could you please check the value of mapred.job.tracker ? Thanks Hemanth On Tue, Oct 16, 2012 at 8:02 PM, Parth Savani pa...@sensenetworks.comwrote: Hello Hemanth, I set the hadoop staging directory to s3 location. However, it complains. Below is the error 12/10/16 10:22:47 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= Exception in thread main java.lang.IllegalArgumentException: Wrong FS: s3n://ABCD:ABCD@ABCD/tmp/mapred/staging/psavani1821193643/.staging, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:410) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:322) at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:79) at org.apache.hadoop.mapred.LocalJobRunner.getStagingAreaDir(LocalJobRunner.java:541) at org.apache.hadoop.mapred.JobClient.getStagingAreaDir(JobClient.java:1204) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:102) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:839) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1177) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapreduce.Job.submit(Job.java:476) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:506) at com.sensenetworks.macrosensedata.ParseLogsMacrosense.run(ParseLogsMacrosense.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.sensenetworks.macrosensedata.ParseLogsMacrosense.main(ParseLogsMacrosense.java:121) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) On Tue, Oct 16, 2012 at 3:11 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, I've not tried this on S3. However, the directory mentioned in the exception is based on the value of this particular configuration key: mapreduce.jobtracker.staging.root.dir. This defaults to ${hadoop.tmp.dir}/mapred/staging. Can you please set this to an S3 location and try ? Thanks Hemanth On Mon, Oct 15, 2012 at 10:43 PM, Parth Savani pa...@sensenetworks.comwrote: Hello, I am trying to run hadoop on s3 using distributed mode. However I am having issues running my job successfully on it. I get the following error I followed the instructions provided in this article - http://wiki.apache.org/hadoop/AmazonS3 I replaced the fs.default.name value in my hdfs-site.xml to s3n://ID:SECRET@BUCKET And I am running my job using the following: hadoop jar /path/to/my/jar/abcd.jar /input /output Where */input* is the folder name inside the s3 bucket (s3n://ID:SECRET@BUCKET/input) and */output *folder should created in my bucket (s3n://ID:SECRET@BUCKET /output) Below is the error i get. It is looking for job.jar on s3 and that path is on my server from where i am launching my job. java.io.FileNotFoundException: No such file or directory '/opt/data/hadoop/hadoop-mapred/mapred/staging/psavani/.staging/job_201207021606_1036/job.jar' at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:412) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1371) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1352) at org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273) at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381) at org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371) at org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:222) at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1372) at java.security.AccessController.doPri
Re: Question about how to find which file takes the longest time to process and how to assign more mappers to process that particular file
Hi, Roughly, this information will be available under the 'Hadoop map task list' page in the Mapreduce web ui (in Hadoop-1.0, which I am assuming is what you are using). You can reach this page by selecting the running tasks link from the job information page. The page has a table that lists all the tasks and under the status column tells you which part of the input is being processed. Please note that, depending on the input format chosen, a task may be processing a *part* of a file, and not necessary a file itself. Another good source of information to see why these particular tasks are slow will be to look at the job's counters. Again these counters can be accessed from the web ui of the task list page. It would help more if you can provide more information - like what job you're trying to run, the input format specified etc. Thanks hemanth On Fri, Oct 5, 2012 at 3:33 AM, Huanchen Zhang iamzhan...@gmail.com wrote: Hello, I have a question about how to find which file takes the longest time to process and how to assign more mappers to process that particular file. Currently, about three mapper takes about five times more time to complete. So, how can I detect which specific files are those three mapper are processing? If above if doable, how can I assign more mappers to process those specific files? Thank you ! Best, Huanchen
Re: A small portion of map tasks slows down the job
Hi, Would reducing the output from the map tasks solve the problem ? i.e. are reducers slowing down because a lot of data is being shuffled ? If that's the case, you could see if the map output size will reduce by using the framework's combiner or an in-mapper combining technique. Thanks Hemanth On Wed, Oct 3, 2012 at 6:34 AM, Huanchen Zhang iamzhan...@gmail.com wrote: Hello, I have a small portion of map tasks whose output is much larger than others (more spills). So the reducer is mainly waiting for these a few map tasks. Is there a good solution for this problem ? Thank you. Best, Huanchen
Re: Can we write output directly to HDFS from Mapper
Can certainly do that. Indeed, if you set the number of reducers to 0, the map output will be directly written to HDFS by the framework itself. You may also want to look at http://hadoop.apache.org/docs/stable/mapred_tutorial.html#Task+Side-Effect+Files to see some things that need to be taken care of if you are writing files on your own. Thanks hemanth On Fri, Sep 28, 2012 at 9:45 AM, Balaraman, Anand anand_balara...@syntelinc.com wrote: Hi In Map-Reduce, is it appropriate to write the output directly to HDFS from Mapper (without using a reducer) ? Are there any adverse effects in doing so or are there any best practices to be followed in this aspect ? Comments are much appreciable at the moment J Thanks and Regards Anand B Confidential: This electronic message and all contents contain information from Syntel, Inc. which may be privileged, confidential or otherwise protected from disclosure. The information is intended to be for the addressee only. If you are not the addressee, any disclosure, copy, distribution or use of the contents of this message is prohibited. If you have received this electronic message in error, please notify the sender immediately and destroy the original message and all copies.
Re: Passing Command-line Parameters to the Job Submit Command
By java environment variables, do you mean the ones passed as -Dkey=value ? That's one way of passing them. I suppose another way is to have a client side site configuration (like mapred-site.xml) that is in the classpath of the client app. Thanks Hemanth On Tue, Sep 25, 2012 at 12:20 AM, Varad Meru meru.va...@gmail.com wrote: Thanks Hemanth, But in general, if we want to pass arguments to any job (not only PiEstimator from examples-jar) and submit the Job to the Job queue scheduler, by the looks of it, we might always need to use the java environment variables only. Is my above assumption correct? Thanks, Varad On Mon, Sep 24, 2012 at 9:48 AM, Hemanth Yamijala yhema...@gmail.comwrote: Varad, Looking at the code for the PiEstimator class which implements the 'pi' example, the two arguments are mandatory and are used *before* the job is submitted for execution - i.e on the client side. In particular, one of them (nSamples) is used not by the MapReduce job, but by the client code (i.e. PiEstimator) to generate some input. Hence, I believe all of this additional work that is being done by the PiEstimator class will be bypassed if we directly use the job -submit command. In other words, I don't think these two ways of running the job: - using the hadoop jar examples pi - using hadoop job -submit are equivalent. As a general answer to your question though, if additional parameters are used by the Mappers or reducers, then they will generally be set as additional job specific configuration items. So, one way of using them with the job -submit command will be to find out the specific names of the configuration items (from code, or some other documentation), and include them in the job.xml used when submitting the job. Thanks Hemanth On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote: Hi, I want to run the PiEstimator example from using the following command $hadoop job -submit pieestimatorconf.xml which contains all the info required by hadoop to run the job. E.g. the input file location, the output file location and other details. propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property propertynamemapred.map.tasks/namevalue20/value/property propertynamemapred.reduce.tasks/namevalue2/value/property ... propertynamemapred.job.name /namevaluePiEstimator/value/property propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property Now, as we now, to run the PiEstimator, we can use the following command too $hadoop jar hadoop-examples.1.0.3 pi 5 10 where 5 and 10 are the arguments to the main class of the PiEstimator. How can I pass the same arguments (5 and 10) using the job -submit command through conf. file or any other way, without changing the code of the examples to reflect the use of environment variables. Thanks in advance, Varad - Varad Meru Software Engineer, Business Intelligence and Analytics, Persistent Systems and Solutions Ltd., Pune, India.
Re: Passing Command-line Parameters to the Job Submit Command
Varad, Looking at the code for the PiEstimator class which implements the 'pi' example, the two arguments are mandatory and are used *before* the job is submitted for execution - i.e on the client side. In particular, one of them (nSamples) is used not by the MapReduce job, but by the client code (i.e. PiEstimator) to generate some input. Hence, I believe all of this additional work that is being done by the PiEstimator class will be bypassed if we directly use the job -submit command. In other words, I don't think these two ways of running the job: - using the hadoop jar examples pi - using hadoop job -submit are equivalent. As a general answer to your question though, if additional parameters are used by the Mappers or reducers, then they will generally be set as additional job specific configuration items. So, one way of using them with the job -submit command will be to find out the specific names of the configuration items (from code, or some other documentation), and include them in the job.xml used when submitting the job. Thanks Hemanth On Sun, Sep 23, 2012 at 1:24 PM, Varad Meru meru.va...@gmail.com wrote: Hi, I want to run the PiEstimator example from using the following command $hadoop job -submit pieestimatorconf.xml which contains all the info required by hadoop to run the job. E.g. the input file location, the output file location and other details. propertynamemapred.jar/namevaluefile:Users/varadmeru/Work/Hadoop/hadoop-examples-1.0.3.jar/value/property propertynamemapred.map.tasks/namevalue20/value/property propertynamemapred.reduce.tasks/namevalue2/value/property ... propertynamemapred.job.name/namevaluePiEstimator/value/property propertynamemapred.output.dir/namevaluefile:Users/varadmeru/Work/out/value/property Now, as we now, to run the PiEstimator, we can use the following command too $hadoop jar hadoop-examples.1.0.3 pi 5 10 where 5 and 10 are the arguments to the main class of the PiEstimator. How can I pass the same arguments (5 and 10) using the job -submit command through conf. file or any other way, without changing the code of the examples to reflect the use of environment variables. Thanks in advance, Varad - Varad Meru Software Engineer, Business Intelligence and Analytics, Persistent Systems and Solutions Ltd., Pune, India.
Re: Will all the intermediate output with the same key go to the same reducer?
Hi, Yes. By contract, all intermediate output with the same key goes to the same reducer. In your example, suppose of the two keys generated from the mapper, one key goes to reducer 1 and the second goes to reducer 2, reducer 3 will not have any records to process and end without producing any output. If the intermediate key space is very large, 1 reducer would certainly be a bottleneck, as you rightly note. Hence, configuring the right number of reducers would be certainly important. Thanks hemanth On 9/20/12, Jason Yang lin.yang.ja...@gmail.com wrote: Hi, all I have a question that whether all the intermediate output with the same key go to the same reducer or not? If it is, in case of only two keys are generated from mapper, but there are 3 reducer running in this job, what would happen? If not, how could I do some processing over the all data, like counting? I think some would suggest to set the number of reducer to 1, but I thought this would make the reducer to be the bottleneck when there are large volume of intermediate output, isn't it? -- YANG, Lin
Re: About ant Hadoop
Can you please look at the jobtracker and tasktracker logs on nodes where the task has been launched ? Also see if the job logs are picking up anything. They'll probably give you clues on what is happening. Also, is HDFS ok ? i.e. are you able to read files already loaded etc. Thanks hemanth On 9/19/12, Li Shengmei lisheng...@ict.ac.cn wrote: Hi,all I revise the source codes of hadoop-1.0.3 and use ant to recompile hadoop. It compiles successfully. Then I jar cvf hadoop-core-1.0.3.jar * and copy the new hadoop-core-1.0.3.jar to overwirite the $HADOOP_HOME/ hadoop-core-1.0.3.jar in every node machine. Then I use hadoop to test the wordcount application. But the application halts at map 0% reduce 0%. Does anyone give suggestions? Thanks a lot. May
Re: What's the basic idea of pseudo-distributed Hadoop ?
One thing to be careful about is paths of dependent libraries or executables like streaming binaries. In pseudo distributed mode, since all processes are looking on the same machine, it is likely that they will find paths that are really local to only the machine where the job is being launched from. When you start to run them in a true distributed environment, and if these files are not packaged and distributed to the cluster in some way, they will start failing. Thanks hemanth On Fri, Sep 14, 2012 at 1:04 PM, Jason Yang lin.yang.ja...@gmail.comwrote: All right, I got it. Thanks for all of you. 2012/9/14 Bertrand Dechoux decho...@gmail.com The only difference between pseudo-distributed and fully distributed would be scale. You could say that code that runs fine on the former, runs fine too on the latter. But it does not necessary mean that the performance will scale the same way (ie if you keep a list of elements in memory, at bigger scale you could receive OOME). Of course, like it has been implied in previous answers, you can't say the same with standalone. With this mode, you could use a global mutable static state thinking it's fine without caring about distribution between the nodes. In that case, the same code launched on pseudo-distributed will fail to replicate the same results. Regards Bertrand On Fri, Sep 14, 2012 at 9:24 AM, Harsh J ha...@cloudera.com wrote: Hi Jason, I think you're confusing the standalone mode with a pseudo-distributed mode. The former is a limited mode of MR where no daemons need to be deployed and the tasks run in a single JVM (via threads). A pseudo distributed cluster is a cluster where all daemons are running on one node itself. Hence, not distributed in the sense of multi-nodes (no use of an network gear) but works in the same way between nodes (RPC, etc.) as a fully-distributed one. If an MR program works fine in a pseudo-distributed mode, it should work (no guarantee) fine in a fully-distributed mode iff all nodes have the same arch/OS, same JVM, and job-specific configurations. This is because tasks execute on various nodes and may be affected by the node's behavior or setup that is different from others - and thats something you'd have to detect/know about if it exhibits failures more than others. On Fri, Sep 14, 2012 at 11:58 AM, Jason Yang lin.yang.ja...@gmail.com wrote: Hey, Kai Thanks for you reply. I was wondering what's difference btw the pseudo-distributed and fully-distributed hadoop, except the maximum number of map/reduce. And if a MR program works fine in pseudo-distributed cluster, will it work exactly fine in the fully-distributed cluster ? 2012/9/14 Kai Voigt k...@123.org e default setting is that a tasktracker can run up to two map and reduce tasks in parallel (mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum), so you will actually see some concurrency on your one machine. -- YANG, Lin -- Harsh J -- Bertrand Dechoux -- YANG, Lin
Re: Ignore keys while scheduling reduce jobs
Hi, When do you know the keys to ignore ? You mentioned after the map stage .. is this at the end of each map task, or at the end of all map tasks ? Thanks hemanth On Fri, Sep 14, 2012 at 4:36 PM, Aseem Anand aseem.ii...@gmail.com wrote: Hi, Is there anyway I can ignore all keys except a certain key ( determined after the map stage) to start only 1 reduce job using a partitioner? If so could someone suggest such a method. Regards, Aseem
Re: Question about the task assignment strategy
Hi, Task assignment takes data locality into account first and not block sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. When such a request comes to the jobtracker, it will try to look for an unassigned task which needs data that is close to the tasktracker and will assign it. Thanks Hemanth On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada mogwa...@gmail.com wrote: Hi, I want to make sure my understanding about task assignment in hadoop is correct or not. When scanning a file with multiple tasktrackers, I am wondering how a task is assigned to each tasktracker . Is it based on the block sequence or data locality ? Let me explain my question by example. There is a file which composed of 10 blocks (block1 to block10), and block1 is the beginning of the file and block10 is the tail of the file. When scanning the file with 3 tasktrackers (tt1 to tt3), I am wondering if task assignment is based on the block sequence like first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and tt1 takes block4 and so on or task assignment is based on the task(data) locality like first tt1 takes block2(because it's located in the local) and tt2 takes block1 (because it's located in the local) and tt3 takes block 4(because it's located in the local) and so on. As far as I experienced and the definitive guide book says, I think that the first case is the task assignment strategy. (and if there are many replicas, closest one is picked.) Is this right ? If this is right, is there any way to do like the second case with the current implementation ? Thanks, Hiroyuki
Re: Error in : hadoop fsck /
Could you please review your configuration to see if you are pointing to the right namenode address ? (This will be in core-site.xml) Please paste it here so we can look for clues. Thanks hemanth On Tue, Sep 11, 2012 at 9:25 PM, yogesh dhari yogeshdh...@live.com wrote: Hi all, I am running hadoop-0.20.2 on single node cluster, I run the command hadoop fsck / it shows error: Exception in thread main java.net.UnknownHostException: http at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at sun.net.NetworkClient.doConnect(NetworkClient.java:180) at sun.net.www.http.HttpClient.openServer(HttpClient.java:378) at sun.net.www.http.HttpClient.openServer(HttpClient.java:473) at sun.net.www.http.HttpClient.init(HttpClient.java:203) at sun.net.www.http.HttpClient.New(HttpClient.java:290) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1299) at org.apache.hadoop.hdfs.tools.DFSck.run(DFSck.java:123) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.hdfs.tools.DFSck.main(DFSck.java:159) Please suggest why it so.. it should show the health status:
Re: Question about the task assignment strategy
Hi, I tried a similar experiment as yours but couldn't replicate the issue. I generated 64 MB files and added them to my DFS - one file from every machine, with a replication factor of 1, like you did. My block size was 64MB. I verified the blocks were located on the same machine as where I added them from. Then, I launched a wordcount (without the min split size config). As expected, it created 8 maps, and I could verify that it ran all the tasks as data local - i.e. every task read off its own datanode. From the launch times of the tasks, I could roughly feel that this scheduling behaviour was independent of the order in which the tasks were launched. This behaviour was retained even with the min split size config. Could you share the size of input you generated (i.e the size of data01..data14) ? Also, what job are you running - specifically what is the input format ? BTW, this wiki entry: http://wiki.apache.org/hadoop/HowManyMapsAndReducestalks a little bit about how the maps are created. Thanks Hemanth On Wed, Sep 12, 2012 at 7:49 AM, Hiroyuki Yamada mogwa...@gmail.com wrote: I figured out the cause. HDFS block size is 128MB, but I specify mapred.min.split.size as 512MB, and data local I/O processing goes wrong for some reason. When I remove the mapred.min.split.size configuration, tasktrackers pick data-local tasks. Why does it happen ? It seems like a bug. Split is a logical container of blocks, so nothing is wrong logically. On Wed, Sep 12, 2012 at 1:20 AM, Hiroyuki Yamada mogwa...@gmail.com wrote: Hi, thank you for the comment. Task assignment takes data locality into account first and not block sequence. Does it work like that when replica factor is set to 1 ? I just had a experiment to check the behavior. There are 14 nodes (node01 to node14) and there are 14 datanodes and 14 tasktrackers working. I first created a data to be processed in each node (say data01 to data14), and I put the each data to the hdfs from each node (at /data directory. /data/data01, ... /data/data14). Replica factor is set to 1, so according to the default block placement policy, each data is stored at local node. (data01 is stored at node01, data02 is stored at node02 and so on) In that setting, I launched a job that processes the /data and what happened is that tasktrackers read from data01 to data14 sequentially, which means tasktrackers first take all data from node01 and then node02 and then node03 and so on. If tasktracker takes data locality into account as you say, each tasktracker should take the local task(data). (tasktrackers at node02 should take data02 blocks if there is any) But, it didn't work like that. What this is happening ? Is there any documents about this ? What part of the source code is doing that ? Regards, Hiroyuki On Tue, Sep 11, 2012 at 11:27 PM, Hemanth Yamijala yhema...@thoughtworks.com wrote: Hi, Task assignment takes data locality into account first and not block sequence. In hadoop, tasktrackers ask the jobtracker to be assigned tasks. When such a request comes to the jobtracker, it will try to look for an unassigned task which needs data that is close to the tasktracker and will assign it. Thanks Hemanth On Tue, Sep 11, 2012 at 6:31 PM, Hiroyuki Yamada mogwa...@gmail.com wrote: Hi, I want to make sure my understanding about task assignment in hadoop is correct or not. When scanning a file with multiple tasktrackers, I am wondering how a task is assigned to each tasktracker . Is it based on the block sequence or data locality ? Let me explain my question by example. There is a file which composed of 10 blocks (block1 to block10), and block1 is the beginning of the file and block10 is the tail of the file. When scanning the file with 3 tasktrackers (tt1 to tt3), I am wondering if task assignment is based on the block sequence like first tt1 takes block1 and tt2 takes block2 and tt3 takes block3 and tt1 takes block4 and so on or task assignment is based on the task(data) locality like first tt1 takes block2(because it's located in the local) and tt2 takes block1 (because it's located in the local) and tt3 takes block 4(because it's located in the local) and so on. As far as I experienced and the definitive guide book says, I think that the first case is the task assignment strategy. (and if there are many replicas, closest one is picked.) Is this right ? If this is right, is there any way to do like the second case with the current implementation ? Thanks, Hiroyuki
Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)
Hi, I am not sure if there's any way to restrict the tasks to specific machines. However, I think there are some ways of restricting to number of 'slots' that can be used by the job. Also, not sure which version of Hadoop you are on. The capacityscheduler (http://hadoop.apache.org/common/docs/r2.0.1-alpha/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html) has ways by which you can set up a queue with a hard capacity limit. The capacity controls the number of slots that that can be used by jobs submitted to the queue. So, if you submit a job to the queue, irrespective of the number of tasks it has, it should limit it to those slots. However, please note that this does not restrict the tasks to specific machines. Thanks Hemanth On Mon, Sep 10, 2012 at 2:36 PM, Safdar Kureishy safdar.kurei...@gmail.com wrote: Hi, I need to run some benchmarking tests for a given mapreduce job on a *subset *of a 10-node Hadoop cluster. Not that it matters, but the current cluster settings allow for ~20 map slots and 10 reduce slots per node. Without loss of generalization, let's say I want a job with these constraints below: - to use only *5* out of the 10 nodes for running the mappers, - to use only *5* out of the 10 nodes for running the reducers. Is there any other way of achieving this through Hadoop property overrides during job-submission time? I understand that the Fair Scheduler can potentially be used to create pools of a proportionate # of mappers and reducers, to achieve a similar outcome, but the problem is that I still cannot tie such a pool to a fixed # of machines (right?). Essentially, regardless of the # of map/reduce tasks involved, I only want a *fixed # of machines* to handle the job. Any tips on how I can go about achieving this? Thanks, Safdar
Re: Reading from HDFS from inside the mapper
Hi, You could check DistributedCache ( http://hadoop.apache.org/common/docs/stable/mapred_tutorial.html#DistributedCache). It would allow you to distribute data to the nodes where your tasks are run. Thanks Hemanth On Mon, Sep 10, 2012 at 3:27 PM, Sigurd Spieckermann sigurd.spieckerm...@gmail.com wrote: Hi, I would like to perform a map-side join of two large datasets where dataset A consists of m*n elements and dataset B consists of n elements. For the join, every element in dataset B needs to be accessed m times. Each mapper would join one element from A with the corresponding element from B. Elements here are actually data blocks. Is there a performance problem (and difference compared to a slightly modified map-side join using the join-package) if I set dataset A as the map-reduce input and load the relevant element from dataset B directly from the HDFS inside the mapper? I could store the elements of B in a MapFile for faster random access. In the second case without the join-package I would not have to partition the datasets manually which would allow a bit more flexibility, but I'm wondering if HDFS access from inside a mapper is strictly bad. Also, does Hadoop have a cache for such situations by any chance? I appreciate any comments! Sigurd
Re: Understanding of the hadoop distribution system (tuning)
Hi, Responses inline to some points. On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan elaine-...@gmo.jp wrote: Hi, I'm new to hadoop and i've just played around with map reduce. I would like to check if my understanding to hadoop is correct and i would appreciate if anyone could correct me if i'm wrong. I have a data of around 518MB, and i wrote a MR program to process it. Here are some of my settings in my mapred-site.xml. --- mapred.tasktracker.map.tasks.maximum = 20 mapred.tasktracker.reduce.tasks.maximum = 20 --- These two configurations essentially tell the tasktrackers that they can run 20 maps and 20 reduces in parallel on a machine. Is this what you intended ? (Generally the sum of these two values should equal the number of cores on your tasktracker node, or a little more). Also, would help if you can tell us your cluster size - i.e. number of slaves. My block size is default, 64MB With my data size = 518MB, i guess setting the maximum for MR task to 20 is far more than enough (518/64 = 8) , did i get it correctly? I suppose what you want is to run all the maps in parallel. For that, the number of map slots in your cluster should be more than the number of maps of your job (assuming there's a single job running). If the number of slots is less than number of maps, the maps would be scheduled in multiple waves. On your jobtracker main page, the Cluster Summary Map Task Capacity gives you the total slots available in your cluster. When i run the MR program, i could see in the Map/Reduce Administration page that the number of Maps Total = 8, so i assume that everything is going well here, once again if i'm wrong please correct me. (Sometimes it shows only Maps Total = 3) This value tells us the number of maps that will run for the job. There's one thing which i'm uncertain about hadoop distribution. Is the Maps Total = 8 means that there are 8 map tasks split among all the data nodes (task trackers)? Is there anyway i can checked whether all the tasks are shared among datanodes (where task trackers are working). There's no easy way to check this. The task page for every task shows the attempts that ran for each task and where they ran under the 'Machine' column. When i clicked on each link under that Task Id, i can see there's Input Split Locations stated under each task details, if the inputs are splitted between data nodes, does that means that everything is working well? I think this is just the location of the splits, including the replicas. What you could see is if enough data local maps ran - which means that the tasks mostly got their inputs from datanodes running on the same machine as themselves. This is given by the counter Data-local map tasks on the job UI page. I need to make sure i got everything running well because my MR took around 6 hours to finish despite the input size is small.. (Well, i know hadoop is not meant for small data), I'm not sure whether it's my configuration that goes wrong or hadoop is just not suitable for my case. I'm actually running a mahout kmeans analysis. Thank you for your time.
Re: [Cosmos-dev] Out of memory in identity mapper?
Harsh, Could IsolationRunner be used here. I'd put up a patch for HADOOP-8765, after applying which IsolationRunner works for me. Maybe we could use it to re-run the map task that's failing and debug. Thanks hemanth On Thu, Sep 6, 2012 at 9:42 PM, Harsh J ha...@cloudera.com wrote: Protobuf involvement makes me more suspicious that this is possibly a corruption or an issue with serialization as well. Perhaps if you can share some stack traces, people can help better. If it is reliably reproducible, then I'd also check for the count of records until after this occurs, and see if the stacktraces are always same. Serialization formats such as protobufs allocate objects based on read sizes (like for example, a string size may be read first before the string's bytes are read, and upon size read, such a length is pre-allocated for the bytes to be read into), and in cases of corrupt data or bugs in the deserialization code, it is quite easy for it to make a large alloc request due to a badly read value. Its one possibility. Is the input compressed too, btw? Can you seek out the input file the specific map fails on, and try to read it in an isolated manner to validate it? Or do all maps seem to fail? On Thu, Sep 6, 2012 at 9:01 PM, SEBASTIAN ORTEGA TORRES sort...@tid.es wrote: Input files are small fixed-size protobuf records and yes, it is reproducible (but it takes some time). In this case I cannot use combiners since I need to process all the elements with the same key altogether. Thanks for the prompt response -- Sebastián Ortega Torres Product Development Innovation / Telefónica Digital C/ Don Ramón de la Cruz 82-84 Madrid 28006 El 06/09/2012, a las 17:13, Harsh J escribió: I can imagine a huge record size possibly causing this. Is this reliably reproducible? Do you also have combiners enabled, which may run the reducer-logic on the map-side itself? On Thu, Sep 6, 2012 at 8:20 PM, JOAQUIN GUANTER GONZALBEZ x...@tid.es wrote: Hello hadoopers! In a reduce-only Hadoop job input files are handled by the identity mapper and sent to the reducers without modification. In one of my job I was surprised to see the job failing in the map phase with Out of memory error and GC overhead limit exceeded. In my understanding, a memory leak on the identity mapper is out of the question. What can be the cause of such error? Thanks, Ximo. P.S. The logs show no stack trace other than the messages I mentioned before. Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx -- Harsh J ___ Cosmos-dev mailing list cosmos-...@tid.es https://listas.tid.es/mailman/listinfo/cosmos-dev Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx -- Harsh J
Re: Error using hadoop in non-distributed mode
Hi, The path /tmp/hadoop-pat/mapred/local/archive/-4686065962599733460_1587570556_150738331/snip is a location used by the tasktracker process for the 'DistributedCache' - a mechanism to distribute files to all tasks running in a map reduce job. ( http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html#DistributedCache ). You have mentioned Mahout, so I am assuming that the specific analysis job you are running is using this feature to distribute the output of the file / Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 to the job that is causing a failure. Also, I find links stating the distributed cache does not work with in the local (non-HDFS) mode. ( http://stackoverflow.com/questions/9148724/multiple-input-into-a-mapper-in-hadoop). Look at the second answer. Thanks hemanth On Tue, Sep 4, 2012 at 10:33 PM, Pat Ferrel pat.fer...@gmail.com wrote: The job is creating several output and intermediate files all under the location: Users/pat/Projects/big-data/b/ssvd/ several output directories and files are created correctly and the file Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 is created and exists at the time of the error. We seem to be passing in Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 as the input file. Under what circumstances would an input path passed in as Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 be turned into pat/mapred/local/archive/6590995089539988730_1587570556_37122331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 ??? On Sep 4, 2012, at 1:14 AM, Narasingu Ramesh ramesh.narasi...@gmail.com wrote: Hi Pat, Please specify correct input file location. Thanks Regards, Ramesh.Narasingu On Mon, Sep 3, 2012 at 9:28 PM, Pat Ferrel p...@occamsmachete.com wrote: Using hadoop with mahout in a local filesystem/non-hdfs config for debugging purposes inside Intellij IDEA. When I run one particular part of the analysis I get the error below. I didn't write the code but we are looking for some hint about what might cause it. This job completes without error in a single node pseudo-clustered config outside of IDEA. several jobs in the pipeline complete without error creating part files just fine in the local file system The file /tmp/hadoop-pat/mapred/local/archive/6590995089539988730_1587570556_37122331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 which is the subject of the error - does not exist Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 does exist at the time of the error. So the code is looking for the data in the wrong place? …. 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor 12/09/02 14:56:29 INFO compress.CodecPool: Got brand-new decompressor 12/09/02 14:56:29 WARN mapred.LocalJobRunner: job_local_0002 java.io.FileNotFoundException: File /tmp/hadoop-pat/mapred/local/archive/-4686065962599733460_1587570556_150738331/file/Users/pat/Projects/big-data/b/ssvd/Q-job/R-m-0 does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterator.init(SequenceFileDirValueIterator.java:92) at org.apache.mahout.math.hadoop.stochasticsvd.BtJob$BtMapper.setup(BtJob.java:219) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) Exception in thread main java.io.IOException: Bt job unsuccessful. at org.apache.mahout.math.hadoop.stochasticsvd.BtJob.run(BtJob.java:609) at org.apache.mahout.math.hadoop.stochasticsvd.SSVDSolver.run(SSVDSolver.java:397) at com.finderbots.analysis.AnalysisPipeline.SSVDTransformAndBack(AnalysisPipeline.java:257) at com.finderbots.analysis.AnalysisJob.run(AnalysisJob.java:20) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.finderbots.analysis.AnalysisJob.main(AnalysisJob.java:34) Disconnected from the target VM, address: '127.0.0.1:63483', transport: 'socket'
Re: Exception while running a Hadoop example on a standalone install on Windows 7
Though I agree with others that it would probably be easier to get Hadoop up and running on Unix based systems, couldn't help notice that this path: \tmp \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging seems to have a space in the first component i.e '\tmp ' and not '\tmp'. Is that a copy paste issue, or is it really the case. Again, not sure if it could cause the specific error you're seeing, but could try removing the space if it does exist. Also assuming that you've set up Cygwin etc. if you still want to try out on Windows. Thanks hemanth On Wed, Sep 5, 2012 at 12:12 AM, Marcos Ortiz mlor...@uci.cu wrote: On 09/04/2012 02:35 PM, Udayini Pendyala wrote: Hi Bejoy, Thanks for your response. I first started to install on Ubuntu Linux and ran into a bunch of problems. So, I wanted to back off a bit and try something simple first. Hence, my attempt to install on my Windows 7 Laptop. Well, if you tell to us the problems that you have in Ubuntu, we can give you a hand. Michael Noll have great tutorials for this: Running Hadoop on Ubuntu Linux (Single node cluster) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Running Hadoop on Ubuntu Linux (Multi node cluster) http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ I am doing the standalone mode - as per the documentation (link in my original email), I don't need ssh unless I am doing the distributed mode. Is that not correct? Yes, but I give you the same recommendation that Bejoy said to you: Use a Unix-based platform for Hadoop, it's more tested and have better performance than Windows. Best wishes Thanks again for responding Udayini --- On *Tue, 9/4/12, Bejoy Ks bejoy.had...@gmail.combejoy.had...@gmail.com * wrote: From: Bejoy Ks bejoy.had...@gmail.com bejoy.had...@gmail.com Subject: Re: Exception while running a Hadoop example on a standalone install on Windows 7 To: user@hadoop.apache.org Date: Tuesday, September 4, 2012, 11:11 AM Hi Udayani By default hadoop works well for linux and linux based OS. Since you are on Windows you need to install and configure ssh using cygwin before you start hadoop daemons. On Tue, Sep 4, 2012 at 6:16 PM, Udayini Pendyala udayini_pendy...@yahoo.comhttp://mc/compose?to=udayini_pendy...@yahoo.com wrote: Hi, Following is a description of what I am trying to do and the steps I followed. GOAL: a). Install Hadoop 1.0.3 b). Hadoop in a standalone (or local) mode c). OS: Windows 7 STEPS FOLLOWED: 1.1. I followed instructions from: http://www.oreillynet.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html. Listing the steps I did - a. I went to: http://hadoop.apache.org/core/releases.html. b. I installed hadoop-1.0.3 by downloading “hadoop-1.0.3.tar.gz” and unzipping/untarring the file. c. I installed JDK 1.6 and set up JAVA_HOME to point to it. d. I set up HADOOP_INSTALL to point to my Hadoop install location. I updated my PATH variable to have $HADOOP_INSTALL/bin e. After the above steps, I ran the command: “hadoop version” and got the following information: $ hadoop version Hadoop 1.0.3 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192 Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012 From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be 2. 2. The standalone was very easy to install as described above. Then, I tried to run a sample command as given in: http://hadoop.apache.org/common/docs/r0.17.2/quickstart.html#Local Specifically, the steps followed were: a. cd $HADOOP_INSTALL b. mkdir input c. cp conf/*.xml input d. bin/hadoop jar hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’ and got the following error: $ bin/hadoop jar hadoop-examples-1.0.3.jar grep input output 'dfs[a-z.]+' 12/09/03 15:41:57 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable 12/09/03 15:41:57 ERROR security.UserGroupInformation: PriviledgedActionExceptio n as:upendyal cause:java.io.IOException: Failed to set permissions of path: \tmp \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging to 0700 java.io http://java.io.IO.IOException: Failed to set permissions of path: \tmp\hadoop-upendyal\map red\staging\upendyal-1075683580\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys tem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav a:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9) at
Re: Integrating hadoop with java UI application deployed on tomcat
Hi, If you are getting the LocalFileSystem, you could try by putting core-site.xml in a directory that's there in the classpath for the Tomcat App (or include such a path in the classpath, if that's possible) Thanks hemanth On Mon, Sep 3, 2012 at 4:01 PM, Visioner Sadak visioner.sa...@gmail.com wrote: Thanks steve thers nothing in logs and no exceptions as well i found that some file is created in my F:\user with directory name but its not visible inside my hadoop browse filesystem directories i also added the config by using the below method hadoopConf.addResource( F:/hadoop-0.22.0/conf/core-site.xml); when running thru WAR printing out the filesystem i m getting org.apache.hadoop.fs.LocalFileSystem@9cd8db when running an independet jar within hadoop i m getting DFS[DFSClient[clientName=DFSClient_296231340, ugi=dell]] when running an independet jar i m able to do uploads just wanted to know will i have to add something in my classpath of tomcat or is there any other configurations of core-site.xml that i am missing out..thanks for your help. On Sat, Sep 1, 2012 at 1:38 PM, Steve Loughran ste...@hortonworks.com wrote: well, it's worked for me in the past outside Hadoop itself: http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/utils/DfsUtils.java?revision=8882view=markup Turn logging up to DEBUG Make sure that the filesystem you've just loaded is what you expect, by logging its value. It may turn out to be file:///, because the normal Hadoop site-config.xml isn't being picked up On Fri, Aug 31, 2012 at 1:08 AM, Visioner Sadak visioner.sa...@gmail.com wrote: but the problem is that my code gets executed with the warning but file is not copied to hdfs , actually i m trying to copy a file from local to hdfs Configuration hadoopConf=new Configuration(); //get the default associated file system FileSystem fileSystem=FileSystem.get(hadoopConf); // HarFileSystem harFileSystem= new HarFileSystem(fileSystem); //copy from lfs to hdfs fileSystem.copyFromLocalFile(new Path(E:/test/GANI.jpg),new Path(/user/TestDir/));
Yarn defaults for local directories
Hi, Is there a reason why Yarn's directory paths are not defaulting to be relative to hadoop.tmp.dir. For e.g. yarn.nodemanager.local-dirs defaults to /tmp/nm-local-dir. Could it be ${hadoop.tmp.dir}/nm-local-dir instead ? Similarly for the log directories, I guess... Thanks hemanth
Re: knowing the nodes on which reduce tasks will run
Hi, You are right that a change to mapred.tasktracker.reduce.tasks.maximum will require a restart of the tasktrackers. AFAIK, there is no way of modifying this property without restarting. On a different note, could you see if the amount of intermediate data can be reduced using a combiner, or some other form of local aggregation ? Thanks hemanth On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: How can I set 'mapred.tasktracker.reduce.tasks.maximum' to 0 in a running tasktracker? Seems that I need to restart the tasktracker and in that case I'll loose the output of map tasks by particular tasktracker. Can I change 'mapred.tasktracker.reduce.tasks.maximum' to 0 without restarting tasktracker? ~Abhay On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks bejoy.had...@gmail.com wrote: HI Abhay The TaskTrackers on which the reduce tasks are triggered is chosen in random based on the reduce slot availability. So if you don't need the reduce tasks to be scheduled on some particular nodes you need to set 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The bottleneck here is that this property is not a job level one you need to set it on a cluster level. A cleaner approach will be to configure each of your nodes with the right number of map and reduce slots based on the resources available on each machine. On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, How can one get to know the nodes on which reduce tasks will run? One of my job is running and it's completing all the map tasks. My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. If the reduce task take any node from cluster then It'll try to copy the data to same disk and it'll eventually fail due to Disk space related exceptions. I have added few more tasktracker nodes in the cluster and now want to run reducer on new nodes only. Is it possible to choose a node on which the reducer will run? What's the algorithm hadoop uses to get a new node to run reducer? Thanks in advance. Bye Abhay