Archives not getting unarchived at tasktrackers
Hi All, I am using DistributedCache.addCacheArchives() to distribute a tar file to the tasktrackers using the following statement. DistributedCache.addCacheArchives(new URI(/home/akhil1988/sample.tar), conf); According to the documentation it should get unarchived at the tasktrackers. But the statement: DistributedCache.getLocalCacheArchives(conf); returns the following Path /hadoop/tmp/hadoop/mapred/local/taskTracker/archive/cn1.cloud.cs.illinois.edu/home/akhil1988/sample.tar That means sample.tar did not get unarchived. Nor I am able to access file sample.txt in the above folder. Can anyone tell where I am going wrong? I tarred the file sample.txt using the following command: tar -cvf sample.tar sample.txt Thanks, Akhil -- View this message in context: http://www.nabble.com/Archives-not-getting-unarchived-at-tasktrackers-tp24233281p24233281.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Thanks Chris for your reply! Well, I could not understand much of what has been discussed on that forum. I am unaware of Cascading. My problem is simple - I want a directory to present in the local working directory of tasks so that I can access it from my map task in the following manner : FileInputStream fin = new FileInputStream(Config/file1.config); where, Config is a directory which contains many files/directories, one of which is file1.config It would be helpful to me if you can tell me what statements to use to distribute a directory to the tasktrackers. The API doc http://hadoop.apache.org/core/docs/r0.20.0/api/index.html says that archives are unzipped on the tasktrackers but I want an example of how to use this in case of a dreictory. Thanks, Akhil Chris Curtin-2 wrote: Hi, I've found it much easier to write the file to HDFS use the API, then pass the 'path' to the file in HDFS as a property. You'll need to remember to clean up the file after you're done with it. Example details are in this thread: http://groups.google.com/group/cascading-user/browse_thread/thread/d5c619349562a8d6# Hope this helps, Chris On Thu, Jun 25, 2009 at 4:50 PM, akhil1988 akhilan...@gmail.com wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24229338.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Using addCacheArchive
Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24207739.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24210836.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Thanks Amareshwari for your reply! The file Config.zip is lying in the HDFS, if it would not have been then the error would be reported by the jobtracker itself while executing the statement: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); But I get error in the map function when I try to access the Config directory. Now I am using the following statement but still getting the same error: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip#Config), conf); Do you think whether there should be any problem in distributing a zipped directory and then hadoop unzipping it recursively. Thanks! Akhil Amareshwari Sriramadasu wrote: Hi Akhil, DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks like you are passing local path. Also, if you want to create symlink, you should pass URI as hdfs://path#linkname, besides calling DistributedCache.createSymlink(conf); Thanks Amareshwari akhil1988 wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214657.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using addCacheArchive
Yes, my HDFS paths are of the form /home/user-name/ And I have used these in DistributedCache's addCacheFiles method successfully. Thanks, Akhil Amareshwari Sriramadasu wrote: Is your hdfs path /home/akhil1988/Config.zip? Usually hdfs path is of the form /user/akhil1988/Config.zip. Just wondering if you are giving wrong path in the uri! Thanks Amareshwari akhil1988 wrote: Thanks Amareshwari for your reply! The file Config.zip is lying in the HDFS, if it would not have been then the error would be reported by the jobtracker itself while executing the statement: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); But I get error in the map function when I try to access the Config directory. Now I am using the following statement but still getting the same error: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip#Config), conf); Do you think whether there should be any problem in distributing a zipped directory and then hadoop unzipping it recursively. Thanks! Akhil Amareshwari Sriramadasu wrote: Hi Akhil, DistributedCache.addCacheArchive takes path on hdfs. From your code, it looks like you are passing local path. Also, if you want to create symlink, you should pass URI as hdfs://path#linkname, besides calling DistributedCache.createSymlink(conf); Thanks Amareshwari akhil1988 wrote: Please ask any questions if I am not clear above about the problem I am facing. Thanks, Akhil akhil1988 wrote: Hi All! I want a directory to be present in the local working directory of the task for which I am using the following statements: DistributedCache.addCacheArchive(new URI(/home/akhil1988/Config.zip), conf); DistributedCache.createSymlink(conf); Here Config is a directory which I have zipped and put at the given location in HDFS I have zipped the directory because the API doc of DistributedCache (http://hadoop.apache.org/core/docs/r0.20.0/api/index.html) says that the archive files are unzipped in the local cache directory : DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. So, from my understanding of the API docs I expect that the Config.zip file will be unzipped to Config directory and since I have SymLinked them I can access the directory in the following manner from my map function: FileInputStream fin = new FileInputStream(Config/file1.config); But I get the FileNotFoundException on the execution of this statement. Please let me know where I am going wrong. Thanks, Akhil -- View this message in context: http://www.nabble.com/Using-addCacheArchive-tp24207739p24214730.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Strange Exeception
Thanks Jason! I gave your suggestion to my cluster administrator and now it is working. Following was his reply to me: But /hadoop/tmp is not /scratch and the only thing that I clean is /scratch. It looks like the disks in the job tracker machine died. I swapped the disks from another node and rebuilt it.As far as I can tell it is working. Thanks Again! jason hadoop wrote: The directory specified by the configuration parameter mapred.system.dir, defaulting to /tmp/hadoop/mapred/system, doesn't exist. Most likely your tmp cleaner task has removed it, and I am guessing it is only created at cluster start time. On Mon, Jun 22, 2009 at 6:19 PM, akhil1988 akhilan...@gmail.com wrote: Hi All! I have been running Hadoop jobs through my user account on a cluster, for a while now. But now I am getting this strange exception when I try to execute a job. If anyone knows, please let me know why this is happening. [akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input /home/akhil1988/output JO 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: /hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:179) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127) at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Thanks, Akhil -- View this message in context: http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- View this message in context: http://www.nabble.com/Strange-Exeception-tp24158395p24177363.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Strange Exeception
Hi All! I have been running Hadoop jobs through my user account on a cluster, for a while now. But now I am getting this strange exception when I try to execute a job. If anyone knows, please let me know why this is happening. [akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input /home/akhil1988/output JO 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: /hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:179) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127) at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Thanks, Akhil -- View this message in context: http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Hi Jason! I finally found out that there was some problem in reserving the HEAPSIZE which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE using export from our user account, after we have started the Hadoop. It has to changed by the root. I have a user account on the cluster and I was trying to change the Hadoop_heapsize from my user account using 'export' which had no effect. So I had to request my cluster administrator to increase the HADOOP_HEAPSIZE in hadoop-env.sh and then restart hadoop. Now the program is running absolutely fine. Thanks for your help. One thing that I would like to ask you is that can we use DistributerCache for transferring directories to the local cache of the tasks? Thanks, Akhil akhil1988 wrote: Hi Jason! Thanks for going with me to solve my problem. To restate things and make it more easier to understand: I am working in local mode in the directory which contains the job jar and also the Config and Data directories. I just removed the following three statements from my code: DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); The program executes till the same point as before now also and terminates. That means the above three statements are of no use while working in local mode. In local mode, the working directory for the mapreduce tasks becomes the current woking direcotry in which you started the hadoop command to execute the job. Since I have removed the DistributedCache.add. statements there should be no issue whether I am giving a file name or a directory name as argument to it. Now it seems to me that there is some problem in reading the binary file using binaryRead. Please let me know if I am going wrong anywhere. Thanks, Akhil jason hadoop wrote: I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache is not generally used for passing data, but for passing file names. The files must be stored in a shared file system (hdfs for simplicity) already. The distributed cache makes the names available to the tasks, and the the files are extracted from hdfs and stored in the task local work area on each task tracker node. It looks like you may be storing the contents of your files in the distributed cache. On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote: Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text{ private Text word = new Text(); private static NETaggerLevel1 tagger1 = new NETaggerLevel1(); private static NETaggerLevel2 tagger2 = new NETaggerLevel2(); Map(){ System.out.println(HI2\n); Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); System.out.println(HI3\n); Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true); System.out.println(loading the tagger); tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1); System.out.println(HI5\n); tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2); System.out.println(Done- loading the tagger); } public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter ) throws IOException { String inputline = value.toString(); /* Processing of the input pair is done here */ } public static void main(String [] args) throws Exception { JobConf conf = new JobConf(NerTagger.class); conf.setJobName(NerTagger); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setNumReduceTasks(0); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.set(mapred.job.tracker, local); conf.set(fs.default.name, file:///); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); conf.set(mapred.child.java.opts,-Xmx4096m); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); System.out.println(HI1\n); JobClient.runJob(conf); } Jason, when the program executes HI1 and HI2 are printed but it does not reaches HI3. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it is able to access Config/allLayer1.config file
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Hi Jason! Thanks for going with me to solve my problem. To restate things and make it more easier to understand: I am working in local mode in the directory which contains the job jar and also the Config and Data directories. I just removed the following three statements from my code: DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); The program executes till the same point as before now also and terminates. That means the above three statements are of no use while working in local mode. In local mode, the working directory for the mapreduce tasks becomes the current woking direcotry in which you started the hadoop command to execute the job. Since I have removed the DistributedCache.add. statements there should be no issue whether I am giving a file name or a directory name as argument to it. Now it seems to me that there is some problem in reading the binary file using binaryRead. Please let me know if I am going wrong anywhere. Thanks, Akhil jason hadoop wrote: I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache is not generally used for passing data, but for passing file names. The files must be stored in a shared file system (hdfs for simplicity) already. The distributed cache makes the names available to the tasks, and the the files are extracted from hdfs and stored in the task local work area on each task tracker node. It looks like you may be storing the contents of your files in the distributed cache. On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote: Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text{ private Text word = new Text(); private static NETaggerLevel1 tagger1 = new NETaggerLevel1(); private static NETaggerLevel2 tagger2 = new NETaggerLevel2(); Map(){ System.out.println(HI2\n); Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); System.out.println(HI3\n); Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true); System.out.println(loading the tagger); tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1); System.out.println(HI5\n); tagger2
Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program is a direct porting of my original sequential code. There is no reduce phase. Basically, I have just put my program in the map class. My program takes around 1-2 min. in instantiating the data objects which are present in the constructor of Map class(it loads some data model files, therefore it takes some time). After the instantiation part in the constrcutor of Map class the map function is supposed to process the input split. The problem is that the data objects do not get instantiated completely and in between(whlie it is still in constructor) the program stops giving the exceptions pasted at bottom. The program runs fine without mapreduce and does not require more than 2GB memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am working on machines with 16GB RAM), the program fails. I have also set HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC Overhead Limit Exceeded exceptions also. Somebody, please help me with this problem: I have trying to debug it for the last 3 days, but unsuccessful. Thanks! java.lang.OutOfMemoryError: Java heap space at sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889) at java.lang.Double.toString(Double.java:179) at java.text.DigitList.set(DigitList.java:272) at java.text.DecimalFormat.format(DecimalFormat.java:584) at java.text.DecimalFormat.format(DecimalFormat.java:507) at java.text.NumberFormat.format(NumberFormat.java:269) at org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79) ... 5 more Caused by: java.lang.ThreadDeath at java.lang.Thread.stop(Thread.java:715) at org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) at org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) -- View this message in context: http://www.nabble.com/Nor-%22OOM-Java-Heap-Space%22-neither-%22GC-OverHead-Limit-Exeeceded%22-tp24059508p24059508.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text{ private Text word = new Text(); private static NETaggerLevel1 tagger1 = new NETaggerLevel1(); private static NETaggerLevel2 tagger2 = new NETaggerLevel2(); Map(){ System.out.println(HI2\n); Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); System.out.println(HI3\n); Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true); System.out.println(loading the tagger); tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1); System.out.println(HI5\n); tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2); System.out.println(Done- loading the tagger); } public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter ) throws IOException { String inputline = value.toString(); /* Processing of the input pair is done here */ } public static void main(String [] args) throws Exception { JobConf conf = new JobConf(NerTagger.class); conf.setJobName(NerTagger); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setNumReduceTasks(0); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.set(mapred.job.tracker, local); conf.set(fs.default.name, file:///); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); conf.set(mapred.child.java.opts,-Xmx4096m); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); System.out.println(HI1\n); JobClient.runJob(conf); } Jason, when the program executes HI1 and HI2 are printed but it does not reaches HI3. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it is able to access Config/allLayer1.config file (as while executing this statement, it prints some messages like which data it is loading, etc.) but it gets stuck there(while loading some classifier) and never reaches HI3. This program runs fine when executed normally(without mapreduce). Thanks, Akhil jason hadoop wrote: Is it possible that your map class is an inner class and not static? On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote: Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program is a direct porting of my original sequential code. There is no reduce phase. Basically, I have just put my program in the map class. My program takes around 1-2 min. in instantiating the data objects which are present in the constructor of Map class(it loads some data model files, therefore it takes some time). After the instantiation part in the constrcutor of Map class the map function is supposed to process the input split. The problem is that the data objects do not get instantiated
Re: Implementing CLient-Server architecture using MapReduce
Can anyone help me on this issue. I have an account on the cluster and I cannot go and start server on each server process on each tasktracker. Akhil akhil1988 wrote: Hi All, I am porting a machine learning application on Hadoop using MapReduce. The architecture of the application goes like this: 1. run a number of server processes which take around 2-3 minutes to start and then remain as daemon waiting for a client to call for a connection. During the startup these server processes get trained on the trainng dataset. 2. A client is then run which connects to servers and process or test any data that it wants to. The client is basically our job, which we will be converted to the mapreduce model of hadoop. Now, since each server takes a good amount of time to start, needless to say that we want each of these server processes to be pre-running on all the tasktrackers(all nodes) so that when a mapreduce(client) task come to that node, the servers are already running and the client just uses them for its purpose. The server process keeps on running waiting for another map task that may be assigned to that node. That means, a server process is started on each node once and it waits for a connection by a client. When clients( implemeted as map reduce) come to that node they connect to the server, do they their processing and leave(or finish). Can you please tell me how should I go about starting the server on each node. If I am not clear, please ask any questions. Any help in this regard will be greatly appreciated. Thank You! Akhil -- View this message in context: http://www.nabble.com/Implementing-CLient-Server-architecture-using-MapReduce-tp23916757p23928505.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Implementing CLient-Server architecture using MapReduce
Hi All, I am porting a machine learning application on Hadoop using MapReduce. The architecture of the application goes like this: 1. run a number of server processes which take around 2-3 minutes to start and then remain as daemon waiting for a client to call for a connection. During the startup these server processes get trained on the trainng dataset. 2. A client is then run which connects to servers and process or test any data that it wants to. The client is basically our job, which we will be converted to the mapreduce model of hadoop. Now, since each server takes a good amount of time to start, needless to say that we want each of these server processes to be pre-running on all the tasktrackers(all nodes) so that when a mapreduce(client) task come to that node, the servers are already running and the client just uses them for its purpose. The server process keeps on running waiting for another map task that may be assigned to that node. That means, a server process is started on each node once and it waits for a connection by a client. When clients( implemeted as map reduce) come to that node they connect to the server, do they their processing and leave(or finish). Can you please tell me how should I go about starting the server on each node. If I am not clear, please ask any questions. Any help in this regard will be greatly appreciated. Thank You! Akhil -- View this message in context: http://www.nabble.com/Implementing-CLient-Server-architecture-using-MapReduce-tp23916757p23916757.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Giving classpath in hadoop jar command i.e. while executing a mapreduce job
I wish to give a path of a jar file as an argument when executing the hadoop jar . command as my mapper uses that jar file for its operation. I found that -libjars option can be used but for me it is not working, it is giving an exception. Can anyone tell, how to use libjars generic command option and if any change in code that needs to be made while using libjars. Thank You in Advance! Akhil Langer -- View this message in context: http://www.nabble.com/Giving-classpath-in-%22hadoop-jar%22-command-i.e.-while-executing-a-mapreduce-job-tp23893761p23893761.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Processing files lying in a directory structure
Hi! I am working on applying WordCount example on the entire Wikipedia dump. The entire english wikipedia is around 200GB which I have stored in HDFS in a cluster to which I have access. The problem: Wikipedia dump contains many directories (it has a very big directory structure) containing HTML files but the FileInputFormat requires all the files to be processed present in a single directory. Can anybody give any idea or if something already exists for applying Wordcount on these HTML files present in the directories without changing the directory strcuture. Akhil -- View this message in context: http://www.nabble.com/Processing-files-lying-in-a-directory-structure-tp23875340p23875340.html Sent from the Hadoop core-user mailing list archive at Nabble.com.