Re: The reduce copier failed
2008-09-25 17:12:18,250 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200809180916_0027_r_07_2: Got 2 new map-outputs number of known map outputs is 21 2008-09-25 17:12:18,251 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809180916_0027_r_07_2 Merge of the inmemory files threw an exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.run(ReduceTask.java:2064) Caused by: org.apache.lucene.index.MergePolicy$MergeException: segment _bfu exists in external directory yet the MergeScheduler executed the merge in a separate thread at Hmm... could this be something specific to org.apache.lucene.index.MergePolicy? Arun org .apache .lucene.index.IndexWriter.copyExternalSegments(IndexWriter.java:2362) at org .apache .lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:2307) at org .apache .hadoop .contrib.index.mapred.IntermediateForm.process(IntermediateForm.java: 135) at org .apache .hadoop .contrib .index.mapred.IndexUpdateCombiner.reduce(IndexUpdateCombiner.java:56) at org .apache .hadoop .contrib .index.mapred.IndexUpdateCombiner.reduce(IndexUpdateCombiner.java:38) at org.apache.hadoop.mapred.ReduceTask $ReduceCopier.combineAndSpill(ReduceTask.java:2160) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access $3100(ReduceTask.java:341) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier $InMemFSMergeThread.doInMemMerge(ReduceTask.java:2120) ... 1 more 2008-09-25 17:12:19,087 INFO org.apache.hadoop.mapred.ReduceTask: Read 14523136 bytes from map-output for attempt_200809180916_0027_m_16_0 2008-09-25 17:12:19,087 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from attempt_200809180916_0027_m_16_0 - (41, 10651153) from ars1dev6 2008-09-25 17:12:19,110 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 14506735 bytes (14506735 raw bytes) into RAM from attempt_200809180916_0027_m_09_0 2008-09-25 17:12:19,226 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager 2008-09-25 17:12:19,228 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.IOException: attempt_200809180916_0027_r_07_2The reduce copier failed at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:255) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java:2209) Thanks, Joe
Re: 1 file per record
hi i'm writing an appln which computes using the entire data from a file. for that purpose i dont want to split my file and the entire file shd go to map task.. i've been able to override isSplitable() do it and the file is not getting split now.. then.. i had to store the input values to an array..(in map func) and then proceed with my computation. when i displayed that array i found only the last line of the file getting displayed... does this mean that data is read line by line by the line reader and not continously. if so, what shd i do inorder to read complete contents of the file... Thank you Chandravadana S Enis Soztutar wrote: Yes, you can use MultiFileInputFormat. You can extend the MultiFileInputFormat to return a RecordReader, which reads a record for each file in the MultiFileSplit. Enis chandra wrote: hi.. By setting isSplitable false, we can set 1 file with n records 1 mapper. Is there any way to set 1 complete file per record.. Thanks in advance Chandravadana S This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful. -- View this message in context: http://www.nabble.com/1-file-per-record-tp19644985p19685269.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Questions about Hadoop
Edward, Can you describe more about Hama, with respect to Hadoop? I've read through the Incubator proposal and your blog -- it's a great approach. Are there any benchmarks available? E.g., size of data sets used, kinds of operations performed, etc. Will this project be able to make use of existing libraries? Best, Paco On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon [EMAIL PROTECTED] wrote: The decision making system seems interesting to me. :) The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use to matrix algebra and its uses in statistical analysis on Hadoop and Hbase. (It is still in its early stage) /Edward
Re: small hadoop cluster setup question
could you please attach your configurations and logs? On Fri, Sep 26, 2008 at 6:12 AM, Ski Gh3 [EMAIL PROTECTED] wrote: Hi all, I'm trying to set up a small cluster with 3 machines. I'd like to have one machine serves as the namenode and the jobtracker, while the 3 all serve as the datanode and tasktrackers. After following the set up instructions, I got an exception running $HADOOP_HOME/bin/start-dfs.sh: java.net.BindException:Address already in use and my secondarynamenode cannot be started. But as I can see, the namenode started and the three data nodes all started. Does this matter? However, when I go to the webinterface at localhost:50070, I saw there is 1 datanode. Is there any reason why the data nodes are not started on the other two machines? The site.xml file and the java path, installation path etc are all set up. Does anyone have this problem before? I 'd really appreciate any help! Thanks!
Re: Adding $CLASSPATH to Map/Reduce tasks
maybe you can use bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args see details: http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote: On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED] wrote: On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Sep 21, 2008, at 2:05 PM, David Hall wrote: (New to this list) Hi, My research group is setting up a small (20-node) cluster. All of these machines are linked by NFS. We have a fairly entrenched codebase/development cycle, and in particular we'd like to be able to access user $CLASSPATHs in the forked jvms run by the Map and Reduce tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to disallow this by specifying it's own. Using jars on NFS for too many tasks might hurt if you have thousands of tasks, causing too much load. The better solution might be to use the DistributedCache: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Specifically: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29 Arun Good point.. I hadn't thought of that, but at the moment we're dealing with barrier-to-adoption rather than efficiency. We'll have to go back to PBS if we can't get users (read: picky phd students) on board. I'd rather avoid that scenario... In the meantime, I think I figured out a hack that I'm going to try. In case anyone's curious, the hack is to create a jar file with a manifest that has the Class-Path field set to all the directories and jars you want, and to put that in the lib/ folder of another jar, and pass that final jar in as the User Jar to a job. Works like a charm. :-) -- David Thanks! -- David Is there any easy way to trick hadoop into making these visible? If not, if I were to submit a patch that would (optionally) add $CLASSPATH to the forked jvms' classpath, would it be considered? Thanks, David Hall
Re: Adding $CLASSPATH to Map/Reduce tasks
Hi, On Fri, Sep 26, 2008 at 10:50 AM, Samuel Guo [EMAIL PROTECTED] wrote: maybe you can use bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args see details: http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html Indeed, I was having the same issue trying to get a Lucene jar file into a running task. Despite what the docs say, it works with the jar option to the hadoop command. (The docs I read said it only worked with job and a couple other commands; unfortunately I don't have a link to that page at the moment.) Joe
Re: IPC Client error | Too many files open
What does jstack show for this? Probably better suited for jira discussion. Raghu. Goel, Ankur wrote: Hi Folks, We have developed a simple log writer in Java that is plugged into Apache custom log and writes log entries directly to our hadoop cluster (50 machines, quad core, each with 16 GB Ram and 800 GB hard-disk, 1 machine as dedicated Namenode another machine as JobTracker TaskTracker + DataNode). There are around 8 Apache servers dumping logs into HDFS via our writer. Everything was working fine and we were getting around 15 - 20 MB log data per hour from each server. Recently we have been experiencing problems with 2-3 of our Apache servers where a file is opened by log-writer in HDFS for writing but it never receives any data. Looking at apache error logs shows the following errors 08/09/22 05:02:13 INFO ipc.Client: java.io.IOException: Too many open files at sun.nio.ch.IOUtil.initPipe(Native Method) at sun.nio.ch.EPollSelectorImpl.init(EPollSelectorImpl.java:49) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java :18) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithT imeout.java:312) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWi thTimeout.java:227) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java: 155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:203) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:289) ... ... Followed by connection errors saying Retrying to connect to server: hadoop-server.com:9000. Already tried 'n' times. (same as above) ... and is retrying constantly (log-writer set up so that it waits and retries). Doing an lsof on the log writer java process shows that it got stuck in a lot of pipe/event poll and eventually ran out of file handles. Below is the part of the lsof output lsof -p 2171 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME java2171 root 20r FIFO0,7 24090207 pipe java2171 root 21w FIFO0,7 24090207 pipe java2171 root 22r 0,80 24090208 eventpoll java2171 root 23r FIFO0,7 23323281 pipe java2171 root 24r FIFO0,7 23331536 pipe java2171 root 25w FIFO0,7 23306764 pipe java2171 root 26r 0,80 23306765 eventpoll java2171 root 27r FIFO0,7 23262160 pipe java2171 root 28w FIFO0,7 23262160 pipe java2171 root 29r 0,80 23262161 eventpoll java2171 root 30w FIFO0,7 23299329 pipe java2171 root 31r 0,80 23299330 eventpoll java2171 root 32w FIFO0,7 23331536 pipe java2171 root 33r FIFO0,7 23268961 pipe java2171 root 34w FIFO0,7 23268961 pipe java2171 root 35r 0,80 23268962 eventpoll java2171 root 36w FIFO0,7 23314889 pipe ... ... ... What in DFS client (if any) could have caused this? Could it be something else? Is it not ideal to use an HDFS writer to directly write logs from Apache into HDFS? Is 'Chukwa (hadoop log collection and analysis framework contributed by Yahoo) a better fit for our case? I would highly appreciate help on any or all of the above questions. Thanks and Regards -Ankur
job details
Hi, I'm trying to figure out which log files are used by the job tracker's web interface to display the following information: Job Name: my job Job File: hdfs://localhost:9000/tmp/hadoop-scohen/mapred/system/ job_200809260816_0001/job.xml Status: Succeeded Started at: Fri Sep 26 08:18:04 CDT 2008 Finished at: Fri Sep 26 08:18:25 CDT 2008 Finished in: 20sec What I would like to do is backup the log files that are needed to display this information so that we can look at it later if the need arises. When I copy everything from the hadoop home/logs directory into another hadoop home/logs directory, the jobs show up in the history page. However, all I see are the job name and starting time, but not the completion time or the length of the job. Does anyone have any suggestions? Thanks, Shirley
Re: job details
Shirley Cohen wrote: Hi, I'm trying to figure out which log files are used by the job tracker's web interface to display the following information: Job Name: my job Job File: hdfs://localhost:9000/tmp/hadoop-scohen/mapred/system/job_200809260816_0001/job.xml Status: Succeeded Started at: Fri Sep 26 08:18:04 CDT 2008 Finished at: Fri Sep 26 08:18:25 CDT 2008 Finished in: 20sec What I would like to do is backup the log files that are needed to display this information so that we can look at it later if the need arises. When I copy everything from the hadoop home/logs directory into another hadoop home/logs directory, the jobs show up in the history page. However, all I see are the job name and starting time, but not the completion time or the length of the job. Looks like the JobStatus structure. Perhaps you will need to tune the logging information to display this when the log finishes, if it is not listed adequately. -steve
RE: How to input a hdfs file inside a mapper?
I would imagine something like: FSDataInputStream inFileStream = dfsFileSystem.open(dfsFilePath); Don't forget to close after. Thanks, Htin -Original Message- From: Amit_Gupta [mailto:[EMAIL PROTECTED] Sent: Friday, September 26, 2008 5:47 AM To: core-user@hadoop.apache.org Subject: How to input a hdfs file inside a mapper? How can I get an Input stream on a file stored in HDFS inside a mapper or a reducer? thanks Amit -- View this message in context: http://www.nabble.com/How-to-input-a-hdfs-file-inside-a-mapper--tp19687785 p19687785.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Adding $CLASSPATH to Map/Reduce tasks
On Fri, Sep 26, 2008 at 7:50 AM, Samuel Guo [EMAIL PROTECTED] wrote: maybe you can use bin/hadoop jar -libjars ${your-depends-jars} your.mapred.jar args see details: http://hadoop.apache.org/core/docs/r0.18.1/api/org/apache/hadoop/mapred/JobShell.html Most of our classes are in non-jars. I suppose it wouldn't be too bad to tell ant to jar them up, but with the hack, it's easy enough to not bother. -- David On Thu, Sep 25, 2008 at 12:26 PM, David Hall [EMAIL PROTECTED]wrote: On Sun, Sep 21, 2008 at 9:41 PM, David Hall [EMAIL PROTECTED] wrote: On Sun, Sep 21, 2008 at 9:35 PM, Arun C Murthy [EMAIL PROTECTED] wrote: On Sep 21, 2008, at 2:05 PM, David Hall wrote: (New to this list) Hi, My research group is setting up a small (20-node) cluster. All of these machines are linked by NFS. We have a fairly entrenched codebase/development cycle, and in particular we'd like to be able to access user $CLASSPATHs in the forked jvms run by the Map and Reduce tasks. However, TaskRunner.java (http://tinyurl.com/4enkg4) seems to disallow this by specifying it's own. Using jars on NFS for too many tasks might hurt if you have thousands of tasks, causing too much load. The better solution might be to use the DistributedCache: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Specifically: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addArchiveToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath(org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html#addFileToClassPath%28org.apache.hadoop.fs.Path,%20org.apache.hadoop.conf.Configuration%29 Arun Good point.. I hadn't thought of that, but at the moment we're dealing with barrier-to-adoption rather than efficiency. We'll have to go back to PBS if we can't get users (read: picky phd students) on board. I'd rather avoid that scenario... In the meantime, I think I figured out a hack that I'm going to try. In case anyone's curious, the hack is to create a jar file with a manifest that has the Class-Path field set to all the directories and jars you want, and to put that in the lib/ folder of another jar, and pass that final jar in as the User Jar to a job. Works like a charm. :-) -- David Thanks! -- David Is there any easy way to trick hadoop into making these visible? If not, if I were to submit a patch that would (optionally) add $CLASSPATH to the forked jvms' classpath, would it be considered? Thanks, David Hall
Re: extracting input to a task from a (streaming) job?
I've create a jira describing my problems running under IsolationRunner. https://issues.apache.org/jira/browse/HADOOP-4041 If anyone is using I.R. successfully to re-run failed tasks in a single JVM, can you please, pretty please, describe on how you do that? Thank you, -Yuri On Friday 08 August 2008 10:09:48 Yuri Pradkin wrote: On Thursday 07 August 2008 16:43:10 John Heidemann wrote: On Thu, 07 Aug 2008 19:42:05 +0200, Leon Mergen wrote: Hello John, On Thu, Aug 7, 2008 at 6:30 PM, John Heidemann [EMAIL PROTECTED] wrote: I have a large Hadoop streaming job that generally works fine, but a few (2-4) of the ~3000 maps and reduces have problems. To make matters worse, the problems are system-dependent (we run an a cluster with machines of slightly different OS versions). I'd of course like to debug these problems, but they are embedded in a large job. Is there a way to extract the input given to a reducer from a job, given the task identity? (This would also be helpful for mappers.) I believe you should set keep.failed.tasks.files to true -- this way, give a task id, you can see what input files it has in ~/ taskTracker/${taskid}/work (source: http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Isolatio nR unner ) IsolationRunner does not work as described in the tutorial. After the task hung, I failed it via the web interface. Then I went to the node that was running this task $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work (this path is already different from the tutorial's) $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml Exception in thread main java.lang.NullPointerException at org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164) Looking at IsolationRunner code, I see this: 164 File workDirName = new File(lDirAlloc.getLocalPathToRead( 165 TaskTracker.getJobCacheSubdir() 166 + Path.SEPARATOR + taskId.getJobID() 167 + Path.SEPARATOR + taskId 168 + Path.SEPARATOR + work, 169 conf). toString()); I.e. it assumes there is supposed to be a taskID subdirectory under the job dir, but: $ pwd ...mapred/local/taskTracker/jobcache/job_200808071645_0001 $ ls jars job.xml work -- it's not there. Any suggestions? Thanks, -Yuri
too many open files error
Hi, I encountered following FileNotFoundException resulting from too many open files error when i tried to run a job. The job had been run for several times before without problem. I am confused by the exception because my code closes all the files and even it doesn't, the job only have only 10-20 small input/output files. The limit on the open file on my box is 1024.Besides, the error seemed to happen even before the task was executed, I am using 0.17 version. I'd appreciate if somebody can shed some light on this issue. BTW, the job ran ok after i restarted hadoop.Yes, the hadoop-site.xml did exist in that directory. java.lang.RuntimeException: java.io.FileNotFoundException: /home/y/conf/hadoop/hadoop-site.xml (Too many open files) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:901) at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:804) at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:772) at org.apache.hadoop.conf.Configuration.get(Configuration.java:272) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:414) at org.apache.hadoop.mapred.JobConf.getKeepFailedTaskFiles(JobConf.java:306) at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.setJobConf(TaskTracker.java:1487) at org.apache.hadoop.mapred.TaskTracker.launchTaskForJob(TaskTracker.java:722) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:716) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1274) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:915) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1310) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2251) Caused by: java.io.FileNotFoundException: /home/y/conf/hadoop/hadoop-site.xml (Too many open files) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:106) at java.io.FileInputStream.init(FileInputStream.java:66) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161) at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653) at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107) at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:225) at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:180) at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:832) ... 12 more Sometimes it gave me this message: java.io.IOException: Cannot run program bash: java.io.IOException: error=24, Too many open files at java.lang.ProcessBuilder.start(ProcessBuilder.java:459) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:133) Caused by: java.io.IOException: java.io.IOException: error=24, Too many open files at java.lang.UNIXProcess.init(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:452) ... 6 more -- Eric Zhang 408-349-2466 Vespa Content team
Could not get block locations. Aborting... exception
Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one: attempt_200809261607_0003_m_02_0: Exception closing file /tmp/ dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile attempt_200809261607_0003_m_02_0: java.io.IOException: Could not get block locations. Aborting... attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError (DFSClient.java:2143) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400 (DFSClient.java:1735) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run (DFSClient.java:1889) At approximately the same time, we start seeing lots of these errors in the namenode log: 2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.startFile: failed to create file /tmp/dustin/input/ input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. 2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/ _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x, DFSClient_attempt_200809261607_0003_m_02_1, true, 3, 67108864) from 10.100.11.83:60056: error: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal (FSNamesystem.java:952) at org.apache.hadoop.dfs.FSNamesystem.startFile (FSNamesystem.java:903) at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Eventually, the job fails because of these errors. Subsequent job runs also experience this problem and fail. The only way we've been able to recover is to restart the DFS. It doesn't happen every time, but it does happen often enough that I'm worried. Does anyone have any ideas as to why this might be happening? I thought that https://issues.apache.org/jira/browse/HADOOP-2669 might be the culprit, but today we upgraded to hadoop 0.18.1 and the problem still happens. Thanks, Bryan
RE: Could not get block locations. Aborting... exception
Does your failed map task open a lot of files to write? Could you please check the log of the datanode running at the machine where the map tasks failed? Do you see any error message containing exceeds the limit of concurrent xcievers? Hairong From: Bryan Duxbury [mailto:[EMAIL PROTECTED] Sent: Fri 9/26/2008 4:36 PM To: core-user@hadoop.apache.org Subject: Could not get block locations. Aborting... exception Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one: attempt_200809261607_0003_m_02_0: Exception closing file /tmp/ dustin/input/input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile attempt_200809261607_0003_m_02_0: java.io.IOException: Could not get block locations. Aborting... attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError (DFSClient.java:2143) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400 (DFSClient.java:1735) attempt_200809261607_0003_m_02_0: at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run (DFSClient.java:1889) At approximately the same time, we start seeing lots of these errors in the namenode log: 2008-09-26 16:19:26,502 WARN org.apache.hadoop.dfs.StateChange: DIR* NameSystem.startFile: failed to create file /tmp/dustin/input/ input_dataunits/_distcp_tmp_1dk90o/part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. 2008-09-26 16:19:26,502 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 7276, call create(/tmp/dustin/input/input_dataunits/ _distcp_tmp_1dk90o/part-01897.bucketfile, rwxr-xr-x, DFSClient_attempt_200809261607_0003_m_02_1, true, 3, 67108864) from 10.100.11.83:60056: error: org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. org.apache.hadoop.dfs.AlreadyBeingCreatedException: failed to create file /tmp/dustin/input/input_dataunits/_distcp_tmp_1dk90o/ part-01897.bucketfile for DFSClient_attempt_200809261607_0003_m_02_1 on client 10.100.11.83 because current leaseholder is trying to recreate file. at org.apache.hadoop.dfs.FSNamesystem.startFileInternal (FSNamesystem.java:952) at org.apache.hadoop.dfs.FSNamesystem.startFile (FSNamesystem.java:903) at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888) Eventually, the job fails because of these errors. Subsequent job runs also experience this problem and fail. The only way we've been able to recover is to restart the DFS. It doesn't happen every time, but it does happen often enough that I'm worried. Does anyone have any ideas as to why this might be happening? I thought that https://issues.apache.org/jira/browse/HADOOP-2669 might be the culprit, but today we upgraded to hadoop 0.18.1 and the problem still happens. Thanks, Bryan
Re: too many open files error
On 26-Sep-08, at 3:09 PM, Eric Zhang wrote: Hi, I encountered following FileNotFoundException resulting from too many open files error when i tried to run a job. The job had been run for several times before without problem. I am confused by the exception because my code closes all the files and even it doesn't, the job only have only 10-20 small input/output files. The limit on the open file on my box is 1024.Besides, the error seemed to happen even before the task was executed, I am using 0.17 version. I'd appreciate if somebody can shed some light on this issue. BTW, the job ran ok after i restarted hadoop.Yes, the hadoop-site.xml did exist in that directory. I had the same errors, including the bash one. Running one particular job would cause all subsequent jobs of any kind to fail, even after all running jobs had completed or failed out. This was confusing because the failing jobs themselves often had no relationship to the cause, they were just in a bad environment. If you can't successfully run a dummy job (with the identity mapper and reducer, or a streaming job with cat) once you start getting failures, then you are probably in the same situation. I believe that the problem was caused by increasing the timeout, but I never pinned it down enough to submit a Jira issue. It might have been the XML reader or something else. I was using streaming, hadoop- ec2, and either 0.17.0 or 0.18.0. It would happen just as rapidly after I made an ec2 image with a higher open file limit. Eventually I figured it out by running each job in my pipeline 5 or so times before trying the next one, which let me see which job was causing the problem (because it would eventually fail itself, rather than hosing a later job). Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Re: Failed to start datanodes
Do you config hostname right in all node? 2008/9/26 Jeremy Chow [EMAIL PROTECTED] Hi list, I've created my hadoop cluster following the tutorial on http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 . but failed. when i use bin/hadoop dfsadmin -report it shows that there is only one datanode on, $ bin/hadoop dfsadmin -report Safe mode is ON Total raw bytes: 20317106176 (18.92 GB) Remaining raw bytes: 12607342427 (11.74 GB) Used raw bytes: 834834432 (796.16 MB) % used: 4.11% Total effective bytes: 0 (0 KB) Effective replication multiplier: Infinity - Datanodes available: 1 Name: 192.168.3.8:50010 State : In Service Total raw bytes: 20317106176 (18.92 GB) Remaining raw bytes: 12607342427(11.74 GB) Used raw bytes: 834834432 (796.16 MB) % used: 4.11% Last contact: Fri Sep 26 15:46:19 CST 2008 and then I check the logs of datanodes which expect to run on other hosts, I found that 2008-09-26 15:47:54,744 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.jobui.com/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 694836; compiled by 'hadoopqa' on Fri Sep 12 23:29:35 UTC 2008 / 2008-09-26 15:48:15,896 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /192.168.3.8:54310. Already tried 0 time(s). 2008-09-26 15:48:36,898 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /192.168.3.8:54310. Already tried 1 time(s). 2008-09-26 15:48:57,900 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /192.168.3.8:54310. Already tried 2 time(s). ... I take 192.168.3.8 as namenode, and 192.168.3.7, 192.168.3.8, 192.168.3.9as datanodes. But obviously remote datanodes cannot start sucessfully. when I use jps on 192.168.3.7, it seems work fine. $ jps 5131 Jps 4561 DataNode but the namenode cannot find it. can anyone give me a solution? thanks a lot. Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com -- Sorry for my English!! 明 Please help me correct my English expression and error in syntax
Re: Failed to start datanodes
Hey, I've fixed it. :) The server has turn on a firewall. Regards, Jeremy
Re: Could not get block locations. Aborting... exception
Well, I did find some more errors in the datanode log. Here's a sampling: 2008-09-26 10:43:57,287 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Block blk_-3923611845661840838_176295 is not valid. at org.apache.hadoop.dfs.FSDataset.getBlockFile (FSDataset.java:716) at org.apache.hadoop.dfs.FSDataset.getLength(FSDataset.java: 704) at org.apache.hadoop.dfs.DataNode$BlockSender.init (DataNode.java:1678) at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock (DataNode.java:1101) at org.apache.hadoop.dfs.DataNode$DataXceiver.run (DataNode.java:1037) 2008-09-26 10:56:19,325 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.dfs.DataNode$BlockReceiver.readToBuf (DataNode.java:2464) at org.apache.hadoop.dfs.DataNode $BlockReceiver.readNextPacket(DataNode.java:2508) at org.apache.hadoop.dfs.DataNode$BlockReceiver.receivePacket (DataNode.java:2572) at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock (DataNode.java:2698) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock (DataNode.java:1283) 2008-09-26 10:56:19,779 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.EOFException at java.io.DataInputStream.readShort(DataInputStream.java:298) at org.apache.hadoop.dfs.DataNode$DataXceiver.run (DataNode.java:1021) at java.lang.Thread.run(Thread.java:619) 2008-09-26 10:56:21,816 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Could not read from stream at org.apache.hadoop.net.SocketInputStream.read (SocketInputStream.java:119) at java.io.DataInputStream.readByte(DataInputStream.java:248) at org.apache.hadoop.io.WritableUtils.readVLong (WritableUtils.java:324) at org.apache.hadoop.io.WritableUtils.readVInt (WritableUtils.java:345) at org.apache.hadoop.io.Text.readString(Text.java:410) 2008-09-26 10:56:28,380 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java: 236) 2008-09-26 10:56:52,387 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(10.100.11.115:50010, storageID=DS-1784982905-10.100.11.115-50010-1221785192226, infoPort=50075, ipcPort=50020):DataXceiver: java.io.IOException: Too many open files at sun.nio.ch.EPollArrayWrapper.epollCreate(Native Method) at sun.nio.ch.EPollArrayWrapper.init (EPollArrayWrapper.java:59) at sun.nio.ch.EPollSelectorImpl.init (EPollSelectorImpl.java:52) at sun.nio.ch.EPollSelectorProvider.openSelector (EPollSelectorProvider.java:18) at sun.nio.ch.Util.getTemporarySelector(Util.java:123) The most interesting one in my eyes is the too many open files one. My ulimit is 1024. How much should it be? I don't think that I have that many files open in my mappers. They should only be operating on a single file at a time. I can try to run the job again and get an lsof if it would be interesting. Thanks for taking the time to reply, by the way. -Bryan On Sep 26, 2008, at 4:48 PM, Hairong Kuang wrote: Does your failed map task open a lot of files to write? Could you please check the log of the datanode running at the machine where the map tasks failed? Do you see any error message containing exceeds the limit of concurrent xcievers? Hairong From: Bryan Duxbury [mailto:[EMAIL PROTECTED] Sent: Fri 9/26/2008 4:36 PM To: core-user@hadoop.apache.org Subject: Could not get block locations. Aborting... exception Hey all. We've been running into a very annoying problem pretty frequently lately. We'll be running some job, for instance a distcp, and it'll be moving along quite nicely, until all of the sudden, it sort of freezes up. It takes a while, and then we'll get an error like this one: attempt_200809261607_0003_m_02_0: Exception