[
https://issues.apache.org/jira/browse/HADOOP-3466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12616222#action_12616222
]
Raghu Angadi commented on HADOOP-3466:
--------------------------------------
There is no file descriptor or socket leak we are aware of in 0.17.0. Any more
detailed information you can provide will help us to see if there is a bug or
to check whats happening in your case. Couple of things you could do :
# Run jstack on a datanode when you suspect it is in bad state.
# Attach datanode log from a node that exhibited this problem.
# Attach corresponding NameNode log.
# etc.
> DataNode fails to deliver blocks, holds thousands of open socket connections
> ----------------------------------------------------------------------------
>
> Key: HADOOP-3466
> URL: https://issues.apache.org/jira/browse/HADOOP-3466
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.17.0
> Environment: Hadoop-0.17.0 (and 0.16.4), 25 node Linux grid (24
> DataNodes, 1 master node)
> Reporter: Christopher Gillett
> Priority: Critical
>
> A job that used to run correctly on our grid (in 0.15.0) now fails. The
> failure occurs after the map phase is complete, and about 2/3rds of the way
> through the reduce phase. This job is processing a modest amount of input
> data (approximately 220G)
> When the error occurs the nodes hosting DataNodes have literally thousands of
> open socket connections on them. The DataNode instances are holding large
> amounts of memory. Sometimes the DataNodes crash or exit, other times they
> continue to run.
> The error which gets kicked out from the application perspective is:
> 08/05/27 11:30:08 INFO mapred.JobClient: map 100% reduce 89%
> 08/05/27 11:30:41 INFO mapred.JobClient: map 100% reduce 90%
> 08/05/27 11:32:45 INFO mapred.JobClient: map 100% reduce 86%
> 08/05/27 11:32:45 INFO mapred.JobClient: Task Id :
> task_200805271056_0001_r_000007_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
> at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanode
> Error(DFSClient.java:1832)
> at
>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1487)
> at
>
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1579)
> I then discovered that 1 or more DataNode instances on the slave nodes
> are down (we run 1 DataNode instance per machine). The cause for at
> least some of the DataNode failures is a JVM internal error that gets
> raised due to a complete out-of-memory scenario (on a 4G, 4-way machine).
> Watching the DataNodes run, I can see them consuming more and more
> memory. For those failures for which there is a JVM traceback, I see (in
> part...NOTE 0.16.4 TRACEBACK):
> #
> # java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out
> of swap space?
> #
> # Internal Error (414C4C4F434154494F4E0E494E4C494E450E4850500017),
> pid=4246, tid=2283883408
> #
> # Java VM: Java HotSpot(TM) Server VM (1.6.0_02-b05 mixed mode)
> # If you would like to submit a bug report, please visit:
> # http://java.sun.com/webapps/bugreport/crash.jsp
> #
> --------------- T H R E A D ---------------
> Current thread (0x8a942000): JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_Java, id=15064]
> Stack: [0x881c4000,0x88215000), sp=0x882139e0, free space=318k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code,
> C=native code)
> V [libjvm.so+0x53b707]
> V [libjvm.so+0x225fe1]
> V [libjvm.so+0x16fdc5]
> V [libjvm.so+0x22aef3]
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> v blob 0xf4f235a7
> J java.io.DataInputStream.readInt()I
> j
>
> org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(Ljava/io/DataOutputStream;Ljava/io/DataInputStream;Ljava/io/DataOutputStream;Ljava/lang/String;Lorg/a
> pache/hadoop/dfs/DataNode$Throttler;I)V+126
> j
>
> org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(Ljava/io/DataInputStream;)V+746
> j org.apache.hadoop.dfs.DataNode$DataXceiver.run()V+174
> j java.lang.Thread.run()V+11
> v ~StubRoutines::call_stub
> --------------- P R O C E S S ---------------
> Java Threads: ( => current thread )
> 0x0ae3f400 JavaThread "process reaper" daemon [_thread_blocked,
> id=26870]
> 0x852e6000 JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_vm, id=26869]
> 0x08a1cc00 JavaThread "PacketResponder 0 for Block
> blk_-6186975972786687394" daemon [_thread_blocked, id=26769]
> 0x852e5000 JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_native, id=26768]
> 0x0956e000 JavaThread "PacketResponder 0 for Block
> blk_-2322514873363546651" daemon [_thread_blocked, id=26767]
> 0x852e4400 JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_native, id=26766]
> 0x09d3a400 JavaThread "PacketResponder 0 for Block
> blk_8926941945313450801" daemon [_thread_blocked, id=26764]
> 0x852e3c00 JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_native, id=26763]
> 0x0953d000 JavaThread "PacketResponder 0 for Block
> blk_4785883052769066976" daemon [_thread_blocked, id=26762]
> 0xb13a5c00 JavaThread
> "[EMAIL PROTECTED]" daemon [_thread_in_native, id=26761]
> The interesting part here is that if I count the number of JavaThreads
> running org.apache.hadoop.dfs.DataNode I see 4,538 (!) in the
> traceback. The number of threads was surprising.
> Other DataNodes just exit without panicking the JVM. In either failure
> mode, the last few lines of the DataNode log file is apparently
> innocuous:
> 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
> got response for connect ack from downstream datanode with
> firstbadlink as
> 2008-05-27 11:31:47,663 INFO org.apache.hadoop.dfs.DataNode: Datanode 2
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:48,268 INFO org.apache.hadoop.dfs.DataNode: Receiving
> block blk_-2241766430103062484 src: /10.2.14.10:33626 dest:
> /10.2.14.10:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Receiving
> block blk_313239508245918539 src: /10.2.14.24:37836 dest:
> /10.2.14.24:50010
> 2008-05-27 11:31:48,740 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Receiving
> block blk_1684581399908730353 src: /10.2.14.16:51605 dest:
> /10.2.14.16:50010
> 2008-05-27 11:31:49,044 INFO org.apache.hadoop.dfs.DataNode: Datanode 0
> forwarding connect ack to upstream firstbadlink is
> 2008-05-27 11:31:49,509 INFO org.apache.hadoop.dfs.DataNode: Receiving
> block blk_2493969670086107736 src: /10.2.14.18:47557 dest:
> /10.2.14.18:50010
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
> got response for connect ack from downstream datanode with
> firstbadlink as
> 2008-05-27 11:31:49,513 INFO org.apache.hadoop.dfs.DataNode: Datanode 1
> forwarding connect ack to upstream firstbadlink is
> Finally, the task-level output (in userlogs) doesn't reveal much
> either:
> 2008-05-27 11:38:30,724 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 34 map output(s)
> 2008-05-27 11:38:30,753 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001976_0 output from worker9.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1: Got 0 new map-outputs & 0 obsolete
> map-outputs from tasktracker and 0 map-outputs from previous failures
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Got 33 known map output location(s);
> scheduling...
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Scheduled 1 of 33 known outputs (0 slow
> hosts and 32 dup hosts)
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Copying task_200805271056_0001_m_001248_0
> output from worker8.
> 2008-05-27 11:38:31,727 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 Need 33 map output(s)
> 2008-05-27 11:38:31,752 INFO org.apache.hadoop.mapred.ReduceTask:
> task_200805271056_0001_r_000007_1 done copying
> task_200805271056_0001_m_001248_0 output from worker8.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.