[ https://issues.apache.org/jira/browse/HDFS-8429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhouyingchao updated HDFS-8429: ------------------------------- Attachment: HDFS-8429-003.patch Tested cases include TestParallelShortCircuitLegacyRead, TestParallelShortCircuitRead, TestParallelShortCircuitReadNoChecksum, TestParallelShortCircuitReadUnCached, TestShortCircuitCache, TestShortCircuitLocalRead, TestShortCircuitShm, TemporarySocketDirectory, TestDomainSocket, TestDomainSocketWatcher > The DomainSocketWatcher thread should not block other threads if it dies > ------------------------------------------------------------------------ > > Key: HDFS-8429 > URL: https://issues.apache.org/jira/browse/HDFS-8429 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: zhouyingchao > Assignee: zhouyingchao > Attachments: HDFS-8429-001.patch, HDFS-8429-002.patch, > HDFS-8429-003.patch > > > In our cluster, an application is hung when doing a short circuit read of > local hdfs block. By looking into the log, we found the DataNode's > DomainSocketWatcher.watcherThread has exited with following log: > {code} > ERROR org.apache.hadoop.net.unix.DomainSocketWatcher: > Thread[Thread-25,5,main] terminating on unexpected exception > java.lang.NullPointerException > at > org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:463) > at java.lang.Thread.run(Thread.java:662) > {code} > The line 463 is following code snippet: > {code} > try { > for (int fd : fdSet.getAndClearReadableFds()) { > sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet, > fd); > } > {code} > getAndClearReadableFds is a native method which will malloc an int array. > Since our memory is very tight, it looks like the malloc failed and a NULL > pointer is returned. > The bad thing is that other threads then blocked in stack like this: > {code} > "DataXceiver for client > unix:/home/work/app/hdfs/c3prc-micloud/datanode/dn_socket [Waiting for > operation #1]" daemon prio=10 tid=0x00007f0c9c086d90 nid=0x8fc3 waiting on > condition [0x00007f09b9856000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007b0174808> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) > at > org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:323) > at > org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:322) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:403) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) > at java.lang.Thread.run(Thread.java:662) > {code} > IMO, we should exit the DN so that the users can know that something go > wrong and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)