[ https://issues.apache.org/jira/browse/HDFS-13828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581356#comment-16581356 ]
Amithsha edited comment on HDFS-13828 at 8/15/18 5:06 PM: ---------------------------------------------------------- Agree on xceiver count may be not sufficient but why for a particular node. And also its not happening on one node it's on a particular set of nodes. Adding the thread dump and datanode log. "DataXceiver for client DFSClient_attempt_1526704594842_1801529_m_008193_0_1144052212_1 at /x.x.x.x:38313 [Waiting for operation #28|#28]" #55366018 daemon prio=5 os_prio=0 tid=0x00007fcdaa0ca000 nid=0x128c runnable [0x00007fcd24485000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x0000000794fde658> (a sun.nio.ch.Util$2) - locked <0x0000000794fde640> (a java.util.Collections$UnmodifiableSet) - locked <0x00000007b072db98> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x00000005e9dc1de0> (a java.io.BufferedInputStream) at java.io.DataInputStream.readShort(DataInputStream.java:312) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227) at java.lang.Thread.run(Thread.java:745) "DataXceiver for client DFSClient_attempt_1526704594842_1801529_m_003865_0_-1268040697_1 at /x.x.x.x:9258 [Sending block BP-1733841164-x.x.x.x-1440204182440:blk_8704233925_7644500095]" #55361352 daemon prio=5 os_prio=0 tid=0x00007fcdaa360000 nid=0xc93d runnable [0x00007fcca559e000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x00000007a5ae6e30> (a sun.nio.ch.Util$2) - locked <0x00000007a5ae6e18> (a java.util.Collections$UnmodifiableSet) - locked <0x000000079d242470> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) was (Author: amithsha): Agree on xceiver count may be not sufficient but why for a particular node. And also its no on one node its on particular set of nodes. Adding the thread dump and datanode log. "DataXceiver for client DFSClient_attempt_1526704594842_1801529_m_008193_0_1144052212_1 at /x.x.x.x:38313 [Waiting for operation #28]" #55366018 daemon prio=5 os_prio=0 tid=0x00007fcdaa0ca000 nid=0x128c runnable [0x00007fcd24485000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x0000000794fde658> (a sun.nio.ch.Util$2) - locked <0x0000000794fde640> (a java.util.Collections$UnmodifiableSet) - locked <0x00000007b072db98> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) - locked <0x00000005e9dc1de0> (a java.io.BufferedInputStream) at java.io.DataInputStream.readShort(DataInputStream.java:312) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227) at java.lang.Thread.run(Thread.java:745) "DataXceiver for client DFSClient_attempt_1526704594842_1801529_m_003865_0_-1268040697_1 at /x.x.x.x:9258 [Sending block BP-1733841164-x.x.x.x-1440204182440:blk_8704233925_7644500095]" #55361352 daemon prio=5 os_prio=0 tid=0x00007fcdaa360000 nid=0xc93d runnable [0x00007fcca559e000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x00000007a5ae6e30> (a sun.nio.ch.Util$2) - locked <0x00000007a5ae6e18> (a java.util.Collections$UnmodifiableSet) - locked <0x000000079d242470> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) > DataNode breaching Xceiver Count > -------------------------------- > > Key: HDFS-13828 > URL: https://issues.apache.org/jira/browse/HDFS-13828 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.1 > Reporter: Amithsha > Priority: Critical > > We were observing the breach of the xceiver count 4096, On a particular set > of nodes from 5 - 8 nodes in a 900 nodes cluster. > And we stopped the datanode services on those nodes and made to replicate > across the cluster. After that also, we observed the same issue on a new set > of nodes. > Q1: Why on a particular node, and also after decommissioning the node the > data should be replicated across the cluster, But why again difference set of > node? > Assumptions : > Reading a particular block/ data on that node might be the cause for this but > it should be mitigated after the decommission but not why? So suspected that > those MR jobs are triggered from Hive, so the query might be referring to the > same block mulitple times in different stages and creating this issue? > From Thread Dump : > Thread dump of datanode says that out of 4090+ xceiver threads created on > that node nearly 4000+ where belong to the same AppId of multiple mappers > with state no operation. > > Any suggestions on this? > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org