[ 
https://issues.apache.org/jira/browse/HDFS-13828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581356#comment-16581356
 ] 

Amithsha commented on HDFS-13828:
---------------------------------

Agree on xceiver count may be not sufficient but why for a particular node. And 
also its no on one node its on particular set of nodes.
Adding the thread dump and datanode log.

"DataXceiver for client 
DFSClient_attempt_1526704594842_1801529_m_008193_0_1144052212_1 at 
/x.x.x.x:38313 [Waiting for operation #28]" #55366018 daemon prio=5 os_prio=0 
tid=0x00007fcdaa0ca000 nid=0x128c runnable [0x00007fcd24485000]

   java.lang.Thread.State: RUNNABLE

        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)

        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

        - locked <0x0000000794fde658> (a sun.nio.ch.Util$2)

        - locked <0x0000000794fde640> (a java.util.Collections$UnmodifiableSet)

        - locked <0x00000007b072db98> (a sun.nio.ch.EPollSelectorImpl)

        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)

        at 
org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)

        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)

        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)

        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)

        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)

        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)

        - locked <0x00000005e9dc1de0> (a java.io.BufferedInputStream)

        at java.io.DataInputStream.readShort(DataInputStream.java:312)

        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)

        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:227)

        at java.lang.Thread.run(Thread.java:745)

 

"DataXceiver for client 
DFSClient_attempt_1526704594842_1801529_m_003865_0_-1268040697_1 at 
/x.x.x.x:9258 [Sending block 
BP-1733841164-x.x.x.x-1440204182440:blk_8704233925_7644500095]" #55361352 
daemon prio=5 os_prio=0 tid=0x00007fcdaa360000 nid=0xc93d runnable 
[0x00007fcca559e000]

   java.lang.Thread.State: RUNNABLE

        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)

        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)

        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)

        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

        - locked <0x00000007a5ae6e30> (a sun.nio.ch.Util$2)

        - locked <0x00000007a5ae6e18> (a java.util.Collections$UnmodifiableSet)

        - locked <0x000000079d242470> (a sun.nio.ch.EPollSelectorImpl)

        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)

 

 

 

 

> DataNode breaching Xceiver Count
> --------------------------------
>
>                 Key: HDFS-13828
>                 URL: https://issues.apache.org/jira/browse/HDFS-13828
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.1
>            Reporter: Amithsha
>            Priority: Critical
>
> We were observing the breach of the xceiver count 4096, On a particular set 
> of nodes from 5 - 8 nodes in a 900 nodes cluster.
> And we stopped the datanode services on those nodes and made to replicate 
> across the cluster. After that also, we observed the same issue on a new set 
> of nodes.
> Q1: Why on a particular node, and also after decommissioning the node the 
> data should be replicated across the cluster, But why again difference set of 
> node?
> Assumptions :
> Reading a particular block/ data on that node might be the cause for this but 
> it should be mitigated after the decommission but not why? So suspected that 
> those MR jobs are triggered from Hive, so the query might be referring to the 
> same block mulitple times  in different stages and creating this issue?
> From Thread Dump :
> Thread dump of datanode says that out of 4090+ xceiver threads created on 
> that node nearly 4000+ where belong to the same AppId of multiple mappers 
> with state no operation.
>  
> Any suggestions on this?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to