[ https://issues.apache.org/jira/browse/HADOOP-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548105 ]
Raghu Angadi commented on HADOOP-2341: -------------------------------------- {quote} tcp 0 50168 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:44584 ESTABLISHED When sniffing the network I see that the remote side (YY.YY.YY.YY) is returning a window size of 0 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 win 46 <nop,nop,timestamp 1786247180 885681789> 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 <nop,nop,timestamp 885801786 1775711351> {quote} This is pretty odd. If I looked at just netstat, I would think client is reading a block. But if I look at the tcpdump, I would think client is writing data (note DataNode is acking 3Mb and client is acking 1 byte). Are you sure these two correspond to the same TCP connection? Please include client side stack if you can. Also, if this is on Linux, with strace and lsof, you might be able to match stacktrace and the tcp connection involved on the Datanode. > Datanode active connections never returns to 0 > ---------------------------------------------- > > Key: HADOOP-2341 > URL: https://issues.apache.org/jira/browse/HADOOP-2341 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.16.0 > Reporter: Paul Saab > > On trunk i continue to see the following in my data node logs: > 2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 42 > 2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 41 > 2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 40 > 2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 39 > 2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 38 > 2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 37 > 2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 36 > 2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 35 > 2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 34 > 2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 33 > This number never returns to 0, even after many hours of no new data being > manipulated or added into the DFS. > Looking at netstat -tn i see significant amount of data in the send-q that > never goes away: > tcp 0 34240 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:55792 > ESTABLISHED > tcp 0 38968 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:38169 > ESTABLISHED > tcp 0 38456 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:35456 > ESTABLISHED > tcp 0 29640 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:59845 > ESTABLISHED > tcp 0 50168 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:44584 > ESTABLISHED > When sniffing the network I see that the remote side (YY.YY.YY.YY) is > returning a window size of 0 > 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 > win 46 <nop,nop,timestamp 1786247180 885681789> > 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 > <nop,nop,timestamp 885801786 1775711351> > Then we look at the stack traces on each datanode, I will have tons of > threads that *never* go away in the following trace: > {code} > Thread 6516 ([EMAIL PROTECTED]): > State: RUNNABLE > Blocked count: 0 > Waited count: 0 > Stack: > java.net.SocketOutputStream.socketWrite0(Native Method) > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > java.io.DataOutputStream.write(DataOutputStream.java:90) > org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400) > org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433) > org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904) > org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849) > java.lang.Thread.run(Thread.java:619) > {code} > Unfortunately there's very little in the logs with exceptions that could > point to this. I have some exceptions the following, but nothing that points > to problems between XX and YY: > {code} > 2007-12-02 11:19:47,889 WARN dfs.DataNode - Unexpected error trying to > delete block blk_4515246476002110310. Block not found in blockMap. > 2007-12-02 11:19:47,922 WARN dfs.DataNode - java.io.IOException: Error in > deleting blocks. > at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750) > at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675) > at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569) > at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720) > at java.lang.Thread.run(Thread.java:619) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.