Datanode active connections never returns to 0
----------------------------------------------

                 Key: HADOOP-2341
                 URL: https://issues.apache.org/jira/browse/HADOOP-2341
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.16.0
            Reporter: Paul Saab


On trunk i continue to see the following in my data node logs:

2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 42
2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 41
2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 40
2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 39
2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 38
2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 37
2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 36
2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 35
2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 34
2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
active connections is: 33

This number never returns to 0, even after many hours of no new data being 
manipulated or added into the DFS.

Looking at netstat -tn i see significant amount of data in the send-q that 
never goes away:

tcp        0  34240 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:55792   
ESTABLISHED 
tcp        0  38968 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:38169   
ESTABLISHED 
tcp        0  38456 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:35456   
ESTABLISHED 
tcp        0  29640 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:59845   
ESTABLISHED 
tcp        0  50168 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:44584   
ESTABLISHED 

When sniffing the network I see that the remote side (YY.YY.YY.YY) is returning 
a window size of 0
16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 win 
46 <nop,nop,timestamp 1786247180 885681789>
16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 
<nop,nop,timestamp 885801786 1775711351>

Then we look at the stack traces on each datanode, I will have tons of threads 
that *never* go away in the following trace:
{code}
Thread 6516 ([EMAIL PROTECTED]):
  State: RUNNABLE
  Blocked count: 0
  Waited count: 0
  Stack:
    java.net.SocketOutputStream.socketWrite0(Native Method)
    java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
    java.net.SocketOutputStream.write(SocketOutputStream.java:136)
    java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
    java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
    java.io.DataOutputStream.write(DataOutputStream.java:90)
    org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400)
    org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433)
    org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904)
    org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
    java.lang.Thread.run(Thread.java:619)
{code}

Unfortunately there's very little in the logs with exceptions that could point 
to this.  I have some exceptions the following, but nothing that points to 
problems between XX and YY:
{code}
2007-12-02 11:19:47,889 WARN  dfs.DataNode - Unexpected error trying to delete 
block blk_4515246476002110310. Block not found in blockMap. 
2007-12-02 11:19:47,922 WARN  dfs.DataNode - java.io.IOException: Error in 
deleting blocks.
        at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750)
        at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675)
        at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569)
        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720)
        at java.lang.Thread.run(Thread.java:619)
{code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to