[
https://issues.apache.org/jira/browse/HADOOP-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
stack updated HADOOP-2341:
--------------------------
Attachment: dfsclient.patch
Ok. So, now I think we're back around to Pauls' original complaint; that
active connections never goes to zero.
I instrumented DFSClient (See attached patch -- maybe I should go ahead and
make an lsof-like tool for hdfs cients keeping a map of open files adding file
on open and removing from the map on close?). Every open during a test run has
a corresponding close. After the test run, I see this in datanode log:
{code}
...
2007-12-08 00:56:06,429 INFO org.apache.hadoop.dfs.DataNode:
XX.XX.XX.140:50010Served block blk_-107599750641758201 to /XX.XX.XX.140
2007-12-08 00:56:06,429 DEBUG org.apache.hadoop.dfs.DataNode:
XX.XX.XX.140:50010:Number of active connections is: 10
2007-12-08 00:56:06,431 DEBUG org.apache.hadoop.dfs.DataNode: Number of active
connections is: 10
2007-12-08 00:56:06,431 INFO org.apache.hadoop.dfs.DataNode:
XX.XX.XX.140:50010Served block blk_8573562922806881814 to /XX.XX.XX.140
2007-12-08 00:56:06,431 DEBUG org.apache.hadoop.dfs.DataNode:
XX.XX.XX.140:50010:Number of active connections is: 10
..
{code}
... and this when I look w/ netstat:
{code}
[EMAIL PROTECTED] ~]$ netstat -tn|grep -v ESTABLISHED|grep CLOSE_WAIT
tcp 38 0 XX.XX.XX.140:51141 208.76.46.110:389
CLOSE_WAIT
tcp 38 0 XX.XX.XX.140:55612 208.76.46.110:389
CLOSE_WAIT
tcp 38 0 fdfb:c1e8:7640:2:2e0::53696 fdfb:c1e8:7640:6:2e0:81:389
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:56087 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:56088 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:53432 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 37585 0 ::ffff:XX.XX.XX.140:53428 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1893 0 ::ffff:XX.XX.XX.140:53430 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 28174 0 ::ffff:XX.XX.XX.140:53684 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1078 0 ::ffff:XX.XX.XX.140:53686 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:39643 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 28230 0 ::ffff:XX.XX.XX.140:39777 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 28230 0 ::ffff:XX.XX.XX.140:39782 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:39831 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:39832 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 0 0 ::ffff:XX.XX.XX.140:40871 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:40872 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1 0 ::ffff:XX.XX.XX.140:41448 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 34056 0 ::ffff:XX.XX.XX.140:41444 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
tcp 1848 0 ::ffff:XX.XX.XX.140:41446 ::ffff:XX.XX.XX.140:50010
CLOSE_WAIT
{code}
As far as I can tell, the application -- an hbase regionserver -- is calling
close but the close is not being run properly inside in DFSClient. Anyone
have suggestions on how I might figure whats up w/ DFSClient or, if its hbase,
how else I might go about looking for an improper close?
> Datanode active connections never returns to 0
> ----------------------------------------------
>
> Key: HADOOP-2341
> URL: https://issues.apache.org/jira/browse/HADOOP-2341
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.16.0
> Reporter: Paul Saab
> Attachments: dfsclient.patch, hregionserver-stack.txt,
> stacks-XX.XX.XX.XXX.txt, stacks-YY.YY.YY.YY.txt
>
>
> On trunk i continue to see the following in my data node logs:
> 2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 42
> 2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 41
> 2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 40
> 2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 39
> 2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 38
> 2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 37
> 2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 36
> 2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 35
> 2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 34
> 2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of
> active connections is: 33
> This number never returns to 0, even after many hours of no new data being
> manipulated or added into the DFS.
> Looking at netstat -tn i see significant amount of data in the send-q that
> never goes away:
> tcp 0 34240 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:55792
> ESTABLISHED
> tcp 0 38968 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:38169
> ESTABLISHED
> tcp 0 38456 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:35456
> ESTABLISHED
> tcp 0 29640 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:59845
> ESTABLISHED
> tcp 0 50168 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:44584
> ESTABLISHED
> When sniffing the network I see that the remote side (YY.YY.YY.YY) is
> returning a window size of 0
> 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123
> win 46 <nop,nop,timestamp 1786247180 885681789>
> 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0
> <nop,nop,timestamp 885801786 1775711351>
> Then we look at the stack traces on each datanode, I will have tons of
> threads that *never* go away in the following trace:
> {code}
> Thread 6516 ([EMAIL PROTECTED]):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> java.net.SocketOutputStream.socketWrite0(Native Method)
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
> java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
> java.io.DataOutputStream.write(DataOutputStream.java:90)
> org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400)
> org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433)
> org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904)
> org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
> java.lang.Thread.run(Thread.java:619)
> {code}
> Unfortunately there's very little in the logs with exceptions that could
> point to this. I have some exceptions the following, but nothing that points
> to problems between XX and YY:
> {code}
> 2007-12-02 11:19:47,889 WARN dfs.DataNode - Unexpected error trying to
> delete block blk_4515246476002110310. Block not found in blockMap.
> 2007-12-02 11:19:47,922 WARN dfs.DataNode - java.io.IOException: Error in
> deleting blocks.
> at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750)
> at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675)
> at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569)
> at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720)
> at java.lang.Thread.run(Thread.java:619)
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.