[ https://issues.apache.org/jira/browse/HADOOP-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12551214 ]
stack commented on HADOOP-2341: ------------------------------- Pardon me Raghu, I don't follow what you mean by pread? I presume you're not referring to the clib call and instead mean some java equiv. MapFile as-is works for hbase 'cos index eases random accesses. At any given time, CLOSE_WAIT in client are minority; more usual is an idle open connection and yes, hbase has both issues. Above I misspoke when I said an uploading last night was occupying 300+ connections all in CLOSE_WAIT. It actually went over 900 open connections in the datanode. I don't know how many were in CLOSE_WAIT (and yes, these CLOSE_WAITs are different from the client-side CLOSE_WAITS I've been going on about above). Is there an issue to move the datanode to use select that you know of? Lets keep the issue open another while if you don't mind. It would be good to have more confidence that we can account for all open sockets in the datanode before closing. For example: the 900+ open connections in the datanode cited above seems excessive when I try counting how many open MapFiles there should have been present at the time. Also, am having trouble matching blocked reads in client to blocked writes cited in thread dumps attached above. > Datanode active connections never returns to 0 > ---------------------------------------------- > > Key: HADOOP-2341 > URL: https://issues.apache.org/jira/browse/HADOOP-2341 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.16.0 > Reporter: Paul Saab > Attachments: dfsclient.patch, hregionserver-stack.txt, > stacks-XX.XX.XX.XXX.txt, stacks-YY.YY.YY.YY.txt > > > On trunk i continue to see the following in my data node logs: > 2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 42 > 2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 41 > 2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 40 > 2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 39 > 2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 38 > 2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 37 > 2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 36 > 2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 35 > 2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 34 > 2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of > active connections is: 33 > This number never returns to 0, even after many hours of no new data being > manipulated or added into the DFS. > Looking at netstat -tn i see significant amount of data in the send-q that > never goes away: > tcp 0 34240 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:55792 > ESTABLISHED > tcp 0 38968 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:38169 > ESTABLISHED > tcp 0 38456 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:35456 > ESTABLISHED > tcp 0 29640 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:59845 > ESTABLISHED > tcp 0 50168 ::ffff:XX.XX.XX.XXX:50010 ::ffff:YY.YY.YY.YY:44584 > ESTABLISHED > When sniffing the network I see that the remote side (YY.YY.YY.YY) is > returning a window size of 0 > 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 > win 46 <nop,nop,timestamp 1786247180 885681789> > 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 > <nop,nop,timestamp 885801786 1775711351> > Then we look at the stack traces on each datanode, I will have tons of > threads that *never* go away in the following trace: > {code} > Thread 6516 ([EMAIL PROTECTED]): > State: RUNNABLE > Blocked count: 0 > Waited count: 0 > Stack: > java.net.SocketOutputStream.socketWrite0(Native Method) > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) > java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) > java.io.DataOutputStream.write(DataOutputStream.java:90) > org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400) > org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433) > org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904) > org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849) > java.lang.Thread.run(Thread.java:619) > {code} > Unfortunately there's very little in the logs with exceptions that could > point to this. I have some exceptions the following, but nothing that points > to problems between XX and YY: > {code} > 2007-12-02 11:19:47,889 WARN dfs.DataNode - Unexpected error trying to > delete block blk_4515246476002110310. Block not found in blockMap. > 2007-12-02 11:19:47,922 WARN dfs.DataNode - java.io.IOException: Error in > deleting blocks. > at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750) > at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675) > at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569) > at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720) > at java.lang.Thread.run(Thread.java:619) > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.