[ 
https://issues.apache.org/jira/browse/HADOOP-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548934
 ] 

stack commented on HADOOP-2341:
-------------------------------

I did a review of hbase.  Found a leak, HADOOP-2362, but it doesn't seem to fix 
the problem.

But on review, I'm now thinking that it could the way that Hbase is using hdfs 
that is making for all the occupied threads over in the datanode.  Below I 
explain why I think this.  Would appreciate any input or alternate suggestions 
for Hbase usage of hdfs.

After doing an upload, I let the cluster alone till it hit a steady state.   
Here's what netstat looks like if I only look for the ESTABLISHED connections:

{code}
[EMAIL PROTECTED] hbase]$ netstat -tn|grep ESTABLISHED|sort > /tmp/ns.txt
..
tcp        0  95488 ::ffff:XX.XX.XX.140:50010  ::ffff:XX.XX.XX.140:32853  
ESTABLISHED
tcp        0  95488 ::ffff:XX.XX.XX.140:50010  ::ffff:XX.XX.XX.140:32860  
ESTABLISHED
tcp        0  95488 ::ffff:XX.XX.XX.140:50010  ::ffff:XX.XX.XX.140:32866  
ESTABLISHED
tcp        0  95488 ::ffff:XX.XX.XX.140:50010  ::ffff:XX.XX.XX.140:32872  
ESTABLISHED
...
tcp    71976      0 ::ffff:XX.XX.XX.140:60511  ::ffff:XX.XX.XX.140:50010  
ESTABLISHED 
tcp    71976      0 ::ffff:XX.XX.XX.140:60513  ::ffff:XX.XX.XX.140:50010  
ESTABLISHED 
tcp    71976      0 ::ffff:XX.XX.XX.140:60711  ::ffff:XX.XX.XX.140:50010  
ESTABLISHED 
tcp    71976      0 ::ffff:XX.XX.XX.140:60713  ::ffff:XX.XX.XX.140:50010  
ESTABLISHED 
tcp    71976      0 ::ffff:XX.XX.XX.140:60999  ::ffff:XX.XX.XX.140:50010  
ESTABLISHED
{code}

The number remained constant.

Counting the number of open connections, I was able to make rough correlation 
with count of open MapFiles in hbase.

I wrote a little program:

{code}
   public static void main(String[] args) throws Exception {
    HBaseConfiguration conf = new HBaseConfiguration();
    MapFile.Reader reader =
      new MapFile.Reader(FileSystem.get(conf), args[0], conf);
    Thread.sleep(1000 * 1000);
  }
{code}

Everytime I ran my program on XX.XX.XX.140, the ESTABLISHED count went up by 
one as did the number of occupied threads in the datanode. 

Hbase keeps open MapFile.Readers.  The number of concurrent open Readers goes 
up as a server deploys more and more regions.

Is there something wrong w/ this pattern -- should hbase not be doing this -- 
or is there something wrong w/ datanode such that a thread is completely 
occupied while a client has a MapFile.Reader open or is this just the way its 
supposed to work?

> Datanode active connections never returns to 0
> ----------------------------------------------
>
>                 Key: HADOOP-2341
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2341
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.0
>            Reporter: Paul Saab
>         Attachments: hregionserver-stack.txt, stacks-XX.XX.XX.XXX.txt, 
> stacks-YY.YY.YY.YY.txt
>
>
> On trunk i continue to see the following in my data node logs:
> 2007-12-03 15:46:47,696 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 42
> 2007-12-03 15:46:48,135 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 41
> 2007-12-03 15:46:48,439 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 40
> 2007-12-03 15:46:48,479 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 39
> 2007-12-03 15:46:48,611 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 38
> 2007-12-03 15:46:48,898 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 37
> 2007-12-03 15:46:48,989 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 36
> 2007-12-03 15:46:51,010 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 35
> 2007-12-03 15:46:51,758 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 34
> 2007-12-03 15:46:52,148 DEBUG dfs.DataNode - XX.XX.XX.XXX:50010:Number of 
> active connections is: 33
> This number never returns to 0, even after many hours of no new data being 
> manipulated or added into the DFS.
> Looking at netstat -tn i see significant amount of data in the send-q that 
> never goes away:
> tcp        0  34240 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:55792   
> ESTABLISHED 
> tcp        0  38968 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:38169   
> ESTABLISHED 
> tcp        0  38456 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:35456   
> ESTABLISHED 
> tcp        0  29640 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:59845   
> ESTABLISHED 
> tcp        0  50168 ::ffff:XX.XX.XX.XXX:50010   ::ffff:YY.YY.YY.YY:44584   
> ESTABLISHED 
> When sniffing the network I see that the remote side (YY.YY.YY.YY) is 
> returning a window size of 0
> 16:11:41.760474 IP XX.XX.XX.XXX.50010 > YY.YY.YY.YY.44584: . ack 3339984123 
> win 46 <nop,nop,timestamp 1786247180 885681789>
> 16:11:41.761597 IP YY.YY.YY.YY.44584 > XX.XX.XX.XXX.50010: . ack 1 win 0 
> <nop,nop,timestamp 885801786 1775711351>
> Then we look at the stack traces on each datanode, I will have tons of 
> threads that *never* go away in the following trace:
> {code}
> Thread 6516 ([EMAIL PROTECTED]):
>   State: RUNNABLE
>   Blocked count: 0
>   Waited count: 0
>   Stack:
>     java.net.SocketOutputStream.socketWrite0(Native Method)
>     java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
>     java.net.SocketOutputStream.write(SocketOutputStream.java:136)
>     java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>     java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
>     java.io.DataOutputStream.write(DataOutputStream.java:90)
>     org.apache.hadoop.dfs.DataNode$BlockSender.sendChunk(DataNode.java:1400)
>     org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1433)
>     org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:904)
>     org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:849)
>     java.lang.Thread.run(Thread.java:619)
> {code}
> Unfortunately there's very little in the logs with exceptions that could 
> point to this.  I have some exceptions the following, but nothing that points 
> to problems between XX and YY:
> {code}
> 2007-12-02 11:19:47,889 WARN  dfs.DataNode - Unexpected error trying to 
> delete block blk_4515246476002110310. Block not found in blockMap. 
> 2007-12-02 11:19:47,922 WARN  dfs.DataNode - java.io.IOException: Error in 
> deleting blocks.
>         at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:750)
>         at org.apache.hadoop.dfs.DataNode.processCommand(DataNode.java:675)
>         at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:569)
>         at org.apache.hadoop.dfs.DataNode.run(DataNode.java:1720)
>         at java.lang.Thread.run(Thread.java:619)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to