namenode slowdown when orphan block(s) left in neededReplication
----------------------------------------------------------------

                 Key: HADOOP-1113
                 URL: https://issues.apache.org/jira/browse/HADOOP-1113
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.10.1
            Reporter: dhruba borthakur


There were about 200 files that had some under-replicated blocks. A "dfs 
-setrep 4" followed by a "dfs -setrep 3" was done on these files. Most of the 
replications took place but the namenode CPU usage got stuck at 99%. The 
cluster has about 450 datanodes.

The stack trace of the namenode, we saw that there is always one thread of the 
following type:

IPC Server handler 3 on 8020" daemon prio=1 tid=0x0000002d941c7d30 nid=0x2d52 
runnable [0x0000000042072000..0x0000000042072eb0]
        at 
org.apache.hadoop.dfs.FSDirectory.getFileByBlock(FSDirectory.java:745)
        - waiting to lock <0x0000002aa212f030> (a 
org.apache.hadoop.dfs.FSDirectory$INode)
        at 
org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2155)
        - locked <0x0000002aa210f6b8> (a java.util.TreeSet)
        - locked <0x0000002aa21401a0> (a org.apache.hadoop.dfs.FSNamesystem)
        at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:521)
        at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)

Also, the namenode is currently not doing any replication requests (as seen 
from the namenode log). A new "setrep" command immediately took place. 

My belief is that there is a block(s) that is permanently stuck in 
neededReplication. This causes all heartbeats requests to do lots of additional 
processing. thus leading to higher CPU usage. One possibility is that all 
datanodes that host the replicas of the block in neededReplication are down.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to