[ https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558871#action_12558871 ]
dhruba borthakur commented on HADOOP-2606: ------------------------------------------ It appears to me that the ReplicationMonitor thread wakes up every 3 seconds and does one iteration. Each iteration scans the neededReplication list once for every datanode. If a cluster has 2000 datanodes and 20K blocks per datanode, then decommissioning 40 nodes means that the size of the neededReplication list is almost 8 million. Thus, this list of 8 million is scanned 2000 times every 3 seconds. Heavy CPU consumption! > Namenode unstable when replicating 500k blocks at once > ------------------------------------------------------ > > Key: HADOOP-2606 > URL: https://issues.apache.org/jira/browse/HADOOP-2606 > Project: Hadoop > Issue Type: Bug > Components: dfs > Affects Versions: 0.14.3 > Reporter: Koji Noguchi > Fix For: 0.17.0 > > > We tried to decommission about 40 nodes at once, each containing 12k blocks. > (about 500k total) > (This also happened when we first tried to decommission 2 million blocks) > Clients started experiencing "java.lang.RuntimeException: > java.net.SocketTimeoutException: timed out waiting for rpc > response" and namenode was in 100% cpu state. > It was spending most of its time on one thread, > "[EMAIL PROTECTED]" daemon prio=10 tid=0x0000002e10702800 nid=0x6718 > runnable [0x0000000041a42000..0x0000000041a42a30] > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766) > at > org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870) > - locked <0x0000002aa3cef720> (a > org.apache.hadoop.dfs.UnderReplicatedBlocks) > - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem) > at > org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928) > at > org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868) > at java.lang.Thread.run(Thread.java:619) > We confirmed that Namenode was not in the fullGC states when these problem > happened. > Also, dfsadmin -metasave was showing "Blocks waiting for replication" was > decreasing very slowly. > I believe this is not specific to decommission and same problem would happen > if we lose one rack. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.