[ https://issues.apache.org/jira/browse/HADOOP-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sanjay Radia updated HADOOP-2448: --------------------------------- Fix Version/s: (was: 0.16.0) > Improve Block report processing and name node restarts (Master Jira) > -------------------------------------------------------------------- > > Key: HADOOP-2448 > URL: https://issues.apache.org/jira/browse/HADOOP-2448 > Project: Hadoop > Issue Type: Improvement > Components: dfs > Reporter: Sanjay Radia > Assignee: Sanjay Radia > > It has been reported that for large clusters (2K datanodes) , a restarted > namenode can often take hours to leave the safe-mode. > - admins have reported that if the data nodes are started, say 100 at a time, > it significantly improves the startup time of the name node > - setting the initial heap (as opposed to max heap) to be larger also helps > t- this avoids the GCs before more memory is added to the heap. > Observations of the Name node via JConsole and instrumentation: > - if 80% of memory is used for maintining the names and blocks data > structures, then processing block reports can generate a lot of GC causing > block reports to take a long time to process. This causes datanodes that sent > the block reports to timeout and resend the block reports making the > situation worse. > Hence to improve the situation the following are proposed: > 1. Have random backoffs (of say 60sec for a 1K cluster) of the initial block > report sent by a DN. This would match the randomization of the normal hourly > block reports. (Jira HADOOP-2326) > 2. Have the NN tell the DN how much to backoff (i.e. rather than a single > configuration parameter for the backoff). This would allow the system to > adjust automatically to cluster size - smaller clusters will startup faster > than larger clusters. (Jira HADOOP-2444) > 3. Change the block reports to be array of longs rather then array of block > report objects - this would reduce the amount of memory used to process a > block report. This would help the initial startup and also the block report > process during normal operation outside of the safe-mode. (Jira HADOOP-2110) > 4. Queue and acknowledge the receipts of the block reports and have separate > set of threads process the block report queue. (HADOOP-2111) > 4 Jiras have been filed as noted. > Based on experiments, we may not want to proceed with option 4. While option > 4 did help block report processing when tried on its own, it turned out that > in combination with 1 it did not help much. Furthermore, clean up of RPC to > remove the client-side timeout (see JIRA Hadoop-2188) would make this fix > obsolete. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.