[ https://issues.apache.org/jira/browse/MAPREDUCE-6224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhihai xu updated MAPREDUCE-6224: --------------------------------- Attachment: MAPREDUCE-6224.branch-1.000.patch > resolve the hosts in DNSToSwitchMapping before inter tracker server start to > avoid IPC timeout in Task Tracker heartbeat > ------------------------------------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-6224 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6224 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv1 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: MAPREDUCE-6224.branch-1.000.patch > > > Resolve the hosts to fill up the cache in CachedDNSToSwitchMapping before > inter tracker server start to avoid IPC timeout in Task Tracker heartbeat. > We saw IPC timeout happen in Task Tracker heartbeat for a large MR1 cluster > which use topology script(ShellCommandExecutor) to resolve the Network > Topology for Task Tracker host in ScriptBasedMapping. > The reason is > Right after inter tracker server start in Job Tracker, Job Tracker receive a > lots HeartBeat from the Task Tracker. > heartbeat function call resolveAndAddToTopology to resolve the Network > Topology for Task Tracker host in ScriptBasedMapping which implement > CachedDNSToSwitchMapping. > ScriptBasedMapping#resolve will check whether the host is in the cache, > If the host is not in the cache, it will run topology script to get the > host's Network Topology using ShellCommandExecutor. Normally running topology > script is time consuming, which may cause the IPC time if too many heartbeat > happened at the same time for a large MR1 cluster. > The solution is to resolve the Network Topology for all hosts in the hosts > list from HostsFileReader before receive any heartbeat from Task Tracker, so > the cache in ScriptBasedMapping will be filled up, and when heartbeat call > resolveAndAddToTopology, it will get the result from the cache instead of > running topology script. -- This message was sent by Atlassian JIRA (v6.3.4#6332)