I'm running a Hadoop 2.2.0 cluster, I observed a weird latency in launching map task on some nodes for my MR job, by looking at Yarn UI, drill down to the specific task, the UI showed me the task is not started yet, still in "localized state", this will keep like 5-15 minutes, then it can proceed. Here's MRAppMaster log,
2016-03-09 22:07:09,737 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Launching attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:07:09,737 INFO [ContainerLauncher #6] org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy: Opening proxy :slc10ank.us.oracle.com:1340 2016-03-09 22:07:09,828 INFO [ContainerLauncher #6] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Shuffle port returned by ContainerManager for attempt_1454695084051_2119658_m_000006_0 : 13562 2016-03-09 22:07:09,828 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: TaskAttempt: [attempt_1454695084051_2119658_m_000006_0] using containerId: [container_1454695084051_2119658_01_000011 on NM: [ slc10ank.us.oracle.com:1340] 2016-03-09 22:07:09,828 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: attempt_1454695084051_2119658_m_000006_0 TaskAttempt Transitioned from ASSIGNED to RUNNING 2016-03-09 22:07:09,829 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.MRAppMetrics: Task is starting on [ slc10ank.us.oracle.com] As you can see from above long, the task was started at 2016-03-09 22:07:09,829, then the next corresponding logging message for this task is 2016-03-09 22:21:42,406 INFO [IPC Server handler 8 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: JVM with ID: jvm_1454695084051_2119658_m_000011 given task: attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:21:46,282 INFO [IPC Server handler 4 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:21:49,301 INFO [IPC Server handler 5 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:21:52,318 INFO [IPC Server handler 6 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:21:55,330 INFO [IPC Server handler 11 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:21:58,343 INFO [IPC Server handler 9 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:22:01,362 INFO [IPC Server handler 7 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:22:04,371 INFO [IPC Server handler 3 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:22:07,387 INFO [IPC Server handler 2 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:22:10,407 INFO [IPC Server handler 10 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 2016-03-09 22:22:13,422 INFO [IPC Server handler 13 on 23425] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Ping from attempt_1454695084051_2119658_m_000006_0 It's almost 15 minute later, I also can prove the task was actually started around 22:21:41 by looking at the log of the task in question as below 2016-03-09 22:21:41,585 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-03-09 22:21:41,595 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-03-09 22:21:41,880 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2016-03-09 22:21:41,988 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2016-03-09 22:21:41,988 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started 2016-03-09 22:21:42,002 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens: 2016-03-09 22:21:42,003 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1454695084051_2119658, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@7364985f) 2016-03-09 22:21:42,101 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now. So my question is what can cause "localizing" take that long? As fa as I know, "localizing" is about download required binary/library/config files from HDFS, so what might be the culprit? Thanks for your help -- --Anfernee