Resource Manager wasting time in allocating many containers and AM rejecting same under a specific scenario

Sunil govind Mon, 30 Dec 2013 21:26:18 -0800

In ResourceManager TaskImpl class, RetroactiveKilledTransition and 
RetroactiveFailureTransition methods are there.
In a specific scenario, like when a Node becomes unstable [bad node] Or when an 
external signal is raised to kill a Successful task which is completed,
Then RetroactiveKilledTransition will get invoked. But this is not considered 
as failedAttempts. Hence this data structure will be empty in this case.
This cause the MAP to be re-launched as a normal Map Task and not as a Failed 
Map.


Assume the cluster is taken over by Reducers alone, and a Successful map is 
killed because of external command [./mapred kill-task <ID>] Or because of a 
bad node.
In this case the ask for the map is sent from AM, but it should wait till the 
RM process all the reducer requests in its queue. [Priority as 10]
New map task priority is 20. If it was 5 as a Failed Map, it would be processed 
immediately.

If 100s of reducers are there in cluster to be processed, and the cluster is 
small scale, it may take minutes to process this map task.
And many allocation for the reducers will be rejected by AM.

Is this expected behavior? Kindly let know whether this can be improved.

Resource Manager wasting time in allocating many containers and AM rejecting same under a specific scenario

Reply via email to