[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546209#comment-14546209 ]
MENG DING commented on YARN-1902: --------------------------------- I was almost going to log the same issue when I saw this thread (and also YARN-3020) :-). After reading all the discussions, and after reading the related code, I still believe this is a bug. I understand what [~bikassaha] has said that the AM-RM protocol is NOT a delta protocol, and that currently user (i.e., ApplicationMaster) is responsible for calling removeContainerRequest() after receiving an allocation, but consider the follow simple modification to the packaged *distributedshell* application: {code:title=ApplicationMaster.java|borderStyle=solid} --- a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java +++ b/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java @@ -805,6 +805,8 @@ public void onContainersAllocated(List<Container> allocatedContainers) { // as all containers may not be allocated at one go. launchThreads.add(launchThread); launchThread.start(); + ContainerRequest containerAsk = setupContainerAskForRM(); + amRMClient.removeContainerRequest(containerAsk); } } {code} The code simply removes a container request after successfully receiving an allocated container in the ApplicationMaster. When you submit this application by specifying, say, 3 containers on the CLI, you will sometimes get 4 containers allocated (not counting the AM container)! {code} root@node2:~# hadoop org.apache.hadoop.yarn.applications.distributedshell.Client -jar /usr/local/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.0-SNAPSHOT.jar -shell_command "sleep 100000" -num_containers 3 -timeout 200000000 {code} {code} root@node2:~# yarn container -list appattempt_1431531743796_0015_000001 15/05/15 20:49:01 INFO client.RMProxy: Connecting to ResourceManager at node2/10.211.55.102:8032 15/05/15 20:49:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Total number of containers :5 Container-Id Start Time Finish Time State Host Node Http Address LOG-URL container_1431531743796_0015_01_000005 Fri May 15 20:44:12 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000005/root container_1431531743796_0015_01_000001 Fri May 15 20:44:06 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000001/root container_1431531743796_0015_01_000002 Fri May 15 20:44:10 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000002/root container_1431531743796_0015_01_000004 Fri May 15 20:44:11 +0000 2015 N/A RUNNING node3:50093 http://node3:8042 http://node3:8042/node/containerlogs/container_1431531743796_0015_01_000004/root container_1431531743796_0015_01_000003 Fri May 15 20:44:10 +0000 2015 N/A RUNNING node4:41128 http://node4:8042 http://node4:8042/node/containerlogs/container_1431531743796_0015_01_000003/root {code} The *fundamental* problem here, I believe, is that the AMRMClient maintains an internal request table *remoteRequestsTable* that keeps track of *total* container requests (i.e., including container requests that have been satisfied, and that are not yet satisfied): {code:title=AMRMClient.java|borderStyle=solid} protected final Map<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>> remoteRequestsTable = new TreeMap<Priority, Map<String, TreeMap<Resource, ResourceRequestInfo>>>(); {code} However, the corresponding table *requests* at the scheduler side (inside AppSchedulingInfo.java) keeps track of *outstanding* container requests (i.e., container requests that are not yet satisfied): {code:title=AppSchedulingInfo.java|borderStyle=solid} final Map<Priority, Map<String, ResourceRequest>> requests = new ConcurrentHashMap<Priority, Map<String, ResourceRequest>>(); {code} Every time an allocation is successfully made, the decResourceRequest() or decrementOutstanding() call will update the *requests* table so that it only contains outstanding requests, but unfortunately, every time an ApplicationMaster heartbeat comes, the same *requests* table is updated by the updateResourceRequests() call with the total requests coming from AMRMClient. This inconsistent view of total requests from AMRMClient side, and the outstanding requests from the Scheduler side, in my opinion, is very confusing to say the least. I see that a solution has already been proposed by [~wangda] in YARN-3020, which I think is the correct thing to do: {quote} maybe we should add a default implementation to deduct pending resource requests by prioirty/resource-name/capacity of allocated containers automatically (User can disable this default behavior, implement their own logic to deduct pending resource requests.) {quote} This solution will make *remoteRequestsTable* in AMRMClient only keep track of outstanding container requests, which is then consistent with the *requests* table at the Scheduler side. Any comments or thoughts? We are currently investigating YARN-1197, and are faced with a similar issue with properly tracking container resource increase requests at both client and server side. Thanks, Meng > Allocation of too many containers when a second request is done with the same > resource capability > ------------------------------------------------------------------------------------------------- > > Key: YARN-1902 > URL: https://issues.apache.org/jira/browse/YARN-1902 > Project: Hadoop YARN > Issue Type: Bug > Components: client > Affects Versions: 2.2.0, 2.3.0, 2.4.0 > Reporter: Sietse T. Au > Assignee: Sietse T. Au > Labels: client > Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch > > > Regarding AMRMClientImpl > Scenario 1: > Given a ContainerRequest x with Resource y, when addContainerRequest is > called z times with x, allocate is called and at least one of the z allocated > containers is started, then if another addContainerRequest call is done and > subsequently an allocate call to the RM, (z+1) containers will be allocated, > where 1 container is expected. > Scenario 2: > No containers are started between the allocate calls. > Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) > are requested in both scenarios, but that only in the second scenario, the > correct behavior is observed. > Looking at the implementation I have found that this (z+1) request is caused > by the structure of the remoteRequestsTable. The consequence of Map<Resource, > ResourceRequestInfo> is that ResourceRequestInfo does not hold any > information about whether a request has been sent to the RM yet or not. > There are workarounds for this, such as releasing the excess containers > received. > The solution implemented is to initialize a new ResourceRequest in > ResourceRequestInfo when a request has been successfully sent to the RM. > The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.3.4#6332)