[ https://issues.apache.org/jira/browse/YARN-1197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561098#comment-14561098 ]
MENG DING commented on YARN-1197: --------------------------------- Thanks [~vinodkv] and [~leftnoteasy] for the great comments! *To [~vinodkv]:* bq. Expanding containers at ACQUIRED state sounds useful in theory. But agree with you that we can punt it for later. Thanks for the confirmation :-) bq. To your example of concurrent increase/decrease sizing requests from AM, shall we simply say that only one change-in-progress is allowed for any given container? Actually we really wanted to be able to achieve this, but with the current asymmetric logic of increasing resource from RM, and decreasing resource from NM, it doesn't seem to be possible :-( The reason is because: * The increase action starts from AM requesting the increase from RM, being granted a resource increase token, then initiating the increase action on NM, until finally NM confirming with RM about the increase. * Once an increase token has been granted to AM, and before it expires (10 minutes by default), if AM does not initiate the increase action on NM, *NM will have no idea that an increase is already in progress*. * If, at this moment, AM initiates a resource decrease action on NM, NM will go ahead and honor it. So in effect, there can be concurrent decrease/increase action going on, and there doesn't seem to be a way to block this. bq. If we do the above, this will also simplify most of the code, as we will simply have the notion of a Change, instead of an explicit increase/decrease everywhere. For e.g., we will just have a ContainerResourceChangeExpirer. I believe the ContainerResourceChangeExpirer only applies to the container resource increase action. The container decrease action goes directly through NM so it does not need an expiration logic. bq. There will be races with container-states toggling from RUNNING to finished states, depending on when AM requests a size-change and when NMs report that a container finished. We can simply say that the state at the ResourceManager wins. Agreed. bq. Didn't understand why we need this RM-NM confirmation. The token from RM to AM to NM should be enough for NM to update its view, right? This is the same as the reasons listed above. bq. Instead of adding new records for ContainerResourceIncrease / decrease in AllocationResponse, should we add a new field in the API record itself stating if it is a New/Increased/Decreased container? If we move to a single change model, it's likely we will not even need this. I am open to this suggestion. We could add a field in the existing *ContainerProto* to indicate if this Container is new/increased/decreased container. The only thing I am not sure is if we can still change the AllocateResponseProto now that the ContainerResourceIncrease/Decrease is already in the trunk? bq. Any obviously invalid change-requests should be rejected right-away. For e.g, an increase to more than cluster's max container size. Seemed like you are suggesting we ignore the invalid requests. Agreed that any invalid increase requests from AM to RM, and invalid decrease requests from AM to NM should be directly rejected. The 'ignore' case I was referring to is in the context of NodeUpdate from NM to RM. bq. Nit: In the design doc, the high-level flow for container-increase point #7 incorrectly talks about decrease instead of increase. Yes, this is a mistake, and I will correct it. bq. I propose we do this in a branch Definitely. There is already a YARN-1197 branch, and we can simply work in that branch. *To [~leftnoteasy]:* bq. Actually the appoarch in design doc is this (Meng plz let me know if I misunderstood). In scheduler's implementation, it allows only one pending change request for same container, later change-request will either overwrite prior one or rejected. The current design only allows one increase request in the whole system, which is guaranteed by the ContainerResourceIncreaseExpirer object. However, as explained above, we cannot block decrease action while an increase action is still in progress. bq. 1) For the protocols between servers/AMs, mostly same to previous doc, the biggest change I can see is the ContainerResourceChangeProto in NodeHeartbeatResponseProto, which makes sense to me. Yes, the ContainerResourceChangeProto is the biggest change. Glad that you agree with this new protocol :-) bq. 2) For the client side change: 2.2.1, +1 to option 3. Great. I will remove option 1 and option 2 from the design doc. bq. 3) For 2.3.3.2 scheduling part, {{The scheduling of an outstanding resource increase request to a container will be skipped if there are either:}}. Both of the two may not needed since AM can require for more resource when container increase (e.g. container increased to 4G, and AM wants it to be 6G before notify NM). Good point, this could be very convenient in practice. However the thing that I have not figured out is how to handle the increase token expiration logic if we have multiple increase actions going on at the same time. The current expiration logic (section 2.3.2 in the design doc) only tracks one increase request for a container (container ID + original capacity for rollback). As an example, if AM is currently using 2G, and asks to increase to 4G, and then asks again to increase to 6G, but AM doesn't actually use any of the token to increase the resource on NM. In this case, RM can only revert the resource allocation back to 4G after expiration, not 2G. bq. 4) We may not need "reserved increase request", all increase request should be considered to be "reserved". But we still need to respect orders of applications in LeafQueue, no matter it's original FIFO or Fair (added after YARN-3306). We can discuss more scheduling details in separated JIRA. For sure. My knowledge in the scheduler side is still very limited, so I will continue to learn along the way. By the way, thanks for clearing up the JIRAs. It's great that you are able to work on the RM/Scheduler! I am glad to take any unassigned tasks :-) > Support changing resources of an allocated container > ---------------------------------------------------- > > Key: YARN-1197 > URL: https://issues.apache.org/jira/browse/YARN-1197 > Project: Hadoop YARN > Issue Type: Task > Components: api, nodemanager, resourcemanager > Affects Versions: 2.1.0-beta > Reporter: Wangda Tan > Attachments: YARN-1197_Design.pdf, mapreduce-project.patch.ver.1, > tools-project.patch.ver.1, yarn-1197-scheduler-v1.pdf, yarn-1197-v2.pdf, > yarn-1197-v3.pdf, yarn-1197-v4.pdf, yarn-1197-v5.pdf, yarn-1197.pdf, > yarn-api-protocol.patch.ver.1, yarn-pb-impl.patch.ver.1, > yarn-server-common.patch.ver.1, yarn-server-nodemanager.patch.ver.1, > yarn-server-resourcemanager.patch.ver.1 > > > The current YARN resource management logic assumes resource allocated to a > container is fixed during the lifetime of it. When users want to change a > resource > of an allocated container the only way is releasing it and allocating a new > container with expected size. > Allowing run-time changing resources of an allocated container will give us > better control of resource usage in application side -- This message was sent by Atlassian JIRA (v6.3.4#6332)