[ https://issues.apache.org/jira/browse/YARN-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Konstantinos Karanasos updated YARN-2883: ----------------------------------------- Attachment: YARN-2883-trunk.011.patch [~kasha], thanks for the review. I am attaching a new patch. In order to keep my reply shorter I am including below only your comments that required further discussion. I agree with and addressed all the rest. bq. We seem to be overriding both {{performContainerStart}} and {{startContainerInternal}} True, I got rid of {{performContainerStart}}, so now there is only {{startContainerInternal}}, which is being overridden. bq. While the abbreviated variable names do make space, they make reading the code slightly harder. Can we reconsider using full names? I am comfortable with using alloc for allocated. Also, there seems to be a mismatch in variable names ending in containers and requests. How about using guaranteedAllocations, opportunisticAllocations, queuedGuaranteedAllocations and queuedOpporunisticAllocations? I see your point about the mismatch, but I would not call them "allocations", because queued containers are not really allocations, and also I'd like to stretch that the running ones are allocated (not just allocations). I used the following names to correct the mismatch and to avoid abbreviations: allocated(Guaranteed|Opportunistic)Containers and queued(Guaranteed|Opportunistic)Containers. bq. I still think we should never queue guaranteed containers. Is there a good reason for queuing them? Since we are killing opportunistic containers anyway, why not do that synchronously and start the guaranteed containers? The killing opportunistic containers should likely take a kill context, so we can handle it appropriately in YARN-1011 where we kill when the utilization goes over a threshold. Thinking more about this, I wonder if we should just defer to the kill when over a threshold utilization anyway once an opportunistic container has started. There are multiple reasons for which I believe we should queue guaranteed containers: # If we do the killing synchronously, the only thing we would gain is to get rid of the overhead of events, which I believe is negligible. On the other hand, synchronous killing will mean that if there is a problem with the killing of one of the running opportunistic containers, the following arriving guaranteed will also be blocked from being executed. This is not the case when we do the killing via events. # Supporting the queuing of guaranteed containers opens up further interesting opportunities: ## Assume you have an opportunistic container that has been running for a while and you decide to give it a couple more seconds to finish its execution before killing it to make room for a guaranteed container. Having a queue of guaranteed containers makes such policies easy to implement. ## One may decide to queue guaranteed containers on purpose. In our EuroSys paper this year ("Efficient queue management for cluster scheduling") we showed that there can be significant gains if you do such queuing carefully. Overall, I believe it is a pity to discard such opportunities, especially given the code is already there. bq. {{killOpportContainers}} seems to not kill opportunistic containers if they started running. Doesn’t this mess with guarantees on the guaranteed containers? Am I missing something? May be, we should just iterate in reverse through the queuedOpportunisticContainers and spawned opportunistic containers and kill as many as required? I double-checked: {{killOpportContainers}} actually kills opportunistic containers only if they are running. Probably the method name {{stopContainerInternalIfNotQueued}} was confusing, so I renamed it to {{stopContainerInternalIfRunning}}. bq. I wonder if we are better off using a separate thread for starting pending containers periodically, and would wait between attempts. The code here could just notify that thread? That would likely work better when YARN-1011 wants to use a queue length > 1. Our initial internal implementation was actually following such a design. The problem that we observed is that such a thread adds an overhead to the start of containers (and also does not play well with the current design for starting containers that is not thread-based). So I essentially rewrote this part of the code to avoid having such a thread. Note that for YARN-1011 we can have such a thread (very similar to the one used currently by the {{ContainersMonitorImpl}}), which can call the existing {{startPendingContainers()}} without any significant changes required. bq. When queuing containers, depending on whether oversubscription is enabled or not, shouldn't we be checking utilization vs allocation respectively? Agreed. I renamed the {{hasAvailableAllocatedResources()}} to {{hasAvailableResources()}} to reflect this. Then we can keep both an {{allocatedContainersUtilization}} and a {{usedContainersUtilization}}, and use the appropriate one depending on whether over-subscription is enabled. bq. {{resourcesToFreeUp}} takes {{opportContainersToKill}}, but I don't see any container ever being added here. Am I missing something or do we not need it and all related fields and method parameters can be removed? Indeed {{resourceToFreeUp}} does not add to {{opportContainersToKill}} directly. It calculates the resources that need to be freed up for the given container to start. It is the {{opportContainersToKill}} that decides which containers to kill. However, {{resourcesToFreeUp}} needs {{opportContainersToKill}} in order to take into account the containers that are marked for killing (due to a previous guaranteed container that triggered some container killing too) but have not been killed yet. bq. If we are using utilization to decide whether to start an opportunistic container, why not just look at the aggregate containers’ utilization? Granted, we ll be able to make a decision about only one container at a time, but we don’t have to iterate through container lists and do arithmetic around resource utilization? Even if we switch to allocation and not utilization, we should still be able to go off of aggregate allocation. No? And, based on what we decide we might not need to track ProcessTreeInfo. Not sure I completely see what you mean. So instead of having a resource utilization, we would each time have to look into the utilization of the running and some of the queued and some of the marked-for-killing containers. If this is what you mean, then we would not really avoid arithmetic operations with resource utilization, right? We might increase such operations actually. bq. Have we considered adding a map from containerId -> ResourceUtilization to assist with updating aggregate utilization when a container is removed etc., so we don’t have to explicitly track those separately? Is this related to your previous comment? > Queuing of container requests in the NM > --------------------------------------- > > Key: YARN-2883 > URL: https://issues.apache.org/jira/browse/YARN-2883 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager, resourcemanager > Reporter: Konstantinos Karanasos > Assignee: Konstantinos Karanasos > Attachments: YARN-2883-trunk.004.patch, YARN-2883-trunk.005.patch, > YARN-2883-trunk.006.patch, YARN-2883-trunk.007.patch, > YARN-2883-trunk.008.patch, YARN-2883-trunk.009.patch, > YARN-2883-trunk.010.patch, YARN-2883-trunk.011.patch, > YARN-2883-yarn-2877.001.patch, YARN-2883-yarn-2877.002.patch, > YARN-2883-yarn-2877.003.patch, YARN-2883-yarn-2877.004.patch > > > We propose to add a queue in each NM, where queueable container requests can > be held. > Based on the available resources in the node and the containers in the queue, > the NM will decide when to allow the execution of a queued container. > In order to ensure the instantaneous start of a guaranteed-start container, > the NM may decide to pre-empt/kill running queueable containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)