[ https://issues.apache.org/jira/browse/YARN-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655975#comment-15655975 ]
Konstantinos Karanasos commented on YARN-4597: ---------------------------------------------- Thanks for working on this, [~asuresh]! I am sending some first comments. I have not yet looked at the {{ContainerScheduler}} -- I will do that tomorrow. - The {{Container}} has two new methods ({{sendLaunchEvent}} and {{sendKillEvent}}), which are public and are not following the design of the rest of the code that keeps such methods private and calls them through transitions in the {{ContainerImpl}}. Let's try to use the existing design if possible. - In {{RMNodeImpl}}: -- Instead of using the {{launchedContainers}} for both the launched and the queued, we might want to split it in two: one for the launched and one for the queued containers. -- I think we should not add opportunistic containers to the {{launchContainers}}. If we do, they will be added to the {{newlyLaunchedContainers}}, then to the {{nodeUpdateQueue}}, and, if I am not wrong, they will be propagated to the schedulers for the guaranteed containers, which will create problems. I have to look at it a bit more, but my hunch is that we should avoid doing it. Even if it does not affect the resource accounting, I don't see any advantage to adding them. - In the {{OpportunisticContainerAllocatorAMService}} we are now calling the {{SchedulerNode::allocate}}, and then we do not update the used resources, but we do update some other counters, which leads to inconsistencies. For example, when releasing a container, I think at the moment we are not calling the release of the {{SchedulerNode}}, which means that the container count will become inconsistent. -- Instead, I suggest to add some counters for opportunistic containers at the {{SchedulerNode}}, both for the number of containers and for the resources used. In this case, we need to make sure that those resources are released too. - Maybe as part of a different JIRA, we should at some point extend the {{container.metrics}} in the {{ContainerImpl}} to keep track of the scheduled/queued containers. h6. Nits: - There seem to be two redundant parameters at {{YarnConfiguration}} at the moment: {{NM_CONTAINER_QUEUING_MIN_QUEUE_LENGTH}} and {{NM_OPPORTUNISTIC_CONTAINERS_MAX_QUEUE_LENGTH}}. If I am not missing something, we should keep one of the two. - {{yarn-default.xml}}: numbed -> number (in a comment) - {{TestNodeManagerResync}}: I think it is better to use one of the existing methods for waiting to get to the RUNNING state. - In {{Container}}/{{ContainerImpl}} and all the associated classes, I would suggest to rename {{isMarkedToKill}} to {{isMarkedForKilling}}. I know it is minor, but it is more self-explanatory. I will send more comments once I check the {{ContainerScheduler}}. Also, let's stress-test the code in a cluster before committing to make sure everything is good. I can help with that. > Add SCHEDULE to NM container lifecycle > -------------------------------------- > > Key: YARN-4597 > URL: https://issues.apache.org/jira/browse/YARN-4597 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager > Reporter: Chris Douglas > Assignee: Arun Suresh > Labels: oct16-hard > Attachments: YARN-4597.001.patch, YARN-4597.002.patch, > YARN-4597.003.patch, YARN-4597.004.patch, YARN-4597.005.patch, > YARN-4597.006.patch, YARN-4597.007.patch, YARN-4597.008.patch, > YARN-4597.009.patch > > > Currently, the NM immediately launches containers after resource > localization. Several features could be more cleanly implemented if the NM > included a separate stage for reserving resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org