[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311957#comment-17311957 ]
Anup Agarwal edited comment on YARN-10724 at 3/31/21, 1:31 AM: --------------------------------------------------------------- BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: (1) LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. (2) It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. was (Author: 108anup): BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: # LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. # It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > -------------------------------------------------------------------- > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Anup Agarwal > Assignee: Anup Agarwal > Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org