[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator
[ https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382305#comment-17382305 ] Anup Agarwal commented on YARN-1187: Currently the alarm events which have the same time instant are sorted arbitrarily using the alarm's UUID. This can cause causally dependent events to be triggered/handled out of order. To fix this, a sequence number – that is incremented at creation of an alarm -- can be added to each alarm so that causal order between events is preserved in the simulation. > Add discrete event-based simulation to yarn scheduler simulator > --- > > Key: YARN-1187 > URL: https://issues.apache.org/jira/browse/YARN-1187 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Andrew Chung >Priority: Major > Attachments: YARN-1187 design doc.pdf, > YARN-1187-branch-2.1.3.001.patch, YARN-1187-trunk.001.patch > > > Follow the discussion in YARN-1021. > Discrete event simulation decouples the running from any real-world clock. > This allows users to step through the execution, set debug points, and > definitely get a deterministic rexec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363 ] Anup Agarwal edited comment on YARN-10724 at 4/1/21, 6:20 PM: -- completedContainer getting called multiple times may or may not be an issue, but logging the same event multiple times might be. SchedulerApplicationAttempt maintains a liveContainers collection and uses it to deduplicate container completion (incl preemption) events; while leafQueue does no such thing, that's why the patch moved the preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt. was (Author: 108anup): completedContainer getting called multiple times may or may not be an issue, but logging the same event multiple times might be. SchedulerApplicationAttempt maintains a liveContainers collection and uses it to deduplicate preemption events; while leafQueue does no such thing, that's why the patch moved the preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363 ] Anup Agarwal commented on YARN-10724: - completedContainer getting called multiple times may or may not be an issue, but logging the same event multiple times might be. SchedulerApplicationAttempt maintains a liveContainers collection and uses it to deduplicate preemption events; while leafQueue does no such thing, that's why the patch moved the preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957 ] Anup Agarwal edited comment on YARN-10724 at 3/31/21, 1:31 AM: --- BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: (1) LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. (2) It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. was (Author: 108anup): BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: # LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. # It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957 ] Anup Agarwal commented on YARN-10724: - BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: # LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. # It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802 ] Anup Agarwal edited comment on YARN-10724 at 3/30/21, 8:43 PM: --- Addressed checkstyle issues: [^YARN-10724-trunk.002.patch] was (Author: 108anup): Addressed checkstyle issues: [^YARN-10724-trunk.002.patch] > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Attachment: YARN-10724-trunk.002.patch > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Attachment: (was: YARN-10724-trunk.002.patch) > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802 ] Anup Agarwal commented on YARN-10724: - Addressed checkstyle issues: [^YARN-10724-trunk.002.patch] > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Attachment: YARN-10724-trunk.002.patch > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649 ] Anup Agarwal edited comment on YARN-10724 at 3/30/21, 5:13 PM: --- I have added a unit test that triggers the overcounting issue along with a fix [^YARN-10724-trunk.001.patch]. The fix also updates FairScheduler to log other preemption metrics including preemptedMemorySeconds and preemptedVcoreSeconds. was (Author: 108anup): I have added a unit test that triggers the overcounting issue along with a fix [^YARN-10724-trunk.001.patch]. The fix also updates FairScheduler to log preemptedMemorySeconds and preemptedVcoreSeconds. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Description: Currently CapacityScheduler over-counts preemption metrics inside QueueMetrics. One cause of the over-counting: When a container is already running, SchedulerNode does not remove the container immediately from launchedContainer list and waits from the NM to kill the container. Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited (AbstractYarnScheduler) which look for containers to preempt based on the launchedContainers list. Both these calls can create a ContainerPreemptEvent for the same container (as RM is waiting for NM to kill the container). This leads LeafQueue to log metrics for the same preemption multiple times. was:Currently CapacityScheduler over-counts preemption metrics inside QueueMetrics. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Environment: (was: One cause of the over-counting: When a container is already running, SchedulerNode does not remove the container immediately from launchedContainer list and waits from the NM to kill the container. Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited (AbstractYarnScheduler) which look for containers to preempt based on the launchedContainers list. Both these calls can create a ContainerPreemptEvent for the same container (as RM is waiting for NM to kill the container). This leads LeafQueue to log metrics for the same preemption multiple times.) > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649 ] Anup Agarwal commented on YARN-10724: - I have added a unit test that triggers the overcounting issue along with a fix [^YARN-10724-trunk.001.patch]. The fix also updates FairScheduler to log preemptedMemorySeconds and preemptedVcoreSeconds. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug > Environment: One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. >Reporter: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-10724: Attachment: YARN-10724-trunk.001.patch > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug > Environment: One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. >Reporter: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
Anup Agarwal created YARN-10724: --- Summary: Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) Key: YARN-10724 URL: https://issues.apache.org/jira/browse/YARN-10724 Project: Hadoop YARN Issue Type: Bug Environment: One cause of the over-counting: When a container is already running, SchedulerNode does not remove the container immediately from launchedContainer list and waits from the NM to kill the container. Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited (AbstractYarnScheduler) which look for containers to preempt based on the launchedContainers list. Both these calls can create a ContainerPreemptEvent for the same container (as RM is waiting for NM to kill the container). This leads LeafQueue to log metrics for the same preemption multiple times. Reporter: Anup Agarwal Currently CapacityScheduler over-counts preemption metrics inside QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator
[ https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311100#comment-17311100 ] Anup Agarwal commented on YARN-1187: Trunk gets new commits almost daily. The trunk patch can be applied on the head commit: HDFS-15918 [. Replace deprecated RAND_pseudo_bytes (|https://github.com/apache/hadoop/commit/654555783db0200aef3ae830e381857d2b46701e] [#2811|https://github.com/apache/hadoop/pull/2811] [)|https://github.com/apache/hadoop/commit/654555783db0200aef3ae830e381857d2b46701e] (with hash: 654555783db0200aef3ae830e381857d2b46701e) > Add discrete event-based simulation to yarn scheduler simulator > --- > > Key: YARN-1187 > URL: https://issues.apache.org/jira/browse/YARN-1187 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Andrew Chung >Priority: Major > Attachments: YARN-1187 design doc.pdf, > YARN-1187-branch-2.1.3.001.patch, YARN-1187-trunk.001.patch > > > Follow the discussion in YARN-1021. > Discrete event simulation decouples the running from any real-world clock. > This allows users to step through the execution, set debug points, and > definitely get a deterministic rexec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Issue Comment Deleted] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator
[ https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anup Agarwal updated YARN-1187: --- Comment: was deleted (was: Migrated the patch over to trunk.) > Add discrete event-based simulation to yarn scheduler simulator > --- > > Key: YARN-1187 > URL: https://issues.apache.org/jira/browse/YARN-1187 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Andrew Chung >Priority: Major > Attachments: YARN-1187 design doc.pdf, > YARN-1187-branch-2.1.3.001.patch > > > Follow the discussion in YARN-1021. > Discrete event simulation decouples the running from any real-world clock. > This allows users to step through the execution, set debug points, and > definitely get a deterministic rexec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator
[ https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308835#comment-17308835 ] Anup Agarwal commented on YARN-1187: Migrated the patch over to trunk. > Add discrete event-based simulation to yarn scheduler simulator > --- > > Key: YARN-1187 > URL: https://issues.apache.org/jira/browse/YARN-1187 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Wei Yan >Assignee: Andrew Chung >Priority: Major > Attachments: YARN-1187 design doc.pdf, > YARN-1187-branch-2.1.3.001.patch > > > Follow the discussion in YARN-1021. > Discrete event simulation decouples the running from any real-world clock. > This allows users to step through the execution, set debug points, and > definitely get a deterministic rexec. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org