[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator

2021-07-16 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382305#comment-17382305
 ] 

Anup Agarwal commented on YARN-1187:


Currently the alarm events which have the same time instant are sorted 
arbitrarily using the alarm's UUID. This can cause causally dependent events to 
be triggered/handled out of order.

To fix this, a sequence number – that is incremented at creation of an alarm -- 
can be added to each alarm so that causal order between events is preserved in 
the simulation.

> Add discrete event-based simulation to yarn scheduler simulator
> ---
>
> Key: YARN-1187
> URL: https://issues.apache.org/jira/browse/YARN-1187
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Andrew Chung
>Priority: Major
> Attachments: YARN-1187 design doc.pdf, 
> YARN-1187-branch-2.1.3.001.patch, YARN-1187-trunk.001.patch
>
>
> Follow the discussion in YARN-1021.
> Discrete event simulation decouples the running from any real-world clock. 
> This allows users to step through the execution, set debug points, and 
> definitely get a deterministic rexec. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-04-01 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363
 ] 

Anup Agarwal edited comment on YARN-10724 at 4/1/21, 6:20 PM:
--

completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate container 
completion (incl preemption) events; while leafQueue does no such thing, that's 
why the patch moved the preemption logging to AppAttempt rather than leafQueue, 
similar to FSAppAttempt.


was (Author: 108anup):
completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate preemption 
events; while leafQueue does no such thing, that's why the patch moved the 
preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-04-01 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363
 ] 

Anup Agarwal commented on YARN-10724:
-

completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate preemption 
events; while leafQueue does no such thing, that's why the patch moved the 
preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/31/21, 1:31 AM:
---

BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
 In the patch, I am unsure about the following two things:

(1) LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.

(2) It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.


was (Author: 108anup):
BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
In the patch, I am unsure about the following two things: # LeafQueue 
originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.
 # It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957
 ] 

Anup Agarwal commented on YARN-10724:
-

BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
In the patch, I am unsure about the following two things: # LeafQueue 
originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.
 # It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/30/21, 8:43 PM:
---

Addressed checkstyle issues: [^YARN-10724-trunk.002.patch]


was (Author: 108anup):
Addressed checkstyle issues: [^YARN-10724-trunk.002.patch]

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Attachment: YARN-10724-trunk.002.patch

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Attachment: (was: YARN-10724-trunk.002.patch)

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802
 ] 

Anup Agarwal commented on YARN-10724:
-

Addressed checkstyle issues: [^YARN-10724-trunk.002.patch]

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Attachment: YARN-10724-trunk.002.patch

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/30/21, 5:13 PM:
---

I have added a unit test that triggers the overcounting issue along with a fix 
[^YARN-10724-trunk.001.patch].

The fix also updates FairScheduler to log other preemption metrics including 
preemptedMemorySeconds and preemptedVcoreSeconds.


was (Author: 108anup):
I have added a unit test that triggers the overcounting issue along with a fix 
[^YARN-10724-trunk.001.patch].

The fix also updates FairScheduler to log preemptedMemorySeconds and 
preemptedVcoreSeconds.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Description: 
Currently CapacityScheduler over-counts preemption metrics inside QueueMetrics.

 

One cause of the over-counting:

When a container is already running, SchedulerNode does not remove the 
container immediately from launchedContainer list and waits from the NM to kill 
the container.

Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited 
(AbstractYarnScheduler) which look for containers to preempt based on the 
launchedContainers list. Both these calls can create a ContainerPreemptEvent 
for the same container (as RM is waiting for NM to kill the container). This 
leads LeafQueue to log metrics for the same preemption multiple times.

  was:Currently CapacityScheduler over-counts preemption metrics inside 
QueueMetrics.


> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Environment: (was: One cause of the over-counting:

When a container is already running, SchedulerNode does not remove the 
container immediately from launchedContainer list and waits from the NM to kill 
the container.

Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited 
(AbstractYarnScheduler) which look for containers to preempt based on the 
launchedContainers list. Both these calls can create a ContainerPreemptEvent 
for the same container (as RM is waiting for NM to kill the container). This 
leads LeafQueue to log metrics for the same preemption multiple times.)

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649
 ] 

Anup Agarwal commented on YARN-10724:
-

I have added a unit test that triggers the overcounting issue along with a fix 
[^YARN-10724-trunk.001.patch].

The fix also updates FairScheduler to log preemptedMemorySeconds and 
preemptedVcoreSeconds.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.
>Reporter: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-10724:

Attachment: YARN-10724-trunk.001.patch

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
> Environment: One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.
>Reporter: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)
Anup Agarwal created YARN-10724:
---

 Summary: Overcounting of preemptions in CapacityScheduler 
(LeafQueue metrics)
 Key: YARN-10724
 URL: https://issues.apache.org/jira/browse/YARN-10724
 Project: Hadoop YARN
  Issue Type: Bug
 Environment: One cause of the over-counting:

When a container is already running, SchedulerNode does not remove the 
container immediately from launchedContainer list and waits from the NM to kill 
the container.

Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke signalContainersIfOvercommited 
(AbstractYarnScheduler) which look for containers to preempt based on the 
launchedContainers list. Both these calls can create a ContainerPreemptEvent 
for the same container (as RM is waiting for NM to kill the container). This 
leads LeafQueue to log metrics for the same preemption multiple times.
Reporter: Anup Agarwal


Currently CapacityScheduler over-counts preemption metrics inside QueueMetrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator

2021-03-29 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311100#comment-17311100
 ] 

Anup Agarwal commented on YARN-1187:


Trunk gets new commits almost daily. The trunk patch can be applied on the head 
commit: HDFS-15918 [. Replace deprecated RAND_pseudo_bytes 
(|https://github.com/apache/hadoop/commit/654555783db0200aef3ae830e381857d2b46701e]
 [#2811|https://github.com/apache/hadoop/pull/2811] 
[)|https://github.com/apache/hadoop/commit/654555783db0200aef3ae830e381857d2b46701e]

(with hash: 654555783db0200aef3ae830e381857d2b46701e)

> Add discrete event-based simulation to yarn scheduler simulator
> ---
>
> Key: YARN-1187
> URL: https://issues.apache.org/jira/browse/YARN-1187
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Andrew Chung
>Priority: Major
> Attachments: YARN-1187 design doc.pdf, 
> YARN-1187-branch-2.1.3.001.patch, YARN-1187-trunk.001.patch
>
>
> Follow the discussion in YARN-1021.
> Discrete event simulation decouples the running from any real-world clock. 
> This allows users to step through the execution, set debug points, and 
> definitely get a deterministic rexec. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator

2021-03-25 Thread Anup Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anup Agarwal updated YARN-1187:
---
Comment: was deleted

(was: Migrated the patch over to trunk.)

> Add discrete event-based simulation to yarn scheduler simulator
> ---
>
> Key: YARN-1187
> URL: https://issues.apache.org/jira/browse/YARN-1187
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Andrew Chung
>Priority: Major
> Attachments: YARN-1187 design doc.pdf, 
> YARN-1187-branch-2.1.3.001.patch
>
>
> Follow the discussion in YARN-1021.
> Discrete event simulation decouples the running from any real-world clock. 
> This allows users to step through the execution, set debug points, and 
> definitely get a deterministic rexec. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1187) Add discrete event-based simulation to yarn scheduler simulator

2021-03-25 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308835#comment-17308835
 ] 

Anup Agarwal commented on YARN-1187:


Migrated the patch over to trunk.

> Add discrete event-based simulation to yarn scheduler simulator
> ---
>
> Key: YARN-1187
> URL: https://issues.apache.org/jira/browse/YARN-1187
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Wei Yan
>Assignee: Andrew Chung
>Priority: Major
> Attachments: YARN-1187 design doc.pdf, 
> YARN-1187-branch-2.1.3.001.patch
>
>
> Follow the discussion in YARN-1021.
> Discrete event simulation decouples the running from any real-world clock. 
> This allows users to step through the execution, set debug points, and 
> definitely get a deterministic rexec. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org