[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-04-01 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363
 ] 

Anup Agarwal edited comment on YARN-10724 at 4/1/21, 6:20 PM:
--

completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate container 
completion (incl preemption) events; while leafQueue does no such thing, that's 
why the patch moved the preemption logging to AppAttempt rather than leafQueue, 
similar to FSAppAttempt.


was (Author: 108anup):
completedContainer getting called multiple times may or may not be an issue, 
but logging the same event multiple times might be. SchedulerApplicationAttempt 
maintains a liveContainers collection and uses it to deduplicate preemption 
events; while leafQueue does no such thing, that's why the patch moved the 
preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/31/21, 1:31 AM:
---

BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
 In the patch, I am unsure about the following two things:

(1) LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.

(2) It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.


was (Author: 108anup):
BTW, I have tested these changes with the discrete-time simulator 
(https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on 
a real testbed. It would be good to get it reviewed.
In the patch, I am unsure about the following two things: # LeafQueue 
originally checked: {{ContainerExitStatus.PREEMPTED == 
containerStatus.getExitStatus()}}; while FSAppAttempt checks: 
{{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} 
for determining if the container was preempted. Ideally, I think 'logical and' 
of these two conditions should be taken.
 # It may be the case that someone deliberately logged preemptions metrics in 
LeafQueue due to some reason that I don't know about. Having said that the 
change in the patch is pretty much consistent with code already in FSAppAttempt.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/30/21, 8:43 PM:
---

Addressed checkstyle issues: [^YARN-10724-trunk.002.patch]


was (Author: 108anup):
Addressed checkstyle issues: [^YARN-10724-trunk.002.patch]

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)

2021-03-30 Thread Anup Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649
 ] 

Anup Agarwal edited comment on YARN-10724 at 3/30/21, 5:13 PM:
---

I have added a unit test that triggers the overcounting issue along with a fix 
[^YARN-10724-trunk.001.patch].

The fix also updates FairScheduler to log other preemption metrics including 
preemptedMemorySeconds and preemptedVcoreSeconds.


was (Author: 108anup):
I have added a unit test that triggers the overcounting issue along with a fix 
[^YARN-10724-trunk.001.patch].

The fix also updates FairScheduler to log preemptedMemorySeconds and 
preemptedVcoreSeconds.

> Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
> 
>
> Key: YARN-10724
> URL: https://issues.apache.org/jira/browse/YARN-10724
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Anup Agarwal
>Assignee: Anup Agarwal
>Priority: Minor
> Attachments: YARN-10724-trunk.001.patch
>
>
> Currently CapacityScheduler over-counts preemption metrics inside 
> QueueMetrics.
>  
> One cause of the over-counting:
> When a container is already running, SchedulerNode does not remove the 
> container immediately from launchedContainer list and waits from the NM to 
> kill the container.
> Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke 
> signalContainersIfOvercommited (AbstractYarnScheduler) which look for 
> containers to preempt based on the launchedContainers list. Both these calls 
> can create a ContainerPreemptEvent for the same container (as RM is waiting 
> for NM to kill the container). This leads LeafQueue to log metrics for the 
> same preemption multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org