[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313363#comment-17313363 ] Anup Agarwal commented on YARN-10724: - completedContainer getting called multiple times may or may not be an issue, but logging the same event multiple times might be. SchedulerApplicationAttempt maintains a liveContainers collection and uses it to deduplicate preemption events; while leafQueue does no such thing, that's why the patch moved the preemption logging to AppAttempt rather than leafQueue, similar to FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313255#comment-17313255 ] Zhengbo Li commented on YARN-10724: --- Hi, I may encounter the same issue if it's as described, so I'm trying to understand it better. Do you mean the issue was that LeafQueue's `completedContainer` method are incorrectly invoked multiple times? Thanks > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311957#comment-17311957 ] Anup Agarwal commented on YARN-10724: - BTW, I have tested these changes with the discrete-time simulator (https://issues.apache.org/jira/browse/YARN-1187) but have not tested these on a real testbed. It would be good to get it reviewed. In the patch, I am unsure about the following two things: # LeafQueue originally checked: {{ContainerExitStatus.PREEMPTED == containerStatus.getExitStatus()}}; while FSAppAttempt checks: {{containerStatus.getDiagnostics().equals(SchedulerUtils.PREEMPTED_CONTAINER)}} for determining if the container was preempted. Ideally, I think 'logical and' of these two conditions should be taken. # It may be the case that someone deliberately logged preemptions metrics in LeafQueue due to some reason that I don't know about. Having said that the change in the patch is pretty much consistent with code already in FSAppAttempt. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311859#comment-17311859 ] Hadoop QA commented on YARN-10724: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 19s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 59s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 50s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 54s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 49s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 19m 55s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 49s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 57s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green}{color} | {color:green} the patch passed with
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311802#comment-17311802 ] Anup Agarwal commented on YARN-10724: - Addressed checkstyle issues: [^YARN-10724-trunk.002.patch] > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Anup Agarwal >Assignee: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch, YARN-10724-trunk.002.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. > > One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311778#comment-17311778 ] Hadoop QA commented on YARN-10724: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Logfile || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 21s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} {color} | {color:green} 0m 0s{color} | {color:green}test4tests{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 40s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 51s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 53s{color} | {color:green}{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 9s{color} | {color:green}{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 20m 26s{color} | {color:blue}{color} | {color:blue} Both FindBugs and SpotBugs are enabled, using SpotBugs. {color} | | {color:green}+1{color} | {color:green} spotbugs {color} | {color:green} 1m 57s{color} | {color:green}{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 17s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 35s{color} | {color:green}{color} | {color:green} the patch passed with JDK Ubuntu-11.0.10+9-Ubuntu-0ubuntu1.20.04 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 35s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed with JDK Private Build-1.8.0_282-8u282-b08-0ubuntu1~20.04-b08 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 45s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 42s{color} | {color:orange}https://ci-hadoop.apache.org/job/PreCommit-YARN-Build/875/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-resourcemanager.txt{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: The patch generated 5 new + 107 unchanged - 0 fixed = 112 total (was 107) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 48s{color} | {color:green}{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 3s{color} | {color:green}{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | |
[jira] [Commented] (YARN-10724) Overcounting of preemptions in CapacityScheduler (LeafQueue metrics)
[ https://issues.apache.org/jira/browse/YARN-10724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311649#comment-17311649 ] Anup Agarwal commented on YARN-10724: - I have added a unit test that triggers the overcounting issue along with a fix [^YARN-10724-trunk.001.patch]. The fix also updates FairScheduler to log preemptedMemorySeconds and preemptedVcoreSeconds. > Overcounting of preemptions in CapacityScheduler (LeafQueue metrics) > > > Key: YARN-10724 > URL: https://issues.apache.org/jira/browse/YARN-10724 > Project: Hadoop YARN > Issue Type: Bug > Environment: One cause of the over-counting: > When a container is already running, SchedulerNode does not remove the > container immediately from launchedContainer list and waits from the NM to > kill the container. > Both NODE_RESOURCE_UPDATE and NODE_UPDATE invoke > signalContainersIfOvercommited (AbstractYarnScheduler) which look for > containers to preempt based on the launchedContainers list. Both these calls > can create a ContainerPreemptEvent for the same container (as RM is waiting > for NM to kill the container). This leads LeafQueue to log metrics for the > same preemption multiple times. >Reporter: Anup Agarwal >Priority: Minor > Attachments: YARN-10724-trunk.001.patch > > > Currently CapacityScheduler over-counts preemption metrics inside > QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org