[jira] [Commented] (YARN-9088) Non-exclusive labels break QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690806#comment-17690806 ] INHYANG PARK commented on YARN-9088: [~cjac] Do you have any plan to fix this issue? > Non-exclusive labels break QueueMetrics > --- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9088) Non-exclusive labels break QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679248#comment-17679248 ] C.J. Collier commented on YARN-9088: I'll review the changes and see if I can pick up where karthikpal left off. Here is a list of the files changed in that other patch ordered by number of changes to the file. hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/\ scheduler/QueueMetrics.java scheduler/AppSchedulingInfo.java scheduler/TestQueueMetrics.java scheduler/capacity/CSQueueMetrics.java scheduler/common/fica/FiCaSchedulerApp.java scheduler/fair/FSAppAttempt.java scheduler/capacity/LeafQueue.java scheduler/SchedulerApplicationAttempt.java scheduler/capacity/CSQueueUtils.java scheduler/capacity/TestNodeLabelContainerAllocation.java scheduler/TestSchedulerApplicationAttempt.java scheduler/capacity/TestCapacityScheduler.java monitor/invariants/TestMetricsInvariantChecker.java scheduler/fair/FairScheduler.java > Non-exclusive labels break QueueMetrics > --- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9088) Non-exclusive labels break QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062390#comment-17062390 ] Anuj commented on YARN-9088: We are in our setup facing similar issue in which global view of pending and available resource is get messed up. > Non-exclusive labels break QueueMetrics > --- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9088) Non-exclusive labels break QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819455#comment-16819455 ] Karthik Palaniappan commented on YARN-9088: --- You'd also need to change how usedCapacity from YARN-6195 is calculated. It has similar logic for only the default partition. > Non-exclusive labels break QueueMetrics > --- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9088) Non-exclusive labels break QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819447#comment-16819447 ] Karthik Palaniappan commented on YARN-9088: --- +1. I think we should consider rolling back YARN-6467 instead of fixing it. I believe the original behavior was correct – metrics for the root queue should include metrics for all child queues and partitions. So AllocatedMB / AvailableMB, for example, give you a global view of cluster utilization. If YARN-6492 ever gets submitted, then we'll get per-partition metrics too. But I think YARN-6467 is the worst of both worlds – you don't get per partition metrics, and you don't get a global view of the cluster. A lot of cloud providers use cluster-level YARN metrics for autoscaling, and YARN-6467 breaks autoscaling. Side note: YARN-6467 was a breaking change with no documentation / release note. So rolling it back (another breaking change) should be fine. I'll attach a patch, as long as the rollback is straightforward. > Non-exclusive labels break QueueMetrics > --- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 2.8.5 >Reporter: Brandon Scheller >Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org