[ https://issues.apache.org/jira/browse/YARN-9088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819447#comment-16819447 ]
Karthik Palaniappan commented on YARN-9088: ------------------------------------------- +1. I think we should consider rolling back YARN-6467 instead of fixing it. I believe the original behavior was correct – metrics for the root queue should include metrics for all child queues and partitions. So AllocatedMB / AvailableMB, for example, give you a global view of cluster utilization. If YARN-6492 ever gets submitted, then we'll get per-partition metrics too. But I think YARN-6467 is the worst of both worlds – you don't get per partition metrics, and you don't get a global view of the cluster. A lot of cloud providers use cluster-level YARN metrics for autoscaling, and YARN-6467 breaks autoscaling. Side note: YARN-6467 was a breaking change with no documentation / release note. So rolling it back (another breaking change) should be fine. I'll attach a patch, as long as the rollback is straightforward. > Non-exclusive labels break QueueMetrics > --------------------------------------- > > Key: YARN-9088 > URL: https://issues.apache.org/jira/browse/YARN-9088 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager > Affects Versions: 2.8.5 > Reporter: Brandon Scheller > Priority: Major > Labels: metrics, nodelabel > > QueueMetrics are broken (random/negative values) when non-exclusive labels > are being used and unlabeled containers run on labeled nodes. > This is caused by the change in the patch here: > https://issues.apache.org/jira/browse/YARN-6467 > It assumes that a container's label will be the same as the node's label that > it is running on. > If you look within the patch, sometimes metrics are updated using the > request.getNodeLabelExpression(). And sometimes they are updated using > node.getPartition(). > This means that in the case where the node is labeled while the container > request isn't, these metrics only get updated when referring to the default > queue. This stops metrics from balancing out and results in incorrect and > negative values in QueueMetrics. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org