[ https://issues.apache.org/jira/browse/YARN-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eric Payne reopened YARN-9767: ------------------------------ Reopening so can close as dup of YARN-6492 > PartitionQueueMetrics Issues > ---------------------------- > > Key: YARN-9767 > URL: https://issues.apache.org/jira/browse/YARN-9767 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Manikandan R > Assignee: Manikandan R > Priority: Major > Attachments: YARN-9767.001.patch > > > The intent of the Jira is to capture the issues/observations encountered as > part of YARN-6492 development separately for ease of tracking. > Observations: > Please refer this > https://issues.apache.org/jira/browse/YARN-6492?focusedCommentId=16904027&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16904027 > 1. Since partition info are being extracted from request and node, there is a > problem. For example, > > Node N has been mapped to Label X (Non exclusive). Queue A has been > configured with ANY Node label. App A requested resources from Queue A and > its containers ran on Node N for some reasons. During > AbstractCSQueue#allocateResource call, Node partition (using SchedulerNode ) > would get used for calculation. Lets say allocate call has been fired for 3 > containers of 1 GB each, then > a. PartitionDefault * queue A -> pending mb is 3 GB > b. PartitionX * queue A -> pending mb is -3 GB > > is the outcome. Because app request has been fired without any label > specification and #a metrics has been derived. After allocation is over, > pending resources usually gets decreased. When this happens, it use node > partition info. hence #b metrics has derived. > > Given this kind of situation, We will need to put some thoughts on achieving > the metrics correctly. > > 2. Though the intent of this jira is to do Partition Queue Metrics, we would > like to retain the existing Queue Metrics for backward compatibility (as you > can see from jira's discussion). > With this patch and YARN-9596 patch, queuemetrics (for queue's) would be > overridden either with some specific partition values or default partition > values. It could be vice - versa as well. For example, after the queues (say > queue A) has been initialised with some min and max cap and also with node > label's min and max cap, Queuemetrics (availableMB) for queue A return values > based on node label's cap config. > I've been working on these observations to provide a fix and attached > .005.WIP.patch. Focus of .005.WIP.patch is to ensure availableMB, > availableVcores is correct (Please refer above #2 observation). Added more > asserts in{{testQueueMetricsWithLabelsOnDefaultLabelNode}} to ensure fix for > #2 is working properly. > Also one more thing to note is, user metrics for availableMB, availableVcores > at root queue was not there even before. Retained the same behaviour. User > metrics for availableMB, availableVcores is available only at child queue's > level and also with partitions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org