[ https://issues.apache.org/jira/browse/YARN-4045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14687389#comment-14687389 ]
Wangda Tan commented on YARN-4045: ---------------------------------- [~tgraves]/[~shahrs87], I think the case could happen when container reservation interacts with node disconnect, one example is: {code} A cluster has 6 nodes, each node has 20G resource, and usage is N1-N4, are all used N5-N6, both of them are used 10G. An app ask 15G container, assume it is reserved at N5, so total used resource = 20G * 4 + 10G * 2 + 15G (just reserved) = 115G Then, N6 disconnected, now cluster resource becomes 100G, and used resource = 105G. {code} I've just checked fixes, YARN-3361 doesn't have related fixes. And currently we don't have a fix for above corner case. Another problem is caused by DRC, from 2.7.1, we have set availableResource = max(availableResource, Resources.none()). {code} childQueue.getMetrics().setAvailableResourcesToQueue( Resources.max( calculator, clusterResource, available, Resources.none() ) ); {code} But if you're using DRC, if a resource has availableMB < 0 and availableVCores > 0, it could report such resource > Resources.None(). We may need to fix this case as well. Thoughts? > Negative avaialbleMB is being reported for root queue. > ------------------------------------------------------ > > Key: YARN-4045 > URL: https://issues.apache.org/jira/browse/YARN-4045 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.7.1 > Reporter: Rushabh S Shah > > We recently deployed 2.7 in one of our cluster. > We are seeing negative availableMB being reported for queue=root. > This is from the jmx output: > {noformat} > <clusterMetrics> > ... > <availableMB>-163328</availableMB> > ... > </clusterMetrics> > {noformat} > The following is the RM log: > {noformat} > 2015-08-10 14:42:28,280 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:28,404 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:30,913 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:33,093 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:35,548 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:35,549 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,088 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,089 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,338 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,339 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,757 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:39,758 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:43,056 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:43,070 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:44,486 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:44,487 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:44,886 [ResourceManager Event Processor] INFO > capacity.ParentQueue: assignedContainer queue=root usedCapacity=1.0032743 > absoluteUsedCapacity=1.0032743 used=<memory:5334016, vCores:6212> > cluster=<memory:5316608, vCores:28320> > 2015-08-10 14:42:47,401 [ResourceManager Event Processor] INFO > capacity.ParentQueue: completedContainer queue=root usedCapacity=1.0029854 > absoluteUsedCapacity=1.0029854 used=<memory:5332480, vCores:6202> > cluster=<memory:5316608, vCores:28320> > {noformat} > bq. used=<memory:5332480, vCores:6202> cluster=<memory:5316608, vCores:28320> > For root queue, usedCapacity is more than totalCapacity -- This message was sent by Atlassian JIRA (v6.3.4#6332)