[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456178#comment-17456178 ]
Andras Gyori commented on YARN-10178: ------------------------------------- Thanks [~epayne] for the details. The root cause you are describing is in place I think. Its probably transitivity, that is violated (namely if q1 > q2 and q2 > q3 then q1 > q3, but the time it reaches the q1, q3 comparison, the queues had already changed, thus breaking the TimSort requirements), though not entirely sure about that. All in all, the snapshot idea seems to be the correct one. As for {noformat} I read online that even the stream method of List is not a deep copy. Is that true? If we are only making a reference of the queue list, then the resource usages of each queue can change and cause the sorted list to be wrong during sorting.{noformat} I believe it is not a problem, as we are not making a copy, but creating new objects out of queues, and only taking floats out of them, which are value types. However, configuredMinResource is indeed a reference and mutable as well, so we might need to clone that with Resources.clone() (I think it is the standard convention). > Global Scheduler async thread crash caused by 'Comparison method violates its > general contract' > ----------------------------------------------------------------------------------------------- > > Key: YARN-10178 > URL: https://issues.apache.org/jira/browse/YARN-10178 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler > Affects Versions: 3.2.1 > Reporter: tuyu > Assignee: Qi Zhu > Priority: Major > Attachments: YARN-10178.001.patch, YARN-10178.002.patch, > YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch > > > Global Scheduler Async Thread crash stack > {code:java} > ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received > RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, > Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: > Comparison method violates its general contract! > at > java.util.TimSort.mergeHi(TimSort.java:899) > at java.util.TimSort.mergeAt(TimSort.java:516) > at java.util.TimSort.mergeForceCollapse(TimSort.java:457) > at java.util.TimSort.sort(TimSort.java:254) > at java.util.Arrays.sort(Arrays.java:1512) > at java.util.ArrayList.sort(ArrayList.java:1462) > at java.util.Collections.sort(Collections.java:177) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616) > {code} > JAVA 8 Arrays.sort default use timsort algo, and timsort has few require > {code:java} > 1.x.compareTo(y) != y.compareTo(x) > 2.x>y,y>z --> x > z > 3.x=y, x.compareTo(z) == y.compareTo(z) > {code} > if not Arrays paramters not satify this require,TimSort will throw > 'java.lang.IllegalArgumentException' > look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know > Capacity Scheduler use this these queue resource usage to compare > {code:java} > AbsoluteUsedCapacity > UsedCapacity > ConfiguredMinResource > AbsoluteCapacity > {code} > In Capacity Scheduler Global Scheduler AsyncThread use > PriorityUtilizationQueueOrderingPolicy function to choose queue to assign > container,and construct a CSAssignment struct, and use > submitResourceCommitRequest function add CSAssignment to backlogs > ResourceCommitterService will tryCommit this CSAssignment,look tryCommit > function,there will update queue resource usage > {code:java} > public boolean tryCommit(Resource cluster, ResourceCommitRequest r, > boolean updatePending) { > long commitStart = System.nanoTime(); > ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request = > (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r; > > ... > boolean isSuccess = false; > if (attemptId != null) { > FiCaSchedulerApp app = getApplicationAttempt(attemptId); > // Required sanity check for attemptId - when async-scheduling enabled, > // proposal might be outdated if AM failover just finished > // and proposal queue was not be consumed in time > if (app != null && attemptId.equals(app.getApplicationAttemptId())) { > if (app.accept(cluster, request, updatePending) > && app.apply(cluster, request, updatePending)) { // apply this > resource > ... > } > } > } > return isSuccess; > } > } > {code} > {code:java} > public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp, > FiCaSchedulerNode> request, boolean updatePending) { > ... > if (!reReservation) { > getCSLeafQueue().apply(cluster, request); > } > ... > } > {code} > LeafQueue.apply invok allocateResource > {code:java} > void allocateResource(Resource clusterResource, > Resource resource, String nodePartition) { > try { > writeLock.lock(); // only lock leaf queue lock > queueUsage.incUsed(nodePartition, resource); > > ++numContainers; > > CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource, > this, labelManager, nodePartition); // there will update queue > statistics > } finally { > writeLock.unlock(); > } > } > {code} > we found ResourceCommitterService will only lock leaf queue to update queue > statistics, but AsyncThread use sortAndGetChildrenAllocationIterator only > lock queue root queue lock > {code:java} > ParentQueue.java > private Iterator<CSQueue> sortAndGetChildrenAllocationIterator( > String partition) { > try { > readLock.lock(); > return queueOrderingPolicy.getAssignmentIterator(partition); > } finally { > readLock.unlock(); > } > } > {code} > so if multi async thread compare queue usage statistics and > ResourceCommitterService apply leaf queue change statistics concurrent, will > break TimSort algo required, and cause thread crash -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org