[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970224#comment-16970224
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
---------------------------------------------

Your link points to code in master not in trunk: master has not been updated 
since 2015, 
[trunk|https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java]
 has the readlock in the code.

I do agree that fixing the data consistency would be the best thing. However 
locking large numbers of nodes before we sort and then unlock them again will 
have a huge performance impact.
Moving to a PriorityQueue is possible as the FS is the only one that uses the 
method at the moment. It also fixes the issue as is confirmed by the unit test.
The old unit test without special locking modifies the nodes while sorting 
without issues. This has been confirmed in local runs with extra logging:
{code}
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2147
2019-11-09 00:57:09,958 INFO  [FairSchedulerContinuousScheduling] 
scheduler.ClusterNodeTracker (ClusterNodeTracker.java:sortedNodeList(390)) - 
sorting node list of size 8000
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:5949
2019-11-09 00:57:09,958 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:4677
...
...
2019-11-09 00:57:09,961 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2212
2019-11-09 00:57:09,961 INFO  [FairSchedulerContinuousScheduling] 
fair.FairScheduler (FairScheduler.java:continuousSchedulingAttempt(1005)) - 
scheduler sorted node list of size 8000
2019-11-09 00:57:09,962 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:2949
2019-11-09 00:57:09,962 INFO  [Thread-25] scheduler.SchedulerNode 
(SchedulerNode.java:deductUnallocatedResource(349)) - deducting resource from 
null:3866
{code}

New patch uploaded

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> -------------------------------------------------------
>
>                 Key: YARN-8373
>                 URL: https://issues.apache.org/jira/browse/YARN-8373
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 2.9.0
>            Reporter: Girish Bhat
>            Assignee: Wilfred Spiegelenburg
>            Priority: Major
>              Labels: newbie
>         Attachments: YARN-8373.001.patch, YARN-8373.002.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to