[ 
https://issues.apache.org/jira/browse/YARN-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969741#comment-16969741
 ] 

Wilfred Spiegelenburg commented on YARN-8373:
---------------------------------------------

YARN-6448 introduces the synchronisation around the 
ClusterNodeTracker.sortedNodeList() method. That was needed due to a changed 
introduced by YARN-4719. The fix looks like it was build when methods were 
synchronised.

Now going forward the change for YARN-6448 was checked in upstream in release 
2.9 and later. This version also includes YARN-3139. Those changes remove the 
synchronised blocks and changes all locking to read/write locks. This really 
means that from the moment the change was added to the code it really did not 
do anything as it is the only synchronised block in the FS. It really only 
prevents two sorts from happening at the same time nothing more. The feeling I 
have is that the fix for YARN-6448 really never worked due to that interaction.

The test that was written is not really testing the real issue. This is inside 
the test code:
{code}
            synchronized (scheduler) {
              node.deductUnallocatedResource(Resource.newInstance(i * 1024, i));
            }
{code}
The test uses a block that is synchronised on the scheduler while in the real 
code this {{deductUnallocatedResource()}} is not locked on the scheduler at 
all. The test should really be removed as it gives a false sense of code being 
tested and correct.

The fix should be as simple as replacing the synchronised block with a read 
lock. That would bring back the fix to the state as it was intended. All the 
node changes like releasing containers etc run through the scheduler under a 
held write lock in {{attemptScheduling()} {{completedContainerInternal()}} or 
{{nodeUpdate()}}.

Fixing the real issue: locking all the nodes while sorting or creating a deep 
copy of the nodes list before sorting are costly. Neither of these will be 
without performance impact especially in large clusters. Based on the analysis 
it will also not give us anything extra

[~snemeth] [~miklos.szeg...@cloudera.com] can you check please?

> RM  Received RMFatalEvent of type CRITICAL_THREAD_CRASH
> -------------------------------------------------------
>
>                 Key: YARN-8373
>                 URL: https://issues.apache.org/jira/browse/YARN-8373
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler, resourcemanager
>    Affects Versions: 2.9.0
>            Reporter: Girish Bhat
>            Assignee: Wilfred Spiegelenburg
>            Priority: Major
>              Labels: newbie
>         Attachments: YARN-8373.001.patch
>
>
>  
>  
> {noformat}
> sudo -u yarn /usr/local/hadoop/latest/bin/yarn version Hadoop 2.9.0 
> Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 
> 756ebc8394e473ac25feac05fa493f6d612e6c50 Compiled by arsuresh on 
> 2017-11-13T23:15Z Compiled with protoc 2.5.0 From source with checksum 
> 0a76a9a32a5257331741f8d5932f183 This command was run using 
> /usr/local/hadoop/hadoop-2.9.0/share/hadoop/common/hadoop-common-2.9.0.jar{noformat}
> This is for version 2.9.0 
>  
> {noformat}
> 2018-05-25 05:53:12,742 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Fai
> rSchedulerContinuousScheduling, that exited unexpectedly: 
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,743 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Shutting down 
> the resource manager.
> 2018-05-25 05:53:12,749 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: a critical thread, FairSchedulerContinuousScheduling, that exited 
> unexpectedly: java.lang.IllegalArgumentException: Comparison method violates 
> its general contract!
> at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1454)
> at java.util.Collections.sort(Collections.java:175)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.sortedNodeList(ClusterNodeTracker.java:340)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.continuousSchedulingAttempt(FairScheduler.java:907)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$ContinuousSchedulingThread.run(FairScheduler.java:296)
> 2018-05-25 05:53:12,772 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
>  ExpiredTokenRemover received java.lang.InterruptedException: sleep 
> interrupted{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to