[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249043#comment-17249043
 ] 

Jim Brennan commented on YARN-9833:
-----------------------------------

Thinking about it more over the weekend, I suspect that the reason the 
CopyOnWriteArrayList was used was more for performance than to allow someone to 
hang onto the reference for a long time.  This ideally is a list that doesn't 
change very often, so handing out a view of a copy-on-write array is cheaper 
than making a copy every time we launch a container.  Unfortunately, 
{{checkdirs()}} as written seems to ruin any advantage we've gained by mutating 
the lists every time it runs (and multiple times at that, by first clearing and 
then adding each entry individually).   This is also where the race comes in.

My suggestion for fixing this would be to fix the {{checkdirs()}}  
implementation to operate on local copies of these arrays, and then update them 
with a single assignment only if they have changed.

 

> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> --------------------------------------------------------------------------------
>
>                 Key: YARN-9833
>                 URL: https://issues.apache.org/jira/browse/YARN-9833
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>             Fix For: 3.3.0, 3.2.2, 3.1.4
>
>         Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
>     this.writeLock.lock();
>     try {
>       localDirs.clear();
>       errorDirs.clear();
>       fullDirs.clear();
>       ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List<String> getGoodDirs() {
>     this.readLock.lock();
>     try {
>       return Collections.unmodifiableList(localDirs);
>     } finally {
>       this.readLock.unlock();
>     }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to