[ 
https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249316#comment-17249316
 ] 

Jim Brennan commented on YARN-9833:
-----------------------------------

{quote}
My worry with this is that code changes in the future will incorrectly use 
getGoodDirs or the other methods that expose the private lists from within 
DirectoryCollection.
{quote}
Isn't this how it has been for years?  It was returning an unmodifiableList 
view of the underlying List, so that limits what the caller can do.  
getGoodDirs() and the others just return a read-only List.  They don't have to 
know about the internals.
If we are going to change these to return a copy of the list, we may want to 
reconsider the use of CopyOnWriteArrayList - I'm not sure it is buying us 
anything.


> Race condition when DirectoryCollection.checkDirs() runs during container 
> launch
> --------------------------------------------------------------------------------
>
>                 Key: YARN-9833
>                 URL: https://issues.apache.org/jira/browse/YARN-9833
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.2.0
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>             Fix For: 3.3.0, 3.2.2, 3.1.4
>
>         Attachments: YARN-9833-001.patch
>
>
> During endurance testing, we found a race condition that cause an empty 
> {{localDirs}} being passed to container-executor.
> The problem is that {{DirectoryCollection.checkDirs()}} clears three 
> collections:
> {code:java}
>     this.writeLock.lock();
>     try {
>       localDirs.clear();
>       errorDirs.clear();
>       fullDirs.clear();
>       ...
> {code}
> This happens in critical section guarded by a write lock. When we start a 
> container, we retrieve the local dirs by calling 
> {{dirsHandler.getLocalDirs();}} which in turn invokes 
> {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
> {code:java}
> List<String> getGoodDirs() {
>     this.readLock.lock();
>     try {
>       return Collections.unmodifiableList(localDirs);
>     } finally {
>       this.readLock.unlock();
>     }
>   }
> {code}
> So we're also in a critical section guarded by the lock. But 
> {{Collections.unmodifiableList()}} only returns a _view_ of the collection, 
> not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be 
> scheduled to run and immediately clears {{localDirs}}.
> This caused a weird behaviour in container-executor, which exited with error 
> code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).
> Therefore we can't just return a view, we must return a copy with 
> {{ImmutableList.copyOf()}}.
> Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to