[ https://issues.apache.org/jira/browse/YARN-9833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930481#comment-16930481 ]
Adam Antal commented on YARN-9833: ---------------------------------- +1 (non-binding). > Race condition when DirectoryCollection.checkDirs() runs during container > launch > -------------------------------------------------------------------------------- > > Key: YARN-9833 > URL: https://issues.apache.org/jira/browse/YARN-9833 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 3.2.0 > Reporter: Peter Bacsko > Assignee: Peter Bacsko > Priority: Major > Attachments: YARN-9833-001.patch > > > During endurance testing, we found a race condition that cause an empty > {{localDirs}} being passed to container-executor. > The problem is that {{DirectoryCollection.checkDirs()}} clears three > collections: > {code:java} > this.writeLock.lock(); > try { > localDirs.clear(); > errorDirs.clear(); > fullDirs.clear(); > ... > {code} > This happens in critical section guarded by a write lock. When we start a > container, we retrieve the local dirs by calling > {{dirsHandler.getLocalDirs();}} which in turn invokes > {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: > {code:java} > List<String> getGoodDirs() { > this.readLock.lock(); > try { > return Collections.unmodifiableList(localDirs); > } finally { > this.readLock.unlock(); > } > } > {code} > So we're also in a critical section guarded by the lock. But > {{Collections.unmodifiableList()}} only returns a _view_ of the collection, > not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be > scheduled to run and immediately clears {{localDirs}}. > This caused a weird behaviour in container-executor, which exited with error > code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). > Therefore we can't just return a view, we must return a copy with > {{ImmutableList.copyOf()}}. > Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org