Peter Bacsko created YARN-9833: ---------------------------------- Summary: Race condition when DirectoryCollection.checkDirs() runs during container launch Key: YARN-9833 URL: https://issues.apache.org/jira/browse/YARN-9833 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.2.0 Reporter: Peter Bacsko Assignee: Peter Bacsko
During endurance testing, we found a race condition that cause an empty {{localDirs}} being passed to container-executor. The problem is that {{DirectoryCollection.checkDirs()}} clears three collections: {code:java} this.writeLock.lock(); try { localDirs.clear(); errorDirs.clear(); fullDirs.clear(); ... {code} This happens in critical section guarded by a write lock. When we start a container, we retrieve the local dirs by calling {{dirsHandler.getLocalDirs();}} which in turn invokes {{DirectoryCollection.getGoodDirs()}}. The implementation of this method is: {code:java} List<String> getGoodDirs() { this.readLock.lock(); try { return Collections.unmodifiableList(localDirs); } finally { this.readLock.unlock(); } } {code} So we're also in a critical section guarded by the lock. But {{Collections.unmodifiableList()}} only returns a _view_ of the collection, not a copy. After we get the view, {{MonitoringTimerTask.run()}} might be scheduled to run and immediately clears {{localDirs}}. This caused a weird behaviour in container-executor, which exited with error code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES). Therefore we can't just return a view, we must return a copy with {{ImmutableList.copyOf()}}. Credits to [~snemeth] for analyzing and determining the root cause. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org