Peter Bacsko created YARN-9833:
----------------------------------

             Summary: Race condition when DirectoryCollection.checkDirs() runs 
during container launch
                 Key: YARN-9833
                 URL: https://issues.apache.org/jira/browse/YARN-9833
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 3.2.0
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


During endurance testing, we found a race condition that cause an empty 
{{localDirs}} being passed to container-executor.

The problem is that {{DirectoryCollection.checkDirs()}} clears three 
collections:
{code:java}
    this.writeLock.lock();
    try {
      localDirs.clear();
      errorDirs.clear();
      fullDirs.clear();
      ...
{code}
This happens in critical section guarded by a write lock. When we start a 
container, we retrieve the local dirs by calling 
{{dirsHandler.getLocalDirs();}} which in turn invokes 
{{DirectoryCollection.getGoodDirs()}}. The implementation of this method is:
{code:java}
List<String> getGoodDirs() {
    this.readLock.lock();
    try {
      return Collections.unmodifiableList(localDirs);
    } finally {
      this.readLock.unlock();
    }
  }
{code}
So we're also in a critical section guarded by the lock. But 
{{Collections.unmodifiableList()}} only returns a _view_ of the collection, not 
a copy. After we get the view, {{MonitoringTimerTask.run()}} might be scheduled 
to run and immediately clears {{localDirs}}.
This caused a weird behaviour in container-executor, which exited with error 
code 35 (COULD_NOT_CREATE_WORK_DIRECTORIES).

Therefore we can't just return a view, we must return a copy with 
{{ImmutableList.copyOf()}}.

Credits to [~snemeth] for analyzing and determining the root cause.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to