[ 
https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14346175#comment-14346175
 ] 

zhihai xu commented on YARN-2893:
---------------------------------

Hi [~vinodkv],
Sporadic job failures are due to the cascading sharing the credentials between 
Jobs. Because the Credentials class is not thread-safe, if multiple jobs try to 
access the shared credentials, we will have the race condition, which will 
cause Sporadic job failures.
The shared credentials is introduced in JobConf constructor: If we create a new 
job using JobConf from the old job, these two jobs will share the same 
credentials.
{code}
public JobConf(Configuration conf) { 
super(conf); 
if (conf instanceof JobConf) { 
JobConf that = (JobConf)conf; 
credentials = that.credentials; 
} 
checkAndWarnDeprecation(); 
} 
{code}

The credential from JobConf will be passed to YARNRunner#submitJob which will 
call createApplicationSubmissionContext to configure Tokens in 
ContainerLaunchContext
{code}
    DataOutputBuffer dob = new DataOutputBuffer();
    ts.writeTokenStorageToStream(dob);
    ByteBuffer securityTokens  = ByteBuffer.wrap(dob.getData(), 0, 
dob.getLength());
    ContainerLaunchContext amContainer =
        ContainerLaunchContext.newInstance(localResources, environment,
          vargsFinal, null, securityTokens, acls);
{code}
It looks like we have two other potential issues in JobConf and Credentials.
I created MAPREDUCE-6269 and HADOOP-11667 for separate discussion.

> AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
> ------------------------------------------------------------------------------
>
>                 Key: YARN-2893
>                 URL: https://issues.apache.org/jira/browse/YARN-2893
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Gera Shegalov
>            Assignee: zhihai xu
>         Attachments: YARN-2893.000.patch
>
>
> MapReduce jobs on our clusters experience sporadic failures due to corrupt 
> tokens in the AM launch context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to