[ 
https://issues.apache.org/jira/browse/HDFS-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13130845#comment-13130845
 ] 

Daryn Sharp commented on HDFS-2447:
-----------------------------------

The problem was found to be that the JT couldn't contact the remote NN to renew 
a token due to a firewall.  The tasks on the DNs were however able to contact 
the remote NN so the job succeeded.  However, the job would have failed if it 
executed past the token expiration since the JT was unable to renew the token.

If the JT has to acquire tokens for a job, and acquisition fails, the job will 
fail.  This is the ideal behavior, but there's a loophole...  If the JT finds 
the token in the job's token cache, then it "assumes" the token must valid.  
The reality may be that the token is invalid, canceled, long expired, or the NN 
can't even be reached.  In all of these cases, the tasks get fired off anyway, 
just to clog up a cluster while they die a long slow death.  Actually, on 23, 
it's been observed that tasks using an invalid token will pound on the NN every 
second -- on one cluster this happened for a month!

The JT immediately issues a token renewal and then uses a timer for future 
renewals.  However, all renewals are done in a thread which means if the 
initial renewal fails because the token is bad, the job starts anyway.  The 
simple solution is for the first renewal to occur in the job's context so an 
exception will kill the job, and future renewals to remain thread-based.
                
> Distcp with hdfs:// passed with error in JT log while copying from .20.204  
> to .20.205 ( with useIp=false)
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-2447
>                 URL: https://issues.apache.org/jira/browse/HDFS-2447
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 0.20.205.0
>            Reporter: Rajit Saha
>            Assignee: Daryn Sharp
>
> I tried to copy file from .20.204 to .20.205 by distcp over hdfs:// while 
> using hadoop.security.token.service.use_ip=false in core-site.xml. The copy 
> was successful but found error " 
> org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal:" exception 
> in .20.205 JT.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to