[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969446#comment-15969446
 ] 

Jason Lowe commented on MAPREDUCE-6876:
---------------------------------------

The input format needs to fetch the delegation tokens.  The job submitter code 
is not omnipotent when it comes to processing input splits because it does not, 
and cannot, know all the different types of input.  Therefore it relies on the 
pluggable input format to do anything specific for the type of input.  For 
example, an HBase input format can grab delegation tokens for accessing HBase.  
The job submitting code does not know where the input lives nor how to grab 
tokens for it -- that's the responsibility of the input format.

The whole purpose of InputFormat.getSplits (which in turn calls listStatus for 
the FileInputFormat implementation) is to obtain the information needed for the 
MapReduce framework to gather the necessary, input-specific details needed to 
run tasks.  It _is_ creating the tasks in a sense, because every split 
corresponds to a map task.  If the input format computes fewer splits then 
there will be fewer tasks.  It's a one-to-one relationship.  Without the tokens 
the tasks cannot run, and as I explained above, the job submitter code cannot 
know how to grab input-type-specific credentials.  The input format only has 
two APIs, getSplits and createRecordReader.  The latter is only used by the 
tasks when they need to start reading the input.  getSplits is only called by 
clients (e.g.: job submitters), so it's the only place we can do input-specific 
actions for job submission.

So what I'm getting at here is I don't think the problem Spark is hitting with 
tokens is specific to FileInputFormat.  If Spark tries to call getSplits on 
arbitrary MapReduce input formats then those formats will need to grab tokens 
for any remote input servers if they are designed to work in a secure cluster 
environment.

> FileInputFormat.listStatus should not fetch delegation tokens
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-6876
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6876
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Michael Gummelt
>
> {{FileInputFormat.listStatus}} fetches delegation tokens: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213
> AFAICT, this is unnecessary.  {{listStatus}} doesn't delegate those tokens to 
> another process.  This is causing issues described in the attached Spark 
> Kerberos ticket, because {{TokenCache.obtainTokensForNameNodes}}, which is 
> used to fetch the delegation tokens, assumes that certain MapReduce 
> configuration variables are set, which isn't true in the Spark calling code.  
> This is a separate problem, but nonetheless it wouldn't have arisen if 
> {{listStatus}} weren't fetching delegation tokens.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to