[
https://issues.apache.org/jira/browse/MAPREDUCE-6876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969446#comment-15969446
]
Jason Lowe commented on MAPREDUCE-6876:
---------------------------------------
The input format needs to fetch the delegation tokens. The job submitter code
is not omnipotent when it comes to processing input splits because it does not,
and cannot, know all the different types of input. Therefore it relies on the
pluggable input format to do anything specific for the type of input. For
example, an HBase input format can grab delegation tokens for accessing HBase.
The job submitting code does not know where the input lives nor how to grab
tokens for it -- that's the responsibility of the input format.
The whole purpose of InputFormat.getSplits (which in turn calls listStatus for
the FileInputFormat implementation) is to obtain the information needed for the
MapReduce framework to gather the necessary, input-specific details needed to
run tasks. It _is_ creating the tasks in a sense, because every split
corresponds to a map task. If the input format computes fewer splits then
there will be fewer tasks. It's a one-to-one relationship. Without the tokens
the tasks cannot run, and as I explained above, the job submitter code cannot
know how to grab input-type-specific credentials. The input format only has
two APIs, getSplits and createRecordReader. The latter is only used by the
tasks when they need to start reading the input. getSplits is only called by
clients (e.g.: job submitters), so it's the only place we can do input-specific
actions for job submission.
So what I'm getting at here is I don't think the problem Spark is hitting with
tokens is specific to FileInputFormat. If Spark tries to call getSplits on
arbitrary MapReduce input formats then those formats will need to grab tokens
for any remote input servers if they are designed to work in a secure cluster
environment.
> FileInputFormat.listStatus should not fetch delegation tokens
> -------------------------------------------------------------
>
> Key: MAPREDUCE-6876
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6876
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Reporter: Michael Gummelt
>
> {{FileInputFormat.listStatus}} fetches delegation tokens:
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L213
> AFAICT, this is unnecessary. {{listStatus}} doesn't delegate those tokens to
> another process. This is causing issues described in the attached Spark
> Kerberos ticket, because {{TokenCache.obtainTokensForNameNodes}}, which is
> used to fetch the delegation tokens, assumes that certain MapReduce
> configuration variables are set, which isn't true in the Spark calling code.
> This is a separate problem, but nonetheless it wouldn't have arisen if
> {{listStatus}} weren't fetching delegation tokens.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]