[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

Michael Gummelt (JIRA) Thu, 13 Apr 2017 18:09:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968469#comment-15968469
 ]


Michael Gummelt commented on SPARK-20328:
-----------------------------------------

bq. I have no idea what that means.

I'm pretty sure a delegation token is just another way for a subject to 
authenticate.  So the driver uses the delegation token provided to it by 
{{spark-submit}} to authenticate.  This is what I mean by "driver is already 
logged in via the delegation token".  Since the driver is authenticated, it can 
request further delegation tokens.  But my point is that it shouldn't need to, 
because that code is not "delegating" the tokens to any other process, which is 
the only thing delegation tokens are needed for.

But this is neither here nor there.  I think I know what I have to do.

> HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs
> -----------------------------------------------------------------
>
>                 Key: SPARK-20328
>                 URL: https://issues.apache.org/jira/browse/SPARK-20328
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0, 2.1.1, 2.1.2
>            Reporter: Michael Gummelt
>
> In order to obtain {{InputSplit}} information, {{HadoopRDD}} creates a 
> MapReduce {{JobConf}} out of the Hadoop {{Configuration}}: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L138
> Semantically, this is a problem because a HadoopRDD does not represent a 
> Hadoop MapReduce job.  Practically, this is a problem because this line: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L194
>  results in this MapReduce-specific security code being called: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/security/TokenCache.java#L130,
>  which assumes the MapReduce master is configured (e.g. via 
> {{yarn.resourcemanager.*}}).  If it isn't, an exception is thrown.
> So I'm seeing this exception thrown as I'm trying to add Kerberos support for 
> the Spark Mesos scheduler:
> {code}
> Exception in thread "main" java.io.IOException: Can't get Master Kerberos 
> principal for use as renewer
>       at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:116)
>       at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
>       at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:205)
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>       at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
> {code}
> I have a workaround where I set a YARN-specific configuration variable to 
> trick {{TokenCache}} into thinking YARN is configured, but this is obviously 
> suboptimal.
> The proper fix to this would likely require significant {{hadoop}} 
> refactoring to make split information available without going through 
> {{JobConf}}, so I'm not yet sure what the best course of action is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20328) HadoopRDDs create a MapReduce JobConf, but are not MapReduce jobs

Reply via email to