[ https://issues.apache.org/jira/browse/SPARK-24493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saisai Shao updated SPARK-24493: -------------------------------- Summary: Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 (was: Kerberos Ticket Renewal is failing in long running Spark job) > Kerberos Ticket Renewal is failing in Hadoop 2.8+ and Hadoop 3 > -------------------------------------------------------------- > > Key: SPARK-24493 > URL: https://issues.apache.org/jira/browse/SPARK-24493 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN > Affects Versions: 2.3.0 > Reporter: Asif M > Priority: Major > > Kerberos Ticket Renewal is failing on long running spark job. I have added > below 2 kerberos properties in the HDFS configuration and ran a spark > streaming job > ([hdfs_wordcount.py|https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/hdfs_wordcount.py]) > {noformat} > dfs.namenode.delegation.token.max-lifetime=1800000 (30min) > dfs.namenode.delegation.token.renew-interval=900000 (15min) > {noformat} > > Spark Job failed at 15min with below error: > {noformat} > 18/06/04 18:56:51 INFO DAGScheduler: ShuffleMapStage 10896 (call at > /usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py:2381) > failed in 0.218 s due to Job aborted due to stage failure: Task 0 in stage > 10896.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10896.0 > (TID 7290, <GatewayNodeHostname>, executor 1): > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (token for abcd: HDFS_DELEGATION_TOKEN owner=a...@example.com, > renewer=yarn, realUser=, issueDate=1528136773875, maxDate=1528138573875, > sequenceNumber=38, masterKeyId=6) is expired, current time: 2018-06-04 > 18:56:51,276+0000 expected renewal time: 2018-06-04 18:56:13,875+0000 > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1499) > at org.apache.hadoop.ipc.Client.call(Client.java:1445) > at org.apache.hadoop.ipc.Client.call(Client.java:1355) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy18.getBlockLocations(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:317) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359) > at com.sun.proxy.$Proxy19.getBlockLocations(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:856) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:845) > at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:834) > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:998) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:326) > at > org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:334) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:86) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:189) > at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:186) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:141) > at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:70) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:105) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:99) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} > *Steps to Reproduce:* > # Add {{dfs.namenode.delegation.token.max-lifetime}} and > {{dfs.namenode.delegation.token.renew-interval}} properties in the HDFS > config and restart the affected services. > # Run spark streaming job on gateway node of the cluster in one terminal tab > {noformat} > /bin/spark-submit --master yarn --principal <SPN> --keytab <Keytab File Full > Path> hdfs_wordcount.py "/tmp/streaming_input" 2>&1 | tee driver.log{noformat} > After 15min the spark application is terminated with the above mentioned > error. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org