Spark session dies out after ~40 hours when running against Hadoop Secure cluster.
spark-submit has --principal and --keytab so kerberos ticket renewal works fine according to logs. Some happens with HDFS dfs connection? These messages come up every 1 second: See complete stack: http://pastebin.com/QxcQvpqm 16/03/11 16:04:59 WARN hdfs.LeaseRenewer: Failed to renew lease for > [DFSClient_NONMAPREDUCE_1534318438_13] for 2802 seconds. Will retry > shortly ... > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in > cache Then in 1 hour it stops trying: 16/03/11 16:18:17 WARN hdfs.DFSClient: Failed to renew lease for > DFSClient_NONMAPREDUCE_1534318438_13 for 3600 seconds (>= hard-limit =3600 > seconds.) Closing all files being written ... > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): > token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in > cache It doesn't look it is Kerberos principal ticket renewal problem, because that would expire much sooner (by default we have 12 hours), and from the logs Spark kerberos ticket renewer works fine. It's some sort of other hdfs delegation token renewal process that breaks? RHEL 6.7 > Spark 1.5 > Hadoop 2.6 Found HDFS-5322, YARN-2648 that seem relevant, but I am not sure if it's the same problem. It seems Spark problem as I only seen this problem in Spark. This is reproducible problem, just wait for ~40 hours and a Spark session is no good. Thanks, Ruslan