Mine is the same scenario. I get the HDFS_DELEGATION_TOKEN issue exactly after the 7 days of the spark job started and it then gets killed.
Even I'm also looking for the solution. Regards, Nik. On Fri, Mar 11, 2016 at 8:10 PM, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > [image: Boxbe] <https://www.boxbe.com/overview> This message is eligible > for Automatic Cleanup! (dautkha...@gmail.com) Add cleanup rule > <https://www.boxbe.com/popup?url=https%3A%2F%2Fwww.boxbe.com%2Fcleanup%3Ftoken%3DJHpPwnE%252BYWA%252Bajh8IOJO0CFuX0TJLT%252F0yU7giLRZRG%252BlI6DXTWdFY94sO%252BGXdQlKP6Y%252BTAQfMlKkYCdUo%252BGxG10PtItcYUUp758XIlPyVVqdzqEIfRsz%252BVQ%252BPNhxFUAjErrWLt%252FTi7k%253D%26key%3DNpxSVgbRz%252FHfM5eY%252B6VN2bEGqKWnv3005suYjGN0A5w%253D&tc_serial=24687159490&tc_rand=1562107157&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001> > | More info > <http://blog.boxbe.com/general/boxbe-automatic-cleanup?tc_serial=24687159490&tc_rand=1562107157&utm_source=stf&utm_medium=email&utm_campaign=ANNO_CLEANUP_ADD&utm_content=001> > > Spark session dies out after ~40 hours when running against Hadoop Secure > cluster. > > spark-submit has --principal and --keytab so kerberos ticket renewal works > fine according to logs. > > Some happens with HDFS dfs connection? > > These messages come up every 1 second: > See complete stack: http://pastebin.com/QxcQvpqm > > 16/03/11 16:04:59 WARN hdfs.LeaseRenewer: Failed to renew lease for >> [DFSClient_NONMAPREDUCE_1534318438_13] for 2802 seconds. Will retry >> shortly ... >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in >> cache > > > Then in 1 hour it stops trying: > > 16/03/11 16:18:17 WARN hdfs.DFSClient: Failed to renew lease for >> DFSClient_NONMAPREDUCE_1534318438_13 for 3600 seconds (>= hard-limit =3600 >> seconds.) Closing all files being written ... >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): >> token (HDFS_DELEGATION_TOKEN token 1349 for rdautkha) can't be found in >> cache > > > It doesn't look it is Kerberos principal ticket renewal problem, because > that would expire much sooner (by default we have 12 hours), and from the > logs Spark kerberos ticket renewer works fine. > > It's some sort of other hdfs delegation token renewal process that breaks? > > RHEL 6.7 >> Spark 1.5 >> Hadoop 2.6 > > > Found HDFS-5322, YARN-2648 that seem relevant, but I am not sure if it's > the same problem. > It seems Spark problem as I only seen this problem in Spark. > This is reproducible problem, just wait for ~40 hours and a Spark session > is no good. > > > Thanks, > Ruslan > > >