Wei-Chiu Chuang created SPARK-37329: ---------------------------------------
Summary: File system delegation tokens are leaked Key: SPARK-37329 URL: https://issues.apache.org/jira/browse/SPARK-37329 Project: Spark Issue Type: Bug Components: Security, YARN Affects Versions: 2.4.0 Reporter: Wei-Chiu Chuang On a very busy Hadoop cluster (with HDFS at rest encryption) we found KMS accumulated millions of delegation tokens that are not cancelled even after jobs are finished, and KMS goes out of memory within a day because of the delegation token leak. We were able to reproduce the bug in a smaller test cluster, and realized when a Spark job starts, it acquires two delegation tokens, and only one is cancelled properly after the job finishes. The other one is left over and linger around for up to 7 days ( default Hadoop delegation token life time). YARN handles the lifecycle of a delegation token properly if its renewer is 'yarn'. However, Spark intentionally (a hack?) acquires a second delegation token with the job issuer as the renewer, simply to get the token renewal interval. The token is then ignored but not cancelled. Propose: cancel the delegation token immediately after the token renewal interval is obtained. Environment: CDH6.3.2 (based on Apache Spark 2.4.0) but the bug probably got introduced since day 1. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org