Chengbing Liu created HDFS-7798: ----------------------------------- Summary: Checkpointing failure caused by shared KerberosAuthenticator Key: HDFS-7798 URL: https://issues.apache.org/jira/browse/HDFS-7798 Project: Hadoop HDFS Issue Type: Bug Components: security Reporter: Chengbing Liu Priority: Critical
We have observed in our real cluster occasionally checkpointing failure. The standby NameNode was not able to upload image to the active NameNode. After some digging, the root cause appears to be a shared {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is designed as a use-once instance, and is not stateless. It has attributes such as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is going to have race condition, resulting in a failed image uploading. Therefore for the first step, without breaking the current API, I propose we create a new {{KerberosAuthenticator}} instance for each connection, to make checkpointing work. We may consider making {{Authenticator}} design and implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)