[ https://issues.apache.org/jira/browse/HDFS-7798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14321793#comment-14321793 ]
Chengbing Liu commented on HDFS-7798: ------------------------------------- The checkpointing failure happens when image uploading and edit log fetching comes at the same time. > Checkpointing failure caused by shared KerberosAuthenticator > ------------------------------------------------------------ > > Key: HDFS-7798 > URL: https://issues.apache.org/jira/browse/HDFS-7798 > Project: Hadoop HDFS > Issue Type: Bug > Components: security > Reporter: Chengbing Liu > Assignee: Chengbing Liu > Priority: Critical > Attachments: HDFS-7798.01.patch > > > We have observed in our real cluster occasional checkpointing failure. The > standby NameNode was not able to upload image to the active NameNode. > After some digging, the root cause appears to be a shared > {{KerberosAuthenticator}} in {{URLConnectionFactory}}. The authenticator is > designed as a use-once instance, and is not stateless. It has attributes such > as {{HttpURLConnection}} and {{URL}}. When multiple threads are calling > {{URLConnectionFactory#openConnection(...)}}, the shared authenticator is > going to have race condition, resulting in a failed image uploading. > Therefore for the first step, without breaking the current API, I propose we > create a new {{KerberosAuthenticator}} instance for each connection, to make > checkpointing work. We may consider making {{Authenticator}} design and > implementation stateless afterwards, as {{ConnectionConfigurator}} does. -- This message was sent by Atlassian JIRA (v6.3.4#6332)