[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967790#comment-16967790 ]
Graham Campbell commented on KAFKA-7987: ---------------------------------------- [~junrao] We're starting to see these errors more frequently (guess our network is getting less reliable), so I'm looking at a fix for this. Does scheduling a reinitialize() in the ZookeeperClient after notifying handlers seem like a reasonable solution? I don't see a way to get more details from the ZK client to try to tell if the auth failure was caused by a retriable error or not. > a broker's ZK session may die on transient auth failure > ------------------------------------------------------- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Improvement > Reporter: Jun Rao > Priority: Major > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian Jira (v8.3.4#803005)