[ https://issues.apache.org/jira/browse/KAFKA-7987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775565#comment-16775565 ]
Jun Rao commented on KAFKA-7987: -------------------------------- One potential way to fix this is to handle auth failure in ZooKeeperClient in the same way as session expiration by constantly retrying establishing the connection until success. > a broker's ZK session may die on transient auth failure > ------------------------------------------------------- > > Key: KAFKA-7987 > URL: https://issues.apache.org/jira/browse/KAFKA-7987 > Project: Kafka > Issue Type: Improvement > Reporter: Jun Rao > Priority: Major > > After a transient network issue, we saw the following log in a broker. > {code:java} > [23:37:02,102] ERROR SASL authentication with Zookeeper Quorum member failed: > javax.security.sasl.SaslException: An error: > (java.security.PrivilegedActionException: javax.security.sasl.SaslException: > GSS initiate failed [Caused by GSSException: No valid credentials provided > (Mechanism level: Server not found in Kerberos database (7))]) occurred when > evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client > will go to AUTH_FAILED state. (org.apache.zookeeper.ClientCnxn) > [23:37:02,102] ERROR [ZooKeeperClient] Auth failed. > (kafka.zookeeper.ZooKeeperClient) > {code} > The network issue prevented the broker from communicating to ZK. The broker's > ZK session then expired, but the broker didn't know that yet since it > couldn't establish a connection to ZK. When the network was back, the broker > tried to establish a connection to ZK, but failed due to auth failure (likely > due to a transient KDC issue). The current logic just ignores the auth > failure without trying to create a new ZK session. Then the broker will be > permanently in a state that it's alive, but not registered in ZK. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)