[ https://issues.apache.org/jira/browse/ZOOKEEPER-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931573#comment-15931573 ]
Rakesh R commented on ZOOKEEPER-2712: ------------------------------------- Thanks a lot [~hanm] for the unit test results and comments. bq. I am curious what exactly cause the races though - from the description it was not very clear to me. Do you mind to elaborate a little bit with regards to what code in test case that uses what function in the dependency library and what the race condition is? I hope the following explanation will help to understand the concurrency flow. Below is the auth failed exception which frequently hits in our {{KerberosSecurityTestcase}} related unit test cases. {code} 2017-03-17 15:55:51,397 [myid:] - WARN [NioProcessor-3:KerberosProtocolHandler@241] - Server not found in Kerberos database (7) 2017-03-17 15:55:51,398 [myid:] - WARN [NioProcessor-3:KerberosProtocolHandler@242] - Server not found in Kerberos database (7) [Krb5LoginModule] authentication failed Server not found in Kerberos database (7) - Server not found in Kerberos database 2017-03-17 15:55:51,409 [myid:1] - ERROR [Thread-3:QuorumPeerTestBase$MainThread@145] - unexpected exception in run javax.security.sasl.SaslException: Failed to initialize authentication mechanism using SASL [Caused by javax.security.auth.login.LoginException: Server not found in Kerberos database (7) - Server not found in Kerberos database] at org.apache.zookeeper.server.quorum.auth.SaslQuorumAuthServer.<init>(SaslQuorumAuthServer.java:69) at org.apache.zookeeper.server.quorum.QuorumPeer.initialize(QuorumPeer.java:570) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:162) {code} As we know, test case is creating a ZK cluster of size 3 and uses [QuorumAuthTestBase.startServer() #L186|https://github.com/apache/zookeeper/blob/branch-3.4/src/java/test/org/apache/zookeeper/server/quorum/auth/QuorumAuthTestBase.java#L186] function to start server in a separate thread. Now, we have three servers starting parallel in three different threads. During startup, each server will initialize SaslQuorumAuthServer and SaslQuorumAuthLearner [QuorumPeer#init|https://github.com/apache/zookeeper/blob/branch-3.4/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java#L570] and does auth login. For Krb login, it internally uses ApacheDS library as this test case is based on {{KerberosSecurityTestcase}}. I have experimented a test scenario of doing multiple Krb {{javax.security.auth.login.LoginContext#login()}} simultaneously and hits exactly the same error {{server not found in Kerberos database}}. Later, I made the login in a sequential fashion and never hits server not found problem. I personally feel, that ApacheDS login module is sharing some resources and resulting in concurrency failure. IMHO, fixing ApacheDS is not our scope and the sequential login changes makes the test case more consistent, does this make sense to you? > MiniKdc test case intermittently failing due to principal not found in > Kerberos database > ---------------------------------------------------------------------------------------- > > Key: ZOOKEEPER-2712 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2712 > Project: ZooKeeper > Issue Type: Bug > Components: tests > Reporter: Rakesh R > Assignee: Rakesh R > Priority: Critical > Fix For: 3.4.10 > > Attachments: > TEST-org.apache.zookeeper.server.quorum.auth.QuorumKerberosAuthTest.txt > > > MiniKdc test cases are intermittently failing due to not finding the > principal. Below is the failure stacktrace. > {code} > 2017-03-08 13:21:10,843 [myid:] - ERROR > [NioProcessor-1:AuthenticationService@187] - Error while searching for client > lear...@example.com : Client not found in Kerberos database > 2017-03-08 13:21:10,843 [myid:] - WARN > [NioProcessor-2:KerberosProtocolHandler@241] - Server not found in Kerberos > database (7) > 2017-03-08 13:21:10,845 [myid:] - WARN > [NioProcessor-2:KerberosProtocolHandler@242] - Server not found in Kerberos > database (7) > 2017-03-08 13:21:10,844 [myid:] - WARN > [NioProcessor-1:KerberosProtocolHandler@241] - Client not found in Kerberos > database (6) > 2017-03-08 13:21:10,845 [myid:] - WARN > [NioProcessor-1:KerberosProtocolHandler@242] - Client not found in Kerberos > database (6) > {code} > Will attach the detailed log to jira. -- This message was sent by Atlassian JIRA (v6.3.15#6346)