Daniel Wong created ZOOKEEPER-4236:
--------------------------------------
Summary: Java Client SendThread create many unnecessary Login
objects
Key: ZOOKEEPER-4236
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4236
Project: ZooKeeper
Issue Type: Bug
Reporter: Daniel Wong
Hi I am an Apache Phoenix committer and I help manage many many zookeeper
clusters at my employment primarily using ZK for HBase use cases. We recently
had a production incident where some of our ACLs were not setup preventing
connectivity from the client to the ZK nodes and the failure path exposed 2
issues to fix. This Jira and ZooKeeper-4235. This Jira is the less important
of the 2 and handles numerous objects. We had hundreds of threads per JVM with
the following stack trace.
{code:java}
java.lang.Thread.State: RUNNABLE at
java.net.PlainSocketImpl.socketConnect([email protected]/Native Method) at
java.net.AbstractPlainSocketImpl.doConnect([email protected]/AbstractPlainSocketImpl.java:399)
- locked <0x00000015004fde20> (a java.net.SocksSocketImpl) at
java.net.AbstractPlainSocketImpl.connectToAddress([email protected]/AbstractPlainSocketImpl.java:242)
at
java.net.AbstractPlainSocketImpl.connect([email protected]/AbstractPlainSocketImpl.java:224)
at
java.net.SocksSocketImpl.connect([email protected]/SocksSocketImpl.java:403)
at java.net.Socket.connect([email protected]/Socket.java:609) at
sun.security.krb5.internal.TCPClient.<init>([email protected]/NetClient.java:62)
at
sun.security.krb5.internal.NetClient.getInstance([email protected]/NetClient.java:42)
at
sun.security.krb5.KdcComm$KdcCommunication.run([email protected]/KdcComm.java:401)
at
sun.security.krb5.KdcComm$KdcCommunication.run([email protected]/KdcComm.java:364)
at java.security.AccessController.doPrivileged([email protected]/Native
Method) at
sun.security.krb5.KdcComm.send([email protected]/KdcComm.java:348)
at
sun.security.krb5.KdcComm.sendIfPossible([email protected]/KdcComm.java:253)
at
sun.security.krb5.KdcComm.send([email protected]/KdcComm.java:234)
at
sun.security.krb5.KdcComm.send([email protected]/KdcComm.java:200)
at
sun.security.krb5.KrbAsReqBuilder.send([email protected]/KrbAsReqBuilder.java:326)
at
sun.security.krb5.KrbAsReqBuilder.action([email protected]/KrbAsReqBuilder.java:371)
at
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication([email protected]/Krb5LoginModule.java:754)
at
com.sun.security.auth.module.Krb5LoginModule.login([email protected]/Krb5LoginModule.java:592)
at
javax.security.auth.login.LoginContext.invoke([email protected]/LoginContext.java:726)
at
javax.security.auth.login.LoginContext$4.run([email protected]/LoginContext.java:665)
at
javax.security.auth.login.LoginContext$4.run([email protected]/LoginContext.java:663)
at java.security.AccessController.doPrivileged([email protected]/Native
Method) at
javax.security.auth.login.LoginContext.invokePriv([email protected]/LoginContext.java:663)
at
javax.security.auth.login.LoginContext.login([email protected]/LoginContext.java:574)
at org.apache.zookeeper.Login.login(Login.java:304) - locked
<0x000000151c477148> (a org.apache.zookeeper.Login) at
org.apache.zookeeper.Login.<init>(Login.java:106) at
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:249)
- locked <0x000000151c476f68> (a
org.apache.zookeeper.client.ZooKeeperSaslClient) at
org.apache.zookeeper.client.ZooKeeperSaslClient.<init>(ZooKeeperSaslClient.java:141)
at
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:972) at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1031)
{code}
Note that these were logging in to our 10 ZK nodes but we had 100s of Logins.
In theory we should only need at most 10 Logins.
This Jira is intended to improve the behavior in limiting the number of Login
objects/clients to the needed number. Note that a combination of JIRAs
https://issues.apache.org/jira/browse/ZOOKEEPER-2375 and
https://issues.apache.org/jira/browse/ZOOKEEPER-2139 removed the singleton at
the Login level but left in unnecessary synchronization code. This could be
again improved via either a singleton perhaps at the SaslClient layer or some
sort of connection -> login cache so that new connections would reuse/wait for
the same objects in failure paths.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)