Swathi Mocharla created ZOOKEEPER-4842:
------------------------------------------
Summary: Zookeeper quorum is not formed intermittently with
trailing dot in the cluster domain name
Key: ZOOKEEPER-4842
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4842
Project: ZooKeeper
Issue Type: Bug
Components: quorum
Affects Versions: 3.8.4
Reporter: Swathi Mocharla
On kubernetes, we've set up the cluster domain with a trailing dot. Doing so,
we are seeing very often that the zookeeper quorum itself is not being
established.
{code:java}
bash-4.4$ env -u KAFKA_OPTS zookeeper-shell localhost:2181 config
Connecting to localhost:2181
[2024-06-25 10:36:39,178] WARN Client session timed out, have not heard from
server in 30031ms for session id 0x0 (org.apache.zookeeper.ClientCnxn)
[2024-06-25 10:36:39,182] WARN Session 0x0 for server
localhost/[0:0:0:0:0:0:0:1]:2181, Closing socket connection. Attempting
reconnect except it is a SessionExpiredException.
(org.apache.zookeeper.ClientCnxn)
org.apache.zookeeper.ClientCnxn$SessionTimeoutException: Client session timed
out, have not heard from server in 30031ms for session id 0x0
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1257)
KeeperErrorCode = ConnectionLoss for /zookeeper/config
{code}
In the zookeeper logs, we see a lot of IOExceptions, UnknownHost and
Interrupted exceptions.
{code:java}
java.io.IOException: ZooKeeperServer not running
at
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:565)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:350)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.base/java.lang.Thread.run(Unknown Source)
{"type":"log", "host":"zk-swkf-2.default", "level":"WARN",
"systemid":"zookeeper-2b13339237454984887b4908dc3a6df0", "system":"zookeeper",
"time":"2024-06-25T10:23:16.325Z", "timezone":"UTC",
"log":{"message":"NIOWorkerThread-1 - org.apache.zookeeper.server.NIOServerCnxn
- Close of session 0x0"}}
java.lang.InterruptedException
at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
at
org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1453)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:99)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1277)
{code}
this is the content of the /etc/resolve.conf
{code:java}
bash-4.4$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local bcmt
nameserver 10.254.0.10
options ndots:5{code}
{code:java}
[root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local.
Server: 10.76.72.33
Address: 10.76.72.33#53
Name: zk-swkf.default.svc.cluster.local
Address: 10.254.94.24
[root@vm-10-76-72-33 ckaf-kafka]# nslookup zk-swkf.default.svc.cluster.local
Server: 10.76.72.33
Address: 10.76.72.33#53
Name: zk-swkf.default.svc.cluster.local
Address: 10.254.94.24
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)