[ https://issues.apache.org/jira/browse/CURATOR-638?focusedWorklogId=791722&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-791722 ]
ASF GitHub Bot logged work on CURATOR-638: ------------------------------------------ Author: ASF GitHub Bot Created on: 17/Jul/22 07:49 Start Date: 17/Jul/22 07:49 Worklog Time Spent: 10m Work Description: eolivelli merged PR #425: URL: https://github.com/apache/curator/pull/425 Issue Time Tracking ------------------- Worklog Id: (was: 791722) Time Spent: 0.5h (was: 20m) > Curator disconnect from zookeeper when IPs change > ------------------------------------------------- > > Key: CURATOR-638 > URL: https://issues.apache.org/jira/browse/CURATOR-638 > Project: Apache Curator > Issue Type: Bug > Components: Client, Recipes > Affects Versions: 5.2.1 > Environment: Docker or Kubernetes, docker example provided > Reporter: Francis Simon > Priority: Blocker > Fix For: 5.4.0 > > Attachments: zkissue.zip > > Time Spent: 0.5h > Remaining Estimate: 0h > > Blocking usage of Zookeeper in production. Tried testing a few versions all > had the issue. Effects any recipes that use ephemeral nodes. Example > attached. > We use multiple Apache Curator recipes in our system which is running in > Docker and Kubernetes. The behavior I am seeing is that curator appears to > resolve to the IP address of the containers rather than being tied to DNS > names. I have seen old tickets on this, but the behavior is reproducible on > the latest code release. > We are running zookeeper in containers on kubernetes. In kubernetes many > things could cause a container to move hosts, the pod disruption budget > ensures that a quorum is always present. But with this bug if all nodes > move for any reason and get new IP addresses clients will disconnect when > they shouldn't. Disconnecting has the bad side effect that all ephemeral > nodes are lost. This effects for us coordination, distributed locking and > service discovery. Causes production downtime so marked as a Blocker. > I have a simple sample which just uses the service discovery recipe to > register a bunch of services in zookeeper. I run the example in docker > compose. It is 100% reproducible. > > {code:java} > # Standup zookeeper and wait for it to be healthy > docker-compose up -d zookeeper1 zookeeper2 zookeeper3 > # Stand up a server and make sure it is connected and working as expected > docker-compose up -d server1 > # Take down a single zookeeper node and stand up another agent. > # The agent will grab the old zookeepers IP address > docker-compose rm -s zookeeper1 > docker-compose up -d server2 > # Bring the zookeeper node back up. > # Wait for it be healthy > docker-compose up -d zookeeper1 > # Then take down the next zookeeper node and stand up another agent. > # The agent will grab the old zookeepers IP address > docker-compose rm -s zookeeper2 > docker-compose up -d server3 > # Bring the zookeeper node back up. > # Wait for it be healthy > docker-compose up -d zookeeper2 > # Then take down the next zookeeper node and stand up another agent. > # The agent will grab the old zookeepers IP address > docker-compose rm -s zookeeper3 > docker-compose up -d server4 > # Bring the zookeeper node back up. > # Wait for it be healthy > docker-compose up -d zookeeper3{code} > > At the time of taking down the 3rd zookeeper node, the first server1 that was > stood up will now receive a disconnected status because the IP of all three > nodes has no changes form the original IP addresses. > > {code:java} > server1_1 | Query instances for servicetest > server1_1 | Exception in thread "main" java.lang.RuntimeException: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70 > server1_1 | at > org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241) > server1_1 | at > org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38) > server1_1 | at > org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171) > server1_1 | at > org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78) > server1_1 | at > org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59) > server1_1 | at zkissue.App.main(App.java:72) > server1_1 | Caused by: > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70 > server1_1 | at > org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > server1_1 | at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > server1_1 | at > org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001) > server1_1 | at > org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274) > server1_1 | at > org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268) > server1_1 | at > org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93) > server1_1 | at > org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265) > server1_1 | at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249) > server1_1 | at > org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34) > server1_1 | at > org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520) > server1_1 | at > org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157) > server1_1 | ... 3 more > {code} > > This causes it to disconnect and lose its discovery state which can be seen > from the other services. > {code:java} > server2_1 | Query instances for servicetest > server2_1 | test > server2_1 | service description: http://server-4:57456 > server2_1 | service description: http://server-3:37740 > server2_1 | service description: http://server-2:40219{code} > > Should mention that the Zookeeper cluster is always happy and healthy. This > is a client side issue. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)