[
https://issues.apache.org/jira/browse/CURATOR-638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zili Chen updated CURATOR-638:
------------------------------
Fix Version/s: 5.4.0
> Curator disconnect from zookeeper when IPs change
> -------------------------------------------------
>
> Key: CURATOR-638
> URL: https://issues.apache.org/jira/browse/CURATOR-638
> Project: Apache Curator
> Issue Type: Bug
> Components: Client, Recipes
> Affects Versions: 5.2.1
> Environment: Docker or Kubernetes, docker example provided
> Reporter: Francis Simon
> Priority: Blocker
> Fix For: 5.4.0
>
> Attachments: zkissue.zip
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Blocking usage of Zookeeper in production. Tried testing a few versions all
> had the issue. Effects any recipes that use ephemeral nodes. Example
> attached.
> We use multiple Apache Curator recipes in our system which is running in
> Docker and Kubernetes. The behavior I am seeing is that curator appears to
> resolve to the IP address of the containers rather than being tied to DNS
> names. I have seen old tickets on this, but the behavior is reproducible on
> the latest code release.
> We are running zookeeper in containers on kubernetes. In kubernetes many
> things could cause a container to move hosts, the pod disruption budget
> ensures that a quorum is always present. But with this bug if all nodes
> move for any reason and get new IP addresses clients will disconnect when
> they shouldn't. Disconnecting has the bad side effect that all ephemeral
> nodes are lost. This effects for us coordination, distributed locking and
> service discovery. Causes production downtime so marked as a Blocker.
> I have a simple sample which just uses the service discovery recipe to
> register a bunch of services in zookeeper. I run the example in docker
> compose. It is 100% reproducible.
>
> {code:java}
> # Standup zookeeper and wait for it to be healthy
> docker-compose up -d zookeeper1 zookeeper2 zookeeper3
> # Stand up a server and make sure it is connected and working as expected
> docker-compose up -d server1
> # Take down a single zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper1
> docker-compose up -d server2
> # Bring the zookeeper node back up.
> # Wait for it be healthy
> docker-compose up -d zookeeper1
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper2
> docker-compose up -d server3
> # Bring the zookeeper node back up.
> # Wait for it be healthy
> docker-compose up -d zookeeper2
> # Then take down the next zookeeper node and stand up another agent.
> # The agent will grab the old zookeepers IP address
> docker-compose rm -s zookeeper3
> docker-compose up -d server4
> # Bring the zookeeper node back up.
> # Wait for it be healthy
> docker-compose up -d zookeeper3{code}
>
> At the time of taking down the 3rd zookeeper node, the first server1 that was
> stood up will now receive a disconnected status because the IP of all three
> nodes has no changes form the original IP addresses.
>
> {code:java}
> server1_1 | Query instances for servicetest
> server1_1 | Exception in thread "main" java.lang.RuntimeException:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1 | at
> org.apache.curator.shaded.com.google.common.base.Throwables.propagate(Throwables.java:241)
> server1_1 | at
> org.apache.curator.utils.ExceptionAccumulator.propagate(ExceptionAccumulator.java:38)
> server1_1 | at
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:171)
> server1_1 | at
> org.apache.curator.shaded.com.google.common.io.Closeables.close(Closeables.java:78)
> server1_1 | at
> org.apache.curator.utils.CloseableUtils.closeQuietly(CloseableUtils.java:59)
> server1_1 | at zkissue.App.main(App.java:72)
> server1_1 | Caused by:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
> = ConnectionLoss for /myservices/test/62e23a0b-dfdb-46f5-966f-8dc7a4978c70
> server1_1 | at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> server1_1 | at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> server1_1 | at
> org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:2001)
> server1_1 | at
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:274)
> server1_1 | at
> org.apache.curator.framework.imps.DeleteBuilderImpl$5.call(DeleteBuilderImpl.java:268)
> server1_1 | at
> org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> server1_1 | at
> org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:265)
> server1_1 | at
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:249)
> server1_1 | at
> org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:34)
> server1_1 | at
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.internalUnregisterService(ServiceDiscoveryImpl.java:520)
> server1_1 | at
> org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.close(ServiceDiscoveryImpl.java:157)
> server1_1 | ... 3 more
> {code}
>
> This causes it to disconnect and lose its discovery state which can be seen
> from the other services.
> {code:java}
> server2_1 | Query instances for servicetest
> server2_1 | test
> server2_1 | service description: http://server-4:57456
> server2_1 | service description: http://server-3:37740
> server2_1 | service description: http://server-2:40219{code}
>
> Should mention that the Zookeeper cluster is always happy and healthy. This
> is a client side issue.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)