[ https://issues.apache.org/jira/browse/ZOOKEEPER-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829949#comment-16829949 ]
Shawn Heisey commented on ZOOKEEPER-2348: ----------------------------------------- I will admit that trying to trace the description of what our user has said and the description here is making my head hurt. But it sounds to me like their situation and the one described here are at least similar if not identical. Solr is running the ZK 3.4.13 client. The version info from the user is "For context this a cluster running Solr 7.7.1 and ZooKeeper 3.4.13 (being monitored by Exhibitor 1.7.1)." So I think they're running 3.4.13 on the server side as well. Here's the detailed scenario we got: * We have three ZooKeeper nodes: A, B, and C. A is the leader of the ensemble. * ZooKeeper A becomes partitioned from ZooKeeper B and C and the Solr tier. * Some Solr nodes log “zkclient has disconnected” warnings and ZooKeeper A expires some Solr client sessions due to timeouts. The partition between Zookeeper A and the Solr tier ends and Solr nodes that were connected to ZooKeeper A attempt to renew their sessions but are told their sessions have expired. [1] * Note that I’m simplifying: some nodes that were connected to ZooKeeper A were able to move their sessions to ZooKeeper B/C before their session expired. [2] * ZooKeeper A realizes it is not synced with ZooKeeper B and C and closes connections with Solr nodes and, apparently, remains partitioned from B/C. * ZooKeeper B and C eventually elect ZooKeeper B as the leader and start accepting writes requests as they form a quorum. * Solr nodes previously connected to ZooKeeper that had their sessions expire now connect to ZooKeeper B and C, they successfully publish their state as DOWN, and then attempt to write to /live_nodes to signal that they’re reconnected to ZooKeeper. * The writes of the ephemeral znodes to /live_nodes fail with NodeExists exceptions [3]. The failed writes are logged on ZooKeeper B. [4] * It looks like a failure mode of “leader becomes partitioned and ephemeral znode deletions are not processed by followers” is documented on ZOOKEEPER-2348<https://jira.apache.org/jira/browse/ZOOKEEPER-2348>. * ZooKeeper A eventually rejoins the ensemble and the /live_nodes entries that expired after the initial partition are removed when session expirations are reprocessed on the new leader (ZooKeeper B) [5] * The Solr nodes whose attempts at writing to /live_nodes failed never try again and remain in the GONE state for 6+ hours. I think there's probably some work we can do in Solr to improve how we manage the ephemeral node creation so it's more robust. > Data between leader and followers are not synchronized. > ------------------------------------------------------- > > Key: ZOOKEEPER-2348 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2348 > Project: ZooKeeper > Issue Type: Bug > Affects Versions: 3.5.1 > Reporter: Echo Chen > Priority: Major > > When client session expired, leader tried to remove it from session map and > remove its EPHEMERAL znode, for example, /test_znode. This operation succeed > on leader, but at the very same time, network fault happended and not synced > to followers, a new leader election launched. After leader election finished, > the new leader is not the old leader. we found the znode /test_znode still > existed in the followers but not on leader > *Scenario :* > 1) Create znode E.g. > {{/rmstore/ZKRMStateRoot/RMAppRoot/application_1449644945944_0001/appattempt_1449644945944_0001_000001}} > 2) Delete Znode. > 3) Network fault b/w follower and leader machines > 4) leader election again and follower became leader. > Now data is not synced with new leader..After this client is not able to same > znode. -- This message was sent by Atlassian JIRA (v7.6.3#76005)