Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers
().getZkClient().getSolrZooKeeper().getChildren(/collections/core/leader_elect/shard1 /election, true) // node deletion Op.delete(path, -1) // node creation Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL); // Perform operations solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList); solrServer.getZkStateReader().updateClusterState(true); -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers
Daniel Collins wrote Is it important where your leader is? If you just want to minimize leadership changes during rolling re-start, then you could restart in the opposite order (S3, S2, S1). That would give only 1 transition, but the end result would be a leader on S2 instead of S1 (not sure if that important to you or not). I know its not a fix, but it might be a workaround until the whole leadership moving is done? I think that rolling restarting the machines in the opposite order (S3,S2,S1) will result in S3 being the leader. It's a valid approach but shouldn't I have to revert to the original order (S1,S2,S3) to achieve the same result in the following rolling restart? This includes operational costs and complexity that I want to avoid. Erick Erickson wrote Just skimming, but the problem here that I ran into was with the listeners. Each _Solr_ instance out there is listening to one of the ephemeral nodes (the one in front). So deleting a node does _not_ change which ephemeral node the associated Solr instance is listening to. So, for instance, when you delete S2..n-01 and re-add it, S2 is still looking at S1n-00 and will continue looking at S1...n-00 until S1n-00 is deleted. Deleting S2..n-01 will wake up S3 though, which should now be looking at S1n-000. Now you have two Solr listeners looking at the same ephemeral node. The key is that deleting S2...n-01 does _not_ wake up S2, just any solr instance that has a watch on the associated ephemeral node. Thanks for the info Erick. I wasn't aware of this linked-list listeners structure between the zk nodes. Based on what you've said though I've changed my implementation a bit and it seems to be working at first glance. Of course it's not reliable yet but it looks promising. My original attempt S1:-n_00 (no code running here) S2:-n_04 (code deleting zknode -n_01 and creating -n_04) S3:-n_03 (code deleting zknode -n_02 and creating -n_03) has been changed to S1:-n_00 (no code running here) S2:-n_03 (code deleting zknode -n_01 and creating -n_03 using EPHEMERAL_SEQUENTIAL) S3:-n_02 (no code running here) Once S1 is shutdown S3 becomes leader since it listens to S1 now according to what you've said The original reason I pursued this minimize leadership changes quest was that it _could_ lead to data loss in some scenarios. I'm not entirely sure though and you could correct me on this and but I'm explaining myself. If you have incoming indexing requests during a rolling restart, could there be a case during the current leader shutdown where the leader-to-be-node could not have the time to sync with the current-leader-that-shut-downs-node in which case everyone will now sync to the new leader thus missing some updates. I've seen an installation having different index sizes in each replica that deteriorated over time. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers
SolrCloud is intended to work in the rolling restart case... Index size, segment counts, segment names can (and will) be different on different replicas of the same shard without anything being amiss. Commits (hard) happen at different times across the replicas in a shard. Merging logic kicks in and may (will eventually in all probability) pick different segments to merge, with varying numbers of deleted docs that get purged etc. The numFound reported on a q=*:*distrib=false, or looking at the core in the admin screen for the replicas in question and noting numDocs should be identical though if 1 you've issued a hard commit with openSearcher=true _or_ a soft commit. 2 you haven't been indexing or haven't issued a commit as in 1 since you started looking. Best, Erick On Tue, Jan 13, 2015 at 4:20 AM, Zisis Tachtsidis zist...@runbox.com wrote: Daniel Collins wrote Is it important where your leader is? If you just want to minimize leadership changes during rolling re-start, then you could restart in the opposite order (S3, S2, S1). That would give only 1 transition, but the end result would be a leader on S2 instead of S1 (not sure if that important to you or not). I know its not a fix, but it might be a workaround until the whole leadership moving is done? I think that rolling restarting the machines in the opposite order (S3,S2,S1) will result in S3 being the leader. It's a valid approach but shouldn't I have to revert to the original order (S1,S2,S3) to achieve the same result in the following rolling restart? This includes operational costs and complexity that I want to avoid. Erick Erickson wrote Just skimming, but the problem here that I ran into was with the listeners. Each _Solr_ instance out there is listening to one of the ephemeral nodes (the one in front). So deleting a node does _not_ change which ephemeral node the associated Solr instance is listening to. So, for instance, when you delete S2..n-01 and re-add it, S2 is still looking at S1n-00 and will continue looking at S1...n-00 until S1n-00 is deleted. Deleting S2..n-01 will wake up S3 though, which should now be looking at S1n-000. Now you have two Solr listeners looking at the same ephemeral node. The key is that deleting S2...n-01 does _not_ wake up S2, just any solr instance that has a watch on the associated ephemeral node. Thanks for the info Erick. I wasn't aware of this linked-list listeners structure between the zk nodes. Based on what you've said though I've changed my implementation a bit and it seems to be working at first glance. Of course it's not reliable yet but it looks promising. My original attempt S1:-n_00 (no code running here) S2:-n_04 (code deleting zknode -n_01 and creating -n_04) S3:-n_03 (code deleting zknode -n_02 and creating -n_03) has been changed to S1:-n_00 (no code running here) S2:-n_03 (code deleting zknode -n_01 and creating -n_03 using EPHEMERAL_SEQUENTIAL) S3:-n_02 (no code running here) Once S1 is shutdown S3 becomes leader since it listens to S1 now according to what you've said The original reason I pursued this minimize leadership changes quest was that it _could_ lead to data loss in some scenarios. I'm not entirely sure though and you could correct me on this and but I'm explaining myself. If you have incoming indexing requests during a rolling restart, could there be a case during the current leader shutdown where the leader-to-be-node could not have the time to sync with the current-leader-that-shut-downs-node in which case everyone will now sync to the new leader thus missing some updates. I've seen an installation having different index sizes in each replica that deteriorated over time. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973p4179147.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud shard leader elections - Altering zookeeper sequence numbers
Just skimming, but the problem here that I ran into was with the listeners. Each _Solr_ instance out there is listening to one of the ephemeral nodes (the one in front). So deleting a node does _not_ change which ephemeral node the associated Solr instance is listening to. So, for instance, when you delete S2..n-01 and re-add it, S2 is still looking at S1n-00 and will continue looking at S1...n-00 until S1n-00 is deleted. Deleting S2..n-01 will wake up S3 though, which should now be looking at S1n-000. Now you have two Solr listeners looking at the same ephemeral node. The key is that deleting S2...n-01 does _not_ wake up S2, just any solr instance that has a watch on the associated ephemeral node. The code you want is in LeaderElector.checkIfIamLeader to understand how it all works. Be aware that the sortSeqs call sorts the nodes by 1 sequence number 2 string comparison. Which has the unfortunate characteristic of a secondary sort by session ID. So two nodes with the same sequence number can sort before or after each other depending on which one gets a session higher/lower than the other. This is quite tricky to get right, I once created a patch for 4.10.3 by applying things in this order (some minor tweaks required). All SOLR- 6115 6512 6577 6513 6517 6670 6691 Good luck! Erick On Mon, Jan 12, 2015 at 8:54 AM, Zisis Tachtsidis zist...@runbox.com wrote: SolrCloud uses ZooKeeper sequence flags to keep track of the order in which nodes register themselves as leader candidates. The node with the lowest sequence number wins as leader of the shard. What I'm trying to do is to keep the leader re-assignments to the minimum during a rolling restart. In this direction I change the zk sequence numbers on the SolrCloud nodes when all nodes of the cluster are up and active. I'm using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose but I'm trying to do it from outside, using the existing APIs without editing Solr source code. == TYPICAL SCENARIO == Suppose we have 3 Solr instances S1,S2,S3. They are started in the same order and the zk sequences assigned have as follows S1:-n_00 (LEADER) S2:-n_01 S3:-n_02 In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3 (after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in total. == MY ATTEMPT == By using SolrZkClient and the Zookeeper multi API I found a way to get rid of the old zknodes that participate in a shard's leader election and write new ones where we can assign the sequence number of our liking. S1:-n_00 (no code running here) S2:-n_04 (code deleting zknode -n_01 and creating -n_04) S3:-n_03 (code deleting zknode -n_02 and creating -n_03) In a rolling restart I'd expect to have S3 as leader (after S1 shutdown), no change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2 changes. This will be constant no matter how many servers are added in SolrCloud while in the first scenarion the # of re-assignments equals the # of Solr servers. The problem occurs when S1 (LEADER) is shut down. The elections that take place still set S2 as leader, It's like ignoring the new sequence numbers. When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed under /collections based on which S3 should have become the leader. Do you have any idea why the new state is not acknowledged during the elections? Is something cached? Or to put it bluntly do I have any chance down this path? If not what are my options? Is it possible to apply all patches under SOLR-6491 in isolation and continue from there? Thank you. Extra info which might help follows 1. Some logging related to leader elections after S1 has been shut down S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with shard failed, moving to the next candidate S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in line to be leader 2. And some sample code on how I perform the ZK re-sequencing // Read current zk nodes for a specific collection solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().getChildren(/collections/core/leader_elect/shard1 /election, true) // node deletion Op.delete(path, -1) // node creation Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL); // Perform operations solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList); solrServer.getZkStateReader().updateClusterState(true); -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence
SolrCloud shard leader elections - Altering zookeeper sequence numbers
SolrCloud uses ZooKeeper sequence flags to keep track of the order in which nodes register themselves as leader candidates. The node with the lowest sequence number wins as leader of the shard. What I'm trying to do is to keep the leader re-assignments to the minimum during a rolling restart. In this direction I change the zk sequence numbers on the SolrCloud nodes when all nodes of the cluster are up and active. I'm using Solr 4.10.0 and I'm aware of SOLR-6491 which has a similar purpose but I'm trying to do it from outside, using the existing APIs without editing Solr source code. == TYPICAL SCENARIO == Suppose we have 3 Solr instances S1,S2,S3. They are started in the same order and the zk sequences assigned have as follows S1:-n_00 (LEADER) S2:-n_01 S3:-n_02 In a rolling restart we'll get S2 as leader (after S1 shutdown), then S3 (after S2 shutdown) and finally S1(after S3 shutdown), 3 changes in total. == MY ATTEMPT == By using SolrZkClient and the Zookeeper multi API I found a way to get rid of the old zknodes that participate in a shard's leader election and write new ones where we can assign the sequence number of our liking. S1:-n_00 (no code running here) S2:-n_04 (code deleting zknode -n_01 and creating -n_04) S3:-n_03 (code deleting zknode -n_02 and creating -n_03) In a rolling restart I'd expect to have S3 as leader (after S1 shutdown), no change (after S2 shutdown) and finally S1(after S3 shutdown), that is 2 changes. This will be constant no matter how many servers are added in SolrCloud while in the first scenarion the # of re-assignments equals the # of Solr servers. The problem occurs when S1 (LEADER) is shut down. The elections that take place still set S2 as leader, It's like ignoring the new sequence numbers. When I go to /solr/#/~cloud?view=tree the new sequence numbers are listed under /collections based on which S3 should have become the leader. Do you have any idea why the new state is not acknowledged during the elections? Is something cached? Or to put it bluntly do I have any chance down this path? If not what are my options? Is it possible to apply all patches under SOLR-6491 in isolation and continue from there? Thank you. Extra info which might help follows 1. Some logging related to leader elections after S1 has been shut down S2 - org.apache.solr.cloud.SyncStrategy Leader's attempt to sync with shard failed, moving to the next candidate S2 - org.apache.solr.cloud.ShardLeaderElectionContext We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway S3 - org.apache.solr.cloud.LeaderElector Our node is no longer in line to be leader 2. And some sample code on how I perform the ZK re-sequencing // Read current zk nodes for a specific collection solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().getChildren(/collections/core/leader_elect/shard1 /election, true) // node deletion Op.delete(path, -1) // node creation Op.create(createPath, new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL); // Perform operations solrServer.getZkStateReader().getZkClient().getSolrZooKeeper().multi(opsList); solrServer.getZkStateReader().updateClusterState(true); -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-shard-leader-elections-Altering-zookeeper-sequence-numbers-tp4178973.html Sent from the Solr - User mailing list archive at Nabble.com.