Timothy Potter created SOLR-5796:
------------------------------------

             Summary: With many collections, leader re-election takes too long 
when a node dies or is rebooted, leading to some shards getting into a 
"conflicting" state about who is the leader.
                 Key: SOLR-5796
                 URL: https://issues.apache.org/jira/browse/SOLR-5796
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
         Environment: Found on branch_4x
            Reporter: Timothy Potter


I'm doing some testing with a 4-node SolrCloud cluster against the latest rev 
in branch_4x having many collections, 150 to be exact, each having 4 shards 
with rf=3, so 450 cores per node. Nodes are decent in terms of resources: 
-Xmx6g with 4 CPU - m3.xlarge's in EC2.

The problem occurs when rebooting one of the nodes, say as part of a rolling 
restart of the cluster. If I kill one node and then wait for an extended period 
of time, such as 3 minutes, then all of the leaders on the downed node (roughly 
150) have time to failover to another node in the cluster. When I restart the 
downed node, since leaders have all failed over successfully, the new node 
starts up and all cores assume the replica role in their respective shards. 
This is goodness and expected.

However, if I don't wait long enough for the leader failover process to 
complete on the other nodes before restarting the downed node, 
then some bad things happen. Specifically, when the dust settles, many of the 
previous leaders on the node I restarted get stuck in the "conflicting" state 
seen in the ZkController, starting around line 852 in branch_4x:

{quote}
852       while (!leaderUrl.equals(clusterStateLeaderUrl)) {
853         if (tries == 60) {
854           throw new SolrException(ErrorCode.SERVER_ERROR,
855               "There is conflicting information about the leader of shard: "
856                   + cloudDesc.getShardId() + " our state says:"
857                   + clusterStateLeaderUrl + " but zookeeper says:" + 
leaderUrl);
858         }
859         Thread.sleep(1000);
860         tries++;
861         clusterStateLeaderUrl = zkStateReader.getLeaderUrl(collection, 
shardId,
862             timeoutms);
863         leaderUrl = getLeaderProps(collection, cloudDesc.getShardId(), 
timeoutms)
864             .getCoreUrl();
865       }
{quote}

As you can see, the code is trying to give a little time for this problem to 
work itself out, 1 minute to be exact. Unfortunately, that doesn't seem to be 
long enough for a busy cluster that has many collections. Now, one might argue 
that 450 cores per node is asking too much of Solr, however I think this points 
to a bigger issue of the fact that a node coming up isn't aware that it went 
down and leader election is running on other nodes and is just being slow. 
Moreover, once this problem occurs, it's not clear how to fix it besides 
shutting the node down again and waiting for leader failover to complete.

It's also interesting to me that /clusterstate.json was updated by the healthy 
node taking over the leader role but the /collections/<coll>leaders/shard# was 
not updated? I added some debugging and it seems like the overseer queue is 
extremely backed up with work.

Maybe the solution here is to just wait longer but I also want to get some 
feedback from the community on other options? I know there are some plans to 
help scale the Overseer (i.e. SOLR-5476) so maybe that helps and I'm trying to 
add more debug to see if this is really due to overseer backlog (which I 
suspect it is).

In general, I'm a little confused by the keeping of leader state in multiple 
places in ZK. Is there any background information on why we have leader state 
in /clusterstate.json and in the leader path znode?

Also, here are some interesting side observations:

a. If I use rf=2, then this problem doesn't occur as leader failover happens 
more quickly and there's less overseer work? 
May be a red herring here, but I can consistently reproduce it with RF=3, but 
not with RF=2 ... suppose that is because there are only 300 cores per node 
versus 450 and that's just enough less work to make this issue work itself out.

b. To support that many cores, I had to set -Xss256k to reduce the stack size 
as Solr uses a lot of threads during startup (high point was 800'ish)           
   
Might be something we should recommend on the mailing list / wiki somewhere.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to