[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

Shalin Shekhar Mangar (Jira) Tue, 20 Oct 2020 06:56:15 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17217619#comment-17217619
 ]


Shalin Shekhar Mangar commented on SOLR-14942:
----------------------------------------------

bq. every method should have javadocs, especially public ones

I added javadocs for ZkController.tryCancellAllElections

bq. ZkController.exlectionContexts is a syncrhonized map – but the new code 
that streams over it's `values()` doesn't synchronize on it which smells like a 
bug?

Yes, thank you! ZkController.close was doing the same thing so I have fixed 
that as well.

bq. if canceling & closing elections is a "slow" enough operation that it makes 
sense to parallelize them, then does it make sense to also check 
zkClient.isClosed() inside the loop, in case the client gets closed out from 
under us? (it's a cheap call, so i don't see any advantage to only checking 
once)

It is not a slow operation. ZkController.close was also using a parallelStream 
on electionContexts so this was basically copied code. But it doesn't make 
sense. As I noted on a comment in the PR, it deletes a znode and sets a 
volatile member so I have replaced parallelStream to a serial forEach.

bq. are there other "inflight" cases where it's risky to shutdown in the middle 
of? replication? peersync? core admin?

When the SolrCoreState.pauseUpdatesAndAwaitInflightRequests() method is 
executed, jetty has already received a SIGTERM so it will not allow any new 
connections/request. Let's talk about ongoing requests:
# All ongoing recoveries/replication (for cores on current node) have stopped 
(ZkController.preClose is called before the 
pauseUpdatesAndAwaitInflightRequests method)
# The election node has not been removed so peersyncs for leader election 
haven't started. (tryCancellAllElections happens after 
pauseUpdatesAndAwaitInflightRequests method)
# If another replica is recovering from this leader and a peersync is 
in-flight, even if we let it complete, subsequent replication requests will fail
# As for core admin requests:
## create, unload and reload are not useful (node is shutting down)
## split shard will eventually fail because it is a multi-step process
## requestrecovery and requestsync are not useful either. After node comes back 
online, all cores will recover again.
## backups and restore operations -- I don't think these should block a 
shutdown operation

bq. instead of hardcoding this instanceof check in HttpSolrCall would it make 
more sense to add a new 'default' method to SolrRequestHandler that 
UpdateRequestHandler (and potentially other handlers) could override to let 
them inspect the request and return true/false if it's "important" enough that 
it must be allowed to block shutdown until complete?

bq. this would also make it easier to bring back the "only block updates if i'm 
the leader" type logic  (living inside thee UpdateRequestHandler impl of the 
new method) at a later date if viable – w/o more API changes

I thought about that optimization (only block updates if i'm the leader) but it 
leads to too many race conditions when leadership is gained and lost. The 
problem is that we must ensure that all registered parties to the phaser 
eventually arrive and if we lose track then it can lead to 
IllegalStateExceptions from the phaser down the line (and even that is best 
effort). That is why I think it is safer to do this inside HttpSolrCall instead 
of giving this choice to plugin writers.

> Reduce leader election time on node shutdown
> --------------------------------------------
>
>                 Key: SOLR-14942
>                 URL: https://issues.apache.org/jira/browse/SOLR-14942
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 7.7.3, 8.6.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Major
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> The credit for this issue and investigation belongs to [~caomanhdat]. I am 
> merely reporting the issue and creating PRs based on his work.
> The shutdown process waits for all replicas/cores to be closed before 
> removing the election node of the leader. This can take some time due to 
> index flush or merge activities on the leader cores and delays new leaders 
> from being elected.
> This process happens at CoreContainer.shutdown():
> # zkController.preClose(): remove current node from live_node and change 
> states of all cores in this node to DOWN state. Assuming that the current 
> node hosting a leader of a shard, the shard becomes leaderless after calling 
> this method, since the state of the leader is DOWN now. The leader election 
> process is not triggered for the shard since the election node is still 
> on-hold by the current node.
> # Waiting for all cores to be loaded (if there are any).
> # SolrCores.close(): close all cores.
> # zkController.close(): this is where all ephemeral nodes are removed from ZK 
> which include election nodes created by this node. Therefore other replicas 
> in the shard can take part in the leader election from now.
> Note that CoreContainer.shutdown() is invoked when Jetty/Solr nodes receive 
> SIGTERM signal. 
> On receiving SIGTERM, Jetty will also stop accepting new connections and new 
> requests. This is a very important factor, since even if the leader replica 
> is ACTIVE and its node in live_nodes, the shard will be considered as 
> leaderless if no-one can index to that shard. Therefore shards become 
> leaderless as soon as the node (which contains shard’s leader) receives 
> SIGTERM.
> Therefore the longer time step 1, 2 and 3 needed to finish, the longer shards 
> remain leaderless. The time needed for step 3 scales with the number of cores 
> so the more cores a node has, the worse. This time is spent in 
> IndexWriter.close() where the system will 
> # Flush all pending updates to disk
> # Waiting for all merge finish (this most likely is the meaty part)
> The shutdown process is proposed to changed to:
> # Wait for all in-flight indexing requests and replication requests to 
> complete
> # Remove election nodes
> # Close all replicas/cores
> This ensures that index flush or merges do not block new leader elections 
> anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14942) Reduce leader election time on node shutdown

Reply via email to