[jira] [Commented] (SOLR-12672) Implement Synchronized Disruption into Solr

Erick Erickson (JIRA) Mon, 20 Aug 2018 09:05:12 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-12672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16586145#comment-16586145
 ]


Erick Erickson commented on SOLR-12672:
---------------------------------------

The NRT indexing case is tricky. Temporarily taking a replica out of rotation 
for _queries_ might work though.

Consider the indexing cycle (NRT).
 * leader gets an update
 * leader forwards request to replica
 * if leader does _not_ get a response back from the replica, it may put the 
replica into Leader Initiated Recovery (LIR) after retries.

Under all circumstances, the update operation does not return to the client 
until all active replicas have replied. And if a replica somehow doesn't reply, 
it goes into recovery.

So at least in the NRT case, whatever mechanism is in place must still accept 
updates or it would be A Bad Thing.

PULL replicas don't have the same problem, but there you wouldn't want them to 
start answering queries until after the next sync after the "I'm busy" state.

TLOG replicas. hmmmmm. Not quite sure what happens here if they don't respond 
to an update.

Next practicality: How to deal with multiple replicas all on the same Solr 
instance? Apart from having one replica per JVM you can't take a single 
_replica_ out of rotation without affecting all the other replicas hosted by 
the JVM. Practically, this may not be a problem since you want as few replicas 
for the _same_ shard hosted on a particular JVM as possible, but you'd have to 
build in some safeguards.

One way this could work (and this seems to fall "naturally" to the Overseer" is

for (all my live_nodes)

{  * send live_node_x the "do your GC bit" * live_node_x broadcasts "all my 
replicas are going to be busy" [1] * live_node_x does a GC * live_node_x 
broadcasts "I'm not busy any more" [2] }

[1] this would cause each Solr instance receiving the message to mark their 
_local_ copy of the node states "don't use replicas on this node for queries". 
How that would interact with them having a watch triggered for an unrelated 
state change is interesting.

[2] there would have to be some reasonable bailout built in in case the "I'm 
not busy now" message didn't get to all live_nodes. Perhaps the Overseer itself 
also gets the "I'm not busy" message, it's a Solr instance after all.

 

NOTE: there is a JIRA out there somewhere about separating the query and update 
threads into separate pools, this all might be much easier after that is done.

> Implement Synchronized Disruption into Solr
> -------------------------------------------
>
>                 Key: SOLR-12672
>                 URL: https://issues.apache.org/jira/browse/SOLR-12672
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Trey Cahill
>            Priority: Trivial
>         Attachments: Synchronized Disruption in Solr.pdf
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> On large Solr clusters, at any given time, there is probably an instance 
> running garbage collection.  By implementing a synchronized disruption across 
> the entire cluster, the response times of a large cluster should decrease as 
> it helps prevent random instances from running GC while the rest of the 
> cluster is responding to a request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-12672) Implement Synchronized Disruption into Solr

Reply via email to