[
https://issues.apache.org/jira/browse/SOLR-14897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Munendra S N reassigned SOLR-14897:
-----------------------------------
Assignee: Munendra S N
> HttpSolrCall will forward a virtually unlimited number of times until
> ClusterState ZkWatcher is updated after collection delete
> -------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-14897
> URL: https://issues.apache.org/jira/browse/SOLR-14897
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Chris M. Hostetter
> Assignee: Munendra S N
> Priority: Blocker
> Fix For: 8.6.3
>
> Attachments: SOLR-14897.patch
>
>
> While investigating the root cause of some SOLR-14896 related failures, I
> have seen evidence that if a collection is deleted, but a client makes a
> subequent request for that collection _before_ the local ClusterState has
> been updated to remove that DocCollection, HttpSolrCall will forward/proxy
> that request a (virtually) unbounded number of times in a very short time
> period - stopping only once the the "cached" local DocCollection is updated
> to indicate there are no active replicas.**
> While HttpSolrCall does track & increment a {{_forwardedCount}} param on
> every request it forwards, it doesn't consult that request unless/until it
> finds a situation where the (local) DocCollection says there are no active
> replicas.
> So if you have a collection XX with 4 total replicas on 4 diff nodes
> (A,B,C,D), and and you delete XX (triggering sequential core deletions on
> A,B,C,D that fire successive ZkWatchers on various nodes to update the
> collection state) a request for XX can bounce back and forth between nodes C
> & D 20+ times until the ClusterState watcher fires on both of those nodes so
> they finally realize that the {{_forwardedCount=20}} is more the the 0 active
> replicas...
> In the below code snippet from HttpSolrCall, the first call to
> {{getCoreUrl(...)}} is expected to return null if there are no active
> replicas - but it uses the local cached DocCollection, which may _think_
> there is an active replica on another node, so it forwards the request to
> that node - where the replica may have been deleted, so that node runs hte
> same code and may forward the request right back to the original node....
> {code:java}
> String coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
> activeSlices, byCoreName, true);
> // Avoid getting into a recursive loop of requests being forwarded by
> // stopping forwarding and erroring out after (totalReplicas) forwards
> if (coreUrl == null) {
> if (queryParams.getInt(INTERNAL_REQUEST_COUNT, 0) > totalReplicas){
> throw new SolrException(SolrException.ErrorCode.INVALID_STATE,
> "No active replicas found for collection: " + collectionName);
> }
> coreUrl = getCoreUrl(collectionName, origCorename, clusterState,
> activeSlices, byCoreName, false);
> }
> {code}
> ..the check that is suppose to prevent a "recursive loop" is only consulted
> once a situation arises where local ClusterState indicates there are no
> active replicas - which seems to defeat the point of the forward check? (at
> which point if the total number of replicas hasn't been exceeded, the code is
> happy to forward the request to a coreUrl which the local ClusterState
> indicates is _not_ active (which also sems to defeat the point?)
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]