[jira] [Commented] (SOLR-11484) CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting directUpdates after collection modifications

Christine Poerschke (JIRA) Fri, 03 Nov 2017 16:13:00 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-11484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238555#comment-16238555
 ]


Christine Poerschke commented on SOLR-11484:
--------------------------------------------

Hi [~varunthacker] - thanks for including me here.

bq. ... I guess the work "Only" in the flag would mean that the update should 
fail if there are no leaders? ...

Correct, that was the intention. For convenience 
[copy/pasting|https://issues.apache.org/jira/browse/SOLR-9512?focusedCommentId=15506019&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15506019]
 a SOLR-9512 comment here:

bq. The SOLR-9090 {{directUpdatesToLeadersOnly}} motivation/intention was for 
the flag to be not a hint but a directive and for updates to 'fail fast' if 
there is (temporarily or otherwise) no shard leader. Fail fast (and let the 
caller of the {{CloudSolrClient}} handle alarming and retries as it sees fit) 
as opposed to sending or retry-sending to a non-leader which would then forward 
to the leader (and potentially still fail eventually, 
eventually/not-fast-slowly).

As far as the

bq. ... In which case our tests should not set this flag and use the default 
behaviour ...

alternative is concerned, hmm, i'm not sure, wouldn't that reduce test coverage 
in general, though yes perhaps for very specific tests a test could opt-out of 
randomising the value of the flag.

ticket cross-reference: SOLR-11507 concerns flag randomisation in the test 
CloudSolrClient - [~dsmiley] and [~gerlowskija] any thoughts on this?

(I should also mention that the {{directUpdatesToLeadersOnly}} flag's addition 
predates the new replica types and I haven't yet considered if/how that might 
change the meaning of the flag.)

> CloudSolrClient's cache of collection clusterstate can cause RouteExceptions 
> when attempting directUpdates after collection modifications
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11484
>                 URL: https://issues.apache.org/jira/browse/SOLR-11484
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Noble Paul
>            Priority: Major
>             Fix For: 7.2, master (8.0)
>
>         Attachments: SOLR-11484.patch, SOLR-11484.patch, 
> jenkins.thetaphi.20662.txt
>
>
> This was discovered while auditing jenkins failures from 
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} 
> (where a test explicitly deletes and then recreates a collection with the 
> same name), but as noted in a comment below, SOLR-11392 is another example of 
> non-obvious test failures that can pop up because of this bug.
> In practice, it can affect any CloudSolrClient user after changes have been 
> made to a collection (to add/move replicas, etc...)
> ----
> Original jira notes...
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}}
> seems to fail with non-trivial frequency, so I grabbed the logs from a recent 
> failure and starting trying to follow along with the actions to figure out 
> what exactly is happening....
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/
> {noformat}
>    [junit4] ERROR   20.3s J1 | 
> TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
>    [junit4]    > Throwable #1: 
> org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from 
> server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: 
> Expected mime type a
> pplication/octet-stream but got text/html. <html>
>    [junit4]    > <head>
>    [junit4]    > <meta http-equiv="Content-Type" 
> content="text/html;charset=ISO-8859-1"/>
>    [junit4]    > <title>Error 404 </title>
> {noformat}
> The crux of this failure appears to be a genuine bug in how CloudSolrClient 
> uses it's cached ClusterState info when doing (direct) updates.  The key bits 
> seem to be:
> * CloudSolrClient does _something_ (update,query,etc...) with a collection 
> causing the current cluster state for the collection to be cached
> * The actual collection changes such that a Solr node/core no longer exists 
> as part of the collection
> * CloudSolrClient is asked to process an UpdateRequest which triggers the 
> code paths for the {{directUpdate()}} method -- which attempts to route the 
> updates directly to a replica of the appropriate shard using the (cache) 
> collection state info
> * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core 
> that doesn't exist, getting a 404 -- which does not (seem to) trigger a state 
> refresh, or retry to find a correct URL to resend the update to.
> Details to follow in comment....



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-11484) CloudSolrClient's cache of collection clusterstate can cause RouteExceptions when attempting directUpdates after collection modifications

Reply via email to