[ https://issues.apache.org/jira/browse/SOLR-11484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223767#comment-16223767 ]
Noble Paul commented on SOLR-11484: ----------------------------------- If there is no leader, send the request to any live NRT node > CloudSolrClient's cache of collection clusterstate can cause RouteExceptions > when attempting directUpdates after collection modifications > ----------------------------------------------------------------------------------------------------------------------------------------- > > Key: SOLR-11484 > URL: https://issues.apache.org/jira/browse/SOLR-11484 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Hoss Man > Assignee: Noble Paul > Fix For: 7.2, master (8.0) > > Attachments: SOLR-11484.patch, SOLR-11484.patch, > jenkins.thetaphi.20662.txt > > > This was discovered while auditing jenkins failures from > {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} > (where a test explicitly deletes and then recreates a collection with the > same name), but as noted in a comment below, SOLR-11392 is another example of > non-obvious test failures that can pop up because of this bug. > In practice, it can affect any CloudSolrClient user after changes have been > made to a collection (to add/move replicas, etc...) > ---- > Original jira notes... > {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} > seems to fail with non-trivial frequency, so I grabbed the logs from a recent > failure and starting trying to follow along with the actions to figure out > what exactly is happening.... > https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/ > {noformat} > [junit4] ERROR 20.3s J1 | > TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<< > [junit4] > Throwable #1: > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from > server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: > Expected mime type a > pplication/octet-stream but got text/html. <html> > [junit4] > <head> > [junit4] > <meta http-equiv="Content-Type" > content="text/html;charset=ISO-8859-1"/> > [junit4] > <title>Error 404 </title> > {noformat} > The crux of this failure appears to be a genuine bug in how CloudSolrClient > uses it's cached ClusterState info when doing (direct) updates. The key bits > seem to be: > * CloudSolrClient does _something_ (update,query,etc...) with a collection > causing the current cluster state for the collection to be cached > * The actual collection changes such that a Solr node/core no longer exists > as part of the collection > * CloudSolrClient is asked to process an UpdateRequest which triggers the > code paths for the {{directUpdate()}} method -- which attempts to route the > updates directly to a replica of the appropriate shard using the (cache) > collection state info > * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core > that doesn't exist, getting a 404 -- which does not (seem to) trigger a state > refresh, or retry to find a correct URL to resend the update to. > Details to follow in comment.... -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org