[
https://issues.apache.org/jira/browse/SOLR-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057467#comment-14057467
]
Shalin Shekhar Mangar commented on SOLR-6235:
---------------------------------------------
Wow, crazy crazy bug! I finally found the root cause.
The problem is with the leader initiated replica code which uses core name to
set/get status. This works fine as long as the core names for all nodes are
different but if they all happened to be "collection1" then we have this
problem :)
In this particular failure that I investigated:
http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/1667/consoleText
Here's the sequence of events:
# port:51916 - core_node1 was initially the leader, docs were indexed and then
it was killed
# port:51919 - core_node2 became the leader, peer sync happened, shards were
checked for consistency
# port:51916 - core_node1 was brought back online, it recovered from the
leader, consistency check passed
# port:51923 core_node3 and port:51932 core_node4 were added to the skipped
servers
# 300 docs were indexed (to go beyond the peer sync limit)
# port:51919 - core_node2 (the leader was killed)
Here is where things get interesting:
# port:51923 core_node3 tries to become the leader and initiates sync with
other replicas
# In the meanwhile, a commit request from checkShardConsistency makes its way
to port:51923 core_node3 (even though it's not clear whether it has indeed
become the leader)
# port:51923 core_node3 calls commit on all shards including port:51919
core_node2 which should've been down but perhaps the local state at 51923 is
not updated yet?
# port:51923 core_node3 puts replica collection1 on 127.0.0.1:51919_ into
leader-initiated recovery
# port:51923 - core_node3 fails to peersync (because number of changes were too
large) and rejoins election
# After this point each shard that tries to become the leader fails because it
thinks that it has been put under leader initiated recovery and goes into
actual "recovery"
# Of course, since there is no leader, recovery cannot happen and each shard
eventually goes to "recovery_failed" state
# Eventually the test gives up and throws an error saying that there are no
live server available to handle the request.
> SyncSliceTest fails on jenkins with no live servers available error
> -------------------------------------------------------------------
>
> Key: SOLR-6235
> URL: https://issues.apache.org/jira/browse/SOLR-6235
> Project: Solr
> Issue Type: Bug
> Components: SolrCloud, Tests
> Reporter: Shalin Shekhar Mangar
> Assignee: Shalin Shekhar Mangar
> Fix For: 4.10
>
>
> {code}
> 1 tests failed.
> FAILED: org.apache.solr.cloud.SyncSliceTest.testDistribSearch
> Error Message:
> No live SolrServers available to handle this request
> Stack Trace:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> available to handle this request
> at
> __randomizedtesting.SeedInfo.seed([685C57B3F25C854B:E9BAD9AB8503E577]:0)
> at
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:317)
> at
> org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:659)
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
> at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
> at
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1149)
> at
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1118)
> at org.apache.solr.cloud.SyncSliceTest.doTest(SyncSliceTest.java:236)
> at
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]