[jira] [Commented] (SOLR-6235) SyncSliceTest fails on jenkins with no live servers available error

Shalin Shekhar Mangar (JIRA) Thu, 10 Jul 2014 06:42:32 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14057467#comment-14057467
 ]


Shalin Shekhar Mangar commented on SOLR-6235:
---------------------------------------------

Wow, crazy crazy bug! I finally found the root cause.

The problem is with the leader initiated replica code which uses core name to 
set/get status. This works fine as long as the core names for all nodes are 
different but if they all happened to be "collection1" then we have this 
problem  :)

In this particular failure that I investigated:
http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/1667/consoleText

Here's the sequence of events:
# port:51916 - core_node1 was initially the leader, docs were indexed and then 
it was killed
# port:51919 - core_node2 became the leader, peer sync happened, shards were 
checked for consistency
# port:51916 - core_node1 was brought back online, it recovered from the 
leader, consistency check passed
# port:51923 core_node3 and port:51932 core_node4 were added to the skipped 
servers
# 300 docs were indexed (to go beyond the peer sync limit)
# port:51919 - core_node2 (the leader was killed)

Here is where things get interesting:
# port:51923 core_node3 tries to become the leader and initiates sync with 
other replicas
# In the meanwhile, a commit request from checkShardConsistency makes its way 
to port:51923 core_node3 (even though it's not clear whether it has indeed 
become the leader)
# port:51923 core_node3 calls commit on all shards including port:51919 
core_node2 which should've been down but perhaps the local state at 51923 is 
not updated yet?
# port:51923 core_node3 puts replica collection1 on 127.0.0.1:51919_ into 
leader-initiated recovery
# port:51923 - core_node3 fails to peersync (because number of changes were too 
large) and rejoins election
# After this point each shard that tries to become the leader fails because it 
thinks that it has been put under leader initiated recovery and goes into 
actual "recovery"
# Of course, since there is no leader, recovery cannot happen and each shard 
eventually goes to "recovery_failed" state
# Eventually the test gives up and throws an error saying that there are no 
live server available to handle the request.

> SyncSliceTest fails on jenkins with no live servers available error
> -------------------------------------------------------------------
>
>                 Key: SOLR-6235
>                 URL: https://issues.apache.org/jira/browse/SOLR-6235
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud, Tests
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 4.10
>
>
> {code}
> 1 tests failed.
> FAILED:  org.apache.solr.cloud.SyncSliceTest.testDistribSearch
> Error Message:
> No live SolrServers available to handle this request
> Stack Trace:
> org.apache.solr.client.solrj.SolrServerException: No live SolrServers 
> available to handle this request
>         at 
> __randomizedtesting.SeedInfo.seed([685C57B3F25C854B:E9BAD9AB8503E577]:0)
>         at 
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:317)
>         at 
> org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:659)
>         at 
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:91)
>         at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
>         at 
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1149)
>         at 
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1118)
>         at org.apache.solr.cloud.SyncSliceTest.doTest(SyncSliceTest.java:236)
>         at 
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:865)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-6235) SyncSliceTest fails on jenkins with no live servers available error

Reply via email to