Steven Bower created SOLR-7550:
----------------------------------

             Summary: PeerSync fails if a replica returns 500 error
                 Key: SOLR-7550
                 URL: https://issues.apache.org/jira/browse/SOLR-7550
             Project: Solr
          Issue Type: Bug
          Components: SolrCloud
    Affects Versions: 4.10.2, 4.8.1
         Environment: linux
            Reporter: Steven Bower
            Priority: Critical


4 node cluster we stopped a node and started that node back up. Prior to the 
node starting up a schema change was made that was invalid. When the node 
started back up the core could not load as the schema was invalid. While in 
this state the leader was restarted as well (so now two nodes in this bad 
state). When the remaining two nodes attempted to become leader and PeerSync 
they were getting a 500 error back from these failed-to-start cores and were 
not able to become leaders, which eventually lead to the remaining two nodes 
ending up in "recovery_failed" state and the cluster being offline.

Some logs:

{noformat}
2015-05-14 17:03:20.712 INFO  ShardLeaderElectionContext [main-EventThread] - 
Running the leader process for shard shard1
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - 
Checking if I should try and be the leader.
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - 
My last published State was Active, it's okay to be the leader.
2015-05-14 17:03:20.720 INFO  ShardLeaderElectionContext [main-EventThread] - I 
may be the new leader - try and sync
2015-05-14 17:03:20.720 WARN  RecoveryStrategy [main-EventThread] - Stopping 
recovery for zkNodeName=host-a2:12345_solr_xxxxcore=xxxx
2015-05-14 17:03:23.220 INFO  SyncStrategy [main-EventThread] - Sync replicas 
to http://host-a2:12345/solr/xxxx/
2015-05-14 17:03:23.221 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx 
url=http://host-a2:12345/solr START replicas=[http://host-b1:12345/solr/xxxx/, 
http://host-a1:12345/solr/xxxx_shard1/] nUpdates=100
2015-05-14 17:03:23.238 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx 
url=http://host-a2:12345/solr  Received 96 versions from 
http://host-b1:12345/solr/xxxx/
2015-05-14 17:03:23.239 INFO  PeerSync [main-EventThread] - PeerSync: core=xxxx 
url=http://host-a2:12345/solr  Our versions are newer. 
ourLowThreshold=1501178223728263172 otherHigh=1501178223745040385
2015-05-14 17:03:23.385 WARN  PeerSync [main-EventThread] - PeerSync: core=xxxx 
url=http://host-a2:12345/solr  exception talking to 
http://host-a1:12345/solr/xxxx_shard1/, failed
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Expected 
mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'xxxx_shard1' is not available due to init 
failure: Could not load conf for core xxxx_shard1: Plugin init failure for 
[schema.xml] fieldType "text_split_colon": Plugin init failure for [schema.xml] 
analyzer/filter: Error loading class 'XXXXXXXXXXXXXX'. Schema file is 
/configs/xxxx/schema.xml,trace=org.apache.solr.common.SolrException: SolrCore 
'xxxx_shard1' is not available due to init failure: Could not load conf for 
core xxxx_shard1: Plugin init failure for [schema.xml] fieldType 
"some_field_type": Plugin init failure for [schema.xml] analyzer/filter: Error 
loading class 'XXXXXXXXXXXXXXX'. Schema file is /configs/xxxx/schema.xml
        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:299)
        at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
        at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
        at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
        at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
        at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  ...
  ...
  ...
{noformat}

It looks as though the error handling is a bit brittle in that it can tolerate 
connection issues, 503 and 404 errors but anything else would cause a cluster 
that needed to leader elect and had a node in a bad state to fail.

If just adding support for 500 errors is seen as the best approach that is a 
simple fix and I can put a patch up quickly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to