Re: SolrCloud never fully recovers after slow disks

Henrik Ossipoff Hansen Thu, 07 Nov 2013 04:44:52 -0800

Hey Erick,

I have tried upping the timeouts quite a bit now, and have tried upping the 
zkTimeout setting in Solr itself (I found a few old posts on the mailing list 
suggesting this).


I realise this is a sort of weird situation, where we are actually trying to 
work around some horrible hardware setup.

Thank you for your post - I will make another post in a day or two after I see 
how it performs.
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 7. nov. 2013 at 13.23.59, Erick Erickson 
(erickerick...@gmail.com<mailto://erickerick...@gmail.com>) wrote:

Right, can you up your ZK timeouts significantly? It sounds like
your ZK timeout is short enough that when your system slows
down, the timeout is exceeded and it's throwing Solr
into a tailspin....

See zoo.cfg.

Best,
Erick


On Tue, Nov 5, 2013 at 3:33 AM, Henrik Ossipoff Hansen <
h...@entertainment-trading.com> wrote:

> I previously made a post on this, but have since narrowed down the issue
> and am now giving this another try, with another spin to it.
>
> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external
> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM
> is using our Storage system (NFS share in VMWare).
>
> Now I do realize and have heard, that NFS is not the greatest way to run
> Solr on, but we have never had this issue on non-SolrCloud setups.
>
> Basically, each night when we run our backup jobs, our storage becomes a
> bit slow in response - this is obviously something we’re trying to solve,
> but bottom line is, that all our other systems somehow stays alive or
> recovers gracefully when bandwidth exists again.
> SolrCloud - not so much. Typically after a session like this, 3-5 nodes
> will either go into a Down state or a Recovering state - and stay that way.
> Sometimes such node will even be marked as leader. A such node will have
> something like this in the log:
>
> ERROR - 2013-11-05 08:57:45.764;
> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
> says we are the leader, but locally we don't think so
> ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException;
> org.apache.solr.common.SolrException: ClusterState says we are the leader (
> http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but
> locally we don't think so. Request came from
> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
> at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
> at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
> at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
> at
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> at
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
> at
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
> at
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
>
> On the other nodes, an error similar to this will be in the log:
>
> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> Server at 
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2returned non ok 
> status:503, message:Service Unavailable
> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to
> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed -
> retrying ...
>
> Does anyone have any ideas or leads towards a solution - one that doesn’t
> involve getting a new storage system (a solution we *are* actively working
> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be
> possible? And even more so - shouldn’t SolrCloud be able to gracefully
> recover after issues like this?
>
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>

Re: SolrCloud never fully recovers after slow disks

Reply via email to