Hmm…I've tried to replicate what looked like a bug from your report (3 Solr 
servers stop/start ), but on 5x it works no problem for me. It shouldn't be any 
different on 4x, but I'll try that next.

In terms of starting up Solr without a working ZooKeeper ensemble - it won't 
work currently. Cores won't be able to register with ZooKeeper and will fail 
loading. It would probably be nicer to come up in search only mode and keep 
trying to reconnect to zookeeper - file a JIRA issue if you are interested.

On the zk data dir, see 
http://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html#Ongoing+Data+Directory+Cleanup

- Mark

On Dec 7, 2012, at 10:22 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Hey, I'll try and answer this tomorrow.
> 
> There is a def an unreported bug in there that needs to be fixed for the 
> restarting the all nodes case.
> 
> Also, a 404 one is generally when jetty is starting or stopping - there are 
> points where 404's can be returned. I'm not sure why else you'd see one. 
> Generally we do retries when that happens.
> 
> - Mark
> 
> On Dec 7, 2012, at 1:07 PM, Alain Rogister <alain.rogis...@gmail.com> wrote:
> 
>> I am reporting the results of my stress tests against Solr 4.x. As I was
>> getting many error conditions with 4.0, I switched to the 4.1 trunk in the
>> hope that some of the issues would be fixed already. Here is my setup :
>> 
>> - Everything running on a single box (2 x 4-core CPUs, 8 GB RAM). I realize
>> this is not representative of a production environment but it's a fine way
>> to find out what happens under resource-constrained conditions.
>> - 3 Solr servers, 3 cores (2 of which are very small, the third one has 410
>> MB of data)
>> - single shard
>> - 3 Zookeeper instances
>> - HAProxy load balancing requests across Solr servers
>> - JMeter or ApacheBench running the tests : 5 thread pools of 20 threads
>> each, sending search requests continuously (no updates)
>> 
>> In nominal conditions, it all works fine i.e. it can process a million
>> requests, maxing out the CPUs at all time, without experiencing nasty
>> failures. There are errors in the logs about replication failures though;
>> they should be benigne in this case as no updates are taking place but it's
>> hard to tell what is going on exactly. Example :
>> 
>> Dec 07, 2012 7:50:37 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr
>> exception talking to
>> http://192.168.0.101:8985/solr/adressage/, failed
>> org.apache.solr.common.SolrException: Server at
>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>> message:Not Found
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:372)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:166)
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
>> 
>> Then I simulated various failure scenarios :
>> 
>> - 1 Solr server stop/start
>> - 2 Solr servers stop/start
>> - 3 Solr servers stop/start : it seems that in this case, the Solr servers
>> *cannot* be restarted : more exactly, the restarted server will consider
>> that it is number 1 out of 4 and wait for the other 3 to come up. The only
>> way out is to stop it again, then stop all Zookeeper instances *and* clean
>> up their zkdata directory, start them, then start the Solr servers.
>> 
>> I noticed that these zkdata directory had grown to 200 MB after a while.
>> What exactly is in there besides the configuration data ? Does it stop
>> growing ?
>> 
>> Then I tried this :
>> 
>> - kill 1 Zookeeper process
>> - kill 2 Zookeeper processes
>> - stop/start 1 Solr server
>> 
>> When doing this, I experienced (many times) situations where the Solr
>> servers could not reconnect and threw scary exceptions. The only way out
>> was to restart the whole cluster.
>> 
>> Q : when, if ever, is one supposed to clean up the zkdata directories ?
>> 
>> Here are the errors I found in the logs. It seems that some of them have
>> been reported in JIRA but 4.1-trunk seems to experience basically the same
>> issues as 4.0 in my test scenarios.
>> 
>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=cachede url=http://192.168.0.101:8983/solr
>> couldn't connect to
>> http://192.168.0.101:8984/solr/cachede/, counting as success
>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>> SEVERE: Sync request error:
>> org.apache.solr.client.solrj.SolrServerException: Server refused connection
>> at: http://192.168.0.101:8984/solr/cachede
>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>> SEVERE: http://192.168.0.101:8983/solr/cachede/: Could not tell a replica
>> to recover:org.apache.solr.client.solrj.SolrServerException: Server refused
>> connection at: http://192.168.0.101:8984/solr
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> at org.apache.solr.cloud.SyncStrategy$1.run(SyncStrategy.java:293)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
>> http://192.168.0.101:8984 refused
>> at
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
>> at
>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
>> at
>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
>> at
>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575)
>> at
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>> ... 5 more
>> Caused by: java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
>> at java.net.Socket.connect(Socket.java:579)
>> at
>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
>> at
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
>> ... 13 more
>> 
>> Dec 07, 2012 8:03:59 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=adressage url=http://192.168.0.101:8983/solr  got a
>> 404 from http://192.168.0.101:8985/solr/adressage/, counting as success
>> Dec 07, 2012 8:03:59 PM org.apache.solr.common.SolrException log
>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at
>> http://192.168.0.101:8985/solr/adressage returned non ok status:404,
>> message:Not Found
>> Dec 07, 2012 8:04:00 PM org.apache.solr.update.PeerSync handleResponse
>> WARNING: PeerSync: core=formabanque url=http://192.168.0.101:8983/solr  got
>> a 404 from http://192.168.0.101:8985/solr/formabanque/, counting as success
>> Dec 07, 2012 8:04:00 PM org.apache.solr.common.SolrException log
>> SEVERE: Sync request error: org.apache.solr.common.SolrException: Server at
>> http://192.168.0.101:8985/solr/formabanque returned non ok status:404,
>> message:Not Found
>> 
>> Dec 07, 2012 8:04:32 PM org.apache.solr.update.PeerSync sync
>> WARNING: no frame of reference to tell of we've missed updates
>> 
>> Dec 07, 2012 8:03:58 PM org.apache.solr.common.SolrException log
>> SEVERE: Error while trying to
>> recover:org.apache.solr.client.solrj.SolrServerException: Server refused
>> connection at: http://192.168.0.101:8984/solr/adressage
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:406)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
>> at
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
>> at
>> org.apache.solr.cloud.RecoveryStrategy.commitOnLeader(RecoveryStrategy.java:182)
>> at
>> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:134)
>> at
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:407)
>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
>> Caused by: org.apache.http.conn.HttpHostConnectException: Connection to
>> http://192.168.0.101:8984 refused
>> at
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:158)
>> at
>> org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:150)
>> at
>> org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:121)
>> at
>> org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:575)
>> at
>> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:425)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754)
>> at
>> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:732)
>> at
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
>> ... 6 more
>> Caused by: java.net.ConnectException: Connection refused
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
>> at java.net.Socket.connect(Socket.java:579)
>> at
>> org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:123)
>> at
>> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:148)
>> ... 14 more
>> 
>> Dec 07, 2012 8:03:58 PM org.apache.solr.cloud.RecoveryStrategy doRecovery
>> SEVERE: Recovery failed - trying again... (0) core=adressage
>> 
>> SEVERE: Error getting leader from zk
>> org.apache.solr.common.SolrException: Could not get leader props
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:735)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:699)
>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:664)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:603)
>> at org.apache.solr.cloud.ZkController.register(ZkController.java:558)
>> at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:791)
>> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:775)
>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:567)
>> at org.apache.solr.core.CoreContainer$2.call(CoreContainer.java:562)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /collections/adressage/leaders/shard1
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>> at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>> at
>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:244)
>> at
>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:241)
>> at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:63)
>> at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:241)
>> at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:713)
>> ... 16 more
>> 
>> Dec 07, 2012 4:39:23 PM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.solr.common.SolrException: no servers hosting shard:
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:159)
>> at
>> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:133)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:722)
> 

Reply via email to