I restarted the overseer node and another took over, queues are empty now. the server with core production_things_shard1_2 is having these errors:
shard update error RetryNode: http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://10.104.59.189:8883/solr/production_things_shard11_replica1 for shard11!!! I also got some strange errors on the restarted node. Makes me wonder if there is a string-matching bug for shard1 vs shard11? SEVERE: :org.apache.solr.common.SolrException: Error getting leader from zk at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771) at org.apache.solr.cloud.ZkController.register(ZkController.java:683) at org.apache.solr.cloud.ZkController.register(ZkController.java:634) at org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874) at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.solr.common.SolrException: There is conflicting information about the leader of shard: shard1 our state says:http://10.104.59.189:8883/solr/collection1/but zookeeper says:http ://10.217.55.151:8883/solr/collection1/ at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756) INFO: Releasing directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar d11_replica1/data/index Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535) SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state recovering for 10.76.31. 67:8883_solr but I still do not see the requested state. I see state: active live:true at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler .java:948) On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com> wrote: > Strange - we hardened that loop in 4.1 - so I'm not sure what happened > here. > > Can you do a stack dump on the overseer and see if you see an Overseer > thread running perhaps? Or just post the results? > > To recover, you should be able to just restart the Overseer node and have > someone else take over - they should pick up processing the queue. > > Any logs you might be able to share could be useful too. > > - Mark > > On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > > > Also, looking at overseer_elect, everything looks fine. node is valid > and > > live. > > > > > > On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com> > wrote: > > > >> Sorry, should have specified. 4.1 > >> > >> > >> > >> > >> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com > >wrote: > >> > >>> What Solr version? 4.0, 4.1 4.2? > >>> > >>> - Mark > >>> > >>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com> wrote: > >>> > >>>> my solr cloud has been running fine for weeks, but about a week ago, > it > >>>> stopped dequeueing from the overseer queue, and now there are > thousands > >>> of > >>>> tasks on the queue, most which look like > >>>> > >>>> { > >>>> "operation":"state", > >>>> "numShards":null, > >>>> "shard":"shard3", > >>>> "roles":null, > >>>> "state":"recovering", > >>>> "core":"production_things_shard3_2", > >>>> "collection":"production_things", > >>>> "node_name":"10.31.41.59:8883_solr", > >>>> "base_url":"http://10.31.41.59:8883/solr"} > >>>> > >>>> i'm trying to create a new collection through collection API, and > >>>> obviously, nothing is happening... > >>>> > >>>> any suggestion on how to fix this? drop the queue in zk? > >>>> > >>>> how could did it have gotten in this state in the first place? > >>>> > >>>> thanks, > >>>> gary > >>> > >>> > >> > >