It looks like they are not picking up the new leader state for some reason…

Thats where where it say the local state doesn't match the zookeeper state. If 
the local state doesn't match the zookeeper state in a short amount of time 
when a new leader comes, everything will bail because it assumes something is 
wrong.

There are a fair number of SolrCloud bug fixes in 4.2 by the way. We didn't do 
a 4.1.1, but I would recommend you update. I don't know that it solves this 
particular issue. I'm going to continue investigating.

- Mark

On Mar 15, 2013, at 9:53 PM, Gary Yngve <gary.yn...@gmail.com> wrote:

> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers
> are red now in the solr cloud graph.. trying to figure out what that
> means...
> 
> 
> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
> 
>> I restarted the overseer node and another took over, queues are empty now.
>> 
>> the server with core production_things_shard1_2
>> is having these errors:
>> 
>> shard update error RetryNode:
>> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException:
>> Server refused connection at:
>> http://10.104.59.189:8883/solr/production_things_shard11_replica1
>> 
>>  for shard11!!!
>> 
>> I also got some strange errors on the restarted node.  Makes me wonder if
>> there is a string-matching bug for shard1 vs shard11?
>> 
>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader from zk
>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771)
>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:683)
>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
>>  at
>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
>>  at
>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
>>  at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>  at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>  at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:722)
>> Caused by: org.apache.solr.common.SolrException: There is conflicting
>> information about the leader
>> of shard: shard1 our state says:
>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http
>> ://10.217.55.151:8883/solr/collection1/
>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756)
>> 
>> INFO: Releasing
>> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar
>> d11_replica1/data/index
>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log
>> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher
>>  at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423)
>>  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535)
>> 
>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state
>> recovering for 10.76.31.
>> 67:8883_solr but I still do not see the requested state. I see state:
>> active live:true
>>  at
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler
>> .java:948)
>> 
>> 
>> 
>> 
>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com>wrote:
>> 
>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened
>>> here.
>>> 
>>> Can you do a stack dump on the overseer and see if you see an Overseer
>>> thread running perhaps? Or just post the results?
>>> 
>>> To recover, you should be able to just restart the Overseer node and have
>>> someone else take over - they should pick up processing the queue.
>>> 
>>> Any logs you might be able to share could be useful too.
>>> 
>>> - Mark
>>> 
>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>>> 
>>>> Also, looking at overseer_elect, everything looks fine.  node is valid
>>> and
>>>> live.
>>>> 
>>>> 
>>>> On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com>
>>> wrote:
>>>> 
>>>>> Sorry, should have specified.  4.1
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com
>>>> wrote:
>>>>> 
>>>>>> What Solr version? 4.0, 4.1 4.2?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>>>>>> 
>>>>>>> my solr cloud has been running fine for weeks, but about a week ago,
>>> it
>>>>>>> stopped dequeueing from the overseer queue, and now there are
>>> thousands
>>>>>> of
>>>>>>> tasks on the queue, most which look like
>>>>>>> 
>>>>>>> {
>>>>>>> "operation":"state",
>>>>>>> "numShards":null,
>>>>>>> "shard":"shard3",
>>>>>>> "roles":null,
>>>>>>> "state":"recovering",
>>>>>>> "core":"production_things_shard3_2",
>>>>>>> "collection":"production_things",
>>>>>>> "node_name":"10.31.41.59:8883_solr",
>>>>>>> "base_url":"http://10.31.41.59:8883/solr"}
>>>>>>> 
>>>>>>> i'm trying to create a new collection through collection API, and
>>>>>>> obviously, nothing is happening...
>>>>>>> 
>>>>>>> any suggestion on how to fix this?  drop the queue in zk?
>>>>>>> 
>>>>>>> how could did it have gotten in this state in the first place?
>>>>>>> 
>>>>>>> thanks,
>>>>>>> gary
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>> 

Reply via email to