On Mar 15, 2013, at 10:04 PM, Gary Yngve <gary.yn...@gmail.com> wrote:

> i think those followers are red from trying to forward requests to the
> overseer while it was being restarted.  i guess i'll see if they become
> green over time.  or i guess i can restart them one at a time..

Restarting the cluster clear things up. It shouldn't take too long for those 
nodes to recover though - they should have been up to date before. The couple 
exceptions you posted def indicate something is out of whack. It's something 
I'd like to get to the bottom of.

- Mark

> 
> 
> On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
> 
>> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers
>> are red now in the solr cloud graph.. trying to figure out what that
>> means...
>> 
>> 
>> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>> 
>>> I restarted the overseer node and another took over, queues are empty now.
>>> 
>>> the server with core production_things_shard1_2
>>> is having these errors:
>>> 
>>> shard update error RetryNode:
>>> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException:
>>> Server refused connection at:
>>> http://10.104.59.189:8883/solr/production_things_shard11_replica1
>>> 
>>>  for shard11!!!
>>> 
>>> I also got some strange errors on the restarted node.  Makes me wonder if
>>> there is a string-matching bug for shard1 vs shard11?
>>> 
>>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader from
>>> zk
>>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771)
>>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:683)
>>>  at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
>>>  at
>>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
>>>  at
>>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
>>>  at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
>>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
>>>  at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
>>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>  at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>  at java.lang.Thread.run(Thread.java:722)
>>> Caused by: org.apache.solr.common.SolrException: There is conflicting
>>> information about the leader
>>> of shard: shard1 our state says:
>>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http
>>> ://10.217.55.151:8883/solr/collection1/
>>>  at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756)
>>> 
>>> INFO: Releasing
>>> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar
>>> d11_replica1/data/index
>>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.solr.common.SolrException: Error opening new searcher
>>>  at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423)
>>>  at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535)
>>> 
>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>>> state recovering for 10.76.31.
>>> 67:8883_solr but I still do not see the requested state. I see state:
>>> active live:true
>>>  at
>>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler
>>> .java:948)
>>> 
>>> 
>>> 
>>> 
>>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com>wrote:
>>> 
>>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened
>>>> here.
>>>> 
>>>> Can you do a stack dump on the overseer and see if you see an Overseer
>>>> thread running perhaps? Or just post the results?
>>>> 
>>>> To recover, you should be able to just restart the Overseer node and
>>>> have someone else take over - they should pick up processing the queue.
>>>> 
>>>> Any logs you might be able to share could be useful too.
>>>> 
>>>> - Mark
>>>> 
>>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>>>> 
>>>>> Also, looking at overseer_elect, everything looks fine.  node is valid
>>>> and
>>>>> live.
>>>>> 
>>>>> 
>>>>> On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Sorry, should have specified.  4.1
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com
>>>>> wrote:
>>>>>> 
>>>>>>> What Solr version? 4.0, 4.1 4.2?
>>>>>>> 
>>>>>>> - Mark
>>>>>>> 
>>>>>>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> my solr cloud has been running fine for weeks, but about a week
>>>> ago, it
>>>>>>>> stopped dequeueing from the overseer queue, and now there are
>>>> thousands
>>>>>>> of
>>>>>>>> tasks on the queue, most which look like
>>>>>>>> 
>>>>>>>> {
>>>>>>>> "operation":"state",
>>>>>>>> "numShards":null,
>>>>>>>> "shard":"shard3",
>>>>>>>> "roles":null,
>>>>>>>> "state":"recovering",
>>>>>>>> "core":"production_things_shard3_2",
>>>>>>>> "collection":"production_things",
>>>>>>>> "node_name":"10.31.41.59:8883_solr",
>>>>>>>> "base_url":"http://10.31.41.59:8883/solr"}
>>>>>>>> 
>>>>>>>> i'm trying to create a new collection through collection API, and
>>>>>>>> obviously, nothing is happening...
>>>>>>>> 
>>>>>>>> any suggestion on how to fix this?  drop the queue in zk?
>>>>>>>> 
>>>>>>>> how could did it have gotten in this state in the first place?
>>>>>>>> 
>>>>>>>> thanks,
>>>>>>>> gary
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 

Reply via email to