Re: overseer queue clogged

Mark Miller Fri, 15 Mar 2013 21:42:01 -0700

On Mar 16, 2013, at 12:30 AM, Gary Yngve <gary.yn...@gmail.com> wrote:


> I will upgrade to 4.2 this weekend and see what happens.  We are on ec2 and
> have had a few issues with hostnames with both zk and solr. (but in this
> case i haven't rebooted any instances either)

There is actually a new feature in 4.2 that lets you specify arbitrary node 
names so that new ips can take over for old nodes. You just have to do this up 
front...

> 
> it's relatively a pain to do the upgrade because we have a query/scorer
> fork of lucene along with supplemental jars, and zk cannot distribute
> binary jars via the config.

There is a JIRA issue for this and it's on my list if no one gets it in before 
me.

> 
> we are also multi-collection per zk... i wish it didn't require a core
> always defined up front for the core admin?  i would love to have an
> instance have no cores and then just create the core i need..

You can do this - just modify your starting Solr example to have no cores in 
solr.xml. You won't be able to make use of the admin UI until you create at 
least one core, but the core and collection apis will both work fine.

- Mark

> 
> -g
> 
> 
> 
> On Fri, Mar 15, 2013 at 7:14 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> 
>> On Mar 15, 2013, at 10:04 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>> 
>>> i think those followers are red from trying to forward requests to the
>>> overseer while it was being restarted.  i guess i'll see if they become
>>> green over time.  or i guess i can restart them one at a time..
>> 
>> Restarting the cluster clear things up. It shouldn't take too long for
>> those nodes to recover though - they should have been up to date before.
>> The couple exceptions you posted def indicate something is out of whack.
>> It's something I'd like to get to the bottom of.
>> 
>> - Mark
>> 
>>> 
>>> 
>>> On Fri, Mar 15, 2013 at 6:53 PM, Gary Yngve <gary.yn...@gmail.com>
>> wrote:
>>> 
>>>> it doesn't appear to be a shard1 vs shard11 issue... 60% of my followers
>>>> are red now in the solr cloud graph.. trying to figure out what that
>>>> means...
>>>> 
>>>> 
>>>> On Fri, Mar 15, 2013 at 6:48 PM, Gary Yngve <gary.yn...@gmail.com>
>> wrote:
>>>> 
>>>>> I restarted the overseer node and another took over, queues are empty
>> now.
>>>>> 
>>>>> the server with core production_things_shard1_2
>>>>> is having these errors:
>>>>> 
>>>>> shard update error RetryNode:
>>>>> 
>> http://10.104.59.189:8883/solr/production_things_shard11_replica1/:org.apache.solr.client.solrj.SolrServerException
>> :
>>>>> Server refused connection at:
>>>>> http://10.104.59.189:8883/solr/production_things_shard11_replica1
>>>>> 
>>>>> for shard11!!!
>>>>> 
>>>>> I also got some strange errors on the restarted node.  Makes me wonder
>> if
>>>>> there is a string-matching bug for shard1 vs shard11?
>>>>> 
>>>>> SEVERE: :org.apache.solr.common.SolrException: Error getting leader
>> from
>>>>> zk
>>>>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:771)
>>>>> at org.apache.solr.cloud.ZkController.register(ZkController.java:683)
>>>>> at org.apache.solr.cloud.ZkController.register(ZkController.java:634)
>>>>> at
>>>>> org.apache.solr.core.CoreContainer.registerInZk(CoreContainer.java:890)
>>>>> at
>>>>> org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:874)
>>>>> at org.apache.solr.core.CoreContainer.register(CoreContainer.java:823)
>>>>> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:633)
>>>>> at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
>>>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>>> at
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>>>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>>> at
>>>>> 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:722)
>>>>> Caused by: org.apache.solr.common.SolrException: There is conflicting
>>>>> information about the leader
>>>>> of shard: shard1 our state says:
>>>>> http://10.104.59.189:8883/solr/collection1/ but zookeeper says:http
>>>>> ://10.217.55.151:8883/solr/collection1/
>>>>> at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:756)
>>>>> 
>>>>> INFO: Releasing
>>>>> 
>> directory:/vol/ubuntu/talemetry_match_solr/solr_server/solr/production_things_shar
>>>>> d11_replica1/data/index
>>>>> Mar 15, 2013 5:52:34 PM org.apache.solr.common.SolrException log
>>>>> SEVERE: org.apache.solr.common.SolrException: Error opening new
>> searcher
>>>>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1423)
>>>>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1535)
>>>>> 
>>>>> SEVERE: org.apache.solr.common.SolrException: I was asked to wait on
>>>>> state recovering for 10.76.31.
>>>>> 67:8883_solr but I still do not see the requested state. I see state:
>>>>> active live:true
>>>>> at
>>>>> 
>> org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler
>>>>> .java:948)
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Mar 15, 2013 at 5:05 PM, Mark Miller <markrmil...@gmail.com
>>> wrote:
>>>>> 
>>>>>> Strange - we hardened that loop in 4.1 - so I'm not sure what happened
>>>>>> here.
>>>>>> 
>>>>>> Can you do a stack dump on the overseer and see if you see an Overseer
>>>>>> thread running perhaps? Or just post the results?
>>>>>> 
>>>>>> To recover, you should be able to just restart the Overseer node and
>>>>>> have someone else take over - they should pick up processing the
>> queue.
>>>>>> 
>>>>>> Any logs you might be able to share could be useful too.
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Mar 15, 2013, at 7:51 PM, Gary Yngve <gary.yn...@gmail.com> wrote:
>>>>>> 
>>>>>>> Also, looking at overseer_elect, everything looks fine.  node is
>> valid
>>>>>> and
>>>>>>> live.
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 15, 2013 at 4:47 PM, Gary Yngve <gary.yn...@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry, should have specified.  4.1
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Mar 15, 2013 at 4:33 PM, Mark Miller <markrmil...@gmail.com
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> What Solr version? 4.0, 4.1 4.2?
>>>>>>>>> 
>>>>>>>>> - Mark
>>>>>>>>> 
>>>>>>>>> On Mar 15, 2013, at 7:19 PM, Gary Yngve <gary.yn...@gmail.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> my solr cloud has been running fine for weeks, but about a week
>>>>>> ago, it
>>>>>>>>>> stopped dequeueing from the overseer queue, and now there are
>>>>>> thousands
>>>>>>>>> of
>>>>>>>>>> tasks on the queue, most which look like
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> "operation":"state",
>>>>>>>>>> "numShards":null,
>>>>>>>>>> "shard":"shard3",
>>>>>>>>>> "roles":null,
>>>>>>>>>> "state":"recovering",
>>>>>>>>>> "core":"production_things_shard3_2",
>>>>>>>>>> "collection":"production_things",
>>>>>>>>>> "node_name":"10.31.41.59:8883_solr",
>>>>>>>>>> "base_url":"http://10.31.41.59:8883/solr"}
>>>>>>>>>> 
>>>>>>>>>> i'm trying to create a new collection through collection API, and
>>>>>>>>>> obviously, nothing is happening...
>>>>>>>>>> 
>>>>>>>>>> any suggestion on how to fix this?  drop the queue in zk?
>>>>>>>>>> 
>>>>>>>>>> how could did it have gotten in this state in the first place?
>>>>>>>>>> 
>>>>>>>>>> thanks,
>>>>>>>>>> gary
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: overseer queue clogged

Reply via email to