[ 
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193435#comment-16193435
 ] 

Scott Blum commented on SOLR-11423:
-----------------------------------

[~noble.paul] both good questions!

> This looks fine. I'm just worried about the extra cost of the extra cost of 
> {{Stat stat = zookeeper.exists(dir, null, true);}} for each call.

Indeed!  That's why in my second pass, I added `offerPermits`.  As long as the 
queue is mostly empty, each client will only bother checking the stats about 
every 200 queued entries.  Agreed on a "create sequential node if parent node 
has fewer then XXX entries".

> Another point to consider is , is the rest of Solr code designed to handle 
> this error? I guess, the thread should wait for a few seconds and retry if 
> the no: of items fell below the limit. This will dramatically reduce the no: 
> of other errors in the system.

I thought about this point quite a bit, and came down on the side of erroring 
immediately.  Again, I'm thinking of this mostly like an automatic emergency 
shutoff in a nuclear reactor: you hope you never need it.  The point being, if 
you're in a state where you have 20k items in the queue, you're already in a 
pathologically bad state.  I can't see how adding latency and hoping things get 
better will improve the situation vs. erroring out immediately.  I've never 
seen a solr cluster recover on its own once the queue got that high, it always 
required manual intervention.

> Overseer queue needs a hard cap (maximum size) that clients respect
> -------------------------------------------------------------------
>
>                 Key: SOLR-11423
>                 URL: https://issues.apache.org/jira/browse/SOLR-11423
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Scott Blum
>            Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the 
> overseer queue with literally thousands and thousands of queued state 
> changes.  Many of these end up being duplicated up/down state updates.  Our 
> production cluster has gotten to the 100k queued items level many times, and 
> there's nothing useful you can do at this point except manually purge the 
> queue in ZK.  Recently, it hit 3 million queued items, at which point our 
> entire ZK cluster exploded.
> I propose a hard cap.  Any client trying to enqueue a item when a queue is 
> full would throw an exception.  I was thinking maybe 10,000 items would be a 
> reasonable limit.  Thoughts?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to