[
https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193435#comment-16193435
]
Scott Blum commented on SOLR-11423:
-----------------------------------
[~noble.paul] both good questions!
> This looks fine. I'm just worried about the extra cost of the extra cost of
> {{Stat stat = zookeeper.exists(dir, null, true);}} for each call.
Indeed! That's why in my second pass, I added `offerPermits`. As long as the
queue is mostly empty, each client will only bother checking the stats about
every 200 queued entries. Agreed on a "create sequential node if parent node
has fewer then XXX entries".
> Another point to consider is , is the rest of Solr code designed to handle
> this error? I guess, the thread should wait for a few seconds and retry if
> the no: of items fell below the limit. This will dramatically reduce the no:
> of other errors in the system.
I thought about this point quite a bit, and came down on the side of erroring
immediately. Again, I'm thinking of this mostly like an automatic emergency
shutoff in a nuclear reactor: you hope you never need it. The point being, if
you're in a state where you have 20k items in the queue, you're already in a
pathologically bad state. I can't see how adding latency and hoping things get
better will improve the situation vs. erroring out immediately. I've never
seen a solr cluster recover on its own once the queue got that high, it always
required manual intervention.
> Overseer queue needs a hard cap (maximum size) that clients respect
> -------------------------------------------------------------------
>
> Key: SOLR-11423
> URL: https://issues.apache.org/jira/browse/SOLR-11423
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: SolrCloud
> Reporter: Scott Blum
> Assignee: Scott Blum
>
> When Solr gets into pathological GC thrashing states, it can fill the
> overseer queue with literally thousands and thousands of queued state
> changes. Many of these end up being duplicated up/down state updates. Our
> production cluster has gotten to the 100k queued items level many times, and
> there's nothing useful you can do at this point except manually purge the
> queue in ZK. Recently, it hit 3 million queued items, at which point our
> entire ZK cluster exploded.
> I propose a hard cap. Any client trying to enqueue a item when a queue is
> full would throw an exception. I was thinking maybe 10,000 items would be a
> reasonable limit. Thoughts?
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]