[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284975#comment-16284975 ] Noble Paul commented on SOLR-11423: --- Sorry for the late response +1 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > Fix For: 7.2, master (8.0) > > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284120#comment-16284120 ] ASF subversion and git services commented on SOLR-11423: Commit 3a7f1071644ffe11ee74c96cfd4946204b6544b5 in lucene-solr's branch refs/heads/master from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=3a7f107 ] SOLR-11423: fix solr/CHANGES.txt, fixed in 7.2 not 7.1 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284118#comment-16284118 ] ASF subversion and git services commented on SOLR-11423: Commit 576b0b5d659527a61ebdd80347c6fa14dc0a086f in lucene-solr's branch refs/heads/branch_7_2 from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=576b0b5 ] SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284072#comment-16284072 ] ASF subversion and git services commented on SOLR-11423: Commit 311105f1b0dad7f20ff6cd55be1d2eb9cd4246d6 in lucene-solr's branch refs/heads/branch_7x from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=311105f ] SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284021#comment-16284021 ] Scott Blum commented on SOLR-11423: --- I'm fixing solr/CHANGES.txt while cherry-picking into 7x and 7_2. I'll push a change to master to move it when that's done. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283981#comment-16283981 ] Tomás Fernández Löbbe commented on SOLR-11423: -- bq. Sounds good to me. So backport to branch_7_2 and branch_7x? +1. And lets fix CHANGES.txt > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283972#comment-16283972 ] Scott Blum commented on SOLR-11423: --- Sounds good to me. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283926#comment-16283926 ] Varun Thacker commented on SOLR-11423: -- +1 to what Tomas suggested > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283908#comment-16283908 ] Tomás Fernández Löbbe commented on SOLR-11423: -- In my experience, 20k state updates in the queue is beyond the point of no return too. I think we should backport as it is now, and if needed in the future we can use a cluster property to address Scott's point. Lets get this to 7.2 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282711#comment-16282711 ] Scott Blum commented on SOLR-11423: --- I didn't resolve due to Noble's #comment-16203208 but I have no objection to resolving. I dropped it on master because I wasn't sure what branches we'd want to backport to. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282306#comment-16282306 ] Adrien Grand commented on SOLR-11423: - The situation is quite weird as this change has been pushed to master only but has a CHANGES entry under 7.1. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16282305#comment-16282305 ] Adrien Grand commented on SOLR-11423: - [~dragonsinth] Is this issue good to resolve? I see commits have been pushed but it is still open? > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236706#comment-16236706 ] ASF subversion and git services commented on SOLR-11423: Commit 0637407ea4bf3a49ed5bdabadcfee650a8e0a200 in lucene-solr's branch refs/heads/master from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0637407 ] SOLR-11423: fix typo > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum >Priority: Major > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236685#comment-16236685 ] Scott Blum commented on SOLR-11423: --- Oh wow, nice catch. How'd I mess that one up? > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum >Priority: Major > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16236399#comment-16236399 ] Tomás Fernández Löbbe commented on SOLR-11423: -- Thanks for adding this [~dragonsinth]. One question {code:java} // Allow this client to push up to 1% of the remaining queue capacity without rechecking. offerPermits.set(maxQueueSize - stat.getNumChildren() / 100); {code} Don't you want {{offerPermits.set((maxQueueSize - stat.getNumChildren()) / 100);}}, or maybe {{offerPermits.set(remainingCapacity / 100);}} > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum >Priority: Major > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206976#comment-16206976 ] Cao Manh Dat commented on SOLR-11423: - Should we modify the behavior here a little bit, instead of throw an IllegalStateException() (which can lead to many errors cause this is an unchecked exception) should we try first, if the queue is full, retry until timeout. [~dragonsinth] I really want to hear about your cluster status after SOLR-11443 get applied. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204874#comment-16204874 ] Scott Blum commented on SOLR-11423: --- Is it possible to pick a number related to ZK's inherent limits? That's really the goal here, prevent Solr from destroying ZK. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204860#comment-16204860 ] Noble Paul commented on SOLR-11423: --- I understand that it's not fool proof. However, when a user has hit a limit and he has no way of changing it other than downgrading to an older solr is worse. If we are on customer support this is a nightmare > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16204004#comment-16204004 ] Scott Blum commented on SOLR-11423: --- I'm happy to defer on this issue, but I just want to be clear that I actively dislike having a system property. It feels like a not useful piece of config, and worse, what happens if you don't set the same cap on every node? Remember this is enforced client side, not server side. If you accidentally have a mix of nodes where half of them cap at 20k, and half of them cap at 40k, then the moment you get above 20k any badly behaving 40k nodes are going to starve out the 20k nodes. It becomes unfair contention. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203208#comment-16203208 ] Noble Paul commented on SOLR-11423: --- there must be a system property to change the value before we can mark this as resolved > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203192#comment-16203192 ] Cao Manh Dat commented on SOLR-11423: - [~dragonsinth] Should this ticket be backported to branch_7x? > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16194900#comment-16194900 ] Scott Blum commented on SOLR-11423: --- Perhaps you're right. In our cluster though, Overseer is not able to chew through 20k entries very fast; it takes a long time (many minutes) to get through that number of items. Our current escape valve is a tiny separate tool that just literally just goes through the queue and deletes everything. :D > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193846#comment-16193846 ] Noble Paul commented on SOLR-11423: --- Waiting can help. If every node errors out immediately after the cut off the items in queue can get cleared immediately and there will be starvation . If we wait and retry after sometime , we can ensure that the events are generated at a lower rate. It helps in cases where a GC pause caused a pile up. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193719#comment-16193719 ] ASF GitHub Bot commented on SOLR-11423: --- Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/257 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193440#comment-16193440 ] ASF subversion and git services commented on SOLR-11423: Commit 9419366a8e68a599efa2d8a230a0158d1763d10d in lucene-solr's branch refs/heads/master from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=9419366 ] SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16193435#comment-16193435 ] Scott Blum commented on SOLR-11423: --- [~noble.paul] both good questions! > This looks fine. I'm just worried about the extra cost of the extra cost of > {{Stat stat = zookeeper.exists(dir, null, true);}} for each call. Indeed! That's why in my second pass, I added `offerPermits`. As long as the queue is mostly empty, each client will only bother checking the stats about every 200 queued entries. Agreed on a "create sequential node if parent node has fewer then XXX entries". > Another point to consider is , is the rest of Solr code designed to handle > this error? I guess, the thread should wait for a few seconds and retry if > the no: of items fell below the limit. This will dramatically reduce the no: > of other errors in the system. I thought about this point quite a bit, and came down on the side of erroring immediately. Again, I'm thinking of this mostly like an automatic emergency shutoff in a nuclear reactor: you hope you never need it. The point being, if you're in a state where you have 20k items in the queue, you're already in a pathologically bad state. I can't see how adding latency and hoping things get better will improve the situation vs. erroring out immediately. I've never seen a solr cluster recover on its own once the queue got that high, it always required manual intervention. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192769#comment-16192769 ] Noble Paul commented on SOLR-11423: --- Another point to consider is , is the rest of Solr code designed to handle this error? I guess, the thread should wait for a few seconds and retry if the no: of items fell below the limit. This will dramatically reduce the no:of other errors in the system. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192660#comment-16192660 ] Noble Paul commented on SOLR-11423: --- This looks fine. I'm just worried about the extra cost of the extra cost of {{Stat stat = zookeeper.exists(dir, null, true);}} for each call. I wish we ZK had a conditional set operation . Anyway +1 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191962#comment-16191962 ] Scott Blum commented on SOLR-11423: --- [~shalinmangar] [~noble.paul] either of you want to take a look at this change and +1 or -1? > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189037#comment-16189037 ] ASF GitHub Bot commented on SOLR-11423: --- GitHub user dragonsinth opened a pull request: https://github.com/apache/lucene-solr/pull/257 SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/lucene-solr jira/SOLR-11423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #257 commit ef8e0934fb27530f0c9450b58872b2b11028f50a Author: Scott BlumDate: 2017-10-02T20:50:57Z SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188949#comment-16188949 ] ASF GitHub Bot commented on SOLR-11423: --- Github user asfgit closed the pull request at: https://github.com/apache/lucene-solr/pull/256 > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1617#comment-1617 ] Scott Blum commented on SOLR-11423: --- Patch passes tests for me. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188867#comment-16188867 ] Scott Blum commented on SOLR-11423: --- [~noble.paul] I went with 20k. We could make it configurable, but again, I intend this to be a general purpose safety valve to protect Zookeeper from exploding, which makes me think we could pick a universal value. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188866#comment-16188866 ] ASF GitHub Bot commented on SOLR-11423: --- GitHub user dragonsinth opened a pull request: https://github.com/apache/lucene-solr/pull/256 SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect You can merge this pull request into a Git repository by running: $ git pull https://github.com/apache/lucene-solr SOLR-11423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/256.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #256 commit ef8e0934fb27530f0c9450b58872b2b11028f50a Author: Scott BlumDate: 2017-10-02T20:50:57Z SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188807#comment-16188807 ] ASF subversion and git services commented on SOLR-11423: Commit ef8e0934fb27530f0c9450b58872b2b11028f50a in lucene-solr's branch refs/heads/SOLR-11423 from [~dragonsinth] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=ef8e093 ] SOLR-11423: Overseer queue needs a hard cap (maximum size) that clients respect > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16188803#comment-16188803 ] Scott Blum commented on SOLR-11423: --- [~erickerickson] I have backported changes to reduce overseer task counts. This isn't an issue with normal operation. Think of this more as an automatic safety shutoff on a nuclear reactor. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16187121#comment-16187121 ] Erick Erickson commented on SOLR-11423: --- What version of Solr? There has been some work done recently to reduce the number of tasks the Overseer needs to process and I'm wondering if your observations include those changes or not. > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186952#comment-16186952 ] Noble Paul commented on SOLR-11423: --- It makes sense to do a hard cap. make it a configurable cluster property with a sensible default. I guess 10K is a pretty small limit > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11423) Overseer queue needs a hard cap (maximum size) that clients respect
[ https://issues.apache.org/jira/browse/SOLR-11423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186282#comment-16186282 ] Scott Blum commented on SOLR-11423: --- CC: [~jhump] [~noble.paul] [~shalinmangar] > Overseer queue needs a hard cap (maximum size) that clients respect > --- > > Key: SOLR-11423 > URL: https://issues.apache.org/jira/browse/SOLR-11423 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud >Reporter: Scott Blum >Assignee: Scott Blum > > When Solr gets into pathological GC thrashing states, it can fill the > overseer queue with literally thousands and thousands of queued state > changes. Many of these end up being duplicated up/down state updates. Our > production cluster has gotten to the 100k queued items level many times, and > there's nothing useful you can do at this point except manually purge the > queue in ZK. Recently, it hit 3 million queued items, at which point our > entire ZK cluster exploded. > I propose a hard cap. Any client trying to enqueue a item when a queue is > full would throw an exception. I was thinking maybe 10,000 items would be a > reasonable limit. Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org