[jira] [Comment Edited] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages
[ https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001514#comment-16001514 ] Joel Bernstein edited comment on SOLR-10524 at 5/8/17 8:58 PM: --- I'm seeing the errors below when running the StreamExpressionTest. I suspect it's related to this ticket. I've been adding tests the past couple of days but only started seeing this today: Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop was (Author: joel.bernstein): I'm seeing the errors below in the StreamingExpressionTest. I suspect it's related to this ticket: Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop > Explore in-memory partitioning for processing Overseer queue messages > - > > Key: SOLR-10524 > URL: https://issues.apache.org/jira/browse/SOLR-10524 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Erick Erickson > Attachments: SOLR-10524-NPE-fix.patch, SOLR-10524.patch, >
[jira] [Comment Edited] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages
[ https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16001514#comment-16001514 ] Joel Bernstein edited comment on SOLR-10524 at 5/8/17 8:58 PM: --- I'm seeing the errors below when running the StreamExpressionTest. I suspect it's related to this ticket. I've been adding tests the past couple of days but only started seeing this today: Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop was (Author: joel.bernstein): I'm seeing the errors below when running the StreamExpressionTest. I suspect it's related to this ticket. I've been adding tests the past couple of days but only started seeing this today: Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239409 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239410 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239411 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239412 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239413 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239414 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop [junit4] 2> java.lang.NullPointerException [junit4] 2> 239415 ERROR (OverseerStateUpdate-97928916256817164-127.0.0.1:51485_solr-n_00) [n:127.0.0.1:51485_solr] o.a.s.c.Overseer Exception in Overseer main queue loop > Explore in-memory partitioning for processing Overseer queue messages > - > > Key: SOLR-10524 > URL: https://issues.apache.org/jira/browse/SOLR-10524 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Erick
[jira] [Comment Edited] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages
[ https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15998138#comment-15998138 ] Shalin Shekhar Mangar edited comment on SOLR-10524 at 5/5/17 10:56 AM: --- Yes, I like this. Same performance, much smaller changes and no chance of something going wrong in the cluster because of processing re-ordered messages. +1 to commit. -There are optimizations we can do on the read side using multi-get. Lets open another issue to explore that as well.- Oops, zookeeper has no multi-get. As a side note, there is a bug in the nsToMs method in testOverseer -- it actually assumes the nanoseconds as milliseconds and the converts them to nano seconds! I'll fix it separately. was (Author: shalinmangar): Yes, I like this. Same performance, much smaller changes and no chance of something going wrong in the cluster because of processing re-ordered messages. +1 to commit. There are optimizations we can do on the read side using multi-get. Lets open another issue to explore that as well. As a side note, there is a bug in the nsToMs method in testOverseer -- it actually assumes the nanoseconds as milliseconds and the converts them to nano seconds! I'll fix it separately. > Explore in-memory partitioning for processing Overseer queue messages > - > > Key: SOLR-10524 > URL: https://issues.apache.org/jira/browse/SOLR-10524 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Erick Erickson > Attachments: SOLR-10524.patch, SOLR-10524.patch, SOLR-10524.patch, > SOLR-10524.patch > > > There are several JIRAs (I'll link in a second) about trying to be more > efficient about processing overseer messages as the overseer can become a > bottleneck, especially with very large numbers of replicas in a cluster. One > of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read > large no:of items say 1. put them into in memory buckets and feed them > into overseer". > This JIRA is to break out that part of the discussion as it might be an easy > win whereas "eliminating the Overseer queue" would be quite an undertaking. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10524) Explore in-memory partitioning for processing Overseer queue messages
[ https://issues.apache.org/jira/browse/SOLR-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15997500#comment-15997500 ] Scott Blum edited comment on SOLR-10524 at 5/4/17 9:56 PM: --- Couple of thoughts: 1) In the places where you've changed Collection -> List, I would go one step further and make it a concrete ArrayList, to a) explicitly convey that the returned list is a mutable copy rather than a view of internal state and b) explicitly convey that sortAndAdd() is operating efficiently on said lists. 2) DQ.remove(id): don't you want to unconditionally knownChildren.remove(id), even if the ZK delete succeeds? 3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an infinite loop if someone else deletes the node you're targeting. The reason there's a loop in removeFirst() is because it's trying a different id each iteration. Suggested remove(id) impl: {code} public void remove(String id) throws KeeperException, InterruptedException { // Remove the ZK node *first*; ZK will resolve any races with peek()/poll(). // This is counterintuitive, but peek()/poll() will not return an element if the underlying // ZK node has been deleted, so it's okay to update knownChildren afterwards. try { String path = dir + "/" + id; zookeeper.delete(path, -1, true); } catch (KeeperException.NoNodeException e) { // Another client deleted the node first, this is fine. } updateLock.lockInterruptibly(); try { knownChildren.remove(id); } finally { updateLock.unlock(); } } {code} was (Author: dragonsinth): Couple of thoughts: 1) In the places where you've changed Collection -> List, I would go one step further and make it a concrete ArrayList, to a) explicitly convey that the returned list is a mutable copy rather than a view of internal state and b) explicitly convey that sortAndAdd() is operating efficiently on said lists. 2) DQ.remove(id): don't you need to unconditionally knownChildren.remove(id), even if the ZK delete succeeds? 3) DQ.remove(id): there is no need to loop here, in fact you'll get stuck in an infinite loop if someone else deletes the node you're targeting. The reason there's a loop in removeFirst() is because it's trying a different id each iteration. Suggested remove(id) impl: {code} public void remove(String id) throws KeeperException, InterruptedException { // Remove the ZK node *first*; ZK will resolve any races with peek()/poll(). // This is counterintuitive, but peek()/poll() will not return an element if the underlying // ZK node has been deleted, so it's okay to update knownChildren afterwards. try { String path = dir + "/" + id; zookeeper.delete(path, -1, true); } catch (KeeperException.NoNodeException e) { // Another client deleted the node first, this is fine. } updateLock.lockInterruptibly(); try { knownChildren.remove(id); } finally { updateLock.unlock(); } } {code} > Explore in-memory partitioning for processing Overseer queue messages > - > > Key: SOLR-10524 > URL: https://issues.apache.org/jira/browse/SOLR-10524 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Erick Erickson > Attachments: SOLR-10524.patch, SOLR-10524.patch > > > There are several JIRAs (I'll link in a second) about trying to be more > efficient about processing overseer messages as the overseer can become a > bottleneck, especially with very large numbers of replicas in a cluster. One > of the approaches mentioned near the end of SOLR-5872 (15-Mar) was to "read > large no:of items say 1. put them into in memory buckets and feed them > into overseer". > This JIRA is to break out that part of the discussion as it might be an easy > win whereas "eliminating the Overseer queue" would be quite an undertaking. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org