[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897936#comment-16897936 ] Alexander S. commented on SOLR-6769: Hi, unfortunately I can't test with the latest versions since we are tied to Solr 5. I tuned our caches and didn't see this error any more so let's close for now. > Election bug > > > Key: SOLR-6769 > URL: https://issues.apache.org/jira/browse/SOLR-6769 > Project: Solr > Issue Type: Bug >Reporter: Alexander S. >Priority: Major > Attachments: Screenshot 876.png > > > Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in > total). > What I did is just stopped the shards, but if first shard stopped immediately > the second one took about 5 minutes to stop. You can see on the screenshot > what happened next. In short: > 1. Shard 1 stopped normally > 3. Replica 1 became a leader > 2. Shard 2 still was performing some job but wasn't accepting connection > 4. Replica 2 did not became a leader because Shard 2 is still there but > doesn't work > 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a > leader > Marked as critical because this shuts down the entire cluster. Please adjust > if I am wrong. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15539639#comment-15539639 ] Alexandre Rafalovitch commented on SOLR-6769: - There has been some fixes related to that, I believe. Is this reproducible against latest version of Solr? If yes, the case can be updated with more details so it is more visible. If not, let's close it and see if somebody will see it again. > Election bug > > > Key: SOLR-6769 > URL: https://issues.apache.org/jira/browse/SOLR-6769 > Project: Solr > Issue Type: Bug >Reporter: Alexander S. > Attachments: Screenshot 876.png > > > Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in > total). > What I did is just stopped the shards, but if first shard stopped immediately > the second one took about 5 minutes to stop. You can see on the screenshot > what happened next. In short: > 1. Shard 1 stopped normally > 3. Replica 1 became a leader > 2. Shard 2 still was performing some job but wasn't accepting connection > 4. Replica 2 did not became a leader because Shard 2 is still there but > doesn't work > 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a > leader > Marked as critical because this shuts down the entire cluster. Please adjust > if I am wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255146#comment-14255146 ] Alexander S. commented on SOLR-6769: Correct, an endless warming was causing this problem. So this is a bug in Solr, it waits for searchers to end warming, which could take up to 5 minutes in some cases. The node itself goes down and does not accept connections but the ellection does not happen. Election bug Key: SOLR-6769 URL: https://issues.apache.org/jira/browse/SOLR-6769 Project: Solr Issue Type: Bug Reporter: Alexander S. Attachments: Screenshot 876.png Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in total). What I did is just stopped the shards, but if first shard stopped immediately the second one took about 5 minutes to stop. You can see on the screenshot what happened next. In short: 1. Shard 1 stopped normally 3. Replica 1 became a leader 2. Shard 2 still was performing some job but wasn't accepting connection 4. Replica 2 did not became a leader because Shard 2 is still there but doesn't work 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a leader Marked as critical because this shuts down the entire cluster. Please adjust if I am wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252440#comment-14252440 ] Alexander S. commented on SOLR-6769: This might be related: http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-td4175088.html Election bug Key: SOLR-6769 URL: https://issues.apache.org/jira/browse/SOLR-6769 Project: Solr Issue Type: Bug Reporter: Alexander S. Attachments: Screenshot 876.png Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in total). What I did is just stopped the shards, but if first shard stopped immediately the second one took about 5 minutes to stop. You can see on the screenshot what happened next. In short: 1. Shard 1 stopped normally 3. Replica 1 became a leader 2. Shard 2 still was performing some job but wasn't accepting connection 4. Replica 2 did not became a leader because Shard 2 is still there but doesn't work 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a leader Marked as critical because this shuts down the entire cluster. Please adjust if I am wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239203#comment-14239203 ] Alexander S. commented on SOLR-6769: Hi, yes, my terminology about shards and replicas wasn't clear, let me explain this better. * Solr: 4.8.1 * Java: java version 1.7.0_51 Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) * We have 5 servers, 2 of which are big (16 CPU cores, 48G of RAM each) and 3 others are small (1 CPU and 1G of RAM). All servers have rapid SSD RAID 10. Each server runs a ZK instance, so we have 5 ZK instances in total. Those big servers also run Solr: the first one runs 2 instances and the second one also runs 2 replicas (so each shard has 2 replicas, the simplest SolrCloud setup from the wiki). So the cluster looks like this: {noformat} * Small 1G node: ZK * Small 1G node: ZK * Small 1G node: ZK * Big 16G node: ZK, Solr1, Solr2 * Big 16G node: ZK, Solr1.1, Solr2.1 {noformat} Stopped manually means I tried to manually stop Solr1 and Solr2, which were the leaders, by sending a TERM signal (we have service files so I did service stop and was expecting a graceful shut down). This was working for Solr1 and it went down normally and Solr1.1 became the leader instantly. Then I tried to do the same for Solr2, but once I sent the TERM it became not operable but didn't exit completely (orange on the screenshot), the process was still running for ≈ 5-10 minutes and the election didn't happen. As a result I get no node hosting shard errors, but was expecting Solr2.1 to become the leader instantly as it was with Solr1.1. As I understand this, the Solr2 didn't shut down instantly because there could be some background jobs, e.g. index merging, an in process commit, etc, *but then it should not stop accepting connections and should not change its status to down* until all background jobs are finished and it s really ready to go down and pass leadership to the Solr2.1. It seems like a bug in Solr, because all services were working normally, all ZK instances were up and operable, and Solr itself wasn't under a heavy load. Otherwise could you please point me where to look for any information about how to gracefully shut down instances? It would be good to have a button in the web UI to be able to force a replica to become the leader with one click. So then I would be able to force Solr1.1 and Solr 2.1 to become the leaders, wait until this happen and safely reboot Solr1 and solr2 instances. Best, Alexander Election bug Key: SOLR-6769 URL: https://issues.apache.org/jira/browse/SOLR-6769 Project: Solr Issue Type: Bug Reporter: Alexander S. Attachments: Screenshot 876.png Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in total). What I did is just stopped the shards, but if first shard stopped immediately the second one took about 5 minutes to stop. You can see on the screenshot what happened next. In short: 1. Shard 1 stopped normally 3. Replica 1 became a leader 2. Shard 2 still was performing some job but wasn't accepting connection 4. Replica 2 did not became a leader because Shard 2 is still there but doesn't work 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a leader Marked as critical because this shuts down the entire cluster. Please adjust if I am wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-6769) Election bug
[ https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238693#comment-14238693 ] Anshum Gupta commented on SOLR-6769: [~aheaven] I would recommend you to post such issues on the user-list before creating a JIRA. This would come in handy: https://wiki.apache.org/solr/UsingMailingLists Though I think it's not really an issue, I'm not closing this issue for now. However, I'll reduce the Priority on this one primarily due to lack of information on the issue. * What version of Solr were you running? * What version of Java? Web server? * How were you running it? External ZK? * What do you mean by stopped normally? Shard is a logical entity, replica is a physical one. Do you mean you stopped the leader of Shard1? * What did you expect should have happened? .. Election bug Key: SOLR-6769 URL: https://issues.apache.org/jira/browse/SOLR-6769 Project: Solr Issue Type: Bug Reporter: Alexander S. Priority: Critical Attachments: Screenshot 876.png Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in total). What I did is just stopped the shards, but if first shard stopped immediately the second one took about 5 minutes to stop. You can see on the screenshot what happened next. In short: 1. Shard 1 stopped normally 3. Replica 1 became a leader 2. Shard 2 still was performing some job but wasn't accepting connection 4. Replica 2 did not became a leader because Shard 2 is still there but doesn't work 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a leader Marked as critical because this shuts down the entire cluster. Please adjust if I am wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org