[jira] [Commented] (SOLR-6769) Election bug

2019-08-01 Thread Alexander S. (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897936#comment-16897936
 ] 

Alexander S. commented on SOLR-6769:


Hi, unfortunately I can't test with the latest versions since we are tied to 
Solr 5. I tuned our caches and didn't see this error any more so let's close 
for now.

> Election bug
> 
>
> Key: SOLR-6769
> URL: https://issues.apache.org/jira/browse/SOLR-6769
> Project: Solr
>  Issue Type: Bug
>Reporter: Alexander S.
>Priority: Major
> Attachments: Screenshot 876.png
>
>
> Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
> total).
> What I did is just stopped the shards, but if first shard stopped immediately 
> the second one took about 5 minutes to stop. You can see on the screenshot 
> what happened next. In short:
> 1. Shard 1 stopped normally
> 3. Replica 1 became a leader
> 2. Shard 2 still was performing some job but wasn't accepting connection
> 4. Replica 2 did not became a leader because Shard 2 is still there but 
> doesn't work
> 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
> leader
> Marked as critical because this shuts down the entire cluster. Please adjust 
> if I am wrong.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6769) Election bug

2016-10-01 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15539639#comment-15539639
 ] 

Alexandre Rafalovitch commented on SOLR-6769:
-

There has been some fixes related to that, I believe.

Is this reproducible against latest version of Solr? If yes, the case can be 
updated with more details so it is more visible. 

If not, let's close it and see if somebody will see it again.

> Election bug
> 
>
> Key: SOLR-6769
> URL: https://issues.apache.org/jira/browse/SOLR-6769
> Project: Solr
>  Issue Type: Bug
>Reporter: Alexander S.
> Attachments: Screenshot 876.png
>
>
> Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
> total).
> What I did is just stopped the shards, but if first shard stopped immediately 
> the second one took about 5 minutes to stop. You can see on the screenshot 
> what happened next. In short:
> 1. Shard 1 stopped normally
> 3. Replica 1 became a leader
> 2. Shard 2 still was performing some job but wasn't accepting connection
> 4. Replica 2 did not became a leader because Shard 2 is still there but 
> doesn't work
> 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
> leader
> Marked as critical because this shuts down the entire cluster. Please adjust 
> if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6769) Election bug

2014-12-21 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255146#comment-14255146
 ] 

Alexander S. commented on SOLR-6769:


Correct, an endless warming was causing this problem. So this is a bug in Solr, 
it waits for searchers to end warming, which could take up to 5 minutes in some 
cases. The node itself goes down and does not accept connections but the 
ellection does not happen.

 Election bug
 

 Key: SOLR-6769
 URL: https://issues.apache.org/jira/browse/SOLR-6769
 Project: Solr
  Issue Type: Bug
Reporter: Alexander S.
 Attachments: Screenshot 876.png


 Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
 total).
 What I did is just stopped the shards, but if first shard stopped immediately 
 the second one took about 5 minutes to stop. You can see on the screenshot 
 what happened next. In short:
 1. Shard 1 stopped normally
 3. Replica 1 became a leader
 2. Shard 2 still was performing some job but wasn't accepting connection
 4. Replica 2 did not became a leader because Shard 2 is still there but 
 doesn't work
 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
 leader
 Marked as critical because this shuts down the entire cluster. Please adjust 
 if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6769) Election bug

2014-12-18 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252440#comment-14252440
 ] 

Alexander S. commented on SOLR-6769:


This might be related: 
http://lucene.472066.n3.nabble.com/Endless-100-CPU-usage-on-searcherExecutor-thread-td4175088.html

 Election bug
 

 Key: SOLR-6769
 URL: https://issues.apache.org/jira/browse/SOLR-6769
 Project: Solr
  Issue Type: Bug
Reporter: Alexander S.
 Attachments: Screenshot 876.png


 Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
 total).
 What I did is just stopped the shards, but if first shard stopped immediately 
 the second one took about 5 minutes to stop. You can see on the screenshot 
 what happened next. In short:
 1. Shard 1 stopped normally
 3. Replica 1 became a leader
 2. Shard 2 still was performing some job but wasn't accepting connection
 4. Replica 2 did not became a leader because Shard 2 is still there but 
 doesn't work
 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
 leader
 Marked as critical because this shuts down the entire cluster. Please adjust 
 if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6769) Election bug

2014-12-09 Thread Alexander S. (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239203#comment-14239203
 ] 

Alexander S. commented on SOLR-6769:


Hi, yes, my terminology about shards and replicas wasn't clear, let me explain 
this better.

* Solr: 4.8.1
* Java:
java version 1.7.0_51
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
* We have 5 servers, 2 of which are big (16 CPU cores, 48G of RAM each) and 3 
others are small (1 CPU and 1G of RAM). All servers have rapid SSD RAID 10. 
Each server runs a ZK instance, so we have 5 ZK instances in total. Those big 
servers also run Solr: the first one runs 2 instances and the second one also 
runs 2 replicas (so each shard has 2 replicas, the simplest SolrCloud setup 
from the wiki).

So the cluster looks like this:
{noformat}
* Small 1G node: ZK
* Small 1G node: ZK
* Small 1G node: ZK
* Big 16G node: ZK, Solr1, Solr2
* Big 16G node: ZK, Solr1.1, Solr2.1
{noformat}

Stopped manually means I tried to manually stop Solr1 and Solr2, which were 
the leaders, by sending a TERM signal (we have service files so I did service 
stop and was expecting a graceful shut down). This was working for Solr1 and 
it went down normally and Solr1.1 became the leader instantly. Then I tried to 
do the same for Solr2, but once I sent the TERM it became not operable but 
didn't exit completely (orange on the screenshot), the process was still 
running for ≈ 5-10 minutes and the election didn't happen. As a result I get 
no node hosting shard errors, but was expecting Solr2.1 to become the leader 
instantly as it was with Solr1.1.

As I understand this, the Solr2 didn't shut down instantly because there could 
be some background jobs, e.g. index merging, an in process commit, etc, *but 
then it should not stop accepting connections and should not change its status 
to down* until all background jobs are finished and it s really ready to go 
down and pass leadership to the Solr2.1.

It seems like a bug in Solr, because all services were working normally, all ZK 
instances were up and operable, and Solr itself wasn't under a heavy load. 
Otherwise could you please point me where to look for any information about how 
to gracefully shut down instances? It would be good to have a button in the web 
UI to be able to force a replica to become the leader with one click. So then I 
would be able to force Solr1.1 and Solr 2.1 to become the leaders, wait until 
this happen and safely reboot Solr1 and solr2 instances.

Best,
Alexander

 Election bug
 

 Key: SOLR-6769
 URL: https://issues.apache.org/jira/browse/SOLR-6769
 Project: Solr
  Issue Type: Bug
Reporter: Alexander S.
 Attachments: Screenshot 876.png


 Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
 total).
 What I did is just stopped the shards, but if first shard stopped immediately 
 the second one took about 5 minutes to stop. You can see on the screenshot 
 what happened next. In short:
 1. Shard 1 stopped normally
 3. Replica 1 became a leader
 2. Shard 2 still was performing some job but wasn't accepting connection
 4. Replica 2 did not became a leader because Shard 2 is still there but 
 doesn't work
 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
 leader
 Marked as critical because this shuts down the entire cluster. Please adjust 
 if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-6769) Election bug

2014-12-08 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-6769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14238693#comment-14238693
 ] 

Anshum Gupta commented on SOLR-6769:


[~aheaven] I would recommend you to post such issues on the user-list before 
creating a JIRA. This would come in handy:
https://wiki.apache.org/solr/UsingMailingLists

Though I think it's not really an issue, I'm not closing this issue for now. 
However, I'll reduce the Priority on this one primarily due to lack of 
information on the issue.
* What version of Solr were you running? 
* What version of Java? Web server?
* How were you running it? External ZK?
* What do you mean by stopped normally? Shard is a logical entity, replica is 
a physical one. Do you mean you stopped the leader of Shard1?
* What did you expect should have happened?
..

 Election bug
 

 Key: SOLR-6769
 URL: https://issues.apache.org/jira/browse/SOLR-6769
 Project: Solr
  Issue Type: Bug
Reporter: Alexander S.
Priority: Critical
 Attachments: Screenshot 876.png


 Hello, I have a very simple set up: 2 shards and 2 replicas (4 nodes in 
 total).
 What I did is just stopped the shards, but if first shard stopped immediately 
 the second one took about 5 minutes to stop. You can see on the screenshot 
 what happened next. In short:
 1. Shard 1 stopped normally
 3. Replica 1 became a leader
 2. Shard 2 still was performing some job but wasn't accepting connection
 4. Replica 2 did not became a leader because Shard 2 is still there but 
 doesn't work
 5. Entire cluster went down until Shard 2 stopped and Replica 2 became a 
 leader
 Marked as critical because this shuts down the entire cluster. Please adjust 
 if I am wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org