[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-31 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125572#comment-15125572
 ] 

Anshum Gupta commented on SOLR-8619:


This makes sense, along with a check for new collection creation i.e. when a 
Replica is in INITIALIZING state, it can get to ACTIVE if there's no other 
replica in the cluster state. Or we could add this check and allow for the 
replica to get to RECOVERING in such a case, and then allow the transition from 
RECOVERING -> ACTIVE.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-30 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125104#comment-15125104
 ] 

Mark Miller commented on SOLR-8619:
---

{quote}Typical cloud issues lead to all replicas of a shard going to 
recovery/down/recovery_failed, and the only way is to cold start it by shutting 
all down and bringing them back up. Will checking lastPublished for ACTIVE 
interfere with that?{quote}

I'm going back to improving that this week!

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124761#comment-15124761
 ] 

Shai Erera commented on SOLR-8619:
--

There are two separate issues here:

* A new replica added via ADDREPLICA
* Which replica should become a leader

In the first case, I don't think that the replica should *ever* become the 
leader, until it has had a chance to sync w/ a leader and first published its 
state as ACTIVE. I also thought along the lines of Erick's suggestion, i.e. add 
an INITIALIZING state to a Replica. Then, replicas can transition from 
INITIALIZING -> RECOVERING -> ACTIVE, but never INITIALIZING -> DOWN. Then, the 
DOWN -> ACTIVE transition is "safe" in that only non-initializing replicas can 
become active leaders in the case of conflicts, or a whole cluster restart, 
because we know they were once ACTIVE.

In case of a new collection, where all REPLICAS are new, then we have two 
choices: either we note that in the internal ADDREPLICA call, so they are added 
in DOWN state, or (which is simpler I guess), since all replicas will be 
INITIALIZING, one can become the leader since they're all equal.

For the second case, which replica should become the leader, the proposals made 
here make sense, but IMO they belong to a separate issue. Using the index 
version, the commit point info etc. are good. But the problem that we've hit is 
that the ONLY _live_ replica at the moment was the new (empty) one, there were 
two others in DOWN state, but their nodes did not belong to the cluster (ZK 
issues, network splits ...) and then that replica decided to become the leader. 
When the 2 others later joined the cluster, they replicated "empty" index from 
the leader, and data was lost.

If we added the new replica in the INITIALIZING state, it would stay that, and 
when the two others returned to the cluster, they would re-compete for 
leadership, using all the proposals made above, and no data would be lost.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Varun Thacker (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124439#comment-15124439
 ] 

Varun Thacker commented on SOLR-8619:
-

bq. Typical cloud issues lead to all replicas of a shard going to 
recovery/down/recovery_failed, and the only way is to cold start it by shutting 
all down and bringing them back up. Will checking lastPublished for ACTIVE 
interfere with that?

Yeah I've always found it very tricky to help clients bring up a shard when all 
replicas got into recovery/down/recovery_failed state. I guess now we have 
forceLeaders to do so

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Ramkumar Aiyengar (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124331#comment-15124331
 ] 

Ramkumar Aiyengar commented on SOLR-8619:
-

Typical cloud issues lead to all replicas of a shard going to 
recovery/down/recovery_failed, and the only way is to cold start it by shutting 
all down and bringing them back up. Will checking lastPublished for ACTIVE 
interfere with that?

An another cue which can be useful is the index generation being zero..

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123616#comment-15123616
 ] 

Anshum Gupta commented on SOLR-8619:


But the user wouldn't get back a useable replica. We could add retries but fail 
if it's not just a short event. The idea here being, if a user is expecting 
traffic, typically the case where a user would want to add a replica, the 
response from the addreplica call should assure him that a _usable_ replica was 
added. If that wasn't the case, ask him to retry while also communicating about 
the reason for error. If we don't do that, the user would have to check the 
clusterstatus to confirm if the new replica is actually usable or not.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123606#comment-15123606
 ] 

Anshum Gupta commented on SOLR-8619:


I think we could just reuse the lastPublished state instead of adding a new 
state, while adding conditions that let things flow smoothly in cases of 
collection creation and custom routed shard creation.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-29 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123603#comment-15123603
 ] 

Anshum Gupta commented on SOLR-8619:


Sure, this should work. Might have to add a few checks there for the conditions 
that Erick mentioned though.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123037#comment-15123037
 ] 

Mark Miller commented on SOLR-8619:
---

It may not stick long term, but currently, if you simply set the lastPublished 
state we track to anything but ACTIVE it won't become leader.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122882#comment-15122882
 ] 

Erick Erickson commented on SOLR-8619:
--

Perhaps a new state? "new"? "never_syncd"? "not_eligible_for_leader_election"? 
Whatever. Point is just a flag saying "I don't care what else you do, I 
shouldn't be leader yet".

Still, you'd have to reconcile such a state with collection creation, but that 
doesn't seem like a big deal.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122860#comment-15122860
 ] 

Mark Miller commented on SOLR-8619:
---

bq. Sure, I strongly think we need to be intelligent in electing leaders. That 
would solve this problem but why would we want a new replica to get added up 
that can't do anything but consume resources for a core? Not a ton of resources 
but still. I guess you'll agree.

Because it seems if you want to add a replica, you want to add a replica.

Let's say I do add replica right when the first replica goes down for some 
reason - like it loses it's zk connection due to a GC event. But then it 
connects again. It almost seems preferable to me that my add replica call still 
works, but it won't become the leader - then when the first replica quickly 
re-establishes its connection to Zk, it will recover from it.

My thinking is, if I want to add a replica, I don't care that it has no one to 
recover from at any given moment. I want to add a replica to the shard now. Let 
the system work out when it's safe and possible to sync up with the shard. 
Otherwise, I have to process the fail, go look at why it happened, try and get 
that straightened out, try the call again, repeat, etc.

There doesn't seem to be a strong reason to fail - the call can easily work and 
when the other replicas come back on line, everything will settle out. We just 
want to make sure it won't become the leader without recovering first.


> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Jason Gerlowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122781#comment-15122781
 ] 

Jason Gerlowski commented on SOLR-8619:
---

Throwing in my 2 cents.  New to SolrCloud, so feel free to ignore...

+1 for having a check to ensure that a replica isn't marked as a leader unless 
it's had a chance to sync with a leader.

+1 for having ADDREPLICA calls fail if there are no active replicas.  I'd be 
fine with allowing API users to create not-ready-for-leadership replicas if 
there was a great way of conveying that caveat to them.  But short of adding a 
replica-state option to CLUSTERSTATUS, I can't think of a good way to do this.  
IMO, it seems cleaner conceptually to prevent users up front from getting into 
this state.  Bit hand-wavy though, so take this rationale with a grain of salt.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122721#comment-15122721
 ] 

Anshum Gupta commented on SOLR-8619:


Sure, I strongly think we need to be intelligent in electing leaders. That 
would solve this problem but why would we want a new replica to get added up 
that can't do anything but consume resources for a core? Not a ton of resources 
but still. I guess you'll agree.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Anshum Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122707#comment-15122707
 ] 

Anshum Gupta commented on SOLR-8619:


What I have in mind should play nice with both of those things else it's not 
even a solution :)

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122538#comment-15122538
 ] 

Mark Miller commented on SOLR-8619:
---

bq. 1. Reject an ADDREPLICA call if all current replicas for the shard are 
down. Considering the new replica can not sync from anyone, it doesn't make 
sense for this replica to even come up

That's probably true in this case though. I don't know if we should reject it, 
but at least make it aware it probably should not become the leader until it 
has sync'd from a leader?

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122534#comment-15122534
 ] 

Mark Miller commented on SOLR-8619:
---

This is expected behavior given the design of the system?

You can't expect not to lose data with these cases - it's a large part of what 
the min replication param is for. If you want to ensure your data is not lost, 
you need to ensure it hits more than one one replica as a minimum. At least a 
replication factor of 3 is probably best to avoid data loss.

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss

2016-01-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122520#comment-15122520
 ] 

Erick Erickson commented on SOLR-8619:
--

I'd also not want to even _try_ to ADDREPLICA if there were no active leaders.

Although I understand that you could start the ADDREPLICA command and _then_ 
all the other replicas could go down before it synched, so it looks like a 
belt-and-suspenders kind of thing.

Hmmm, does that play nice with
1> creating a collection? there are no replicas by definition
2> adding shards (implicit router)

Just random thoughts

> A new replica should not become leader when all current replicas are down as 
> it leads to data loss
> --
>
> Key: SOLR-8619
> URL: https://issues.apache.org/jira/browse/SOLR-8619
> Project: Solr
>  Issue Type: Bug
>Reporter: Anshum Gupta
>
> Here's what I'm talking about:
> * Start a 2 node solrcloud cluster
> * Create a 1 shard/1 replica collection
> * Add documents
> * Shut down the node that has the only active shard
> * ADDREPLICA for the shard/collection, so Solr would attempt to add a new 
> replica on the other node
> * Solr waits for a while before this replica becomes an active leader.
> * Index a few new docs
> * Bring up the old node
> * The replica comes up, with it's old index and then syncs to only contain 
> the docs from the new leader.
> All old documents are lost in this case
> Here are a few things that might work here:
> 1. Reject an ADDREPLICA call if all current replicas for the shard are down. 
> Considering the new replica can not sync from anyone, it doesn't make sense 
> for this replica to even come up
> 2. The replica shouldn't become active/leader unless either it was the last 
> known leader or active before it went into recovering state
> unless there are no other replicas in the clusterstate.
> This might very well be related to SOLR-8173 but we should add a check to 
> ADDREPLICA as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org