[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125572#comment-15125572 ] Anshum Gupta commented on SOLR-8619: This makes sense, along with a check for new collection creation i.e. when a Replica is in INITIALIZING state, it can get to ACTIVE if there's no other replica in the cluster state. Or we could add this check and allow for the replica to get to RECOVERING in such a case, and then allow the transition from RECOVERING -> ACTIVE. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15125104#comment-15125104 ] Mark Miller commented on SOLR-8619: --- {quote}Typical cloud issues lead to all replicas of a shard going to recovery/down/recovery_failed, and the only way is to cold start it by shutting all down and bringing them back up. Will checking lastPublished for ACTIVE interfere with that?{quote} I'm going back to improving that this week! > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124761#comment-15124761 ] Shai Erera commented on SOLR-8619: -- There are two separate issues here: * A new replica added via ADDREPLICA * Which replica should become a leader In the first case, I don't think that the replica should *ever* become the leader, until it has had a chance to sync w/ a leader and first published its state as ACTIVE. I also thought along the lines of Erick's suggestion, i.e. add an INITIALIZING state to a Replica. Then, replicas can transition from INITIALIZING -> RECOVERING -> ACTIVE, but never INITIALIZING -> DOWN. Then, the DOWN -> ACTIVE transition is "safe" in that only non-initializing replicas can become active leaders in the case of conflicts, or a whole cluster restart, because we know they were once ACTIVE. In case of a new collection, where all REPLICAS are new, then we have two choices: either we note that in the internal ADDREPLICA call, so they are added in DOWN state, or (which is simpler I guess), since all replicas will be INITIALIZING, one can become the leader since they're all equal. For the second case, which replica should become the leader, the proposals made here make sense, but IMO they belong to a separate issue. Using the index version, the commit point info etc. are good. But the problem that we've hit is that the ONLY _live_ replica at the moment was the new (empty) one, there were two others in DOWN state, but their nodes did not belong to the cluster (ZK issues, network splits ...) and then that replica decided to become the leader. When the 2 others later joined the cluster, they replicated "empty" index from the leader, and data was lost. If we added the new replica in the INITIALIZING state, it would stay that, and when the two others returned to the cluster, they would re-compete for leadership, using all the proposals made above, and no data would be lost. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124439#comment-15124439 ] Varun Thacker commented on SOLR-8619: - bq. Typical cloud issues lead to all replicas of a shard going to recovery/down/recovery_failed, and the only way is to cold start it by shutting all down and bringing them back up. Will checking lastPublished for ACTIVE interfere with that? Yeah I've always found it very tricky to help clients bring up a shard when all replicas got into recovery/down/recovery_failed state. I guess now we have forceLeaders to do so > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15124331#comment-15124331 ] Ramkumar Aiyengar commented on SOLR-8619: - Typical cloud issues lead to all replicas of a shard going to recovery/down/recovery_failed, and the only way is to cold start it by shutting all down and bringing them back up. Will checking lastPublished for ACTIVE interfere with that? An another cue which can be useful is the index generation being zero.. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123616#comment-15123616 ] Anshum Gupta commented on SOLR-8619: But the user wouldn't get back a useable replica. We could add retries but fail if it's not just a short event. The idea here being, if a user is expecting traffic, typically the case where a user would want to add a replica, the response from the addreplica call should assure him that a _usable_ replica was added. If that wasn't the case, ask him to retry while also communicating about the reason for error. If we don't do that, the user would have to check the clusterstatus to confirm if the new replica is actually usable or not. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123606#comment-15123606 ] Anshum Gupta commented on SOLR-8619: I think we could just reuse the lastPublished state instead of adding a new state, while adding conditions that let things flow smoothly in cases of collection creation and custom routed shard creation. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123603#comment-15123603 ] Anshum Gupta commented on SOLR-8619: Sure, this should work. Might have to add a few checks there for the conditions that Erick mentioned though. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123037#comment-15123037 ] Mark Miller commented on SOLR-8619: --- It may not stick long term, but currently, if you simply set the lastPublished state we track to anything but ACTIVE it won't become leader. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122882#comment-15122882 ] Erick Erickson commented on SOLR-8619: -- Perhaps a new state? "new"? "never_syncd"? "not_eligible_for_leader_election"? Whatever. Point is just a flag saying "I don't care what else you do, I shouldn't be leader yet". Still, you'd have to reconcile such a state with collection creation, but that doesn't seem like a big deal. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122860#comment-15122860 ] Mark Miller commented on SOLR-8619: --- bq. Sure, I strongly think we need to be intelligent in electing leaders. That would solve this problem but why would we want a new replica to get added up that can't do anything but consume resources for a core? Not a ton of resources but still. I guess you'll agree. Because it seems if you want to add a replica, you want to add a replica. Let's say I do add replica right when the first replica goes down for some reason - like it loses it's zk connection due to a GC event. But then it connects again. It almost seems preferable to me that my add replica call still works, but it won't become the leader - then when the first replica quickly re-establishes its connection to Zk, it will recover from it. My thinking is, if I want to add a replica, I don't care that it has no one to recover from at any given moment. I want to add a replica to the shard now. Let the system work out when it's safe and possible to sync up with the shard. Otherwise, I have to process the fail, go look at why it happened, try and get that straightened out, try the call again, repeat, etc. There doesn't seem to be a strong reason to fail - the call can easily work and when the other replicas come back on line, everything will settle out. We just want to make sure it won't become the leader without recovering first. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122781#comment-15122781 ] Jason Gerlowski commented on SOLR-8619: --- Throwing in my 2 cents. New to SolrCloud, so feel free to ignore... +1 for having a check to ensure that a replica isn't marked as a leader unless it's had a chance to sync with a leader. +1 for having ADDREPLICA calls fail if there are no active replicas. I'd be fine with allowing API users to create not-ready-for-leadership replicas if there was a great way of conveying that caveat to them. But short of adding a replica-state option to CLUSTERSTATUS, I can't think of a good way to do this. IMO, it seems cleaner conceptually to prevent users up front from getting into this state. Bit hand-wavy though, so take this rationale with a grain of salt. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122721#comment-15122721 ] Anshum Gupta commented on SOLR-8619: Sure, I strongly think we need to be intelligent in electing leaders. That would solve this problem but why would we want a new replica to get added up that can't do anything but consume resources for a core? Not a ton of resources but still. I guess you'll agree. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122707#comment-15122707 ] Anshum Gupta commented on SOLR-8619: What I have in mind should play nice with both of those things else it's not even a solution :) > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122538#comment-15122538 ] Mark Miller commented on SOLR-8619: --- bq. 1. Reject an ADDREPLICA call if all current replicas for the shard are down. Considering the new replica can not sync from anyone, it doesn't make sense for this replica to even come up That's probably true in this case though. I don't know if we should reject it, but at least make it aware it probably should not become the leader until it has sync'd from a leader? > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122534#comment-15122534 ] Mark Miller commented on SOLR-8619: --- This is expected behavior given the design of the system? You can't expect not to lose data with these cases - it's a large part of what the min replication param is for. If you want to ensure your data is not lost, you need to ensure it hits more than one one replica as a minimum. At least a replication factor of 3 is probably best to avoid data loss. > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-8619) A new replica should not become leader when all current replicas are down as it leads to data loss
[ https://issues.apache.org/jira/browse/SOLR-8619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122520#comment-15122520 ] Erick Erickson commented on SOLR-8619: -- I'd also not want to even _try_ to ADDREPLICA if there were no active leaders. Although I understand that you could start the ADDREPLICA command and _then_ all the other replicas could go down before it synched, so it looks like a belt-and-suspenders kind of thing. Hmmm, does that play nice with 1> creating a collection? there are no replicas by definition 2> adding shards (implicit router) Just random thoughts > A new replica should not become leader when all current replicas are down as > it leads to data loss > -- > > Key: SOLR-8619 > URL: https://issues.apache.org/jira/browse/SOLR-8619 > Project: Solr > Issue Type: Bug >Reporter: Anshum Gupta > > Here's what I'm talking about: > * Start a 2 node solrcloud cluster > * Create a 1 shard/1 replica collection > * Add documents > * Shut down the node that has the only active shard > * ADDREPLICA for the shard/collection, so Solr would attempt to add a new > replica on the other node > * Solr waits for a while before this replica becomes an active leader. > * Index a few new docs > * Bring up the old node > * The replica comes up, with it's old index and then syncs to only contain > the docs from the new leader. > All old documents are lost in this case > Here are a few things that might work here: > 1. Reject an ADDREPLICA call if all current replicas for the shard are down. > Considering the new replica can not sync from anyone, it doesn't make sense > for this replica to even come up > 2. The replica shouldn't become active/leader unless either it was the last > known leader or active before it went into recovering state > unless there are no other replicas in the clusterstate. > This might very well be related to SOLR-8173 but we should add a check to > ADDREPLICA as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org