[ https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047556#comment-16047556 ]
Shalin Shekhar Mangar commented on SOLR-10704: ---------------------------------------------- Thanks Andrzej. A few comments: # The key for the watcher should be collection_name + coreNodeName -- that is necessary and sufficient to be unique across the cluster # Instead of {{if (replica.getState().equals(Replica.State.ACTIVE))}}, you should use replica.isActive(liveNodes) to check if replica is active -- another gotcha of SolrCloud that we really should fix at some point # The RecoveryWatcher's latch is counted down even if there exists at least one replica other than the one being moved -- which is completely fine but a bit confusing reading the code -- perhaps a code comment is pertinent. # The RecoveryWatcher should have additional checks for replica types e.g. there must be at least 1 active NRT or TLOG replicas somewhere otherwise the slice will be left leaderless as PULL type replicas cannot become leaders. # Unrelated to these changes -- it looks like if anyOneFailed is true, then we delete all newly created replicas from the target AND continue to delete the source node as well? Overall, this kind of operation is hard to guarantee in the current state of SolrCloud because at any time, the leader can put another replica in LIR. If that happens after we checked for the replica to be active, then deleting the leader will make that slice leader-less as replicas in LIR cannot become leaders without recoverying first. However, at this point, this is the best we can do. > REPLACENODE can make the collection lost data which replicaFactor is 1 > ----------------------------------------------------------------------- > > Key: SOLR-10704 > URL: https://issues.apache.org/jira/browse/SOLR-10704 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Affects Versions: 6.2 > Environment: Red Hat 4.8.3-9, JDK 1.8.0_121 > Reporter: Daisy.Yuan > Assignee: Andrzej Bialecki > Fix For: master (7.0), 6.7 > > Attachments: 219.log, SOLR-10704.patch > > > When some replicas which the relative collection's replicaFactor is 1, it > will lost data after executing the REPLACENODE cmd. > It may be the new replica on the target node does not complete revovering, > but the old replica on the source node was already be deleted. > At last the target revocery failed for the following exception: > 2017-05-18 17:08:48,587 | ERROR | > recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr > x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 > r:core_node3 | Error while trying to recover. > core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException > at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org