[ 
https://issues.apache.org/jira/browse/SOLR-10704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16047556#comment-16047556
 ] 

Shalin Shekhar Mangar commented on SOLR-10704:
----------------------------------------------

Thanks Andrzej.

A few comments:
# The key for the watcher should be collection_name + coreNodeName -- that is 
necessary and sufficient to be unique across the cluster
# Instead of {{if (replica.getState().equals(Replica.State.ACTIVE))}}, you 
should use replica.isActive(liveNodes) to check if replica is active -- another 
gotcha of SolrCloud that we really should fix at some point
# The RecoveryWatcher's latch is counted down even if there exists at least one 
replica other than the one being moved -- which is completely fine but a bit 
confusing reading the code -- perhaps a code comment is pertinent.
# The RecoveryWatcher should have additional checks for replica types e.g. 
there must be at least 1 active NRT or TLOG replicas somewhere otherwise the 
slice will be left leaderless as PULL type replicas cannot become leaders.
# Unrelated to these changes -- it looks like if anyOneFailed is true, then we 
delete all newly created replicas from the target AND continue to delete the 
source node as well?

Overall, this kind of operation is hard to guarantee in the current state of 
SolrCloud because at any time, the leader can put another replica in LIR. If 
that happens after we checked for the replica to be active, then deleting the 
leader will make that slice leader-less as replicas in LIR cannot become 
leaders without recoverying first. However, at this point, this is the best we 
can do.

> REPLACENODE can make the collection lost data which replicaFactor is 1 
> -----------------------------------------------------------------------
>
>                 Key: SOLR-10704
>                 URL: https://issues.apache.org/jira/browse/SOLR-10704
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 6.2
>         Environment: Red Hat 4.8.3-9, JDK 1.8.0_121
>            Reporter: Daisy.Yuan
>            Assignee: Andrzej Bialecki 
>             Fix For: master (7.0), 6.7
>
>         Attachments: 219.log, SOLR-10704.patch
>
>
> When some replicas which the relative collection's replicaFactor is 1, it 
> will lost data after executing the REPLACENODE cmd. 
> It may be the new replica on the target node does not complete revovering, 
> but the old replica on the source node  was already be deleted.
> At last the target revocery failed for the following exception:
> 2017-05-18 17:08:48,587 | ERROR | 
> recoveryExecutor-3-thread-2-processing-n:192.168.229.137:21103_solr 
> x:replace-hdfs-coll1_shard1_replica2 s:shard1 c:replace-hdfs-coll1 
> r:core_node3 | Error while trying to recover. 
> core=replace-hdfs-coll1_shard1_replica2:java.lang.NullPointerException
>         at org.apache.solr.update.PeerSync.alreadyInSync(PeerSync.java:339)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to