[ https://issues.apache.org/jira/browse/SOLR-13945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981287#comment-16981287 ]
Shalin Shekhar Mangar edited comment on SOLR-13945 at 11/25/19 4:29 AM: ------------------------------------------------------------------------ [~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. -It is not necessary if there is a single replica.- (note it is necessary to call this commit regardless of the replication factor) was (Author: shalinmangar): [~ichattopadhyaya] - the final commit was added in SOLR-4997 so that documents are visible when the sub-shard replicas come up. It is not necessary if there is a single replica. > SPLITSHARD data loss due to "rollback" > -------------------------------------- > > Key: SOLR-13945 > URL: https://issues.apache.org/jira/browse/SOLR-13945 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Ishan Chattopadhyaya > Priority: Major > Attachments: SOLR-13945.patch, SOLR-13945.patch, SOLR-13945.patch > > > # As per SOLR-7673, there is a commit on the parent shard *after state > changes* have happened, i.e. from active/construction/construction to > inactive/active/active. Please see > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L586-L588 > # Due to SOLR-12509, there's now a cleanup/rollback method called > "cleanupAfterFailure" in the finally block that resets the state to > active/construction/construction. Please see: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/api/collections/SplitShardCmd.java#L657 > # When 2 is entered into due to a failure in 1, we have a situation where any > documents that went into the subshards (because they are already active by > now) are now lost after the parent becomes active. > If my above understanding is correct, I am wondering: > # Why is a commit to parent shard needed *after* the parent shard is > inactive, subshards are now active and the split operation has completed? > # This rollback looks very suspicious. If state of subshards is already > active and parent is inactive, then what is the need for setting them back to > construction? Seems like a crucial check is missing there. Also, why do we > reset the subshard status back to construction instead of inactive? It is > extremely misleading (and, frankly, ridiculous) for any external clusterstate > monitoring tools to see the subshards to go from CONSTRUCTION to ACTIVE to > CONSTRUCTION and then the subshard disappearing. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org