[ 
https://issues.apache.org/jira/browse/SOLR-9438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-9438:
----------------------------------------
    Attachment: SOLR-9438.patch

Changes:
# We record the parent leader node name and the ephemeral owner of its live 
node (zk session id which created the live node) at the start of the split 
process.
# These two pieces of information called "shard_parent_node" and 
"shard_parent_zk_session" respectively, are stored in the cluster state along 
with the slice information.
# When all replicas of all sub-shards are live, the overseer checks if the 
parent leader node is still live and if its ephemeral owner is still the same. 
If yes, it switches the sub-shard states to active and parent to inactive. If 
not, it  changes the sub-shard state to a newly introduced "recovery_failed" 
state.
# Any shard in "recovery_failed" state does not receive any indexing or 
querying traffic.
# I beefed up the test to check for both outcomes and to assert that all 
documents that were successfully indexed are visible on a distributed search. 
Additionally, if the split succeeds, we also assert that all replicas of the 
sub-shards are consistent i.e. have the same number of docs.
# Fixed a test bug where concurrent watcher invocations on collection state 
would shutdown the leader node again even after the test had restarted it 
already to assert document counts.

Results of beasting are looking good as far as this particular bug is 
concerned, but there is a curious failure where one and only core stays down 
and times out the waiting for recovery check. I'm still digging.

> Shard split can lose data
> -------------------------
>
>                 Key: SOLR-9438
>                 URL: https://issues.apache.org/jira/browse/SOLR-9438
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: 4.10.4, 5.5.2, 6.1
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Critical
>              Labels: difficulty-medium, impact-high
>             Fix For: master (7.0), 6.3
>
>         Attachments: SOLR-9438-false-replication.log, 
> SOLR-9438-split-data-loss.log, SOLR-9438.patch, SOLR-9438.patch, 
> SOLR-9438.patch, SOLR-9438.patch
>
>
> Solr’s shard split can lose documents if the parent/sub-shard leader is 
> killed (or crashes) between the time that the new sub-shard replica is 
> created and before it recovers. In such a case the slice has already been set 
> to ‘recovery’ state, the sub-shard replica comes up, finds that no other 
> replica is up, waits until the leader vote wait time and then proceeds to 
> become the leader as well as publish itself as active. Once that happens the 
> overseer seeing that all replicas of the sub-shard are now ‘active’, sets the 
> parent slice as ‘inactive’ and the new sub-shard as ‘active’.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to