[ 
https://issues.apache.org/jira/browse/SOLR-13815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16947902#comment-16947902
 ] 

Yonik Seeley commented on SOLR-13815:
-------------------------------------

Actually, doc_38 and doc_40 don't look the same.

When indexing on the new subShard for doc_38, we see update.distrib=FROMLEADER
For doc_40, we see update.distrib=TOLEADER, so for doc_40, it was forwarded to 
the new leader.

If we look at DistributedZkUpdateProcessor, it looks like a slice is only 
considered a sub-slice if it is in the CONSTRUCTION or Slice.State.RECOVERY 
state:
{code}
  protected List<SolrCmdDistributor.Node> getSubShardLeaders(DocCollection 
coll, String shardId, String docId, SolrInputDocument doc) {
    Collection<Slice> allSlices = coll.getSlices();
    List<SolrCmdDistributor.Node> nodes = null;
    for (Slice aslice : allSlices) {
      final Slice.State state = aslice.getState();
      if (state == Slice.State.CONSTRUCTION || state == Slice.State.RECOVERY)  {
{code}

This must introduce the race condition, where the state of the sub-slice was 
just changed to active (and hence it won't be returned by getSubShardLeaders), 
but the code to check/forward to the leader has already completed.

I'm not sure what the implications are of removing the state checks.  We either 
need to do that, or somehow close the hole that causes the race condition. 

> Live split can lose data
> ------------------------
>
>                 Key: SOLR-13815
>                 URL: https://issues.apache.org/jira/browse/SOLR-13815
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Yonik Seeley
>            Priority: Major
>         Attachments: fail.191004_053129, fail.191004_093307
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This issue is to investigate potential data loss during a "live" split (i.e. 
> split happens while updates are flowing)
> This was discovered during the shared storage work which was based on a 
> non-release branch_8x sometime before 8.3, hence the first steps are to try 
> and reproduce on the master branch without any shared storage changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to