[jira] [Updated] (SOLR-11661) Race condition between core creation thread and recovery request from leader causes inconsistent view of documents

Shalin Shekhar Mangar (JIRA) Mon, 20 Nov 2017 00:52:26 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shalin Shekhar Mangar updated SOLR-11661:
-----------------------------------------
    Attachment: 11458-2-MoveReplicaHDFSTest-log.txt

Full logs attached.

Dat and I analyzed the logs and we found this problem:
{code}
# New collection called MoveReplicaHDFSTest_failed_coll is being created. New 
replicas core_node7 and core_node8 for shard are in process of being created.
# New core MoveReplicaHDFSTest_failed_coll_shard2_replica_n4 core_node7 tries 
to become leader, asks MoveReplicaHDFSTest_failed_coll_shard2_replica_n6 
core_node8 to sync
# Sync fails because core_node8 has no versions
# core_node7 becomes leader and asks core_node8 to recover
# core_node8 gets a request to recover and starts recovery thread 
recoveryExecutor-53-thread-1-processing-n:127.0.0.1:61049_solr
# core_node8 enters buffering state
# core_node8 sends prep recovery command to core_node7 and publishes itself in 
recovery state
# core_node7 has a thread in WaitForState and sees core_node8 as down currently
# At t=70388, some DataStreamer Exception is reported from DFSClient and leader 
core_node7 logs that  it could not close the HDFS transaction log due to no 
more good datanodes being available -- these look like they aren't relevant to 
the problem
# core_node7 (leader) publishes itself as active
# core_node7 create core is complete
# core_node8 create thread (qtp1713789948-2124) sees that there is a leader and 
publishes itself as active, skipping recovery
# core_node8 create core command is successful
# collection create is finished
# core_node7 remains tied in WaitForState because from now on it only sees 
core_node8 in active but not in recovery
# the recovery thread in core_node8 remains waiting in prep recovery
# New documents are added to the collection but they aren't visible to 
searchers because core_node8 is buffering and therefore ignores commit requests
{code}

So there is a race between the core create thread publishing local as active 
after the leader has asked said core to recover. This is a side effect of 
SOLR-9566 which skips recovery for replicas which are being created as part of 
a new collection.


> Race condition between core creation thread and recovery request from leader 
> causes inconsistent view of documents
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11661
>                 URL: https://issues.apache.org/jira/browse/SOLR-11661
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 7.2, master (8.0)
>
>         Attachments: 11458-2-MoveReplicaHDFSTest-log.txt
>
>
> While testing SOLR-11458, [~ab] ran into an interesting failure which 
> resulted in different document counts between leader and replica. The test is 
> MoveReplicaHDFSTest on jira/solr-11458-2 branch.
> The failure is rare but reproducible on beasting:
> {code}
> reproduce with: ant test  -Dtestcase=MoveReplicaHDFSTest 
> -Dtests.method=testNormalFailedMove -Dtests.seed=161856CB543CD71C 
> -Dtests.slow=true -Dtests.locale=ar-SA -Dtests.timezone=US/Michigan 
> -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
>    [junit4] FAILURE 14.2s | MoveReplicaHDFSTest.testNormalFailedMove <<<
>    [junit4]    > Throwable #1: java.lang.AssertionError: expected:<100> but 
> was:<56>
>    [junit4]    >      at 
> __randomizedtesting.SeedInfo.seed([161856CB543CD71C:31134983787E4905]:0)
>    [junit4]    >      at 
> org.apache.solr.cloud.MoveReplicaTest.testFailedMove(MoveReplicaTest.java:305)
>    [junit4]    >      at 
> org.apache.solr.cloud.MoveReplicaHDFSTest.testNormalFailedMove(MoveReplicaHDFSTest.java:69)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-11661) Race condition between core creation thread and recovery request from leader causes inconsistent view of documents

Reply via email to