[ https://issues.apache.org/jira/browse/SOLR-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shalin Shekhar Mangar updated SOLR-11661: ----------------------------------------- Attachment: 11458-2-MoveReplicaHDFSTest-log.txt Full logs attached. Dat and I analyzed the logs and we found this problem: {code} # New collection called MoveReplicaHDFSTest_failed_coll is being created. New replicas core_node7 and core_node8 for shard are in process of being created. # New core MoveReplicaHDFSTest_failed_coll_shard2_replica_n4 core_node7 tries to become leader, asks MoveReplicaHDFSTest_failed_coll_shard2_replica_n6 core_node8 to sync # Sync fails because core_node8 has no versions # core_node7 becomes leader and asks core_node8 to recover # core_node8 gets a request to recover and starts recovery thread recoveryExecutor-53-thread-1-processing-n:127.0.0.1:61049_solr # core_node8 enters buffering state # core_node8 sends prep recovery command to core_node7 and publishes itself in recovery state # core_node7 has a thread in WaitForState and sees core_node8 as down currently # At t=70388, some DataStreamer Exception is reported from DFSClient and leader core_node7 logs that it could not close the HDFS transaction log due to no more good datanodes being available -- these look like they aren't relevant to the problem # core_node7 (leader) publishes itself as active # core_node7 create core is complete # core_node8 create thread (qtp1713789948-2124) sees that there is a leader and publishes itself as active, skipping recovery # core_node8 create core command is successful # collection create is finished # core_node7 remains tied in WaitForState because from now on it only sees core_node8 in active but not in recovery # the recovery thread in core_node8 remains waiting in prep recovery # New documents are added to the collection but they aren't visible to searchers because core_node8 is buffering and therefore ignores commit requests {code} So there is a race between the core create thread publishing local as active after the leader has asked said core to recover. This is a side effect of SOLR-9566 which skips recovery for replicas which are being created as part of a new collection. > Race condition between core creation thread and recovery request from leader > causes inconsistent view of documents > ------------------------------------------------------------------------------------------------------------------ > > Key: SOLR-11661 > URL: https://issues.apache.org/jira/browse/SOLR-11661 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: SolrCloud > Reporter: Shalin Shekhar Mangar > Fix For: 7.2, master (8.0) > > Attachments: 11458-2-MoveReplicaHDFSTest-log.txt > > > While testing SOLR-11458, [~ab] ran into an interesting failure which > resulted in different document counts between leader and replica. The test is > MoveReplicaHDFSTest on jira/solr-11458-2 branch. > The failure is rare but reproducible on beasting: > {code} > reproduce with: ant test -Dtestcase=MoveReplicaHDFSTest > -Dtests.method=testNormalFailedMove -Dtests.seed=161856CB543CD71C > -Dtests.slow=true -Dtests.locale=ar-SA -Dtests.timezone=US/Michigan > -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 > [junit4] FAILURE 14.2s | MoveReplicaHDFSTest.testNormalFailedMove <<< > [junit4] > Throwable #1: java.lang.AssertionError: expected:<100> but > was:<56> > [junit4] > at > __randomizedtesting.SeedInfo.seed([161856CB543CD71C:31134983787E4905]:0) > [junit4] > at > org.apache.solr.cloud.MoveReplicaTest.testFailedMove(MoveReplicaTest.java:305) > [junit4] > at > org.apache.solr.cloud.MoveReplicaHDFSTest.testNormalFailedMove(MoveReplicaHDFSTest.java:69) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org