[ https://issues.apache.org/jira/browse/SOLR-11815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16311716#comment-16311716 ]
Shaun Sabo commented on SOLR-11815: ----------------------------------- We spun a test up with isIndexStale commented out and found that the following block was causing us to still trigger a full re-replication as well when the checksum mismatch is discovered. We don't think that this check is necessary anymore either. https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1136-L1141 > TLOG leaders going down and rejoining as a replica do fullCopy when not needed > ------------------------------------------------------------------------------ > > Key: SOLR-11815 > URL: https://issues.apache.org/jira/browse/SOLR-11815 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) > Affects Versions: 7.2 > Environment: Oracle JDK 1.8 > Ubuntu 16.04 > Reporter: Shaun Sabo > Assignee: Ishan Chattopadhyaya > > I am running a collection with a persistent high volume of writes. When the > leader goes down and recovers, it joins as a replica and asks the new leader > for the files to Sync. The isIndexStale check is finding that some files > differ in size and checksum which forces a fullCopy. Since our indexes are > rather large, a rolling restart is resulting in large amounts of data > transfer, and in some cases disk space contention issues. > I do not believe the fullCopy is necessary given the circumstances. > Repro Steps: > 1. collection/shard with 1 leader and 1 replica are accepting writes > - Pull interval is 30 seconds > - Hard Commit interval is 60 seconds > 2. Replica executes an index pull and completes. > 3. Leader process Hard Commits (replica index is delayed) > 4. leader process is killed (SIGTERM) > 5. Replica takes over as new leader > 6. New leader applies TLOG since last pull (cores are binary-divergent now) > 7. Former leader comes back as New Replica > 8. New replica initiates recovery > - Recovery detects that the generation and version are behind and a check > is necessary > 9. isIndexStale() detects that a segment exists on both the New Replica and > New Leader but that the size and checksum differ. > - This triggers fullCopy to be flagged on > 10. Entirety of index is pulled regardless of changes > The majority of files should not have changes, but everything gets pulled > because of the first file it finds with a mismatched checksum. > Relevant Code: > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L516-L518 > https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1105-L1126 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org