[ 
https://issues.apache.org/jira/browse/SOLR-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14260134#comment-14260134
 ] 

Shalin Shekhar Mangar commented on SOLR-6640:
---------------------------------------------

I had a discussion with Varun about this issue. We have two problems here
# Solr corrupts the index during replication recovery
# Such a corrupt index puts Solr into an infinite recovery loop

For #1 the problem is clear -- we have open searchers on uncommitted/flushed 
files which are mixed with files from the leader causing corruption.

Possible solutions for #1 are either a) switch to a different index dir and 
move/copy files from committed segments and use the index.properties approach 
to open a searcher on the new index dir or b) close the searcher then rollback 
the writer and then download the necessary files.

Closing the searcher.... is not as simple as it sounds because the searcher is 
ref counted and close() doesn't really close immediately. Also, at any time, a 
request might open a new searcher so it is a very involved change.

For #2, every where we open a reader/searcher or writer, we should be ready to 
handle the corrupt index exceptions.

I think we should first try to first solve the problem of corrupting the index. 
So let's try the deletion approach that Varun outlined. If that fails then we 
should switch to a new index dir, move/copy over files from commit points, 
fetch the missing segments from the leader and use the index.properties 
approach to completely move to a new index directory.

The second problem that we need to solve is that a corrupted index trashes the 
server. We should be able to recover from such a scenario instead of going into 
an infinite recovery loop.

Let's fix these two problems (in that order) and then figure out ways to 
optimize recovery.

Longer term we need to change our code such that we can close the searchers, 
rollback the writer and delete uncommitted files and then attempt replication 
recovery.

Also my earlier comment on non-cloud Solr was wrong:
bq. In SolrCloud we could just close the searcher before rollback because a 
replica in recovery won't get any search requests but that's not practical in 
standalone Solr because it'd cause downtime.

In stand alone Solr this is not a problem because indexing and soft-commits do 
not happen on slaves. But anyway changing to close the searcher etc is a big 
change.

> ChaosMonkeySafeLeaderTest failure with CorruptIndexException
> ------------------------------------------------------------
>
>                 Key: SOLR-6640
>                 URL: https://issues.apache.org/jira/browse/SOLR-6640
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>    Affects Versions: 5.0
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 5.0
>
>         Attachments: Lucene-Solr-5.x-Linux-64bit-jdk1.8.0_20-Build-11333.txt, 
> SOLR-6640.patch, SOLR-6640.patch
>
>
> Test failure found on jenkins:
> http://jenkins.thetaphi.de/job/Lucene-Solr-5.x-Linux/11333/
> {code}
> 1 tests failed.
> REGRESSION:  org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.testDistribSearch
> Error Message:
> shard2 is not consistent.  Got 62 from 
> http://127.0.0.1:57436/collection1lastClient and got 24 from 
> http://127.0.0.1:53065/collection1
> Stack Trace:
> java.lang.AssertionError: shard2 is not consistent.  Got 62 from 
> http://127.0.0.1:57436/collection1lastClient and got 24 from 
> http://127.0.0.1:53065/collection1
>         at 
> __randomizedtesting.SeedInfo.seed([F4B371D421E391CD:7555FFCC56BCF1F1]:0)
>         at org.junit.Assert.fail(Assert.java:93)
>         at 
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1255)
>         at 
> org.apache.solr.cloud.AbstractFullDistribZkTestBase.checkShardConsistency(AbstractFullDistribZkTestBase.java:1234)
>         at 
> org.apache.solr.cloud.ChaosMonkeySafeLeaderTest.doTest(ChaosMonkeySafeLeaderTest.java:162)
>         at 
> org.apache.solr.BaseDistributedSearchTestCase.testDistribSearch(BaseDistributedSearchTestCase.java:869)
> {code}
> Cause of inconsistency is:
> {code}
> Caused by: org.apache.lucene.index.CorruptIndexException: file mismatch, 
> expected segment id=yhq3vokoe1den2av9jbd3yp8, got=yhq3vokoe1den2av9jbd3yp7 
> (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/ssd/jenkins/workspace/Lucene-Solr-5.x-Linux/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest-F4B371D421E391CD-001/tempDir-001/jetty3/index/_1_2.liv")))
>    [junit4]   2>              at 
> org.apache.lucene.codecs.CodecUtil.checkSegmentHeader(CodecUtil.java:259)
>    [junit4]   2>              at 
> org.apache.lucene.codecs.lucene50.Lucene50LiveDocsFormat.readLiveDocs(Lucene50LiveDocsFormat.java:88)
>    [junit4]   2>              at 
> org.apache.lucene.codecs.asserting.AssertingLiveDocsFormat.readLiveDocs(AssertingLiveDocsFormat.java:64)
>    [junit4]   2>              at 
> org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:102)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to