[ https://issues.apache.org/jira/browse/SOLR-7836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712222#comment-14712222 ]
Yonik Seeley commented on SOLR-7836: ------------------------------------ I've been running ChaosMonkeySafeLeaderTest for about 3 days with my test script that also searches for corrupt indexes or assertion failures even when the test still passes. Current trunk (as of last week): 9 corrupt indexes Patched trunk: 14 corrupt indexes and 2 test failures (inconsistent shards) The corrupt indexes *may* not be a problem, I don't really know. We kill off servers, perhaps during replication? Seems like that could produce corrupt indexes, but I don't know if that's the scenario or not. Increasing the incidence of those doesn't necessarily point to a problem either. But inconsistent shards vs not... does seem like a problem if it holds. I've reviewed the locking code again, and it looks solid, so I'm not sure what's going on. Here's a typical corrupt index trace: {code} 2> 21946 WARN (RecoveryThread-collection1) [n:127.0.0.1:51815_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.h.IndexFetcher Could not retrie ve checksum from file. 2> org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1698720114 vs expected footer=-1071082520 (resource=MMapIndexInput(path="/opt/code/lusolr_clean2/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest_B7DC9C42462BF20D-001/shard-2-001/cores/collection1/data/index/_0.fdt")) 2> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416) 2> at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401) 2> at org.apache.solr.handler.IndexFetcher.compareFile(IndexFetcher.java:876) 2> at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:839) 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:437) 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265) 2> at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:382) 2> at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162) 2> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437) 2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227) {code} > Possible deadlock when closing refcounted index writers. > -------------------------------------------------------- > > Key: SOLR-7836 > URL: https://issues.apache.org/jira/browse/SOLR-7836 > Project: Solr > Issue Type: Bug > Reporter: Erick Erickson > Assignee: Erick Erickson > Fix For: Trunk, 5.4 > > Attachments: SOLR-7836-reorg.patch, SOLR-7836-synch.patch, > SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch, SOLR-7836.patch, > deadlock_3.res.zip, deadlock_5_pass_iw.res.zip, deadlock_test > > > Preliminary patch for what looks like a possible race condition between > writerFree and pauseWriter in DefaultSorlCoreState. > Looking for comments and/or why I'm completely missing the boat. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org