[ https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785941#comment-16785941 ]
Hoss Man commented on LUCENE-8692: ---------------------------------- {quote}I think we need to be careful here. From my perspective there are 3 types of exceptions here: {quote} Right .... ok. This is definitely more nuanced then I originally realized, I was lumping #2 and #3 into the same bucket in my head, and hadn't even remembered/considered the "rollback" situation. FWIW: My "mental model" when I first raised this issue was: * We've got this MockDirectoryWrapper.corruptFiles helper method that corrupts files under the covers in ways that are unrecoverable from a client perspective * Presumably then the exceptions a client gets as a result should be "tragic" * When a test uses corruptFiles, here are the code paths that can result in exceptions bubbling up to a client that are not marked tragic But your (IIUC) point is that in the general case, there is no reason some of these code paths (like maybeMerge & startCommit) can/should assume that _any_ exceptions bubbling up are in fact unrecoverable / traggic – in this test it may be, but the code doesn't know that: it's very possible that a rollback will work. ---- {quote}Now we can debate if we want to change this and we can, in-fact I am all for making it even more strict especially since it's inconsistent with what we do if addDocument fails with an aborting exception. {quote} It definitely seems like there should be _something_ we can/should do to better recognize situations like this as "unrecoverable" and be more strict in dealing with low level exceptions during things like commit – but I'm out definitely out of my depth in understanding/suggesting what that might look like. Clearly the current patch is being too aggressive in what it treats as "tragic". > IndexWriter.getTragicException() nay not reflect all corrupting exceptions > (notably: NoSuchFileException) > --------------------------------------------------------------------------------------------------------- > > Key: LUCENE-8692 > URL: https://issues.apache.org/jira/browse/LUCENE-8692 > Project: Lucene - Core > Issue Type: Bug > Reporter: Hoss Man > Priority: Major > Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, > LUCENE-8692_test.patch > > > Backstory... > Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's > {{corruptFiles}} to introduce corruption into the "leader" node's index and > then assert that this solr node gives up it's leadership of the shard and > another replica takes over. > This can currently fail sporadically (but usually reproducibly - > seeSOLR-13237) due to the leader not giving up it's leadership even after the > corruption causes an update/commit to fail. Solr's leadership code makes > this decision after encountering an exception from the IndexWriter based on > wether {{IndexWriter.getTragicException()}} is (non-)null. > ---- > While investigating this, I created an isolated Lucene-Core equivilent test > that demonstrates the same basic situation: > * Gradually cause corruption on an index untill (otherwise) valid execution > of IW.add() + IW.commit() calls throw an exception to the IW client. > * assert that if an exception is thrown to the IW client, > {{getTragicException()}} is now non-null. > It's fairly easy to make my new test fail reproducibly -- in every situation > I've seen the underlying exception is a {{NoSuchFileException}} (ie: the > randomly introduced corruption was to delete some file). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org