[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16785941#comment-16785941
 ] 

Hoss Man commented on LUCENE-8692:
----------------------------------

{quote}I think we need to be careful here. From my perspective there are 3 
types of exceptions here:
{quote}
Right .... ok. This is definitely more nuanced then I originally realized, I 
was lumping #2 and #3 into the same bucket in my head, and hadn't even 
remembered/considered the "rollback" situation.

FWIW: My "mental model" when I first raised this issue was:
 * We've got this MockDirectoryWrapper.corruptFiles helper method that corrupts 
files under the covers in ways that are unrecoverable from a client perspective
 * Presumably then the exceptions a client gets as a result should be "tragic"
 * When a test uses corruptFiles, here are the code paths that can result in 
exceptions bubbling up to a client that are not marked tragic

But your (IIUC) point is that in the general case, there is no reason some of 
these code paths (like maybeMerge & startCommit) can/should assume that _any_ 
exceptions bubbling up are in fact unrecoverable / traggic – in this test it 
may be, but the code doesn't know that: it's very possible that a rollback will 
work.
----
{quote}Now we can debate if we want to change this and we can, in-fact I am all 
for making it even more strict especially since it's inconsistent with what we 
do if addDocument fails with an aborting exception.
{quote}
It definitely seems like there should be _something_ we can/should do to better 
recognize situations like this as "unrecoverable" and be more strict in dealing 
with low level exceptions during things like commit – but I'm out definitely 
out of my depth in understanding/suggesting what that might look like.

Clearly the current patch is being too aggressive in what it treats as "tragic".

> IndexWriter.getTragicException() nay not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8692
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8692
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - 
> seeSOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail.  Solr's leadership code makes 
> this decision after encountering an exception from the IndexWriter based on 
> wether {{IndexWriter.getTragicException()}} is (non-)null.
> ----
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
> * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
> * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly -- in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to