[ 
https://issues.apache.org/jira/browse/LUCENE-8692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789320#comment-16789320
 ] 

Simon Willnauer commented on LUCENE-8692:
-----------------------------------------

{quote}
It definitely seems like there should be something we can/should do to better 
recognize situations like this as "unrecoverable" and be more strict in dealing 
with low level exceptions during things like commit – but I'm out definitely 
out of my depth in understanding/suggesting what that might look like.
{quote}

I agree with you here, I personally question the purpose of rollback since all 
the cases I have seen a missing rollback would simply mean dataloss. if 
somebody continues after a failed commit / prepareCommit / reopen they will end 
up with inconsistency and / or dataloss. I can't think of a reason why you 
would want to do it. I am curious what [~mikemccand] [~jpountz] [~rcmuir ] 
think about that. 
If we deprecated and remove rollback() we can be more agressive when it gets to 
tragic events and prevent users from continuing after such an exception by 
closing the writer automatically.



> IndexWriter.getTragicException() may not reflect all corrupting exceptions 
> (notably: NoSuchFileException)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8692
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8692
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: LUCENE-8692.patch, LUCENE-8692.patch, LUCENE-8692.patch, 
> LUCENE-8692_test.patch
>
>
> Backstory...
> Solr has a "LeaderTragicEventTest" which uses MockDirectoryWrapper's 
> {{corruptFiles}} to introduce corruption into the "leader" node's index and 
> then assert that this solr node gives up it's leadership of the shard and 
> another replica takes over.
> This can currently fail sporadically (but usually reproducibly - see 
> SOLR-13237) due to the leader not giving up it's leadership even after the 
> corruption causes an update/commit to fail. Solr's leadership code makes this 
> decision after encountering an exception from the IndexWriter based on wether 
> {{IndexWriter.getTragicException()}} is (non-)null.
> ----
> While investigating this, I created an isolated Lucene-Core equivilent test 
> that demonstrates the same basic situation:
>  * Gradually cause corruption on an index untill (otherwise) valid execution 
> of IW.add() + IW.commit() calls throw an exception to the IW client.
>  * assert that if an exception is thrown to the IW client, 
> {{getTragicException()}} is now non-null.
> It's fairly easy to make my new test fail reproducibly – in every situation 
> I've seen the underlying exception is a {{NoSuchFileException}} (ie: the 
> randomly introduced corruption was to delete some file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to