[ 
https://issues.apache.org/jira/browse/SOLR-18121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18072592#comment-18072592
 ] 

ASF subversion and git services commented on SOLR-18121:
--------------------------------------------------------

Commit 1d3b5ed1204ca056bb057bb4aebb20751135f3ca in solr's branch 
refs/heads/branch_10x from Jason Gerlowski
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=1d3b5ed1204 ]

SOLR-18121: Rm 'rm-in-advance' fetcher codepath (#4258)

Prior to this commit, the IndexFetcher had support for an edge case
where if a full-index-fetch was needed but the receiving host didn't
have adequate disk space, then the IndexFetcher would delete the
existing (out of date) copy of the index to free up space.

Unfortunately, this optimization/"feature" was never documented in the
ref-guide, had very little test coverage, and contained a number of
resource-leaks that appeared to trigger 100% of the time.  While the
feature might've worked when initially written, the rare-ness of the
triggering condition and lack of tests had led to some severe bitrot
over time.

This commit removes this optimization, largely by walking back changes
in 'b061947' which added it initially.

> Ref-counting leak in IndexFetcher's "deleteFilesInAdvance" codepath
> -------------------------------------------------------------------
>
>                 Key: SOLR-18121
>                 URL: https://issues.apache.org/jira/browse/SOLR-18121
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>            Reporter: Jason Gerlowski
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> While experimenting with TLOG and PULL replicas locally, I think I've 
> discovered a resource-leak in IndexFetcher.
> IndexFetcher's little-known "deleteFilesInAdvance" code-path, which deletes 
> the local copy of the index if space is needed to download the remote copy, 
> doesn't obey the contract of {{SolrCoreState.closeIndexWriter}}.  The problem 
> code in question is around 
> [L1273|https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/handler/IndexFetcher.java#L1273]:
> {code}
>     
> solrCore.getUpdateHandler().getSolrCoreState().closeIndexWriter(this.solrCore,
>  false);
>     for (String f : filesTobeDeleted) {
>       try {
>         indexDir.deleteFile(f);
>       } catch (FileNotFoundException | NoSuchFileException e) {
>         // no problem , it was deleted by someone else
>       }   
>     }   
> {code}
> The javadocs for {{closeIndexWriter}} indicate:
> bq. Expert method that closes the IndexWriter - you must call {@link 
> #openIndexWriter(SolrCore)} in a finally block after calling this method.
> ...which clearly isn't happening in this case.
> This causes tangible issues in the field - we (Houston and myself) uncovered 
> this issue while debugging a collection-delete operation that stalled out 
> indefinitely.  The bug described above messes up the ref-count for the 
> SolrCore in question, so deleting this core/replica stalled forever waiting 
> for the ref-count to hit 0.
> Side note 1: My reading of the logic here suggests that this problem would 
> happen reliably any time this codepath is invoked.  It was a tough thing to 
> debug, so I wouldn't expect user bug re
> ports necessarily.  But our tests have pretty stringent ref-count checks, and 
> the fact that no tests report this issue suggests to me that this codepath is 
> entirely untested. 
> Side note 2: I have some larger questions about whether this codepath is 
> something we really want to continue supporting in Solr today.  You can 
> imagine scenarios where this could lead to data-loss if a replica gets into 
> this state and then something happens to the leader.  Safer to keep the data 
> that this replica does have around, even if it's slightly less up to date 
> than what the leader has currently.  But I'll raise those on SOLR-12999 which 
> initially introduced the feature.  Let's assume on the JIRA ticket that we do 
> want to keep the feature around.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to