[
https://issues.apache.org/jira/browse/SOLR-17497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895334#comment-17895334
]
ASF subversion and git services commented on SOLR-17497:
--------------------------------------------------------
Commit 4c211ba5be43c63d3b9a7ccf5810b4aa73960e80 in solr's branch
refs/heads/branch_9_7 from Sanjay Dutt
[ https://gitbox.apache.org/repos/asf?p=solr.git;h=4c211ba5be4 ]
SOLR-17448 SOLR-17497: IndexFetcher, catch exception instead of bubbling up
uncaught (#2800)
(cherry picked from commit cc30093c5ee988555389b50cf2333edf743bb50f)
> Pull replicas throws AlreadyClosedException
> ---------------------------------------------
>
> Key: SOLR-17497
> URL: https://issues.apache.org/jira/browse/SOLR-17497
> Project: Solr
> Issue Type: Task
> Reporter: Sanjay Dutt
> Priority: Major
> Attachments: Screenshot 2024-10-23 at 6.01.02 PM.png
>
>
> Recently, a common exception (org.apache.lucene.store.AlreadyClosedException:
> this Directory is closed) seen in multiple failed test cases.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
> FAILED:
> org.apache.solr.cloud.SplitShardWithNodeRoleTest.testSolrClusterWithNodeRoleWithPull
> FAILED: org.apache.solr.cloud.TestPullReplica.testAddDocs
>
>
> {code:java}
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an
> uncaught exception in thread: Thread[id=10271,
> name=fsyncService-6341-thread-1, state=RUNNABLE,
> group=TGRP-SplitShardWithNodeRoleTest]
> at
> __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4:E5DB3E97188A8EB9]:0)
> Caused by: org.apache.lucene.store.AlreadyClosedException: this Directory is
> closed
> at __randomizedtesting.SeedInfo.seed([3F7DACB3BC44C3C4]:0)
> at
> app//org.apache.lucene.store.BaseDirectory.ensureOpen(BaseDirectory.java:50)
> at
> app//org.apache.lucene.store.ByteBuffersDirectory.sync(ByteBuffersDirectory.java:237)
> at
> app//org.apache.lucene.tests.store.MockDirectoryWrapper.sync(MockDirectoryWrapper.java:214)
> at
> app//org.apache.solr.handler.IndexFetcher$DirectoryFile.sync(IndexFetcher.java:2034)
> at
> app//org.apache.solr.handler.IndexFetcher$FileFetcher.lambda$fetch$0(IndexFetcher.java:1803)
> at
> app//org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$1(ExecutorUtil.java:449)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> [email protected]/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at [email protected]/java.lang.Thread.run(Thread.java:829)
> {code}
>
> Interesting thing about these test cases is that they all share same kind of
> setup where each has one shard and two replicas – one NRT and another is PULL.
>
> Going through one of the test case execution step.
> FAILED: org.apache.solr.cloud.TestPullReplica.testKillPullReplica
>
> Test flow
> 1. Create a collection with 1 NRT and 1 PULL replica
> 2. waitForState
> 3. waitForNumDocsInAllActiveReplicas(0); // *Name says it all*
> 4. Index another document.
> 5. waitForNumDocsInAllActiveReplicas(1);
> 6. Stop Pull replica
> 7. Index another document
> 8. waitForNumDocsInAllActiveReplicas(2);
> 9. Start Pull Replica
> 10. waitForState
> 11. waitForNumDocsInAllActiveReplicas(2);
>
> As per the logs the whole sequence executed successfully. Here is the link to
> the logs:
> [https://ge.apache.org/s/yxydiox3gvlf2/tests/task/:solr:core:test/details/org.apache.solr.cloud.TestPullReplica/testKillPullReplica/1/output]
> (link may stop working in the future)
>
> Last step where they are making sure that all the active replicas should have
> two documents each has logged a info which is another proof that it completed
> successfully.
>
> {code:java}
> 616575 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node3
> (https://127.0.0.1:35647/solr/pull_replica_test_kill_pull_replica_shard1_replica_n1/)
> has all 2 docs 616606 INFO (qtp1091538342-13057-null-11348)
> [n:127.0.0.1:38207_solr c:pull_replica_test_kill_pull_replica s:shard1
> r:core_node4 x:pull_replica_test_kill_pull_replica_shard1_replica_p2
> t:null-11348] o.a.s.c.S.Request webapp=/solr path=/select
> params={q=*:*&wt=javabin&version=2} rid=null-11348 hits=2 status=0 QTime=0
> 616607 INFO
> (TEST-TestPullReplica.testKillPullReplica-seed#[F30CC837FDD0DC28]) [n: c: s:
> r: x: t:] o.a.s.c.TestPullReplica Replica core_node4
> (https://127.0.0.1:38207/solr/pull_replica_test_kill_pull_replica_shard1_replica_p2/)
> has all 2 docs{code}
>
> *Where is the issue then?*
> In the logs it has been observed, that after restarting the PULL replica. The
> recovery process started and after fetching all the files info from the NRT,
> the replication aborted and logged "User aborted replication"
>
> {code:java}
> o.a.s.h.IndexFetcher User aborted Replication =>
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication at
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1826)
> org.apache.solr.handler.IndexFetcher$ReplicationHandlerException: User
> aborted replication{code}
>
> Inside IndexFetcher once it's aborted, It performs cleanup() operation which
> do the closeup. And delete the resource only if the downloaded bytes not
> equal to the size.
> {code:java}
> private void cleanup() {
> try {
> file.close();
> } catch (Exception e) {
> /* no-op */
> log.error("Error closing file: {}", this.saveAs, e);
> }
> if (bytesDownloaded != size) {
> // if the download is not complete then
> // delete the file being downloaded
> try {
> file.delete();
> } catch (Exception e) {
> log.error("Error deleting file: {}", this.saveAs, e);
> }
> // if the failure is due to a user abort it is returned normally else an
> exception is thrown
> if (!aborted)
> throw new SolrException(SolrException.ErrorCode.SERVER_ERROR, "Unable to
> download " + fileName + " completely. Downloaded " + bytesDownloaded + "!=" +
> size);
> }
> }{code}
> After which a sync operation is performed in a thread, and that's where it
> fails.
> {code:java}
> fsyncService.execute(
> () -> {
> try {
> file.sync();
> } catch (IOException e) {
> fsyncException = e;
> }
> });{code}
> Now two things:
> 1. Why would replication is aborted in the first place? And who executes it?
> 2. Should sync not be performed when the replication is aborted?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]