hardikbajaj opened a new issue, #16783:
URL: https://github.com/apache/druid/issues/16783

   Druid indexer tasks sometimes get stuck in PUBLISHING state, due to 
executors are not shut down properly.
   
   ### Affected Version
   
   Druid 25.0.0
   
   ### Description
   We are running Kafka supervisor ingestion task, with task replication as 
two. 
   - Task [A1, A2] are replica tasks of same TaskGroup. After task rollover 
happens, it creates new set of tasks[B1, B2] to start ingesting, and TaskGroup 
[A1, A2] is added to pending completion task group.
   - Task A1 was completed with a SUCESS and it send stop signal to both A1, 
A2. Async Stop request is send to task A2, which returns 200, and indexer start 
stopping the task Gracefully. But task A2 got forever stuck in PUBLISHING state 
on indexer.
   - Overlord detects A2 active task, which is neither in 
ActivelyReadingTaskGroup nor PendingCompletionTaskGroup. So It creates a NEW 
PendingCompletionTaskGroup with only A2 Task.
   - Task A2 was stuck in PUBLISHING state on indexer. After a timeout period 
of `PendingCompletionTimeout` minutes, this task is forcefully killed. Since 
overlord sees Task group completion timeout is passed, and task is not sucess. 
So it **KILLS ACTIVELY READING TASKS.** 
   
   ### Why Task A2 got stuck in PUBLISHING state? 
   I did some debugging which is the probable cause of task getting stuck.
   - When Task A2 got a Stop signal, the main thread was interrupted 
Immediately. A2 task was handing off the segments at the time and had completed 
publishing.
   - From logs, we can Indexer was handing off 3 segments. But it sends the 
Dropped segment log for only one segment.
   ```
   2 Jul 2024 @ 03:05:55.583 UTC        Segment[S0] successfully handed off, 
dropping.  indexer-pod     coordinator_handoff_scheduled_0
   2 Jul 2024 @ 03:05:55.585 UTC        Segment[S1] successfully handed off, 
dropping.  indexer-pod     coordinator_handoff_scheduled_0
   2 Jul 2024 @ 03:05:55.587 UTC        Segment[S2] successfully handed off, 
dropping.  indexer-pod     coordinator_handoff_scheduled_0
   2 Jul 2024 @ 03:05:56.223 UTC        Stopping gracefully (status: 
[PUBLISHING])      indexer-pod     thread
   2 Jul 2024 @ 03:05:56.223 UTC        Closing reporter 
org.apache.kafka.common.metrics.JmxReporter    indexer-pod     
[task_id]-threading-task-runner-executor-0
   
   2 Jul 2024 @ 03:05:56.224 UTC        Shutting down immediately...    
indexer-pod main thread
   2 Jul 2024 @ 03:05:56.258 UTC        Dropped segment[S0].    indexer-pod     
[task_id]-appenderator-persist
   
   ```
   - The task seem to get stuck, waiting for the persist executor to shut down. 
As we can see, last log line the main thread printed was `Shutting down 
immediately...` . After that we [shut down 
executors](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java#L1081)
 and wait for the them to shut down, **with a timeout of 365 Days**
   ```
   Preconditions.checkState(
             persistExecutor == null || persistExecutor.awaitTermination(365, 
TimeUnit.DAYS),
             "persistExecutor not terminated"
         );
   ```
   - After `PendingCompletionTimeout` minutes, this task is forcefully killed 
and marked as FAILED. And is stopped again, which again interrupts the main 
ingestion thread. We get this ERROR log line
   ```
   2 Jul 2024 @ 03:40:01.670 UTC        Exception caught during execution       
indexer-pod     threading-task-runner-executor-0
        java.lang.RuntimeException: org.apache.druid.java.util.common.RE: 
Current thread is interrupted after [0] tries
        at 
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:156)
        at 
org.apache.druid.storage.s3.S3TaskLogs.pushTaskReports(S3TaskLogs.java:141)
        at 
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:223)
        at 
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:152)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
   Caused by: org.apache.druid.java.util.common.RE: Current thread is 
interrupted after [0] tries
        at 
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:148)
        at 
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81)
        at 
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163)
        at 
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153)
        at 
org.apache.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:101)
        at 
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:147)
        ... 7 more
   ```
   This exception allowed the main thread to go to 
[finally](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L910)
 { [thread] block and remove chat handlers and stopping other task tools. 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to