[I] Druid indexer tasks gets stuck in PUBLISHING state (druid)

via GitHub Tue, 23 Jul 2024 08:35:21 -0700


hardikbajaj opened a new issue, #16783:
URL: https://github.com/apache/druid/issues/16783

Druid indexer tasks sometimes get stuck in PUBLISHING state, due to
executors are not shut down properly.

### Affected Version

Druid 25.0.0

### Description
We are running Kafka supervisor ingestion task, with task replication as
two.
- Task [A1, A2] are replica tasks of same TaskGroup. After task rollover
happens, it creates new set of tasks[B1, B2] to start ingesting, and TaskGroup
[A1, A2] is added to pending completion task group.
- Task A1 was completed with a SUCESS and it send stop signal to both A1,
A2. Async Stop request is send to task A2, which returns 200, and indexer start
stopping the task Gracefully. But task A2 got forever stuck in PUBLISHING state
on indexer.
- Overlord detects A2 active task, which is neither in
ActivelyReadingTaskGroup nor PendingCompletionTaskGroup. So It creates a NEW
PendingCompletionTaskGroup with only A2 Task.
- Task A2 was stuck in PUBLISHING state on indexer. After a timeout period
of `PendingCompletionTimeout` minutes, this task is forcefully killed. Since
overlord sees Task group completion timeout is passed, and task is not sucess.
So it **KILLS ACTIVELY READING TASKS.**

### Why Task A2 got stuck in PUBLISHING state?
I did some debugging which is the probable cause of task getting stuck.
- When Task A2 got a Stop signal, the main thread was interrupted
Immediately. A2 task was handing off the segments at the time and had completed
publishing.
- From logs, we can Indexer was handing off 3 segments. But it sends the
Dropped segment log for only one segment.
```
2 Jul 2024 @ 03:05:55.583 UTC Segment[S0] successfully handed off,
dropping. indexer-pod coordinator_handoff_scheduled_0
2 Jul 2024 @ 03:05:55.585 UTC Segment[S1] successfully handed off,
dropping. indexer-pod coordinator_handoff_scheduled_0
2 Jul 2024 @ 03:05:55.587 UTC Segment[S2] successfully handed off,
dropping. indexer-pod coordinator_handoff_scheduled_0
2 Jul 2024 @ 03:05:56.223 UTC Stopping gracefully (status:
[PUBLISHING]) indexer-pod thread
2 Jul 2024 @ 03:05:56.223 UTC Closing reporter
org.apache.kafka.common.metrics.JmxReporter indexer-pod
[task_id]-threading-task-runner-executor-0

2 Jul 2024 @ 03:05:56.224 UTC Shutting down immediately...
indexer-pod main thread
2 Jul 2024 @ 03:05:56.258 UTC Dropped segment[S0]. indexer-pod
[task_id]-appenderator-persist

```
- The task seem to get stuck, waiting for the persist executor to shut down.
As we can see, last log line the main thread printed was `Shutting down
immediately...` . After that we [shut down
executors](https://github.com/apache/druid/blob/master/server/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderator.java#L1081)
and wait for the them to shut down, **with a timeout of 365 Days**
```
Preconditions.checkState(
persistExecutor == null || persistExecutor.awaitTermination(365,
TimeUnit.DAYS),
"persistExecutor not terminated"
);
```
- After `PendingCompletionTimeout` minutes, this task is forcefully killed
and marked as FAILED. And is stopped again, which again interrupts the main
ingestion thread. We get this ERROR log line
```
2 Jul 2024 @ 03:40:01.670 UTC Exception caught during execution
indexer-pod threading-task-runner-executor-0
java.lang.RuntimeException: org.apache.druid.java.util.common.RE:
Current thread is interrupted after [0] tries
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:156)
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskReports(S3TaskLogs.java:141)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:223)
at
org.apache.druid.indexing.overlord.ThreadingTaskRunner$1.call(ThreadingTaskRunner.java:152)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.druid.java.util.common.RE: Current thread is
interrupted after [0] tries
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:148)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:81)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:163)
at
org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:153)
at
org.apache.druid.storage.s3.S3Utils.retryS3Operation(S3Utils.java:101)
at
org.apache.druid.storage.s3.S3TaskLogs.pushTaskFile(S3TaskLogs.java:147)
... 7 more
```
This exception allowed the main thread to go to
[finally](https://github.com/apache/druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskRunner.java#L910)
{ [thread] block and remove chat handlers and stopping other task tools.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[I] Druid indexer tasks gets stuck in PUBLISHING state (druid)

Reply via email to