pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1185827594

   Hi @AmatyaAvadhanula,
   
   So I did let it run with the section commented out for a bit, and most of 
them failed (~50%+) with the same error messages we've seen before but I did 
see one that was different, relevant INFO level logs below (can also get you 
debug logs if needed). I'm unsure at the moment if the change we made removed a 
race condition but looking at the code I believe I understand why you suggested 
it and looks like it could possibly do so. 
   
   Example Task: `index_kafka_REDACT_daa7d95ffa54138_omkgiccg` - this one had 
the same issue where it was killed due to not pausing within the chatRetries 
(default 8) limit
   
   * Overlord Log Snippet - Shutdown task due to: `Task 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] failed to change its status from 
[READING] to [PAUSED], aborting]`
   ```
   2022-07-14T23:47:20,270 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT2S]
   2022-07-14T23:47:22,550 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT4S]
   2022-07-14T23:47:26,826 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT8S]
   2022-07-14T23:47:35,108 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-14T23:47:45,387 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-14T23:47:55,666 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-14T23:48:05,943 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-14T23:48:16,222 INFO [IndexTaskClient-REDACT-2] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still 
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its 
status to [PAUSED]; will try again in [PT10S]
   2022-07-14T23:49:27,234 INFO [qtp1886478937-103] 
org.apache.druid.indexing.overlord.TaskLockbox - Added 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z, 
version='2022-07-14T23:27:38.085Z', priority=75, revoked=false}]
   2022-07-14T23:49:27,234 INFO [qtp1886478937-103] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Adding lock on 
interval[2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z] 
version[2022-07-14T23:27:38.085Z] for task: 
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
   2022-07-14T23:50:58,304 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] because: [An exception occured 
while waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to pause: 
[org.apache.druid.java.util.common.ISE: Task 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] failed to change its status from 
[READING] to [PAUSED], aborting]]
   2022-07-14T23:50:58,419 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to 
worker: REDACTED.host.com:8091, status 200 OK, response: 
{"task":"index_kafka_REDACT_daa7d95ffa54138_omkgiccg"}
   2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - 
Worker[REDACTED.host.com:8091] wrote FAILED status for task 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] on 
[TaskLocation{host='REDACTED.host.com', port=8102, tlsPort=-1}]
   2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - 
Worker[REDACTED.host.com:8091] completed 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] with status[FAILED]
   2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task: 
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
   2022-07-14T23:52:28,726 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from activeTasks
   2022-07-14T23:52:28,726 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-29T09:00:00.000Z/2022-05-29T10:00:00.000Z, 
version='2022-07-14T23:45:45.943Z', priority=75, revoked=false}]
   2022-07-14T23:52:28,958 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-29T10:00:00.000Z/2022-05-29T11:00:00.000Z, 
version='2022-07-14T23:01:48.156Z', priority=75, revoked=false}]
   2022-07-14T23:52:29,186 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-29T17:00:00.000Z/2022-05-29T18:00:00.000Z, 
version='2022-07-14T23:18:41.116Z', priority=75, revoked=false}]
   2022-07-14T23:52:29,415 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-31T01:00:00.000Z/2022-05-31T02:00:00.000Z, 
version='2022-07-14T23:42:26.717Z', priority=75, revoked=false}]
   2022-07-14T23:52:29,640 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-31T09:00:00.000Z/2022-05-31T10:00:00.000Z, 
version='2022-07-14T23:14:55.776Z', priority=75, revoked=false}]
   2022-07-14T23:52:29,886 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-05-31T16:00:00.000Z/2022-05-31T17:00:00.000Z, 
version='2022-07-14T23:38:25.971Z', priority=75, revoked=false}]
   2022-07-14T23:52:30,120 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z, 
version='2022-07-14T23:27:38.085Z', priority=75, revoked=false}]
   2022-07-14T23:52:30,348 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-07T11:00:00.000Z/2022-07-07T12:00:00.000Z, 
version='2022-07-14T23:05:31.113Z', priority=75, revoked=false}]
   2022-07-14T23:52:30,572 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-07T14:00:00.000Z/2022-07-07T15:00:00.000Z, 
version='2022-07-14T22:10:07.792Z', priority=75, revoked=false}]
   2022-07-14T23:52:30,797 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-07T15:00:00.000Z/2022-07-07T16:00:00.000Z, 
version='2022-07-14T23:35:04.230Z', priority=75, revoked=false}]
   2022-07-14T23:52:31,029 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-14T16:00:00.000Z/2022-07-14T17:00:00.000Z, 
version='2022-07-14T20:20:37.672Z', priority=75, revoked=false}]
   2022-07-14T23:52:31,270 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-14T17:00:00.000Z/2022-07-14T18:00:00.000Z, 
version='2022-07-14T21:50:01.268Z', priority=75, revoked=false}]
   2022-07-14T23:52:31,517 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-14T18:00:00.000Z/2022-07-14T19:00:00.000Z, 
version='2022-07-14T21:44:27.183Z', priority=75, revoked=false}]
   2022-07-14T23:52:31,746 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-14T19:00:00.000Z/2022-07-14T20:00:00.000Z, 
version='2022-07-14T23:06:16.008Z', priority=75, revoked=false}]
   2022-07-14T23:52:31,988 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-15T01:00:00.000Z/2022-07-15T02:00:00.000Z, 
version='2022-07-14T22:57:43.141Z', priority=75, revoked=false}]
   2022-07-14T23:52:32,214 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskLockbox - Removing 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from 
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT', 
dataSource='REDACT', 
interval=2022-07-15T02:00:00.000Z/2022-07-15T03:00:00.000Z, 
version='2022-07-14T23:10:47.272Z', priority=75, revoked=false}]
   2022-07-14T23:52:32,569 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task 
index_kafka_REDACT_daa7d95ffa54138_omkgiccg to status: 
TaskStatus{id=index_kafka_REDACT_daa7d95ffa54138_omkgiccg, status=FAILED, 
duration=-1, errorMsg=An exception occured while waiting for task 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to pause: ...}
   2022-07-14T23:52:32,690 INFO [KafkaSupervisor-REDACT-Worker-1] 
org.apache.druid.indexing.overlord.TaskQueue - Task done: 
AbstractTask{id='index_kafka_REDACT_daa7d95ffa54138_omkgiccg', 
groupId='index_kafka_REDACT', 
taskResource=TaskResource{availabilityGroup='index_kafka_REDACT_daa7d95ffa54138',
 requiredCapacity=1}, dataSource='REDACT', 
context={checkpoints={"0":{"5":13421694182}}, 
useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, 
forceTimeChunkLock=true}}
   2022-07-14T23:53:44,018 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown 
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] because: [notified status change 
from task]
   2022-07-14T23:53:44,018 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up 
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] on 
worker[REDACTED.host.com:8091]
   2022-07-14T23:53:44,076 WARN [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed: 
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
   2022-07-14T23:53:44,076 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.TaskQueue - Task FAILED: 
AbstractTask{id='index_kafka_REDACT_daa7d95ffa54138_omkgiccg', 
groupId='index_kafka_REDACT', 
taskResource=TaskResource{availabilityGroup='index_kafka_REDACT_daa7d95ffa54138',
 requiredCapacity=1}, dataSource='REDACT', 
context={checkpoints={"0":{"5":13421694182}}, 
useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true, 
forceTimeChunkLock=true}} (3828502 run duration)
   2022-07-14T23:53:44,076 INFO [Curator-PathChildrenCache-1] 
org.apache.druid.indexing.overlord.RemoteTaskRunner - 
Task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] went bye bye.
   ```
   * Peon Log Snippet - was actually in `PAUSED` state during the shutdown - 
`Stopping forcefully (status: [PAUSED])`
   ```
   2022-07-14T23:35:25,545 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing 
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
 at existing 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:38:47,724 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing 
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
 at existing 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:42:48,993 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing 
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
 at existing 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:42:48,994 INFO [task-runner-0-priority-0] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted 
rows[302,835] and (estimated) bytes[482,175,309]
   2022-07-14T23:42:49,019 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
 spill[0] to disk in [24] ms (1,000 rows).
   2022-07-14T23:42:50,562 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22]
 spill[7] to disk in [1,542] ms (136,567 rows).
   2022-07-14T23:42:53,411 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7]
 spill[8] to disk in [2,837] ms (164,266 rows).
   2022-07-14T23:42:53,444 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4]
 spill[5] to disk in [18] ms (1 rows).
   2022-07-14T23:42:53,497 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
 spill[0] to disk in [52] ms (1 rows).
   2022-07-14T23:42:53,519 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data for 
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
 spill[0] to disk in [20] ms (1,000 rows).
   2022-07-14T23:42:53,520 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed 
in-memory data with commit metadata 
[AppenderatorDriverMetadata{segments={index_kafka_REDACT_daa7d95ffa54138_1=[SegmentWithState{segmentIdentifier=REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329,
 state=APPENDING}, 
SegmentWithState{segmentIdentifier=REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082,
 state=APPENDING}, 
SegmentWithState{segmentIdentifier=REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560,
 state=APPENDING}, 
SegmentWithState{segmentIdentifier=REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22,
 state=APPENDING}, 
SegmentWithState{segmentIdentifier=REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7,
 state=APPENDING}, SegmentWith
 
State{segmentIdentifier=REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4,
 state=APPENDING}]}, 
lastSegmentIds={index_kafka_REDACT_daa7d95ffa54138_1=REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329},
 
callerMetadata={nextPartitions=SeekableStreamEndSequenceNumbers{stream='REDACT',
 partitionSequenceNumberMap={5=13433730900}}}}] for segments: 
REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560,
 
REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22,
 
REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7,
 
REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4,
 
REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329,
 
REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082
   2022-07-14T23:42:53,520 INFO 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted 
stats: processed rows: [12037718], persisted rows[302835], sinks: [6], total 
fireHydrants (across sinks): [26], persisted fireHydrants (across sinks): [6]
   2022-07-14T23:46:08,182 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing 
segment[REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782]
 at existing 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:49:27,388 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing 
segment[REDACT_2022-06-01T14:00:00.000Z_2022-06-01T15:00:00.000Z_2022-06-01T20:26:55.463Z_762]
 at existing 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,392 INFO [parent-monitor-0] 
org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM 
shutdown.
   2022-07-14T23:50:58,392 INFO [Thread-70] org.apache.druid.cli.CliPeon - 
Running shutdown hook
   2022-07-14T23:50:58,392 INFO [Thread-70] 
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle 
[module] stage [ANNOUNCEMENTS]
   2022-07-14T23:50:58,398 INFO [Thread-70] 
org.apache.druid.curator.announcement.Announcer - Unannouncing 
[/druid/announcements/REDACTED.host.com:8102]
   2022-07-14T23:50:58,414 INFO [Thread-70] 
org.apache.druid.curator.announcement.Announcer - Unannouncing 
[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,417 INFO [Thread-70] 
org.apache.druid.curator.announcement.Announcer - Unannouncing 
[/druid/internal-discovery/PEON/REDACTED.host.com:8102]
   2022-07-14T23:50:58,420 INFO [Thread-70] 
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle 
[module] stage [SERVER]
   2022-07-14T23:50:58,426 INFO [Thread-70] 
org.eclipse.jetty.server.AbstractConnector - Stopped 
ServerConnector@70cd122{HTTP/1.1, (http/1.1)}{0.0.0.0:8102}
   2022-07-14T23:50:58,426 INFO [Thread-70] org.eclipse.jetty.server.session - 
node0 Stopped scavenging
   2022-07-14T23:50:58,427 INFO [Thread-70] 
org.eclipse.jetty.server.handler.ContextHandler - Stopped 
o.e.j.s.ServletContextHandler@5a6af2c4{/,null,STOPPED}
   2022-07-14T23:50:58,429 INFO [Thread-70] 
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle 
[module] stage [NORMAL]
   2022-07-14T23:50:58,429 INFO [Thread-70] 
org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for 
[DruidServerMetadata{name='REDACTED.host.com:8102', 
hostAndPort='REDACTED.host.com:8102', hostAndTlsPort='null', maxSize=0, 
tier='_default_tier', type=indexer-executor, priority=0}]
   2022-07-14T23:50:58,430 INFO [Thread-70] 
org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping...
   2022-07-14T23:50:58,430 INFO [Thread-70] 
org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
   2022-07-14T23:50:58,430 INFO [Thread-70] 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting 
graceful shutdown of task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg].
   2022-07-14T23:50:58,430 INFO [Thread-70] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - 
Stopping forcefully (status: [PAUSED])
   2022-07-14T23:50:58,431 ERROR [task-runner-0-priority-0] 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner - 
Encountered exception in run() before persisting.
   java.lang.InterruptedException: null
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
 ~[?:1.8.0_322]
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
 ~[?:1.8.0_322]
        at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.possiblyPause(SeekableStreamIndexTaskRunner.java:1303)
 ~[druid-indexing-service-0.23.0.jar:0.23.0]
        at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:579)
 [druid-indexing-service-0.23.0.jar:0.23.0]
        at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:265)
 [druid-indexing-service-0.23.0.jar:0.23.0]
        at 
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:149)
 [druid-indexing-service-0.23.0.jar:0.23.0]
        at 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477)
 [druid-indexing-service-0.23.0.jar:0.23.0]
        at 
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449)
 [druid-indexing-service-0.23.0.jar:0.23.0]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_322]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_322]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_322]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
   2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted 
rows[2,499] and (estimated) bytes[2,953,857]
   2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer 
clientId=consumer-kafka-supervisor-ogdbelko-1, 
groupId=kafka-supervisor-ogdbelko] Resetting generation due to: consumer 
pro-actively leaving the group
   2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0] 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer 
clientId=consumer-kafka-supervisor-ogdbelko-1, 
groupId=kafka-supervisor-ogdbelko] Request joining group due to: consumer 
pro-actively leaving the group
   2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0] 
org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
   2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0] 
org.apache.kafka.common.metrics.Metrics - Closing reporter 
org.apache.kafka.common.metrics.JmxReporter
   2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0] 
org.apache.kafka.common.metrics.Metrics - Metrics reporters closed
   2022-07-14T23:50:58,436 INFO [task-runner-0-priority-0] 
org.apache.kafka.common.utils.AppInfoParser - App info kafka.consumer for 
consumer-kafka-supervisor-ogdbelko-1 unregistered
   2022-07-14T23:50:58,436 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-06-01T14:00:00.000Z_2022-06-01T15:00:00.000Z_2022-06-01T20:26:55.463Z_762]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,438 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,438 INFO [task-runner-0-priority-0] 
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing 
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
 at 
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
   2022-07-14T23:50:58,445 ERROR 
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist] 
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental 
persist failed: 
{class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator, 
segment=REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782,
 dataSource=REDACT, count=0}
   2022-07-14T23:50:58,446 INFO 
[LookupExtractorFactoryContainerProvider-MainThread] 
org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop 
exited. Lookup notices are not handled anymore.
   2022-07-14T23:50:58,447 INFO [task-runner-0-priority-0] 
org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self 
[{"druidNode":{"service":"druid/middleManager","host":"REDACTED.host.com","bindOnHost":false,"plaintextPort":8102,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","serverType":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
   2022-07-14T23:50:58,449 INFO [Curator-Framework-0] 
org.apache.curator.framework.imps.CuratorFrameworkImpl - 
backgroundOperationsLoop exiting
   2022-07-14T23:50:58,451 INFO [task-runner-0-priority-0] 
org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed 
with status: {
     "id" : "index_kafka_REDACT_daa7d95ffa54138_omkgiccg",
     "status" : "SUCCESS",
     "duration" : 3823392,
     "errorMsg" : null,
     "location" : {
       "host" : null,
       "port" : -1,
       "tlsPort" : -1
     }
   }
   2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread 
[Thread[Thread-70,5,main]] is non daemon.
   2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread 
[Thread[pool-3-thread-1,5,main]] is non daemon.
   2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread 
[Thread[pool-3-thread-2,5,main]] is non daemon.
   2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread 
[Thread[pool-3-thread-3,5,main]] is non daemon.
   2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread 
[Thread[pool-3-thread-4,5,main]] is non daemon.
   2022-07-14T23:50:58,453 INFO [main] 
org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module] 
already stopped and stop was called. Silently skipping
   Cannot remove shutdown hook, already shutting down!
   Finished peon task
   ```
   
   In this new example, the last chatRetry was at `2022-07-14T23:48:16,222` and 
since it didn't pause at the last retry, it sent a shutdown at 
`2022-07-14T23:50:58,304`. The peon did eventually pause this time but around 
~`2022-07-14T23:50:58,430`. So if it had 1-2 more minutes it would have 
returned a successful response to the pause ("/status" check) request. 
   
   Knowing this I was able to temporarily provide some relieve and let our 
existing tasks complete (for the most part 90%+) but increasing chatRetries to 
a very high number (8 => 25) since it's 10 seconds between retries after the 
3rd retry, this allows for ~2 more minutes for the Peon to actually pause. This 
is just a workaround for now since the issue still exists with the Peons not 
pausing in a reasonable time but does provide some temporarily relieve. 
   
   Seems like we will definitely need those flame graphs @abhishekagarwal87 
mentioned to get a better idea of what the peon is working on during the pause 
request coming in
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to