pchang388 commented on issue #12701:
URL: https://github.com/apache/druid/issues/12701#issuecomment-1185827594
Hi @AmatyaAvadhanula,
So I did let it run with the section commented out for a bit, and most of
them failed (~50%+) with the same error messages we've seen before but I did
see one that was different, relevant INFO level logs below (can also get you
debug logs if needed). I'm unsure at the moment if the change we made removed a
race condition but looking at the code I believe I understand why you suggested
it and looks like it could possibly do so.
Example Task: `index_kafka_REDACT_daa7d95ffa54138_omkgiccg` - this one had
the same issue where it was killed due to not pausing within the chatRetries
(default 8) limit
* Overlord Log Snippet - Shutdown task due to: `Task
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] failed to change its status from
[READING] to [PAUSED], aborting]`
```
2022-07-14T23:47:20,270 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT2S]
2022-07-14T23:47:22,550 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT4S]
2022-07-14T23:47:26,826 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT8S]
2022-07-14T23:47:35,108 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT10S]
2022-07-14T23:47:45,387 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT10S]
2022-07-14T23:47:55,666 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT10S]
2022-07-14T23:48:05,943 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT10S]
2022-07-14T23:48:16,222 INFO [IndexTaskClient-REDACT-2]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskClient - Still
waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to change its
status to [PAUSED]; will try again in [PT10S]
2022-07-14T23:49:27,234 INFO [qtp1886478937-103]
org.apache.druid.indexing.overlord.TaskLockbox - Added
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z,
version='2022-07-14T23:27:38.085Z', priority=75, revoked=false}]
2022-07-14T23:49:27,234 INFO [qtp1886478937-103]
org.apache.druid.indexing.overlord.MetadataTaskStorage - Adding lock on
interval[2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z]
version[2022-07-14T23:27:38.085Z] for task:
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
2022-07-14T23:50:58,304 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] because: [An exception occured
while waiting for task [index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to pause:
[org.apache.druid.java.util.common.ISE: Task
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] failed to change its status from
[READING] to [PAUSED], aborting]]
2022-07-14T23:50:58,419 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Sent shutdown message to
worker: REDACTED.host.com:8091, status 200 OK, response:
{"task":"index_kafka_REDACT_daa7d95ffa54138_omkgiccg"}
2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner -
Worker[REDACTED.host.com:8091] wrote FAILED status for task
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] on
[TaskLocation{host='REDACTED.host.com', port=8102, tlsPort=-1}]
2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner -
Worker[REDACTED.host.com:8091] completed
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] with status[FAILED]
2022-07-14T23:50:59,559 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.TaskQueue - Received FAILED status for task:
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
2022-07-14T23:52:28,726 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from activeTasks
2022-07-14T23:52:28,726 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-29T09:00:00.000Z/2022-05-29T10:00:00.000Z,
version='2022-07-14T23:45:45.943Z', priority=75, revoked=false}]
2022-07-14T23:52:28,958 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-29T10:00:00.000Z/2022-05-29T11:00:00.000Z,
version='2022-07-14T23:01:48.156Z', priority=75, revoked=false}]
2022-07-14T23:52:29,186 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-29T17:00:00.000Z/2022-05-29T18:00:00.000Z,
version='2022-07-14T23:18:41.116Z', priority=75, revoked=false}]
2022-07-14T23:52:29,415 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-31T01:00:00.000Z/2022-05-31T02:00:00.000Z,
version='2022-07-14T23:42:26.717Z', priority=75, revoked=false}]
2022-07-14T23:52:29,640 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-31T09:00:00.000Z/2022-05-31T10:00:00.000Z,
version='2022-07-14T23:14:55.776Z', priority=75, revoked=false}]
2022-07-14T23:52:29,886 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-05-31T16:00:00.000Z/2022-05-31T17:00:00.000Z,
version='2022-07-14T23:38:25.971Z', priority=75, revoked=false}]
2022-07-14T23:52:30,120 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-06-01T14:00:00.000Z/2022-06-01T15:00:00.000Z,
version='2022-07-14T23:27:38.085Z', priority=75, revoked=false}]
2022-07-14T23:52:30,348 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-07T11:00:00.000Z/2022-07-07T12:00:00.000Z,
version='2022-07-14T23:05:31.113Z', priority=75, revoked=false}]
2022-07-14T23:52:30,572 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-07T14:00:00.000Z/2022-07-07T15:00:00.000Z,
version='2022-07-14T22:10:07.792Z', priority=75, revoked=false}]
2022-07-14T23:52:30,797 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-07T15:00:00.000Z/2022-07-07T16:00:00.000Z,
version='2022-07-14T23:35:04.230Z', priority=75, revoked=false}]
2022-07-14T23:52:31,029 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-14T16:00:00.000Z/2022-07-14T17:00:00.000Z,
version='2022-07-14T20:20:37.672Z', priority=75, revoked=false}]
2022-07-14T23:52:31,270 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-14T17:00:00.000Z/2022-07-14T18:00:00.000Z,
version='2022-07-14T21:50:01.268Z', priority=75, revoked=false}]
2022-07-14T23:52:31,517 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-14T18:00:00.000Z/2022-07-14T19:00:00.000Z,
version='2022-07-14T21:44:27.183Z', priority=75, revoked=false}]
2022-07-14T23:52:31,746 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-14T19:00:00.000Z/2022-07-14T20:00:00.000Z,
version='2022-07-14T23:06:16.008Z', priority=75, revoked=false}]
2022-07-14T23:52:31,988 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-15T01:00:00.000Z/2022-07-15T02:00:00.000Z,
version='2022-07-14T22:57:43.141Z', priority=75, revoked=false}]
2022-07-14T23:52:32,214 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskLockbox - Removing
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] from
TaskLock[TimeChunkLock{type=EXCLUSIVE, groupId='index_kafka_REDACT',
dataSource='REDACT',
interval=2022-07-15T02:00:00.000Z/2022-07-15T03:00:00.000Z,
version='2022-07-14T23:10:47.272Z', priority=75, revoked=false}]
2022-07-14T23:52:32,569 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.MetadataTaskStorage - Updating task
index_kafka_REDACT_daa7d95ffa54138_omkgiccg to status:
TaskStatus{id=index_kafka_REDACT_daa7d95ffa54138_omkgiccg, status=FAILED,
duration=-1, errorMsg=An exception occured while waiting for task
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] to pause: ...}
2022-07-14T23:52:32,690 INFO [KafkaSupervisor-REDACT-Worker-1]
org.apache.druid.indexing.overlord.TaskQueue - Task done:
AbstractTask{id='index_kafka_REDACT_daa7d95ffa54138_omkgiccg',
groupId='index_kafka_REDACT',
taskResource=TaskResource{availabilityGroup='index_kafka_REDACT_daa7d95ffa54138',
requiredCapacity=1}, dataSource='REDACT',
context={checkpoints={"0":{"5":13421694182}},
useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true,
forceTimeChunkLock=true}}
2022-07-14T23:53:44,018 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Shutdown
[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] because: [notified status change
from task]
2022-07-14T23:53:44,018 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner - Cleaning up
task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] on
worker[REDACTED.host.com:8091]
2022-07-14T23:53:44,076 WARN [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.TaskQueue - Unknown task completed:
index_kafka_REDACT_daa7d95ffa54138_omkgiccg
2022-07-14T23:53:44,076 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.TaskQueue - Task FAILED:
AbstractTask{id='index_kafka_REDACT_daa7d95ffa54138_omkgiccg',
groupId='index_kafka_REDACT',
taskResource=TaskResource{availabilityGroup='index_kafka_REDACT_daa7d95ffa54138',
requiredCapacity=1}, dataSource='REDACT',
context={checkpoints={"0":{"5":13421694182}},
useLineageBasedSegmentAllocation=true, IS_INCREMENTAL_HANDOFF_SUPPORTED=true,
forceTimeChunkLock=true}} (3828502 run duration)
2022-07-14T23:53:44,076 INFO [Curator-PathChildrenCache-1]
org.apache.druid.indexing.overlord.RemoteTaskRunner -
Task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg] went bye bye.
```
* Peon Log Snippet - was actually in `PAUSED` state during the shutdown -
`Stopping forcefully (status: [PAUSED])`
```
2022-07-14T23:35:25,545 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
at existing
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:38:47,724 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
at existing
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:42:48,993 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
at existing
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:42:48,994 INFO [task-runner-0-priority-0]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted
rows[302,835] and (estimated) bytes[482,175,309]
2022-07-14T23:42:49,019 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
spill[0] to disk in [24] ms (1,000 rows).
2022-07-14T23:42:50,562 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22]
spill[7] to disk in [1,542] ms (136,567 rows).
2022-07-14T23:42:53,411 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7]
spill[8] to disk in [2,837] ms (164,266 rows).
2022-07-14T23:42:53,444 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4]
spill[5] to disk in [18] ms (1 rows).
2022-07-14T23:42:53,497 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
spill[0] to disk in [52] ms (1 rows).
2022-07-14T23:42:53,519 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data for
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
spill[0] to disk in [20] ms (1,000 rows).
2022-07-14T23:42:53,520 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Flushed
in-memory data with commit metadata
[AppenderatorDriverMetadata{segments={index_kafka_REDACT_daa7d95ffa54138_1=[SegmentWithState{segmentIdentifier=REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329,
state=APPENDING},
SegmentWithState{segmentIdentifier=REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082,
state=APPENDING},
SegmentWithState{segmentIdentifier=REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560,
state=APPENDING},
SegmentWithState{segmentIdentifier=REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22,
state=APPENDING},
SegmentWithState{segmentIdentifier=REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7,
state=APPENDING}, SegmentWith
State{segmentIdentifier=REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4,
state=APPENDING}]},
lastSegmentIds={index_kafka_REDACT_daa7d95ffa54138_1=REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329},
callerMetadata={nextPartitions=SeekableStreamEndSequenceNumbers{stream='REDACT',
partitionSequenceNumberMap={5=13433730900}}}}] for segments:
REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560,
REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22,
REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7,
REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4,
REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329,
REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082
2022-07-14T23:42:53,520 INFO
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted
stats: processed rows: [12037718], persisted rows[302835], sinks: [6], total
fireHydrants (across sinks): [26], persisted fireHydrants (across sinks): [6]
2022-07-14T23:46:08,182 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing
segment[REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782]
at existing
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:49:27,388 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Announcing
segment[REDACT_2022-06-01T14:00:00.000Z_2022-06-01T15:00:00.000Z_2022-06-01T20:26:55.463Z_762]
at existing
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,392 INFO [parent-monitor-0]
org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Triggering JVM
shutdown.
2022-07-14T23:50:58,392 INFO [Thread-70] org.apache.druid.cli.CliPeon -
Running shutdown hook
2022-07-14T23:50:58,392 INFO [Thread-70]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [ANNOUNCEMENTS]
2022-07-14T23:50:58,398 INFO [Thread-70]
org.apache.druid.curator.announcement.Announcer - Unannouncing
[/druid/announcements/REDACTED.host.com:8102]
2022-07-14T23:50:58,414 INFO [Thread-70]
org.apache.druid.curator.announcement.Announcer - Unannouncing
[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,417 INFO [Thread-70]
org.apache.druid.curator.announcement.Announcer - Unannouncing
[/druid/internal-discovery/PEON/REDACTED.host.com:8102]
2022-07-14T23:50:58,420 INFO [Thread-70]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [SERVER]
2022-07-14T23:50:58,426 INFO [Thread-70]
org.eclipse.jetty.server.AbstractConnector - Stopped
ServerConnector@70cd122{HTTP/1.1, (http/1.1)}{0.0.0.0:8102}
2022-07-14T23:50:58,426 INFO [Thread-70] org.eclipse.jetty.server.session -
node0 Stopped scavenging
2022-07-14T23:50:58,427 INFO [Thread-70]
org.eclipse.jetty.server.handler.ContextHandler - Stopped
o.e.j.s.ServletContextHandler@5a6af2c4{/,null,STOPPED}
2022-07-14T23:50:58,429 INFO [Thread-70]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Stopping lifecycle
[module] stage [NORMAL]
2022-07-14T23:50:58,429 INFO [Thread-70]
org.apache.druid.server.coordination.ZkCoordinator - Stopping ZkCoordinator for
[DruidServerMetadata{name='REDACTED.host.com:8102',
hostAndPort='REDACTED.host.com:8102', hostAndTlsPort='null', maxSize=0,
tier='_default_tier', type=indexer-executor, priority=0}]
2022-07-14T23:50:58,430 INFO [Thread-70]
org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopping...
2022-07-14T23:50:58,430 INFO [Thread-70]
org.apache.druid.server.coordination.SegmentLoadDropHandler - Stopped.
2022-07-14T23:50:58,430 INFO [Thread-70]
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner - Starting
graceful shutdown of task[index_kafka_REDACT_daa7d95ffa54138_omkgiccg].
2022-07-14T23:50:58,430 INFO [Thread-70]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner -
Stopping forcefully (status: [PAUSED])
2022-07-14T23:50:58,431 ERROR [task-runner-0-priority-0]
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner -
Encountered exception in run() before persisting.
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
~[?:1.8.0_322]
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
~[?:1.8.0_322]
at
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.possiblyPause(SeekableStreamIndexTaskRunner.java:1303)
~[druid-indexing-service-0.23.0.jar:0.23.0]
at
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.runInternal(SeekableStreamIndexTaskRunner.java:579)
[druid-indexing-service-0.23.0.jar:0.23.0]
at
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTaskRunner.run(SeekableStreamIndexTaskRunner.java:265)
[druid-indexing-service-0.23.0.jar:0.23.0]
at
org.apache.druid.indexing.seekablestream.SeekableStreamIndexTask.run(SeekableStreamIndexTask.java:149)
[druid-indexing-service-0.23.0.jar:0.23.0]
at
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:477)
[druid-indexing-service-0.23.0.jar:0.23.0]
at
org.apache.druid.indexing.overlord.SingleTaskBackgroundRunner$SingleTaskBackgroundRunnerCallable.call(SingleTaskBackgroundRunner.java:449)
[druid-indexing-service-0.23.0.jar:0.23.0]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_322]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[?:1.8.0_322]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[?:1.8.0_322]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_322]
2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Persisted
rows[2,499] and (estimated) bytes[2,953,857]
2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0]
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer
clientId=consumer-kafka-supervisor-ogdbelko-1,
groupId=kafka-supervisor-ogdbelko] Resetting generation due to: consumer
pro-actively leaving the group
2022-07-14T23:50:58,433 INFO [task-runner-0-priority-0]
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator - [Consumer
clientId=consumer-kafka-supervisor-ogdbelko-1,
groupId=kafka-supervisor-ogdbelko] Request joining group due to: consumer
pro-actively leaving the group
2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0]
org.apache.kafka.common.metrics.Metrics - Metrics scheduler closed
2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0]
org.apache.kafka.common.metrics.Metrics - Closing reporter
org.apache.kafka.common.metrics.JmxReporter
2022-07-14T23:50:58,434 INFO [task-runner-0-priority-0]
org.apache.kafka.common.metrics.Metrics - Metrics reporters closed
2022-07-14T23:50:58,436 INFO [task-runner-0-priority-0]
org.apache.kafka.common.utils.AppInfoParser - App info kafka.consumer for
consumer-kafka-supervisor-ogdbelko-1 unregistered
2022-07-14T23:50:58,436 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-07-07T15:00:00.000Z_2022-07-07T16:00:00.000Z_2022-07-07T14:57:55.341Z_560]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-06-01T14:00:00.000Z_2022-06-01T15:00:00.000Z_2022-06-01T20:26:55.463Z_762]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-07-14T18:00:00.000Z_2022-07-14T19:00:00.000Z_2022-07-14T19:24:01.698Z_22]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-07-14T19:00:00.000Z_2022-07-14T20:00:00.000Z_2022-07-14T20:41:44.206Z_7]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,437 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-07-15T02:00:00.000Z_2022-07-15T03:00:00.000Z_2022-07-14T21:09:08.443Z_4]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,438 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-05-31T01:00:00.000Z_2022-05-31T02:00:00.000Z_2022-06-06T12:50:29.291Z_329]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,438 INFO [task-runner-0-priority-0]
org.apache.druid.server.coordination.BatchDataSegmentAnnouncer - Unannouncing
segment[REDACT_2022-05-31T16:00:00.000Z_2022-05-31T17:00:00.000Z_2022-05-31T22:18:01.994Z_1082]
at
path[/druid/segments/REDACTED.host.com:8102/REDACTED.host.com:8102_indexer-executor__default_tier_2022-07-14T22:49:49.993Z_d79a1d9505724d35a787458d655501b40]
2022-07-14T23:50:58,445 ERROR
[[index_kafka_REDACT_daa7d95ffa54138_omkgiccg]-appenderator-persist]
org.apache.druid.segment.realtime.appenderator.StreamAppenderator - Incremental
persist failed:
{class=org.apache.druid.segment.realtime.appenderator.StreamAppenderator,
segment=REDACT_2022-05-29T09:00:00.000Z_2022-05-29T10:00:00.000Z_2022-05-31T23:59:23.591Z_782,
dataSource=REDACT, count=0}
2022-07-14T23:50:58,446 INFO
[LookupExtractorFactoryContainerProvider-MainThread]
org.apache.druid.query.lookup.LookupReferencesManager - Lookup Management loop
exited. Lookup notices are not handled anymore.
2022-07-14T23:50:58,447 INFO [task-runner-0-priority-0]
org.apache.druid.curator.discovery.CuratorDruidNodeAnnouncer - Unannounced self
[{"druidNode":{"service":"druid/middleManager","host":"REDACTED.host.com","bindOnHost":false,"plaintextPort":8102,"port":-1,"tlsPort":-1,"enablePlaintextPort":true,"enableTlsPort":false},"nodeType":"peon","services":{"dataNodeService":{"type":"dataNodeService","tier":"_default_tier","maxSize":0,"type":"indexer-executor","serverType":"indexer-executor","priority":0},"lookupNodeService":{"type":"lookupNodeService","lookupTier":"__default"}}}].
2022-07-14T23:50:58,449 INFO [Curator-Framework-0]
org.apache.curator.framework.imps.CuratorFrameworkImpl -
backgroundOperationsLoop exiting
2022-07-14T23:50:58,451 INFO [task-runner-0-priority-0]
org.apache.druid.indexing.worker.executor.ExecutorLifecycle - Task completed
with status: {
"id" : "index_kafka_REDACT_daa7d95ffa54138_omkgiccg",
"status" : "SUCCESS",
"duration" : 3823392,
"errorMsg" : null,
"location" : {
"host" : null,
"port" : -1,
"tlsPort" : -1
}
}
2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread
[Thread[Thread-70,5,main]] is non daemon.
2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread
[Thread[pool-3-thread-1,5,main]] is non daemon.
2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread
[Thread[pool-3-thread-2,5,main]] is non daemon.
2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread
[Thread[pool-3-thread-3,5,main]] is non daemon.
2022-07-14T23:50:58,453 INFO [main] org.apache.druid.cli.CliPeon - Thread
[Thread[pool-3-thread-4,5,main]] is non daemon.
2022-07-14T23:50:58,453 INFO [main]
org.apache.druid.java.util.common.lifecycle.Lifecycle - Lifecycle [module]
already stopped and stop was called. Silently skipping
Cannot remove shutdown hook, already shutting down!
Finished peon task
```
In this new example, the last chatRetry was at `2022-07-14T23:48:16,222` and
since it didn't pause at the last retry, it sent a shutdown at
`2022-07-14T23:50:58,304`. The peon did eventually pause this time but around
~`2022-07-14T23:50:58,430`. So if it had 1-2 more minutes it would have
returned a successful response to the pause ("/status" check) request.
Knowing this I was able to temporarily provide some relieve and let our
existing tasks complete (for the most part 90%+) but increasing chatRetries to
a very high number (8 => 25) since it's 10 seconds between retries after the
3rd retry, this allows for ~2 more minutes for the Peon to actually pause. This
is just a workaround for now since the issue still exists with the Peons not
pausing in a reasonable time but does provide some temporarily relieve.
Seems like we will definitely need those flame graphs @abhishekagarwal87
mentioned to get a better idea of what the peon is working on during the pause
request coming in
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]