abhishekrb19 opened a new pull request, #19405:
URL: https://github.com/apache/druid/pull/19405
### Description
When `ioConfig.serverPriorityToReplicas` is set, a supervisor can get
stuck throwing
`DruidException` from `computeUnassignedServerPriorities`, preventing any
new
task replicas with destined replicas from being created until the old
tasks rollover. The
exception is as follows:
```
Found unassignedServerPriorities[[]] of size[0] < total replicas[1] for
taskGroupId[0].
Task server priorities[[1, 0]] have already been assigned to tasks[[foo]].
```
The supervisor can remain in this unhealthy state unable to create
additional task replicas until tasks eventually rollover.
### Root cause
`createTasksForGroup` was an *additional* writer of
`group.taskIdToServerPriority`,
recording the priority at task-submission time — before the task was ever
observed
in `group.tasks`. The group's other invariant (`tasks` and
`taskIdToServerPriority`
stay in sync) is enforced by `discoverTasks` (additions) and
`TaskGroup.removeTask`
(removals), both of which operate on `group.tasks`.
If a task died after submission but before the next supervisor run picked
it up from
the overlord, it never entered `group.tasks`, so `removeTask` never fired
— the
priority entry was orphaned and made the next
`computeUnassignedServerPriorities`
call throw.
### Fix
This patch removes the additional writer, leaving `discoverTasks` and
`removeTask`
as the sole mutators. `group.taskIdToServerPriority` should now stay in
sync with `group.tasks`.
The added unit tests fail on master without the fix.
This PR has:
- [x] been self-reviewed.
- [x] added Javadocs for most classes and all non-trivial methods. Linked
related entities via Javadoc links.
- [x] added comments explaining the "why" and the intent of the code
wherever would not be obvious for an unfamiliar reader.
- [x] added unit tests or modified existing tests to cover new code paths,
ensuring the threshold for [code
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
is met.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]