abhishekrb19 opened a new pull request, #19405:
URL: https://github.com/apache/druid/pull/19405

   ### Description
   
     When `ioConfig.serverPriorityToReplicas` is set, a supervisor can get 
stuck throwing
     `DruidException` from `computeUnassignedServerPriorities`, preventing any 
new
     task replicas with destined replicas from being created until the old 
tasks rollover. The 
   exception is as follows:
     ```
   Found unassignedServerPriorities[[]] of size[0] < total replicas[1] for 
taskGroupId[0].
     Task server priorities[[1, 0]] have already been assigned to tasks[[foo]].
   ```
   
   The supervisor can remain in this unhealthy state unable to create 
additional task replicas until tasks eventually rollover.
   
     ### Root cause
     `createTasksForGroup` was an *additional* writer of 
`group.taskIdToServerPriority`,
     recording the priority at task-submission time — before the task was ever 
observed
     in `group.tasks`. The group's other invariant (`tasks` and 
`taskIdToServerPriority`
     stay in sync) is enforced by `discoverTasks` (additions) and 
`TaskGroup.removeTask`
     (removals), both of which operate on `group.tasks`.
   
     If a task died after submission but before the next supervisor run picked 
it up from
     the overlord, it never entered `group.tasks`, so `removeTask` never fired 
— the
     priority entry was orphaned and made the next 
`computeUnassignedServerPriorities`
     call throw.
   
    ### Fix
   
     This patch removes the additional writer, leaving `discoverTasks` and 
`removeTask`
     as the sole mutators. `group.taskIdToServerPriority` should now stay in 
sync with `group.tasks`.
     The added unit tests fail on master without the fix.
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to