Hi, I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in `TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task with `ExecutorAllocationManager`. Then `ExecutorAllocationManager` will request additional executors for these "unscheduleable tasks" by increasing the value returned in `ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are counting these tasks twice, but this makes sense because the current executors don't have any slot for these tasks, so we actually want to get new executors that are able to run these tasks. To avoid a deadlock due to tasks being unscheduleable forever, we store the timestamp when a task was registered as unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the application if there is some task that has been unscheduleable for a configurable age threshold. This way we give an opportunity to dynamic allocation to get more executors that are able to run the tasks, but we don't make the application wait forever.
Attached to the JIRA is a patch with a draft for this approach. Looking forward to your feedback on this.