georgew5656 commented on PR #14898: URL: https://github.com/apache/druid/pull/14898#issuecomment-1692360707
> Thanks for the explanation, @YongGang . > > > This was due to when SeekableStreamSupervisor started, it looped through the tasks from TaskStorage and query TaskMaster for leader status. The query returns false as the becoming leader process hasn't finished yet. But that created confusion as some tasks can get right TaskRunner some doesn't, all depends on whether leader election finished. > > Okay, so IIUC, the supervisor has started on a node which has been recently elected leader but not fully ready yet. > > I think the fix should be more along the lines of supervisor waiting for the leader election to be complete before starting its duties. Alternatively, `SupervisorManager` itself should become active after leader election is complete. This might have the following impact: > > * While leader election is in progress, we cannot perform any supervisor CRUD, which makes sense > * SupervisorManager initialization would be delayed a little > * Others? > > In my opinion, these side effects are only to be expected. It is better than being in a state where an Overlord starts doing things before it has properly become the leader and fails inevitably. > > Let me know what you think. i think we may have found a solution here that is simpler, to move some logic from restore() (which is called asyncronously in Task.manage) to start (which is called during LifecycleStart of the taskRunner before SupervisorManager starts) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
