YongGang commented on PR #14898: URL: https://github.com/apache/druid/pull/14898#issuecomment-1690508647
During K8s task runner testing, we saw there were cases that Kafka tasks will be marked as failed by the supervisor even though they have succeeded. After digging into the code, the issue that triggered task being stopped is from [here](https://github.com/apache/druid/blob/27.0.0/indexing-service/src/main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java#L1898) (haven't found why) Overlord was **bounced** around that time and leadership changed. When looking through the logs to understand the root cause, we saw thousands of following logs, from the Overlord instance even though it was **elected** as the leader. (can see the log like "By the power of Grayskull, I have the power!") > Failed to get task runner because I'm not the leader! > Failed to get task queue because I'm not the leader! This was due to when `SeekableStreamSupervisor` started, it looped through the tasks from TaskStorage and query TaskMaster for leader status. The query returns false as the becoming leader process hasn't finished yet. But that created confusion as some tasks can get right TaskRunner some doesn't, all depends on whether leader election finished. Thus I propose to add lock around leader query, so the query result is determined: either Overlord is the leader or not, not in an intermediate state. This is especially true when Overlord is restarting as the becoming leader process can take tens of seconds to finish, it's hard to reason about what's the return value of getTaskRunner() and getTaskQueue() during that time. I feel like this is also related to [the other issue ](https://github.com/apache/druid/pull/14880)I'm trying to solve: if one component is depending on another one, it should wait until the dependent component fully start/stop (in above case is `SeekableStreamSupervisor` -> `TaskMaster`), we use lock to achieve this purpose here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
