vamossagar12 opened a new pull request, #13453: URL: https://github.com/apache/kafka/pull/13453
During fast consecutive rebalances where a task is revoked from one worker and assigned to another one, it has been observed that there is a small time window and thus a race condition during which a RUNNING status record in the new generation is produced and is immediately followed by a delayed UNASSIGNED status record belonging to the same or a previous generation before the worker that sends this message reads the RUNNING status record that corresponds to the latest generation. Although this doesn't inhibit the actual execution of tasks, it reports an incorrect status for those tasks(i.e UNASSIGNED). If the users have setup some kind of monitoring on tasks status then this could lead to false alarms for example. This PR aims to solve this problem by checking if a status message is stale after reading it and updates it's status only when it is safe to. Note that it uses the same method `canWriteSafely` to check the staleness, so if needed, that method can be renamed (which I haven't done in this PR). Also, the original description of the ticket only talks about RUNNING/UNASSIGNED state but this PR should ideally help in filtering out all stale messages (which might still be infrequent but worth handling imo). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org