Sabit created KAFKA-16025:
-----------------------------
Summary: Streams StateDirectory has orphaned locks after
rebalancing, blocking future rebalancing
Key: KAFKA-16025
URL: https://issues.apache.org/jira/browse/KAFKA-16025
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.4.0
Environment: Linux
Reporter: Sabit
Attachments: Screenshot 1702750558363.png
Hello,
We are encountering an issue where during rebalancing, we see streams threads
on one client get stuck in rebalancing. Upon enabling debug logs, we saw that
some tasks were having issues initializing due to failure to grab a lock in the
StateDirectory:
{{{{2023-12-14 22:51:57.352000Z stream-thread
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since:
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed
to lock the state directory for task 0_51; will retry}}}}
Reading the comments for TaskManager, this seems like an expected case, however
in our situation, the thread would get stuck on this permanently. After looking
at the logs, we came to understand that whenever tasks 0_51, 0_37, 0_107,
0_219, or 0_93 were being assigned to this client, those threads would get
stuck due to being unable to grab the lock. We took a heap dump of this jvm and
found that all of these tasks were being locks by StreamThread-21 (see
attachment). Additionally, each of these task directories exist on the instance
but are empty directories.
The sequence of events that occurred for us to arrive at this state is that
initially, all of these tasks were being processed on the impacted client,
either as active or standby tasks. We had one client drop out of the consumer
group, so these tasks were rebalanced away from the client. When we try to
bring up a new client to replace the one that dropped out, the impacted client
cannot initialize these 5 tasks it was initially processing. Sample of one
timeline:
{{# Task moved away from the original consumer thread}}
{{{{2023-12-13 22:45:58.240000Z stream-thread
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Suspended running}}}}
{{{{2023-12-13 }}}}{{{{22:45:58.263000Z stream-thread
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Closed clean}}}}
{{# Directory cleaned up?{{{}{}}}}}
{{{{2023-12-13 }}}}{{{{22:57:18.696000Z stream-thread
[i-0f1a5e7a42158e04b-CleanupThread] Deleting obsolete state directory 0_51 for
task 0_51 as 680455ms has elapsed (cleanup delay is 600000ms).}}}}
{{{{# Cannot initialize task when it is re-assigned to this client}}}}
{{{{2023-12-14 22:51:57.352000Z stream-thread
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since:
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed
to lock the state directory for task 0_51; will retry}}}}
{{{{Reading through the StateDirectory, it wasn't immediately obvious how we
could arrive in a situation where the task is locked by a thread it hadn't been
attempted to be assigned to yet, while the directory was cleaned up, but is now
empty instead of being deleted. We didn't observe any filesystem issues on this
client around this time either.}}}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)