[jira] [Updated] (KAFKA-16025) Streams StateDirectory has orphaned locks after rebalancing, blocking future rebalancing

Sabit (Jira) Sun, 17 Dec 2023 16:12:05 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-16025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sabit updated KAFKA-16025:
--------------------------
    Description: 
Hello,

 

We are encountering an issue where during rebalancing, we see streams threads 
on one client get stuck in rebalancing. Upon enabling debug logs, we saw that 
some tasks were having issues initializing due to failure to grab a lock in the 
StateDirectory:

 

{{2023-12-14 22:51:57.352000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed 
to lock the state directory for task 0_51; will retry}}

 

Reading the comments for TaskManager, this seems like an expected case, however 
in our situation, the thread would get stuck on this permanently. After looking 
at the logs, we came to understand that whenever tasks 0_51, 0_37, 0_107, 
0_219, or 0_93 were being assigned to this client, the assigned threads would 
get stuck due to being unable to grab the lock. We took a heap dump of this JVM 
and found that all of these tasks were being locks by StreamThread-21 (see 
attachment). Additionally, each of these task directories exist on the client 
but are empty directories.

 

The sequence of events that occurred for us to arrive at this state is that 
initially, all of these tasks were being processed on the impacted client, 
either as active or standby tasks. We had one client drop out of the consumer 
group, so these tasks were rebalanced away from the client. When we try to 
bring up a new client to replace the one that dropped out, the impacted client 
cannot initialize these 5 tasks it was initially processing. Sample of one 
timeline:

 

{{# Task moved away from the original consumer thread}}

{{2023-12-13 22:45:58.240000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Suspended running}}

{{2023-12-13 22:45:58.263000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Closed clean}}

{{# Directory cleaned up}}

{{2023-12-13 22:57:18.696000Z stream-thread [i-0f1a5e7a42158e04b-CleanupThread] 
Deleting obsolete state directory 0_51 for task 0_51 as 680455ms has elapsed 
(cleanup delay is 600000ms).}}

{{# Cannot initialize task when it is re-assigned to this client}}

{{2023-12-14 22:51:57.352000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed 
to lock the state directory for task 0_51; will retry}}

 

{{Reading through the StateDirectory, it wasn't immediately obvious how we 
could arrive in a situation where the task is locked by a thread it hadn't been 
attempted to be assigned to yet, while the directory was cleaned up, but is now 
empty instead of being deleted. We didn't observe any filesystem issues on this 
client around this time either.}}

 

  was:
Hello,

 

We are encountering an issue where during rebalancing, we see streams threads 
on one client get stuck in rebalancing. Upon enabling debug logs, we saw that 
some tasks were having issues initializing due to failure to grab a lock in the 
StateDirectory:

 

{{2023-12-14 22:51:57.352000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed 
to lock the state directory for task 0_51; will retry}}

 

Reading the comments for TaskManager, this seems like an expected case, however 
in our situation, the thread would get stuck on this permanently. After looking 
at the logs, we came to understand that whenever tasks 0_51, 0_37, 0_107, 
0_219, or 0_93 were being assigned to this client, those threads would get 
stuck due to being unable to grab the lock. We took a heap dump of this jvm and 
found that all of these tasks were being locks by StreamThread-21 (see 
attachment). Additionally, each of these task directories exist on the instance 
but are empty directories.

 

The sequence of events that occurred for us to arrive at this state is that 
initially, all of these tasks were being processed on the impacted client, 
either as active or standby tasks. We had one client drop out of the consumer 
group, so these tasks were rebalanced away from the client. When we try to 
bring up a new client to replace the one that dropped out, the impacted client 
cannot initialize these 5 tasks it was initially processing. Sample of one 
timeline:

 

{{# Task moved away from the original consumer thread}}

{{2023-12-13 22:45:58.240000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Suspended running}}

{{2023-12-13 22:45:58.263000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Closed clean}}

{{# Directory cleaned up}}

{{2023-12-13 22:57:18.696000Z stream-thread [i-0f1a5e7a42158e04b-CleanupThread] 
Deleting obsolete state directory 0_51 for task 0_51 as 680455ms has elapsed 
(cleanup delay is 600000ms).}}

{{# Cannot initialize task when it is re-assigned to this client}}

{{2023-12-14 22:51:57.352000Z stream-thread 
[i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] Failed 
to lock the state directory for task 0_51; will retry}}

 

{{Reading through the StateDirectory, it wasn't immediately obvious how we 
could arrive in a situation where the task is locked by a thread it hadn't been 
attempted to be assigned to yet, while the directory was cleaned up, but is now 
empty instead of being deleted. We didn't observe any filesystem issues on this 
client around this time either.}}

 


> Streams StateDirectory has orphaned locks after rebalancing, blocking future 
> rebalancing
> ----------------------------------------------------------------------------------------
>
>                 Key: KAFKA-16025
>                 URL: https://issues.apache.org/jira/browse/KAFKA-16025
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.4.0
>         Environment: Linux
>            Reporter: Sabit
>            Priority: Major
>         Attachments: Screenshot 1702750558363.png
>
>
> Hello,
>  
> We are encountering an issue where during rebalancing, we see streams threads 
> on one client get stuck in rebalancing. Upon enabling debug logs, we saw that 
> some tasks were having issues initializing due to failure to grab a lock in 
> the StateDirectory:
>  
> {{2023-12-14 22:51:57.352000Z stream-thread 
> [i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
> stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] 
> Failed to lock the state directory for task 0_51; will retry}}
>  
> Reading the comments for TaskManager, this seems like an expected case, 
> however in our situation, the thread would get stuck on this permanently. 
> After looking at the logs, we came to understand that whenever tasks 0_51, 
> 0_37, 0_107, 0_219, or 0_93 were being assigned to this client, the assigned 
> threads would get stuck due to being unable to grab the lock. We took a heap 
> dump of this JVM and found that all of these tasks were being locks by 
> StreamThread-21 (see attachment). Additionally, each of these task 
> directories exist on the client but are empty directories.
>  
> The sequence of events that occurred for us to arrive at this state is that 
> initially, all of these tasks were being processed on the impacted client, 
> either as active or standby tasks. We had one client drop out of the consumer 
> group, so these tasks were rebalanced away from the client. When we try to 
> bring up a new client to replace the one that dropped out, the impacted 
> client cannot initialize these 5 tasks it was initially processing. Sample of 
> one timeline:
>  
> {{# Task moved away from the original consumer thread}}
> {{2023-12-13 22:45:58.240000Z stream-thread 
> [i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Suspended running}}
> {{2023-12-13 22:45:58.263000Z stream-thread 
> [i-0f1a5e7a42158e04b-StreamThread-32] standby-task [0_51] Closed clean}}
> {{# Directory cleaned up}}
> {{2023-12-13 22:57:18.696000Z stream-thread 
> [i-0f1a5e7a42158e04b-CleanupThread] Deleting obsolete state directory 0_51 
> for task 0_51 as 680455ms has elapsed (cleanup delay is 600000ms).}}
> {{# Cannot initialize task when it is re-assigned to this client}}
> {{2023-12-14 22:51:57.352000Z stream-thread 
> [i-0f1a5e7a42158e04b-StreamThread-14] Could not initialize task 0_51 since: 
> stream-thread [i-0f1a5e7a42158e04b-StreamThread-14] standby-task [0_51] 
> Failed to lock the state directory for task 0_51; will retry}}
>  
> {{Reading through the StateDirectory, it wasn't immediately obvious how we 
> could arrive in a situation where the task is locked by a thread it hadn't 
> been attempted to be assigned to yet, while the directory was cleaned up, but 
> is now empty instead of being deleted. We didn't observe any filesystem 
> issues on this client around this time either.}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-16025) Streams StateDirectory has orphaned locks after rebalancing, blocking future rebalancing

Reply via email to