[jira] [Commented] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

2023-06-13 Thread Andras Hatvani (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732025#comment-17732025
 ] 

Andras Hatvani commented on KAFKA-12679:


[~lbrutschy] [~mjsax] This is a pressing issue for my client and I've worked it 
around by reducing `num.stream.threads` to `1` which shouldn't be necessary.

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> ---
>
> Key: KAFKA-12679
> URL: https://issues.apache.org/jira/browse/KAFKA-12679
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.1
> Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>Reporter: Peter Nahas
>Priority: Major
> Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>   : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (KAFKA-12679) Rebalancing a restoring or running task may cause directory livelocking with newly created task

2023-06-13 Thread Andras Hatvani (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732025#comment-17732025
 ] 

Andras Hatvani edited comment on KAFKA-12679 at 6/13/23 11:08 AM:
--

[~lbrutschy] [~mjsax] This is a pressing issue for my client as I could only 
work it around by reducing {{num.stream.threads}} to {{1}} which shouldn't be 
necessary.


was (Author: andras hatvani):
[~lbrutschy] [~mjsax] This is a pressing issue for my client and I've worked it 
around by reducing `num.stream.threads` to `1` which shouldn't be necessary.

> Rebalancing a restoring or running task may cause directory livelocking with 
> newly created task
> ---
>
> Key: KAFKA-12679
> URL: https://issues.apache.org/jira/browse/KAFKA-12679
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.1
> Environment: Broker and client version 2.6.1
> Multi-node broker cluster
> Multi-node, auto scaling streams app instances
>Reporter: Peter Nahas
>Priority: Major
> Attachments: Backoff-between-directory-lock-attempts.patch
>
>
> If a task that uses a state store is in the restoring state or in a running 
> state and the task gets rebalanced to a separate thread on the same instance, 
> the newly created task will attempt to lock the state store director while 
> the first thread is continuing to use it. This is totally normal and expected 
> behavior when the first thread is not yet aware of the rebalance. However, 
> that newly created task is effectively running a while loop with no backoff 
> waiting to lock the directory:
>  # TaskManager tells the task to restore in `tryToCompleteRestoration`
>  # The task attempts to lock the directory
>  # The lock attempt fails and throws a 
> `org.apache.kafka.streams.errors.LockException`
>  # TaskManager catches the exception, stops further processing on the task 
> and reports that not all tasks have restored
>  # The StreamThread `runLoop` continues to run.
> I've seen some documentation indicate that there is supposed to be a backoff 
> when this condition occurs, but there does not appear to be any in the code. 
> The result is that if this goes on for long enough, the lock-loop may 
> dominate CPU usage in the process and starve out the old stream thread task 
> processing.
>  
> When in this state, the DEBUG level logging for TaskManager will produce a 
> steady stream of messages like the following:
> {noformat}
> 2021-03-30 20:59:51,098 DEBUG --- [StreamThread-10] o.a.k.s.p.i.TaskManager   
>   : stream-thread [StreamThread-10] Could not initialize 0_34 due 
> to the following exception; will retry
> org.apache.kafka.streams.errors.LockException: stream-thread 
> [StreamThread-10] standby-task [0_34] Failed to lock the state directory for 
> task 0_34
> {noformat}
>  
>  
> I've attached a git formatted patch to resolve the issue. Simply detect the 
> scenario and sleep for the backoff time in the appropriate StreamThread.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (KAFKA-16584) Make log processing summary configurable or debug

2024-04-18 Thread Andras Hatvani (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-16584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Hatvani updated KAFKA-16584:
---
Description: 
Currently *every two minutes for every stream thread* statistics will be logged 
on INFO log level. 
{code}
2024-04-18T09:18:23.790+02:00  INFO 33178 --- [service] [-StreamThread-1] 
o.a.k.s.p.internals.StreamThread         : stream-thread 
[service-149405a3-c7e3-4505-8bbd-c3bff226b115-StreamThread-1] Processed 0 total 
records, ran 0 punctuators, and committed 0 total tasks since the last update 
{code}

This is absolutely unnecessary and even harmful since it fills the logs and 
thus storage space with unwanted and useless data. Otherwise the INFO logs are 
useful and helpful, therefore it's not an option to raise the log level to WARN.
Please make the logProcessingSummary 
* either to a DEBUG level log or
* make it configurable so that it can be disabled.
This is the relevant code: 
https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java#L1073

  was:
Currently *every two minutes for every stream thread* statistics will be logged 
on INFO log level. 
{code:log}
2024-04-18T09:18:23.790+02:00  INFO 33178 --- [service] [-StreamThread-1] 
o.a.k.s.p.internals.StreamThread         : stream-thread 
[service-149405a3-c7e3-4505-8bbd-c3bff226b115-StreamThread-1] Processed 0 total 
records, ran 0 punctuators, and committed 0 total tasks since the last update 
{code}

This is absolutely unnecessary and even harmful since it fills the logs and 
thus storage space with unwanted and useless data. Otherwise the INFO logs are 
useful and helpful, therefore it's not an option to raise the log level to WARN.
Please make the logProcessingSummary 
* either to a DEBUG level log or
* make it configurable so that it can be disabled.
This is the relevant code: 
https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java#L1073


> Make log processing summary configurable or debug
> -
>
> Key: KAFKA-16584
> URL: https://issues.apache.org/jira/browse/KAFKA-16584
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.6.2
>Reporter: Andras Hatvani
>Priority: Major
>
> Currently *every two minutes for every stream thread* statistics will be 
> logged on INFO log level. 
> {code}
> 2024-04-18T09:18:23.790+02:00  INFO 33178 --- [service] [-StreamThread-1] 
> o.a.k.s.p.internals.StreamThread         : stream-thread 
> [service-149405a3-c7e3-4505-8bbd-c3bff226b115-StreamThread-1] Processed 0 
> total records, ran 0 punctuators, and committed 0 total tasks since the last 
> update {code}
> This is absolutely unnecessary and even harmful since it fills the logs and 
> thus storage space with unwanted and useless data. Otherwise the INFO logs 
> are useful and helpful, therefore it's not an option to raise the log level 
> to WARN.
> Please make the logProcessingSummary 
> * either to a DEBUG level log or
> * make it configurable so that it can be disabled.
> This is the relevant code: 
> https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java#L1073



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-16584) Make log processing summary configurable or debug

2024-04-18 Thread Andras Hatvani (Jira)
Andras Hatvani created KAFKA-16584:
--

 Summary: Make log processing summary configurable or debug
 Key: KAFKA-16584
 URL: https://issues.apache.org/jira/browse/KAFKA-16584
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 3.6.2
Reporter: Andras Hatvani


Currently *every two minutes for every stream thread* statistics will be logged 
on INFO log level. 
{code:log}
2024-04-18T09:18:23.790+02:00  INFO 33178 --- [service] [-StreamThread-1] 
o.a.k.s.p.internals.StreamThread         : stream-thread 
[service-149405a3-c7e3-4505-8bbd-c3bff226b115-StreamThread-1] Processed 0 total 
records, ran 0 punctuators, and committed 0 total tasks since the last update 
{code}

This is absolutely unnecessary and even harmful since it fills the logs and 
thus storage space with unwanted and useless data. Otherwise the INFO logs are 
useful and helpful, therefore it's not an option to raise the log level to WARN.
Please make the logProcessingSummary 
* either to a DEBUG level log or
* make it configurable so that it can be disabled.
This is the relevant code: 
https://github.com/apache/kafka/blob/aee9724ee15ed539ae73c09cc2c2eda83ae3c864/streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java#L1073



--
This message was sent by Atlassian Jira
(v8.20.10#820010)