[ 
https://issues.apache.org/jira/browse/KAFKA-9846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080978#comment-17080978
 ] 

ASF GitHub Bot commented on KAFKA-9846:
---------------------------------------

vinothchandar commented on pull request #8462: KAFKA-9846: Filter active tasks 
for running state in KafkaStreams#allLocalStorePartitionLags()
URL: https://github.com/apache/kafka/pull/8462
 
 
   
     - Added check that only treats running active tasks as having 0 lag
     - Tasks that are neither restoring, nor running will report 0 as 
currentoffset position
     - Fixed LagFetchIntegrationTest to wait till thread/instance reaches 
RUNNING before checking lag
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Race condition can lead to severe lag underestimate for active tasks
> --------------------------------------------------------------------
>
>                 Key: KAFKA-9846
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9846
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.5.0
>            Reporter: Sophie Blee-Goldman
>            Assignee: Vinoth Chandar
>            Priority: Critical
>             Fix For: 2.6.0
>
>
> In KIP-535 we added the ability to query still-restoring and standby tasks. 
> To give users control over how out of date the data they fetch can be, we 
> added an API to KafkaStreams that fetches the end offsets for all changelog 
> partitions and computes the lag for each local state store.
> During this lag computation, we check whether an active task is in RESTORING 
> and calculate the actual lag if so. If not, we assume it's in RUNNING and 
> return a lag of zero. However, tasks may be in other states besides running 
> and restoring; notably they first pass through the CREATED state before 
> getting to RESTORING. A CREATED task may happen to be caught-up to the end 
> offset, but in many cases it is likely to be lagging or even completely 
> uninitialized.
> This introduces a race condition where users may be led to believe that a 
> task has zero lag and is "safe" to query even with the strictest correctness 
> guarantees, while the task is actually lagging by some unknown amount.  
> During transfer of ownership of the task between different threads on the 
> same machine, tasks can actually spend a while in CREATED while the new owner 
> waits to acquire the task directory lock. So, this race condition may not be 
> particularly rare in multi-threaded Streams applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to