Sophie Blee-Goldman created KAFKA-9846:
------------------------------------------

             Summary: Race condition can lead to severe lag underestimate for 
active tasks
                 Key: KAFKA-9846
                 URL: https://issues.apache.org/jira/browse/KAFKA-9846
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.5.0
            Reporter: Sophie Blee-Goldman


In KIP-535 we added the ability to query still-restoring and standby tasks. To 
give users control over how out of date the data they fetch can be, we added an 
API to KafkaStreams that fetches the end offsets for all changelog partitions 
and computes the lag for each local state store.

During this lag computation, we check whether an active task is in RESTORING 
and calculate the actual lag if so. If not, we assume it's in RUNNING and 
return a lag of zero. However, tasks may be in other states besides running and 
restoring; notably they first pass through the CREATED state before getting to 
RESTORING. A CREATED task may happen to be caught-up to the end offset, but in 
many cases it is likely to be lagging or even completely uninitialized.

This introduces a race condition where users may be led to believe that a task 
has zero lag and is "safe" to query even with the strictest correctness 
guarantees, while the task is actually lagging by some unknown amount.  During 
transfer of ownership of the task between different threads on the same 
machine, tasks can actually spend a while in CREATED while the new owner waits 
to acquire the task directory lock. So, this race condition may not be 
particularly rare in multi-threaded Streams applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to