Sergey Zyrianov created KAFKA-18168:
---------------------------------------
Summary: GlobalKTable does not checkpoint offsets until next 10K
events
Key: KAFKA-18168
URL: https://issues.apache.org/jira/browse/KAFKA-18168
Project: Kafka
Issue Type: Improvement
Components: streams
Affects Versions: 3.8.1, 3.4.1
Reporter: Sergey Zyrianov
As in https://issues.apache.org/jira/browse/KAFKA-5241, there is a state of
considerable size kept on a topic that backs up GlobalKTalbe. Restoring
GlobalKTable takes minutes before it is operational. After successful restore
the checkpoint file is not created until further 10K events happen on the
topic.
The following scenario illustrates the issue:
# {*}Scaling Out{*}: When a new instance (e.g., pod X) is added to an already
running set of instances (pods 0...X-1), the new instance will restore the
state successfully. However, it will not create a checkpoint file until 10K
events are processed on the {{GlobalKTable}} topic.
# {*}Lack of Traffic{*}: If there is no new traffic on the {{GlobalKTable}}
topic, there is no mechanism to force the creation of the checkpoint file. The
state remains uncheckpointed. Ref
[https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/processor/internals/StateManagerUtil.java#L78C35-L78C72]
# {*}Instance Restart{*}: If the new instance (pod X) is restarted (due to
update for ex) before 10K events have been processed, it will have to restore
the entire state from the topic again, leading to the same time-consuming
restoration process. This issue persists across restarts.
IMO, checkpointing during the restore process and upon completion/close is
missing in the current implementation
--
This message was sent by Atlassian Jira
(v8.20.10#820010)