[ https://issues.apache.org/jira/browse/FLINK-35853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866287#comment-17866287 ]
Keith Lee commented on FLINK-35853: ----------------------------------- I dug around code but was unfamiliar with the state backend code base. I noted a change introduced in FLINK-28699 where IncrementalRemoteKeyedStateHandle is also used for full checkpointing. > Regression in checkpoint size when performing full checkpointing in RocksDB > --------------------------------------------------------------------------- > > Key: FLINK-35853 > URL: https://issues.apache.org/jira/browse/FLINK-35853 > Project: Flink > Issue Type: Bug > Components: Runtime / State Backends > Affects Versions: 1.18.1 > Environment: amazon-linux-2023 > Reporter: Keith Lee > Priority: Major > Attachments: StaticStateSizeGenerator115.java, > StaticStateSizeGenerator118.java > > > We have an job with small and static state size (states are updated instead > of added), the job is configured to use RocksDB + full checkpointng > (incremental disabled) because the diff between checkpoint is larger than > full checkpoint size. > After migrating to 1.18, we observed significant and steady increase in full > checkpoint size with RocksDB + full checkpointing. The increase was not > observed with hashmap state backend. > I managed to reproduce the issue with following code: > [^StaticStateSizeGenerator115.java] > [^StaticStateSizeGenerator118.java] > Result: > On Flink 1.15, RocksDB + full checkpointing, checkpoint size is constant at > 250KiB. > On Flink 1.18, RocksDB + full checkpointing, max checkpoint size got up to > 38MiB before dropping (presumably due to compaction?) > On Flink 1.18, Hashmap statebackend, checkpoint size is constant at 219KiB. > Notes: > One observation I have is that the issue is more pronounced with higher > parallelism, the code uses 8 parallelism. The production application that we > first saw the regression got up to GiB of checkpoint size, where only > expected and observed (in 1.15) at most a couple of MiB. -- This message was sent by Atlassian Jira (v8.20.10#820010)