[ 
https://issues.apache.org/jira/browse/FLINK-35853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866287#comment-17866287
 ] 

Keith Lee commented on FLINK-35853:
-----------------------------------

I dug around code but was unfamiliar with the state backend code base. I noted 
a change introduced in FLINK-28699 where IncrementalRemoteKeyedStateHandle is 
also used for full checkpointing.

> Regression in checkpoint size when performing full checkpointing in RocksDB
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-35853
>                 URL: https://issues.apache.org/jira/browse/FLINK-35853
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.18.1
>         Environment: amazon-linux-2023
>            Reporter: Keith Lee
>            Priority: Major
>         Attachments: StaticStateSizeGenerator115.java, 
> StaticStateSizeGenerator118.java
>
>
> We have an job with small and static state size (states are updated instead 
> of added), the job is configured to use RocksDB + full checkpointng 
> (incremental disabled) because the diff between checkpoint is larger than 
> full checkpoint size. 
> After migrating to 1.18, we observed significant and steady increase in full 
> checkpoint size with RocksDB + full checkpointing. The increase was not 
> observed with hashmap state backend.
> I managed to reproduce the issue with following code:
> [^StaticStateSizeGenerator115.java]
> [^StaticStateSizeGenerator118.java]
> Result:
> On Flink 1.15, RocksDB + full checkpointing, checkpoint size is constant at 
> 250KiB.
> On Flink 1.18, RocksDB + full checkpointing, max checkpoint size got up to 
> 38MiB before dropping (presumably due to compaction?)
> On Flink 1.18, Hashmap statebackend, checkpoint size is constant at 219KiB.
> Notes:
> One observation I have is that the issue is more pronounced with higher 
> parallelism, the code uses 8 parallelism. The production application that we 
> first saw the regression got up to GiB of checkpoint size, where only 
> expected and observed (in 1.15) at most a couple of MiB.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to