[ https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Feifan Wang updated FLINK-24149: -------------------------------- Description: h1. Backgroud We have many job with large state size in production environment. According to the operation practice of these jobs and the analysis of some specific problems, we believe that RocksDBStateBackend's incremental checkpoint has many advantages over savepoint: # Savepoint cost much longer time then incremental checkpoint in jobs with large state. The figure below is a job in our production environment, it takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a few seconds.( checkpoint after savepoint case longer time is a problem described in -FLINK-23949-) !image-2021-09-08-17-55-46-898.png|width=723,height=161! # Savepoint causes excessive cpu usage. The figure below shows the CPU usage of the same job in the above figure : # Savepoint may cause excessive native memory usage and eventually cause the TaskManager process memory usage to exceed the limit. (We did not further investigate the cause and did not try to reproduce the problem on other large state jobs, but only increased the overhead memory. So this reason may not be so conclusive. ) For the above reasons, we tend to use retained incremental checkpoint to completely replace savepoint for jobs with large state size. was: h1. Backgroud We have many job with large state size in production environment. According to the operation practice of these jobs and the analysis of some specific problems, we believe that RocksDBStateBackend's incremental checkpoint has many advantages over savepoint: # Savepoint cost much longer time then incremental checkpoint in jobs with large state. The figure below is a job in our production environment, it takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a few seconds.( checkpoint after savepoint case longer time is a problem described in -FLINK-23949-) # Savepoint causes excessive cpu usage. The figure below shows the CPU usage of the same job in the above figure : # Savepoint may cause excessive native memory usage and eventually cause the TaskManager process memory usage to exceed the limit. (We did not further investigate the cause and did not try to reproduce the problem on other large state jobs, but only increased the overhead memory. So this reason may not be so conclusive. ) For the above reasons, we tend to use retained incremental checkpoint to completely replace savepoint for jobs with large state size. > Make checkpoint self-contained and relocatable > ---------------------------------------------- > > Key: FLINK-24149 > URL: https://issues.apache.org/jira/browse/FLINK-24149 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Reporter: Feifan Wang > Priority: Major > Labels: pull-request-available > Attachments: image-2021-09-08-17-06-31-560.png, > image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, > image-2021-09-08-18-01-03-176.png > > > h1. Backgroud > We have many job with large state size in production environment. According > to the operation practice of these jobs and the analysis of some specific > problems, we believe that RocksDBStateBackend's incremental checkpoint has > many advantages over savepoint: > # Savepoint cost much longer time then incremental checkpoint in jobs with > large state. The figure below is a job in our production environment, it > takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a > few seconds.( checkpoint after savepoint case longer time is a problem > described in -FLINK-23949-) > !image-2021-09-08-17-55-46-898.png|width=723,height=161! > # Savepoint causes excessive cpu usage. The figure below shows the CPU usage > of the same job in the above figure : > # Savepoint may cause excessive native memory usage and eventually cause the > TaskManager process memory usage to exceed the limit. (We did not further > investigate the cause and did not try to reproduce the problem on other large > state jobs, but only increased the overhead memory. So this reason may not be > so conclusive. ) > For the above reasons, we tend to use retained incremental checkpoint to > completely replace savepoint for jobs with large state size. > -- This message was sent by Atlassian Jira (v8.3.4#803005)