[ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feifan Wang updated FLINK-24149:
--------------------------------
    Description: 
h1. Backgroud

We have many job with large state size in production environment. According to 
the operation practice of these jobs and the analysis of some specific 
problems, we believe that RocksDBStateBackend's incremental checkpoint has many 
advantages over savepoint:
 # Savepoint cost much longer time then incremental checkpoint in jobs with 
large state. The figure below is a job in our production environment, it takes 
nearly 7 minutes to complete a savepoint, while checkpoint only takes a few 
seconds.( checkpoint after savepoint case longer time is a problem described in 
-FLINK-23949-)
!image-2021-09-08-17-55-46-898.png|width=723,height=161!
 # Savepoint causes excessive cpu usage. The figure below shows the CPU usage 
of the same job in the above figure :
 # Savepoint may cause excessive native memory usage and eventually cause the 
TaskManager process memory usage to exceed the limit. (We did not further 
investigate the cause and did not try to reproduce the problem on other large 
state jobs, but only increased the overhead memory. So this reason may not be 
so conclusive. )

For the above reasons, we tend to use retained incremental checkpoint to 
completely replace savepoint for jobs with large state size.

 

  was:
h1. Backgroud

We have many job with large state size in production environment. According to 
the operation practice of these jobs and the analysis of some specific 
problems, we believe that RocksDBStateBackend's incremental checkpoint has many 
advantages over savepoint:
 # Savepoint cost much longer time then incremental checkpoint in jobs with 
large state. The figure below is a job in our production environment, it takes 
nearly 7 minutes to complete a savepoint, while checkpoint only takes a few 
seconds.( checkpoint after savepoint case longer time is a problem described in 
-FLINK-23949-)
 # Savepoint causes excessive cpu usage. The figure below shows the CPU usage 
of the same job in the above figure :
 # Savepoint may cause excessive native memory usage and eventually cause the 
TaskManager process memory usage to exceed the limit. (We did not further 
investigate the cause and did not try to reproduce the problem on other large 
state jobs, but only increased the overhead memory. So this reason may not be 
so conclusive. )

For the above reasons, we tend to use retained incremental checkpoint to 
completely replace savepoint for jobs with large state size.

 


> Make checkpoint self-contained and relocatable
> ----------------------------------------------
>
>                 Key: FLINK-24149
>                 URL: https://issues.apache.org/jira/browse/FLINK-24149
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Feifan Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-09-08-17-06-31-560.png, 
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, 
> image-2021-09-08-18-01-03-176.png
>
>
> h1. Backgroud
> We have many job with large state size in production environment. According 
> to the operation practice of these jobs and the analysis of some specific 
> problems, we believe that RocksDBStateBackend's incremental checkpoint has 
> many advantages over savepoint:
>  # Savepoint cost much longer time then incremental checkpoint in jobs with 
> large state. The figure below is a job in our production environment, it 
> takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a 
> few seconds.( checkpoint after savepoint case longer time is a problem 
> described in -FLINK-23949-)
> !image-2021-09-08-17-55-46-898.png|width=723,height=161!
>  # Savepoint causes excessive cpu usage. The figure below shows the CPU usage 
> of the same job in the above figure :
>  # Savepoint may cause excessive native memory usage and eventually cause the 
> TaskManager process memory usage to exceed the limit. (We did not further 
> investigate the cause and did not try to reproduce the problem on other large 
> state jobs, but only increased the overhead memory. So this reason may not be 
> so conclusive. )
> For the above reasons, we tend to use retained incremental checkpoint to 
> completely replace savepoint for jobs with large state size.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to