Re: Best practices around checkpoint intervals and sizes?

Chesnay Schepler Thu, 18 Feb 2021 11:40:01 -0800

A lower checkpoint interval (== more checkpoints / time) will consumemore resources and hence can affect the job performance.It ultimately boils down to how much latency you are willing to acceptwhen a failure occurs and data has to be re-processed (more checkpoints=> less data).

How long this catch-up takes depends on the job and provisioning of thecluster. An over-provisioned cluster can recover quicker from what isultimately just a data spike, while one that is barely keeping up mayincur significant latency.

We know that many users have a checkpointing interval of the order ofminutes, but at the end of the day you will need to run some experimentswith your job&cluster&data to get some rough numbers.


On 2/18/2021 7:35 AM, Dan Hill wrote:

Hi. I'm playing around with optimizing our checkpoint intervals andsizes.
Are there any best practices around this? I have a ~7 sequentialjoins and a few sinks. I'm curious what would result in the betterthroughput and latency trade offs. I'd assume less frequentcheckpointing would increase throughput (but constrained by howfrequently I want checkedpointed sinks written).

Re: Best practices around checkpoint intervals and sizes?

Reply via email to