A lower checkpoint interval (== more checkpoints / time) will consume more resources and hence can affect the job performance. It ultimately boils down to how much latency you are willing to accept when a failure occurs and data has to be re-processed (more checkpoints => less data).

How long this catch-up takes depends on the job and provisioning of the cluster. An over-provisioned cluster can recover quicker from what is ultimately just a data spike, while one that is barely keeping up may incur significant latency.

We know that many users have a checkpointing interval of the order of minutes, but at the end of the day you will need to run some experiments with your job&cluster&data to get some rough numbers.

On 2/18/2021 7:35 AM, Dan Hill wrote:
Hi.  I'm playing around with optimizing our checkpoint intervals and sizes.

Are there any best practices around this?  I have a ~7 sequential joins and a few sinks.  I'm curious what would result in the better throughput and latency trade offs.  I'd assume less frequent checkpointing would increase throughput (but constrained by how frequently I want checkedpointed sinks written).


Reply via email to