A lower checkpoint interval (== more checkpoints / time) will consume
more resources and hence can affect the job performance.
It ultimately boils down to how much latency you are willing to accept
when a failure occurs and data has to be re-processed (more checkpoints
=> less data).
How long this catch-up takes depends on the job and provisioning of the
cluster. An over-provisioned cluster can recover quicker from what is
ultimately just a data spike, while one that is barely keeping up may
incur significant latency.
We know that many users have a checkpointing interval of the order of
minutes, but at the end of the day you will need to run some experiments
with your job&cluster&data to get some rough numbers.
On 2/18/2021 7:35 AM, Dan Hill wrote:
Hi. I'm playing around with optimizing our checkpoint intervals and
sizes.
Are there any best practices around this? I have a ~7 sequential
joins and a few sinks. I'm curious what would result in the better
throughput and latency trade offs. I'd assume less frequent
checkpointing would increase throughput (but constrained by how
frequently I want checkedpointed sinks written).