Hi,

I am trying to use Flink checkpoints solution in order to support task manager 
recovery.
I’m running flink using beam with filesystem storage and the following 
parameters:
checkpointingInterval=30000
checkpointingMode=EXACTLY_ONCE.

What I see is that if I kill a task manager pod, it takes flink about 30 
seconds to identify the failure and another 5-6 minutes to restart the jobs.
Is there a way to shorten the downtime? What is an expected downtime in case 
the task manager is killed, until the jobs are recovered? Are there any best 
practices for handling it? (e.g. different configuration parameters)

Thanks,
Ifat

Reply via email to