Hi Sambaran, could you also share the cause why the checkpoints could not be discarded with us?
With Flink 1.10, we introduced a stricter memory model for the TaskManagers. That could be a reason why you see more TaskManagers being killed by the underlying resource management system. You could maybe check whether your resource management system logs that some containers/pods are exceeding their memory limitations. If this is the case, then you should give your Flink processes a bit more memory [1]. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup.html Cheers, Till On Tue, Apr 27, 2021 at 6:48 PM Sambaran <sambaran2...@gmail.com> wrote: > Hi there, > > We have recently migrated to flink 1.12 from 1.7, although the jobs are > running fine, sometimes the task manager is getting killed (much frequently > 2-3 times a day). > > Logs: > INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - > RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. > > While checking more logs we see flink not able to discard old checkpoints > org.apache.flink.runtime.checkpoint.CheckpointsCleaner [] - Could > not discard completed checkpoint 173. > > We are not sure of what is the reason here, has anyone faced this before? > > Regards > Sambaran >