Hi Sambaran,

could you also share the cause why the checkpoints could not be discarded
with us?

With Flink 1.10, we introduced a stricter memory model for the
TaskManagers. That could be a reason why you see more TaskManagers being
killed by the underlying resource management system. You could maybe check
whether your resource management system logs that some containers/pods are
exceeding their memory limitations. If this is the case, then you should
give your Flink processes a bit more memory [1].

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup.html

Cheers,
Till

On Tue, Apr 27, 2021 at 6:48 PM Sambaran <sambaran2...@gmail.com> wrote:

> Hi there,
>
> We have recently migrated to flink 1.12 from 1.7, although the jobs are
> running fine, sometimes the task manager is getting killed (much frequently
> 2-3 times a day).
>
> Logs:
> INFO  org.apache.flink.runtime.taskexecutor.TaskManagerRunner      [] -
> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
>
> While checking more logs we see flink not able to discard old checkpoints
> org.apache.flink.runtime.checkpoint.CheckpointsCleaner       [] - Could
> not discard completed checkpoint 173.
>
> We are not sure of what is the reason here, has anyone faced this before?
>
> Regards
> Sambaran
>

Reply via email to