Hi,

When a job fails and is recovered by Flink, task manager JVMs are reused.
That can cause problems when the failed job wasn't cleaned up properly, for
example leaving behind the user class loader. This would manifest in rising
base for memory usage, leading to a death spiral.

It would be good to provide an option that guarantees isolation, by
restarting the task manager processes. Managing the processes would depend
on how Flink is deployed, but the recovery sequence would need to provide a
hook for the user.

Has there been prior discussion or related work?

Thanks,
Thomas

Reply via email to