Hi devs,

Some time ago I asked about the way Task Manager pods are handled by the
native Kubernetes driver [1]. I have now looked a bit through the source
code and I think it could be possible to deploy TMs with a stateful set,
which could allow tracking OOM kills as I mentioned in my original email,
and could also make it easier to track metrics and create alerts, since the
labels wouldn't change as much.

One challenge is probably the new elastic scaling features [2], since the
driver would have to differentiate between new pod requests due to a TM
terminating, and a request due to scaling. I'm also not sure where
downscaling requests are currently handled.

I would be interested in taking a look at this and seeing if I can get
something working. I think it would be possible to make it configurable in
a way that maintains backwards compatibility. Would it be ok if I enter a
Jira ticket and try it out?

Regards,
Alexis.

[1] https://lists.apache.org/thread/jysgdldv8swgf4fhqwqochgf6hq0qs52
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/deployment/elastic_scaling/

Reply via email to