Yeah, at some point I've investigated performance issues with AWS K8s. They
have somewhat strict rate limits on the K8s api server.
You run into the rate limits by configuring a very high checkpoint
frequency (I guess something like 500ms) and a high
state.checkpoints.num-retained count (e.g.
Thanks for sharing your opinions on the proposal. The concerns sound
reasonable. I guess, I'm going to follow-up on Chesnay's idea about
combining multiple requests into one for the k8s implementation. Having a
performance test for the k8s API server access sounds like a good idea,
too. Both
This is a nice FLIP. I particular like how much background it provides
on the issue; something that other FLIPs could certainly benefit from...
I went over the FLIP and had a chat with Matthias about it.
Somewhat unrelated to the FLIP we found a flaw in the current cleanup
mechanism of failed
Thanks Matthias for continuously improving the clean-up process.
Given that we highly depends on K8s APIServer for HA implementation, I am
not in favor of storing too many entries in the ConfigMap,
as well as adding more update requests to the APIServer. So I lean towards
Proposal #2. It just
I would like to bring this topic up one more time. I put some more thought
into it and created FLIP-270 [1] as a follow-up of FLIP-194 [2] with an
updated version of what I summarized in my previous email. It would be
interesting to get some additional perspectives on this; more specifically,
the
Hi everyone,
I’d like to start a discussion on repeatable cleanup of checkpoint data. In
FLIP-194 [1] we introduced repeatable cleanup of HA data along the
introduction of the JobResultStore component. The goal was to make Flink
being in charge of cleanup for the data it owns. The Flink cluster