Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-11-25 Thread Robert Metzger
Yeah, at some point I've investigated performance issues with AWS K8s. They have somewhat strict rate limits on the K8s api server. You run into the rate limits by configuring a very high checkpoint frequency (I guess something like 500ms) and a high state.checkpoints.num-retained count (e.g.

Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-11-10 Thread Matthias Pohl
Thanks for sharing your opinions on the proposal. The concerns sound reasonable. I guess, I'm going to follow-up on Chesnay's idea about combining multiple requests into one for the k8s implementation. Having a performance test for the k8s API server access sounds like a good idea, too. Both

Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-11-07 Thread Chesnay Schepler
This is a nice FLIP. I particular like how much background it provides on the issue; something that other FLIPs could certainly benefit from... I went over the FLIP and had a chat with Matthias about it. Somewhat unrelated to the FLIP we found a flaw in the current cleanup mechanism of failed

Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-11-06 Thread Yang Wang
Thanks Matthias for continuously improving the clean-up process. Given that we highly depends on K8s APIServer for HA implementation, I am not in favor of storing too many entries in the ConfigMap, as well as adding more update requests to the APIServer. So I lean towards Proposal #2. It just

Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-10-27 Thread Matthias Pohl
I would like to bring this topic up one more time. I put some more thought into it and created FLIP-270 [1] as a follow-up of FLIP-194 [2] with an updated version of what I summarized in my previous email. It would be interesting to get some additional perspectives on this; more specifically, the

[DISCUSS] Repeatable cleanup of checkpoint data

2022-09-28 Thread Matthias Pohl
Hi everyone, I’d like to start a discussion on repeatable cleanup of checkpoint data. In FLIP-194 [1] we introduced repeatable cleanup of HA data along the introduction of the JobResultStore component. The goal was to make Flink being in charge of cleanup for the data it owns. The Flink cluster