Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3524 Thanks for opening this pull request. Adding a `CompositeStateHandle` and a `StateRegistry` is a good idea. Some thoughts: - What do you think about making the `StateRegistry` into a `SharedStateRegistry` which only contains the handles to state that is shared across checkpoints? State that is exclusive to a checkpoint is not handled by that registry, but remains only in the checkpoint. That way we "isolate" the existing behavior against the coming changes and do not risk regressions in the state cleanup code (which is very critical for current users). - Another reason for the above suggestion is to also bring some other code into place that has some "fast paths" and "safety nets" for checkpoint cleanups (currently only with non-shared state), for example dropping a checkpoint simply by a `rm -r` (see https://github.com/apache/flink/pull/3522 ). We have seen that for various users the state cleanup problems are among the biggest problems they have, which we can address very well with the work started in the above linked pull request. These things would work together seamlessly if the registry deals only with shared state handles. - I am wondering if it is easier to put the registry into the checkpoint coordinator rather than the checkpoint stores. That way we need the code that deals with adding / failure handling / etc only once.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---