Sxnan commented on PR #138: URL: https://github.com/apache/flink-agents/pull/138#issuecomment-3307846572
I spotted a few bugs that need to be addressed: 1. There is a fundamental problem that Python async execution cannot work with per-action state and checkpoint at the moment. I created an issue https://github.com/apache/flink-agents/issues/185. But it should not block this PR, as it is not introduced by this PR. 2. The operator state used to store the `recoveryMarks` has to be handled carefully. We should take the `currentProcessingKeysOpState` as an example. During recovery, we should pass the unioned list of markers to the ActionStateStore, and let the action state store decide where to recover from given all the markers. 3. We cannot simply clean up the state in notifyCheckpointComplete, since we don't know if there are states that are yet to be used to recover. Therefore, we can only clean the state up to the sequence number for each key. For 2 and 3, it requires some advanced usage of the Flink state, so I prepare some code for your reference. Feel free to use the code. https://github.com/Sxnan/flink-agents/commits/per-stask-state-poc/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
