[ https://issues.apache.org/jira/browse/IGNITE-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mirza Aliev updated IGNITE-25276: --------------------------------- Description: h3. Motivation Currently, when {{doStableKeySwitch}} fails due to MetaStorage being unavailable (e.g., the only MetaStorage node is stopped), the intent to perform a stable switch is lost. We need to ensure that such intents are not discarded and can be recovered once MetaStorage becomes available again. h3. Implementation details Algo: {code:java} in method RebalanceRaftGroupEventsListener#doStableKeySwitchWithExceptionHandling in whenComplete block: if ex is recoverable (TimeoutException, etc.): if ex is TimeoutException: try to refresh ms leader if it is available, call doStableKeySwitchWithExceptionHandling again with throttling if it is not available (meaning that refresh leader timeouted), then register listener on MS.TopologyAwareRaftGroupService.onLeaderElected and call doStableKeySwitchWithExceptionHandling again when leader elected {code} h3. Definition of done * Investigating whether there is already an automatic recovery mechanism for stable switch intents in place. * If no such mechanism exists, implementing logic that detects pending intents and retries them once MetaStorage is restored. was: h3. Motivation Currently, when {{doStableKeySwitch}} fails due to MetaStorage being unavailable (e.g., the only MetaStorage node is stopped), the intent to perform a stable switch is lost. We need to ensure that such intents are not discarded and can be recovered once MetaStorage becomes available again. h3. Definition of done * Investigating whether there is already an automatic recovery mechanism for stable switch intents in place. * If no such mechanism exists, implementing logic that detects pending intents and retries them once MetaStorage is restored. > Implement recovery mechanism for stable switch intents after MetaStorage > becomes available > ------------------------------------------------------------------------------------------ > > Key: IGNITE-25276 > URL: https://issues.apache.org/jira/browse/IGNITE-25276 > Project: Ignite > Issue Type: Bug > Reporter: Mirza Aliev > Priority: Major > Labels: ignite-3 > > h3. Motivation > Currently, when {{doStableKeySwitch}} fails due to MetaStorage being > unavailable (e.g., the only MetaStorage node is stopped), the intent to > perform a stable switch is lost. We need to ensure that such intents are not > discarded and can be recovered once MetaStorage becomes available again. > h3. Implementation details > Algo: > {code:java} > in method > RebalanceRaftGroupEventsListener#doStableKeySwitchWithExceptionHandling in > whenComplete block: > if ex is recoverable (TimeoutException, etc.): > if ex is TimeoutException: > try to refresh ms leader > if it is available, call doStableKeySwitchWithExceptionHandling again > with throttling > if it is not available (meaning that refresh leader timeouted), then > register listener on MS.TopologyAwareRaftGroupService.onLeaderElected and > call doStableKeySwitchWithExceptionHandling again when leader > elected > {code} > h3. Definition of done > * Investigating whether there is already an automatic recovery mechanism for > stable switch intents in place. > * If no such mechanism exists, implementing logic that detects pending > intents and retries them once MetaStorage is restored. -- This message was sent by Atlassian Jira (v8.20.10#820010)