[ 
https://issues.apache.org/jira/browse/IGNITE-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-25276:
---------------------------------
    Description: 
h3. Motivation

Currently, when {{doStableKeySwitch}} fails due to MetaStorage being 
unavailable (e.g., the only MetaStorage node is stopped), the intent to perform 
a stable switch is lost. We need to ensure that such intents are not discarded 
and can be recovered once MetaStorage becomes available again.
h3. Implementation details
Algo:

{code:java}
in method 
RebalanceRaftGroupEventsListener#doStableKeySwitchWithExceptionHandling in 
whenComplete block: 


 if ex is recoverable (TimeoutException, etc.):
 if ex is TimeoutException:
 try to refresh ms leader
      if it is available, call doStableKeySwitchWithExceptionHandling again 
with throttling
      if it is not available (meaning that refresh leader timeouted), then 
register listener on MS.TopologyAwareRaftGroupService.onLeaderElected and
          call doStableKeySwitchWithExceptionHandling again when leader elected
 {code}
h3. Definition of done
 * Investigating whether there is already an automatic recovery mechanism for 
stable switch intents in place.

 * If no such mechanism exists, implementing logic that detects pending intents 
and retries them once MetaStorage is restored.

  was:
h3. Motivation
Currently, when {{doStableKeySwitch}} fails due to MetaStorage being 
unavailable (e.g., the only MetaStorage node is stopped), the intent to perform 
a stable switch is lost. We need to ensure that such intents are not discarded 
and can be recovered once MetaStorage becomes available again.

h3. Definition of done

* Investigating whether there is already an automatic recovery mechanism for 
stable switch intents in place.

* If no such mechanism exists, implementing logic that detects pending intents 
and retries them once MetaStorage is restored.



> Implement recovery mechanism for stable switch intents after MetaStorage 
> becomes available
> ------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-25276
>                 URL: https://issues.apache.org/jira/browse/IGNITE-25276
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>
> h3. Motivation
> Currently, when {{doStableKeySwitch}} fails due to MetaStorage being 
> unavailable (e.g., the only MetaStorage node is stopped), the intent to 
> perform a stable switch is lost. We need to ensure that such intents are not 
> discarded and can be recovered once MetaStorage becomes available again.
> h3. Implementation details
> Algo:
> {code:java}
> in method 
> RebalanceRaftGroupEventsListener#doStableKeySwitchWithExceptionHandling in 
> whenComplete block: 
>  if ex is recoverable (TimeoutException, etc.):
>  if ex is TimeoutException:
>  try to refresh ms leader
>       if it is available, call doStableKeySwitchWithExceptionHandling again 
> with throttling
>       if it is not available (meaning that refresh leader timeouted), then 
> register listener on MS.TopologyAwareRaftGroupService.onLeaderElected and
>           call doStableKeySwitchWithExceptionHandling again when leader 
> elected
>  {code}
> h3. Definition of done
>  * Investigating whether there is already an automatic recovery mechanism for 
> stable switch intents in place.
>  * If no such mechanism exists, implementing logic that detects pending 
> intents and retries them once MetaStorage is restored.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to