[ https://issues.apache.org/jira/browse/IGNITE-20771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Lapin updated IGNITE-20771: ------------------------------------- Description: h3. Motivation In order to implement tx coordinator recovery, it's definitely required to understand whether coordinator is dead or not. Every data node has it's own txn state local volatile map (txId -> org.apache.ignite.internal.tx.TxStateMeta) where besides other fields we can find txCoordinatorId. Liveness check assumes that if a node with given id is available in physical topology then coordinator is alive, otherwise it's considered as dead. However despite the fact that such local check is fast enough there's no sense in checking it too often, espesially with subsequent sends of initialRecoveryRequests. Thus, it seems reasonable to add one more field to the TxStateMeta that will store last liveness check timestamp. Because it's always local checks it's valid to use System.currentTimeMillis or similar instead of HybridTimestamp in order to reduce the contention on the clock. Please pay attention that triggers that will initiate liveness checks will be implemented separetly. h3. Definition of Done * One more lastLivenessCheck timestamp is added to the TxStateMeta. * Aforementioned field is updated locally on each tx operation with currentTimeMillis. * New cluster-wide tx liveness interval configuration property is introduced. * Within liveness check ** if (the lastLivenessCheck >= currentTimeMillis - livenessInterval) - no-op ** elseĀ *** update lastLivenessCheck *** do the probe - check whether txCoordinatorId is still available in physical topology, if it's available no further actions are required if int's not then **** trigger initiateRecovery procedure implemented in IGNITE-20685. **** if commit partition is also unavailable (meaning that there's no primary replica) mark transaction as abandoned. > Implement tx coordinator liveness check > --------------------------------------- > > Key: IGNITE-20771 > URL: https://issues.apache.org/jira/browse/IGNITE-20771 > Project: Ignite > Issue Type: Improvement > Reporter: Alexander Lapin > Priority: Major > > h3. Motivation > In order to implement tx coordinator recovery, it's definitely required to > understand whether coordinator is dead or not. Every data node has it's own > txn state local volatile map (txId -> > org.apache.ignite.internal.tx.TxStateMeta) where besides other fields we can > find txCoordinatorId. Liveness check assumes that if a node with given id is > available in physical topology then coordinator is alive, otherwise it's > considered as dead. However despite the fact that such local check is fast > enough there's no sense in checking it too often, espesially with subsequent > sends of initialRecoveryRequests. Thus, it seems reasonable to add one more > field to the TxStateMeta that will store last liveness check timestamp. > Because it's always local checks it's valid to use System.currentTimeMillis > or similar instead of HybridTimestamp in order to reduce the contention on > the clock. Please pay attention that triggers that will initiate liveness > checks will be implemented separetly. > h3. Definition of Done > * One more lastLivenessCheck timestamp is added to the TxStateMeta. > * Aforementioned field is updated locally on each tx operation with > currentTimeMillis. > * New cluster-wide tx liveness interval configuration property is introduced. > * Within liveness check > ** if (the lastLivenessCheck >= currentTimeMillis - livenessInterval) - no-op > ** elseĀ > *** update lastLivenessCheck > *** do the probe - check whether txCoordinatorId is still available in > physical topology, if it's available no further actions are required if int's > not then > **** trigger initiateRecovery procedure implemented in IGNITE-20685. > **** if commit partition is also unavailable (meaning that there's no > primary replica) mark transaction as abandoned. -- This message was sent by Atlassian Jira (v8.20.10#820010)