[ https://issues.apache.org/jira/browse/IGNITE-24904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Denis Chudov updated IGNITE-24904: ---------------------------------- Description: See IGNITE-24817 for the scenario. The fundamental problem is that, for now, there is no way to distinguish the reason of absence of tx state: it may be never existing or it may be lost due to data loss. The idea that we may start from: each time a replication group restores majority, it writes the current time in the storage. Write intents contain their creation time. If we see during the WI resolution that the time of majority restoration is greater than write intent creation time, then highly likely the tx state is lost. Also, we can try to recover the latest known state from other cluster nodes, but it may have been vacuumized there. After the design is ready, we can move further with, for example: * lazy marking the partitions with unresolvable write intents as DEGRADED (or another new status) and returning them back to HEALTHY after the commit partition is recovered again; * providing CLI tool for listing and probably manual resolving of such write intents (may be blocked by IGNITE-25665 which introduces the way to get all write intents without full storage scan) was: See IGNITE-24817 for the scenario. The fundamental problem is that, for now, there is no way to distinguish the reason of absence of tx state: it may be never existing or it may be lost due to data loss. The idea that we may start from: each time a replication group restores majority, it writes the current time in the storage. Write intents contain their creation time. If we see during the WI resolution that the time of majority restoration is greater than write intent creation time, then highly likely the tx state is lost. Also, we can try to recover the latest known state from other cluster nodes, but it may have been vacuumized there. > Design the way to distinguish the absence of tx state due to the transaction > from the case of loss the transaction state due to data loss in commit > partition > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: IGNITE-24904 > URL: https://issues.apache.org/jira/browse/IGNITE-24904 > Project: Ignite > Issue Type: Improvement > Reporter: Denis Chudov > Priority: Major > Labels: ignite-3 > > See IGNITE-24817 for the scenario. > The fundamental problem is that, for now, there is no way to distinguish the > reason of absence of tx state: it may be never existing or it may be lost due > to data loss. > The idea that we may start from: each time a replication group restores > majority, it writes the current time in the storage. Write intents contain > their creation time. If we see during the WI resolution that the time of > majority restoration is greater than write intent creation time, then highly > likely the tx state is lost. > Also, we can try to recover the latest known state from other cluster nodes, > but it may have been vacuumized there. > > After the design is ready, we can move further with, for example: > * lazy marking the partitions with unresolvable write intents as DEGRADED > (or another new status) and returning them back to HEALTHY after the commit > partition is recovered again; > * providing CLI tool for listing and probably manual resolving of such write > intents (may be blocked by IGNITE-25665 which introduces the way to get all > write intents without full storage scan) -- This message was sent by Atlassian Jira (v8.20.10#820010)