Ivan Bessonov created IGNITE-17611: -------------------------------------- Summary: Implement proper local storage recovery for transaction state store Key: IGNITE-17611 URL: https://issues.apache.org/jira/browse/IGNITE-17611 Project: Ignite Issue Type: Improvement Reporter: Ivan Bessonov
h3. Preliminaries Current design expects transaction states to be replicated using the same RAFT groups that process partition transactional data. In code this means that there are two physical storages associated with a single state machine. This design is easy to achieve when the system is stable, but fault tolerance and basic node restart might introduce some complications. h3. Partition storage design By itself, partition storage works this way: * every write command writes value of the RAFT log index, associated with the command; * this index value is written atomically with the data from the comment; * updates are accumulated in the memory buffer before being written to disk. * upon restart, we read the value of the last applied index and proceed the recovery process from it. It's done with RAFT snapshots infrastructure. h3. Changes to tx state store Basically, everything has to be repeated: * applied index value must be introduced to tx state storage; * updates must be atomic; * on restart, we should use the minimal value of last applied index from both TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be changed). h3. Other necessary changes * atomic flush must be set up for the tx state storage. WAL should be disabled; * snapshot command must trigger the flush. Please refer to {{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for implementation reference. Listener class can be generified and reused; * assertion in {{PartitionListener#onWrite}} should be removed or drastically improved; * read operation on storages must be prohibited until local recovery is completed - we should apply all command up to "commitIndex" value that's been read at the start of the node, otherwise storages may have data, inconsistent with each other. -- This message was sent by Atlassian Jira (v8.20.10#820010)