Ivan Bessonov created IGNITE-17611:
--------------------------------------

             Summary: Implement proper local storage recovery for transaction 
state store
                 Key: IGNITE-17611
                 URL: https://issues.apache.org/jira/browse/IGNITE-17611
             Project: Ignite
          Issue Type: Improvement
            Reporter: Ivan Bessonov


h3. Preliminaries

Current design expects transaction states to be replicated using the same RAFT 
groups that process partition transactional data. In code this means that there 
are two physical storages associated with a single state machine. This design 
is easy to achieve when the system is stable, but fault tolerance and basic 
node restart might introduce some complications.
h3. Partition storage design

By itself, partition storage works this way:
 * every write command writes value of the RAFT log index, associated with the 
command;
 * this index value is written atomically with the data from the comment;
 * updates are accumulated in the memory buffer before being written to disk.
 * upon restart, we read the value of the last applied index and proceed the 
recovery process from it. It's done with RAFT snapshots infrastructure.

h3. Changes to tx state store

Basically, everything has to be repeated:
 * applied index value must be introduced to tx state storage;
 * updates must be atomic;
 * on restart, we should use the minimal value of last applied index from both 
TX State and MvPartinion storages ({{{}PartitionSnapshotStorage{}}} has to be 
changed).

h3. Other necessary changes
 * atomic flush must be set up for the tx state storage. WAL should be disabled;
 * snapshot command must trigger the flush. Please refer to 
{{RocksDbFlushListener}} and {{RocksDbMvPartitionStorage#flush}} for 
implementation reference. Listener class can be generified and reused;
 * assertion in {{PartitionListener#onWrite}} should be removed or drastically 
improved;
 * read operation on storages must be prohibited until local recovery is 
completed - we should apply all command up to "commitIndex" value that's been 
read at the start of the node, otherwise storages may have data, inconsistent 
with each other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to