Ivan Bessonov created IGNITE-16304:
--------------------------------------

             Summary: [POC] In-Memory storage integration
                 Key: IGNITE-16304
                 URL: https://issues.apache.org/jira/browse/IGNITE-16304
             Project: Ignite
          Issue Type: Task
          Components: persistence
    Affects Versions: 3.0.0-alpha3
            Reporter: Ivan Bessonov


Goals

We need an in-memory store, similar to Ignite-2. This store must reuse common 
replication infrastructure, in other words, be integrated into raft STM and 
support transactions.

The raft protocol implies some persistent state: metadata, logs, snapshot.

Simplest solution - write a raft persistent state on disk (this is already 
implemented for 
org.apache.ignite.internal.storage.basic.ConcurrentHashMapPartitionStorage). 

Drawback - not fully in-memory solution, doesn't much differ from a database 
cache

We can go the pure in-memory way - keep all raft state in a volatile store.
h3. Raft metadata

Must not be persisted for a pure in-memory cluster, because the state is always 
lost on restart. 

Note: a node must always be removed from the raft group when it’s removed from 
baseline by auto adjust and should join as new (in-memory always works with 
auto-adjust similarly to Ignite 2). *Out of scope.*
h3. Log store

Has working in-memory implementation (currently used in tests): 
org.apache.ignite.raft.jraft.storage.impl.LocalLogStorage

Note: generally speaking, log is only required for "historical rebalancing" 
after the snapshot rebalance. It won't be needed at all once it is possible to 
apply snapshot and concurrent updates at the same time, for example when a 
solution like mvcc is implemented.
h3. Snapshots

Can be implemented over any kv store extended with some kind of Copy-On-Write 
support. Not implemented currently. More details below.
h3. COW buffer

To create an in-memory snapshot, the snapshot data is written to a separate 
in-memory buffer. The buffer is populated from the state machine update thread 
either by the update operations or by a snapshot advance mini-task which is 
submitted to the state machine update thread as needed.

To maintain a snapshot, the state machine needs to keep an snapshot iterator 
boundary key. If a key being updated is smaller or equal than the boundary key, 
there is no need in any additional action because the snapshot iterator has 
already processed this key. If a key being updated is larger than the boundary 
key, the old version of the key is eagerly put to the snapshot buffer and the 
key is marked with snapshot ID (so that the key is skipped during further 
iteration). Snapshot advance mini-task iterates over a next batch of the keys 
starting from the boundary key and puts to the snapshot buffer only keys that 
are not yet marked by the snapshot ID.

This approach has similar memory requirements to the first alternative, but 
does not require to modify the storage tree so that it can store multiple 
versions of the same key. This approach, however, allows for transparent 
snapshot buffer offloading to disk which can reduce memory requirements. It is 
also simpler in implementation because the code is essentially single-threaded 
and only requires synchronization for the in-memory buffer. The downside is 
that snapshot advance tasks will increase tail latency of state machine update 
operations.

Can be implemented on top of any kv store.

Note: we should consider the possibility of streaming the snapshot instead of 
storing it in memory until it is completed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to