[ 
https://issues.apache.org/jira/browse/IGNITE-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roman Puchkovskiy reassigned IGNITE-16655:
------------------------------------------

    Assignee: Roman Puchkovskiy

> Volatile RAFT log for pure in-memory storages
> ---------------------------------------------
>
>                 Key: IGNITE-16655
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16655
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Sergey Chugunov
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: iep-74, ignite-3
>
> h3. Original issue description
> For in-memory storage Raft logging can be optimized as we don't need to have 
> it active when topology is stable.
> Each write can directly go to in-memory storage at much lower cost than 
> synchronizing it with disk so it is possible to avoid writing Raft log.
> As nodes don't have any state and always join cluster clean we always need to 
> transfer full snapshot during rebalancing - no need to keep long Raft log for 
> historical rebalancing purposes.
> So we need to implement API for Raft component enabling configuration of Raft 
> logging process.
> h3. More detailed description
> Apparently, we can't completely ignore writing to log. There are several 
> situations where it needs to be collected:
>  * During a regular workload, each node needs to have a small portion of log 
> in case if it becomes a leader. There might be a number of "slow" nodes 
> outside of "quorum" that require older data to be re-sent to them. Log entry 
> can be truncated only when all nodes reply with "ack" or fail, otherwise log 
> entry should be preserved.
>  * During a clean node join - it will need to apply part of the log that 
> wasn't included in the full-rebalance snapshot. So, everything, starting with 
> snapshots applied index, will have to be preserved.
> It feels like the second option is just a special case of the first one - we 
> can't truncate log until we receive all acks. And we can't receive an ack 
> from the joining node until it finishes its rebalancing procedure.
> So, it all comes to the aggressive log truncation to make it short.
> Preserved log can be quite big in reality, there must be a disk offloading 
> operation available.
> The easiest way to achieve it is to write into a RocksDB instance with WAL 
> disabled. It'll store everything in memory until the flush, and even then the 
> amount of flushed data will be small on stable topology. Absence of WAL is 
> not an issue, the entire rocks instance can be dropped on restart, since it's 
> supposed to be volatile.
> To avoid even the smallest flush, we can use additional volatile structure, 
> like ring buffer or concurrent map, to store part of the log, and transfer 
> records into RocksDB only on structure overflow. This sounds more compilcated 
> and makes memory management more difficult. But, we should take it into 
> consideration anyways.
>  * Potentially, we could use a volatile page memory region for this purpose, 
> since it already has a good control over the amount of memory used. But, 
> memory overflow should be carefully processed, usually it's treated as an 
> error and might even cause node failure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to