[ https://issues.apache.org/jira/browse/IGNITE-16655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy reassigned IGNITE-16655: ------------------------------------------ Assignee: Roman Puchkovskiy > Volatile RAFT log for pure in-memory storages > --------------------------------------------- > > Key: IGNITE-16655 > URL: https://issues.apache.org/jira/browse/IGNITE-16655 > Project: Ignite > Issue Type: Improvement > Reporter: Sergey Chugunov > Assignee: Roman Puchkovskiy > Priority: Major > Labels: iep-74, ignite-3 > > h3. Original issue description > For in-memory storage Raft logging can be optimized as we don't need to have > it active when topology is stable. > Each write can directly go to in-memory storage at much lower cost than > synchronizing it with disk so it is possible to avoid writing Raft log. > As nodes don't have any state and always join cluster clean we always need to > transfer full snapshot during rebalancing - no need to keep long Raft log for > historical rebalancing purposes. > So we need to implement API for Raft component enabling configuration of Raft > logging process. > h3. More detailed description > Apparently, we can't completely ignore writing to log. There are several > situations where it needs to be collected: > * During a regular workload, each node needs to have a small portion of log > in case if it becomes a leader. There might be a number of "slow" nodes > outside of "quorum" that require older data to be re-sent to them. Log entry > can be truncated only when all nodes reply with "ack" or fail, otherwise log > entry should be preserved. > * During a clean node join - it will need to apply part of the log that > wasn't included in the full-rebalance snapshot. So, everything, starting with > snapshots applied index, will have to be preserved. > It feels like the second option is just a special case of the first one - we > can't truncate log until we receive all acks. And we can't receive an ack > from the joining node until it finishes its rebalancing procedure. > So, it all comes to the aggressive log truncation to make it short. > Preserved log can be quite big in reality, there must be a disk offloading > operation available. > The easiest way to achieve it is to write into a RocksDB instance with WAL > disabled. It'll store everything in memory until the flush, and even then the > amount of flushed data will be small on stable topology. Absence of WAL is > not an issue, the entire rocks instance can be dropped on restart, since it's > supposed to be volatile. > To avoid even the smallest flush, we can use additional volatile structure, > like ring buffer or concurrent map, to store part of the log, and transfer > records into RocksDB only on structure overflow. This sounds more compilcated > and makes memory management more difficult. But, we should take it into > consideration anyways. > * Potentially, we could use a volatile page memory region for this purpose, > since it already has a good control over the amount of memory used. But, > memory overflow should be carefully processed, usually it's treated as an > error and might even cause node failure. -- This message was sent by Atlassian Jira (v8.20.10#820010)