[ https://issues.apache.org/jira/browse/IGNITE-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698938#comment-17698938 ]
Ivan Bessonov edited comment on IGNITE-18475 at 3/10/23 1:12 PM: ----------------------------------------------------------------- First of all, what are the implications of completely disabling fsync for the log. # If a minority of nodes have been restarted with the loss of log suffix, everything works fine. Nodes are treated according to their real state, log is replicated once again. Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}. # If a majority of nodes have been restarted, but only the minority has a loss of log suffix, everything works fine. Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}. This means that, in any situation, if only a minority of nodes lost the log suffix, raft group remains healthy and consistent. # If a majority of nodes have been restarted, with the majority experiencing the loss of log suffix, things become unstable: ## If leader has not been restarted, it may replicate the log suffix to the followers that experienced data loss. If this happened, data will be consistent. ## If leader has been restarted, the re-election will occur. Now it all depends on its results. ### Node with newest data is elected as a leader - everything's fine, data will be consistent after replication. ### Node with data loss is elected as a leader. Two things may happen: #### If only a single RAFT log entry has been lost, according to a new leader, the group will move into broken state. For example: {code:java} // Before start: Node 0 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1] Node 1 (offline) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] Node 2 (offline) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] // After start: Node 0 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1] Node 1 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., data=1] Node 2 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., data=1]{code} Log for node 0 is silently "corrupted", data is inconsistent, configuration is inconsistent. {*}This is, most likely, a bug in JRaft{*}. Following message can be seen in such test, instead of error, for node 0: {code:java} WARNING: Received entries of which the lastLog=2 is not greater than appliedIndex=2, return immediately with nothing changed. {code} # ## ### #### If multiple log entries have been lost, according to a new leader, aforementioned bug is not happening. New majority, that consists of old nodes, will continue working, while old minority with "newer" data will fail to replicate new updates. To my knowledge, no attempts of snapshot installation would take place. Some data is permanently lost, is not recovered manually. Some group nodes required manual cleanup. Otherwise, data is consistent. 4. Full cluster restart, where majority of nodes lose log suffix, seems to be equivalent to 3.2.2 was (Author: ibessonov): First of all, what are the implications of completely disabling fsync for the log. # If a minority of nodes have been restarted with the loss of log suffix, everything works fine. Nodes are treated according to their real state, log is replicated once again. Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartSingleNode{}}}. # If a majority of nodes have been restarted, but only the minority has a loss of log suffix, everything works fine. Case is covered by {{{}ItTruncateSuffixAndRestartTest#testRestartTwoNodes{}}}. This means that, in any situation, if only a minority of nodes lost the log suffix, raft group remains healthy and consistent. # If a majority of nodes have been restarted, with the majority experiencing the loss of log suffix, things become unstable: ## If leader has not been restarted, it may replicate the log suffix to the followers that experienced data loss. If this happened, data will be consistent. ## If leader has been restarted, the re-election will occur. Now it all depends on its results. ### Node with newest data is elected as a leader - everything's fine, data will be consistent after replication. ### Node with data loss is elected as a leader. Two things may happen: #### If only a single RAFT log entry has been lost, according to a new leader, the group will move into broken state. For example: {code:java} // Before start: Node 0 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1] Node 1 (offline) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] Node 2 (offline) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] // After start: Node 0 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_DATA, id=LogId [index=2, term=1], ..., data=1] Node 1 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., data=1] Node 2 (online) 1: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=1, term=1], ..., data=0] 2: LogEntry [type=ENTRY_TYPE_CONFIGURATION, id=LogId [index=2, term=3], ..., data=1]{code} Log for node 0 is silently "corrupted", data is inconsistent, configuration is inconsistent. {*}This is, most likely, a bug in JRaft{*}. Following message can be seen in such test, instead of error, for node 0: {code:java} WARNING: Received entries of which the lastLog=2 is not greater than appliedIndex=2, return immediately with nothing changed. {code} #### If multiple log entries have been lost, according to a new leader, aforementioned bug is not happening. New majority, that consists of old nodes, will continue working, while old minority with "newer" data will fail to replicate new updates. To my knowledge, no attempts of snapshot installation would take place. Some data is permanently lost, is not recovered manually. Some group nodes required manual cleanup. Otherwise, data is consistent. # Full cluster restart, where majority of nodes lose log suffix, seems to be equivalent to 3.2.2 > Huge performance drop with enabled sync write per log entry for RAFT logs > ------------------------------------------------------------------------- > > Key: IGNITE-18475 > URL: https://issues.apache.org/jira/browse/IGNITE-18475 > Project: Ignite > Issue Type: Task > Reporter: Kirill Gusakov > Assignee: Ivan Bessonov > Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > During the YCSB benchmark runs for ignite-3 beta1 we found out, that we have > significant issues with performance for select/insert queries. > One of the root cause of these issues: write every log entry to rocksdb with > enabled sync option (which leads to frequent fsync calls). > These issues can be reproduced by localised jmh benchmarks > [SelectBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/SelectBenchmark.java#L39] > and > [InsertBenchmark|https://github.com/gridgain/apache-ignite-3/blob/4b9de922caa4aef97a5e8e159d5db76a3fc7a3ad/modules/runner/src/test/java/org/apache/ignite/internal/benchmark/InsertBenchmark.java#L29] > with RaftOptions.sync=true/false: > * jdbc select queries: 115ms vs 4ms > * jdbc insert queries: 70ms vs 2.5ms > (These results received on MacBook Pro (16-inch, 2019) and it looks like > macOS has slow fsync command in general, but runs on Ubuntu shows the huge > different also (~26 times for insert test). So, your environment can show > another, but still huge difference.) > Why select queries suffers from syncs even more, than inserts, described in > https://issues.apache.org/jira/browse/IGNITE-18474. > Possible solutions for the issue: > * Doesn't sync every raft record in rocksdb by default, but it can break the > raft invariants > * Investigate the inner parts of RocksDB (according syscall tracing, not > every write with sync produce fsync syscall), maybe another strategies wll be > suitable for our cases > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)