[ https://issues.apache.org/jira/browse/IGNITE-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ivan Bessonov updated IGNITE-23240: ----------------------------------- Description: h1. Preface Current implementation, based on {{{}RocksDB{}}}, is known to be way slower then it should be. There are multiple obvious reasons for that: * Writing into WAL +and+ memtable * Creating unique keys for every record * Inability to efficiently serialize data, we must have an intermediate state before we pass data into {{{}RocksDB{}}}'s API. h1. Benchmarks h3. Local benchmarks Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my local environment with fsync disabled. I got the following results: * {{{}Logit{}}}: {noformat} Test write: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 23.541 Total size : 16777216000 Throughput(bps) : 712680684 Throughput(rps) : 43498 Test read: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 3.808 Total size : 16777216000 Throughput(bps) : 4405781512 Throughput(rps) : 268907 Test done!{noformat} * {{{}RocksDB{}}}: {noformat} Test write: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 178.785 Total size : 16777216000 Throughput(bps) : 93840176 Throughput(rps) : 5727 Test read: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 13.572 Total size : 16777216000 Throughput(bps) : 1236163866 Throughput(rps) : 75449 Test done!{noformat} While testing on local environment is not optimal, is still shows a huge improvement in writing speed (7.5x) and reading speed (3.5x). Enabling {{fsync}} sort-of equalizes writing speed, but we still expect that simpler log implementation would be faster dues to smaller overall overhead. h3. Integration testing Benchmark for 3 servers and 1 client writing data in multiple threads shows 34438 vs 30299 throughput improvement. {{{}RocksDB{}}}: !Screenshot from 2024-09-20 10-38-53.png! {{{}Logit{}}}: !Screenshot from 2024-09-20 10-38-57.png! Benchmark for single thread insertions in embedded mode shows 4072 vs 3739 throughput improvement. {{{}RocksDB{}}}: !Screenshot from 2024-09-20 10-42-49.png! {{{}Logit{}}}: !Screenshot from 2024-09-20 10-43-09.png! h1. Observations Despite a drastic difference in log throughput, user operations throughput increase is only about 10%. This means that we lose a lot of time elsewhere, and optimizing those parts could significantly increase performance too. Log optimizations would become more evident after that. h1. Unsolved issues There are multiple issues with new log implementation, some of them have been mentioned in IGNITE-22843 * {{Logit}} pre-allocates _a lot_ of data on drive. Considering that we use "log per partition" paradigm, it's too wasteful. * Storing separate log file per partition is not scalable anyway, it's too difficult to optimize batches and {{fsync}} in this approach. * Using the same log for all tables in a distribution zone won't really solve the issue, the best it could do is to make it {_}manageable{_}, in some sense. h1. Shortly about how Logit works Each log consists of 3 sets of files: * "segment" files with data. * "configuration" files with raft configuration. * "index" files with pointers to segment and configuration files. "segment" and "configuration" files contain chunks of data in a following format: |Magic header|Payload size|Payload itself| "index" files contain following pieces of data: |Magic header|Log entry type (data/cfg)|offset|position| It's a fixed-length tuple, that contains a "link" to one of data files. Each "index" file is basically an offset table, and it is used to resolve "logIndex" into real log data. h1. What we should change A list of actions, that we need to do to make this log fit the required criteria includes: * Merge "configuration" and "segment" files into one, to have fewer files on drive, the distinction is arbitrary anyway. Let's call it a "data" file. * Use the same "data" file for multiple raft groups. It's important to note that we can't use "data" file per stripe, because stripe calculation function is not {_}stable{_}, it allocates {{stripeId}} dynamically in order to have a smoother distribution in runtime. * Log should be able to enforce checkpoints/flushes in storage engines, in order to safely truncate data upon reaching a threshold (we truncate logs from multiple raft groups at the same time, that's why we need it). This means that we will change the way Raft canonically makes snapshots, instead we will have our own approach, similar to what we have in Ignite 2.x. Or we will abuse snapshots logic and trigger them outside of the schedule. * In order to make {{fsync}} faster, we should get rid of "index" files, leaving only "data" files. That's because we would have {{O(N)}} index files and only {{O(1)}} data files, roughly speaking. * Removing index files would make log scan less efficient. This means that we would have to frequently scan data files in order to find a required log entry. We would also have to encode {{raftGroupId}} efficiently into data files. To avoid this issue we could use on-heap caches. * Local data recovery (log replay) would lead to {{O(N)}} data file scans, which is too much. Ignite 2.x uses 1 scan to recover all partitions. Recovery logic in Ignite 3 would have to be rewritten. * There's an issue with log replay - it contains uncommitted data, that would be truncated after leader election. Which means that we should only replay a part of the data, not everything. * We should enrich Logit checkpoints information with status of all raft groups, not just one. * This might be a real blocker: {{LogManagerImpl#checkAndResolveConflict}} uses a {{truncateSuffix}} operation, which becomes really tricky when you don't have a dedicated index file. Also, such an operation might be required in disaster recovery scenarios when you need to rollback a "poisoned" command that breaks your state machine. Basically, what we need to do is implement WAL the way it's implemented in Ignite 2.x (without binary records though). There's a chance that we should just port it, instead of re-writing half of Logit, we should discuss that. Parts of JRaft would have to be updated. was: h1. Preface Current implementation, based on {{{}RocksDB{}}}, is known to be way slower then it should be. There are multiple obvious reasons for that: * Writing into WAL +and+ memtable * Creating unique keys for every record * Inability to efficiently serialize data, we must have an intermediate state before we pass data into {{{}RocksDB{}}}'s API. h1. Benchmarks h3. Local benchmarks Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my local environment with fsync disabled. I got the following results: * {{{}Logit{}}}: {noformat} Test write: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 23.541 Total size : 16777216000 Throughput(bps) : 712680684 Throughput(rps) : 43498 Test read: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 3.808 Total size : 16777216000 Throughput(bps) : 4405781512 Throughput(rps) : 268907 Test done!{noformat} * {{{}RocksDB{}}}: {noformat} Test write: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 178.785 Total size : 16777216000 Throughput(bps) : 93840176 Throughput(rps) : 5727 Test read: Log number : 1024000 Log Size : 16384 Batch Size : 100 Cost time(s) : 13.572 Total size : 16777216000 Throughput(bps) : 1236163866 Throughput(rps) : 75449 Test done!{noformat} While testing on local environment is not optimal, is still shows a huge improvement in writing speed (7.5x) and reading speed (3.5x). Enabling {{fsync}} sort-of equalizes writing speed, but we still expect that simpler log implementation would be faster dues to smaller overall overhead. h3. Integration testing Benchmark for 3 servers and 1 client writing data in multiple threads shows 34438 vs 30299 throughput improvement. {{{}RocksDB{}}}: !Screenshot from 2024-09-20 10-38-53.png! {{{}Logit{}}}: !Screenshot from 2024-09-20 10-38-57.png! Benchmark for single thread insertions in embedded mode shows 4072 vs 3739 throughput improvement. {{{}RocksDB{}}}: !Screenshot from 2024-09-20 10-42-49.png! {{{}Logit{}}}: !Screenshot from 2024-09-20 10-43-09.png! h1. Observations Despite a drastic difference in log throughput, user operations throughput increase is only about 10%. This means that we lose a lot of time elsewhere, and optimizing those parts could significantly increase performance too. Log optimizations would become more evident after that. h1. Unsolved issues There are multiple issues with new log implementation, some of them have been mentioned in IGNITE-22843 * {{Logit}} pre-allocates _a lot_ of data on drive. Considering that we use "log per partition" paradigm, it's too wasteful. * Storing separate log file per partition is not scalable anyway, it's too difficult to optimize batches and {{fsync}} in this approach. * Using the same log for all tables in a distribution zone won't really solve the issue, the best it could do is to make it {_}manageable{_}, in some sense. h1. Shortly about how Logit works Each log consists of 3 sets of files: * "segment" files with data. * "configuration" files with raft configuration. * "index" files with pointers to segment and configuration files. "segment" and "configuration" files contain chunks of data in a following format: |Magic header|Payload size|Payload itself| "index" files contain following pieces of data: |Magic header|Log entry type (data/cfg)|offset|position| It's a fixed-length tuple, that contains a "link" to one of data files. Each "index" file is basically an offset table, and it is used to resolve "logIndex" into real log data. h1. What we should change A list of actions, that we need to do to make this log fit the required criteria includes: * Merge "configuration" and "segment" files into one, to have fewer files on drive, the distinction is arbitrary anyway. Let's call it a "data" file. * Use the same "data" file for multiple raft groups. It's important to note that we can't use "data" file per stripe, because stripe calculation function is not {_}stable{_}, it allocates {{stripeId}} dynamically in order to have a smoother distribution in runtime. * Log should be able to enforce checkpoints/flushes in storage engines, in order to safely truncate data upon reaching a threshold (we truncate logs from multiple raft groups at the same time, that's why we need it). This means that we will change the way Raft canonically makes snapshots, instead we will have our own approach, similar to what we have in Ignite 2.x. Or we will abuse snapshots logic and trigger them outside of the schedule. * In order to make {{fsync}} faster, we should get rid of "index" files, leaving only "data" files. That's because we would have {{O(N)}} index files and only {{O(1)}} data files, roughly speaking. * Removing index files would make log scan less efficient. This means that we would have to frequently scan data files in order to find a required log entry. We would also have to encode {{raftGroupId}} efficiently into data files. To avoid this issue we could use on-heap caches. * Local data recovery (log replay) would lead to {{O(N)}} data file scans, which is too much. Ignite 2.x uses 1 scan to recover all partitions. Recovery logic in Ignite 3 would have to be rewritten. * We should enrich Logit checkpoints information with status of all raft groups, not just one. * This might be a real blocker: {{LogManagerImpl#checkAndResolveConflict}} uses a {{truncateSuffix}} operation, which becomes really tricky when you don't have a dedicated index file. Also, such an operation might be required in disaster recovery scenarios when you need to rollback a "poisoned" command that breaks your state machine. Basically, what we need to do is implement WAL the way it's implemented in Ignite 2.x (without binary records though). There's a chance that we should just port it, instead of re-writing half of Logit, we should discuss that. Parts of JRaft would have to be updated. > Ignite 3 new log storage > ------------------------ > > Key: IGNITE-23240 > URL: https://issues.apache.org/jira/browse/IGNITE-23240 > Project: Ignite > Issue Type: Epic > Reporter: Ivan Bessonov > Priority: Major > Labels: ignite-3 > Attachments: Screenshot from 2024-09-20 10-38-53.png, Screenshot from > 2024-09-20 10-38-57.png, Screenshot from 2024-09-20 10-42-49.png, Screenshot > from 2024-09-20 10-43-09.png > > > h1. Preface > Current implementation, based on {{{}RocksDB{}}}, is known to be way slower > then it should be. There are multiple obvious reasons for that: > * Writing into WAL +and+ memtable > * Creating unique keys for every record > * Inability to efficiently serialize data, we must have an intermediate > state before we pass data into {{{}RocksDB{}}}'s API. > h1. Benchmarks > h3. Local benchmarks > Local benchmarks ({{{}LogStorageBenchmarks{}}}) have been performed on my > local environment with fsync disabled. I got the following results: > * {{{}Logit{}}}: > {noformat} > Test write: > Log number : 1024000 > Log Size : 16384 > Batch Size : 100 > Cost time(s) : 23.541 > Total size : 16777216000 > Throughput(bps) : 712680684 > Throughput(rps) : 43498 > Test read: > Log number : 1024000 > Log Size : 16384 > Batch Size : 100 > Cost time(s) : 3.808 > Total size : 16777216000 > Throughput(bps) : 4405781512 > Throughput(rps) : 268907 > Test done!{noformat} > * {{{}RocksDB{}}}: > {noformat} > Test write: > Log number : 1024000 > Log Size : 16384 > Batch Size : 100 > Cost time(s) : 178.785 > Total size : 16777216000 > Throughput(bps) : 93840176 > Throughput(rps) : 5727 > Test read: > Log number : 1024000 > Log Size : 16384 > Batch Size : 100 > Cost time(s) : 13.572 > Total size : 16777216000 > Throughput(bps) : 1236163866 > Throughput(rps) : 75449 > Test done!{noformat} > While testing on local environment is not optimal, is still shows a huge > improvement in writing speed (7.5x) and reading speed (3.5x). Enabling > {{fsync}} sort-of equalizes writing speed, but we still expect that simpler > log implementation would be faster dues to smaller overall overhead. > h3. Integration testing > Benchmark for 3 servers and 1 client writing data in multiple threads shows > 34438 vs 30299 throughput improvement. > {{{}RocksDB{}}}: > !Screenshot from 2024-09-20 10-38-53.png! > {{{}Logit{}}}: > !Screenshot from 2024-09-20 10-38-57.png! > Benchmark for single thread insertions in embedded mode shows 4072 vs 3739 > throughput improvement. > {{{}RocksDB{}}}: > !Screenshot from 2024-09-20 10-42-49.png! > {{{}Logit{}}}: > !Screenshot from 2024-09-20 10-43-09.png! > h1. Observations > Despite a drastic difference in log throughput, user operations throughput > increase is only about 10%. This means that we lose a lot of time elsewhere, > and optimizing those parts could significantly increase performance too. Log > optimizations would become more evident after that. > h1. Unsolved issues > There are multiple issues with new log implementation, some of them have been > mentioned in IGNITE-22843 > * {{Logit}} pre-allocates _a lot_ of data on drive. Considering that we use > "log per partition" paradigm, it's too wasteful. > * Storing separate log file per partition is not scalable anyway, it's too > difficult to optimize batches and {{fsync}} in this approach. > * Using the same log for all tables in a distribution zone won't really > solve the issue, the best it could do is to make it {_}manageable{_}, in some > sense. > h1. Shortly about how Logit works > Each log consists of 3 sets of files: > * "segment" files with data. > * "configuration" files with raft configuration. > * "index" files with pointers to segment and configuration files. > "segment" and "configuration" files contain chunks of data in a following > format: > |Magic header|Payload size|Payload itself| > "index" files contain following pieces of data: > |Magic header|Log entry type (data/cfg)|offset|position| > It's a fixed-length tuple, that contains a "link" to one of data files. Each > "index" file is basically an offset table, and it is used to resolve > "logIndex" into real log data. > h1. What we should change > A list of actions, that we need to do to make this log fit the required > criteria includes: > * Merge "configuration" and "segment" files into one, to have fewer files on > drive, the distinction is arbitrary anyway. Let's call it a "data" file. > * Use the same "data" file for multiple raft groups. > It's important to note that we can't use "data" file per stripe, because > stripe calculation function is not {_}stable{_}, it allocates {{stripeId}} > dynamically in order to have a smoother distribution in runtime. > * Log should be able to enforce checkpoints/flushes in storage engines, in > order to safely truncate data upon reaching a threshold (we truncate logs > from multiple raft groups at the same time, that's why we need it). > This means that we will change the way Raft canonically makes snapshots, > instead we will have our own approach, similar to what we have in Ignite 2.x. > Or we will abuse snapshots logic and trigger them outside of the schedule. > * In order to make {{fsync}} faster, we should get rid of "index" files, > leaving only "data" files. That's because we would have {{O(N)}} index files > and only {{O(1)}} data files, roughly speaking. > * Removing index files would make log scan less efficient. This means that > we would have to frequently scan data files in order to find a required log > entry. We would also have to encode {{raftGroupId}} efficiently into data > files. To avoid this issue we could use on-heap caches. > * Local data recovery (log replay) would lead to {{O(N)}} data file scans, > which is too much. Ignite 2.x uses 1 scan to recover all partitions. Recovery > logic in Ignite 3 would have to be rewritten. > * There's an issue with log replay - it contains uncommitted data, that > would be truncated after leader election. Which means that we should only > replay a part of the data, not everything. > * We should enrich Logit checkpoints information with status of all raft > groups, not just one. > * This might be a real blocker: {{LogManagerImpl#checkAndResolveConflict}} > uses a {{truncateSuffix}} operation, which becomes really tricky when you > don't have a dedicated index file. > Also, such an operation might be required in disaster recovery scenarios when > you need to rollback a "poisoned" command that breaks your state machine. > Basically, what we need to do is implement WAL the way it's implemented in > Ignite 2.x (without binary records though). There's a chance that we should > just port it, instead of re-writing half of Logit, we should discuss that. > Parts of JRaft would have to be updated. -- This message was sent by Atlassian Jira (v8.20.10#820010)