[ https://issues.apache.org/jira/browse/IGNITE-23550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17903806#comment-17903806 ]
Kirill Tkalenko edited comment on IGNITE-23550 at 12/9/24 7:47 AM: ------------------------------------------------------------------- Based on the results of running tests from [PR 4845|https://github.com/apache/ignite-3/pull/4845] locally and TС I made several conclusions. As the analysis of jfr has shown, we spend quite a lot of time saving the checksum on each write to the metastorage since it is saved in the synchronous mode to a WAL. Results of executing *MetaStorageManager#put* 100k times with/without the sync mode of the checksum: ||Disable sync for checsum||TC/Local||Time|| |True|TC|4s 12ms 204us 192866ns, totalMs=4012, totalNs=4012204866| |False|TC|22s 12ms 732us 720751ns, totalMs=22012, totalNs=22012732751| |True|Local|2s 218ms 965us 747459ns, totalMs=2218, totalNs=2218965459| |False|Local|5m 10s 157ms 117us, totalMs=310157, totalNs=310157117417| >From the table we can conclude that disabling sync with WAL for the checksum >will increase our performance several times. Result node restart with/without the sync mode of the checksum with 100k put in raft log: ||Disable sync for checsum||TC/Local||Time|| |True|TC|4s 744ms 454us, totalMs=4744, totalNs=4744454870| |False|TC|24s 195ms 34us, totalMs=24195, totalNs=24195034186| |True|Local|2s 877ms 511us, totalMs=2877, totalNs=2877511875| |False|Local|6m 41s 771ms 593us, totalMs=401771, totalNs=401771593084| This table also shows that there will be a performance gain if we fix the situation with the checksum sync mode. Let's look at the results of restarting the node if we take a snapshot before or not: ||TC/Local||Disable sync for checsum||Snapshot before restart||Time|| |TC|True|True|1s 890ms 270us, totalMs=1890, totalNs=1890270734| |TC|True|False|4s 744ms 454us, totalMs=4744, totalNs=4744454870| |TC|False|True|1s 836ms 376us, totalMs=1836, totalNs=1836376733| |TC|False|False|24s 195ms 34us, totalMs=24195, totalNs=24195034186| |Local|True|True|955ms 622us, totalMs=955, totalNs=955622750| |Local|True|False|2s 877ms 511us, totalMs=2877, totalNs=2877511875| |Local|False|True|804ms 878us 74708ns, totalMs=804, totalNs=804878708| |Local|False|False|6m 41s 771ms 593us, totalMs=401771, totalNs=401771593084| It can be concluded that taking a snapshot before restarting a node gives good performance for starting the node. See how long it takes to take snapshots for a 200MB storage: ||TC/Local||do/restore snapshot||Time|| |TC|do|1s 911ms 127us, totalMs=1911, totalNs=1911127356| |TC|restore|179ms 976us 797710ns, totalMs=179, totalNs=179976710| |Local|do|799ms 362us, totalMs=799, totalNs=799362375| |Local|restore|105ms 686us 581542ns, totalMs=105, totalNs=105686542| It can be concluded that creating a snapshot and restoring from it can provide a performance boost. But we will make a mark that there is no parallel load on the nodes. h2. Conclusions: # We need to figure out the sync mode for the checksum, since it affects the performance of using metastorage commands both when the cluster is running and when nodes are restarted. What can be done? ## As an option, we can infiltrate the mechanism for sending raft commands and add our own checksum to it for the checksum of the raft itself and then work with it as we need. This is not an easy way, since it will require more infiltration into the raft code. ## Think of something else. # Think about working with metastorage snapshots, as the test showed, the more often we do them, the faster the node will enter the cluster, now it does not work optimally since by default a snapshot is taken once an hour, which can slow down the entry of a node into the topology. What can be done? ## The easiest option is to start taking snapshots more frequently, for example once per minute. Pros: easy to implement. Cons: taking a snapshot is not a cheap operation and can affect parallel disk operations, such as saving to raft logs, checkpoints, etc. ## Take snapshots based on the number of operations, such as every 10k metastorage updates. Pros: a more understandable mechanism and less impact on disk operations on the node if there is no active metastorage update. Cons: you need to make changes to the raft code and there may still be unnecessary background disk operations when this is not required and may affect as in the point above. ## Take a snapshot when it is requested from a remote node. Pros: snapshots will be taken only on demand from a remote node, which can be very rare, which will reduce disk load as in the points above. Cons: will also require more changes to the raft code, with frequent requests for snapshots from a remote node, we can affect parallel disk operations. ## Create data streaming similar to partitions. Pros: no need to take snapshots at all and we will transfer data directly over the network and read from the disk without parallel disk operations. Cons: we will need to write much more code for the raft and think through the whole mechanism a little. ## Think of something else. In my opinion, we can do it like this: # Do point 1.1. # For now, try point 2.3, later 2.4, since it requires much more work. was (Author: ktkale...@gridgain.com): Based on the results of running tests from [PR 4845|https://github.com/apache/ignite-3/pull/4845] locally and TС I made several conclusions. As the analysis of jfr has shown, we spend quite a lot of time saving the checksum on each write to the metastorage since it is saved in the synchronous mode to a WAL. Results of executing *MetaStorageManager#put* 100k times with/without the sync mode of the checksum: ||Disable sync for checsum||TC/Local||Time|| |True|TC|4s 12ms 204us 192866ns, totalMs=4012, totalNs=4012204866| |False|TC|22s 12ms 732us 720751ns, totalMs=22012, totalNs=22012732751| |True|Local|2s 218ms 965us 747459ns, totalMs=2218, totalNs=2218965459| |False|Local|5m 10s 157ms 117us, totalMs=310157, totalNs=310157117417| >From the table we can conclude that disabling sync with WAL for the checksum >will increase our performance several times. Result node restart with/without the sync mode of the checksum with 100k put in raft log: ||Disable sync for checsum||TC/Local||Time|| |True|TC|4s 744ms 454us, totalMs=4744, totalNs=4744454870| |False|TC|24s 195ms 34us, totalMs=24195, totalNs=24195034186| |True|Local|2s 877ms 511us, totalMs=2877, totalNs=2877511875| |False|Local|6m 41s 771ms 593us, totalMs=401771, totalNs=401771593084| This table also shows that there will be a performance gain if we fix the situation with the checksum sync mode. Let's look at the results of restarting the node if we take a snapshot before or not: ||TC/Local||Disable sync for checsum||Snapshot before restart||Time|| |TC|True|True|1s 890ms 270us, totalMs=1890, totalNs=1890270734| |TC|True|False|4s 744ms 454us, totalMs=4744, totalNs=4744454870| |TC|False|True|1s 836ms 376us, totalMs=1836, totalNs=1836376733| |TC|False|False|24s 195ms 34us, totalMs=24195, totalNs=24195034186| |Local|True|True|955ms 622us, totalMs=955, totalNs=955622750| |Local|True|False|2s 877ms 511us, totalMs=2877, totalNs=2877511875| |Local|False|True|804ms 878us 74708ns, totalMs=804, totalNs=804878708| |Local|False|False|6m 41s 771ms 593us, totalMs=401771, totalNs=401771593084| It can be concluded that taking a snapshot before restarting a node gives good performance for starting the node. See how long it takes to take snapshots for a 200MB storage: ||TC/Local||do/restore snapshot||Time|| |TC|do|1s 911ms 127us, totalMs=1911, totalNs=1911127356| |TC|restore|179ms 976us 797710ns, totalMs=179, totalNs=179976710| |Local|do|799ms 362us, totalMs=799, totalNs=799362375| |Local|restore|105ms 686us 581542ns, totalMs=105, totalNs=105686542| It can be concluded that creating a snapshot and restoring from it can provide a performance boost. But we will make a mark that there is no parallel load on the nodes. h2. Conclusions: # We need to figure out the sync mode for the checksum, since it affects the performance of using metastorage commands both when the cluster is running and when nodes are restarted. What can be done? ## As an option, we can infiltrate the mechanism for sending raft commands and add our own checksum to it for the checksum of the raft itself and then work with it as we need. This is not an easy way, since it will require more infiltration into the raft code. ## Think of something else. # Think about working with metastorage snapshots, as the test showed, the more often we do them, the faster the node will enter the cluster, now it does not work optimally since by default a snapshot is taken once an hour, which can slow down the entry of a node into the topology. What can be done? ## The easiest option is to start taking snapshots more frequently, for example once per minute. Pros: easy to implement. Cons: taking a snapshot is not a cheap operation and can affect parallel disk operations, such as saving to raft logs, checkpoints, etc. ## Take snapshots based on the number of operations, such as every 10k metastorage updates. Pros: a more understandable mechanism and less impact on disk operations on the node if there is no active metastorage update. Cons: you need to make changes to the raft code and there may still be unnecessary background disk operations when this is not required and may affect as in the point above. ## Take a snapshot when it is requested from a remote node. Pros: snapshots will be taken only on demand from a remote node, which can be very rare, which will reduce disk load as in the points above. Cons: will also require more changes to the raft code, with frequent requests for snapshots from a remote node, we can affect parallel disk operations. ## Create data streaming similar to partitions. Pros: no need to take snapshots at all and we will transfer data directly over the network and read from the disk without parallel disk operations. Cons: we will need to write much more code for the raft and think through the whole mechanism a little. ## Think of something else. In my opinion, you can do it like this: # Do point 1.1. # For now, try point 2.3, later 2.4, since it requires much more work. > Test and optimize metastorage snapshot transfer and recovery speed for new > nodes > -------------------------------------------------------------------------------- > > Key: IGNITE-23550 > URL: https://issues.apache.org/jira/browse/IGNITE-23550 > Project: Ignite > Issue Type: Improvement > Reporter: Ivan Bessonov > Assignee: Kirill Tkalenko > Priority: Major > Labels: ignite-3 > Time Spent: 10m > Remaining Estimate: 0h > > Test and optimize metastorage snapshot transfer and recovery speed for new > nodes. > Let's assume that we have a 100Mb+ meta-storage snapshot and 100k+ entries in > raft log replicated as log. > How long would it take for a new node to join the cluster under these > conditions? Will something break? What can we do to make it work? > Goal is - the joining process should work for a long-running clusters. It > should be pretty fast as well. Less than 10 seconds for sure, of course > depending on the network capabilities. No timeout errors should occur if it > takes more than 10 seconds. -- This message was sent by Atlassian Jira (v8.20.10#820010)