[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals
[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876173#comment-14876173 ] Philip Thompson edited comment on CASSANDRA-10195 at 9/18/15 7:13 PM: -- [~krummas], results with your branch from 9644 are here: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=dtcs-9644.read.json I'll update that graph with a run that has a smaller dataset soon, it does not load very well currently. As you can see, the patch for 9644 saw incredible performance improvements over vanilla DTCS. was (Author: philipthompson): [~krummas], results with your branch from 9644 are here: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=dtcs-9644.read.json I'll update that graph with a run that has a smaller dataset soon, it does not load very well currently. As you can see, the patch for 9644 saw incredibly performance improvements over vanilla DTCS. > TWCS experiments and improvement proposals > -- > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Antti Nissinen > Fix For: 2.1.x, 2.2.x > > Attachments: 20150814_1027_compaction_hierarchy.txt, > node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, > node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, > node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, > node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, > node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, > node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, > node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, > node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, > sstable_count_figure2.png > > > This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) > and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the > TWCS. In a test system several crashes were caused intentionally (and > unintentionally) and repair operations were executed leading to flood of > small SSTables. Target was to be able compact those files are release disk > space reserved by duplicate data. Setup is following: > - Three nodes > - DateTieredCompactionStrategy, max_sstable_age_days = 5 > Cassandra 2.1.2 > The setup and data format has been documented in detailed here > https://issues.apache.org/jira/browse/CASSANDRA-9644. > The test was started by dumping few days worth of data to the database for > 100 000 signals. Time graphs of SStables from different nodes indicates that > the DTCS has been working as expected and SStables are nicely ordered in time > wise. > See files: > node0_20150727_1250_time_graph.txt > node1_20150727_1250_time_graph.txt > node2_20150727_1250_time_graph.txt > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID > Rack > UN 139.66.43.170 188.87 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN 139.66.43.169 198.37 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN 139.66.43.168 191.88 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > All nodes crashed due to power failure (know beforehand) and repair > operations were started for each node one at the time. Below is the behavior > of SSTable count on different nodes. New data was dumped simultaneously with > repair operation. > SEE FIGURE: sstable_count_figure1.png > Vertical lines indicate following events. > 1) Cluster was down due to power shutdown and was restarted. At the first > vertical line the repair operation (nodetool repair -pr) was started for the > first node > 2) Repair for the second repair operation was started after the first node > was successfully repaired. > 3) Repair for the third repair operation was started > 4) Third repair operation was finished > 5) One of the nodes crashed (unknown reason in OS level) > 6) Repair operation (nodetool repair -pr) was started for the first node > 7) Repair operation for the second node was started > 8) Repair operation for the third node was started > 9) Repair operations finished > These repair operations are leading to huge amount of small SSTables covering > the whole time span of the data. The compaction horizon of DTCS was limited > to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. > Therefore, small SStables won't be compacted. Below are the time graphs from > SSTables after the second round of repairs. > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving
[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals
[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739091#comment-14739091 ] Philip Thompson edited comment on CASSANDRA-10195 at 9/10/15 5:31 PM: -- I will start those tests now, but it will take a few days for them to run. Do you need me to set any special compaction options? was (Author: philipthompson): I will start those tests now, but it will take a few days for them to run. > TWCS experiments and improvement proposals > -- > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Antti Nissinen > Fix For: 2.1.x, 2.2.x > > Attachments: 20150814_1027_compaction_hierarchy.txt, > node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, > node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, > node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, > node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, > node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, > node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, > node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, > node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, > sstable_count_figure2.png > > > This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) > and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the > TWCS. In a test system several crashes were caused intentionally (and > unintentionally) and repair operations were executed leading to flood of > small SSTables. Target was to be able compact those files are release disk > space reserved by duplicate data. Setup is following: > - Three nodes > - DateTieredCompactionStrategy, max_sstable_age_days = 5 > Cassandra 2.1.2 > The setup and data format has been documented in detailed here > https://issues.apache.org/jira/browse/CASSANDRA-9644. > The test was started by dumping few days worth of data to the database for > 100 000 signals. Time graphs of SStables from different nodes indicates that > the DTCS has been working as expected and SStables are nicely ordered in time > wise. > See files: > node0_20150727_1250_time_graph.txt > node1_20150727_1250_time_graph.txt > node2_20150727_1250_time_graph.txt > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID > Rack > UN 139.66.43.170 188.87 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN 139.66.43.169 198.37 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN 139.66.43.168 191.88 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > All nodes crashed due to power failure (know beforehand) and repair > operations were started for each node one at the time. Below is the behavior > of SSTable count on different nodes. New data was dumped simultaneously with > repair operation. > SEE FIGURE: sstable_count_figure1.png > Vertical lines indicate following events. > 1) Cluster was down due to power shutdown and was restarted. At the first > vertical line the repair operation (nodetool repair -pr) was started for the > first node > 2) Repair for the second repair operation was started after the first node > was successfully repaired. > 3) Repair for the third repair operation was started > 4) Third repair operation was finished > 5) One of the nodes crashed (unknown reason in OS level) > 6) Repair operation (nodetool repair -pr) was started for the first node > 7) Repair operation for the second node was started > 8) Repair operation for the third node was started > 9) Repair operations finished > These repair operations are leading to huge amount of small SSTables covering > the whole time span of the data. The compaction horizon of DTCS was limited > to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. > Therefore, small SStables won't be compacted. Below are the time graphs from > SSTables after the second round of repairs. > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID > Rack > UN xx.xx.xx.170 663.61 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN xx.xx.xx.169 763.52 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN xx.xx.xx.168 651.59 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > See files: > node0_20150810_1017_time_graph.txt > node1_20150810_1017_time_graph.txt > node2_20150810_1017_time_graph.t
[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals
[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728151#comment-14728151 ] Jeff Jirsa edited comment on CASSANDRA-10195 at 9/2/15 10:36 PM: - [~philipthompson], thanks! Can you run this quick script on both (only looks at latest logfiles, if logs have rotated, adjust accordingly): {code} export bytes_out=0 for i in `grep Compacted /var/log/cassandra/system.log | grep -v system | awk '{ print $15 }' | sed -e 's/,//g'` do bytes_out=`echo $bytes_out + $i | bc` done echo "Total bytes compacted: $bytes_out" {code} (Basically just counting the total bytes compacted while running, excludes system keyspace) was (Author: jjirsa): [~philipthompson], thanks! Can you run this quick script on both: {code} export bytes_out=0 for i in `grep Compacted /var/log/cassandra/system.log | grep -v system | awk '{ print $15 }' | sed -e 's/,//g'` do bytes_out=`echo $bytes_out + $i | bc` done echo "Total bytes compacted: $bytes_out" {code} (Basically just counting the total bytes compacted while running, excludes system keyspace) > TWCS experiments and improvement proposals > -- > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Antti Nissinen > Fix For: 2.1.x, 2.2.x > > Attachments: 20150814_1027_compaction_hierarchy.txt, > node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, > node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, > node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, > node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, > node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, > node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, > node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, > node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, > sstable_count_figure2.png > > > This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) > and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the > TWCS. In a test system several crashes were caused intentionally (and > unintentionally) and repair operations were executed leading to flood of > small SSTables. Target was to be able compact those files are release disk > space reserved by duplicate data. Setup is following: > - Three nodes > - DateTieredCompactionStrategy, max_sstable_age_days = 5 > Cassandra 2.1.2 > The setup and data format has been documented in detailed here > https://issues.apache.org/jira/browse/CASSANDRA-9644. > The test was started by dumping few days worth of data to the database for > 100 000 signals. Time graphs of SStables from different nodes indicates that > the DTCS has been working as expected and SStables are nicely ordered in time > wise. > See files: > node0_20150727_1250_time_graph.txt > node1_20150727_1250_time_graph.txt > node2_20150727_1250_time_graph.txt > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID > Rack > UN 139.66.43.170 188.87 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN 139.66.43.169 198.37 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN 139.66.43.168 191.88 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > All nodes crashed due to power failure (know beforehand) and repair > operations were started for each node one at the time. Below is the behavior > of SSTable count on different nodes. New data was dumped simultaneously with > repair operation. > SEE FIGURE: sstable_count_figure1.png > Vertical lines indicate following events. > 1) Cluster was down due to power shutdown and was restarted. At the first > vertical line the repair operation (nodetool repair -pr) was started for the > first node > 2) Repair for the second repair operation was started after the first node > was successfully repaired. > 3) Repair for the third repair operation was started > 4) Third repair operation was finished > 5) One of the nodes crashed (unknown reason in OS level) > 6) Repair operation (nodetool repair -pr) was started for the first node > 7) Repair operation for the second node was started > 8) Repair operation for the third node was started > 9) Repair operations finished > These repair operations are leading to huge amount of small SSTables covering > the whole time span of the data. The compaction horizon of DTCS was limited > to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. > Therefore, small SS
[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals
[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728032#comment-14728032 ] Philip Thompson edited comment on CASSANDRA-10195 at 9/2/15 9:23 PM: - I have been doing some single node testing on i2.2xlarge instance with both DTCS and TWCS. The DTCS node was running 2.1-HEAD at sha df9e798de6eb4. The TWCS node was running Jeff's twcs-2.1 branch at git sha 6119d119c21a. Using cassandra stress, I loaded 4.5B rows into each node, resulting in about 1TB of data each. The stress output from loading was too large to graph, but here are the summaries: DTCS: {code} Results: op rate : 16010 [insert:16010] partition rate: 16010 [insert:16010] row rate : 16010 [insert:16010] latency mean : 15.6 [insert:15.6] latency median: 4.0 [insert:4.0] latency 95th percentile : 21.2 [insert:21.2] latency 99th percentile : 59.1 [insert:59.1] latency 99.9th percentile : 366.1 [insert:366.1] latency max : 12100.5 [insert:12100.5] {code} TWCS: {code} Results: op rate : 16397 [insert:16397] partition rate: 16397 [insert:16397] row rate : 16397 [insert:16397] latency mean : 15.2 [insert:15.2] latency median: 3.9 [insert:3.9] latency 95th percentile : 22.0 [insert:22.0] latency 99th percentile : 66.9 [insert:66.9] latency 99.9th percentile : 1494.0 [insert:1494.0] latency max : 24187.3 [insert:24187.3] {code} The stress yaml profiles I used are available here: https://gist.github.com/ptnapoleon/6e3f16049c756ba1e53a https://gist.github.com/ptnapoleon/bdc9390edb0b004691bd After loading the data, I gave the cluster some time to finish compactions. I then began a long-running mixed workload that I terminated after a day, in favor of a shorter, 5M op mixed workload, at a 1:3 insert/read ratio. Here are the results of that test, which show better performance from TWCS: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=twcs_dtcs.brief.json&metric=op_rate&operation=1_MIXED&smoothing=1&show_aggregates=true&xmin=0&xmax=30877.88&ymin=0&ymax=761.2 The only non-default configuration change I made, was to increase concurrent_compactors to 8 on each node. I still have both clusters up, and have a lot of log data collected. What additional compaction specific tunings, or workloads should be tested? was (Author: philipthompson): I have been doing some single node testing on i2.2xlarge instance with both DTCS and TWCS. The DTCS node was running 2.1-HEAD at sha df9e798de6eb4. The TWCS node was running Jeff's twcs-2.1 branch at git sha 6119d119c21a. Using cassandra stress, I loaded 4.5B rows into each node, resulting in about 1TB of data each. The stress yaml profiles I used are available here: https://gist.github.com/ptnapoleon/6e3f16049c756ba1e53a https://gist.github.com/ptnapoleon/bdc9390edb0b004691bd After loading the data, I gave the cluster some time to finish compactions. I then began a long-running mixed workload that I terminated after a day, in favor of a shorter, 5M op mixed workload, at a 1:3 insert/read ratio. Here are the results of that test, which show better performance from TWCS: http://riptano.github.io/cassandra_performance/graph_v5/graph.html?stats=twcs_dtcs.brief.json&metric=op_rate&operation=1_MIXED&smoothing=1&show_aggregates=true&xmin=0&xmax=30877.88&ymin=0&ymax=761.2 The only non-default configuration change I made, was to increase concurrent_compactors to 8 on each node. I still have both clusters up, and have a lot of log data collected. What additional compaction specific tunings, or workloads should be tested? > TWCS experiments and improvement proposals > -- > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Antti Nissinen > Fix For: 2.1.x, 2.2.x > > Attachments: 20150814_1027_compaction_hierarchy.txt, > node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, > node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, > node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, > node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, > node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, > node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, > node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, > node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, > sstable_count_figure2.png > > > This JIRA item describes experiments with DateTieredCompactionStarte
[jira] [Comment Edited] (CASSANDRA-10195) TWCS experiments and improvement proposals
[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713571#comment-14713571 ] Jeff Jirsa edited comment on CASSANDRA-10195 at 8/26/15 2:49 PM: - Hi Antti, I'm still not sure if TWCS is going to make it into mainline cassandra, but thank you very much for testing it, and I'm very happy to hear that it worked well for this problem. {quote} 1) The current time window would be determined from the newest time stamp found in SSTables. This allows the effective compaction of the SSTables when data is fed to the history in timely order. In dumping process the time stamp of the column is set according to the time stamp of the data sample. {quote} Agreed, I'll be adding that shortly. {quote} 2) The count of SSTables participating in one compaction could be limited either by the number of files given by max_threshold OR by the sum of size of files selected for the compaction bucket. File size limitation would prevent combining a large files together potentially causing out of disk space situation or extremely long lasting compaction tasks. {quote} One of the reasons I wrote a new compaction strategy instead of trying to modify DTCS is that I believe DTCS is very difficult for virtually anyone (regardless of experience) to configure properly. If I can find a way to add that parameter without losing the easy-to-understand and easy-to-reason-about nature of TWCS, I will consider it. {quote} 3) Now time windows are handled one by one starting from the newest. This will not lead to the fastest decrease in SStable count. An alternative might a round-robin approach in which time windows are stepped through and only one compaction task for that given time window is done and then moving to the next time window. {quote} I prioritized the highest window because presumably, if "now" is getting overrun, cleaning up old sstables isn't doing us any favors. It is true, however, that it's possible for old windows to have lots of tiny files that are easier to clean up than whatever may exist in the current window. I will consider the right way to approach this. Again, there's no guarantee that the project will accept TWCS (at the moment, the person in charge of this part of the project prefers not to have 2 time/date compaction strategies, and that's not an unreasonable position), but I'm going to keep TWCS current and will continue rebasing it as needed, as I believe it is useful. If the time comes that CASSANDRA-9666 is closed as "Won't fix", I'll make the code available elsewhere. was (Author: jjirsa): Hi Antti, I'm still not sure if TWCS is going to make it into mainline cassandra, but thank you very much for testing it, and I'm very happy to hear that it worked well for this problem. {quote} 1) The current time window would be determined from the newest time stamp found in SSTables. This allows the effective compaction of the SSTables when data is fed to the history in timely order. In dumping process the time stamp of the column is set according to the time stamp of the data sample. {quote} Agreed, I'll be adding that shortly. {quote} 2) The count of SSTables participating in one compaction could be limited either by the number of files given by max_threshold OR by the sum of size of files selected for the compaction bucket. File size limitation would prevent combining a large files together potentially causing out of disk space situation or extremely long lasting compaction tasks. {quote} One of the reasons I wrote a new compaction strategy instead of trying to modify DTCS is that I believe DTCS is very difficult for virtually anyone (regardless of experience) to configure properly. If I can find a way to add that parameter without losing the easy-to-understand and easy-to-reason-about nature of TWCS, I will consider it. {quote} 3) Now time windows are handled one by one starting from the newest. This will not lead to the fastest decrease in SStable count. An alternative might a round-robin approach in which time windows are stepped through and only one compaction task for that given time window is done and then moving to the next time window. {quote} I prioritized the highest window because presumably, if "now" is getting overrun, cleaning up old sstables isn't doing us any favors. It is true, however, that it's possible for old windows to have lots of tiny files that are easier to clean up than whatever may exist in the current window. I will consider the right way to approach this. > TWCS experiments and improvement proposals > -- > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Components: Core >