[ https://issues.apache.org/jira/browse/CASSANDRA-10195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008944#comment-15008944 ]
Philip Thompson commented on CASSANDRA-10195: --------------------------------------------- [~krummas], results are finished for the first two tests. These are only at ~275GB of data, on one node, using cstar perf. [TWCS with jeffjirsa/twcs-2.1|http://cstar.datastax.com/tests/id/8aee77a2-888e-11e5-88b6-0256e416528f] [DTCS with marcuse/9644|http://cstar.datastax.com/tests/id/8e17c1c2-888e-11e5-88b6-0256e416528f] [DTCS with marcuse/9644_2|http://cstar.datastax.com/tests/id/ac416862-8c92-11e5-b5e7-0256e416528f] is still running As you can see from the first two graphs, load and read performance were roughly equivalent. These lower density tests take about 36-48 hours to run each, if you need anymore, after the last finishes. > TWCS experiments and improvement proposals > ------------------------------------------ > > Key: CASSANDRA-10195 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10195 > Project: Cassandra > Issue Type: Improvement > Reporter: Antti Nissinen > Fix For: 2.1.x, 2.2.x > > Attachments: 20150814_1027_compaction_hierarchy.txt, > node0_20150727_1250_time_graph.txt, node0_20150810_1017_time_graph.txt, > node0_20150812_1531_time_graph.txt, node0_20150813_0835_time_graph.txt, > node0_20150814_1054_time_graph.txt, node1_20150727_1250_time_graph.txt, > node1_20150810_1017_time_graph.txt, node1_20150812_1531_time_graph.txt, > node1_20150813_0835_time_graph.txt, node1_20150814_1054_time_graph.txt, > node2_20150727_1250_time_graph.txt, node2_20150810_1017_time_graph.txt, > node2_20150812_1531_time_graph.txt, node2_20150813_0835_time_graph.txt, > node2_20150814_1054_time_graph.txt, sstable_count_figure1.png, > sstable_count_figure2.png > > > This JIRA item describes experiments with DateTieredCompactionStartegy (DTCS) > and TimeWindowCompactionStrategy (TWCS) and proposes modifications to the > TWCS. In a test system several crashes were caused intentionally (and > unintentionally) and repair operations were executed leading to flood of > small SSTables. Target was to be able compact those files are release disk > space reserved by duplicate data. Setup is following: > - Three nodes > - DateTieredCompactionStrategy, max_sstable_age_days = 5 > Cassandra 2.1.2 > The setup and data format has been documented in detailed here > https://issues.apache.org/jira/browse/CASSANDRA-9644. > The test was started by dumping few days worth of data to the database for > 100 000 signals. Time graphs of SStables from different nodes indicates that > the DTCS has been working as expected and SStables are nicely ordered in time > wise. > See files: > node0_20150727_1250_time_graph.txt > node1_20150727_1250_time_graph.txt > node2_20150727_1250_time_graph.txt > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host ID > Rack > UN 139.66.43.170 188.87 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN 139.66.43.169 198.37 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN 139.66.43.168 191.88 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > All nodes crashed due to power failure (know beforehand) and repair > operations were started for each node one at the time. Below is the behavior > of SSTable count on different nodes. New data was dumped simultaneously with > repair operation. > SEE FIGURE: sstable_count_figure1.png > Vertical lines indicate following events. > 1) Cluster was down due to power shutdown and was restarted. At the first > vertical line the repair operation (nodetool repair -pr) was started for the > first node > 2) Repair for the second repair operation was started after the first node > was successfully repaired. > 3) Repair for the third repair operation was started > 4) Third repair operation was finished > 5) One of the nodes crashed (unknown reason in OS level) > 6) Repair operation (nodetool repair -pr) was started for the first node > 7) Repair operation for the second node was started > 8) Repair operation for the third node was started > 9) Repair operations finished > These repair operations are leading to huge amount of small SSTables covering > the whole time span of the data. The compaction horizon of DTCS was limited > to 5 days (max_sstable_age_days) due to the size of the SStables on the disc. > Therefore, small SStables won't be compacted. Below are the time graphs from > SSTables after the second round of repairs. > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host ID > Rack > UN xx.xx.xx.170 663.61 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN xx.xx.xx.169 763.52 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN xx.xx.xx.168 651.59 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > See files: > node0_20150810_1017_time_graph.txt > node1_20150810_1017_time_graph.txt > node2_20150810_1017_time_graph.txt > To get rid of the SStables the TimeWindowCompactionStrategy was taken into > use. Window size was set to 5 days. Cassandra version was updated to 2.1.8. > Below figure shows the behavior of SStable count. TWCS was taken into use > 10.8.2015 at 13:10. The maximum amount of files to be compacted in one task > was limited to 32 files to avoid running out of disk space. > See Figure: sstable_count_figure2.png > Shape of the trend indicates clearly how selection of SStables for buckets > based on size affects. Combining files gets slower when files are getting > bigger inside the time window. When the time window does not have any more > compactions to be done the next time window is started. Combining small files > is again fast and the number of SStables decreases quickly. Below are the > time graphs for SStables when compactions were ready with TWCS. New data was > not dumped simultaneously with compactions. > See files: > node0_20150812_1531_time_graph.txt > node1_20150812_1531_time_graph.txt > node2_20150812_1531_time_graph.txt > Datacenter: datacenter1 > ======================= > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host ID > Rack > UN xx.xx.xx.170 436.17 GB 256 ? > dfc29863-c935-4909-9d7f-c59a47eda03d rack1 > UN xx.xx.xx.169 454.96 GB 256 ? > 12e7628b-7f05-48f6-b7e4-35a82010021a rack1 > UN xx.xx.xx.168 439.13 GB 256 ? > 26088392-f803-4d59-9073-c75f857fb332 rack1 > Data dumping was activated again and the SStable statistics were observed > again on the next morning. > See files: > node0_20150813_0835_time_graph.txt > node1_20150813_0835_time_graph.txt > node2_20150813_0835_time_graph.txt > Since the data was dumped to the history the newest data did not come into > the current time window that is determined from the system time. Since new > small SSTables (approximately 30- 50 MB in size) are appearing continuously > the compaction ended up compacting together one large SStable with several > small files. The code was modified so that the current time is determined > from the newest time stamp in the SStables (like in DTCS).This modification > led to much more reasonable compaction behavior for the case when historical > data is pushed to the database. Below are the time grahps from nodes after > one day. Size tiered compaction was now able to work with newest files as > intended while dumping data in real-time. > See files: > node0_20150814_1054_time_graph.txt > node1_20150814_1054_time_graph.txt > node2_20150814_1054_time_graph.txt > The change in behavior is clearly visible in the compaction hierarchy graph > below. TWCS modification is visible starting from the line 39. See the > description of the file format in > https://issues.apache.org/jira/browse/CASSANDRA-9644. > See file: 20150814_1027_compaction_hierarchy.txt > The behavior of the TWCS looks really promising and works also in practice!!! > We would like to propose some ideas for future development of the algorithm. > 1) The current time window would be determined from the newest time stamp > found in SSTables. This allows the effective compaction of the SSTables when > data is fed to the history in timely order. In dumping process the time stamp > of the column is set according to the time stamp of the data sample. > 2) The count of SSTables participating in one compaction could be limited > either by the number of files given by max_threshold OR by the sum of size of > files selected for the compaction bucket. File size limitation would prevent > combining a large files together potentially causing out of disk space > situation or extremely long lasting compaction tasks. > 3) Now time windows are handled one by one starting from the newest. This > will not lead to the fastest decrease in SStable count. An alternative might > a round-robin approach in which time windows are stepped through and only one > compaction task for that given time window is done and then moving to the > next time window. > Side note: while observing the compaction process it appears that compaction > is intermittently using two threads for the compaction. However, sometimes > during a long lasting compaction task (hours) another thread was not kicking > in and working with small SSTables even if there were thousands of those > available for compactions. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)