[ https://issues.apache.org/jira/browse/CASSANDRA-12200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376343#comment-15376343 ]
Jeff Jirsa commented on CASSANDRA-12200: ---------------------------------------- Solution here is likely implementing a PriorityQueue for compaction, as discussed in CASSANDRA-11218, and then prioritizing anticompaction in the same manner we need to prioritize index builds, user defined compaction, etc. > Backlogged compactions can make repair on trivially small tables waiting for > a long time to finish > -------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-12200 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12200 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Wei Deng > > In C* 3.0 we started to use incremental repair by default. However, this > seems to create a repair performance problem if you have a relatively > write-heavy workload that can drive all available concurrent_compactors to be > used by active compactions. > I was able to demonstrate this issue by the following scenario: > 1. On a three-node C* 3.0.7 cluster, use "cassandra-stress write n=100000000" > to generate 100GB of data with keyspace1.standard1 table using LCS (ctrl+c > the stress client once the data size on each node reaches 35+GB). > 2. At this point, there will be hundreds of L0 SSTables waiting for LCS to > digest on each node, and with concurrent_compactors set to default at 2, the > two compaction threads are constantly busy processing the backlogged L0 > SSTables. > 3. Now create a new keyspace called "trivial_ks" with RF=3 and create a small > two-column CQL table in it, and insert 6 records. > 4. Start a "nodetool repair trivial_ks" session on one of the nodes, and > watch the following behavior: > {noformat} > automaton@wdengdse50google-98425b985-3:~$ nodetool repair trivial_ks > [2016-07-13 01:57:28,364] Starting repair command #1, repairing keyspace > trivial_ks with repair options (parallelism: parallel, primary range: false, > incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [], > hosts: [], # of ranges: 3) > [2016-07-13 01:57:31,027] Repair session 27212dd0-489d-11e6-a6d6-cd06faa0aaa2 > for range [(3074457345618258602,-9223372036854775808], > (-9223372036854775808,-3074457345618258603], > (-3074457345618258603,3074457345618258602]] finished (progress: 66%) > [2016-07-13 02:07:47,637] Repair completed successfully > [2016-07-13 02:07:47,657] Repair command #1 finished in 10 minutes 19 seconds > {noformat} > Basically for such a small table it took 10+ minutes to finish the repair. > Looking at debug.log for this particular repair session UUID, you will find > that all nodes were able to pass through validation compaction within 15ms, > but one of the nodes actually got stuck waiting for a compaction slot because > it has to do an anti-compaction step before it can finally tell the > initiating node that it's done with its part of the repair session, so it > took 10+ minutes for one compaction slot to be freed up, like shown in the > following debug.log entries: > {noformat} > DEBUG [AntiEntropyStage:1] 2016-07-13 01:57:30,956 > RepairMessageVerbHandler.java:149 - Got anticompaction request > AnticompactionRequest{parentRepairSession=27103de0-489d-11e6-a6d6-cd06faa0aaa2} > org.apache.cassandra.repair.messages.AnticompactionRequest@34449ff4 > <...> > <snip> > <...> > DEBUG [CompactionExecutor:5] 2016-07-13 02:07:47,506 CompactionTask.java:217 > - Compacted (286609e0-489d-11e6-9e03-1fd69c5ec46c) 32 sstables to > [/var/lib/cassandra/data/keyspace1/standard1-9c02e9c1487c11e6b9161dbd340a212f/mb-499-big,] > to level=0. 2,892,058,050 bytes to 2,874,333,820 (~99% of original) in > 616,880ms = 4.443617MB/s. 0 total partitions merged to 12,233,340. > Partition merge counts were {1:12086760, 2:146580, } > INFO [CompactionExecutor:5] 2016-07-13 02:07:47,512 > CompactionManager.java:511 - Starting anticompaction for trivial_ks.weitest > on > 1/[BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db')] > sstables > INFO [CompactionExecutor:5] 2016-07-13 02:07:47,513 > CompactionManager.java:540 - SSTable > BigTableReader(path='/var/lib/cassandra/data/trivial_ks/weitest-538b07d1489b11e6a9ef61c6ff848952/mb-1-big-Data.db') > fully contained in range (-9223372036854775808,-9223372036854775808], > mutating repairedAt instead of anticompacting > INFO [CompactionExecutor:5] 2016-07-13 02:07:47,570 > CompactionManager.java:578 - Completed anticompaction successfully > {noformat} > Since validation compaction has its own threads outside of the regular > compaction thread pool restricted by concurrent_compactors, we were able to > pass through validation compaction without any issue. If we could treat > anti-compaction the same way (i.e. to give it its own thread pool), we can > avoid this kind of repair performance problem from happening. -- This message was sent by Atlassian JIRA (v6.3.4#6332)