[ https://issues.apache.org/jira/browse/CASSANDRA-13948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210935#comment-16210935 ]
Paulo Motta commented on CASSANDRA-13948: ----------------------------------------- I think I was able to get to the bottom of this issue with the help of [~dkinder]'s logs It turns out that the {{ColumnFamilyStore}} (and subsequently the {{CompactionStrategyManager}}) is initialized before gossip is settled, so the node's local ranges are not properly computed during startup, causing the computed disk boundaries to not match the actual boundaries. This happens because {{GossipingPropertyFileSnitch}} fallbacks to the {{FilePropertySnitch}}, so when a node is not found is gossip it will pick the DCs/racks from the {{cassandra-topology.properties}} file, and if it's not defined there it will use the {{DEFAULT}} dc/rack and mess up the local ranges and disk boundary computation. The log lines below show the following steps: * {{GossipingPropertyFileSnitch}} falling back to {{PropertyFileSnitch}} * Compactions on system keyspaces being scheduled on the same disk, while compaction on user keyspaces being run on SSTables from different disks, indicating the disk boundaries were not calculated correctly for NTS keyspaces * After gossip settles, lot's of {{SSTable from level 0 is not on corresponding level in the leveled manifest}} warnings, indicating the disk boundary layout changed after gossip settled but the compaction strategies were not reloaded with the new layout {code:none} INFO [main] 2017-10-12 14:39:32,928 GossipingPropertyFileSnitch.java:64 - Loaded cassandra-topology.properties for compatibility DEBUG [CompactionExecutor:26] 2017-10-12 15:02:09,442 CompactionTask.java:155 - Compacting (fe490ea1-af98-11e7-b5a1-57bcefdac924) [/srv/disk6/cassandra-data/walker/domain_info/.domain_info_claim_tok_idx/mc-18386-big-Data.db:level=0, /srv/disk9/cassandra-data/walker/domain_info/.domain_info_claim_tok_idx/mc-18387-big-Data.db:level=0, ] DEBUG [CompactionExecutor:31] 2017-10-12 15:02:09,442 CompactionTask.java:155 - Compacting (fe490ea0-af98-11e7-b5a1-57bcefdac924) [/srv/disk10/cassandra-data/system/peers/mc-1671-big-Data.db:level=0, /srv/disk10/cassandra-data/system/peers/mc-1659-big-Data.db:level=0, /srv/disk10/cassandra-data/system/peers/mc-1656-big-Data.db:level=0, /srv/disk10/cassandra-data/system/peers/mc-1690-big-Data.db:level=0, ] DEBUG [CompactionExecutor:17] 2017-10-12 15:02:09,442 CompactionTask.java:155 - Compacting (fe490ea8-af98-11e7-b5a1-57bcefdac924) [/srv/disk5/cassandra-data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-2625-big-Data.db:level=0, /srv/disk5/cassandra-data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-2605-big-Data.db:level=0, /srv/disk5/cassandra-data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-2597-big-Data.db:level=0, /srv/disk5/cassandra-data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/mc-2617-big-Data.db:level=0, ] DEBUG [CompactionExecutor:29] 2017-10-12 15:02:09,952 CompactionTask.java:155 - Compacting (fe9a8a00-af98-11e7-b5a1-57bcefdac924) [/srv/disk11/cassandra-data/walker/domain_info/.domain_info_dispatched_idx/mc-18474-big-Data.db:level=0, /srv/disk3/cassandra-data/walker/domain_info/.domain_info_dispatched_idx/mc-18473-big-Data.db:level=0, ] DEBUG [CompactionExecutor:18] 2017-10-12 15:04:08,939 CompactionTask.java:155 - Compacting (45865ca0-af99-11e7-b5a1-57bcefdac924) [/srv/disk8/cassandra-data/walker/links/mc-6571323-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566335-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566315-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566361-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571043-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566330-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566322-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571261-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571335-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6570145-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566346-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571249-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571311-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566307-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571285-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566341-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6570765-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571299-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566352-big-Data.db:level=1, ] DEBUG [CompactionExecutor:30] 2017-10-12 15:04:08,866 DiskBoundaryManager.java:69 - Cached ring version 84 != current ring version 86 DEBUG [CompactionExecutor:30] 2017-10-12 15:04:08,867 DiskBoundaryManager.java:83 - Refreshing disk boundary cache for walker.links WARN [CompactionExecutor:18] 2017-10-12 15:04:08,888 LeveledCompactionStrategy.java:140 - Could not acquire references for compacting SSTables [BigTableReader(path='/srv/disk2/cassandra-data/walke r/links/mc-6566879-big-Data.db')] which is not a problem per se,unless it happens frequently, in which case it must be reported. Will retry later. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,892 LeveledCompactionStrategy.java:140 - Could not acquire references for compacting SSTables [BigTableReader(path='/srv/disk11/cassandra-data/walk er/links/mc-6566381-big-Data.db')] which is not a problem per se,unless it happens frequently, in which case it must be reported. Will retry later. DEBUG [CompactionExecutor:30] 2017-10-12 15:04:08,939 CompactionTask.java:255 - Compacted (2f095f90-af99-11e7-b5a1-57bcefdac924) 1 sstables to [/srv/disk1/cassandra-data/walker/links/mc-6571348-big ,] to level=4. 271.077MiB to 271.077MiB (~100% of original) in 37,729ms. Read Throughput = 7.185MiB/s, Write Throughput = 7.185MiB/s, Row Throughput = ~116,380/s. 741 total partitions merged to 741. Partition merge counts were {1:741, } DEBUG [CompactionExecutor:18] 2017-10-12 15:04:08,939 CompactionTask.java:155 - Compacting (45865ca0-af99-11e7-b5a1-57bcefdac924) [/srv/disk8/cassandra-data/walker/links/mc-6571323-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566335-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566315-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566361-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571043-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566330-big-Data.db:level=1, /srv/disk7/cassandra-data/walker/links/mc-6566322-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571261-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571335-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6570145-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566346-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571249-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571311-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566307-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6571285-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566341-big-Data.db:level=1, /srv/disk8/cassandra-data/walker/links/mc-6570765-big-Data.db:level=0, /srv/disk8/cassandra-data/walker/links/mc-6571299-big-Data.db:level=0, /srv/disk7/cassandra-data/walker/links/mc-6566352-big-Data.db:level=1, ] WARN [CompactionExecutor:18] 2017-10-12 15:04:08,941 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk8/cassandra-data/walker/links/mc-6571323-big-Data.db from level 0 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566335-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566315-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566361-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566330-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566322-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk8/cassandra-data/walker/links/mc-6571261-big-Data.db from level 0 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk8/cassandra-data/walker/links/mc-6571335-big-Data.db from level 0 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566346-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk8/cassandra-data/walker/links/mc-6571311-big-Data.db from level 0 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,942 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566307-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,943 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk8/cassandra-data/walker/links/mc-6571285-big-Data.db from level 0 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,943 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566341-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. WARN [CompactionExecutor:18] 2017-10-12 15:04:08,943 LeveledCompactionStrategy.java:274 - Live sstable /srv/disk7/cassandra-data/walker/links/mc-6566352-big-Data.db from level 1 is not on corresponding level in the leveled manifest. This is not a problem per se, but may indicate an orphaned sstable due to a failed compaction not cleaned up properly. {code} >From the above I draw the following conclusions: 1. We should wait for gossip to settle before starting compactions, since the disk boundaries may be dependent on gossip when any gossiping snitch is used. (done on [this commit|https://github.com/pauloricardomg/cassandra/commit/d5bd9eab53ab423dde024400c299f01e335cdf4c]) 2. We should reload the compaction strategies when the disk layout changes, so SSTables are correctly mapped to their corresponding compaction strategies. (done on [this commit|https://github.com/pauloricardomg/cassandra/commit/073bfa4e548ac807b523a919288a5a71379bfd21], in addition to testing and some other simplifications on {{CompactionStrategyManager}}) 3. A race when acquiring references on {{CompactionStrategyManager.getNextBackgroundTask}} is quite common when there are multiple concurrent compactions, since there will be parallel flushes on multiple disks, so rather than removing the {{while true}} loop to acquire references, I updated the compaction strategies to stop when the current candidate is the same as before, indicating there is a race with the tracker ([commit|https://github.com/pauloricardomg/cassandra/commit/3f971daa48036f14ea7aba7fd1d56150e78415b3]). I will work on dtests to simulate the above case as well as test disk boundary refreshing, but this [branch|https://github.com/pauloricardomg/cassandra/tree/3.11-13948] should be ready for a first round of review (cc [~krummas]). I will submit a round of unit and dtests on internal CI and will post the results here when ready. > Avoid deadlock when not able to acquire references for compaction > ----------------------------------------------------------------- > > Key: CASSANDRA-13948 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13948 > Project: Cassandra > Issue Type: Bug > Components: Compaction > Reporter: Paulo Motta > Assignee: Paulo Motta > Fix For: 3.11.x, 4.x > > Attachments: debug.log > > > The thread dump below shows a race between an sstable replacement by the > {{IndexSummaryRedistribution}} and > {{AbstractCompactionTask.getNextBackgroundTask}}: > {noformat} > Thread 94580: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, > line=175 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt() > @bci=1, line=836 (Compiled frame) > - > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, > int) @bci=67, line=870 (Compiled frame) > - java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(int) > @bci=17, line=1199 (Compiled frame) > - java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock() @bci=5, > line=943 (Compiled frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.handleListChangedNotification(java.lang.Iterable, > java.lang.Iterable) @bci=359, line=483 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.handleNotification(org.apache.cassandra.notifications.INotification, > java.lang.Object) @bci=53, line=555 (Interpreted frame) > - > org.apache.cassandra.db.lifecycle.Tracker.notifySSTablesChanged(java.util.Collection, > java.util.Collection, org.apache.cassandra.db.compaction.OperationType, > java.lang.Throwable) @bci=50, line=409 (Interpreted frame) > - > org.apache.cassandra.db.lifecycle.LifecycleTransaction.doCommit(java.lang.Throwable) > @bci=157, line=227 (Interpreted frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit(java.lang.Throwable) > @bci=61, line=116 (Compiled frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.commit() > @bci=2, line=200 (Interpreted frame) > - > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish() > @bci=5, line=185 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryRedistribution.redistributeSummaries() > @bci=559, line=130 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionManager.runIndexSummaryRedistribution(org.apache.cassandra.io.sstable.IndexSummaryRedistribution) > @bci=9, line=1420 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries(org.apache.cassandra.io.sstable.IndexSummaryRedistribution) > @bci=4, line=250 (Interpreted frame) > - > org.apache.cassandra.io.sstable.IndexSummaryManager.redistributeSummaries() > @bci=30, line=228 (Interpreted frame) > - org.apache.cassandra.io.sstable.IndexSummaryManager$1.runMayThrow() > @bci=4, line=125 (Interpreted frame) > - org.apache.cassandra.utils.WrappedRunnable.run() @bci=1, line=28 > (Interpreted frame) > - > org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$UncomplainingRunnable.run() > @bci=4, line=118 (Compiled frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 > (Compiled frame) > - java.util.concurrent.FutureTask.runAndReset() @bci=47, line=308 (Compiled > frame) > - > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask) > @bci=1, line=180 (Compiled frame) > - java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() > @bci=37, line=294 (Compiled frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > @bci=95, line=1149 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 > (Interpreted frame) > - > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable) > @bci=1, line=81 (Interpreted frame) > - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=748 (Compiled frame) > {noformat} > {noformat} > Thread 94573: (state = IN_JAVA) > - java.util.HashMap$HashIterator.nextNode() @bci=95, line=1441 (Compiled > frame; information may be imprecise) > - java.util.HashMap$KeyIterator.next() @bci=1, line=1461 (Compiled frame) > - > org.apache.cassandra.db.lifecycle.View$3.apply(org.apache.cassandra.db.lifecycle.View) > @bci=20, line=268 (Compiled frame) > - org.apache.cassandra.db.lifecycle.View$3.apply(java.lang.Object) @bci=5, > line=265 (Compiled frame) > - > org.apache.cassandra.db.lifecycle.Tracker.apply(com.google.common.base.Predicate, > com.google.common.base.Function) @bci=13, line=133 (Compiled frame) > - org.apache.cassandra.db.lifecycle.Tracker.tryModify(java.lang.Iterable, > org.apache.cassandra.db.compaction.OperationType) @bci=31, line=99 (Compiled > frame) > - > org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(int) > @bci=84, line=139 (Compiled frame) > - > org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(int) > @bci=105, line=119 (Interpreted frame) > - > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run() > @bci=84, line=265 (Interpreted frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 > (Compiled frame) > - java.util.concurrent.FutureTask.run() @bci=42, line=266 (Compiled frame) > - > java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) > @bci=95, line=1149 (Compiled frame) > - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=624 > (Interpreted frame) > - > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(java.lang.Runnable) > @bci=1, line=81 (Interpreted frame) > - org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$8.run() @bci=4 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=748 (Compiled frame) > {noformat} > This particular node remain in this state forever, indicating > {{LeveledCompactionStrategyTask.getNextBackgroundTask}} was looping > indefinitely. > What happened is that sstable references were replaced on the tracker by the > {{IndexSummaryRedistribution}} thread, so the > {{AbstractCompactionStrategy.getNextBackgroundTask}} could not create the > transaction with the old references, and the {{IndexSummaryRedistribution}} > could not update the sstable reference in the compaction strategy because > {{AbstractCompactionStrategy.getNextBackgroundTask}} was holding the > {{CompactionStrategyManager}} lock. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org