[ https://issues.apache.org/jira/browse/CASSANDRA-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Haoze Wu updated CASSANDRA-17564: --------------------------------- Description: We have been testing Cassandra 3.11.10 for a while. During a node start, we found that a synchronization guarantee implied by the code comments is not enforced. Specifically, in the `invalidate` method called in this call stack (in version 3.11.10): {code:java} org.apache.cassandra.service.CassandraDaemon#main:786 org.apache.cassandra.service.CassandraDaemon#activate:633 org.apache.cassandra.service.CassandraDaemon#setup:261 org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 java.lang.Iterable#forEach:75 org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code} In line 564~570 within `public void invalidate(boolean expectMBean)`: {code:java} latencyCalculator.cancel(false); compactionStrategyManager.shutdown(); SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 data.dropSSTables(); // line 568 LifecycleTransaction.waitForDeletions(); // line 569 indexManager.invalidateAllIndexesBlocking(); {code} According to the code and the comments, we suppose `data.dropSSTables()` in line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. Call stack in version 3.11.10: {code:java} org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233 org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238 org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267 org.apache.cassandra.utils.concurrent.Refs#release:241 org.apache.cassandra.utils.concurrent.Ref#release:119 org.apache.cassandra.utils.concurrent.Ref#release:225 org.apache.cassandra.utils.concurrent.Ref#release:326 org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 {code} Then, `LifecycleTransaction.waitForDeletions()` in line 569 is {code:java} /** * Deletions run on the nonPeriodicTasks executor, (both failedDeletions or global tidiers in SSTableReader) * so by scheduling a new empty task and waiting for it we ensure any prior deletion has completed. */ public static void waitForDeletions() { LogTransaction.waitForDeletions(); } {code} And then call `waitForDeletions` in `LogTransaction`: {code:java} static void waitForDeletions() { FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(), 0, TimeUnit.MILLISECONDS)); } {code} >From the comments, we think it ensures that all existing tasks in >`nonPeriodicTasks` are drained. However, we found some tidier tasks are still >running in `nonPeriodicTasks` thread pool. We suspect that those tidier tasks should be guaranteed to finish during server setup, because of its exception handling. In version 3.11.10, these tidier tasks are submitted to `nonPeriodicTasks` in `SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling `FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the call stack `SSTableReader$InstanceTidier$1#run:2223` => `LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`). The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We found that it checks the `CassandraDaemon.setupCompleted` flag in call stack within (`FileUtils#handleFSErrorAndPropagate:507` => `JVMStabilityInspector#inspectThrowable:60` => `JVMStabilityInspector#inspectThrowable:106` => `JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` => `DefaultFSErrorHandler:handleFSError:58`) {code:java} if (!StorageService.instance.isDaemonSetupCompleted()) // line 58 handleStartupFSError(e); // line 59 {code} The `handleStartupFSError` in line 59 will trigger server crash immediately. It prevents the faulty state early in the startup phase. On the other hand, if the `CassandraDaemon.setupCompleted` flag is set, we found that the server tolerates the exception, even in the stop mode in `disk_failure_policy`. Since those tidier tasks still appear after `LifecycleTransaction.waitForDeletions()`, we did more experiments to further confirm that if those tasks got exceptions after the `CassandraDaemon.setupCompleted` flag is set, the server will suffer from some internal issues, e.g., fail to handle table read/write. Therefore, we suspect there could be some concurrency issues — some tidier tasks are not waited. To figure out the root cause of this concurrency issue, we re-examine line 564~570 within `public void invalidate(boolean expectMBean)` in the CassandraDaemon main thread: {code:java} latencyCalculator.cancel(false); compactionStrategyManager.shutdown(); SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 data.dropSSTables(); // line 568 LifecycleTransaction.waitForDeletions(); // line 569 indexManager.invalidateAllIndexesBlocking();{code} After some investigation, we found that the tidier tasks generated in line 568 are actually waited by `waitForDeletions` in line 569. However, some additional tidier tasks may be asynchronously generated from `removeTruncationRecord` line 566. `SystemKeyspace#removeTruncationRecord` essentially submits some `ColumnFamilyStore$Flush` tasks, which submit `CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier tasks. We will provide the related 3 call stacks at the end of this report. Here we want to first emphasize that these tidier tasks should finish before setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential exception may silently result in a corrupted state while the server remains “available”. To confirm that these tidier tasks are not waited, we keep investigating the remaining control-flow in `org.apache.cassandra.service.CassandraDaemon#setup`, after calling `LegacySchemaMigrator#migrate`. Actually we found some code about aborting the potential compactions (and tidier tasks): It enumerates all `ColumnFamilyStore` and calls `ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their potential compaction tasks. Essentially it calls `CompactionStrategyManager#disable` to assign “false” to the `enabled` flag, so that, in line 160 in `CompactionManager#submitBackground`: {code:java} public List<Future<?>> submitBackground(final ColumnFamilyStore cfs) { if (cfs.isAutoCompactionDisabled()) { logger.trace("Autocompaction is disabled"); return Collections.emptyList(); } // ... } {code} New compaction tasks will not be submitted due to the disabled autocompaction. However, the potential existing compaction tasks and tidier tasks are still running, and they may generate more tidier tasks. Therefore, we propose that we should wait for the existing compaction tasks (in compaction executor) and tidier tasks (in nonPeriodicTasks) after disabling the autocompaction. P.S. Version 3.11.10, the call stack of `CassandraDaemon#main` submitting `ColumnFamilyStore$Flush` tasks: {code:java} org.apache.cassandra.service.CassandraDaemon#main:786 org.apache.cassandra.service.CassandraDaemon#activate:633 org.apache.cassandra.service.CassandraDaemon#setup:261 org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 java.lang.Iterable#forEach:75 org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 org.apache.cassandra.db.ColumnFamilyStore#invalidate:566 org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668 org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829 org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954 org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885 org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904 {code} Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting `CompactionManager$BackgroundCompactionCandidate`: {code:java} java.lang.Thread#run:748 org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 java.util.concurrent.ThreadPoolExecutor$Worker#run:624 java.util.concurrent.ThreadPoolExecutor#runWorker:1149 org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128 org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155 org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642 org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400 org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186 {code} Version 3.11.10, the call stack of `CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks: {code:java} java.lang.Thread#run:748 org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 java.util.concurrent.ThreadPoolExecutor$Worker#run:624 java.util.concurrent.ThreadPoolExecutor#runWorker:1149 java.util.concurrent.FutureTask#run:266 java.util.concurrent.Executors$RunnableAdapter#call:511 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272 org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61 org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85 org.apache.cassandra.utils.WrappedRunnable#run:28 org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223 org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123 org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179 org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168 org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385 org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327 org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356 org.apache.cassandra.utils.concurrent.Refs#release:241 org.apache.cassandra.utils.concurrent.Ref#release:119 org.apache.cassandra.utils.concurrent.Ref$State#release:225 org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326 org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 {code} was: We have been testing Cassandra 3.11.10 for a while. During a node start, we found that a synchronization guarantee implied by the code comments is not enforced. Specifically, in the `invalidate` method called in this call stack (in version 3.11.10): {code:java} org.apache.cassandra.service.CassandraDaemon#main:786 org.apache.cassandra.service.CassandraDaemon#activate:633 org.apache.cassandra.service.CassandraDaemon#setup:261 org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 java.lang.Iterable#forEach:75 org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code} In line 564~570 within `public void invalidate(boolean expectMBean)`: {code:java} latencyCalculator.cancel(false); compactionStrategyManager.shutdown(); SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 data.dropSSTables(); // line 568 LifecycleTransaction.waitForDeletions(); // line 569 indexManager.invalidateAllIndexesBlocking(); {code} According to the code and the comments, we suppose `data.dropSSTables()` in line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. Call stack in version 3.11.10: {code:java} org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233 org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238 org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267 org.apache.cassandra.utils.concurrent.Refs#release:241 org.apache.cassandra.utils.concurrent.Ref#release:119 org.apache.cassandra.utils.concurrent.Ref#release:225 org.apache.cassandra.utils.concurrent.Ref#release:326 org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 {code} Then, `LifecycleTransaction.waitForDeletions()` in line 569 is {code:java} /** * Deletions run on the nonPeriodicTasks executor, (both failedDeletions or global tidiers in SSTableReader) * so by scheduling a new empty task and waiting for it we ensure any prior deletion has completed. */ public static void waitForDeletions() { LogTransaction.waitForDeletions(); } {code} And then call `waitForDeletions` in `LogTransaction`: {code:java} static void waitForDeletions() { FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(), 0, TimeUnit.MILLISECONDS)); } {code} >From the comments, we think it ensures that all existing tasks in >`nonPeriodicTasks` are drained. However, we found some tidier tasks are still >running in `nonPeriodicTasks` thread pool. We suspect that those tidier tasks should be guaranteed to finish during server setup, because of its exception handling. In version 3.11.10, these tidier tasks are submitted to `nonPeriodicTasks` in `SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling `FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the call stack `SSTableReader$InstanceTidier$1#run:2223` => `LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`). The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We found that it checks the `CassandraDaemon.setupCompleted` flag in call stack within (`FileUtils#handleFSErrorAndPropagate:507` => `JVMStabilityInspector#inspectThrowable:60` => `JVMStabilityInspector#inspectThrowable:106` => `JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` => `DefaultFSErrorHandler:handleFSError:58`) {code:java} if (!StorageService.instance.isDaemonSetupCompleted()) // line 58 handleStartupFSError(e); // line 59 {code} The `handleStartupFSError` in line 59 will trigger server crash immediately. It prevents the faulty state early in the startup phase. On the other hand, if the `CassandraDaemon.setupCompleted` flag is set, we found that the server tolerates the exception, even in the stop mode in `disk_failure_policy`. Since those tidier tasks still appear after `LifecycleTransaction.waitForDeletions()`, we did more experiments to further confirm that if those tasks got exceptions after the `CassandraDaemon.setupCompleted` flag is set, the server will suffer from some internal issues, e.g., fail to handle table read/write. Therefore, we suspect there could be some concurrency issues — some tidier tasks are not waited. To figure out the root cause of this concurrency issue, we re-examine line 564~570 within `public void invalidate(boolean expectMBean)` in the CassandraDaemon main thread: {code:java} latencyCalculator.cancel(false); compactionStrategyManager.shutdown(); SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 data.dropSSTables(); // line 568 LifecycleTransaction.waitForDeletions(); // line 569 indexManager.invalidateAllIndexesBlocking();{code} After some investigation, we found that the tidier tasks generated in line 568 are actually waited by `waitForDeletions` in line 569. However, some additional tidier tasks may be asynchronously generated from `removeTruncationRecord` line 566. `SystemKeyspace#removeTruncationRecord` essentially submits some `ColumnFamilyStore$Flush` tasks, which submit `CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier tasks. We will provide the related 3 call stacks at the end of this report. Here we want to first emphasize that these tidier tasks should finish before setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential exception may silently result in a corrupted state while the server remains “available”. To confirm that these tidier tasks are not waited, we keep investigating the remaining control-flow in `org.apache.cassandra.service.CassandraDaemon#setup`, after calling `LegacySchemaMigrator#migrate`. Actually we found some code about aborting the potential compactions (and tidier tasks): It enumerates all `ColumnFamilyStore` and calls `ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their potential compaction tasks. Essentially it calls `CompactionStrategyManager#disable` to assign “false” to the `enabled` flag, so that, in line 160 in `CompactionManager#submitBackground`: {code:java} public List<Future<?>> submitBackground(final ColumnFamilyStore cfs) { if (cfs.isAutoCompactionDisabled()) { logger.trace("Autocompaction is disabled"); return Collections.emptyList(); } // ... } {code} New compaction tasks will not be submitted due to the disabled autocompaction. However, the potential existing compaction tasks and tidier tasks are still running, and they may generate more tidier tasks. Therefore, we propose that we should wait for the existing compaction tasks (in compaction executor) and tidier tasks (in nonPeriodicTasks) after disabling the autocompaction. P.S. Version 3.11.10, the call stack of `CassandraDaemon#main` submitting `ColumnFamilyStore$Flush` tasks: {code:java} org.apache.cassandra.service.CassandraDaemon#main:786 org.apache.cassandra.service.CassandraDaemon#activate:633 org.apache.cassandra.service.CassandraDaemon#setup:261 org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 java.lang.Iterable#forEach:75 org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 org.apache.cassandra.db.ColumnFamilyStore#invalidate:566 org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668 org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829 org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954 org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885 org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904 {code} Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting `CompactionManager$BackgroundCompactionCandidate`: {code:java} java.lang.Thread#run:748 org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 java.util.concurrent.ThreadPoolExecutor$Worker#run:624 java.util.concurrent.ThreadPoolExecutor#runWorker:1149 org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128 org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155 org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642 org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400 org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186 {code} Version 3.11.10, the call stack of `CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks: {code:java} java.lang.Thread#run:748 org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 java.util.concurrent.ThreadPoolExecutor$Worker#run:624 java.util.concurrent.ThreadPoolExecutor#runWorker:1149 java.util.concurrent.FutureTask#run:266 java.util.concurrent.Executors$RunnableAdapter#call:511 org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272 org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61 org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85 org.apache.cassandra.utils.WrappedRunnable#run:28 org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223 org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123 org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179 org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168 org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385 org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327 org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356 org.apache.cassandra.utils.concurrent.Refs#release:241 org.apache.cassandra.utils.concurrent.Ref#release:119 org.apache.cassandra.utils.concurrent.Ref$State#release:225 org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326 org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 {code} > Add synchronization to wait for outstanding tasks in the compaction executor > and nonPeriodicTasks during CassandraDaemon setup > ------------------------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-17564 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17564 > Project: Cassandra > Issue Type: Improvement > Components: Local/Compaction > Reporter: Haoze Wu > Priority: Normal > > We have been testing Cassandra 3.11.10 for a while. During a node start, we > found that a synchronization guarantee implied by the code comments is not > enforced. Specifically, in the `invalidate` method called in this call stack > (in version 3.11.10): > {code:java} > org.apache.cassandra.service.CassandraDaemon#main:786 > org.apache.cassandra.service.CassandraDaemon#activate:633 > org.apache.cassandra.service.CassandraDaemon#setup:261 > org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 > org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 > java.lang.Iterable#forEach:75 > org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 > org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code} > In line 564~570 within `public void invalidate(boolean expectMBean)`: > {code:java} > latencyCalculator.cancel(false); > compactionStrategyManager.shutdown(); > SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 > data.dropSSTables(); // line 568 > LifecycleTransaction.waitForDeletions(); // line 569 > indexManager.invalidateAllIndexesBlocking(); > {code} > According to the code and the comments, we suppose `data.dropSSTables()` in > line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. > Call stack in version 3.11.10: > {code:java} > org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233 > org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238 > org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267 > org.apache.cassandra.utils.concurrent.Refs#release:241 > org.apache.cassandra.utils.concurrent.Ref#release:119 > org.apache.cassandra.utils.concurrent.Ref#release:225 > org.apache.cassandra.utils.concurrent.Ref#release:326 > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 > {code} > Then, `LifecycleTransaction.waitForDeletions()` in line 569 is > {code:java} > /** > * Deletions run on the nonPeriodicTasks executor, (both failedDeletions > or global tidiers in SSTableReader) > * so by scheduling a new empty task and waiting for it we ensure any > prior deletion has completed. > */ > public static void waitForDeletions() > { > LogTransaction.waitForDeletions(); > } > {code} > And then call `waitForDeletions` in `LogTransaction`: > {code:java} > static void waitForDeletions() > { > > FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(), > 0, TimeUnit.MILLISECONDS)); > } > {code} > From the comments, we think it ensures that all existing tasks in > `nonPeriodicTasks` are drained. However, we found some tidier tasks are still > running in `nonPeriodicTasks` thread pool. > We suspect that those tidier tasks should be guaranteed to finish during > server setup, because of its exception handling. In version 3.11.10, these > tidier tasks are submitted to `nonPeriodicTasks` in > `SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling > `FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the > call stack `SSTableReader$InstanceTidier$1#run:2223` => > `LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`). > The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We > found that it checks the `CassandraDaemon.setupCompleted` flag in call stack > within (`FileUtils#handleFSErrorAndPropagate:507` => > `JVMStabilityInspector#inspectThrowable:60` => > `JVMStabilityInspector#inspectThrowable:106` => > `JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` > => `DefaultFSErrorHandler:handleFSError:58`) > {code:java} > if (!StorageService.instance.isDaemonSetupCompleted()) // line 58 > handleStartupFSError(e); // line 59 > {code} > The `handleStartupFSError` in line 59 will trigger server crash immediately. > It prevents the faulty state early in the startup phase. On the other hand, > if the `CassandraDaemon.setupCompleted` flag is set, we found that the server > tolerates the exception, even in the stop mode in `disk_failure_policy`. > Since those tidier tasks still appear after > `LifecycleTransaction.waitForDeletions()`, we did more experiments to further > confirm that if those tasks got exceptions after the > `CassandraDaemon.setupCompleted` flag is set, the server will suffer from > some internal issues, e.g., fail to handle table read/write. Therefore, we > suspect there could be some concurrency issues — some tidier tasks are not > waited. > To figure out the root cause of this concurrency issue, we re-examine line > 564~570 within `public void invalidate(boolean expectMBean)` in the > CassandraDaemon main thread: > {code:java} > latencyCalculator.cancel(false); > compactionStrategyManager.shutdown(); > SystemKeyspace.removeTruncationRecord(metadata.cfId); // line 566 > data.dropSSTables(); // line 568 > LifecycleTransaction.waitForDeletions(); // line 569 > indexManager.invalidateAllIndexesBlocking();{code} > After some investigation, we found that the tidier tasks generated in line > 568 are actually waited by `waitForDeletions` in line 569. However, some > additional tidier tasks may be asynchronously generated from > `removeTruncationRecord` line 566. `SystemKeyspace#removeTruncationRecord` > essentially submits some `ColumnFamilyStore$Flush` tasks, which submit > `CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier > tasks. We will provide the related 3 call stacks at the end of this report. > Here we want to first emphasize that these tidier tasks should finish before > setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential > exception may silently result in a corrupted state while the server remains > “available”. > To confirm that these tidier tasks are not waited, we keep investigating the > remaining control-flow in > `org.apache.cassandra.service.CassandraDaemon#setup`, after calling > `LegacySchemaMigrator#migrate`. Actually we found some code about aborting > the potential compactions (and tidier tasks): > It enumerates all `ColumnFamilyStore` and calls > `ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their > potential compaction tasks. Essentially it calls > `CompactionStrategyManager#disable` to assign “false” to the `enabled` flag, > so that, in line 160 in `CompactionManager#submitBackground`: > {code:java} > public List<Future<?>> submitBackground(final ColumnFamilyStore cfs) > { > if (cfs.isAutoCompactionDisabled()) > { > logger.trace("Autocompaction is disabled"); > return Collections.emptyList(); > } > // ... > } {code} > New compaction tasks will not be submitted due to the disabled autocompaction. > However, the potential existing compaction tasks and tidier tasks are still > running, and they may generate more tidier tasks. > Therefore, we propose that we should wait for the existing compaction tasks > (in compaction executor) and tidier tasks (in nonPeriodicTasks) after > disabling the autocompaction. > > P.S. > Version 3.11.10, the call stack of `CassandraDaemon#main` submitting > `ColumnFamilyStore$Flush` tasks: > {code:java} > org.apache.cassandra.service.CassandraDaemon#main:786 > org.apache.cassandra.service.CassandraDaemon#activate:633 > org.apache.cassandra.service.CassandraDaemon#setup:261 > org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83 > org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137 > java.lang.Iterable#forEach:75 > org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137 > org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 > org.apache.cassandra.db.ColumnFamilyStore#invalidate:566 > org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668 > org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829 > org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954 > org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885 > org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904 > {code} > Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting > `CompactionManager$BackgroundCompactionCandidate`: > {code:java} > java.lang.Thread#run:748 > org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 > java.util.concurrent.ThreadPoolExecutor$Worker#run:624 > java.util.concurrent.ThreadPoolExecutor#runWorker:1149 > org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128 > org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155 > org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642 > org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400 > org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186 > {code} > Version 3.11.10, the call stack of > `CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks: > {code:java} > java.lang.Thread#run:748 > org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84 > java.util.concurrent.ThreadPoolExecutor$Worker#run:624 > java.util.concurrent.ThreadPoolExecutor#runWorker:1149 > java.util.concurrent.FutureTask#run:266 > java.util.concurrent.Executors$RunnableAdapter#call:511 > org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272 > org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61 > org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85 > org.apache.cassandra.utils.WrappedRunnable#run:28 > org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223 > org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123 > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179 > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168 > org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385 > org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327 > org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356 > org.apache.cassandra.utils.concurrent.Refs#release:241 > org.apache.cassandra.utils.concurrent.Ref#release:119 > org.apache.cassandra.utils.concurrent.Ref$State#release:225 > org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326 > org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org