[ 
https://issues.apache.org/jira/browse/CASSANDRA-17564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoze Wu updated CASSANDRA-17564:
---------------------------------
    Description: 
We have been testing Cassandra 3.11.10 for a while. During a node start, we 
found that a synchronization guarantee implied by the code comments is not 
enforced. Specifically, in the `invalidate` method called in this call stack 
(in version 3.11.10):
{code:java}
org.apache.cassandra.service.CassandraDaemon#main:786
org.apache.cassandra.service.CassandraDaemon#activate:633
org.apache.cassandra.service.CassandraDaemon#setup:261
org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
java.lang.Iterable#forEach:75
org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code}
In line 564~570 within `public void invalidate(boolean expectMBean)`:
{code:java}
        latencyCalculator.cancel(false);
        compactionStrategyManager.shutdown();
        SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566

        data.dropSSTables();                                   // line 568
        LifecycleTransaction.waitForDeletions();               // line 569
        indexManager.invalidateAllIndexesBlocking();
{code}
According to the code and the comments, we suppose `data.dropSSTables()` in 
line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. 
Call stack in version 3.11.10:
{code:java}
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267
org.apache.cassandra.utils.concurrent.Refs#release:241
org.apache.cassandra.utils.concurrent.Ref#release:119
org.apache.cassandra.utils.concurrent.Ref#release:225
org.apache.cassandra.utils.concurrent.Ref#release:326
org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
{code}
Then, `LifecycleTransaction.waitForDeletions()` in line 569 is
{code:java}
    /**
     * Deletions run on the nonPeriodicTasks executor, (both failedDeletions or 
global tidiers in SSTableReader)
     * so by scheduling a new empty task and waiting for it we ensure any prior 
deletion has completed.
     */
    public static void waitForDeletions()
    {
        LogTransaction.waitForDeletions();
    }
{code}
And then call `waitForDeletions` in `LogTransaction`:
{code:java}
    static void waitForDeletions()
    {
        
FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(),
 0, TimeUnit.MILLISECONDS));
    }
{code}
>From the comments, we think it ensures that all existing tasks in 
>`nonPeriodicTasks` are drained. However, we found some tidier tasks are still 
>running in `nonPeriodicTasks` thread pool.

We suspect that those tidier tasks should be guaranteed to finish during server 
setup, because of its exception handling. In version 3.11.10, these tidier 
tasks are submitted to `nonPeriodicTasks` in 
`SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling 
`FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the 
call stack `SSTableReader$InstanceTidier$1#run:2223` => 
`LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`).

The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We found 
that it checks the `CassandraDaemon.setupCompleted` flag in call stack within 
(`FileUtils#handleFSErrorAndPropagate:507` => 
`JVMStabilityInspector#inspectThrowable:60` => 
`JVMStabilityInspector#inspectThrowable:106` => 
`JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` => 
`DefaultFSErrorHandler:handleFSError:58`)
{code:java}
        if (!StorageService.instance.isDaemonSetupCompleted())  // line 58
            handleStartupFSError(e);                            // line 59
{code}
The `handleStartupFSError` in line 59 will trigger server crash immediately. It 
prevents the faulty state early in the startup phase. On the other hand, if the 
`CassandraDaemon.setupCompleted` flag is set, we found that the server 
tolerates the exception, even in the stop mode in `disk_failure_policy`.

Since those tidier tasks still appear after 
`LifecycleTransaction.waitForDeletions()`, we did more experiments to further 
confirm that if those tasks got exceptions after the 
`CassandraDaemon.setupCompleted` flag is set, the server will suffer from some 
internal issues, e.g., fail to handle table read/write. Therefore, we suspect 
there could be some concurrency issues — some tidier tasks are not waited.

To figure out the root cause of this concurrency issue, we re-examine line 
564~570 within `public void invalidate(boolean expectMBean)` in the 
CassandraDaemon main thread:
{code:java}
        latencyCalculator.cancel(false);
        compactionStrategyManager.shutdown();
        SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566

        data.dropSSTables();                                   // line 568
        LifecycleTransaction.waitForDeletions();               // line 569
        indexManager.invalidateAllIndexesBlocking();{code}
After some investigation, we found that the tidier tasks generated in line 568 
are actually waited by `waitForDeletions` in line 569. However, some additional 
tidier tasks may be asynchronously generated from `removeTruncationRecord` line 
566. `SystemKeyspace#removeTruncationRecord` essentially submits some 
`ColumnFamilyStore$Flush` tasks, which submit 
`CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier 
tasks. We will provide the related 3 call stacks at the end of this report. 
Here we want to first emphasize that these tidier tasks should finish before 
setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential 
exception may silently result in a corrupted state while the server remains 
“available”.

To confirm that these tidier tasks are not waited, we keep investigating the 
remaining control-flow in `org.apache.cassandra.service.CassandraDaemon#setup`, 
after calling `LegacySchemaMigrator#migrate`. Actually we found some code about 
aborting the potential compactions (and tidier tasks):

It enumerates all `ColumnFamilyStore` and calls 
`ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their potential 
compaction tasks. Essentially it calls `CompactionStrategyManager#disable` to 
assign “false” to the `enabled` flag, so that, in line 160 in 
`CompactionManager#submitBackground`:
{code:java}
    public List<Future<?>> submitBackground(final ColumnFamilyStore cfs)
    {
        if (cfs.isAutoCompactionDisabled())
        {
            logger.trace("Autocompaction is disabled");
            return Collections.emptyList();
        }
        // ...
    } {code}
New compaction tasks will not be submitted due to the disabled autocompaction.

However, the potential existing compaction tasks and tidier tasks are still 
running, and they may generate more tidier tasks.

Therefore, we propose that we should wait for the existing compaction tasks (in 
compaction executor) and tidier tasks (in nonPeriodicTasks) after disabling the 
autocompaction.

 

P.S.

Version 3.11.10, the call stack of `CassandraDaemon#main` submitting 
`ColumnFamilyStore$Flush` tasks:
{code:java}
org.apache.cassandra.service.CassandraDaemon#main:786
org.apache.cassandra.service.CassandraDaemon#activate:633
org.apache.cassandra.service.CassandraDaemon#setup:261
org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
java.lang.Iterable#forEach:75
org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
org.apache.cassandra.db.ColumnFamilyStore#invalidate:542
org.apache.cassandra.db.ColumnFamilyStore#invalidate:566
org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668
org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829
org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954
org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885
org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904
{code}
Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting 
`CompactionManager$BackgroundCompactionCandidate`:
{code:java}
java.lang.Thread#run:748
org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
java.util.concurrent.ThreadPoolExecutor$Worker#run:624
java.util.concurrent.ThreadPoolExecutor#runWorker:1149
org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128
org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155
org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642
org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400
org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186
{code}
Version 3.11.10, the call stack of 
`CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks:
{code:java}
java.lang.Thread#run:748
org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
java.util.concurrent.ThreadPoolExecutor$Worker#run:624
java.util.concurrent.ThreadPoolExecutor#runWorker:1149
java.util.concurrent.FutureTask#run:266
java.util.concurrent.Executors$RunnableAdapter#call:511
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272
org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61
org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85
org.apache.cassandra.utils.WrappedRunnable#run:28
org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223
org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168
org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385
org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327
org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356
org.apache.cassandra.utils.concurrent.Refs#release:241
org.apache.cassandra.utils.concurrent.Ref#release:119
org.apache.cassandra.utils.concurrent.Ref$State#release:225
org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326
org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
{code}

  was:
We have been testing Cassandra 3.11.10 for a while. During a node start, we 
found that a synchronization guarantee implied by the code comments is not 
enforced. Specifically, in the `invalidate` method called in this call stack 
(in version 3.11.10):

 
{code:java}
org.apache.cassandra.service.CassandraDaemon#main:786
org.apache.cassandra.service.CassandraDaemon#activate:633
org.apache.cassandra.service.CassandraDaemon#setup:261
org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
java.lang.Iterable#forEach:75
org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code}
In line 564~570 within `public void invalidate(boolean expectMBean)`:

 
{code:java}
        latencyCalculator.cancel(false);
        compactionStrategyManager.shutdown();
        SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566

        data.dropSSTables();                                   // line 568
        LifecycleTransaction.waitForDeletions();               // line 569
        indexManager.invalidateAllIndexesBlocking();
{code}
According to the code and the comments, we suppose `data.dropSSTables()` in 
line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. 
Call stack in version 3.11.10:
{code:java}
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238
org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267
org.apache.cassandra.utils.concurrent.Refs#release:241
org.apache.cassandra.utils.concurrent.Ref#release:119
org.apache.cassandra.utils.concurrent.Ref#release:225
org.apache.cassandra.utils.concurrent.Ref#release:326
org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
{code}
Then, `LifecycleTransaction.waitForDeletions()` in line 569 is
{code:java}
    /**
     * Deletions run on the nonPeriodicTasks executor, (both failedDeletions or 
global tidiers in SSTableReader)
     * so by scheduling a new empty task and waiting for it we ensure any prior 
deletion has completed.
     */
    public static void waitForDeletions()
    {
        LogTransaction.waitForDeletions();
    }
{code}
And then call `waitForDeletions` in `LogTransaction`:
{code:java}
    static void waitForDeletions()
    {
        
FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(),
 0, TimeUnit.MILLISECONDS));
    }
{code}
>From the comments, we think it ensures that all existing tasks in 
>`nonPeriodicTasks` are drained. However, we found some tidier tasks are still 
>running in `nonPeriodicTasks` thread pool.

We suspect that those tidier tasks should be guaranteed to finish during server 
setup, because of its exception handling. In version 3.11.10, these tidier 
tasks are submitted to `nonPeriodicTasks` in 
`SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling 
`FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the 
call stack `SSTableReader$InstanceTidier$1#run:2223` => 
`LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`).

The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We found 
that it checks the `CassandraDaemon.setupCompleted` flag in call stack within 
(`FileUtils#handleFSErrorAndPropagate:507` => 
`JVMStabilityInspector#inspectThrowable:60` => 
`JVMStabilityInspector#inspectThrowable:106` => 
`JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` => 
`DefaultFSErrorHandler:handleFSError:58`)
{code:java}
        if (!StorageService.instance.isDaemonSetupCompleted())  // line 58
            handleStartupFSError(e);                            // line 59
{code}
The `handleStartupFSError` in line 59 will trigger server crash immediately. It 
prevents the faulty state early in the startup phase. On the other hand, if the 
`CassandraDaemon.setupCompleted` flag is set, we found that the server 
tolerates the exception, even in the stop mode in `disk_failure_policy`.

Since those tidier tasks still appear after 
`LifecycleTransaction.waitForDeletions()`, we did more experiments to further 
confirm that if those tasks got exceptions after the 
`CassandraDaemon.setupCompleted` flag is set, the server will suffer from some 
internal issues, e.g., fail to handle table read/write. Therefore, we suspect 
there could be some concurrency issues — some tidier tasks are not waited.

To figure out the root cause of this concurrency issue, we re-examine line 
564~570 within `public void invalidate(boolean expectMBean)` in the 
CassandraDaemon main thread:
{code:java}
        latencyCalculator.cancel(false);
        compactionStrategyManager.shutdown();
        SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566

        data.dropSSTables();                                   // line 568
        LifecycleTransaction.waitForDeletions();               // line 569
        indexManager.invalidateAllIndexesBlocking();{code}
After some investigation, we found that the tidier tasks generated in line 568 
are actually waited by `waitForDeletions` in line 569. However, some additional 
tidier tasks may be asynchronously generated from `removeTruncationRecord` line 
566. `SystemKeyspace#removeTruncationRecord` essentially submits some 
`ColumnFamilyStore$Flush` tasks, which submit 
`CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier 
tasks. We will provide the related 3 call stacks at the end of this report. 
Here we want to first emphasize that these tidier tasks should finish before 
setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential 
exception may silently result in a corrupted state while the server remains 
“available”.

To confirm that these tidier tasks are not waited, we keep investigating the 
remaining control-flow in `org.apache.cassandra.service.CassandraDaemon#setup`, 
after calling `LegacySchemaMigrator#migrate`. Actually we found some code about 
aborting the potential compactions (and tidier tasks):

It enumerates all `ColumnFamilyStore` and calls 
`ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their potential 
compaction tasks. Essentially it calls `CompactionStrategyManager#disable` to 
assign “false” to the `enabled` flag, so that, in line 160 in 
`CompactionManager#submitBackground`:
{code:java}
    public List<Future<?>> submitBackground(final ColumnFamilyStore cfs)
    {
        if (cfs.isAutoCompactionDisabled())
        {
            logger.trace("Autocompaction is disabled");
            return Collections.emptyList();
        }
        // ...
    } {code}
New compaction tasks will not be submitted due to the disabled autocompaction.

However, the potential existing compaction tasks and tidier tasks are still 
running, and they may generate more tidier tasks.

Therefore, we propose that we should wait for the existing compaction tasks (in 
compaction executor) and tidier tasks (in nonPeriodicTasks) after disabling the 
autocompaction. 

 

P.S.

Version 3.11.10, the call stack of `CassandraDaemon#main` submitting 
`ColumnFamilyStore$Flush` tasks:
{code:java}
org.apache.cassandra.service.CassandraDaemon#main:786
org.apache.cassandra.service.CassandraDaemon#activate:633
org.apache.cassandra.service.CassandraDaemon#setup:261
org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
java.lang.Iterable#forEach:75
org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
org.apache.cassandra.db.ColumnFamilyStore#invalidate:542
org.apache.cassandra.db.ColumnFamilyStore#invalidate:566
org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668
org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829
org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954
org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885
org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904
{code}
Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting 
`CompactionManager$BackgroundCompactionCandidate`:
{code:java}
java.lang.Thread#run:748
org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
java.util.concurrent.ThreadPoolExecutor$Worker#run:624
java.util.concurrent.ThreadPoolExecutor#runWorker:1149
org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128
org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155
org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642
org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400
org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186
{code}
Version 3.11.10, the call stack of 
`CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks:
{code:java}
java.lang.Thread#run:748
org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
java.util.concurrent.ThreadPoolExecutor$Worker#run:624
java.util.concurrent.ThreadPoolExecutor#runWorker:1149
java.util.concurrent.FutureTask#run:266
java.util.concurrent.Executors$RunnableAdapter#call:511
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272
org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61
org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85
org.apache.cassandra.utils.WrappedRunnable#run:28
org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223
org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168
org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385
org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327
org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356
org.apache.cassandra.utils.concurrent.Refs#release:241
org.apache.cassandra.utils.concurrent.Ref#release:119
org.apache.cassandra.utils.concurrent.Ref$State#release:225
org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326
org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
{code}


> Add synchronization to wait for outstanding tasks in the compaction executor 
> and nonPeriodicTasks during CassandraDaemon setup
> ------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17564
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17564
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Haoze Wu
>            Priority: Normal
>
> We have been testing Cassandra 3.11.10 for a while. During a node start, we 
> found that a synchronization guarantee implied by the code comments is not 
> enforced. Specifically, in the `invalidate` method called in this call stack 
> (in version 3.11.10):
> {code:java}
> org.apache.cassandra.service.CassandraDaemon#main:786
> org.apache.cassandra.service.CassandraDaemon#activate:633
> org.apache.cassandra.service.CassandraDaemon#setup:261
> org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
> org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
> java.lang.Iterable#forEach:75
> org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
> org.apache.cassandra.db.ColumnFamilyStore#invalidate:542 {code}
> In line 564~570 within `public void invalidate(boolean expectMBean)`:
> {code:java}
>         latencyCalculator.cancel(false);
>         compactionStrategyManager.shutdown();
>         SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566
>         data.dropSSTables();                                   // line 568
>         LifecycleTransaction.waitForDeletions();               // line 569
>         indexManager.invalidateAllIndexesBlocking();
> {code}
> According to the code and the comments, we suppose `data.dropSSTables()` in 
> line 568 will submit some tidier tasks to the `nonPeriodicTasks` thread pool. 
> Call stack in version 3.11.10:
> {code:java}
> org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:233
> org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:238
> org.apache.cassandra.db.lifecycle.Tracker#dropSSTables:267
> org.apache.cassandra.utils.concurrent.Refs#release:241
> org.apache.cassandra.utils.concurrent.Ref#release:119
> org.apache.cassandra.utils.concurrent.Ref#release:225
> org.apache.cassandra.utils.concurrent.Ref#release:326
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
> {code}
> Then, `LifecycleTransaction.waitForDeletions()` in line 569 is
> {code:java}
>     /**
>      * Deletions run on the nonPeriodicTasks executor, (both failedDeletions 
> or global tidiers in SSTableReader)
>      * so by scheduling a new empty task and waiting for it we ensure any 
> prior deletion has completed.
>      */
>     public static void waitForDeletions()
>     {
>         LogTransaction.waitForDeletions();
>     }
> {code}
> And then call `waitForDeletions` in `LogTransaction`:
> {code:java}
>     static void waitForDeletions()
>     {
>         
> FBUtilities.waitOnFuture(ScheduledExecutors.nonPeriodicTasks.schedule(Runnables.doNothing(),
>  0, TimeUnit.MILLISECONDS));
>     }
> {code}
> From the comments, we think it ensures that all existing tasks in 
> `nonPeriodicTasks` are drained. However, we found some tidier tasks are still 
> running in `nonPeriodicTasks` thread pool.
> We suspect that those tidier tasks should be guaranteed to finish during 
> server setup, because of its exception handling. In version 3.11.10, these 
> tidier tasks are submitted to `nonPeriodicTasks` in 
> `SSTableReader$InstanceTidier#tidy:2205`, and have the exception handling 
> `FileUtils.handleFSErrorAndPropagate(new FSWriteError(e, file))` (within the 
> call stack `SSTableReader$InstanceTidier$1#run:2223` => 
> `LogTransaction$SSTableTidier#run:386` => `LogTransaction#delete:261`).
> The `FileUtils.handleFSErrorAndPropagate` handles this `FSWriteError`. We 
> found that it checks the `CassandraDaemon.setupCompleted` flag in call stack 
> within (`FileUtils#handleFSErrorAndPropagate:507` => 
> `JVMStabilityInspector#inspectThrowable:60` => 
> `JVMStabilityInspector#inspectThrowable:106` => 
> `JVMStabilityInspector#inspectDiskError:73` => `FileUtils#handleFSError:494` 
> => `DefaultFSErrorHandler:handleFSError:58`)
> {code:java}
>         if (!StorageService.instance.isDaemonSetupCompleted())  // line 58
>             handleStartupFSError(e);                            // line 59
> {code}
> The `handleStartupFSError` in line 59 will trigger server crash immediately. 
> It prevents the faulty state early in the startup phase. On the other hand, 
> if the `CassandraDaemon.setupCompleted` flag is set, we found that the server 
> tolerates the exception, even in the stop mode in `disk_failure_policy`.
> Since those tidier tasks still appear after 
> `LifecycleTransaction.waitForDeletions()`, we did more experiments to further 
> confirm that if those tasks got exceptions after the 
> `CassandraDaemon.setupCompleted` flag is set, the server will suffer from 
> some internal issues, e.g., fail to handle table read/write. Therefore, we 
> suspect there could be some concurrency issues — some tidier tasks are not 
> waited.
> To figure out the root cause of this concurrency issue, we re-examine line 
> 564~570 within `public void invalidate(boolean expectMBean)` in the 
> CassandraDaemon main thread:
> {code:java}
>         latencyCalculator.cancel(false);
>         compactionStrategyManager.shutdown();
>         SystemKeyspace.removeTruncationRecord(metadata.cfId);  // line 566
>         data.dropSSTables();                                   // line 568
>         LifecycleTransaction.waitForDeletions();               // line 569
>         indexManager.invalidateAllIndexesBlocking();{code}
> After some investigation, we found that the tidier tasks generated in line 
> 568 are actually waited by `waitForDeletions` in line 569. However, some 
> additional tidier tasks may be asynchronously generated from 
> `removeTruncationRecord` line 566. `SystemKeyspace#removeTruncationRecord` 
> essentially submits some `ColumnFamilyStore$Flush` tasks, which submit 
> `CompactionManager$BackgroundCompactionCandidate` tasks, which submit tidier 
> tasks. We will provide the related 3 call stacks at the end of this report. 
> Here we want to first emphasize that these tidier tasks should finish before 
> setting the `CassandraDaemon.setupCompleted` flag. Otherwise, the potential 
> exception may silently result in a corrupted state while the server remains 
> “available”.
> To confirm that these tidier tasks are not waited, we keep investigating the 
> remaining control-flow in 
> `org.apache.cassandra.service.CassandraDaemon#setup`, after calling 
> `LegacySchemaMigrator#migrate`. Actually we found some code about aborting 
> the potential compactions (and tidier tasks):
> It enumerates all `ColumnFamilyStore` and calls 
> `ColumnFamilyStore#disableAutoCompaction` in line 303 to abort their 
> potential compaction tasks. Essentially it calls 
> `CompactionStrategyManager#disable` to assign “false” to the `enabled` flag, 
> so that, in line 160 in `CompactionManager#submitBackground`:
> {code:java}
>     public List<Future<?>> submitBackground(final ColumnFamilyStore cfs)
>     {
>         if (cfs.isAutoCompactionDisabled())
>         {
>             logger.trace("Autocompaction is disabled");
>             return Collections.emptyList();
>         }
>         // ...
>     } {code}
> New compaction tasks will not be submitted due to the disabled autocompaction.
> However, the potential existing compaction tasks and tidier tasks are still 
> running, and they may generate more tidier tasks.
> Therefore, we propose that we should wait for the existing compaction tasks 
> (in compaction executor) and tidier tasks (in nonPeriodicTasks) after 
> disabling the autocompaction.
>  
> P.S.
> Version 3.11.10, the call stack of `CassandraDaemon#main` submitting 
> `ColumnFamilyStore$Flush` tasks:
> {code:java}
> org.apache.cassandra.service.CassandraDaemon#main:786
> org.apache.cassandra.service.CassandraDaemon#activate:633
> org.apache.cassandra.service.CassandraDaemon#setup:261
> org.apache.cassandra.schema.LegacySchemaMigrator#migrate:83
> org.apache.cassandra.schema.LegacySchemaMigrator#unloadLegacySchemaTables:137
> java.lang.Iterable#forEach:75
> org.apache.cassandra.schema.LegacySchemaMigrator#lambda$unloadLegacySchemaTables$1:137
> org.apache.cassandra.db.ColumnFamilyStore#invalidate:542
> org.apache.cassandra.db.ColumnFamilyStore#invalidate:566
> org.apache.cassandra.db.SystemKeyspace#removeTruncationRecord:668
> org.apache.cassandra.db.SystemKeyspace#forceBlockingFlush:829
> org.apache.cassandra.db.ColumnFamilyStore#forceFlush:954
> org.apache.cassandra.db.ColumnFamilyStore#switchMemtableIfCurrent:885
> org.apache.cassandra.db.ColumnFamilyStore#switchMemtable:904
> {code}
> Version 3.11.10, the call stack of `ColumnFamilyStore$Flush#run` submitting 
> `CompactionManager$BackgroundCompactionCandidate`:
> {code:java}
> java.lang.Thread#run:748
> org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
> java.util.concurrent.ThreadPoolExecutor$Worker#run:624
> java.util.concurrent.ThreadPoolExecutor#runWorker:1149
> org.apache.cassandra.db.ColumnFamilyStore$Flush#run:1128
> org.apache.cassandra.db.ColumnFamilyStore$Flush#flushMemtable:1155
> org.apache.cassandra.db.ColumnFamilyStore#replaceFlushed:1642
> org.apache.cassandra.db.compaction.CompactionStrategyManager#replaceFlushed:400
> org.apache.cassandra.db.compaction.CompactionManager#submitBackground:186
> {code}
> Version 3.11.10, the call stack of 
> `CompactionManager$BackgroundCompactionCandidate` submitting tidier tasks:
> {code:java}
> java.lang.Thread#run:748
> org.apache.cassandra.concurrent.NamedThreadFactory#lambda$threadLocalDeallocator$0:84
> java.util.concurrent.ThreadPoolExecutor$Worker#run:624
> java.util.concurrent.ThreadPoolExecutor#runWorker:1149
> java.util.concurrent.FutureTask#run:266
> java.util.concurrent.Executors$RunnableAdapter#call:511
> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate#run:272
> org.apache.cassandra.db.compaction.AbstractCompactionTask#execute:61
> org.apache.cassandra.db.compaction.CompactionTask#executeInternal:85
> org.apache.cassandra.utils.WrappedRunnable#run:28
> org.apache.cassandra.db.compaction.CompactionTask#runMayThrow:223
> org.apache.cassandra.db.compaction.writers.CompactionAwareWriter#finish:123
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#finish:179
> org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional#prepareToCommit:168
> org.apache.cassandra.io.sstable.SSTableRewriter#doPrepare#385
> org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:327
> org.apache.cassandra.db.lifecycle.LifecycleTransaction#checkpoint:356
> org.apache.cassandra.utils.concurrent.Refs#release:241
> org.apache.cassandra.utils.concurrent.Ref#release:119
> org.apache.cassandra.utils.concurrent.Ref$State#release:225
> org.apache.cassandra.utils.concurrent.Ref$GlobalState#release:326
> org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier#tidy:2205 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to