Joey Lynch created CASSANDRA-16552:
--------------------------------------

             Summary: Anticompaction appears to race with Compaction, 
preventing forward compaction progress after an incremental repair
                 Key: CASSANDRA-16552
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16552
             Project: Cassandra
          Issue Type: Improvement
          Components: Local/Compaction
            Reporter: Joey Lynch
         Attachments: CompactionStuck.png, anticompaction_before_issue.txt

While testing 4.0-rc1 on a 12 i3en.2xlarge x 2 region (AWS us-east-1 and 
eu-west-1) cluster I attempted to run {{nodetool repair}} while the cluster was 
taking moderate read/write load. 

The first time it worked as expected, but when I ran an incremental run the 
second time multiple nodes got stuck trying to compact the unrepaired sstables. 
They are now spinning with:
{noformat}
$ nt compactionstats
pending tasks: 827
- acceptance_josephl.acceptance_josephl_cass4: 827

$ nt tpstats            
Pool Name                    Active Pending Completed Blocked All time blocked
RequestResponseStage         0      0       422359133 0       0               
MutationStage                0      0       164540628 0       0               
ReadStage                    0      0       198857844 0       0               
CompactionExecutor           0      0       60782     0       0    


$ tail system.log
DEBUG [CompactionExecutor:684] 2021-03-31 15:13:59,902 LeveledManifest.java:292 
- L0 is too far behind, performing size-tiering there first
DEBUG [CompactionExecutor:684] 2021-03-31 15:13:59,908 LeveledManifest.java:292 
- L0 is too far behind, performing size-tiering there first
WARN  [CompactionExecutor:684] 2021-03-31 15:13:59,912 
LeveledCompactionStrategy.java:154 - Could not acquire references for 
compacting SSTables 
[BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11eba
fb40b81cbd6fb3d/na-4826-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4872-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acc
eptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4849-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4874-big-Data.db'),
 BigTableReader(path='/mnt/dat
a/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4841-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4897-big-D
ata.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4924-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e79
0917c11ebafb40b81cbd6fb3d/na-4837-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4926-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/acceptance_j
osephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4729-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4723-big-Data.db'),
 BigTableReader(path
='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4875-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-
4922-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4920-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cas
s4-6144e790917c11ebafb40b81cbd6fb3d/na-4869-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4823-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/ac
ceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4846-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4873-big-Data.db'),
 BigTableR
eader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4840-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cb
d6fb3d/na-4833-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4829-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_j
osephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4726-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4923-big-Data.db'),
 BigTableReader(path='/mnt/data/cassand
ra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4925-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4905-big-Data.db'),
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4876-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11eb
afb40b81cbd6fb3d/na-4901-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4732-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/ac
ceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4909-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4915-big-Data.db'),
 BigTableReader(path='/mnt/da
ta/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4921-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4860-big-
Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4693-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e7
90917c11ebafb40b81cbd6fb3d/na-4694-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4692-big-Data.db'),
 BigTableReader(path='/mnt/data/cassandra/data/acceptance_
josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4691-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4696-big-Data.db'),
 BigTableReader(pat
h='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4697-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na
-4700-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4698-big-Data.db'),
 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_ca
ss4-6144e790917c11ebafb40b81cbd6fb3d/na-4688-big-Data.db'), 
BigTableReader(path='/mnt/data/cassandra/data/acceptance_josephl/acceptance_josephl_cass4-6144e790917c11ebafb40b81cbd6fb3d/na-4689-big-Data.db')]
 which is not a problem per se,unless it happens
frequently, in which case it must be reported. Will retry later.
{noformat}

I've attached some starting breadcrumbs. I believe the issue is a potential 
race in [marking sstables for 
compaction|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/compaction/LeveledCompactionStrategy.java#L152-L161]
 getting null back from 
[tryModify|https://github.com/apache/cassandra/blob/d42087a63309178b96909c012dd0073fe0b6ea11/src/java/org/apache/cassandra/db/lifecycle/Tracker.java#L100]
 which I think can only happen under a [small number of 
circumstances|https://github.com/apache/cassandra/blob/d42087a63309178b96909c012dd0073fe0b6ea11/src/java/org/apache/cassandra/db/lifecycle/View.java#L269].
 From the initial investigation it does appear that only the unrepaired 
products get into this state.

I have a heap dump containing the View state but it contains potentially 
sensitive infrastructure details so if you're debugging just message me in 
slack and I can send it to you directly.

The following mitigation appears to unstick the nodes via a forced full 
compaction:
{noformat}
nodetool stop COMPACTION; nodetool compact <ks> <table>
{noformat}
I'm not confident in this mitigation though.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to