Hello - I have a 48 node C* cluster spread across 4 AWS regions with RF=3. A few months ago I started noticing disk usage on some nodes increasing consistently. At first I solved the problem by destroying the nodes and rebuilding them, but the problem returns.
I did some more investigation recently, and this is what I found: - I narrowed the problem down to a CF that uses TWCS, by simply looking at disk space usage - in each region, 3 nodes have this problem of growing disk space (matches replication factor) - on each node, I tracked down the problem to a particular SSTable using `sstableexpiredblockers` - in the SSTable, using `sstabledump`, I found a row that does not have a ttl like the other rows, and appears to be from someone else on the team testing something and forgetting to include a ttl - all other rows show "expired: true" except this one, hence my suspicion - when I query for that particular partition key, I get no results - I tried deleting the row anyways, but that didn't seem to change anything - I also tried `nodetool scrub`, but that didn't help either Would this rogue row without a ttl explain the problem? If so, why? If not, does anyone have any other ideas? Why does the row show in `sstabledump` but not when I query for it? I appreciate any help or suggestions! - Mike