[ https://issues.apache.org/jira/browse/CASSANDRA-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955393#comment-14955393 ]
Alan Boudreault commented on CASSANDRA-10510: --------------------------------------------- Hello [~Bj0rn], Unfortunately, Cassandra 2.0 is EOL (End of Life). Please reopen this ticket if you can reproduce the bug in 2.1+. > Compacted SSTables failing to get removed, overflowing disk > ----------------------------------------------------------- > > Key: CASSANDRA-10510 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10510 > Project: Cassandra > Issue Type: Bug > Reporter: Björn Hegerfors > Attachments: nonReleasedSSTables.txt > > > Short version: it appears that if the resulting SSTable of a compaction > enters another compaction soon after, the SSTables participating in the > former compaction don't get deleted from disk until Cassandra is restarted. > We have run into a big problem after applying CASSANDRA-10276 and > CASSANDRA-10280, backported to 2.0.14. But the bug we're seeing is not > introduced by these patches, it has just made itself very apparent and > harmful. > Here's what has happened. We had repair running on our table that is a time > series and uses DTCS. The ring was split into 5016 small ranges being > repaired one after the other (using parallel repair, i.e. not snapshot > repair). This causes a flood of tiny SSTables to get streamed into all nodes > (we don't use vnodes), with timestamp ranges similar to existing SSTables on > disk. The problem with that is the sheer number of SSTables, disk usage is > not affected. This has been reported before, see CASSANDRA-9644. These > SSTables are streamed continuously for up to a couple of days. > The patches were applied to fix the problem of ending up with tens of > thousands of SSTables that would never get touched by DTCS. But now that DTCS > does touch them, we have run into a new problem instead. While disk usage was > in the 25-30% neighborhood before repairs began, disk usage started growing > fast when these continuous streams started coming in. Eventually, a couple of > nodes ran out of disk, which led us to stop all the repairing on the cluster. > This didn't reduce the disk usage. Compactions were of course very active. > More than doubling disk usage should not be possible, regardless of the > choices your compaction strategy makes. And we were not getting magnitudes of > data streamed in. Large quantities of SSTables, yes, but this was the nodes > creating more data out of thin air. > We have a tool to show timestamp and size metadata of SSTables. What we > found, looking at all non-tmp data files, was something akin to duplicates of > almost all the largest SSTables. Not quite exact replicas, but there were > these multi-gigabyte SSTables covering exactly the same range of timestamps > and differing in size by mere kilobytes. There were typically 3 of each of > the largest SSTables, sometimes even more. > Here's what I suspect: DTCS is the only compaction strategy that would > commonly finish compacting a really large SSTable and on the very next run of > the compaction strategy nominate the result for yet another compaction. Even > together with tiny SSTables, which certainly happens in our scenario. > Potentially, the large SSTable that participated in the first compaction > might even get nominated again by DTCS, if for some reason it can be returned > by getUncompactingSSTables. > Whatever the reason, I have collected evidence showing that these large > "duplicate" SSTables are of the same "lineage". Only one should remain on > disk: the latest one. The older ones have already been compacted, resulting > in the newer ones. But for some reason, they never got deleted from disk. And > this was really harmful when combining DTCS with continuously streaming in > tiny SSTables. The same but worse would happen without the patches and > uncapped max_sstable_age_days. > Attached is one occurrence of 3 duplicated SSTables, their metadata and log > lines about their compactions. You can see how similar they were to each > other. SSTable generations 374277, 374249, 373702 (the large one), 374305, > 374231 and 374333 completed compaction at 04:05:26,878, yet they were all > still on disk over 6 hours later. At 04:05:26,898 the result, 374373, entered > another compaction with 375174. They also stayed around after that compaction > finished. Literally all SSTables named in these log lines were still on disk > when I checked! Only one should have remained: 375189. > Now this was just one random example from the data I collected. This happened > everywhere. Some SSTables should probably have been deleted a day before. > However, once we restarted the nodes, all of the duplicates were suddenly > gone! -- This message was sent by Atlassian JIRA (v6.3.4#6332)