[ 
https://issues.apache.org/jira/browse/CASSANDRA-10510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955393#comment-14955393
 ] 

Alan Boudreault commented on CASSANDRA-10510:
---------------------------------------------

Hello [~Bj0rn], 

Unfortunately, Cassandra 2.0 is EOL (End of Life). Please reopen this ticket if 
you can reproduce the bug in 2.1+.

> Compacted SSTables failing to get removed, overflowing disk
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-10510
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10510
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Björn Hegerfors
>         Attachments: nonReleasedSSTables.txt
>
>
> Short version: it appears that if the resulting SSTable of a compaction 
> enters another compaction soon after, the SSTables participating in the 
> former compaction don't get deleted from disk until Cassandra is restarted.
> We have run into a big problem after applying CASSANDRA-10276 and 
> CASSANDRA-10280, backported to 2.0.14. But the bug we're seeing is not 
> introduced by these patches, it has just made itself very apparent and 
> harmful.
> Here's what has happened. We had repair running on our table that is a time 
> series and uses DTCS. The ring was split into 5016 small ranges being 
> repaired one after the other (using parallel repair, i.e. not snapshot 
> repair). This causes a flood of tiny SSTables to get streamed into all nodes 
> (we don't use vnodes), with timestamp ranges similar to existing SSTables on 
> disk. The problem with that is the sheer number of SSTables, disk usage is 
> not affected. This has been reported before, see CASSANDRA-9644. These 
> SSTables are streamed continuously for up to a couple of days.
> The patches were applied to fix the problem of ending up with tens of 
> thousands of SSTables that would never get touched by DTCS. But now that DTCS 
> does touch them, we have run into a new problem instead. While disk usage was 
> in the 25-30% neighborhood before repairs began, disk usage started growing 
> fast when these continuous streams started coming in. Eventually, a couple of 
> nodes ran out of disk, which led us to stop all the repairing on the cluster.
> This didn't reduce the disk usage. Compactions were of course very active. 
> More than doubling disk usage should not be possible, regardless of the 
> choices your compaction strategy makes. And we were not getting magnitudes of 
> data streamed in. Large quantities of SSTables, yes, but this was the nodes 
> creating more data out of thin air.
> We have a tool to show timestamp and size metadata of SSTables. What we 
> found, looking at all non-tmp data files, was something akin to duplicates of 
> almost all the largest SSTables. Not quite exact replicas, but there were 
> these multi-gigabyte SSTables covering exactly the same range of timestamps 
> and differing in size by mere kilobytes. There were typically 3 of each of 
> the largest SSTables, sometimes even more.
> Here's what I suspect: DTCS is the only compaction strategy that would 
> commonly finish compacting a really large SSTable and on the very next run of 
> the compaction strategy nominate the result for yet another compaction. Even 
> together with tiny SSTables, which certainly happens in our scenario. 
> Potentially, the large SSTable that participated in the first compaction 
> might even get nominated again by DTCS, if for some reason it can be returned 
> by getUncompactingSSTables.
> Whatever the reason, I have collected evidence showing that these large 
> "duplicate" SSTables are of the same "lineage". Only one should remain on 
> disk: the latest one. The older ones have already been compacted, resulting 
> in the newer ones. But for some reason, they never got deleted from disk. And 
> this was really harmful when combining DTCS with continuously streaming in 
> tiny SSTables. The same but worse would happen without the patches and 
> uncapped max_sstable_age_days.
> Attached is one occurrence of 3 duplicated SSTables, their metadata and log 
> lines about their compactions. You can see how similar they were to each 
> other. SSTable generations 374277, 374249, 373702 (the large one), 374305, 
> 374231 and 374333 completed compaction at 04:05:26,878, yet they were all 
> still on disk over 6 hours later. At 04:05:26,898 the result, 374373, entered 
> another compaction with 375174. They also stayed around after that compaction 
> finished. Literally all SSTables named in these log lines were still on disk 
> when I checked! Only one should have remained: 375189.
> Now this was just one random example from the data I collected. This happened 
> everywhere. Some SSTables should probably have been deleted a day before.
> However, once we restarted the nodes, all of the duplicates were suddenly 
> gone!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to