[ https://issues.apache.org/jira/browse/CASSANDRA-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Philip Thompson updated CASSANDRA-8571: --------------------------------------- Reproduced In: 2.1.2 Fix Version/s: 2.1.3 > Free space management does not work very well > --------------------------------------------- > > Key: CASSANDRA-8571 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8571 > Project: Cassandra > Issue Type: Bug > Reporter: Bartłomiej Romański > Fix For: 2.1.3 > > > Hi all, > We've got a cluster of 2.1.2 with 18 nodes equipped with 3x 480GB SSD each > (JBODs). We mostly use LCS. > Recently, our nodes starts failing with 'no space left on device'. It all > started with our mistake - we let our LCS accumulate too much in L0. > As a result, STCS woke up and we end with some big sstables on each node > (let's say 5-10 sstables, 20-50gb each). > During normal operation we keep our disks about 50% full. This gives about > 200 GB free space on each of them. This was too little for compacting all > accumulated L0 sstables at once. Cassandra kept trying to do that and keep > failing... > Evantually, we managed to stabilized the situation (with some crazy code > hacking, manually moving sstables etc...). However, there are a few things > that would be more than helpful in recovering from such situations more > automatically... > First, please look at DiskAwareRunnable.runMayThrow(). This methods initiates > (local) variable: writeSize. I believe we should check somewhere here if we > have enough space on a chosen disk. The problem is that writeSize is never > read... Am I missing something here? > Btw, while in STCS we first look for the least overloaded disk, and then (if > there are more than one such disks) for the one with the most free space > (please note the sort order in Directories.getWriteableLocation()). That's > often suboptimal (it's usually better to wait for the bigger disk than to > compact fewer sstables now), but probably not crucial. > Second, the strategy (used by LCS) that we first choose target disk and then > use it for whole compaction is not the best one. For big compactions (eg. > after some massive operations like bootstrap or repair; or after some issues > with LCS like in our case) on small drives (eg. JBOD of SSDs) these will > never succeed. Much better strategy would be to choose target drive for each > output sstable separately, or at least round robin them. > Third, it would be helpful if the default check for MAX_COMPACTING_L0 in > LeveledManifest.getCandidatesFor() would be expanded to support also limit > for total space. After fallback STCS in L0 you end up with very big sstables > and 32 of them is just too much for one compaction on a small drives. > We finally used some hack similar the last option (as it was the easiest one > to implement in a hurry), but any improvents described above would save us > from all this. > Thanks, > BR -- This message was sent by Atlassian JIRA (v6.3.4#6332)