[ https://issues.apache.org/jira/browse/CASSANDRA-10501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974967#comment-14974967 ]
Yuki Morishita commented on CASSANDRA-10501: -------------------------------------------- Patch looks good to me though can you run tests on cassci to make sure it doesn't broke anything? > Failure to start up Cassandra when temporary compaction files are not all > renamed after kill/crash (FSReadError) > ---------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-10501 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10501 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 2.1.6 > Redhat Linux > Reporter: Mathieu Roy > Assignee: Marcus Eriksson > Labels: compaction, triage > Fix For: 2.1.x, 2.2.x, 3.0.0 > > > We have seen an issue intermittently but repeatedly over the last few months > where, after exiting the Cassandra process, it fails to start with an > FSReadError (stack trace below). The FSReadError refers to a 'statistics' > file for a that doesn't exist, though a corresponding temporary file does > exist (eg. there is no > /media/data/cassandraDB/data/clusteradmin/singleton_token-01a92ed069b511e59b2c53679a538c14/clusteradmin-singleton_token-ka-9-Statistics.db > file, but there is a > /media/data/cassandraDB/data/clusteradmin/singleton_token-01a92ed069b511e59b2c53679a538c14/clusteradmin-singleton_token-tmp-ka-9-Statistics.db > file.) > We tracked down the issue to the fact that the process exited with leftover > compactions and some of the 'tmp' files for the SSTable had been renamed to > final files, but not all of them - the issue happens if the 'Statistics' file > is not renamed but others are. The scenario we've seen on the last two > occurrences involves the 'CompressionInfo' file being a final file while all > other files for the SSTable generation were left with 'tmp' names. > When this occurs, Cassandra cannot start until the file issue is resolved; > we've worked around it by deleting the SSTable files from the same > generation, both final and tmp, which at least allows Cassandra to start. > Renaming all files to either tmp or final names would also work. > We've done some debugging in Cassandra and have been unable to cause the > issue without renaming the files manually. The rename code at > SSTableWriter.rename() looks like it could result in this if the process > exits in the middle of the rename, but in every occurrence we've debugged > through, the Set of components is ordered and Statistics is the first file > renamed. > However the comments in SSTableWriter.rename() suggest that the 'Data' file > is meant to be used as meaning the files were completely renamed. The method > ColumnFamilyStore. removeUnfinishedCompactionLeftovers(), however, will > proceed assuming the compaction is complete if any of the component files has > a final name, and will skip temporary files when reading the list. If the > 'Statistics' file is temporary then it won't be read, and the defaults does > not include a list of ancestors, leading to the NullPointerException. > It appears that ColumnFamilyStore. removeUnfinishedCompactionLeftovers() > should perhaps either ensure that all 'tmp' files are properly renamed before > it uses them, or skip SSTable files that don't have either the 'Data' or > 'Statistics' file in final form. > Stack trace: > {code} > FSReadError in Failed to remove unfinished compaction leftovers (file: > /media/data/cassandraDB/data/clusteradmin/singleton_token-01a92ed069b511e59b2c53679a538c14/clusteradmin-singleton_token-ka-9-Statistics.db). > See log for details. > at > org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:617) > at > org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:302) > at > org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:536) > at > org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) > Caused by: java.lang.NullPointerException > at > org.apache.cassandra.db.ColumnFamilyStore.removeUnfinishedCompactionLeftovers(ColumnFamilyStore.java:609) > ... 3 more > Exception encountered during startup: java.lang.NullPointerException > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)