[ https://issues.apache.org/jira/browse/CASSANDRA-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303217#comment-15303217 ]
Stefania commented on CASSANDRA-11889: -------------------------------------- This is not only a different problem but a much serious problem too. We had too many files open and it looks like the Cassandra process exited. Could you open a new ticket with as many details as you can please? > LogRecord: file system race condition may cause verify() to fail > ---------------------------------------------------------------- > > Key: CASSANDRA-11889 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11889 > Project: Cassandra > Issue Type: Bug > Components: Local Write-Read Paths > Reporter: Stefania > Assignee: Stefania > Fix For: 3.0.x, 3.x > > > The following exception was reported in CASSANDRA-11470. It occurred whilst > listing files with compaction in progress: > {code} > WARN [CompactionExecutor:2006] 2016-05-23 18:23:31,694 > BigTableWriter.java:171 - Writing large partition > test_keyspace/test_columnfamily:eda6b9c36f8df6fe596492c3438d7a38e9b109a6 > (123663388 bytes) > INFO [IndexSummaryManager:1] 2016-05-23 18:24:23,731 > IndexSummaryRedistribution.java:74 - Redistributing index summaries > WARN [CompactionExecutor:2006] 2016-05-23 18:24:56,669 > BigTableWriter.java:171 - Writing large partition > test_keyspace/test_columnfamily:05b6b424194dd19ab7cfbcd53c4979768cd859e9 > (256286063 bytes) > WARN [CompactionExecutor:2006] 2016-05-23 18:26:23,575 > BigTableWriter.java:171 - Writing large partition > test_keyspace/test_columnfamily:04e9fac15552b9ae77c27a6cb8d0fd11fdcc24d7 > (212445557 bytes) > INFO [CompactionExecutor:2005] 2016-05-23 18:29:26,839 > LeveledManifest.java:437 - Adding high-level (L3) > BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_2-d29dd71045a811e59aff6776bf484396/ma-61041-big-Data.db') > to candidates > WARN [CompactionExecutor:2006] 2016-05-23 18:30:34,154 > BigTableWriter.java:171 - Writing large partition > test_keyspace/test_columnfamily:edbe6f178503be90911dbf29a55b97a4b095a9ec > (183852539 bytes) > INFO [CompactionExecutor:2006] 2016-05-23 18:31:21,080 > LeveledManifest.java:437 - Adding high-level (L3) > BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_2-d29dd71045a811e59aff6776bf484396/ma-61042-big-Data.db') > to candidates > ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,207 > LogFile.java:173 - Unexpected files detected for sstable [ma-91034-big], > record > [REMOVE:[/data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma-91034-big,1463992176000,8][457420186]]: > last update time [00:00:00] should have been [08:29:36] > ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,208 > ScheduledReporter.java:119 - RuntimeException thrown from > GraphiteReporter#report. Exception was suppressed. > java.lang.RuntimeException: Failed to list files in > /data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396 > at > org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:57) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.lifecycle.LifecycleTransaction.getFiles(LifecycleTransaction.java:547) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:691) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.Directories$SSTableLister.listFiles(Directories.java:662) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.Directories$TrueFilesSizeVisitor.<init>(Directories.java:981) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.Directories.getTrueAllocatedSizeIn(Directories.java:893) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.Directories.trueSnapshotsSize(Directories.java:883) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.ColumnFamilyStore.trueSnapshotsSize(ColumnFamilyStore.java:2332) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:637) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:634) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.metrics.TableMetrics$33.getValue(TableMetrics.java:692) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.metrics.TableMetrics$33.getValue(TableMetrics.java:686) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:281) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:158) > ~[metrics-graphite-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) > ~[metrics-core-3.1.0.jar:3.1.0] > at > com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) > ~[metrics-core-3.1.0.jar:3.1.0] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_91] > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > [na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > [na:1.8.0_91] > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > [na:1.8.0_91] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_91] > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91] > Caused by: > org.apache.cassandra.db.lifecycle.LogTransaction$CorruptTransactionLogException: > Some records failed verification. See earlier in log for details. > at > org.apache.cassandra.db.lifecycle.LogAwareFileLister.readTxnLog(LogAwareFileLister.java:114) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.lifecycle.LogAwareFileLister.classifyFiles(LogAwareFileLister.java:106) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) > ~[na:1.8.0_91] > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) > ~[na:1.8.0_91] > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374) > ~[na:1.8.0_91] > at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[na:1.8.0_91] > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > ~[na:1.8.0_91] > at > java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) > ~[na:1.8.0_91] > at > java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) > ~[na:1.8.0_91] > at > java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > ~[na:1.8.0_91] > at > java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) > ~[na:1.8.0_91] > at > org.apache.cassandra.db.lifecycle.LogAwareFileLister.innerList(LogAwareFileLister.java:75) > ~[apache-cassandra-3.0.6.jar:3.0.6] > at > org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:53) > ~[apache-cassandra-3.0.6.jar:3.0.6] > ... 22 common frames omitted > INFO [CompactionExecutor:2005] 2016-05-23 18:32:27,284 > LeveledManifest.java:437 - Adding high-level (L3) > BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_3-37a9cb90b7ac11e48a9b6776bf484396/ma-98189-big-Data.db') > to candidates > INFO [CompactionExecutor:2005] 2016-05-23 18:34:42,562 > LeveledManifest.java:437 - Adding high-level (L3) > BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_3-37a9cb90b7ac11e48a9b6776bf484396/ma-98210-big-Data.db') > to candidates > {code} > The error of interest is: > {code} > ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,207 > LogFile.java:173 - Unexpected files detected for sstable [ma-91034-big], > record > [REMOVE:[/data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma-91034-big,1463992176000,8][457420186]]: > last update time [00:00:00] should have been [08:29:36] > {code} > The problem is this code here in LogRecord: > {code} > public LogRecord withExistingFiles() > { > return make(type, getExistingFiles(), 0, absolutePath.get()); > } > public static LogRecord make(Type type, List<File> files, int minFiles, > String absolutePath) > { > long lastModified = files.stream().map(File::lastModified).reduce(0L, > Long::max); > return new LogRecord(type, absolutePath, lastModified, > Math.max(minFiles, files.size())); > } > {code} > If the sstable files are completely deleted after getExistingFiles() has > returned them and before lastModified is calculated, then we may report last > update time as zero because the files no longer exist and still indicate that > there are files on disk so that this code in LogFile.verifyRecord fails: > {code} > record.status.onDiskRecord = record.withExistingFiles(); > if (record.updateTime != record.status.onDiskRecord.updateTime && > record.status.onDiskRecord.numFiles > 0) > { > record.setError(String.format("Unexpected files detected for > sstable [%s], " + > "record [%s]: last update time > [%tT] should have been [%tT]", > record.fileName(), > record, > > record.status.onDiskRecord.updateTime, > record.updateTime)); > } > {code} > As a consequence, we throw an exception when listing files. In this case we > were calculating the true snapshot size for the metrics. > Note that we are careful to delete files from oldest to newest so that the > last update time should always match what is stored in the txn log, however > if all files are deleted then we report last update time as zero. So, if my > analysis is correct, this problem should not be very frequent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)