[ 
https://issues.apache.org/jira/browse/CASSANDRA-11889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303217#comment-15303217
 ] 

Stefania commented on CASSANDRA-11889:
--------------------------------------

This is not only a different problem but a much serious problem too. We had too 
many files open and it looks like the Cassandra process exited. Could you open 
a new ticket with as many details as you can please?

> LogRecord: file system race condition may cause verify() to fail
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-11889
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11889
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>            Reporter: Stefania
>            Assignee: Stefania
>             Fix For: 3.0.x, 3.x
>
>
> The following exception was reported in CASSANDRA-11470. It occurred whilst 
> listing files with compaction in progress:
> {code}
> WARN  [CompactionExecutor:2006] 2016-05-23 18:23:31,694 
> BigTableWriter.java:171 - Writing large partition 
> test_keyspace/test_columnfamily:eda6b9c36f8df6fe596492c3438d7a38e9b109a6 
> (123663388 bytes)
> INFO  [IndexSummaryManager:1] 2016-05-23 18:24:23,731 
> IndexSummaryRedistribution.java:74 - Redistributing index summaries
> WARN  [CompactionExecutor:2006] 2016-05-23 18:24:56,669 
> BigTableWriter.java:171 - Writing large partition 
> test_keyspace/test_columnfamily:05b6b424194dd19ab7cfbcd53c4979768cd859e9 
> (256286063 bytes)
> WARN  [CompactionExecutor:2006] 2016-05-23 18:26:23,575 
> BigTableWriter.java:171 - Writing large partition 
> test_keyspace/test_columnfamily:04e9fac15552b9ae77c27a6cb8d0fd11fdcc24d7 
> (212445557 bytes)
> INFO  [CompactionExecutor:2005] 2016-05-23 18:29:26,839 
> LeveledManifest.java:437 - Adding high-level (L3) 
> BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_2-d29dd71045a811e59aff6776bf484396/ma-61041-big-Data.db')
>  to candidates
> WARN  [CompactionExecutor:2006] 2016-05-23 18:30:34,154 
> BigTableWriter.java:171 - Writing large partition 
> test_keyspace/test_columnfamily:edbe6f178503be90911dbf29a55b97a4b095a9ec 
> (183852539 bytes)
> INFO  [CompactionExecutor:2006] 2016-05-23 18:31:21,080 
> LeveledManifest.java:437 - Adding high-level (L3) 
> BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_2-d29dd71045a811e59aff6776bf484396/ma-61042-big-Data.db')
>  to candidates
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,207 
> LogFile.java:173 - Unexpected files detected for sstable [ma-91034-big], 
> record 
> [REMOVE:[/data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma-91034-big,1463992176000,8][457420186]]:
>  last update time [00:00:00] should have been [08:29:36]
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,208 
> ScheduledReporter.java:119 - RuntimeException thrown from 
> GraphiteReporter#report. Exception was suppressed.
> java.lang.RuntimeException: Failed to list files in 
> /data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:57)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LifecycleTransaction.getFiles(LifecycleTransaction.java:547)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$SSTableLister.filter(Directories.java:691)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$SSTableLister.listFiles(Directories.java:662)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories$TrueFilesSizeVisitor.<init>(Directories.java:981)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories.getTrueAllocatedSizeIn(Directories.java:893)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.Directories.trueSnapshotsSize(Directories.java:883) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.ColumnFamilyStore.trueSnapshotsSize(ColumnFamilyStore.java:2332)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:637) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$32.getValue(TableMetrics.java:634) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$33.getValue(TableMetrics.java:692) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.metrics.TableMetrics$33.getValue(TableMetrics.java:686) 
> ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> com.codahale.metrics.graphite.GraphiteReporter.reportGauge(GraphiteReporter.java:281)
>  ~[metrics-graphite-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:158)
>  ~[metrics-graphite-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) 
> ~[metrics-core-3.1.0.jar:3.1.0]
>       at 
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) 
> ~[metrics-core-3.1.0.jar:3.1.0]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_91]
>       at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_91]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_91]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_91]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
> Caused by: 
> org.apache.cassandra.db.lifecycle.LogTransaction$CorruptTransactionLogException:
>  Some records failed verification. See earlier in log for details.
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.readTxnLog(LogAwareFileLister.java:114)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.classifyFiles(LogAwareFileLister.java:106)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) 
> ~[na:1.8.0_91]
>       at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
>  ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
>  ~[na:1.8.0_91]
>       at 
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) 
> ~[na:1.8.0_91]
>       at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) 
> ~[na:1.8.0_91]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.innerList(LogAwareFileLister.java:75)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       at 
> org.apache.cassandra.db.lifecycle.LogAwareFileLister.list(LogAwareFileLister.java:53)
>  ~[apache-cassandra-3.0.6.jar:3.0.6]
>       ... 22 common frames omitted
> INFO  [CompactionExecutor:2005] 2016-05-23 18:32:27,284 
> LeveledManifest.java:437 - Adding high-level (L3) 
> BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_3-37a9cb90b7ac11e48a9b6776bf484396/ma-98189-big-Data.db')
>  to candidates
> INFO  [CompactionExecutor:2005] 2016-05-23 18:34:42,562 
> LeveledManifest.java:437 - Adding high-level (L3) 
> BigTableReader(path='/data/cassandra/data/test_keyspace/test_columnfamily_3-37a9cb90b7ac11e48a9b6776bf484396/ma-98210-big-Data.db')
>  to candidates
> {code}
> The error of interest is:
> {code}
> ERROR [metrics-graphite-reporter-1-thread-1] 2016-05-23 18:31:21,207 
> LogFile.java:173 - Unexpected files detected for sstable [ma-91034-big], 
> record 
> [REMOVE:[/data/cassandra/data/test_keyspace/test_columnfamily-3996ce80b7ac11e48a9b6776bf484396/ma-91034-big,1463992176000,8][457420186]]:
>  last update time [00:00:00] should have been [08:29:36]
> {code}
> The problem is this code here in LogRecord:
> {code}
>     public LogRecord withExistingFiles()
>     {
>         return make(type, getExistingFiles(), 0, absolutePath.get());
>     }
>     public static LogRecord make(Type type, List<File> files, int minFiles, 
> String absolutePath)
>     {
>         long lastModified = files.stream().map(File::lastModified).reduce(0L, 
> Long::max);
>         return new LogRecord(type, absolutePath, lastModified, 
> Math.max(minFiles, files.size()));
>     }
> {code}
> If the sstable files are completely deleted after getExistingFiles() has 
> returned them and before lastModified is calculated, then we may report last 
> update time as zero because the files no longer exist and still indicate that 
> there are files on disk so that this code in LogFile.verifyRecord fails:
> {code}
>         record.status.onDiskRecord = record.withExistingFiles();
>         if (record.updateTime != record.status.onDiskRecord.updateTime && 
> record.status.onDiskRecord.numFiles > 0)
>         {
>             record.setError(String.format("Unexpected files detected for 
> sstable [%s], " +
>                                           "record [%s]: last update time 
> [%tT] should have been [%tT]",
>                                           record.fileName(),
>                                           record,
>                                           
> record.status.onDiskRecord.updateTime,
>                                           record.updateTime));
>         }
> {code}
> As a consequence, we throw an exception when listing files. In this case we 
> were calculating the true snapshot size for the metrics.
> Note that we are careful to delete files from oldest to newest so that the 
> last update time should always match what is stored in the txn log, however 
> if all files are deleted then we report last update time as zero. So, if my 
> analysis is correct, this problem should not be very frequent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to