[ https://issues.apache.org/jira/browse/CASSANDRA-15284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17764312#comment-17764312 ]
Sebastian Marsching commented on CASSANDRA-15284: ------------------------------------------------- If you are not interested in the detailed analysis of this problem and just want to learn about where the root of this bug is in my opinion, you can safely skip to the _Conclusion_ section below. We just encountered this bug in Cassandra 4.1.3 when using {{sstablescrub}} and we also saw it when using {{{}nodetool scrub{}}}. For Cassandra 4.1.3, the stacktrace looks like this: {{java.lang.AssertionError: null}} {{ at org.apache.cassandra.io.compress.CompressionMetadata$Chunk.<init>(CompressionMetadata.java:492)}} {{ at org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:251)}} {{ at org.apache.cassandra.io.util.MmappedRegions.updateState(MmappedRegions.java:163)}} {{ at org.apache.cassandra.io.util.MmappedRegions.<init>(MmappedRegions.java:73)}} {{ at org.apache.cassandra.io.util.MmappedRegions.<init>(MmappedRegions.java:61)}} {{ at org.apache.cassandra.io.util.MmappedRegions.map(MmappedRegions.java:104)}} {{ at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:373)}} {{ at org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:340)}} {{ at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:429)}} {{ at org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:404)}} {{ at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:316)}} {{ at org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:364)}} {{ at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:168)}} {{ at org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:179)}} {{ at org.apache.cassandra.io.sstable.SSTableRewriter.finish(SSTableRewriter.java:351)}} {{ at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:318)}} {{ at org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:1276)}} {{ at org.apache.cassandra.db.compaction.CompactionManager$2.execute(CompactionManager.java:518)}} {{ at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:438)}} {{ at org.apache.cassandra.concurrent.FutureTask.call(FutureTask.java:61)}} {{ at org.apache.cassandra.concurrent.FutureTask.run(FutureTask.java:71)}} {{ at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)}} {{ at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)}} {{ at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)}} {{ at java.base/java.lang.Thread.run(Thread.java:829)}} For {{{}sstablescrub{}}}, the stacktrace looks slightly different because {{Scrubber.scrub}} is called from a different location. So, I started {{sstablescrub}} inside a debugger and intercepted the {{AssertionError}} to get a better idea about what’s happening here. The {{CompressionMetadata}} instance that causes the problem has been created by calling {{{}CompressionMetadata$Writer.open{}}}, and the {{CompressionMetadata.chunkFor}} is called for a position beyond the range covered by the offsets added to {{CompressionMetadata$Writer}} (and thus the offsets available in {{{}CompressionMetadata.chunkFor{}}}). {{CompressionMetadata$Writer.open}} copies the {{chunkOffsets}} to the new {{CompressionMetadata}} object, but it gets the {{dataLength}} as a parameter and calculates the {{chunkOffsetsSize}} from this number. This means that when dataLength is greater than the compression chunk size multiplied by the number of chunks for which the {{CompressionMetadata$Writer}} has offsets, the resulting {{CompressionMetadata}} will cause the {{AssertionError}} in {{chunkFor}} when trying to access the position that is within the {{dataLength}} but not covered by any of the chunks with known offsets. But why does the {{dataLength}} passed to {{CompressionMetadata$Writer.open}} not match the number of chunk offsets? Chunk offsets are added to the {{CompressionMetadata$Writer}} in {{{}CompressedSequentialWriter.flushData{}}}. This is also where {{{}CompressedSequentialWriter.uncompressedSize{}}}, which is later passed to {{CompressionMetadata$Writer.open}} when {{CompressedSequentialWriter.open}} is called, is incremented. {{CompressedSequentialWriter.addOffset}} is called every time when a chunk is written. As the buffer size is equal to the chunk size, and {{addOffset}} is called every time {{uncompressedSize}} is incremented, {{uncompressedSize}} should be less than or equal to the number of chunk offsets multiplied by the chunk size. However, when observing the {{{}AssertionError{}}}, the uncompressedSize is actually greater. Here are the numbers from my debugging session (attributes of the {{{}CompressedSequentialWriter{}}}): {{chunkCount = 1051767}} {{chunkOffset = 10238962958}} {{compressedSize = 10247354746}} {{lastFlushOffset = 69023412828}} {{uncompressedSize = 69023412828}} The associated {{CompressionMetadata$Writer}} has a {{count}} of 1051767 (which matches the {{chunkCount}} in the {{{}CompressedSequentialWriter{}}}) and the chunk size is 64 kiB. When we multiply the chunk size with the chunk count, we find that the uncompressedSize should not be more than 68,928,602,112 bytes, but it is 69,023,412,828 bytes. The uncompressedSize is not modified outside {{{}CompressedSequentialWriter.flushData{}}}, but the {{chunkCount}} is also modified in {{{}CompressedSequentialWriter.resetAndTruncate{}}}. *Conclusion* I believe that this is where the root of this bug is in {{{}CompressedSequentialWriter.resetAndTruncate{}}}: This method should either reset {{uncompressedSize}} or it should not reset the {{chunkCount}} (and in the latter case not call {{{}CompressionMetadata$Writer.resetAndTruncate{}}}) either. I am about 95 % sure that the first option is the right one, but I would appreciate if anyone with a better understanding of this logic could chime in. > AssertionError while scrubbing sstable > -------------------------------------- > > Key: CASSANDRA-15284 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15284 > Project: Cassandra > Issue Type: Bug > Reporter: Gianluigi Tiesi > Priority: Normal > Attachments: assert-comp-meta.diff > > > I've got a damaged data file but while trying to run scrub (online or > offline) I always get this > error: > > {code:java} > -- StackTrace -- > java.lang.AssertionError > at > org.apache.cassandra.io.compress.CompressionMetadata$Chunk.<init>(CompressionMetadata.java:474) > at > org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:239) > at > org.apache.cassandra.io.util.MmappedRegions.updateState(MmappedRegions.java:163) > at > org.apache.cassandra.io.util.MmappedRegions.<init>(MmappedRegions.java:73) > at > org.apache.cassandra.io.util.MmappedRegions.<init>(MmappedRegions.java:61) > at > org.apache.cassandra.io.util.MmappedRegions.map(MmappedRegions.java:104) > at > org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:362) > at > org.apache.cassandra.io.util.FileHandle$Builder.complete(FileHandle.java:331) > at > org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:336) > at > org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:318) > at > org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:322) > at > org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:370) > at > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:173) > at > org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:184) > at > org.apache.cassandra.io.sstable.SSTableRewriter.finish(SSTableRewriter.java:357) > at > org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:291) > at > org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:1010) > at > org.apache.cassandra.db.compaction.CompactionManager.access$200(CompactionManager.java:83) > at > org.apache.cassandra.db.compaction.CompactionManager$3.execute(CompactionManager.java:391) > at > org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:312) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > at java.lang.Thread.run(Thread.java:748) > {code} > At the moment I've moved away the corrupted file, If you need more info fell > free to ask > > According to the source > [https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/io/compress/CompressionMetadata.java#L474] > looks like the requested chung length is <= 0 -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org