[ 
https://issues.apache.org/jira/browse/CASSANDRA-13545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037233#comment-16037233
 ] 

Dmitry Erokhin commented on CASSANDRA-13545:
--------------------------------------------

One of our engineers has been able to find at least one issue which leads to 
this condition. His findings are below.
---

With a consistent reproduction outside of the production cluster, I downloaded 
the cassandra source code, setup a remote debugger (eclipse) and connected it 
to the cassandra process running on my node.
 
At this point I was able to setup breakpoints and examine a live system, 
starting at the last frame in the traceback 
(org.apache.cassandra.io.sstable.IndexSummary.<init>(IndexSummary.java:86)). 
Stepping through the code duing a live compaction, I was able to determine that 
the issue is indeed a bug in Cassandra that occurs when it is trying to run a 
compaction job with a very large number of partitions.
 
The SafeMemoryWriter class is used to build the index summary for the new 
sstable.
{code:java}
public class SafeMemoryWriter extends DataOutputBuffer
{
    private SafeMemory memory;
 
    @SuppressWarnings("resource")
    public SafeMemoryWriter(long initialCapacity)
    {
        this(new SafeMemory(initialCapacity));
    }
 
    private SafeMemoryWriter(SafeMemory memory)
    {
        super(tailBuffer(memory).order(ByteOrder.BIG_ENDIAN));
        this.memory = memory;
    }
 
    public SafeMemory currentBuffer()
    {
        return memory;
    }
 
    @Override
    protected void reallocate(long count)
    {
        long newCapacity = calculateNewSize(count);
        if (newCapacity != capacity())
        {
            long position = length();
            ByteOrder order = buffer.order();
 
            SafeMemory oldBuffer = memory;
            memory = this.memory.copy(newCapacity);
            buffer = tailBuffer(memory);
 
            int newPosition = (int) (position - tailOffset(memory));
            buffer.position(newPosition);
            buffer.order(order);
 
            oldBuffer.free();
        }
    }
 
    public void setCapacity(long newCapacity)
    {
        reallocate(newCapacity);
    }
 
    public void close()
    {
        memory.close();
    }
 
    public Throwable close(Throwable accumulate)
    {
        return memory.close(accumulate);
    }
 
    public long length()
    {
        return tailOffset(memory) +  buffer.position();
    }
 
    public long capacity()
    {
        return memory.size();
    }
 
    @Override
    public SafeMemoryWriter order(ByteOrder order)
    {
        super.order(order);
        return this;
    }
 
    @Override
    public long validateReallocation(long newSize)
    {
        return newSize;
    }
 
    private static long tailOffset(Memory memory)
    {
        return Math.max(0, memory.size - Integer.MAX_VALUE);
    }
 
    private static ByteBuffer tailBuffer(Memory memory)
    {
        return memory.asByteBuffer(tailOffset(memory), (int) 
Math.min(memory.size, Integer.MAX_VALUE));
    }
}
{code}
The appears like it is intended to work with buffers larger than 
Integer.MAX_VALUE, however if the initial size of the buffer is larger than 
that the initial value of length() will be incorrect (it won’t be zero) and 
writing via the DataOutputBuffer will write in the wrong location (it won’t 
start at offset 0).
 
 
{code:java}
    public IndexSummaryBuilder(long expectedKeys, int minIndexInterval, int 
samplingLevel)
    {
        this.samplingLevel = samplingLevel;
        this.startPoints = Downsampling.getStartPoints(BASE_SAMPLING_LEVEL, 
samplingLevel);
 
        long maxExpectedEntries = expectedKeys / minIndexInterval;
        if (maxExpectedEntries > Integer.MAX_VALUE)
        {
            // that's a _lot_ of keys, and a very low min index interval
            int effectiveMinInterval = (int) Math.ceil((double) 
Integer.MAX_VALUE / expectedKeys);
            maxExpectedEntries = expectedKeys / effectiveMinInterval;
            assert maxExpectedEntries <= Integer.MAX_VALUE : maxExpectedEntries;
            logger.warn("min_index_interval of {} is too low for {} expected 
keys; using interval of {} instead",
                        minIndexInterval, expectedKeys, effectiveMinInterval);
            this.minIndexInterval = effectiveMinInterval;
        }
        else
        {
            this.minIndexInterval = minIndexInterval;
        }
 
        // for initializing data structures, adjust our estimates based on the 
sampling level
        maxExpectedEntries = Math.max(1, (maxExpectedEntries * samplingLevel) / 
BASE_SAMPLING_LEVEL);
        offsets = new SafeMemoryWriter(4 * 
maxExpectedEntries).order(ByteOrder.nativeOrder());
        entries = new SafeMemoryWriter(40 * 
maxExpectedEntries).order(ByteOrder.nativeOrder());
 
        // the summary will always contain the first index entry (downsampling 
will never remove it)
        nextSamplePosition = 0;
        indexIntervalMatches++;
    }
{code}
The bug occurs when the entries table in the index summary for the new sstable 
is larger than Integer.MAX_VALUE bytes (2 GiB). This happens when expectedKeys 
> Integer.MAX_VALUE / 40 * minIndexInterval . Our partitions for the blocks 
table have a mean size of 179 bytes, so we would expect to see issues on this 
table for compactions over about 1.12 TiB.
 
The default value of minIndexInterval is 128, however it is adjustable per 
table and can be used to avoid this condition. It should be set to a power of 
2. I ran this cql on my test node:
{code:sql}
ALTER TABLE tablename.blocks WITH min_index_interval = 512 ;
{code}
Since this change, I haven’t seen the assertion. The compaction has proceeded 
much farther than before, but it has not completed yet since it is so large.
{noformat}
$ nodetool compactionstats -H
pending tasks: 1
                                     id   compaction type       keyspace    
table   completed     total    unit   progress
   9965f4b0-4749-11e7-b21c-91cb0a91f895        Compaction   tablename   blocks  
 629.51 GB   1.34 TB   bytes     45.71%
Active compaction remaining time :        n/a
{noformat}
I would expect that making this change would fix the issue for all future 
compactions on all nodes.
 
The index summary is used to reduce disk io to the sstable index. A larger 
index interval would result in a less efficient index summary and more io to 
the sstable index. However the min is just the minimum value, the actual value 
is controlled automatically by Cassandra. On p10, it is 2048 for the larger 
blocks sstables, so I would not expect a performance impact.

Compaction failed with new error
{code}
ERROR [CompactionExecutor:6] 2017-06-04 10:15:26,115 CassandraDaemon.java:185 - 
Exception in thread Thread[CompactionExecutor:6,1,RMI Runtime]
java.lang.AssertionError: Illegal bounds [-2147483648..-2147483640); size: 
3355443200
        at org.apache.cassandra.io.util.Memory.checkBounds(Memory.java:339) 
~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.SafeMemory.checkBounds(SafeMemory.java:104) 
~[apache-cassandra-2.2.5.jar:2.2.5]
        at org.apache.cassandra.io.util.Memory.getLong(Memory.java:260) 
~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:224)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.CompressedSegmentedFile.createMappedSegments(CompressedSegmentedFile.java:80)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.<init>(CompressedPoolingSegmentedFile.java:38)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.CompressedPoolingSegmentedFile$Builder.complete(CompressedPoolingSegmentedFile.java:101)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:188)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.util.SegmentedFile$Builder.complete(SegmentedFile.java:179)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinal(BigTableWriter.java:345)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.openFinalEarly(BigTableWriter.java:333)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:297)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.io.sstable.SSTableRewriter.doPrepare(SSTableRewriter.java:345)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.doPrepare(CompactionAwareWriter.java:79)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.prepareToCommit(Transactional.java:169)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.utils.concurrent.Transactional$AbstractTransactional.finish(Transactional.java:179)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.writers.CompactionAwareWriter.finish(CompactionAwareWriter.java:89)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:196)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256)
 ~[apache-cassandra-2.2.5.jar:2.2.5]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_131]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[na:1.8.0_131]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_131]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
{code}


> Exception in CompactionExecutor leading to tmplink files not being removed
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13545
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13545
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction
>            Reporter: Dmitry Erokhin
>
> We are facing an issue where compactions fail on a few nodes with the 
> following message
> {code}
> ERROR [CompactionExecutor:1248] 2017-05-22 15:32:55,390 
> CassandraDaemon.java:185 - Exception in thread 
> Thread[CompactionExecutor:1248,1,main]
> java.lang.AssertionError: null
>       at 
> org.apache.cassandra.io.sstable.IndexSummary.<init>(IndexSummary.java:86) 
> ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.io.sstable.IndexSummaryBuilder.build(IndexSummaryBuilder.java:235)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.openEarly(BigTableWriter.java:316)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.io.sstable.SSTableRewriter.maybeReopenEarly(SSTableRewriter.java:170)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.io.sstable.SSTableRewriter.append(SSTableRewriter.java:115)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.db.compaction.writers.DefaultCompactionWriter.append(DefaultCompactionWriter.java:64)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:184)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:74)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:256)
>  ~[apache-cassandra-2.2.5.jar:2.2.5]
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_121]
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_121]
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  ~[na:1.8.0_121]
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_121]
>       at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> {code}
> Also, the number of tmplink files in /var/lib/cassandra/data/<keyspace 
> name>/blocks/tmplink* is growing constantly until node runs out of space. 
> Restarting cassandra removes all tmplink files, but the issue still continues.
> We are using Cassandra 2.2.5 on Debian 8 with Oracle Java 8
> {code}
> root@cassandra-p10:/var/lib/cassandra/data/mugenstorage/blocks-33167ef0447a11e68f3e5b42fc45b62f#
>  dpkg -l | grep -E "java|cassandra"
> ii  cassandra                      2.2.5                        all          
> distributed storage system for structured data
> ii  cassandra-tools                2.2.5                        all          
> distributed storage system for structured data
> ii  java-common                    0.52                         all          
> Base of all Java packages
> ii  javascript-common              11                           all          
> Base support for JavaScript library packages
> ii  oracle-java8-installer         8u121-1~webupd8~0            all          
> Oracle Java(TM) Development Kit (JDK) 8
> ii  oracle-java8-set-default       8u121-1~webupd8~0            all          
> Set Oracle JDK 8 as default Java
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to