[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13923125#comment-13923125 ]
Benedict commented on CASSANDRA-6285: ------------------------------------- So, I think there may potentially be at least two races in the off heap deallocation. I suspect that may not be everything, though, as these two races probably won't cause the problem often. These are predicated on the assumption that thrift doesn't copy data from the DirectByteBuffer that the hsha server provides to it, so could be wrong, but anyway: 1) CL appends can be lagged behind the memtable update and, as a result, the acknowledgment to the client of success writing. If the CL record contains the ByteBuffer when it is freed, and that data is then reused in another allocation, it will write incorrect data to the commit log. 2) I believe thrift calls are two stage. If this is the case, and the client disconnects in between sending the first stage and receiving the result in the second stage, the buffer could be freed whilst still in flight to the memtable/CL These are just quick ideas for where it might be, I haven't familiarised myself fully with thrift, the disruptor etc. to be certain if these are plausible, but it may turn out to be useful so thought I'd share. > 2.0 HSHA server introduces corrupt data > --------------------------------------- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 > Reporter: David Sauer > Assignee: Pavel Yaskevich > Priority: Critical > Fix For: 2.0.6 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, compaction_test.py > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.2#6252)