[ https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182626#comment-14182626 ]
Alexander Sterligov edited comment on CASSANDRA-6285 at 10/24/14 9:29 AM: -------------------------------------------------------------------------- [~xedin] That NPE happend once and unfortunatelly I have not saved it. If I'll get it once more I'll save this sstable. I totally removed OpsCenter keyspace (with sstables) and recreated them. I don't get "Last written key DecoratedKey" any more. By the way, this error definetely causees streams to hang on 100%. I have several strange things happening now: - I've noticed that it takes about 30 minutes between "nodetool repair" and first pending AntiEntropySession. Is that ok? - Repair is already running for 24 hours (~13GB per node, 17 nodes). What's the number of AntiEntropySessions to finish single repair? Number of key ranges? {quote} Pool Name Active Pending Completed Blocked All time blocked CounterMutationStage 0 0 0 0 0 ReadStage 0 0 392196 0 0 RequestResponseStage 0 0 5271906 0 0 MutationStage 0 0 19832506 0 0 ReadRepairStage 0 0 2280 0 0 GossipStage 0 0 453830 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage 0 0 0 0 0 ValidationExecutor 0 0 39446 0 0 MemtableReclaimMemory 0 0 29927 0 0 InternalResponseStage 0 0 588279 0 0 AntiEntropyStage 0 0 5325285 0 0 MiscStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 MemtableFlushWriter 0 0 29927 0 0 PendingRangeCalculator 0 0 30 0 0 MemtablePostFlush 0 0 135734 0 0 CompactionExecutor 31 31 502175 0 0 AntiEntropySessions 3 3 3446 0 0 HintedHandoff 0 0 44 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 0 MUTATION 2 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0 {quote} - Some validation compactions run for more than 100% (1923%). I thinks that it's CASSANDRA-7239, right? - the amount of sstables for some CFs is about 15 000 and continues to grow during repair. - There are several following exceptions during repair {quote} ERROR [RepairJobTask:80] 2014-10-24 13:27:31,717 RepairJob.java:127 - Error occurred during snapshot phase java.lang.RuntimeException: Could not create snapshot at /37.140.189.163 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77) ~[apache-cassandra-2.1.0.jar:2.1.0] at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:347) ~[apache-cassandra-2.1.0.jar:2.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] ERROR [AntiEntropySessions:141] 2014-10-24 13:27:31,724 RepairSession.java:303 - [repair #da2cb020-5b5f-11e4-a45e-d9cec1206f33] session completed with the following error java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.0.jar:2.1.0] at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.0.jar:2.1.0] at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] ERROR [AntiEntropySessions:141] 2014-10-24 13:27:31,724 CassandraDaemon.java:166 - Exception in thread Thread[AntiEntropySessions:141,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation. at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na] at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) ~[apache-cassandra-2.1.0.jar:2.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_51] at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ~[na:1.7.0_51] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_51] at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51] Caused by: java.io.IOException: Failed during snapshot creation. at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344) ~[apache-cassandra-2.1.0.jar:2.1.0] at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) ~[apache-cassandra-2.1.0.jar:2.1.0] at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na] ... 3 common frames omitted {quote} was (Author: sterligovak): [~xedin] That NPE happend once and unfortunatelly I have not saved it. If I'll get it once more I'll save this sstable. I totally removed OpsCenter keyspace (with sstables) and recreated them. I don't get "Last written key DecoratedKey" any more. By the way, this error definetely causees streams to hang on 100%. I have several strange things happening now: - I've noticed that it takes about 30 minutes between "nodetool repair" and first pending AntiEntropySession. Is that ok? - Repair is already running for 24 hours (~13GB per node, 17 nodes). What's the number of AntiEntropySessions to finish single repair? Number of key ranges? {quote} Pool Name Active Pending Completed Blocked All time blocked CounterMutationStage 0 0 0 0 0 ReadStage 0 0 392196 0 0 RequestResponseStage 0 0 5271906 0 0 MutationStage 0 0 19832506 0 0 ReadRepairStage 0 0 2280 0 0 GossipStage 0 0 453830 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage 0 0 0 0 0 ValidationExecutor 0 0 39446 0 0 MemtableReclaimMemory 0 0 29927 0 0 InternalResponseStage 0 0 588279 0 0 AntiEntropyStage 0 0 5325285 0 0 MiscStage 0 0 0 0 0 CommitLogArchiver 0 0 0 0 0 MemtableFlushWriter 0 0 29927 0 0 PendingRangeCalculator 0 0 30 0 0 MemtablePostFlush 0 0 135734 0 0 CompactionExecutor 31 31 502175 0 0 AntiEntropySessions 3 3 3446 0 0 HintedHandoff 0 0 44 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 0 MUTATION 2 _TRACE 0 REQUEST_RESPONSE 0 COUNTER_MUTATION 0 {quote} - Some validation compactions run for more than 100% (1923%). I thinks that it's CASSANDRA-7239, right? - the amount of sstables for some CFs is about 15 000 and continues to grow during repair. > 2.0 HSHA server introduces corrupt data > --------------------------------------- > > Key: CASSANDRA-6285 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6285 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2 > Reporter: David Sauer > Assignee: Pavel Yaskevich > Priority: Critical > Fix For: 2.0.8 > > Attachments: 6285_testnotes1.txt, > CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, > compaction_test.py, disruptor-high-cpu.patch, > disruptor-memory-corruption.patch, enable_reallocate_buffers.txt > > > After altering everything to LCS the table OpsCenter.rollups60 amd one other > none OpsCenter-Table got stuck with everything hanging around in L0. > The compaction started and ran until the logs showed this: > ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java > (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime] > java.lang.RuntimeException: Last written key > DecoratedKey(1326283851463420237, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564) > >= current key DecoratedKey(954210699457429663, > 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f) > writing into > /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db > at > org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141) > at > org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164) > at > org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160) > at > org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60) > at > org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59) > at > org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296) > at > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:724) > Moving back to STC worked to keep the compactions running. > Especialy my own Table i would like to move to LCS. > After a major compaction with STC the move to LCS fails with the same > Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)