[jira] [Comment Edited] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data

Alexander Sterligov (JIRA) Fri, 24 Oct 2014 02:30:07 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182626#comment-14182626
 ]


Alexander Sterligov edited comment on CASSANDRA-6285 at 10/24/14 9:29 AM:
--------------------------------------------------------------------------

[~xedin] That NPE happend once and unfortunatelly I have not saved it. If I'll 
get it once more I'll save this sstable.
I totally removed OpsCenter keyspace (with sstables) and recreated them. I 
don't get "Last written key DecoratedKey" any more. By the way, this error 
definetely causees streams to hang on 100%.

I have several strange things happening now:
  - I've noticed that it takes about 30 minutes between "nodetool repair" and 
first pending AntiEntropySession. Is that ok?
  - Repair is already running for 24 hours (~13GB per node, 17 nodes). What's 
the number of AntiEntropySessions to finish single repair? Number of key ranges?
{quote}
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
CounterMutationStage              0         0              0         0          
       0
ReadStage                         0         0         392196         0          
       0
RequestResponseStage              0         0        5271906         0          
       0
MutationStage                     0         0       19832506         0          
       0
ReadRepairStage                   0         0           2280         0          
       0
GossipStage                       0         0         453830         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
MigrationStage                    0         0              0         0          
       0
ValidationExecutor                0         0          39446         0          
       0
MemtableReclaimMemory             0         0          29927         0          
       0
InternalResponseStage             0         0         588279         0          
       0
AntiEntropyStage                  0         0        5325285         0          
       0
MiscStage                         0         0              0         0          
       0
CommitLogArchiver                 0         0              0         0          
       0
MemtableFlushWriter               0         0          29927         0          
       0
PendingRangeCalculator            0         0             30         0          
       0
MemtablePostFlush                 0         0         135734         0          
       0
CompactionExecutor               31        31         502175         0          
       0
AntiEntropySessions               3         3           3446         0          
       0
HintedHandoff                     0         0             44         0          
       0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
PAGED_RANGE                  0
BINARY                       0
READ                         0
MUTATION                     2
_TRACE                       0
REQUEST_RESPONSE             0
COUNTER_MUTATION             0
{quote}
  - Some validation compactions run for more than 100% (1923%). I thinks that 
it's CASSANDRA-7239, right?
  - the amount of sstables for some CFs is about 15 000 and continues to grow 
during repair.
  - There are several following exceptions during repair
{quote}
ERROR [RepairJobTask:80] 2014-10-24 13:27:31,717 RepairJob.java:127 - Error 
occurred during snapshot phase
java.lang.RuntimeException: Could not create snapshot at /37.140.189.163
        at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77)
 ~[apache-cassandra-2.1.0.jar:2.1.0]
        at 
org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:347) 
~[apache-cassandra-2.1.0.jar:2.1.0]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_51]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
ERROR [AntiEntropySessions:141] 2014-10-24 13:27:31,724 RepairSession.java:303 
- [repair #da2cb020-5b5f-11e4-a45e-d9cec1206f33] session completed with the 
following error
java.io.IOException: Failed during snapshot creation.
        at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
 ~[apache-cassandra-2.1.0.jar:2.1.0]
        at 
org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) 
~[apache-cassandra-2.1.0.jar:2.1.0]
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
~[guava-16.0.jar:na]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_51]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
ERROR [AntiEntropySessions:141] 2014-10-24 13:27:31,724 
CassandraDaemon.java:166 - Exception in thread 
Thread[AntiEntropySessions:141,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
creation.
        at com.google.common.base.Throwables.propagate(Throwables.java:160) 
~[guava-16.0.jar:na]
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
~[apache-cassandra-2.1.0.jar:2.1.0]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
~[na:1.7.0_51]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
~[na:1.7.0_51]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_51]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_51]
        at java.lang.Thread.run(Thread.java:744) [na:1.7.0_51]
Caused by: java.io.IOException: Failed during snapshot creation.
        at 
org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
 ~[apache-cassandra-2.1.0.jar:2.1.0]
        at 
org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:128) 
~[apache-cassandra-2.1.0.jar:2.1.0]
        at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
~[guava-16.0.jar:na]
        ... 3 common frames omitted
{quote}


was (Author: sterligovak):
[~xedin] That NPE happend once and unfortunatelly I have not saved it. If I'll 
get it once more I'll save this sstable.
I totally removed OpsCenter keyspace (with sstables) and recreated them. I 
don't get "Last written key DecoratedKey" any more. By the way, this error 
definetely causees streams to hang on 100%.

I have several strange things happening now:
  - I've noticed that it takes about 30 minutes between "nodetool repair" and 
first pending AntiEntropySession. Is that ok?
  - Repair is already running for 24 hours (~13GB per node, 17 nodes). What's 
the number of AntiEntropySessions to finish single repair? Number of key ranges?
{quote}
Pool Name                    Active   Pending      Completed   Blocked  All 
time blocked
CounterMutationStage              0         0              0         0          
       0
ReadStage                         0         0         392196         0          
       0
RequestResponseStage              0         0        5271906         0          
       0
MutationStage                     0         0       19832506         0          
       0
ReadRepairStage                   0         0           2280         0          
       0
GossipStage                       0         0         453830         0          
       0
CacheCleanupExecutor              0         0              0         0          
       0
MigrationStage                    0         0              0         0          
       0
ValidationExecutor                0         0          39446         0          
       0
MemtableReclaimMemory             0         0          29927         0          
       0
InternalResponseStage             0         0         588279         0          
       0
AntiEntropyStage                  0         0        5325285         0          
       0
MiscStage                         0         0              0         0          
       0
CommitLogArchiver                 0         0              0         0          
       0
MemtableFlushWriter               0         0          29927         0          
       0
PendingRangeCalculator            0         0             30         0          
       0
MemtablePostFlush                 0         0         135734         0          
       0
CompactionExecutor               31        31         502175         0          
       0
AntiEntropySessions               3         3           3446         0          
       0
HintedHandoff                     0         0             44         0          
       0

Message type           Dropped
RANGE_SLICE                  0
READ_REPAIR                  0
PAGED_RANGE                  0
BINARY                       0
READ                         0
MUTATION                     2
_TRACE                       0
REQUEST_RESPONSE             0
COUNTER_MUTATION             0
{quote}
  - Some validation compactions run for more than 100% (1923%). I thinks that 
it's CASSANDRA-7239, right?
  - the amount of sstables for some CFs is about 15 000 and continues to grow 
during repair.

> 2.0 HSHA server introduces corrupt data
> ---------------------------------------
>
>                 Key: CASSANDRA-6285
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6285
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 4 nodes, shortly updated from 1.2.11 to 2.0.2
>            Reporter: David Sauer
>            Assignee: Pavel Yaskevich
>            Priority: Critical
>             Fix For: 2.0.8
>
>         Attachments: 6285_testnotes1.txt, 
> CASSANDRA-6285-disruptor-heap.patch, cassandra-attack-src.zip, 
> compaction_test.py, disruptor-high-cpu.patch, 
> disruptor-memory-corruption.patch, enable_reallocate_buffers.txt
>
>
> After altering everything to LCS the table OpsCenter.rollups60 amd one other 
> none OpsCenter-Table got stuck with everything hanging around in L0.
> The compaction started and ran until the logs showed this:
> ERROR [CompactionExecutor:111] 2013-11-01 19:14:53,865 CassandraDaemon.java 
> (line 187) Exception in thread Thread[CompactionExecutor:111,1,RMI Runtime]
> java.lang.RuntimeException: Last written key 
> DecoratedKey(1326283851463420237, 
> 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574426c6f6f6d46696c746572537061636555736564)
>  >= current key DecoratedKey(954210699457429663, 
> 37382e34362e3132382e3139382d6a7576616c69735f6e6f72785f696e6465785f323031335f31305f30382d63616368655f646f63756d656e74736c6f6f6b75702d676574546f74616c4469736b5370616365557365640b0f)
>  writing into 
> /var/lib/cassandra/data/OpsCenter/rollups60/OpsCenter-rollups60-tmp-jb-58656-Data.db
>       at 
> org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:141)
>       at 
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:164)
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:160)
>       at 
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>       at 
> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:60)
>       at 
> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
>       at 
> org.apache.cassandra.db.compaction.CompactionManager$6.runMayThrow(CompactionManager.java:296)
>       at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:724)
> Moving back to STC worked to keep the compactions running.
> Especialy my own Table i would like to move to LCS.
> After a major compaction with STC the move to LCS fails with the same 
> Exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-6285) 2.0 HSHA server introduces corrupt data

Reply via email to