[ 
https://issues.apache.org/jira/browse/CASSANDRA-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557767#comment-13557767
 ] 

Robert Coli commented on CASSANDRA-4446:
----------------------------------------

How to reproduce it, from the multiple reports :

1) Drain and stop cluster with counters on 1.0.x
2) Start same cluster on 1.1.x
3) Notice commitlog replay of the counter columnfamily and that your counters 
have over-counted

Attached is a log from the latest reporter, 
CASSANDRA-4446--1.0.12_to_1.1.8.txt. It shows the following.

1) Drain starts and completes on 1.0.12
2) Cluster then starts on 1.1.8, and replays the commit log
3) As part of commitlog replay, it flushes various CFs including 
titan3/RMEntityCount/, which is a counter columnfamily; machine has 4gb of heap 
and the flush is while thrift is down and the node has not jumped state to 
normal, so it seems reasonable to conjecture this flush is part of commitlog 
replay
4) It then logs "10698 replayed mutations", which adds further support to the 
idea that these Counts are part of replay
5) Operator then noticed a significant percentage of records had overcounted in 
this columnfamily
                
> nodetool drain sometimes doesn't mark commitlog fully flushed
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-4446
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4446
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core, Tools
>         Environment: ubuntu 10.04 64bit
> Linux HOSTNAME 2.6.32-345-ec2 #48-Ubuntu SMP Wed May 2 19:29:55 UTC 2012 
> x86_64 GNU/Linux
> sun JVM
> cassandra 1.0.10 installed from apache deb
>            Reporter: Robert Coli
>            Assignee: Jonathan Ellis
>            Priority: Minor
>             Fix For: 1.2.1
>
>         Attachments: 4446.txt, 
> cassandra.1.0.10.replaying.log.after.exception.during.drain.txt, 
> CASSANDRA-4446--1.0.12_to_1.1.8.txt
>
>
> I recently wiped a customer's QA cluster. I drained each node and verified 
> that they were drained. When I restarted the nodes, I saw the commitlog 
> replay create a memtable and then flush it. I have attached a sanitized log 
> snippet from a representative node at the time. 
> It appears to show the following :
> 1) Drain begins
> 2) Drain triggers flush
> 3) Flush triggers compaction
> 4) StorageService logs DRAINED message
> 5) compaction thread excepts
> 6) on restart, same CF creates a memtable
> 7) and then flushes it [1]
> The columnfamily involved in the replay in 7) is the CF for which the 
> compaction thread excepted in 5). This seems to suggest a timing issue 
> whereby the exception in 5) prevents the flush in 3) from marking all the 
> segments flushed, causing them to replay after restart.
> In case it might be relevant, I did an online change of compaction strategy 
> from Leveled to SizeTiered during the uptime period preceding this drain.
> [1] Isn't commitlog replay not supposed to automatically trigger a flush in 
> modern cassandra?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to