[ 
https://issues.apache.org/jira/browse/CASSANDRA-6797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua McKenzie updated CASSANDRA-6797:
---------------------------------------
    Component/s: Lifecycle
                 Compaction

> compaction and scrub data directories race on startup
> -----------------------------------------------------
>
>                 Key: CASSANDRA-6797
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6797
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Compaction, Lifecycle
>         Environment: macos (and linux)
>            Reporter: Matt Byrd
>            Assignee: Joshua McKenzie
>            Priority: Minor
>              Labels: compaction, concurrency, starting
>             Fix For: 2.0.6, 2.1 beta2
>
>         Attachments: trunk-6797.patch
>
>
>  
> Hi,  
> On doing a rolling restarting of a 2.0.5 cluster in several environments I'm 
> seeing the following error:
> {code}
>  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,549 CompactionTask.java 
> (line 115) Compacting 
> [SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13-Data.db'),
>  SSTableReader(path='/Users/Matthew/.ccm/compactio
> n_race/node1/data/system/local/system-local-jb-15-Data.db'), 
> SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-16-Data.db'),
>  
> SSTableReader(path='/Users/Matthew/.ccm/compaction_race/node1/data/system/local/syst
> em-local-jb-14-Data.db')]
>  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,557 ColumnFamilyStore.java 
> (line 254) Initializing system_traces.sessions
>  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,560 ColumnFamilyStore.java 
> (line 254) Initializing system_traces.events
>  WARN [main] 2014-03-03 17:11:07,608 ColumnFamilyStore.java (line 473) 
> Removing orphans for 
> /Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-13:
>  [CompressionInfo.db, Filter.db, Index.db, TOC.txt, Summary.db, Data.db, 
> Statistics.
> db]
> ERROR [main] 2014-03-03 17:11:07,609 CassandraDaemon.java (line 479) 
> Exception encountered during startup
> java.lang.AssertionError: attempted to delete non-existing file 
> system-local-jb-13-CompressionInfo.db
>         at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:111)
>         at 
> org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:106)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:476)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:264)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:462)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:552)
>  INFO [CompactionExecutor:1] 2014-03-03 17:11:07,612 CompactionTask.java 
> (line 275) Compacted 4 sstables to 
> [/Users/Matthew/.ccm/compaction_race/node1/data/system/local/system-local-jb-17,].
>   10,963 bytes to 5,572 (~50% of original) in 57ms = 0.093226MB/s.  4 total 
> partitions merged to 1.  Partition merge counts were {4:1, }
> {code}
> Seems like a potential race, since compactions are occurring whilst the 
> existing data directories are being scrubbed.
> Probably an in progress compaction looks like an incomplete one and results 
> in it being attempted to be scrubbed whilst in progress. 
> On the attempt to delete in the scrubDataDirectories we discover that it no 
> longer exists, presumably because it has now been compacted away. 
> This then causes an assertion error and the node fails to start up. 
> Here is a ccm script which just stops and starts a 3 node 2.0.5 cluster 
> repeatedly. 
> It seems to fairly reliably reproduce the problem, in less than ten 
> iterations: 
> {code}
> #!/bin/bash
> ccm create compaction_race -v 2.0.5
> ccm populate -n 3
> ccm start
> for i in $(seq 0 1000); do 
>     echo $i;
>     ccm stop
>     ccm start
>     grep ERR ~/.ccm/compaction_race/*/logs/system.log;
> done
> {code}
>  
> Someone else should probably confirm that this is what is going wrong,  
> however if it is, the solution might be as simple as to disable 
> autocompactions slightly earlier in CassandraDaemon.setup. 
>  
> Or alternatively if there isn't a good reason why we are first scrubbing the 
> system tables and then scrubbing all keyspaces (including the system 
> keyspace), you could perhaps just scrub solely the non system keyspaces on 
> the second scrub.
> Please let me know if there is anything else I can provide.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to