[jira] [Commented] (CASSANDRA-8683) Incremental repairs broken with early opening of compaction results

Marcus Eriksson (JIRA) Mon, 26 Jan 2015 06:41:08 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291895#comment-14291895
 ]


Marcus Eriksson commented on CASSANDRA-8683:
--------------------------------------------

from irc:
{noformat}
15:27 < marcuse> belliottsmith: we create a new sstable reader over the same 
file, the reference counter is not transfered to the new file, so if i bump the 
reference count on the file i hold a reference to, it will not affect the 
latest one
15:27 < marcuse> does it make sense/
15:27 < marcuse> * transfered to the new SSTR instance
15:28 < belliottsmith> but you should have taken a reference against the one 
you're using, no?
15:28 < marcuse> belliottsmith: no, we don't want to hold references during the 
whole duration of the repair (hours)
15:28 < belliottsmith> each instance is distinct, but so long as you take a 
reference against the one you're using, it shouldn't be a problem...
15:28 < marcuse> so, if a file is gone, we simply remove it from the set and 
dont anticompact it
15:29 < belliottsmith> ah
15:29 < belliottsmith> hmm. that is tricky, since we need to repeatedly refetch 
the set of files we want to use, since we could have new data replacing it as 
well, surely?
15:30 < belliottsmith> i don't really see a way around holding a reference for 
the lifetime of the repair. but if we stream oldest files first we can drop 
references to them as we go, no>?
15:30 < marcuse> no, for incremental repairs we repair best-effort, if a file 
is gone, it means it has been compacted away and will get repaired next time 
around
15:31 < belliottsmith> so where/when do we grab our reference to the sstable 
then?
15:31 < belliottsmith> because in that case it sounds like we should simply be 
failing to grab our reference count
15:31 < marcuse> 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ActiveRepairService.java#L409
15:32 < marcuse> and yes, we probably should fail, but we don't
15:32 < belliottsmith> that code looks a bit odd. why do we check for data 
component first? shouldn't an attempt to acquire the ref be sufficient? 
(clearly it isn't here though, since it's not removing it from the set)
15:33 < marcuse> yes, it should, but i don't think we decrease ref count when 
we replace an instance?
15:34 < marcuse> ie, when we open early and have a new instance, we should 
decrease the reference on the old file
15:34 < marcuse> *old instance\
15:34 < belliottsmith> we have to
15:34 < belliottsmith> otherwise it would never get deleted
15:34 < marcuse> but its the same underlying file right? it will get deleted by 
the new instance
15:34 < belliottsmith> if we don't that's definitely a bug, but i'm pretty sure 
we do
15:34 < belliottsmith> but lifetimes don't expire in order necessarily
15:35 < belliottsmith> so the older files ref count to zero, then remove 
themselves from the linked-list of replacements
15:35 < belliottsmith> if there's nobody left, they delete the file
15:36 < belliottsmith> this all is in dire need of cleaning up in 8568
{noformat}

> Incremental repairs broken with early opening of compaction results
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-8683
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8683
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Marcus Eriksson
>            Assignee: Marcus Eriksson
>             Fix For: 2.1.3
>
>
> Incremental repairs holds a set of the sstables it started the repair on (we 
> need to know which sstables were actually validated to be able to anticompact 
> them). This includes any tmplink files that existed when the compaction 
> started (if we wouldn't include those, we would miss data since we move the 
> start point of the existing non-tmplink files)
> With CASSANDRA-6916 we swap out those instances with new ones 
> (SSTR.cloneWithNewStart / SSTW.openEarly), meaning that the underlying file 
> can get deleted even though we hold a reference.
> This causes the unit test error: 
> http://cassci.datastax.com/job/trunk_utest/1330/testReport/junit/org.apache.cassandra.db.compaction/LeveledCompactionStrategyTest/testValidationMultipleSSTablePerLevel/
> (note that it only fails on trunk though, in 2.1 we don't hold references to 
> the repairing files for non-incremental repairs, but the bug should exist in 
> 2.1 as well)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8683) Incremental repairs broken with early opening of compaction results

Reply via email to