[ 
https://issues.apache.org/jira/browse/CASSANDRA-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054283#comment-13054283
 ] 

Peter Schuller commented on CASSANDRA-2811:
-------------------------------------------

One of the huge benefits of concurrent compaction is that it significantly 
helps with a mix of small and large column families. If compaction is forced to 
be serial, we're back to the situation that a 'nodetool repair' of the small 1 
gig CF can block for 3 days waiting on a huge repair of a 800 gig CF.

It would be nice if that could be avoided, or at least be tweakable. Can we for 
example just make repair (without a cf specified) do one repair at a time? 
I.e., fully repair a single CF, then do the next, etc.

It seems that it should provide sensible out-of-the-box behavior, while still 
retaining the concurrency as desired for cases when specific CF:s are repaired 
at different intervals.

If the concurrency came not just from multiple CF:s but also from multiple 
ranges, then it would be nice if all the ranges for a given CF could be treated 
as "one" compaction I think.


> Repair doesn't stagger flushes
> ------------------------------
>
>                 Key: CASSANDRA-2811
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2811
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>             Fix For: 0.8.2
>
>
> When you do a nodetool repair (with no options), the following things occured:
> * For each keyspace, a call to SS.forceTableRepair is issued
> * In each of those calls: for each token range the node is responsible for, a 
> repair session is created and started
> * Each of these session will request one merkle tree by column family (to 
> each node for which it makes sense, which includes the node the repair is 
> started on)
> All those merkle tree requests are done basically at the same time. And now 
> that compaction is multi-threaded, this means that usually more than one 
> validation compaction will be started at the same time. The problem is that a 
> validation compaction starts by a flush. Given that by default the 
> flush_queue_size is 4 and the number of compaction thread is the number of 
> processors and given that on any recent machine the number of core will be >= 
> 4, this means that this will easily end up blocking write for some period of 
> time.
> It turns out to also have a more subtle problem for repair itself. If two 
> validation compaction for the same column family (but different range) are 
> started in a very short time interval, the first validation will block on the 
> flush, but the second one may not block at all if the memtable is clean when 
> it request it's own flush. In which case that second validation will be 
> executed on data older than it should.
> I think the simpler fix is to make sure we only ever do one validation 
> compaction at a time. It's probably a better use of resources anyway. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to