[ https://issues.apache.org/jira/browse/CASSANDRA-2811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054283#comment-13054283 ]
Peter Schuller commented on CASSANDRA-2811: ------------------------------------------- One of the huge benefits of concurrent compaction is that it significantly helps with a mix of small and large column families. If compaction is forced to be serial, we're back to the situation that a 'nodetool repair' of the small 1 gig CF can block for 3 days waiting on a huge repair of a 800 gig CF. It would be nice if that could be avoided, or at least be tweakable. Can we for example just make repair (without a cf specified) do one repair at a time? I.e., fully repair a single CF, then do the next, etc. It seems that it should provide sensible out-of-the-box behavior, while still retaining the concurrency as desired for cases when specific CF:s are repaired at different intervals. If the concurrency came not just from multiple CF:s but also from multiple ranges, then it would be nice if all the ranges for a given CF could be treated as "one" compaction I think. > Repair doesn't stagger flushes > ------------------------------ > > Key: CASSANDRA-2811 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2811 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.8.0 > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Fix For: 0.8.2 > > > When you do a nodetool repair (with no options), the following things occured: > * For each keyspace, a call to SS.forceTableRepair is issued > * In each of those calls: for each token range the node is responsible for, a > repair session is created and started > * Each of these session will request one merkle tree by column family (to > each node for which it makes sense, which includes the node the repair is > started on) > All those merkle tree requests are done basically at the same time. And now > that compaction is multi-threaded, this means that usually more than one > validation compaction will be started at the same time. The problem is that a > validation compaction starts by a flush. Given that by default the > flush_queue_size is 4 and the number of compaction thread is the number of > processors and given that on any recent machine the number of core will be >= > 4, this means that this will easily end up blocking write for some period of > time. > It turns out to also have a more subtle problem for repair itself. If two > validation compaction for the same column family (but different range) are > started in a very short time interval, the first validation will block on the > flush, but the second one may not block at all if the memtable is clean when > it request it's own flush. In which case that second validation will be > executed on data older than it should. > I think the simpler fix is to make sure we only ever do one validation > compaction at a time. It's probably a better use of resources anyway. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira