[ 
https://issues.apache.org/jira/browse/CASSANDRA-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628109#comment-14628109
 ] 

Benedict commented on CASSANDRA-9669:
-------------------------------------

So, I've been mulling this over and would like to outline the tradeoffs and get 
agreement before implementation:

The simplest approach is to _delay flush completion_ until all prior flushes 
have succeeded. i.e., all flushes for a given table will complete in the order 
they are submitted, i.e. in CommitLog order. They will remain as tmp files, and 
the memtables will continue to service requests for that data, until all 
earlier flushes for that table have also completed. There is one serious flaw 
here, though, which is that CASSANDRA-7275 becomes even more devastating. In 
this scenario, the server's memtable space will rapidly be completely 
exhausted, and there is simply no escaping it. 

In 2.1+ we can mitigate this by "opening early" (even though it's a flush) and 
evicting the memtable immediately, only upgrading it to a permanent file once 
prior flushes have completed. However 2.0's stability for some users may be 
significantly affected.

Another possibility is to begin saving the start replay position with sstables 
(on top of end), and on restart ensure we replay anything present in a gap. 
This would mean a number of small changes with e.g. metadata, but would also 
introduce some extra compaction complexity, which I'm reticent to introduce: we 
would have to prevent compaction from being permitted on any sstable that was 
"ahead" of its prior flushes. Compaction logic is already unpleasantly complex 
around these kinds of decisions, especially in 2.1, and introduces risks to 2.0 
and 2.1 I would like to avoid.

Any solution is going to be more involved than I would like for 2.0 or 2.1, 
though.

> Commit Log Replay is Broken
> ---------------------------
>
>                 Key: CASSANDRA-9669
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9669
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Benedict
>            Assignee: Benedict
>            Priority: Critical
>              Labels: correctness
>             Fix For: 3.x, 2.1.x, 2.2.x, 3.0.x
>
>
> While {{postFlushExecutor}} ensures it never expires CL entries out-of-order, 
> on restart we simply take the maximum replay position of any sstable on disk, 
> and ignore anything prior. 
> It is quite possible for there to be two flushes triggered for a given table, 
> and for the second to finish first by virtue of containing a much smaller 
> quantity of live data (or perhaps the disk is just under less pressure). If 
> we crash before the first sstable has been written, then on restart the data 
> it would have represented will disappear, since we will not replay the CL 
> records.
> This looks to be a bug present since time immemorial, and also seems pretty 
> serious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to