[ 
https://issues.apache.org/jira/browse/CASSANDRA-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968613#comment-16968613
 ] 

Dimitar Dimitrov commented on CASSANDRA-15368:
----------------------------------------------

Thanks for the super-quick reply, [~benedict]!
I'll definitely check out your patch for CASSANDRA-15367.
As for this problem, if me taking potentially (much) longer to fix it isn't a 
problem for you, I can surely take a stab.

Also here's the analysis that I mentioned in my previous reply - all comments 
appreciated.

h4. Defining the problem

Let's assume we have a single table (no indexes, no MVs) that's being 
continuously written to when a single flush for it is requested.
 We want to examine if we can have the old memtable accepting a write with a 
higher CL position, and the new memtable
 accepting a write with a lower CL position - the latter also implies the old 
memtable rejecting that write.
 * Below we'll be calling the write with the higher CL position *HW*, its 
assigned {{OpOrder.Group}} (and the action for assigning it) *HW group*, and 
its assigned CL position (and the action for assigning it) *HW position*.
 * Similar for the write with the lower CL position - *LW*, *LW group*, and *LW 
position*.

So to get the (un)desired ordering, we need the following specific results from 
3 executions of {{Memtable.accepts(OpOrder.Group, CommitLogPosition)}}:
 - {{oldMemtable.accepts(<HW>)}} (called *HW accept?* below), which should 
return true
 - {{oldMemtable.accepts(<LW>)}} (called *LW accept?* below), which should 
return false
 - {{newMemtable.accepts(<LW>)}}, which should return true (not necessary for 
the analysis below)

h4. Some constraints

 A. For each of the writes, the {{OpOrder.Group}} assignment happens-before the 
CL position allocation for the corresponding write, which happens-before the 
{{Memtable.accepts(OpOrder.Group, CommitLogPosition)}} call for the 
corresponding write.
 * HW group --hb-> HW position --hb-> HW accept?
 * LW group --hb-> LW position --hb-> LW accept?

B. The CL position allocations are totally (and numerically) ordered by 
happens-before, due to the way {{CommitLogSegment}}-s are advanced and the way 
their internal {{allocatePosition}} markers are CAS-ed.
 * LW position --hb-> HW position

C. If {{writeBarrier.issue()}} in the {{Flush}} ctor happens-before HW group, 
then the final upper CL bound for the old memtable (called *UB* below) has been 
set, and is guaranteed to be less than HW position, but then HW accept? is 
guaranteed to return false (because it will see {{writeBarrier}} as not 
{{null}}, and HW position would be guaranteed to be more than UB) => 
contradiction
 * If {{writeBarrier.issue()}} --hb-> HW group => UB --hb-> HW group => UB 
--hb-> HW position => contradiction
 * Therefore HW group --hb-> {{writeBarrier.issue()}}
 * Note that this was not true before the fix for CASSANDRA-8383.

D. If {{writeBarrier.issue()}} happens-before LW group, then UB has been set, 
and is guaranteed to be less than LW position, and therefore less than HW 
position. Also {{writeBarrier.issue()}} would happen-before HW position, which 
would happen-before HW accept?. That means that HW accept? will see 
{{writeBarrier}} as not {{null}}, and UB as set and less than HW position, so 
is guaranteed to return false => contradiction
 * If {{writeBarrier.issue()}} --hb-> LW group => UB --hb-> LW position --hb-> 
HW position && {{writeBarrier.issue()}} --hb-> HW accept? => contradiction
 * Therefore LW group --hb-> {{writeBarrier.issue()}}

E. As a corollary of C. and D., LW group and HW group should both be before the 
barrier issued by the flush, and therefore *the placements of LW and HW will 
both be determined by LW position, HW position, and UB*.

h4. The case work

In order for HW accept? to return true:
# ...it could be seeing {{writeBarrier}} as {{null}}, which means to have 
started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}.
## This implies that LW accept? is started after HW accept? has started - 
otherwise LW accept? would also have seen {{writeBarrier}} as {{null}} and 
returned true already => contradiction
 ## So LW accept? has started after HW accept? has started, and needs to return 
false because of LW position (see E. why it cannot be due to LW group).
 This could happen only if UB has been set and is less than LW position. But as 
setting UB happens after {{oldMemtable.setDiscarding}}, and HW accept? had 
started before the {{writeBarrier}} is set in {{oldMemtable.setDiscarding}}, UB 
should be at least HW position, which is more than LW position => contradiction
 #* If HW accept? start --hb-> writeBarrier set in 
{{oldMemtable.setDiscarding}} => HW position --hb-> writeBarrier set in 
{{oldMemtable.setDiscarding}} --hb-> UB => LW position --hb-> UB => 
contradiction
 #* Therefore writeBarrier set in {{oldMemtable.setDiscarding}} --hb-> HW 
accept? start
 # ...it could have been started before UB is set. In that case the UB 
candidate is guaranteed to be at least HW position. But then it's impossible 
for LW accept? to find UB already set and less than LW position => contradiction
 #* If HW accept? start --hb-> UB => HW position --hb-> UB => LW position 
--hb-> UB => contradiction
 # ...it could have been executed after UB is set, but in order for HW accept? 
to return true, UB must end up at least as HW position. But then it's 
impossible for LW accept? to find UB already set and less than LW position => 
contradiction
 #* If UB --hb-> HW accept start? && HW accept? return true => UB >= HW 
position => UB >= LW position => contradiction

> Failing to flush Memtable without terminating process results in permanent 
> data loss
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15368
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15368
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Commit Log, Local/Memtable
>            Reporter: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.0, 2.2.x, 3.0.x, 3.11.x
>
>
> {{Memtable}} do not contain records that cover a precise contiguous range of 
> {{ReplayPosition}}, since there are only weak ordering constraints when 
> rolling over to a new {{Memtable}} - the last operations for the old 
> {{Memtable}} may obtain their {{ReplayPosition}} after the first operations 
> for the new {{Memtable}}.
> Unfortunately, we treat the {{Memtable}} range as contiguous, and invalidate 
> the entire range on flush.  Ordinarily we only invalidate records when all 
> prior {{Memtable}} have also successfully flushed.  However, in the event of 
> a flush that does not terminate the process (either because of disk failure 
> policy, or because it is a software error), the later flush is able to 
> invalidate the region of the commit log that includes records that should 
> have been flushed in the prior {{Memtable}}
> More problematically, this can also occur on restart without any associated 
> flush failure, as we use commit log boundaries written to our flushed 
> sstables to filter {{ReplayPosition}} on recovery, which is meant to 
> replicate our {{Memtable}} flush behaviour above.  However, we do not know 
> that earlier flushes have completed, and they may complete successfully 
> out-of-order.  So any flush that completes before the process terminates, but 
> began after another flush that _doesn’t_ complete before the process 
> terminates, has the potential to cause permanent data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to