[ 
https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235707#comment-16235707
 ] 

Jason Brown commented on CASSANDRA-13987:
-----------------------------------------

Just to add these here for completeness, I spoke with several other 
contributors, and here is a brief summary of each idea and my reasoning for not 
pursuing each.

[~mkjellman] proposed to reintroduce a lock to the commitlog path, albeit with 
a smaller scope. The basic idea would still use multiple threads to serialize 
the mutation into the log, but we would lock around getting the {{Allocation}} 
buffer and writing the mutation's length and checksum. This would allow us to 
be able to replay everything that successfully serialized into the commitlog; 
we could skip entries that did not completely serialize (and thus fail on 
deserialization) as we would be guaranteed the entry's length was written at 
the beginning of the entry (and thus we could skip to the next entry if 
possible).

The biggest downside here was the reintroduction of the lock, which is a larger 
topic than what I want to address here, and should involve a wider community 
discussion.

[~aweisberg] proposed having a mmaped sidekick file where we would capture the 
position (and checksum of the position) of each entry in the main commitlog 
file. The entries in the sidekick file would be fixed-size values (8 bytes), so 
we would always be able to read the values. We would use something like the 
main commitlog's CAS to allocate space for the sidekick entry, but the ordering 
in the sidekick entries are not guaranteed to be in the same order as the 
commit log's entries. On replay, we would need to read in the sidekick file to 
know the offsets, and we would need to attempt to replay as many of the entries 
from the main commitlog as appeared in the sidekick file.

While being a reasonably good idea, the downside for me is that introducing 
another file for ensuring more commitlog replayablility seemed more involved 
than probably necessary for the stated goal. Coorinated failures are already an 
edge condition, and imposing the sidekick file tax on every commitlog might be 
more than required. Also, I am concerned about the additional cost on replay to 
read the sidekick file, order the entries, and then ensure at least all those 
entries are replayed. We are sensitive to startup times, and this would add to 
it (albeit perhaps slightly). Another complicating factor for this idea is that 
is does not work with compressed or encrypted commitlogs.



> Multithreaded commitlog subtly changed durability
> -------------------------------------------------
>
>                 Key: CASSANDRA-13987
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13987
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Major
>             Fix For: 4.x
>
>
> When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly 
> changed the way that commitlog durability worked. Everything still gets 
> written to an mmap file. However, not everything is replayable from the 
> mmaped file after a process crash, in periodic mode.
> In brief, the reason this changesd is due to the chained markers that are 
> required for the multithreaded commit log. At each msync, we wait for 
> outstanding mutations to serialize into the commitlog, and update a marker 
> before and after the commits that have accumluated since the last sync. With 
> those markers, we can safely replay that section of the commitlog. Without 
> the markers, we have no guarantee that the commits in that section were 
> successfully written, thus we abandon those commits on replay.
> If you have correlated process failures of multiple nodes at "nearly" the 
> same time (see ["There Is No 
> Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have 
> data loss if none of the nodes msync the commitlog. For example, with RF=3, 
> if quorum write succeeds on two nodes (and we acknowledge the write back to 
> the client), and then the process on both nodes OOMs (say, due to reading the 
> index for a 100GB partition), the write will be lost if neither process 
> msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. 
> The reason why this data is silently lost is due to the chained markers that 
> were introduced with CASSANDRA-3578.
> The problem we are addressing with this ticket is incrementally improving 
> 'durability' due to process crash, not host crash. (Note: operators should 
> use batch mode to ensure greater durability, but batch mode in it's current 
> implementation is a) borked, and b) will burn through, *very* rapidly, SSDs 
> that don't have a non-volatile write cache sitting in front.) 
> The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which 
> means that a node could lose up to ten seconds of data due to process crash. 
> The unfortunate thing is that the data is still avaialble, in the mmap file, 
> but we can't replay it due to incomplete chained markers.
> ftr, I don't believe we've ever had a stated policy about commitlog 
> durability wrt process crash. Pre-2.0 we naturally piggy-backed off the 
> memory mapped file and the fact that every mutation was acquired a lock and 
> wrote into the mmap buffer, and the ability to replay everything out of it 
> came for free. With CASSANDRA-3578, that was subtly changed. 
> Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust 
> the durability 
> guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit]
>  of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm 
> using that idea as a loose springboard for what to do here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to