[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Brown resolved CASSANDRA-13987. ------------------------------------- Resolution: Fixed Fix Version/s: (was: 4.x) 4.0 3.11.2 3.0.16 committed as sha {{05cb556f90dbd1929a180254809e05620265419b}} > Multithreaded commitlog subtly changed durability > ------------------------------------------------- > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement > Reporter: Jason Brown > Assignee: Jason Brown > Fix For: 3.0.16, 3.11.2, 4.0 > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org