[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278437#comment-16278437 ] Sam Tunnicliffe commented on CASSANDRA-13987: - bq. The original code did this (across all three branches)... Thanks [~jasobrown], I was having a brain fart and overlooking the fact that {{lastMarkerOffset}} is of course the start of the marker and so we have necessarily allocated past it to write the marker. I'm +1 on this, once CI is happy (as happy as it can be). I just have one niggle, which I'll leave to you to decide whether to address or not... On 3.0, in the 'should we flush or just update markers' condition, I think it would be clearer to check {{shutdown}}, rather than {{!run}} given that {{run = !shutdown}}. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277811#comment-16277811 ] Jason Brown commented on CASSANDRA-13987: - I've addressed the nits, and pushed to the usual spots: ||3.0||3.11||trunk|| |[branch|https://github.com/jasobrown/cassandra/tree/13987-3.0]|[branch|https://github.com/jasobrown/cassandra/tree/13987-3.11]|[branch|https://github.com/jasobrown/cassandra/tree/13987-trunk]| |[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/13987-3.0]|[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/13987-3.11]|[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/13987-trunk]| || bq. In trunk, the comment near the top of CLS::sync is missing a word. I meant to remove this comment from trunk; already did it on 3.0/3.11 bq. Also in CLS::sync, why does the calculation of needToMarkData include SYNC_MARKER_SIZE? I could be missing something, but it seems to me that we only need check allocatePosition.get() > lastMarkerOffset The [original code|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java#L311] did this (across all three branches). More properly, though, the call to {{#allocate(int)}} returns the offset of where to begin writing into th\ e buffer, and that's what we store in {{lastMarkerOffset}}. So then on the next time {{#sync()}} is invoked, we know another mutation has come in because {{allocatePosition}} will be greater than {{lastMarkerOffset + SYNC_MARKER_SIZE}}. bq. dtests seem to be pretty broken on all 3 branches, is that teething troubles with the custom circleci yaml This is due to the circleci "teething troubles" bq. On 3.0, in ACLS's runnable shouldn't this be checking shutdown not run because we want to flush if shutdown has been requested? D'oh! You are correct. fixed > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277025#comment-16277025 ] Sam Tunnicliffe commented on CASSANDRA-13987: - bq. I've added CommitLogChainedMarkersTest. Looking at it now, perhaps it could do with a better name and/or a comment at the top of the file explaining what, specifically, it's testing. Thanks for adding the test, I definitely agree that an explanatory comment would be useful, took me a few minutes to figure out what was going on there :) Few more minor-ish comments: * The validation on {{markerIntervalMillis}} in the latest commit on 3.0/3.11 also needs adding to trunk. Without it, use of the default value results in wakeup time always being before {{now}} * In ACLS, Can you rename {{requestSync}} to something like {{syncRequested}} or {{explicitSyncReqested}} to be consistent with {{shutdownRequested}}? * In trunk, the comment near the top of {{CLS::sync}} is missing a word, I'm guessing it should read "this will evaluate to true *when* the allocate position is jacked...". Also, this comment is missing on 3.0/3.11 * Also in {{CLS::sync}}, why does the calculation of {{needToMarkData}} include {{SYNC_MARKER_SIZE}}? I could be missing something, but it seems to me that we only need check {{allocatePosition.get() > lastMarkerOffset}}. * On 3.0, in ACLS's runnable shouldn't this be checking {{shutdown}} not {{run}} because we want to flush if shutdown *has* been requested? {code}if (lastSyncedAt + syncIntervalMillis <= pollStarted || run || requestSync){code} The dtests seem to be pretty broken on all 3 branches, is that teething troubles with the custom circleci yaml, or are you still working on those? > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16272601#comment-16272601 ] Jason Brown commented on CASSANDRA-13987: - Spent about a week tracking down a race condition, and thankfully it was just a stupid bug which is fixed. I've also backported to 3.0 and 3.11 ||3.0||3.11||trunk|| |[branch|https://github.com/jasobrown/cassandra/tree/13987-3.0]|[branch|https://github.com/jasobrown/cassandra/tree/13987-3.11]|[branch|https://github.com/jasobrown/cassandra/tree/commitlog_mmap-more-frequent-markers]| |[utests & dtests|https://circleci.com/gh/jasobrown/cassandra/tree/13987-3.0]|[utests & dtests|https://circleci.com/gh/jasobrown/cassandra/tree/13987-3.11]|[utests & dtests|https://circleci.com/gh/jasobrown/cassandra/tree/commitlog_mmap-more-frequent-markers]| || The trunk branch is a continuation of the previous development branch, while the 3.0/3.11 branched are squashed backports. 3.11 is trivially close to the trunk code (minor compilation fixes were needed), but 3.0 required a bit more work. utests look good across the branches, and I'm waiting for the dtests to finish. Note: I'm using an updated circleci config which won't be committed into the apache repo. bq. Should we move the call to writeCDCIndexFile ... Yup, certainly seems the correct thing to do ;) I addressed the other nits, as well. While running the utests on circleci, I there were some failures, related to not forcing the flush when shutting down the {{AbstractCommitLogService}}. That's fixed in the latest commit. bq. do you think it's worth adding a unit test or two for this? Yes, and I've added {{CommitLogChainedMarkersTest}}. Looking at it now, perhaps it could do with a better name and/or a comment at the top of the file explaining what, specifically, it's testing. Also, I've fixed a few minor things for a few tests. - {{AbstractCommitLogService#requestExtraSync()}} was correctly unparking the sync thread, but commit log data was not being flushed to disk. Thus, I added a volatile boolean to the class for {{#requestExtraSync()}} to indicate that a sync should happen. This is mostly to support batch commit log mode. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally pigg
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271720#comment-16271720 ] Jason Brown commented on CASSANDRA-13987: - [~jolynch] It looks like 2.1's shutdown hook in {{StorageService}} does [shutdown the commit log|https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/service/StorageService.java#L716]. So, I think you should be OK, assuming that a gentle {{kill}} is issued (not {{kill -9}} or similar ilk). > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16257892#comment-16257892 ] Joseph Lynch commented on CASSANDRA-13987: -- If I understand this bug correctly, and process crash could lead to up to 10s of un-replayable data in the commitlogs, then isn't it really bad that prior to [db198e684 | https://github.com/apache/cassandra/commit/5115c106db198e684b47c614b237925c45c71da8] merged in 3.0.10 and 3.10, the shutdown hook only flushed [non durable cfs| https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/service/StorageService.java#L699-L702], meaning that we were counting on commitlog replay during a normal restart of Cassandra? It would seem that it is possible if an operator didn't do an explicit {{nodetool drain}} before sending a TERM signal (e.g. if they just called {{/etc/init.d/cassandra restart}}) they would potentially have up to 10s of un-replayable (but acknowledged) data as a result? If this is correct and we can't backport this change to 2.1 should we at least backport that the shutdown handler flushes all column families? > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251628#comment-16251628 ] Benedict commented on CASSANDRA-13987: -- That still leaves the question of what behaviour we restore. For plain commit logs, it's probably best if we either remove the sync markers or start ignoring them. It occurs to me that they were only introduced to handle the file reuse that no longer happens. Without the sync markers we'll pretty much fully restore prior behaviour. Not suggesting that this supersedes this ticket, of course. But it's probably a good thing to do (while perhaps further caveating our treatment of corrupt records, since they may be more expected) > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250428#comment-16250428 ] Jason Brown commented on CASSANDRA-13987: - I don't believe we've had a policy or guarantee in the past about the availability of commit log data that was unflushed (not {{mysyc}}'ed), thus I'm not sure how much of a 'regression' this changed behavior is. It's unfortunate that some previous assumptions that both developers and operators may have had were altered, and the end result may result in data loss. So I'm kind of on the fence with how far to go back, but I think 3.0 and up is reasonable. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250373#comment-16250373 ] Jeff Jirsa commented on CASSANDRA-13987: I get that the ship has sailed on 2.1/2.1, and I accept that. I'd like it in 3.0/3.11 because I think it's a guarantee people expect, but I'm open to arguments that it's too dangerous (I haven't touched that code in months, you have, I'll defer to you if you think it's straightforward enough). > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250370#comment-16250370 ] Benedict commented on CASSANDRA-13987: -- The discussed behavioural change was introduced in 2.1, so if it's considered a regression it should probably go all the way back, at least to 2.2 (I think we're still servicing that, right?) However if we consider it a regression this doesn't fundamentally fix the problem, and we should probably file a follow up ticket if we want to restore 2.0 behaviour. For the record, it's quite likely that for unencrypted segments we can get very nearly identical behaviour to before with only changes to replay, by just skipping corrupted sync markers and continuing to replay records while we are able to. Some changes to the file format and/or the time at which we serialize the size/checksum could make this more reliable, but here we're talking about race conditions, which arguably isn't a regression given these could have equivalently simply been held up in the queue for the commit log thread before. For encrypted segments, I don't know if we need to "restore" behaviour since it was never available before, but it would make sense to do so (least surprise and all that). In which case we'd probably want to modify our segment writing to happen concurrently (but serially writing the bytes, of course). This probably isn't actually such a dramatic change, though it's been a while since I've looked at the code. This way we could "just" do the same as above, but also abort when we hit a corrupted/abrupt end of encrypted/compressed stream. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboar
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250368#comment-16250368 ] Jason Brown commented on CASSANDRA-13987: - [~jjirsa] This is a change in behavior from when multithreaded commitlog (CASSANDRA-3578) was introduced, in 2.1. I'm pretty sure we don't want to update 2.1, and 2.2 is highly doubtful, as well, but I'm fine with 3.0 and higher if folks think it's worth it. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250338#comment-16250338 ] sankalp kohli commented on CASSANDRA-13987: --- +1 for doing this in 3.0+ > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250333#comment-16250333 ] Jeff Jirsa commented on CASSANDRA-13987: Am I the only one who thinks this belongs in 3.0+ instead of 4.0? It's a regression (though not from 2.1/2.2, I guess), and it impacts data safety. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250195#comment-16250195 ] Sam Tunnicliffe commented on CASSANDRA-13987: - Previously, {{writeCDCIndexFile}} was only called ever called after a flush, which would be consistent with its comment that states: {code}We persist the offset of the last data synced to disk so clients can parse only durable data if they choose{code} So currently this definition of durable would include durability in the face of host failures, whereas with this patch the index file may contain offsets for segments that are durable under process crash, but which have not yet been msynced/fsynced and so may not survive a host failure. Should we move the call to {{writeCDCIndexFile}} into the {{if (flush || close)}} block, to after the flush has completed? That question aside, the code seems solid and I've manually tested both as-is and with some added hacks to inject failures etc, but I feel like it could still benefit from some automated testing to cover the new behaviour. I know that writing tests for this area is non-trivial and usually involves byteman, but do you think it's worth adding a unit test or two for this? Nits: * Typo in cassandra.yaml #380 s/mmaped/mmapped * The comment atop {{AbstractCommitLogSegmentManager::sync}} could use updating. The fact that it says it flushes, but also takes a boolean flush arg is a bit confusing. * {{CompressedSegment}} and {{EncryptedSegment}} no longer need to import {{SyncUtil}} > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16242915#comment-16242915 ] Jason Brown commented on CASSANDRA-13987: - [~benedict] thanks for the comments, and for (indirectly) confirming that my understanding of the multithreaded commit log is more-or-less correct. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240517#comment-16240517 ] Benedict commented on CASSANDRA-13987: -- (To clarify, I'm all in favour if this current approach to improving the status quo, just suggesting a follow up that would be non-breaking and relatively straight forward) > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16239552#comment-16239552 ] Benedict commented on CASSANDRA-13987: -- There is another option - similar to Ariel's approach. If we leave the commit log files as they are, the only period of time we need to account for is the duration between flushes, in which we may have some completely written mutations but no way to know their boundaries until the next sync happens. So, we could fairly easily have a "superfluous" universal side-kick file, containing just a buffer of start/end offset pairs (possibly with some generation bits, or other book keeping info) written _since_ the last sync, that can optionally be consulted to replay records that occurred after this. Every time we sync we can start overwriting this buffer from the beginning. If a universal buffer, the first few bytes can state the file it is associated with at present. Without consulting this file, replay happens exactly as it does now. It just permits us to improve recovery of the final sync-segment. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235731#comment-16235731 ] Ariel Weisberg commented on CASSANDRA-13987: bq. but the ordering in the sidekick entries are not guaranteed to be in the same order as the commit log's entries. Just a heads up they would. You would increment the offsets atomically using CAS of two 4-byte values packed into one 8-byte value. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Major > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235707#comment-16235707 ] Jason Brown commented on CASSANDRA-13987: - Just to add these here for completeness, I spoke with several other contributors, and here is a brief summary of each idea and my reasoning for not pursuing each. [~mkjellman] proposed to reintroduce a lock to the commitlog path, albeit with a smaller scope. The basic idea would still use multiple threads to serialize the mutation into the log, but we would lock around getting the {{Allocation}} buffer and writing the mutation's length and checksum. This would allow us to be able to replay everything that successfully serialized into the commitlog; we could skip entries that did not completely serialize (and thus fail on deserialization) as we would be guaranteed the entry's length was written at the beginning of the entry (and thus we could skip to the next entry if possible). The biggest downside here was the reintroduction of the lock, which is a larger topic than what I want to address here, and should involve a wider community discussion. [~aweisberg] proposed having a mmaped sidekick file where we would capture the position (and checksum of the position) of each entry in the main commitlog file. The entries in the sidekick file would be fixed-size values (8 bytes), so we would always be able to read the values. We would use something like the main commitlog's CAS to allocate space for the sidekick entry, but the ordering in the sidekick entries are not guaranteed to be in the same order as the commit log's entries. On replay, we would need to read in the sidekick file to know the offsets, and we would need to attempt to replay as many of the entries from the main commitlog as appeared in the sidekick file. While being a reasonably good idea, the downside for me is that introducing another file for ensuring more commitlog replayablility seemed more involved than probably necessary for the stated goal. Coorinated failures are already an edge condition, and imposing the sidekick file tax on every commitlog might be more than required. Also, I am concerned about the additional cost on replay to read the sidekick file, order the entries, and then ensure at least all those entries are replayed. We are sensitive to startup times, and this would add to it (albeit perhaps slightly). Another complicating factor for this idea is that is does not work with compressed or encrypted commitlogs. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Major > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 sec
[jira] [Commented] (CASSANDRA-13987) Multithreaded commitlog subtly changed durability
[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235703#comment-16235703 ] Jason Brown commented on CASSANDRA-13987: - Here is a branch that takes the simplest path: it updates the commit log chained markers (in periodic mode) much more quickly than it mysnc's. ||trunk|| |[branch|https://github.com/jasobrown/cassandra/tree/commitlog_mmap-more-frequent-markers]| |[utests|https://circleci.com/gh/jasobrown/cassandra/tree/commitlog_mmap-more-frequent-markers]| |[dtest|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/407/]| The basic idea is that if we can update the chained markers (say, (a configurable) once every 100 milliseconds), that should probably be more than enough time to survive a correlated failure between two nodes about OOM. This is *not* a silver bullet to ensure complete replayability, as in that case you should use batch commit log mode for each commit to ensure durability. There are alternatives that I discussed with others (see next comment). This branch does not solve the problem for compressed/encrypted commitlog (in periodic mode), as those implementations do not use mmaped files. I am not sure how best (or if) to address those. Switching them to use a memory mapped file might not be too difficult, code-wise, but i'm not sure about performance implications. Apparently, [~benedict] and I had some discussion about the use of mmap and the commitlog a lng time ago (CASSANDRA-6809), but I honestly can't remember the details beyond our comments on that ticket. > Multithreaded commitlog subtly changed durability > - > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement >Reporter: Jason Brown >Assignee: Jason Brown >Priority: Major > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard f