This is an automated email from the ASF dual-hosted git repository.

mck pushed a commit to branch trunk
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git


The following commit(s) were added to refs/heads/trunk by this push:
     new 2db0aea21 Sept 2022 blog "Learn How CommitLog Works in Apache 
Cassandra"
2db0aea21 is described below

commit 2db0aea213a6b73b72d9c386c29585f7a1361977
Author: Diogenese Topper <diotop...@gmail.com>
AuthorDate: Wed Sep 14 18:59:24 2022 -0700

    Sept 2022 blog "Learn How CommitLog Works in Apache Cassandra"
    
     patch by Alex Sorokoumov, Chris Thornett, Diogenese Topper; reviewed by 
Mick Semb Wever for CASSANDRA-17860
    
    Co-authored by: Alex Sorokoumov
    Co-authored by: Chris Thornett <ch...@constantia.io>
    Co-authored by: Diogenese Topper <diogen...@constantia.io>
---
 .../images/blog/Allocating-and-active-segments.png | Bin 0 -> 661957 bytes
 .../ROOT/images/blog/Compressed-Segment-layout.png | Bin 0 -> 54667 bytes
 .../modules/ROOT/images/blog/Dirty-intervals.png   | Bin 0 -> 23619 bytes
 .../ROOT/images/blog/Encrypted-Segment-layout.png  | Bin 0 -> 93448 bytes
 ...rks-in-Apache-Cassandra-unsplash-sandip-roy.jpg | Bin 0 -> 158681 bytes
 .../ROOT/images/blog/Memtables-and-CommitLog.png   | Bin 0 -> 113334 bytes
 .../ROOT/images/blog/Mmaped-Segment-layout.png     | Bin 0 -> 54194 bytes
 .../modules/ROOT/images/blog/Segment layout.png    | Bin 0 -> 52523 bytes
 site-content/source/modules/ROOT/pages/blog.adoc   |  25 ++++
 ...rn-How-CommitLog-Works-in-Apache-Cassandra.adoc | 139 +++++++++++++++++++++
 10 files changed, 164 insertions(+)

diff --git 
a/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png
 
b/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png
new file mode 100644
index 000000000..939b6d4a8
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png
 differ
diff --git 
a/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png 
b/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png
new file mode 100644
index 000000000..18f19d188
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png 
differ
diff --git a/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png 
b/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png
new file mode 100644
index 000000000..30e7676d4
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png differ
diff --git 
a/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png 
b/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png
new file mode 100644
index 000000000..bfaa3654d
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png 
differ
diff --git 
a/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg
 
b/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg
new file mode 100644
index 000000000..2e1224b27
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg
 differ
diff --git 
a/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png 
b/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png
new file mode 100644
index 000000000..b0a6034ac
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png 
differ
diff --git 
a/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png 
b/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png
new file mode 100644
index 000000000..63a1269af
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png differ
diff --git a/site-content/source/modules/ROOT/images/blog/Segment layout.png 
b/site-content/source/modules/ROOT/images/blog/Segment layout.png
new file mode 100644
index 000000000..fc2168021
Binary files /dev/null and 
b/site-content/source/modules/ROOT/images/blog/Segment layout.png differ
diff --git a/site-content/source/modules/ROOT/pages/blog.adoc 
b/site-content/source/modules/ROOT/pages/blog.adoc
index 2820f57be..4d74c1b7c 100644
--- a/site-content/source/modules/ROOT/pages/blog.adoc
+++ b/site-content/source/modules/ROOT/pages/blog.adoc
@@ -8,6 +8,31 @@ NOTES FOR CONTENT CREATORS
 - Replace post tile, date, description and link to you post.
 ////
 
+//start card
+[openblock,card shadow relative test]
+----
+[openblock,card-header]
+------
+[discrete]
+=== Learn How CommitLog Works in Apache Cassandra
+[discrete]
+==== September 26, 2022
+------
+[openblock,card-content]
+------
+Learn how Apache Cassandra’s CommittLog works, how Cassandra ensures data 
durability, and how various tuning parameters affect its behavior.
+
+[openblock,card-btn card-btn--blog]
+--------
+
+[.btn.btn--alt]
+xref:blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc[Read More]
+--------
+
+------
+----
+//end card
+
 //start card
 [openblock,card shadow relative test]
 ----
diff --git 
a/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc
 
b/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc
new file mode 100644
index 000000000..16f98af2f
--- /dev/null
+++ 
b/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc
@@ -0,0 +1,139 @@
+= Learn How CommitLog Works in Apache Cassandra
+:page-layout: single-post
+:page-role: blog-post
+:page-post-date: September 26, 2022
+:page-post-author: Alex Sorokoumov
+:description: A Comprehensive Guide to CommitLog
+:keywords:
+
+:!figure-caption:
+
+._Image credit: https://unsplash.com/@sandiproy_kolkata[Sandip Roy on 
Unsplash^]_
+image::blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg[Golden
 Bridge, Hòa Ninh, Hòa Vang, Danang, Vietnam. A bridge held up by stone hands]
+
+CommitLog (aka write-ahead log, WAL) is a standard component of many 
databases. In Apache Cassandra, it is an efficient append-only on-disk data 
structure that guarantees durability.
+
+Learning more about how CommitLog works will be helpful to database 
administrators who want to better understand the guarantees and trade-offs 
Cassandra provides. This post also serves as an introduction for any users who 
want to dig into this subsystem. Finally, database enthusiasts and developers 
might find it interesting to read how Cassandra’s write-ahead log is 
implemented in practice.
+
+As part of our overview of the CommitLog features, we will go through the 
following:
+
+*  A Recap of the Write Path
+*  An overview of the CommitLog Lifecycle
+* How to Append to the CommitLog
+* CommitLog Segment Types 
+*  Segment Recycling
+* CommitLog and Change-Data-Capture(CDC)
+
+=== Write Path Recap
+
+This section briefly summarizes the Cassandra write path to establish the role 
CommitLog plays in the database system.
+
+When Cassandra accepts new write requests, it saves new mutations to an 
in-memory write-back cache called a memtable and appends them to the CommitLog. 
The former allows serving reads without accessing the disk, while the latter 
guarantees durability. If Cassandra crashes before flushing the memtable, it 
will restore acknowledged writes by replaying the CommitLog.
+
+Once the database flushes a memtable to disk as an 
https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html#sstables[SSTable],
 which is an immutable file for persisting data, it can eliminate the 
corresponding log entries. We are going to learn how this happens in the next 
section.
+
+=== CommitLog Lifecycle Explained
+
+This section describes the CommitLog structure and how it knows what data to 
keep or remove.
+
+The CommitLog is an append-only data structure comprising a series of segments 
- files stored on disk. Segments persist `mutations` - internal objects 
containing information about new writes. Besides the changed rows, mutations 
contain relevant metadata - keyspace and table names, creation timestamp, GC 
grace seconds, etc. Mutations are 
https://en.wikipedia.org/wiki/Idempotence[idempotent^], i.e. Mutations can be 
applied multiple times while changing the state only once.
+
+CommitLog segments are shared between tables so that all incoming writes land 
in the same segment. At any point in time, there is:
+
+* an `allocating` segment that accepts new mutations
+* an `available` segment to be used next 
+* 0 or more `active` segments to be deleted once the corresponding memtables 
are flushed. 
+
+As soon as the `allocating` segment exceeds 
`https://github.com/apache/cassandra/blob/cassandra-4.1/conf/cassandra.yaml#L500-L517[commitlog_segment_size^]`
 (32MiB by default), the database syncs it to disk and switches to the next 
available segment. *Figure 1* below illustrates different segment types and 
their function.
+
+:!figure-caption:
+
+.*Figure 1*. _The Segment Lifecycle. The numbers are globally increasing 
positions in segments. The allocating segment accepts new mutations. Once it is 
full, CommitLog marks it as active and starts allocating to the pre-baked 
available segment. As soon as there are no dirty mutations in an active 
segment, CommitLog removes the segment._
+image::blog/Allocating-and-active-segments.png[Allocating and active segments]
+
+Cassandra can only delete a segment after all its mutations are persisted in 
SSTables. Knowing if a file does not hold any mutations that haven’t been 
flushed yet requires a bit of bookkeeping. 
+
+Each segment maintains a hash table with `dirty` intervals. Dirty intervals 
contain mutation positions that haven’t yet been flushed as SSTables. *Figure 
2* illustrates how the CommitLog maintains dirty positions for each segment.
+
+:!figure-caption:
+
+.*Figure 2*. _Each segment maintains 1 hash map for dirty intervals in the 
form of_ `[table id -> intervals]`. _This figure demonstrates a segment with 
the dirty map equal to_ `{ Table 1: [[9, 11)], Table 2: [[7, 9), [13, 15)] }`.
+image::blog/Dirty-intervals.png[Dirty intervals]
+
+Each memtable maintains high and low CommitLog positions to mark the 
corresponding mutations as clean on flush (see *Figure 3*). The high position 
is the position of the latest mutation written to CommitLog; memtables update 
it on each new write. The low position is a high position of a previously 
flushed memtable. The low position cannot change anymore as that memtable no 
longer accepts writes.
+
+:!figure-caption:
+
+.*Figure 3*. _Memtables maintain low and high CommitLog positions. The low CL 
position of the i+1th Memtable is a high position in the i-th Memtable._
+image::blog/Memtables-and-CommitLog.png[Memtables and CommitLog]
+
+On memtable flush, Cassandra marks the corresponding CommitLog positions as 
clean. As soon as the entire segment is clean, the CommitLog deletes it. 
+
+=== Appending to the CommitLog
+
+In the previous section, we learned that the CommitLog appends mutations from 
different tables to the same segment. The benefit of this approach is faster 
flush due to sequential write I/O. But doesn’t it create contention when 
concurrent requests write to the same segment? Let’s see now how Cassandra 
addresses this issue.
+
+Appending to the CommitLog takes several steps. First, the CommitLog reserves 
an in-memory buffer in the allocating segment and writes the serialized 
mutation to the allocated space. Then the CommitLog flushes the entire segment 
block to disk by calling 
https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#force-boolean-[FileChannel.force()^].
 
+
+The only contention point for concurrent writes is allocating space in the 
in-memory buffer, a relatively fast operation.
+
+Flushing to disk happens according to the 
`https://github.com/apache/cassandra/blob/cassandra-4.1/conf/cassandra.yaml#L472-L493[commitlog_sync^]`
 configuration property. It supports the following options:
+
+* `periodic` (default) - a write is successful after writing to a buffer in 
memory. Sync to disk happens every `commitlog_sync_period_in_ms` (10,000ms by 
default) or after reaching the segment size limit.
+* `batch` - a write is successful only after flushing to disk. Every mutation 
invokes sync (note: `commitlog_sync_batch_window_in_ms` is ignored by Apache 
Cassandra 4.0).
+* `group` - a write is successful only after flushing to disk. Mutations form 
a group (hence the name) that waits for the same sync that happens every 
`commitlog_sync_group_window_in_ms` (1,000ms by default).
+
+With `periodic` mode, the server does not wait for the sync to disk and 
responds to the client after writing Mutation(s) to the in-memory buffer. While 
`commitlog_sync_period_in_ms` *acts* as an upper bound for the sync frequency, 
usually, the main sync trigger in workloads for Cassandra is the allocating 
segment reaching its maximum size. Accordingly, one can decrease the expected 
time to sync by reducing the segment size controlled by the 
`commitlog_segment_size` option. As a side effe [...]
+
+Decoupling of syncing to disk from acknowledging requests reduces an upper 
bound on throughput and lower bound on latency and provides a trade-off between 
sync frequency and durability via `commitlog_sync_period_in_ms` option. A 
potential data loss scenario for already acknowledged writes is simultaneous 
OS/hardware crashes on multiple replicas within the sync period.
+
+Alternative sync strategies are `batch` and `group`. The `batch` *strategy* is 
essentially a paranoid option that ensures that every successful write is 
persisted to disk. Rarely required, thorough evaluation is recommended before 
using the feature. With the `group` strategy, write requests will be delayed up 
to `commitlog_sync_group_window_in_ms` depending on how long ago the previous 
sync happened. This option allows balancing throughput and latency by changing 
the window size. A bigge [...]
+
+=== CommitLog Segment Types
+
+The previous section described _how_ CommitLog appends and flushes data. In 
this section, we will go through _what_ the CommitLog writes to disk, i.e., the 
structure of 
https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java#L60[CommitLog
 segments^].
+
+Cassandra supports three segment types: memory-mapped, compressed, and 
encrypted. The database selects a segment type to use depending on 
`commitlog_compression` and `transparent_data_encryption_options` configuration 
options in `cassandra.yaml`. `commitlog_compression` controls segment 
compression and supports three compression types: _LZ4_, _Snappy_, and 
_Deflate_. The latter option controls data encryption on disk, including both 
CommitLog segments and hints. Cassandra uses encrypted  [...]
+
+Let’s describe a layout of a memory-mapped segment and build on top of it to 
show how compressed and encrypted segments work. All segment types use the same 
pattern. Any data in a segment is followed by its checksum so that readers can 
discard only corrupted data and recover as much information as possible on 
error. A segment starts with a header that contains information about its 
version, compression, and encryption. The header format is the same for all 
segment types. Sync blocks that [...]
+
+:!figure-caption:
+
+.*Figure 4*. _The layout of a memory-mapped segment. The header consists of a 
version, a segment ID, parameters, and CRC. The version is incremented if there 
are changes in the CommitLog structure. ID is a unique segment identifier. 
Parameter length describes how much space the parameters block occupies. The 
parameters block contains a JSON string with compression and encryption 
parameters. CRC finishes the header. A sync block starts with a marker followed 
by the mutations. The sync mar [...]
+image::blog/Mmaped-Segment-layout.png[Mmaped Segment layout]
+
+While memory-mapped segments maintain a single memory-mapped file that is 
periodically flushed to disk, compressed and encrypted segments use in-memory 
fixed-size buffers to serialize, compress, and encrypt mutations. Besides that, 
sync markers of compressed and encrypted segments contain an additional value: 
the total size of uncompressed data. The compressed segment compresses the 
entire in-memory buffer with mutations before writing them to the segment file. 
See *Figure 5* for the det [...]
+ 
+:!figure-caption:
+
+.*Figure 5*. _The layout of a compressed segment. Sync marker has an 
additional field - uncompressed size._
+image::blog/Compressed-Segment-layout.png[Compressed Segment layout]
+
+Unlike compressed segments, encrypted segments write mutations in data blocks. 
These blocks are small chunks whose size is controlled by 
`transparent_data_encryption_options.chunk_length_kb`. Each data block is 
compressed, encrypted, and written to the segment file individually. See 
*Figure 6* for details on the layout of each data block.
+
+:!figure-caption:
+
+.*Figure 6*. _The layout of an encrypted segment. The total block length and 
length of encrypted compressed data are unencrypted. The length of unencrypted 
compressed data as well as the data itself are encrypted._
+image::blog/Encrypted-Segment-layout.png[Encrypted Segment layout]
+
+=== Segment Recycling
+
+At this point, we need to clarify the meaning of the term ‘segment recycling,’ 
which occurs in the Cassandra 
https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html[documentation^]
 and the codebase. Segment recycling was introduced in Cassandra 1.1.0 and 
removed in 2.2.0.
+
+Back in version 1.1.0 
(https://issues.apache.org/jira/browse/CASSANDRA-3411[CASSANDRA-3411^]), 
Cassandra pre-allocated empty 128MiB files as Commit Log segments. The idea 
behind pre-allocation was to avoid changing the metadata on append. 
Accordingly, recycling old segments amortized pre-allocation overhead for 
subsequent segments. Instead of deleting clean segments, Cassandra wrote an 
`end-of-segment` marker at the file's beginning. New writes overwrote the 
marker. Restoring from an emp [...]
+
+Segment recycling was removed in Cassandra 2.2.0 
(https://issues.apache.org/jira/browse/CASSANDRA-6809[CASSANDRA-6809^]). In 
practice, recycling didn’t demonstrate significant performance improvements 
(https://issues.apache.org/jira/browse/CASSANDRA-8771[CASSANDRA-8771^]) while 
complicating segment lifecycle and introducing non-trivial bugs (for example, 
https://issues.apache.org/jira/browse/CASSANDRA-8729[CASSANDRA-8729^]). 
Starting from 2.2.0, recycling a segment means closing the file [...]
+
+=== Change-Data-Capture (CDC)
+
+This section describes Change-Data-Capture in the context of the CommitLog and 
refers to the state of CDC as of C* 4.0 
(https://issues.apache.org/jira/browse/CASSANDRA-12148[CASSANDRA-12148^]). For 
a complete CDC guide, please refer to the 
https://cassandra.apache.org/doc/latest/operating/cdc.html[documentation^]. 
https://en.wikipedia.org/wiki/Change_data_capture[Change-Data-Capture^] allows 
external consumers to consume new writes that happen on the cluster. CDC is 
configured per-table  [...]
+
+CDC in Cassandra exposes synced parts of CommitLog segments to external 
consumers. On sync, CDC creates a hard link in `cdc_raw_directory` and a 
`<segment_file>_cdc.idx` file. This index file holds the offset for the final 
byte of the last sync block in the corresponding segment. Consumers should read 
the segment only until the specified offset as it indicates the point where the 
segment was safely persisted on disk.
+
+Once the segment is discarded, the index file contains the word `COMPLETED.` 
It is the responsibility of the consumer to delete hard links to read segments. 
If the folder fills up to its max allowed space, `cdc_free_space_in_mb`, new 
writes on this table are rejected.
+
+The CommitLog is one of the key components of Apache Cassandra as it offers 
one of the most important database guarantees: durability. In this article, we 
covered the CommitLog from multiple perspectives. First, we presented its role 
in the write path and its interactions with other database components. Then, we 
discussed the specifics of the sync mechanism as well as relevant 
configuration. After that, we looked into different segment types and their 
on-disk representation, as well as t [...]
+
+If you would like to learn more about the CommitLog, you can follow the JIRA 
issues linked in this article and ask questions on the 
xref:community.adoc[Mailing List^] and https://the-asf.slack.com/[ASF Slack^] 
in the #cassandra Slack channel.
+
+Thanks to Frank Rosner, Branimir Lambov, and Chris Thornett for their 
discussions and corrections.


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to