This is an automated email from the ASF dual-hosted git repository. mck pushed a commit to branch trunk in repository https://gitbox.apache.org/repos/asf/cassandra-website.git
The following commit(s) were added to refs/heads/trunk by this push: new 2db0aea21 Sept 2022 blog "Learn How CommitLog Works in Apache Cassandra" 2db0aea21 is described below commit 2db0aea213a6b73b72d9c386c29585f7a1361977 Author: Diogenese Topper <diotop...@gmail.com> AuthorDate: Wed Sep 14 18:59:24 2022 -0700 Sept 2022 blog "Learn How CommitLog Works in Apache Cassandra" patch by Alex Sorokoumov, Chris Thornett, Diogenese Topper; reviewed by Mick Semb Wever for CASSANDRA-17860 Co-authored by: Alex Sorokoumov Co-authored by: Chris Thornett <ch...@constantia.io> Co-authored by: Diogenese Topper <diogen...@constantia.io> --- .../images/blog/Allocating-and-active-segments.png | Bin 0 -> 661957 bytes .../ROOT/images/blog/Compressed-Segment-layout.png | Bin 0 -> 54667 bytes .../modules/ROOT/images/blog/Dirty-intervals.png | Bin 0 -> 23619 bytes .../ROOT/images/blog/Encrypted-Segment-layout.png | Bin 0 -> 93448 bytes ...rks-in-Apache-Cassandra-unsplash-sandip-roy.jpg | Bin 0 -> 158681 bytes .../ROOT/images/blog/Memtables-and-CommitLog.png | Bin 0 -> 113334 bytes .../ROOT/images/blog/Mmaped-Segment-layout.png | Bin 0 -> 54194 bytes .../modules/ROOT/images/blog/Segment layout.png | Bin 0 -> 52523 bytes site-content/source/modules/ROOT/pages/blog.adoc | 25 ++++ ...rn-How-CommitLog-Works-in-Apache-Cassandra.adoc | 139 +++++++++++++++++++++ 10 files changed, 164 insertions(+) diff --git a/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png b/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png new file mode 100644 index 000000000..939b6d4a8 Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Allocating-and-active-segments.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png b/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png new file mode 100644 index 000000000..18f19d188 Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Compressed-Segment-layout.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png b/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png new file mode 100644 index 000000000..30e7676d4 Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Dirty-intervals.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png b/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png new file mode 100644 index 000000000..bfaa3654d Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Encrypted-Segment-layout.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg b/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg new file mode 100644 index 000000000..2e1224b27 Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg differ diff --git a/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png b/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png new file mode 100644 index 000000000..b0a6034ac Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Memtables-and-CommitLog.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png b/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png new file mode 100644 index 000000000..63a1269af Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Mmaped-Segment-layout.png differ diff --git a/site-content/source/modules/ROOT/images/blog/Segment layout.png b/site-content/source/modules/ROOT/images/blog/Segment layout.png new file mode 100644 index 000000000..fc2168021 Binary files /dev/null and b/site-content/source/modules/ROOT/images/blog/Segment layout.png differ diff --git a/site-content/source/modules/ROOT/pages/blog.adoc b/site-content/source/modules/ROOT/pages/blog.adoc index 2820f57be..4d74c1b7c 100644 --- a/site-content/source/modules/ROOT/pages/blog.adoc +++ b/site-content/source/modules/ROOT/pages/blog.adoc @@ -8,6 +8,31 @@ NOTES FOR CONTENT CREATORS - Replace post tile, date, description and link to you post. //// +//start card +[openblock,card shadow relative test] +---- +[openblock,card-header] +------ +[discrete] +=== Learn How CommitLog Works in Apache Cassandra +[discrete] +==== September 26, 2022 +------ +[openblock,card-content] +------ +Learn how Apache Cassandra’s CommittLog works, how Cassandra ensures data durability, and how various tuning parameters affect its behavior. + +[openblock,card-btn card-btn--blog] +-------- + +[.btn.btn--alt] +xref:blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc[Read More] +-------- + +------ +---- +//end card + //start card [openblock,card shadow relative test] ---- diff --git a/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc b/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc new file mode 100644 index 000000000..16f98af2f --- /dev/null +++ b/site-content/source/modules/ROOT/pages/blog/Learn-How-CommitLog-Works-in-Apache-Cassandra.adoc @@ -0,0 +1,139 @@ += Learn How CommitLog Works in Apache Cassandra +:page-layout: single-post +:page-role: blog-post +:page-post-date: September 26, 2022 +:page-post-author: Alex Sorokoumov +:description: A Comprehensive Guide to CommitLog +:keywords: + +:!figure-caption: + +._Image credit: https://unsplash.com/@sandiproy_kolkata[Sandip Roy on Unsplash^]_ +image::blog/Learn-How-CommitLog-Works-in-Apache-Cassandra-unsplash-sandip-roy.jpg[Golden Bridge, Hòa Ninh, Hòa Vang, Danang, Vietnam. A bridge held up by stone hands] + +CommitLog (aka write-ahead log, WAL) is a standard component of many databases. In Apache Cassandra, it is an efficient append-only on-disk data structure that guarantees durability. + +Learning more about how CommitLog works will be helpful to database administrators who want to better understand the guarantees and trade-offs Cassandra provides. This post also serves as an introduction for any users who want to dig into this subsystem. Finally, database enthusiasts and developers might find it interesting to read how Cassandra’s write-ahead log is implemented in practice. + +As part of our overview of the CommitLog features, we will go through the following: + +* A Recap of the Write Path +* An overview of the CommitLog Lifecycle +* How to Append to the CommitLog +* CommitLog Segment Types +* Segment Recycling +* CommitLog and Change-Data-Capture(CDC) + +=== Write Path Recap + +This section briefly summarizes the Cassandra write path to establish the role CommitLog plays in the database system. + +When Cassandra accepts new write requests, it saves new mutations to an in-memory write-back cache called a memtable and appends them to the CommitLog. The former allows serving reads without accessing the disk, while the latter guarantees durability. If Cassandra crashes before flushing the memtable, it will restore acknowledged writes by replaying the CommitLog. + +Once the database flushes a memtable to disk as an https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html#sstables[SSTable], which is an immutable file for persisting data, it can eliminate the corresponding log entries. We are going to learn how this happens in the next section. + +=== CommitLog Lifecycle Explained + +This section describes the CommitLog structure and how it knows what data to keep or remove. + +The CommitLog is an append-only data structure comprising a series of segments - files stored on disk. Segments persist `mutations` - internal objects containing information about new writes. Besides the changed rows, mutations contain relevant metadata - keyspace and table names, creation timestamp, GC grace seconds, etc. Mutations are https://en.wikipedia.org/wiki/Idempotence[idempotent^], i.e. Mutations can be applied multiple times while changing the state only once. + +CommitLog segments are shared between tables so that all incoming writes land in the same segment. At any point in time, there is: + +* an `allocating` segment that accepts new mutations +* an `available` segment to be used next +* 0 or more `active` segments to be deleted once the corresponding memtables are flushed. + +As soon as the `allocating` segment exceeds `https://github.com/apache/cassandra/blob/cassandra-4.1/conf/cassandra.yaml#L500-L517[commitlog_segment_size^]` (32MiB by default), the database syncs it to disk and switches to the next available segment. *Figure 1* below illustrates different segment types and their function. + +:!figure-caption: + +.*Figure 1*. _The Segment Lifecycle. The numbers are globally increasing positions in segments. The allocating segment accepts new mutations. Once it is full, CommitLog marks it as active and starts allocating to the pre-baked available segment. As soon as there are no dirty mutations in an active segment, CommitLog removes the segment._ +image::blog/Allocating-and-active-segments.png[Allocating and active segments] + +Cassandra can only delete a segment after all its mutations are persisted in SSTables. Knowing if a file does not hold any mutations that haven’t been flushed yet requires a bit of bookkeeping. + +Each segment maintains a hash table with `dirty` intervals. Dirty intervals contain mutation positions that haven’t yet been flushed as SSTables. *Figure 2* illustrates how the CommitLog maintains dirty positions for each segment. + +:!figure-caption: + +.*Figure 2*. _Each segment maintains 1 hash map for dirty intervals in the form of_ `[table id -> intervals]`. _This figure demonstrates a segment with the dirty map equal to_ `{ Table 1: [[9, 11)], Table 2: [[7, 9), [13, 15)] }`. +image::blog/Dirty-intervals.png[Dirty intervals] + +Each memtable maintains high and low CommitLog positions to mark the corresponding mutations as clean on flush (see *Figure 3*). The high position is the position of the latest mutation written to CommitLog; memtables update it on each new write. The low position is a high position of a previously flushed memtable. The low position cannot change anymore as that memtable no longer accepts writes. + +:!figure-caption: + +.*Figure 3*. _Memtables maintain low and high CommitLog positions. The low CL position of the i+1th Memtable is a high position in the i-th Memtable._ +image::blog/Memtables-and-CommitLog.png[Memtables and CommitLog] + +On memtable flush, Cassandra marks the corresponding CommitLog positions as clean. As soon as the entire segment is clean, the CommitLog deletes it. + +=== Appending to the CommitLog + +In the previous section, we learned that the CommitLog appends mutations from different tables to the same segment. The benefit of this approach is faster flush due to sequential write I/O. But doesn’t it create contention when concurrent requests write to the same segment? Let’s see now how Cassandra addresses this issue. + +Appending to the CommitLog takes several steps. First, the CommitLog reserves an in-memory buffer in the allocating segment and writes the serialized mutation to the allocated space. Then the CommitLog flushes the entire segment block to disk by calling https://docs.oracle.com/javase/8/docs/api/java/nio/channels/FileChannel.html#force-boolean-[FileChannel.force()^]. + +The only contention point for concurrent writes is allocating space in the in-memory buffer, a relatively fast operation. + +Flushing to disk happens according to the `https://github.com/apache/cassandra/blob/cassandra-4.1/conf/cassandra.yaml#L472-L493[commitlog_sync^]` configuration property. It supports the following options: + +* `periodic` (default) - a write is successful after writing to a buffer in memory. Sync to disk happens every `commitlog_sync_period_in_ms` (10,000ms by default) or after reaching the segment size limit. +* `batch` - a write is successful only after flushing to disk. Every mutation invokes sync (note: `commitlog_sync_batch_window_in_ms` is ignored by Apache Cassandra 4.0). +* `group` - a write is successful only after flushing to disk. Mutations form a group (hence the name) that waits for the same sync that happens every `commitlog_sync_group_window_in_ms` (1,000ms by default). + +With `periodic` mode, the server does not wait for the sync to disk and responds to the client after writing Mutation(s) to the in-memory buffer. While `commitlog_sync_period_in_ms` *acts* as an upper bound for the sync frequency, usually, the main sync trigger in workloads for Cassandra is the allocating segment reaching its maximum size. Accordingly, one can decrease the expected time to sync by reducing the segment size controlled by the `commitlog_segment_size` option. As a side effe [...] + +Decoupling of syncing to disk from acknowledging requests reduces an upper bound on throughput and lower bound on latency and provides a trade-off between sync frequency and durability via `commitlog_sync_period_in_ms` option. A potential data loss scenario for already acknowledged writes is simultaneous OS/hardware crashes on multiple replicas within the sync period. + +Alternative sync strategies are `batch` and `group`. The `batch` *strategy* is essentially a paranoid option that ensures that every successful write is persisted to disk. Rarely required, thorough evaluation is recommended before using the feature. With the `group` strategy, write requests will be delayed up to `commitlog_sync_group_window_in_ms` depending on how long ago the previous sync happened. This option allows balancing throughput and latency by changing the window size. A bigge [...] + +=== CommitLog Segment Types + +The previous section described _how_ CommitLog appends and flushes data. In this section, we will go through _what_ the CommitLog writes to disk, i.e., the structure of https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/db/commitlog/CommitLogSegment.java#L60[CommitLog segments^]. + +Cassandra supports three segment types: memory-mapped, compressed, and encrypted. The database selects a segment type to use depending on `commitlog_compression` and `transparent_data_encryption_options` configuration options in `cassandra.yaml`. `commitlog_compression` controls segment compression and supports three compression types: _LZ4_, _Snappy_, and _Deflate_. The latter option controls data encryption on disk, including both CommitLog segments and hints. Cassandra uses encrypted [...] + +Let’s describe a layout of a memory-mapped segment and build on top of it to show how compressed and encrypted segments work. All segment types use the same pattern. Any data in a segment is followed by its checksum so that readers can discard only corrupted data and recover as much information as possible on error. A segment starts with a header that contains information about its version, compression, and encryption. The header format is the same for all segment types. Sync blocks that [...] + +:!figure-caption: + +.*Figure 4*. _The layout of a memory-mapped segment. The header consists of a version, a segment ID, parameters, and CRC. The version is incremented if there are changes in the CommitLog structure. ID is a unique segment identifier. Parameter length describes how much space the parameters block occupies. The parameters block contains a JSON string with compression and encryption parameters. CRC finishes the header. A sync block starts with a marker followed by the mutations. The sync mar [...] +image::blog/Mmaped-Segment-layout.png[Mmaped Segment layout] + +While memory-mapped segments maintain a single memory-mapped file that is periodically flushed to disk, compressed and encrypted segments use in-memory fixed-size buffers to serialize, compress, and encrypt mutations. Besides that, sync markers of compressed and encrypted segments contain an additional value: the total size of uncompressed data. The compressed segment compresses the entire in-memory buffer with mutations before writing them to the segment file. See *Figure 5* for the det [...] + +:!figure-caption: + +.*Figure 5*. _The layout of a compressed segment. Sync marker has an additional field - uncompressed size._ +image::blog/Compressed-Segment-layout.png[Compressed Segment layout] + +Unlike compressed segments, encrypted segments write mutations in data blocks. These blocks are small chunks whose size is controlled by `transparent_data_encryption_options.chunk_length_kb`. Each data block is compressed, encrypted, and written to the segment file individually. See *Figure 6* for details on the layout of each data block. + +:!figure-caption: + +.*Figure 6*. _The layout of an encrypted segment. The total block length and length of encrypted compressed data are unencrypted. The length of unencrypted compressed data as well as the data itself are encrypted._ +image::blog/Encrypted-Segment-layout.png[Encrypted Segment layout] + +=== Segment Recycling + +At this point, we need to clarify the meaning of the term ‘segment recycling,’ which occurs in the Cassandra https://cassandra.apache.org/doc/latest/cassandra/architecture/storage_engine.html[documentation^] and the codebase. Segment recycling was introduced in Cassandra 1.1.0 and removed in 2.2.0. + +Back in version 1.1.0 (https://issues.apache.org/jira/browse/CASSANDRA-3411[CASSANDRA-3411^]), Cassandra pre-allocated empty 128MiB files as Commit Log segments. The idea behind pre-allocation was to avoid changing the metadata on append. Accordingly, recycling old segments amortized pre-allocation overhead for subsequent segments. Instead of deleting clean segments, Cassandra wrote an `end-of-segment` marker at the file's beginning. New writes overwrote the marker. Restoring from an emp [...] + +Segment recycling was removed in Cassandra 2.2.0 (https://issues.apache.org/jira/browse/CASSANDRA-6809[CASSANDRA-6809^]). In practice, recycling didn’t demonstrate significant performance improvements (https://issues.apache.org/jira/browse/CASSANDRA-8771[CASSANDRA-8771^]) while complicating segment lifecycle and introducing non-trivial bugs (for example, https://issues.apache.org/jira/browse/CASSANDRA-8729[CASSANDRA-8729^]). Starting from 2.2.0, recycling a segment means closing the file [...] + +=== Change-Data-Capture (CDC) + +This section describes Change-Data-Capture in the context of the CommitLog and refers to the state of CDC as of C* 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-12148[CASSANDRA-12148^]). For a complete CDC guide, please refer to the https://cassandra.apache.org/doc/latest/operating/cdc.html[documentation^]. https://en.wikipedia.org/wiki/Change_data_capture[Change-Data-Capture^] allows external consumers to consume new writes that happen on the cluster. CDC is configured per-table [...] + +CDC in Cassandra exposes synced parts of CommitLog segments to external consumers. On sync, CDC creates a hard link in `cdc_raw_directory` and a `<segment_file>_cdc.idx` file. This index file holds the offset for the final byte of the last sync block in the corresponding segment. Consumers should read the segment only until the specified offset as it indicates the point where the segment was safely persisted on disk. + +Once the segment is discarded, the index file contains the word `COMPLETED.` It is the responsibility of the consumer to delete hard links to read segments. If the folder fills up to its max allowed space, `cdc_free_space_in_mb`, new writes on this table are rejected. + +The CommitLog is one of the key components of Apache Cassandra as it offers one of the most important database guarantees: durability. In this article, we covered the CommitLog from multiple perspectives. First, we presented its role in the write path and its interactions with other database components. Then, we discussed the specifics of the sync mechanism as well as relevant configuration. After that, we looked into different segment types and their on-disk representation, as well as t [...] + +If you would like to learn more about the CommitLog, you can follow the JIRA issues linked in this article and ask questions on the xref:community.adoc[Mailing List^] and https://the-asf.slack.com/[ASF Slack^] in the #cassandra Slack channel. + +Thanks to Frank Rosner, Branimir Lambov, and Chris Thornett for their discussions and corrections. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org