[ https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17469325#comment-17469325 ]
Brandon Williams edited comment on CASSANDRA-17237 at 1/5/22, 1:57 PM: ----------------------------------------------------------------------- This seems to be the crux of the issue: {quote} The default readahead for Centos 7 VMs is 4MB (as opposed to the default readahead for non-VM Centos 7 which is 128kb). Even though this is reduced by the kernel (cf `max_sane_readahead()`) to something around 450k, it is still far too large for an average Cassandra read. {quote} CASSANDRA-16436 will at least catch and warn about this so the user can hopefully set their RA to something reasonable. I guess as a counter to distros that have pathologically set RA to the point that the kernel won't even allow it, we could revive disk_access_mode in the yaml for users who choose not to correct the RA. It was removed 12 years ago in 0ee0bc6d7be7f since mmap was clearly the superior mode, and likely still is with sane readahead settings, so I'm hesistant to muddy those waters again. was (Author: brandon.williams): This seems to be the crux of the issue: {quote} The default readahead for Centos 7 VMs is 4MB (as opposed to the default readahead for non-VM Centos 7 which is 128kb). Even though this is reduced by the kernel (cf `max_sane_readahead()`) to something around 450k, it is still far too large for an average Cassandra read. {quote} CASSANDRA-16436 will at least catch and warn about this so the user can hopefully set their RA to something reasonable. I guess as a counter to distros that have pathologically set RA to the point that the kernel won't even allow it, we could revive disk_access_mode in the yaml for users choose not to correct the RA. It was removed 12 years ago in 0ee0bc6d7be7f since mmap was clearly the superior mode, and likely still is with sane readahead settings, so I'm hesistant to muddy those waters again. > Pathalogical interaction between Cassandra and readahead, particularly on > Centos 7 VMs > -------------------------------------------------------------------------------------- > > Key: CASSANDRA-17237 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17237 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config > Reporter: Daniel Cranford > Priority: Normal > Fix For: 4.x > > > Cassandra defaults to using mmap for IO, except on 32 bit systems. The config > value `disk_access_mode` that controls this isn't even included in or > documented in cassandra.yml. > While this may be a reasonable default config for Cassandra, we've noticed a > pathalogical interplay between the way Linux implements readahead for mmap, > and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs. > A read that misses all levels of cache in Cassandra is (typically) going to > involve 2 IOs: 1 into the index file and one into the data file. These IOs > will both be effectively random given the nature the mummer3 hash partitioner. > The amount of data read from the index file IO will be relatively small, > perhaps 4-8kb, compared to the data file IO which (assuming the entire > partition fits in a single compressed chunk and a compression ratio of 1/2) > will require 32kb. > However, applications using `mmap()` have no way to tell the OS the desired > IO size - they can only tell the OS the desired IO location - by reading from > the mapped address and triggering a page fault. This is unlike `read()` where > the application provides both the size and location to the OS. So for > `mmap()` the OS has to guess how large the IO submitted to the backing device > should be and whether the application is performing sequential or random IO > unless the application provides hints (eg `fadvise()`, `madvise()`, > `readahead()`). > This is how Linux determines the size of IO for mmap during a page fault: > * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead > value with the faulting address in the middle of the IO, eg IO requested for > range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This > is sometimes referred to as "read around" (ie read around the faulting > address). See > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989] > * The kernel maintains a cache miss counter for the file. Every time the > kernel submits an IO for a page fault, this counts as a miss. Every time the > application faults in a page that is already in the pages cache (presumably > from a previous page fault's IO) is a cache hit and decrements the counter. > If the miss counter exceeds a threshold, the kernel stops inflating the IOs > to the max readahead and falls back to reading a *single* 4k page for each > page fault. See summary > [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1] > and implementation > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955] > and > [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005] > * This means an application that, on average, references more than one 4k > page around the initial page fault will consistently have page fault IOs > inflated to the maximum readahead value. Note, there is no ramping up a > window the way there is with standard IO. The kernel only submits IOs of 1 > page and max_readahead as far as I can tell. > Observations: > * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a > big deal depending on your setup. > * Cassandra will always have IOs inflated to the maximum readahead because > more than 1 page is references for the data file and (depending on the size > and cardinality of your keys) more than one page is referenced from the index > file > * The device's readahead is a crude system wide knob for controlling IO size. > Cassandra cannot perform smaller IOs for the index file (unless your keyset > is such that only 1 page from the index file needs to be referenced). > Centos 7 VMs: > * The default readahead for Centos 7 VMs is 4MB (as opposed to the default > readahead for non-VM Centos 7 which is 128kb). > * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to > something around 450k, it is still far too large for an average Cassandra > read. > * Even once this readahead is reduced to the recommended 64kb, standard IO > still has a 10% performance advantage in our tests, likely because the > readahead algorithm for standard IO is more flexible and converges on smaller > reads from the index file and larger reads from the data file. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org