[ 
https://issues.apache.org/jira/browse/CASSANDRA-17237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-17237:
------------------------------------------
    Summary: Pathological interaction between Cassandra and readahead, 
particularly on Centos 7 VMs  (was: Pathalogical interaction between Cassandra 
and readahead, particularly on Centos 7 VMs)

> Pathological interaction between Cassandra and readahead, particularly on 
> Centos 7 VMs
> --------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17237
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17237
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Local/Config
>            Reporter: Daniel Cranford
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 5.x
>
>
> Cassandra defaults to using mmap for IO, except on 32 bit systems. The config 
> value `disk_access_mode` that controls this isn't even included in or 
> documented in cassandra.yml.
> While this may be a reasonable default config for Cassandra, we've noticed a 
> pathalogical interplay between the way Linux implements readahead for mmap, 
> and Cassandra's IO patterns, particularly on vanilla Centos 7 VMs.
> A read that misses all levels of cache in Cassandra is (typically) going to 
> involve 2 IOs: 1 into the index file and one into the data file. These IOs 
> will both be effectively random given the nature the mummer3 hash partitioner.
> The amount of data read from the index file IO will be relatively small, 
> perhaps 4-8kb, compared to the data file IO which (assuming the entire 
> partition fits in a single compressed chunk and a compression ratio of 1/2) 
> will require 32kb.
> However, applications using `mmap()` have no way to tell the OS the desired 
> IO size - they can only tell the OS the desired IO location - by reading from 
> the mapped address and triggering a page fault. This is unlike `read()` where 
> the application provides both the size and location to the OS. So for 
> `mmap()` the OS has to guess how large the IO submitted to the backing device 
> should be and whether the application is performing sequential or random IO 
> unless the application provides hints (eg `fadvise()`, `madvise()`, 
> `readahead()`).
> This is how Linux determines the size of IO for mmap during a page fault:
>  * Outside of hints (eg FADV_RANDOM) default IO size is maximum readahead 
> value with the faulting address in the middle of the IO, eg IO requested for 
> range [fault_addr - max_readahead / 2, fault_addr + max_readahead / 2] This 
> is sometimes referred to as "read around" (ie read around the faulting 
> address). See 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2989]
>   * The kernel maintains a cache miss counter for the file. Every time the 
> kernel submits an IO for a page fault, this counts as a miss. Every time the 
> application faults in a page that is already in the pages cache (presumably 
> from a previous page fault's IO) is a cache hit and decrements the counter. 
> If the miss counter exceeds a threshold, the kernel stops inflating the IOs 
> to the max readahead and falls back to reading a *single* 4k page for each 
> page fault. See summary 
> [here|https://www.quora.com/What-heuristics-does-the-adaptive-readahead-implementation-in-the-Linux-kernel-use/answer/Robert-Love-1]
>  and implementation 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L2955]
>  and 
> [here|https://github.com/torvalds/linux/blob/0c941cf30b913d4a684d3f8d9eee60e0daffdc79/mm/filemap.c#L3005]
>   * This means an application that, on average, references more than one 4k 
> page around the initial page fault will consistently have page fault IOs 
> inflated to the maximum readahead value. Note, there is no ramping up a 
> window the way there is with standard IO. The kernel only submits IOs of 1 
> page and max_readahead as far as I can tell.
> Observations:
> * mmap'ed IO on Linux wastes half the IO bandwith. This may or may not be a 
> big deal depending on your setup.
> * Cassandra will always have IOs inflated to the maximum readahead because 
> more than 1 page is references for the data file and (depending on the size 
> and cardinality of your keys) more than one page is referenced from the index 
> file
> * The device's readahead is a crude system wide knob for controlling IO size. 
> Cassandra cannot perform smaller IOs for the index file (unless your keyset 
> is such that only 1 page from the index file needs to be referenced).
> Centos 7 VMs:
> * The default readahead for Centos 7 VMs is 4MB (as opposed to the default 
> readahead for non-VM Centos 7 which is 128kb).
> * Even though this is reduced by the kernel (cf `max_sane_readahead()`) to 
> something around 450k, it is still far too large for an average Cassandra 
> read.
> * Even once this readahead is reduced to the recommended 64kb, standard IO 
> still has a 10% performance advantage in our tests, likely because the 
> readahead algorithm for standard IO is more flexible and converges on smaller 
> reads from the index file and larger reads from the data file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to