[ https://issues.apache.org/jira/browse/CASSANDRA-19494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jon Haddad updated CASSANDRA-19494: ----------------------------------- Resolution: Duplicate Status: Resolved (was: Triage Needed) Will be resolved as part of CASSANDRA-15452, very exciting. > Optimize I/O during table scans > ------------------------------- > > Key: CASSANDRA-19494 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19494 > Project: Cassandra > Issue Type: Improvement > Reporter: Jon Haddad > Priority: Normal > Attachments: reads.txt > > > The storage engine reads chunk by chunk during table scans. We'd be much > better off if we could perform larger I/O operations to an internal buffer, > perform fewer I/O operations, and avoid making excessive system calls. > For example, doing a scan against this table: > {noformat} > CREATE TABLE easy_cass_stress.keyvalue ( > key text PRIMARY KEY, > value text > ) WITH additional_write_policy = '99p' > AND allow_auto_snapshot = true > AND bloom_filter_fp_chance = 0.01 > AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} > AND cdc = false > AND comment = '' > AND compaction = {'class': > 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', > 'max_threshold': '32', 'min_threshold': '4'} > AND compression = {'chunk_length_in_kb': '16', 'class': > 'org.apache.cassandra.io.compress.LZ4Compressor'} > AND memtable = 'default' > AND crc_check_chance = 1.0 > AND default_time_to_live = 0 > AND extensions = {} > AND gc_grace_seconds = 864000 > AND incremental_backups = true > AND max_index_interval = 2048 > AND memtable_flush_period_in_ms = 0 > AND min_index_interval = 128 > AND read_repair = 'BLOCKING' > AND speculative_retry = '99p';{noformat} > I see the following I/O activity (sample only, see attachment for full > accounting of all reads) > > {noformat} > TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME > 16:59:23 ReadStage-2 2523 R 15051 0 0.02 nb-6-big-Data.db > 16:59:23 ReadStage-2 2523 R 15049 0 0.01 nb-8-big-Data.db > 16:59:23 ReadStage-2 2523 R 15025 0 0.01 nb-5-big-Data.db > 16:59:23 ReadStage-2 2523 R 15064 0 0.01 nb-7-big-Data.db > 16:59:25 ReadStage-2 2523 R 15051 0 0.01 nb-6-big-Data.db > 16:59:25 ReadStage-2 2523 R 15049 0 0.01 nb-8-big-Data.db > 16:59:25 ReadStage-2 2523 R 15025 0 0.01 nb-5-big-Data.db > 16:59:25 ReadStage-2 2523 R 15064 0 0.00 nb-7-big-Data.db > 16:59:25 ReadStage-2 2523 R 15064 14 0.01 nb-5-big-Data.db > 16:59:25 ReadStage-2 2523 R 15051 0 0.01 nb-6-big-Data.db > 16:59:25 ReadStage-2 2523 R 15049 0 0.00 nb-8-big-Data.db > 16:59:25 ReadStage-2 2523 R 15064 14 0.00 nb-5-big-Data.db > 16:59:25 ReadStage-2 2523 R 15064 0 0.00 nb-7-big-Data.db > 16:59:25 ReadStage-2 2523 R 15012 29 0.01 > nb-5-big-Data.db{noformat} > with a sample of our off-cpu time looking like this (after dropping caches) > {noformat} > cpudist -O -p $(cassandra-pid) -m 1 30 > msecs : count distribution > 0 -> 1 : 5259 |****************************************| > 2 -> 3 : 486 |*** | > 4 -> 7 : 0 | | > 8 -> 15 : 1 | | > 16 -> 31 : 0 | | > 32 -> 63 : 29 | | > 64 -> 127 : 77 | | > 128 -> 255 : 4 | | > 256 -> 511 : 6 | | > 512 -> 1023 : 6 | > |{noformat} > We pay a pretty serious throughput penalty for excessive I/O. > We should be able to leverage the work in CASSANDRA-15452 for this. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org