[ https://issues.apache.org/jira/browse/CASSANDRA-8894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631203#comment-14631203 ]
Benedict commented on CASSANDRA-8894: ------------------------------------- I may aim to integrate this work prior to our running full performance tests, as I would like to see this safely hit pre-3.0, and we know it is effective already for in-memory workloads. The question now is more the tuning parameters, and how we might yet tweak them, and that's something that can be done much closer to release if we need to. Some quick feedback: * crossingProbability is always zero, I think? need to use {{ / 4096d}} * disk_optimization_record_size_percentile: disk_optimization_estimate_percentile? * disk_optimization_crossing_chance: disk_optimization_page_cross_chance? No super strong feelings about the names, though. Just suggestions; not 100% certain they're even better from my POV, nor that it's important. Otherwise this all LGTM, and I'm keen to commit. When performance testing, we should figure out how (via cstar) we can tweak read ahead settings on the machine. [~enigmacurry]: is there any way we could have that as a GUI option? Because this new code should make read ahead a bad idea for SSD clusters, and disabling it may see this will likely see standard mode become a superior option to mmap, since we can predict exactly how much we should read better than the OS can. > Our default buffer size for (uncompressed) buffered reads should be smaller, > and based on the expected record size > ------------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-8894 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8894 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Benedict > Assignee: Stefania > Labels: benedict-to-commit > Fix For: 3.x > > Attachments: 8894_25pct.yaml, 8894_5pct.yaml, 8894_tiny.yaml > > > A large contributor to slower buffered reads than mmapped is likely that we > read a full 64Kb at once, when average record sizes may be as low as 140 > bytes on our stress tests. The TLB has only 128 entries on a modern core, and > each read will touch 32 of these, meaning we are unlikely to almost ever be > hitting the TLB, and will be incurring at least 30 unnecessary misses each > time (as well as the other costs of larger than necessary accesses). When > working with an SSD there is little to no benefit reading more than 4Kb at > once, and in either case reading more data than we need is wasteful. So, I > propose selecting a buffer size that is the next larger power of 2 than our > average record size (with a minimum of 4Kb), so that we expect to read in one > operation. I also propose that we create a pool of these buffers up-front, > and that we ensure they are all exactly aligned to a virtual page, so that > the source and target operations each touch exactly one virtual page per 4Kb > of expected record size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)