Hi, Previously there was a years-long thread about a 4GB (32bit) buffer cache constraint on AMD64, ref https://marc.info/?t=146824436600004&r=1&w=2 .
What I gather is, * The problematique is that on AMD64, DMA is limited to 32bit addressing, I guess because unlike AMD64 arch CPU:s which all have 64bit DMA support, popular PCI accessories and supporting hardware out there like bridges, have DMA functionality limited to 32bit addressing. (Is this a feature of lower-quality hardware, or for very old PCI devices, or is it systemic to the whole AMD64 ecosystem today? Could a system be configured to use 64bit DMA on AMD64 and be expected to work presuming recent or higher-quality / well-selected hardware?) * The OS asks the disk hardware to load disk data to give memory locations via DMA, and then userland fread() and mmap() is fed with that data - no need for further data moving or mapping. This is the dynamics leading to the 4GB cap. And, the 4GB cap is kind of constraining for any computer with much RAM and lots of disk reading, as it means lots of reads that wouldn't need to hit the disk (as it could be cached using all this free memory) isn't cached and is directed to disk anyhow which takes a lot of time, yes? * This was recognized a long time ago and Bob wrote a solution in the form of a "buffer cache flipper" that would push buffer cache data out of the 32bit area (to "high memory" as in >32bit) hence lifting the limit, via a "(generic) backpressure" mechanism that as a bonus used the DMA engine to do the memory moving, I guess this means that the buffer cache would be pretty much zero-cost to the CPU - sounds incredibly neat! And then, it didn't really work, malfunctioned and irritated people (was "busted" - for unknown reasons, actually why was it?) and Theo wrote it will be fixed in the future. Has it been fixed since? Also - when fixed, fread() and mmap() reads to data that's in the buffer cache will be incredibly fast right, as, in optimal conditions the mmap:ed addresses will be already-mapped to the buffer cache data and hence in optimal conditions mmap:ed buffer cache data reads will have the speed of any memory access, right? (The ML thread also mentioned an undeadly.org post discussing this topic, however both searching and browsing I can't find it, the closest i find is 5 words here https://undeadly.org/cgi?action=article;sid=20170815171854 - do you have any URL?) Last, OpenBSD's biggest limit as an OS seems to be that the disk/file subsystem is sequential. A modern SSD can read at 2.8GB/sec but that requires parallellism, without multiqueueing and with small reads e.g. 4KB or smaller, speeds stay around 70-120MB/sec = ~3.5% of the hardware's potential performance. This would be really worthy goal to donate to for instance, in particular as OpenBSD leads the way in many other areas. Are there any thoughts about implementing this in the future? Thanks, Joseph