Hi, On 2025-07-22 22:50:00 +0200, Tomas Vondra wrote: > Yes. It's definitely true we could construct examples where the complex > patch beats the simple one for this reason. And I believe some of those > examples could be quite realistic, even if not very common (like when > very few index tuples fit on a leaf page). > > However, I'm not sure the pgbench example with only 6 heap blocks per > leaf is very significant. Sure, the simple patch can't prefetch TIDs > from the following leaf, but AFAICS the complex patch won't do that > either. Not because it couldn't, but because with that many hits the > distance will drop to ~1 (or close to it). (It'll probably prefetch a > couple TIDs from the next leaf at the very end of the page, but I don't > think that matters overall.) > > I'm not sure what prefetch distances will be sensible in queries that do > other stuff. The queries in the benchmark do just the index scan, but if > the query does something with the tuple (in the nodes on top), that > shortens the required prefetch distance. Of course, simple queries will > benefit from prefetching far ahead.
That may be true with local fast NVMe disks, but won't be true for networked storage like in common clouds. Latencies of 0.3 - 4ms leave a lot of CPU cycles for actual processing of the data. The high latencies for such storage also means that you need fairly deep queues and that missing prefetches can introduce substantial slowdowns. A hypothetical disk that can do 20k iops at 3ms latency needs an average IO depth of 60. If you have a bubble after every few dozen IOs, you're not going to reach that effective IO depth. And even for local NVMes, the IO-depth required to fully utilize the capacity for small random IO can be fairly high. I have a raid-10 of four SSDs that peaks at a depth around ~350. Also, plenty indexes are on multiple columns and/or wider datatypes, making bubbles triggered due to "crossing-the-leaf-page" more common. > Thanks. I wonder how difficult would it be to add something like this to > pgstattuple. I mean, it shouldn't be difficult to look at leaf pages and > count distinct blocks, right? Seems quite useful. +1 > Explain would also greatly benefit from tracking something like this. > The buffer "hits" and "reads" can be very difficult to interpret. Indeed. I actually observed that sometimes the reason that the real iodepth (i.e. measured at the OS level) ends up less high than one would hope is that, while prefetching, we again need a heap buffer that is already being prefetched. Currently the behaviour in that case is to synchronously wait for IO on that buffer to complete. That obviously causes a "pipeline bubble"... Greetings, Andres Freund