Hi, On 2026-02-27 15:11:24 -0500, Peter Geoghegan wrote: > On Thu, Feb 26, 2026 at 11:18 PM Andres Freund <[email protected]> wrote: > > Note how the increase in scanned heap pages actually *decreases* the overall > > time rather substantially. > > > > It's quite visible, both in iostat, and a query like > > SELECT pid, target_desc, off, length FROM pg_aios \watch 0.5 > > > > that for the first query has basically no IO concurrency, the second has > > very > > intermittent IO concurrency and the third one has nice IO concurrency. > > > > > > If I disable the yield logic, the fillfactor=90 case is good: > > I can recreate your results. Including the part where you found that > the problem would go away once yields were completely disabled.
Ooops, just sent an review that I started writing a few hours ago that would have benefited from having read this first... > I can certainly understand why you're suspicious of the yielding > mechanism. I wonder if I gave undue weight to the merge join query I > showed you [1] (and one or two others like it). Declaring that the > underlying merge join/yielding issue is not worth the complexity > required to yield would certainly be convenient. Yielding *isn't* > helpful for the vast majority of individual queries, so I'm certainly > tempted. But I can't help but feel nervous about the large disparity > in the number of *index* pages read by that particular query, once the > yielding mechanism is disabled. I do continue to wonder if we ought to pass down some hints from the planner about how much data an indexscan is likely to read to influence readahead aggressiveness. I do agree it's right beign concerned about the increase in index fetches with such mark/restore cases. > With that being said, it seems as if yielding isn't the only factor in > play here. I also notice that even master exhibits roughly the same > performance disparity (also while using direct I/O, though with > shared_buffers set to 16GB rather than your 2GB): > ================================= > EXPLAIN OUTPUT (best run, master) > ================================= > > --- Fillfactor 90 --- > Index Scan using pgbench_accounts_ff90_pkey on pgbench_accounts_ff90 > Index Searches: 1 > Buffers: shared hit=27325 read=181819 > I/O Timings: shared read=16822.256 > Planning Time: 0.035 ms > Execution Time: 18048.198 ms > > --- Fillfactor 50 --- > Index Scan using pgbench_accounts_ff50_pkey on pgbench_accounts_ff50 > Index Searches: 1 > Buffers: shared hit=27325 read=333334 > I/O Timings: shared read=30685.965 > Planning Time: 0.028 ms > Execution Time: 32005.962 ms > > --- Fillfactor 25 --- > Index Scan using pgbench_accounts_ff25_pkey on pgbench_accounts_ff25 > Index Searches: 1 > Buffers: shared hit=27325 read=666667 > I/O Timings: shared read=10278.124 > Planning Time: 0.034 ms > Execution Time: 11796.573 ms > > While fillfactor 90 is fastest, fillfactor 25 is almost 3x faster than > fillfactor 50, despite performing about twice as many reads. I have to > imagine this relates to my Samsung 980 Pro SSD performing its own > read-ahead, in a way that works inconsistently across workloads. > Note again that this effect with master only appears when > shared_buffers is set to 16GB. With your 2GB shared_buffers setting, > master takes 17930.381 ms for FF 90, 31822.473 ms for FF 50, and > 61094.676 ms for FF 25 (which is at least consistent-ish in the way > that one would expect). Is this, by any chance, with starting the server and running these queries in that order? Are you repeating these runs within one server start, evicting the buffers inbetween? If you don't, you'll often get very inconsistent performance, because the first time a buffer pool page is accessed, you'll get a page fault, during which the kernel has to find free memory to back the page, which then also has to be zeroed out. With a small buffer pool you reach the point where individual buffers are reused much more quickly, which would explain why it only happens with the larger s_b. Are you using huge pages? I see rather differing performance results with/without when not prefetching. In fact, when not repeating the benchmarks and running them in order within the same "server start cycle", I get similar timings as you were when not using huge pages. But just running the queries in a different order gives very different results. > For context, here is how the patch compares to master with > shared_buffers=16GB (here master uses the same query execution/query > plans as those shown above) once the patch/Pfetch's yielding is > disabled: > > FF Heap Pages Master Pfetch ON ON/Master > ---------------------------------------------------- > 90 181819 18048.2 1465.0 0.081x > 50 333334 32006.0 1682.2 0.053x > 25 666667 11796.6 1928.4 0.163x > > I also noticed that the patch isn't at all sensitive to whether > shared_buffers is set to 2GB or 16GB -- not once yielding is disabled > like this. It really shouldn't be sensitive - that query will never be able to reuse heapam pages within a query, and evicting a clean buffer isn't that expensive. > I'm not sure how relevant this later point about "shared_buffers > sensitivity with yielding" really is. Nor am I sure if the effect with > master (and the possible role of device-level readahead) is all that > significant. I'm pointing all of this out in the hope that you can > offer an explanation that'll help me to improve my own intutions about > this stuff. I suspect it's really related to running multiple different queries in a specific order, without restarting in between. Exacerbated perhaps by not using huge pages. Greetings, Andres Freund
