Hi, On Tue, Mar 24, 2026 at 9:18 AM SATYANARAYANA NARLAPURAM < [email protected]> wrote:
> Hi Hackers, > > While review the patch in the thread [1] I noticed the following: > > When the WAL prefetcher encounters a block reference that carries a full > page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing > a prefetch for that block because the old on-disk content is irrelevant > since replay will overwrite or zero the page entirely. However, if a later > WAL record within the look-ahead window references the same block without > an FPW, the prefetcher would still issue a fadvise64 syscall for it, > because the block was never recorded in the duplicate-detection window. > > Fixed this by making these blocks as recently seen in the FPW and > WILL_INIT skip paths. The existing duplicate-check loop then naturally > suppresses prefetch attempts for subsequent references to the same block, > counting them under the skip_rep stat. This is particularly effective for > workloads that produce many sequential writes to the same page (e.g., bulk > inserts into heap-only tables), where each page's first post-checkpoint > touch generates an FPW and subsequent inserts to the same page follow > shortly after in WAL. > > In order to further improve the wasted prefetch calls, we can try to > increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE > according to max blocks that can be prefetched or maintain a hash table. I > did not attempt to do this in this patch because that can impact the redo > performance (more cpu cycles). Worst case, the current fix may fail in > scenarios where the table has more than four indexes, for example. However, > I still believe it is an improvement over the baseline. If we decide to > spend more cycles on optimizing the window sizes, it can be in a different > patch. > > Benchmarked recovery with 10 GB of WAL from insert-only workload into a > no-index table, replayed from an identical crash snapshot: > > Fast disk (NVMe) > Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls > Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls > > This is nearly 31% faster redo, 90% fewer fadvise syscalls > > *Prefetch Counters* > Counter Baseline Patched Delta > prefetch (fadvise issued) 1,204,992 122,753 −89.8% > hit 924,457 911,785 −1.4% > skip_init 1,097,536 1,097,536 0 > skip_fpw 28 28 0 > skip_rep 80,020,209 81,115,120 +1,094,911 > > Slower disk (with ~2ms latency) > Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls > Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls > > This is nearly 68% faster redo, 3.1× overall speedup > > > *Configuration:* > > shared_buffers = '124GB' > huge_pages = on > wal_buffers = '512MB' > max_wal_size = '100GB' > checkpoint_timeout = '30min' > full_page_writes = on > maintenance_io_concurrency = 50 > recovery_prefetch = on > > *Workload:* > CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text); > -- No indexes, no primary key. > > > -- Then insert in batches of 1M rows until WAL reaches 10 GB: > INSERT INTO test_noindex > SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60) > FROM generate_series(1, 1000000) g; > > > Thanks, > Satya > > [1] > https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07 > Rebased the patch.
0001-xlogprefetcher-record-FPW-WILL_INIT-blocks-in-the-re.patch
Description: Binary data
