neoremind commented on PR #16145: URL: https://github.com/apache/lucene/pull/16145#issuecomment-4594402925
I spent some time spinning up a [JMH benchmark](https://github.com/apache/lucene/compare/main...neoremind:lucene:verify_16145?expand=1) to simulate and vet. The setup is an 8G file on my EC2 m5.2xlarge (8 vCPU, 16G mem, io2 EBS with ~338us 4K random read latency direct io bypass page cache verified with `fio`); openjdk "25.0.2" 2026-01-20. The idea is: each read (4k/8k/16k) randomly picks between a small hot region (16MB, stays cached) and a random offset across the full 8G file. I assume this could roughly mimics HNSW scenario, some nodes are warm while others are deep in the graph and cold. Kernal page cache is cleared before each iteration. Test run for 6s for each iter, not enough time to fully warm page cache. ### Benchmark results (focus on 4k read) **With this PR, compare all hint modes:** | coldReadPct | noPrefetch | Normal | Sequential | **Random** | |---|---|---|---|---| | 10% | 81 us | 82 us | 86 us | **40 us** | | 50% | 400 us | 334 us | 328 us | **188 us** | | 90% | 734 us | 349 us | 345 us | **348 us** | **With PR vs. without PR comparison:** | coldReadPct | Without PR | With PR | Speed up | |---|---|---|---| | 10% | 66 us | 40 us | **1.7x faster** | | 50% | 270 us | 188 us | **1.4x faster** | | 90% | 361 us | 348 us | almost same | **Fully warm page cache (no page cache clear before each iter), with PR applied:** | coldReadPct | noPrefetch / Normal / Sequential | Random | |---|---|---| | 10% | ~0.30 us | 1.37 us | | 50% | ~0.47 us | 1.54 us | | 90% | ~0.61 us | 1.64 us | ### Key findings 1. The change does help random access pattern. At 10%/50% cold reads, it is ~1.7x/~1.4x faster than without the PR (40us/188us vs. 66us/270us). It is because the backoff logic using shared counter skips `madvise` when warm hits are often at certain time period, making prefetch a no-op. What's worth noting is, even without this PR, the `RANDOM` hint already helps compared to noPrefetch and `NORMAL`, the existing backoff does let some `madvise` calls through by chance, this PR makes `RANDOM` access faster with consistent prefetch. 2. At 90% cold, no improvement, because cold reads keep resetting the shared counter to 0, so the backoff never kicks in. The problem only stands when warm hits push the shared counter away before a cold read arrives. 3. In page cache fully warm case, there is indeed overhead, just ~1.1us per read call. This aligns with what @mikemccand points out `isLoaded()` is somewhat costly. But this is the tradeoff, as long as there are any cold pages, the savings of prefetching pages outweigh this 1.1us overhead, and the net win is bigger with more cold reads. As @mikemccand, @jimczi point out, if we remove `isLoaded()`, we also have to verify the overhead of `isLoaded()` probe vs. always `madvise`? Note that this is a microbenchmark, I think it would be more sound to vet with a real-world workload like in HNSW scenario, but the direction is positive. JMH result details <details> <summary>candidate benchmark (clear page cache - mimic cold page access)</summary> ``` Benchmark (coldReadPct) (dataDir) (fileName) (fileSizeGB) (hotRegionMB) (prefetchLength) Mode Cnt Score Error Units PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 80.835 ± 70.514 us/op PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 80.708 ± 119.700 us/op PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 82.326 ± 80.527 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 399.633 ± 419.646 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 408.596 ± 326.960 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 405.251 ± 474.788 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 734.384 ± 1010.659 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 734.215 ± 932.702 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 739.150 ± 701.159 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 82.132 ± 121.437 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 85.855 ± 107.871 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 85.510 ± 73.271 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 334.346 ± 131.156 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 347.240 ± 478.895 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 373.239 ± 842.645 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 348.674 ± 311.629 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 382.573 ± 104.376 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 423.491 ± 202.213 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 39.944 ± 15.283 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 43.664 ± 20.228 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 50.725 ± 7.693 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 188.247 ± 55.046 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 207.542 ± 143.460 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 236.605 ± 120.293 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 348.257 ± 206.218 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 368.074 ± 325.322 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 431.100 ± 241.599 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 86.163 ± 74.133 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 84.713 ± 85.901 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 84.914 ± 105.107 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 327.904 ± 134.254 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 338.804 ± 127.809 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 354.642 ± 39.018 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 345.348 ± 148.671 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 370.560 ± 364.124 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 429.385 ± 311.389 us/op ``` </details> <details> <summary>candidate benchmark (all warm pages in page cache)</summary> `cat file > /dev/null` before benchmark, but disable page cache clearing. ``` Benchmark (coldReadPct) (dataDir) (fileName) (fileSizeGB) (hotRegionMB) (prefetchLength) Mode Cnt Score Error Units PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.286 ± 0.098 us/op PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.463 ± 0.040 us/op PrefetchBenchmark.noPrefetchReadRandom 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 0.832 ± 0.055 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.472 ± 0.076 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.733 ± 0.119 us/op PrefetchBenchmark.noPrefetchReadRandom 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.286 ± 0.049 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.600 ± 0.106 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.948 ± 0.100 us/op PrefetchBenchmark.noPrefetchReadRandom 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.654 ± 0.221 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.296 ± 0.028 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.468 ± 0.007 us/op PrefetchBenchmark.prefetchNormalThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 0.851 ± 0.043 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.471 ± 0.097 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.758 ± 0.061 us/op PrefetchBenchmark.prefetchNormalThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.325 ± 0.219 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.612 ± 0.063 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.951 ± 0.014 us/op PrefetchBenchmark.prefetchNormalThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.643 ± 0.033 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 1.366 ± 0.512 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 1.560 ± 0.299 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.967 ± 0.644 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 1.541 ± 0.040 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 1.815 ± 0.177 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 2.424 ± 0.078 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 1.641 ± 0.056 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 1.986 ± 0.107 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 2.699 ± 0.150 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.295 ± 0.015 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.490 ± 0.097 us/op PrefetchBenchmark.prefetchSequentialThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 0.839 ± 0.038 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.484 ± 0.077 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.746 ± 0.021 us/op PrefetchBenchmark.prefetchSequentialThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.286 ± 0.043 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 0.608 ± 0.081 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 0.945 ± 0.005 us/op PrefetchBenchmark.prefetchSequentialThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 1.654 ± 0.252 us/op ``` </details> <details> <summary>baseline benchmark</summary> Without this PR changes, only test `prefetchRandomThenRead` to do apple-to-apple comparison. ``` Benchmark (coldReadPct) (dataDir) (fileName) (fileSizeGB) (hotRegionMB) (prefetchLength) Mode Cnt Score Error Units PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 66.179 ± 38.100 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 98.342 ± 38.340 us/op PrefetchBenchmark.prefetchRandomThenRead 10 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 164.219 ± 14.844 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 270.401 ± 138.404 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 368.736 ± 564.566 us/op PrefetchBenchmark.prefetchRandomThenRead 50 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 643.159 ± 1996.410 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 4096 avgt 3 360.715 ± 227.521 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 8192 avgt 3 394.207 ± 381.675 us/op PrefetchBenchmark.prefetchRandomThenRead 90 /home/ec2-user/environment prefetch_bench_data_8G 8 16 16384 avgt 3 446.483 ± 270.331 us/op ``` </details> <details> <summary>fio test</summary> Average: 338 µs per 4K random read P50: 310 µs P90: 383 µs P99: 840 µs ``` $ sudo fio --name=randread --ioengine=libaio --direct=1 --bs=4k \ > --iodepth=1 --rw=randread --size=1G --runtime=10 --time_based \ > --filename=/home/ec2-user/environment/prefetch_bench_data_8G randread: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.32 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=11.4MiB/s][r=2930 IOPS][eta 00m:00s] randread: (groupid=0, jobs=1): err= 0: pid=65728: Mon Jun 1 10:04:31 2026 read: IOPS=2954, BW=11.5MiB/s (12.1MB/s)(115MiB/10001msec) slat (nsec): min=2224, max=25262, avg=2487.67, stdev=450.97 clat (usec): min=215, max=5466, avg=335.40, stdev=133.15 lat (usec): min=217, max=5469, avg=337.88, stdev=133.15 clat percentiles (usec): | 1.00th=[ 265], 5.00th=[ 277], 10.00th=[ 281], 20.00th=[ 293], | 30.00th=[ 297], 40.00th=[ 306], 50.00th=[ 310], 60.00th=[ 318], | 70.00th=[ 326], 80.00th=[ 343], 90.00th=[ 383], 95.00th=[ 457], | 99.00th=[ 840], 99.50th=[ 1074], 99.90th=[ 2114], 99.95th=[ 2474], | 99.99th=[ 4752] bw ( KiB/s): min=11248, max=12104, per=100.00%, avg=11829.89, stdev=237.18, samples=19 iops : min= 2812, max= 3026, avg=2957.47, stdev=59.30, samples=19 lat (usec) : 250=0.09%, 500=96.27%, 750=2.25%, 1000=0.76% lat (msec) : 2=0.52%, 4=0.09%, 10=0.02% cpu : usr=0.45%, sys=1.48%, ctx=29548, majf=0, minf=12 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=29548,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=11.5MiB/s (12.1MB/s), 11.5MiB/s-11.5MiB/s (12.1MB/s-12.1MB/s), io=115MiB (121MB), run=10001-10001msec Disk stats (read/write): nvme0n1: ios=29257/106, merge=0/37, ticks=9714/54, in_queue=9769, util=96.35% ``` </details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
