Hi, I thought it might be interesting to revive this thread because the improvements i saw from Thomas’s work, and even just simple prefetching of bucket headers for the probe phase in-memory (to see the effect of prefetching), are still showing nice improvements. Here are some results for simple prefetching in probe phase only, on Thomas's last benchmark query (in-memory self join): *Task clock*: -25.6% *Page faults*: -21.46% *Cycles*: -17.39% *L1 dcache loads: *-13.78% *L1 dcache load misses*: -30.1% *LLC loads*: -36.7% *LLC load misses*: -55.1% *dTLB** loads*: -13.77% *dTLB Misses: +*0.5% *Cache references: *-9.5% *Cache misses*: -7.9% *IPC*: -6.4%
So, I thought it might be worth relooking at this, even if we avoid major architectural changes in the hash join executor required by more advanced techniques. Though it will require a lot of perf benchmarking to prove the performance improvements, i think its doable to prove or *opposite* what we can find with minimal architectural changes. Also, about the Linux experience, it was for lists (pointer chasing) prefetching (see linux thread <https://lwn.net/Articles/444346/>), which was happening on Intel with prefetch(null) in the case of doing list prefetching on short-sized lists, hitting the end of the list very often (like chained hash tables). This is still noticeable in Postgres if we try to do prefetching on intra-bucket scan, performance is relatively the same or even worse. Thoughts?