Re: Prefetch the next tuple's memory during seqscans
On Sat, 20 Jan 2024 at 16:35, vignesh C wrote: > I'm seeing that there has been no activity in this thread for more > than 6 months, I'm planning to close this in the current commitfest > unless someone is planning to take it forward. Thanks for the reminder about this. Since the heapgettup/heapgettup_pagemode refactor I was unable to see the same performance gains as I was before. Also, since reading "The Art of Writing Efficient Programs" I'm led to believe that modern processor hardware prefetchers can detect and prefetch on both forward and backward access patterns. I also saw some discussion on twitter about this [1]. I'm not sure yet how this translates to non-uniform access patterns, e.g. tuples are varying cachelines apart and we do something like only deform attributes in the first cacheline. Will the prefetcher still see the pattern in this case? If it's non-uniform, then how does it know which cacheline to fetch? If the tuple spans multiple cacheline and we deform the whole tuple, will accessing the next cacheline in a forward direction make the hardware prefetcher forget about the more general backward access that's going on? These are questions I'll need to learn the answers to before I can understand what's the best thing to do in this area. The only way to tell is to design a benchmark and see how far we can go before the hardware prefetcher no longer detects the pattern. I've withdrawn the patch. I can resubmit once I've done some more experimentation if that experimentation yields positive results. David [1] https://twitter.com/ID_AA_Carmack/status/1470832912149135360
Re: Prefetch the next tuple's memory during seqscans
On Mon, 10 Jul 2023 at 15:04, Daniel Gustafsson wrote: > > > On 10 Jul 2023, at 11:32, Daniel Gustafsson wrote: > > > >> On 4 Apr 2023, at 06:50, David Rowley wrote: > > > >> Updated patches attached. > > > > This patch is marked Waiting on Author, but from the thread it seems Needs > > Review is more apt. I've changed status and also attached a new version of > > the > > patch as the posted v1 no longer applied due to changes in formatting for > > Perl > > code. > > ..and again with both patches attached. Doh. I'm seeing that there has been no activity in this thread for more than 6 months, I'm planning to close this in the current commitfest unless someone is planning to take it forward. Regards, Vignesh
Re: Prefetch the next tuple's memory during seqscans
> On 10 Jul 2023, at 11:32, Daniel Gustafsson wrote: > >> On 4 Apr 2023, at 06:50, David Rowley wrote: > >> Updated patches attached. > > This patch is marked Waiting on Author, but from the thread it seems Needs > Review is more apt. I've changed status and also attached a new version of > the > patch as the posted v1 no longer applied due to changes in formatting for Perl > code. ..and again with both patches attached. Doh. -- Daniel Gustafsson v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch Description: Binary data prefetch_in_PageRepairFragmentation.patch Description: Binary data
Re: Prefetch the next tuple's memory during seqscans
> On 4 Apr 2023, at 06:50, David Rowley wrote: > Updated patches attached. This patch is marked Waiting on Author, but from the thread it seems Needs Review is more apt. I've changed status and also attached a new version of the patch as the posted v1 no longer applied due to changes in formatting for Perl code. -- Daniel Gustafsson v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch Description: Binary data
Re: Prefetch the next tuple's memory during seqscans
On Tue, 4 Apr 2023 at 07:47, Gregory Stark (as CFM) wrote: > The referenced patch was committed March 19th but there's been no > comment here. Is this patch likely to go ahead this release or should > I move it forward again? Thanks for the reminder on this. I have done some work on it but just didn't post it here as I didn't have good news. The problem I'm facing is that after Melanie's recent refactor work done around heapgettup() [1], I can no longer get the same speedup as before with the pg_prefetch_mem(). While testing Melanie's patches, I did do some performance tests and did see a good increase in performance from it. I really don't know the reason why the prefetching does not show the gains as it did before. Perhaps the rearranged code is better able to perform hardware prefetching of cache lines. I am, however, inclined not to drop the pg_prefetch_mem() macro altogether just because I can no longer demonstrate any performance gains during sequential scans, so I decided to go and try what Thomas mentioned in [2] to use the prefetching macro to fetch the required tuples in PageRepairFragmentation() so that they're cached in CPU cache by the time we get to compactify_tuples(). I tried this using the same test as I described in [3] after adjusting the following line to use PANIC instead of LOG: ereport(LOG, (errmsg("redo done at %X/%X system usage: %s", LSN_FORMAT_ARGS(xlogreader->ReadRecPtr), pg_rusage_show(&ru0; doing that allows me to repeat the test using the same WAL each time. amd3990x CPU on Ubuntu 22.10 with 64GB RAM. shared_buffers = 10GB checkpoint_timeout = '1 h' max_wal_size = 100GB max_connections = 300 Master: 2023-04-04 15:54:55.635 NZST [15958] PANIC: redo done at 0/DC447610 system usage: CPU: user: 44.46 s, system: 0.97 s, elapsed: 45.45 s 2023-04-04 15:56:33.380 NZST [16109] PANIC: redo done at 0/DC447610 system usage: CPU: user: 43.80 s, system: 0.86 s, elapsed: 44.69 s 2023-04-04 15:57:25.968 NZST [16134] PANIC: redo done at 0/DC447610 system usage: CPU: user: 44.08 s, system: 0.74 s, elapsed: 44.84 s 2023-04-04 15:58:53.820 NZST [16158] PANIC: redo done at 0/DC447610 system usage: CPU: user: 44.20 s, system: 0.72 s, elapsed: 44.94 s Prefetch Memory in PageRepairFragmentation(): 2023-04-04 16:03:16.296 NZST [25921] PANIC: redo done at 0/DC447610 system usage: CPU: user: 41.73 s, system: 0.77 s, elapsed: 42.52 s 2023-04-04 16:04:07.384 NZST [25945] PANIC: redo done at 0/DC447610 system usage: CPU: user: 40.87 s, system: 0.86 s, elapsed: 41.74 s 2023-04-04 16:05:01.090 NZST [25968] PANIC: redo done at 0/DC447610 system usage: CPU: user: 41.20 s, system: 0.72 s, elapsed: 41.94 s 2023-04-04 16:05:49.235 NZST [25996] PANIC: redo done at 0/DC447610 system usage: CPU: user: 41.56 s, system: 0.66 s, elapsed: 42.24 s About 6.7% performance increase over master. I wonder since I really just did the seqscan patch as a means to get the pg_prefetch_mem() patch in, I wonder if it's ok to scrap that in favour of the PageRepairFragmentation patch. Updated patches attached. David [1] https://postgr.es/m/CAAKRu_YSOnhKsDyFcqJsKtBSrd32DP-jjXmv7hL0BPD-z0TGXQ%40mail.gmail.com [2] https://postgr.es/m/CA%2BhUKGJRtzbbhVmb83vbCiMRZ4piOAi7HWLCqs%3DGQ74mUPrP_w%40mail.gmail.com [3] https://postgr.es/m/CAApHDvoKwqAzhiuxEt8jSquPJKDpH8DNUZDFUSX9P7DXrJdc3Q%40mail.gmail.com v1-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch Description: Binary data prefetch_in_PageRepairFragmentation.patch Description: Binary data
Re: Prefetch the next tuple's memory during seqscans
On Sun, 29 Jan 2023 at 21:24, David Rowley wrote: > > I've moved this patch to the next CF. This patch has a dependency on > what's being proposed in [1]. The referenced patch was committed March 19th but there's been no comment here. Is this patch likely to go ahead this release or should I move it forward again? -- Gregory Stark As Commitfest Manager
Re: Prefetch the next tuple's memory during seqscans
On Wed, 4 Jan 2023 at 23:06, vignesh C wrote: > patching file src/backend/access/heap/heapam.c > Hunk #1 FAILED at 451. > 1 out of 6 hunks FAILED -- saving rejects to file > src/backend/access/heap/heapam.c.rej I've moved this patch to the next CF. This patch has a dependency on what's being proposed in [1]. I'd rather wait until that goes in before rebasing this. Having this go in first will just make Melanie's job harder on her heapam.c refactoring work. David [1] https://commitfest.postgresql.org/41/3987/
Re: Prefetch the next tuple's memory during seqscans
On Wed, 23 Nov 2022 at 03:28, David Rowley wrote: > > On Thu, 3 Nov 2022 at 06:25, Andres Freund wrote: > > Attached is an experimental patch/hack for that. It ended up being more > > beneficial to make the access ordering more optimal than prefetching the > > tuple > > contents, but I'm not at all sure that's the be-all-end-all. > > Thanks for writing that patch. I've been experimenting with it. > > I tried unrolling the loop (patch 0003) as you mentioned in: > > + * FIXME: Worth unrolling so that we don't fetch the same cacheline > + * over and over, due to line items being smaller than a cacheline? > > but didn't see any gains from doing that. > > I also adjusted your patch a little so that instead of doing: > > - OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */ > + OffsetNumber *rs_vistuples; > + OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */ > > to work around the issue of having to populate rs_vistuples_d in > reverse, I added a new field called rs_startindex to mark where the > first element in the rs_vistuples array is. The way you wrote it seems > to require fewer code changes, but per the FIXME comment you left, I > get the idea you just did it the way you did to make it work enough > for testing. > > I'm quite keen to move forward in committing the 0001 patch to add the > pg_prefetch_mem macro. What I'm a little undecided about is what the > best patch is to commit first to make use of the new macro. > > I did some tests on the attached set of patches: > > alter system set max_parallel_workers_per_gather = 0; > select pg_reload_conf(); > > create table t as select a from generate_series(1,1000)a; > alter table t set (autovacuum_enabled=false); > > $ cat bench.sql > select * from t where a = 0; > > psql -c "select pg_prewarm('t');" postgres > > -- Test 1 no frozen tuples in "t" > > Master (@9c6ad5eaa): > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 383.332 ms > latency average = 375.747 ms > latency average = 376.090 ms > > Master + 0001 + 0002: > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 370.133 ms > latency average = 370.149 ms > latency average = 370.157 ms > > Master + 0001 + 0005: > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 372.662 ms > latency average = 371.034 ms > latency average = 372.709 ms > > -- Test 2 "select count(*) from t" with all tuples frozen > > $ cat bench1.sql > select count(*) from t; > > psql -c "vacuum freeze t;" postgres > psql -c "select pg_prewarm('t');" postgres > > Master (@9c6ad5eaa): > $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 406.238 ms > latency average = 407.029 ms > latency average = 406.962 ms > > Master + 0001 + 0005: > $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 345.470 ms > latency average = 345.775 ms > latency average = 345.354 ms > > My current thoughts are that it might be best to go with 0005 to start > with. I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID 5212d447fa53518458cbe609092b347803a667c5 === === applying patch ./v2-0001-Add-pg_prefetch_mem-macro-to-load-cache-lines.patch === applying patch ./v2-0002-Perform-memory-prefetching-in-heapgetpage.patch patching file src/backend/access/heap/heapam.c Hunk #1 FAILED at 451. 1 out of 6 hunks FAILED -- saving rejects to file src/backend/access/heap/heapam.c.rej [1] - http://cfbot.cputube.org/patch_41_3978.log Regards, Vignesh
Re: Prefetch the next tuple's memory during seqscans
On Thu, 1 Dec 2022 at 18:18, John Naylor wrote: > I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure reports > "checking for __builtin_prefetch... yes", but I don't think it does anything > here, as the results are within noise level. A quick search didn't turn up > anything informative on this platform, and I'm not motivated to dig deeper. > In any case, it doesn't make things worse. Thanks for testing the power8 hardware. Andres just let me test on some Apple M1 hardware (those cores are insanely fast!) Using the table and running the script from [1], with trimmed-down output, I see: Master @ edf12e7bbd Testing a -> 158.037 ms Testing a2 -> 164.442 ms Testing a3 -> 171.523 ms Testing a4 -> 189.892 ms Testing a5 -> 217.197 ms Testing a6 -> 186.790 ms Testing a7 -> 189.491 ms Testing a8 -> 195.384 ms Testing a9 -> 200.547 ms Testing a10 -> 206.149 ms Testing a11 -> 211.708 ms Testing a12 -> 217.976 ms Testing a13 -> 224.565 ms Testing a14 -> 230.642 ms Testing a15 -> 237.372 ms Testing a16 -> 244.110 ms (checking for __builtin_prefetch... yes) Master + v2-0001 + v2-0005 Testing a -> 157.477 ms Testing a2 -> 163.720 ms Testing a3 -> 171.159 ms Testing a4 -> 186.837 ms Testing a5 -> 205.220 ms Testing a6 -> 184.585 ms Testing a7 -> 189.879 ms Testing a8 -> 195.650 ms Testing a9 -> 201.220 ms Testing a10 -> 207.162 ms Testing a11 -> 213.255 ms Testing a12 -> 219.313 ms Testing a13 -> 225.763 ms Testing a14 -> 237.337 ms Testing a15 -> 239.440 ms Testing a16 -> 245.740 ms It does not seem like there's any improvement on this architecture. There is a very small increase from "a" to "a6", but a very small decrease in performance from "a7" to "a16". It's likely within the expected noise level. David [1] https://postgr.es/m/caaphdvqwexy_6jgmb39vr3oqxz_w6stafkq52hodvwaw-19...@mail.gmail.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 4:58 AM David Rowley wrote: > My current thoughts are that it might be best to go with 0005 to start > with. +1 > I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. There seem to be some open questions about that one as well. I reran the same test in [1] (except I don't have the ability to lock clock speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results: HEAD: Testing a1 latency average = 965.462 ms Testing a2 latency average = 1054.608 ms Testing a3 latency average = 1078.263 ms Testing a4 latency average = 1120.933 ms Testing a5 latency average = 1162.753 ms Testing a6 latency average = 1298.876 ms Testing a7 latency average = 1228.775 ms Testing a8 latency average = 1293.535 ms 0001+0005: Testing a1 latency average = 791.224 ms Testing a2 latency average = 876.421 ms Testing a3 latency average = 911.039 ms Testing a4 latency average = 981.693 ms Testing a5 latency average = 998.176 ms Testing a6 latency average = 979.954 ms Testing a7 latency average = 1066.523 ms Testing a8 latency average = 1030.235 ms I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure reports "checking for __builtin_prefetch... yes", but I don't think it does anything here, as the results are within noise level. A quick search didn't turn up anything informative on this platform, and I'm not motivated to dig deeper. In any case, it doesn't make things worse. HEAD: Testing a1 latency average = 1402.163 ms Testing a2 latency average = 1442.971 ms Testing a3 latency average = 1599.188 ms Testing a4 latency average = 1664.397 ms Testing a5 latency average = 1782.091 ms Testing a6 latency average = 1860.655 ms Testing a7 latency average = 1929.120 ms Testing a8 latency average = 2021.100 ms 0001+0005: Testing a1 latency average = 1433.080 ms Testing a2 latency average = 1428.369 ms Testing a3 latency average = 1542.406 ms Testing a4 latency average = 1642.452 ms Testing a5 latency average = 1737.173 ms Testing a6 latency average = 1828.239 ms Testing a7 latency average = 1920.909 ms Testing a8 latency average = 2036.922 ms [1] https://www.postgresql.org/message-id/CAFBsxsHqmH_S%3D4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, 23 Nov 2022 at 10:58, David Rowley wrote: > My current thoughts are that it might be best to go with 0005 to start > with. I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. I tried running TPC-H @ scale 5 with master (@d09dbeb9) vs master + 0001 + 0005 patch. The results look quite promising. Query 15 seems to run 15% faster and overall it's 4.23% faster. Full results are attached. David query master Master + 0001 + 0005compare 1 25999.5 25793.6 100.8% 2 1171.0 1152.0 101.6% 3 6180.5 5456.5 113.3% 4 1167.1 1107.0 105.4% 5 4968.3 4604.8 107.9% 6 3696.6 3306.4 111.8% 7 5501.4 4905.6 112.1% 8 1394.8 1345.2 103.7% 9 10861.2 11159.8 97.3% 10 4354.3 4356.4 100.0% 11 382.5 386.3 99.0% 12 3888.6 3838.0 101.3% 13 6905.0 6622.5 104.3% 14 3886.1 3429.8 113.3% 15 8009.7 6927.8 115.6% 16 2406.2 2363.9 101.8% 17 14.614.998.1% 18 11735.6 11453.9 102.5% 19 44.944.7100.4% 20 262.8 246.6 106.6% 21 3014.1 3027.6 99.6% 22 176.4 179.3 98.4%
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 11:03:22AM -0500, Bruce Momjian wrote: > > CPUs have several different kinds of 'hardware prefetchers' (worth > > reading about), that look out for sequential and striding patterns and > > try to get the cache line ready before you access it. Using the > > prefetch instructions explicitly is called 'software prefetching' > > (special instructions inserted by programmers or compilers). The > > theory here would have to be that the hardware prefetchers couldn't > > pick up the pattern, but we know how to do it. The exact details of > > the hardware prefetchers vary between chips, and there are even some > > parameters you can adjust in BIOS settings. One idea is that the > > hardware prefetchers are generally biased towards increasing > > addresses, but our tuples tend to go backwards on the page[1]. It's > > possible that some other CPUs can detect backwards strides better, but > > since real world tuples aren't of equal size anyway, there isn't > > really a fixed stride at all, so software prefetching seems quite > > promising for this... > > > > [1] > > https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE > > I remember someone showing that having our item pointers at the _end_ of > the page and tuples at the start moving toward the end increased > performance significantly. Ah, I found it, from 2017, with a 15-25% slowdown: https://www.postgresql.org/message-id/20171108205943.tps27i2tujsstrg7%40alap3.anarazel.de -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 2, 2022 at 12:42:11AM +1300, Thomas Munro wrote: > On Wed, Nov 2, 2022 at 12:09 AM Andy Fan wrote: > > By theory, Why does the preferch make thing better? I am asking this > > because I think we need to read the data from buffer to cache line once > > in either case (I'm obvious wrong in face of the test result.) > > CPUs have several different kinds of 'hardware prefetchers' (worth > reading about), that look out for sequential and striding patterns and > try to get the cache line ready before you access it. Using the > prefetch instructions explicitly is called 'software prefetching' > (special instructions inserted by programmers or compilers). The > theory here would have to be that the hardware prefetchers couldn't > pick up the pattern, but we know how to do it. The exact details of > the hardware prefetchers vary between chips, and there are even some > parameters you can adjust in BIOS settings. One idea is that the > hardware prefetchers are generally biased towards increasing > addresses, but our tuples tend to go backwards on the page[1]. It's > possible that some other CPUs can detect backwards strides better, but > since real world tuples aren't of equal size anyway, there isn't > really a fixed stride at all, so software prefetching seems quite > promising for this... > > [1] > https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE I remember someone showing that having our item pointers at the _end_ of the page and tuples at the start moving toward the end increased performance significantly. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Prefetch the next tuple's memory during seqscans
On Wed, 23 Nov 2022 at 21:26, sirisha chamarthi wrote: > Master > After vacuum: > latency average = 393.880 ms > > Master + 0001 + 0005 > After vacuum: > latency average = 369.591 ms Thank you for running those again. Those results make more sense. Would you mind also testing the count(*) query too? David
Re: Prefetch the next tuple's memory during seqscans
On Tue, Nov 22, 2022 at 11:44 PM David Rowley wrote: > On Wed, 23 Nov 2022 at 20:29, sirisha chamarthi > wrote: > > I ran your test1 exactly like your setup except the row count is 300 > (with 13275 blocks). Shared_buffers is 128MB and the hardware configuration > details at the bottom of the mail. It appears Master + 0001 + 0005 > regressed compared to master slightly . > > Thank you for running these tests. > > Can you share if the plans used for these queries was a parallel plan? > I had set max_parallel_workers_per_gather to 0 to remove the > additional variability from parallel query. > > Also, 13275 blocks is 104MBs, does EXPLAIN (ANALYZE, BUFFERS) indicate > that all pages were in shared buffers? I used pg_prewarm() to ensure > they were so that the runs were consistent. > I reran the test with setting max_parallel_workers_per_gather = 0 and with pg_prewarm. Appears I missed some step while testing on the master, thanks for sharing the details. New numbers show master has higher latency than *Master + 0001 + 0005*. *Master* Before vacuum: latency average = 452.881 ms After vacuum: latency average = 393.880 ms *Master + 0001 + 0005* Before vacuum: latency average = 441.832 ms After vacuum: latency average = 369.591 ms
Re: Prefetch the next tuple's memory during seqscans
On Wed, 23 Nov 2022 at 20:29, sirisha chamarthi wrote: > I ran your test1 exactly like your setup except the row count is 300 > (with 13275 blocks). Shared_buffers is 128MB and the hardware configuration > details at the bottom of the mail. It appears Master + 0001 + 0005 regressed > compared to master slightly . Thank you for running these tests. Can you share if the plans used for these queries was a parallel plan? I had set max_parallel_workers_per_gather to 0 to remove the additional variability from parallel query. Also, 13275 blocks is 104MBs, does EXPLAIN (ANALYZE, BUFFERS) indicate that all pages were in shared buffers? I used pg_prewarm() to ensure they were so that the runs were consistent. David
Re: Prefetch the next tuple's memory during seqscans
On Tue, Nov 22, 2022 at 1:58 PM David Rowley wrote: > On Thu, 3 Nov 2022 at 06:25, Andres Freund wrote: > > Attached is an experimental patch/hack for that. It ended up being more > > beneficial to make the access ordering more optimal than prefetching the > tuple > > contents, but I'm not at all sure that's the be-all-end-all. > > Thanks for writing that patch. I've been experimenting with it. > > I tried unrolling the loop (patch 0003) as you mentioned in: > > + * FIXME: Worth unrolling so that we don't fetch the same cacheline > + * over and over, due to line items being smaller than a cacheline? > > but didn't see any gains from doing that. > > I also adjusted your patch a little so that instead of doing: > > - OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */ > + OffsetNumber *rs_vistuples; > + OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */ > > to work around the issue of having to populate rs_vistuples_d in > reverse, I added a new field called rs_startindex to mark where the > first element in the rs_vistuples array is. The way you wrote it seems > to require fewer code changes, but per the FIXME comment you left, I > get the idea you just did it the way you did to make it work enough > for testing. > > I'm quite keen to move forward in committing the 0001 patch to add the > pg_prefetch_mem macro. What I'm a little undecided about is what the > best patch is to commit first to make use of the new macro. > > I did some tests on the attached set of patches: > > alter system set max_parallel_workers_per_gather = 0; > select pg_reload_conf(); > > create table t as select a from generate_series(1,1000)a; > alter table t set (autovacuum_enabled=false); > > $ cat bench.sql > select * from t where a = 0; > > psql -c "select pg_prewarm('t');" postgres > > -- Test 1 no frozen tuples in "t" > > Master (@9c6ad5eaa): > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 383.332 ms > latency average = 375.747 ms > latency average = 376.090 ms > > Master + 0001 + 0002: > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 370.133 ms > latency average = 370.149 ms > latency average = 370.157 ms > > Master + 0001 + 0005: > $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 372.662 ms > latency average = 371.034 ms > latency average = 372.709 ms > > -- Test 2 "select count(*) from t" with all tuples frozen > > $ cat bench1.sql > select count(*) from t; > > psql -c "vacuum freeze t;" postgres > psql -c "select pg_prewarm('t');" postgres > > Master (@9c6ad5eaa): > $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 406.238 ms > latency average = 407.029 ms > latency average = 406.962 ms > > Master + 0001 + 0005: > $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" > latency average = 345.470 ms > latency average = 345.775 ms > latency average = 345.354 ms > > My current thoughts are that it might be best to go with 0005 to start > with. I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. > I ran your test1 exactly like your setup except the row count is 300 (with 13275 blocks). Shared_buffers is 128MB and the hardware configuration details at the bottom of the mail. It appears *Master + 0001 + 0005 *regressed compared to master slightly . *Master (@56d0ed3b756b2e3799a7bbc0ac89bc7657ca2c33)* Before vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 430.287 ms After Vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 369.046 ms *Master + 0001 + 0002:* Before vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 427.983 ms After Vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 367.185 ms *Master + 0001 + 0005:* Before vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 447.045 ms After Vacuum: /usr/local/pgsql/bin/pgbench -n -f bench.sql -M prepared -T 30 -P 10 postgres | grep -E "^latency" latency average = 374.484 ms lscpu output Architecture:x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 1 On-line CPU(s) list: 0 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 1 NUMA node(s):1 Vendor ID: GenuineIntel CPU family: 6 Model:
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 5:00 AM David Rowley wrote: > > On Thu, 3 Nov 2022 at 22:09, John Naylor wrote: > > I tried a similar test, but with text fields of random length, and there is improvement here: > > Thank you for testing that. Can you share which CPU this was on? That was an Intel Core i7-10750H -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Thu, 3 Nov 2022 at 22:09, John Naylor wrote: > I tried a similar test, but with text fields of random length, and there is > improvement here: Thank you for testing that. Can you share which CPU this was on? My tests were all on AMD Zen 2. I'm keen to see what the results are on intel hardware. David
Re: Prefetch the next tuple's memory during seqscans
On Thu, 3 Nov 2022 at 06:25, Andres Freund wrote: > Attached is an experimental patch/hack for that. It ended up being more > beneficial to make the access ordering more optimal than prefetching the tuple > contents, but I'm not at all sure that's the be-all-end-all. Thanks for writing that patch. I've been experimenting with it. I tried unrolling the loop (patch 0003) as you mentioned in: + * FIXME: Worth unrolling so that we don't fetch the same cacheline + * over and over, due to line items being smaller than a cacheline? but didn't see any gains from doing that. I also adjusted your patch a little so that instead of doing: - OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */ + OffsetNumber *rs_vistuples; + OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */ to work around the issue of having to populate rs_vistuples_d in reverse, I added a new field called rs_startindex to mark where the first element in the rs_vistuples array is. The way you wrote it seems to require fewer code changes, but per the FIXME comment you left, I get the idea you just did it the way you did to make it work enough for testing. I'm quite keen to move forward in committing the 0001 patch to add the pg_prefetch_mem macro. What I'm a little undecided about is what the best patch is to commit first to make use of the new macro. I did some tests on the attached set of patches: alter system set max_parallel_workers_per_gather = 0; select pg_reload_conf(); create table t as select a from generate_series(1,1000)a; alter table t set (autovacuum_enabled=false); $ cat bench.sql select * from t where a = 0; psql -c "select pg_prewarm('t');" postgres -- Test 1 no frozen tuples in "t" Master (@9c6ad5eaa): $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" latency average = 383.332 ms latency average = 375.747 ms latency average = 376.090 ms Master + 0001 + 0002: $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" latency average = 370.133 ms latency average = 370.149 ms latency average = 370.157 ms Master + 0001 + 0005: $ pgbench -n -f bench.sql -M prepared -T 10 postgres | grep -E "^latency" latency average = 372.662 ms latency average = 371.034 ms latency average = 372.709 ms -- Test 2 "select count(*) from t" with all tuples frozen $ cat bench1.sql select count(*) from t; psql -c "vacuum freeze t;" postgres psql -c "select pg_prewarm('t');" postgres Master (@9c6ad5eaa): $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" latency average = 406.238 ms latency average = 407.029 ms latency average = 406.962 ms Master + 0001 + 0005: $ pgbench -n -f bench1.sql -M prepared -T 10 postgres | grep -E "^latency" latency average = 345.470 ms latency average = 345.775 ms latency average = 345.354 ms My current thoughts are that it might be best to go with 0005 to start with. I know Melanie is working on making some changes in this area, so perhaps it's best to leave 0002 until that work is complete. David From 491df9d6ab87a54bbc76b876484733d02d6c94ea Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 19 Oct 2022 08:54:01 +1300 Subject: [PATCH v2 1/5] Add pg_prefetch_mem() macro to load cache lines. Initially mapping to GCC, Clang and MSVC builtins. Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com --- config/c-compiler.m4 | 17 configure | 40 ++ configure.ac | 3 +++ meson.build| 3 ++- src/include/c.h| 8 src/include/pg_config.h.in | 3 +++ src/tools/msvc/Solution.pm | 1 + 7 files changed, 74 insertions(+), 1 deletion(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 000b075312..582a47501c 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, [Define to 1 if your compiler understands $1.]) fi])# PGAC_CHECK_BUILTIN_FUNC +# PGAC_CHECK_BUILTIN_VOID_FUNC +# --- +# Variant for void functions. +AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ +$1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index 3966368b8d..c4685b8a1e 100755 --- a/configure +++ b/configure @@ -15988,6 +15988,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6
Re: Prefetch the next tuple's memory during seqscans
On Tue, Nov 1, 2022 at 5:17 AM David Rowley wrote: > > My test is to run 16 queries changing the WHERE clause each time to > have WHERE a = 0, then WHERE a2 = 0 ... WHERE a16 = 0. I wanted to > know if prefetching only the first cache line of the tuple would be > less useful when we require evaluation of say, the "a16" column vs the > "a" column. I tried a similar test, but with text fields of random length, and there is improvement here: Intel laptop, turbo boost off shared_buffers = '4GB' huge_pages = 'on' max_parallel_workers_per_gather = '0' create table text8 as select repeat('X', int4(random() * 20)) a1, repeat('X', int4(random() * 20)) a2, repeat('X', int4(random() * 20)) a3, repeat('X', int4(random() * 20)) a4, repeat('X', int4(random() * 20)) a5, repeat('X', int4(random() * 20)) a6, repeat('X', int4(random() * 20)) a7, repeat('X', int4(random() * 20)) a8 from generate_series(1,1000) a; vacuum freeze text8; psql -c "select pg_prewarm('text8')" && \ for i in a1 a2 a3 a4 a5 a6 a7 a8; do echo Testing $i echo "select * from text8 where $i = 'ZZZ';" > bench.sql pgbench -f bench.sql -M prepared -n -T 10 postgres | grep latency done Master: Testing a1 latency average = 980.595 ms Testing a2 latency average = 1045.081 ms Testing a3 latency average = 1107.736 ms Testing a4 latency average = 1162.188 ms Testing a5 latency average = 1213.985 ms Testing a6 latency average = 1272.156 ms Testing a7 latency average = 1318.281 ms Testing a8 latency average = 1363.359 ms Patch 0001+0003: Testing a1 latency average = 812.548 ms Testing a2 latency average = 897.303 ms Testing a3 latency average = 955.997 ms Testing a4 latency average = 1023.497 ms Testing a5 latency average = 1088.494 ms Testing a6 latency average = 1149.418 ms Testing a7 latency average = 1213.134 ms Testing a8 latency average = 1282.760 ms -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
Hi, On 2022-11-02 10:25:44 -0700, Andres Freund wrote: > server is started with > local: numactl --membind 1 --physcpubind 10 > remote: numactl --membind 0 --physcpubind 10 > interleave: numactl --interleave=all --physcpubind 10 Argh, forgot to say that this is with max_parallel_workers_per_gather=0, s_b=8GB, huge_pages=on. Greetings, Andres Freund
Re: Prefetch the next tuple's memory during seqscans
Hi, On 2022-11-01 20:00:43 -0700, Andres Freund wrote: > I suspect that prefetching in heapgetpage() would provide gains as well, at > least for pages that aren't marked all-visible, pretty common in the real > world IME. Attached is an experimental patch/hack for that. It ended up being more beneficial to make the access ordering more optimal than prefetching the tuple contents, but I'm not at all sure that's the be-all-end-all. I separately benchmarked pinning the CPU and memory to the same socket, different socket and interleaving memory. I did this for HEAD, your patch, your patch and mine. BEGIN; DROP TABLE IF EXISTS large; CREATE TABLE large(a int8 not null, b int8 not null default '0', c int8); INSERT INTO large SELECT generate_series(1, 5000);COMMIT; server is started with local: numactl --membind 1 --physcpubind 10 remote: numactl --membind 0 --physcpubind 10 interleave: numactl --interleave=all --physcpubind 10 benchmark stared with: psql -qX -f ~/tmp/prewarm.sql && \ pgbench -n -f ~/tmp/seqbench.sql -t 1 -r > /dev/null && \ perf stat -e task-clock,LLC-loads,LLC-load-misses,cycles,instructions -C 10 \ pgbench -n -f ~/tmp/seqbench.sql -t 3 -r seqbench.sql: SELECT count(*) FROM large WHERE c IS NOT NULL; SELECT sum(a), sum(b), sum(c) FROM large; SELECT sum(c) FROM large; branchmemorytime s miss % head local 31.612 74.03 david local 32.034 73.54 david+andres local 31.644 42.80 andreslocal 30.863 48.05 head remote33.350 72.12 david remote33.425 71.30 david+andres remote32.428 49.57 andresremote30.907 44.33 head interleave32.465 71.33 david interleave33.176 72.60 david+andres interleave32.590 46.23 andresinterleave30.440 45.13 It's cool seeing how doing optimizing heapgetpage seems to pretty much remove the performance difference between local / remote memory. It makes some sense that David's patch doesn't help in this case - without all-visible being set the tuple headers will have already been pulled in for the HTSV call. I've not yet experimented with moving the prefetch for the tuple contents from David's location to before the HTSV. I suspect that might benefit both workloads. Greetings, Andres Freund diff --git i/src/include/access/heapam.h w/src/include/access/heapam.h index 9dab35551e1..dff7616abeb 100644 --- i/src/include/access/heapam.h +++ w/src/include/access/heapam.h @@ -74,7 +74,8 @@ typedef struct HeapScanDescData /* these fields only used in page-at-a-time mode and for bitmap scans */ int rs_cindex; /* current tuple's index in vistuples */ int rs_ntuples; /* number of visible tuples on page */ - OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */ + OffsetNumber *rs_vistuples; + OffsetNumber rs_vistuples_d[MaxHeapTuplesPerPage]; /* their offsets */ } HeapScanDescData; typedef struct HeapScanDescData *HeapScanDesc; diff --git i/src/backend/access/heap/heapam.c w/src/backend/access/heap/heapam.c index 12be87efed4..632f315f4e1 100644 --- i/src/backend/access/heap/heapam.c +++ w/src/backend/access/heap/heapam.c @@ -448,30 +448,99 @@ heapgetpage(TableScanDesc sscan, BlockNumber page) */ all_visible = PageIsAllVisible(dp) && !snapshot->takenDuringRecovery; - for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); - lineoff <= lines; - lineoff++, lpp++) + if (all_visible) { - if (ItemIdIsNormal(lpp)) - { - HeapTupleData loctup; - bool valid; + HeapTupleData loctup; + + loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); + + scan->rs_vistuples = scan->rs_vistuples_d; + + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); + lineoff <= lines; + lineoff++, lpp++) + { + if (!ItemIdIsNormal(lpp)) +continue; - loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); loctup.t_len = ItemIdGetLength(lpp); ItemPointerSet(&(loctup.t_self), page, lineoff); - if (all_visible) -valid = true; - else -valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); + HeapCheckForSerializableConflictOut(true, scan->rs_base.rs_rd, +&loctup, buffer, snapshot); + scan->rs_vistuples[ntup++] = lineoff; + } + } + else + { + HeapTupleData loctup; + int normcount = 0; + OffsetNumber normoffsets[MaxHeapTuplesPerPage]; + + loctup.t_tableOid = RelationGetRelid(scan->rs_base.rs_rd); + + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); + lineoff <= lines; + lineoff++, lpp++) + + /* + * Iterate forward over line items, they're laid out in increasing + * order in memory. Doing this separately allows to benefit from + * out-of-order capabilities of the CPU and simplifies the next loop.
Re: Prefetch the next tuple's memory during seqscans
Hi, On 2022-10-31 16:52:52 +1300, David Rowley wrote: > As part of the AIO work [1], Andres mentioned to me that he found that > prefetching tuple memory during hot pruning showed significant wins. > I'm not proposing anything to improve HOT pruning here I did try and reproduce my old results, and it does look like we already get most of the gains from prefetching via 18b87b201f7. I see gains from prefetching before that patch, but see it hurt after. If I reverse the iteration order from 18b87b201f7 prefetching helps again. > but as a segue to get the prefetching infrastructure in so that there are > fewer AIO patches, I'm proposing we prefetch the next tuple during sequence > scans in while page mode. > Time: 328.225 ms (avg ~7.7% faster) > ... > Time: 410.843 ms (avg ~22% faster) That's a pretty impressive result. I suspect that prefetching in heapgetpage() would provide gains as well, at least for pages that aren't marked all-visible, pretty common in the real world IME. Greetings, Andres Freund
Re: Prefetch the next tuple's memory during seqscans
On Wed, 2 Nov 2022 at 00:09, Andy Fan wrote: > I just have a different platforms at hand, Here is my test with > Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz. > shared_buffers has been set to big enough to hold all the data. Many thanks for testing that. Those numbers look much better than the ones I got from my AMD machine. > By theory, Why does the preferch make thing better? I am asking this > because I think we need to read the data from buffer to cache line once > in either case (I'm obvious wrong in face of the test result.) That's a good question. I didn't really explain that in my email. There's quite a bit of information in [1]. My basic understanding is that many modern CPU architectures are ok at "Sequential Prefetching" of cache lines from main memory when the direction is forward, but I believe that they're not very good at detecting access patterns that are scanning memory addresses in a backwards direction. Because of our page layout, we have the page header followed by item pointers at the start of the page. These item pointers are fixed with and point to the tuples, which are variable width. Tuples are written starting at the end of the page. The page is full when the tuples would overlap with the item pointers. See diagrams in [2]. We do our best to keep those tuples in reverse order of the item pointer array. This means when we're performing a forward sequence scan, we're (generally) reading tuples starting at the end of the page and working backwards. Since the CPU is not very good at noticing this and prefetching the preceding cacheline, we can make things go faster (seemingly) by issuing a manual prefetch operation by way of pg_prefetch_mem(). The key here is that accessing RAM is far slower than accessing CPU caches. Modern CPUs can perform multiple operations in parallel and these can be rearranged by the CPU so they're not in the same order as the instructions are written in the programme. It's possible that high latency operations such as accessing RAM could hold up other operations which depend on the value of what's waiting to come in from RAM. If the CPU is held up like this, it's called a pipeline stall [3]. The prefetching in this case is helping to reduce the time spent stalled waiting for memory access. David [1] https://en.wikipedia.org/wiki/Cache_prefetching I might not do the explanation justice, but I believe many CPU archate [2] https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/speeding-up-recovery-and-vacuum-in-postgres-14/ba-p/2234071 [3] https://en.wikipedia.org/wiki/Pipeline_stall
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 2, 2022 at 12:09 AM Andy Fan wrote: > By theory, Why does the preferch make thing better? I am asking this > because I think we need to read the data from buffer to cache line once > in either case (I'm obvious wrong in face of the test result.) CPUs have several different kinds of 'hardware prefetchers' (worth reading about), that look out for sequential and striding patterns and try to get the cache line ready before you access it. Using the prefetch instructions explicitly is called 'software prefetching' (special instructions inserted by programmers or compilers). The theory here would have to be that the hardware prefetchers couldn't pick up the pattern, but we know how to do it. The exact details of the hardware prefetchers vary between chips, and there are even some parameters you can adjust in BIOS settings. One idea is that the hardware prefetchers are generally biased towards increasing addresses, but our tuples tend to go backwards on the page[1]. It's possible that some other CPUs can detect backwards strides better, but since real world tuples aren't of equal size anyway, there isn't really a fixed stride at all, so software prefetching seems quite promising for this... [1] https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-PAGE-LAYOUT-FIGURE
Re: Prefetch the next tuple's memory during seqscans
Hi: > Different platforms would be good. Certainly, 1 platform isn't a good > enough indication that this is going to be useful. I just have a different platforms at hand, Here is my test with Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz. shared_buffers has been set to big enough to hold all the data. columns Master Patched Improvement a 310.931 289.251 6.972608071 a2 329.577 299.975 8.981816085 a3 336.887 313.502 6.941496704 a4 352.099 325.345 7.598431123 a5 358.582 336.486 6.162049406 a6 375.004 349.12 6.902326375 a7 379.699 362.998 4.398484062 a8 391.911 371.41 5.231034597 a9 404.3 383.779 5.075686372 a10 425.48 396.114 6.901852026 a11 449.944 431.826 4.026723326 a12 461.876 443.579 3.961452857 a13 470.59 460.237 2.2425 a14 483.332 467.078 3.362905829 a15 490.798 472.262 3.776706507 a16 503.321 484.322 3.774728255 By theory, Why does the preferch make thing better? I am asking this because I think we need to read the data from buffer to cache line once in either case (I'm obvious wrong in face of the test result.) Another simple point is the below styles are same. But the format 3 looks clearer than others for me. It can tell code reader more stuffs. just fyi. pg_prefetch_mem(PageGetItem((Page) dp, lpp)); pg_prefetch_mem(tuple->t_data); pg_prefetch_mem((scan->rs_ctup.t_data); -- Best Regards Andy Fan
Re: Prefetch the next tuple's memory during seqscans
On Tue, 1 Nov 2022 at 03:12, Aleksander Alekseev wrote: > I wonder if we can be sure and/or check that there is no performance > degradation under different loads and different platforms... Different platforms would be good. Certainly, 1 platform isn't a good enough indication that this is going to be useful. As for different loads. I imagine the worst case for this will be that the prefetched tuple is flushed from the cache by some other operation in the plan making the prefetch useless. I tried the following so that we read 1 million tuples from a Sort node before coming back and reading another tuple from the seqscan. create table a as select 1 as a from generate_series(1,2) a; create table b as select 1 as a from generate_series(1,1000) a; vacuum freeze a,b; select pg_prewarm('a'),pg_prewarm('b'); set work_mem = '256MB'; select * from a, lateral (select * from b order by a) b offset 2000; Master (@ a9f8ca600) Time: 1414.590 ms (00:01.415) Time: 1373.584 ms (00:01.374) Time: 1373.057 ms (00:01.373) Time: 1383.033 ms (00:01.383) Time: 1378.865 ms (00:01.379) Master + 0001 + 0003: Time: 1352.726 ms (00:01.353) Time: 1348.306 ms (00:01.348) Time: 1358.033 ms (00:01.358) Time: 1354.348 ms (00:01.354) Time: 1353.971 ms (00:01.354) As I'd have expected, I see no regression. It's hard to imagine we'd be able to measure the regression over the overhead of some operation that would evict everything out of cache. FWIW, this CPU has a 256MB L3 cache and the Sort node's EXPLAIN ANALYZE looks like: Sort Method: quicksort Memory: 262144kB > Also I see 0001 and 0003 but no 0002. Just wanted to double check that > there is no patch missing. Perhaps I should resequence the patches to avoid confusion. I didn't send 0002 on purpose. The 0002 is Andres' patch to prefetch during HOT pruning. Here I'm only interested in seeing if we can get the pg_prefetch_mem macros in core to reduce the number of AIO patches by 1. Another thing about this is that I'm really only fetching the first cache line of the tuple. All columns in the t2 table (from the earlier email) are fixed width, so accessing the a16 column is a cached offset. I ran a benchmark using the same t2 table as my earlier email, i.e: -- table with 64 bytes of user columns create table t2 as select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a a13,a a14,a a15,a a16 from generate_series(1,1000)a; vacuum freeze t2; My test is to run 16 queries changing the WHERE clause each time to have WHERE a = 0, then WHERE a2 = 0 ... WHERE a16 = 0. I wanted to know if prefetching only the first cache line of the tuple would be less useful when we require evaluation of say, the "a16" column vs the "a" column. The times below (in milliseconds) are what I got from a 10-second pgbench run: column master patched a 490.571 409.748 a2428.004 430.927 a3449.156 453.858 a4474.945 479.73 a5514.646 507.809 a6517.525 519.956 a7543.587 539.023 a8562.718 559.387 a9585.458 584.63 a10 609.143 604.606 a11 645.273 638.535 a12 658.848 657.377 a13 696.395 685.389 a14 702.779 716.722 a15 727.161 723.567 a16 756.186 749.396 I'm not sure how to explain why only the "a" column seems to improve and the rest seem mostly unaffected. David #!/bin/bash psql -c "select pg_prewarm('t2');" postgres for i in a a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16; do echo Testing $i echo "select * from t2 where $i = 0;" > bench.sql pgbench -f bench.sql -M prepared -n -T 10 postgres | grep latency done
Re: Prefetch the next tuple's memory during seqscans
Hi David, > I'll add this to the November CF. Thanks for the patch. I wonder if we can be sure and/or check that there is no performance degradation under different loads and different platforms... Also I see 0001 and 0003 but no 0002. Just wanted to double check that there is no patch missing. -- Best regards, Aleksander Alekseev
Prefetch the next tuple's memory during seqscans
As part of the AIO work [1], Andres mentioned to me that he found that prefetching tuple memory during hot pruning showed significant wins. I'm not proposing anything to improve HOT pruning here, but as a segue to get the prefetching infrastructure in so that there are fewer AIO patches, I'm proposing we prefetch the next tuple during sequence scans in while page mode. It turns out the gains are pretty good when we apply this: -- table with 4 bytes of user columns create table t as select a from generate_series(1,1000)a; vacuum freeze t; select pg_prewarm('t'); Master @ a9f8ca600 # select * from t where a = 0; Time: 355.001 ms Time: 354.573 ms Time: 354.490 ms Time: 354.556 ms Time: 354.335 ms Master + 0001 + 0003: # select * from t where a = 0; Time: 328.578 ms Time: 329.387 ms Time: 329.349 ms Time: 329.704 ms Time: 328.225 ms (avg ~7.7% faster) -- table with 64 bytes of user columns create table t2 as select a,a a2,a a3,a a4,a a5,a a6,a a7,a a8,a a9,a a10,a a11,a a12,a a13,a a14,a a15,a a16 from generate_series(1,1000)a; vacuum freeze t2; select pg_prewarm('t2'); Master: # select * from t2 where a = 0; Time: 501.725 ms Time: 501.815 ms Time: 503.225 ms Time: 501.242 ms Time: 502.394 ms Master + 0001 + 0003: # select * from t2 where a = 0; Time: 412.076 ms Time: 410.669 ms Time: 410.490 ms Time: 409.782 ms Time: 410.843 ms (avg ~22% faster) This was tested on an AMD 3990x CPU. I imagine the CPU matters quite a bit here. It would be interesting to see if the same or similar gains can be seen on some modern intel chip too. I believe Thomas wrote the 0001 patch (same as patch in [2]?). I only quickly put together the 0003 patch. I wondered if we might want to add a macro to 0001 that says if pg_prefetch_mem() is empty or not then use that to #ifdef out the code I added to heapam.c. Although, perhaps most compilers will be able to optimise away the extra lines that are figuring out what the address of the next tuple is. My tests above are likely the best case for this. It seems plausible to me that if there was a much more complex plan that found a reasonable number of tuples and did something with them that we wouldn't see the same sort of gains. Also, it also does not seem impossible that the prefetch just results in evicting some useful-to-some-other-exec-node cache line or that the prefetched tuple gets flushed out the cache by the time we get around to fetching the next tuple from the scan again due to various other node processing that's occurred since the seq scan was last called. I imagine such things would be indistinguishable from noise, but I've not tested. I also tried prefetching out by 2 tuples. It didn't help any further than prefetching 1 tuple. I'll add this to the November CF. David [1] https://www.postgresql.org/message-id/flat/20210223100344.llw5an2akleng...@alap3.anarazel.de [2] https://www.postgresql.org/message-id/CA%2BhUKG%2Bpi63ZbcZkYK3XB1pfN%3DkuaDaeV0Ha9E%2BX_p6TTbKBYw%40mail.gmail.com From 2fd10f1266550f26f4395de080bcdcf89b6859b6 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 19 Oct 2022 08:54:01 +1300 Subject: [PATCH 1/3] Add pg_prefetch_mem() macro to load cache lines. Initially mapping to GCC, Clang and MSVC builtins. Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com --- config/c-compiler.m4 | 17 configure | 40 ++ configure.ac | 3 +++ meson.build| 3 ++- src/include/c.h| 8 src/include/pg_config.h.in | 3 +++ src/tools/msvc/Solution.pm | 1 + 7 files changed, 74 insertions(+), 1 deletion(-) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 000b075312..582a47501c 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, [Define to 1 if your compiler understands $1.]) fi])# PGAC_CHECK_BUILTIN_FUNC +# PGAC_CHECK_BUILTIN_VOID_FUNC +# --- +# Variant for void functions. +AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ +$1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index 3966368b8d..c4685b8a1e 100755 --- a/configure +++ b/configure @@ -15988,6 +15988,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5 +$as_echo_n "checking for __builtin_prefetch... " >&6; } +if ${pgac_cv__builtin_prefetch+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end c