Re: Streaming read-ready sequential scan code
On Fri, Aug 30, 2024 at 1:00 AM Alexander Lakhin wrote: > I looked at two perf profiles of such out-of-sync processes and found no > extra calls or whatsoever in the slow one, it just has the number of perf > samples increased proportionally. It made me suspect CPU frequency > scaling... Indeed, with the "performance" governor set and the boost mode > disabled, I'm now seeing much more stable numbers (I do this tuning before > running performance tests, but I had forgotten about that when I ran that > your test, my bad). Aha, mystery solved. I have pushed the fix. Thanks!
Re: Streaming read-ready sequential scan code
Hello Thomas, 29.08.2024 01:16, Thomas Munro wrote: Yeah. That's quite interesting, and must destabilise that simple-minded demo. I'm curious to know exactly what contention is causing that (about 3/4 of a millisecond that I don't see and now I want to know what it's waiting for), but it's a very crude test lacking timer resolution in the earlier messages, and it's an unrelated topic and a distraction. Perhaps it explains why you saw two different behaviours in Q15 with the patch and I didn't, though. Really it shouldn't be so sensitive to such variations, it's obviously a terrible plan, and TPC-DS needs a planner hacker mega-brain applied to it; I'm going to try to nerd-snipe one... I looked at two perf profiles of such out-of-sync processes and found no extra calls or whatsoever in the slow one, it just has the number of perf samples increased proportionally. It made me suspect CPU frequency scaling... Indeed, with the "performance" governor set and the boost mode disabled, I'm now seeing much more stable numbers (I do this tuning before running performance tests, but I had forgotten about that when I ran that your test, my bad). I'm sorry for the noise and the distraction. Still, now I can confirm your results. Without the patch, two parallel workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times out of 10. And with the patch, I've got "shared hit=219 / shared hit=224" 3 times, "shared hit=220 / shared hit=223" 4 times, "shared hit=221 / shared hit=222" 3 times of 10. On b7b0f3f27~1, my results are: "shared hit=219 / shared hit=224": 2 "shared hit=220 / shared hit=223": 3 "shared hit=221 / shared hit=222": 4 "shared hit=218 / shared hit=225": 1 Best regards, Alexander
Re: Streaming read-ready sequential scan code
On Wed, Aug 28, 2024 at 9:00 PM Alexander Lakhin wrote: > 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker > process "parallel worker for PID 338262" > 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker > process "parallel worker for PID 338262" > 2024-08-28 08:41:59.305 UTC [338263] DEBUG: InitPostgres > 2024-08-28 08:41:59.305 UTC [338264] DEBUG: InitPostgres > 2024-08-28 08:41:59.309 UTC [338263] NOTICE: 2024-08-28 08:41:59.309364+00 > pid 338263 > 2024-08-28 08:41:59.310 UTC [338264] NOTICE: 2024-08-28 08:41:59.310079+00 > pid 338264 > It looks like the two parallel workers were started simultaneously, but > then the second one lagged behind... Yeah. That's quite interesting, and must destabilise that simple-minded demo. I'm curious to know exactly what contention is causing that (about 3/4 of a millisecond that I don't see and now I want to know what it's waiting for), but it's a very crude test lacking timer resolution in the earlier messages, and it's an unrelated topic and a distraction. Perhaps it explains why you saw two different behaviours in Q15 with the patch and I didn't, though. Really it shouldn't be so sensitive to such variations, it's obviously a terrible plan, and TPC-DS needs a planner hacker mega-brain applied to it; I'm going to try to nerd-snipe one...
Re: Streaming read-ready sequential scan code
28.08.2024 07:45, Thomas Munro wrote: Huh. I can reproduce what I showed with pretty low variance, on my FreeBSD, Linux and macOS systems. I included parallel_leader_participation=off so that the workers would hopefully start as closely together in time as possible, and hopefully allow only a block or so of variation in the outcome. If I do: create or replace function f(i int) returns int language plpgsql parallel safe as $$ begin raise notice '% pid %', clock_timestamp(), pg_backend_pid(); return i; end; $$; then postgres=# explain (analyze) select f(i) from t limit 1; NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019 NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018 shows start times differ by only a few microseconds at most, often 0. I wonder if your system is more variable at beginning execution? Maybe you're on a multi-socket system and forking/startup times vary in some way I can't see on small systems or something like that? I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04. With no active background tasks running, I'm seeing: NOTICE: 2024-08-28 08:17:36.917162+00 pid 320103 NOTICE: 2024-08-28 08:17:36.917163+00 pid 320102 NOTICE: 2024-08-28 08:17:40.592333+00 pid 320143 NOTICE: 2024-08-28 08:17:40.592645+00 pid 320144 With log_min_messages = DEBUG3, I get: NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264 And the following messages in the server.log: 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" 2024-08-28 08:41:59.304 UTC [338251] DEBUG: starting background worker process "parallel worker for PID 338262" 2024-08-28 08:41:59.305 UTC [338263] DEBUG: InitPostgres 2024-08-28 08:41:59.305 UTC [338264] DEBUG: InitPostgres 2024-08-28 08:41:59.309 UTC [338263] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 2024-08-28 08:41:59.309 UTC [338263] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE 2024-08-28 08:41:59.309 UTC [338262] NOTICE: 2024-08-28 08:41:59.309364+00 pid 338263 2024-08-28 08:41:59.309 UTC [338262] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE parallel worker 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(0): 5 before_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(0): 6 on_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: proc_exit(0): 1 callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: exit(0) 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2024-08-28 08:41:59.309 UTC [338263] DEBUG: proc_exit(-1): 0 callbacks to make 2024-08-28 08:41:59.310 UTC [338264] NOTICE: 2024-08-28 08:41:59.310079+00 pid 338264 2024-08-28 08:41:59.310 UTC [338264] CONTEXT: PL/pgSQL function f(integer) line 3 at RAISE It looks like the two parallel workers were started simultaneously, but then the second one lagged behind... Best regards, Alexander
Re: Streaming read-ready sequential scan code
On Wed, Aug 28, 2024 at 1:00 AM Alexander Lakhin wrote: > gives me unstable numbers on unpatched master: > Buffers: shared hit=226 > Buffers: shared hit=217 > Buffers: shared hit=162 > Buffers: shared hit=281 > Buffers: shared hit=210 > Buffers: shared hit=233 > and also with the patch applied: > Buffers: shared hit=246 > Buffers: shared hit=197 > Buffers: shared hit=247 > Buffers: shared hit=196 > Buffers: shared hit=224 > Buffers: shared hit=219 > Buffers: shared hit=291 > Buffers: shared hit=152 > > Please correct me, if I'm doing something wrong. Huh. I can reproduce what I showed with pretty low variance, on my FreeBSD, Linux and macOS systems. I included parallel_leader_participation=off so that the workers would hopefully start as closely together in time as possible, and hopefully allow only a block or so of variation in the outcome. If I do: create or replace function f(i int) returns int language plpgsql parallel safe as $$ begin raise notice '% pid %', clock_timestamp(), pg_backend_pid(); return i; end; $$; then postgres=# explain (analyze) select f(i) from t limit 1; NOTICE: 2024-08-28 16:41:32.845747+12 pid 47019 NOTICE: 2024-08-28 16:41:32.845746+12 pid 47018 shows start times differ by only a few microseconds at most, often 0. I wonder if your system is more variable at beginning execution? Maybe you're on a multi-socket system and forking/startup times vary in some way I can't see on small systems or something like that?
Re: Streaming read-ready sequential scan code
Hello Thomas, 27.08.2024 09:52, Thomas Munro wrote: Here's a really simple way to see the new unfairness at the end of a parallel scan: drop table if exists t; create table t (i int); insert into t select generate_series(1, 10); alter table t set (parallel_workers = 2); set parallel_setup_cost = 0; set parallel_leader_participation = off; explain (analyze, buffers, verbose) select count(*) from t; On my machine, unpatched master shows: Worker 0: actual time=0.036..12.452 rows=51076 loops=1 Buffers: shared hit=226 Worker 1: actual time=0.037..12.003 rows=48924 loops=1 Buffers: shared hit=217 The attached patch, which I'd like to push, is effectively what Alexander tested (blocknums[16] -> blocknums[1]). There's no point in using an array of size 1, so I've turned it into a simple variable and deleted the relevant comments. My machine shows: Worker 0: actual time=0.038..12.115 rows=49946 loops=1 Buffers: shared hit=221 Worker 1: actual time=0.038..12.109 rows=50054 loops=1 Buffers: shared hit=222 That difference may not seem huge, but other pre-existing things are going pathologically wrong in the reported query that magnify it (see my earlier analysis). It's an interesting problem that will require more study (my earlier analysis missed a detail that I'll write about separately), but it doesn't seem to be new or have easy fixes, so that will have to be for later work. I've tried your query and could not get sustainable results, unfortunately. The following script: rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1 pg_ctl -s -l server.log start cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1 create table t (i int); insert into t select generate_series(1, 10); alter table t set (parallel_workers = 2); set parallel_setup_cost = 0; set parallel_leader_participation = off; explain (analyze, buffers, verbose) select count(*) from t; EOF pg_ctl -s stop gives me unstable numbers on unpatched master: Worker 0: actual time=0.024..5.814 rows=51076 loops=1 Buffers: shared hit=226 Worker 1: actual time=0.023..5.614 rows=48924 loops=1 Buffers: shared hit=217 --- Worker 0: actual time=0.027..5.130 rows=36612 loops=1 Buffers: shared hit=162 Worker 1: actual time=0.013..5.605 rows=63388 loops=1 Buffers: shared hit=281 --- Worker 0: actual time=0.025..5.447 rows=47460 loops=1 Buffers: shared hit=210 Worker 1: actual time=0.019..5.688 rows=52540 loops=1 Buffers: shared hit=233 and also with the patch applied: Worker 0: actual time=0.012..4.486 rows=55478 loops=1 Buffers: shared hit=246 Worker 1: actual time=0.014..4.430 rows=44522 loops=1 Buffers: shared hit=197 --- Worker 0: actual time=0.013..4.269 rows=55822 loops=1 Buffers: shared hit=247 Worker 1: actual time=0.017..4.238 rows=44178 loops=1 Buffers: shared hit=196 --- Worker 0: actual time=0.014..4.974 rows=50624 loops=1 Buffers: shared hit=224 Worker 1: actual time=0.016..4.932 rows=49376 loops=1 Buffers: shared hit=219 --- Worker 0: actual time=0.012..5.459 rows=65648 loops=1 Buffers: shared hit=291 Worker 1: actual time=0.022..5.109 rows=34352 loops=1 Buffers: shared hit=152 Please correct me, if I'm doing something wrong. Best regards, Alexander
Re: Streaming read-ready sequential scan code
Here's a really simple way to see the new unfairness at the end of a parallel scan: drop table if exists t; create table t (i int); insert into t select generate_series(1, 10); alter table t set (parallel_workers = 2); set parallel_setup_cost = 0; set parallel_leader_participation = off; explain (analyze, buffers, verbose) select count(*) from t; On my machine, unpatched master shows: Worker 0: actual time=0.036..12.452 rows=51076 loops=1 Buffers: shared hit=226 Worker 1: actual time=0.037..12.003 rows=48924 loops=1 Buffers: shared hit=217 The attached patch, which I'd like to push, is effectively what Alexander tested (blocknums[16] -> blocknums[1]). There's no point in using an array of size 1, so I've turned it into a simple variable and deleted the relevant comments. My machine shows: Worker 0: actual time=0.038..12.115 rows=49946 loops=1 Buffers: shared hit=221 Worker 1: actual time=0.038..12.109 rows=50054 loops=1 Buffers: shared hit=222 That difference may not seem huge, but other pre-existing things are going pathologically wrong in the reported query that magnify it (see my earlier analysis). It's an interesting problem that will require more study (my earlier analysis missed a detail that I'll write about separately), but it doesn't seem to be new or have easy fixes, so that will have to be for later work. From 07ff31ad30bf9e383e42336e28143852e3793c5b Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 27 Aug 2024 15:11:53 +1200 Subject: [PATCH] Fix unfairness in all-cached parallel seq scan. Commit b5a9b18c introduced block streaming infrastructure with a special fast path for all-cached blocks, and commit b7b0f3f2 connected the infrastructure up to sequential scans. One of the fast path optimizations had an unintended consequence: it interfered with the underlying parallel sequential scan's block allocator, which has its own ramp-up and ramp-down algorithm. A scan of a small all-cached table could give more blocks to one worker. In some plans (probably already very bad plans, such as the one reported by Alexander), the unfairness could be magnified. Now all-cached scans will call the next-block-number callback just once each time it wants a new block, instead of trying to buffer 16 block numbers at once. Back-patch to 17. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com --- src/backend/storage/aio/read_stream.c | 82 --- 1 file changed, 24 insertions(+), 58 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index a83c18c2a4b..57d9e93c001 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -117,13 +117,10 @@ struct ReadStream bool advice_enabled; /* - * Small buffer of block numbers, useful for 'ungetting' to resolve flow - * control problems when I/Os are split. Also useful for batch-loading - * block numbers in the fast path. + * One-block buffer to support 'ungetting' a block number, to resolve flow + * control problems when I/Os are split. */ - BlockNumber blocknums[16]; - int16 blocknums_count; - int16 blocknums_next; + BlockNumber buffered_blocknum; /* * The callback that will tell us which block numbers to read, and an @@ -167,68 +164,39 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index) } /* - * Ask the callback which block it would like us to read next, with a small - * buffer in front to allow read_stream_unget_block() to work and to allow the - * fast path to skip this function and work directly from the array. + * Ask the callback which block it would like us to read next, with a one-block + * buffer in front to allow read_stream_unget_block() to work. */ static inline BlockNumber read_stream_get_block(ReadStream *stream, void *per_buffer_data) { - if (stream->blocknums_next < stream->blocknums_count) - return stream->blocknums[stream->blocknums_next++]; + BlockNumber blocknum; - /* - * We only bother to fetch one at a time here (but see the fast path which - * uses more). - */ - return stream->callback(stream, - stream->callback_private_data, - per_buffer_data); + blocknum = stream->buffered_blocknum; + if (blocknum != InvalidBlockNumber) + stream->buffered_blocknum = InvalidBlockNumber; + else + blocknum = stream->callback(stream, + stream->callback_private_data, + per_buffer_data); + + return blocknum; } /* * In order to deal with short reads in StartReadBuffers(), we sometimes need - * to defer handling of a block until later. + * to defer handling of a block number we've already received from the callback + * until later. */ static inline void read_stream_unget_block(ReadStream *stream, BlockNumber block
Re: Streaming read-ready sequential scan code
On Tue, May 21, 2024 at 9:11 AM Melanie Plageman wrote: > So, if you are seeing the slow-down mostly go away by reducing > blocknums array size, does the regression only appear when the scan > data is fully in shared buffers? Or is this blocknums other use > (dealing with short reads)? That must be true (that blocknums array is normally only "filled" in the "fast path", where all buffers are found in cache). > Is your theory that one worker ends up reading 16 blocks that should > have been distributed across multiple workers? Yes, it just jiggles the odds around a bit, introducing a bit of extra unfairness by calling the callback in a tighter loop to build a little batch, revealing a pre-existing problem. The mistake in PHJ (problem #2 above) is that, once a worker decides it would like all workers to stop inserting so it can increase the number of buckets, it sets a flag to ask them to do that, and waits for them to see it, but if there is a worker filtering all tuples out, it never checks the "growth" flag. So it scans all the way to the end while the other guy waits. Normally it checks that flag when it is time to allocate a new chunk of memory, which seemed to make sense to me at the time: if we've hit the needs-more-buckets (or needs-more-batches) logic, then surely workers are inserting tuples and will soon allocate a new chunk! But, of course, here is the edge case where that isn't true: we had bad estimates so hash table too small (problem #1), we got lots of tuples clustered over a few heap pages and decided to expand the hash table, but right at that moment, matching tuples ran out so somebody had to finish the whole scan without ever checking the flag (problem #2), and that someone happened to have all the rest of the pages because we made the lookahead a bit less fair (problem #3). Nice confluence of problems. I expect #2 and #3 to be easy to fix, and I didn't look at the estimation problem #1 at all (perhaps a stats puzzle designed by the TPC to trip us up?).
Re: Streaming read-ready sequential scan code
Thank you to all of you for looking into this. On Sat, May 18, 2024 at 12:47 AM Thomas Munro wrote: > > On Sat, May 18, 2024 at 11:30 AM Thomas Munro wrote: > > Andres happened to have TPC-DS handy, and reproduced that regression > > in q15. We tried some stuff and figured out that it requires > > parallel_leader_participation=on, ie that this looks like some kind of > > parallel fairness and/or timing problem. It seems to be a question of > > which worker finishes up processing matching rows, and the leader gets > > a ~10ms head start but may be a little more greedy with the new > > streaming code. He tried reordering the table contents and then saw > > 17 beat 16. So for q15, initial indications are that this isn't a > > fundamental regression, it's just a test that is sensitive to some > > arbitrary conditions. > > > > I'll try to figure out some more details about that, ie is it being > > too greedy on small-ish tables, > > After more debugging, we learned a lot more things... > > 1. That query produces spectacularly bad estimates, so we finish up > having to increase the number of buckets in a parallel hash join many > times. That is quite interesting, but unrelated to new code. > 2. Parallel hash join is quite slow at negotiating an increase in the > number of hash bucket, if all of the input tuples are being filtered > out by quals, because of the choice of where workers check for > PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I > think. I have put that on my todo list 'cause that's also my code, > but it's not a new issue it's just one that is now highlighted... > 3. This bit of read_stream.c is exacerbating unfairness in the > underlying scan, so that 1 and 2 come together and produce a nasty > slowdown, which goes away if you change it like so: > > - BlockNumber blocknums[16]; > + BlockNumber blocknums[1]; > > I will follow up after some more study. So, if you are seeing the slow-down mostly go away by reducing blocknums array size, does the regression only appear when the scan data is fully in shared buffers? Or is this blocknums other use (dealing with short reads)? Is your theory that one worker ends up reading 16 blocks that should have been distributed across multiple workers? - Melanie
Re: Streaming read-ready sequential scan code
On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin wrote: > With blocknums[1], timing is changed, but the effect is not persistent. > 10 query15 executions in a row, b7b0f3f27: > 277.932 ms > 281.805 ms > 278.335 ms > 281.565 ms > 284.167 ms > 283.171 ms > 281.165 ms > 281.615 ms > 285.394 ms > 277.301 ms The bad time 10/10. > b7b0f3f27~1: > 159.789 ms > 165.407 ms > 160.893 ms > 159.343 ms > 160.936 ms > 161.577 ms > 161.637 ms > 163.421 ms > 163.143 ms > 167.109 ms The good time 10/10. > b7b0f3f27 + blocknums[1]: > 164.133 ms > 280.920 ms > 160.748 ms > 163.182 ms > 161.709 ms > 161.998 ms > 161.239 ms > 276.256 ms > 161.601 ms > 160.384 ms The good time 8/10, the bad time 2/10. Thanks for checking! I bet all branches can show that flip/flop instability in these adverse conditions, depending on random scheduling details. I will start a new thread with a patch for the root cause of that, ie problem #2 (this will need back-patching), and post a fix for #3 (v17 blocknums[N] tweak affecting fairness/likelihood, which was probably basically a bit of ill-advised premature optimisation) here in a few days.
Re: Streaming read-ready sequential scan code
Hello Thomas, 18.05.2024 07:47, Thomas Munro wrote: After more debugging, we learned a lot more things... 1. That query produces spectacularly bad estimates, so we finish up having to increase the number of buckets in a parallel hash join many times. That is quite interesting, but unrelated to new code. 2. Parallel hash join is quite slow at negotiating an increase in the number of hash bucket, if all of the input tuples are being filtered out by quals, because of the choice of where workers check for PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I think. I have put that on my todo list 'cause that's also my code, but it's not a new issue it's just one that is now highlighted... 3. This bit of read_stream.c is exacerbating unfairness in the underlying scan, so that 1 and 2 come together and produce a nasty slowdown, which goes away if you change it like so: - BlockNumber blocknums[16]; + BlockNumber blocknums[1]; I will follow up after some more study. Thank you for the information! Unfortunately, I can't see significant differences in my environment with parallel_leader_participation=off. With blocknums[1], timing is changed, but the effect is not persistent. 10 query15 executions in a row, b7b0f3f27: 277.932 ms 281.805 ms 278.335 ms 281.565 ms 284.167 ms 283.171 ms 281.165 ms 281.615 ms 285.394 ms 277.301 ms b7b0f3f27~1: 159.789 ms 165.407 ms 160.893 ms 159.343 ms 160.936 ms 161.577 ms 161.637 ms 163.421 ms 163.143 ms 167.109 ms b7b0f3f27 + blocknums[1]: 164.133 ms 280.920 ms 160.748 ms 163.182 ms 161.709 ms 161.998 ms 161.239 ms 276.256 ms 161.601 ms 160.384 ms I placed PGDATA on tmpfs to rule out any blockdev specifics (increasing blockdev ra from 256 to 4096 didn't help me with PGDATA on NVME either.) Best regards, Alexander
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 11:30 AM Thomas Munro wrote: > Andres happened to have TPC-DS handy, and reproduced that regression > in q15. We tried some stuff and figured out that it requires > parallel_leader_participation=on, ie that this looks like some kind of > parallel fairness and/or timing problem. It seems to be a question of > which worker finishes up processing matching rows, and the leader gets > a ~10ms head start but may be a little more greedy with the new > streaming code. He tried reordering the table contents and then saw > 17 beat 16. So for q15, initial indications are that this isn't a > fundamental regression, it's just a test that is sensitive to some > arbitrary conditions. > > I'll try to figure out some more details about that, ie is it being > too greedy on small-ish tables, After more debugging, we learned a lot more things... 1. That query produces spectacularly bad estimates, so we finish up having to increase the number of buckets in a parallel hash join many times. That is quite interesting, but unrelated to new code. 2. Parallel hash join is quite slow at negotiating an increase in the number of hash bucket, if all of the input tuples are being filtered out by quals, because of the choice of where workers check for PHJ_GROWTH_NEED_MORE_BUCKETS. That could be improved quite easily I think. I have put that on my todo list 'cause that's also my code, but it's not a new issue it's just one that is now highlighted... 3. This bit of read_stream.c is exacerbating unfairness in the underlying scan, so that 1 and 2 come together and produce a nasty slowdown, which goes away if you change it like so: - BlockNumber blocknums[16]; + BlockNumber blocknums[1]; I will follow up after some more study.
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 8:09 AM Thomas Munro wrote: > On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin wrote: > > I decided to compare v17 vs v16 performance (as I did the last year [1]) > > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > > benchmark, query15 (and several others, but I focused on this one): > > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > > > 151.03): pg_tpcds.query15 > > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > > > 152.64): pg_tpcds.query15 > > Please look at the full html report attached in case you're interested. > > > > (I used my pg-mark tool to measure/analyze performance, but I believe the > > same results can be seen without it.) > > Will investigate, but if it's easy for you to rerun, does it help if > you increase Linux readahead, eg blockdev --setra setting? Andres happened to have TPC-DS handy, and reproduced that regression in q15. We tried some stuff and figured out that it requires parallel_leader_participation=on, ie that this looks like some kind of parallel fairness and/or timing problem. It seems to be a question of which worker finishes up processing matching rows, and the leader gets a ~10ms head start but may be a little more greedy with the new streaming code. He tried reordering the table contents and then saw 17 beat 16. So for q15, initial indications are that this isn't a fundamental regression, it's just a test that is sensitive to some arbitrary conditions. I'll try to figure out some more details about that, ie is it being too greedy on small-ish tables, and generally I do wonder about the interactions between the heuristics and batching working at different levels (OS, seq scan, read stream, hence my earlier ra question which is likely a red herring) and how there might be unintended consequences/interference patterns, but this particular case seems more data dependent.
Re: Streaming read-ready sequential scan code
On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin wrote: > I decided to compare v17 vs v16 performance (as I did the last year [1]) > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) > benchmark, query15 (and several others, but I focused on this one): > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > > 151.03): pg_tpcds.query15 > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > > 152.64): pg_tpcds.query15 > Please look at the full html report attached in case you're interested. > > (I used my pg-mark tool to measure/analyze performance, but I believe the > same results can be seen without it.) Will investigate, but if it's easy for you to rerun, does it help if you increase Linux readahead, eg blockdev --setra setting?
Re: Streaming read-ready sequential scan code
Hello, I decided to compare v17 vs v16 performance (as I did the last year [1]) and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds) benchmark, query15 (and several others, but I focused on this one): Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 151.03): pg_tpcds.query15 Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 152.64): pg_tpcds.query15 Please look at the full html report attached in case you're interested. (I used my pg-mark tool to measure/analyze performance, but I believe the same results can be seen without it.) `git bisect` for this performance degradation pointed at b7b0f3f27... [1] https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.com Best regards, AlexanderTitle: Postgres benchmarking results Benchmarking run started at 2024-05-14T12:04:26, ended at 2024-05-15T05:44:15Benchmark versionInstanceversionpg-src-16--1PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--1PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--2PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--2PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--3PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--3PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitBenchmark pgbench_nativeInstancetpspg-src-16--110478.49pg-src-master--110662.36pg-src-16--210491.77pg-src-master--210700.65pg-src-master--310565.92pg-src-16--310555.88Benchmark pgbench_referenceInstancelatency_avgtps1tps2pg-src-16--18.597452.807452.84pg-src-master--18.427597.857597.88pg-src-16--28.527508.757508.79pg-src-master--28.487545.737545.77pg-src-master--38.377649.517649.55pg-src-16--38.547498.927498.95Benchmark pg_tpchInstancequery1query2query3query4query5query6query7query8query9query10query11query12query13query14query15query16query17query18query19query20query21query22pg-src-16--19.591.473.000.700.901.472.974.560.942.470.772.232.870.883.200.978.7014.410.195.842.360.23pg-src-master--19.111.542.880.700.891.402.834.750.872.390.752.132.830.893.061.008.3213.970.195.552.300.22pg-src-16--29.611.493.040.710.931.472.964.690.862.500.782.212.880.883.201.008.8214.200.205.852.360.23pg-src-master--29.051.492.850.700.871.372.824.470.872.350.772.142.670.883.030.998.2714.240.195.572.300.22pg-src-master--310.121.482.880.690.881.392.844.730.882.400.782.142.800.883.081.008.2814.180.195.562.330.22pg-src-16--39.721.473.010.700.911.462.954.670.962.500.782.202.710.873.221.008.6914.400.195.982.340.23Benchmark pg_tpcdsInstanceload_sizeindexes_sizevacuum_full_sizevacuum_freeze_sizeload_timeindexes_timevacuum_full_timevacuum_freeze_timeanalyze_timequery92query91query13query9query21query90query93query16query68query48query76query7query37query43query72query82query79query55query66query71query69query34query45query10query28query3query15query50query85query84query42query73query94query46query88query19query40query26query96query6query52pg-src-16--15069460.007250212.006939852.006942528.0078.3673.2396.682.293.4883.4944.95638.871466.98157.23149.804.96105.36278.76774.01380.491194.288.21713.2020492.5810.40346.13297.64302.20241.78115.87246.68133.8631.62667.3830.43153.4178.77182.2515.65294.41205.3677.62498.711670.38313.1374.38596.20202.6123918.69311.14pg-src-master--15069628.007250072.006939856.006942528.0079.8771.6392.832.163.4682.1445.80612.611355.02201.37137.854.95145.74253.06752.50376.791203.268.36692.8125286.109.13325.84271.36412.45252.12114.83228.80132.1431.34653.3129.11229.8477.31184.5614.86268.99182.6369.12479.661515.73297.1096.46955.47185.3517137.62274.60pg-src-16--25069288.007249988.006939848.006942520.0077.3673.2795.922.243.4182.8447.27639.831450.85158.35154.494.76101.82277.17779.26387.131202.088.26714.5820769.8810.32348.23288.74304.59243.07114.51249.44137.4232.46662.6529.54153.4780.16119.1416.45295.05207.3877.66506.871686.13319.2473.34584.07204.5523798.14291.23pg-src-master--25069476.007249892.006939860.006942500.0079.9570.7292.643.103.4178.3444.75610.521362.95195.95141.655.19140.07253.17754.75385.321209.357.56686.4925771.909.17329.43290.55415.04224.55119.17229.24135.6031.45657.6429.96236.3481.87182.6915.04277.01184.9470.63494.841523.35306.65103.54975.26185.2917169.69278.12pg-src-master--35069584.007249996.006939856.006942500.0078.9471.0893.273.113.4377.4447.19610.861386.95193.86140.134.83154.89257.04756.67378.821213.107.62688.1825978.559.44328.20278.73417.64226.57119.25231.79135.5431.99654.8030.82236.4356.91117.9914.63281.26183.6571.78479.821534.32301.8998.001002.00184.5817194.48
Re: Streaming read-ready sequential scan code
Hi, On 2024-04-08 09:36:59 +1200, Thomas Munro wrote: > I've been on the fence about that flag for sequential scan... Some > days I want to consider changing to READ_STREAM_DEFAULT and relying on > our "anti-heuristics" to suppress advice, which would work out the > same in most cases but might occasionally win big. Agreed, it's pretty easy to end up with a fairly "fragmented" set of a relation's buffers in s_b. OTOH, there might not be any need for the heuristic if we actually trigger reads asynchronously. > BTW looking at the branching in read-stream user patches that have an > initialisation step like yours, I wonder if it might every make sense > to be able to change the callback on the fly from inside the callback, > so that you finish up with a branchless one doing most of the work. I > have no idea if it's worth it... I was wondering about that too, I dislike those branches. But instead of changing the callback, it seems like a better design would be to have another dedicated callback for that? There already is a dedicated branch for the "just starting up" path in read_stream_next_buffer(), so it'd be pretty much free to call another callback there. Greetings, Andres Freund
Re: Streaming read-ready sequential scan code
On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman wrote: > Attached v13 0001 is your fix and 0002 is a new version of the > sequential scan streaming read user. Off-list Andres mentioned that I > really ought to separate the parallel and serial sequential scan users > into two different callbacks. I've done that in the attached. It > actually makes the code used by the callbacks nicer and more readable > anyway (putting aside performance). I was able to measure a small > performance difference as well. Thanks. I changed a couple of very trivial things before pushing. +BlockNumber (*cb) (ReadStream *pgsr, void *private_data, + void *per_buffer_data); This type has a friendly name: ReadStreamBlockNumberCB. +scan->rs_read_stream = read_stream_begin_relation(READ_STREAM_SEQUENTIAL, I've been on the fence about that flag for sequential scan... Some days I want to consider changing to READ_STREAM_DEFAULT and relying on our "anti-heuristics" to suppress advice, which would work out the same in most cases but might occasionally win big. It might also hurt, I dunno, so I suspect we'd have to make it better first, which my patch in [1] was a first swing at, but I haven't researched that enough. So, kept this way! - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. Changed to: * Read the next block of the scan relation from the read stream and save it * in the scan descriptor. It is already pinned. +static BlockNumber +heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data, +void *per_buffer_data) Changed argument names to match the function pointer type definition, "stream" and "callback_private_data". BTW looking at the branching in read-stream user patches that have an initialisation step like yours, I wonder if it might every make sense to be able to change the callback on the fly from inside the callback, so that you finish up with a branchless one doing most of the work. I have no idea if it's worth it... [1] https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com
Re: Streaming read-ready sequential scan code
On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro wrote: > > I found a bug in read_stream.c that could be hit with Melanie's > streaming seq scan patch with parallelism enabled and certain buffer > pool conditions. Short version: there is an edge case where an "if" > needed to be a "while", or we could lose a few blocks. Here's the fix > for that, longer explanation in commit message. Attached v13 0001 is your fix and 0002 is a new version of the sequential scan streaming read user. Off-list Andres mentioned that I really ought to separate the parallel and serial sequential scan users into two different callbacks. I've done that in the attached. It actually makes the code used by the callbacks nicer and more readable anyway (putting aside performance). I was able to measure a small performance difference as well. I've also added a few comments and improved existing comments. - Melanie From eded321df22bf472f147bd8f94b596d465355c70 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 13:32:14 +1300 Subject: [PATCH v13 2/2] Use streaming IO in heapam sequential and TID range scans Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. Author: Melanie Plageman Reviewed-by: Andres Freund Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com --- src/backend/access/heap/heapam.c | 234 +-- src/include/access/heapam.h | 15 ++ 2 files changed, 176 insertions(+), 73 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 01bb2f4cc16..9d10d42b69b 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -223,6 +223,66 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +/* + * Streaming read API callback for parallel sequential scans. Returns the next + * block the caller wants from the read stream or InvalidBlockNumber when done. + */ +static BlockNumber +heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + Assert(ScanDirectionIsForward(scan->rs_dir)); + Assert(scan->rs_base.rs_parallel); + + if (unlikely(!scan->rs_inited)) + { + /* parallel scan */ + table_block_parallelscan_startblock_init(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + + /* may return InvalidBlockNumber if there are no more blocks */ + scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, + (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel); + scan->rs_inited = true; + } + else + { + scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd, + scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc) + scan->rs_base.rs_parallel); + } + + return scan->rs_prefetch_block; +} + +/* + * Streaming read API callback for serial sequential and TID range scans. + * Returns the next block the caller wants from the read stream or + * InvalidBlockNumber when done. + */ +static BlockNumber +heap_scan_stream_read_next_serial(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -325,6 +385,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* + * Initialize to ForwardScanDirection because it is most common and + * because heap scans go forward before going backward (e.g. CURSORs). + */ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -462,12 +529,14 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM. * - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. */ static inline void heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir) { + Assert(scan->rs_read_
Re: Streaming read-ready sequential scan code
I found a bug in read_stream.c that could be hit with Melanie's streaming seq scan patch with parallelism enabled and certain buffer pool conditions. Short version: there is an edge case where an "if" needed to be a "while", or we could lose a few blocks. Here's the fix for that, longer explanation in commit message. From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 7 Apr 2024 12:36:44 +1200 Subject: [PATCH] Fix bug in read_stream.c. When we determine that a wanted block can't be combined with the current pending read, it's time to start that pending read to get it out of the way. An "if" in that code path should have been a "while", because it might take more than one go to get that job done. Otherwise the remaining part of a partially started read could be clobbered and we could lose some blocks. This was only broken for smaller ranges, as the more common case of io_combine_limit-sized ranges is handled earlier in the code and knows how to loop. Discovered while testing parallel sequential scans of partially cached tables. They have a ramp-down phase with ever smaller ranges of contiguous blocks, to be fair to parallel workers as the work runs out. Defect in commit b5a9b18c. --- src/backend/storage/aio/read_stream.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 9a70a81f7ae..f54dacdd914 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice) } /* We have to start the pending read before we can build another. */ - if (stream->pending_read_nblocks > 0) + while (stream->pending_read_nblocks > 0) { read_stream_start_pending_read(stream, suppress_advice); suppress_advice = false; -- 2.44.0
Re: Streaming read-ready sequential scan code
On Fri, Apr 5, 2024 at 7:28 PM Thomas Munro wrote: > > On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman > wrote: > > I haven't looked into or reviewed the memory prefetching part. > > > > While reviewing 0002, I realized that I don't quite see how > > read_stream_get_block() will be used in the fastpath -- which it > > claims in its comments. > > Comments improved. Ah, makes sense now. > The occasional CI problem I mentioned turned out to be > read_stream_reset() remembering a little too much state across > rescans. Fixed. Nice investigative work figuring this out. > Thanks both for the feedback on the ring buffer tweaks. Comments > updated. Here is the full stack of patches I would like to commit > very soon, though I may leave the memory prefetching part out for a > bit longer to see if I can find any downside. 0001 LGTM. I did not review 0002 or 0003. 0004 looks good except for one comment typo: /* * Tell call not to pin more than half the buffers in the ring. * This is a trade-off between look ahead distance and deferring * writeback and associated WAL traffic. */ call -> caller 0006, I noticed the commit message is missing an important comma: Instead of calling ReadBuffer() for each block heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. should be Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming read API introduced in b5a9b18cd0. - Melanie
Re: Streaming read-ready sequential scan code
On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman wrote: > On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro wrote: > > The interesting column is hot. The 200ms->211ms regression is due to > > the extra bookkeeping in the slow path. The rejiggered fastpath code > > fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can > > you reproduce that? > > I am able to reproduce the fast path solving the issue using Heikki's > example here [1] but in shared buffers (hot). > > master: 25 ms > stream read: 29 ms > stream read + fast path: 25 ms Great, thanks. > I haven't looked into or reviewed the memory prefetching part. > > While reviewing 0002, I realized that I don't quite see how > read_stream_get_block() will be used in the fastpath -- which it > claims in its comments. Comments improved. > Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist? Just for testing purposes. Behaviour should be identical to external observers either way, it's just a hand-rolled specialisation for certain parameters, and it's useful to be able to verify that and measure the speedup. I think it's OK to leave a bit of developer/testing scaffolding in the tree if it's likely to be useful again and especially if like this case it doesn't create any dead code. (Perhaps in later work we might find the right way to get the compiler to do the specialisation? It's so simple though.) The occasional CI problem I mentioned turned out to be read_stream_reset() remembering a little too much state across rescans. Fixed. Thanks both for the feedback on the ring buffer tweaks. Comments updated. Here is the full stack of patches I would like to commit very soon, though I may leave the memory prefetching part out for a bit longer to see if I can find any downside. From 54efd755040507b55672092907d53b4db30f5a06 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 12:08:24 +1300 Subject: [PATCH v12 1/6] Improve read_stream.c's fast path. Unfortunately the "fast path" for cached scans that don't do any I/O was accidentally coded in a way that could only be triggered by pg_prewarm's usage patter. We really need it to work for the streaming sequential scan patch. Refactor. Reviewed-by: Melanie Plageman Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 75 +++ 1 file changed, 31 insertions(+), 44 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5e..b9e11a28312 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -169,7 +169,7 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index) /* * Ask the callback which block it would like us to read next, with a small * buffer in front to allow read_stream_unget_block() to work and to allow the - * fast path to work in batches. + * fast path to skip this function and work directly from the array. */ static inline BlockNumber read_stream_get_block(ReadStream *stream, void *per_buffer_data) @@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) if (likely(stream->fast_path)) { BlockNumber next_blocknum; - bool need_wait; /* Fast path assumptions. */ Assert(stream->ios_in_progress == 0); Assert(stream->pinned_buffers == 1); Assert(stream->distance == 1); - Assert(stream->pending_read_nblocks == 1); + Assert(stream->pending_read_nblocks == 0); Assert(stream->per_buffer_data_size == 0); /* We're going to return the buffer we pinned last time. */ @@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) buffer = stream->buffers[oldest_buffer_index]; Assert(buffer != InvalidBuffer); - /* - * Pin a buffer for the next call. Same buffer entry, and arbitrary - * I/O entry (they're all free). - */ - need_wait = StartReadBuffer(&stream->ios[0].op, - &stream->buffers[oldest_buffer_index], - stream->pending_read_blocknum, - stream->advice_enabled ? - READ_BUFFERS_ISSUE_ADVICE : 0); - - /* Choose the block the next call will pin. */ + /* Choose the next block to pin. */ if (unlikely(stream->blocknums_next == stream->blocknums_count)) read_stream_fill_blocknums(stream); next_blocknum = stream->blocknums[stream->blocknums_next++]; - /* - * Fast return if the next call doesn't require I/O for the buffer we - * just pinned, and we have a block number to give it as a pending - * read. - */ - if (likely(!need_wait && next_blocknum != InvalidBlockNumber)) + if (likely(next_blocknum != InvalidBlockNumber)) { - stream->pending_read_blocknum = next_blocknum; - return buffer; - } - - /* - * For anything more complex, set up some more state and take the slow - * path next time. - */ - stream->fast_path = false; +
Re: Streaming read-ready sequential scan code
On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro wrote: > > Yeah, I plead benchmarking myopia, sorry. The fastpath as committed > is only reached when distance goes 2->1, as pg_prewarm does. Oops. > With the attached minor rearrangement, it works fine. I also poked > some more at that memory prefetcher. Here are the numbers I got on a > desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, > cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, > GCC -O3). > > create table t (i int, filler text) with (fillfactor=10); > insert into t > select g, repeat('x', 900) from generate_series(1, 56) g; > vacuum freeze t; > set max_parallel_workers_per_gather = 0; > > select count(*) from t; > > cold = must be read from actual disk (Linux drop_caches) > warm = read from linux page cache > hot = already in pg cache via pg_prewarm > > cold warmhot > master2479ms 886ms 200ms > seqscan 2498ms 716ms 211ms <-- regression > seqscan + fastpath2493ms 711ms 200ms <-- fixed, I think? > seqscan + memprefetch 2499ms 716ms 182ms > seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ > > Cold has no difference. That's just my disk demonstrating Linux RA at > 128kB (default); random I/O is obviously a more interesting story. > It's consistently a smidgen faster with Linux RA set to 2MB (as in > blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect > probably also increases on fancier faster storage than what I have on > hand: > > cold > master1775ms > seqscan + fastpath + memprefetch 1700ms > > Warm is faster as expected (fewer system calls schlepping data > kernel->userspace). > > The interesting column is hot. The 200ms->211ms regression is due to > the extra bookkeeping in the slow path. The rejiggered fastpath code > fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can > you reproduce that? I am able to reproduce the fast path solving the issue using Heikki's example here [1] but in shared buffers (hot). master: 25 ms stream read: 29 ms stream read + fast path: 25 ms I haven't looked into or reviewed the memory prefetching part. While reviewing 0002, I realized that I don't quite see how read_stream_get_block() will be used in the fastpath -- which it claims in its comments. read_stream_next_buffer() is the only caller of read_stream_look_ahead()->read_stream_get_block(), and if fast_path is true, read_stream_next_buffer() always returns before calling read_stream_look_ahead(). Maybe I am missing something. I see fast_path uses read_stream_fill_blocknums() to invoke the callback. Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist? Otherwise 0002 looks good to me. I haven't reviewed 0003 or 0004. I attached a new version (v11) because I noticed an outdated comment in my seq scan streaming read user patch (0001). The other patches in the set are untouched from your versions besides adding author/reviewer info in commit message for 0002. - Melanie [1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1%40iki.fi From acbd7172f10857d49d4b5d4afc4efab704b33486 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Mon, 10 Jul 2023 11:22:34 +0200 Subject: [PATCH v11 3/4] Add pg_prefetch_mem() macro to load cache lines. Initially mapping to GCC, Clang and MSVC builtins. Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com --- config/c-compiler.m4 | 17 configure | 40 ++ configure.ac | 3 +++ meson.build| 1 + src/include/c.h| 8 src/include/pg_config.h.in | 3 +++ 6 files changed, 72 insertions(+) diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 3268a780bb0..4cc02f97601 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, [Define to 1 if your compiler understands $1.]) fi])# PGAC_CHECK_BUILTIN_FUNC +# PGAC_CHECK_BUILTIN_VOID_FUNC +# --- +# Variant for void functions. +AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC], +[AC_CACHE_CHECK(for $1, pgac_cv$1, +[AC_LINK_IFELSE([AC_LANG_PROGRAM([ +void +call$1($2) +{ +$1(x); +}], [])], +[pgac_cv$1=yes], +[pgac_cv$1=no])]) +if test x"${pgac_cv$1}" = xyes ; then +AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1, + [Define to 1 if your compiler understands $1.]) +fi])# PGAC_CHECK_BUILTIN_VOID_FUNC # PGAC_CHECK_BUILTIN_FUNC_PTR diff --git a/configure b/configure index 36feeafbb23..79b78c33ddc 100755 --- a/configure +++ b/configure @@ -15543,6 +15543,46 @@ _ACEOF fi +# Can we use a built-in to prefetch memory? +{ $as_echo "$as_me:${a
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 12:39 PM Andres Freund wrote: > > On 2024-04-04 22:37:39 +1300, Thomas Munro wrote: > > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro wrote: > > > Alright what about this? > > I think it's probably worth adding a bit more of the commit message to the > function comment. Yes, there's a bit in one of the return branches, but that's > not what you're going to look at when just checking what the function does. Agreed about the comment. I kept thinking that BAS_BULKREAD should maybe return nbuffers - 1, but I couldn't convince myself why. Otherwise v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count LGTM. - Melanie
Re: Streaming read-ready sequential scan code
Yeah, I plead benchmarking myopia, sorry. The fastpath as committed is only reached when distance goes 2->1, as pg_prewarm does. Oops. With the attached minor rearrangement, it works fine. I also poked some more at that memory prefetcher. Here are the numbers I got on a desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled, cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe, GCC -O3). create table t (i int, filler text) with (fillfactor=10); insert into t select g, repeat('x', 900) from generate_series(1, 56) g; vacuum freeze t; set max_parallel_workers_per_gather = 0; select count(*) from t; cold = must be read from actual disk (Linux drop_caches) warm = read from linux page cache hot = already in pg cache via pg_prewarm cold warmhot master2479ms 886ms 200ms seqscan 2498ms 716ms 211ms <-- regression seqscan + fastpath2493ms 711ms 200ms <-- fixed, I think? seqscan + memprefetch 2499ms 716ms 182ms seqscan + fastpath + memprefetch 2505ms 710ms 170ms <-- \O/ Cold has no difference. That's just my disk demonstrating Linux RA at 128kB (default); random I/O is obviously a more interesting story. It's consistently a smidgen faster with Linux RA set to 2MB (as in blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect probably also increases on fancier faster storage than what I have on hand: cold master1775ms seqscan + fastpath + memprefetch 1700ms Warm is faster as expected (fewer system calls schlepping data kernel->userspace). The interesting column is hot. The 200ms->211ms regression is due to the extra bookkeeping in the slow path. The rejiggered fastpath code fixes it for me, or maybe sometimes shows an extra 1ms. Phew. Can you reproduce that? The memory prefetching trick, on top of that, seems to be a good optimisation so far. Note that that's not an entirely independent trick, it's something we can only do now that we can see into the future; it's the next level up of prefetching, worth doing around 60ns before you need the data I guess. Who knows how thrashed the cache might be before the caller gets around to accessing that page, but there doesn't seem to be much of a cost or downside to this bet. We know there are many more opportunities like that[1] but I don't want to second-guess the AM here, I'm just betting that the caller is going to look at the header. Unfortunately there seems to be a subtle bug hiding somewhere in here, visible on macOS on CI. Looking into that, going to find my Mac... [1] https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com From 74b8cde45a8babcec7b52b06bdb8ea046a0a966f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Fri, 5 Apr 2024 13:32:14 +1300 Subject: [PATCH v10 1/4] Use streaming I/O in heapam sequential scan. --- src/backend/access/heap/heapam.c | 100 +-- src/include/access/heapam.h | 15 + 2 files changed, 97 insertions(+), 18 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dada2ecd1e3..f7946a39fd9 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +static BlockNumber +heap_scan_stream_read_next(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* + * Initialize to ForwardScanDirection because it is most common and heap + * scans usually must go forwards before going backward. + */ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM. * - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buff
Re: Streaming read-ready sequential scan code
On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman wrote: > So, sequential scan does not have per-buffer data. I did some logging > and the reason most fully-in-SB sequential scans don't use the fast > path is because read_stream->pending_read_nblocks is always 0. Hnghghghgh... right, sorry I guessed the wrong reason, it turns out that I made a fast path just a little too specialised for pg_prewarm. Working on it...
Re: Streaming read-ready sequential scan code
Hi, On 2024-04-04 22:37:39 +1300, Thomas Munro wrote: > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro wrote: > > Alright what about this? I think it's probably worth adding a bit more of the commit message to the function comment. Yes, there's a bit in one of the return branches, but that's not what you're going to look at when just checking what the function does. > From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001 > From: Thomas Munro > Date: Thu, 4 Apr 2024 21:11:06 +1300 > Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB. > > The BAS_VACUUM ring size has been 256kB since commit d526575f. Commit > 1cbbee03 made it configurable but retained the traditional default. > The correct default size has been debated for years, but 256kB is > certainly very small. VACUUM soon needs to write back data it dirtied > only 32 blocks ago, which usually requires flushing the WAL. New > experiments in prefetching pages for VACUUM exacerbated the problem by > crashing into dirty data even sooner. Let's make the default 2MB. > That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which > would be a considered a small shared_buffers setting for a real system > these days. Users are still free to set the GUC to a different value. +1. Independent of any other changes, this improves the default vacuum performance substantially. We might want to dynamically size the default at some point - but we probably should overhaul the infrastructure first... Greetings, Andres Freund
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 7:45 AM Thomas Munro wrote: > > On Thu, Apr 4, 2024 at 8:02 PM David Rowley wrote: > > 3a4a3537a > > latency average = 34.497 ms > > latency average = 34.538 ms > > > > 3a4a3537a + read_stream_for_seqscans.patch > > latency average = 40.923 ms > > latency average = 41.415 ms > > > > i.e. no meaningful change from the refactor, but a regression from a > > cached workload that changes the page often without doing much work in > > between with the read stread patch. > > I ran Heikki's test except I ran the "insert" 4 times to get a table > of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, > huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms > when the data is in LInux cache and PG is not prewarmed, roughly as he > reported. Good. > > If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying > things out to see what works, I got that down to 243ms (ie beat > master) by inserting a memory prefetch: > > --- a/src/backend/storage/aio/read_stream.c > +++ b/src/backend/storage/aio/read_stream.c > @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void > **per_buffer_data) > /* Prepare for the next call. */ > read_stream_look_ahead(stream, false); > > + > __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); > > Maybe that's a solution to a different problem that just happens to > more than make up the difference in this case, and it may be > questionable whether that cache line will survive long enough to help > you, but this one-tuple-per-page test likes it... Hmm, to get a more > realistic table than the one-tuple-per-page on, I tried doubling a > tenk1 table until it reached 2759MB and then I got a post-prewarm > regression 702ms -> 721ms, and again I can beat master by memory > prefetching: 689ms. > > Annoyingly, with the same table I see no difference between the actual > pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to > use the 'fast path' I made pretty much just to be able to minimise > regression in that (probably stupid) all-cached tes that doesn't even > look at the page contents. Unfortunately seq scan can't use it > because it has per-buffer data, which is one of the things it can't do > (because of a space management issue). Maybe I should try to find a > way to fix that. So, sequential scan does not have per-buffer data. I did some logging and the reason most fully-in-SB sequential scans don't use the fast path is because read_stream->pending_read_nblocks is always 0. When when read_stream->distance stays 1 (expected for all-in-SB as it is initialized to 1 and we don't want distance to ramp up), read_stream_look_ahead() never increments read_stream->pending_read_nblocks because it sets it to 1 the first time it is called and then the conditions of the while loop are not met again while (stream->ios_in_progress < stream->max_ios && stream->pinned_buffers + stream->pending_read_nblocks < stream->distance) distance is 1, pending_read_nblocks is 1, thus we only loop once and don't increment pending_read_nblocks. prewarm is only able to use the fast path because it passes READ_STREAM_FULL and thus initializes read_stream->distance to a higher initial value. I added some logging to see if any of the sequential scans in the regression suite used the fast path. The one example I see of the fast path being used is a temp table IO stats test in src/test/regress/sql/stats.sql. I didn't check exactly what conditions led it to do this. But we probably want seq scans which are all in SB to use the fast path. - Melanie
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 3:02 AM David Rowley wrote: > > On Thu, 4 Apr 2024 at 16:45, David Rowley wrote: > > I've pushed the v9-0001 with that rename done. > > I've now just pushed the 0002 patch with some revisions: Thanks! > 1. The function declarations you added for heapgettup_advance_block() > and heapgettup_initial_block() didn't match the properties of their > definitions. You'd declared both of these static inline but neither > of these were. Ah, they needed to be defined static but I intentionally left off the inline in the definition and only put it in the forward declaration because I thought that was correct. Anyway, I'm fine with how you did it in the end. > 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as > that's effectively what it does with v8-0002, however, that's just too > many words all shoved together, so I renamed it to > heap_fetch_next_buffer(). Sounds good. > 3. I changed heapgettup_initial_block() to pg_noinline as it both > makes more sense to have this out of line and it also fixed a small > performance regression. Ah, I guess it is in the unlikely path. I often forget that noinline exists. It's interesting that that made a noticeable difference since it is a pretty short function. Thanks for taking such a close look! > Looks like I also failed to grep for all the remaining instances of > "heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a. > > I also rebased the 0003 patch which I've attached as a raw patch. Thanks! > FWIW, using Heikki's test in [1] with a pg_prewarm each time after > restarting the instance. No parallel aggregate. > > drowley@amd3990x:~$ cat bench.sql > select count(*) from giga; > > drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency > > 44086b097~1 > latency average = 34.323 ms > latency average = 34.332 ms > > 44086b097 > latency average = 34.811 ms > latency average = 34.862 ms > > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch. Cool. Thanks for testing this out. Sounds like Thomas did some analysis of how to resolve this for the streaming read user, and I will do some too. - Melanie
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 8:02 PM David Rowley wrote: > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch. I ran Heikki's test except I ran the "insert" 4 times to get a table of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms when the data is in LInux cache and PG is not prewarmed, roughly as he reported. Good. If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying things out to see what works, I got that down to 243ms (ie beat master) by inserting a memory prefetch: --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Prepare for the next call. */ read_stream_look_ahead(stream, false); + __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); Maybe that's a solution to a different problem that just happens to more than make up the difference in this case, and it may be questionable whether that cache line will survive long enough to help you, but this one-tuple-per-page test likes it... Hmm, to get a more realistic table than the one-tuple-per-page on, I tried doubling a tenk1 table until it reached 2759MB and then I got a post-prewarm regression 702ms -> 721ms, and again I can beat master by memory prefetching: 689ms. Annoyingly, with the same table I see no difference between the actual pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to use the 'fast path' I made pretty much just to be able to minimise regression in that (probably stupid) all-cached tes that doesn't even look at the page contents. Unfortunately seq scan can't use it because it has per-buffer data, which is one of the things it can't do (because of a space management issue). Maybe I should try to find a way to fix that. > I'm happy to run further benchmarks, but for the remainder of the > committer responsibility for the next patches, I'm going to leave that > to Thomas. Thanks!
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro wrote: > Alright what about this? Forgot to git add a change, new version. From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 20:31:26 +1300 Subject: [PATCH v2 1/2] Allow BufferAccessStrategy to limit pin count. When pinning extra buffers to look ahead, users of a strategy are in danger of pinning a lot of the buffers in the ring, or even more than the ring size. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 5 src/backend/storage/buffer/freelist.c | 35 +++ src/include/storage/bufmgr.h | 1 + 3 files changed, 41 insertions(+) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5..eab87f6f73 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -419,6 +419,7 @@ read_stream_begin_relation(int flags, size_t size; int16 queue_size; int16 max_ios; + int strategy_pin_limit; uint32 max_pinned_buffers; Oid tablespace_id; SMgrRelation smgr; @@ -460,6 +461,10 @@ read_stream_begin_relation(int flags, max_pinned_buffers = Min(max_pinned_buffers, PG_INT16_MAX - io_combine_limit - 1); + /* Give the strategy a chance to limit the number of buffers we pin. */ + strategy_pin_limit = GetAccessStrategyPinLimit(strategy); + max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers); + /* Don't allow this backend to pin more than its share of buffers. */ if (SmgrIsTemp(smgr)) LimitAdditionalLocalPins(&max_pinned_buffers); diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3611357fa3..c69590d6d8 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy) return strategy->nbuffers; } +/* + * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned + * + * Strategies can specify the maximum number of buffers that a user should pin + * at once when performing look-ahead. Callers should combine this number with + * other relevant limits and take the minimum. + */ +int +GetAccessStrategyPinLimit(BufferAccessStrategy strategy) +{ + if (strategy == NULL) + return NBuffers; + + switch (strategy->btype) + { + case BAS_BULKREAD: + + /* +* Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers +* shouldn't be a problem and the caller is free to pin up to the +* entire ring at once. +*/ + return strategy->nbuffers; + + default: + + /* +* Tell call not to pin more than half the buffers in the ring. +* This is a trade-off between look ahead distance and deferring +* writeback and associated WAL traffic. +*/ + return strategy->nbuffers / 2; + } +} + /* * FreeAccessStrategy -- release a BufferAccessStrategy object * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index f380f9d9a6..07ba1a6050 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype, int ring_size_kb); extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy); +extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy); extern void FreeAccessStrategy(BufferAccessStrategy strategy); -- 2.39.3 (Apple Git-146) From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 21:11:06 +1300 Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB. The BAS_VACUUM ring size has been 256kB since commit d526575f. Commit 1cbbee03 made it configurable but retained the traditional default.
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 11:13 AM Andres Freund wrote: > The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so, > should we apply that only if the ring the strategy doesn't use the > StrategyRejectBuffer() logic? Hmm, I don't really know, but that sounds plausible. What do you think about the attached? > I think for VACUUM we should probably go a bit further. There's no comparable > L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more > expensive, compared to a seqscan. I'd go or at lest 4x-8x. Alright what about this? From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Thu, 4 Apr 2024 20:31:26 +1300 Subject: [PATCH 1/2] Allow BufferAccessStrategy to limit pin count. When pinning extra buffers to look ahead, users of a strategy are in danger of pinning a lot of the buffers in the ring, or even more than the ring size. For some strategies, that means "escaping" from the ring, and in others it means forcing dirty data to disk very frequently with associated WAL flushing. Since external code has no insight into any of that, allow individual strategy types to expose a clamp that should be applied when deciding how many buffers to pin at once. Discussion: https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com --- src/backend/storage/aio/read_stream.c | 5 src/backend/storage/buffer/freelist.c | 35 +++ src/include/storage/bufmgr.h | 1 + 3 files changed, 41 insertions(+) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 4f21262ff5..eab87f6f73 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -419,6 +419,7 @@ read_stream_begin_relation(int flags, size_t size; int16 queue_size; int16 max_ios; + int strategy_pin_limit; uint32 max_pinned_buffers; Oid tablespace_id; SMgrRelation smgr; @@ -460,6 +461,10 @@ read_stream_begin_relation(int flags, max_pinned_buffers = Min(max_pinned_buffers, PG_INT16_MAX - io_combine_limit - 1); + /* Give the strategy a chance to limit the number of buffers we pin. */ + strategy_pin_limit = GetAccessStrategyPinLimit(strategy); + max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers); + /* Don't allow this backend to pin more than its share of buffers. */ if (SmgrIsTemp(smgr)) LimitAdditionalLocalPins(&max_pinned_buffers); diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 3611357fa3..c69590d6d8 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy) return strategy->nbuffers; } +/* + * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned + * + * Strategies can specify the maximum number of buffers that a user should pin + * at once when performing look-ahead. Callers should combine this number with + * other relevant limits and take the minimum. + */ +int +GetAccessStrategyPinLimit(BufferAccessStrategy strategy) +{ + if (strategy == NULL) + return NBuffers; + + switch (strategy->btype) + { + case BAS_BULKREAD: + + /* +* Since BAS_BULKREAD uses StrategyRejectBuffer(), dirty buffers +* shouldn't be a problem and the caller is free to pin up to the +* entire ring at once. +*/ + return strategy->nbuffers; + + default: + + /* +* Tell call not to pin more than half the buffers in the ring. +* This is a trade-off between look ahead distance and deferring +* writeback and associated WAL traffic. +*/ + return strategy->nbuffers / 2; + } +} + /* * FreeAccessStrategy -- release a BufferAccessStrategy object * diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index f380f9d9a6..07ba1a6050 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -318,6 +318,7 @@ extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype); extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType btype, int ring_size_kb); extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy); +extern int GetAccessStrategyPinLimit(BufferAccessStrategy
Re: Streaming read-ready sequential scan code
On Thu, 4 Apr 2024 at 16:45, David Rowley wrote: > I've pushed the v9-0001 with that rename done. I've now just pushed the 0002 patch with some revisions: 1. The function declarations you added for heapgettup_advance_block() and heapgettup_initial_block() didn't match the properties of their definitions. You'd declared both of these static inline but neither of these were. 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as that's effectively what it does with v8-0002, however, that's just too many words all shoved together, so I renamed it to heap_fetch_next_buffer(). 3. I changed heapgettup_initial_block() to pg_noinline as it both makes more sense to have this out of line and it also fixed a small performance regression. Looks like I also failed to grep for all the remaining instances of "heapgetpage" in 44086b097. Those are now fixed by 3a4a3537a. I also rebased the 0003 patch which I've attached as a raw patch. FWIW, using Heikki's test in [1] with a pg_prewarm each time after restarting the instance. No parallel aggregate. drowley@amd3990x:~$ cat bench.sql select count(*) from giga; drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency 44086b097~1 latency average = 34.323 ms latency average = 34.332 ms 44086b097 latency average = 34.811 ms latency average = 34.862 ms 3a4a3537a latency average = 34.497 ms latency average = 34.538 ms 3a4a3537a + read_stream_for_seqscans.patch latency average = 40.923 ms latency average = 41.415 ms i.e. no meaningful change from the refactor, but a regression from a cached workload that changes the page often without doing much work in between with the read stread patch. I'm happy to run further benchmarks, but for the remainder of the committer responsibility for the next patches, I'm going to leave that to Thomas. David [1] https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e...@iki.fi diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dada2ecd1e..f7946a39fd 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +static BlockNumber +heap_scan_stream_read_next(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* +* Initialize to ForwardScanDirection because it is most common and heap +* scans usually must go forwards before going backward. +*/ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM. * - * Read the next block of the scan relation into a buffer and pin that buffer - * before saving it in the scan descriptor. + * Read the next block of the scan relation from the read stream and pin that + * buffer before saving it in the scan descriptor. */ static inline void heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir) { + Assert(scan->rs_read_stream); + /* release previous scan buffer, if any */ if (BufferIsValid(scan->rs_cbuf)) { @@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir) */ CHECK_FOR_INTERRUPTS(); - if (unlikely(!scan->rs_inited)) + /* +* If the scan direction is changing, reset the prefetch block to the +* current block. Otherwise, we will incorrectly prefetch the blocks +* between the prefetch block and the current block again before +* prefetching blocks in the new, correct scan direction. +*/ + if (unlikely(scan->rs_dir != dir)) { - scan->rs_cblock = heapgettup_initial_block(scan, dir
Re: Streaming read-ready sequential scan code
On Thu, 4 Apr 2024 at 14:38, Melanie Plageman wrote: > Looking at it in the code, I am wondering if we should call > heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear > that it is prepping a page to be scanned. You choose whatever you > think is best. I ended up calling it heap_prepare_pagescan() as I started to think prep/prepare should come first. I don't think it's perfect as the intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but that's obviously too long. I've pushed the v9-0001 with that rename done. David
Re: Streaming read-ready sequential scan code
On Wed, Apr 3, 2024 at 9:08 PM David Rowley wrote: > > On Thu, 4 Apr 2024 at 06:03, Melanie Plageman > wrote: > > Attached v8 is rebased over current master (which now has the > > streaming read API). > > I've looked at the v8-0001 patch. Thanks for taking a look! > I wasn't too keen on heapbuildvis() as a function name for an external > function. Since it also does pruning work, it seemed weird to make it > sound like it only did visibility work. Per our offline discussion > about names, I've changed it to what you suggested which is > heap_page_prep(). Looking at it in the code, I am wondering if we should call heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear that it is prepping a page to be scanned. You choose whatever you think is best. > Aside from that, there was an outdated comment: "In pageatatime mode, > heapgetpage() already did visibility checks," which was no longer true > as that's done in heapbuildvis() (now heap_page_prep()). > > I also did a round of comment adjustments as there were a few things I > didn't like, e.g: > > + * heapfetchbuf - subroutine for heapgettup() > > This is also used in heapgettup_pagemode(), so I thought it was a bad > idea for a function to list places it thinks it's being called. I > also thought it redundant to write "This routine" in the function head > comment. I think "this routine" is implied by the context. I ended up > with: > > /* > * heapfetchbuf - read and pin the given MAIN_FORKNUM block number. > * > * Read the specified block of the scan relation into a buffer and pin that > * buffer before saving it in the scan descriptor. > */ > > I'm happy with your changes to heapam_scan_sample_next_block(). I did > adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively > the same as the seqscan version, just with "seqscan" swapped for > "sample scan". That all is fine with me. > The only other thing I adjusted there was to use "blockno" in some > places where you were using hscan->rs_cblock. These all come after > the "hscan->rs_cblock = blockno;" line. My thoughts here are that this > is more likely to avoid reading the value from the struct again if the > compiler isn't happy that the two values are still equivalent for some > reason. Even if the compiler is happy today, it would only take a > code change to pass hscan to some external function for the compiler > to perhaps be uncertain if that function has made an adjustment to > rs_cblock and go with the safe option of pulling the value out the > struct again which is a little more expensive as it requires some > maths to figure out the offset. > > I've attached v9-0001 and a delta of just my changes from v8. All sounds good and LGTM - Melanie
Re: Streaming read-ready sequential scan code
On Thu, 4 Apr 2024 at 06:03, Melanie Plageman wrote: > Attached v8 is rebased over current master (which now has the > streaming read API). I've looked at the v8-0001 patch. I wasn't too keen on heapbuildvis() as a function name for an external function. Since it also does pruning work, it seemed weird to make it sound like it only did visibility work. Per our offline discussion about names, I've changed it to what you suggested which is heap_page_prep(). Aside from that, there was an outdated comment: "In pageatatime mode, heapgetpage() already did visibility checks," which was no longer true as that's done in heapbuildvis() (now heap_page_prep()). I also did a round of comment adjustments as there were a few things I didn't like, e.g: + * heapfetchbuf - subroutine for heapgettup() This is also used in heapgettup_pagemode(), so I thought it was a bad idea for a function to list places it thinks it's being called. I also thought it redundant to write "This routine" in the function head comment. I think "this routine" is implied by the context. I ended up with: /* * heapfetchbuf - read and pin the given MAIN_FORKNUM block number. * * Read the specified block of the scan relation into a buffer and pin that * buffer before saving it in the scan descriptor. */ I'm happy with your changes to heapam_scan_sample_next_block(). I did adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively the same as the seqscan version, just with "seqscan" swapped for "sample scan". The only other thing I adjusted there was to use "blockno" in some places where you were using hscan->rs_cblock. These all come after the "hscan->rs_cblock = blockno;" line. My thoughts here are that this is more likely to avoid reading the value from the struct again if the compiler isn't happy that the two values are still equivalent for some reason. Even if the compiler is happy today, it would only take a code change to pass hscan to some external function for the compiler to perhaps be uncertain if that function has made an adjustment to rs_cblock and go with the safe option of pulling the value out the struct again which is a little more expensive as it requires some maths to figure out the offset. I've attached v9-0001 and a delta of just my changes from v8. David From 90bfc63097c556d0921d8f9165731fb07ec04167 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v9] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for read streams. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former is now heapfetchbuf() and the latter is now heapbuildvis(). Future commits will push the logic for selecting the next block into heapfetchbuf() in cases when streaming reads are not supported (such as backwards sequential scans). Because this logic differs for sample scans and sequential scans, inline the code to read the block into a buffer for sample scans. This has the added benefit of allowing for a bit of refactoring in heapam_scan_sample_next_block(), including unpinning the previous buffer before invoking the callback to select the next block. --- src/backend/access/heap/heapam.c | 78 ++-- src/backend/access/heap/heapam_handler.c | 38 src/include/access/heapam.h | 2 +- 3 files changed, 74 insertions(+), 44 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index a9d5b109a5..6524fc44a5 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -360,17 +360,17 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk } /* - * heapgetpage - subroutine for heapgettup() + * heap_page_prep - Prepare the current scan page to be scanned in pagemode * - * This routine reads and pins the specified page of the relation. - * In page-at-a-time mode it performs additional work, namely determining - * which tuples on the page are visible. + * Preparation currently consists of 1. prune the scan's rs_cbuf page, and 2. + * fill the rs_vistuples array with the OffsetNumbers of visible tuples. */ void -heapgetpage(TableScanDesc sscan, BlockNumber block) +heap_page_prep(TableScanDesc sscan) { HeapScanDesc scan = (HeapScanDesc) sscan; - Buffer buffer; + Buffer buffer = scan->rs_cbuf; + BlockNumber block = scan->rs_
Re: Streaming read-ready sequential scan code
Hi, On 2024-04-04 09:27:35 +1300, Thomas Munro wrote: > Anecdotally by all reports I've seen so far, all-in-cache seems to be > consistently a touch faster than master if anything, for streaming seq > scan and streaming ANALYZE. That's great!, but it doesn't seem to be > due to intentional changes. No efficiency is coming from batching > anything. Perhaps it has to do with CPU pipelining effects: though > it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work > itself is cut up into stages in a kind of pipeline: > read_stream_next_buffer() chooses page n + 2, pins page n + 1 and > returns page n each time you call it, so maybe we get more CPU > parallelism due to spreading the data dependencies out? Another theory is that it's due to the plain ReadBuffer() path needing to do RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't need to. > > On the topic of BAS_BULKREAD buffer access strategy, I think the least > > we could do is add an assert like this to read_stream_begin_relation() > > after calculating max_pinned_buffers. > > > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > > > Perhaps we should do more? I think with a ring size of 16 MB, large > > SELECTs are safe for now. But I know future developers will make > > changes and it would be good not to assume they will understand that > > pinning more buffers than the size of the ring effectively invalidates > > the ring. > > Yeah I think we should clamp max_pinned_buffers if we see a strategy. > What do you think about: > > if (strategy) > { > int strategy_buffers = GetAccessStrategyBufferCount(strategy); > max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); > } > I just don't know where to get that '2'. The idea would be to > hopefully never actually be constrained by it in practice, except in > tiny/toy setups, so we can't go too wrong with our number '2' there. The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so, should we apply that only if the ring the strategy doesn't use the StrategyRejectBuffer() logic? I think it's fine to add a handwavy justification for the 2, saying that we want to balance readahead distance and reducing WAL write frequency, and that at some point more sophisticated logic will be needed. > Then we should increase the default ring sizes for BAS_BULKREAD and > BAS_VACUUM to make the previous statement true. I'm not sure it's right tying them together. The concerns for both are fairly different. > The size of main memory and L2 cache have increased dramatically since those > strategies were invented. I think we should at least double them, and more > likely quadruple them. I realise you already made them configurable per > command in commit 1cbbee033, but I mean the hard coded default 256 in > freelist.c. It's not only to get more space for our prefetching plans, it's > also to give the system more chance of flushing WAL asynchronously/in some > other backend before you crash into dirty data; as you discovered, > prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy, > by taking away space that is effectively deferring WAL flushes, so I'd at > least like to double the size for if we use "/ 2" above. I think for VACUUM we should probably go a bit further. There's no comparable L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more expensive, compared to a seqscan. I'd go or at lest 4x-8x. Greetings, Andres Freund
Re: Streaming read-ready sequential scan code
On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro wrote: > > On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman > wrote: > > On the topic of BAS_BULKREAD buffer access strategy, I think the least > > we could do is add an assert like this to read_stream_begin_relation() > > after calculating max_pinned_buffers. > > > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > > > Perhaps we should do more? I think with a ring size of 16 MB, large > > SELECTs are safe for now. But I know future developers will make > > changes and it would be good not to assume they will understand that > > pinning more buffers than the size of the ring effectively invalidates > > the ring. > > Yeah I think we should clamp max_pinned_buffers if we see a strategy. > What do you think about: > > if (strategy) > { > int strategy_buffers = GetAccessStrategyBufferCount(strategy); > max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); > } > > I just don't know where to get that '2'. The idea would be to > hopefully never actually be constrained by it in practice, except in > tiny/toy setups, so we can't go too wrong with our number '2' there. Yea, I don't actually understand why not just use strategy_buffers - 1 or something. 1/2 seems like a big limiting factor for those strategies with small rings. I don't really think it will come up for SELECT queries since they rely on readahead and not prefetching. It does seem like it could easily come up for analyze. But I am on board with the idea of clamping for sequential scan/TID range scan. For vacuum, we might have to think harder. If the user specifies buffer_usage_limit and io_combine_limit and they are never reaching IOs of io_combine_limit because of their buffer_usage_limit value, then we should probably inform them. - Melanie
Re: Streaming read-ready sequential scan code
On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman wrote: > On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas wrote: > > On 01/04/2024 22:58, Melanie Plageman wrote: > > > Attached v7 has version 14 of the streaming read API as well as a few > > > small tweaks to comments and code. > > > > I saw benchmarks in this thread to show that there's no regression when > > the data is in cache, but I didn't see any benchmarks demonstrating the > > benefit of this. So I ran this quick test: > > Good point! It would be good to show why we would actually want this > patch. Attached v8 is rebased over current master (which now has the > streaming read API). Anecdotally by all reports I've seen so far, all-in-cache seems to be consistently a touch faster than master if anything, for streaming seq scan and streaming ANALYZE. That's great!, but it doesn't seem to be due to intentional changes. No efficiency is coming from batching anything. Perhaps it has to do with CPU pipelining effects: though it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work itself is cut up into stages in a kind of pipeline: read_stream_next_buffer() chooses page n + 2, pins page n + 1 and returns page n each time you call it, so maybe we get more CPU parallelism due to spreading the data dependencies out? (Makes me wonder what happens if you insert a memory prefetch for the page header into that production line, and if there are more opportunities to spread dependencies out eg hashing the buffer tag ahead of time.) > On the topic of BAS_BULKREAD buffer access strategy, I think the least > we could do is add an assert like this to read_stream_begin_relation() > after calculating max_pinned_buffers. > > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); > > Perhaps we should do more? I think with a ring size of 16 MB, large > SELECTs are safe for now. But I know future developers will make > changes and it would be good not to assume they will understand that > pinning more buffers than the size of the ring effectively invalidates > the ring. Yeah I think we should clamp max_pinned_buffers if we see a strategy. What do you think about: if (strategy) { int strategy_buffers = GetAccessStrategyBufferCount(strategy); max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers); } I just don't know where to get that '2'. The idea would be to hopefully never actually be constrained by it in practice, except in tiny/toy setups, so we can't go too wrong with our number '2' there. Then we should increase the default ring sizes for BAS_BULKREAD and BAS_VACUUM to make the previous statement true. The size of main memory and L2 cache have increased dramatically since those strategies were invented. I think we should at least double them, and more likely quadruple them. I realise you already made them configurable per command in commit 1cbbee033, but I mean the hard coded default 256 in freelist.c. It's not only to get more space for our prefetching plans, it's also to give the system more chance of flushing WAL asynchronously/in some other backend before you crash into dirty data; as you discovered, prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy, by taking away space that is effectively deferring WAL flushes, so I'd at least like to double the size for if we use "/ 2" above.
Re: Streaming read-ready sequential scan code
On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas wrote: > > On 01/04/2024 22:58, Melanie Plageman wrote: > > Attached v7 has version 14 of the streaming read API as well as a few > > small tweaks to comments and code. > > I saw benchmarks in this thread to show that there's no regression when > the data is in cache, but I didn't see any benchmarks demonstrating the > benefit of this. So I ran this quick test: Good point! It would be good to show why we would actually want this patch. Attached v8 is rebased over current master (which now has the streaming read API). On the topic of BAS_BULKREAD buffer access strategy, I think the least we could do is add an assert like this to read_stream_begin_relation() after calculating max_pinned_buffers. Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers); Perhaps we should do more? I think with a ring size of 16 MB, large SELECTs are safe for now. But I know future developers will make changes and it would be good not to assume they will understand that pinning more buffers than the size of the ring effectively invalidates the ring. - Melanie From cfccafec650a77c53b1d78180b52db31742181ff Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Wed, 27 Mar 2024 20:25:06 -0400 Subject: [PATCH v8 3/3] Sequential scans and TID range scans stream reads Implementing streaming read support for heap sequential scans and TID range scans includes three parts: Allocate the read stream object in heap_beginscan(). On rescan, reset the stream by releasing all pinned buffers and resetting the prefetch block. Implement a callback returning the next block to prefetch to the read stream infrastructure. Invoke the read stream API when a new page is needed. When the scan direction changes, reset the stream. --- src/backend/access/heap/heapam.c | 94 src/include/access/heapam.h | 15 + 2 files changed, 97 insertions(+), 12 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 6b26f5bf8af..3546f637c13 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] = * */ +static BlockNumber +heap_scan_stream_read_next(ReadStream *pgsr, void *private_data, + void *per_buffer_data) +{ + HeapScanDesc scan = (HeapScanDesc) private_data; + + if (unlikely(!scan->rs_inited)) + { + scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir); + scan->rs_inited = true; + } + else + scan->rs_prefetch_block = heapgettup_advance_block(scan, + scan->rs_prefetch_block, + scan->rs_dir); + + return scan->rs_prefetch_block; +} + /* * initscan - scan code common to heap_beginscan and heap_rescan * @@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock) scan->rs_cbuf = InvalidBuffer; scan->rs_cblock = InvalidBlockNumber; + /* + * Initialize to ForwardScanDirection because it is most common and heap + * scans usually must go forwards before going backward. + */ + scan->rs_dir = ForwardScanDirection; + scan->rs_prefetch_block = InvalidBlockNumber; + /* page-at-a-time fields are always invalid when not rs_inited */ /* @@ -459,12 +485,14 @@ heapbuildvis(TableScanDesc sscan) /* * heapfetchbuf - subroutine for heapgettup() * - * This routine reads the next block of the relation into a buffer and returns - * with that pinned buffer saved in the scan descriptor. + * This routine gets gets the next block of the relation from the read stream + * and saves that pinned buffer in the scan descriptor. */ static inline void heapfetchbuf(HeapScanDesc scan, ScanDirection dir) { + Assert(scan->rs_read_stream); + /* release previous scan buffer, if any */ if (BufferIsValid(scan->rs_cbuf)) { @@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir) */ CHECK_FOR_INTERRUPTS(); - if (unlikely(!scan->rs_inited)) + /* + * If the scan direction is changing, reset the prefetch block to the + * current block. Otherwise, we will incorrectly prefetch the blocks + * between the prefetch block and the current block again before + * prefetching blocks in the new, correct scan direction. + */ + if (unlikely(scan->rs_dir != dir)) { - scan->rs_cblock = heapgettup_initial_block(scan, dir); - Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf)); - scan->rs_inited = true; + scan->rs_prefetch_block = scan->rs_cblock; + read_stream_reset(scan->rs_read_stream); } - else - scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir); - /* read block if valid */ - if (BlockNumberIsValid(scan->rs_cblock)) - scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, - scan->rs_cblock, RBM_NORMAL, scan->rs_strategy); + s
Re: Streaming read-ready sequential scan code
On 01/04/2024 22:58, Melanie Plageman wrote: Attached v7 has version 14 of the streaming read API as well as a few small tweaks to comments and code. I saw benchmarks in this thread to show that there's no regression when the data is in cache, but I didn't see any benchmarks demonstrating the benefit of this. So I ran this quick test: -- create table ~1 GB table with only 1 row per page. CREATE TABLE giga (i int, filler text) with (fillfactor=10); insert into giga select g, repeat('x', 900) from generate_series(1, 14) g; vacuum freeze giga; \timing on select count(*) from giga; The SELECT takes about 390 ms on 'master', and 230 ms with the patch. This is pretty much the best case for this patch, real world gains will be much smaller. Nevertheless, nice speedup! -- Heikki Linnakangas Neon (https://neon.tech)
Re: Streaming read-ready sequential scan code
On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote: > On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman > wrote: > > > > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > > > wrote: > > > > > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > > > wrote: > > > > > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > > > wrote: > > > > > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > > > PgStreamingRead object for sequential scans > > > > > > > > > > > > I've written three alternative implementations of the actual > > > > > > streaming > > > > > > read user for sequential scan which handle the question of where to > > > > > > allocate the streaming read object and how to handle changing scan > > > > > > direction in different ways. > > > > > > > > > > > > Option A) > > > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > > > know the scan direction at this time, if the scan ends up not being > > > > > > a > > > > > > forwards scan, the streaming read object must later be freed -- so > > > > > > this will sometimes allocate a streaming read object it never uses. > > > > > > - Only supports ForwardScanDirection and once the scan direction > > > > > > changes, streaming is never supported again -- even if we return to > > > > > > ForwardScanDirection > > > > > > - Must maintain a "fallback" codepath that does not use the > > > > > > streaming read API > > > > > > > > > > Attached is a version of this patch which implements a "reset" > > > > > function for the streaming read API which should be cheaper than the > > > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > > > work on any of my proposed implementations (A/B/C). I implemented it > > > > > on A as an example. > > > > > > > > Attached is the latest version of this patchset -- rebased in light of > > > > Thomas' updatees to the streaming read API [1]. I have chosen the > > > > approach I think we should go with. It is a hybrid of my previously > > > > proposed approaches. > > > > > > While investigating some performance concerns, Andres pointed out that > > > the members I add to HeapScanDescData in this patch push rs_cindex and > > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > > > v4's HeapScanDescData is as well-packed as on master and its members > > > are reordered so that rs_cindex and rs_ntuples are back on the second > > > cacheline of the struct's data. > > > > I did some additional profiling and realized that dropping the > > unlikely() from the places we check rs_inited frequently was negatively > > impacting performance. v5 adds those back and also makes a few other > > very minor cleanups. > > > > Note that this patch set has a not yet released version of Thomas > > Munro's Streaming Read API with a new ramp-up logic which seems to fix a > > performance issue I saw with my test case when all of the sequential > > scan's blocks are in shared buffers. Once he sends the official new > > version, I will rebase this and point to his explanation in that thread. > > Attached v6 has the version of the streaming read API mentioned here > [1]. This resolved the fully-in-shared-buffers regressions > investigated in that thread by Andres, Bilal, and Thomas. Attached v7 has version 14 of the streaming read API as well as a few small tweaks to comments and code. I noticed that 0001 in the set posed a small regression from master for a sequential scan of a relation already in shared buffers. While investigating this, I saw that heapfetchbuf() was still not being inlined (compiled at -O2) and when I promoted heapfetchbuf() from static inline to static pg_attribute_always_inline, most of the very small regression I saw went away. I don't know if I squashed the issue entirely, though. - Melanie >From db6219d9ea689fdd0150aa0fbbeaa2ee11364aa8 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v7 1/4] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former i
Re: Streaming read-ready sequential scan code
On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman wrote: > > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > > wrote: > > > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > > wrote: > > > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > > wrote: > > > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > > PgStreamingRead object for sequential scans > > > > > > > > > > I've written three alternative implementations of the actual streaming > > > > > read user for sequential scan which handle the question of where to > > > > > allocate the streaming read object and how to handle changing scan > > > > > direction in different ways. > > > > > > > > > > Option A) > > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > > know the scan direction at this time, if the scan ends up not being a > > > > > forwards scan, the streaming read object must later be freed -- so > > > > > this will sometimes allocate a streaming read object it never uses. > > > > > - Only supports ForwardScanDirection and once the scan direction > > > > > changes, streaming is never supported again -- even if we return to > > > > > ForwardScanDirection > > > > > - Must maintain a "fallback" codepath that does not use the streaming > > > > > read API > > > > > > > > Attached is a version of this patch which implements a "reset" > > > > function for the streaming read API which should be cheaper than the > > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > > work on any of my proposed implementations (A/B/C). I implemented it > > > > on A as an example. > > > > > > Attached is the latest version of this patchset -- rebased in light of > > > Thomas' updatees to the streaming read API [1]. I have chosen the > > > approach I think we should go with. It is a hybrid of my previously > > > proposed approaches. > > > > While investigating some performance concerns, Andres pointed out that > > the members I add to HeapScanDescData in this patch push rs_cindex and > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > > v4's HeapScanDescData is as well-packed as on master and its members > > are reordered so that rs_cindex and rs_ntuples are back on the second > > cacheline of the struct's data. > > I did some additional profiling and realized that dropping the > unlikely() from the places we check rs_inited frequently was negatively > impacting performance. v5 adds those back and also makes a few other > very minor cleanups. > > Note that this patch set has a not yet released version of Thomas > Munro's Streaming Read API with a new ramp-up logic which seems to fix a > performance issue I saw with my test case when all of the sequential > scan's blocks are in shared buffers. Once he sends the official new > version, I will rebase this and point to his explanation in that thread. Attached v6 has the version of the streaming read API mentioned here [1]. This resolved the fully-in-shared-buffers regressions investigated in that thread by Andres, Bilal, and Thomas. The one outstanding item for the sequential scan streaming read user is deciding how the BAS_BULKREAD buffer access strategy should interact with the streaming read infrastructure. We discussed a bit off-list, and it seems clear that the ring must be at least as large as io_combine_limit. This should be no problem for BAS_BULKREAD because its ring is 16 MB. The question is whether or not we need to do anything right now to ensure there aren't adverse interactions between io_combine_limit, max_ios, and the buffer access strategy ring buffer size. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJTwrS7F%3DuJPx3SeigMiQiW%2BLJaOkjGyZdCntwyMR%3DuAw%40mail.gmail.com From bed26d391190f4411eccc4533d188e5dba6e8f72 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v6 1/6] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former is now heapfetchbuf() and the latter is now heapbuildvis(). Future commits will push
Re: Streaming read-ready sequential scan code
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote: > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman > wrote: > > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > > wrote: > > > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > > wrote: > > > > > > > > > > There is an outstanding question about where to allocate the > > > > > PgStreamingRead object for sequential scans > > > > > > > > I've written three alternative implementations of the actual streaming > > > > read user for sequential scan which handle the question of where to > > > > allocate the streaming read object and how to handle changing scan > > > > direction in different ways. > > > > > > > > Option A) > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > > - Allocates the streaming read object in initscan(). Since we do not > > > > know the scan direction at this time, if the scan ends up not being a > > > > forwards scan, the streaming read object must later be freed -- so > > > > this will sometimes allocate a streaming read object it never uses. > > > > - Only supports ForwardScanDirection and once the scan direction > > > > changes, streaming is never supported again -- even if we return to > > > > ForwardScanDirection > > > > - Must maintain a "fallback" codepath that does not use the streaming > > > > read API > > > > > > Attached is a version of this patch which implements a "reset" > > > function for the streaming read API which should be cheaper than the > > > full pg_streaming_read_free() on rescan. This can easily be ported to > > > work on any of my proposed implementations (A/B/C). I implemented it > > > on A as an example. > > > > Attached is the latest version of this patchset -- rebased in light of > > Thomas' updatees to the streaming read API [1]. I have chosen the > > approach I think we should go with. It is a hybrid of my previously > > proposed approaches. > > While investigating some performance concerns, Andres pointed out that > the members I add to HeapScanDescData in this patch push rs_cindex and > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached > v4's HeapScanDescData is as well-packed as on master and its members > are reordered so that rs_cindex and rs_ntuples are back on the second > cacheline of the struct's data. I did some additional profiling and realized that dropping the unlikely() from the places we check rs_inited frequently was negatively impacting performance. v5 adds those back and also makes a few other very minor cleanups. Note that this patch set has a not yet released version of Thomas Munro's Streaming Read API with a new ramp-up logic which seems to fix a performance issue I saw with my test case when all of the sequential scan's blocks are in shared buffers. Once he sends the official new version, I will rebase this and point to his explanation in that thread. - Melanie >From 29827c23a11061846a7c145f430aa4712c1c30f3 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v5 1/5] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former is now heapfetchbuf() and the latter is now heapbuildvis(). Future commits will push the logic for selecting the next block into heapfetchbuf() in cases when streaming reads are not supported (such as backwards sequential scans). Because this logic differs for sample scans and sequential scans, inline the code to read the block into a buffer for sample scans. This has the added benefit of allowing for a bit of refactoring in heapam_scan_sample_next_block(), including unpinning the previous buffer before invoking the callback to select the next block. --- src/backend/access/heap/heapam.c | 74 ++-- src/backend/access/heap/heapam_handler.c | 40 + src/include/access/heapam.h | 2 +- 3 files changed, 72 insertions(+), 44 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 34bc60f625f..aef90d28473 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -363,17 +363,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk } /* - * heapget
Re: Streaming read-ready sequential scan code
On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman wrote: > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > > wrote: > > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > > wrote: > > > > > > > > There is an outstanding question about where to allocate the > > > > PgStreamingRead object for sequential scans > > > > > > I've written three alternative implementations of the actual streaming > > > read user for sequential scan which handle the question of where to > > > allocate the streaming read object and how to handle changing scan > > > direction in different ways. > > > > > > Option A) > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > > - Allocates the streaming read object in initscan(). Since we do not > > > know the scan direction at this time, if the scan ends up not being a > > > forwards scan, the streaming read object must later be freed -- so > > > this will sometimes allocate a streaming read object it never uses. > > > - Only supports ForwardScanDirection and once the scan direction > > > changes, streaming is never supported again -- even if we return to > > > ForwardScanDirection > > > - Must maintain a "fallback" codepath that does not use the streaming > > > read API > > > > Attached is a version of this patch which implements a "reset" > > function for the streaming read API which should be cheaper than the > > full pg_streaming_read_free() on rescan. This can easily be ported to > > work on any of my proposed implementations (A/B/C). I implemented it > > on A as an example. > > Attached is the latest version of this patchset -- rebased in light of > Thomas' updatees to the streaming read API [1]. I have chosen the > approach I think we should go with. It is a hybrid of my previously > proposed approaches. While investigating some performance concerns, Andres pointed out that the members I add to HeapScanDescData in this patch push rs_cindex and rs_ntuples to another cacheline and introduce a 4-byte hole. Attached v4's HeapScanDescData is as well-packed as on master and its members are reordered so that rs_cindex and rs_ntuples are back on the second cacheline of the struct's data. - Melanie From 614195777b3ee675d74d98953b086e0f8a4a494d Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 26 Feb 2024 15:33:39 -0500 Subject: [PATCH v4 4/5] Add pg_streaming_read_reset For rescan, we want to reuse the streaming read object and simply release the buffers that were pinned by the streaming read infrastructure. --- src/backend/storage/aio/streaming_read.c | 18 ++ src/include/storage/streaming_read.h | 1 + 2 files changed, 19 insertions(+) diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c index 71f2c4a70b6..70f3ef051f8 100644 --- a/src/backend/storage/aio/streaming_read.c +++ b/src/backend/storage/aio/streaming_read.c @@ -610,3 +610,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr) pfree(pgsr); } + + +/* + * Reset a streaming read object by releasing all of the buffers. Note that + * max_ios is not recalculated, so any changes to maintenance_io_concurrency and + * effective_io_concurrency will have no effect. + */ +void +pg_streaming_read_reset(PgStreamingRead *pgsr) +{ + Buffer buffer; + + /* Stop looking ahead, and unpin anything that wasn't consumed. */ + pgsr->finished = true; + while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer) + ReleaseBuffer(buffer); + pgsr->finished = false; +} diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h index c4d3892bb26..63cef719e42 100644 --- a/src/include/storage/streaming_read.h +++ b/src/include/storage/streaming_read.h @@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags, extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr); extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private); extern void pg_streaming_read_free(PgStreamingRead *pgsr); +extern void pg_streaming_read_reset(PgStreamingRead *pgsr); #endif -- 2.40.1 From 4b6d9059aa32625e91d38f1d414ec514d4073197 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 29 Jan 2024 11:50:01 -0500 Subject: [PATCH v4 2/5] Replace blocks with buffers in heapgettup control flow Future commits will introduce the streaming read API and the sequential scan streaming read API user. Streaming read API users implement a callback which returns the next block to read. Sequential scans previously looped through the blocks in the relation, synchronously reading in a block and then processing it. An InvalidBlockNumber returned by heapgettup_advance_block() meant that the relation was exhausted and all blocks had been processed. The streaming read API may exhaust the blocks in a relation (having read all of them into buffers) before they have all be
Re: Streaming read-ready sequential scan code
On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote: > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman > wrote: > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > > wrote: > > > > > > There is an outstanding question about where to allocate the > > > PgStreamingRead object for sequential scans > > > > I've written three alternative implementations of the actual streaming > > read user for sequential scan which handle the question of where to > > allocate the streaming read object and how to handle changing scan > > direction in different ways. > > > > Option A) > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > > - Allocates the streaming read object in initscan(). Since we do not > > know the scan direction at this time, if the scan ends up not being a > > forwards scan, the streaming read object must later be freed -- so > > this will sometimes allocate a streaming read object it never uses. > > - Only supports ForwardScanDirection and once the scan direction > > changes, streaming is never supported again -- even if we return to > > ForwardScanDirection > > - Must maintain a "fallback" codepath that does not use the streaming read > > API > > Attached is a version of this patch which implements a "reset" > function for the streaming read API which should be cheaper than the > full pg_streaming_read_free() on rescan. This can easily be ported to > work on any of my proposed implementations (A/B/C). I implemented it > on A as an example. Attached is the latest version of this patchset -- rebased in light of Thomas' updatees to the streaming read API [1]. I have chosen the approach I think we should go with. It is a hybrid of my previously proposed approaches. The streaming read is allocated in heap_beginscan() and then reset on rescan and when the scan direction changes. I only check if the scan direction changes when a new page is needed. This implementation means no fallback method is needed, so we can remove the non-streaming read code for heap sequential scans. Because heapgettup() and heapgettup_pagemode() are also used for TID range scans, this patch also happens to implement streaming reads for TID range scans. - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com >From 9227b1b621c473ab189ac0cefba1b9aaed02aa09 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sat, 27 Jan 2024 18:39:37 -0500 Subject: [PATCH v3 1/5] Split heapgetpage into two parts heapgetpage(), a per-block utility function used in heap scans, read a passed-in block into a buffer and then, if page-at-a-time processing was enabled, pruned the page and built an array of its visible tuples. This was used for sequential and sample scans. Future commits will add support for streaming reads. The streaming read API will read in the blocks specified by a callback, but any significant per-page processing should be done synchronously on the buffer yielded by the streaming read API. To support this, separate the logic in heapgetpage() to read a block into a buffer from that which prunes the page and builds an array of the visible tuples. The former is now heapfetchbuf() and the latter is now heapbuildvis(). Future commits will push the logic for selecting the next block into heapfetchbuf() in cases when streaming reads are not supported (such as backwards sequential scans). Because this logic differs for sample scans and sequential scans, inline the code to read the block into a buffer for sample scans. This has the added benefit of allowing for a bit of refactoring in heapam_scan_sample_next_block(), including unpinning the previous buffer before invoking the callback to select the next block. --- src/backend/access/heap/heapam.c | 74 ++-- src/backend/access/heap/heapam_handler.c | 40 + src/include/access/heapam.h | 2 +- 3 files changed, 72 insertions(+), 44 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 707460a5364..449221da6ac 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk } /* - * heapgetpage - subroutine for heapgettup() + * heapbuildvis - Utility function for heap scans. * - * This routine reads and pins the specified page of the relation. - * In page-at-a-time mode it performs additional work, namely determining - * which tuples on the page are visible. + * Given a page residing in a buffer saved in the scan descriptor, prune the + * page and determine which of its tuples are all visible, saving their offsets + * in an array in the scan descriptor. */ void -heapgetpage(TableScanDesc sscan, BlockNumber block) +heapbuildvis(TableScanDesc sscan) { HeapScanDesc scan = (HeapScanDesc) sscan; - Buffer buffer; + Buffer buffer = scan->rs_
Re: Streaming read-ready sequential scan code
On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman wrote: > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman > wrote: > > > > There is an outstanding question about where to allocate the > > PgStreamingRead object for sequential scans > > I've written three alternative implementations of the actual streaming > read user for sequential scan which handle the question of where to > allocate the streaming read object and how to handle changing scan > direction in different ways. > > Option A) > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation > - Allocates the streaming read object in initscan(). Since we do not > know the scan direction at this time, if the scan ends up not being a > forwards scan, the streaming read object must later be freed -- so > this will sometimes allocate a streaming read object it never uses. > - Only supports ForwardScanDirection and once the scan direction > changes, streaming is never supported again -- even if we return to > ForwardScanDirection > - Must maintain a "fallback" codepath that does not use the streaming read API Attached is a version of this patch which implements a "reset" function for the streaming read API which should be cheaper than the full pg_streaming_read_free() on rescan. This can easily be ported to work on any of my proposed implementations (A/B/C). I implemented it on A as an example. - Melanie From ee93bf41b41aefd6fe20c41be15d2fd5e5d75181 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Mon, 29 Jan 2024 11:50:01 -0500 Subject: [PATCH v2 2/5] Replace blocks with buffers in heapgettup control flow Future commits will introduce the streaming read API and the sequential scan streaming read API user. Streaming read API users implement a callback which returns the next block to read. Sequential scans previously looped through the blocks in the relation, synchronously reading in a block and then processing it. An InvalidBlockNumber returned by heapgettup_advance_block() meant that the relation was exhausted and all blocks had been processed. The streaming read API may exhaust the blocks in a relation (having read all of them into buffers) before they have all been processed by the sequential scan. As such, the sequential scan should continue processing blocks until heapfetchbuf() returns InvalidBuffer. Note that this commit does not implement the streaming read API user. It simply restructures heapgettup() and heapgettup_pagemode() to use buffers instead of blocks for control flow. Not all sequential scans will support streaming reads. As such, this code will remain for compatability even after sequential scans support streaming reads. --- src/backend/access/heap/heapam.c | 79 ++-- 1 file changed, 35 insertions(+), 44 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 449221da6ac..e0fe3d9c326 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation, static bool heap_acquire_tuplock(Relation relation, ItemPointer tid, LockTupleMode mode, LockWaitPolicy wait_policy, bool *have_tuple_lock); +static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan, + BlockNumber block, ScanDirection dir); +static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir); static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask, uint16 old_infomask2, TransactionId add_to_xmax, LockTupleMode mode, bool is_update, @@ -463,14 +466,12 @@ heapbuildvis(TableScanDesc sscan) /* * heapfetchbuf - subroutine for heapgettup() * - * This routine reads the specified block of the relation into a buffer and - * returns with that pinned buffer saved in the scan descriptor. + * This routine reads the next block of the relation into a buffer and returns + * with that pinned buffer saved in the scan descriptor. */ static inline void -heapfetchbuf(HeapScanDesc scan, BlockNumber block) +heapfetchbuf(HeapScanDesc scan, ScanDirection dir) { - Assert(block < scan->rs_nblocks); - /* release previous scan buffer, if any */ if (BufferIsValid(scan->rs_cbuf)) { @@ -485,10 +486,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block) */ CHECK_FOR_INTERRUPTS(); - /* read page using selected strategy */ - scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block, - RBM_NORMAL, scan->rs_strategy); - scan->rs_cblock = block; + if (!scan->rs_inited) + { + scan->rs_cblock = heapgettup_initial_block(scan, dir); + Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf)); + scan->rs_inited = true; + } + else + scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir); + + /* read block if valid */ + if (BlockNumberIsValid(scan->rs_cblock)) + scan->rs_cbuf = ReadBufferExtended(scan->rs_
Re: Streaming read-ready sequential scan code
On Tue, Feb 20, 2024 at 6:13 AM Robert Haas wrote: > > On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman > wrote: > > I've written three alternative implementations of the actual streaming > > read user for sequential scan which handle the question of where to > > allocate the streaming read object and how to handle changing scan > > direction in different ways. > > It's weird to me that the prospect of changing the scan direction > causes such complexity. I mean, why doesn't a streaming read object > have a forget_all_my_previous_requests() method or somesuch? Basically, that is what pg_streaming_read_free() does. It goes through and releases the buffers it had pinned and frees any per-buffer data allocated. The complexity with the sequential scan streaming read user and scan direction is just that it has to detect when the scan direction changes and do the releasing/freeing and reallocation. The scan direction is passed to heapgettup[_pagemode](), so this is something that can change on a tuple-to-tuple basis. It is less that doing this is complicated and more that it is annoying and distracting to have to check for and handle a very unimportant and uncommon case in the main path of the common case. - Melanie
Re: Streaming read-ready sequential scan code
On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman wrote: > I've written three alternative implementations of the actual streaming > read user for sequential scan which handle the question of where to > allocate the streaming read object and how to handle changing scan > direction in different ways. It's weird to me that the prospect of changing the scan direction causes such complexity. I mean, why doesn't a streaming read object have a forget_all_my_previous_requests() method or somesuch? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Streaming read-ready sequential scan code
On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman wrote: > > There is an outstanding question about where to allocate the > PgStreamingRead object for sequential scans I've written three alternative implementations of the actual streaming read user for sequential scan which handle the question of where to allocate the streaming read object and how to handle changing scan direction in different ways. Option A) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation - Allocates the streaming read object in initscan(). Since we do not know the scan direction at this time, if the scan ends up not being a forwards scan, the streaming read object must later be freed -- so this will sometimes allocate a streaming read object it never uses. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option B) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. It only allocates the streaming read object if it is a forwards scan. It frees the streaming read object if the scan direction is later changed. - Only supports ForwardScanDirection and once the scan direction changes, streaming is never supported again -- even if we return to ForwardScanDirection - Must maintain a "fallback" codepath that does not use the streaming read API Option C) https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream - Allocates the streaming read object in heapgettup[_pagemode]() when it has not been previously allocated. To do this it has to record and switch into a different memory context than the per-tuple context. - All scan directions support streaming. To do this, the scan direction has to be tracked and we must check if the direction has changed on every heapgettup[_pagemode]() invocation to avoid returning wrong results. - No "fallback" codepath as all heap sequential scans will use the streaming read API As you can see, each option has pros and cons. I'm interested in what others think about which we should choose. - Melanie
Re: Streaming read-ready sequential scan code
On Mon, Jan 29, 2024 at 4:24 PM David Rowley wrote: > > On Tue, 30 Jan 2024 at 10:17, Melanie Plageman > wrote: > > Though logically the performance with 0001 and 0002 should be the same > > as master (no new non-inline function calls, no additional looping), > > I've done a bit of profiling anyway. I created a large multi-GB table, > > read it all into shared buffers (disabling the large sequential scan > > bulkread optimization), and did a sequential SELECT count(*) from the > > table. From the profiles below, you'll notice that master and the > > patch are basically the same. Actual percentages vary from run-to-run. > > Execution time is the same. > > Can you also run a test on a Seqscan with a filter that filters out > all tuples? There's less overhead in other parts of the executor with > such a query. Yes, of course. Thank you so much for taking a look! While I was at it, I changed the table schema to be entirely composed of INT type columns and regenerated the data. Note that, both in this example and my previous example, I ensured that the table was vacuumed beforehand (and autovacuum disabled for the table) so there wasn't any on-access pruning happening (heapgetpage() does that in pagemode). This is the schema CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g INT) with (autovacuum_enabled=false); I added 4600 rows to the table, making it 2.6 GB. Shared buffers is double that. Before profiling, I did a SELECT * from the table with the large sequential scan bulkread optimization disabled. Then I vacuumed the table. Finally, I turned up parallel_setup_cost high enough to disable query parallelism. The query I profiled was: SELECT * FROM foo WHERE id = 0; With the data I generated, 0 rows match that condition. Profiles below. Execution time essentially the same. patch: 17.08% postgres postgres [.] ExecInterpExpr 11.17% postgres postgres [.] tts_buffer_heap_getsomeattrs 10.64% postgres postgres [.] ExecStoreBufferHeapTuple 9.82% postgres postgres [.] heap_getnextslot 9.13% postgres postgres [.] heapgettup_pagemode 8.98% postgres postgres [.] heapbuildvis 5.40% postgres postgres [.] HeapCheckForSerializableConflictOut 5.16% postgres postgres [.] SeqNext master: 17.89% postgres postgres [.] ExecInterpExpr 12.28% postgres postgres [.] tts_buffer_heap_getsomeattrs 10.54% postgres postgres [.] ExecStoreBufferHeapTuple 10.11% postgres postgres [.] heapgettup_pagemode 8.52% postgres postgres [.] heapgetpage 8.28% postgres postgres [.] heap_getnextslot 5.00% postgres postgres [.] HeapCheckForSerializableConflictOut 4.71% postgres postgres [.] SeqNext - Melanie
Re: Streaming read-ready sequential scan code
On Tue, 30 Jan 2024 at 10:17, Melanie Plageman wrote: > Though logically the performance with 0001 and 0002 should be the same > as master (no new non-inline function calls, no additional looping), > I've done a bit of profiling anyway. I created a large multi-GB table, > read it all into shared buffers (disabling the large sequential scan > bulkread optimization), and did a sequential SELECT count(*) from the > table. From the profiles below, you'll notice that master and the > patch are basically the same. Actual percentages vary from run-to-run. > Execution time is the same. Can you also run a test on a Seqscan with a filter that filters out all tuples? There's less overhead in other parts of the executor with such a query. David