subject:"Re\: Streaming read\-ready sequential scan code"

Re: Streaming read-ready sequential scan code

2024-08-30 Thread Thomas Munro

On Fri, Aug 30, 2024 at 1:00 AM Alexander Lakhin  wrote:
> I looked at two perf profiles of such out-of-sync processes and found no
> extra calls or whatsoever in the slow one, it just has the number of perf
> samples increased proportionally. It made me suspect CPU frequency
> scaling... Indeed, with the "performance" governor set and the boost mode
> disabled, I'm now seeing much more stable numbers (I do this tuning before
> running performance tests, but I had forgotten about that when I ran that
> your test, my bad).

Aha, mystery solved.

I have pushed the fix.  Thanks!

Re: Streaming read-ready sequential scan code

2024-08-29 Thread Alexander Lakhin


Hello Thomas,

29.08.2024 01:16, Thomas Munro wrote:

Yeah.  That's quite interesting, and must destabilise that
simple-minded demo.  I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction.  Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...


I looked at two perf profiles of such out-of-sync processes and found no
extra calls or whatsoever in the slow one, it just has the number of perf
samples increased proportionally. It made me suspect CPU frequency
scaling... Indeed, with the "performance" governor set and the boost mode
disabled, I'm now seeing much more stable numbers (I do this tuning before
running performance tests, but I had forgotten about that when I ran that
your test, my bad).

I'm sorry for the noise and the distraction.

Still, now I can confirm your results. Without the patch, two parallel
workers gave "Buffers: shared hit=217 / Buffers: shared hit=226" 10 times
out of 10. And with the patch, I've got
"shared hit=219 / shared hit=224" 3 times,
"shared hit=220 / shared hit=223" 4 times,
"shared hit=221 / shared hit=222" 3 times of 10.

On b7b0f3f27~1, my results are:
"shared hit=219 / shared hit=224": 2
"shared hit=220 / shared hit=223": 3
"shared hit=221 / shared hit=222": 4
"shared hit=218 / shared hit=225": 1

Best regards,
Alexander

Re: Streaming read-ready sequential scan code

2024-08-28 Thread Thomas Munro

On Wed, Aug 28, 2024 at 9:00 PM Alexander Lakhin  wrote:
> 2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker 
> process "parallel worker for PID 338262"
> 2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker 
> process "parallel worker for PID 338262"

> 2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
> 2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres

> 2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 
> pid 338263
> 2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 
> pid 338264

> It looks like the two parallel workers were started simultaneously, but
> then the second one lagged behind...

Yeah.  That's quite interesting, and must destabilise that
simple-minded demo.  I'm curious to know exactly what contention is
causing that (about 3/4 of a millisecond that I don't see and now I
want to know what it's waiting for), but it's a very crude test
lacking timer resolution in the earlier messages, and it's an
unrelated topic and a distraction.  Perhaps it explains why you saw
two different behaviours in Q15 with the patch and I didn't, though.
Really it shouldn't be so sensitive to such variations, it's obviously
a terrible plan, and TPC-DS needs a planner hacker mega-brain applied
to it; I'm going to try to nerd-snipe one...

Re: Streaming read-ready sequential scan code

2024-08-28 Thread Alexander Lakhin


28.08.2024 07:45, Thomas Munro wrote:

Huh.  I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems.  I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome.  If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
   raise notice '% pid %', clock_timestamp(), pg_backend_pid();
   return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE:  2024-08-28 16:41:32.845747+12 pid 47019
NOTICE:  2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?


I use a single-socket system with AMD Ryzen 5900X, running Ubuntu 20.04.
With no active background tasks running, I'm seeing:
NOTICE:  2024-08-28 08:17:36.917162+00 pid 320103
NOTICE:  2024-08-28 08:17:36.917163+00 pid 320102

NOTICE:  2024-08-28 08:17:40.592333+00 pid 320143
NOTICE:  2024-08-28 08:17:40.592645+00 pid 320144

With log_min_messages = DEBUG3, I get:
NOTICE:  2024-08-28 08:41:59.309364+00 pid 338263
NOTICE:  2024-08-28 08:41:59.310079+00 pid 338264

And the following messages in the server.log:
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process 
"parallel worker for PID 338262"
2024-08-28 08:41:59.304 UTC [338251] DEBUG:  starting background worker process 
"parallel worker for PID 338262"
2024-08-28 08:41:59.305 UTC [338263] DEBUG:  InitPostgres
2024-08-28 08:41:59.305 UTC [338264] DEBUG:  InitPostgres
2024-08-28 08:41:59.309 UTC [338263] NOTICE:  2024-08-28 08:41:59.309364+00 pid 
338263
2024-08-28 08:41:59.309 UTC [338263] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE
2024-08-28 08:41:59.309 UTC [338262] NOTICE:  2024-08-28 08:41:59.309364+00 pid 
338263
2024-08-28 08:41:59.309 UTC [338262] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE
    parallel worker
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 5 before_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(0): 6 on_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(0): 1 callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  exit(0)
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 
before_shmem_exit callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  shmem_exit(-1): 0 on_shmem_exit 
callbacks to make
2024-08-28 08:41:59.309 UTC [338263] DEBUG:  proc_exit(-1): 0 callbacks to make
2024-08-28 08:41:59.310 UTC [338264] NOTICE:  2024-08-28 08:41:59.310079+00 pid 
338264
2024-08-28 08:41:59.310 UTC [338264] CONTEXT:  PL/pgSQL function f(integer) 
line 3 at RAISE

It looks like the two parallel workers were started simultaneously, but
then the second one lagged behind...

Best regards,
Alexander

Re: Streaming read-ready sequential scan code

2024-08-27 Thread Thomas Munro

On Wed, Aug 28, 2024 at 1:00 AM Alexander Lakhin  wrote:
> gives me unstable numbers on unpatched master:
> Buffers: shared hit=226
> Buffers: shared hit=217

> Buffers: shared hit=162
> Buffers: shared hit=281

> Buffers: shared hit=210
> Buffers: shared hit=233

> and also with the patch applied:
> Buffers: shared hit=246
> Buffers: shared hit=197

> Buffers: shared hit=247
> Buffers: shared hit=196

> Buffers: shared hit=224
> Buffers: shared hit=219

> Buffers: shared hit=291
> Buffers: shared hit=152
>
> Please correct me, if I'm doing something wrong.

Huh.  I can reproduce what I showed with pretty low variance, on my
FreeBSD, Linux and macOS systems.  I included
parallel_leader_participation=off so that the workers would hopefully
start as closely together in time as possible, and hopefully allow
only a block or so of variation in the outcome.  If I do:

create or replace function f(i int) returns int language plpgsql
parallel safe as $$
begin
  raise notice '% pid %', clock_timestamp(), pg_backend_pid();
  return i;
end;
$$;

then

postgres=# explain (analyze) select f(i) from t limit 1;
NOTICE:  2024-08-28 16:41:32.845747+12 pid 47019
NOTICE:  2024-08-28 16:41:32.845746+12 pid 47018

shows start times differ by only a few microseconds at most, often 0.
I wonder if your system is more variable at beginning execution?
Maybe you're on a multi-socket system and forking/startup times vary
in some way I can't see on small systems or something like that?

Re: Streaming read-ready sequential scan code

2024-08-27 Thread Alexander Lakhin


Hello Thomas,

27.08.2024 09:52, Thomas Munro wrote:

Here's a really simple way to see the new unfairness at the end of a
parallel scan:

drop table if exists t;
create table t (i int);
insert into t select generate_series(1, 10);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;

On my machine, unpatched master shows:

  Worker 0:  actual time=0.036..12.452 rows=51076 loops=1
Buffers: shared hit=226
  Worker 1:  actual time=0.037..12.003 rows=48924 loops=1
Buffers: shared hit=217

The attached patch, which I'd like to push, is effectively what
Alexander tested (blocknums[16] -> blocknums[1]).  There's no point in
using an array of size 1, so I've turned it into a simple variable and
deleted the relevant comments.  My machine shows:

  Worker 0:  actual time=0.038..12.115 rows=49946 loops=1
Buffers: shared hit=221
  Worker 1:  actual time=0.038..12.109 rows=50054 loops=1
Buffers: shared hit=222

That difference may not seem huge, but other pre-existing things are
going pathologically wrong in the reported query that magnify it (see
my earlier analysis).  It's an interesting problem that will require
more study (my earlier analysis missed a detail that I'll write about
separately), but it doesn't seem to be new or have easy fixes, so that
will have to be for later work.


I've tried your query and could not get sustainable results, unfortunately.
The following script:
rm -rf "$PGDATA"; initdb -D "$PGDATA" >initdb.log 2>&1

pg_ctl -s -l server.log start

cat << EOF | psql | grep 'Parallel Seq Scan' -A10 | grep 'Worker ' -A1
create table t (i int);
insert into t select generate_series(1, 10);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;
EOF

pg_ctl -s stop

gives me unstable numbers on unpatched master:
 Worker 0:  actual time=0.024..5.814 rows=51076 loops=1
   Buffers: shared hit=226
 Worker 1:  actual time=0.023..5.614 rows=48924 loops=1
   Buffers: shared hit=217
---
 Worker 0:  actual time=0.027..5.130 rows=36612 loops=1
   Buffers: shared hit=162
 Worker 1:  actual time=0.013..5.605 rows=63388 loops=1
   Buffers: shared hit=281
---
 Worker 0:  actual time=0.025..5.447 rows=47460 loops=1
   Buffers: shared hit=210
 Worker 1:  actual time=0.019..5.688 rows=52540 loops=1
   Buffers: shared hit=233

and also with the patch applied:
 Worker 0:  actual time=0.012..4.486 rows=55478 loops=1
   Buffers: shared hit=246
 Worker 1:  actual time=0.014..4.430 rows=44522 loops=1
   Buffers: shared hit=197
---
 Worker 0:  actual time=0.013..4.269 rows=55822 loops=1
   Buffers: shared hit=247
 Worker 1:  actual time=0.017..4.238 rows=44178 loops=1
   Buffers: shared hit=196
---
 Worker 0:  actual time=0.014..4.974 rows=50624 loops=1
   Buffers: shared hit=224
 Worker 1:  actual time=0.016..4.932 rows=49376 loops=1
   Buffers: shared hit=219
---
 Worker 0:  actual time=0.012..5.459 rows=65648 loops=1
   Buffers: shared hit=291
 Worker 1:  actual time=0.022..5.109 rows=34352 loops=1
   Buffers: shared hit=152

Please correct me, if I'm doing something wrong.

Best regards,
Alexander

Re: Streaming read-ready sequential scan code

2024-08-26 Thread Thomas Munro

Here's a really simple way to see the new unfairness at the end of a
parallel scan:

drop table if exists t;
create table t (i int);
insert into t select generate_series(1, 10);
alter table t set (parallel_workers = 2);
set parallel_setup_cost = 0;
set parallel_leader_participation = off;
explain (analyze, buffers, verbose) select count(*) from t;

On my machine, unpatched master shows:

 Worker 0:  actual time=0.036..12.452 rows=51076 loops=1
   Buffers: shared hit=226
 Worker 1:  actual time=0.037..12.003 rows=48924 loops=1
   Buffers: shared hit=217

The attached patch, which I'd like to push, is effectively what
Alexander tested (blocknums[16] -> blocknums[1]).  There's no point in
using an array of size 1, so I've turned it into a simple variable and
deleted the relevant comments.  My machine shows:

 Worker 0:  actual time=0.038..12.115 rows=49946 loops=1
   Buffers: shared hit=221
 Worker 1:  actual time=0.038..12.109 rows=50054 loops=1
   Buffers: shared hit=222

That difference may not seem huge, but other pre-existing things are
going pathologically wrong in the reported query that magnify it (see
my earlier analysis).  It's an interesting problem that will require
more study (my earlier analysis missed a detail that I'll write about
separately), but it doesn't seem to be new or have easy fixes, so that
will have to be for later work.
From 07ff31ad30bf9e383e42336e28143852e3793c5b Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 27 Aug 2024 15:11:53 +1200
Subject: [PATCH] Fix unfairness in all-cached parallel seq scan.

Commit b5a9b18c introduced block streaming infrastructure with a special
fast path for all-cached blocks, and commit b7b0f3f2 connected the
infrastructure up to sequential scans.  One of the fast path
optimizations had an unintended consequence: it interfered with the
underlying parallel sequential scan's block allocator, which has its own
ramp-up and ramp-down algorithm.  A scan of a small all-cached table
could give more blocks to one worker.  In some plans (probably already
very bad plans, such as the one reported by Alexander), the unfairness
could be magnified.

Now all-cached scans will call the next-block-number callback just once
each time it wants a new block, instead of trying to buffer 16 block
numbers at once.

Back-patch to 17.

Reported-by: Alexander Lakhin 
Discussion: https://postgr.es/m/63a63690-dd92-c809-0b47-af05459e95d1%40gmail.com
---
 src/backend/storage/aio/read_stream.c | 82 ---
 1 file changed, 24 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index a83c18c2a4b..57d9e93c001 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -117,13 +117,10 @@ struct ReadStream
 	bool		advice_enabled;
 
 	/*
-	 * Small buffer of block numbers, useful for 'ungetting' to resolve flow
-	 * control problems when I/Os are split.  Also useful for batch-loading
-	 * block numbers in the fast path.
+	 * One-block buffer to support 'ungetting' a block number, to resolve flow
+	 * control problems when I/Os are split.
 	 */
-	BlockNumber blocknums[16];
-	int16		blocknums_count;
-	int16		blocknums_next;
+	BlockNumber buffered_blocknum;
 
 	/*
 	 * The callback that will tell us which block numbers to read, and an
@@ -167,68 +164,39 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 }
 
 /*
- * Ask the callback which block it would like us to read next, with a small
- * buffer in front to allow read_stream_unget_block() to work and to allow the
- * fast path to skip this function and work directly from the array.
+ * Ask the callback which block it would like us to read next, with a one-block
+ * buffer in front to allow read_stream_unget_block() to work.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
 {
-	if (stream->blocknums_next < stream->blocknums_count)
-		return stream->blocknums[stream->blocknums_next++];
+	BlockNumber blocknum;
 
-	/*
-	 * We only bother to fetch one at a time here (but see the fast path which
-	 * uses more).
-	 */
-	return stream->callback(stream,
-			stream->callback_private_data,
-			per_buffer_data);
+	blocknum = stream->buffered_blocknum;
+	if (blocknum != InvalidBlockNumber)
+		stream->buffered_blocknum = InvalidBlockNumber;
+	else
+		blocknum = stream->callback(stream,
+	stream->callback_private_data,
+	per_buffer_data);
+
+	return blocknum;
 }
 
 /*
  * In order to deal with short reads in StartReadBuffers(), we sometimes need
- * to defer handling of a block until later.
+ * to defer handling of a block number we've already received from the callback
+ * until later.
  */
 static inline void
 read_stream_unget_block(ReadStream *stream, BlockNumber block

Re: Streaming read-ready sequential scan code

2024-05-20 Thread Thomas Munro

On Tue, May 21, 2024 at 9:11 AM Melanie Plageman
 wrote:
> So, if you are seeing the slow-down mostly go away by reducing
> blocknums array size, does the regression only appear when the scan
> data is fully in shared buffers? Or is this blocknums other use
> (dealing with short reads)?

That must be true (that blocknums array is normally only "filled" in
the "fast path", where all buffers are found in cache).

> Is your theory that one worker ends up reading 16 blocks that should
> have been distributed across multiple workers?

Yes, it just jiggles the odds around a bit, introducing a bit of extra
unfairness by calling the callback in a tighter loop to build a little
batch, revealing a pre-existing problem.

The mistake in PHJ (problem #2 above) is that, once a worker decides
it would like all workers to stop inserting so it can increase the
number of buckets, it sets a flag to ask them to do that, and waits
for them to see it, but if there is a worker filtering all tuples out,
it never checks the "growth" flag.  So it scans all the way to the end
while the other guy waits.  Normally it checks that flag when it is
time to allocate a new chunk of memory, which seemed to make sense to
me at the time: if we've hit the needs-more-buckets (or
needs-more-batches) logic, then surely workers are inserting tuples
and will soon allocate a new chunk!  But, of course, here is the edge
case where that isn't true: we had bad estimates so hash table too
small (problem #1), we got lots of tuples clustered over a few heap
pages and decided to expand the hash table, but right at that moment,
matching tuples ran out so somebody had to finish the whole scan
without ever checking the flag (problem #2), and that someone happened
to have all the rest of the pages because we made the lookahead a bit
less fair (problem #3).  Nice confluence of problems.  I expect #2 and
#3 to be easy to fix, and I didn't look at the estimation problem #1
at all (perhaps a stats puzzle designed by the TPC to trip us up?).

Re: Streaming read-ready sequential scan code

2024-05-20 Thread Melanie Plageman

Thank you to all of you for looking into  this.

On Sat, May 18, 2024 at 12:47 AM Thomas Munro  wrote:
>
> On Sat, May 18, 2024 at 11:30 AM Thomas Munro  wrote:
> > Andres happened to have TPC-DS handy, and reproduced that regression
> > in q15.  We tried some stuff and figured out that it requires
> > parallel_leader_participation=on, ie that this looks like some kind of
> > parallel fairness and/or timing problem.  It seems to be a question of
> > which worker finishes up processing matching rows, and the leader gets
> > a ~10ms head start but may be a little more greedy with the new
> > streaming code.  He tried reordering the table contents and then saw
> > 17 beat 16.  So for q15, initial indications are that this isn't a
> > fundamental regression, it's just a test that is sensitive to some
> > arbitrary conditions.
> >
> > I'll try to figure out some more details about that, ie is it being
> > too greedy on small-ish tables,
>
> After more debugging, we learned a lot more things...
>
> 1.  That query produces spectacularly bad estimates, so we finish up
> having to increase the number of buckets in a parallel hash join many
> times.  That is quite interesting, but unrelated to new code.
> 2.  Parallel hash join is quite slow at negotiating an increase in the
> number of hash bucket, if all of the input tuples are being filtered
> out by quals, because of the choice of where workers check for
> PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
> think.  I have put that on my todo list 'cause that's also my code,
> but it's not a new issue it's just one that is now highlighted...
> 3.  This bit of read_stream.c is exacerbating unfairness in the
> underlying scan, so that 1 and 2 come together and produce a nasty
> slowdown, which goes away if you change it like so:
>
> -   BlockNumber blocknums[16];
> +   BlockNumber blocknums[1];
>
> I will follow up after some more study.

So, if you are seeing the slow-down mostly go away by reducing
blocknums array size, does the regression only appear when the scan
data is fully in shared buffers? Or is this blocknums other use
(dealing with short reads)?

Is your theory that one worker ends up reading 16 blocks that should
have been distributed across multiple workers?

- Melanie

Re: Streaming read-ready sequential scan code

2024-05-18 Thread Thomas Munro

On Sun, May 19, 2024 at 7:00 AM Alexander Lakhin  wrote:
> With blocknums[1], timing is changed, but the effect is not persistent.
> 10 query15 executions in a row, b7b0f3f27:
> 277.932 ms
> 281.805 ms
> 278.335 ms
> 281.565 ms
> 284.167 ms
> 283.171 ms
> 281.165 ms
> 281.615 ms
> 285.394 ms
> 277.301 ms

The bad time 10/10.

> b7b0f3f27~1:
> 159.789 ms
> 165.407 ms
> 160.893 ms
> 159.343 ms
> 160.936 ms
> 161.577 ms
> 161.637 ms
> 163.421 ms
> 163.143 ms
> 167.109 ms

The good time 10/10.

> b7b0f3f27 + blocknums[1]:
> 164.133 ms
> 280.920 ms
> 160.748 ms
> 163.182 ms
> 161.709 ms
> 161.998 ms
> 161.239 ms
> 276.256 ms
> 161.601 ms
> 160.384 ms

The good time 8/10, the bad time 2/10.

Thanks for checking!  I bet all branches can show that flip/flop
instability in these adverse conditions, depending on random
scheduling details.  I will start a new thread with a patch for the
root cause of that, ie problem #2 (this will need back-patching), and
post a fix for #3 (v17 blocknums[N] tweak affecting
fairness/likelihood, which was probably basically a bit of ill-advised
premature optimisation) here in a few days.

Re: Streaming read-ready sequential scan code

2024-05-18 Thread Alexander Lakhin


Hello Thomas,

18.05.2024 07:47, Thomas Munro wrote:

After more debugging, we learned a lot more things...

1.  That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times.  That is quite interesting, but unrelated to new code.
2.  Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
think.  I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3.  This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-   BlockNumber blocknums[16];
+   BlockNumber blocknums[1];

I will follow up after some more study.


Thank you for the information!

Unfortunately, I can't see significant differences in my environment with
parallel_leader_participation=off.

With blocknums[1], timing is changed, but the effect is not persistent.
10 query15 executions in a row, b7b0f3f27:
277.932 ms
281.805 ms
278.335 ms
281.565 ms
284.167 ms
283.171 ms
281.165 ms
281.615 ms
285.394 ms
277.301 ms

b7b0f3f27~1:
159.789 ms
165.407 ms
160.893 ms
159.343 ms
160.936 ms
161.577 ms
161.637 ms
163.421 ms
163.143 ms
167.109 ms

b7b0f3f27 + blocknums[1]:
164.133 ms
280.920 ms
160.748 ms
163.182 ms
161.709 ms
161.998 ms
161.239 ms
276.256 ms
161.601 ms
160.384 ms

I placed PGDATA on tmpfs to rule out any blockdev specifics (increasing
blockdev ra from 256 to 4096 didn't help me with PGDATA on NVME either.)

Best regards,
Alexander

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 11:30 AM Thomas Munro  wrote:
> Andres happened to have TPC-DS handy, and reproduced that regression
> in q15.  We tried some stuff and figured out that it requires
> parallel_leader_participation=on, ie that this looks like some kind of
> parallel fairness and/or timing problem.  It seems to be a question of
> which worker finishes up processing matching rows, and the leader gets
> a ~10ms head start but may be a little more greedy with the new
> streaming code.  He tried reordering the table contents and then saw
> 17 beat 16.  So for q15, initial indications are that this isn't a
> fundamental regression, it's just a test that is sensitive to some
> arbitrary conditions.
>
> I'll try to figure out some more details about that, ie is it being
> too greedy on small-ish tables,

After more debugging, we learned a lot more things...

1.  That query produces spectacularly bad estimates, so we finish up
having to increase the number of buckets in a parallel hash join many
times.  That is quite interesting, but unrelated to new code.
2.  Parallel hash join is quite slow at negotiating an increase in the
number of hash bucket, if all of the input tuples are being filtered
out by quals, because of the choice of where workers check for
PHJ_GROWTH_NEED_MORE_BUCKETS.  That could be improved quite easily I
think.  I have put that on my todo list 'cause that's also my code,
but it's not a new issue it's just one that is now highlighted...
3.  This bit of read_stream.c is exacerbating unfairness in the
underlying scan, so that 1 and 2 come together and produce a nasty
slowdown, which goes away if you change it like so:

-   BlockNumber blocknums[16];
+   BlockNumber blocknums[1];

I will follow up after some more study.

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 8:09 AM Thomas Munro  wrote:
> On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin  wrote:
> > I decided to compare v17 vs v16 performance (as I did the last year [1])
> > and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> > benchmark, query15 (and several others, but I focused on this one):
> > Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 
> > 151.03): pg_tpcds.query15
> > Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 
> > > 152.64): pg_tpcds.query15
> > Please look at the full html report attached in case you're interested.
> >
> > (I used my pg-mark tool to measure/analyze performance, but I believe the
> > same results can be seen without it.)
>
> Will investigate, but if it's easy for you to rerun, does it help if
> you increase Linux readahead, eg blockdev --setra setting?

Andres happened to have TPC-DS handy, and reproduced that regression
in q15.  We tried some stuff and figured out that it requires
parallel_leader_participation=on, ie that this looks like some kind of
parallel fairness and/or timing problem.  It seems to be a question of
which worker finishes up processing matching rows, and the leader gets
a ~10ms head start but may be a little more greedy with the new
streaming code.  He tried reordering the table contents and then saw
17 beat 16.  So for q15, initial indications are that this isn't a
fundamental regression, it's just a test that is sensitive to some
arbitrary conditions.

I'll try to figure out some more details about that, ie is it being
too greedy on small-ish tables, and generally I do wonder about the
interactions between the heuristics and batching working at different
levels (OS, seq scan, read stream, hence my earlier ra question which
is likely a red herring) and how there might be unintended
consequences/interference patterns, but this particular case seems
more data dependent.

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Thomas Munro

On Sat, May 18, 2024 at 1:00 AM Alexander Lakhin  wrote:
> I decided to compare v17 vs v16 performance (as I did the last year [1])
> and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
> benchmark, query15 (and several others, but I focused on this one):
> Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 
> 151.03): pg_tpcds.query15
> Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 
> 152.64): pg_tpcds.query15
> Please look at the full html report attached in case you're interested.
>
> (I used my pg-mark tool to measure/analyze performance, but I believe the
> same results can be seen without it.)

Will investigate, but if it's easy for you to rerun, does it help if
you increase Linux readahead, eg blockdev --setra setting?

Re: Streaming read-ready sequential scan code

2024-05-17 Thread Alexander Lakhin


Hello,

I decided to compare v17 vs v16 performance (as I did the last year [1])
and discovered that v17 loses to v16 in the pg_tpcds (s64da_tpcds)
benchmark, query15 (and several others, but I focused on this one):
Best pg-src-master--.* worse than pg-src-16--.* by 52.2 percents (229.84 > 
151.03): pg_tpcds.query15
Average pg-src-master--.* worse than pg-src-16--.* by 53.4 percents (234.20 > 
152.64): pg_tpcds.query15
Please look at the full html report attached in case you're interested.

(I used my pg-mark tool to measure/analyze performance, but I believe the
same results can be seen without it.)

`git bisect` for this performance degradation pointed at b7b0f3f27...

[1] 
https://www.postgresql.org/message-id/b32bed1b-0746-9b20-1472-4bdc9ca66d52%40gmail.com

Best regards,
AlexanderTitle: Postgres benchmarking results

Benchmarking run started at 2024-05-14T12:04:26, ended at 2024-05-15T05:44:15Benchmark versionInstanceversionpg-src-16--1PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--1PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--2PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--2PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-master--3PostgreSQL 17devel-master/2e810bdb7f on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--3PostgreSQL 16.3-REL_16_STABLE/c1664c8eef on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitBenchmark pgbench_nativeInstancetpspg-src-16--110478.49pg-src-master--110662.36pg-src-16--210491.77pg-src-master--210700.65pg-src-master--310565.92pg-src-16--310555.88Benchmark pgbench_referenceInstancelatency_avgtps1tps2pg-src-16--18.597452.807452.84pg-src-master--18.427597.857597.88pg-src-16--28.527508.757508.79pg-src-master--28.487545.737545.77pg-src-master--38.377649.517649.55pg-src-16--38.547498.927498.95Benchmark pg_tpchInstancequery1query2query3query4query5query6query7query8query9query10query11query12query13query14query15query16query17query18query19query20query21query22pg-src-16--19.591.473.000.700.901.472.974.560.942.470.772.232.870.883.200.978.7014.410.195.842.360.23pg-src-master--19.111.542.880.700.891.402.834.750.872.390.752.132.830.893.061.008.3213.970.195.552.300.22pg-src-16--29.611.493.040.710.931.472.964.690.862.500.782.212.880.883.201.008.8214.200.205.852.360.23pg-src-master--29.051.492.850.700.871.372.824.470.872.350.772.142.670.883.030.998.2714.240.195.572.300.22pg-src-master--310.121.482.880.690.881.392.844.730.882.400.782.142.800.883.081.008.2814.180.195.562.330.22pg-src-16--39.721.473.010.700.911.462.954.670.962.500.782.202.710.873.221.008.6914.400.195.982.340.23Benchmark pg_tpcdsInstanceload_sizeindexes_sizevacuum_full_sizevacuum_freeze_sizeload_timeindexes_timevacuum_full_timevacuum_freeze_timeanalyze_timequery92query91query13query9query21query90query93query16query68query48query76query7query37query43query72query82query79query55query66query71query69query34query45query10query28query3query15query50query85query84query42query73query94query46query88query19query40query26query96query6query52pg-src-16--15069460.007250212.006939852.006942528.0078.3673.2396.682.293.4883.4944.95638.871466.98157.23149.804.96105.36278.76774.01380.491194.288.21713.2020492.5810.40346.13297.64302.20241.78115.87246.68133.8631.62667.3830.43153.4178.77182.2515.65294.41205.3677.62498.711670.38313.1374.38596.20202.6123918.69311.14pg-src-master--15069628.007250072.006939856.006942528.0079.8771.6392.832.163.4682.1445.80612.611355.02201.37137.854.95145.74253.06752.50376.791203.268.36692.8125286.109.13325.84271.36412.45252.12114.83228.80132.1431.34653.3129.11229.8477.31184.5614.86268.99182.6369.12479.661515.73297.1096.46955.47185.3517137.62274.60pg-src-16--25069288.007249988.006939848.006942520.0077.3673.2795.922.243.4182.8447.27639.831450.85158.35154.494.76101.82277.17779.26387.131202.088.26714.5820769.8810.32348.23288.74304.59243.07114.51249.44137.4232.46662.6529.54153.4780.16119.1416.45295.05207.3877.66506.871686.13319.2473.34584.07204.5523798.14291.23pg-src-master--25069476.007249892.006939860.006942500.0079.9570.7292.643.103.4178.3444.75610.521362.95195.95141.655.19140.07253.17754.75385.321209.357.56686.4925771.909.17329.43290.55415.04224.55119.17229.24135.6031.45657.6429.96236.3481.87182.6915.04277.01184.9470.63494.841523.35306.65103.54975.26185.2917169.69278.12pg-src-master--35069584.007249996.006939856.006942500.0078.9471.0893.273.113.4377.4447.19610.861386.95193.86140.134.83154.89257.04756.67378.821213.107.62688.1825978.559.44328.20278.73417.64226.57119.25231.79135.5431.99654.8030.82236.4356.91117.9914.63281.26183.6571.78479.821534.32301.8998.001002.00184.5817194.48

Re: Streaming read-ready sequential scan code

2024-04-07 Thread Andres Freund

Hi,

On 2024-04-08 09:36:59 +1200, Thomas Munro wrote:
> I've been on the fence about that flag for sequential scan...  Some
> days I want to consider changing to READ_STREAM_DEFAULT and relying on
> our "anti-heuristics" to suppress advice, which would work out the
> same in most cases but might occasionally win big.

Agreed, it's pretty easy to end up with a fairly "fragmented" set of a
relation's buffers in s_b.  OTOH, there might not be any need for the
heuristic if we actually trigger reads asynchronously.

> BTW looking at the branching in read-stream user patches that have an
> initialisation step like yours, I wonder if it might every make sense
> to be able to change the callback on the fly from inside the callback,
> so that you finish up with a branchless one doing most of the work.  I
> have no idea if it's worth it...

I was wondering about that too, I dislike those branches. But instead of
changing the callback, it seems like a better design would be to have another
dedicated callback for that?  There already is a dedicated branch for the
"just starting up" path in read_stream_next_buffer(), so it'd be pretty much
free to call another callback there.

Greetings,

Andres Freund

Re: Streaming read-ready sequential scan code

2024-04-07 Thread Thomas Munro

On Sun, Apr 7, 2024 at 1:34 PM Melanie Plageman
 wrote:
> Attached v13 0001 is your fix and 0002 is a new version of the
> sequential scan streaming read user. Off-list Andres mentioned that I
> really ought to separate the parallel and serial sequential scan users
> into two different callbacks. I've done that in the attached. It
> actually makes the code used by the callbacks nicer and more readable
> anyway (putting aside performance). I was able to measure a small
> performance difference as well.

Thanks.  I changed a couple of very trivial things before pushing.

+BlockNumber (*cb) (ReadStream *pgsr, void *private_data,
+   void *per_buffer_data);

This type has a friendly name: ReadStreamBlockNumberCB.

+scan->rs_read_stream =
read_stream_begin_relation(READ_STREAM_SEQUENTIAL,

I've been on the fence about that flag for sequential scan...  Some
days I want to consider changing to READ_STREAM_DEFAULT and relying on
our "anti-heuristics" to suppress advice, which would work out the
same in most cases but might occasionally win big.  It might also
hurt, I dunno, so I suspect we'd have to make it better first, which
my patch in [1] was a first swing at, but I haven't researched that
enough.  So, kept this way!

- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.

Changed to:

 * Read the next block of the scan relation from the read stream and save it
 * in the scan descriptor.  It is already pinned.

+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+void *per_buffer_data)

Changed argument names to match the function pointer type definition,
"stream" and "callback_private_data".

BTW looking at the branching in read-stream user patches that have an
initialisation step like yours, I wonder if it might every make sense
to be able to change the callback on the fly from inside the callback,
so that you finish up with a branchless one doing most of the work.  I
have no idea if it's worth it...

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLLFvou5rx5FDhm-Pc9r4STQTFFmrx6SUV%2Bvk8fwMbreA%40mail.gmail.com

Re: Streaming read-ready sequential scan code

2024-04-06 Thread Melanie Plageman

On Sat, Apr 6, 2024 at 9:25 PM Thomas Munro  wrote:
>
> I found a bug in read_stream.c that could be hit with Melanie's
> streaming seq scan patch with parallelism enabled and certain buffer
> pool conditions.  Short version: there is an edge case where an "if"
> needed to be a "while", or we could lose a few blocks.  Here's the fix
> for that, longer explanation in commit message.

Attached v13 0001 is your fix and 0002 is a new version of the
sequential scan streaming read user. Off-list Andres mentioned that I
really ought to separate the parallel and serial sequential scan users
into two different callbacks. I've done that in the attached. It
actually makes the code used by the callbacks nicer and more readable
anyway (putting aside performance). I was able to measure a small
performance difference as well.

I've also added a few comments and improved existing comments.

- Melanie
From eded321df22bf472f147bd8f94b596d465355c70 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v13 2/2] Use streaming IO in heapam sequential and TID range
 scans

Instead of calling ReadBuffer() for each block, heap sequential scans
and TID range scans now use the streaming read API introduced in
b5a9b18cd0.

Author: Melanie Plageman 
Reviewed-by: Andres Freund
Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
---
 src/backend/access/heap/heapam.c | 234 +--
 src/include/access/heapam.h  |  15 ++
 2 files changed, 176 insertions(+), 73 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 01bb2f4cc16..9d10d42b69b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,66 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+/*
+ * Streaming read API callback for parallel sequential scans. Returns the next
+ * block the caller wants from the read stream or InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_parallel(ReadStream *pgsr, void *private_data,
+	void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	Assert(ScanDirectionIsForward(scan->rs_dir));
+	Assert(scan->rs_base.rs_parallel);
+
+	if (unlikely(!scan->rs_inited))
+	{
+		/* parallel scan */
+		table_block_parallelscan_startblock_init(scan->rs_base.rs_rd,
+ scan->rs_parallelworkerdata,
+ (ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+
+		/* may return InvalidBlockNumber if there are no more blocks */
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+	scan->rs_parallelworkerdata,
+	(ParallelBlockTableScanDesc) scan->rs_base.rs_parallel);
+		scan->rs_inited = true;
+	}
+	else
+	{
+		scan->rs_prefetch_block = table_block_parallelscan_nextpage(scan->rs_base.rs_rd,
+	scan->rs_parallelworkerdata, (ParallelBlockTableScanDesc)
+	scan->rs_base.rs_parallel);
+	}
+
+	return scan->rs_prefetch_block;
+}
+
+/*
+ * Streaming read API callback for serial sequential and TID range scans.
+ * Returns the next block the caller wants from the read stream or
+ * InvalidBlockNumber when done.
+ */
+static BlockNumber
+heap_scan_stream_read_next_serial(ReadStream *pgsr, void *private_data,
+  void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -325,6 +385,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and
+	 * because heap scans go forward before going backward (e.g. CURSORs).
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +529,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_

Re: Streaming read-ready sequential scan code

2024-04-06 Thread Thomas Munro

I found a bug in read_stream.c that could be hit with Melanie's
streaming seq scan patch with parallelism enabled and certain buffer
pool conditions.  Short version: there is an edge case where an "if"
needed to be a "while", or we could lose a few blocks.  Here's the fix
for that, longer explanation in commit message.
From 43cef2d58141ba048e9349b0027afff148be5553 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 7 Apr 2024 12:36:44 +1200
Subject: [PATCH] Fix bug in read_stream.c.

When we determine that a wanted block can't be combined with the current
pending read, it's time to start that pending read to get it out of the
way.  An "if" in that code path should have been a "while", because it
might take more than one go to get that job done.  Otherwise the
remaining part of a partially started read could be clobbered and we
could lose some blocks.  This was only broken for smaller ranges, as the
more common case of io_combine_limit-sized ranges is handled earlier in
the code and knows how to loop.

Discovered while testing parallel sequential scans of partially cached
tables.  They have a ramp-down phase with ever smaller ranges of
contiguous blocks, to be fair to parallel workers as the work runs out.

Defect in commit b5a9b18c.
---
 src/backend/storage/aio/read_stream.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 9a70a81f7ae..f54dacdd914 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -363,7 +363,7 @@ read_stream_look_ahead(ReadStream *stream, bool suppress_advice)
 		}

 		/* We have to start the pending read before we can build another. */
-		if (stream->pending_read_nblocks > 0)
+		while (stream->pending_read_nblocks > 0)
 		{
 			read_stream_start_pending_read(stream, suppress_advice);
 			suppress_advice = false;
-- 
2.44.0

Re: Streaming read-ready sequential scan code

2024-04-05 Thread Melanie Plageman

On Fri, Apr 5, 2024 at 7:28 PM Thomas Munro  wrote:
>
> On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
>  wrote:
> > I haven't looked into or reviewed the memory prefetching part.
> >
> > While reviewing 0002, I realized that I don't quite see how
> > read_stream_get_block() will be used in the fastpath -- which it
> > claims in its comments.
>
> Comments improved.

Ah, makes sense now.

> The occasional CI problem I mentioned turned out to be
> read_stream_reset() remembering a little too much state across
> rescans.  Fixed.

Nice investigative work figuring this out.

> Thanks both for the feedback on the ring buffer tweaks.  Comments
> updated.  Here is the full stack of patches I would like to commit
> very soon, though I may leave the memory prefetching part out for a
> bit longer to see if I can find any downside.

0001 LGTM. I did not review 0002 or 0003.

0004 looks good except for one comment typo:

   /*
* Tell call not to pin more than half the buffers in the ring.
* This is a trade-off between look ahead distance and deferring
* writeback and associated WAL traffic.
*/

call -> caller

0006, I noticed the commit message is missing an important comma:

Instead of calling ReadBuffer() for each block heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

should be

Instead of calling ReadBuffer() for each block, heap sequential scans and
TID range scans now use the streaming read API introduced in b5a9b18cd0.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-05 Thread Thomas Munro

On Sat, Apr 6, 2024 at 6:55 AM Melanie Plageman
 wrote:
> On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro  wrote:
> > The interesting column is hot.  The 200ms->211ms regression is due to
> > the extra bookkeeping in the slow path.  The rejiggered fastpath code
> > fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
> > you reproduce that?
>
> I am able to reproduce the fast path solving the issue using Heikki's
> example here [1] but in shared buffers (hot).
>
> master:   25 ms
> stream read:   29 ms
> stream read + fast path: 25 ms

Great, thanks.

> I haven't looked into or reviewed the memory prefetching part.
>
> While reviewing 0002, I realized that I don't quite see how
> read_stream_get_block() will be used in the fastpath -- which it
> claims in its comments.

Comments improved.

> Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Just for testing purposes.  Behaviour should be identical to external
observers either way, it's just a hand-rolled specialisation for
certain parameters, and it's useful to be able to verify that and
measure the speedup.  I think it's OK to leave a bit of
developer/testing scaffolding in the tree if it's likely to be useful
again and especially if like this case it doesn't create any dead
code.  (Perhaps in later work we might find the right way to get the
compiler to do the specialisation?  It's so simple though.)

The occasional CI problem I mentioned turned out to be
read_stream_reset() remembering a little too much state across
rescans.  Fixed.

Thanks both for the feedback on the ring buffer tweaks.  Comments
updated.  Here is the full stack of patches I would like to commit
very soon, though I may leave the memory prefetching part out for a
bit longer to see if I can find any downside.
From 54efd755040507b55672092907d53b4db30f5a06 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 12:08:24 +1300
Subject: [PATCH v12 1/6] Improve read_stream.c's fast path.

Unfortunately the "fast path" for cached scans that don't do any I/O was
accidentally coded in a way that could only be triggered by pg_prewarm's
usage patter.  We really need it to work for the streaming sequential
scan patch.  Refactor.

Reviewed-by: Melanie Plageman 
Discussion: https://postgr.es/m/CA%2BhUKGKXZALJ%3D6aArUsXRJzBm%3Dqvc4AWp7%3DiJNXJQqpbRLnD_w%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c | 75 +++
 1 file changed, 31 insertions(+), 44 deletions(-)

diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 4f21262ff5e..b9e11a28312 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -169,7 +169,7 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 /*
  * Ask the callback which block it would like us to read next, with a small
  * buffer in front to allow read_stream_unget_block() to work and to allow the
- * fast path to work in batches.
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
@@ -578,13 +578,12 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 	if (likely(stream->fast_path))
 	{
 		BlockNumber next_blocknum;
-		bool		need_wait;
 
 		/* Fast path assumptions. */
 		Assert(stream->ios_in_progress == 0);
 		Assert(stream->pinned_buffers == 1);
 		Assert(stream->distance == 1);
-		Assert(stream->pending_read_nblocks == 1);
+		Assert(stream->pending_read_nblocks == 0);
 		Assert(stream->per_buffer_data_size == 0);
 
 		/* We're going to return the buffer we pinned last time. */
@@ -594,40 +593,29 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
 		buffer = stream->buffers[oldest_buffer_index];
 		Assert(buffer != InvalidBuffer);
 
-		/*
-		 * Pin a buffer for the next call.  Same buffer entry, and arbitrary
-		 * I/O entry (they're all free).
-		 */
-		need_wait = StartReadBuffer(&stream->ios[0].op,
-	&stream->buffers[oldest_buffer_index],
-	stream->pending_read_blocknum,
-	stream->advice_enabled ?
-	READ_BUFFERS_ISSUE_ADVICE : 0);
-
-		/* Choose the block the next call will pin. */
+		/* Choose the next block to pin. */
 		if (unlikely(stream->blocknums_next == stream->blocknums_count))
 			read_stream_fill_blocknums(stream);
 		next_blocknum = stream->blocknums[stream->blocknums_next++];
 
-		/*
-		 * Fast return if the next call doesn't require I/O for the buffer we
-		 * just pinned, and we have a block number to give it as a pending
-		 * read.
-		 */
-		if (likely(!need_wait && next_blocknum != InvalidBlockNumber))
+		if (likely(next_blocknum != InvalidBlockNumber))
 		{
-			stream->pending_read_blocknum = next_blocknum;
-			return buffer;
-		}
-
-		/*
-		 * For anything more complex, set up some more state and take the slow
-		 * path next time.
-		 */
-		stream->fast_path = false;
+

Re: Streaming read-ready sequential scan code

2024-04-05 Thread Melanie Plageman

On Fri, Apr 5, 2024 at 12:15 AM Thomas Munro  wrote:
>
> Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
> is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
> With the attached minor rearrangement, it works fine.  I also poked
> some more at that memory prefetcher.  Here are the numbers I got on a
> desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
> cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
> GCC -O3).
>
> create table t (i int, filler text) with (fillfactor=10);
> insert into t
> select g, repeat('x', 900) from generate_series(1, 56) g;
> vacuum freeze t;
> set max_parallel_workers_per_gather = 0;
>
> select count(*) from t;
>
> cold = must be read from actual disk (Linux drop_caches)
> warm = read from linux page cache
> hot = already in pg cache via pg_prewarm
>
> cold   warmhot
> master2479ms  886ms  200ms
> seqscan   2498ms  716ms  211ms <-- regression
> seqscan + fastpath2493ms  711ms  200ms <-- fixed, I think?
> seqscan + memprefetch 2499ms  716ms  182ms
> seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/
>
> Cold has no difference.  That's just my disk demonstrating Linux RA at
> 128kB (default); random I/O is obviously a more interesting story.
> It's consistently a smidgen faster with Linux RA set to 2MB (as in
> blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
> probably also increases on fancier faster storage than what I have on
> hand:
>
> cold
> master1775ms
> seqscan + fastpath + memprefetch  1700ms
>
> Warm is faster as expected (fewer system calls schlepping data
> kernel->userspace).
>
> The interesting column is hot.  The 200ms->211ms regression is due to
> the extra bookkeeping in the slow path.  The rejiggered fastpath code
> fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
> you reproduce that?

I am able to reproduce the fast path solving the issue using Heikki's
example here [1] but in shared buffers (hot).

master:   25 ms
stream read:   29 ms
stream read + fast path: 25 ms

I haven't looked into or reviewed the memory prefetching part.

While reviewing 0002, I realized that I don't quite see how
read_stream_get_block() will be used in the fastpath -- which it
claims in its comments.
read_stream_next_buffer() is the only caller of
read_stream_look_ahead()->read_stream_get_block(), and if fast_path is
true, read_stream_next_buffer() always returns before calling
read_stream_look_ahead(). Maybe I am missing something. I see
fast_path uses read_stream_fill_blocknums() to invoke the callback.

Oh and why does READ_STREAM_DISABLE_FAST_PATH macro exist?

Otherwise 0002 looks good to me.

I haven't reviewed 0003 or 0004. I attached a new version (v11)
because I noticed an outdated comment in my seq scan streaming read
user patch (0001). The other patches in the set are untouched from
your versions besides adding author/reviewer info in commit message
for 0002.

- Melanie

[1] 
https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e6a1%40iki.fi
From acbd7172f10857d49d4b5d4afc4efab704b33486 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Mon, 10 Jul 2023 11:22:34 +0200
Subject: [PATCH v11 3/4] Add pg_prefetch_mem() macro to load cache lines.

Initially mapping to GCC, Clang and MSVC builtins.

Discussion: https://postgr.es/m/CAEepm%3D2y9HM9QP%2BHhRZdQ3pU6FShSMyu%3DV1uHXhQ5gG-dketHg%40mail.gmail.com
---
 config/c-compiler.m4   | 17 
 configure  | 40 ++
 configure.ac   |  3 +++
 meson.build|  1 +
 src/include/c.h|  8 
 src/include/pg_config.h.in |  3 +++
 6 files changed, 72 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 3268a780bb0..4cc02f97601 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -355,6 +355,23 @@ AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
[Define to 1 if your compiler understands $1.])
 fi])# PGAC_CHECK_BUILTIN_FUNC
 
+# PGAC_CHECK_BUILTIN_VOID_FUNC
+# ---
+# Variant for void functions.
+AC_DEFUN([PGAC_CHECK_BUILTIN_VOID_FUNC],
+[AC_CACHE_CHECK(for $1, pgac_cv$1,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([
+void
+call$1($2)
+{
+$1(x);
+}], [])],
+[pgac_cv$1=yes],
+[pgac_cv$1=no])])
+if test x"${pgac_cv$1}" = xyes ; then
+AC_DEFINE_UNQUOTED(AS_TR_CPP([HAVE$1]), 1,
+   [Define to 1 if your compiler understands $1.])
+fi])# PGAC_CHECK_BUILTIN_VOID_FUNC
 
 
 # PGAC_CHECK_BUILTIN_FUNC_PTR
diff --git a/configure b/configure
index 36feeafbb23..79b78c33ddc 100755
--- a/configure
+++ b/configure
@@ -15543,6 +15543,46 @@ _ACEOF
 
 fi
 
+# Can we use a built-in to prefetch memory?
+{ $as_echo "$as_me:${a

Re: Streaming read-ready sequential scan code

2024-04-05 Thread Melanie Plageman

On Thu, Apr 4, 2024 at 12:39 PM Andres Freund  wrote:
>
> On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:
> > On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro  wrote:
> > > Alright what about this?
>
> I think it's probably worth adding a bit more of the commit message to the
> function comment. Yes, there's a bit in one of the return branches, but that's
> not what you're going to look at when just checking what the function does.

Agreed about the comment. I kept thinking that BAS_BULKREAD should
maybe return nbuffers - 1, but I couldn't convince myself why.
Otherwise v2-0001-Allow-BufferAccessStrategy-to-limit-pin-count LGTM.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

Yeah, I plead benchmarking myopia, sorry.  The fastpath as committed
is only reached when distance goes 2->1, as pg_prewarm does.  Oops.
With the attached minor rearrangement, it works fine.  I also poked
some more at that memory prefetcher.  Here are the numbers I got on a
desktop system (Intel i9-9900 @ 3.1GHz, Linux 6.1, turbo disabled,
cpufreq governor=performance, 2MB huge pages, SB=8GB, consumer NMVe,
GCC -O3).

create table t (i int, filler text) with (fillfactor=10);
insert into t
select g, repeat('x', 900) from generate_series(1, 56) g;
vacuum freeze t;
set max_parallel_workers_per_gather = 0;

select count(*) from t;

cold = must be read from actual disk (Linux drop_caches)
warm = read from linux page cache
hot = already in pg cache via pg_prewarm

cold   warmhot
master2479ms  886ms  200ms
seqscan   2498ms  716ms  211ms <-- regression
seqscan + fastpath2493ms  711ms  200ms <-- fixed, I think?
seqscan + memprefetch 2499ms  716ms  182ms
seqscan + fastpath + memprefetch  2505ms  710ms  170ms <-- \O/

Cold has no difference.  That's just my disk demonstrating Linux RA at
128kB (default); random I/O is obviously a more interesting story.
It's consistently a smidgen faster with Linux RA set to 2MB (as in
blockdev --setra 4096 /dev/nvmeXXX), and I believe this effect
probably also increases on fancier faster storage than what I have on
hand:

cold
master1775ms
seqscan + fastpath + memprefetch  1700ms

Warm is faster as expected (fewer system calls schlepping data
kernel->userspace).

The interesting column is hot.  The 200ms->211ms regression is due to
the extra bookkeeping in the slow path.  The rejiggered fastpath code
fixes it for me, or maybe sometimes shows an extra 1ms.  Phew.  Can
you reproduce that?

The memory prefetching trick, on top of that, seems to be a good
optimisation so far.  Note that that's not an entirely independent
trick, it's something we can only do now that we can see into the
future; it's the next level up of prefetching, worth doing around 60ns
before you need the data I guess.  Who knows how thrashed the cache
might be before the caller gets around to accessing that page, but
there doesn't seem to be much of a cost or downside to this bet.  We
know there are many more opportunities like that[1] but I don't want
to second-guess the AM here, I'm just betting that the caller is going
to look at the header.

Unfortunately there seems to be a subtle bug hiding somewhere in here,
visible on macOS on CI.  Looking into that, going to find my Mac...

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvpTRx7hqFZGiZJ%3Dd9JN4h1tzJ2%3Dxt7bM-9XRmpVj63psQ%40mail.gmail.com
From 74b8cde45a8babcec7b52b06bdb8ea046a0a966f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 5 Apr 2024 13:32:14 +1300
Subject: [PATCH v10 1/4] Use streaming I/O in heapam sequential scan.


---
 src/backend/access/heap/heapam.c | 100 +--
 src/include/access/heapam.h  |  15 +
 2 files changed, 97 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e3..f7946a39fd9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+		   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buff

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Fri, Apr 5, 2024 at 4:20 AM Melanie Plageman
 wrote:
> So, sequential scan does not have per-buffer data. I did some logging
> and the reason most fully-in-SB sequential scans don't use the fast
> path is because read_stream->pending_read_nblocks is always 0.

Hnghghghgh... right, sorry I guessed the wrong reason, it turns out
that I made a fast path just a little too specialised for pg_prewarm.
Working on it...

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Andres Freund

Hi,

On 2024-04-04 22:37:39 +1300, Thomas Munro wrote:
> On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro  wrote:
> > Alright what about this?

I think it's probably worth adding a bit more of the commit message to the
function comment. Yes, there's a bit in one of the return branches, but that's
not what you're going to look at when just checking what the function does.


> From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
> From: Thomas Munro 
> Date: Thu, 4 Apr 2024 21:11:06 +1300
> Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.
> 
> The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
> 1cbbee03 made it configurable but retained the traditional default.
> The correct default size has been debated for years, but 256kB is
> certainly very small.  VACUUM soon needs to write back data it dirtied
> only 32 blocks ago, which usually requires flushing the WAL.  New
> experiments in prefetching pages for VACUUM exacerbated the problem by
> crashing into dirty data even sooner.  Let's make the default 2MB.
> That's 1.5% of the default toy buffer pool size, and 0.2% of 1GB, which
> would be a considered a small shared_buffers setting for a real system
> these days.  Users are still free to set the GUC to a different value.

+1.  Independent of any other changes, this improves the default vacuum
performance substantially. We might want to dynamically size the default at
some point - but we probably should overhaul the infrastructure first...


Greetings,

Andres Freund

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Melanie Plageman

On Thu, Apr 4, 2024 at 7:45 AM Thomas Munro  wrote:
>
> On Thu, Apr 4, 2024 at 8:02 PM David Rowley  wrote:
> > 3a4a3537a
> > latency average = 34.497 ms
> > latency average = 34.538 ms
> >
> > 3a4a3537a + read_stream_for_seqscans.patch
> > latency average = 40.923 ms
> > latency average = 41.415 ms
> >
> > i.e. no meaningful change from the refactor, but a regression from a
> > cached workload that changes the page often without doing much work in
> > between with the read stread patch.
>
> I ran Heikki's test except I ran the "insert" 4 times to get a table
> of 4376MB according to \d+.  On my random cloud ARM server (SB=8GB,
> huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
> when the data is in LInux cache and PG is not prewarmed, roughly as he
> reported.  Good.
>
> If I pg_prewarm first, I see that slowdown 260ms -> 294ms.  Trying
> things out to see what works, I got that down to 243ms (ie beat
> master) by inserting a memory prefetch:
>
> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
> **per_buffer_data)
> /* Prepare for the next call. */
> read_stream_look_ahead(stream, false);
>
> +   
> __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));
>
> Maybe that's a solution to a different problem that just happens to
> more than make up the difference in this case, and it may be
> questionable whether that cache line will survive long enough to help
> you, but this one-tuple-per-page test likes it...  Hmm, to get a more
> realistic table than the one-tuple-per-page on, I tried doubling a
> tenk1 table until it reached 2759MB and then I got a post-prewarm
> regression 702ms -> 721ms, and again I can beat master by memory
> prefetching: 689ms.
>
> Annoyingly, with the same table I see no difference between the actual
> pg_prewarm('giga') time: around 155ms for both.  pg_prewarm is able to
> use the 'fast path' I made pretty much just to be able to minimise
> regression in that (probably stupid) all-cached tes that doesn't even
> look at the page contents.  Unfortunately seq scan can't use it
> because it has per-buffer data, which is one of the things it can't do
> (because of a space management issue).  Maybe I should try to find a
> way to fix that.

So, sequential scan does not have per-buffer data. I did some logging
and the reason most fully-in-SB sequential scans don't use the fast
path is because read_stream->pending_read_nblocks is always 0.

When when read_stream->distance stays 1 (expected for all-in-SB as it
is initialized to 1 and we don't want distance to ramp up),
read_stream_look_ahead() never increments
read_stream->pending_read_nblocks because it sets it to 1 the first
time it is called and then the conditions of the while loop are not
met again

while (stream->ios_in_progress < stream->max_ios &&
   stream->pinned_buffers + stream->pending_read_nblocks <
stream->distance)

distance is 1, pending_read_nblocks is 1, thus we only loop once and
don't increment pending_read_nblocks.

prewarm is only able to use the fast path because it passes
READ_STREAM_FULL and thus initializes read_stream->distance to a
higher initial value.

I added some logging to see if any of the sequential scans in the
regression suite used the fast path. The one example I see of the fast
path being used is a temp table IO stats test in
src/test/regress/sql/stats.sql. I didn't check exactly what conditions
led it to do this. But we probably want seq scans which are all in SB
to use the fast path.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Melanie Plageman

On Thu, Apr 4, 2024 at 3:02 AM David Rowley  wrote:
>
> On Thu, 4 Apr 2024 at 16:45, David Rowley  wrote:
> > I've pushed the v9-0001 with that rename done.
>
> I've now just pushed the 0002 patch with some revisions:

Thanks!

> 1. The function declarations you added for heapgettup_advance_block()
> and heapgettup_initial_block() didn't match the properties of their
> definitions.  You'd declared both of these static inline but neither
> of these were.

Ah, they needed to be defined static but I intentionally left off the
inline in the definition and only put it in the forward declaration
because I thought that was correct.  Anyway, I'm fine with how you did
it in the end.

> 2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
> that's effectively what it does with v8-0002, however, that's just too
> many words all shoved together, so I renamed it to
> heap_fetch_next_buffer().

Sounds good.

> 3. I changed heapgettup_initial_block() to pg_noinline as it both
> makes more sense to have this out of line and it also fixed a small
> performance regression.

Ah, I guess it is in the unlikely path. I often forget that noinline
exists. It's interesting that that made a noticeable difference since
it is a pretty short function. Thanks for taking such a close look!

> Looks like I also failed to grep for all the remaining instances of
> "heapgetpage" in 44086b097.  Those are now fixed by 3a4a3537a.
>
> I also rebased the 0003 patch which I've attached as a raw patch.

Thanks!

> FWIW, using Heikki's test in [1] with a pg_prewarm each time after
> restarting the instance. No parallel aggregate.
>
> drowley@amd3990x:~$ cat bench.sql
>  select count(*) from giga;
>
> drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency
>
> 44086b097~1
> latency average = 34.323 ms
> latency average = 34.332 ms
>
> 44086b097
> latency average = 34.811 ms
> latency average = 34.862 ms
>
> 3a4a3537a
> latency average = 34.497 ms
> latency average = 34.538 ms
>
> 3a4a3537a + read_stream_for_seqscans.patch
> latency average = 40.923 ms
> latency average = 41.415 ms
>
> i.e. no meaningful change from the refactor, but a regression from a
> cached workload that changes the page often without doing much work in
> between with the read stread patch.

Cool. Thanks for testing this out. Sounds like Thomas did some
analysis of how to resolve this for the streaming read user, and I
will do some too.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 8:02 PM David Rowley  wrote:
> 3a4a3537a
> latency average = 34.497 ms
> latency average = 34.538 ms
>
> 3a4a3537a + read_stream_for_seqscans.patch
> latency average = 40.923 ms
> latency average = 41.415 ms
>
> i.e. no meaningful change from the refactor, but a regression from a
> cached workload that changes the page often without doing much work in
> between with the read stread patch.

I ran Heikki's test except I ran the "insert" 4 times to get a table
of 4376MB according to \d+.  On my random cloud ARM server (SB=8GB,
huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms
when the data is in LInux cache and PG is not prewarmed, roughly as he
reported.  Good.

If I pg_prewarm first, I see that slowdown 260ms -> 294ms.  Trying
things out to see what works, I got that down to 243ms (ie beat
master) by inserting a memory prefetch:

--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void
**per_buffer_data)
/* Prepare for the next call. */
read_stream_look_ahead(stream, false);

+   
__builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index]));

Maybe that's a solution to a different problem that just happens to
more than make up the difference in this case, and it may be
questionable whether that cache line will survive long enough to help
you, but this one-tuple-per-page test likes it...  Hmm, to get a more
realistic table than the one-tuple-per-page on, I tried doubling a
tenk1 table until it reached 2759MB and then I got a post-prewarm
regression 702ms -> 721ms, and again I can beat master by memory
prefetching: 689ms.

Annoyingly, with the same table I see no difference between the actual
pg_prewarm('giga') time: around 155ms for both.  pg_prewarm is able to
use the 'fast path' I made pretty much just to be able to minimise
regression in that (probably stupid) all-cached tes that doesn't even
look at the page contents.  Unfortunately seq scan can't use it
because it has per-buffer data, which is one of the things it can't do
(because of a space management issue).  Maybe I should try to find a
way to fix that.

> I'm happy to run further benchmarks, but for the remainder of the
> committer responsibility for the next patches, I'm going to leave that
> to Thomas.

Thanks!

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 10:31 PM Thomas Munro  wrote:
> Alright what about this?

Forgot to git add a change, new version.
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH v2 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: 
https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 
 src/backend/storage/buffer/freelist.c | 35 +++
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
size_t  size;
int16   queue_size;
int16   max_ios;
+   int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
max_pinned_buffers = Min(max_pinned_buffers,
 PG_INT16_MAX - 
io_combine_limit - 1);
 
+   /* Give the strategy a chance to limit the number of buffers we pin. */
+   strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+   max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
/* Don't allow this backend to pin more than its share of buffers. */
if (SmgrIsTemp(smgr))
LimitAdditionalLocalPins(&max_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c 
b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+   if (strategy == NULL)
+   return NBuffers;
+
+   switch (strategy->btype)
+   {
+   case BAS_BULKREAD:
+
+   /*
+* Since BAS_BULKREAD uses StrategyRejectBuffer(), 
dirty buffers
+* shouldn't be a problem and the caller is free to pin 
up to the
+* entire ring at once.
+*/
+   return strategy->nbuffers;
+
+   default:
+
+   /*
+* Tell call not to pin more than half the buffers in 
the ring.
+* This is a trade-off between look ahead distance and 
deferring
+* writeback and associated WAL traffic.
+*/
+   return strategy->nbuffers / 2;
+   }
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy 
GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType 
btype,

  int ring_size_kb);
 extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int GetAccessStrategyPinLimit(BufferAccessStrategy strategy);
 
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
 
-- 
2.39.3 (Apple Git-146)

From e610bc78a2e3ecee50bd897e35084469d00bbac5 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 21:11:06 +1300
Subject: [PATCH v2 2/2] Increase default vacuum_buffer_usage_limit to 2MB.

The BAS_VACUUM ring size has been 256kB since commit d526575f.  Commit
1cbbee03 made it configurable but retained the traditional default.

Re: Streaming read-ready sequential scan code

2024-04-04 Thread Thomas Munro

On Thu, Apr 4, 2024 at 11:13 AM Andres Freund  wrote:
> The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
> should we apply that only if the ring the strategy doesn't use the
> StrategyRejectBuffer() logic?

Hmm, I don't really know, but that sounds plausible.  What do you
think about the attached?

> I think for VACUUM we should probably go a bit further. There's no comparable
> L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
> expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Alright what about this?
From 6dea2983abf8d608c34e02351d70694de99f25f2 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Thu, 4 Apr 2024 20:31:26 +1300
Subject: [PATCH 1/2] Allow BufferAccessStrategy to limit pin count.

When pinning extra buffers to look ahead, users of a strategy are in
danger of pinning a lot of the buffers in the ring, or even more than
the ring size.  For some strategies, that means "escaping" from the
ring, and in others it means forcing dirty data to disk very frequently
with associated WAL flushing.  Since external code has no insight into
any of that, allow individual strategy types to expose a clamp that
should be applied when deciding how many buffers to pin at once.

Discussion: 
https://postgr.es/m/CAAKRu_aJXnqsyZt6HwFLnxYEBgE17oypkxbKbT1t1geE_wvH2Q%40mail.gmail.com
---
 src/backend/storage/aio/read_stream.c |  5 
 src/backend/storage/buffer/freelist.c | 35 +++
 src/include/storage/bufmgr.h  |  1 +
 3 files changed, 41 insertions(+)

diff --git a/src/backend/storage/aio/read_stream.c 
b/src/backend/storage/aio/read_stream.c
index 4f21262ff5..eab87f6f73 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -419,6 +419,7 @@ read_stream_begin_relation(int flags,
size_t  size;
int16   queue_size;
int16   max_ios;
+   int strategy_pin_limit;
uint32  max_pinned_buffers;
Oid tablespace_id;
SMgrRelation smgr;
@@ -460,6 +461,10 @@ read_stream_begin_relation(int flags,
max_pinned_buffers = Min(max_pinned_buffers,
 PG_INT16_MAX - 
io_combine_limit - 1);
 
+   /* Give the strategy a chance to limit the number of buffers we pin. */
+   strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+   max_pinned_buffers = Min(strategy_pin_limit, max_pinned_buffers);
+
/* Don't allow this backend to pin more than its share of buffers. */
if (SmgrIsTemp(smgr))
LimitAdditionalLocalPins(&max_pinned_buffers);
diff --git a/src/backend/storage/buffer/freelist.c 
b/src/backend/storage/buffer/freelist.c
index 3611357fa3..c69590d6d8 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -629,6 +629,41 @@ GetAccessStrategyBufferCount(BufferAccessStrategy strategy)
return strategy->nbuffers;
 }
 
+/*
+ * GetAccessStrategyPinLimit -- get cap of number of buffers that can be pinned
+ *
+ * Strategies can specify the maximum number of buffers that a user should pin
+ * at once when performing look-ahead.  Callers should combine this number with
+ * other relevant limits and take the minimum.
+ */
+int
+GetAccessStrategyPinLimit(BufferAccessStrategy strategy)
+{
+   if (strategy == NULL)
+   return NBuffers;
+
+   switch (strategy->btype)
+   {
+   case BAS_BULKREAD:
+
+   /*
+* Since BAS_BULKREAD uses StrategyRejectBuffer(), 
dirty buffers
+* shouldn't be a problem and the caller is free to pin 
up to the
+* entire ring at once.
+*/
+   return strategy->nbuffers;
+
+   default:
+
+   /*
+* Tell call not to pin more than half the buffers in 
the ring.
+* This is a trade-off between look ahead distance and 
deferring
+* writeback and associated WAL traffic.
+*/
+   return strategy->nbuffers / 2;
+   }
+}
+
 /*
  * FreeAccessStrategy -- release a BufferAccessStrategy object
  *
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f380f9d9a6..07ba1a6050 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -318,6 +318,7 @@ extern BufferAccessStrategy 
GetAccessStrategy(BufferAccessStrategyType btype);
 extern BufferAccessStrategy GetAccessStrategyWithSize(BufferAccessStrategyType 
btype,

  int ring_size_kb);
 extern int GetAccessStrategyBufferCount(BufferAccessStrategy strategy);
+extern int GetAccessStrategyPinLimit(BufferAccessStrategy

Re: Streaming read-ready sequential scan code

2024-04-04 Thread David Rowley

On Thu, 4 Apr 2024 at 16:45, David Rowley  wrote:
> I've pushed the v9-0001 with that rename done.

I've now just pushed the 0002 patch with some revisions:

1. The function declarations you added for heapgettup_advance_block()
and heapgettup_initial_block() didn't match the properties of their
definitions.  You'd declared both of these static inline but neither
of these were.
2. I felt inclined to rename heapfetchbuf() to heapfetchnextbuf() as
that's effectively what it does with v8-0002, however, that's just too
many words all shoved together, so I renamed it to
heap_fetch_next_buffer().
3. I changed heapgettup_initial_block() to pg_noinline as it both
makes more sense to have this out of line and it also fixed a small
performance regression.

Looks like I also failed to grep for all the remaining instances of
"heapgetpage" in 44086b097.  Those are now fixed by 3a4a3537a.

I also rebased the 0003 patch which I've attached as a raw patch.

FWIW, using Heikki's test in [1] with a pg_prewarm each time after
restarting the instance. No parallel aggregate.

drowley@amd3990x:~$ cat bench.sql
 select count(*) from giga;

drowley@amd3990x:~$ pgbench -n -f bench.sql -T 120 postgres | grep latency

44086b097~1
latency average = 34.323 ms
latency average = 34.332 ms

44086b097
latency average = 34.811 ms
latency average = 34.862 ms

3a4a3537a
latency average = 34.497 ms
latency average = 34.538 ms

3a4a3537a + read_stream_for_seqscans.patch
latency average = 40.923 ms
latency average = 41.415 ms

i.e. no meaningful change from the refactor, but a regression from a
cached workload that changes the page often without doing much work in
between with the read stread patch.

I'm happy to run further benchmarks, but for the remainder of the
committer responsibility for the next patches, I'm going to leave that
to Thomas.

David

[1] 
https://www.postgresql.org/message-id/3b0f3701-addd-4629-9257-cf28e1a6e...@iki.fi
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dada2ecd1e..f7946a39fd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -223,6 +223,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 
1] =
  * 
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+  void *per_buffer_data)
+{
+   HeapScanDesc scan = (HeapScanDesc) private_data;
+
+   if (unlikely(!scan->rs_inited))
+   {
+   scan->rs_prefetch_block = heapgettup_initial_block(scan, 
scan->rs_dir);
+   scan->rs_inited = true;
+   }
+   else
+   scan->rs_prefetch_block = heapgettup_advance_block(scan,
+   
   scan->rs_prefetch_block,
+   
   scan->rs_dir);
+
+   return scan->rs_prefetch_block;
+}
+
 /* 
  * initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -325,6 +344,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool 
keep_startblock)
scan->rs_cbuf = InvalidBuffer;
scan->rs_cblock = InvalidBlockNumber;
 
+   /*
+* Initialize to ForwardScanDirection because it is most common and heap
+* scans usually must go forwards before going backward.
+*/
+   scan->rs_dir = ForwardScanDirection;
+   scan->rs_prefetch_block = InvalidBlockNumber;
+
/* page-at-a-time fields are always invalid when not rs_inited */
 
/*
@@ -462,12 +488,14 @@ heap_prepare_pagescan(TableScanDesc sscan)
 /*
  * heap_fetch_next_buffer - read and pin the next block from MAIN_FORKNUM.
  *
- * Read the next block of the scan relation into a buffer and pin that buffer
- * before saving it in the scan descriptor.
+ * Read the next block of the scan relation from the read stream and pin that
+ * buffer before saving it in the scan descriptor.
  */
 static inline void
 heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection dir)
 {
+   Assert(scan->rs_read_stream);
+
/* release previous scan buffer, if any */
if (BufferIsValid(scan->rs_cbuf))
{
@@ -482,25 +510,23 @@ heap_fetch_next_buffer(HeapScanDesc scan, ScanDirection 
dir)
 */
CHECK_FOR_INTERRUPTS();
 
-   if (unlikely(!scan->rs_inited))
+   /*
+* If the scan direction is changing, reset the prefetch block to the
+* current block. Otherwise, we will incorrectly prefetch the blocks
+* between the prefetch block and the current block again before
+* prefetching blocks in the new, correct scan direction.
+*/
+   if (unlikely(scan->rs_dir != dir))
{
-   scan->rs_cblock = heapgettup_initial_block(scan, dir

Re: Streaming read-ready sequential scan code

2024-04-03 Thread David Rowley

On Thu, 4 Apr 2024 at 14:38, Melanie Plageman  wrote:
> Looking at it in the code, I am wondering if we should call
> heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
> that it is prepping a page to be scanned. You choose whatever you
> think is best.

I ended up calling it heap_prepare_pagescan() as I started to think
prep/prepare should come first. I don't think it's perfect as the
intended meaning is heap_prepare_page_for_scanning_in_pagemode(), but
that's obviously too long.

I've pushed the v9-0001 with that rename done.

David

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman

On Wed, Apr 3, 2024 at 9:08 PM David Rowley  wrote:
>
> On Thu, 4 Apr 2024 at 06:03, Melanie Plageman  
> wrote:
> > Attached v8 is rebased over current master (which now has the
> > streaming read API).
>
> I've looked at the v8-0001 patch.

Thanks for taking a look!

> I wasn't too keen on heapbuildvis() as a function name for an external
> function.  Since it also does pruning work, it seemed weird to make it
> sound like it only did visibility work.  Per our offline discussion
> about names, I've changed it to what you suggested which is
> heap_page_prep().

Looking at it in the code, I am wondering if we should call
heap_page_prep() heap_scan_page_prep(). Just wondering if it is clear
that it is prepping a page to be scanned. You choose whatever you
think is best.

> Aside from that, there was an outdated comment: "In pageatatime mode,
> heapgetpage() already did visibility checks," which was no longer true
> as that's done in heapbuildvis() (now heap_page_prep()).
>
> I also did a round of comment adjustments as there were a few things I
> didn't like, e.g:
>
> + * heapfetchbuf - subroutine for heapgettup()
>
> This is also used in heapgettup_pagemode(), so I thought it was a bad
> idea for a function to list places it thinks it's being called.   I
> also thought it redundant to write "This routine" in the function head
> comment. I think "this routine" is implied by the context. I ended up
> with:
>
> /*
>  * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
>  *
>  * Read the specified block of the scan relation into a buffer and pin that
>  * buffer before saving it in the scan descriptor.
>  */
>
> I'm happy with your changes to heapam_scan_sample_next_block(). I did
> adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
> the same as the seqscan version, just with "seqscan" swapped for
> "sample scan".

That all is fine with me.

> The only other thing I adjusted there was to use "blockno" in some
> places where you were using hscan->rs_cblock.  These all come after
> the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
> is more likely to avoid reading the value from the struct again if the
> compiler isn't happy that the two values are still equivalent for some
> reason.  Even if the compiler is happy today, it would only take a
> code change to pass hscan to some external function for the compiler
> to perhaps be uncertain if that function has made an adjustment to
> rs_cblock and go with the safe option of pulling the value out the
> struct again which is a little more expensive as it requires some
> maths to figure out the offset.
>
> I've attached v9-0001 and a delta of just my changes from v8.

All sounds good and LGTM

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-03 Thread David Rowley

On Thu, 4 Apr 2024 at 06:03, Melanie Plageman  wrote:
> Attached v8 is rebased over current master (which now has the
> streaming read API).

I've looked at the v8-0001 patch.

I wasn't too keen on heapbuildvis() as a function name for an external
function.  Since it also does pruning work, it seemed weird to make it
sound like it only did visibility work.  Per our offline discussion
about names, I've changed it to what you suggested which is
heap_page_prep().

Aside from that, there was an outdated comment: "In pageatatime mode,
heapgetpage() already did visibility checks," which was no longer true
as that's done in heapbuildvis() (now heap_page_prep()).

I also did a round of comment adjustments as there were a few things I
didn't like, e.g:

+ * heapfetchbuf - subroutine for heapgettup()

This is also used in heapgettup_pagemode(), so I thought it was a bad
idea for a function to list places it thinks it's being called.   I
also thought it redundant to write "This routine" in the function head
comment. I think "this routine" is implied by the context. I ended up
with:

/*
 * heapfetchbuf - read and pin the given MAIN_FORKNUM block number.
 *
 * Read the specified block of the scan relation into a buffer and pin that
 * buffer before saving it in the scan descriptor.
 */

I'm happy with your changes to heapam_scan_sample_next_block(). I did
adjust the comment above CHECK_FOR_INTERRUPTS() so it was effectively
the same as the seqscan version, just with "seqscan" swapped for
"sample scan".

The only other thing I adjusted there was to use "blockno" in some
places where you were using hscan->rs_cblock.  These all come after
the "hscan->rs_cblock = blockno;" line. My thoughts here are that this
is more likely to avoid reading the value from the struct again if the
compiler isn't happy that the two values are still equivalent for some
reason.  Even if the compiler is happy today, it would only take a
code change to pass hscan to some external function for the compiler
to perhaps be uncertain if that function has made an adjustment to
rs_cblock and go with the safe option of pulling the value out the
struct again which is a little more expensive as it requires some
maths to figure out the offset.

I've attached v9-0001 and a delta of just my changes from v8.

David
From 90bfc63097c556d0921d8f9165731fb07ec04167 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v9] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for read streams. The streaming read API
will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c | 78 ++--
 src/backend/access/heap/heapam_handler.c | 38 
 src/include/access/heapam.h  |  2 +-
 3 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a9d5b109a5..6524fc44a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -360,17 +360,17 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber 
startBlk, BlockNumber numBlk
 }

 /*
- * heapgetpage - subroutine for heapgettup()
+ * heap_page_prep - Prepare the current scan page to be scanned in pagemode
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Preparation currently consists of 1. prune the scan's rs_cbuf page, and 2.
+ * fill the rs_vistuples array with the OffsetNumbers of visible tuples.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heap_page_prep(TableScanDesc sscan)
 {
HeapScanDesc scan = (HeapScanDesc) sscan;
-   Buffer  buffer;
+   Buffer  buffer = scan->rs_cbuf;
+   BlockNumber block = scan->rs_

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Andres Freund

Hi,

On 2024-04-04 09:27:35 +1300, Thomas Munro wrote:
> Anecdotally by all reports I've seen so far, all-in-cache seems to be
> consistently a touch faster than master if anything, for streaming seq
> scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
> due to intentional changes.  No efficiency is coming from batching
> anything.  Perhaps it has to do with CPU pipelining effects: though
> it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
> itself is cut up into stages in a kind of pipeline:
> read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
> returns page n each time you call it, so maybe we get more CPU
> parallelism due to spreading the data dependencies out?

Another theory is that it's due to the plain ReadBuffer() path needing to do
RelationGetSmgr(reln) on every call, whereas the streaming read path doesn't
need to.



> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }

> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

The / 2 is to avoid causing unnecessarily frequent WAL flushes, right? If so,
should we apply that only if the ring the strategy doesn't use the
StrategyRejectBuffer() logic?

I think it's fine to add a handwavy justification for the 2, saying that we
want to balance readahead distance and reducing WAL write frequency, and that
at some point more sophisticated logic will be needed.


> Then we should increase the default ring sizes for BAS_BULKREAD and
> BAS_VACUUM to make the previous statement true.

I'm not sure it's right tying them together. The concerns for both are fairly
different.


> The size of main memory and L2 cache have increased dramatically since those
> strategies were invented.  I think we should at least double them, and more
> likely quadruple them.  I realise you already made them configurable per
> command in commit 1cbbee033, but I mean the hard coded default 256 in
> freelist.c.  It's not only to get more space for our prefetching plans, it's
> also to give the system more chance of flushing WAL asynchronously/in some
> other backend before you crash into dirty data; as you discovered,
> prefetching accidentally makes that effect worse in.a BAS_VACUUM strategy,
> by taking away space that is effectively deferring WAL flushes, so I'd at
> least like to double the size for if we use "/ 2" above.

I think for VACUUM we should probably go a bit further. There's no comparable
L1/L2 issue, because the per-buffer processing + WAL insertion is a lot more
expensive, compared to a seqscan. I'd go or at lest 4x-8x.

Greetings,

Andres Freund

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman

On Wed, Apr 3, 2024 at 4:28 PM Thomas Munro  wrote:
>
> On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
>  wrote:
> > On the topic of BAS_BULKREAD buffer access strategy, I think the least
> > we could do is add an assert like this to read_stream_begin_relation()
> > after calculating max_pinned_buffers.
> >
> > Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
> >
> > Perhaps we should do more? I think with a ring size of 16 MB, large
> > SELECTs are safe for now. But I know future developers will make
> > changes and it would be good not to assume they will understand that
> > pinning more buffers than the size of the ring effectively invalidates
> > the ring.
>
> Yeah I think we should clamp max_pinned_buffers if we see a strategy.
> What do you think about:
>
> if (strategy)
> {
> int strategy_buffers = GetAccessStrategyBufferCount(strategy);
> max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
> }
>
> I just don't know where to get that '2'.  The idea would be to
> hopefully never actually be constrained by it in practice, except in
> tiny/toy setups, so we can't go too wrong with our number '2' there.

Yea, I don't actually understand why not just use strategy_buffers - 1
or something. 1/2 seems like a big limiting factor for those
strategies with small rings.

I don't really think it will come up for SELECT queries since they
rely on readahead and not prefetching.
It does seem like it could easily come up for analyze.

But I am on board with the idea of clamping for sequential scan/TID
range scan. For vacuum, we might have to think harder. If the user
specifies buffer_usage_limit and io_combine_limit and they are never
reaching IOs of io_combine_limit because of their buffer_usage_limit
value, then we should probably inform them.

- Melanie

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Thomas Munro

On Thu, Apr 4, 2024 at 6:03 AM Melanie Plageman
 wrote:
> On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
> > On 01/04/2024 22:58, Melanie Plageman wrote:
> > > Attached v7 has version 14 of the streaming read API as well as a few
> > > small tweaks to comments and code.
> >
> > I saw benchmarks in this thread to show that there's no regression when
> > the data is in cache, but I didn't see any benchmarks demonstrating the
> > benefit of this. So I ran this quick test:
>
> Good point! It would be good to show why we would actually want this
> patch. Attached v8 is rebased over current master (which now has the
> streaming read API).

Anecdotally by all reports I've seen so far, all-in-cache seems to be
consistently a touch faster than master if anything, for streaming seq
scan and streaming ANALYZE.  That's great!, but it doesn't seem to be
due to intentional changes.  No efficiency is coming from batching
anything.  Perhaps it has to do with CPU pipelining effects: though
it's doing the same work as ReadBuffer()-when-it-gets-a-hit, the work
itself is cut up into stages in a kind of pipeline:
read_stream_next_buffer() chooses page n + 2, pins page n + 1 and
returns page n each time you call it, so maybe we get more CPU
parallelism due to spreading the data dependencies out?  (Makes me
wonder what happens if you insert a memory prefetch for the page
header into that production line, and if there are more opportunities
to spread dependencies out eg hashing the buffer tag ahead of time.)

> On the topic of BAS_BULKREAD buffer access strategy, I think the least
> we could do is add an assert like this to read_stream_begin_relation()
> after calculating max_pinned_buffers.
>
> Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);
>
> Perhaps we should do more? I think with a ring size of 16 MB, large
> SELECTs are safe for now. But I know future developers will make
> changes and it would be good not to assume they will understand that
> pinning more buffers than the size of the ring effectively invalidates
> the ring.

Yeah I think we should clamp max_pinned_buffers if we see a strategy.
What do you think about:

if (strategy)
{
int strategy_buffers = GetAccessStrategyBufferCount(strategy);
max_pinned_buffers = Min(strategy_buffers / 2, max_pinned_buffers);
}

I just don't know where to get that '2'.  The idea would be to
hopefully never actually be constrained by it in practice, except in
tiny/toy setups, so we can't go too wrong with our number '2' there.

Then we should increase the default ring sizes for BAS_BULKREAD and
BAS_VACUUM to make the previous statement true.  The size of main
memory and L2 cache have increased dramatically since those strategies
were invented.  I think we should at least double them, and more
likely quadruple them.  I realise you already made them configurable
per command in commit 1cbbee033, but I mean the hard coded default 256
in freelist.c.  It's not only to get more space for our prefetching
plans, it's also to give the system more chance of flushing WAL
asynchronously/in some other backend before you crash into dirty data;
as you discovered, prefetching accidentally makes that effect worse
in.a BAS_VACUUM strategy, by taking away space that is effectively
deferring WAL flushes, so I'd at least like to double the size for if
we use "/ 2" above.

Re: Streaming read-ready sequential scan code

2024-04-03 Thread Melanie Plageman

On Tue, Apr 2, 2024 at 1:10 PM Heikki Linnakangas  wrote:
>
> On 01/04/2024 22:58, Melanie Plageman wrote:
> > Attached v7 has version 14 of the streaming read API as well as a few
> > small tweaks to comments and code.
>
> I saw benchmarks in this thread to show that there's no regression when
> the data is in cache, but I didn't see any benchmarks demonstrating the
> benefit of this. So I ran this quick test:

Good point! It would be good to show why we would actually want this
patch. Attached v8 is rebased over current master (which now has the
streaming read API).

On the topic of BAS_BULKREAD buffer access strategy, I think the least
we could do is add an assert like this to read_stream_begin_relation()
after calculating max_pinned_buffers.

Assert(GetAccessStrategyBufferCount(strategy) > max_pinned_buffers);

Perhaps we should do more? I think with a ring size of 16 MB, large
SELECTs are safe for now. But I know future developers will make
changes and it would be good not to assume they will understand that
pinning more buffers than the size of the ring effectively invalidates
the ring.

- Melanie
From cfccafec650a77c53b1d78180b52db31742181ff Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 27 Mar 2024 20:25:06 -0400
Subject: [PATCH v8 3/3] Sequential scans and TID range scans stream reads

Implementing streaming read support for heap sequential scans and TID
range scans includes three parts:

Allocate the read stream object in heap_beginscan(). On rescan, reset
the stream by releasing all pinned buffers and resetting the prefetch
block.

Implement a callback returning the next block to prefetch to the
read stream infrastructure.

Invoke the read stream API when a new page is needed. When the scan
direction changes, reset the stream.
---
 src/backend/access/heap/heapam.c | 94 
 src/include/access/heapam.h  | 15 +
 2 files changed, 97 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6b26f5bf8af..3546f637c13 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -221,6 +221,25 @@ static const int MultiXactStatusLock[MaxMultiXactStatus + 1] =
  * 
  */
 
+static BlockNumber
+heap_scan_stream_read_next(ReadStream *pgsr, void *private_data,
+		   void *per_buffer_data)
+{
+	HeapScanDesc scan = (HeapScanDesc) private_data;
+
+	if (unlikely(!scan->rs_inited))
+	{
+		scan->rs_prefetch_block = heapgettup_initial_block(scan, scan->rs_dir);
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_prefetch_block = heapgettup_advance_block(scan,
+		   scan->rs_prefetch_block,
+		   scan->rs_dir);
+
+	return scan->rs_prefetch_block;
+}
+
 /* 
  *		initscan - scan code common to heap_beginscan and heap_rescan
  * 
@@ -323,6 +342,13 @@ initscan(HeapScanDesc scan, ScanKey key, bool keep_startblock)
 	scan->rs_cbuf = InvalidBuffer;
 	scan->rs_cblock = InvalidBlockNumber;
 
+	/*
+	 * Initialize to ForwardScanDirection because it is most common and heap
+	 * scans usually must go forwards before going backward.
+	 */
+	scan->rs_dir = ForwardScanDirection;
+	scan->rs_prefetch_block = InvalidBlockNumber;
+
 	/* page-at-a-time fields are always invalid when not rs_inited */
 
 	/*
@@ -459,12 +485,14 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the next block of the relation into a buffer and returns
- * with that pinned buffer saved in the scan descriptor.
+ * This routine gets gets the next block of the relation from the read stream
+ * and saves that pinned buffer in the scan descriptor.
  */
 static inline void
 heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
+	Assert(scan->rs_read_stream);
+
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -479,19 +507,23 @@ heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	if (unlikely(!scan->rs_inited))
+	/*
+	 * If the scan direction is changing, reset the prefetch block to the
+	 * current block. Otherwise, we will incorrectly prefetch the blocks
+	 * between the prefetch block and the current block again before
+	 * prefetching blocks in the new, correct scan direction.
+	 */
+	if (unlikely(scan->rs_dir != dir))
 	{
-		scan->rs_cblock = heapgettup_initial_block(scan, dir);
-		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
-		scan->rs_inited = true;
+		scan->rs_prefetch_block = scan->rs_cblock;
+		read_stream_reset(scan->rs_read_stream);
 	}
-	else
-		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
 
-	/* read block if valid */
-	if (BlockNumberIsValid(scan->rs_cblock))
-		scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM,
-		   scan->rs_cblock, RBM_NORMAL, scan->rs_strategy);
+	s

Re: Streaming read-ready sequential scan code

2024-04-02 Thread Heikki Linnakangas


On 01/04/2024 22:58, Melanie Plageman wrote:

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.


I saw benchmarks in this thread to show that there's no regression when 
the data is in cache, but I didn't see any benchmarks demonstrating the 
benefit of this. So I ran this quick test:


-- create table ~1 GB table with only 1 row per page.
CREATE TABLE giga (i int, filler text) with (fillfactor=10);
insert into giga select g, repeat('x', 900) from generate_series(1, 
14) g;

vacuum freeze giga;

\timing on
select count(*) from giga;

The SELECT takes about 390 ms on 'master', and 230 ms with the patch.

This is pretty much the best case for this patch, real world gains will 
be much smaller. Nevertheless, nice speedup!


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Streaming read-ready sequential scan code

2024-04-01 Thread Melanie Plageman

On Wed, Mar 27, 2024 at 08:47:03PM -0400, Melanie Plageman wrote:
> On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
>  wrote:
> >
> > On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > > >  wrote:
> > > > > >
> > > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > > >  wrote:
> > > > > > >
> > > > > > > There is an outstanding question about where to allocate the
> > > > > > > PgStreamingRead object for sequential scans
> > > > > >
> > > > > > I've written three alternative implementations of the actual 
> > > > > > streaming
> > > > > > read user for sequential scan which handle the question of where to
> > > > > > allocate the streaming read object and how to handle changing scan
> > > > > > direction in different ways.
> > > > > >
> > > > > > Option A) 
> > > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > > know the scan direction at this time, if the scan ends up not being 
> > > > > > a
> > > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > > changes, streaming is never supported again -- even if we return to
> > > > > > ForwardScanDirection
> > > > > > - Must maintain a "fallback" codepath that does not use the 
> > > > > > streaming read API
> > > > >
> > > > > Attached is a version of this patch which implements a "reset"
> > > > > function for the streaming read API which should be cheaper than the
> > > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > > on A as an example.
> > > >
> > > > Attached is the latest version of this patchset -- rebased in light of
> > > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > > approach I think we should go with. It is a hybrid of my previously
> > > > proposed approaches.
> > >
> > > While investigating some performance concerns, Andres pointed out that
> > > the members I add to HeapScanDescData in this patch push rs_cindex and
> > > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > > v4's HeapScanDescData is as well-packed as on master and its members
> > > are reordered so that rs_cindex and rs_ntuples are back on the second
> > > cacheline of the struct's data.
> >
> > I did some additional profiling and realized that dropping the
> > unlikely() from the places we check rs_inited frequently was negatively
> > impacting performance. v5 adds those back and also makes a few other
> > very minor cleanups.
> >
> > Note that this patch set has a not yet released version of Thomas
> > Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> > performance issue I saw with my test case when all of the sequential
> > scan's blocks are in shared buffers. Once he sends the official new
> > version, I will rebase this and point to his explanation in that thread.
> 
> Attached v6 has the version of the streaming read API mentioned here
> [1]. This resolved the fully-in-shared-buffers regressions
> investigated in that thread by Andres, Bilal, and Thomas.

Attached v7 has version 14 of the streaming read API as well as a few
small tweaks to comments and code.

I noticed that 0001 in the set posed a small regression from master for
a sequential scan of a relation already in shared buffers. While
investigating this, I saw that heapfetchbuf() was still not being
inlined (compiled at -O2) and when I promoted heapfetchbuf() from static
inline to static pg_attribute_always_inline, most of the very small
regression I saw went away. I don't know if I squashed the issue
entirely, though.

- Melanie
>From db6219d9ea689fdd0150aa0fbbeaa2ee11364aa8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v7 1/4] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former i

Re: Streaming read-ready sequential scan code

2024-03-27 Thread Melanie Plageman

On Fri, Mar 8, 2024 at 4:56 PM Melanie Plageman
 wrote:
>
> On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> > On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
> >  wrote:
> > >
> > > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > > >  wrote:
> > > > > >
> > > > > > There is an outstanding question about where to allocate the
> > > > > > PgStreamingRead object for sequential scans
> > > > >
> > > > > I've written three alternative implementations of the actual streaming
> > > > > read user for sequential scan which handle the question of where to
> > > > > allocate the streaming read object and how to handle changing scan
> > > > > direction in different ways.
> > > > >
> > > > > Option A) 
> > > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > > know the scan direction at this time, if the scan ends up not being a
> > > > > forwards scan, the streaming read object must later be freed -- so
> > > > > this will sometimes allocate a streaming read object it never uses.
> > > > > - Only supports ForwardScanDirection and once the scan direction
> > > > > changes, streaming is never supported again -- even if we return to
> > > > > ForwardScanDirection
> > > > > - Must maintain a "fallback" codepath that does not use the streaming 
> > > > > read API
> > > >
> > > > Attached is a version of this patch which implements a "reset"
> > > > function for the streaming read API which should be cheaper than the
> > > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > > work on any of my proposed implementations (A/B/C). I implemented it
> > > > on A as an example.
> > >
> > > Attached is the latest version of this patchset -- rebased in light of
> > > Thomas' updatees to the streaming read API [1]. I have chosen the
> > > approach I think we should go with. It is a hybrid of my previously
> > > proposed approaches.
> >
> > While investigating some performance concerns, Andres pointed out that
> > the members I add to HeapScanDescData in this patch push rs_cindex and
> > rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> > v4's HeapScanDescData is as well-packed as on master and its members
> > are reordered so that rs_cindex and rs_ntuples are back on the second
> > cacheline of the struct's data.
>
> I did some additional profiling and realized that dropping the
> unlikely() from the places we check rs_inited frequently was negatively
> impacting performance. v5 adds those back and also makes a few other
> very minor cleanups.
>
> Note that this patch set has a not yet released version of Thomas
> Munro's Streaming Read API with a new ramp-up logic which seems to fix a
> performance issue I saw with my test case when all of the sequential
> scan's blocks are in shared buffers. Once he sends the official new
> version, I will rebase this and point to his explanation in that thread.

Attached v6 has the version of the streaming read API mentioned here
[1]. This resolved the fully-in-shared-buffers regressions
investigated in that thread by Andres, Bilal, and Thomas.

The one outstanding item for the sequential scan streaming read user
is deciding how the BAS_BULKREAD buffer access strategy should
interact with the streaming read infrastructure. We discussed a bit
off-list, and it seems clear that the ring must be at least as large
as io_combine_limit. This should be no problem for BAS_BULKREAD
because its ring is 16 MB. The question is whether or not we need to
do anything right now to ensure there aren't adverse interactions
between io_combine_limit, max_ios, and the buffer access strategy ring
buffer size.

- Melanie

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJTwrS7F%3DuJPx3SeigMiQiW%2BLJaOkjGyZdCntwyMR%3DuAw%40mail.gmail.com
From bed26d391190f4411eccc4533d188e5dba6e8f72 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v6 1/6] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push

Re: Streaming read-ready sequential scan code

2024-03-08 Thread Melanie Plageman

On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
>  wrote:
> >
> > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > There is an outstanding question about where to allocate the
> > > > > PgStreamingRead object for sequential scans
> > > >
> > > > I've written three alternative implementations of the actual streaming
> > > > read user for sequential scan which handle the question of where to
> > > > allocate the streaming read object and how to handle changing scan
> > > > direction in different ways.
> > > >
> > > > Option A) 
> > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > know the scan direction at this time, if the scan ends up not being a
> > > > forwards scan, the streaming read object must later be freed -- so
> > > > this will sometimes allocate a streaming read object it never uses.
> > > > - Only supports ForwardScanDirection and once the scan direction
> > > > changes, streaming is never supported again -- even if we return to
> > > > ForwardScanDirection
> > > > - Must maintain a "fallback" codepath that does not use the streaming 
> > > > read API
> > >
> > > Attached is a version of this patch which implements a "reset"
> > > function for the streaming read API which should be cheaper than the
> > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > work on any of my proposed implementations (A/B/C). I implemented it
> > > on A as an example.
> >
> > Attached is the latest version of this patchset -- rebased in light of
> > Thomas' updatees to the streaming read API [1]. I have chosen the
> > approach I think we should go with. It is a hybrid of my previously
> > proposed approaches.
> 
> While investigating some performance concerns, Andres pointed out that
> the members I add to HeapScanDescData in this patch push rs_cindex and
> rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> v4's HeapScanDescData is as well-packed as on master and its members
> are reordered so that rs_cindex and rs_ntuples are back on the second
> cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

- Melanie
>From 29827c23a11061846a7c145f430aa4712c1c30f3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v5 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c | 74 ++--
 src/backend/access/heap/heapam_handler.c | 40 +
 src/include/access/heapam.h  |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..aef90d28473 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -363,17 +363,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }

 /*
- * heapget

Re: Streaming read-ready sequential scan code

2024-03-02 Thread Melanie Plageman

On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
 wrote:
>
> On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> >  wrote:
> > >
> > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > >  wrote:
> > > >
> > > > There is an outstanding question about where to allocate the
> > > > PgStreamingRead object for sequential scans
> > >
> > > I've written three alternative implementations of the actual streaming
> > > read user for sequential scan which handle the question of where to
> > > allocate the streaming read object and how to handle changing scan
> > > direction in different ways.
> > >
> > > Option A) 
> > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > - Allocates the streaming read object in initscan(). Since we do not
> > > know the scan direction at this time, if the scan ends up not being a
> > > forwards scan, the streaming read object must later be freed -- so
> > > this will sometimes allocate a streaming read object it never uses.
> > > - Only supports ForwardScanDirection and once the scan direction
> > > changes, streaming is never supported again -- even if we return to
> > > ForwardScanDirection
> > > - Must maintain a "fallback" codepath that does not use the streaming 
> > > read API
> >
> > Attached is a version of this patch which implements a "reset"
> > function for the streaming read API which should be cheaper than the
> > full pg_streaming_read_free() on rescan. This can easily be ported to
> > work on any of my proposed implementations (A/B/C). I implemented it
> > on A as an example.
>
> Attached is the latest version of this patchset -- rebased in light of
> Thomas' updatees to the streaming read API [1]. I have chosen the
> approach I think we should go with. It is a hybrid of my previously
> proposed approaches.

While investigating some performance concerns, Andres pointed out that
the members I add to HeapScanDescData in this patch push rs_cindex and
rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
v4's HeapScanDescData is as well-packed as on master and its members
are reordered so that rs_cindex and rs_ntuples are back on the second
cacheline of the struct's data.

- Melanie
From 614195777b3ee675d74d98953b086e0f8a4a494d Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 26 Feb 2024 15:33:39 -0500
Subject: [PATCH v4 4/5] Add pg_streaming_read_reset

For rescan, we want to reuse the streaming read object and simply
release the buffers that were pinned by the streaming read
infrastructure.
---
 src/backend/storage/aio/streaming_read.c | 18 ++
 src/include/storage/streaming_read.h |  1 +
 2 files changed, 19 insertions(+)

diff --git a/src/backend/storage/aio/streaming_read.c b/src/backend/storage/aio/streaming_read.c
index 71f2c4a70b6..70f3ef051f8 100644
--- a/src/backend/storage/aio/streaming_read.c
+++ b/src/backend/storage/aio/streaming_read.c
@@ -610,3 +610,21 @@ pg_streaming_read_free(PgStreamingRead *pgsr)
 
 	pfree(pgsr);
 }
+
+
+/*
+ * Reset a streaming read object by releasing all of the buffers. Note that
+ * max_ios is not recalculated, so any changes to maintenance_io_concurrency and
+ * effective_io_concurrency will have no effect.
+ */
+void
+pg_streaming_read_reset(PgStreamingRead *pgsr)
+{
+	Buffer		buffer;
+
+	/* Stop looking ahead, and unpin anything that wasn't consumed. */
+	pgsr->finished = true;
+	while ((buffer = pg_streaming_read_buffer_get_next(pgsr, NULL)) != InvalidBuffer)
+		ReleaseBuffer(buffer);
+	pgsr->finished = false;
+}
diff --git a/src/include/storage/streaming_read.h b/src/include/storage/streaming_read.h
index c4d3892bb26..63cef719e42 100644
--- a/src/include/storage/streaming_read.h
+++ b/src/include/storage/streaming_read.h
@@ -48,5 +48,6 @@ extern PgStreamingRead *pg_streaming_read_buffer_alloc(int flags,
 extern void pg_streaming_read_prefetch(PgStreamingRead *pgsr);
 extern Buffer pg_streaming_read_buffer_get_next(PgStreamingRead *pgsr, void **per_buffer_private);
 extern void pg_streaming_read_free(PgStreamingRead *pgsr);
+extern void pg_streaming_read_reset(PgStreamingRead *pgsr);
 
 #endif
-- 
2.40.1

From 4b6d9059aa32625e91d38f1d414ec514d4073197 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v4 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all be

Re: Streaming read-ready sequential scan code

2024-02-28 Thread Melanie Plageman

On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
>  wrote:
> >
> > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> >  wrote:
> > >
> > > There is an outstanding question about where to allocate the
> > > PgStreamingRead object for sequential scans
> >
> > I've written three alternative implementations of the actual streaming
> > read user for sequential scan which handle the question of where to
> > allocate the streaming read object and how to handle changing scan
> > direction in different ways.
> >
> > Option A) 
> > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > - Allocates the streaming read object in initscan(). Since we do not
> > know the scan direction at this time, if the scan ends up not being a
> > forwards scan, the streaming read object must later be freed -- so
> > this will sometimes allocate a streaming read object it never uses.
> > - Only supports ForwardScanDirection and once the scan direction
> > changes, streaming is never supported again -- even if we return to
> > ForwardScanDirection
> > - Must maintain a "fallback" codepath that does not use the streaming read 
> > API
> 
> Attached is a version of this patch which implements a "reset"
> function for the streaming read API which should be cheaper than the
> full pg_streaming_read_free() on rescan. This can easily be ported to
> work on any of my proposed implementations (A/B/C). I implemented it
> on A as an example.

Attached is the latest version of this patchset -- rebased in light of
Thomas' updatees to the streaming read API [1]. I have chosen the
approach I think we should go with. It is a hybrid of my previously
proposed approaches.

The streaming read is allocated in heap_beginscan() and then reset on
rescan and when the scan direction changes. I only check if the scan
direction changes when a new page is needed. This implementation means
no fallback method is needed, so we can remove the non-streaming read
code for heap sequential scans.

Because heapgettup() and heapgettup_pagemode() are also used for TID
range scans, this patch also happens to implement streaming reads for
TID range scans.

- Melanie

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com
>From 9227b1b621c473ab189ac0cefba1b9aaed02aa09 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v3 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c | 74 ++--
 src/backend/access/heap/heapam_handler.c | 40 +
 src/include/access/heapam.h  |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 707460a5364..449221da6ac 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -367,17 +367,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }

 /*
- * heapgetpage - subroutine for heapgettup()
+ * heapbuildvis - Utility function for heap scans.
  *
- * This routine reads and pins the specified page of the relation.
- * In page-at-a-time mode it performs additional work, namely determining
- * which tuples on the page are visible.
+ * Given a page residing in a buffer saved in the scan descriptor, prune the
+ * page and determine which of its tuples are all visible, saving their offsets
+ * in an array in the scan descriptor.
  */
 void
-heapgetpage(TableScanDesc sscan, BlockNumber block)
+heapbuildvis(TableScanDesc sscan)
 {
 	HeapScanDesc scan = (HeapScanDesc) sscan;
-	Buffer		buffer;
+	Buffer		buffer = scan->rs_

Re: Streaming read-ready sequential scan code

2024-02-26 Thread Melanie Plageman

On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
 wrote:
>
> On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
>  wrote:
> >
> > There is an outstanding question about where to allocate the
> > PgStreamingRead object for sequential scans
>
> I've written three alternative implementations of the actual streaming
> read user for sequential scan which handle the question of where to
> allocate the streaming read object and how to handle changing scan
> direction in different ways.
>
> Option A) 
> https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> - Allocates the streaming read object in initscan(). Since we do not
> know the scan direction at this time, if the scan ends up not being a
> forwards scan, the streaming read object must later be freed -- so
> this will sometimes allocate a streaming read object it never uses.
> - Only supports ForwardScanDirection and once the scan direction
> changes, streaming is never supported again -- even if we return to
> ForwardScanDirection
> - Must maintain a "fallback" codepath that does not use the streaming read API

Attached is a version of this patch which implements a "reset"
function for the streaming read API which should be cheaper than the
full pg_streaming_read_free() on rescan. This can easily be ported to
work on any of my proposed implementations (A/B/C). I implemented it
on A as an example.

- Melanie
From ee93bf41b41aefd6fe20c41be15d2fd5e5d75181 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 29 Jan 2024 11:50:01 -0500
Subject: [PATCH v2 2/5] Replace blocks with buffers in heapgettup control flow

Future commits will introduce the streaming read API and the sequential
scan streaming read API user. Streaming read API users implement a
callback which returns the next block to read. Sequential scans
previously looped through the blocks in the relation, synchronously
reading in a block and then processing it. An InvalidBlockNumber
returned by heapgettup_advance_block() meant that the relation was
exhausted and all blocks had been processed.

The streaming read API may exhaust the blocks in a relation (having read
all of them into buffers) before they have all been processed by the
sequential scan. As such, the sequential scan should continue processing
blocks until heapfetchbuf() returns InvalidBuffer.

Note that this commit does not implement the streaming read API user. It
simply restructures heapgettup() and heapgettup_pagemode() to use
buffers instead of blocks for control flow.

Not all sequential scans will support streaming reads. As such, this
code will remain for compatability even after sequential scans support
streaming reads.
---
 src/backend/access/heap/heapam.c | 79 ++--
 1 file changed, 35 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 449221da6ac..e0fe3d9c326 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -87,6 +87,9 @@ static Bitmapset *HeapDetermineColumnsInfo(Relation relation,
 static bool heap_acquire_tuplock(Relation relation, ItemPointer tid,
  LockTupleMode mode, LockWaitPolicy wait_policy,
  bool *have_tuple_lock);
+static inline BlockNumber heapgettup_advance_block(HeapScanDesc scan,
+   BlockNumber block, ScanDirection dir);
+static inline BlockNumber heapgettup_initial_block(HeapScanDesc scan, ScanDirection dir);
 static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 	  uint16 old_infomask2, TransactionId add_to_xmax,
 	  LockTupleMode mode, bool is_update,
@@ -463,14 +466,12 @@ heapbuildvis(TableScanDesc sscan)
 /*
  * heapfetchbuf - subroutine for heapgettup()
  *
- * This routine reads the specified block of the relation into a buffer and
- * returns with that pinned buffer saved in the scan descriptor.
+ * This routine reads the next block of the relation into a buffer and returns
+ * with that pinned buffer saved in the scan descriptor.
  */
 static inline void
-heapfetchbuf(HeapScanDesc scan, BlockNumber block)
+heapfetchbuf(HeapScanDesc scan, ScanDirection dir)
 {
-	Assert(block < scan->rs_nblocks);
-
 	/* release previous scan buffer, if any */
 	if (BufferIsValid(scan->rs_cbuf))
 	{
@@ -485,10 +486,19 @@ heapfetchbuf(HeapScanDesc scan, BlockNumber block)
 	 */
 	CHECK_FOR_INTERRUPTS();
 
-	/* read page using selected strategy */
-	scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, block,
-	   RBM_NORMAL, scan->rs_strategy);
-	scan->rs_cblock = block;
+	if (!scan->rs_inited)
+	{
+		scan->rs_cblock = heapgettup_initial_block(scan, dir);
+		Assert(scan->rs_cblock != InvalidBlockNumber || !BufferIsValid(scan->rs_cbuf));
+		scan->rs_inited = true;
+	}
+	else
+		scan->rs_cblock = heapgettup_advance_block(scan, scan->rs_cblock, dir);
+
+	/* read block if valid */
+	if (BlockNumberIsValid(scan->rs_cblock))
+		scan->rs_cbuf = ReadBufferExtended(scan->rs_

Re: Streaming read-ready sequential scan code

2024-02-20 Thread Melanie Plageman

On Tue, Feb 20, 2024 at 6:13 AM Robert Haas  wrote:
>
> On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
>  wrote:
> > I've written three alternative implementations of the actual streaming
> > read user for sequential scan which handle the question of where to
> > allocate the streaming read object and how to handle changing scan
> > direction in different ways.
>
> It's weird to me that the prospect of changing the scan direction
> causes such complexity. I mean, why doesn't a streaming read object
> have a forget_all_my_previous_requests() method or somesuch?

Basically, that is what pg_streaming_read_free() does. It goes through
and releases the buffers it had pinned and frees any per-buffer data
allocated.

The complexity with the sequential scan streaming read user and scan
direction is just that it has to detect when the scan direction
changes and do the releasing/freeing and reallocation. The scan
direction is passed to heapgettup[_pagemode](), so this is something
that can change on a tuple-to-tuple basis.

It is less that doing this is complicated and more that it is annoying
and distracting to have to check for and handle a very unimportant and
uncommon case in the main path of the common case.

- Melanie

Re: Streaming read-ready sequential scan code

2024-02-20 Thread Robert Haas

On Tue, Feb 20, 2024 at 4:35 AM Melanie Plageman
 wrote:
> I've written three alternative implementations of the actual streaming
> read user for sequential scan which handle the question of where to
> allocate the streaming read object and how to handle changing scan
> direction in different ways.

It's weird to me that the prospect of changing the scan direction
causes such complexity. I mean, why doesn't a streaming read object
have a forget_all_my_previous_requests() method or somesuch?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Streaming read-ready sequential scan code

2024-02-19 Thread Melanie Plageman

On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
wrote:
>
> There is an outstanding question about where to allocate the
> PgStreamingRead object for sequential scans

I've written three alternative implementations of the actual streaming
read user for sequential scan which handle the question of where to
allocate the streaming read object and how to handle changing scan
direction in different ways.

Option A)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
- Allocates the streaming read object in initscan(). Since we do not
know the scan direction at this time, if the scan ends up not being a
forwards scan, the streaming read object must later be freed -- so
this will sometimes allocate a streaming read object it never uses.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option B)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_heapgettup_alloc_forward_only
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context. It
only allocates the streaming read object if it is a forwards scan. It
frees the streaming read object if the scan direction is later
changed.
- Only supports ForwardScanDirection and once the scan direction
changes, streaming is never supported again -- even if we return to
ForwardScanDirection
- Must maintain a "fallback" codepath that does not use the streaming read API

Option C)
https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_all_dir_stream
- Allocates the streaming read object in heapgettup[_pagemode]() when
it has not been previously allocated. To do this it has to record and
switch into a different memory context than the per-tuple context.
- All scan directions support streaming. To do this, the scan
direction has to be tracked and we must check if the direction has
changed on every heapgettup[_pagemode]() invocation to avoid returning
wrong results.
- No "fallback" codepath as all heap sequential scans will use the
streaming read API

As you can see, each option has pros and cons. I'm interested in what
others think about which we should choose.

- Melanie

Re: Streaming read-ready sequential scan code

2024-01-29 Thread Melanie Plageman

On Mon, Jan 29, 2024 at 4:24 PM David Rowley  wrote:
>
> On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
>  wrote:
> > Though logically the performance with 0001 and 0002 should be the same
> > as master (no new non-inline function calls, no additional looping),
> > I've done a bit of profiling anyway. I created a large multi-GB table,
> > read it all into shared buffers (disabling the large sequential scan
> > bulkread optimization), and did a sequential SELECT count(*) from the
> > table. From the profiles below, you'll notice that master and the
> > patch are basically the same. Actual percentages vary from run-to-run.
> > Execution time is the same.
>
> Can you also run a test on a Seqscan with a filter that filters out
> all tuples?  There's less overhead in other parts of the executor with
> such a query.

Yes, of course. Thank you so much for taking a look!

While I was at it, I changed the table schema to be entirely composed
of INT type columns and regenerated the data. Note that, both in this
example and my previous example, I ensured that the table was vacuumed
beforehand (and autovacuum disabled for the table) so there wasn't any
on-access pruning happening (heapgetpage() does that in pagemode).

This is the schema
  CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f INT, g
INT) with (autovacuum_enabled=false);

I added 4600 rows to the table, making it 2.6 GB. Shared buffers
is double that. Before profiling, I did a SELECT * from the table with
the large sequential scan bulkread optimization disabled. Then I
vacuumed the table. Finally, I turned up parallel_setup_cost high
enough to disable query parallelism.

The query I profiled was:
SELECT * FROM foo WHERE id = 0;
With the data I generated, 0 rows match that condition.

Profiles below. Execution time essentially the same.

patch:
  17.08%  postgres  postgres   [.] ExecInterpExpr
  11.17%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
  10.64%  postgres  postgres   [.] ExecStoreBufferHeapTuple
   9.82%  postgres  postgres   [.] heap_getnextslot
   9.13%  postgres  postgres   [.] heapgettup_pagemode
   8.98%  postgres  postgres   [.] heapbuildvis
   5.40%  postgres  postgres   [.] HeapCheckForSerializableConflictOut
   5.16%  postgres  postgres   [.] SeqNext

master:
  17.89%  postgres  postgres   [.] ExecInterpExpr
  12.28%  postgres  postgres   [.] tts_buffer_heap_getsomeattrs
  10.54%  postgres  postgres   [.] ExecStoreBufferHeapTuple
  10.11%  postgres  postgres   [.] heapgettup_pagemode
   8.52%  postgres  postgres   [.] heapgetpage
   8.28%  postgres  postgres   [.] heap_getnextslot
   5.00%  postgres  postgres   [.] HeapCheckForSerializableConflictOut
   4.71%  postgres  postgres   [.] SeqNext

- Melanie

Re: Streaming read-ready sequential scan code

2024-01-29 Thread David Rowley

On Tue, 30 Jan 2024 at 10:17, Melanie Plageman
 wrote:
> Though logically the performance with 0001 and 0002 should be the same
> as master (no new non-inline function calls, no additional looping),
> I've done a bit of profiling anyway. I created a large multi-GB table,
> read it all into shared buffers (disabling the large sequential scan
> bulkread optimization), and did a sequential SELECT count(*) from the
> table. From the profiles below, you'll notice that master and the
> patch are basically the same. Actual percentages vary from run-to-run.
> Execution time is the same.

Can you also run a test on a Seqscan with a filter that filters out
all tuples?  There's less overhead in other parts of the executor with
such a query.

David

51 matches

Mail list logo