Re: Use streaming read API in ANALYZE

2024-05-20 Thread Melanie Plageman
On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz  wrote:
>
> On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz  wrote:
> >
> > On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
> > I wanted to discuss what will happen to this patch now that
> > 27bc1772fc8 is reverted. I am continuing this thread but I can create
> > another thread if you prefer so.
>
> 041b96802ef is discussed in the 'Table AM Interface Enhancements'
> thread [1]. The main problems discussed about this commit is that the
> read stream API is not pushed to the heap-specific code and, because
> of that, the other AM implementations need to use read streams. To
> push read stream API to the heap-specific code, it is pretty much
> required to pass BufferAccessStrategy and BlockSamplerData to the
> initscan().
>
> I am sharing the alternative version of this patch. The first patch
> just reverts 041b96802ef and the second patch is the alternative
> version.
>
> In this alternative version, the read stream API is not pushed to the
> heap-specific code, but it is controlled by the heap-specific code.
> The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
> heap-specific code if the scan type is 'ANALYZE'. This flag is used to
> decide whether streaming API in ANALYZE will be used or not. If this
> flag is set, this means heap AMs and read stream API will be used. If
> it is not set, this means heap AMs will not be used and code falls
> back to the version before read streams.

Personally, I think the alternative version here is the best option
other than leaving what is in master. However, I would vote for
keeping what is in master because 1) where we are in the release
timeline and 2) the acquire_sample_rows() code, before streaming read,
was totally block-based anyway.

If we kept what was in master, do we need to document for table AMs
how to use read_stream_next_buffer() or can we assume they will look
at the heap AM implementation?

- Melanie




Re: Use streaming read API in ANALYZE

2024-05-15 Thread Nazir Bilal Yavuz
Hi,

On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
> >
> > Pushed.  Thanks Bilal and reviewers!
>
> I wanted to discuss what will happen to this patch now that
> 27bc1772fc8 is reverted. I am continuing this thread but I can create
> another thread if you prefer so.

041b96802ef is discussed in the 'Table AM Interface Enhancements'
thread [1]. The main problems discussed about this commit is that the
read stream API is not pushed to the heap-specific code and, because
of that, the other AM implementations need to use read streams. To
push read stream API to the heap-specific code, it is pretty much
required to pass BufferAccessStrategy and BlockSamplerData to the
initscan().

I am sharing the alternative version of this patch. The first patch
just reverts 041b96802ef and the second patch is the alternative
version.

In this alternative version, the read stream API is not pushed to the
heap-specific code, but it is controlled by the heap-specific code.
The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the
heap-specific code if the scan type is 'ANALYZE'. This flag is used to
decide whether streaming API in ANALYZE will be used or not. If this
flag is set, this means heap AMs and read stream API will be used. If
it is not set, this means heap AMs will not be used and code falls
back to the version before read streams.

Pros of the alternative version:

* The existing AM implementations other than heap AM can continue to
use their AMs without any change.
* AM implementations other than heap do not need to use read streams.
* Upstream code uses the read stream API and benefits from that.

Cons of the alternative version:

* 6 if cases are added to the acquire_sample_rows() function and 3 of
them are in the while loop.
* Because of these changes, the code looks messy.

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 323f28ff979cde8e4dbde8b4654bded74abf1fbc Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 15 May 2024 00:03:56 +0300
Subject: [PATCH v13 1/2] Revert "Use streaming I/O in ANALYZE."

This commit reverts 041b96802ef.

041b96802ef revised the changes on 27bc1772fc8 but 27bc1772fc8 and
dd1f6b0c172 are reverted together in 6377e12a5a5. So, this commit
reverts all 27bc1772fc, 041b96802ef and dd1f6b0c172 together.

Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/include/access/tableam.h | 26 +++
 src/backend/access/heap/heapam_handler.c | 38 +-
 src/backend/commands/analyze.c   | 96 ++--
 3 files changed, 98 insertions(+), 62 deletions(-)

diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 8e583b45cd5..e08b9627f30 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,7 +21,6 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "executor/tuptable.h"
-#include "storage/read_stream.h"
 #include "utils/rel.h"
 #include "utils/snapshot.h"
 
@@ -655,16 +654,6 @@ typedef struct TableAmRoutine
 	struct VacuumParams *params,
 	BufferAccessStrategy bstrategy);
 
-	/*
-	 * Prepare to analyze the next block in the read stream.  Returns false if
-	 * the stream is exhausted and true otherwise. The scan must have been
-	 * started with SO_TYPE_ANALYZE option.
-	 *
-	 * This routine holds a buffer pin and lock on the heap page.  They are
-	 * held until heapam_scan_analyze_next_tuple() returns false.  That is
-	 * until all the items of the heap page are analyzed.
-	 */
-
 	/*
 	 * Prepare to analyze block `blockno` of `scan`. The scan has been started
 	 * with table_beginscan_analyze().  See also
@@ -683,7 +672,8 @@ typedef struct TableAmRoutine
 	 * isn't one yet.
 	 */
 	bool		(*scan_analyze_next_block) (TableScanDesc scan,
-			ReadStream *stream);
+			BlockNumber blockno,
+			BufferAccessStrategy bstrategy);
 
 	/*
 	 * See table_scan_analyze_next_tuple().
@@ -1721,17 +1711,19 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params,
 }
 
 /*
- * Prepare to analyze the next block in the read stream. The scan needs to
- * have been  started with table_beginscan_analyze().  Note that this routine
- * might acquire resources like locks that are held until
+ * Prepare to analyze block `blockno` of `scan`. The scan needs to have been
+ * started with table_beginscan_analyze().  Note that this routine might
+ * acquire resources like locks that are held until
  * table_scan_analyze_next_tuple() returns false.
  *
  * Returns false if block is unsuitable for sampling, true otherwise.
  */
 static inline bool
-table_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
+table_scan_analyze_next_block(TableScanDesc scan, 

Re: Use streaming read API in ANALYZE

2024-04-29 Thread Nazir Bilal Yavuz
Hi,

On Mon, 8 Apr 2024 at 04:21, Thomas Munro  wrote:
>
> Pushed.  Thanks Bilal and reviewers!

I wanted to discuss what will happen to this patch now that
27bc1772fc8 is reverted. I am continuing this thread but I can create
another thread if you prefer so.

After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM
agnostic again. So, read stream changes might have to be pushed down
now but there are a couple of roadblocks like Melanie mentioned [1]
before.

Quote from Melanie [1]:

On Thu, 11 Apr 2024 at 19:19, Melanie Plageman
 wrote:
>
> I am working on pushing streaming ANALYZE into heap AM code, and I ran
> into a few roadblocks.
>
> If we want ANALYZE to make the ReadStream object in heap_beginscan()
> (like the read stream implementation of heap sequential and TID range
> scans do), I don't see any way around changing the scan_begin table AM
> callback to take a BufferAccessStrategy at the least (and perhaps also
> the BlockSamplerData).
>
> read_stream_begin_relation() doesn't just save the
> BufferAccessStrategy in the ReadStream, it uses it to set various
> other things in the ReadStream object. callback_private_data (which in
> ANALYZE's case is the BlockSamplerData) is simply saved in the
> ReadStream, so it could be set later, but that doesn't sound very
> clean to me.
>
> As such, it seems like a cleaner alternative would be to add a table
> AM callback for creating a read stream object that takes the
> parameters of read_stream_begin_relation(). But, perhaps it is a bit
> late for such additions.

If we do not want to add a new table AM callback like Melanie
mentioned, it is pretty much required to pass BufferAccessStrategy and
BlockSamplerData to the initscan().

> It also opens us up to the question of whether or not sequential scan
> should use such a callback instead of making the read stream object in
> heap_beginscan().
>
> I am happy to write a patch that does any of the above. But, I want to
> raise these questions, because perhaps I am simply missing an obvious
> alternative solution.

I wonder the same, I could not think of any alternative solution to
this problem.

Another quote from Melanie [2] in the same thread:

On Thu, 11 Apr 2024 at 20:48, Melanie Plageman
 wrote:
>
> I will also say that, had this been 6 months ago, I would probably
> suggest we restructure ANALYZE's table AM interface to accommodate
> read stream setup and to address a few other things I find odd about
> the current code. For example, I think creating a scan descriptor for
> the analyze scan in acquire_sample_rows() is quite odd. It seems like
> it would be better done in the relation_analyze callback. The
> relation_analyze callback saves some state like the callbacks for
> acquire_sample_rows() and the Buffer Access Strategy. But at least in
> the heap implementation, it just saves them in static variables in
> analyze.c. It seems like it would be better to save them in a useful
> data structure that could be accessed later. We have access to pretty
> much everything we need at that point (in the relation_analyze
> callback). I also think heap's implementation of
> table_beginscan_analyze() doesn't need most of
> heap_beginscan()/initscan(), so doing this instead of something
> ANALYZE specific seems more confusing than helpful.

If we want to implement ANALYZE specific counterparts of
heap_beginscan()/initscan(); we may think of passing
BufferAccessStrategy and BlockSamplerData to them.

Also, there is an ongoing(?) discussion about a few problems /
improvements about the acquire_sample_rows() mentioned at the end of
the 'Table AM Interface Enhancements' thread [3]. Should we wait for
these discussions to be resolved or can we resume working on this
patch?

Any kind of feedback would be appreciated.

[1] 
https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com
[3] 
https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Use streaming read API in ANALYZE

2024-04-16 Thread Nazir Bilal Yavuz
Hi,

On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz  wrote:
>
> Hi,
>
> Thank you for looking into this!
>
> On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
> >
> > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > > Streaming API has been committed but the committed version has a minor
> > > change, the read_stream_begin_relation function takes Relation instead
> > > of BufferManagerRelation now. So, here is a v5 which addresses this
> > > change.
> >
> > I'm getting a repeatable segfault / assertion failure with this:
> >
> > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> > CREATE TABLE
> > postgres=# insert into tengiga select g, repeat('x', 900) from
> > generate_series(1, 140) g;
> > INSERT 0 140
> > postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> > SET
> > ANALYZE
> > postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> > SET
> > server closed the connection unexpectedly
> > This probably means the server terminated abnormally
> > before or while processing the request.
> >
> > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> > "heapam_handler.c", Line: 1079, PID: 262232
> > postgres: heikki postgres [local]
> > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> > postgres: heikki postgres [local]
> > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> > postgres: heikki postgres [local]
> > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> > postgres: heikki postgres [local]
> > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> > postgres: heikki postgres [local]
> > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> > postgres: heikki postgres [local]
> > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> > postgres: heikki postgres [local]
> > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> > 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> > was terminated by signal 6: Aborted
>
> I realized the same error while working on Jakub's benchmarking results.
>
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.

I wanted to re-check this problem and I realized that I was wrong. I
tried using nblocks again and this time there was no failure. I looked
at block sampling logic and I am pretty sure that BlockSampler_Init()
function correctly returns the number of blocks that
BlockSampler_Next() will return. It seems 158f581923 fixed this issue
as well.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman 
> > > Date: Sun, 7 Apr 2024 15:38:41 -0400
> > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> > >
> > > A previous commit stopped using BlockSampler_HasMore() for flow control
> > > in acquire_sample_rows(). There seems little use now for
> > > BlockSampler_HasMore(). It should be sufficient to return
> > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> > > would have returned false. Remove BlockSampler_HasMore().
> > >
> > > Author: Melanie Plageman, Nazir Bilal Yavuz
> > > Discussion: 
> > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> >
> > The justification here seems somewhat odd. Sure, the previous commit stopped
> > using BlockSampler_HasMore in acquire_sample_rows - but only because it was
> > moved to block_sampling_streaming_read_next()?
>
> It didn't stop using it. It stopped being useful. The reason it existed,
> as far as I can tell, was to use it as the while() loop condition in
> acquire_sample_rows(). I think it makes much more sense for
> BlockSampler_Next() to return InvalidBlockNumber when there are no more
> blocks -- not to assert you don't call it when there aren't any more
> blocks.
>
> I didn't want to change BlockSampler_Next() in the same commit as the
> streaming read user and we can't remove BlockSampler_HasMore() without
> changing BlockSampler_Next().

I agree that the code looks useless if one condition implies the
other, but isn't it good to keep that cross-check, perhaps
reformulated as an assertion?  I didn't look too hard at the maths, I
just saw the words "It is not obvious that this code matches Knuth's
Algorithm S ..." and realised I'm not sure I have time to develop a
good opinion about this today.  So I'll leave the 0002 change out for
now, as it's a tidy-up that can easily be applied in the next cycle.
From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 8 Apr 2024 14:38:58 +1200
Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore().

Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in
acquire_sample_rows(). There seems to be little use for it now. We can
just return InvalidBlockNumber from BlockSampler_Next() when
BlockSample_HasMore() would have returned false.

Author: Melanie Plageman 
Author: Nazir Bilal Yavuz 
Reviewed-by: Andres Freund 
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/commands/analyze.c|  4 +---
 src/backend/utils/misc/sampling.c | 10 +++---
 src/include/utils/sampling.h  |  1 -
 3 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index da27a13a3f..e9fa3470cf 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -,9 +,7 @@ block_sampling_read_stream_next(ReadStream *stream,
 void *callback_private_data,
 void *per_buffer_data)
 {
-	BlockSamplerData *bs = callback_private_data;
-
-	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+	return BlockSampler_Next(callback_private_data);
 }
 
 /*
diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c
index 933db06702..6e2bca9739 100644
--- a/src/backend/utils/misc/sampling.c
+++ b/src/backend/utils/misc/sampling.c
@@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize,
 	return Min(bs->n, bs->N);
 }
 
-bool
-BlockSampler_HasMore(BlockSampler bs)
-{
-	return (bs->t < bs->N) && (bs->m < bs->n);
-}
-
 BlockNumber
 BlockSampler_Next(BlockSampler bs)
 {
@@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs)
 	double		p;/* probability to skip block */
 	double		V;/* random */
 
-	Assert(BlockSampler_HasMore(bs));	/* hence K > 0 and k > 0 */
+	/* Return if no remaining blocks or no blocks to sample */
+	if (K <= 0 || k <= 0)
+		return InvalidBlockNumber;
 
 	if ((BlockNumber) k >= K)
 	{
diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h
index be48ee52ba..fb5d6820a2 100644
--- a/src/include/utils/sampling.h
+++ b/src/include/utils/sampling.h
@@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler;
 
 extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks,
 	 int samplesize, uint32 randseed);
-extern bool BlockSampler_HasMore(BlockSampler bs);
 extern BlockNumber BlockSampler_Next(BlockSampler bs);
 
 /* Reservoir sampling methods */
-- 
2.44.0



Re: Use streaming read API in ANALYZE

2024-04-07 Thread Thomas Munro
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman
 wrote:
> On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> > >  src/backend/commands/analyze.c | 89 ++
> > >  1 file changed, 26 insertions(+), 63 deletions(-)
> >
> > That's a very nice demonstration of how this makes good prefetching 
> > easier...
>
> Agreed. Yay streaming read API and Bilal!

+1

I found a few comments to tweak, just a couple of places that hadn't
got the memo after we renamed "read stream", and an obsolete mention
of pinning buffers.  I adjusted those directly.

I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB
table, and I got:

  coldhot
master: 9025ms  199ms
patched, io_combine_limit=1:9025ms  191ms
patched, io_combine_limit=default:  8729ms  191ms

Despite being random, occasionally some I/Os must get merged, allowing
slightly better random throughput when accessing disk blocks through a
3000 IOPS drinking straw.  Looking at strace, I see 29144 pread* calls
instead of 30071, which fits that theory.  Let's see... if you roll a
fair 973452-sided dice 30071 times, how many times do you expect to
roll consecutive numbers?  Each time you roll there is a 1/973452
chance that you get the last number + 1, and we have 30071 tries
giving 30071/973452 = ~3%.  9025ms minus 3% is 8754ms.  Seems about
right.

I am not sure why the hot number is faster exactly.  (Anecdotally, I
did notice that in the cases that beat master semi-unexpectedly like
this, my software memory prefetch patch doesn't help or hurt, while in
some cases and on some CPUs there is little difference, and then that
patch seems to get a speed-up like this, which might be a clue.
*Shrug*, investigation needed.)

Pushed.  Thanks Bilal and reviewers!




Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> > From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 14:55:22 -0400
> > Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.
> > 
> > 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
> > scan_analzye_next_tuple -- leaving their heap AM implementations only
> > called by acquire_sample_rows().
> 
> Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
> thread.  I did raise that separately
> https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de
> 
> Unless I seriously missed something, I see no alternative to reverting that
> commit.

Noted. I'll give up on this refactor then. Lots of churn for no gain.
Attached v11 is just Bilal's v8 patch rebased to apply cleanly and with
a few tweaks (I changed one of the loop conditions. All other changes
are to comments and commit message).

> > @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
> > break;
> >  
> > prefetch_block = BlockSampler_Next(_bs);
> > -   PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> > prefetch_block);
> > +   PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, 
> > prefetch_block);
> > }
> > }
> >  #endif
> >  
> > +   scan->rs_cbuf = InvalidBuffer;
> > +
> > /* Outer loop over blocks to sample */
> > while (BlockSampler_HasMore())
> > {
> 
> I don't think it's good to move a lot of code *and* change how it is
> structured in the same commit. Makes it much harder to actually see changes /
> makes git blame harder to use / etc.

Yep.

> > From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 15:28:32 -0400
> > Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE
> >
> > The ANALYZE command prefetches and reads sample blocks chosen by a
> > BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
> > each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.
> >
> > Author: Nazir Bilal Yavuz
> > Reviewed-by: Melanie Plageman
> > Discussion: 
> > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> > ---
> >  src/backend/commands/analyze.c | 89 ++
> >  1 file changed, 26 insertions(+), 63 deletions(-)
> 
> That's a very nice demonstration of how this makes good prefetching easier...

Agreed. Yay streaming read API and Bilal!

> > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman 
> > Date: Sun, 7 Apr 2024 15:38:41 -0400
> > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> > 
> > A previous commit stopped using BlockSampler_HasMore() for flow control
> > in acquire_sample_rows(). There seems little use now for
> > BlockSampler_HasMore(). It should be sufficient to return
> > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> > would have returned false. Remove BlockSampler_HasMore().
> > 
> > Author: Melanie Plageman, Nazir Bilal Yavuz
> > Discussion: 
> > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> 
> The justification here seems somewhat odd. Sure, the previous commit stopped
> using BlockSampler_HasMore in acquire_sample_rows - but only because it was
> moved to block_sampling_streaming_read_next()?

It didn't stop using it. It stopped being useful. The reason it existed,
as far as I can tell, was to use it as the while() loop condition in
acquire_sample_rows(). I think it makes much more sense for
BlockSampler_Next() to return InvalidBlockNumber when there are no more
blocks -- not to assert you don't call it when there aren't any more
blocks.

I didn't want to change BlockSampler_Next() in the same commit as the
streaming read user and we can't remove BlockSampler_HasMore() without
changing BlockSampler_Next().

- Melanie
>From 3cb43693c04554f5d46e0dc9156bef36af642593 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 18:17:01 -0400
Subject: [PATCH v11 1/2] Use streaming read API in ANALYZE

The ANALYZE command prefetches and reads sample blocks chosen by a
BlockSampler algorithm. Instead of calling [Prefetch|Read]Buffer() for
each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.

Author: Nazir Bilal Yavuz
Reviewed-by: Melanie Plageman, Andres Freund
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 20 +++---
 src/backend/commands/analyze.c   | 85 +++-
 src/include/access/heapam.h  |  5 +-
 3 files changed, 39 

Re: Use streaming read API in ANALYZE

2024-04-07 Thread Andres Freund
Hi,

On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote:
> From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 14:55:22 -0400
> Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.
> 
> 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
> scan_analzye_next_tuple -- leaving their heap AM implementations only
> called by acquire_sample_rows().

Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this
thread.  I did raise that separately
https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de

Unless I seriously missed something, I see no alternative to reverting that
commit.


> @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel,
>   break;
>  
>   prefetch_block = BlockSampler_Next(_bs);
> - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> prefetch_block);
> + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, 
> prefetch_block);
>   }
>   }
>  #endif
>  
> + scan->rs_cbuf = InvalidBuffer;
> +
>   /* Outer loop over blocks to sample */
>   while (BlockSampler_HasMore())
>   {

I don't think it's good to move a lot of code *and* change how it is
structured in the same commit. Makes it much harder to actually see changes /
makes git blame harder to use / etc.



> From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 15:28:32 -0400
> Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE
>
> The ANALYZE command prefetches and reads sample blocks chosen by a
> BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for
> each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0.
>
> Author: Nazir Bilal Yavuz
> Reviewed-by: Melanie Plageman
> Discussion: 
> https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
> ---
>  src/backend/commands/analyze.c | 89 ++
>  1 file changed, 26 insertions(+), 63 deletions(-)

That's a very nice demonstration of how this makes good prefetching easier...




> From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman 
> Date: Sun, 7 Apr 2024 15:38:41 -0400
> Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore()
> 
> A previous commit stopped using BlockSampler_HasMore() for flow control
> in acquire_sample_rows(). There seems little use now for
> BlockSampler_HasMore(). It should be sufficient to return
> InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore()
> would have returned false. Remove BlockSampler_HasMore().
> 
> Author: Melanie Plageman, Nazir Bilal Yavuz
> Discussion: 
> https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com

The justification here seems somewhat odd. Sure, the previous commit stopped
using BlockSampler_HasMore in acquire_sample_rows - but only because it was
moved to block_sampling_streaming_read_next()?

Greetings,

Andres Freund




Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Sun, Apr 7, 2024 at 3:57 PM Melanie Plageman
 wrote:
>
> On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:
> >
> > On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  
> > wrote:
> > >
> > > I don't see the point of BlockSampler_HasMore() anymore. I removed it in
> > > the attached and made BlockSampler_Next() return InvalidBlockNumber
> > > under the same conditions. Is there a reason not to do this? There
> > > aren't other callers. If the BlockSampler_Next() wasn't part of an API,
> > > we could just make it the streaming read callback, but that might be
> > > weird as it is now.
> >
> > I agree. There is no reason to have BlockSampler_HasMore() after
> > streaming read API changes.
> >
> > > That and my other ideas in attached. Let me know what you think.
> >
> > I agree with your changes but I am not sure if others agree with all
> > the changes you have proposed. So, I didn't merge 0001 and your ideas
> > yet, instead I wrote a commit message, added some comments, changed ->
> > 'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
> > attached it as 0002.
>
> I couldn't quite let go of those changes to acquire_sample_rows(), so
> attached v9 0001 implements them as a preliminary patch before your
> analyze streaming read user. I inlined heapam_scan_analyze_next_block()
> entirely and made heapam_scan_analyze_next_tuple() a static function in
> commands/analyze.c (and tweaked the name).
>
> I made a few tweaks to your patch since it is on top of those changes
> instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
> since it doesn't make sense to remove it before the streaming read user
> was added.

I realized there were a few outdated comments. Fixed in attached v10.

- Melanie
From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 193 +--
 src/backend/commands/analyze.c   | 181 -
 src/include/access/heapam.h  |   9 --
 3 files changed, 179 insertions(+), 204 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..364dd0b165b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-			   double *liverows, double *deadrows,
-			   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = 

Re: Use streaming read API in ANALYZE

2024-04-07 Thread Melanie Plageman
On Thu, Apr 04, 2024 at 02:03:30PM +0300, Nazir Bilal Yavuz wrote:
> 
> On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  
> wrote:
> >
> > I don't see the point of BlockSampler_HasMore() anymore. I removed it in
> > the attached and made BlockSampler_Next() return InvalidBlockNumber
> > under the same conditions. Is there a reason not to do this? There
> > aren't other callers. If the BlockSampler_Next() wasn't part of an API,
> > we could just make it the streaming read callback, but that might be
> > weird as it is now.
> 
> I agree. There is no reason to have BlockSampler_HasMore() after
> streaming read API changes.
> 
> > That and my other ideas in attached. Let me know what you think.
> 
> I agree with your changes but I am not sure if others agree with all
> the changes you have proposed. So, I didn't merge 0001 and your ideas
> yet, instead I wrote a commit message, added some comments, changed ->
> 'if (bs->t >= bs->N || bs->m >= bs->n)' to 'if (K <= 0 || k <= 0)' and
> attached it as 0002.

I couldn't quite let go of those changes to acquire_sample_rows(), so
attached v9 0001 implements them as a preliminary patch before your
analyze streaming read user. I inlined heapam_scan_analyze_next_block()
entirely and made heapam_scan_analyze_next_tuple() a static function in
commands/analyze.c (and tweaked the name).

I made a few tweaks to your patch since it is on top of those changes
instead of preceding them. Then 0003 is removing BlockSampler_HasMore()
since it doesn't make sense to remove it before the streaming read user
was added.

- Melanie
>From b301a73e2ede8eb1d95e6a8c8d89f09959d9c22a Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sun, 7 Apr 2024 14:55:22 -0400
Subject: [PATCH v9 1/3] Make heapam_scan_analyze_next_[tuple|block] static.

27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and
scan_analzye_next_tuple -- leaving their heap AM implementations only
called by acquire_sample_rows().

Move heapam_scan_analyze_next_tuple() to analyze.c as a static helper for
acquire_sample_rows() and inline heapam_scan_analyze_next_block().

Author: Melanie Plageman
Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
---
 src/backend/access/heap/heapam_handler.c | 183 ---
 src/backend/commands/analyze.c   | 181 +-
 src/include/access/heapam.h  |   9 --
 3 files changed, 174 insertions(+), 199 deletions(-)

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 58de2c82a70..5edac76ceb6 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,189 +1054,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	pfree(isnull);
 }
 
-/*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
- *
- * This routine holds a buffer pin and lock on the heap page.  They are held
- * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
- * items of the heap page are analyzed.
- */
-void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-
-	/*
-	 * We must maintain a pin on the target page's buffer to ensure that
-	 * concurrent activity - e.g. HOT pruning - doesn't delete tuples out from
-	 * under us.  Hence, pin the page until we are done looking at it.  We
-	 * also choose to hold sharelock on the buffer throughout --- we could
-	 * release and re-acquire sharelock for each tuple, but since we aren't
-	 * doing much work per tuple, the extra lock traffic is probably better
-	 * avoided.
-	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
-	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
-}
-
-/*
- * Iterate over tuples in the block selected with
- * heapam_scan_analyze_next_block().  If a tuple that's suitable for sampling
- * is found, true is returned and a tuple is stored in `slot`.  When no more
- * tuples for sampling, false is returned and the pin and lock acquired by
- * heapam_scan_analyze_next_block() are released.
- *
- * *liverows and *deadrows are incremented according to the encountered
- * tuples.
- */
-bool
-heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin,
-			   double *liverows, double *deadrows,
-			   TupleTableSlot *slot)
-{
-	HeapScanDesc hscan = (HeapScanDesc) scan;
-	Page		targpage;
-	OffsetNumber maxoffset;
-	BufferHeapTupleTableSlot *hslot;
-
-	Assert(TTS_IS_BUFFERTUPLE(slot));
-
-	hslot = (BufferHeapTupleTableSlot *) slot;
-	targpage = BufferGetPage(hscan->rs_cbuf);
-	maxoffset = PageGetMaxOffsetNumber(targpage);
-
-	/* Inner loop over all tuples on the selected page */
-	for (; 

Re: Use streaming read API in ANALYZE

2024-04-04 Thread Nazir Bilal Yavuz
Hi,

On Wed, 3 Apr 2024 at 23:44, Melanie Plageman  wrote:
>
>
> I've reviewed the patches inline below and attached a patch that has
> some of my ideas on top of your patch.

Thank you!

>
> > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Wed, 3 Apr 2024 15:14:15 +0300
> > Subject: [PATCH v6] Use streaming read API in ANALYZE
> >
> > ANALYZE command gets random tuples using BlockSampler algorithm. Use
> > streaming reads to get these tuples by using BlockSampler algorithm in
> > streaming read API prefetch logic.
> > ---
> >  src/include/access/heapam.h  |  6 +-
> >  src/backend/access/heap/heapam_handler.c | 22 +++---
> >  src/backend/commands/analyze.c   | 85 
> >  3 files changed, 42 insertions(+), 71 deletions(-)
> >
> > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> > index a307fb5f245..633caee9d95 100644
> > --- a/src/include/access/heapam.h
> > +++ b/src/include/access/heapam.h
> > @@ -25,6 +25,7 @@
> >  #include "storage/bufpage.h"
> >  #include "storage/dsm.h"
> >  #include "storage/lockdefs.h"
> > +#include "storage/read_stream.h"
> >  #include "storage/shm_toc.h"
> >  #include "utils/relcache.h"
> >  #include "utils/snapshot.h"
> > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> > struct 
> > GlobalVisState *vistest);
> >
> >  /* in heap/heapam_handler.c*/
> > -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> > -   
> >  BlockNumber blockno,
> > -   
> >  BufferAccessStrategy bstrategy);
> > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> > +   
> >  ReadStream *stream);
> >  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
> > 
> >  TransactionId OldestXmin,
> > 
> >  double *liverows, double *deadrows,
> > diff --git a/src/backend/access/heap/heapam_handler.c 
> > b/src/backend/access/heap/heapam_handler.c
> > index 0952d4a98eb..d83fbbe6af3 100644
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> > Relation NewHeap,
> >  }
> >
> >  /*
> > - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> > - * with SO_TYPE_ANALYZE option.
> > + * Prepare to analyze block returned from streaming object.  If the block 
> > returned
> > + * from streaming object is valid, true is returned; otherwise false is 
> > returned.
> > + * The scan has been started with SO_TYPE_ANALYZE option.
> >   *
> >   * This routine holds a buffer pin and lock on the heap page.  They are 
> > held
> >   * until heapam_scan_analyze_next_tuple() returns false.  That is until 
> > all the
> >   * items of the heap page are analyzed.
> >   */
> > -void
> > -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> > -
> > BufferAccessStrategy bstrategy)
> > +bool
> > +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
> >  {
> >   HeapScanDesc hscan = (HeapScanDesc) scan;
> >
> > @@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, 
> > BlockNumber blockno,
> >* doing much work per tuple, the extra lock traffic is probably 
> > better
> >* avoided.
>
> Personally I think heapam_scan_analyze_next_block() should be inlined.
> It only has a few lines. I would find it clearer inline. At the least,
> there is no reason for it (or heapam_scan_analyze_next_tuple()) to take
> a TableScanDesc instead of a HeapScanDesc.

I agree.

>
> >*/
> > - hscan->rs_cblock = blockno;
> > - hscan->rs_cindex = FirstOffsetNumber;
> > - hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
> > -   
> >   blockno, RBM_NORMAL, bstrategy);
> > + hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
> > + if (hscan->rs_cbuf == InvalidBuffer)
> > + return false;
> > +
> >   LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
> > +
> > + hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
> > + hscan->rs_cindex = FirstOffsetNumber;
> > + return true;
> >  }
>
> >  /*
> > diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
> > index 2fb39f3ede1..764520d5aa2 100644
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,20 

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Melanie Plageman
On Wed, Apr 03, 2024 at 10:25:01PM +0300, Nazir Bilal Yavuz wrote:
>
> I realized the same error while working on Jakub's benchmarking results.
> 
> Cause: I was using the nblocks variable to check how many blocks will
> be returned from the streaming API. But I realized that sometimes the
> number returned from BlockSampler_Init() is not equal to the number of
> blocks that BlockSampler_Next() will return as BlockSampling algorithm
> decides how many blocks to return on the fly by using some random
> seeds.
> 
> There are a couple of solutions I thought of:
> 
> 1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
> the acquire_sample_rows():
> 
> Streaming API uses this function to prefetch block numbers.
> BlockSampler_HasMore() will reach to the end first as it is used while
> prefetching, so it will start to return false while there are still
> buffers to return from the streaming API. That will cause some buffers
> at the end to not be processed.
> 
> 2- Expose something (function, variable etc.) from the streaming API
> to understand if the read is finished and there is no buffer to
> return:
> 
> I think this works but I am not sure if the streaming API allows
> something like that.
> 
> 3- Check every buffer returned from the streaming API, if it is
> invalid stop the main loop in the acquire_sample_rows():
> 
> This solves the problem but there will be two if checks for each
> buffer returned,
> - in heapam_scan_analyze_next_block() to check if the returned buffer is 
> invalid
> - to break main loop in acquire_sample_rows() if
> heapam_scan_analyze_next_block() returns false
> One of the if cases can be bypassed by moving
> heapam_scan_analyze_next_block()'s code to the main loop in the
> acquire_sample_rows().
> 
> I implemented the third solution, here is v6.

I've reviewed the patches inline below and attached a patch that has
some of my ideas on top of your patch.

> From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz 
> Date: Wed, 3 Apr 2024 15:14:15 +0300
> Subject: [PATCH v6] Use streaming read API in ANALYZE
> 
> ANALYZE command gets random tuples using BlockSampler algorithm. Use
> streaming reads to get these tuples by using BlockSampler algorithm in
> streaming read API prefetch logic.
> ---
>  src/include/access/heapam.h  |  6 +-
>  src/backend/access/heap/heapam_handler.c | 22 +++---
>  src/backend/commands/analyze.c   | 85 
>  3 files changed, 42 insertions(+), 71 deletions(-)
> 
> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index a307fb5f245..633caee9d95 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -25,6 +25,7 @@
>  #include "storage/bufpage.h"
>  #include "storage/dsm.h"
>  #include "storage/lockdefs.h"
> +#include "storage/read_stream.h"
>  #include "storage/shm_toc.h"
>  #include "utils/relcache.h"
>  #include "utils/snapshot.h"
> @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
> struct 
> GlobalVisState *vistest);
>  
>  /* in heap/heapam_handler.c*/
> -extern void heapam_scan_analyze_next_block(TableScanDesc scan,
> - 
>BlockNumber blockno,
> - 
>BufferAccessStrategy bstrategy);
> +extern bool heapam_scan_analyze_next_block(TableScanDesc scan,
> + 
>ReadStream *stream);
>  extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
>   
>TransactionId OldestXmin,
>   
>double *liverows, double *deadrows,
> diff --git a/src/backend/access/heap/heapam_handler.c 
> b/src/backend/access/heap/heapam_handler.c
> index 0952d4a98eb..d83fbbe6af3 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, 
> Relation NewHeap,
>  }
>  
>  /*
> - * Prepare to analyze block `blockno` of `scan`.  The scan has been started
> - * with SO_TYPE_ANALYZE option.
> + * Prepare to analyze block returned from streaming object.  If the block 
> returned
> + * from streaming object is valid, true is returned; otherwise false is 
> returned.
> + * The scan has been started with SO_TYPE_ANALYZE option.
>   *
>   * This routine holds a buffer pin and lock on the heap page.  They are held
>   * until heapam_scan_analyze_next_tuple() returns false.  That is until all 
> the
>   * items of the heap page are analyzed.
>   */
> -void
> -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
> - 

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz
Hi,

Thank you for looking into this!

On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas  wrote:
>
> On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:
> > Streaming API has been committed but the committed version has a minor
> > change, the read_stream_begin_relation function takes Relation instead
> > of BufferManagerRelation now. So, here is a v5 which addresses this
> > change.
>
> I'm getting a repeatable segfault / assertion failure with this:
>
> postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
> CREATE TABLE
> postgres=# insert into tengiga select g, repeat('x', 900) from
> generate_series(1, 140) g;
> INSERT 0 140
> postgres=# set default_statistics_target = 10; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target = 100; ANALYZE tengiga;
> SET
> ANALYZE
> postgres=# set default_statistics_target =1000; ANALYZE tengiga;
> SET
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File:
> "heapam_handler.c", Line: 1079, PID: 262232
> postgres: heikki postgres [local]
> ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
> postgres: heikki postgres [local]
> ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]
> postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
> postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
> postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
> postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
> postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
> postgres: heikki postgres [local]
> ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
> postgres: heikki postgres [local]
> ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]
> postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
> postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
> postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
> postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
> postgres: heikki postgres [local]
> ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]
> postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
> postgres: heikki postgres [local]
> ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]
> postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
> postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
> postgres: heikki postgres [local]
> ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]
> postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
> postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
> 2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232)
> was terminated by signal 6: Aborted

I realized the same error while working on Jakub's benchmarking results.

Cause: I was using the nblocks variable to check how many blocks will
be returned from the streaming API. But I realized that sometimes the
number returned from BlockSampler_Init() is not equal to the number of
blocks that BlockSampler_Next() will return as BlockSampling algorithm
decides how many blocks to return on the fly by using some random
seeds.

There are a couple of solutions I thought of:

1- Use BlockSampler_HasMore() instead of nblocks in the main loop in
the acquire_sample_rows():

Streaming API uses this function to prefetch block numbers.
BlockSampler_HasMore() will reach to the end first as it is used while
prefetching, so it will start to return false while there are still
buffers to return from the streaming API. That will cause some buffers
at the end to not be processed.

2- Expose something (function, variable etc.) from the streaming API
to understand if the read is finished and there is no buffer to
return:

I think this works but I am not sure if the streaming API allows
something like that.

3- Check every buffer returned from the streaming API, if it is
invalid stop the main loop in the acquire_sample_rows():

This solves the problem but there will be two if checks for each
buffer returned,
- in heapam_scan_analyze_next_block() to check if the returned buffer is invalid
- to break main loop in acquire_sample_rows() if
heapam_scan_analyze_next_block() returns false
One of the if cases can be bypassed by moving
heapam_scan_analyze_next_block()'s code to the main loop in the
acquire_sample_rows().

I implemented the third solution, here is v6.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 15:14:15 +0300
Subject: [PATCH v6] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using 

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz
Hi Jakub,

Thank you for looking into this and doing a performance analysis.

On Wed, 3 Apr 2024 at 11:42, Jakub Wartak  wrote:
>
> On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
> [..]
> > v4 is rebased on top of v14 streaming read API changes.
>
> Hi Nazir, so with streaming API committed, I gave a try to this patch.
> With autovacuum=off and 30GB table on NVMe (with standard readahead of
> 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
> default) created using: create table t as select repeat('a', 100) || i
> || repeat('b', 500) as filler from generate_series(1, 4500) as i;
>
> on master, effect of mainteance_io_concurency [default 10] is like
> that (when resetting the fs cache after each ANALYZE):
>
> m_io_c = 0:
> Time: 3137.914 ms (00:03.138)
> Time: 3094.540 ms (00:03.095)
> Time: 3452.513 ms (00:03.453)
>
> m_io_c = 1:
> Time: 2972.751 ms (00:02.973)
> Time: 2939.551 ms (00:02.940)
> Time: 2904.428 ms (00:02.904)
>
> m_io_c = 2:
> Time: 1580.260 ms (00:01.580)
> Time: 1572.132 ms (00:01.572)
> Time: 1558.334 ms (00:01.558)
>
> m_io_c = 4:
> Time: 938.304 ms
> Time: 931.772 ms
> Time: 920.044 ms
>
> m_io_c = 8:
> Time: 666.025 ms
> Time: 660.241 ms
> Time: 648.848 ms
>
> m_io_c = 16:
> Time: 542.450 ms
> Time: 561.155 ms
> Time: 539.683 ms
>
> m_io_c = 32:
> Time: 538.487 ms
> Time: 541.705 ms
> Time: 538.101 ms
>
> with patch applied:
>
> m_io_c = 0:
> Time: 3106.469 ms (00:03.106)
> Time: 3140.343 ms (00:03.140)
> Time: 3044.133 ms (00:03.044)
>
> m_io_c = 1:
> Time: 2959.817 ms (00:02.960)
> Time: 2920.265 ms (00:02.920)
> Time: 2911.745 ms (00:02.912)
>
> m_io_c = 2:
> Time: 1581.912 ms (00:01.582)
> Time: 1561.444 ms (00:01.561)
> Time: 1558.251 ms (00:01.558)
>
> m_io_c = 4:
> Time: 908.116 ms
> Time: 901.245 ms
> Time: 901.071 ms
>
> m_io_c = 8:
> Time: 619.870 ms
> Time: 620.327 ms
> Time: 614.266 ms
>
> m_io_c = 16:
> Time: 529.885 ms
> Time: 526.958 ms
> Time: 528.474 ms
>
> m_io_c = 32:
> Time: 521.185 ms
> Time: 520.713 ms
> Time: 517.729 ms
>
> No difference to me, which seems to be good. I've double checked and
> patch used the new way
>
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
> -> mdreadv -> FileReadV -> pg_preadv (inlined)
> acquire_sample_rows -> heapam_scan_analyze_next_block ->
> ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
> -> ...
>
> I gave also io_combine_limit to 32 (max, 256kB) a try and got those
> slightly better results:
>
> [..]
> m_io_c = 16:
> Time: 494.599 ms
> Time: 496.345 ms
> Time: 973.500 ms
>
> m_io_c = 32:
> Time: 461.031 ms
> Time: 449.037 ms
> Time: 443.375 ms
>
> and that (last one) apparently was able to push it to ~50-60k still
> random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
> still reading random , so I assume no merging was done:
>
> Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
> w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
> drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
> nvme0n1   61212.00591.82 0.00   0.000.10 9.90
> 2.00  0.02 0.00   0.000.0012.000.00  0.00
> 0.00   0.000.00 0.000.000.006.28  85.20
>
> So in short it looks good to me.

My results are similar to yours, also I realized a bug while working
on your benchmarking cases. I will share the cause and the fix soon.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft




Re: Use streaming read API in ANALYZE

2024-04-03 Thread Heikki Linnakangas

On 03/04/2024 13:31, Nazir Bilal Yavuz wrote:

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.


I'm getting a repeatable segfault / assertion failure with this:

postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10);
CREATE TABLE
postgres=# insert into tengiga select g, repeat('x', 900) from 
generate_series(1, 140) g;

INSERT 0 140
postgres=# set default_statistics_target = 10; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target = 100; ANALYZE tengiga;
SET
ANALYZE
postgres=# set default_statistics_target =1000; ANALYZE tengiga;
SET
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File: 
"heapam_handler.c", Line: 1079, PID: 262232
postgres: heikki postgres [local] 
ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8]
postgres: heikki postgres [local] 
ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34]

postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34]
postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a]
postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9]
postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0]
postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe]
postgres: heikki postgres [local] 
ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9]
postgres: heikki postgres [local] 
ANALYZE(ProcessUtility+0x136)[0x564889f0afb1]

postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8]
postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b]
postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015]
postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6]
postgres: heikki postgres [local] 
ANALYZE(PostgresMain+0x80c)[0x564889f06fd7]

postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876]
postgres: heikki postgres [local] 
ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3]

postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e]
postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0]
postgres: heikki postgres [local] 
ANALYZE(PostmasterMain+0x152b)[0x564889e2214d]

postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305]
postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61]
2024-04-03 20:15:49.157 EEST [262101] LOG:  server process (PID 262232) 
was terminated by signal 6: Aborted


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: Use streaming read API in ANALYZE

2024-04-03 Thread Nazir Bilal Yavuz
Hi,

On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz  wrote:
>
> v4 is rebased on top of v14 streaming read API changes.

Streaming API has been committed but the committed version has a minor
change, the read_stream_begin_relation function takes Relation instead
of BufferManagerRelation now. So, here is a v5 which addresses this
change.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Wed, 3 Apr 2024 01:22:59 +0300
Subject: [PATCH v5] Use streaming read API in ANALYZE

ANALYZE command gets random tuples using BlockSampler algorithm. Use
streaming reads to get these tuples by using BlockSampler algorithm in
streaming read API prefetch logic.
---
 src/include/access/heapam.h  |  4 +-
 src/backend/access/heap/heapam_handler.c | 16 ++---
 src/backend/commands/analyze.c   | 85 
 3 files changed, 36 insertions(+), 69 deletions(-)

diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b632fe953c4..4e35caeb42e 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,6 +25,7 @@
 #include "storage/bufpage.h"
 #include "storage/dsm.h"
 #include "storage/lockdefs.h"
+#include "storage/read_stream.h"
 #include "storage/shm_toc.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
@@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
 
 /* in heap/heapam_handler.c*/
 extern void heapam_scan_analyze_next_block(TableScanDesc scan,
-		   BlockNumber blockno,
-		   BufferAccessStrategy bstrategy);
+		   ReadStream *stream);
 extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan,
 		   TransactionId OldestXmin,
 		   double *liverows, double *deadrows,
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index c86000d245b..0533d9660c4 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 }
 
 /*
- * Prepare to analyze block `blockno` of `scan`.  The scan has been started
- * with SO_TYPE_ANALYZE option.
+ * Prepare to analyze block returned from streaming object.  The scan has been
+ * started with SO_TYPE_ANALYZE option.
  *
  * This routine holds a buffer pin and lock on the heap page.  They are held
  * until heapam_scan_analyze_next_tuple() returns false.  That is until all the
  * items of the heap page are analyzed.
  */
 void
-heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
-			   BufferAccessStrategy bstrategy)
+heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream)
 {
 	HeapScanDesc hscan = (HeapScanDesc) scan;
 
@@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno,
 	 * doing much work per tuple, the extra lock traffic is probably better
 	 * avoided.
 	 */
-	hscan->rs_cblock = blockno;
-	hscan->rs_cindex = FirstOffsetNumber;
-	hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM,
-		blockno, RBM_NORMAL, bstrategy);
+	hscan->rs_cbuf = read_stream_next_buffer(stream, NULL);
+	Assert(BufferIsValid(hscan->rs_cbuf));
 	LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE);
+
+	hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf);
+	hscan->rs_cindex = FirstOffsetNumber;
 }
 
 /*
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2fb39f3ede1..105285c3ea2 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	return stats;
 }
 
+/*
+ * Prefetch callback function to get next block number while using
+ * BlockSampling algorithm
+ */
+static BlockNumber
+block_sampling_streaming_read_next(ReadStream *stream,
+   void *user_data,
+   void *per_buffer_data)
+{
+	BlockSamplerData *bs = user_data;
+
+	return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber;
+}
+
 /*
  * acquire_sample_rows -- acquire a random sample of rows from the heap
  *
@@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
-#ifdef USE_PREFETCH
-	int			prefetch_maximum = 0;	/* blocks to prefetch if enabled */
-	BlockSamplerData prefetch_bs;
-#endif
+	ReadStream *stream;
 
 	Assert(targrows > 0);
 
@@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel,
 	randseed = pg_prng_uint32(_global_prng_state);
 	nblocks = BlockSampler_Init(, totalblocks, targrows, randseed);
 
-#ifdef USE_PREFETCH
-	prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace);
-	/* Create another BlockSampler, using the same seed, for prefetching */
-	if (prefetch_maximum)
-		(void) BlockSampler_Init(_bs, totalblocks, targrows, 

Re: Use streaming read API in ANALYZE

2024-04-03 Thread Jakub Wartak
On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz  wrote:
[..]
> v4 is rebased on top of v14 streaming read API changes.

Hi Nazir, so with streaming API committed, I gave a try to this patch.
With autovacuum=off and 30GB table on NVMe (with standard readahead of
256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB
default) created using: create table t as select repeat('a', 100) || i
|| repeat('b', 500) as filler from generate_series(1, 4500) as i;

on master, effect of mainteance_io_concurency [default 10] is like
that (when resetting the fs cache after each ANALYZE):

m_io_c = 0:
Time: 3137.914 ms (00:03.138)
Time: 3094.540 ms (00:03.095)
Time: 3452.513 ms (00:03.453)

m_io_c = 1:
Time: 2972.751 ms (00:02.973)
Time: 2939.551 ms (00:02.940)
Time: 2904.428 ms (00:02.904)

m_io_c = 2:
Time: 1580.260 ms (00:01.580)
Time: 1572.132 ms (00:01.572)
Time: 1558.334 ms (00:01.558)

m_io_c = 4:
Time: 938.304 ms
Time: 931.772 ms
Time: 920.044 ms

m_io_c = 8:
Time: 666.025 ms
Time: 660.241 ms
Time: 648.848 ms

m_io_c = 16:
Time: 542.450 ms
Time: 561.155 ms
Time: 539.683 ms

m_io_c = 32:
Time: 538.487 ms
Time: 541.705 ms
Time: 538.101 ms

with patch applied:

m_io_c = 0:
Time: 3106.469 ms (00:03.106)
Time: 3140.343 ms (00:03.140)
Time: 3044.133 ms (00:03.044)

m_io_c = 1:
Time: 2959.817 ms (00:02.960)
Time: 2920.265 ms (00:02.920)
Time: 2911.745 ms (00:02.912)

m_io_c = 2:
Time: 1581.912 ms (00:01.582)
Time: 1561.444 ms (00:01.561)
Time: 1558.251 ms (00:01.558)

m_io_c = 4:
Time: 908.116 ms
Time: 901.245 ms
Time: 901.071 ms

m_io_c = 8:
Time: 619.870 ms
Time: 620.327 ms
Time: 614.266 ms

m_io_c = 16:
Time: 529.885 ms
Time: 526.958 ms
Time: 528.474 ms

m_io_c = 32:
Time: 521.185 ms
Time: 520.713 ms
Time: 517.729 ms

No difference to me, which seems to be good. I've double checked and
patch used the new way

acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers
-> mdreadv -> FileReadV -> pg_preadv (inlined)
acquire_sample_rows -> heapam_scan_analyze_next_block ->
ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer
-> ...

I gave also io_combine_limit to 32 (max, 256kB) a try and got those
slightly better results:

[..]
m_io_c = 16:
Time: 494.599 ms
Time: 496.345 ms
Time: 973.500 ms

m_io_c = 32:
Time: 461.031 ms
Time: 449.037 ms
Time: 443.375 ms

and that (last one) apparently was able to push it to ~50-60k still
random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was
still reading random , so I assume no merging was done:

Devicer/s rMB/s   rrqm/s  %rrqm r_await rareq-sz
w/s wMB/s   wrqm/s  %wrqm w_await wareq-sz d/s dMB/s
drqm/s  %drqm d_await dareq-sz f/s f_await  aqu-sz  %util
nvme0n1   61212.00591.82 0.00   0.000.10 9.90
2.00  0.02 0.00   0.000.0012.000.00  0.00
0.00   0.000.00 0.000.000.006.28  85.20

So in short it looks good to me.

-Jakub Wartak.




Re: Use streaming read API in ANALYZE

2024-04-02 Thread Nazir Bilal Yavuz
Hi,

Thanks for the review!

On Wed, 27 Mar 2024 at 23:15, Melanie Plageman
 wrote:
>
> On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> > Hi,
> >
> > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> > >
> > >
> > > The new version of the streaming read API [1] is posted. I updated the
> > > streaming read API changes patch (0001), using the streaming read API
> > > in ANALYZE patch (0002) remains the same. This should make it easier
> > > to review as it can be applied on top of master
> > >
> > >
> >
> > The new version of the streaming read API is posted [1]. I rebased the
> > patch on top of master and v9 of the streaming read API.
> >
> > There is a minimal change in the 'using the streaming read API in ANALYZE
> > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> > to copy exactly the same behavior as before. Also, some benchmarking
> > results:
> >
> > I created a 22 GB table and set the size of shared buffers to 30GB, the
> > rest is default.
> >
> > ╔═══╦═╦╗
> > ║   ║  Avg Timings in ms  ║║
> > ╠═══╬══╦══╬╣
> > ║   ║  master  ║  patched ║ percentage ║
> > ╠═══╬══╬══╬╣
> > ║ Both OS cache and ║  ║  ║║
> > ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> > ╠═══╬══╬══╬╣
> > ║   OS cache is loaded but  ║  ║  ║║
> > ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> > ╠═══╬══╬══╬╣
> > ║ Shared buffers are loaded ║  ║  ║║
> > ║   ║  89.2846 ║  84.6952 ║%5.1║
> > ╚═══╩══╩══╩╝
> >
> > Any kind of feedback would be appreciated.
>
> Thanks for working on this!
>
> A general review comment: I noticed you have the old streaming read
> (pgsr) naming still in a few places (including comments) -- so I would
> just make sure and update everywhere when you rebase in Thomas' latest
> version of the read stream API.

Done.

>
> > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> > From: Nazir Bilal Yavuz 
> > Date: Mon, 19 Feb 2024 14:30:47 +0300
> > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
> >
> > --- a/src/backend/commands/analyze.c
> > +++ b/src/backend/commands/analyze.c
> > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> > *index_expr)
> >   return stats;
> >  }
> >
> > +/*
> > + * Prefetch callback function to get next block number while using
> > + * BlockSampling algorithm
> > + */
> > +static BlockNumber
> > +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> > +   
> > void *user_data,
> > +   
> > void *per_buffer_data)
>
> I don't think you need the pg_ prefix

Done.

>
> > +{
> > + BlockSamplerData *bs = user_data;
> > + BlockNumber *current_block = per_buffer_data;
>
> Why can't you just do BufferGetBlockNumber() on the buffer returned from
> the read stream API instead of allocating per_buffer_data for the block
> number?

Done.

>
> > +
> > + if (BlockSampler_HasMore(bs))
> > + *current_block = BlockSampler_Next(bs);
> > + else
> > + *current_block = InvalidBlockNumber;
> > +
> > + return *current_block;
>
>
> I think we'd like to keep the read stream code in heapam-specific code.
> Instead of doing streaming_read_buffer_begin() here, you could put this
> in heap_beginscan() or initscan() guarded by
> scan->rs_base.rs_flags & SO_TYPE_ANALYZE

In the recent changes [1], heapam_scan_analyze_next_[block | tuple]
are removed from tableam. They are directly called from
heapam-specific code now. So, IMO, no need to do this now.

v4 is rebased on top of v14 streaming read API changes.

[1] 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v4 1/2] v14 Streaming Read API

Part 1:

Provide vectored variant of ReadBuffer().

Break ReadBuffer() up into two steps: StartReadBuffers() and
WaitReadBuffers().  This has two advantages:

1.  Multiple consecutive blocks can be read with one system call.
2.  Advice (hints of future reads) can optionally be issued to the kernel.

The traditional ReadBuffer() function is now implemented in terms of
those functions, to avoid duplication.  For now we still only read a
block at a time so there is no change to generated system calls yet, but
later 

Re: Use streaming read API in ANALYZE

2024-03-27 Thread Melanie Plageman
On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote:
> Hi,
> 
> On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
> >
> >
> > The new version of the streaming read API [1] is posted. I updated the
> > streaming read API changes patch (0001), using the streaming read API
> > in ANALYZE patch (0002) remains the same. This should make it easier
> > to review as it can be applied on top of master
> >
> >
> 
> The new version of the streaming read API is posted [1]. I rebased the
> patch on top of master and v9 of the streaming read API.
> 
> There is a minimal change in the 'using the streaming read API in ANALYZE
> patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
> to copy exactly the same behavior as before. Also, some benchmarking
> results:
> 
> I created a 22 GB table and set the size of shared buffers to 30GB, the
> rest is default.
> 
> ╔═══╦═╦╗
> ║   ║  Avg Timings in ms  ║║
> ╠═══╬══╦══╬╣
> ║   ║  master  ║  patched ║ percentage ║
> ╠═══╬══╬══╬╣
> ║ Both OS cache and ║  ║  ║║
> ║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
> ╠═══╬══╬══╬╣
> ║   OS cache is loaded but  ║  ║  ║║
> ║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
> ╠═══╬══╬══╬╣
> ║ Shared buffers are loaded ║  ║  ║║
> ║   ║  89.2846 ║  84.6952 ║%5.1║
> ╚═══╩══╩══╩╝
> 
> Any kind of feedback would be appreciated.

Thanks for working on this!

A general review comment: I noticed you have the old streaming read
(pgsr) naming still in a few places (including comments) -- so I would
just make sure and update everywhere when you rebase in Thomas' latest
version of the read stream API.

> From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001
> From: Nazir Bilal Yavuz 
> Date: Mon, 19 Feb 2024 14:30:47 +0300
> Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE
>
> --- a/src/backend/commands/analyze.c
> +++ b/src/backend/commands/analyze.c
> @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node 
> *index_expr)
>   return stats;
>  }
>  
> +/*
> + * Prefetch callback function to get next block number while using
> + * BlockSampling algorithm
> + */
> +static BlockNumber
> +pg_block_sampling_streaming_read_next(StreamingRead *stream,
> +   void 
> *user_data,
> +   void 
> *per_buffer_data)

I don't think you need the pg_ prefix

> +{
> + BlockSamplerData *bs = user_data;
> + BlockNumber *current_block = per_buffer_data;

Why can't you just do BufferGetBlockNumber() on the buffer returned from
the read stream API instead of allocating per_buffer_data for the block
number?

> +
> + if (BlockSampler_HasMore(bs))
> + *current_block = BlockSampler_Next(bs);
> + else
> + *current_block = InvalidBlockNumber;
> +
> + return *current_block;


I think we'd like to keep the read stream code in heapam-specific code.
Instead of doing streaming_read_buffer_begin() here, you could put this
in heap_beginscan() or initscan() guarded by
scan->rs_base.rs_flags & SO_TYPE_ANALYZE

same with streaming_read_buffer_end()/heap_endscan().

You'd also then need to save the reference to the read stream in the
HeapScanDescData.

> + stream = streaming_read_buffer_begin(STREAMING_READ_MAINTENANCE,
> + 
>  vac_strategy,
> + 
>  BMR_REL(scan->rs_rd),
> + 
>  MAIN_FORKNUM,
> + 
>  pg_block_sampling_streaming_read_next,
> + 
>  ,
> + 
>  sizeof(BlockSamplerData));
>  
>   /* Outer loop over blocks to sample */

In fact, I think you could use this opportunity to get rid of the block
dependency in acquire_sample_rows() altogether.

Looking at the code now, it seems like you could just invoke
heapam_scan_analyze_next_block() (maybe rename it to
heapam_scan_analyze_next_buffer() or something) from
heapam_scan_analyze_next_tuple() and remove
table_scan_analyze_next_block() entirely.

Then table AMs can figure out how they want 

Re: Use streaming read API in ANALYZE

2024-03-26 Thread Nazir Bilal Yavuz
Hi,

On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz  wrote:
>
>
> The new version of the streaming read API [1] is posted. I updated the
> streaming read API changes patch (0001), using the streaming read API
> in ANALYZE patch (0002) remains the same. This should make it easier
> to review as it can be applied on top of master
>
>

The new version of the streaming read API is posted [1]. I rebased the
patch on top of master and v9 of the streaming read API.

There is a minimal change in the 'using the streaming read API in ANALYZE
patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE
to copy exactly the same behavior as before. Also, some benchmarking
results:

I created a 22 GB table and set the size of shared buffers to 30GB, the
rest is default.

╔═══╦═╦╗
║   ║  Avg Timings in ms  ║║
╠═══╬══╦══╬╣
║   ║  master  ║  patched ║ percentage ║
╠═══╬══╬══╬╣
║ Both OS cache and ║  ║  ║║
║  shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║
╠═══╬══╬══╬╣
║   OS cache is loaded but  ║  ║  ║║
║  shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3   ║
╠═══╬══╬══╬╣
║ Shared buffers are loaded ║  ║  ║║
║   ║  89.2846 ║  84.6952 ║%5.1║
╚═══╩══╩══╩╝

Any kind of feedback would be appreciated.

[1]:
https://www.postgresql.org/message-id/CA%2BhUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ%2B-yTrD9WBM-GmAcg%40mail.gmail.com

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 7278c354d80e4d1f21c6fa0d810a723789f2d722 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v3 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h  |  45 +-
 src/include/storage/streaming_read.h  |  50 ++
 src/backend/storage/Makefile  |   2 +-
 src/backend/storage/aio/Makefile  |  14 +
 src/backend/storage/aio/meson.build   |   5 +
 src/backend/storage/aio/streaming_read.c  | 678 +
 src/backend/storage/buffer/bufmgr.c   | 706 --
 src/backend/storage/buffer/localbuf.c |  14 +-
 src/backend/storage/meson.build   |   1 +
 src/backend/utils/misc/guc_tables.c   |  14 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 doc/src/sgml/config.sgml  |  14 +
 src/tools/pgindent/typedefs.list  |   2 +
 13 files changed, 1309 insertions(+), 237 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..1cc198bde21 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing;
 extern PGDLLIMPORT int effective_io_concurrency;
 extern PGDLLIMPORT int maintenance_io_concurrency;
 
+#define MAX_BUFFER_IO_SIZE PG_IOV_MAX
+#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ)
+extern PGDLLIMPORT int buffer_io_size;
+
 extern PGDLLIMPORT int checkpoint_flush_after;
 extern PGDLLIMPORT int backend_flush_after;
 extern PGDLLIMPORT int bgwriter_flush_after;
@@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
-
 /*
  * prototypes for functions in bufmgr.c
  */
@@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int16		nblocks;
+	

Re: Use streaming read API in ANALYZE

2024-02-28 Thread Nazir Bilal Yavuz
Hi,

On Mon, 19 Feb 2024 at 18:13, Nazir Bilal Yavuz  wrote:
>
> I worked on using the currently proposed streaming read API [1] in ANALYZE. 
> The patch is attached. 0001 is the not yet merged streaming read API code 
> changes that can be applied to the master, 0002 is the actual code.
>
> The blocks to analyze are obtained by using the streaming read API now.
>
> - Since streaming read API is already doing prefetch, I removed the #ifdef 
> USE_PREFETCH code from acquire_sample_rows().
>
> - Changed 'while (BlockSampler_HasMore())' to 'while (nblocks)' because 
> the prefetch mechanism in the streaming read API will advance 'bs' before 
> returning buffers.
>
> - Removed BlockNumber and BufferAccessStrategy from the declaration of 
> scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them.
>
> I counted syscalls of analyzing ~5GB table. It can be seen that the patched 
> version did ~1300 less read calls.
>
> Patched:
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  39.670.012128   0 29809   pwrite64
>  36.960.011299   0 28594   pread64
>  23.240.007104   0 27611   fadvise64
>
> Master (21a71648d3):
>
> % time seconds  usecs/call callserrors syscall
> -- --- --- - - 
>  38.940.016457   0 29816   pwrite64
>  36.790.015549   0 29850   pread64
>  23.910.010106   0 29848   fadvise64
>
>
> Any kind of feedback would be appreciated.
>
> [1]: 
> https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com

The new version of the streaming read API [1] is posted. I updated the
streaming read API changes patch (0001), using the streaming read API
in ANALYZE patch (0002) remains the same. This should make it easier
to review as it can be applied on top of master

[1]: 
https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 21d9043501284c6bae996522ff2f3ac693f81986 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Mon, 26 Feb 2024 23:48:31 +1300
Subject: [PATCH v2 1/2] Streaming read API changes that are not committed to
 master yet

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
---
 src/include/storage/bufmgr.h |  45 ++
 src/include/storage/streaming_read.h |  52 ++
 src/backend/storage/Makefile |   2 +-
 src/backend/storage/aio/Makefile |  14 +
 src/backend/storage/aio/meson.build  |   5 +
 src/backend/storage/aio/streaming_read.c | 612 
 src/backend/storage/buffer/bufmgr.c  | 687 +++
 src/backend/storage/buffer/localbuf.c|  14 +-
 src/backend/storage/meson.build  |   1 +
 src/tools/pgindent/typedefs.list |   3 +
 10 files changed, 1202 insertions(+), 233 deletions(-)
 create mode 100644 src/include/storage/streaming_read.h
 create mode 100644 src/backend/storage/aio/Makefile
 create mode 100644 src/backend/storage/aio/meson.build
 create mode 100644 src/backend/storage/aio/streaming_read.c

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d51d46d3353..b57f71f97e3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -14,6 +14,7 @@
 #ifndef BUFMGR_H
 #define BUFMGR_H
 
+#include "port/pg_iovec.h"
 #include "storage/block.h"
 #include "storage/buf.h"
 #include "storage/bufpage.h"
@@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount;
 #define BUFFER_LOCK_SHARE		1
 #define BUFFER_LOCK_EXCLUSIVE	2
 
+/*
+ * Maximum number of buffers for multi-buffer I/O functions.  This is set to
+ * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum.
+ */
+#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ)
 
 /*
  * prototypes for functions in bufmgr.c
@@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator,
 		ForkNumber forkNum, BlockNumber blockNum,
 		ReadBufferMode mode, BufferAccessStrategy strategy,
 		bool permanent);
+
+#define READ_BUFFERS_ZERO_ON_ERROR 0x01
+#define READ_BUFFERS_ISSUE_ADVICE 0x02
+
+/*
+ * Private state used by StartReadBuffers() and WaitReadBuffers().  Declared
+ * in public header only to allow inclusion in other structs, but contents
+ * should not be accessed.
+ */
+struct ReadBuffersOperation
+{
+	/* Parameters passed in to StartReadBuffers(). */
+	BufferManagerRelation bmr;
+	Buffer	   *buffers;
+	ForkNumber	forknum;
+	BlockNumber blocknum;
+	int			nblocks;
+	BufferAccessStrategy strategy;
+	int			flags;
+
+	/* Range of buffers, if we need to perform a read. */
+	int			io_buffers_len;
+};
+