Re: Use streaming read API in ANALYZE
On Wed, Sep 18, 2024 at 5:13 AM Thomas Munro wrote: > On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl wrote: > > I used the combination of your patch and making the computation of > vacattrstats for a relation available through the API and managed to > implement something that I think does the right thing. (I just sampled a > few different statistics to check if they seem reasonable, like most common > vals and most common freqs.) See attached patch. > > Cool. I went ahead and committed that small new function and will > mark the open item closed. > Thank you Thomas, this will help a lot. > > I need the vacattrstats to set up the two streams for the internal > relations. I can just re-implement them in the same way as is already done, > but this seems like a small change that avoids unnecessary code duplication. > > Unfortunately we're not in a phase where we can make non-essential > changes, we're right about to release and we're only committing fixes, > and it seems like you have a way forward (albeit with some > duplication). We can keep talking about that for v18. > Yes, I can work around this by re-implementing the same code that is present in PostgreSQL. > > From your earlier email: > > I'll take a look at the thread. I really think the ReadStream > abstraction is a good step in the right direction. > > Here's something you or your colleagues might be interested in: I was > looking around for a fun extension to streamify as a demo of the > technology, and I finished up writing a quick patch to streamify > pgvector's HNSW index scan, which worked well enough to share[1] (I > think it should in principle be able to scale with the number of graph > connections, at least 16x), but then people told me that it's of > limited interest because everybody knows that HNSW indexes have to fit > in memory (I think there may also be memory prefetch streaming > opportunities, unexamined for now). But that made me wonder what the > people with the REALLY big indexes do for hyperdimensional graph > search on a scale required to build Skynet, and that led me back to > Timescale pgvectorscale[2]. I see two obvious signs that this thing > is eminently and profitably streamifiable: (1) The stated aim is > optimising for indexes that don't fit in memory, hence "Disk" in the > name of the research project it is inspired by, (2) I see that > DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP > (Windows). So now I am waiting patiently for a Rustacean to show up > with patches for pgvectorscale to use ReadStream, which would already > get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon > hopefully), and hopefully also provide a nice test case for the AIO > patch set which redirects buffer reads through io_uring (Linux, > basically the newer better libaio) or background I/O workers (other > OSes, which works surprisingly competitively). Just BTW for > comparison with DiskANN we have also had early POC-quality patches > that drive AIO with overlapped/IOCP (Windows) which will eventually be > rebased and proposed (Windows isn't really a primary target but we > wanted to validate that the stuff we're working on has abstractions > that will map to the obvious system APIs found in the systems > PostgreSQL targets). For completeness, I've also had it mostly > working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped > support for those last two so that was a bit of a dead end). > [1] > https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com > [2] https://github.com/timescale/pgvectorscale > [3] https://github.com/microsoft/DiskANN > Thanks Thomas, this looks really interesting. I've forwarded it to the pgvectorscale team. -- Best wishes, Mats Kindahl, Timescale
Re: Use streaming read API in ANALYZE
On Sun, Sep 15, 2024 at 12:14 AM Mats Kindahl wrote: > I used the combination of your patch and making the computation of > vacattrstats for a relation available through the API and managed to > implement something that I think does the right thing. (I just sampled a few > different statistics to check if they seem reasonable, like most common vals > and most common freqs.) See attached patch. Cool. I went ahead and committed that small new function and will mark the open item closed. > I need the vacattrstats to set up the two streams for the internal relations. > I can just re-implement them in the same way as is already done, but this > seems like a small change that avoids unnecessary code duplication. Unfortunately we're not in a phase where we can make non-essential changes, we're right about to release and we're only committing fixes, and it seems like you have a way forward (albeit with some duplication). We can keep talking about that for v18. >From your earlier email: > I'll take a look at the thread. I really think the ReadStream abstraction is > a good step in the right direction. Here's something you or your colleagues might be interested in: I was looking around for a fun extension to streamify as a demo of the technology, and I finished up writing a quick patch to streamify pgvector's HNSW index scan, which worked well enough to share[1] (I think it should in principle be able to scale with the number of graph connections, at least 16x), but then people told me that it's of limited interest because everybody knows that HNSW indexes have to fit in memory (I think there may also be memory prefetch streaming opportunities, unexamined for now). But that made me wonder what the people with the REALLY big indexes do for hyperdimensional graph search on a scale required to build Skynet, and that led me back to Timescale pgvectorscale[2]. I see two obvious signs that this thing is eminently and profitably streamifiable: (1) The stated aim is optimising for indexes that don't fit in memory, hence "Disk" in the name of the research project it is inspired by, (2) I see that DIskANN[3] is aggressively using libaio (Linux) and overlapped/IOCP (Windows). So now I am waiting patiently for a Rustacean to show up with patches for pgvectorscale to use ReadStream, which would already get read-ahead advice and vectored I/O (Linux, macOS, FreeBSD soon hopefully), and hopefully also provide a nice test case for the AIO patch set which redirects buffer reads through io_uring (Linux, basically the newer better libaio) or background I/O workers (other OSes, which works surprisingly competitively). Just BTW for comparison with DiskANN we have also had early POC-quality patches that drive AIO with overlapped/IOCP (Windows) which will eventually be rebased and proposed (Windows isn't really a primary target but we wanted to validate that the stuff we're working on has abstractions that will map to the obvious system APIs found in the systems PostgreSQL targets). For completeness, I've also had it mostly working on the POSIX AIO of FreeBSD, HP-UX and AIX (though we dropped support for those last two so that was a bit of a dead end). [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJ_7NKd46nx1wbyXWriuZSNzsTfm%2BrhEuvU6nxZi3-KVw%40mail.gmail.com [2] https://github.com/timescale/pgvectorscale [3] https://github.com/microsoft/DiskANN
Re: Use streaming read API in ANALYZE
On Fri, Sep 13, 2024 at 10:10 AM Mats Kindahl wrote: > On Tue, Sep 10, 2024 at 6:04 AM Thomas Munro > wrote: > >> On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro >> wrote: >> > Mats, what do you think about >> > this? (I haven't tried to preserve the prefetching behaviour, which >> > probably didn't actually too work for you in v16 anyway at a guess, >> > I'm just looking for the absolute simplest thing we can do to resolve >> > this API mismatch.) TimeScale could then continue to use its v16 >> > coding to handle the two-relations-in-a-trenchcoat problem, and we >> > could continue discussing how to make v18 better. >> >> . o O { Spitballing here: if we add that tiny function I showed to get >> you unstuck for v17, then later in v18, if we add a multi-relation >> ReadStream constructor/callback (I have a patch somewhere, I want to >> propose that as it is needed for streaming recovery), you could >> construct a new ReadSteam of your own that is daisy-chained from that >> one. You could keep using your N + M block numbering scheme if you >> want to, and the callback of the new stream could decode the block >> numbers and redirect to the appropriate relation + real block number. >> > > I think it is good to make as small changes as possible to the RC, so > agree with this approach. Looking at the patch. I think it will work, but > I'll do some experimentation with the patch. > > Just asking, is there any particular reason why you do not want to *add* > new functions for opaque objects inside a major release? After all, that > was the reason they were opaque from the beginning and extending with new > functions would not break any existing code, not even from the ABI > perspective. > > >> That way you'd get I/O concurrency for both relations (for now just >> read-ahead advice, but see Andres's AIO v2 thread). That'd >> essentially be a more supported version of the 'access the struct >> internals' idea (or at least my understanding of what you had in >> mind), through daisy-chained streams. A little weird maybe, and maybe >> the redesign work will result in something completely >> different/better... just a thought... } >> > > I'll take a look at the thread. I really think the ReadStream abstraction > is a good step in the right direction. > -- > Best wishes, > Mats Kindahl, Timescale > Hi Thomas, I used the combination of your patch and making the computation of vacattrstats for a relation available through the API and managed to implement something that I think does the right thing. (I just sampled a few different statistics to check if they seem reasonable, like most common vals and most common freqs.) See attached patch. I need the vacattrstats to set up the two streams for the internal relations. I can just re-implement them in the same way as is already done, but this seems like a small change that avoids unnecessary code duplication. -- Best wishes, Mats Kindahl, Timescale From 8acb707cc859f1570046919295817e27f0d8b4f6 Mon Sep 17 00:00:00 2001 From: Mats Kindahl Date: Sat, 14 Sep 2024 14:03:35 +0200 Subject: [PATCH] Support extensions wanting to do more advanced analyze To support extensions that want to do more advanced analyzis implementations, this commit adds two function: `analuze_compute_vacattrstats` to compute vacuum attribute stats that can be used to compute the number of target rows to analyze. `read_stream_next_block` which allow an extension to just use the provided analyze stream to sample blocks. --- src/backend/commands/analyze.c| 120 ++ src/backend/storage/aio/read_stream.c | 14 +++ src/include/commands/vacuum.h | 2 + src/include/storage/read_stream.h | 2 + 4 files changed, 85 insertions(+), 53 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index c590a2adc35..8a402ad15c6 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -269,6 +269,72 @@ analyze_rel(Oid relid, RangeVar *relation, pgstat_progress_end_command(); } +/* + * Determine which columns to analyze + * + * Note that system attributes are never analyzed, so we just reject them + * at the lookup stage. We also reject duplicate column mentions. (We + * could alternatively ignore duplicates, but analyzing a column twice + * won't work; we'd end up making a conflicting update in pg_statistic.) + */ +int +analyze_compute_vacattrstats(Relation onerel, List *va_cols, VacAttrStats ***vacattrstats_out) +{ + int tcnt, +i, +attr_cnt; + VacAttrStats **vacattrstats; + + if (va_cols != NIL) + { + Bitmapset *unique_cols = NULL; + ListCell *le; + + vacattrstats = (VacAttrStats **) palloc(list_length(va_cols) * +sizeof(VacAttrStats *)); + tcnt = 0; + foreach(le, va_cols) + { + char *col = strVal(lfirst(le)); + + i = attnameAttNum(onerel, col, false); + if (i == InvalidAttrNumber) +ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_COLUMN), + errmsg(
Re: Use streaming read API in ANALYZE
On Tue, Sep 10, 2024 at 6:04 AM Thomas Munro wrote: > On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro > wrote: > > Mats, what do you think about > > this? (I haven't tried to preserve the prefetching behaviour, which > > probably didn't actually too work for you in v16 anyway at a guess, > > I'm just looking for the absolute simplest thing we can do to resolve > > this API mismatch.) TimeScale could then continue to use its v16 > > coding to handle the two-relations-in-a-trenchcoat problem, and we > > could continue discussing how to make v18 better. > > . o O { Spitballing here: if we add that tiny function I showed to get > you unstuck for v17, then later in v18, if we add a multi-relation > ReadStream constructor/callback (I have a patch somewhere, I want to > propose that as it is needed for streaming recovery), you could > construct a new ReadSteam of your own that is daisy-chained from that > one. You could keep using your N + M block numbering scheme if you > want to, and the callback of the new stream could decode the block > numbers and redirect to the appropriate relation + real block number. > I think it is good to make as small changes as possible to the RC, so agree with this approach. Looking at the patch. I think it will work, but I'll do some experimentation with the patch. Just asking, is there any particular reason why you do not want to *add* new functions for opaque objects inside a major release? After all, that was the reason they were opaque from the beginning and extending with new functions would not break any existing code, not even from the ABI perspective. > That way you'd get I/O concurrency for both relations (for now just > read-ahead advice, but see Andres's AIO v2 thread). That'd > essentially be a more supported version of the 'access the struct > internals' idea (or at least my understanding of what you had in > mind), through daisy-chained streams. A little weird maybe, and maybe > the redesign work will result in something completely > different/better... just a thought... } > I'll take a look at the thread. I really think the ReadStream abstraction is a good step in the right direction. -- Best wishes, Mats Kindahl, Timescale
Re: Use streaming read API in ANALYZE
On Tue, Sep 10, 2024 at 12:28 AM Thomas Munro wrote: > On Tue, Sep 10, 2024 at 3:36 AM Michael Banck wrote: > > I am a bit confused about the status of this thread. Robert mentioned > > RC1, so I guess it pertains to v17 but I don't see it on the open item > > wiki list? > > Yes, v17. Alight, I'll add an item. > > > Does the above mean you are going to revert it for v17, Thomas? And if > > so, what exactly? The ANALYZE changes on top of the streaming read API > > or something else about that API that is being discussed on this thread? > > I might have been a little pessimistic in that assessment. Another > workaround that seems an awful lot cleaner and less invasive would be > to offer a new ReadStream API function that provides access to block > numbers and the strategy, ie the arguments of v16's > scan_analyze_next_block() function. Mats, what do you think about > this? (I haven't tried to preserve the prefetching behaviour, which > probably didn't actually too work for you in v16 anyway at a guess, > I'm just looking for the absolute simplest thing we can do to resolve > this API mismatch.) TimeScale could then continue to use its v16 > coding to handle the two-relations-in-a-trenchcoat problem, and we > could continue discussing how to make v18 better. > In the original code we could call the methods with an "adjusted" block number, so the entire logic worked as before because we could just recursively forward the call with modified parameters. This is a little different with the new API. > I looked briefly at another non-heap-like table AM, the Citus Columnar > TAM. I am not familiar with that code and haven't studied it deeply > this morning, but its _next_block() currently just returns true, so I > think it will somehow need to change to counting calls and returning > false when it thinks its been called enough times (otherwise the loop > in acquire_sample_rows() won't terminate, I think?). I suppose an > easy way to do that without generating extra I/O or having to think > hard about how to preserve the loop cound from v16 would be to use > this function. > Yes, but we are re-using the heapam so forwarding the call to it, which not only fetches the next block it also reads the buffer. Since you could just pass in the block number before, it just worked. As mentioned, we intended to set up a new ReadStream for the "internal" relation ourselves (I think this is what you mean with "daisy-chain" in the followup to this mail), but then you need targrows, which is based on vacattrstats, which is computed with code that is currently either inline (the loop over the attributes in do_analyze_rel), or static (the examine_attribute function). We can write our own code for this, it would help to have the code that does this work callable, or be able to extract parameters from the existing readstream to at least get a hint. This would allow us to just get the vacuum attribute stats for an arbitrary relation and then run the same computations as in do_analyze_rel. Being able to do the same for the indexes is less important since this is an "internal" relation and the "public" indexes are the ones that matter. I attached a tentative patch for this, just doing some refactorings, and will see if that is sufficient for the current work by trying to use it. (I thought I would be able to verify this today, but am a little delayed so I'm sending this anyway.) A patch like this is a minimal refactoring so should be safe even in an RC. I have deliberately not tried to do a more serious refactoring although I see that there are some duplications when doing the same work with the indexes and it would probably be possible to make a more generic function for this. > I think there are broadly three categories of TAMs with respect to > ANALYZE block sampling: those that are very heap-like (blocks of one > SMgrRelation) and can just use the stream directly, those that are not > at all heap-like (doing something completely different to sample > tuples and ignoring the block aspect but using _next_block() to > control the loop), and then Timescale's case which is sort of > somewhere in between: almost heap-like from the point of view of this > sampling code, ie working with blocks, but fudging the meaning of > block numbers, which we didn't anticipate. In this case the block numbers are only from a different relation, so they are still valid blocks, just encoded in a funny way. The block numbers trick is just a hack, but the gist is that we want to sample an arbitrary number of relations/forks when running analysis, not just the "front-facing" one. > (I wonder if it fails to > sample fairly across the underlying relation boundary anyway because > their data densities must surely be quite different, but that's not > what we're here to talk about.) > Yes, they are, so this is kind-of-a-hack-to-get-it-roughly-correct. The ideal scenario would be to be able to run the same analysis of that is done in do_analyz
Re: Use streaming read API in ANALYZE
On Thu, Sep 5, 2024 at 11:12 AM Thomas Munro wrote: > On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl wrote: > > Forgive me for asking, but I am not entirely sure why the ReadStream > struct is opaque. The usual reasons are: > > > > You want to provide an ABI to allow extensions to work with new major > versions without re-compiling. Right now it is necessary to recompile > extensions anyway, this does not seem to apply. (Because there are a lot of > other changes that you need when switching versions because of the lack of > a stable ABI for other parts of the code. However, it might be that the > goal is to support it eventually, and then it would make sense to start > making structs opaque.) > > You want to ensure that you can make modifications inside a major > version without breaking ABIs and requiring a re-compile. In this case, you > could still follow safe practice of adding new fields last, not relying on > the size of the struct for anything (e.g., no arrays of these structures, > just pointers to them), etc. However, if you want to be very safe and > support very drastic changes inside a major version, it needs to be opaque, > so this could be the reason. > > > > Is it either of these reasons, or is there another reason? > > > > Making the ReadStream API non-opaque (that is, moving the definition to > the header file) would at least solve our problem (unless I am mistaken). > However, I am ignorant about long-term plans which might affect this, so > there might be a good reason to revert it for reasons I am not aware of. > > The second thing. Also there are very active plans[1] to change the > internal design of ReadStream in 18, since the goal is to drive true > asynchronous I/O, and the idea of ReadStream was to create a simple > API to let many consumers start using it, so that we can drive > efficient modern system interfaces below that API, so having people > depending on how it works would not be great. > That is understandable, since you usually do not want to have to re-compile the extension for different minor versions. However, it would be a rare case with extensions that are meddling with this, so might not turn out to be a big problem in reality, as long as it is very clear to all involved that this might change and that you make an effort to avoid binary incompatibility by removing or changing types for fields. > But let's talk about how that would actually look, for example if we > exposed the struct or you took a photocopy of it... I think your idea > must be something like: if you could access struct ReadStream's > internals, you could replace stream->callback with an interceptor > callback, and if the BlockSampler had been given the fake N + M > relation size, the interceptor could overwrite > stream->ios[next_io_index].op.smgr and return x - N if the intercepted > callback returned x >= N. (Small detail: need to check > stream->fast_path and use 0 instead or something like that, but maybe > we could change that.) Yes, this is what I had in mind, but I did not dig too deeply into the code. > One minor problem that jumps out is that > read_stream.c could inappropriately merge blocks from the two > relations into one I/O. Hmm, I guess you'd have to teach the > interceptor not to allow that: if switching between the two relation, > and if the block number would coincide with > stream->pending_read_blocknum + stream->pending_read_nblocks, it would > need to pick a new block instead (interfering with the block sampling > algorithm, but only very rarely). Is this what you had in mind, or > something else? > Hmmm... I didn't look too closely at this. Since the block number comes from the callback, I guess we could make sure to have a "padding" block between the regions so that we "break" any suite of blocks, which I think is what you mean with "teach the interceptor not to allow that", but I would have to write a patch to make sure. > > (BTW I have a patch to teach read_stream.c about multi-smgr-relation > streams, by adding a different constructor with a different callback > that returns smgr, fork, block instead of just the block, but it > didn't make it into 17.) > Without having looked at the patch, this sounds like the correct way to do it. > > [1] > https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt > -- Best wishes, Mats Kindahl, Timescale
Re: Use streaming read API in ANALYZE
On Tue, Sep 10, 2024 at 10:27 AM Thomas Munro wrote: > Mats, what do you think about > this? (I haven't tried to preserve the prefetching behaviour, which > probably didn't actually too work for you in v16 anyway at a guess, > I'm just looking for the absolute simplest thing we can do to resolve > this API mismatch.) TimeScale could then continue to use its v16 > coding to handle the two-relations-in-a-trenchcoat problem, and we > could continue discussing how to make v18 better. . o O { Spitballing here: if we add that tiny function I showed to get you unstuck for v17, then later in v18, if we add a multi-relation ReadStream constructor/callback (I have a patch somewhere, I want to propose that as it is needed for streaming recovery), you could construct a new ReadSteam of your own that is daisy-chained from that one. You could keep using your N + M block numbering scheme if you want to, and the callback of the new stream could decode the block numbers and redirect to the appropriate relation + real block number. That way you'd get I/O concurrency for both relations (for now just read-ahead advice, but see Andres's AIO v2 thread). That'd essentially be a more supported version of the 'access the struct internals' idea (or at least my understanding of what you had in mind), through daisy-chained streams. A little weird maybe, and maybe the redesign work will result in something completely different/better... just a thought... }
Re: Use streaming read API in ANALYZE
On Tue, Sep 10, 2024 at 3:36 AM Michael Banck wrote: > I am a bit confused about the status of this thread. Robert mentioned > RC1, so I guess it pertains to v17 but I don't see it on the open item > wiki list? Yes, v17. Alight, I'll add an item. > Does the above mean you are going to revert it for v17, Thomas? And if > so, what exactly? The ANALYZE changes on top of the streaming read API > or something else about that API that is being discussed on this thread? I might have been a little pessimistic in that assessment. Another workaround that seems an awful lot cleaner and less invasive would be to offer a new ReadStream API function that provides access to block numbers and the strategy, ie the arguments of v16's scan_analyze_next_block() function. Mats, what do you think about this? (I haven't tried to preserve the prefetching behaviour, which probably didn't actually too work for you in v16 anyway at a guess, I'm just looking for the absolute simplest thing we can do to resolve this API mismatch.) TimeScale could then continue to use its v16 coding to handle the two-relations-in-a-trenchcoat problem, and we could continue discussing how to make v18 better. I looked briefly at another non-heap-like table AM, the Citus Columnar TAM. I am not familiar with that code and haven't studied it deeply this morning, but its _next_block() currently just returns true, so I think it will somehow need to change to counting calls and returning false when it thinks its been called enough times (otherwise the loop in acquire_sample_rows() won't terminate, I think?). I suppose an easy way to do that without generating extra I/O or having to think hard about how to preserve the loop cound from v16 would be to use this function. I think there are broadly three categories of TAMs with respect to ANALYZE block sampling: those that are very heap-like (blocks of one SMgrRelation) and can just use the stream directly, those that are not at all heap-like (doing something completely different to sample tuples and ignoring the block aspect but using _next_block() to control the loop), and then Timescale's case which is sort of somewhere in between: almost heap-like from the point of view of this sampling code, ie working with blocks, but fudging the meaning of block numbers, which we didn't anticipate. (I wonder if it fails to sample fairly across the underlying relation boundary anyway because their data densities must surely be quite different, but that's not what we're here to talk about.) . o O { We need that wiki page listing TAMs with links to the open source ones... } From db05a33e742cb292b1a2d44665582042fcd05d2f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 10 Sep 2024 10:15:33 +1200 Subject: [PATCH] Allow raw block numbers to be read from ReadStream. --- src/backend/storage/aio/read_stream.c | 14 ++ src/include/storage/read_stream.h | 3 +++ 2 files changed, 17 insertions(+) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index 7f0e07d9586..cf914a7712f 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -835,3 +835,17 @@ read_stream_end(ReadStream *stream) read_stream_reset(stream); pfree(stream); } + +/* + * Transitional support for code that would like to perform a read directly, + * without using the stream. Returns, and skips, the next block number that + * would be read by the stream's look-ahead algorithm, or InvalidBlockNumber + * if the end of the stream is reached. Also reports the strategy that would + * be used to read it. + */ +BlockNumber +read_stream_next_block(ReadStream *stream, BufferAccessStrategy *strategy) +{ + *strategy = stream->ios[0].op.strategy; + return read_stream_get_block(stream, NULL); +} diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h index 2f94ee744b9..37b51934f8d 100644 --- a/src/include/storage/read_stream.h +++ b/src/include/storage/read_stream.h @@ -68,6 +68,9 @@ extern ReadStream *read_stream_begin_relation(int flags, void *callback_private_data, size_t per_buffer_data_size); extern Buffer read_stream_next_buffer(ReadStream *stream, void **per_buffer_data); +extern BlockNumber read_stream_next_block(ReadStream *stream, + BufferAccessStrategy *strategy); + extern ReadStream *read_stream_begin_smgr_relation(int flags, BufferAccessStrategy strategy, SMgrRelation smgr, -- 2.46.0
Re: Use streaming read API in ANALYZE
Hi, On Thu, Sep 05, 2024 at 09:12:07PM +1200, Thomas Munro wrote: > On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl wrote: > > Making the ReadStream API non-opaque (that is, moving the definition > > to the header file) would at least solve our problem (unless I am > > mistaken). However, I am ignorant about long-term plans which might > > affect this, so there might be a good reason to revert it for > > reasons I am not aware of. > > The second thing. I am a bit confused about the status of this thread. Robert mentioned RC1, so I guess it pertains to v17 but I don't see it on the open item wiki list? Does the above mean you are going to revert it for v17, Thomas? And if so, what exactly? The ANALYZE changes on top of the streaming read API or something else about that API that is being discussed on this thread? I am also asking because this feature (i.e. Use streaming read API in ANALYZE) is being mentioned in the release announcement and that was just frozen for translations. Michael
Re: Use streaming read API in ANALYZE
On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl wrote: > Forgive me for asking, but I am not entirely sure why the ReadStream struct > is opaque. The usual reasons are: > > You want to provide an ABI to allow extensions to work with new major > versions without re-compiling. Right now it is necessary to recompile > extensions anyway, this does not seem to apply. (Because there are a lot of > other changes that you need when switching versions because of the lack of a > stable ABI for other parts of the code. However, it might be that the goal is > to support it eventually, and then it would make sense to start making > structs opaque.) > You want to ensure that you can make modifications inside a major version > without breaking ABIs and requiring a re-compile. In this case, you could > still follow safe practice of adding new fields last, not relying on the size > of the struct for anything (e.g., no arrays of these structures, just > pointers to them), etc. However, if you want to be very safe and support very > drastic changes inside a major version, it needs to be opaque, so this could > be the reason. > > Is it either of these reasons, or is there another reason? > > Making the ReadStream API non-opaque (that is, moving the definition to the > header file) would at least solve our problem (unless I am mistaken). > However, I am ignorant about long-term plans which might affect this, so > there might be a good reason to revert it for reasons I am not aware of. The second thing. Also there are very active plans[1] to change the internal design of ReadStream in 18, since the goal is to drive true asynchronous I/O, and the idea of ReadStream was to create a simple API to let many consumers start using it, so that we can drive efficient modern system interfaces below that API, so having people depending on how it works would not be great. But let's talk about how that would actually look, for example if we exposed the struct or you took a photocopy of it... I think your idea must be something like: if you could access struct ReadStream's internals, you could replace stream->callback with an interceptor callback, and if the BlockSampler had been given the fake N + M relation size, the interceptor could overwrite stream->ios[next_io_index].op.smgr and return x - N if the intercepted callback returned x >= N. (Small detail: need to check stream->fast_path and use 0 instead or something like that, but maybe we could change that.) One minor problem that jumps out is that read_stream.c could inappropriately merge blocks from the two relations into one I/O. Hmm, I guess you'd have to teach the interceptor not to allow that: if switching between the two relation, and if the block number would coincide with stream->pending_read_blocknum + stream->pending_read_nblocks, it would need to pick a new block instead (interfering with the block sampling algorithm, but only very rarely). Is this what you had in mind, or something else? (BTW I have a patch to teach read_stream.c about multi-smgr-relation streams, by adding a different constructor with a different callback that returns smgr, fork, block instead of just the block, but it didn't make it into 17.) [1] https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
Re: Use streaming read API in ANALYZE
On Thu, Sep 5, 2024 at 1:34 AM Thomas Munro wrote: > On Thu, Sep 5, 2024 at 3:36 AM Robert Haas wrote: > > On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro > wrote: > > > Thanks for the explanation. I think we should revert it. IMHO it was > > > a nice clean example of a streaming transformation, but unfortunately > > > it transformed an API that nobody liked in the first place, and broke > > > some weird and wonderful workarounds. Let's try again in 18. > > > > The problem I have with this is that we just released RC1. I suppose > > if we have to make this change it's better to do it sooner than later, > > but are we sure we want to whack this around this close to final > > release? > > I hear you. But I definitely don't want to (and likely can't at this > point) make any of the other proposed changes, and I also don't want > to break Timescale. That seems to leave only one option: go back to > the v16 API for RC2, and hope that the ongoing table AM discussions > for v18 (CF #4866) will fix all the problems for the people whose TAMs > don't quack like a "heap", and the people whose TAMs do and who would > not like to duplicate the code, and the people who want streaming I/O. > Forgive me for asking, but I am not entirely sure why the ReadStream struct is opaque. The usual reasons are: - You want to provide an ABI to allow extensions to work with new major versions without re-compiling. Right now it is necessary to recompile extensions anyway, this does not seem to apply. (Because there are a lot of other changes that you need when switching versions because of the lack of a stable ABI for other parts of the code. However, it might be that the goal is to support it eventually, and then it would make sense to start making structs opaque.) - You want to ensure that you can make modifications *inside* a major version without breaking ABIs and requiring a re-compile. In this case, you could still follow safe practice of adding new fields last, not relying on the size of the struct for anything (e.g., no arrays of these structures, just pointers to them), etc. However, if you want to be *very* safe and support very drastic changes inside a major version, it needs to be opaque, so this could be the reason. Is it either of these reasons, or is there another reason? Making the ReadStream API non-opaque (that is, moving the definition to the header file) would at least solve our problem (unless I am mistaken). However, I am ignorant about long-term plans which might affect this, so there might be a good reason to revert it for reasons I am not aware of. -- Best wishes, Mats Kindahl, Timescale
Re: Use streaming read API in ANALYZE
On Thu, Sep 5, 2024 at 3:36 AM Robert Haas wrote: > On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro wrote: > > Thanks for the explanation. I think we should revert it. IMHO it was > > a nice clean example of a streaming transformation, but unfortunately > > it transformed an API that nobody liked in the first place, and broke > > some weird and wonderful workarounds. Let's try again in 18. > > The problem I have with this is that we just released RC1. I suppose > if we have to make this change it's better to do it sooner than later, > but are we sure we want to whack this around this close to final > release? I hear you. But I definitely don't want to (and likely can't at this point) make any of the other proposed changes, and I also don't want to break Timescale. That seems to leave only one option: go back to the v16 API for RC2, and hope that the ongoing table AM discussions for v18 (CF #4866) will fix all the problems for the people whose TAMs don't quack like a "heap", and the people whose TAMs do and who would not like to duplicate the code, and the people who want streaming I/O.
Re: Use streaming read API in ANALYZE
On Wed, Sep 4, 2024 at 6:38 AM Thomas Munro wrote: > Thanks for the explanation. I think we should revert it. IMHO it was > a nice clean example of a streaming transformation, but unfortunately > it transformed an API that nobody liked in the first place, and broke > some weird and wonderful workarounds. Let's try again in 18. The problem I have with this is that we just released RC1. I suppose if we have to make this change it's better to do it sooner than later, but are we sure we want to whack this around this close to final release? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use streaming read API in ANALYZE
Thanks for the explanation. I think we should revert it. IMHO it was a nice clean example of a streaming transformation, but unfortunately it transformed an API that nobody liked in the first place, and broke some weird and wonderful workarounds. Let's try again in 18.
Re: Use streaming read API in ANALYZE
On Sat, Aug 24, 2024 at 5:31 AM Thomas Munro wrote: > On Thu, Aug 22, 2024 at 7:31 PM Mats Kindahl wrote: > > The alternate version proposed by Nazir allows you to deide which > interface to use. > > Reverting the patch entirely would also solve the problem. > After digging through the code a little more I discovered that there actually is another one: move the ReadStream struct into read_stream.h. > > Passing down the block sampler and the strategy to scan_begin() and move > the ReadStream setup in analyze.c into initscan() in heapam.c, but this > requires adding new parameters to this function. > > Having accessors that allow you to get the block sampler and strategy > from the ReadStream object. > > I'm a bit confused about how it can make sense to use the same > BlockSampler with a side relation/fork. Could you point me at the > code? > Sorry, that was a bit unclear. Intention was not to re-use the block sampler but to set a new one up with parameters from the original block sampler, which would require access to it. (The strategy is less of a problem since only one is used.) To elaborate on the situation: For the TAM in question we have two different storage areas, both are heaps. Both relations use the same attributes "publicly" (they are internally different, but we transform them to look the same). One of the relations is the "default" one and is stored in rd_rel. In order to run ANALYZE, we need to sample blocks from both relations, in slightly different ways. With the old interface, we faked the number of blocks in relation_size() callback and claimed that there were N + M blocks. When then being asked about a block by block number, we could easily pick the correct relation and just forward the call. With the new ReadStream API, a read-stream is (automatically) set up on the "default" relation, but we can set up a separate read-stream inside the TAM for the other relation. However, the difficulty is in setting it up correctly: We cannot use the "fake number of block"-trick since the read stream does not only compute the block number, but actually tries to read the buffer in the relation provided when setting up the read stream, so a block number outside the range of this relation will not be found since it is in a different relation. If we could create our own read stream with both relations, that could be solved and we could just implement the same logic, but direct it to the correct relations depending on where we want to read the block. Unless I am mistaken, there is already support for this since there is an array of in-progress I/O and it would be trivial to extend this with more relations+forks, if you have access to the structure definition. The ReadStream struct is, however, an opaque struct so it's hard to hack around with it. Just making the struct declaration public would potentially solve a lot of problems here. (See attached patch, which is close to the minimum of what is needed to allow extension writers to tweak the contents.) Since both relations are using the same attributes with the same "analyzability", having that information would be useful to compute the targrows for setting up the additional stream, but it is computed in do_analyze_rel() and not further propagated, so it needs to be re-computed if we want to set up a separate read-stream. > > It would be great if this could be fixed before the PG17 release now > that 27bc1772fc8 was reverted. > > Ack. Thinking... > Right now I think that just making the ReadStream struct available in the header file is the best approach. It is a safe and low-risk fix (so something that can be added to a beta) and will allow extension writers to hack to their hearts' contents. In addition to that, being able to select what interface to use would also help. > Random thought: is there a wiki page or something where we can find > out about all the table AM projects? For the successor to > 27bc1772fc8, I hope they'll be following along. > At this point, unfortunately not, we are quite early in this. Once I have something, I'll share. -- Best wishes, Mats Kindahl, Timescale From ea4bb194e0dcccac8465b3aa13950f721bde3860 Mon Sep 17 00:00:00 2001 From: Mats Kindahl Date: Thu, 29 Aug 2024 10:39:34 +0200 Subject: Make ReadStream struct non-opaque Move the ReadStream struct and two utility functions from read_stream.c to read_stream.h to allow extensions to modify the structure if they need to. --- src/backend/storage/aio/read_stream.c | 59 --- src/include/storage/read_stream.h | 105 ++ 2 files changed, 105 insertions(+), 59 deletions(-) diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c index a83c18c2a4..bf2a679037 100644 --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -97,65 +97,6 @@ #include "utils/rel.h" #include "utils/spccache.h" -typedef struct InProgressIO -{ - int16 buffer_index; - Read
Re: Use streaming read API in ANALYZE
On Thu, Aug 22, 2024 at 7:31 PM Mats Kindahl wrote: > The alternate version proposed by Nazir allows you to decide which interface > to use. > Reverting the patch entirely would also solve the problem. > Passing down the block sampler and the strategy to scan_begin() and move the > ReadStream setup in analyze.c into initscan() in heapam.c, but this requires > adding new parameters to this function. > Having accessors that allow you to get the block sampler and strategy from > the ReadStream object. I'm a bit confused about how it can make sense to use the same BlockSampler with a side relation/fork. Could you point me at the code? > It would be great if this could be fixed before the PG17 release now that > 27bc1772fc8 was reverted. Ack. Thinking... Random thought: is there a wiki page or something where we can find out about all the table AM projects? For the successor to 27bc1772fc8, I hope they'll be following along.
Re: Use streaming read API in ANALYZE
On Mon, May 20, 2024 at 10:46 PM Melanie Plageman wrote: > On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz > wrote: > > > > On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz > wrote: > > > > > > On Mon, 8 Apr 2024 at 04:21, Thomas Munro > wrote: > > > I wanted to discuss what will happen to this patch now that > > > 27bc1772fc8 is reverted. I am continuing this thread but I can create > > > another thread if you prefer so. > > > > 041b96802ef is discussed in the 'Table AM Interface Enhancements' > > thread [1]. The main problems discussed about this commit is that the > > read stream API is not pushed to the heap-specific code and, because > > of that, the other AM implementations need to use read streams. To > > push read stream API to the heap-specific code, it is pretty much > > required to pass BufferAccessStrategy and BlockSamplerData to the > > initscan(). > > > > I am sharing the alternative version of this patch. The first patch > > just reverts 041b96802ef and the second patch is the alternative > > version. > > > > In this alternative version, the read stream API is not pushed to the > > heap-specific code, but it is controlled by the heap-specific code. > > The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the > > heap-specific code if the scan type is 'ANALYZE'. This flag is used to > > decide whether streaming API in ANALYZE will be used or not. If this > > flag is set, this means heap AMs and read stream API will be used. If > > it is not set, this means heap AMs will not be used and code falls > > back to the version before read streams. > > Personally, I think the alternative version here is the best option > other than leaving what is in master. However, I would vote for > keeping what is in master because 1) where we are in the release > timeline and 2) the acquire_sample_rows() code, before streaming read, > was totally block-based anyway. > > If we kept what was in master, do we need to document for table AMs > how to use read_stream_next_buffer() or can we assume they will look > at the heap AM implementation? > Hi all, I ran into this with the PG17 beta3 and for our use-case we need to set up another stream (using a different relation and/or fork, but using the same strategy) in addition to the one that is passed in to the scan_analyze_next_block(), so to be able to do that it is necessary to have the block sampler and the strategy from the original stream. Given that this makes it very difficult (see below) to set up a different ReadStream inside the TAM unless you have the BlockSampler and the BufferReadStrategy, and the old interface did not have this problem, I would consider this a regression. This would be possible to solve in a few different ways: 1. The alternate version proposed by Nazir allows you to decide which interface to use. 2. Reverting the patch entirely would also solve the problem. 3. Passing down the block sampler and the strategy to scan_begin() and move the ReadStream setup in analyze.c into initscan() in heapam.c, but this requires adding new parameters to this function. 4. Having accessors that allow you to get the block sampler and strategy from the ReadStream object. The proposed solution 1 above would still not solve the problem of allowing a user to set up a different or extra ReadStream if they want to use the new ReadStream interface. Reverting the ReadStream patch entirely would also deal with the regression, but I find the ReadStream interface very elegant since it moves the block sampling into a separate abstraction and would like to use it, but right now there are some limitations if you want to use it fully. The third solution above would allow that, but it requires a change in the signature of scan_begin(), which might not be the best at this stage of development. Proposal 4 would allow you to construct a new stream based on the old one and might be a simple alternative solution as well with less changes to the current code. It is possible to capture the information in ProcessUtility() and re-compute all the parameters, but that is quite a lot of work to get right, especially considering that these computations are all over the place and part of different functions at different stages (For example, variable ring_size, needed to set up the buffer access strategy is computed in ExecVacuum(); variable targrows, used to set up the buffer sampler, is computed inside acquire_sample_rows(), which in turn requires to decide what attributes to analyze, which is computed in do_analyze_rel().) It would be great if this could be fixed before the PG17 release now that 27bc1772fc8 was reverted. -- Best wishes, Mats Kindahl, Timescale From f684ba1a941912bfe57c256a080cb8f6e49e8871 Mon Sep 17 00:00:00 2001 From: Mats Kindahl Date: Thu, 22 Aug 2024 09:25:30 +0200 Subject: Add accessors for ReadStream Add accessors to be able to retrieve the buffer access strategy as well as the callback with its private data. ---
Re: Use streaming read API in ANALYZE
On Wed, May 15, 2024 at 2:18 PM Nazir Bilal Yavuz wrote: > > On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz wrote: > > > > On Mon, 8 Apr 2024 at 04:21, Thomas Munro wrote: > > I wanted to discuss what will happen to this patch now that > > 27bc1772fc8 is reverted. I am continuing this thread but I can create > > another thread if you prefer so. > > 041b96802ef is discussed in the 'Table AM Interface Enhancements' > thread [1]. The main problems discussed about this commit is that the > read stream API is not pushed to the heap-specific code and, because > of that, the other AM implementations need to use read streams. To > push read stream API to the heap-specific code, it is pretty much > required to pass BufferAccessStrategy and BlockSamplerData to the > initscan(). > > I am sharing the alternative version of this patch. The first patch > just reverts 041b96802ef and the second patch is the alternative > version. > > In this alternative version, the read stream API is not pushed to the > heap-specific code, but it is controlled by the heap-specific code. > The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the > heap-specific code if the scan type is 'ANALYZE'. This flag is used to > decide whether streaming API in ANALYZE will be used or not. If this > flag is set, this means heap AMs and read stream API will be used. If > it is not set, this means heap AMs will not be used and code falls > back to the version before read streams. Personally, I think the alternative version here is the best option other than leaving what is in master. However, I would vote for keeping what is in master because 1) where we are in the release timeline and 2) the acquire_sample_rows() code, before streaming read, was totally block-based anyway. If we kept what was in master, do we need to document for table AMs how to use read_stream_next_buffer() or can we assume they will look at the heap AM implementation? - Melanie
Re: Use streaming read API in ANALYZE
Hi, On Mon, 29 Apr 2024 at 18:41, Nazir Bilal Yavuz wrote: > > Hi, > > On Mon, 8 Apr 2024 at 04:21, Thomas Munro wrote: > > > > Pushed. Thanks Bilal and reviewers! > > I wanted to discuss what will happen to this patch now that > 27bc1772fc8 is reverted. I am continuing this thread but I can create > another thread if you prefer so. 041b96802ef is discussed in the 'Table AM Interface Enhancements' thread [1]. The main problems discussed about this commit is that the read stream API is not pushed to the heap-specific code and, because of that, the other AM implementations need to use read streams. To push read stream API to the heap-specific code, it is pretty much required to pass BufferAccessStrategy and BlockSamplerData to the initscan(). I am sharing the alternative version of this patch. The first patch just reverts 041b96802ef and the second patch is the alternative version. In this alternative version, the read stream API is not pushed to the heap-specific code, but it is controlled by the heap-specific code. The SO_USE_READ_STREAMS_IN_ANALYZE flag is introduced and set in the heap-specific code if the scan type is 'ANALYZE'. This flag is used to decide whether streaming API in ANALYZE will be used or not. If this flag is set, this means heap AMs and read stream API will be used. If it is not set, this means heap AMs will not be used and code falls back to the version before read streams. Pros of the alternative version: * The existing AM implementations other than heap AM can continue to use their AMs without any change. * AM implementations other than heap do not need to use read streams. * Upstream code uses the read stream API and benefits from that. Cons of the alternative version: * 6 if cases are added to the acquire_sample_rows() function and 3 of them are in the while loop. * Because of these changes, the code looks messy. Any kind of feedback would be appreciated. [1] https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft From 323f28ff979cde8e4dbde8b4654bded74abf1fbc Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Wed, 15 May 2024 00:03:56 +0300 Subject: [PATCH v13 1/2] Revert "Use streaming I/O in ANALYZE." This commit reverts 041b96802ef. 041b96802ef revised the changes on 27bc1772fc8 but 27bc1772fc8 and dd1f6b0c172 are reverted together in 6377e12a5a5. So, this commit reverts all 27bc1772fc, 041b96802ef and dd1f6b0c172 together. Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com --- src/include/access/tableam.h | 26 +++ src/backend/access/heap/heapam_handler.c | 38 +- src/backend/commands/analyze.c | 96 ++-- 3 files changed, 98 insertions(+), 62 deletions(-) diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 8e583b45cd5..e08b9627f30 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -21,7 +21,6 @@ #include "access/sdir.h" #include "access/xact.h" #include "executor/tuptable.h" -#include "storage/read_stream.h" #include "utils/rel.h" #include "utils/snapshot.h" @@ -655,16 +654,6 @@ typedef struct TableAmRoutine struct VacuumParams *params, BufferAccessStrategy bstrategy); - /* - * Prepare to analyze the next block in the read stream. Returns false if - * the stream is exhausted and true otherwise. The scan must have been - * started with SO_TYPE_ANALYZE option. - * - * This routine holds a buffer pin and lock on the heap page. They are - * held until heapam_scan_analyze_next_tuple() returns false. That is - * until all the items of the heap page are analyzed. - */ - /* * Prepare to analyze block `blockno` of `scan`. The scan has been started * with table_beginscan_analyze(). See also @@ -683,7 +672,8 @@ typedef struct TableAmRoutine * isn't one yet. */ bool (*scan_analyze_next_block) (TableScanDesc scan, - ReadStream *stream); + BlockNumber blockno, + BufferAccessStrategy bstrategy); /* * See table_scan_analyze_next_tuple(). @@ -1721,17 +1711,19 @@ table_relation_vacuum(Relation rel, struct VacuumParams *params, } /* - * Prepare to analyze the next block in the read stream. The scan needs to - * have been started with table_beginscan_analyze(). Note that this routine - * might acquire resources like locks that are held until + * Prepare to analyze block `blockno` of `scan`. The scan needs to have been + * started with table_beginscan_analyze(). Note that this routine might + * acquire resources like locks that are held until * table_scan_analyze_next_tuple() returns false. * * Returns false if block is unsuitable for sampling, true otherwise. */ static inline bool -table_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream) +table_scan_analyze_next_block(TableScanDesc scan,
Re: Use streaming read API in ANALYZE
Hi, On Mon, 8 Apr 2024 at 04:21, Thomas Munro wrote: > > Pushed. Thanks Bilal and reviewers! I wanted to discuss what will happen to this patch now that 27bc1772fc8 is reverted. I am continuing this thread but I can create another thread if you prefer so. After the revert of 27bc1772fc8, acquire_sample_rows() became table-AM agnostic again. So, read stream changes might have to be pushed down now but there are a couple of roadblocks like Melanie mentioned [1] before. Quote from Melanie [1]: On Thu, 11 Apr 2024 at 19:19, Melanie Plageman wrote: > > I am working on pushing streaming ANALYZE into heap AM code, and I ran > into a few roadblocks. > > If we want ANALYZE to make the ReadStream object in heap_beginscan() > (like the read stream implementation of heap sequential and TID range > scans do), I don't see any way around changing the scan_begin table AM > callback to take a BufferAccessStrategy at the least (and perhaps also > the BlockSamplerData). > > read_stream_begin_relation() doesn't just save the > BufferAccessStrategy in the ReadStream, it uses it to set various > other things in the ReadStream object. callback_private_data (which in > ANALYZE's case is the BlockSamplerData) is simply saved in the > ReadStream, so it could be set later, but that doesn't sound very > clean to me. > > As such, it seems like a cleaner alternative would be to add a table > AM callback for creating a read stream object that takes the > parameters of read_stream_begin_relation(). But, perhaps it is a bit > late for such additions. If we do not want to add a new table AM callback like Melanie mentioned, it is pretty much required to pass BufferAccessStrategy and BlockSamplerData to the initscan(). > It also opens us up to the question of whether or not sequential scan > should use such a callback instead of making the read stream object in > heap_beginscan(). > > I am happy to write a patch that does any of the above. But, I want to > raise these questions, because perhaps I am simply missing an obvious > alternative solution. I wonder the same, I could not think of any alternative solution to this problem. Another quote from Melanie [2] in the same thread: On Thu, 11 Apr 2024 at 20:48, Melanie Plageman wrote: > > I will also say that, had this been 6 months ago, I would probably > suggest we restructure ANALYZE's table AM interface to accommodate > read stream setup and to address a few other things I find odd about > the current code. For example, I think creating a scan descriptor for > the analyze scan in acquire_sample_rows() is quite odd. It seems like > it would be better done in the relation_analyze callback. The > relation_analyze callback saves some state like the callbacks for > acquire_sample_rows() and the Buffer Access Strategy. But at least in > the heap implementation, it just saves them in static variables in > analyze.c. It seems like it would be better to save them in a useful > data structure that could be accessed later. We have access to pretty > much everything we need at that point (in the relation_analyze > callback). I also think heap's implementation of > table_beginscan_analyze() doesn't need most of > heap_beginscan()/initscan(), so doing this instead of something > ANALYZE specific seems more confusing than helpful. If we want to implement ANALYZE specific counterparts of heap_beginscan()/initscan(); we may think of passing BufferAccessStrategy and BlockSamplerData to them. Also, there is an ongoing(?) discussion about a few problems / improvements about the acquire_sample_rows() mentioned at the end of the 'Table AM Interface Enhancements' thread [3]. Should we wait for these discussions to be resolved or can we resume working on this patch? Any kind of feedback would be appreciated. [1] https://www.postgresql.org/message-id/CAAKRu_ZxU6hucckrT1SOJxKfyN7q-K4KU1y62GhDwLBZWG%2BROg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAAKRu_YkphAPNbBR2jcLqnxGhDEWTKhYfLFY%3D0R_oG5LHBH7Gw%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft
Re: Use streaming read API in ANALYZE
Hi, On Wed, 3 Apr 2024 at 22:25, Nazir Bilal Yavuz wrote: > > Hi, > > Thank you for looking into this! > > On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas wrote: > > > > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote: > > > Streaming API has been committed but the committed version has a minor > > > change, the read_stream_begin_relation function takes Relation instead > > > of BufferManagerRelation now. So, here is a v5 which addresses this > > > change. > > > > I'm getting a repeatable segfault / assertion failure with this: > > > > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10); > > CREATE TABLE > > postgres=# insert into tengiga select g, repeat('x', 900) from > > generate_series(1, 140) g; > > INSERT 0 140 > > postgres=# set default_statistics_target = 10; ANALYZE tengiga; > > SET > > ANALYZE > > postgres=# set default_statistics_target = 100; ANALYZE tengiga; > > SET > > ANALYZE > > postgres=# set default_statistics_target =1000; ANALYZE tengiga; > > SET > > server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File: > > "heapam_handler.c", Line: 1079, PID: 262232 > > postgres: heikki postgres [local] > > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8] > > postgres: heikki postgres [local] > > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34] > > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34] > > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a] > > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9] > > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0] > > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe] > > postgres: heikki postgres [local] > > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9] > > postgres: heikki postgres [local] > > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1] > > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8] > > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b] > > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015] > > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6] > > postgres: heikki postgres [local] > > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7] > > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876] > > postgres: heikki postgres [local] > > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3] > > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e] > > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0] > > postgres: heikki postgres [local] > > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d] > > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4] > > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a] > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305] > > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61] > > 2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232) > > was terminated by signal 6: Aborted > > I realized the same error while working on Jakub's benchmarking results. > > Cause: I was using the nblocks variable to check how many blocks will > be returned from the streaming API. But I realized that sometimes the > number returned from BlockSampler_Init() is not equal to the number of > blocks that BlockSampler_Next() will return as BlockSampling algorithm > decides how many blocks to return on the fly by using some random > seeds. I wanted to re-check this problem and I realized that I was wrong. I tried using nblocks again and this time there was no failure. I looked at block sampling logic and I am pretty sure that BlockSampler_Init() function correctly returns the number of blocks that BlockSampler_Next() will return. It seems 158f581923 fixed this issue as well. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Use streaming read API in ANALYZE
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman wrote: > On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote: > > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote: > > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001 > > > From: Melanie Plageman > > > Date: Sun, 7 Apr 2024 15:38:41 -0400 > > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore() > > > > > > A previous commit stopped using BlockSampler_HasMore() for flow control > > > in acquire_sample_rows(). There seems little use now for > > > BlockSampler_HasMore(). It should be sufficient to return > > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() > > > would have returned false. Remove BlockSampler_HasMore(). > > > > > > Author: Melanie Plageman, Nazir Bilal Yavuz > > > Discussion: > > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com > > > > The justification here seems somewhat odd. Sure, the previous commit stopped > > using BlockSampler_HasMore in acquire_sample_rows - but only because it was > > moved to block_sampling_streaming_read_next()? > > It didn't stop using it. It stopped being useful. The reason it existed, > as far as I can tell, was to use it as the while() loop condition in > acquire_sample_rows(). I think it makes much more sense for > BlockSampler_Next() to return InvalidBlockNumber when there are no more > blocks -- not to assert you don't call it when there aren't any more > blocks. > > I didn't want to change BlockSampler_Next() in the same commit as the > streaming read user and we can't remove BlockSampler_HasMore() without > changing BlockSampler_Next(). I agree that the code looks useless if one condition implies the other, but isn't it good to keep that cross-check, perhaps reformulated as an assertion? I didn't look too hard at the maths, I just saw the words "It is not obvious that this code matches Knuth's Algorithm S ..." and realised I'm not sure I have time to develop a good opinion about this today. So I'll leave the 0002 change out for now, as it's a tidy-up that can easily be applied in the next cycle. From c3b8df8e4720d8b0dfb4c892c0aa3ddaef8f401f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 8 Apr 2024 14:38:58 +1200 Subject: [PATCH v12] Remove obsolete BlockSampler_HasMore(). Commit 041b9680 stopped using BlockSampler_HasMore() for flow control in acquire_sample_rows(). There seems to be little use for it now. We can just return InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() would have returned false. Author: Melanie Plageman Author: Nazir Bilal Yavuz Reviewed-by: Andres Freund Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com --- src/backend/commands/analyze.c| 4 +--- src/backend/utils/misc/sampling.c | 10 +++--- src/include/utils/sampling.h | 1 - 3 files changed, 4 insertions(+), 11 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index da27a13a3f..e9fa3470cf 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -,9 +,7 @@ block_sampling_read_stream_next(ReadStream *stream, void *callback_private_data, void *per_buffer_data) { - BlockSamplerData *bs = callback_private_data; - - return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber; + return BlockSampler_Next(callback_private_data); } /* diff --git a/src/backend/utils/misc/sampling.c b/src/backend/utils/misc/sampling.c index 933db06702..6e2bca9739 100644 --- a/src/backend/utils/misc/sampling.c +++ b/src/backend/utils/misc/sampling.c @@ -54,12 +54,6 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize, return Min(bs->n, bs->N); } -bool -BlockSampler_HasMore(BlockSampler bs) -{ - return (bs->t < bs->N) && (bs->m < bs->n); -} - BlockNumber BlockSampler_Next(BlockSampler bs) { @@ -68,7 +62,9 @@ BlockSampler_Next(BlockSampler bs) double p;/* probability to skip block */ double V;/* random */ - Assert(BlockSampler_HasMore(bs)); /* hence K > 0 and k > 0 */ + /* Return if no remaining blocks or no blocks to sample */ + if (K <= 0 || k <= 0) + return InvalidBlockNumber; if ((BlockNumber) k >= K) { diff --git a/src/include/utils/sampling.h b/src/include/utils/sampling.h index be48ee52ba..fb5d6820a2 100644 --- a/src/include/utils/sampling.h +++ b/src/include/utils/sampling.h @@ -38,7 +38,6 @@ typedef BlockSamplerData *BlockSampler; extern BlockNumber BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize, uint32 randseed); -extern bool BlockSampler_HasMore(BlockSampler bs); extern BlockNumber BlockSampler_Next(BlockSampler bs); /* Reservoir sampling methods */ -- 2.44.0
Re: Use streaming read API in ANALYZE
On Mon, Apr 8, 2024 at 10:26 AM Melanie Plageman wrote: > On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote: > > > src/backend/commands/analyze.c | 89 ++ > > > 1 file changed, 26 insertions(+), 63 deletions(-) > > > > That's a very nice demonstration of how this makes good prefetching > > easier... > > Agreed. Yay streaming read API and Bilal! +1 I found a few comments to tweak, just a couple of places that hadn't got the memo after we renamed "read stream", and an obsolete mention of pinning buffers. I adjusted those directly. I ran some tests on a random basic Linux/ARM cloud box with a 7.6GB table, and I got: coldhot master: 9025ms 199ms patched, io_combine_limit=1:9025ms 191ms patched, io_combine_limit=default: 8729ms 191ms Despite being random, occasionally some I/Os must get merged, allowing slightly better random throughput when accessing disk blocks through a 3000 IOPS drinking straw. Looking at strace, I see 29144 pread* calls instead of 30071, which fits that theory. Let's see... if you roll a fair 973452-sided dice 30071 times, how many times do you expect to roll consecutive numbers? Each time you roll there is a 1/973452 chance that you get the last number + 1, and we have 30071 tries giving 30071/973452 = ~3%. 9025ms minus 3% is 8754ms. Seems about right. I am not sure why the hot number is faster exactly. (Anecdotally, I did notice that in the cases that beat master semi-unexpectedly like this, my software memory prefetch patch doesn't help or hurt, while in some cases and on some CPUs there is little difference, and then that patch seems to get a speed-up like this, which might be a clue. *Shrug*, investigation needed.) Pushed. Thanks Bilal and reviewers!
Re: Use streaming read API in ANALYZE
On Sun, Apr 07, 2024 at 03:00:00PM -0700, Andres Freund wrote: > Hi, > > On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote: > > From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman > > Date: Sun, 7 Apr 2024 14:55:22 -0400 > > Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static. > > > > 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and > > scan_analzye_next_tuple -- leaving their heap AM implementations only > > called by acquire_sample_rows(). > > Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this > thread. I did raise that separately > https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de > > Unless I seriously missed something, I see no alternative to reverting that > commit. Noted. I'll give up on this refactor then. Lots of churn for no gain. Attached v11 is just Bilal's v8 patch rebased to apply cleanly and with a few tweaks (I changed one of the loop conditions. All other changes are to comments and commit message). > > @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel, > > break; > > > > prefetch_block = BlockSampler_Next(&prefetch_bs); > > - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, > > prefetch_block); > > + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, > > prefetch_block); > > } > > } > > #endif > > > > + scan->rs_cbuf = InvalidBuffer; > > + > > /* Outer loop over blocks to sample */ > > while (BlockSampler_HasMore(&bs)) > > { > > I don't think it's good to move a lot of code *and* change how it is > structured in the same commit. Makes it much harder to actually see changes / > makes git blame harder to use / etc. Yep. > > From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman > > Date: Sun, 7 Apr 2024 15:28:32 -0400 > > Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE > > > > The ANALYZE command prefetches and reads sample blocks chosen by a > > BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for > > each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. > > > > Author: Nazir Bilal Yavuz > > Reviewed-by: Melanie Plageman > > Discussion: > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com > > --- > > src/backend/commands/analyze.c | 89 ++ > > 1 file changed, 26 insertions(+), 63 deletions(-) > > That's a very nice demonstration of how this makes good prefetching easier... Agreed. Yay streaming read API and Bilal! > > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001 > > From: Melanie Plageman > > Date: Sun, 7 Apr 2024 15:38:41 -0400 > > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore() > > > > A previous commit stopped using BlockSampler_HasMore() for flow control > > in acquire_sample_rows(). There seems little use now for > > BlockSampler_HasMore(). It should be sufficient to return > > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() > > would have returned false. Remove BlockSampler_HasMore(). > > > > Author: Melanie Plageman, Nazir Bilal Yavuz > > Discussion: > > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com > > The justification here seems somewhat odd. Sure, the previous commit stopped > using BlockSampler_HasMore in acquire_sample_rows - but only because it was > moved to block_sampling_streaming_read_next()? It didn't stop using it. It stopped being useful. The reason it existed, as far as I can tell, was to use it as the while() loop condition in acquire_sample_rows(). I think it makes much more sense for BlockSampler_Next() to return InvalidBlockNumber when there are no more blocks -- not to assert you don't call it when there aren't any more blocks. I didn't want to change BlockSampler_Next() in the same commit as the streaming read user and we can't remove BlockSampler_HasMore() without changing BlockSampler_Next(). - Melanie >From 3cb43693c04554f5d46e0dc9156bef36af642593 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sun, 7 Apr 2024 18:17:01 -0400 Subject: [PATCH v11 1/2] Use streaming read API in ANALYZE The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling [
Re: Use streaming read API in ANALYZE
Hi, On 2024-04-07 16:59:26 -0400, Melanie Plageman wrote: > From 1dc2343661f3edb3b1bc4307afb0e956397eb76c Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Sun, 7 Apr 2024 14:55:22 -0400 > Subject: [PATCH v10 1/3] Make heapam_scan_analyze_next_[tuple|block] static. > > 27bc1772fc81 removed the table AM callbacks scan_analyze_next_block and > scan_analzye_next_tuple -- leaving their heap AM implementations only > called by acquire_sample_rows(). Ugh, I don't think 27bc1772fc81 makes much sense. But that's unrelated to this thread. I did raise that separately https://www.postgresql.org/message-id/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de Unless I seriously missed something, I see no alternative to reverting that commit. > @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel, > break; > > prefetch_block = BlockSampler_Next(&prefetch_bs); > - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, > prefetch_block); > + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, > prefetch_block); > } > } > #endif > > + scan->rs_cbuf = InvalidBuffer; > + > /* Outer loop over blocks to sample */ > while (BlockSampler_HasMore(&bs)) > { I don't think it's good to move a lot of code *and* change how it is structured in the same commit. Makes it much harder to actually see changes / makes git blame harder to use / etc. > From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Sun, 7 Apr 2024 15:28:32 -0400 > Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE > > The ANALYZE command prefetches and reads sample blocks chosen by a > BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for > each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. > > Author: Nazir Bilal Yavuz > Reviewed-by: Melanie Plageman > Discussion: > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com > --- > src/backend/commands/analyze.c | 89 ++ > 1 file changed, 26 insertions(+), 63 deletions(-) That's a very nice demonstration of how this makes good prefetching easier... > From 862b7ac81cdafcda7b525e02721da14e46265509 Mon Sep 17 00:00:00 2001 > From: Melanie Plageman > Date: Sun, 7 Apr 2024 15:38:41 -0400 > Subject: [PATCH v10 3/3] Obsolete BlockSampler_HasMore() > > A previous commit stopped using BlockSampler_HasMore() for flow control > in acquire_sample_rows(). There seems little use now for > BlockSampler_HasMore(). It should be sufficient to return > InvalidBlockNumber from BlockSampler_Next() when BlockSample_HasMore() > would have returned false. Remove BlockSampler_HasMore(). > > Author: Melanie Plageman, Nazir Bilal Yavuz > Discussion: > https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com The justification here seems somewhat odd. Sure, the previous commit stopped using BlockSampler_HasMore in acquire_sample_rows - but only because it was moved to block_sampling_streaming_read_next()? Greetings, Andres Freund
Re: Use streaming read API in ANALYZE
am.h @@ -412,15 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple); extern bool HeapTupleIsSurelyDead(HeapTuple htup, struct GlobalVisState *vistest); -/* in heap/heapam_handler.c*/ -extern void heapam_scan_analyze_next_block(TableScanDesc scan, - BlockNumber blockno, - BufferAccessStrategy bstrategy); -extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan, - TransactionId OldestXmin, - double *liverows, double *deadrows, - TupleTableSlot *slot); - /* * To avoid leaking too much knowledge about reorderbuffer implementation * details this is implemented in reorderbuffer.c not heapam_visibility.c -- 2.40.1 From 90d115c2401567be65bcf64393a6d3b39286779e Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sun, 7 Apr 2024 15:28:32 -0400 Subject: [PATCH v10 2/3] Use streaming read API in ANALYZE The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. Author: Nazir Bilal Yavuz Reviewed-by: Melanie Plageman Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com --- src/backend/commands/analyze.c | 89 ++ 1 file changed, 26 insertions(+), 63 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 335ffb24302..3cfad92390d 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1253,6 +1253,20 @@ heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin, return false; } +/* + * Streaming read callback returning the next block number while using + * BlockSampling algorithm. + */ +static BlockNumber +block_sampling_streaming_read_next(ReadStream *stream, + void *user_data, + void *per_buffer_data) +{ + BlockSamplerData *bs = user_data; + + return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber; +} + /* * acquire_sample_rows -- acquire a random sample of rows from the heap * @@ -1305,10 +1319,7 @@ acquire_sample_rows(Relation onerel, int elevel, HeapScanDesc scan; BlockNumber nblocks; BlockNumber blksdone = 0; -#ifdef USE_PREFETCH - int prefetch_maximum = 0; /* blocks to prefetch if enabled */ - BlockSamplerData prefetch_bs; -#endif + ReadStream *stream; Assert(targrows > 0); @@ -1321,13 +1332,6 @@ acquire_sample_rows(Relation onerel, int elevel, randseed = pg_prng_uint32(&pg_global_prng_state); nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed); -#ifdef USE_PREFETCH - prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace); - /* Create another BlockSampler, using the same seed, for prefetching */ - if (prefetch_maximum) - (void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed); -#endif - /* Report sampling block numbers */ pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL, nblocks); @@ -1337,54 +1341,21 @@ acquire_sample_rows(Relation onerel, int elevel, scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE); slot = table_slot_create(onerel, NULL); - -#ifdef USE_PREFETCH - - /* - * If we are doing prefetching, then go ahead and tell the kernel about - * the first set of pages we are going to want. This also moves our - * iterator out ahead of the main one being used, where we will keep it so - * that we're always pre-fetching out prefetch_maximum number of blocks - * ahead. - */ - if (prefetch_maximum) - { - for (int i = 0; i < prefetch_maximum; i++) - { - BlockNumber prefetch_block; - - if (!BlockSampler_HasMore(&prefetch_bs)) -break; - - prefetch_block = BlockSampler_Next(&prefetch_bs); - PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block); - } - } -#endif + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, + vac_strategy, + scan->rs_base.rs_rd, + MAIN_FORKNUM, + block_sampling_streaming_read_next, + &bs, + 0); scan->rs_cbuf = InvalidBuffer; /* Outer loop over blocks to sample */ - while (BlockSampler_HasMore(&bs)) + while (BufferIsValid(scan->rs_cbuf = read_stream_next_buffer(stream, NULL))) { - BlockNumber targblock = BlockSampler_Next(&bs); -#ifdef USE_PREFETCH - BlockNumber prefetch_targblock = InvalidBlockNumber; - - /* - * Make sure that every time the main BlockSampler is moved forward - * that our prefetch BlockSampler also gets moved forward, so that we - * always stay out ahead. - */ - if (prefetch_maximum && BlockSampler_HasMore(&prefetch_bs)) - prefetch_targblock = BlockSampler_Next(&prefetch_bs); -#endif - vacuum_delay_point(); - scan->rs_cblock = targblock; - scan->r
Re: Use streaming read API in ANALYZE
ample_rows(Relation onerel, int elevel, BlockSamplerData bs; ReservoirStateData rstate; TupleTableSlot *slot; - TableScanDesc scan; + HeapScanDesc scan; BlockNumber nblocks; BlockNumber blksdone = 0; #ifdef USE_PREFETCH @@ -1184,7 +1335,7 @@ acquire_sample_rows(Relation onerel, int elevel, /* Prepare for sampling rows */ reservoir_init_selection_state(&rstate, targrows); - scan = heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE); + scan = (HeapScanDesc) heap_beginscan(onerel, NULL, 0, NULL, NULL, SO_TYPE_ANALYZE); slot = table_slot_create(onerel, NULL); #ifdef USE_PREFETCH @@ -1206,11 +1357,13 @@ acquire_sample_rows(Relation onerel, int elevel, break; prefetch_block = BlockSampler_Next(&prefetch_bs); - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block); + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_block); } } #endif + scan->rs_cbuf = InvalidBuffer; + /* Outer loop over blocks to sample */ while (BlockSampler_HasMore(&bs)) { @@ -1229,7 +1382,21 @@ acquire_sample_rows(Relation onerel, int elevel, vacuum_delay_point(); - heapam_scan_analyze_next_block(scan, targblock, vac_strategy); + scan->rs_cblock = targblock; + scan->rs_cindex = FirstOffsetNumber; + + /* + * We must maintain a pin on the target page's buffer to ensure that + * concurrent activity - e.g. HOT pruning - doesn't delete tuples out + * from under us. Hence, pin the page until we are done looking at + * it. We also choose to hold sharelock on the buffer throughout --- + * we could release and re-acquire sharelock for each tuple, but since + * we aren't doing much work per tuple, the extra lock traffic is + * probably better avoided. + */ + scan->rs_cbuf = ReadBufferExtended(scan->rs_base.rs_rd, MAIN_FORKNUM, + targblock, RBM_NORMAL, vac_strategy); + LockBuffer(scan->rs_cbuf, BUFFER_LOCK_SHARE); #ifdef USE_PREFETCH @@ -1238,10 +1405,10 @@ acquire_sample_rows(Relation onerel, int elevel, * next one we will want, if there's any left. */ if (prefetch_maximum && prefetch_targblock != InvalidBlockNumber) - PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_targblock); + PrefetchBuffer(scan->rs_base.rs_rd, MAIN_FORKNUM, prefetch_targblock); #endif - while (heapam_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot)) + while (heap_scan_analyze_next_tuple(scan, OldestXmin, &liverows, &deadrows, slot)) { /* * The first targrows sample rows are simply copied into the @@ -1291,7 +1458,7 @@ acquire_sample_rows(Relation onerel, int elevel, } ExecDropSingleTupleTableSlot(slot); - heap_endscan(scan); + heap_endscan((TableScanDesc) scan); /* * If we didn't find as many tuples as we wanted then we're done. No sort diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 48936826bcc..be630620d0d 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -412,15 +412,6 @@ extern bool HeapTupleHeaderIsOnlyLocked(HeapTupleHeader tuple); extern bool HeapTupleIsSurelyDead(HeapTuple htup, struct GlobalVisState *vistest); -/* in heap/heapam_handler.c*/ -extern void heapam_scan_analyze_next_block(TableScanDesc scan, - BlockNumber blockno, - BufferAccessStrategy bstrategy); -extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan, - TransactionId OldestXmin, - double *liverows, double *deadrows, - TupleTableSlot *slot); - /* * To avoid leaking too much knowledge about reorderbuffer implementation * details this is implemented in reorderbuffer.c not heapam_visibility.c -- 2.40.1 >From 8ad10261b84775fde97f71cd273041af5160fc28 Mon Sep 17 00:00:00 2001 From: Melanie Plageman Date: Sun, 7 Apr 2024 15:28:32 -0400 Subject: [PATCH v9 2/3] Use streaming read API in ANALYZE The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling Prefetch|ReadBuffer() for each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. Author: Nazir Bilal Yavuz Reviewed-by: Melanie Plageman Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com --- src/backend/commands/analyze.c | 89 ++ 1 file changed, 26 insertions(+), 63 deletions(-) diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 335ffb24302..3cfad92390d 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1253,6 +1253,20 @@ heap_scan_analyze_next_tuple(HeapScanDesc scan, TransactionId OldestXmin, return false; } +/* + * Streaming read callback returning the next block number while using + * BlockSampling algorithm. + */ +static BlockNumber +block_sampling_streaming_read_next(ReadStream *stream, + v
Re: Use streaming read API in ANALYZE
Hi, On Wed, 3 Apr 2024 at 23:44, Melanie Plageman wrote: > > > I've reviewed the patches inline below and attached a patch that has > some of my ideas on top of your patch. Thank you! > > > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001 > > From: Nazir Bilal Yavuz > > Date: Wed, 3 Apr 2024 15:14:15 +0300 > > Subject: [PATCH v6] Use streaming read API in ANALYZE > > > > ANALYZE command gets random tuples using BlockSampler algorithm. Use > > streaming reads to get these tuples by using BlockSampler algorithm in > > streaming read API prefetch logic. > > --- > > src/include/access/heapam.h | 6 +- > > src/backend/access/heap/heapam_handler.c | 22 +++--- > > src/backend/commands/analyze.c | 85 > > 3 files changed, 42 insertions(+), 71 deletions(-) > > > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > > index a307fb5f245..633caee9d95 100644 > > --- a/src/include/access/heapam.h > > +++ b/src/include/access/heapam.h > > @@ -25,6 +25,7 @@ > > #include "storage/bufpage.h" > > #include "storage/dsm.h" > > #include "storage/lockdefs.h" > > +#include "storage/read_stream.h" > > #include "storage/shm_toc.h" > > #include "utils/relcache.h" > > #include "utils/snapshot.h" > > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup, > > struct > > GlobalVisState *vistest); > > > > /* in heap/heapam_handler.c*/ > > -extern void heapam_scan_analyze_next_block(TableScanDesc scan, > > - > > BlockNumber blockno, > > - > > BufferAccessStrategy bstrategy); > > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan, > > + > > ReadStream *stream); > > extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan, > > > > TransactionId OldestXmin, > > > > double *liverows, double *deadrows, > > diff --git a/src/backend/access/heap/heapam_handler.c > > b/src/backend/access/heap/heapam_handler.c > > index 0952d4a98eb..d83fbbe6af3 100644 > > --- a/src/backend/access/heap/heapam_handler.c > > +++ b/src/backend/access/heap/heapam_handler.c > > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > > Relation NewHeap, > > } > > > > /* > > - * Prepare to analyze block `blockno` of `scan`. The scan has been started > > - * with SO_TYPE_ANALYZE option. > > + * Prepare to analyze block returned from streaming object. If the block > > returned > > + * from streaming object is valid, true is returned; otherwise false is > > returned. > > + * The scan has been started with SO_TYPE_ANALYZE option. > > * > > * This routine holds a buffer pin and lock on the heap page. They are > > held > > * until heapam_scan_analyze_next_tuple() returns false. That is until > > all the > > * items of the heap page are analyzed. > > */ > > -void > > -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno, > > - > > BufferAccessStrategy bstrategy) > > +bool > > +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream) > > { > > HeapScanDesc hscan = (HeapScanDesc) scan; > > > > @@ -1076,11 +1076,15 @@ heapam_scan_analyze_next_block(TableScanDesc scan, > > BlockNumber blockno, > >* doing much work per tuple, the extra lock traffic is probably > > better > >* avoided. > > Personally I think heapam_scan_analyze_next_block() should be inlined. > It only has a few lines. I would find it clearer inline. At the least, > there is no reason for it (or heapam_scan_analyze_next_tuple()) to take > a TableScanDesc instead of a HeapScanDesc. I agree. > > >*/ > > - hscan->rs_cblock = blockno; > > - hscan->rs_cindex = FirstOffsetNumber; > > - hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM, > > -
Re: Use streaming read API in ANALYZE
On Wed, Apr 03, 2024 at 10:25:01PM +0300, Nazir Bilal Yavuz wrote: > > I realized the same error while working on Jakub's benchmarking results. > > Cause: I was using the nblocks variable to check how many blocks will > be returned from the streaming API. But I realized that sometimes the > number returned from BlockSampler_Init() is not equal to the number of > blocks that BlockSampler_Next() will return as BlockSampling algorithm > decides how many blocks to return on the fly by using some random > seeds. > > There are a couple of solutions I thought of: > > 1- Use BlockSampler_HasMore() instead of nblocks in the main loop in > the acquire_sample_rows(): > > Streaming API uses this function to prefetch block numbers. > BlockSampler_HasMore() will reach to the end first as it is used while > prefetching, so it will start to return false while there are still > buffers to return from the streaming API. That will cause some buffers > at the end to not be processed. > > 2- Expose something (function, variable etc.) from the streaming API > to understand if the read is finished and there is no buffer to > return: > > I think this works but I am not sure if the streaming API allows > something like that. > > 3- Check every buffer returned from the streaming API, if it is > invalid stop the main loop in the acquire_sample_rows(): > > This solves the problem but there will be two if checks for each > buffer returned, > - in heapam_scan_analyze_next_block() to check if the returned buffer is > invalid > - to break main loop in acquire_sample_rows() if > heapam_scan_analyze_next_block() returns false > One of the if cases can be bypassed by moving > heapam_scan_analyze_next_block()'s code to the main loop in the > acquire_sample_rows(). > > I implemented the third solution, here is v6. I've reviewed the patches inline below and attached a patch that has some of my ideas on top of your patch. > From 8d396a42186325f920d5a05e7092d8e1b66f3cdf Mon Sep 17 00:00:00 2001 > From: Nazir Bilal Yavuz > Date: Wed, 3 Apr 2024 15:14:15 +0300 > Subject: [PATCH v6] Use streaming read API in ANALYZE > > ANALYZE command gets random tuples using BlockSampler algorithm. Use > streaming reads to get these tuples by using BlockSampler algorithm in > streaming read API prefetch logic. > --- > src/include/access/heapam.h | 6 +- > src/backend/access/heap/heapam_handler.c | 22 +++--- > src/backend/commands/analyze.c | 85 > 3 files changed, 42 insertions(+), 71 deletions(-) > > diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h > index a307fb5f245..633caee9d95 100644 > --- a/src/include/access/heapam.h > +++ b/src/include/access/heapam.h > @@ -25,6 +25,7 @@ > #include "storage/bufpage.h" > #include "storage/dsm.h" > #include "storage/lockdefs.h" > +#include "storage/read_stream.h" > #include "storage/shm_toc.h" > #include "utils/relcache.h" > #include "utils/snapshot.h" > @@ -388,9 +389,8 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup, > struct > GlobalVisState *vistest); > > /* in heap/heapam_handler.c*/ > -extern void heapam_scan_analyze_next_block(TableScanDesc scan, > - >BlockNumber blockno, > - >BufferAccessStrategy bstrategy); > +extern bool heapam_scan_analyze_next_block(TableScanDesc scan, > + >ReadStream *stream); > extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan, > >TransactionId OldestXmin, > >double *liverows, double *deadrows, > diff --git a/src/backend/access/heap/heapam_handler.c > b/src/backend/access/heap/heapam_handler.c > index 0952d4a98eb..d83fbbe6af3 100644 > --- a/src/backend/access/heap/heapam_handler.c > +++ b/src/backend/access/heap/heapam_handler.c > @@ -1054,16 +1054,16 @@ heapam_relation_copy_for_cluster(Relation OldHeap, > Relation NewHeap, > } > > /* > - * Prepare to analyze block `blockno` of `scan`. The scan has been started > - * with SO_TYPE_ANALYZE option. > + * Prepare to analyze block returned from streaming object. If the block > returned > + * from streaming object is valid, true is returned; otherwis
Re: Use streaming read API in ANALYZE
Hi, Thank you for looking into this! On Wed, 3 Apr 2024 at 20:17, Heikki Linnakangas wrote: > > On 03/04/2024 13:31, Nazir Bilal Yavuz wrote: > > Streaming API has been committed but the committed version has a minor > > change, the read_stream_begin_relation function takes Relation instead > > of BufferManagerRelation now. So, here is a v5 which addresses this > > change. > > I'm getting a repeatable segfault / assertion failure with this: > > postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10); > CREATE TABLE > postgres=# insert into tengiga select g, repeat('x', 900) from > generate_series(1, 140) g; > INSERT 0 140 > postgres=# set default_statistics_target = 10; ANALYZE tengiga; > SET > ANALYZE > postgres=# set default_statistics_target = 100; ANALYZE tengiga; > SET > ANALYZE > postgres=# set default_statistics_target =1000; ANALYZE tengiga; > SET > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File: > "heapam_handler.c", Line: 1079, PID: 262232 > postgres: heikki postgres [local] > ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8] > postgres: heikki postgres [local] > ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34] > postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34] > postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a] > postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9] > postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0] > postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe] > postgres: heikki postgres [local] > ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9] > postgres: heikki postgres [local] > ANALYZE(ProcessUtility+0x136)[0x564889f0afb1] > postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8] > postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b] > postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015] > postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6] > postgres: heikki postgres [local] > ANALYZE(PostgresMain+0x80c)[0x564889f06fd7] > postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876] > postgres: heikki postgres [local] > ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3] > postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e] > postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0] > postgres: heikki postgres [local] > ANALYZE(PostmasterMain+0x152b)[0x564889e2214d] > postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4] > /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305] > postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61] > 2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232) > was terminated by signal 6: Aborted I realized the same error while working on Jakub's benchmarking results. Cause: I was using the nblocks variable to check how many blocks will be returned from the streaming API. But I realized that sometimes the number returned from BlockSampler_Init() is not equal to the number of blocks that BlockSampler_Next() will return as BlockSampling algorithm decides how many blocks to return on the fly by using some random seeds. There are a couple of solutions I thought of: 1- Use BlockSampler_HasMore() instead of nblocks in the main loop in the acquire_sample_rows(): Streaming API uses this function to prefetch block numbers. BlockSampler_HasMore() will reach to the end first as it is used while prefetching, so it will start to return false while there are still buffers to return from the streaming API. That will cause some buffers at the end to not be processed. 2- Expose something (function, variable etc.) from the streaming API to understand if the read is finished and there is no buffer to return: I think this works but I am not sure if the streaming API allows something like that. 3- Check every buffer returned from the streaming API, if it is invalid stop the main loop in the acquire_sample_rows(): This solves the problem but there will be two if checks for each buffer returned, - in heapam_scan_analyze_next_block() to check if the returned buffer is invalid - to break main loop in acquire_sample_rows() if heapam_scan_analyze_next_block() returns false One of the if cases can be bypassed by moving heapam_scan_analyze_next_block()'s code to the main loop in the acquire_sample_rows(). I implemented the third solution, here is v6. -- Regards, Nazir Bilal Yavu
Re: Use streaming read API in ANALYZE
Hi Jakub, Thank you for looking into this and doing a performance analysis. On Wed, 3 Apr 2024 at 11:42, Jakub Wartak wrote: > > On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz wrote: > [..] > > v4 is rebased on top of v14 streaming read API changes. > > Hi Nazir, so with streaming API committed, I gave a try to this patch. > With autovacuum=off and 30GB table on NVMe (with standard readahead of > 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB > default) created using: create table t as select repeat('a', 100) || i > || repeat('b', 500) as filler from generate_series(1, 4500) as i; > > on master, effect of mainteance_io_concurency [default 10] is like > that (when resetting the fs cache after each ANALYZE): > > m_io_c = 0: > Time: 3137.914 ms (00:03.138) > Time: 3094.540 ms (00:03.095) > Time: 3452.513 ms (00:03.453) > > m_io_c = 1: > Time: 2972.751 ms (00:02.973) > Time: 2939.551 ms (00:02.940) > Time: 2904.428 ms (00:02.904) > > m_io_c = 2: > Time: 1580.260 ms (00:01.580) > Time: 1572.132 ms (00:01.572) > Time: 1558.334 ms (00:01.558) > > m_io_c = 4: > Time: 938.304 ms > Time: 931.772 ms > Time: 920.044 ms > > m_io_c = 8: > Time: 666.025 ms > Time: 660.241 ms > Time: 648.848 ms > > m_io_c = 16: > Time: 542.450 ms > Time: 561.155 ms > Time: 539.683 ms > > m_io_c = 32: > Time: 538.487 ms > Time: 541.705 ms > Time: 538.101 ms > > with patch applied: > > m_io_c = 0: > Time: 3106.469 ms (00:03.106) > Time: 3140.343 ms (00:03.140) > Time: 3044.133 ms (00:03.044) > > m_io_c = 1: > Time: 2959.817 ms (00:02.960) > Time: 2920.265 ms (00:02.920) > Time: 2911.745 ms (00:02.912) > > m_io_c = 2: > Time: 1581.912 ms (00:01.582) > Time: 1561.444 ms (00:01.561) > Time: 1558.251 ms (00:01.558) > > m_io_c = 4: > Time: 908.116 ms > Time: 901.245 ms > Time: 901.071 ms > > m_io_c = 8: > Time: 619.870 ms > Time: 620.327 ms > Time: 614.266 ms > > m_io_c = 16: > Time: 529.885 ms > Time: 526.958 ms > Time: 528.474 ms > > m_io_c = 32: > Time: 521.185 ms > Time: 520.713 ms > Time: 517.729 ms > > No difference to me, which seems to be good. I've double checked and > patch used the new way > > acquire_sample_rows -> heapam_scan_analyze_next_block -> > ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers > -> mdreadv -> FileReadV -> pg_preadv (inlined) > acquire_sample_rows -> heapam_scan_analyze_next_block -> > ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer > -> ... > > I gave also io_combine_limit to 32 (max, 256kB) a try and got those > slightly better results: > > [..] > m_io_c = 16: > Time: 494.599 ms > Time: 496.345 ms > Time: 973.500 ms > > m_io_c = 32: > Time: 461.031 ms > Time: 449.037 ms > Time: 443.375 ms > > and that (last one) apparently was able to push it to ~50-60k still > random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was > still reading random , so I assume no merging was done: > > Devicer/s rMB/s rrqm/s %rrqm r_await rareq-sz > w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s > drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util > nvme0n1 61212.00591.82 0.00 0.000.10 9.90 > 2.00 0.02 0.00 0.000.0012.000.00 0.00 > 0.00 0.000.00 0.000.000.006.28 85.20 > > So in short it looks good to me. My results are similar to yours, also I realized a bug while working on your benchmarking cases. I will share the cause and the fix soon. -- Regards, Nazir Bilal Yavuz Microsoft
Re: Use streaming read API in ANALYZE
On 03/04/2024 13:31, Nazir Bilal Yavuz wrote: Streaming API has been committed but the committed version has a minor change, the read_stream_begin_relation function takes Relation instead of BufferManagerRelation now. So, here is a v5 which addresses this change. I'm getting a repeatable segfault / assertion failure with this: postgres=# CREATE TABLE tengiga (i int, filler text) with (fillfactor=10); CREATE TABLE postgres=# insert into tengiga select g, repeat('x', 900) from generate_series(1, 140) g; INSERT 0 140 postgres=# set default_statistics_target = 10; ANALYZE tengiga; SET ANALYZE postgres=# set default_statistics_target = 100; ANALYZE tengiga; SET ANALYZE postgres=# set default_statistics_target =1000; ANALYZE tengiga; SET server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. TRAP: failed Assert("BufferIsValid(hscan->rs_cbuf)"), File: "heapam_handler.c", Line: 1079, PID: 262232 postgres: heikki postgres [local] ANALYZE(ExceptionalCondition+0xa8)[0x56488a0de9d8] postgres: heikki postgres [local] ANALYZE(heapam_scan_analyze_next_block+0x63)[0x5648899ece34] postgres: heikki postgres [local] ANALYZE(+0x2d3f34)[0x564889b6af34] postgres: heikki postgres [local] ANALYZE(+0x2d2a3a)[0x564889b69a3a] postgres: heikki postgres [local] ANALYZE(analyze_rel+0x33e)[0x564889b68fa9] postgres: heikki postgres [local] ANALYZE(vacuum+0x4b3)[0x564889c2dcc0] postgres: heikki postgres [local] ANALYZE(ExecVacuum+0xd6f)[0x564889c2d7fe] postgres: heikki postgres [local] ANALYZE(standard_ProcessUtility+0x901)[0x564889f0b8b9] postgres: heikki postgres [local] ANALYZE(ProcessUtility+0x136)[0x564889f0afb1] postgres: heikki postgres [local] ANALYZE(+0x6728c8)[0x564889f098c8] postgres: heikki postgres [local] ANALYZE(+0x672b3b)[0x564889f09b3b] postgres: heikki postgres [local] ANALYZE(PortalRun+0x320)[0x564889f09015] postgres: heikki postgres [local] ANALYZE(+0x66b2c6)[0x564889f022c6] postgres: heikki postgres [local] ANALYZE(PostgresMain+0x80c)[0x564889f06fd7] postgres: heikki postgres [local] ANALYZE(+0x667876)[0x564889efe876] postgres: heikki postgres [local] ANALYZE(postmaster_child_launch+0xe6)[0x564889e1f4b3] postgres: heikki postgres [local] ANALYZE(+0x58e68e)[0x564889e2568e] postgres: heikki postgres [local] ANALYZE(+0x58b7f0)[0x564889e227f0] postgres: heikki postgres [local] ANALYZE(PostmasterMain+0x152b)[0x564889e2214d] postgres: heikki postgres [local] ANALYZE(+0xb4)[0x564889cdb4b4] /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f7d83b6724a] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f7d83b67305] postgres: heikki postgres [local] ANALYZE(_start+0x21)[0x564889971a61] 2024-04-03 20:15:49.157 EEST [262101] LOG: server process (PID 262232) was terminated by signal 6: Aborted -- Heikki Linnakangas Neon (https://neon.tech)
Re: Use streaming read API in ANALYZE
Hi, On Tue, 2 Apr 2024 at 10:23, Nazir Bilal Yavuz wrote: > > v4 is rebased on top of v14 streaming read API changes. Streaming API has been committed but the committed version has a minor change, the read_stream_begin_relation function takes Relation instead of BufferManagerRelation now. So, here is a v5 which addresses this change. -- Regards, Nazir Bilal Yavuz Microsoft From 43e2f2b32e2fdb7e1fd787b1d8595768741f4792 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Wed, 3 Apr 2024 01:22:59 +0300 Subject: [PATCH v5] Use streaming read API in ANALYZE ANALYZE command gets random tuples using BlockSampler algorithm. Use streaming reads to get these tuples by using BlockSampler algorithm in streaming read API prefetch logic. --- src/include/access/heapam.h | 4 +- src/backend/access/heap/heapam_handler.c | 16 ++--- src/backend/commands/analyze.c | 85 3 files changed, 36 insertions(+), 69 deletions(-) diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index b632fe953c4..4e35caeb42e 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -25,6 +25,7 @@ #include "storage/bufpage.h" #include "storage/dsm.h" #include "storage/lockdefs.h" +#include "storage/read_stream.h" #include "storage/shm_toc.h" #include "utils/relcache.h" #include "utils/snapshot.h" @@ -374,8 +375,7 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup, /* in heap/heapam_handler.c*/ extern void heapam_scan_analyze_next_block(TableScanDesc scan, - BlockNumber blockno, - BufferAccessStrategy bstrategy); + ReadStream *stream); extern bool heapam_scan_analyze_next_tuple(TableScanDesc scan, TransactionId OldestXmin, double *liverows, double *deadrows, diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index c86000d245b..0533d9660c4 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -1054,16 +1054,15 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, } /* - * Prepare to analyze block `blockno` of `scan`. The scan has been started - * with SO_TYPE_ANALYZE option. + * Prepare to analyze block returned from streaming object. The scan has been + * started with SO_TYPE_ANALYZE option. * * This routine holds a buffer pin and lock on the heap page. They are held * until heapam_scan_analyze_next_tuple() returns false. That is until all the * items of the heap page are analyzed. */ void -heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno, - BufferAccessStrategy bstrategy) +heapam_scan_analyze_next_block(TableScanDesc scan, ReadStream *stream) { HeapScanDesc hscan = (HeapScanDesc) scan; @@ -1076,11 +1075,12 @@ heapam_scan_analyze_next_block(TableScanDesc scan, BlockNumber blockno, * doing much work per tuple, the extra lock traffic is probably better * avoided. */ - hscan->rs_cblock = blockno; - hscan->rs_cindex = FirstOffsetNumber; - hscan->rs_cbuf = ReadBufferExtended(scan->rs_rd, MAIN_FORKNUM, - blockno, RBM_NORMAL, bstrategy); + hscan->rs_cbuf = read_stream_next_buffer(stream, NULL); + Assert(BufferIsValid(hscan->rs_cbuf)); LockBuffer(hscan->rs_cbuf, BUFFER_LOCK_SHARE); + + hscan->rs_cblock = BufferGetBlockNumber(hscan->rs_cbuf); + hscan->rs_cindex = FirstOffsetNumber; } /* diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 2fb39f3ede1..105285c3ea2 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1102,6 +1102,20 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr) return stats; } +/* + * Prefetch callback function to get next block number while using + * BlockSampling algorithm + */ +static BlockNumber +block_sampling_streaming_read_next(ReadStream *stream, + void *user_data, + void *per_buffer_data) +{ + BlockSamplerData *bs = user_data; + + return BlockSampler_HasMore(bs) ? BlockSampler_Next(bs) : InvalidBlockNumber; +} + /* * acquire_sample_rows -- acquire a random sample of rows from the heap * @@ -1154,10 +1168,7 @@ acquire_sample_rows(Relation onerel, int elevel, TableScanDesc scan; BlockNumber nblocks; BlockNumber blksdone = 0; -#ifdef USE_PREFETCH - int prefetch_maximum = 0; /* blocks to prefetch if enabled */ - BlockSamplerData prefetch_bs; -#endif + ReadStream *stream; Assert(targrows > 0); @@ -1170,13 +1181,6 @@ acquire_sample_rows(Relation onerel, int elevel, randseed = pg_prng_uint32(&pg_global_prng_state); nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed); -#ifdef USE_PREFETCH - prefetch_maximum = get_tablespace_maintenance_io_concurrency(onerel->rd_rel->reltablespace); - /* Create another BlockSam
Re: Use streaming read API in ANALYZE
On Tue, Apr 2, 2024 at 9:24 AM Nazir Bilal Yavuz wrote: [..] > v4 is rebased on top of v14 streaming read API changes. Hi Nazir, so with streaming API committed, I gave a try to this patch. With autovacuum=off and 30GB table on NVMe (with standard readahead of 256kb and ext4, Debian 12, kernel 6.1.0, shared_buffers = 128MB default) created using: create table t as select repeat('a', 100) || i || repeat('b', 500) as filler from generate_series(1, 4500) as i; on master, effect of mainteance_io_concurency [default 10] is like that (when resetting the fs cache after each ANALYZE): m_io_c = 0: Time: 3137.914 ms (00:03.138) Time: 3094.540 ms (00:03.095) Time: 3452.513 ms (00:03.453) m_io_c = 1: Time: 2972.751 ms (00:02.973) Time: 2939.551 ms (00:02.940) Time: 2904.428 ms (00:02.904) m_io_c = 2: Time: 1580.260 ms (00:01.580) Time: 1572.132 ms (00:01.572) Time: 1558.334 ms (00:01.558) m_io_c = 4: Time: 938.304 ms Time: 931.772 ms Time: 920.044 ms m_io_c = 8: Time: 666.025 ms Time: 660.241 ms Time: 648.848 ms m_io_c = 16: Time: 542.450 ms Time: 561.155 ms Time: 539.683 ms m_io_c = 32: Time: 538.487 ms Time: 541.705 ms Time: 538.101 ms with patch applied: m_io_c = 0: Time: 3106.469 ms (00:03.106) Time: 3140.343 ms (00:03.140) Time: 3044.133 ms (00:03.044) m_io_c = 1: Time: 2959.817 ms (00:02.960) Time: 2920.265 ms (00:02.920) Time: 2911.745 ms (00:02.912) m_io_c = 2: Time: 1581.912 ms (00:01.582) Time: 1561.444 ms (00:01.561) Time: 1558.251 ms (00:01.558) m_io_c = 4: Time: 908.116 ms Time: 901.245 ms Time: 901.071 ms m_io_c = 8: Time: 619.870 ms Time: 620.327 ms Time: 614.266 ms m_io_c = 16: Time: 529.885 ms Time: 526.958 ms Time: 528.474 ms m_io_c = 32: Time: 521.185 ms Time: 520.713 ms Time: 517.729 ms No difference to me, which seems to be good. I've double checked and patch used the new way acquire_sample_rows -> heapam_scan_analyze_next_block -> ReadBufferExtended -> ReadBuffer_common (inlined) -> WaitReadBuffers -> mdreadv -> FileReadV -> pg_preadv (inlined) acquire_sample_rows -> heapam_scan_analyze_next_block -> ReadBufferExtended -> ReadBuffer_common (inlined) -> StartReadBuffer -> ... I gave also io_combine_limit to 32 (max, 256kB) a try and got those slightly better results: [..] m_io_c = 16: Time: 494.599 ms Time: 496.345 ms Time: 973.500 ms m_io_c = 32: Time: 461.031 ms Time: 449.037 ms Time: 443.375 ms and that (last one) apparently was able to push it to ~50-60k still random IOPS range, the rareq-sz was still ~8 (9.9) kB as analyze was still reading random , so I assume no merging was done: Devicer/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme0n1 61212.00591.82 0.00 0.000.10 9.90 2.00 0.02 0.00 0.000.0012.000.00 0.00 0.00 0.000.00 0.000.000.006.28 85.20 So in short it looks good to me. -Jakub Wartak.
Re: Use streaming read API in ANALYZE
Hi, Thanks for the review! On Wed, 27 Mar 2024 at 23:15, Melanie Plageman wrote: > > On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote: > > Hi, > > > > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz wrote: > > > > > > > > > The new version of the streaming read API [1] is posted. I updated the > > > streaming read API changes patch (0001), using the streaming read API > > > in ANALYZE patch (0002) remains the same. This should make it easier > > > to review as it can be applied on top of master > > > > > > > > > > The new version of the streaming read API is posted [1]. I rebased the > > patch on top of master and v9 of the streaming read API. > > > > There is a minimal change in the 'using the streaming read API in ANALYZE > > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE > > to copy exactly the same behavior as before. Also, some benchmarking > > results: > > > > I created a 22 GB table and set the size of shared buffers to 30GB, the > > rest is default. > > > > ╔═══╦═╦╗ > > ║ ║ Avg Timings in ms ║║ > > ╠═══╬══╦══╬╣ > > ║ ║ master ║ patched ║ percentage ║ > > ╠═══╬══╬══╬╣ > > ║ Both OS cache and ║ ║ ║║ > > ║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║ > > ╠═══╬══╬══╬╣ > > ║ OS cache is loaded but ║ ║ ║║ > > ║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3 ║ > > ╠═══╬══╬══╬╣ > > ║ Shared buffers are loaded ║ ║ ║║ > > ║ ║ 89.2846 ║ 84.6952 ║%5.1║ > > ╚═══╩══╩══╩╝ > > > > Any kind of feedback would be appreciated. > > Thanks for working on this! > > A general review comment: I noticed you have the old streaming read > (pgsr) naming still in a few places (including comments) -- so I would > just make sure and update everywhere when you rebase in Thomas' latest > version of the read stream API. Done. > > > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001 > > From: Nazir Bilal Yavuz > > Date: Mon, 19 Feb 2024 14:30:47 +0300 > > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE > > > > --- a/src/backend/commands/analyze.c > > +++ b/src/backend/commands/analyze.c > > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node > > *index_expr) > > return stats; > > } > > > > +/* > > + * Prefetch callback function to get next block number while using > > + * BlockSampling algorithm > > + */ > > +static BlockNumber > > +pg_block_sampling_streaming_read_next(StreamingRead *stream, > > + > > void *user_data, > > + > > void *per_buffer_data) > > I don't think you need the pg_ prefix Done. > > > +{ > > + BlockSamplerData *bs = user_data; > > + BlockNumber *current_block = per_buffer_data; > > Why can't you just do BufferGetBlockNumber() on the buffer returned from > the read stream API instead of allocating per_buffer_data for the block > number? Done. > > > + > > + if (BlockSampler_HasMore(bs)) > > + *current_block = BlockSampler_Next(bs); > > + else > > + *current_block = InvalidBlockNumber; > > + > > + return *current_block; > > > I think we'd like to keep the read stream code in heapam-specific code. > Instead of doing streaming_read_buffer_begin() here, you could put this > in heap_beginscan() or initscan() guarded by > scan->rs_base.rs_flags & SO_TYPE_ANALYZE In the recent changes [1], heapam_scan_analyze_next_[block | tuple] are removed from tableam. They are directly called from heapam-specific code now. So, IMO, no need to do this now. v4 is rebased on top of v14 streaming read API changes. [1] 27bc1772fc814946918a5ac8ccb9b5c5ad0380aa -- Regards, Nazir Bilal Yavuz Microsoft From 725a3d875fb1d675f5d99d8602a87b5e47b765db Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 26 Feb 2024 23:48:31 +1300 Subjec
Re: Use streaming read API in ANALYZE
On Tue, Mar 26, 2024 at 02:51:27PM +0300, Nazir Bilal Yavuz wrote: > Hi, > > On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz wrote: > > > > > > The new version of the streaming read API [1] is posted. I updated the > > streaming read API changes patch (0001), using the streaming read API > > in ANALYZE patch (0002) remains the same. This should make it easier > > to review as it can be applied on top of master > > > > > > The new version of the streaming read API is posted [1]. I rebased the > patch on top of master and v9 of the streaming read API. > > There is a minimal change in the 'using the streaming read API in ANALYZE > patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE > to copy exactly the same behavior as before. Also, some benchmarking > results: > > I created a 22 GB table and set the size of shared buffers to 30GB, the > rest is default. > > ╔═══╦═╦╗ > ║ ║ Avg Timings in ms ║║ > ╠═══╬══╦══╬╣ > ║ ║ master ║ patched ║ percentage ║ > ╠═══╬══╬══╬╣ > ║ Both OS cache and ║ ║ ║║ > ║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║ > ╠═══╬══╬══╬╣ > ║ OS cache is loaded but ║ ║ ║║ > ║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3 ║ > ╠═══╬══╬══╬╣ > ║ Shared buffers are loaded ║ ║ ║║ > ║ ║ 89.2846 ║ 84.6952 ║%5.1║ > ╚═══╩══╩══╩╝ > > Any kind of feedback would be appreciated. Thanks for working on this! A general review comment: I noticed you have the old streaming read (pgsr) naming still in a few places (including comments) -- so I would just make sure and update everywhere when you rebase in Thomas' latest version of the read stream API. > From c7500cc1b9068ff0b704181442999cd8bed58658 Mon Sep 17 00:00:00 2001 > From: Nazir Bilal Yavuz > Date: Mon, 19 Feb 2024 14:30:47 +0300 > Subject: [PATCH v3 2/2] Use streaming read API in ANALYZE > > --- a/src/backend/commands/analyze.c > +++ b/src/backend/commands/analyze.c > @@ -1102,6 +1102,26 @@ examine_attribute(Relation onerel, int attnum, Node > *index_expr) > return stats; > } > > +/* > + * Prefetch callback function to get next block number while using > + * BlockSampling algorithm > + */ > +static BlockNumber > +pg_block_sampling_streaming_read_next(StreamingRead *stream, > + void > *user_data, > + void > *per_buffer_data) I don't think you need the pg_ prefix > +{ > + BlockSamplerData *bs = user_data; > + BlockNumber *current_block = per_buffer_data; Why can't you just do BufferGetBlockNumber() on the buffer returned from the read stream API instead of allocating per_buffer_data for the block number? > + > + if (BlockSampler_HasMore(bs)) > + *current_block = BlockSampler_Next(bs); > + else > + *current_block = InvalidBlockNumber; > + > + return *current_block; I think we'd like to keep the read stream code in heapam-specific code. Instead of doing streaming_read_buffer_begin() here, you could put this in heap_beginscan() or initscan() guarded by scan->rs_base.rs_flags & SO_TYPE_ANALYZE same with streaming_read_buffer_end()/heap_endscan(). You'd also then need to save the reference to the read stream in the HeapScanDescData. > + stream = streaming_read_buffer_begin(STREAMING_READ_MAINTENANCE, > + > vac_strategy, > + > BMR_REL(scan->rs_rd), > + > MAIN_FORKNUM, > + > pg_block_sampling_streaming_read_next, > + > &bs, > + > sizeof(BlockSamplerData)); > > /* Outer loop over blocks to sample */ In fact, I think you could use this opportunity to get rid of the block dependency
Re: Use streaming read API in ANALYZE
Hi, On Wed, 28 Feb 2024 at 14:42, Nazir Bilal Yavuz wrote: > > > The new version of the streaming read API [1] is posted. I updated the > streaming read API changes patch (0001), using the streaming read API > in ANALYZE patch (0002) remains the same. This should make it easier > to review as it can be applied on top of master > > The new version of the streaming read API is posted [1]. I rebased the patch on top of master and v9 of the streaming read API. There is a minimal change in the 'using the streaming read API in ANALYZE patch (0002)', I changed STREAMING_READ_FULL to STREAMING_READ_MAINTENANCE to copy exactly the same behavior as before. Also, some benchmarking results: I created a 22 GB table and set the size of shared buffers to 30GB, the rest is default. ╔═══╦═╦╗ ║ ║ Avg Timings in ms ║║ ╠═══╬══╦══╬╣ ║ ║ master ║ patched ║ percentage ║ ╠═══╬══╬══╬╣ ║ Both OS cache and ║ ║ ║║ ║ shared buffers are clear ║ 513.9247 ║ 463.1019 ║%9.9║ ╠═══╬══╬══╬╣ ║ OS cache is loaded but ║ ║ ║║ ║ shared buffers are clear ║ 423.1097 ║ 354.3277 ║%16.3 ║ ╠═══╬══╬══╬╣ ║ Shared buffers are loaded ║ ║ ║║ ║ ║ 89.2846 ║ 84.6952 ║%5.1║ ╚═══╩══╩══╩╝ Any kind of feedback would be appreciated. [1]: https://www.postgresql.org/message-id/CA%2BhUKGL-ONQnnnp-SONCFfLJzqcpAheuzZ%2B-yTrD9WBM-GmAcg%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft From 7278c354d80e4d1f21c6fa0d810a723789f2d722 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 26 Feb 2024 23:48:31 +1300 Subject: [PATCH v3 1/2] Streaming read API changes that are not committed to master yet Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com --- src/include/storage/bufmgr.h | 45 +- src/include/storage/streaming_read.h | 50 ++ src/backend/storage/Makefile | 2 +- src/backend/storage/aio/Makefile | 14 + src/backend/storage/aio/meson.build | 5 + src/backend/storage/aio/streaming_read.c | 678 + src/backend/storage/buffer/bufmgr.c | 706 -- src/backend/storage/buffer/localbuf.c | 14 +- src/backend/storage/meson.build | 1 + src/backend/utils/misc/guc_tables.c | 14 + src/backend/utils/misc/postgresql.conf.sample | 1 + doc/src/sgml/config.sgml | 14 + src/tools/pgindent/typedefs.list | 2 + 13 files changed, 1309 insertions(+), 237 deletions(-) create mode 100644 src/include/storage/streaming_read.h create mode 100644 src/backend/storage/aio/Makefile create mode 100644 src/backend/storage/aio/meson.build create mode 100644 src/backend/storage/aio/streaming_read.c diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index d51d46d3353..1cc198bde21 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -14,6 +14,7 @@ #ifndef BUFMGR_H #define BUFMGR_H +#include "port/pg_iovec.h" #include "storage/block.h" #include "storage/buf.h" #include "storage/bufpage.h" @@ -133,6 +134,10 @@ extern PGDLLIMPORT bool track_io_timing; extern PGDLLIMPORT int effective_io_concurrency; extern PGDLLIMPORT int maintenance_io_concurrency; +#define MAX_BUFFER_IO_SIZE PG_IOV_MAX +#define DEFAULT_BUFFER_IO_SIZE Min(MAX_BUFFER_IO_SIZE, (128 * 1024) / BLCKSZ) +extern PGDLLIMPORT int buffer_io_size; + extern PGDLLIMPORT int checkpoint_flush_after; extern PGDLLIMPORT int backend_flush_after; extern PGDLLIMPORT int bgwriter_flush_after; @@ -158,7 +163,6 @@ extern PGDLLIMPORT int32 *LocalRefCount; #define BUFFER_LOCK_SHARE 1 #define BUFFER_LOCK_EXCLUSIVE 2 - /* * prototypes for functions in bufmgr.c */ @@ -177,6 +181,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, bool permanent); + +#define READ_BUFFERS_ZERO_ON_ERROR 0x01 +#define READ_BUFFERS_ISSUE_ADVICE 0x02 + +/* + * Private state used by StartReadBuffers() and WaitReadBuffers(). Declared + * in public header only to allow inclusion in other structs, but contents + * should not be accessed. + */ +struct ReadBuffersOperation +{ + /* Parameters passed in to StartReadBuffers(). */ + BufferManagerRelation bmr; + Buffer *buffers; + ForkNumber forknum; + BlockNumber blocknum; + int16 nblocks; + BufferAccessStra
Re: Use streaming read API in ANALYZE
Hi, On Mon, 19 Feb 2024 at 18:13, Nazir Bilal Yavuz wrote: > > I worked on using the currently proposed streaming read API [1] in ANALYZE. > The patch is attached. 0001 is the not yet merged streaming read API code > changes that can be applied to the master, 0002 is the actual code. > > The blocks to analyze are obtained by using the streaming read API now. > > - Since streaming read API is already doing prefetch, I removed the #ifdef > USE_PREFETCH code from acquire_sample_rows(). > > - Changed 'while (BlockSampler_HasMore(&bs))' to 'while (nblocks)' because > the prefetch mechanism in the streaming read API will advance 'bs' before > returning buffers. > > - Removed BlockNumber and BufferAccessStrategy from the declaration of > scan_analyze_next_block(), passing pgsr (PgStreamingRead) instead of them. > > I counted syscalls of analyzing ~5GB table. It can be seen that the patched > version did ~1300 less read calls. > > Patched: > > % time seconds usecs/call callserrors syscall > -- --- --- - - > 39.670.012128 0 29809 pwrite64 > 36.960.011299 0 28594 pread64 > 23.240.007104 0 27611 fadvise64 > > Master (21a71648d3): > > % time seconds usecs/call callserrors syscall > -- --- --- - - > 38.940.016457 0 29816 pwrite64 > 36.790.015549 0 29850 pread64 > 23.910.010106 0 29848 fadvise64 > > > Any kind of feedback would be appreciated. > > [1]: > https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com The new version of the streaming read API [1] is posted. I updated the streaming read API changes patch (0001), using the streaming read API in ANALYZE patch (0002) remains the same. This should make it easier to review as it can be applied on top of master [1]: https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft From 21d9043501284c6bae996522ff2f3ac693f81986 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 26 Feb 2024 23:48:31 +1300 Subject: [PATCH v2 1/2] Streaming read API changes that are not committed to master yet Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com --- src/include/storage/bufmgr.h | 45 ++ src/include/storage/streaming_read.h | 52 ++ src/backend/storage/Makefile | 2 +- src/backend/storage/aio/Makefile | 14 + src/backend/storage/aio/meson.build | 5 + src/backend/storage/aio/streaming_read.c | 612 src/backend/storage/buffer/bufmgr.c | 687 +++ src/backend/storage/buffer/localbuf.c| 14 +- src/backend/storage/meson.build | 1 + src/tools/pgindent/typedefs.list | 3 + 10 files changed, 1202 insertions(+), 233 deletions(-) create mode 100644 src/include/storage/streaming_read.h create mode 100644 src/backend/storage/aio/Makefile create mode 100644 src/backend/storage/aio/meson.build create mode 100644 src/backend/storage/aio/streaming_read.c diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index d51d46d3353..b57f71f97e3 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -14,6 +14,7 @@ #ifndef BUFMGR_H #define BUFMGR_H +#include "port/pg_iovec.h" #include "storage/block.h" #include "storage/buf.h" #include "storage/bufpage.h" @@ -158,6 +159,11 @@ extern PGDLLIMPORT int32 *LocalRefCount; #define BUFFER_LOCK_SHARE 1 #define BUFFER_LOCK_EXCLUSIVE 2 +/* + * Maximum number of buffers for multi-buffer I/O functions. This is set to + * allow 128kB transfers, unless BLCKSZ and IOV_MAX imply a a smaller maximum. + */ +#define MAX_BUFFERS_PER_TRANSFER Min(PG_IOV_MAX, (128 * 1024) / BLCKSZ) /* * prototypes for functions in bufmgr.c @@ -177,6 +183,42 @@ extern Buffer ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockNum, ReadBufferMode mode, BufferAccessStrategy strategy, bool permanent); + +#define READ_BUFFERS_ZERO_ON_ERROR 0x01 +#define READ_BUFFERS_ISSUE_ADVICE 0x02 + +/* + * Private state used by StartReadBuffers() and WaitReadBuffers(). Declared + * in public header only to allow inclusion in other structs, but contents + * should not be accessed. + */ +struct ReadBuffersOperation +{ + /* Parameters passed in to StartReadBuffers(). */ + BufferManagerRelation bmr; + Buffer *buffers; + ForkNumber forknum; + BlockNumber blocknum; + int nblocks; + BufferAccessStrategy strategy; + int flags; + + /* Range of buffers, if we need to perform a read. */ + int io_buffers_len; +}; + +ty
Use streaming read API in ANALYZE
lags without locking. This is racy, but it's OK to + * return false spuriously: when CompleteReadBuffers() calls + * StartBufferIO(), it'll see that it's now valid. * * Note: We deliberately avoid a Valgrind client request here. * Individual access methods can optionally superimpose buffer page @@ -2382,7 +2504,7 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy) * that the buffer page is legitimately non-accessible here. We * cannot meddle with that. */ - result = true; + result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0; } ref->refcount++; @@ -4833,6 +4955,46 @@ ConditionalLockBuffer(Buffer buffer) LW_EXCLUSIVE); } +/* + * Zero a buffer, and lock it as RBM_ZERO_AND_LOCK or + * RBM_ZERO_AND_CLEANUP_LOCK would. The buffer must be already pinned. It + * does not have to be valid, but it is valid and locked on return. + */ +void +ZeroBuffer(Buffer buffer, ReadBufferMode mode) +{ + BufferDesc *bufHdr; + uint32 buf_state; + + Assert(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK); + + if (BufferIsLocal(buffer)) + bufHdr = GetLocalBufferDescriptor(-buffer - 1); + else + { + bufHdr = GetBufferDescriptor(buffer - 1); + if (mode == RBM_ZERO_AND_LOCK) + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); + else + LockBufferForCleanup(buffer); + } + + memset(BufferGetPage(buffer), 0, BLCKSZ); + + if (BufferIsLocal(buffer)) + { + buf_state = pg_atomic_read_u32(&bufHdr->state); + buf_state |= BM_VALID; + pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state); + } + else + { + buf_state = LockBufHdr(bufHdr); + buf_state |= BM_VALID; + UnlockBufHdr(bufHdr, buf_state); + } +} + /* * Verify that this backend is pinning the buffer exactly once. * diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c index 1f02fed250e..6956d4e5b49 100644 --- a/src/backend/storage/buffer/localbuf.c +++ b/src/backend/storage/buffer/localbuf.c @@ -109,10 +109,9 @@ PrefetchLocalBuffer(SMgrRelation smgr, ForkNumber forkNum, * LocalBufferAlloc - * Find or create a local buffer for the given page of the given relation. * - * API is similar to bufmgr.c's BufferAlloc, except that we do not need - * to do any locking since this is all local. Also, IO_IN_PROGRESS - * does not get set. Lastly, we support only default access strategy - * (hence, usage_count is always advanced). + * API is similar to bufmgr.c's BufferAlloc, except that we do not need to do + * any locking since this is all local. We support only default access + * strategy (hence, usage_count is always advanced). */ BufferDesc * LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum, @@ -288,7 +287,7 @@ GetLocalVictimBuffer(void) } /* see LimitAdditionalPins() */ -static void +void LimitAdditionalLocalPins(uint32 *additional_pins) { uint32 max_pins; @@ -298,9 +297,10 @@ LimitAdditionalLocalPins(uint32 *additional_pins) /* * In contrast to LimitAdditionalPins() other backends don't play a role - * here. We can allow up to NLocBuffer pins in total. + * here. We can allow up to NLocBuffer pins in total, but it might not be + * initialized yet so read num_temp_buffers. */ - max_pins = (NLocBuffer - NLocalPinnedBuffers); + max_pins = (num_temp_buffers - NLocalPinnedBuffers); if (*additional_pins >= max_pins) *additional_pins = max_pins; diff --git a/src/backend/storage/meson.build b/src/backend/storage/meson.build index 40345bdca27..739d13293fb 100644 --- a/src/backend/storage/meson.build +++ b/src/backend/storage/meson.build @@ -1,5 +1,6 @@ # Copyright (c) 2022-2024, PostgreSQL Global Development Group +subdir('aio') subdir('buffer') subdir('file') subdir('freespace') diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list index d808aad8b05..35657008761 100644 --- a/src/tools/pgindent/typedefs.list +++ b/src/tools/pgindent/typedefs.list @@ -2094,6 +2094,8 @@ PgStat_TableCounts PgStat_TableStatus PgStat_TableXactStatus PgStat_WalStats +PgStreamingRead +PgStreamingReadRange PgXmlErrorContext PgXmlStrictness Pg_finfo_record -- 2.43.0 From ead994a29826be6033c3acc8fca90a6f6cc58a30 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Mon, 19 Feb 2024 14:30:47 +0300 Subject: [PATCH v1 2/2] Use streaming read API in ANALYZE ANALYZE command gets random tuples using BlockSampler algorithm. Use streaming reads to get these tuples by using BlockSampler algorithm in streaming read API prefetch logic. --- src/include/access/tableam.h | 16 ++-- src/backend/access/heap/heapam_handler.c | 11 +-- src/backend/commands/analyze.c | 97 3 files changed, 45 insertions(+), 79 deletions(-) diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 5f8474871d2..7e6e99ba71d 100644 --- a/src/include/access/tableam.h +++ b