date:20240116

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-16 Thread feichanghong

For this specific job, I have always wanted a try_index_open() thatwould attempt to open the index with a relkind check, perhaps we couldintroduce one and reuse it here?Yes, replacing index_open with try_index_open solves the problem. Theidea is similar to my initial report of "after successfully opening the heaptable in reindex_index, check again whether the index still exists”.But it should be better to introduce try_index_open.

v1-try_index_open.patch
Description: Binary data

Best Regards,Fei ChanghongAlibaba Cloud Computing Ltd.

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-16 Thread Kyotaro Horiguchi

At Wed, 17 Jan 2024 14:38:54 +0900, torikoshia  
wrote in 
> Hi,
> 
> Thanks for applying!
> 
> > + errmsg_plural("%zd row were skipped due to data type
> > incompatibility",
> 
> Sorry, I just noticed it, but 'were' should be 'was' here?
> 
> >> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> >> counts soft errors. I'll suggest this in another thread.
> > Please do!
> 
> I've started it here:
> 
> https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com

Switching topics, this commit (9e2d870119) adds the following help message:


>   "COPY { %s [ ( %s [, ...] ) ] | ( %s ) }\n"
>   "TO { '%s' | PROGRAM '%s' | STDOUT }\n"
> ...
>   "SAVE_ERROR_TO '%s'\n"
> ...
>   _("location"),

On the other hand, SAVE_ERROR_TO takes 'error' or 'none', which
indicate "immediately error out" and 'just ignore the failure'
respectively, but these options hardly seem to denote a 'location',
and appear more like an 'action'. I somewhat suspect that this
parameter name intially conceived with the assupmtion that it would
take file names or similar parameters. I'm not sure if others will
agree, but I think the parameter name might not be the best
choice. For instance, considering the addition of the third value
'log', something like on_error_action (error, ignore, log) would be
more intuitively understandable. What do you think?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Add PQsendSyncMessage() to libpq

2024-01-16 Thread Michael Paquier

On Tue, Jan 16, 2024 at 02:55:12PM +0100, Alvaro Herrera wrote:
> On 2024-Jan-16, Michael Paquier wrote:
>> How about the attached to remove all that, then?
> 
> Looks good, thank you.

Thanks for double-checking.  Done.
--
Michael


signature.asc
Description: PGP signature

RE: speed up a logical replica setup

2024-01-16 Thread Hayato Kuroda (Fujitsu)

Dear Euler, hackers,

I found that some bugs which have been reported by Shlok were not fixed, so
made a top-up patch. 0001 was not changed, and 0002 contains below:

* Add a timeout option for the recovery option, per [1]. The code was basically 
ported from pg_ctl.c.
* Reject if the target server is not a standby, per [2]
* Raise FATAL error if --subscriber-conninfo specifies non-local server, per [3]
  (not sure it is really needed, so feel free reject the part.)

Feel free to merge parts of 0002 if it looks good to you.
Thanks Shlok to make a part of patch.

[1]: 
https://www.postgresql.org/message-id/CANhcyEUCt-g4JLQU3Q3ofFk_Vt-Tqh3ZdXoLcpT8fjz9LY_-ww%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CANhcyEUCt-g4JLQU3Q3ofFk_Vt-Tqh3ZdXoLcpT8fjz9LY_-ww%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/TY3PR01MB98895BA6C1D72CB8582CACC4F5682%40TY3PR01MB9889.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v20240117-0001-Creates-a-new-logical-replica-from-a-stand.patch
Description:  v20240117-0001-Creates-a-new-logical-replica-from-a-stand.patch


v20240117-0002-Address-some-comments-proposed-on-hackers.patch
Description:  v20240117-0002-Address-some-comments-proposed-on-hackers.patch

Re: [PATCH] Add additional extended protocol commands to psql: \parse and \bindx

2024-01-16 Thread Michael Paquier

On Tue, Jan 16, 2024 at 10:37:22AM +0100, Jelte Fennema-Nio wrote:
> I do realize the same is true for plain \bind, but it seems
> like a bug there too.

Hmm.  ignore_slash_options() is used to make the difference between
active and inactive branches with \if.  I was playing a bit with
psql.sql but I don't really see a difference if for example adding
some \bind commands (say a valid SELECT $1 \bind 4) in the big "\if
false" that all the command types (see "vars and backticks").

Perhaps I am missing a trick?
--
Michael

signature.asc
Description: PGP signature

Re: Change GUC hashtable to use simplehash?

2024-01-16 Thread John Naylor

I spent some time rewriting the comments and a couple other cosmetic
changes, and squashed into two patches: the second one has the
optimized string hashing. They each have still just one demo use case.
It looks pretty close to commitable, but I'll leave this up for a few
days in case anyone wants to have another look.

After this first step is out of the way, we can look into using this
more widely, including dynahash and the GUC hash.
From cff2bfda6a3067936ef17162a2db2609185afb24 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 16 Jan 2024 16:32:48 +0700
Subject: [PATCH v14 2/2] Add optimized C string hashing

Given an already-initialized hash state and a NUL-terminated string,
accumulate the hash of the string into the hash state and return the
length for the caller to (optionally) save for the finalizer. This
avoids a strlen call.

If the string pointer is aligned, we can use a word-at-a-time
algorithm for NUL lookahead. The aligned case is only used on 64-bit
platforms, since it's not worth the extra complexity for 32-bit.

As demonstration, use this in the search path cache. This brings the
general case performance closer to the optimization done in commit
a86c61c9ee.

There are other places that could benefit from this, but that is left
for future work.

Handling the tail of the string after finishing the word-wise loop
was inspired by NetBSD's strlen(), but no code was taken since that
is written in assembly language.

Jeff Davis and John Naylor

Discussion: https://postgr.es/m/3820f030fd008ff14134b3e9ce5cc6dd623ed479.camel%40j-davis.com
Discussion: https://postgr.es/m/b40292c99e623defe5eadedab1d438cf51a4107c.camel%40j-davis.com
---
 src/backend/catalog/namespace.c  |  20 +++--
 src/include/common/hashfn_unstable.h | 130 +++
 2 files changed, 145 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index eecc50a958..b610aa6242 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -41,7 +41,7 @@
 #include "catalog/pg_ts_template.h"
 #include "catalog/pg_type.h"
 #include "commands/dbcommands.h"
-#include "common/hashfn.h"
+#include "common/hashfn_unstable.h"
 #include "funcapi.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
@@ -253,11 +253,21 @@ static bool MatchNamedCall(HeapTuple proctup, int nargs, List *argnames,
 static inline uint32
 spcachekey_hash(SearchPathCacheKey key)
 {
-	const unsigned char *bytes = (const unsigned char *) key.searchPath;
-	int			blen = strlen(key.searchPath);
+	fasthash_state hs;
+	int			sp_len;
 
-	return hash_combine(hash_bytes(bytes, blen),
-		hash_uint32(key.roleid));
+	fasthash_init(&hs, FH_UNKNOWN_LENGTH, 0);
+
+	hs.accum = key.roleid;
+	fasthash_combine(&hs);
+
+	/*
+	 * Combine search path into the hash and save the length for tweaking the
+	 * final mix.
+	 */
+	sp_len = fasthash_accum_cstring(&hs, key.searchPath);
+
+	return fasthash_final32(&hs, sp_len);
 }
 
 static inline bool
diff --git a/src/include/common/hashfn_unstable.h b/src/include/common/hashfn_unstable.h
index 88fa613d4e..ff36114379 100644
--- a/src/include/common/hashfn_unstable.h
+++ b/src/include/common/hashfn_unstable.h
@@ -61,6 +61,24 @@
  * 2) Incremental interface. This can used for incorporating multiple
  * inputs. The standalone functions use this internally, so see fasthash64()
  * for an an example of how this works.
+ *
+ * The incremental interface is especially useful if any of the inputs
+ * are NUL-terminated C strings, since the length is not needed ahead
+ * of time. This avoids needing to call strlen(). This case is optimized
+ * in fasthash_accum_cstring() :
+ *
+ * fasthash_state hs;
+ * fasthash_init(&hs, FH_UNKNOWN_LENGTH, 0);
+ * len = fasthash_accum_cstring(&hs, *str);
+ * ...
+ * return fasthash_final32(&hs, len);
+ *
+ * Here we pass FH_UNKNOWN_LENGTH as a convention, since passing zero
+ * would zero out the internal seed as well. fasthash_accum_cstring()
+ * returns the length of the string, which is computed on-the-fly while
+ * mixing the string into the hash. Experimentation has found that
+ * SMHasher fails unless we incorporate the length, so it is passed to
+ * the finalizer as a tweak.
  */
 
 
@@ -154,6 +172,118 @@ fasthash_accum(fasthash_state *hs, const char *k, int len)
 	fasthash_combine(hs);
 }
 
+/*
+ * Set high bit in lowest byte where the input is zero, from:
+ * https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
+ */
+#define haszero64(v) \
+	(((v) - 0x0101010101010101) & ~(v) & 0x8080808080808080)
+
+/*
+ * all-purpose workhorse for fasthash_accum_cstring
+ */
+static inline int
+fasthash_accum_cstring_unaligned(fasthash_state *hs, const char *str)
+{
+	const char *const start = str;
+
+	while (*str)
+	{
+		int			chunk_len = 0;
+
+		while (chunk_len < FH_SIZEOF_ACCUM && str[chunk_len] != '\0')
+			chunk_len++;
+
+		fasthash_accum(hs, str, chunk_len);
+		str += chunk_len;
+	}
+
+	return str - start;
+}
+
+/*

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-16 Thread Michael Paquier

On Wed, Jan 17, 2024 at 12:54:26AM +0800, feichanghong wrote:
>> This is extremely nonspecific, as line numbers in our code change
>> constantly.  Please quote a chunk of code surrounding that
>> and indicate which line you are trying to stop at.
> 
> Thanks for the suggestion, I've refined the steps below to reproduce:

Yeah, thanks for the steps.  I am not surprised that there are still a
few holes in this area.  CONCURRENTLY can behave differently depending
on the step where the old index is getting opened.

For this specific job, I have always wanted a try_index_open() that
would attempt to open the index with a relkind check, perhaps we could
introduce one and reuse it here?
--
Michael


signature.asc
Description: PGP signature

Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)

2024-01-16 Thread Michael Paquier

On Tue, Jan 16, 2024 at 05:25:39PM -0300, Ranier Vilela wrote:
> Thanks for taking care of this.

Yeah, that's a good catch.

> Do you have plans or should I register for a commitfest?

Daniel has stated that he would take care of it, so why not letting
him a few days?  I don't think that a CF entry is necessary.
--
Michael


signature.asc
Description: PGP signature

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-16 Thread Michael Paquier

On Tue, Jan 16, 2024 at 11:24:49PM -0500, Jonathan S. Katz wrote:
> On 1/16/24 1:11 AM, Michael Paquier wrote:
>> I'd like to apply that, just let me know if you have any comments
>> and/or objections.
> 
> Looking at the code, I understand an argument for not backpatching given we
> modify the struct, but this does seem low-risk/high-reward and should help
> PostgreSQL to run better on this higher throughput workloads.

Just to be clear here.  I have repeated tests on all the stable
branches yesterday, and the TPS falls off drastically around 256
concurrent sessions for all of them with patterns similar to what I've
posted for 12, getting back a lot of performance for the cases with
more than 1k connections.
--
Michael


signature.asc
Description: PGP signature

Reduce useless changes before reassembly during logical replication

2024-01-16 Thread li jie

Hi hackers,

During logical replication, if there is a large write transaction, some
spill files will be written to disk, depending on the setting of
logical_decoding_work_mem.

This behavior can effectively avoid OOM, but if the transaction
generates a lot of change before commit, a large number of files may
fill the disk. For example, you can update a TB-level table.

However, I found an inelegant phenomenon. If the modified large table is not
published, its changes will also be written with a large number of spill files.
Look at an example below:

publisher:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE PUBLICATION mypub FOR TABLE public.tbl_pub;
```

subscriber:
```
create table tbl_pub(id int, val1 text, val2 text,val3 text);
create table tbl_t1(id int, val1 text, val2 text,val3 text);
CREATE SUBSCRIPTION mysub CONNECTION 'host=127.0.0.1 port=5432
user=postgres dbname=postgres' PUBLICATION mypub;
```

publisher:
```
begin;
insert into tbl_t1 select i,repeat('xyzzy', i),repeat('abcba',
i),repeat('dfds', i) from generate_series(0,99) i;
```

Later you will see a large number of spill files in the
"/$PGDATA/pg_replslot/mysub/" directory.
```
$ll -sh
total 4.5G
4.0K -rw--- 1 postgres postgres 200 Nov 30 09:24 state
17M -rw--- 1 postgres postgres 17M Nov 30 08:22 xid-750-lsn-0-1000.spill
12M -rw--- 1 postgres postgres 12M Nov 30 08:20 xid-750-lsn-0-100.spill
17M -rw--- 1 postgres postgres 17M Nov 30 08:23 xid-750-lsn-0-1100.spill
..
```

We can see that table tbl_t1 is not published in mypub. It also won't be sent
downstream because it's not subscribed.
After the transaction is reorganized, the pgoutput decoding plugin filters out
changes to these unpublished relationships when sending logical changes.
See function pgoutput_change.

Most importantly, if we filter out unpublished relationship-related
changes after constructing the changes but before queuing the changes
into a transaction, will it reduce the workload of logical decoding
and avoid disk
or memory growth as much as possible?

The patch in the attachment is a prototype, which can effectively reduce the
memory and disk space usage during logical replication.

Design:
1. Added a callback LogicalDecodeFilterByRelCB for the output plugin.

2. Added this callback function pgoutput_table_filter for the pgoutput plugin.
Its main implementation is based on the table filter in the
pgoutput_change function.
Its main function is to determine whether the change needs to be published based
on the parameters of the publication, and if not, filter it.

3. After constructing a change and before Queue a change into a transaction,
use RelidByRelfilenumber to obtain the relation associated with the change,
just like obtaining the relation in the ReorderBufferProcessTXN function.

4. Relation may be a toast, and there is no good way to get its real
table relation based on toast relation. Here, I get the real table oid
through toast relname, and then get the real table relation.

5. This filtering takes into account INSERT/UPDATE/INSERT. Other
changes have not been considered yet and can be expanded in the future.

Test:
1. Added a test case 034_table_filter.pl
2. Like the case above, create two tables, the published table tbl_pub and
the non-published table tbl_t1
3. Insert 10,000 rows of toast data into tbl_t1 on the publisher, and use
pg_ls_replslotdir to record the total size of the slot directory
every second.
4. Compare the size of the slot directory at the beginning of the
transaction(size1),
the size at the end of the transaction (size2), and the average
size of the entire process(size3).
5. Assert(size1==size2==size3)

Sincerely look forward to your feedback.
Regards, lijie


v1-Reduce-useless-changes-before-reassembly-during-logi.patch
Description: Binary data

Re: index prefetching

2024-01-16 Thread Konstantin Knizhnik




On 16/01/2024 11:58 pm, Jim Nasby wrote:

On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).
EBS provides good throughput but larger latencies comparing with 
local SSDs.

I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.



Thank you for clarification.
Yes, EBS is just block device and read-ahead can be used fir it as for 
any other local device.
There is actually recommendation to increase read-ahead for EBS device 
to reach better performance on some workloads:


https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html

So looks like for sequential access pattern manual prefetching at EBS is 
not needed.
But at Neon situation is quite different. May be Aurora Postgres is 
using some other mechanism for speed-up vacuum and seqscan,

but Neon is using Postgres prefetch mechanism for it.

Re: Add tuples_skipped to pg_stat_progress_copy

2024-01-16 Thread Masahiko Sawada

On Wed, Jan 17, 2024 at 2:22 PM torikoshia  wrote:
>
> Hi,
>
> 132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to
> skip malformed data, but there is no way to watch the number of skipped
> rows during COPY.
>
> Attached patch adds tuples_skipped to pg_stat_progress_copy, which
> counts the number of skipped tuples because source data is malformed.
> If SAVE_ERROR_TO is not specified, this column remains zero.
>
> The advantage would be that users can quickly notice and stop COPYing
> when there is a larger amount of skipped data than expected, for
> example.
>
> As described in commit log, it is expected to add more choices for
> SAVE_ERROR_TO like 'log' and using such options may enable us to know
> the number of skipped tuples during COPY, but exposed in
> pg_stat_progress_copy would be easier to monitor.
>
>
> What do you think?

+1

The patch is pretty simple. Here is a comment:

+   (if SAVE_ERROR_TO is specified, otherwise zero).
+  
+ 

To be precise, this counter only advances when a value other than
'ERROR' is specified to SAVE_ERROR_TO option.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Assertion failure with epoch when replaying standby records for 2PC

2024-01-16 Thread Michael Paquier

Hi all,

rorqual has failed today with a very interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-01-17%2005%3A06%3A31

This has caused an assertion failure for a 2PC transaction when
replaying one of the tests from the main regression suite:
2024-01-17 05:08:23.143 UTC [3242608] DETAIL:  Last completed transaction was 
at log time 2024-01-17 05:08:22.920244+00.
TRAP: failed Assert("epoch > 0"), File: 
"../pgsql/src/backend/access/transam/twophase.c", Line: 969, PID: 3242610
postgres: standby_1: startup recovering 
0001000C(ExceptionalCondition+0x83)[0x55746c7838c1]
postgres: standby_1: startup recovering 
0001000C(+0x194f0e)[0x55746c371f0e]
postgres: standby_1: startup recovering 
0001000C(StandbyTransactionIdIsPrepared+0x29)[0x55746c373120]
postgres: standby_1: startup recovering 
0001000C(StandbyReleaseOldLocks+0x3f)[0x55746c621357]
postgres: standby_1: startup recovering 
0001000C(ProcArrayApplyRecoveryInfo+0x50)[0x55746c61bbb5]
postgres: standby_1: startup recovering 
0001000C(standby_redo+0xe1)[0x55746c621490]
postgres: standby_1: startup recovering 
0001000C(PerformWalRecovery+0xa5e)[0x55746c392404]
postgres: standby_1: startup recovering 
0001000C(StartupXLOG+0x3ac)[0x55746c3862b8]
postgres: standby_1: startup recovering 
0001000C(StartupProcessMain+0xd9)[0x55746c5a60f6]
postgres: standby_1: startup recovering 
0001000C(AuxiliaryProcessMain+0x172)[0x55746c59bbdd]
postgres: standby_1: startup recovering 
0001000C(+0x3c4235)[0x55746c5a1235]
postgres: standby_1: startup recovering 
0001000C(PostmasterMain+0x1401)[0x55746c5a5a10]
postgres: standby_1: startup recovering 
0001000C(main+0x835)[0x55746c4e90ce]
/lib/x86_64-linux-gnu/libc.so.6(+0x276ca)[0x7f67bbb846ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f67bbb84785]
postgres: standby_1: startup recovering 
0001000C(_start+0x21)[0x55746c2b61d1]

This refers to the following in twophase.c with
AdjustToFullTransactionId(): 
nextXid = XidFromFullTransactionId(nextFullXid);
epoch = EpochFromFullTransactionId(nextFullXid);

if (unlikely(xid > nextXid))
{
/* Wraparound occurred, must be from a prev epoch. */
Assert(epoch > 0);
epoch--;
}

This would mean that we've found a way to get a negative epoch, which
should not be possible.

Alexander, you have added this code in 5a1dfde8334b when switching the
2PC file names to use FullTransactionIds.  Could you check please?
--
Michael


signature.asc
Description: PGP signature

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-16 Thread torikoshia


Hi,

Thanks for applying!

+   errmsg_plural("%zd row were skipped due 
to data type incompatibility",


Sorry, I just noticed it, but 'were' should be 'was' here?


BTW I'm thinking we should add a column to pg_stat_progress_copy that
counts soft errors. I'll suggest this in another thread.

Please do!


I've started it here:

https://www.postgresql.org/message-id/d12fd8c99adcae2744212cb23feff...@oss.nttdata.com


--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: partitioning and identity column

2024-01-16 Thread Ashutosh Bapat

On Wed, Jan 17, 2024 at 12:30 AM Peter Eisentraut  wrote:
>
> On 09.01.24 15:10, Ashutosh Bapat wrote:
> > Here's complete patch-set.
>
> Looks good!  Committed.
>

Thanks a lot Peter.

--
Best Wishes,
Ashutosh Bapat

Re: Make mesage at end-of-recovery less scary.

2024-01-16 Thread Michael Paquier

On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:
>> For now, let me explain the basis for this patch. The fundamental
>> issue is that these warnings that always appear are, in practice, not
>> a problem in almost all cases. Some of those who encounter them for
>> the first time may feel uneasy and reach out with inquiries. On the
>> other hand, those familiar with these warnings tend to ignore them and
>> only pay attention to details when actual issues arise. Therefore, the
>> intention of this patch is to label them as "no issue" unless a
>> problem is blatantly evident, in order to prevent unnecessary concern.
> 
> I agree and don't mind affecting the error message per se.
> 
> However I see that the actual logic of how WAL is processed is being
> changed. If we do this, at very least it requires thorough thinking. I
> strongly suspect that the proposed code is wrong and/or not safe
> and/or less safe than it is now for the reasons named above.

FWIW, that pretty much sums up my feeling regarding this patch,
because an error, basically any error, would hurt back very badly.
Sure, the error messages we generate now when reaching the end of WAL
can sound scary, and they are (I suspect that's not really the case
for anybody who has history doing support with PostgreSQL because a
bunch of these messages are old enough to vote, but I can understand
that anybody would freak out the first time they see that).

However, per the recent issues we've had in this area, like
cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
bae868caf222, I am of the opinion that the header validation, the
empty page case in XLogReaderValidatePageHeader() and the record read
changes are risky enough that I am not convinced that the gains are
worth the risks taken.

The error stack in the WAL reader is complicated enough that making it
more complicated as the patch proposes does not sound like not a good
tradeoff to me to make the reports related to the end of WAL cleaner
for the end-user.  I agree that we should do something, but the patch
does not seem like a good step towards this goal.  Perhaps somebody
would be more excited about this proposal than I am, of course.
--
Michael

signature.asc
Description: PGP signature

Add tuples_skipped to pg_stat_progress_copy

2024-01-16 Thread torikoshia


Hi,

132de9968840c introduced SAVE_ERROR_TO option to COPY and enabled to 
skip malformed data, but there is no way to watch the number of skipped 
rows during COPY.


Attached patch adds tuples_skipped to pg_stat_progress_copy, which 
counts the number of skipped tuples because source data is malformed.

If SAVE_ERROR_TO is not specified, this column remains zero.

The advantage would be that users can quickly notice and stop COPYing 
when there is a larger amount of skipped data than expected, for 
example.


As described in commit log, it is expected to add more choices for 
SAVE_ERROR_TO like 'log' and using such options may enable us to know 
the number of skipped tuples during COPY, but exposed in 
pg_stat_progress_copy would be easier to monitor.



What do you think?

--
Regards,

--
Atsushi Torikoshi
NTT DATA Group CorporationFrom 98e546ff2de380175708ce003f67c993299a3fb3 Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Wed, 17 Jan 2024 13:41:44 +0900
Subject: [PATCH v1] Add tuples_skipped to pg_stat_progress_copy

132de9968840c enabled COPY to skip malformed data, but there is no way to watch the number of skipped rows during COPY.

This patch adds tuples_skipped to pg_stat_progress_copy, which counts the number of skipped tuple because source data is malformed.
If SAVE_ERROR_TO is not specified, this column remains zero.

Needs catalog bump.
---
 doc/src/sgml/monitoring.sgml | 10 ++
 src/backend/catalog/system_views.sql |  3 ++-
 src/backend/commands/copyfrom.c  |  5 +
 src/include/commands/progress.h  |  1 +
 src/test/regress/expected/rules.out  |  3 ++-
 5 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b804eb8b5e..96ed774670 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5779,6 +5779,16 @@ FROM pg_stat_get_backend_idset() AS backendid;
WHERE clause of the COPY command.
   
  
+
+ 
+  
+   tuples_skipped bigint
+  
+  
+   Number of tuples skipped because they contain malformed data
+   (if SAVE_ERROR_TO is specified, otherwise zero).
+  
+ 
 

   
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e43e36f5ac..6288270e2b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1318,7 +1318,8 @@ CREATE VIEW pg_stat_progress_copy AS
 S.param1 AS bytes_processed,
 S.param2 AS bytes_total,
 S.param3 AS tuples_processed,
-S.param4 AS tuples_excluded
+S.param4 AS tuples_excluded,
+S.param7 AS tuples_skipped
 FROM pg_stat_get_progress_info('COPY') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 4058b08134..fe33b0facf 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -650,6 +650,7 @@ CopyFrom(CopyFromState cstate)
 	CopyMultiInsertInfo multiInsertInfo = {0};	/* pacify compiler */
 	int64		processed = 0;
 	int64		excluded = 0;
+	int64		skipped = 0;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -1012,6 +1013,10 @@ CopyFrom(CopyFromState cstate)
  */
 cstate->escontext->error_occurred = false;
 
+			/* Report that this tuple was skipped by the SAVE_ERROR_TO clause */
+			pgstat_progress_update_param(PROGRESS_COPY_TUPLES_SKIPPED,
+			 ++skipped);
+
 			continue;
 		}
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a458c8c50a..73afa77a9c 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -142,6 +142,7 @@
 #define PROGRESS_COPY_TUPLES_EXCLUDED 3
 #define PROGRESS_COPY_COMMAND 4
 #define PROGRESS_COPY_TYPE 5
+#define PROGRESS_COPY_TUPLES_SKIPPED 6
 
 /* Commands of COPY (as advertised via PROGRESS_COPY_COMMAND) */
 #define PROGRESS_COPY_COMMAND_FROM 1
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 55f2e95352..5e846b01e6 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1988,7 +1988,8 @@ pg_stat_progress_copy| SELECT s.pid,
 s.param1 AS bytes_processed,
 s.param2 AS bytes_total,
 s.param3 AS tuples_processed,
-s.param4 AS tuples_excluded
+s.param4 AS tuples_excluded,
+s.param7 AS tuples_skipped
FROM (pg_stat_get_progress_info('COPY'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
  LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_progress_create_index| SELECT s.pid,

base-commit: 65c5864d7fac46516f17ee89085e349a87ee5bd7
-- 
2.39.2

Re: Synchronizing slots from primary to standby

2024-01-16 Thread Nisha Moond

A review on v62-006: failover-ready validation steps doc -

+   Next, check that the logical replication slots identified above exist on
+   the standby server. This step can be skipped if
+   standby_slot_names has been correctly configured.
+
+test_standby=# SELECT bool_and(synced AND NOT temporary AND
conflict_reason IS NULL) AS failover_ready
+   FROM pg_replication_slots
+   WHERE slot_name in ('sub');
+ failover_ready
+
+ t
+(1 row)

This query does not ensure that all the logical replication slots
exist on standby. Due to the 'IN ('slots')' check, it will return
'true' even if only one or a few slots exist.

Re: Synchronizing slots from primary to standby

2024-01-16 Thread shveta malik

On Wed, Jan 17, 2024 at 6:43 AM Masahiko Sawada  wrote:
>
> On Tue, Jan 16, 2024 at 6:40 PM shveta malik  wrote:
> >
> > On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik  
> > > > wrote:
> > > > >
> > > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik 
> > > > > >  wrote:
> > > > > > >
> > > > > > > There are multiple approaches discussed and tried when it comes to
> > > > > > > starting a slot-sync worker. I am summarizing all here:
> > > > > > >
> > > > > > >  1) Make slotsync worker as an Auxiliary Process (like 
> > > > > > > checkpointer,
> > > > > > > walwriter, walreceiver etc). The benefit this approach provides 
> > > > > > > is, it
> > > > > > > can control begin and stop in a more flexible way as each 
> > > > > > > auxiliary
> > > > > > > process could have different checks before starting and can have
> > > > > > > different stop conditions. But it needs code duplication for 
> > > > > > > process
> > > > > > > management(start, stop, crash handling, signals etc) and 
> > > > > > > currently it
> > > > > > > does not support db-connection smoothly (none of the auxiliary 
> > > > > > > process
> > > > > > > has one so far)
> > > > > > >
> > > > > >
> > > > > > As slotsync worker needs to perform transactions and access 
> > > > > > syscache,
> > > > > > we can't make it an auxiliary process as that doesn't initialize the
> > > > > > required stuff like syscache. Also, see the comment "Auxiliary
> > > > > > processes don't run transactions ..." in AuxiliaryProcessMain() 
> > > > > > which
> > > > > > means this is not an option.
> > > > > >
> > > > > > >
> > > > > > > 2) Make slotsync worker as a 'special' process like 
> > > > > > > AutoVacLauncher
> > > > > > > which is neither an Auxiliary process nor a bgworker one. It 
> > > > > > > allows
> > > > > > > db-connection and also provides flexibility to have start and stop
> > > > > > > conditions for a process.
> > > > > > >
> > > > > >
> > > > > > Yeah, due to these reasons, I think this option is worth considering
> > > > > > and another plus point is that this allows us to make 
> > > > > > enable_syncslot
> > > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.
> > > > > >
> > > > > > >
> > > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register 
> > > > > > > our
> > > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a
> > > > > > > relevant start_time and restart_time and then the process 
> > > > > > > management
> > > > > > > is well taken care of. It does not need any code-duplication and
> > > > > > > allows db-connection smoothly in registered process. The only 
> > > > > > > thing it
> > > > > > > lacks is that it does not provide flexibility of having
> > > > > > > start-condition which then makes us to have 'enable_syncslot' as
> > > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said 
> > > > > > > this, I
> > > > > > > feel enable_syncslot is something which will not be changed 
> > > > > > > frequently
> > > > > > > and with the benefits provided by bgworker infra, it seems a
> > > > > > > reasonably good option to choose this approach.
> > > > > > >
> > > > > >
> > > > > > I agree but it may be better to make it a PGC_SIGHUP parameter.
> > > > > >
> > > > > > > 4) Another option is to have Logical Replication Launcher(or a new
> > > > > > > process) to launch slot-sync worker. But going by the current 
> > > > > > > design
> > > > > > > where we have only 1 slotsync worker, it may be an overhead to 
> > > > > > > have an
> > > > > > > additional manager process maintained.
> > > > > > >
> > > > > >
> > > > > > I don't see any good reason to have an additional launcher process 
> > > > > > here.
> > > > > >
> > > > > > >
> > > > > > > Thus weighing pros and cons of all these options, we have 
> > > > > > > currently
> > > > > > > implemented the bgworker approach (approach 3).  Any feedback is
> > > > > > > welcome.
> > > > > > >
> > > > > >
> > > > > > I vote to go for (2) unless we face difficulties in doing so but (3)
> > > > > > is also okay especially if others also think so.
> > > > >
> > > > > I am not against any of the approaches but I still feel that when we
> > > > > have a standard way of doing things (bgworker) we should not keep
> > > > > adding code to do things in a special way unless there is a strong
> > > > > reason to do so. Now we need to decide if 'enable_syncslot' being
> > > > > PGC_POSTMASTER is a strong reason to go the non-standard way?
> > > > >
> > > >
> > > > Agreed and as said earlier I think it is better to make it a
> > > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> > > > already autovacuum launcher is handled in the same way. One more minor
> > > > thing is it will save us for ha

Re: Improve the connection failure error messages

2024-01-16 Thread Nisha Moond

Thanks for reviewing, please find my response inline.

On Wed, Jan 17, 2024 at 4:56 AM Peter Smith  wrote:
>
> On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
>  wrote:
> >
> > Hi,
> >
> > Thanks for the patch.
> >
> > > Due to this behavior, it is not possible to add a test to show the
> > > error message as it is done for CREATE SUBSCRIPTION.
> > > Let me know if you think there is another way to add this test.
> >
> > I believe it can be done with TAP tests, see for instance:
> >
> > contrib/auto_explain/t/001_auto_explain.pl
> >
> > However I wouldn't insist on including the test in scope of this
> > particular patch. Such a test doesn't currently exist, it can be added
> > as a separate patch, and whether this is actually a useful test (all
> > the tests consume time after all...) is somewhat debatable. Personally
> > I agree that it would be nice to have though.
> >
> > This being said...
> >
> > > The ALTER SUBSCRIPTION command does not error out on the user
> > > interface if updated with a bad connection string and the connection
> > > failure error can only be seen in the respective log file.
> >
> > I wonder if we should fix this. Again, not necessarily in scope of
> > this work, but if this is not a difficult task, again, it would be
> > nice to have.
> >
> > Other than that v2 looks good.
> >
>
> OK.  I see now that any ALTER of the subscription's connection, even
> to some value that fails, will restart a new worker (like ALTER of any
> other subscription parameters). For a bad connection, it will continue
> to relaunch-worker/ERROR over and over.
>
> test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
> replication apply worker for subscription "sub4" has started
> 2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11274) exited with exit code 1
> dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication
> apply worker for subscription "sub4" has started
> 2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
> publisher: invalid port number: "-1"
> 2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
> replication apply worker" (PID 11391) exited with exit code 1
> b4
> ...
>
> I don't really have any opinion if that behaviour is good or bad, but
> anyway, it is deliberate, and IMO it is outside the scope of your
> patch, so v2 patch LGTM.

Upon code review, the ALTER SUBSCRIPTION updates the connection string
after checking for parse and a few obvious errors and does not attempt
to establish a connection. It is the apply worker running for the
respective subscription that will try to connect and fail in case of a
bad connection string.
To me, it seems an intentional design behavior and I agree that
deciding to change or maintain this behavior is out of this patch's
scope.

--
Thanks,
Nisha

Re: [PATCH] Add support function for containment operators

2024-01-16 Thread jian he

On Wed, Jan 17, 2024 at 5:46 AM Tom Lane  wrote:
>
> But perhaps someone has an argument for a different rule?
>
> Anyway, pending discussion of that point, I think the code is good
> to go.  I don't like the test cases much though: they expend many more
> cycles than necessary.  You could prove the same points just by
> looking at the expansion of expressions, eg.
>

your patch is far better!

IMHO, worried about the support function, the transformed plan
generates the wrong result,
so we add the tests to make it bullet proof.
Now I see your point. If the transformed plan is right, the whole
added code should be fine.
but keeping the textrange_supp related test should be a good idea.
since we don't have SUBTYPE_OPCLASS related sql tests.

Re: Synchronizing slots from primary to standby

2024-01-16 Thread Peter Smith

On Tue, Jan 16, 2024 at 10:57 PM shveta malik  wrote:
>
...
> v62-006:
> Separated the failover-ready validation steps into this separate
> doc-patch (which were earlier present in v61-002 and v61-003). Also
> addressed some of the doc comments by Peter in [1].
> Thanks Hou-San for providing this patch.
>
> [1]: 
> https://www.postgresql.org/message-id/CAHut%2BPteZVNx1jQ6Hs3mEdoC%3DDNALVpJJ2mZDYim7sU-04tiaw%40mail.gmail.com
>

Thanks for addressing my previous review in the new patch 0006. I
checked it again and below are a few more comments

==
1. GENERAL

I was wondering if some other documentation (like somewhere from
chapter 27, or maybe the pgctl promote docs?) should be referring back
to this new information about how to decide if the standby is ready
for promotion.

==
doc/src/sgml/logical-replication.sgml

2.
+
+  
+   Because the slot synchronization logic copies asynchronously, it is
+   necessary to confirm that replication slots have been synced to the standby
+   server before the failover happens. Furthermore, to ensure a successful
+   failover, the standby server must not be lagging behind the subscriber. It
+   is highly recommended to use standby_slot_names to
+   prevent the subscriber from consuming changes faster than the hot standby.
+   To confirm that the standby server is indeed ready for failover, follow
+   these 2 steps:
+  

For easier navigation, perhaps that standby_slot_names should include
a link back to where the standby_slot_names GUC is described.

~~~

3.
+
+ 
+  
+   Firstly, on the subscriber node, use the following SQL to identify the
+   slot names that should be synced to the standby that we plan to promote.

Minor change to wording.

SUGGESTION
Firstly, on the subscriber node, use the following SQL to identify
which slots should be synced to the standby that we plan to promote.

~~~

4.
+
+test_sub=# SELECT
+   array_agg(slotname) AS slots
+   FROM
+   ((
+   SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid ||
'_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname
+   FROM pg_control_system() ctl, pg_subscription_rel r,
pg_subscription s
+   WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover
+   ) UNION (
+   SELECT s.oid AS subid, s.subslotname as slotname
+   FROM pg_subscription s
+   WHERE s.subfailover
+   ));
+ slots
+---
+ {sub}
+(1 row)
+
+ 

I think the example might be better if the result shows > 1 slot.  e.g.
{sub1,sub2,sub3}

This would also make the next step 1.b. more clear.

~~~

5.
+
+ 
+ 
+  
+   Next, check that the logical replication slots identified above exist on
+   the standby server. This step can be skipped if
+   standby_slot_names has been correctly configured.
+
+test_standby=# SELECT bool_and(synced AND NOT temporary AND
conflict_reason IS NULL) AS failover_ready
+   FROM pg_replication_slots
+   WHERE slot_name in ('sub');
+ failover_ready
+
+ t
+(1 row)
+

5a.

(uppercase SQL keyword)

/in/IN/

~

5b.
I felt this might be easier to understand if the SQL gives a
two-column result instead of one all-of-nothing T/F where you might no
be sure which slot was the one giving a problem. e.g.

failover_ready | slot
-
t  | sub1
t  | sub2
f  | sub3
...

~~~

6.
+   
+Firstly, on the subscriber node check the last replayed WAL. If the
+query result is NULL, it indicates that the subscriber has not yet
+replayed any WAL. Therefore, the next step can be skipped, as the
+standby server must be ahead of the subscriber.

IMO all of that part "If the query result is NULL" does not really
belong here because it describes skipping the *next* step. So, it
would be better to say this in the next step. Something like:

SUGGESTION (for step 2b)
Next, on the standby server check that the last-received WAL location
is ahead of the replayed WAL location on the subscriber identified
above. If the above SQL result was NULL, it means the subscriber has
not yet replayed any WAL, so the standby server must be ahead of the
subscriber, and this step can be skipped.

~~~

7.
+
+test_sub=# SELECT
+   MAX(remote_lsn) AS remote_lsn_on_subscriber
+   FROM
+   ((
+   SELECT (CASE WHEN r.srsubstate = 'f' THEN
pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' ||
r.srrelid), false)
+   WHEN r.srsubstate IN ('s', 'r') THEN
r.srsublsn END) as remote_lsn
+   FROM pg_subscription_rel r, pg_subscription s
+   WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid =
r.srsubid AND s.subfailover
+   ) UNION (
+   SELECT pg_replication_origin_progress(CONCAT('pg_' ||
s.oid), false) AS remote_lsn
+   FROM pg_subscription s
+

Re: Synchronizing slots from primary to standby

2024-01-16 Thread Peter Smith

Here is a  review comment for the latest v62-0002 changes.

==
src/backend/replication/logical/slotsync.c

1.
+ if (namestrcmp(&slot->data.plugin, remote_slot->plugin) == 0 &&
+ slot->data.database == dbid &&
+ remote_slot->restart_lsn == slot->data.restart_lsn &&
+ remote_slot->catalog_xmin == slot->data.catalog_xmin &&
+ remote_slot->two_phase == slot->data.two_phase &&
+ remote_slot->failover == slot->data.failover &&
+ remote_slot->confirmed_lsn == slot->data.confirmed_flush)
+ return false;

For consistency, I think it would be better to always code the remote
slot value on the LHS and the local slot value on the RHS, instead of
the current random mix.

And rename 'dbid' to 'remote_dbid' for name consistency too.

SUGGESTION
if (namestrcmp(remote_slot->plugin, &slot->data.plugin) == 0 &&
  remote_dbid == slot->data.database &&
  remote_slot->restart_lsn == slot->data.restart_lsn &&
  remote_slot->catalog_xmin == slot->data.catalog_xmin &&
  remote_slot->two_phase == slot->data.two_phase &&
  remote_slot->failover == slot->data.failover &&
  remote_slot->confirmed_lsn == slot->data.confirmed_flush)
  return false;

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: heavily contended lwlocks with long wait queues scale badly

2024-01-16 Thread Jonathan S. Katz


On 1/16/24 1:11 AM, Michael Paquier wrote:

On Thu, Jan 11, 2024 at 09:47:33AM -0500, Jonathan S. Katz wrote:

I have similar data sources to Nathan/Michael and I'm trying to avoid piling
on, but one case that's interesting occurred after a major version upgrade
from PG10 to PG14 on a database supporting a very active/highly concurrent
workload. On inspection, it seems like backpatching would help this
particularly case.

With 10/11 EOL, I do wonder if we'll see more of these reports on upgrade to
< PG16.

(I was in favor of backpatching prior; opinion is unchanged).


Hearing nothing, I have prepared a set of patches for v12~v15,
checking all the lwlock paths for all the branches.  At the end the
set of changes look rather sane to me regarding the queue handlings.

I have also run some numbers on all the branches, and the test case
posted upthread falls off dramatically after 512 concurrent
connections at the top of all the stable branches :(

For example on REL_12_STABLE with and without the patch attached:
num  v12   v12+patch
129717.151665  29096.707588
263257.709301  61889.476318
4127921.873393 124575.901330
8231400.571662 230562.725174
16   343911.185351 312432.897015
32   291748.985280 281011.787701
64   268998.728648 269975.605115
128  297332.597018 286449.176950
256  243902.817657 240559.122309
512  190069.602270 194510.718508
768  58915.650225  165714.707198
1024 39920.950552  149433.836901
2048 16922.391688  108164.301054
4096 6229.063321   69032.338708

I'd like to apply that, just let me know if you have any comments
and/or objections.


Wow. All I can say is that my opinion remains unchanged on going forward 
with backpatching.


Looking at the code, I understand an argument for not backpatching given 
we modify the struct, but this does seem low-risk/high-reward and should 
help PostgreSQL to run better on this higher throughput workloads.


Thanks,

Jonathan


OpenPGP_signature.asc
Description: OpenPGP digital signature

Re: Fix a typo of func DecodeInsert()

2024-01-16 Thread Yongtao Huang

Thank you. I prefer to keep the comments of these three functions
*DecodeInsert()*,  *DecodeUpdate()*, and *DecodeDelete()* aligned.
```
/*
 * Parse XLOG_HEAP_INSERT (not MULTI_INSERT!) records into tuplebufs.
 *
 * Inserts can contain the new tuple.
 */
static void
DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)

/*
 * Parse XLOG_HEAP_UPDATE and XLOG_HEAP_HOT_UPDATE, which have the same
layout
 * in the record, from wal into proper tuplebufs.
 *
 * Updates can possibly contain a new tuple and the old primary key.
 */
static void
DecodeUpdate(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)

/*
 * Parse XLOG_HEAP_DELETE from wal into proper tuplebufs.
 *
 * Deletes can possibly contain the old primary key.
 */
static void
DecodeDelete(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)

```

Best wishes

Yongtao Huang

Richard Guo  于2024年1月17日周三 09:10写道：

>
> On Wed, Jan 17, 2024 at 8:47 AM Yongtao Huang 
> wrote:
>
>> Hi all,
>> I think the comment above the function DecodeInsert()
>> in src/backend/replication/logical/decode.c should be
>> + * *Inserts *can contain the new tuple.
>> , rather than
>> - * *Deletes *can contain the new tuple.
>>
>
> Nice catch.  +1.
>
> I kind of wonder if it would be clearer to state that "XLOG_HEAP_INSERT
> can contain the new tuple", in order to differentiate it from
> XLOG_HEAP2_MULTI_INSERT.
>
> Thanks
> Richard
>

Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Tom Lane

David Rowley  writes:
> On Wed, 17 Jan 2024 at 15:28, Maiquel Grassi  wrote:
>> On Wed, 17 Jan 2024 at 14:36, David Rowley  wrote:
>>> If you were looking for something to optimize in this rough area, then
>>> perhaps adding some kind of "Backward WindowAgg" node (by overloading
>>> the existing node) to allow queries such as the following to be
>>> executed without an additional sort.
>>> 
>>> SELECT a,row_number() over (order by a desc) from t order by a;

>> David, considering this optimization, allowing for that, do you believe it 
>> is plausible to try advancing towards a possible Proof of Concept (PoC) 
>> implementation?

> I think the largest factor which would influence the success of that
> would be how much more complex nodeWindowAgg.c would become.

Even if a workable patch for that is presented, should we accept it?
I'm having a hard time believing that this requirement is common
enough to justify more than a microscopic addition of complexity.
This whole area is devilishly complicated already, and I can think of
a bunch of improvements that I'd rate as more worthy of developer
effort than this.

regards, tom lane

Re: locked reads for atomics

2024-01-16 Thread Li, Yong



> On Nov 28, 2023, at 05:00, Nathan Bossart  wrote:
> 
> External Email
> 
> Here's a v2 of the patch set in which I've attempted to address all
> feedback.  I've also added a pg_write_membarrier_u* pair of functions that
> provide an easy way to write to an atomic variable with full barrier
> semantics.  In the generic implementation, these are just aliases for an
> atomic exchange.
> 
> 0002 demonstrates how these functions might be used to eliminate the
> arch_lck spinlock, which is only ever used for one boolean variable.  My
> hope is that the membarrier functions make eliminating spinlocks for
> non-performance-sensitive code easy to reason about.
> 
> (We might be able to use a pg_atomic_flag instead for 0002, but that code
> seems intended for a slightly different use-case and has more complicated
> barrier semantics.)
> 
> --
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com

Hi Nathan,

The patch looks good to me.

The patch adds two pairs of atomic functions that provide full-barrier 
semantics to atomic read/write operations. The patch also includes an example 
of how this new functions can be used to replace spin locks.

The patch applies cleanly to HEAD. “make check-world” also runs cleanly with no 
error.  I am moving it to Ready for Committers.

Regards,
Yong

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor

I wrote:

> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?

If this is a possibility, I thought I'd first send the last (I hope)
large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
including tidstore/vacuum here, because recent discussion has some
up-in-the-air work.

Should be self-explanatory, but some thing are worth calling out:
0012 and 0013: Some time ago I started passing insertpos as a
parameter, but now see that is not ideal -- when growing from node16
to node48 we don't need it at all, so it's a wasted calculation. While
reverting that, I found that this also allows passing constants in
some cases.
0014 makes a cleaner separation between adding a child and growing a
node, resulting in more compact-looking functions.
0019 is a bit unpolished, but I realized that it's pointless to assign
a zero child when further up the call stack we overwrite it anyway
with the actual value. With this, that assignment is skipped. This
makes some comments and names strange, so needs a bit of polish, but
wanted to get it out there anyway.

v53-ART.tar.gz
Description: application/gzip

Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David Rowley

On Wed, 17 Jan 2024 at 15:28, Maiquel Grassi  wrote:
> On Wed, 17 Jan 2024 at 14:36, David Rowley  wrote:
> > If you were looking for something to optimize in this rough area, then
> > perhaps adding some kind of "Backward WindowAgg" node (by overloading
> > the existing node) to allow queries such as the following to be
> > executed without an additional sort.
> >
> > SELECT a,row_number() over (order by a desc) from t order by a;
>
> David, considering this optimization, allowing for that, do you believe it is 
> plausible to try advancing towards a possible Proof of Concept (PoC) 
> implementation?

I think the largest factor which would influence the success of that
would be how much more complex nodeWindowAgg.c would become.

There's a couple of good ways to ensure such a patch fails:

1. Copy and paste all the code out of nodeWindowAgg.c and create
nodeWindowAggBackward.c and leave a huge maintenance burden. (don't do
this)
2. Make nodeWindowAgg.c much more complex and slower by adding dozens
of conditions to check if we're in backward mode.

I've not taken the time to study nodeWindowAgg.c to know how much more
complex supporting reading the tuples backwards would make it.
Certainly the use of tuplestore_trim() would have to change and
obviously way we read stored tuples back would need to be adjusted. It
might just add much more complexity than it would be worth.  Part of
the work would be finding this out.

If making the changes to nodeWindowAgg.c is too complex, then
adjusting nodeMaterial.c would at least put us in a better position
than having to sort twice.  You'd have to add a bool isbackward flag
to MaterialPath and then likely add a ScanDirection normal_dir to
MaterialState then set "dir" in ExecMaterial() using
ScanDirectionCombine of the two scan directions.   At least some of
what's there would work as a result of that, but likely a few other
things in ExecMaterial() would need to be rejiggered.  explain.c would
need to show "Backward Material", etc.

Both cases you'd need to modify planner.c's create_one_window_path()
and invent a function such as pathkeys_count_contained_in_backward()
or at least pathkeys_contained_in_backward() to detect when you need
to use the backward node type.

I'd go looking at nodeWindowAgg.c first, if you're interested.

David

RE: Random pg_upgrade test failure on drongo

2024-01-16 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

> 
> Thanks to both of you. I have pushed the patch.
>

I have been tracking the BF animals these days, and this failure has not seen 
anymore.
I think we can close the topic. Again, thanks for all efforts.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor

On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada  wrote:
>
> On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
> >
> > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  
> > wrote:
> > > Just changing "items" to be the local tidstore struct could make the
> > > code tricky a bit, since max_bytes and num_items are on the shared
> > > memory while "items" is a local pointer to the shared tidstore.
> >
> > Thanks for trying it this way! I like the overall simplification but
> > this aspect is not great.
> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?
>
> Seems worth trying. The current RT_CREATE() API is also convenient as
> other data structure such as simplehash.h and dshash.c supports a
> similar

I don't happen to know if these paths had to solve similar trickiness
with some values being local, and some shared.

> > That may be a good idea for other reasons. It's awkward that the
> > create function is declared like this:
> >
> > #ifdef RT_SHMEM
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > dsa_area *dsa,
> > int tranche_id);
> > #else
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > #endif
> >
> > An init function wouldn't need these parameters: it could look at the
> > passed struct to know what to do.
>
> But the init function would initialize leaf_ctx etc,no? Initializing
> leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.

I was more referring to the parameters that were different above
depending on shared memory. My first thought was that the tricky part
is because of the allocation in local memory, but it's certainly
possible I've misunderstood the problem.

> The same
> is true for dsa. I imagined that an init function would allocate a DSA
> memory for the control object.

Yes:

...
//  embedded in VacDeadItems
  TidStore items;
};

// NULL DSA in local case, etc
dead_items->items.area = dead_items_dsa;
dead_items->items.tranche_id = FOO_ID;

TidStoreInit(&dead_items->items, vac_work_mem);

That's how I imagined it would work (leaving out some details). I
haven't tried it, so not sure how much it helps. Maybe it has other
problems, but I'm hoping it's just a matter of programming.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.

Re: introduce dynamic shared memory registry

2024-01-16 Thread Bharath Rupireddy

On Tue, Jan 16, 2024 at 9:37 PM Nathan Bossart  wrote:
>
> The autoprewarm change (0003) does use this variable.  I considered making
> it optional (i.e., you could pass in NULL if you didn't want it), but I
> didn't feel like the extra code in GetNamedDSMSegment() to allow this was
> worth it so that callers could avoid creating a single bool.

I'm okay with it.

The v8 patches look good to me.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

As far as I see your proposal, you want to allow something that is
undefined to be reversed.  I don't think this is a good idea at all.
As mentioned by others, you should have ORDER BY clauses and just add
a DESC.


  *   Okay, now I'm convinced of that.

If you were looking for something to optimize in this rough area, then
perhaps adding some kind of "Backward WindowAgg" node (by overloading
the existing node) to allow queries such as the following to be
executed without an additional sort.

SELECT a,row_number() over (order by a desc) from t order by a;


  *   David, considering this optimization, allowing for that, do you believe 
it is plausible to try advancing towards a possible Proof of Concept (PoC) 
implementation?

Maiquel.

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-16 Thread Jelte Fennema-Nio

On Tue, 16 Jan 2024 at 21:36, Robert Haas  wrote:
> Sorry for not getting back to this right away; there are quite a few
> threads competing for my attention.

No worries, I know it's a busy time.

> But I think we would want to have those changes in all supported
> branches before we could think of requesting 3.1 or higher by default.
> Imagine that in v17 we both added server support for protocol version
> 3.1 and also adopted your 0001. Then, a v17 libpq would fail to
> connect to a v16 or earlier PostgreSQL instance. In effect, it would
> be a complete wire compatibility break. There's no way that such a
> patch is going to be acceptable. If I were to commit a patch from you
> or anyone else that does that, many angry people would show up to yell
> at both of us. So unless I'm misunderstanding the situation, 0001 is
> pretty much dead on arrival for now and for quite a while to come.

It's understandable you're worried about breaking clients, but afaict
**you are actually misunderstanding the situation**. I totally agree
that we cannot bump the protocol version without also merging 0002
(that's why the version bump is in patch 0005 not patch 0001). But
0002 does **not** need to be backported to all supported branches. The
only libpq versions that can ever receive a NegotiateVersionProtocol
are ones that request 3.1, and since none of the pre-17 libpq versions
ever request 3.1 there's no need for them to be able to handle
NegotiateVersionProtocol.

If you try out the patchset, you will see that you can connect with
psql16 to postgres17 and psql17 to postgres16. Both without any
problems. The only time when you will get problems is if you connect
to a server from before these versions that you mentioned (2017-11-21,
11.0, 10.2, 9.6.7, 9.5.11, 9.4.16, 9.3.21)

> Also, I'm pretty doubtful that we want
> PQunsupportedProtocolExtensions().

I definitely think we should include this API. As a client author (and
even user), you might want to know what features are supported by the
server you are connected to. That way you can avoid calling functions
that would otherwise fail. This is also the reason why
PQprotocolVersion() and PQserverVersion() exist. IMHO
PQunsupportedProtocolExtensions() is simply a natural addition to
those already existing feature-discovery APIs.

I'll move the addition of PQunsupportedProtocolExtensions() to a
separate patch though, since I do agree that it's a separate item from
the rest of 0002.

> That seems like allowing the client
> to have too much direct visibility into what's happening at the
> protocol level. That kind of interface might make sense if we were
> trying to support unknown protocol extensions from third parties, but
> for stuff in core, I think there should be specific APIs that relate
> to specific features e.g. you call PQsetWireProtocolRole(char
> *whatever) and it returns success or failure according to whether that
> capability is available without telling you how that's negotiated on
> the wire.

I think we have a very different idea of what is a desirable API for
client authors that use libpq to build their clients. libpq its API is
pretty low level, so I think it makes total sense for client authors
to know what protocol extension parameters exist. It seems totally
acceptable for me to have them call PQsetParameter directly:

PQsetParameter("_pq_.protocol_roles", "true")
PQsetParameter("_pq_.report_parameters", "role,search_path")

Otherwise we need to introduce **two** new functions for every
protocol extension that is going to be introduced, a blocking and a
non-blocking one (e.g. PQsetWireProtocolRole() and
PQsendSetWireProtocolRole()). And that seems like unnecessary API
bloat to me.

To be clear, I think it would probably make sense for client authors
to expose functions like that for the users of the client. But I think
libpq should not add an API surface that can easily be avoided (e.g.
there's also no function to begin a transaction, even though pretty
much every client exposes one).

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

As far as I see your proposal, you want to allow something that is
undefined to be reversed.  I don't think this is a good idea at all.
As mentioned by others, you should have ORDER BY clauses and just add
a DESC.

If you were looking for something to optimize in this rough area, then
perhaps adding some kind of "Backward WindowAgg" node (by overloading
the existing node) to allow queries such as the following to be
executed without an additional sort.

SELECT a,row_number() over (order by a desc) from t order by a;

The planner complexity is likely fairly easy to implement that. I
don't think we'd need to generate any additional Paths. We could
invent some pathkeys_contained_in_reverse() function and switch on the
Backward flag if it is.

The complexity would be in nodeWindowAgg.c... perhaps too much
complexity for it to be worthwhile and not add additional overhead to
the non-backward case.

Or, it might be easier to invent "Backward Materialize" instead and
just have the planner use on of those instead of the final sort.

David

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread Masahiko Sawada

On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
>
> On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  wrote:
> > Just changing "items" to be the local tidstore struct could make the
> > code tricky a bit, since max_bytes and num_items are on the shared
> > memory while "items" is a local pointer to the shared tidstore.
>
> Thanks for trying it this way! I like the overall simplification but
> this aspect is not great.
> Hmm, I wonder if that's a side-effect of the "create" functions doing
> their own allocations and returning a pointer. Would it be less tricky
> if the structs were declared where we need them and passed to "init"
> functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

>
> That may be a good idea for other reasons. It's awkward that the
> create function is declared like this:
>
> #ifdef RT_SHMEM
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> dsa_area *dsa,
> int tranche_id);
> #else
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> #endif
>
> An init function wouldn't need these parameters: it could look at the
> passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object. So I imagine we will end up still
requiring some of them.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David Rowley

On Wed, 17 Jan 2024 at 08:51, Michał Kłeczek  wrote:
> I think that’s the main issue: what (semantically) does row_number() mean in 
> that case? You could equally well generate random numbers?

Well, not quite random as at least row_number() would ensure the
number is unique in the result set. The point I think you're trying to
make is very valid though.

To reinforce that point, here's an example how undefined the behaviour
that Maique is relying on:

create table t (a int primary key);
insert into t values(3),(2),(4),(1),(5);

select a,row_number() over() from t; -- Seq Scan
 a | row_number
---+
 3 |  1
 2 |  2
 4 |  3
 1 |  4
 5 |  5

set enable_seqscan=0;
set enable_bitmapscan=0;

select a,row_number() over() from t; -- Index Scan
 a | row_number
---+
 1 |  1
 2 |  2
 3 |  3
 4 |  4
 5 |  5

i.e the row numbers are just assigned in whichever order they're given
to the WindowAgg node.

Maique,

As far as I see your proposal, you want to allow something that is
undefined to be reversed.  I don't think this is a good idea at all.
As mentioned by others, you should have ORDER BY clauses and just add
a DESC.

If you were looking for something to optimize in this rough area, then
perhaps adding some kind of "Backward WindowAgg" node (by overloading
the existing node) to allow queries such as the following to be
executed without an additional sort.

SELECT a,row_number() over (order by a desc) from t order by a;

The planner complexity is likely fairly easy to implement that. I
don't think we'd need to generate any additional Paths. We could
invent some pathkeys_contained_in_reverse() function and switch on the
Backward flag if it is.

The complexity would be in nodeWindowAgg.c... perhaps too much
complexity for it to be worthwhile and not add additional overhead to
the non-backward case.

Or, it might be easier to invent "Backward Materialize" instead and
just have the planner use on of those instead of the final sort.

David

Re: Add pgindent test to check if codebase is correctly formatted

2024-01-16 Thread Bruce Momjian

On Tue, Jan 16, 2024 at 07:32:47PM -0600, Tristan Partin wrote:
> It requires changes to at least the Meson build files. pg_bsd_indent is not
> marked for installation currently. There is a TODO there. pgindent has no
> install_data() for instance. pg_bsd_indent seemingly gets installed
> somewhere in autotools given the contents of its Makefile, but I didn't see
> anything in my install tree afterward.
> 
> Sure RPM/DEB packagers can solve this issue downstream, but that doesn't
> help those of that run "meson install" or "make install" upstream. Packagers
> are probably more likely to package the tools if they are marked for
> installation by upstream too.
> 
> Hope this helps to better explain what changes would be required within the
> Postgres source tree.

Yes, it does, thanks.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: Add pgindent test to check if codebase is correctly formatted

2024-01-16 Thread Tristan Partin

On Tue Jan 16, 2024 at 7:27 PM CST, Bruce Momjian wrote:

On Tue, Jan 16, 2024 at 07:22:23PM -0600, Tristan Partin wrote:
> I think a good solution would be to distribute pgindent and pg_bsd_indent.
> At Neon, we are trying to format our extension code using pgindent. I am
> sure there are other extension authors out there too that format using
> pgindent. Distributing pg_bsd_indent and pgindent in the postgresql-devel
> package would be a great help to those of us that pgindent out of tree code.
> It would also have the added benefit of adding the tools to $PREFIX/bin,
> which would make the test that I added not need a hack to get the
> pg_bsd_indent executable.

So your point is that pg_bsd_indent and pgindent are in the source tree,
but not in any package distribution?  Isn't that a packager decision?

It requires changes to at least the Meson build files. pg_bsd_indent is 
not marked for installation currently. There is a TODO there. pgindent 
has no install_data() for instance. pg_bsd_indent seemingly gets 
installed somewhere in autotools given the contents of its Makefile, but 
I didn't see anything in my install tree afterward.

Sure RPM/DEB packagers can solve this issue downstream, but that doesn't 
help those of that run "meson install" or "make install" upstream. 
Packagers are probably more likely to package the tools if they are 
marked for installation by upstream too.

Hope this helps to better explain what changes would be required within 
the Postgres source tree.

--
Tristan Partin
Neon (https://neon.tech)

Re: Add pgindent test to check if codebase is correctly formatted

2024-01-16 Thread Bruce Momjian

On Tue, Jan 16, 2024 at 07:22:23PM -0600, Tristan Partin wrote:
> I think a good solution would be to distribute pgindent and pg_bsd_indent.
> At Neon, we are trying to format our extension code using pgindent. I am
> sure there are other extension authors out there too that format using
> pgindent. Distributing pg_bsd_indent and pgindent in the postgresql-devel
> package would be a great help to those of us that pgindent out of tree code.
> It would also have the added benefit of adding the tools to $PREFIX/bin,
> which would make the test that I added not need a hack to get the
> pg_bsd_indent executable.

So your point is that pg_bsd_indent and pgindent are in the source tree,
but not in any package distribution?  Isn't that a packager decision?

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Add pgindent test to check if codebase is correctly formatted

2024-01-16 Thread Tristan Partin

Had some time to watch code run through an extensive test suite, so 
thought I would propose this patch that is probably about 75% of the way 
to the stated $subject. I had to add in a hack for Meson, and I couldn't 
figure out a good hack for autotools.


I think a good solution would be to distribute pgindent and 
pg_bsd_indent. At Neon, we are trying to format our extension code using 
pgindent. I am sure there are other extension authors out there too that 
format using pgindent. Distributing pg_bsd_indent and pgindent in the 
postgresql-devel package would be a great help to those of us that 
pgindent out of tree code. It would also have the added benefit of 
adding the tools to $PREFIX/bin, which would make the test that I added 
not need a hack to get the pg_bsd_indent executable.


--
Tristan Partin
Neon (https://neon.tech)
From 6e9ca366b6e4976ae591012150fe77729e37c503 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 16 Jan 2024 19:00:44 -0600
Subject: [PATCH v1 1/2] Allow tests to register command line arguments in
 Meson

---
 meson.build | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/meson.build b/meson.build
index c317144b6b..12ba91109b 100644
--- a/meson.build
+++ b/meson.build
@@ -3284,6 +3284,8 @@ foreach test_dir : tests
 'env': env,
   } + t.get('test_kwargs', {})
 
+  test_args = t.get('args', [])
+
   foreach onetap : t['tests']
 # Make tap test names prettier, remove t/ and .pl
 onetap_p = onetap
@@ -3302,7 +3304,7 @@ foreach test_dir : tests
 '--testname', onetap_p,
 '--', test_command,
 test_dir['sd'] / onetap,
-  ],
+  ] + test_args,
 )
   endforeach
   install_suites += test_group
-- 
Tristan Partin
Neon (https://neon.tech)

From 13902328b24984ab2d18914b63c6874143715f48 Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Tue, 16 Jan 2024 19:03:42 -0600
Subject: [PATCH v1 2/2] Add pgindent test to check if codebase is correctly
 formatted

Should be useful for developers and committers to test code as they
develop, commit, or prepare patches.
---
 src/meson.build  |  2 +-
 src/tools/pgindent/Makefile  | 24 
 src/tools/pgindent/meson.build   | 13 +
 src/tools/pgindent/t/001_pgindent.pl | 19 +++
 4 files changed, 57 insertions(+), 1 deletion(-)
 create mode 100644 src/tools/pgindent/Makefile
 create mode 100644 src/tools/pgindent/meson.build
 create mode 100644 src/tools/pgindent/t/001_pgindent.pl

diff --git a/src/meson.build b/src/meson.build
index 65c7d17d08..0345d2fa79 100644
--- a/src/meson.build
+++ b/src/meson.build
@@ -14,7 +14,7 @@ subdir('pl')
 subdir('interfaces')
 
 subdir('tools/pg_bsd_indent')
-
+subdir('tools/pgindent')
 
 ### Generate a Makefile.global that's complete enough for PGXS to work.
 #
diff --git a/src/tools/pgindent/Makefile b/src/tools/pgindent/Makefile
new file mode 100644
index 00..45d6b71a12
--- /dev/null
+++ b/src/tools/pgindent/Makefile
@@ -0,0 +1,24 @@
+#-
+#
+# src/tools/pgindent/Makefile
+#
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+#-
+
+subdir = src/tools/pgindent
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+clean distclean:
+	rm -rf log/ tmp_check/
+
+# Provide this alternate test name to allow testing pgindent without building
+# all of the surrounding Postgres installation.
+.PHONY: test
+
+pg_bsd_indent:
+	$(MAKE) -C ../pg_bsd_indent pg_bsd_indent
+
+test: pg_bsd_indent
+	$(prove_installcheck)
diff --git a/src/tools/pgindent/meson.build b/src/tools/pgindent/meson.build
new file mode 100644
index 00..d31ade11ce
--- /dev/null
+++ b/src/tools/pgindent/meson.build
@@ -0,0 +1,13 @@
+tests += {
+  'name': 'pgindent',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+'args': [
+  pg_bsd_indent.path(),
+],
+'tests': [
+  't/001_pgindent.pl',
+],
+  },
+}
diff --git a/src/tools/pgindent/t/001_pgindent.pl b/src/tools/pgindent/t/001_pgindent.pl
new file mode 100644
index 00..8ee93f4789
--- /dev/null
+++ b/src/tools/pgindent/t/001_pgindent.pl
@@ -0,0 +1,19 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+#
+# Check that all C code is formatted with pgindent
+#
+
+use strict;
+use warnings FATAL => 'all';
+use Test::More;
+use PostgreSQL::Test::Utils qw(command_ok);
+
+if (!$ENV{PG_TEST_EXTRA} || $ENV{PG_TEST_EXTRA} !~ /\bpgindent\b/)
+{
+	plan skip_all => "test pgindent not enabled in PG_TEST_EXTRA";
+}
+
+my $pg_bsd_indent = $ARGV[0];
+command_ok(["./pgindent", "--indent=$pg_bsd_indent", "--check", "--diff"]);
+
+done_testing();
-- 
Tristan Partin
Neon (https://neon.tech)

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-16 Thread Bharath Rupireddy

On Tue, Jan 16, 2024 at 9:22 PM Nathan Bossart  wrote:
>
> I fixed this in v4.

LGTM.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Synchronizing slots from primary to standby

2024-01-16 Thread Masahiko Sawada

On Tue, Jan 16, 2024 at 6:40 PM shveta malik  wrote:
>
> On Tue, Jan 16, 2024 at 12:59 PM Masahiko Sawada  
> wrote:
> >
> > On Tue, Jan 16, 2024 at 1:07 PM Amit Kapila  wrote:
> > >
> > > On Tue, Jan 16, 2024 at 9:03 AM shveta malik  
> > > wrote:
> > > >
> > > > On Sat, Jan 13, 2024 at 12:54 PM Amit Kapila  
> > > > wrote:
> > > > >
> > > > > On Fri, Jan 12, 2024 at 5:50 PM shveta malik  
> > > > > wrote:
> > > > > >
> > > > > > There are multiple approaches discussed and tried when it comes to
> > > > > > starting a slot-sync worker. I am summarizing all here:
> > > > > >
> > > > > >  1) Make slotsync worker as an Auxiliary Process (like checkpointer,
> > > > > > walwriter, walreceiver etc). The benefit this approach provides is, 
> > > > > > it
> > > > > > can control begin and stop in a more flexible way as each auxiliary
> > > > > > process could have different checks before starting and can have
> > > > > > different stop conditions. But it needs code duplication for process
> > > > > > management(start, stop, crash handling, signals etc) and currently 
> > > > > > it
> > > > > > does not support db-connection smoothly (none of the auxiliary 
> > > > > > process
> > > > > > has one so far)
> > > > > >
> > > > >
> > > > > As slotsync worker needs to perform transactions and access syscache,
> > > > > we can't make it an auxiliary process as that doesn't initialize the
> > > > > required stuff like syscache. Also, see the comment "Auxiliary
> > > > > processes don't run transactions ..." in AuxiliaryProcessMain() which
> > > > > means this is not an option.
> > > > >
> > > > > >
> > > > > > 2) Make slotsync worker as a 'special' process like AutoVacLauncher
> > > > > > which is neither an Auxiliary process nor a bgworker one. It allows
> > > > > > db-connection and also provides flexibility to have start and stop
> > > > > > conditions for a process.
> > > > > >
> > > > >
> > > > > Yeah, due to these reasons, I think this option is worth considering
> > > > > and another plus point is that this allows us to make enable_syncslot
> > > > > a PGC_SIGHUP GUC rather than a PGC_POSTMASTER.
> > > > >
> > > > > >
> > > > > > 3) Make slotysnc worker a bgworker. Here we just need to register 
> > > > > > our
> > > > > > process as a bgworker (RegisterBackgroundWorker()) by providing a
> > > > > > relevant start_time and restart_time and then the process management
> > > > > > is well taken care of. It does not need any code-duplication and
> > > > > > allows db-connection smoothly in registered process. The only thing 
> > > > > > it
> > > > > > lacks is that it does not provide flexibility of having
> > > > > > start-condition which then makes us to have 'enable_syncslot' as
> > > > > > PGC_POSTMASTER parameter rather than PGC_SIGHUP. Having said this, I
> > > > > > feel enable_syncslot is something which will not be changed 
> > > > > > frequently
> > > > > > and with the benefits provided by bgworker infra, it seems a
> > > > > > reasonably good option to choose this approach.
> > > > > >
> > > > >
> > > > > I agree but it may be better to make it a PGC_SIGHUP parameter.
> > > > >
> > > > > > 4) Another option is to have Logical Replication Launcher(or a new
> > > > > > process) to launch slot-sync worker. But going by the current design
> > > > > > where we have only 1 slotsync worker, it may be an overhead to have 
> > > > > > an
> > > > > > additional manager process maintained.
> > > > > >
> > > > >
> > > > > I don't see any good reason to have an additional launcher process 
> > > > > here.
> > > > >
> > > > > >
> > > > > > Thus weighing pros and cons of all these options, we have currently
> > > > > > implemented the bgworker approach (approach 3).  Any feedback is
> > > > > > welcome.
> > > > > >
> > > > >
> > > > > I vote to go for (2) unless we face difficulties in doing so but (3)
> > > > > is also okay especially if others also think so.
> > > >
> > > > I am not against any of the approaches but I still feel that when we
> > > > have a standard way of doing things (bgworker) we should not keep
> > > > adding code to do things in a special way unless there is a strong
> > > > reason to do so. Now we need to decide if 'enable_syncslot' being
> > > > PGC_POSTMASTER is a strong reason to go the non-standard way?
> > > >
> > >
> > > Agreed and as said earlier I think it is better to make it a
> > > PGC_SIGHUP. Also, not sure we can say it is a non-standard way as
> > > already autovacuum launcher is handled in the same way. One more minor
> > > thing is it will save us for having a new bgworker state
> > > BgWorkerStart_ConsistentState_HotStandby as introduced by this patch.
> >
> > Why do we need to add a new BgWorkerStart_ConsistentState_HotStandby
> > for the slotsync worker? Isn't it sufficient that the slotsync worker
> > exits if not in hot standby mode?
>
> It is doable, but that will mean starting slot-sync worker even on
> primary on every server restart which does not seem like a good idea.
>

Re: Fix a typo of func DecodeInsert()

2024-01-16 Thread Richard Guo

On Wed, Jan 17, 2024 at 8:47 AM Yongtao Huang 
wrote:

> Hi all,
> I think the comment above the function DecodeInsert()
> in src/backend/replication/logical/decode.c should be
> + * *Inserts *can contain the new tuple.
> , rather than
> - * *Deletes *can contain the new tuple.
>

Nice catch.  +1.

I kind of wonder if it would be clearer to state that "XLOG_HEAP_INSERT
can contain the new tuple", in order to differentiate it from
XLOG_HEAP2_MULTI_INSERT.

Thanks
Richard

Fix a typo of func DecodeInsert()

2024-01-16 Thread Yongtao Huang

Hi all,
I think the comment above the function DecodeInsert()
in src/backend/replication/logical/decode.c should be
+ * *Inserts *can contain the new tuple.
, rather than
- * *Deletes *can contain the new tuple.

Please correct me if I'm wrong, thanks a lot.


0001-Fix-a-typo-of-func-DecodeInsert.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-01-16 Thread Peter Smith

About v62-0001:

As stated in the patch comment:
But note that this commit does not yet include the capability to
actually sync the replication slot; the next patch will address that.

~~~

Because of this, I think it might be prudent to separate the
documentation portion from this patch so that it can be pushed later
when the actual synchronize capability also gets pushed.

It would not be good for the PG documentation on HEAD to be describing
behaviour that does not yet exist. (e.g. if patch 0001 is pushed
early, but then there is some delay or problems getting the subsequent
patches committed).

==
Kind Regards,
Peter Smith.
Fujitsu Australia

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

But as you are the one arguing for the new feature demonstrating that the 
status quo is deficient is your job.

--//--

I performed these three tests(take a look below) quite simple but functional, 
so that we can get an idea of the performance. Apparently, we have a higher 
cost in using "count(*) - row_number() + 1" than in using "row_number_desc() 
over()".

Perhaps, if we think in terms of SQL standards, my suggested name may not have 
been the best. The name could be anything else. I don't have another 
suggestion. Does anyone have a better one? I leave it open for others to also 
reflect.



postgres=# select * into public.foo_1 from generate_series(1,100);
SELECT 100
postgres=# explain analyze select count(*) over() - row_number() over() + 1 
from public.foo_1;
  QUERY PLAN
--
 WindowAgg  (cost=0.00..38276.25 rows=1128375 width=8) (actual 
time=244.878..475.595 rows=100 loops=1)
   ->  Seq Scan on foo_1  (cost=0.00..15708.75 rows=1128375 width=0) (actual 
time=0.033..91.486 rows=100 loops=1)
 Planning Time: 0.073 ms
 Execution Time: 505.375 ms
(4 rows)

postgres=# explain analyze select row_number_desc() over() from public.foo_1;
  QUERY PLAN
--
 WindowAgg  (cost=0.00..26925.00 rows=100 width=8) (actual 
time=141.107..427.100 rows=100 loops=1)
   ->  Seq Scan on foo_1  (cost=0.00..14425.00 rows=100 width=0) (actual 
time=0.031..61.651 rows=100 loops=1)
 Planning Time: 0.051 ms
 Execution Time: 466.535 ms
(4 rows)



postgres=# select * into public.foo_2 from generate_series(1,1000);
SELECT 1000
postgres=# explain analyze select count(*) over() - row_number() over() + 1 
from public.foo_2;
   QUERY PLAN
-
 WindowAgg  (cost=0.00..344247.31 rows=977 width=8) (actual 
time=2621.014..5145.325 rows=1000 loops=1)
   ->  Seq Scan on foo_2  (cost=0.00..144247.77 rows=977 width=0) (actual 
time=0.031..821.533 rows=1000 loops=1)
 Planning Time: 0.085 ms
 Execution Time: 5473.422 ms
(4 rows)

postgres=# explain analyze select row_number_desc() over() from public.foo_2;
   QUERY PLAN
-
 WindowAgg  (cost=0.00..269247.48 rows=977 width=8) (actual 
time=1941.915..4527.896 rows=1000 loops=1)
   ->  Seq Scan on foo_2  (cost=0.00..144247.77 rows=977 width=0) (actual 
time=0.029..876.802 rows=1000 loops=1)
 Planning Time: 0.030 ms
 Execution Time: 4871.278 ms
(4 rows)




postgres=# select * into public.foo_3 from generate_series(1,1);
SELECT 1
postgres=# explain analyze select count(*) over() - row_number() over() + 1 
from public.foo_3;
  QUERY PLAN
---
 WindowAgg  (cost=0.00..3827434.70 rows=112831890 width=8) (actual 
time=56823.080..84295.660 rows=1 loops=1)
   ->  Seq Scan on foo_3  (cost=0.00..1570796.90 rows=112831890 width=0) 
(actual time=1.010..37735.121 rows=1 loops=1)
 Planning Time: 1.018 ms
 Execution Time: 87677.572 ms
(4 rows)

postgres=# explain analyze select row_number_desc() over() from public.foo_3;
   QUERY PLAN

 WindowAgg  (cost=0.00..2981195.53 rows=112831890 width=8) (actual 
time=29523.037..55517.349 rows=1 loops=1)
   ->  Seq Scan on foo_3  (cost=0.00..1570796.90 rows=112831890 width=0) 
(actual time=12.638..19050.614 rows=1 loops=1)
 Planning Time: 55.653 ms
 Execution Time: 59001.423 ms
(4 rows)



Regards,
Maiquel.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor

On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  wrote:
> Just changing "items" to be the local tidstore struct could make the
> code tricky a bit, since max_bytes and num_items are on the shared
> memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.

Re: Improve the connection failure error messages

2024-01-16 Thread Peter Smith

On Sat, Jan 13, 2024 at 12:36 AM Aleksander Alekseev
 wrote:
>
> Hi,
>
> Thanks for the patch.
>
> > Due to this behavior, it is not possible to add a test to show the
> > error message as it is done for CREATE SUBSCRIPTION.
> > Let me know if you think there is another way to add this test.
>
> I believe it can be done with TAP tests, see for instance:
>
> contrib/auto_explain/t/001_auto_explain.pl
>
> However I wouldn't insist on including the test in scope of this
> particular patch. Such a test doesn't currently exist, it can be added
> as a separate patch, and whether this is actually a useful test (all
> the tests consume time after all...) is somewhat debatable. Personally
> I agree that it would be nice to have though.
>
> This being said...
>
> > The ALTER SUBSCRIPTION command does not error out on the user
> > interface if updated with a bad connection string and the connection
> > failure error can only be seen in the respective log file.
>
> I wonder if we should fix this. Again, not necessarily in scope of
> this work, but if this is not a difficult task, again, it would be
> nice to have.
>
> Other than that v2 looks good.
>

OK.  I see now that any ALTER of the subscription's connection, even
to some value that fails, will restart a new worker (like ALTER of any
other subscription parameters). For a bad connection, it will continue
to relaunch-worker/ERROR over and over.

test_sub=# \r2024-01-17 09:34:28.665 AEDT [11274] LOG:  logical
replication apply worker for subscription "sub4" has started
2024-01-17 09:34:28.666 AEDT [11274] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:28.667 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11274) exited with exit code 1
dRs su2024-01-17 09:34:33.669 AEDT [11391] LOG:  logical replication
apply worker for subscription "sub4" has started
2024-01-17 09:34:33.669 AEDT [11391] ERROR:  could not connect to the
publisher: invalid port number: "-1"
2024-01-17 09:34:33.670 AEDT [928] LOG:  background worker "logical
replication apply worker" (PID 11391) exited with exit code 1
b4
...

I don't really have any opinion if that behaviour is good or bad, but
anyway, it is deliberate, and IMO it is outside the scope of your
patch, so v2 patch LGTM.

~~

BTW, while experimenting with the bad connection ALTER I also tried
setting 'disable_on_error' like below:

ALTER SUBSCRIPTION sub4 SET (disable_on_error);
ALTER SUBSCRIPTION sub4 CONNECTION 'port = -1';

...but here the subscription did not become DISABLED as I expected it
would do on the next connection error iteration. It remains enabled
and just continues to loop relaunch/ERROR indefinitely same as before.

That looks like it may be a bug. Thoughts?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Melanie Plageman

On Tue, Jan 16, 2024 at 2:23 PM Robert Haas  wrote:
>
> On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman
>  wrote:
> > All LGTM.
>
> Committed.

Attached v8 patch set is rebased over this.

In 0004, I've taken the approach you seem to favor and combined the FSM
updates from the prune and no prune cases in lazy_scan_heap() instead
of pushing the FSM updates into lazy_scan_prune() and
lazy_scan_noprune().

I did not guard against calling lazy_scan_new_or_empty() a second time
in the case that lazy_scan_noprune() was called. I can do this. I
mentioned upthread I found it confusing for lazy_scan_new_or_empty()
to be guarded by if (do_prune). The overhead of calling it wouldn't be
terribly high. I can change that based on your opinion of what is
better.

The big open question/functional change is when we consider vacuuming
the FSM. Previously, we only considered vacuuming the FSM in the no
indexes, has dead items case. After combining the FSM updates from
lazy_scan_prune()'s no indexes/has lpdead items case,
lazy_scan_prune()'s regular case, and lazy_scan_noprune(), all of them
consider vacuuming the FSM. I could guard against this, but I wasn't
sure why we would want to only vacuum the FSM in the no indexes/has
dead items case.

I also noticed while rebasing something I missed while reviewing
45d395cd75ffc5b -- has_lpdead_items is set in a slightly different
place in lazy_scan_noprune() than lazy_scan_prune() (with otherwise
identical surrounding code). Both are correct, but it would have been
nice for them to be the same. If the patches below are committed, we
could standardize on the location in lazy_scan_noprune().

- Melanie
From 1aaf95e70997f363b4783d023f92ad16f796e7b4 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Fri, 5 Jan 2024 11:12:58 -0500
Subject: [PATCH v8 2/4] Move VM update code to lazy_scan_prune()

The visibility map is updated after lazy_scan_prune() returns in
lazy_scan_heap(). Most of the output parameters of lazy_scan_prune() are
used to update the VM in lazy_scan_heap(). Moving that code into
lazy_scan_prune() simplifies lazy_scan_heap() and requires less
communication between the two. This is a pure lift and shift. No VM
update logic is changed. All but one of the members of the prunestate
are now obsolete. It will be removed in a future commit.
---
 src/backend/access/heap/vacuumlazy.c | 226 ++-
 1 file changed, 115 insertions(+), 111 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c7ef879be4e..6535d8bdeea 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -249,6 +249,7 @@ static bool lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf,
    bool sharelock, Buffer vmbuffer);
 static void lazy_scan_prune(LVRelState *vacrel, Buffer buf,
 			BlockNumber blkno, Page page,
+			Buffer vmbuffer, bool all_visible_according_to_vm,
 			LVPagePruneState *prunestate);
 static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 			  BlockNumber blkno, Page page,
@@ -1032,117 +1033,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * tuple headers of remaining items with storage. It also determines
 		 * if truncating this block is safe.
 		 */
-		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
-
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
-
-		/*
-		 * Handle setting visibility map bit based on information from the VM
-		 * (as of last lazy_scan_skip() call), and from prunestate
-		 */
-		if (!all_visible_according_to_vm && prunestate.all_visible)
-		{
-			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
-
-			if (prunestate.all_frozen)
-			{
-Assert(!TransactionIdIsValid(prunestate.visibility_cutoff_xid));
-flags |= VISIBILITYMAP_ALL_FROZEN;
-			}
-
-			/*
-			 * It should never be the case that the visibility map page is set
-			 * while the page-level bit is clear, but the reverse is allowed
-			 * (if checksums are not enabled).  Regardless, set both bits so
-			 * that we get back in sync.
-			 *
-			 * NB: If the heap page is all-visible but the VM bit is not set,
-			 * we don't need to dirty the heap page.  However, if checksums
-			 * are enabled, we do need to make sure that the heap page is
-			 * dirtied before passing it to visibilitymap_set(), because it
-			 * may be logged.  Given that this situation should only happen in
-			 * rare cases after a crash, it is not worth optimizing.
-			 */
-			PageSetAllVisible(page);
-			MarkBufferDirty(buf);
-			visibilitymap_set(vacrel->rel, blkno, buf, InvalidXLogRecPtr,
-			  vmbuffer, prunestate.visibility_cutoff_xid,
-			  flags);
-		}
-
-		/*
-		 * As of PostgreSQL 9.2, the visibility map bit should never be set if
-		 * the page-level bit is clear.  However, it's possible that the bit
-		 * got cleared after lazy_scan_skip() was called, so we must recheck
-		 * with buffer lock before concluding that the VM is corrupt.
-		 */
-		else if (all_

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 4:28 PM Jim Nasby  wrote:
> On 1/12/24 12:45 PM, Robert Haas wrote:
> > P.P.S. to everyone: Yikes, this logic is really confusing.
>
> Having studied all this code several years ago when it was even simpler
> - it was *still* very hard to grok even back then. I *greatly
> appreciate* the effort that y'all are putting into increasing the
> clarity here.

Thanks. And yeah, I agree.

> BTW, back in the day the whole "no indexes" optimization was a really
> tiny amount of code... I think it amounted to 2 or 3 if statements. I
> haven't yet attempted to grok this patchset, but I'm definitely
> wondering how much it's worth continuing to optimize that case. Clearly
> it'd be very expensive to memoize dead tuples just to trawl that list a
> single time to clean the heap, but outside of that I'm not sure other
> optimazations are worth it given the amount of code
> complexity/duplication they seem to require - especially for code where
> correctness is so crucial.

Personally, I don't think throwing away that optimization is the way
to go. The idea isn't intrinsically complicated, I believe. It's just
that the code has become messy because of too many hands touching it.
At least, that's my read.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: index prefetching

2024-01-16 Thread Jim Nasby


On 1/16/24 2:10 PM, Konstantin Knizhnik wrote:
Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).

EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.


Actually, EBS only provides a block device - it's definitely not a 
filesystem itself (*EFS* is a filesystem - but it's also significantly 
different than EBS). So as long as readahead is happening somewheer 
above the block device I would expect it to JustWork on EBS.


Of course, Aurora Postgres (like Neon) is completely different. If you 
look at page 53 of [1] you'll note that there's two different terms 
used: prefetch and batch. I'm not sure how much practical difference 
there is, but batched IO (one IO request to Aurora Storage for many 
blocks) predates index prefetch; VACUUM in APG has used batched IO for a 
very long time (it also *only* reads blocks that aren't marked all 
visble/frozen; none of the "only skip if skipping at least 32 blocks" 
logic is used).


1: 
https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Deep_dive_on_Amazon_Aurora_with_PostgreSQL_compatibility_DAT328-R1.pdf

--
Jim Nasby, Data Architect, Austin TX

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

The people in this community are quite capable and willing to write a contrary 
opinion to mine.  Not sure how to make “this new proposed function shouldn’t be 
added to core”, and trying to explain why not, non-oppressive.  I can add 
“thank you for taking the time to try and improve PostgreSQL” in front to 
soften the blow of rejection but I tend to just get to the point.

David J.

//

Thank you for your opinion. We built together one more insight on PostgreSQL 
for the community.

Best regards,
Maiquel O.

Re: [PATCH] Add support function for containment operators

2024-01-16 Thread Tom Lane

jian he  writes:
> [ v5-0001-Simplify-containment-in-range-constants-with-supp.patch ]

I spent some time reviewing and cleaning up this code.  The major
problem I noted was that it doesn't spend any effort thinking about
cases where it'd be unsafe or undesirable to apply the transformation.
In particular, it's entirely uncool to produce a double-sided
comparison if the elemExpr is volatile.  These two expressions
do not have the same result:

select random() <@ float8range(0.1, 0.2);
select random() >= 0.1 and random() < 0.2;

(Yes, I'm aware that BETWEEN is broken in this respect.  All the
more reason why we mustn't break <@.)

Another problem is that even if the elemExpr isn't volatile,
it might be expensive enough that evaluating it twice is bad.
I am not sure where we ought to put the cutoff.  There are some
existing places where we set a 10X cpu_operator_cost limit on
comparable transformations, so I stole that logic in the attached.
But perhaps someone has an argument for a different rule?

Anyway, pending discussion of that point, I think the code is good
to go.  I don't like the test cases much though: they expend many more
cycles than necessary.  You could prove the same points just by
looking at the expansion of expressions, eg.

regression=# explain (verbose, costs off) select current_date <@ 
daterange(null,null);
   QUERY PLAN   

 Result
   Output: true
(2 rows)

regression=# explain (verbose, costs off) select current_date <@ 
daterange('-Infinity', '1997-04-10'::date, '[)');
   QUERY PLAN   
 
-
 Result
   Output: ((CURRENT_DATE >= '-infinity'::date) AND (CURRENT_DATE < 
'1997-04-10'::date))
(2 rows)

I'd suggest losing the temp table and just coding tests like these.

regards, tom lane

diff --git a/src/backend/utils/adt/rangetypes.c b/src/backend/utils/adt/rangetypes.c
index d99b00b590..2d94a6b877 100644
--- a/src/backend/utils/adt/rangetypes.c
+++ b/src/backend/utils/adt/rangetypes.c
@@ -30,19 +30,20 @@
  */
 #include "postgres.h"
 
-#include "access/tupmacs.h"
 #include "common/hashfn.h"
-#include "lib/stringinfo.h"
 #include "libpq/pqformat.h"
 #include "miscadmin.h"
+#include "nodes/makefuncs.h"
 #include "nodes/miscnodes.h"
-#include "port/pg_bitutils.h"
+#include "nodes/supportnodes.h"
+#include "optimizer/clauses.h"
+#include "optimizer/cost.h"
+#include "optimizer/optimizer.h"
 #include "utils/builtins.h"
 #include "utils/date.h"
 #include "utils/lsyscache.h"
 #include "utils/rangetypes.h"
 #include "utils/timestamp.h"
-#include "varatt.h"
 
 
 /* fn_extra cache entry for one of the range I/O functions */
@@ -69,6 +70,12 @@ static Size datum_compute_size(Size data_length, Datum val, bool typbyval,
 			   char typalign, int16 typlen, char typstorage);
 static Pointer datum_write(Pointer ptr, Datum datum, bool typbyval,
 		   char typalign, int16 typlen, char typstorage);
+static Node *find_simplified_clause(PlannerInfo *root,
+	Expr *rangeExpr, Expr *elemExpr);
+static Expr *build_bound_expr(Expr *elemExpr, Datum val,
+			  bool isLowerBound, bool isInclusive,
+			  TypeCacheEntry *typeCache,
+			  Oid opfamily, Oid rng_collation);
 
 
 /*
@@ -2173,6 +2180,58 @@ make_empty_range(TypeCacheEntry *typcache)
 	return make_range(typcache, &lower, &upper, true, NULL);
 }
 
+/*
+ * Planner support function for elem_contained_by_range (<@ operator).
+ */
+Datum
+elem_contained_by_range_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+	Node	   *ret = NULL;
+
+	if (IsA(rawreq, SupportRequestSimplify))
+	{
+		SupportRequestSimplify *req = (SupportRequestSimplify *) rawreq;
+		FuncExpr   *fexpr = req->fcall;
+		Expr	   *leftop,
+   *rightop;
+
+		Assert(list_length(fexpr->args) == 2);
+		leftop = linitial(fexpr->args);
+		rightop = lsecond(fexpr->args);
+
+		ret = find_simplified_clause(req->root, rightop, leftop);
+	}
+
+	PG_RETURN_POINTER(ret);
+}
+
+/*
+ * Planner support function for range_contains_elem (@> operator).
+ */
+Datum
+range_contains_elem_support(PG_FUNCTION_ARGS)
+{
+	Node	   *rawreq = (Node *) PG_GETARG_POINTER(0);
+	Node	   *ret = NULL;
+
+	if (IsA(rawreq, SupportRequestSimplify))
+	{
+		SupportRequestSimplify *req = (SupportRequestSimplify *) rawreq;
+		FuncExpr   *fexpr = req->fcall;
+		Expr	   *leftop,
+   *rightop;
+
+		Assert(list_length(fexpr->args) == 2);
+		leftop = linitial(fexpr->args);
+		rightop = lsecond(fexpr->args);
+
+		ret = find_simplified_clause(req->root, leftop, rightop);
+	}
+
+	PG_RETURN_POINTER(ret);
+}
+
 
 /*
  *--
@@ -2715,3 +2774,180 @@ datum_write(Pointer ptr, Datum datum, bool typbyval, char typalign,
 
 	return ptr;
 }
+
+/*
+ * Common code for the elem_contained_by_range and range_contains_elem
+ * support functions.  The ca

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Jim Nasby


On 1/12/24 12:45 PM, Robert Haas wrote:

P.P.S. to everyone: Yikes, this logic is really confusing.


Having studied all this code several years ago when it was even simpler 
- it was *still* very hard to grok even back then. I *greatly 
appreciate* the effort that y'all are putting into increasing the 
clarity here.


BTW, back in the day the whole "no indexes" optimization was a really 
tiny amount of code... I think it amounted to 2 or 3 if statements. I 
haven't yet attempted to grok this patchset, but I'm definitely 
wondering how much it's worth continuing to optimize that case. Clearly 
it'd be very expensive to memoize dead tuples just to trawl that list a 
single time to clean the heap, but outside of that I'm not sure other 
optimazations are worth it given the amount of code 
complexity/duplication they seem to require - especially for code where 
correctness is so crucial.

--
Jim Nasby, Data Architect, Austin TX

Re: UUID v7

2024-01-16 Thread Jelte Fennema-Nio

On Tue, 16 Jan 2024 at 22:02, Przemysław Sztoch  wrote:
> But replace uuidv7_extract_time(uuid) with uuid_extract_time(uuid) - function 
> should be able extract timestamp from v1/v6/v7

I'm fine with this.

> I would highly recommend to add:
> uuidv5(namespace uuid, name text) -> uuid
> using uuid_generate_v5 from uuid-ossp extension 
> (https://www.postgresql.org/docs/current/uuid-ossp.html)
> There is an important version and it should be included into the main PG code.

I think adding more uuid versions would probably be desirable. But I
don't think it makes sense to clutter this patchset with that. I feel
like on this uuidv7 patchset we've had enough discussion that it could
reasonably get into PG17, but I think adding even more uuid versions
to this patchset would severely reduce the chances of that happening.

Re: UUID v7

2024-01-16 Thread Przemysław Sztoch


Another question: how did you choose between using TimestampTz and
Timestamp types? I realize that internally it's all the same. Maybe
Timestamp will be slightly better since the way it is displayed
doesn't depend on the session settings. Many people I talked to find
this part of TimestampTz confusing.

timstamptz internally always store UTC.
I believe that in SQL, when operating with time in UTC, you should 
always use timestamptz.
timestamp is theoretically the same thing. But internally it does not 
convert time to UTC and will lead to incorrect use.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: UUID v7

2024-01-16 Thread Przemysław Sztoch


Jelte Fennema-Nio wrote on 1/16/2024 9:25 PM:

On Tue, 16 Jan 2024 at 19:17, Andrey M. Borodin  wrote:
So currently my preference for the function names would be:

- uuidv4() -> alias for gen_random_uuid()
- uuidv7()
- uuidv7(timestamptz)
- uuid_extract_ver(uuid)
- uuid_extract_var(uuid)
- uuidv7_extract_time(uuid)

+1
But replaceuuidv7_extract_time(uuid)with uuid_extract_time(uuid) - 
function should be able extract timestamp from v1/v6/v7


I would highly recommend to add:
uuidv5(namespace uuid, name text) -> uuid
using uuid_generate_v5 from uuid-ossp extension 
(https://www.postgresql.org/docs/current/uuid-ossp.html)
There is an important version and it should be included into the main PG 
code.


Jelte: Please propose the name of the function that will convert uuid 
from version 1 to 6.
v6 is almost as good as v7 for indexes. And v6 allows you to convert 
from v1 which some people use.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

Re: cleanup patches for incremental backup

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 3:22 PM Matthias van de Meent
 wrote:
> > One other thought is that the incremental backup only replaces
> > relation files with incremental files, and it never does anything
> > about FSM files. So the statement that it only contains data that was
> > potentially changed isn't quite correct. It might be better to phrase
> > it the other way around i.e. it is like a full backup, except that
> > some files can be replaced by incremental files which omit blocks to
> > which no WAL-logged changes have been made.
>
> How about the attached?

I like the direction.

+ A special base backup
+ that for some WAL-logged relations only contains the pages that were
+ modified since a previous backup, as opposed to the full relation data of
+ normal base backups. Like base backups, it is generated by the tool
+ .

Could we say "that for some files may contain only those pages that
were modified since a previous backup, as opposed to the full contents
of every file"? My thoughts are (1) there's no hard guarantee that an
incremental backup will replace even one file with an incremental
file, although in practice it is probably almost always going to
happen and (2) pg_combinebackup would actually be totally fine with
any file at all being sent incrementally; it's only that the server
isn't smart enough to figure out how to do this with e.g. SLRU data
right now.

+ To restore incremental backups the tool 
+ is used, which combines the incremental backups with a base backup and
+ WAL to restore a
+ database cluster to
+ a consistent state.

I wondered if this needed to be clearer that the chain of backups
could have length > 2. But on further reflection, I think it's fine,
unless you feel otherwise.

The rest LGTM.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Revise the Asserts added to bimapset manipulation functions

2024-01-16 Thread David Rowley

On Tue, 16 Jan 2024 at 21:00, Richard Guo  wrote:
> Thank you so much for all the work you have put into making this patch
> perfect.  I reviewed through the v3 patch and I do not have further
> comments.  I think it's in good shape now.

Thanks for looking again.  I pushed the patch after removing some
comments mentioning "these operations".  I found these not to be very
useful and also misleading.

David

New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David G. Johnston

On Tuesday, January 16, 2024, Maiquel Grassi  wrote:

> However, initially, I have one more obstacle in your feedback. If I use
> count(*) over() - row_number() over(), it gives me an offset of one unit.
> To resolve this, I need to add 1.
>
>
> This way, simulating a reverse row_number() becomes even more laborious.
>
>
> I don’t really understand why you think this reverse inserted counting is
> even a good idea so I don’t really care how laborious it is to implement
> with existing off-the-shelf tools.  A window function named “descending” is
> non-standard and seemingly non-sensical and should not be added.  You can
> specify order by in the over clause and that is what you should be doing.
> Mortgage payments are usually monthly, so order by date.
>
> David J.
>
> --//--
>
> We are just raising hypotheses and discussing healthy possibilities here.
> This is a suggestion for knowledge and community growth. Note that this is
> not about a new "feature patch.
>
>
That is not how your initial post here came across.  It seemed quite
concrete in goal and use case motivating that goal.


> I am asking for the community's opinion in general. Your responses are
> largely appearing aggressive and depreciative. Kindly request you to be
> more welcoming in your answers and not oppressive. This way, the community
> progresses more rapidly..
>
>
The people in this community are quite capable and willing to write a
contrary opinion to mine.  Not sure how to make “this new proposed function
shouldn’t be added to core”, and trying to explain why not,
non-oppressive.  I can add “thank you for taking the time to try and
improve PostgreSQL” in front to soften the blow of rejection but I tend to
just get to the point.

David J.

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 8:43 AM Jelte Fennema-Nio  wrote:
> I haven't removed 0008 yet, since I'd like some feedback first if that
> makes sense. But I did add two new patches in the middle of the
> patchset (which shift the later patch numbers by 2):
>
> 0007: Adds a new \parameterset meta-command to psql to be able to more
> easily test the new ParameterSet message
>
> 0008: Shows warning in psql if the server is not supporting the newest
> protocol version.

Sorry for not getting back to this right away; there are quite a few
threads competing for my attention.

I think it's flat-out not viable to bump the protocol version to 3.1
any time in the next few years. NegotiateProtocolVerison has existed
since ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed (2017-11-21, 11.0,
10.2, 9.6.7, 9.5.11, 9.4.16, 9.3.21), but libpq didn't handle it until
bbf9c282ce92272ed7bf6771daf0f9efa209e61b (2022-11-17, 16.0) -- and
even that handling doesn't really seem like what we want, because it
looks like it will reject anything where the protocol version doesn't
match exactly, rather than downgrading. To fix that, I think we need
some parts of what you've got in 0002, where we have an earliest and a
latest minor protocol version that we'll accept, and the server is
allowed to downgrade from the latest thing we support, just as long as
they don't try to downgrade below the earliest thing we support.

But I think we would want to have those changes in all supported
branches before we could think of requesting 3.1 or higher by default.
Imagine that in v17 we both added server support for protocol version
3.1 and also adopted your 0001. Then, a v17 libpq would fail to
connect to a v16 or earlier PostgreSQL instance. In effect, it would
be a complete wire compatibility break. There's no way that such a
patch is going to be acceptable. If I were to commit a patch from you
or anyone else that does that, many angry people would show up to yell
at both of us. So unless I'm misunderstanding the situation, 0001 is
pretty much dead on arrival for now and for quite a while to come.
That doesn't necessarily mean that we couldn't *optionally* request
3.1, e.g. controlled by a connection keyword. I would imagine that the
user would write e.g. 'user=rhaas password=banana protocolroles=true'
and libpq would say "oh, because the user wanted protocolroles=true I
need to request at least 3.1" -- but if that weren't there, the server
would still request only 3.0 and nothing would break.

Also, I'm pretty doubtful that we want
PQunsupportedProtocolExtensions(). That seems like allowing the client
to have too much direct visibility into what's happening at the
protocol level. That kind of interface might make sense if we were
trying to support unknown protocol extensions from third parties, but
for stuff in core, I think there should be specific APIs that relate
to specific features e.g. you call PQsetWireProtocolRole(char
*whatever) and it returns success or failure according to whether that
capability is available without telling you how that's negotiated on
the wire.

So in summary, I think parts of 0002 are a good idea, but 0001 is not realistic.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

However, initially, I have one more obstacle in your feedback. If I use 
count(*) over() - row_number() over(), it gives me an offset of one unit. To 
resolve this, I need to add 1.

This way, simulating a reverse row_number() becomes even more laborious.

I don’t really understand why you think this reverse inserted counting is even 
a good idea so I don’t really care how laborious it is to implement with 
existing off-the-shelf tools.  A window function named “descending” is 
non-standard and seemingly non-sensical and should not be added.  You can 
specify order by in the over clause and that is what you should be doing.  
Mortgage payments are usually monthly, so order by date.

David J.

--//--

We are just raising hypotheses and discussing healthy possibilities here. This 
is a suggestion for knowledge and community growth. Note that this is not about 
a new "feature patch." I am asking for the community's opinion in general. Your 
responses are largely appearing aggressive and depreciative. Kindly request you 
to be more welcoming in your answers and not oppressive. This way, the 
community progresses more rapidly.

Maiquel.

Re: UUID v7

2024-01-16 Thread Jelte Fennema-Nio

On Tue, 16 Jan 2024 at 19:17, Andrey M. Borodin  wrote:
> Jelte, what is your opinion on naming the function which extracts timestamp 
> from UUID v7?

I looked at a few more datatypes: json, jsonb & hstore. The get_
prefix is not used there at all, so I'm still opposed to that. But
they seem to use either an _to_ or an _extract_ infix. _to_ is then
used for conversion of the whole object, and _extract_ is used to
extract a subset. So I think _extract_ would fit well here.

On Fri, 5 Jan 2024 at 11:57, Sergey Prokhorenko
 wrote:
> When naming functions, I would advise using the shorter abbreviation uuidv7 
> from the new version of the RFC instead of uuid_v7.

I also agree with that, uuid_v7 looks weird to my eyes. The RFC also
abbreviates them as UUIDv7 (without a space).

The more I look at it the more I also think the gen_ prefix is quite
strange, and I already thought the gen_random_uuid name was quite
weird. But now that we will also have a uuidv7 I think it's even
stranger that one uses the name from the RFC.

The name of gen_random_uuid was taken verbatim from pgcrypto, without
any discussion on the list[0]:

> Here is a proposed patch for this. I did a fair bit of looking around
> in other systems for a naming pattern but didn't find anything
> consistent. So I ended up just taking the function name and code from
> pgcrypto.

So currently my preference for the function names would be:

- uuidv4() -> alias for gen_random_uuid()
- uuidv7()
- uuidv7(timestamptz)
- uuid_extract_ver(uuid)
- uuid_extract_var(uuid)
- uuidv7_extract_time(uuid)

[0]: 
https://www.postgresql.org/message-id/flat/6a65610c-46fc-2323-6b78-e8086340a325%402ndquadrant.com#76e40e950a44aa8b6844297e8d2efe2c

Re: Fix a possible socket leak at Windows (src/backend/port/win32/socket.c)

2024-01-16 Thread Ranier Vilela

Em seg., 15 de jan. de 2024 às 09:43, Daniel Gustafsson 
escreveu:

> > On 13 Jan 2024, at 22:38, Ranier Vilela  wrote:
>
> > In the pgwin32_socket function (src/backend/port/win32/socket.c), there
> is a possible socket leak if the socket cannot be made non-blocking.
>
> I don't know Windows well enough to comment on the implications of not
> calling
> closesocket here, but it definitely seems like a prudent thing to do
> backpatched down to 12. Unless objections I'll do that.
>
Thanks for taking care of this.
Do you have plans or should I register for a commitfest?

Best regards,
Ranier Vilela

Re: cleanup patches for incremental backup

2024-01-16 Thread Matthias van de Meent

On Tue, 16 Jan 2024 at 16:39, Robert Haas  wrote:
>
> On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent
>  wrote:
> > Off-list I was notified that the new WAL summarizer process was not
> > yet added to the glossary, so PFA a patch that does that.
> > In passing, it also adds "incremental backup" to the glossary, and
> > updates the documented types of backends in monitoring.sgml with the
> > new backend type, too.
>
> I wonder if it's possible that you sent the wrong version of this
> patch, because:
>
> (1) The docs don't build with this applied. I'm not sure if it's the
> only problem, but  the closing >.

That's my mistake, I didn't check install-world yet due to unrelated
issues building the docs. I've since sorted out these issues (this was
a good stick to get that done), so this issue is fixed in the attached
patch.

> (2) The changes to monitoring.sgml contain an unrelated change, about
> pg_stat_all_indexes.idx_scan.

Thanks for noticing, fixed in attached.

> Also, I think the "For more information, see  /> bit should have a period after the markup tag, as we seem to do in
> other cases.

Fixed.

> One other thought is that the incremental backup only replaces
> relation files with incremental files, and it never does anything
> about FSM files. So the statement that it only contains data that was
> potentially changed isn't quite correct. It might be better to phrase
> it the other way around i.e. it is like a full backup, except that
> some files can be replaced by incremental files which omit blocks to
> which no WAL-logged changes have been made.

How about the attached?


Kind regards,

Matthias van de Meent


v2-0001-incremental-backups-Add-new-items-to-glossary-mon.patch
Description: Binary data

Re: UUID v7

2024-01-16 Thread Przemysław Sztoch


Andrey Borodin wrote on 1/16/2024 1:15 PM:

Sergey, Przemysław, Jelte, thanks for your feedback.
Here's v9. Changes:
1. Swapped type of the argument to timestamptz in gen_uuid_v7()

Please update docs part about optional timestamp argument.

2. Renamed get_uuid_v7_time() to uuid_v7_time()

Pleaserename uuid_v7_time to uuid_time() and add support for v1 and v6.
If version is incompatible then return NULL.

3. Added  uuid_ver() and uuid_var().

Looks good.
But for me, throwing an error is problematic. Wouldn't it be better to 
return -1.

What do you think?
Best regards, Andrey Borodin.

--
Przemysław Sztoch | Mobile +48 509 99 00 66

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

You can do:

-(ROW_NUMBER() OVER ()) AS descending

(note “-“ in front)

I don't have a base column to use for "order by,"

I think that’s the main issue: what (semantically) does row_number() mean in 
that case? You could equally well generate random numbers?


--//--

What I want to do is inverse the enumeration using a simple solution. I want to 
look at the enumeration of the dataset list from bottom to top, not from top to 
bottom. I don't want to reverse the sign of the integers. The generated 
integers in output remain positive.The returned dataset can be from any query. 
What I need is exactly the opposite of row_number().

count(*) over() - row_number() + 1 works.

But I think for a large volume of data, its performance will be inferior to the 
suggested row_number_desc() over(). I may be very wrong, so I will test it.

Maiquel.

Re: UUID v7

2024-01-16 Thread Jelte Fennema-Nio

On Tue, 16 Jan 2024 at 15:44, Andrey M. Borodin  wrote:
> > On 16 Jan 2024, at 18:00, Aleksander Alekseev  
> > wrote:
> > Not 100% sure what this is for. Any chance this could be part of another 
> > patch?
> Nope, it’s necessary there. Without these changes catalog functions cannot 
> have defaults for arguments. These defaults have type pg_node_tree which has 
> no-op in function.

That seems like the wrong way to make that work then. How about
instead we define the same function name twice, once with and once
without a timestamp argument. That's how this is done for other
functions that are overloaded in pg_catalog.

Re: index prefetching

2024-01-16 Thread Konstantin Knizhnik



On 16/01/2024 6:25 pm, Tomas Vondra wrote:

On 1/16/24 09:13, Konstantin Knizhnik wrote:

Hi,

On 12/01/2024 6:42 pm, Tomas Vondra wrote:

Hi,

Here's an improved version of this patch, finishing a lot of the stuff
that I alluded to earlier - moving the code from indexam.c, renaming a
bunch of stuff, etc. I've also squashed it into a single patch, to make
it easier to review.

I am thinking about testing you patch with Neon (cloud Postgres). As far
as Neon seaprates compute and storage, prefetch is much more critical
for Neon
architecture than for vanilla Postgres.

I have few complaints:

1. It disables prefetch for sequential access pattern (i.e. INDEX
MERGE), motivating it that in this case OS read-ahead will be more
efficient than prefetch. It may be true for normal storage devices, bit
not for Neon storage and may be also for Postgres on top of DFS (i.e.
Amazon RDS). I wonder if we can delegate decision whether to perform
prefetch in this case or not to some other level. I do not know
precisely where is should be handled. The best candidate IMHO is
storager manager. But it most likely requires extension of SMGR API. Not
sure if you want to do it... Straightforward solution is to move this
logic to some callback, which can be overwritten by user.


Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?


Amazon RDS is just vanilla Postgres with file system mounted on EBS 
(Amazon  distributed file system).

EBS provides good throughput but larger latencies comparing with local SSDs.
I am not sure if read-ahead works for EBS.




4. I think that performing prefetch at executor level is really great

idea and so prefetch can be used by all indexes, including custom
indexes. But prefetch will be efficient only if index can provide fast
access to next TID (located at the same page). I am not sure that it is
true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
custom AM. I wonder if we should extend AM API to make index make a
decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.


I tend to agree with you - it is hard to imagine index implementation 
which doesn't win from prefetching heap pages.
May be only the filtering case you have mentioned. But it seems to me 
that current B-Tree index scan (not IOS) implementation in Postgres
doesn't try to use index tuple to check extra condition - it will fetch 
heap tuple in any case.



5. Minor notice: there are few places where index_getnext_slot is called
with last NULL parameter (disabled prefetch) with the following comment
"XXX Would be nice to also benefit from prefetching here." But all this
places corresponds to "point loopkup", i.e. unique constraint check,
find replication tuple by index... Prefetch seems to be unlikely useful
here, unlkess there is index bloating and and we have to skip a lot of
tuples before locating right one. But should we try to optimize case of
bloated indexes?


Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).

Sorry, I looked at v20240103-0001-prefetch-2023-12-09.patch , I didn't 
noticed v20240112-0001-Prefetch-heap-pages-during-index-scans.patch




regards

New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David G. Johnston

On Tuesday, January 16, 2024, Maiquel Grassi  wrote:
>
>
> However, initially, I have one more obstacle in your feedback. If I use
> count(*) over() - row_number() over(), it gives me an offset of one unit.
> To resolve this, I need to add 1.
>
> This way, simulating a reverse row_number() becomes even more laborious.
>

I don’t really understand why you think this reverse inserted counting is
even a good idea so I don’t really care how laborious it is to implement
with existing off-the-shelf tools.  A window function named “descending” is
non-standard and seemingly non-sensical and should not be added.  You can
specify order by in the over clause and that is what you should be doing.
Mortgage payments are usually monthly, so order by date.

David J.

Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Michał Kłeczek



> On 16 Jan 2024, at 16:51, Maiquel Grassi  wrote:
> 
> 
> Imagine I have a dataset that is returned to my front-end, and I want to 
> reverse enumerate them (exactly the concept of Math enumerating integers). 
> The row_number does the ascending enumeration, but I need the descending 
> enumeration.

You can do:

-(ROW_NUMBER() OVER ()) AS descending

(note “-“ in front)

> I don't have a base column to use for "order by,"

I think that’s the main issue: what (semantically) does row_number() mean in 
that case? You could equally well generate random numbers?


— 
Michal

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

I doubt it is materially different, you need that count regardless so the 
effort is expended no matter if you put it in an SQL expression or build it 
into the window function.  But as you are the one arguing for the new feature 
demonstrating that the status quo is deficient is your job.

---//---

Ok, I'll run the tests to validate these performances and draw some conclusions.

However, initially, I have one more obstacle in your feedback. If I use 
count(*) over() - row_number() over(), it gives me an offset of one unit. To 
resolve this, I need to add 1. This way, simulating a reverse row_number() 
becomes even more laborious.

SELECT
  row_number() over()
  , row_number_desc() over()
  , count(*) over() - row_number() over() as FROM pg_catalog.pg_database;
 row_number | row_number_desc | count_minus_row_number
+-+
  1 |   3 |  2
  2 |   2 |  1
  3 |   1 |  0
(3 rows)

postgres=# SELECT row_number() over(), row_number_desc() over(), count(*) 
over() - row_number() over() as count_minus_row_number, count(*) over() - 
row_number() over() + 1 AS count_minus_row_number_plus_one FROM 
pg_catalog.pg_database;
 row_number | row_number_desc | count_minus_row_number | 
count_minus_row_number_plus_one
+-++-
  1 |   3 |  2 |
   3
  2 |   2 |  1 |
   2
  3 |   1 |  0 |
   1
(3 rows)

Tks,
Maiquel.

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 11:28 AM Melanie Plageman
 wrote:
> All LGTM.

Committed.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: psql JSON output format

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 11:07 AM Laurenz Albe  wrote:
> "Round-trip safety" is not so important.  If you want to move data from
> PostgreSQL to PostgreSQL, you use the plain or the binary format.
> The CSV format by default renders NULL and empty strings identical, and
> I don't think anybody objects to that.

As Andrew says, the part about the CSV format is not correct, but I
also don't think I agree with the larger point, either. I believe that
round-trip safety is a really desirable property. Is it absolutely
necessary in every case? Maybe not. But, it shouldn't be lacking
without a good reason, either, at least IMHO. If you postulate that
people are moving data from A to B, it is reasonable to think that
eventually someone is going to want to move some data from B back to
A. If that turns out to be hard, they'll be sad. We shouldn't make
people sad without a good reason.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: partitioning and identity column

2024-01-16 Thread Peter Eisentraut


On 09.01.24 15:10, Ashutosh Bapat wrote:

Here's complete patch-set.


Looks good!  Committed.

Re: UUID v7

2024-01-16 Thread Andrey M. Borodin

> On 16 Jan 2024, at 21:49, Sergey Prokhorenko  
> wrote:
> 
> It is not clear how to interpret uuid_v7_time():
>   • uuid_v7 to time() (extracting the timestamp)
>   • time() to uuid_v7  (generation of the uuid_v7)
> It is worth improving the naming, for example, adding prepositions.

Previously, Jelte had some thoughts on idiomatic function names.

Jelte, what is your opinion on naming the function which extracts timestamp 
from UUID v7?
Of cause, it would be great to hear opinion from anyone else.

Best regards, Andrey Borodin.

New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David G. Johnston

On Tuesday, January 16, 2024, Maiquel Grassi  wrote:

> Hi,
>
> Count() over() - row_number() over()
>
>But if my dataset is significantly large? Wouldn't calling two window
> functions instead of one be much slower?
>Is *count() over() - row_number() over()* faster than *row_number_desc()
> over()*?
>
>
I doubt it is materially different, you need that count regardless so the
effort is expended no matter if you put it in an SQL expression or build it
into the window function.  But as you are the one arguing for the new
feature demonstrating that the status quo is deficient is your job.

David J.

Re: ALTER TYPE OWNER fails to recurse to multirange

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 11:46 AM Tom Lane  wrote:
> They're by no means independent.  What would it mean to have a
> multirange without the underlying range type?

It would mean just that - no more, and no less. If it's possible to
imagine a data type that stores pairs of values from the underlying
data type with the constraint that the first is less than the second,
plus the ability to specify inclusive or exclusive bounds and the
ability to have infinite bounds, then it's equally possible to imagine
a data type that represents a set of such ranges such that no two
ranges in the set overlap. And you need not imagine that the former
data type must exist in order for the latter to exist. Theoretically,
they're just two different data types that somebody could decide to
create.

> Also, we already
> treat the multirange as dependent for some things:

But this seems like an entirely valid point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

Hi,

Count() over() - row_number() over()

   But if my dataset is significantly large? Wouldn't calling two window 
functions instead of one be much slower?
   Is count() over() - row_number() over() faster than row_number_desc() over()?

Maiquel.

Re: index prefetching

2024-01-16 Thread Robert Haas

On Tue, Jan 16, 2024 at 11:25 AM Tomas Vondra
 wrote:
> > 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> > heap pages which are not marked as all-visible. It seems to me that if
> > optimized has chosen IOS (and not bitmap heap scan for example), then
> > there should be large enough fraction for all-visible pages. Also index
> > prefetch is most efficient for OLAp queries and them are used to be
> > performance for historical data which is all-visible. But IOS can be
> > really handled separately in some other PR. Frankly speaking combining
> > prefetch of leave B-Tree pages and referenced heap pages seems to be
> > very challenged task.
>
> I see prefetching of leaf pages as interesting / worthwhile improvement,
> but out of scope for this patch. I don't think it can be done at the
> executor level - the prefetch requests need to be submitted from the
> index AM code (by calling PrefetchBuffer, etc.)

+1. This is a good feature, and so is that, but they're not the same
feature, despite the naming problems.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: UUID v7

2024-01-16 Thread Sergey Prokhorenko

Andrey,

It is not clear how to interpret uuid_v7_time():   
   - uuid_v7 to time() (extracting the timestamp)
   - time() to uuid_v7  (generation of the uuid_v7)
 It is worth improving the naming, for example, adding prepositions.


Sergey prokhorenkosergeyprokhore...@yahoo.com.au 

On Tuesday, 16 January 2024 at 05:44:51 pm GMT+3, Andrey M. Borodin 
 wrote:  
 
 Thanks for your review, Aleksander!

> On 16 Jan 2024, at 18:00, Aleksander Alekseev  
> wrote:
> 
> 
> ```
> +Datum
> +pg_node_tree_in(PG_FUNCTION_ARGS)
> +{
> +    if (!IsBootstrapProcessingMode())
> +        elog(ERROR, "cannot accept a value of type pg_node_tree_in");
> +    return textin(fcinfo);
> +}
> ```
> 
> Not 100% sure what this is for. Any chance this could be part of another 
> patch?
Nope, it’s necessary there. Without these changes catalog functions cannot have 
defaults for arguments. These defaults have type pg_node_tree which has no-op 
in function.

> One thing I don't particularly like about the tests is the fact that
> they don't check if a correct UUID was actually generated. I realize
> that's not quite trivial due to the random nature of the function, but
> maybe we could use some substring/regex magic here? Something like:
> 
> ```
> select gen_uuid_v7() :: text ~ '^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$';
> ?column?
> --
> t
> 
> select regexp_replace(gen_uuid_v7('2024-01-16 15:45:33 MSK') :: text,
> '[0-9a-f]{4}-[0-9a-f]{12}$', '-' || repeat('X', 12));
>            regexp_replace
> --
> 018d124e-39c8-74c7--
> ```
Any 8 bytes which have ver and var bits (6 bits total) are correct UUID.
This is checked by tests when uuid_var() and uuid_ver() functions are exercised.

> ```
> +  proname => 'uuid_v7_time', proleakproof => 't', provolatile => 'i',
> ```
> 
> I don't think we conventionally specify IMMUTABLE volatility, it's the
> default. Other values also are worth checking.
Makes sense, I’ll drop this values in next version.
BTW I’m in doubt if provided functions are leakproof. They ERROR-out with 
messages that can give a clue about several bits of UUID. Does this break 
leakproofness? I think yest, but I’m not sure.
gen_uuid_v7() seems leakproof to me.

> Another question: how did you choose between using TimestampTz and
> Timestamp types? I realize that internally it's all the same. Maybe
> Timestamp will be slightly better since the way it is displayed
> doesn't depend on the session settings. Many people I talked to find
> this part of TimestampTz confusing.

I mean, this argument is expected to be used to implement K-way sorted 
identifiers. In this context, it seems to me, it’s good to remember to 
developer that time shift also depend on timezones.
But this is too vague.
Do you have any reasons that apply to UUID generation?

> Also I would like to point out that part of the documentation is
> missing, but I guess at this stage of the game it's OK.
> 
> Last but not least: maybe we should support casting Timestamp[Tz] to
> UUIDv7 and vice versa? Shouldn't be difficult to implement and I
> suspect somebody will request this eventually. During the cast to UUID
> we will always get the same value for the given Timestamp[Tz], which
> probably can be useful in certain applications. It can't be done with
> gen_uuid_v7() and its volatility doesn't permit it.
I’m strongly opposed to doing this cast. I was not adding this function to 
extract timestamp from UUID, because standard does not recommend it. But a lot 
of people asked for this.
But supporting easy way to do unrecommended thing seem bad.


Best regards, Andrey Borodin.

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-16 Thread feichanghong


> This is extremely nonspecific, as line numbers in our code change
> constantly.  Please quote a chunk of code surrounding that
> and indicate which line you are trying to stop at.

Thanks for the suggestion, I've refined the steps below to reproduce:
1. Initialize the data
```
DROP TABLE IF EXISTS tbl_part;
CREATE TABLE tbl_part (a integer) PARTITION BY RANGE (a);
CREATE TABLE tbl_part_p1 PARTITION OF tbl_part FOR VALUES FROM (0) TO (10);
CREATE INDEX ON tbl_part(a);
```
2. session1 reindex and the gdb break after the reindex_index function 
successfully obtains the heapId, as noted in the code chunk below:

reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
  const ReindexParams *params)
{
..
/*
 * Open and lock the parent heap relation.  ShareLock is sufficient 
since
 * we only need to be sure no schema or data changes are going on.
 */
heapId = IndexGetRelation(indexId,
  (params->options & 
REINDEXOPT_MISSING_OK) != 0);
> gdb break at here
/* if relation is missing, leave */
if (!OidIsValid(heapId))
return;
```
REINDEX INDEX tbl_part_a_idx;
```
3. session2 drop index succeed

```
DROP INDEX tbl_part_a_idx;
```
4. session1 gdb continue


Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

Re: psql JSON output format

2024-01-16 Thread Andrew Dunstan



On 2024-01-16 Tu 11:07, Laurenz Albe wrote:

On Tue, 2024-01-09 at 16:51 +, Dean Rasheed wrote:

On Tue, 9 Jan 2024 at 14:35, Christoph Berg  wrote:

Getting it print numeric/boolean without quotes was actually easy, as
well as json(b). Implemented as the attached v2 patch.

But: not quoting json means that NULL and 'null'::json will both be
rendered as 'null'. That strikes me as a pretty undesirable conflict.
Does the COPY patch also do that?

Yes. Perhaps what needs to happen is for a NULL column to be omitted
entirely from the output. I think the COPY TO json patch would have to
do that if COPY FROM json were to be added later, to make it
round-trip safe.

I think the behavior is fine as it is.  I'd expect both NULL and JSON "null"
to be rendered as "null".  I think the main use case for a feature like this
is people who need the result in JSON for further processing somewhere else.

"Round-trip safety" is not so important.  If you want to move data from
PostgreSQL to PostgreSQL, you use the plain or the binary format.
The CSV format by default renders NULL and empty strings identical, and
I don't think anybody objects to that.



This is absolutely not true. The docs say about CSV format:

   A NULL is output as the NULL parameter string and is not quoted,
   while a non-NULL value matching the NULL parameter string is quoted.
   For example, with the default settings, a NULL is written as an
   unquoted empty string, while an empty string data value is written
   with double quotes ("").

CSV format with default settings is and has been from the beginning 
designed to be round trippable.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: ALTER TYPE OWNER fails to recurse to multirange

2024-01-16 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jan 15, 2024 at 2:28 PM Tom Lane  wrote:
>> I'm reasoning by analogy to array types, which are automatically
>> created and automatically updated to keep the same ownership
>> etc. properties as their base type.  To the extent that multirange
>> types don't act exactly like that, I say it's a bug/oversight in the
>> multirange patch.  So I think this is a backend bug, not a pg_dump
>> bug.

> Well, I guess maybe I'm just clueless. I thought that the range and
> multirange were two essentially independent objects being created by
> the same command. But I haven't studied the implementation so maybe
> I'm completely wrong.

They're by no means independent.  What would it mean to have a
multirange without the underlying range type?  Also, we already
treat the multirange as dependent for some things:

d=# create type varbitrange as range (subtype = varbit);
CREATE TYPE
d=# \dT
   List of data types
 Schema |   Name   | Description 
+--+-
 public | varbitmultirange | 
 public | varbitrange  | 
(2 rows)

d=# drop type varbitmultirange;
ERROR:  cannot drop type varbitmultirange because type varbitrange requires it
HINT:  You can drop type varbitrange instead.
d=# drop type varbitrange restrict;
DROP TYPE
d=# \dT
 List of data types
 Schema | Name | Description 
+--+-
(0 rows)

So I think we're looking at a half-baked dependency design,
not two independent objects.

regards, tom lane

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-16 Thread Tom Lane

"=?ISO-8859-1?B?ZmVpY2hhbmdob25n?="  writes:
> 2. session1 reindex and gdb break at index.c:3585

This is extremely nonspecific, as line numbers in our code change
constantly.  Please quote a chunk of code surrounding that
and indicate which line you are trying to stop at.

regards, tom lane

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Melanie Plageman

On Tue, Jan 16, 2024 at 10:24 AM Robert Haas  wrote:
>
> On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman
>  wrote:
>
> > Perhaps it isn't important, but I find this wording confusing. You
> > mention lazy_scan_prune() and then mention that "the whole test is in
> > one place instead of spread out" -- which kind of makes it sound like
> > you are consolidating FSM updates for both the lazy_scan_noprune() and
> > lazy_scan_prune() cases. Perhaps simply flipping the order of the "since
> > the caller" and "moreover, this way" conjunctions would solve it. I
> > defer to your judgment.
>
> I rewrote the commit message a bit. See what you think of this version.

All LGTM.

Re: index prefetching

2024-01-16 Thread Tomas Vondra

On 1/16/24 09:13, Konstantin Knizhnik wrote:
> Hi,
> 
> On 12/01/2024 6:42 pm, Tomas Vondra wrote:
>> Hi,
>>
>> Here's an improved version of this patch, finishing a lot of the stuff
>> that I alluded to earlier - moving the code from indexam.c, renaming a
>> bunch of stuff, etc. I've also squashed it into a single patch, to make
>> it easier to review.
> 
> I am thinking about testing you patch with Neon (cloud Postgres). As far
> as Neon seaprates compute and storage, prefetch is much more critical
> for Neon
> architecture than for vanilla Postgres.
> 
> I have few complaints:
> 
> 1. It disables prefetch for sequential access pattern (i.e. INDEX
> MERGE), motivating it that in this case OS read-ahead will be more
> efficient than prefetch. It may be true for normal storage devices, bit
> not for Neon storage and may be also for Postgres on top of DFS (i.e.
> Amazon RDS). I wonder if we can delegate decision whether to perform
> prefetch in this case or not to some other level. I do not know
> precisely where is should be handled. The best candidate IMHO is
> storager manager. But it most likely requires extension of SMGR API. Not
> sure if you want to do it... Straightforward solution is to move this
> logic to some callback, which can be overwritten by user.
> 

Interesting point. You're right these decisions (whether to prefetch
particular patterns) are closely tied to the capabilities of the storage
system. So it might make sense to maybe define it at that level.

Not sure what exactly RDS does with the storage - my understanding is
that it's mostly regular Postgres code, but managed by Amazon. So how
would that modify the prefetching logic?

However, I'm not against making this modular / wrapping this in some
sort of callbacks, for example.

> 2. It disables prefetch for direct_io. It seems to be even more obvious
> than 1), because prefetching using `posix_fadvise` definitely not
> possible in case of using direct_io. But in theory if SMGR provides some
> alternative prefetch implementation (as in case of Neon), this also may
> be not true. Still unclear why we can want to use direct_io in Neon...
> But still I prefer to mo.ve this decision outside executor.
> 

True. I think this would / should be customizable by the callback.

> 3. It doesn't perform prefetch of leave pages for IOS, only referenced
> heap pages which are not marked as all-visible. It seems to me that if
> optimized has chosen IOS (and not bitmap heap scan for example), then
> there should be large enough fraction for all-visible pages. Also index
> prefetch is most efficient for OLAp queries and them are used to be
> performance for historical data which is all-visible. But IOS can be
> really handled separately in some other PR. Frankly speaking combining
> prefetch of leave B-Tree pages and referenced heap pages seems to be
> very challenged task.
> 

I see prefetching of leaf pages as interesting / worthwhile improvement,
but out of scope for this patch. I don't think it can be done at the
executor level - the prefetch requests need to be submitted from the
index AM code (by calling PrefetchBuffer, etc.)

> 4. I think that performing prefetch at executor level is really great
> idea and so prefetch can be used by all indexes, including custom
> indexes. But prefetch will be efficient only if index can provide fast
> access to next TID (located at the same page). I am not sure that it is
> true for all builtin indexes (GIN, GIST, BRIN,...) and especially for
> custom AM. I wonder if we should extend AM API to make index make a
> decision weather to perform prefetch of TIDs or not.

I'm not against having a flag to enable/disable prefetching, but the
question is whether doing prefetching for such indexes can be harmful.
I'm not sure about that.

> 
> 5. Minor notice: there are few places where index_getnext_slot is called
> with last NULL parameter (disabled prefetch) with the following comment
> "XXX Would be nice to also benefit from prefetching here." But all this
> places corresponds to "point loopkup", i.e. unique constraint check,
> find replication tuple by index... Prefetch seems to be unlikely useful
> here, unlkess there is index bloating and and we have to skip a lot of
> tuples before locating right one. But should we try to optimize case of
> bloated indexes?
> 

Are you sure you're looking at the last patch version? Because the
current patch does not have any new parameters in index_getnext_* and
the comments were removed too (I suppose you're talking about
execIndexing, execReplication and those places).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread David G. Johnston

On Tuesday, January 16, 2024, Maiquel Grassi  wrote:

> Hello David, how are you?
>
> Firstly, I apologize if I wasn't clear in what I intended to propose. I
> used a very specific example here, and it wasn't very clear what I really
> wanted to bring up for discussion.
>
> I understand that it's possible to order the "returned dataset" using
> "order by ... desc."
>
>
It is, but it is also possible to order a window frame/partition by
specifying order by in the over clause.  Which is what I showed, and what
you should try to use.  That orders the enumeration, you can still order,
or not, the output dataset.

> I don't have a base column to use for "order by," and I also can't use
> CTID column.
>

Then you really don’t have an ordering in the data itself.  This is unusual
and not really worth adding a new function to deal with.

>
> How can I do this without using my reversed enumeration "row_number desc"
> function?
>

Count() over() - row_number() over()

 Please don’t top-post replies, in-line and trim like I’m doing.

David J.

P.s. if you really don’t care about logical order you probably should just
let your front-end deal with it.

Re: "ERROR: could not open relation with OID 16391" error was encountered when reindexing

2024-01-16 Thread feichanghong

I have provided a python script in the attachment to minimize the reproduction 
of the issue.

I'm sorry that I lost the attached script in my last reply, but I've added it 
in this reply.


You can also use psql to reproduce it with the following steps:
1. Initialize the data
```
DROP TABLE IF EXISTS tbl_part;
CREATE TABLE tbl_part (a integer) PARTITION BY RANGE (a);
CREATE TABLE tbl_part_p1 PARTITION OF tbl_part FOR VALUES FROM (0) TO (10);
CREATE INDEX ON tbl_part(a);
```
2. session1 reindex and gdb break at index.c:3585
```
REINDEX INDEX tbl_part_a_idx;
```
3. session2 drop index succeed


```
DROP INDEX tbl_part_a_idx;
```
4. session1 gdb continue



Best Regards,
Fei Changhong
Alibaba Cloud Computing Ltd.

reproduce.py
Description: Binary data

Re: introduce dynamic shared memory registry

2024-01-16 Thread Nathan Bossart

On Tue, Jan 16, 2024 at 10:28:29AM +0530, Bharath Rupireddy wrote:
> I think it's better for GetNamedDSMSegment() to error out on empty
> 'name' and size 0. This makes the user-facing function
> GetNamedDSMSegment more concrete.

Agreed, thanks for the suggestion.

> +void *
> +GetNamedDSMSegment(const char *name, size_t size,
> +   void (*init_callback) (void *ptr), bool *found)
> 
> +Assert(found);
> 
> Why is input parameter 'found' necessary to be passed by the caller?
> Neither the test module added, nor the pg_prewarm is using the found
> variable. The function will anyway create the DSM segment if one with
> the given name isn't found. IMO, found is an optional parameter for
> the caller. So, the assert(found) isn't necessary.

The autoprewarm change (0003) does use this variable.  I considered making
it optional (i.e., you could pass in NULL if you didn't want it), but I
didn't feel like the extra code in GetNamedDSMSegment() to allow this was
worth it so that callers could avoid creating a single bool.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v8 1/3] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 184 +---
 1 file changed, 115 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..ede2a5dea6 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
-void RequestAddinShmemSpace(int size)
+void RequestAddinShmemSpace(Size size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tra

Re: psql JSON output format

2024-01-16 Thread Laurenz Albe

On Tue, 2024-01-09 at 16:51 +, Dean Rasheed wrote:
> On Tue, 9 Jan 2024 at 14:35, Christoph Berg  wrote:
> > 
> > Getting it print numeric/boolean without quotes was actually easy, as
> > well as json(b). Implemented as the attached v2 patch.
> > 
> > But: not quoting json means that NULL and 'null'::json will both be
> > rendered as 'null'. That strikes me as a pretty undesirable conflict.
> > Does the COPY patch also do that?
> 
> Yes. Perhaps what needs to happen is for a NULL column to be omitted
> entirely from the output. I think the COPY TO json patch would have to
> do that if COPY FROM json were to be added later, to make it
> round-trip safe.

I think the behavior is fine as it is.  I'd expect both NULL and JSON "null"
to be rendered as "null".  I think the main use case for a feature like this
is people who need the result in JSON for further processing somewhere else.

"Round-trip safety" is not so important.  If you want to move data from
PostgreSQL to PostgreSQL, you use the plain or the binary format.
The CSV format by default renders NULL and empty strings identical, and
I don't think anybody objects to that.

Yours,
Laurenz Albe

Re: Custom explain options

2024-01-16 Thread Konstantin Knizhnik



On 16/01/2024 5:38 pm, Tomas Vondra wrote:

By "broken" you mean that you prefetch items only from a single leaf
page, so immediately after reading the next one nothing is prefetched.
Correct?



Yes, exactly. It means that reading first heap page from next leaf page 
will be done without prefetch which in case of Neon means roundtrip with 
page server (~0.2msec within one data center).




  Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.


Yeh, it works nice for vanilla Postgres. You call index_getnext_tid() 
and when it reaches end of leaf page it reads next read page. Because of 
OS read-ahead this read is expected to be fast even without prefetch. 
But not in Neon case - we have to download this page from page server 
(see above). So ideal solution for Neon will be to prefetch both leave 
pages and referenced heap pages. And prefetch of last one should be 
initiated as soon as leaf page is loaded. Unfortunately it is 
non-trivial to implement and current index scan prefetch implementation 
for Neon is not doing it.

Re: reorganize "Shared Memory and LWLocks" section of docs

2024-01-16 Thread Nathan Bossart

On Tue, Jan 16, 2024 at 08:20:19AM -0600, Nathan Bossart wrote:
> On Tue, Jan 16, 2024 at 10:02:15AM +0530, Bharath Rupireddy wrote:
>> The v3 patch looks good to me except for a nitpick: the input
>> parameter for RequestAddinShmemSpace is 'Size' not 'int'
>> 
>>  
>>  void RequestAddinShmemSpace(int size)
>>  
> 
> Hah, I think this mistake is nearly old enough to vote (e0dece1, 5f78aa5).
> Good catch.

I fixed this in v4.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 402eaf87776fb6a9d212da66947f47c63bd53f2a Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 11 Jan 2024 21:55:25 -0600
Subject: [PATCH v4 1/1] reorganize shared memory and lwlocks documentation

---
 doc/src/sgml/xfunc.sgml | 184 +---
 1 file changed, 115 insertions(+), 69 deletions(-)

diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 89116ae74c..ede2a5dea6 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3397,90 +3397,136 @@ CREATE FUNCTION make_array(anyelement) RETURNS anyarray

 

-Shared Memory and LWLocks
+Shared Memory
 
-
- Add-ins can reserve LWLocks and an allocation of shared memory on server
- startup.  The add-in's shared library must be preloaded by specifying
- it in
- shared_preload_libraries.
- The shared library should register a shmem_request_hook
- in its _PG_init function.  This
- shmem_request_hook can reserve LWLocks or shared memory.
- Shared memory is reserved by calling:
+
+ Requesting Shared Memory at Startup
+
+ 
+  Add-ins can reserve shared memory on server startup.  To do so, the
+  add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries.
+  The shared library should also register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve shared memory by
+  calling:
 
-void RequestAddinShmemSpace(int size)
+void RequestAddinShmemSpace(Size size)
 
- from your shmem_request_hook.
-
-
- LWLocks are reserved by calling:
+  Each backend should obtain a pointer to the reserved shared memory by
+  calling:
+
+void *ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+
+  If this function sets foundPtr to
+  false, the caller should proceed to initialize the
+  contents of the reserved shared memory.  If foundPtr
+  is set to true, the shared memory was already
+  initialized by another backend, and the caller need not initialize
+  further.
+ 
+
+ 
+  To avoid race conditions, each backend should use the LWLock
+  AddinShmemInitLock when initializing its allocation
+  of shared memory, as shown here:
+
+static mystruct *ptr = NULL;
+boolfound;
+
+LWLockAcquire(AddinShmemInitLock, LW_EXCLUSIVE);
+ptr = ShmemInitStruct("my struct name", size, &found);
+if (!found)
+{
+... initialize contents of shared memory ...
+ptr->locks = GetNamedLWLockTranche("my tranche name");
+}
+LWLockRelease(AddinShmemInitLock);
+
+  shmem_startup_hook provides a convenient place for the
+  initialization code, but it is not strictly required that all such code
+  be placed in this hook.  Each backend will execute the registered
+  shmem_startup_hook shortly after it attaches to shared
+  memory.  Note that add-ins should still acquire
+  AddinShmemInitLock within this hook, as shown in the
+  example above.
+ 
+
+ 
+  An example of a shmem_request_hook and
+  shmem_startup_hook can be found in
+  contrib/pg_stat_statements/pg_stat_statements.c in
+  the PostgreSQL source tree.
+ 
+
+   
+
+   
+LWLocks
+
+
+ Requesting LWLocks at Startup
+
+ 
+  Add-ins can reserve LWLocks on server startup.  As with shared memory,
+  the add-in's shared library must be preloaded by specifying it in
+  shared_preload_libraries,
+  and the shared library should register a
+  shmem_request_hook in its
+  _PG_init function.  This
+  shmem_request_hook can reserve LWLocks by calling:
 
 void RequestNamedLWLockTranche(const char *tranche_name, int num_lwlocks)
 
- from your shmem_request_hook.  This will ensure that an array of
- num_lwlocks LWLocks is available under the name
- tranche_name.  Use GetNamedLWLockTranche
- to get a pointer to this array.
-
-
- An example of a shmem_request_hook can be found in
- contrib/pg_stat_statements/pg_stat_statements.c in the
- PostgreSQL source tree.
-
-
- There is another, more flexible method of obtaining LWLocks. First,
- allocate a tranche_id from a shared counter by
- calling:
+  This ensures that an array of num_lwlocks LWLocks is
+  available under the name tranche_name.  A pointer to
+  this array can be obtained by calling:
 
-int LWLockNewTrancheId(void)
+LWLockPadded *GetNamedLWLockT

RE: New Window Function: ROW_NUMBER_DESC() OVER() ?

2024-01-16 Thread Maiquel Grassi

Hello David, how are you?

Firstly, I apologize if I wasn't clear in what I intended to propose. I used a 
very specific example here, and it wasn't very clear what I really wanted to 
bring up for discussion.

I understand that it's possible to order the "returned dataset" using "order by 
... desc." However, I would like someone to help me think about the following 
scenario:

Imagine I have a dataset that is returned to my front-end, and I want to 
reverse enumerate them (exactly the concept of Math enumerating integers). The 
row_number does the ascending enumeration, but I need the descending 
enumeration. I don't have a base column to use for "order by," and I also can't 
use CTID column.

Furthermore, imagine that I have a list of hashes, and I would use "order by" 
on this column or another column to do the reverse enumeration. This wouldn't 
work because I wouldn't have the correct reverse enumeration, meaning the 
reversal of the data would not be original.

It's not about reverse ordering; it's about reverse enumeration.

I apologize again for not being clear in the first interaction.

How can I do this without using my reversed enumeration "row_number desc" 
function?

Regards,
Maiquel O. Grassi.

De: David G. Johnston 
Enviado: terça-feira, 16 de janeiro de 2024 11:30
Para: Maiquel Grassi 
Cc: pgsql-hack...@postgresql.org 
Assunto: Re: New Window Function: ROW_NUMBER_DESC() OVER() ?

On Tuesday, January 16, 2024, Maiquel Grassi 
mailto:gra...@hotmail.com.br>> wrote:
Hi developers,

I was working on loans and bank financing, specifically focusing on 
Amortization Systems. I had the need to reverse the counter for the total 
number of installments or for a specific set of installments. This "reversal" 
is essentially a reverse "row_number" function. I realized that it is to "hard 
work" to write PL/foo functions for this or even to implement it in just SQL 
using little code.

I think “row_number() over (order by … desc)”  is a sufficient way to get this 
behavior and this isn’t something useful enough to warrant being the first 
ordering-specific function in the system.

David J.

Re: cleanup patches for incremental backup

2024-01-16 Thread Robert Haas

On Mon, Jan 15, 2024 at 3:31 PM Matthias van de Meent
 wrote:
> Off-list I was notified that the new WAL summarizer process was not
> yet added to the glossary, so PFA a patch that does that.
> In passing, it also adds "incremental backup" to the glossary, and
> updates the documented types of backends in monitoring.sgml with the
> new backend type, too.

I wonder if it's possible that you sent the wrong version of this
patch, because:

(1) The docs don't build with this applied. I'm not sure if it's the
only problem, but .

(2) The changes to monitoring.sgml contain an unrelated change, about
pg_stat_all_indexes.idx_scan.

Also, I think the "For more information, see  bit should have a period after the markup tag, as we seem to do in
other cases.

One other thought is that the incremental backup only replaces
relation files with incremental files, and it never does anything
about FSM files. So the statement that it only contains data that was
potentially changed isn't quite correct. It might be better to phrase
it the other way around i.e. it is like a full backup, except that
some files can be replaced by incremental files which omit blocks to
which no WAL-logged changes have been made.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Custom explain options

2024-01-16 Thread Tomas Vondra

On 1/15/24 21:42, Konstantin Knizhnik wrote:
> 
> On 15/01/2024 5:08 pm, Tomas Vondra wrote:
>>
>> My patch does not care about prefetching internal index pages. Yes, it's
>> a limitation, but my assumption is the internal pages are maybe 0.1% of
>> the index, and typically very hot / cached. Yes, if the index is not
>> used very often, this may be untrue. But I consider it a possible future
>> improvement, for some other patch. FWIW there's a prefetching patch for
>> inserts into indexes (which only prefetches just the index leaf pages).
> 
> We have to prefetch pages at height-1 level (parents of leave pages) for
> IOS because otherwise prefetch pipeline is broken at each transition to
> next leave page.
> When we start with new leave patch we have to fill prefetch ring from
> the scratch which certainly has negative impact on performance.
> 

By "broken" you mean that you prefetch items only from a single leaf
page, so immediately after reading the next one nothing is prefetched.
Correct? Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.

> 
>> Not sure I understand what this is about. The patch simply calls the
>> index AM function index_getnext_tid() enough times to fill the prefetch
>> queue. It does not prefetch the next index leaf page, it however does
>> prefetch the heap pages. It does not "stall" at the boundary of the
>> index leaf page, or something.
> 
> Ok, now I fully understand your approach. Looks really elegant and works
> for all indexes.
> There is still issue with IOS and seqscan.
> 

Not sure. For seqscan, I think this has nothing to do with it. Postgres
relies on read-ahad to do the work - of course, if that doesn't work
(e.g. for async/direct I/O that'd be the case), an improvement will be
needed. But it's unrelated to this patch, and I'm certainly not saying
this patch does that. I think Thomas/Andres did some work on that.

For IOS, I think the limitation that this does not prefetch any index
pages (especially the leafs) is there, and it'd be nice to do something
about it. But I see it as a separate thing, which I think does need to
happen in the index AM layer (not in the executor).

> 
> 
>>
>>> Another challenge - is how far we should prefetch (as far as I
>>> understand both your and our approach using dynamically extended
>>> prefetch window)
>>>
>> By dynamic extension of prefetch window you mean the incremental growth
>> of the prefetch distance from 0 to effective_io_concurrency?
> 
> Yes
> 
>> I don't
>> think there's a better solution.
> 
> I tried one more solution: propagate information about expected number
> of fetched rows to AM. Based on this information it is possible to
> choose proper prefetch distance.
> Certainly it is not quote precise: we can scan large number rows but
> filter only few of them. This is why this approach was not committed in
> Neon.
> But I still think that using statistics for determining prefetch window
> is not so bad idea. May be it needs better thinking.
> 

I don't think we should rely on this information too much. It's far too
unreliable - especially the planner estimates. The run-time data may be
more accurate, but I'm worried it may be quite variable (e.g. for
different runs of the scan).

My position is to keep this as simple as possible, and prefer to be more
conservative when possible - that is, shorter prefetch distances. In my
experience the benefit of prefetching is subject to diminishing returns,
i.e. going from 0 => 16 is way bigger difference than 16 => 32. So
better to stick with lower value instead of wasting resources.

> 
>>
>> There might be additional information that we could consider (e.g.
>> expected number of rows for the plan, earlier executions of the scan,
>> ...) but each of these has a failure more.
> 
> I wrote reply above before reading next fragment:)
> So I have already tried it.
> 
>> I haven't tried with pgvector, but I don't see why my patch would not
>> work for all index AMs that cna return TID.
> 
> 
> Yes, I agree. But it will be efficient only if getting next TIS is
> cheap  - it is located on the same leaf page.
> 

Maybe. I haven't tried/thought about it, but yes - if it requires doing
a lot of work in between the prefetches, the benefits of prefetching
will diminish naturally. Might be worth doing some experiments.

> 
>>
>>> I have also tried to implement alternative approach for prefetch based
>>> on access statistic.
>>> It comes from use case of seqscan of table with larger toasted records.
>>> So for each record we have to extract its TOAST data.
>>> It is done using standard index scan, but unfortunately index prefetch
>>> doesn't help much here: there is usually just one TOAST segment and so
>>> prefetch just have no chance to do something useful. But as far as heap
>>> records are accessed sequentially, there is good chance that toast table
>>> will also be accessed mostly sequentially. So we

Re: Emit fewer vacuum records by reaping removable tuples during pruning

2024-01-16 Thread Robert Haas

On Mon, Jan 15, 2024 at 4:03 PM Melanie Plageman
 wrote:
> It doesn't pass the number of LP_DEAD items back to the caller. It
> passes a boolean.

Oops.

> Perhaps it isn't important, but I find this wording confusing. You
> mention lazy_scan_prune() and then mention that "the whole test is in
> one place instead of spread out" -- which kind of makes it sound like
> you are consolidating FSM updates for both the lazy_scan_noprune() and
> lazy_scan_prune() cases. Perhaps simply flipping the order of the "since
> the caller" and "moreover, this way" conjunctions would solve it. I
> defer to your judgment.

I rewrote the commit message a bit. See what you think of this version.

> tests if lpdead_items > 0. It is more the inconsistency that bothered me
> than the fact that lpdead_items is signed.

Changed.

> > @@ -2159,6 +2156,9 @@ lazy_scan_noprune(LVRelState *vacrel,
> >   if (hastup)
> >   vacrel->nonempty_pages = blkno + 1;
> >
> > + /* Did we find LP_DEAD items? */
> > + *has_lpdead_items = (lpdead_items > 0);
>
> I would drop this comment. The code doesn't really need it, and the
> reason we care if there are LP_DEAD items is not because they are dead
> but because we want to know if we'll touch this page again. You don't
> need to rehash all that here, so I think omitting the comment is enough.

I want to keep the comment. I guess it's a pet peeve of mine, but I
hate it when people do this:

/* some comment */
some_code();

some_more_code();

/* some other comment */
even_more_code();

IMV, this makes it unclear whether /* some comment */ is describing
both some_code() and some_more_code(), or just the former. To be fair,
there is often no practical confusion, because if the comment is good
and the code is nothing too complicated then you understand which way
the author meant it. But sometimes the comment is bad or out of date
and sometimes the code is difficult to understand and then you're left
scratching your head as to what the author meant. I prefer to insert a
comment above some_more_code() in such cases, even if it's a bit
perfunctory. I think it makes the code easier to read.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v3-0001-Be-more-consistent-about-whether-to-update-the-FS.patch
Description: Binary data

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2024-01-16 Thread Alexander Korotkov

Hi,

> Thanks for updating the patch!

You're welcome!

> Here is a minor comment:
>
> > +/*
> > + * Extract a defGetCopySaveErrorToChoice value from a DefElem.
> > + */
>
> Should be Extract a "CopySaveErrorToChoice"?

Fixed.

> BTW I'm thinking we should add a column to pg_stat_progress_copy that
> counts soft errors. I'll suggest this in another thread.

Please do!

--
Regards,
Alexander Korotkov


0001-Add-new-COPY-option-SAVE_ERROR_TO-v6.patch
Description: Binary data

Re: POC: GROUP BY optimization

2024-01-16 Thread Alexander Korotkov

Hi!

Thank you for your review.  The revised patchset is attached.

On Tue, Jan 16, 2024 at 4:48 AM Richard Guo  wrote:
> On Mon, Jan 15, 2024 at 3:56 PM Alexander Korotkov  
> wrote:
>>
>> On Mon, Jan 15, 2024 at 8:42 AM Richard Guo  wrote:
>> > On Mon, Jan 15, 2024 at 8:20 AM Alexander Korotkov  
>> > wrote:
>> >>
>> >> Thank you for providing the test case relevant for this code change.
>> >> The revised patch incorporating this change is attached.  Now the
>> >> patchset looks good to me.  I'm going to push it if there are no
>> >> objections.
>> >
>> > Seems I'm late for the party.  Can we hold for several more days?  I'd
>> > like to have a review on this patch.
>>
>> Sure, NP.  I'll hold it till your review.
>
>
> Thanks.  Appreciate that.
>
> I have briefly reviewed this patch and here are some comments.
>
> * When trying to match the ordering of GROUP BY to that of ORDER BY in
> get_useful_group_keys_orderings, this patch checks against the length of
> path->pathkeys.  This does not make sense.  I guess this is a typo and
> the real intention is to check against root->sort_pathkeys.
>
> --- a/src/backend/optimizer/path/pathkeys.c
> +++ b/src/backend/optimizer/path/pathkeys.c
> @@ -504,7 +504,7 @@ get_useful_group_keys_orderings(PlannerInfo *root, Path 
> *path)
>root->num_groupby_pathkeys);
>
> if (n > 0 &&
> -   (enable_incremental_sort || n == list_length(path->pathkeys)))
> +   (enable_incremental_sort || n == 
> list_length(root->sort_pathkeys)))
>
> However, I think this is also incorrect.  When incremental sort is
> disabled, it is reasonable to consider reordering the GROUP BY keys only
> if the number of matching pathkeys is equal to the length of
> root->group_pathkeys i.e. if 'n == list_length(root->group_pathkeys)'.

Hmm... Why should this be list_length(root->group_pathkeys) while
we're targeting root->sort_pathkeys.  I yet changed that to
list_length(root->sort_pathkeys).

> * When the original ordering of GROUP BY keys matches the path's
> pathkeys or ORDER BY keys, get_useful_group_keys_orderings would
> generate duplicate PathKeyInfos.  Consequently, this duplication would
> lead to the creation and addition of identical paths multiple times.
> This is not great.  Consider the query below:
>
> create table t (a int, b int);
> create index on t (a, b);
> set enable_hashagg to off;
>
> explain select count(*) from t group by a, b;
> QUERY PLAN
> --
>  GroupAggregate  (cost=0.15..97.27 rows=226 width=16)
>Group Key: a, b
>->  Index Only Scan using t_a_b_idx on t  (cost=0.15..78.06 rows=2260 
> width=8)
> (3 rows)
>
> The same path with group keys {a, b} is created and added twice.

I tried to address that.

> * Part of the work performed in this patch overlaps with that of
> preprocess_groupclause.  They are both trying to adjust the ordering of
> the GROUP BY keys to match ORDER BY.  I wonder if it would be better to
> perform this work only once.

Andrei, could you take a look.

> * When reordering the GROUP BY keys, I think the following checks need
> some improvements.
>
> +   /*
> +* Since 1349d27 pathkey coming from underlying node can be in the
> +* root->group_pathkeys but not in the processed_groupClause. So, we
> +* should be careful here.
> +*/
> +   sgc = get_sortgroupref_clause_noerr(pathkey->pk_eclass->ec_sortref,
> +   *group_clauses);
> +   if (!sgc)
> +   /* The grouping clause does not cover this pathkey */
> +   break;
>
> I think we need to check or assert 'pathkey->pk_eclass->ec_sortref' is
> valid before calling get_sortgroupref_clause_noerr with it.  Also, how
> can we guarantee that the returned GROUP BY item is sortable?  Should we
> check that with OidIsValid(sgc->sortop)?

Added.

--
Regards,
Alexander Korotkov


0001-Generalize-common-code-of-adding-sort-befor-20240116.patch
Description: Binary data


0002-Explore-alternative-orderings-of-group-by-p-20240116.patch
Description: Binary data

1 2 >

1 - 100 of 142 matches

Mail list logo