date:20240313

Re: Typos in reorderbuffer.c.

2024-03-13 Thread Amit Kapila

On Thu, Mar 14, 2024 at 9:58 AM Kyotaro Horiguchi
 wrote:
>
> While examining reorderbuffer.c, I found several typos. I'm not sure
> if fixing them is worthwhile, but I've attached a fix just in case.
>

LGTM. I'll push this in some time.

-- 
With Regards,
Amit Kapila.

Re: MERGE ... RETURNING

2024-03-13 Thread jian he

Hi
mainly document issues. Other than that, it looks good!

MERGE not supported in COPY
MERGE not supported in WITH query
These entries in src/backend/po.* need to be deleted if this patch is
committed?
--
  
   RETURNING
  

  
   INSERT
   RETURNING
  

  
   UPDATE
   RETURNING
  

  
   DELETE
   RETURNING
  

  
   MERGE
   RETURNING
  

in doc/src/sgml/dml.sgml, what is the point of these?
It is not rendered in the html file, deleting it still generates all the
html file.
--
The following part is about doc/src/sgml/plpgsql.sgml.


 The query used in this type of
FOR
 statement can be any SQL command that returns rows to the caller:
 SELECT is the most common case,
 but you can also use INSERT,
UPDATE, or
 DELETE with a RETURNING clause.
Some utility
 commands such as EXPLAIN will work too.

here we need to add MERGE?


   
Row-level triggers fired BEFORE can return null to
signal the
trigger manager to skip the rest of the operation for this row
(i.e., subsequent triggers are not fired, and the

INSERT/UPDATE/DELETE
does not occur
for this row).  If a nonnull
here we need to add MERGE?


   
Variable substitution currently works only in SELECT,
INSERT, UPDATE,
DELETE, and commands containing one of
these (such as EXPLAIN and CREATE TABLE
... AS SELECT),
because the main SQL engine allows query parameters only in these
commands.  To use a non-constant name or value in other statement
types (generically called utility statements), you must construct
the utility statement as a string and EXECUTE it.
   
here we need to add MERGE?
demo:
CREATE OR REPlACE FUNCTION stamp_user2(id int, comment text) RETURNS void
AS $$
<>
DECLARE
curtime timestamp := now();
BEGIN
MERGE INTO users
USING (SELECT 1)
ON true
WHEN MATCHED and (users.id = stamp_user2.id) THEN
  update SET last_modified = fn.curtime, comment =
stamp_user2.comment;
raise notice 'test';
END;
$$ LANGUAGE plpgsql;


INSTEAD OF triggers (which are always row-level
triggers,
and may only be used on views) can return null to signal that they did
not perform any updates, and that the rest of the operation for this
row should be skipped (i.e., subsequent triggers are not fired, and the
row is not counted in the rows-affected status for the surrounding

INSERT/UPDATE/DELETE).
I am not sure we need to add MERGE. Maybe not.

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-13 Thread Amit Kapila

On Thu, Mar 14, 2024 at 5:57 AM Masahiko Sawada  wrote:
>
> This fact makes me think that the slotsync worker might be able to
> accept the primary_conninfo value even if there is no dbname in the
> value. That is, if there is no dbname in the primary_conninfo, it uses
> the username in accordance with the specs of the connection string.
> Currently, the slotsync worker connects to the local database first
> and then establishes the connection to the primary server. But if we
> can reverse the two steps, it can get the dbname that has actually
> been used to establish the remote connection and use it for the local
> connection too. That way, the primary_conninfo generated by
> pg_basebackup could work even without the patch. For example, if the
> OS user executing pg_basebackup is 'postgres', the slotsync worker
> would connect to the postgres database. Given the 'postgres' database
> is created by default and 'postgres' OS user is used in common, I
> guess it could cover many cases in practice actually.
>

I think this is worth investigating but I suspect that in most cases
users will end up using a replication connection without specifying
the user name and we may not be able to give a meaningful error
message when slotsync worker won't be able to connect. The same will
be true even when the dbname same as the username would be used.

> Having said that, even with (or without) the above change, we might
> want to change the pg_basebackup so that it writes the dbname to the
> primary_conninfo if -R option is specified. Since the database where
> the slotsync worker connects cannot be dropped while the slotsync
> worker is running, the user might want to change the database to
> connect, and it would be useful if they can do that using
> pg_basebackup instead of modifying the configuration file manually.
>
> While the current approach makes sense to me, I'm a bit concerned that
> we might end up having the pg_basebackup search the actual database
> name (e.g. 'dbname=template1') from the .pgpass file instead of
> 'dbname=replication'. As far as I tested on my environment, suppose
> that I execute:
>
> pg_basebackup -D tmp -d "dbname=testdb" -R
>
> The pg_basebackup established a replication connection but looked for
> the password of the 'testdb' database. This could be another
> inconvenience for the existing users who want to use the slot
> synchronization.
>

This is true because it is internally using logical replication
connection (as we will set set replication=database). I feel the
mentioned behavior is an expected one with or without slotsync worker
usage. Anyway, this is an optional feature, so users can still choose
to ignore specifying dbname in the connection string. The point is
that commit cca97ce6a6 allowed using dbname in the connection string
in pg_basebackup and we are just extending to write that dbname along
with other things in connection_info.

> A random idea I came up with is, we add a new option to the
> pg_basebackup to overwrite the full or some portion of the connection
> string that is eventually written in the primary_conninfo in
> postgresql.auto.conf. For example, the command:
>
> pg_basebackup -D tmp -d "host=1.1.1.1 port=" -R
> --primary-coninfo-ext "host=2.2.2.2 dbname=postgres"
>
> will produce the connection string that is based on -d option value
> but is overwritten by --primary-conninfo-ext option value, which will
> be like:
>
> host=2.2.2.2 dbname=postgres port=
>
> This option might help not only for users who want to use the slotsync
> worker but also for users who want to take a basebackup from a standby
> but have the new standby connect to the primary.
>

Agreed, this could be another way though it would be good to get some
inputs from users or otherwise about the preferred way to specify
dbname. One can also imagine using the Alter System for this purpose.

> But it's still just an idea and I might be missing something. And
> given we're getting closer to the feature freeze, it would be a PG18
> item.
>

+1. At this stage, it is important to discuss whether we should allow
pg_baseback to write dbname (either a specified one or a default one)
along with other parameters in primary_conninfo?

-- 
With Regards,
Amit Kapila.

Re: pg16: XX000: could not find pathkey item to sort

2024-03-13 Thread Ashutosh Bapat

On Thu, Mar 14, 2024 at 4:30 AM David Rowley  wrote:

> On Thu, 14 Mar 2024 at 06:00, Alexander Lakhin 
> wrote:
> > I've stumbled upon the same error, but this time it apparently has
> another
> > cause. It can be produced (on REL_16_STABLE and master) as follows:
> > CREATE TABLE t (a int, b int) PARTITION BY RANGE (a);
> > CREATE TABLE td PARTITION OF t DEFAULT;
> > CREATE TABLE tp1 PARTITION OF t FOR VALUES FROM (1) TO (2);
> > SET enable_partitionwise_aggregate = on;
> > SET parallel_setup_cost = 0;
> > SELECT a, sum(b order by b) FROM t GROUP BY a ORDER BY a;
> >
> > ERROR:  could not find pathkey item to sort
> >
> > `git bisect` for this anomaly blames the same commit 1349d2790.
>
> Thanks for finding and for the recreator script.
>
> I've attached a patch which fixes the problem for me.
>
> On debugging this I uncovered some other stuff that looks broken which
> seems to caused by partition-wise aggregates.  With your example
> query, in get_useful_pathkeys_for_relation(), we call
> relation_can_be_sorted_early() to check if the pathkey can be used as
> a set of pathkeys in useful_pathkeys_list.  The problem is that in
> your query the 'rel' is the base relation belonging to the partitioned
> table and relation_can_be_sorted_early() looks through the targetlist
> for that relation and finds columns "a" and "b" in there. The problem
> is "b" has been aggregated away as partial aggregation has taken place
> due to the partition-wise aggregation. I believe whichever rel we
> should be using there should have an Aggref in the target exprs rather
> than the plain unaggregated column.  I've added Robert and Ashutosh to
> see what their thoughts are on this.
>

I don't understand why root->query_pathkeys has both a and b. "a" is there
because of GROUP BY and ORDER BY clause. But why "b"?

Under the debugger this is what I observed: generate_useful_gather_paths()
gets called twice, once for the base relation and second time for the upper
relation.

When it's called for base relation, it includes "a" and "b" both in the
useful pathkeys. The plan doesn't use sortedness on b. But I don't think
that's the problem of the relation used. It looks like root->query_pathkeys
containing "b" may be a problem.

When it's called for upper relation, the reltarget has "a" and Aggref() and
it includes only "a" in the useful pathkeys which is as per your
expectation.

-- 
Best Wishes,
Ashutosh Bapat

Re: Remove a FIXME and unused variables in Meson

2024-03-13 Thread Tristan Partin


On Thu Mar 14, 2024 at 12:15 AM CDT, Michael Paquier wrote:

On Thu, Mar 14, 2024 at 12:13:18AM -0500, Tristan Partin wrote:
> One of them adds version gates to two LLVM flags (-frwapv,
> -fno-strict-aliasing). I believe we moved the minimum LLVM version recently,
> so these might not be necessary, but maybe it helps for historictal reasons.
> If not, I'll just remove the comment in a different patch.
> 
> Second patch removes some unused variables. Were they analogous to things in

> autotools and the Meson portions haven't been added yet?
> 
> I was looking into adding LLVM JIT support to Meson since there is a TODO

> about it, but it wasn't clear what was missing except adding some variables
> into the PGXS Makefile.

It looks like you have forgotten to attach the patches.  :)


CLASSIC!

--
Tristan Partin
Neon (https://neon.tech)
From 615acd91d353defd7fbe136fd115515452d6d00b Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Thu, 14 Mar 2024 00:02:11 -0500
Subject: [PATCH v1 1/2] Protect adding llvm flags if found version is not
 enough

-fwrapv: https://github.com/llvm/llvm-project/commit/51924e517bd2d25faea6ef873db3c59ec4d09bf8
-fno-strict-aliasing: https://github.com/llvm/llvm-project/commit/10169b94cfe6838f881339f1944891f6d8451174
---
 src/backend/jit/llvm/meson.build | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/jit/llvm/meson.build b/src/backend/jit/llvm/meson.build
index 41c759f73c5..0e1de7bc98e 100644
--- a/src/backend/jit/llvm/meson.build
+++ b/src/backend/jit/llvm/meson.build
@@ -58,8 +58,13 @@ else
 endif
 
 
-# XXX: Need to determine proper version of the function cflags for clang
-bitcode_cflags = ['-fno-strict-aliasing', '-fwrapv']
+bitcode_cflags = []
+if llvm.version().version_compare('>=2.9.0')
+  bitcode_cflags += ['-fno-strict-aliasing']
+endif
+if llvm.version().version_compare('>=2.8.0')
+  bitcode_cflags += ['-fwrapv']
+endif
 if llvm.version().version_compare('=15.0')
   bitcode_cflags += ['-Xclang', '-no-opaque-pointers']
 endif
-- 
Tristan Partin
Neon (https://neon.tech)

From 4e5541352cd582c7c7320d02c075cf7a95c6bfda Mon Sep 17 00:00:00 2001
From: Tristan Partin 
Date: Thu, 14 Mar 2024 00:05:38 -0500
Subject: [PATCH v1 2/2] Remove some unused variables in Meson

They weren't being used, so don't assign to them.
---
 meson.build | 5 -
 src/backend/meson.build | 1 -
 2 files changed, 6 deletions(-)

diff --git a/meson.build b/meson.build
index 85788f9dd8f..b0a45a7d834 100644
--- a/meson.build
+++ b/meson.build
@@ -202,7 +202,6 @@ if host_system == 'cygwin'
   dlsuffix = '.dll'
   mod_link_args_fmt = ['@0@']
   mod_link_with_name = 'lib@0@.exe.a'
-  mod_link_with_dir = 'libdir'
 
 elif host_system == 'darwin'
   dlsuffix = '.dylib'
@@ -212,7 +211,6 @@ elif host_system == 'darwin'
   export_fmt = '-Wl,-exported_symbols_list,@0@'
 
   mod_link_args_fmt = ['-bundle_loader', '@0@']
-  mod_link_with_dir = 'bindir'
   mod_link_with_name = '@0@'
 
   sysroot_args = [files('src/tools/darwin_sysroot'), get_option('darwin_sysroot')]
@@ -269,7 +267,6 @@ elif host_system == 'windows'
 mod_link_with_name = 'lib@0@.exe.a'
   endif
   mod_link_args_fmt = ['@0@']
-  mod_link_with_dir = 'libdir'
 
   shmem_kind = 'win32'
   sema_kind = 'win32'
@@ -488,10 +485,8 @@ dir_pgxs = dir_lib_pkg / 'pgxs'
 dir_include = get_option('includedir')
 
 dir_include_pkg = dir_include
-dir_include_pkg_rel = ''
 if not (dir_prefix_contains_pg or dir_include_pkg.contains('pgsql') or dir_include_pkg.contains('postgres'))
   dir_include_pkg = dir_include_pkg / pkg
-  dir_include_pkg_rel = pkg
 endif
 
 dir_man = get_option('mandir')
diff --git a/src/backend/meson.build b/src/backend/meson.build
index 436c04af080..d2c10d1abd7 100644
--- a/src/backend/meson.build
+++ b/src/backend/meson.build
@@ -154,7 +154,6 @@ if mod_link_args_fmt.length() > 0
 
   name = mod_link_with_name.format('postgres')
   link_with_uninst = meson.current_build_dir() / name
-  link_with_inst = '${@0@}/@1@'.format(mod_link_with_dir, name)
 
   foreach el : mod_link_args_fmt
 pg_mod_link_args += el.format(link_with_uninst)
-- 
Tristan Partin
Neon (https://neon.tech)

Re: Remove a FIXME and unused variables in Meson

2024-03-13 Thread Michael Paquier

On Thu, Mar 14, 2024 at 12:13:18AM -0500, Tristan Partin wrote:
> One of them adds version gates to two LLVM flags (-frwapv,
> -fno-strict-aliasing). I believe we moved the minimum LLVM version recently,
> so these might not be necessary, but maybe it helps for historictal reasons.
> If not, I'll just remove the comment in a different patch.
> 
> Second patch removes some unused variables. Were they analogous to things in
> autotools and the Meson portions haven't been added yet?
> 
> I was looking into adding LLVM JIT support to Meson since there is a TODO
> about it, but it wasn't clear what was missing except adding some variables
> into the PGXS Makefile.

It looks like you have forgotten to attach the patches.  :)
--
Michael


signature.asc
Description: PGP signature

Remove a FIXME and unused variables in Meson

2024-03-13 Thread Tristan Partin


Two meson patches.

One of them adds version gates to two LLVM flags (-frwapv, 
-fno-strict-aliasing). I believe we moved the minimum LLVM version 
recently, so these might not be necessary, but maybe it helps for 
historictal reasons. If not, I'll just remove the comment in a different 
patch.


Second patch removes some unused variables. Were they analogous to 
things in autotools and the Meson portions haven't been added yet?


I was looking into adding LLVM JIT support to Meson since there is 
a TODO about it, but it wasn't clear what was missing except adding some 
variables into the PGXS Makefile.


--
Tristan Partin
Neon (https://neon.tech)

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > > BTW do we still want to test the tidstore by using a combination of
> > > > SQL functions? We might no longer need to input TIDs via a SQL
> > > > function.
> > >
> > > I'm not sure. I stopped short of doing that to get feedback on this
> > > much. One advantage with SQL functions is we can use generate_series
> > > to easily input lists of blocks with different numbers and strides,
> > > and array literals for offsets are a bit easier. What do you think?
> >
> > While I'm not a fan of the following part, I agree that it makes sense
> > to use SQL functions for test data generation:
> >
> > -- Constant values used in the tests.
> > \set maxblkno 4294967295
> > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> > 291.
> > -- We use a higher number to test tidstore.
> > \set maxoffset 512
>
> I'm not really a fan of these either, and could be removed a some
> point if we've done everything else nicely.
>
> > It would also be easier for developers to test the tidstore with their
> > own data set. So I agreed with the current approach; use SQL functions
> > for data generation and do the actual tests inside C functions.
>
> Okay, here's an another idea: Change test_lookup_tids() to be more
> general and put the validation down into C as well. First we save the
> blocks from do_set_block_offsets() into a table, then with all those
> blocks lookup a sufficiently-large range of possible offsets and save
> found values in another array. So the static items structure would
> have 3 arrays: inserts, successful lookups, and iteration (currently
> the iteration output is private to check_set_block_offsets(). Then
> sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

>
> Further thought: We may not really need to test block numbers that
> vigorously, since the radix tree tests should cover keys/values pretty
> well.

Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and
MaxBlockNumber, would be sufficient.

>  The difference here is using bitmaps of tids and that should be
> well covered.

Right. We would need to test offset numbers vigorously instead.

>
> Locally (not CI), we should try big inputs to make sure we can
> actually go up to many GB -- it's easier and faster this way than
> having vacuum give us a large data set.

I'll do these tests.

>
> > Is it
> > convenient for developers if we have functions like generate_tids()
> > and generate_random_tids() to generate TIDs so that they can pass them
> > to do_set_block_offsets()?
>
> I guess I don't see the advantage of adding a layer of indirection at
> this point, but it could be useful at a later time.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Can Execute commands for different portals interleave?

2024-03-13 Thread Evgeny Smirnov

Greetings!

The question (a short version): is it possible for a client to send two
selects in the same transaction using the extended query protocol (without
declaring cursors) and pull rows simultaneously by means of interleaving
portal names and restricting fetch size in Execute commands.


The question (a long version with a motivation):

A Postgresql backend is capable of operating multiple portals within a
transaction and switching between them on and off. For instance the
following sequence (issued from a kotlin application via r2dbc-driver not
from psql)


```

*// The table *users_Fetch *contains users with ids between 1 and 20*

BEGIN


DECLARE fetch_test1 SCROLL CURSOR FOR SELECT userId FROM users_Fetch;

DECLARE fetch_test2 SCROLL CURSOR FOR SELECT userId FROM users_Fetch;


MOVE FORWARD 3 FROM fetch_test1;

FETCH FORWARD 5 FROM fetch_test1;

FETCH FORWARD 5 FROM fetch_test2;


select userId from users_Fetch;


FETCH BACKWARD 5 FROM fetch_test1;

FETCH FORWARD 5 FROM fetch_test2;


COMMIT;

```


results in an expected outcome:

```

4, 5, 6, 7, 8, *// MOVE FORWARD 3 FROM fetch_test1; FETCH FORWARD 5
FROM fetch_test1;*

1, 2, 3, 4, 5, *// FETCH FORWARD 5 FROM fetch_test2;*

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, *// select userId from users_Fetch;*

7, 6, 5, 4, 3, *// FETCH BACKWARD 5 FROM fetch_test1;*

6, 7, 8, 9, 10, *// FETCH FORWARD 5 FROM fetch_test2;*


```


Is the same possible for conventional selects issued with extended query
protocol? From the protocol perspective it would result in the following
traffic:


```

231 53111 5432 PGSQL 109 >Q ———> BEGIN

232 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=1 Ack=54 Win=6371 Len=0
TSval=2819351776 TSecr=589492423

237 5432 53111 PGSQL 73 P/B/D/E/H ———> select * from …; bind B_1; execute
B_1, fetch 2 rows; flush

240 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=18 Ack=274 Win=6368 Len=0
TSval=2819351793 TSecr=589492440

245 5432 53111 PGSQL 552 <1/2/T/D/D/s ———> Data, Data, Portal suspended

…

// Then the same sequence for another prepared statement and portal (lets
say B_2) but without a limit in the Execute command and sync at the end.

…

// Then the client proceeds with B_1 till the completion

270 53111 5432 PGSQL 69 > E ———> execute B_1, fetch 2 rows,

271 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=925 Ack=323 Win=6367 Len=0
TSval=2819351846 TSecr=589492493

272 53111 5432 PGSQL 61 >H ———> Flush

274 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=925 Ack=328 Win=6367 Len=0
TSval=2819351846 TSecr=589492493

282 5432 53111 PGSQL 144  Command completion

283 53111 5432 TCP 56 53111 → 5432 [ACK] Seq=328 Ack=1013 Win=6351 Len=0
TSval=589492496 TSecr=2819351849

284 53111 5432 PGSQL 66 >C ———> Close B_1

285 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=1013 Ack=338 Win=6367 Len=0
TSval=2819351849 TSecr=589492496

286 53111 5432 PGSQL 61 >S ———> Sync

287 5432 53111 TCP 56 5432 → 53111 [ACK] Seq=1013 Ack=343 Win=6366 Len=0
TSval=2819351849 TSecr=589492496

293 5432 53111 PGSQL 67 <3/Z

294 53111 5432 TCP 56 53111 → 5432 [ACK] Seq=343 Ack=1024 Win=6351 Len=0
TSval=589492498 TSecr=2819351851

295 53111 5432 PGSQL 68 >Q ———> COMMIT

```


I’m interested because such a communication is intrinsic to r2dbc scenarios
like this

```

val usersWithAccouns = Flux.defer *{*

*// Select all users*

databaseClient.sql("select * from users where userId >= $1 and userId
<= $2")

.bind("$1", 1)

.bind("$2", 255)

.flatMap *{ *r *-> *r.map *{ *row, meta *-> *… *} }*

 .flatMap *{ *user *->*

*// For each user select all its accounts*

databaseClient.sql("select login from accounts where userId=$1
limit 1")

.bind("$1", user.id)

.flatMap *{ *r *-> *r.map *{ *row, meta *-> *… *} }*

.reduce …

*}*

*}*.`as`(transactionalOperator::transactional)

```

which results in a failure owing to inner requests building up a queue
inside the driver (due to inability to suspend a limitless Execute for
"select * from users…" ).


Thanks!

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 9:25 PM Robert Haas  wrote:
>
> On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar  wrote:
> > > Andres already commented on the snapshot stuff on an earlier patch
> > > version, and that's much nicer with this version. However, I don't
> > > understand why a parallel bitmap heap scan needs to do anything at all
> > > with the snapshot, even before these patches. The parallel worker
> > > infrastructure already passes the active snapshot from the leader to the
> > > parallel worker. Why does bitmap heap scan code need to do that too?
> >
> > Yeah thinking on this now it seems you are right that the parallel
> > infrastructure is already passing the active snapshot so why do we
> > need it again.  Then I checked other low scan nodes like indexscan and
> > seqscan and it seems we are doing the same things there as well.
> > Check for SerializeSnapshot() in table_parallelscan_initialize() and
> > index_parallelscan_initialize() which are being called from
> > ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
> > respectively.
>
> I remember thinking about this when I was writing very early parallel
> query code. It seemed to me that there must be some reason why the
> EState has a snapshot, as opposed to just using the active snapshot,
> and so I took care to propagate that snapshot, which is used for the
> leader's scans, to the worker scans also. Now, if the EState doesn't
> need to contain a snapshot, then all of that mechanism is unnecessary,
> but I don't see how it can be right for the leader to do
> table_beginscan() using estate->es_snapshot and the worker to use the
> active snapshot.

Yeah, that's a very valid point. So I think now Heikki/Melanie might
have got an answer to their question, about the thought process behind
serializing the snapshot for each scan node.  And the same thing is
followed for BitmapHeapNode as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > BTW do we still want to test the tidstore by using a combination of
> > > SQL functions? We might no longer need to input TIDs via a SQL
> > > function.
> >
> > I'm not sure. I stopped short of doing that to get feedback on this
> > much. One advantage with SQL functions is we can use generate_series
> > to easily input lists of blocks with different numbers and strides,
> > and array literals for offsets are a bit easier. What do you think?
>
> While I'm not a fan of the following part, I agree that it makes sense
> to use SQL functions for test data generation:
>
> -- Constant values used in the tests.
> \set maxblkno 4294967295
> -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> 291.
> -- We use a higher number to test tidstore.
> \set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

> It would also be easier for developers to test the tidstore with their
> own data set. So I agreed with the current approach; use SQL functions
> for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

> Is it
> convenient for developers if we have functions like generate_tids()
> and generate_random_tids() to generate TIDs so that they can pass them
> to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.

Typos in reorderbuffer.c.

2024-03-13 Thread Kyotaro Horiguchi

Hello.

While examining reorderbuffer.c, I found several typos. I'm not sure
if fixing them is worthwhile, but I've attached a fix just in case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 001f901ee6..92cf39ff74 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -32,7 +32,7 @@
  *
  *	  In order to cope with large transactions - which can be several times as
  *	  big as the available memory - this module supports spooling the contents
- *	  of a large transactions to disk. When the transaction is replayed the
+ *	  of large transactions to disk. When the transaction is replayed the
  *	  contents of individual (sub-)transactions will be read from disk in
  *	  chunks.
  *
@@ -3078,10 +3078,10 @@ ReorderBufferImmediateInvalidation(ReorderBuffer *rb, uint32 ninvalidations,
  * least once for every xid in XLogRecord->xl_xid (other places in records
  * may, but do not have to be passed through here).
  *
- * Reorderbuffer keeps some datastructures about transactions in LSN order,
+ * Reorderbuffer keeps some data structures about transactions in LSN order,
  * for efficiency. To do that it has to know about when transactions are seen
  * first in the WAL. As many types of records are not actually interesting for
- * logical decoding, they do not necessarily pass though here.
+ * logical decoding, they do not necessarily pass through here.
  */
 void
 ReorderBufferProcessXid(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn)
@@ -3513,11 +3513,11 @@ ReorderBufferLargestTXN(ReorderBuffer *rb)
  * snapshot because we don't decode such transactions.  Also, we do not select
  * the transaction which doesn't have any streamable change.
  *
- * Note that, we skip transactions that contains incomplete changes. There
+ * Note that, we skip transactions that contain incomplete changes. There
  * is a scope of optimization here such that we can select the largest
  * transaction which has incomplete changes.  But that will make the code and
  * design quite complex and that might not be worth the benefit.  If we plan to
- * stream the transactions that contains incomplete changes then we need to
+ * stream the transactions that contain incomplete changes then we need to
  * find a way to partially stream/truncate the transaction changes in-memory
  * and build a mechanism to partially truncate the spilled files.
  * Additionally, whenever we partially stream the transaction we need to
@@ -4701,8 +4701,8 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
 }
 
 /*
- * Rejigger change->newtuple to point to in-memory toast tuples instead to
- * on-disk toast tuples that may not longer exist (think DROP TABLE or VACUUM).
+ * Rejigger change->newtuple to point to in-memory toast tuples instead of
+ * on-disk toast tuples that may no longer exist (think DROP TABLE or VACUUM).
  *
  * We cannot replace unchanged toast tuples though, so those will still point
  * to on-disk toast data.
@@ -4713,7 +4713,7 @@ ReorderBufferToastAppendChunk(ReorderBuffer *rb, ReorderBufferTXN *txn,
  * at unexpected times.
  *
  * We simply subtract size of the change before rejiggering the tuple, and
- * then adding the new size. This makes it look like the change was removed
+ * then add the new size. This makes it look like the change was removed
  * and then added back, except it only tweaks the accounting info.
  *
  * In particular it can't trigger serialization, which would be pointless

Inconsistent printf placeholders

2024-03-13 Thread Kyotaro Horiguchi

Hello.

A recent commit 6612185883 introduced two error messages that are
identical in text but differ in their placeholders.

-   pg_fatal("could not read file \"%s\": read only %d of 
%d bytes",
-filename, (int) rb, (int) st.st_size);
+   pg_fatal("could not read file \"%s\": read only %zd of 
%lld bytes",
+filename, rb, (long long int) 
st.st_size);
...
-   pg_fatal("could not read file \"%s\": read only %d of 
%d bytes",
+   pg_fatal("could not read file \"%s\": read only %d of 
%u bytes",
 rf->filename, rb, length);

I'd be happy if the two messages kept consistency. I suggest aligning
types instead of making the messages different, as attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/pg_combinebackup/reconstruct.c b/src/bin/pg_combinebackup/reconstruct.c
index 41f06bb26b..e3b8e84289 100644
--- a/src/bin/pg_combinebackup/reconstruct.c
+++ b/src/bin/pg_combinebackup/reconstruct.c
@@ -504,15 +504,15 @@ make_rfile(char *filename, bool missing_ok)
 static void
 read_bytes(rfile *rf, void *buffer, unsigned length)
 {
-	int			rb = read(rf->fd, buffer, length);
+	ssize_t		rb = read(rf->fd, buffer, length);
 
 	if (rb != length)
 	{
 		if (rb < 0)
 			pg_fatal("could not read file \"%s\": %m", rf->filename);
 		else
-			pg_fatal("could not read file \"%s\": read only %d of %u bytes",
-	 rf->filename, rb, length);
+			pg_fatal("could not read file \"%s\": read only %zd of %lld bytes",
+	 rf->filename, rb, (long long int) length);
 	}
 }

Re: Recent 027_streaming_regress.pl hangs

2024-03-13 Thread Thomas Munro

On Thu, Mar 14, 2024 at 3:27 PM Michael Paquier  wrote:
> Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
> list that could have played with the way WAL is generated, hence
> potentially impacting the records that are replayed.

Yeah, I was wondering if its checkpoint delaying logic might have
got the checkpointer jammed or something like that, but I don't
currently see how.  Yeah, the replay of bulk newpages could be
relevant, but it's not exactly new technology.  One thing I wondered
about is whether the Perl "wait for catchup" thing, which generates
large volumes of useless log, could be somehow changed to actually
show the progress after some time.  Something like "I'm still waiting
for this replica to reach LSN X, but it has so far only reported LSN
Y, and here's a dump of the WAL around there"?

> 93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
> unrelated territory.

Hmm.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 11:23 AM Peter Smith  wrote:
>
> On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada  
> wrote:
> >
> > On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
> > >
> > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  
> > > > wrote:
> > > > >
> > > ...
> > > > > > > 5.
> > > > > > > + *
> > > > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > > > node's
> > > > > > > + * index in the heap, enabling to perform some operations such 
> > > > > > > as removing
> > > > > > > + * the node from the heap.
> > > > > > >   */
> > > > > > >  binaryheap *
> > > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > > > void *arg)
> > > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > > > + bool indexed, void *arg)
> > > > > > >
> > > > > > > BEFORE
> > > > > > > ... enabling to perform some operations such as removing the node 
> > > > > > > from the heap.
> > > > > > >
> > > > > > > SUGGESTION
> > > > > > > ... to help make operations such as removing nodes more efficient.
> > > > > > >
> > > > > >
> > > > > > But these operations literally require the indexed binary heap as we
> > > > > > have an assertion:
> > > > > >
> > > > > > void
> > > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > > > {
> > > > > > bh_nodeidx_entry *ent;
> > > > > >
> > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > > > Assert(heap->bh_indexed);
> > > > > >
> > > > >
> > > > > I didn’t quite understand -- the operations mentioned are "operations
> > > > > such as removing the node", but binaryheap_remove_node() also removes
> > > > > a node from the heap. So I still felt the comment wording of the patch
> > > > > is not quite correct.
> > > >
> > > > Now I understand your point. That's a valid point.
> > > >
> > > > >
> > > > > Now, if the removal of a node from an indexed heap can *only* be done
> > > > > using binaryheap_remove_node_ptr() then:
> > > > > - the other removal functions (binaryheap_remove_*) probably need some
> > > > > comments to make sure nobody is tempted to call them directly for an
> > > > > indexed heap.
> > > > > - maybe some refactoring and assertions are needed to ensure those
> > > > > *cannot* be called directly for an indexed heap.
> > > > >
> > > >
> > > > If the 'index' is true, the caller can not only use the existing
> > > > functions but also newly added functions such as
> > > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > > > something like below?
> > > >
> > >
> > > You said: "can not only use the existing functions but also..."
> > >
> > > Hmm. Is that right? IIUC those existing "remove" functions should NOT
> > > be called directly if the heap was "indexed" because they'll delete
> > > the node from the heap OK, but any corresponding index for that
> > > deleted node will be left lying around -- i.e. everything gets out of
> > > sync. This was the reason for my original concern.
> > >
> >
> > All existing binaryheap functions should be available even if the
> > binaryheap is 'indexed'. For instance, with the patch,
> > binaryheap_remote_node() is:
> >
> > void
> > binaryheap_remove_node(binaryheap *heap, int n)
> > {
> > int cmp;
> >
> > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > Assert(n >= 0 && n < heap->bh_size);
> >
> > /* compare last node to the one that is being removed */
> > cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
> >heap->bh_nodes[n],
> >heap->bh_arg);
> >
> > /* remove the last node, placing it in the vacated entry */
> > replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
> >
> > /* sift as needed to preserve the heap property */
> > if (cmp > 0)
> > sift_up(heap, n);
> > else if (cmp < 0)
> > sift_down(heap, n);
> > }
> >
> > The replace_node(), sift_up() and sift_down() update node's index as
> > well if the binaryheap is indexed. When deleting the node from the
> > binaryheap, it will also delete its index from the hash table.
> >
>
> I see now. Thanks for the information.
>
> ~~~
>
> Some more review comments for v8-0002
>
> ==
>
> 1.
> +/*
> + * Remove the node's index from the hash table if the heap is indexed.
> + */
> +static bool
> +delete_nodeidx(binaryheap *heap, bh_node_type node)
> +{
> + if (!binaryheap_indexed(heap))
> + return false;
> +
> + return bh_nodeidx_delete(heap->bh_nodeidx, node);
> +}
>
> I wasn't sure if having this function was a good idea. Yes, it makes
> code more readable, but I felt the heap code ought to be as efficient
> as possible so maybe it is better for the index check to be done at
> the caller, instead of incurring any overhead of function calls that
> might do nothing.
>
> SUGGESTION
> if

Re: Improve eviction algorithm in ReorderBuffer

2024-03-13 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 3:04 PM Peter Smith  wrote:
>
> Here are some review comments for v8-0003
>
> ==
> 0. GENERAL -- why the state enum?
>
> This patch introduced a new ReorderBufferMemTrackState with 2 states
> (REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
> REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP)
>
> It's the same as having a boolean flag OFF/ON, so I didn't see any
> benefit of the enum instead of a simple boolean flag like
> 'track_txn_sizes'.
>
> NOTE: Below in this post (see #11) I would like to propose another
> idea, which can simplify much further, eliminating the need for the
> state boolean. If adopted that will impact lots of these other review
> comments.

Good point! We used to use three states in the earlier version patch
but now that we have only two we don't necessarily need to use an
enum. I've used your idea.

>
> ==
> Commit Message
>
> 1.
> Previously, when selecting the transaction to evict during logical
> decoding, we check all transactions to find the largest
> transaction. Which could lead to a significant replication lag
> especially in case where there are many subtransactions.
>
> ~
>
> /Which could/This could/
>
> /in case/in the case/

Fixed.

>
> ==
> .../replication/logical/reorderbuffer.c
>
> 2.
>  *   We use a max-heap with transaction size as the key to efficiently find
>  *   the largest transaction. The max-heap state is managed in two states:
>  *   REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and
> REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP.
>
> /The max-heap state is managed in two states:/The max-heap is managed
> in two states:/

This part is removed.

>
> ~~~
>
> 3.
> +/*
> + * Threshold of the total number of top-level and sub transactions
> that controls
> + * whether we switch the memory track state. While using max-heap to select
> + * the largest transaction is effective when there are many transactions 
> being
> + * decoded, in many systems there is generally no need to use it as long as 
> all
> + * transactions being decoded are top-level transactions. Therefore, we use
> + * MaxConnections as the threshold* so we can prevent switch to the
> state unless
> + * we use subtransactions.
> + */
> +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections
>
> 3a.
> /memory track state./memory tracking state./
>
> /While using max-heap/Although using max-heap/
>
> "in many systems" (are these words adding anything?)
>
> /threshold*/threshold/
>
> /so we can prevent switch/so we can prevent switching/
>

Fixed.

> ~
>
> 3b.
> There's nothing really in this name to indicate the units of the
> threshold. Consider if there is some more informative name for this
> macro: e.g.
> MAXHEAP_TX_COUNT_THRESHOLD (?)

Fixed.

>
> ~~~
>
> 4.
> + /*
> + * Don't start with a lower number than
> + * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least
> + * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once.
> + */
> + buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
> + buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 
> 2,
> +ReorderBufferTXNSizeCompare,
> +true, NULL);
> +
>
> IIUC the comment intends to say:
>
> Allocate the initial heap size greater than THRESHOLD because the
> txn_heap will not be used until the threshold is exceeded.
>
> Also, maybe the comment should make a point of saying "Note: the
> binary heap is INDEXED for faster manipulations". or something
> similar.
>

Fixed.

> ~~~
>
> 5.
>  static void
>  ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb,
>   ReorderBufferChange *change,
> + ReorderBufferTXN *txn,
>   bool addition, Size sz)
>  {
> - ReorderBufferTXN *txn;
>   ReorderBufferTXN *toptxn;
>
> - Assert(change->txn);
> -
>
> There seems some trick now where the passed 'change' could be NULL,
> which was not possible before. e.g., when change is NULL then 'txn' is
> not NULL, and vice versa. Some explanation about this logic and the
> meaning of these parameters should be written in this function
> comment.

Added comments.

>
> ~
>
> 6.
> + txn = txn != NULL ? txn : change->txn;
>
> IMO it's more natural to code the ternary using the same order as the
> parameters:
>
> e.g. txn = change ? change->txn : txn;

I see your point. I changed it to:

if (txn == NULL)
txn = change->txn;

so we don't change txn if it's not NULL.

>
> ~~~
>
> 7.
> /*
>  * Build the max-heap and switch the state. We will run a heap assembly step
>  * at the end, which is more efficient.
>  */
> static void
> ReorderBufferBuildMaxHeap(ReorderBuffer *rb)
>
> /We will run a heap assembly step at the end, which is more
> efficient./The heap assembly step is deferred until the end, for
> efficiency./

Fixed.

>
> ~~~
>
> 8. ReorderBufferLargestTXN
>
> + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD)
> + {
> + HASH_SEQ_STATUS hash_seq;
> + ReorderBufferTXNByIdEnt *ent;
> +
> + hash_seq_init(_seq, rb->by_txn);
> + while ((ent = hash_seq_search(_seq)) != NULL)
> + {
> + ReorderBufferTXN

Re: ERROR: error triggered for injection point gin-leave-leaf-split-incomplete

2024-03-13 Thread Tom Lane

Thomas Munro  writes:
> I noticed 3 regression test failures like $SUBJECT in cfbot runs for
> unrelated patches that probably shouldn't affect GIN, so I guess this
> is probably a problem in master.

Hmm ... I have no insight on what's causing this, but "error triggered
for" is about as content-free an error message as I've ever seen.
Even granting that it's developer-directed, it's still content-free:
we already know it's an error, and something must have triggered that,
but you're offering nothing about what.  Can't we do better than that?

regards, tom lane

Re: Recent 027_streaming_regress.pl hangs

2024-03-13 Thread Michael Paquier

On Thu, Mar 14, 2024 at 03:00:28PM +1300, Thomas Munro wrote:
> Assuming it is due to a commit in master, and given the failure
> frequency, I think it is very likely to be a change from this 3 day
> window of commits, and more likely in the top half dozen or so:
> 
> d360e3cc60e Fix compiler warning on typedef redeclaration
> 8af25652489 Introduce a new smgr bulk loading facility.
> e612384fc78 Fix mistake in SQL features list
> d13ff82319c Fix BF failure in commit 93db6cbda0.
> efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
> unexpected input
> 93db6cbda03 Add a new slot sync worker to synchronize logical slots.
> 3d47b75546d pgindent fix
> b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
> fbc93b8b5f5 Remove custom Constraint node read/write implementations
> 801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
> 7a424ece48.
> 011d60c4352 Speed up uuid_out() by not relying on a StringInfo
> 943f7ae1c86 Add lookup table for replication slot conflict reasons
> 28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
> 4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
> ed345c2728b Fix typo
> 690805ca754 doc: Fix link to pg_ident_file_mappings view
> ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
> 75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
> fcd210d496d Doc: improve explanation of type interval, especially extract().
> 489072ab7a9 Replace relids in lateral subquery parse tree during SJE
> 74563f6b902 Revert "Improve compression and storage support with inheritance"
> d2ca9a50b5b Minor corrections for partition pruning
> 818fefd8fd4 Fix race leading to incorrect conflict cause in
> InvalidatePossiblyObsoleteSlot()
> 01ec4d89b91 doc: Use system-username instead of system-user
> 
> I just haven't got a specific theory yet, as the logs are empty.  I
> wonder if some kind of failures could start firing signals around to
> get us a stack.

Thanks for providing this list and an analysis. 

Hmm.  Perhaps 8af25652489?  That looks like the closest thing in the
list that could have played with the way WAL is generated, hence
potentially impacting the records that are replayed.

93db6cbda03, efa70c15c74 and 818fefd8fd4 came to mind, but they touch
unrelated territory.
--
Michael


signature.asc
Description: PGP signature

RE: Synchronizing slots from primary to standby

2024-03-13 Thread Zhijie Hou (Fujitsu)

Hi,

Since the standby_slot_names patch has been committed, I am attaching the last
doc patch for review.

Best Regards,
Hou zj


v109-0001-Document-the-steps-to-check-if-the-standby-is-r.patch
Description:  v109-0001-Document-the-steps-to-check-if-the-standby-is-r.patch

Re: Recent 027_streaming_regress.pl hangs

2024-03-13 Thread Thomas Munro

On Wed, Mar 13, 2024 at 10:53 AM Thomas Munro  wrote:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-02-23%2015%3A44%3A35

Assuming it is due to a commit in master, and given the failure
frequency, I think it is very likely to be a change from this 3 day
window of commits, and more likely in the top half dozen or so:

d360e3cc60e Fix compiler warning on typedef redeclaration
8af25652489 Introduce a new smgr bulk loading facility.
e612384fc78 Fix mistake in SQL features list
d13ff82319c Fix BF failure in commit 93db6cbda0.
efa70c15c74 Make GetSlotInvalidationCause() return RS_INVAL_NONE on
unexpected input
93db6cbda03 Add a new slot sync worker to synchronize logical slots.
3d47b75546d pgindent fix
b6df0798a5e Fix the intermittent buildfarm failures in 031_column_list.
fbc93b8b5f5 Remove custom Constraint node read/write implementations
801792e528d Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.
011d60c4352 Speed up uuid_out() by not relying on a StringInfo
943f7ae1c86 Add lookup table for replication slot conflict reasons
28f3915b73f Remove superfluous 'pgprocno' field from PGPROC
4989ce72644 MERGE ... DO NOTHING: require SELECT privileges
ed345c2728b Fix typo
690805ca754 doc: Fix link to pg_ident_file_mappings view
ff9e1e764fc Add option force_initdb to PostgreSQL::Test::Cluster:init()
75bcba6cbd2 Remove extra check_stack_depth() from dropconstraint_internal()
fcd210d496d Doc: improve explanation of type interval, especially extract().
489072ab7a9 Replace relids in lateral subquery parse tree during SJE
74563f6b902 Revert "Improve compression and storage support with inheritance"
d2ca9a50b5b Minor corrections for partition pruning
818fefd8fd4 Fix race leading to incorrect conflict cause in
InvalidatePossiblyObsoleteSlot()
01ec4d89b91 doc: Use system-username instead of system-user

I just haven't got a specific theory yet, as the logs are empty.  I
wonder if some kind of failures could start firing signals around to
get us a stack.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> > >
> > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > As I mentioned above, if we implement the test cases in C, we can use
> > > > the debug-build array in the test code. And we won't use it in AND/OR
> > > > operations tests in the future.
> > >
> > > That's a really interesting idea, so I went ahead and tried that for
> > > v71. This seems like a good basis for testing larger, randomized
> > > inputs, once we decide how best to hide that from the expected output.
> > > The tests use SQL functions do_set_block_offsets() and
> > > check_set_block_offsets(). The latter does two checks against a tid
> > > array, and replaces test_dump_tids().
> >
> > Great! I think that's a very good starter.
> >
> > The lookup_test() (and test_lookup_tids()) do also test that the
> > IsMember() function returns false as expected if the TID doesn't exist
> > in it, and probably we can do these tests in a C function too.
> >
> > BTW do we still want to test the tidstore by using a combination of
> > SQL functions? We might no longer need to input TIDs via a SQL
> > function.
>
> I'm not sure. I stopped short of doing that to get feedback on this
> much. One advantage with SQL functions is we can use generate_series
> to easily input lists of blocks with different numbers and strides,
> and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions. Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()? Then they call check_set_block_offsets()
and others for actual data lookup and iteration tests.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

small_cleanups around login event triggers

2024-03-13 Thread Robert Treat

I was taking a look at the login event triggers work (nice work btw)
and saw a couple of minor items that I thought would be worth cleaning
up. This is mostly just clarifying the exiting docs and code comments.

Robert Treat
https://xzilla.net


login_event_trigger_small_cleanups.patch
Description: Binary data

ERROR: error triggered for injection point gin-leave-leaf-split-incomplete

2024-03-13 Thread Thomas Munro

Hi,

I noticed 3 regression test failures like $SUBJECT in cfbot runs for
unrelated patches that probably shouldn't affect GIN, so I guess this
is probably a problem in master.  All three happened on FreeBSD, but I
doubt that's relevant, it's just that the FreeBSD CI task was randomly
selected to also run the "regress-running" test suite, which is a sort
of Meson equivalent of make installcheck (eg running tests against a
cluster that was already running).  As for why it might require
regress-running or have started only recently, it could be due to
yesterday's boost in the number of CPUs for FreeBSD CI, changing the
timing.  (?)

http://cfbot.cputube.org/highlights/regress-7.html

Re: Put genbki.pl output into src/include/catalog/ directly

2024-03-13 Thread Andreas Karlsson


On 3/13/24 12:41 PM, Andreas Karlsson wrote:

On 2/8/24 8:58 AM, Peter Eisentraut wrote:
I think keeping the two build systems aligned this way will be useful 
for longer-term maintenance.


Agreed, so started reviewing the patch. Attached is a rebased version of 
the patch to solve a conflict.


I have reviewed the patch now and would say it looks good. I like how we 
remove the symlinks plus make things more similar to the meson build so 
I think we should merge this.


I tried building, running tests, running make clean, running make 
install and tried building with meson (plus checking that meson really 
checks for the generated files and actually refuse to build). Everything 
worked as expected.


The code changes look clean and mostly consist of moving code. I 
personally think this is ready for committer.


Andreas

Re: RFC: Logging plan of the running query

2024-03-13 Thread jian he

On Wed, Mar 13, 2024 at 1:28 PM torikoshia  wrote:
>
> On Fri, Feb 16, 2024 at 11:42 PM torikoshia 
> wrote:
> > I'm not so sure about the implementation now, i.e. finding the next
> > node
> > to be executed from the planstate tree, but I'm going to try this
> > approach.
>
> Attached a patch which takes this approach.
>

one minor issue.
I understand the regress test, compare the expected outcome with
testrun outcome,
but can you enlighten me about how you check if the change you made in
contrib/auto_explain/t/001_auto_explain.pl is correct.
(i am not familiar with perl).

diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index cf195f1359..2d06bf297e 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -17,6 +17,8 @@
 #include "lib/stringinfo.h"
 #include "parser/parse_node.h"
+extern PGDLLIMPORT bool ProcessLogQueryPlanInterruptActive;

diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h
index 054dd2bf62..3c6bd1ea7c 100644
--- a/src/include/utils/elog.h
+++ b/src/include/utils/elog.h
@@ -167,6 +167,7 @@ struct Node;
 extern bool message_level_is_interesting(int elevel);
+extern PGDLLIMPORT bool ProcessLogQueryPlanInterruptActive;

utils/elog.h is already included in src/include/postgres.h.
you don't need to declare ProcessLogQueryPlanInterruptActive at
include/commands/explain.h?
generally a variable should only declare once?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > wrote:
> >
> > > As I mentioned above, if we implement the test cases in C, we can use
> > > the debug-build array in the test code. And we won't use it in AND/OR
> > > operations tests in the future.
> >
> > That's a really interesting idea, so I went ahead and tried that for
> > v71. This seems like a good basis for testing larger, randomized
> > inputs, once we decide how best to hide that from the expected output.
> > The tests use SQL functions do_set_block_offsets() and
> > check_set_block_offsets(). The latter does two checks against a tid
> > array, and replaces test_dump_tids().
>
> Great! I think that's a very good starter.
>
> The lookup_test() (and test_lookup_tids()) do also test that the
> IsMember() function returns false as expected if the TID doesn't exist
> in it, and probably we can do these tests in a C function too.
>
> BTW do we still want to test the tidstore by using a combination of
> SQL functions? We might no longer need to input TIDs via a SQL
> function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

Re: Sequence Access Methods, round two

2024-03-13 Thread Michael Paquier

On Wed, Mar 13, 2024 at 07:00:37AM +0100, Peter Eisentraut wrote:
> I don't understand what the overall benefit of this change is supposed to
> be.

In the context of this thread, this removes the dependency of sequence
value lookup to heap.

> If this route were to be pursued, it should be a different function name.
> We shouldn't change the signature of an existing function.

I'm not so sure about that.  The existing pg_sequence_last_value is
undocumented and only used in a system view.
--
Michael


signature.asc
Description: PGP signature

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-13 Thread Masahiko Sawada

On Fri, Feb 23, 2024 at 3:05 PM Amit Kapila  wrote:
>
> On Wed, Feb 21, 2024 at 7:46 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > > > Just FYI - here is an extreme case. And note that I have applied 
> > > > proposed patch.
> > > >
> > > > When `pg_basebackup -D data_N2 -R` is used:
> > > > ```
> > > > primary_conninfo = 'user=hayato ... dbname=hayato ...
> > > > ```
> > > >
> > > > But when `pg_basebackup -d "" -D data_N2 -R` is used:
> > > > ```
> > > > primary_conninfo = 'user=hayato ... dbname=replication
> > > > ```
> > >
> > > It seems like maybe somebody should look into why this is happening,
> > > and perhaps fix it.
> >
> > I think this caused from below part [1] in GetConnection().
> >
> > If both dbname and connection_string are the NULL, we will enter the else 
> > part
> > and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated
> > only here.
> >
> > Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(),
> > the strange part would be found and replaced to the username [2].
> >
> > I think if both the connection string and the dbname are NULL, it should be
> > considered as the physical replication connection. here is a patch to fix 
> > it.
> >
>
> When dbname is NULL or not given, it defaults to username. This
> follows the specs of the connection string.

This fact makes me think that the slotsync worker might be able to
accept the primary_conninfo value even if there is no dbname in the
value. That is, if there is no dbname in the primary_conninfo, it uses
the username in accordance with the specs of the connection string.
Currently, the slotsync worker connects to the local database first
and then establishes the connection to the primary server. But if we
can reverse the two steps, it can get the dbname that has actually
been used to establish the remote connection and use it for the local
connection too. That way, the primary_conninfo generated by
pg_basebackup could work even without the patch. For example, if the
OS user executing pg_basebackup is 'postgres', the slotsync worker
would connect to the postgres database. Given the 'postgres' database
is created by default and 'postgres' OS user is used in common, I
guess it could cover many cases in practice actually.

Having said that, even with (or without) the above change, we might
want to change the pg_basebackup so that it writes the dbname to the
primary_conninfo if -R option is specified. Since the database where
the slotsync worker connects cannot be dropped while the slotsync
worker is running, the user might want to change the database to
connect, and it would be useful if they can do that using
pg_basebackup instead of modifying the configuration file manually.

While the current approach makes sense to me, I'm a bit concerned that
we might end up having the pg_basebackup search the actual database
name (e.g. 'dbname=template1') from the .pgpass file instead of
'dbname=replication'. As far as I tested on my environment, suppose
that I execute:

pg_basebackup -D tmp -d "dbname=testdb" -R

The pg_basebackup established a replication connection but looked for
the password of the 'testdb' database. This could be another
inconvenience for the existing users who want to use the slot
synchronization.

A random idea I came up with is, we add a new option to the
pg_basebackup to overwrite the full or some portion of the connection
string that is eventually written in the primary_conninfo in
postgresql.auto.conf. For example, the command:

pg_basebackup -D tmp -d "host=1.1.1.1 port=" -R
--primary-coninfo-ext "host=2.2.2.2 dbname=postgres"

will produce the connection string that is based on -d option value
but is overwritten by --primary-conninfo-ext option value, which will
be like:

host=2.2.2.2 dbname=postgres port=

This option might help not only for users who want to use the slotsync
worker but also for users who want to take a basebackup from a standby
but have the new standby connect to the primary.

But it's still just an idea and I might be missing something. And
given we're getting closer to the feature freeze, it would be a PG18
item.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Improve readability by using designated initializers when possible

2024-03-13 Thread Michael Paquier

On Wed, Mar 13, 2024 at 02:24:32PM +0100, Peter Eisentraut wrote:
> On 08.03.24 06:50, Michael Paquier wrote:
>> This is usually taken care of by committers or updated automatically.
> 
> both fixed

Looks mostly fine, thanks for the new version.

-EventTriggerSupportsObjectClass(ObjectClass objclass)
+EventTriggerSupportsObject(const ObjectAddress *object) 

The shortcut introduced here is interesting, but it is inconsistent.
HEAD treats OCLASS_SUBSCRIPTION as something supported by event
triggers, but as pg_subscription is a shared catalog it would be
discarded with your change.  Subscriptions are marked as supported in
the event trigger table:
https://www.postgresql.org/docs/devel/event-trigger-matrix.html
--
Michael


signature.asc
Description: PGP signature

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-03-13 Thread Alexander Korotkov

On Mon, Mar 11, 2024 at 11:48 AM Alexander Korotkov
 wrote:
>
> On Mon, Mar 11, 2024 at 5:43 AM Anton A. Melnikov
>  wrote:
> > On 11.03.2024 03:39, Alexander Korotkov wrote:
> > > Now that we distinguish stats for checkpoints and
> > > restartpoints, we need to update the docs.  Please, check the patch
> > > attached.
> >
> > Maybe bring the pg_stat_get_checkpointer_buffers_written() description in 
> > consistent with these changes?
> > Like that:
> >
> > --- a/src/include/catalog/pg_proc.dat
> > +++ b/src/include/catalog/pg_proc.dat
> > @@ -5740 +5740 @@
> > -  descr => 'statistics: number of buffers written by the checkpointer',
> > +  descr => 'statistics: number of buffers written during checkpoints and 
> > restartpoints',
>
> This makes sense.  I've included this into the revised patch.

Pushed.

--
Regards,
Alexander Korotkov

Re: Volatile write caches on macOS and Windows, redux

2024-03-13 Thread Thomas Munro

Short sales pitch for these patches:

* the default settings eat data on Macs and Windows
* nobody understands what wal_sync_method=fsync_writethrough means anyway
* it's a weird kludge that it affects not only WAL, let's clean that up

Re: SQL:2011 application time

2024-03-13 Thread jian he

in GetOperatorFromWellKnownStrategy:
*strat = GistTranslateStratnum(opclass, instrat);
if (*strat == InvalidStrategy)
{
HeapTuple tuple;
tuple = SearchSysCache1(CLAOID, ObjectIdGetDatum(opclass));
if (!HeapTupleIsValid(tuple))
elog(ERROR, "cache lookup failed for operator class %u", opclass);
ereport(ERROR,
errcode(ERRCODE_UNDEFINED_OBJECT),
errmsg(errstr, format_type_be(opcintype)),
errdetail("Could not translate strategy number %d for operator class
\"%s\" for access method \"%s\".",
  instrat, NameStr(((Form_pg_opclass) GETSTRUCT(tuple))->opcname), "gist"));
ReleaseSysCache(tuple);
}

last `ReleaseSysCache(tuple);` is unreachable?


@@ -118,12 +120,17 @@ typedef struct RI_ConstraintInfo
  int16 confdelsetcols[RI_MAX_NUMKEYS]; /* attnums of cols to set on
  * delete */
  char confmatchtype; /* foreign key's match type */
+ bool temporal; /* if the foreign key is temporal */
  int nkeys; /* number of key columns */
  int16 pk_attnums[RI_MAX_NUMKEYS]; /* attnums of referenced cols */
  int16 fk_attnums[RI_MAX_NUMKEYS]; /* attnums of referencing cols */
  Oid pf_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (PK = FK) */
  Oid pp_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (PK = PK) */
  Oid ff_eq_oprs[RI_MAX_NUMKEYS]; /* equality operators (FK = FK) */
+ Oid period_contained_by_oper; /* operator for PERIOD SQL */
+ Oid agged_period_contained_by_oper; /* operator for PERIOD SQL */
+ Oid period_referenced_agg_proc; /* proc for PERIOD SQL */
+ Oid period_referenced_agg_rettype; /* rettype for previous */

the comment seems not clear to me. Here is my understanding about it:
period_contained_by_oper is the operator where a single period/range
contained by a single period/range.
agged_period_contained_by_oper is the operator oid where a period
contained by a bound of periods
period_referenced_agg_proc is the oprcode of the agged_period_contained_by_oper.
period_referenced_agg_rettype is the function
period_referenced_agg_proc returning data type.

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Thomas Munro

On Thu, Mar 14, 2024 at 10:49 AM Heikki Linnakangas  wrote:
> I tried to say that smgr implementation might have better ways to assert
> that than calling smgrnblocks(), so it would be better to leave it to
> the implementation. But what bothered me most was that smgrwrite() had a
> different signature than mdwrite(). I'm happy with the way you have it
> in the v4 patch.

Looking for other things that can be hoisted up to smgr.c level
because they are part of the contract or would surely need to be
duplicated by any implementation: I think the check that you don't
exceed the maximum possible block number could be there too, no?
Here's a version like that.

Does anyone else want to object to doing this for 17?  Probably still
needs some cleanup eg comments etc that may be lurking around the
place and another round of testing.  As for the overall idea, I'll
leave it a few days and see if others have feedback.  My take is that
this is what bulk_write.c was clearly intended to do, it's just that
smgr let it down by not allowing vectored extension yet.  It's a
fairly mechanical simplification, generalisation, and net code
reduction to do so by merging, like this.
From 61b351b60d22060e5fc082645cdfc19188ac4841 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 9 Mar 2024 16:04:21 +1300
Subject: [PATCH v5 1/3] Use smgrwritev() for both overwriting and extending.

Since mdwrite() and mdextend() were basically the same and both need
vectored variants, merge them into a single interface to reduce
duplication.  We still want to be able to assert that callers know the
difference between overwriting and extending and activate slightly
different behavior during recovery, so use a new flags argument to
control that.

Likewise for the zero-extending variant, which is has much in common at
the interface level, except it doesn't deal in buffers.

The traditional single-block smgrwrite() and smgrextend() functions with
skipFsync boolean argument are translated to smgrwritev() by inlinable
wrapper functions, for low-overhead backwards-compatibility.

Reviewed-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   |  15 +--
 src/backend/storage/buffer/localbuf.c |   3 +-
 src/backend/storage/smgr/md.c | 136 
 src/backend/storage/smgr/smgr.c   | 145 +++---
 src/include/storage/md.h  |   7 +-
 src/include/storage/smgr.h|  22 ++--
 6 files changed, 110 insertions(+), 218 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..6caf35a9d6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2064,7 +2064,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	io_start = pgstat_prepare_io_time(track_io_timing);

 	/*
-	 * Note: if smgrzeroextend fails, we will end up with buffers that are
+	 * Note: if smgrwritev fails, we will end up with buffers that are
 	 * allocated but not marked BM_VALID.  The next relation extension will
 	 * still select the same block number (because the relation didn't get any
 	 * longer on disk) and so future attempts to extend the relation will find
@@ -2073,7 +2073,8 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 *
 	 * We don't need to set checksum for all-zero pages.
 	 */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by,
+			   SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO);

 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -3720,11 +3721,11 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 *
 	 * In recovery, we cache the value returned by the first lseek(SEEK_END)
 	 * and the future writes keeps the cached value up-to-date. See
-	 * smgrextend. It is possible that the value of the first lseek is smaller
-	 * than the actual number of existing blocks in the file due to buggy
-	 * Linux kernels that might not have accounted for the recent write. But
-	 * that should be fine because there must not be any buffers after that
-	 * file size.
+	 * smgrwritev(). It is possible that the value of the first lseek is
+	 * smaller than the actual number of existing blocks in the file due to
+	 * buggy Linux kernels that might not have accounted for the recent write.
+	 * But that should be fine because there must not be any buffers after
+	 * that file size.
 	 */
 	for (i = 0; i < nforks; i++)
 	{
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..5b2b0fe9f4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -416,7 +416,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 	io_start = pgstat_prepare_io_time(track_io_timing);

 	/* actually extend relation */
-

Re: pg16: XX000: could not find pathkey item to sort

2024-03-13 Thread David Rowley

On Thu, 14 Mar 2024 at 06:00, Alexander Lakhin  wrote:
> I've stumbled upon the same error, but this time it apparently has another
> cause. It can be produced (on REL_16_STABLE and master) as follows:
> CREATE TABLE t (a int, b int) PARTITION BY RANGE (a);
> CREATE TABLE td PARTITION OF t DEFAULT;
> CREATE TABLE tp1 PARTITION OF t FOR VALUES FROM (1) TO (2);
> SET enable_partitionwise_aggregate = on;
> SET parallel_setup_cost = 0;
> SELECT a, sum(b order by b) FROM t GROUP BY a ORDER BY a;
>
> ERROR:  could not find pathkey item to sort
>
> `git bisect` for this anomaly blames the same commit 1349d2790.

Thanks for finding and for the recreator script.

I've attached a patch which fixes the problem for me.

On debugging this I uncovered some other stuff that looks broken which
seems to caused by partition-wise aggregates.  With your example
query, in get_useful_pathkeys_for_relation(), we call
relation_can_be_sorted_early() to check if the pathkey can be used as
a set of pathkeys in useful_pathkeys_list.  The problem is that in
your query the 'rel' is the base relation belonging to the partitioned
table and relation_can_be_sorted_early() looks through the targetlist
for that relation and finds columns "a" and "b" in there.  The problem
is "b" has been aggregated away as partial aggregation has taken place
due to the partition-wise aggregation. I believe whichever rel we
should be using there should have an Aggref in the target exprs rather
than the plain unaggregated column.  I've added Robert and Ashutosh to
see what their thoughts are on this.

David
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 443ab08d75..eaba6ddc03 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7320,6 +7320,15 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 {
ListCell   *lc;
Path   *cheapest_partial_path;
+   List   *groupby_pathkeys;
+
+
+   /* trim off any pathkeys added for ORDER BY / DISTINCT aggregates */
+   if (list_length(root->group_pathkeys) > root->num_groupby_pathkeys)
+   groupby_pathkeys = list_copy_head(root->group_pathkeys,
+   
  root->num_groupby_pathkeys);
+   else
+   groupby_pathkeys = root->group_pathkeys;

/* Try Gather for unordered paths and Gather Merge for ordered ones. */
generate_useful_gather_paths(root, rel, true);
@@ -7334,7 +7343,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
int presorted_keys;
double  total_groups;

-   is_sorted = pathkeys_count_contained_in(root->group_pathkeys,
+   is_sorted = pathkeys_count_contained_in(groupby_pathkeys,

path->pathkeys,

_keys);

@@ -7366,7 +7375,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
path = (Path *) create_incremental_sort_path(root,

 rel,

 path,
-   
 root->group_pathkeys,
+   
 groupby_pathkeys,

 presorted_keys,

 -1.0);

@@ -7375,7 +7384,7 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 rel,
 path,

rel->reltarget,
-
root->group_pathkeys,
+
groupby_pathkeys,
 NULL,

_groups);

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Thomas Munro

On Sun, Mar 3, 2024 at 11:41 AM Tomas Vondra
 wrote:
> On 3/2/24 23:28, Melanie Plageman wrote:
> > On Sat, Mar 2, 2024 at 10:05 AM Tomas Vondra
> >  wrote:
> >> With the current "master" code, eic=1 means we'll issue a prefetch for B
> >> and then read+process A. And then issue prefetch for C and read+process
> >> B, and so on. It's always one page ahead.
> >
> > Yes, that is what I mean for eic = 1

I spent quite a few days thinking about the meaning of eic=0 and eic=1
for streaming_read.c v7[1], to make it agree with the above and with
master.  Here's why I was confused:

Both eic=0 and eic=1 are expected to generate at most 1 physical I/O
at a time, or I/O queue depth 1 if you want to put it that way.  But
this isn't just about concurrency of I/O, it's also about computation.
Duh.

eic=0 means that the I/O is not concurrent with executor computation.
So, to annotate an excerpt from [1]'s random.txt, we have:

effective_io_concurrency = 0, range size = 1
unpatched  patched
==
pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
 *** executor now has page at 0x58000 to work on ***
pread(43,...,8192,0xb) = 8192  pread(82,...,8192,0xb) = 8192
 *** executor now has page at 0xb to work on ***

eic=1 means that a single I/O is started and then control is returned
to the executor code to do useful work concurrently with the
background read that we assume is happening:

effective_io_concurrency = 1, range size = 1
unpatched  patched
==
pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
posix_fadvise(43,0xb,0x2000,...)   posix_fadvise(82,0xb,0x2000,...)
 *** executor now has page at 0x58000 to work on ***
pread(43,...,8192,0xb) = 8192  pread(82,...,8192,0xb) = 8192
posix_fadvise(43,0x108000,0x2000,...)  posix_fadvise(82,0x108000,0x2000,...)
 *** executor now has page at 0xb to work on ***
pread(43,...,8192,0x108000) = 8192 pread(82,...,8192,0x108000) = 8192
posix_fadvise(43,0x16,0x2000,...)  posix_fadvise(82,0x16,0x2000,...)

In other words, 'concurrency' doesn't mean 'number of I/Os running
concurrently with each other', it means 'number of I/Os running
concurrently with computation', and when you put it that way, 0 and 1
are different.

Note that the first read is a bit special: by the time the consumer is
ready to pull a buffer out of the stream when we don't have a buffer
ready yet, it is too late to issue useful advice, so we don't bother.
FWIW I think even in the AIO future we would have a synchronous read
in that specific place, at least when using io_method=worker, because
it would be stupid to ask another process to read a block for us that
we want right now and then wait for it wake us up when it's done.

Note that even when we aren't issuing any advice because eic=0 or
because we detected sequential access and we believe the kernel can do
a better job than us, we still 'look ahead' (= call the callback to
see which block numbers are coming down the pipe), but only as far as
we need to coalesce neighbouring blocks.  (I deliberately avoid using
the word "prefetch" except in very general discussions because it
means different things to different layers of the code, hence talk of
"look ahead" and "advice".)  That's how we get this change:

effective_io_concurrency = 0, range size = 4
unpatched  patched
==
pread(43,...,8192,0x58000) = 8192  pread(82,...,8192,0x58000) = 8192
pread(43,...,8192,0x5a000) = 8192  preadv(82,...,2,0x5a000) = 16384
pread(43,...,8192,0x5c000) = 8192  pread(82,...,8192,0x5e000) = 8192
pread(43,...,8192,0x5e000) = 8192  preadv(82,...,4,0xb) = 32768
pread(43,...,8192,0xb) = 8192  preadv(82,...,4,0x108000) = 32768
pread(43,...,8192,0xb2000) = 8192  preadv(82,...,4,0x16) = 32768

And then once we introduce eic > 0 to the picture with neighbouring
blocks that can be coalesced, "patched" starts to diverge even more
from "unpatched" because it tracks the number of wide I/Os in
progress, not the number of single blocks.

[1] 
https://www.postgresql.org/message-id/ca+hukglji+c5jb3j6uvkgmyhky-qu+lpcsinahugsa5z4dv...@mail.gmail.com

Re: Statistics Import and Export

2024-03-13 Thread Corey Huinker

>
> Note that there's two different things we're talking about here- the
> lock on the relation that we're analyzing and then the lock on the
> pg_statistic (or pg_class) catalog itself.  Currently, at least, it
> looks like in the three places in the backend that we open
> StatisticRelationId, we release the lock when we close it rather than
> waiting for transaction end.  I'd be inclined to keep it that way in
> these functions also.  I doubt that one lock will end up causing much in
> the way of issues to acquire/release it multiple times and it would keep
> the code consistent with the way ANALYZE works.
>

ANALYZE takes out one lock StatisticRelationId per relation, not per
attribute like we do now. If we didn't release the lock after every
attribute, and we only called the function outside of a larger transaction
(as we plan to do with pg_restore) then that is the closest we're going to
get to being consistent with ANALYZE.

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Heikki Linnakangas


On 13/03/2024 23:12, Thomas Munro wrote:

Alright, here is a first attempt at merging all three interfaces as
you suggested.  I like it!  I especially like the way it removes lots
of duplication.

I don't understand your argument about the location of the
write-vs-extent assertions.  It seems to me that these are assertions
about what the *public* smgrnblocks() function returns.  In other
words, we assert that the caller is aware of the current relation size
(and has some kind of interlocking scheme for that to be possible),
according to the smgr implementation's public interface.  That's not
an assertion about internal details of the smgr implementation, it's
part of the "contract" for the API.


I tried to say that smgr implementation might have better ways to assert 
that than calling smgrnblocks(), so it would be better to leave it to 
the implementation. But what bothered me most was that smgrwrite() had a 
different signature than mdwrite(). I'm happy with the way you have it 
in the v4 patch.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Thomas Munro

Alright, here is a first attempt at merging all three interfaces as
you suggested.  I like it!  I especially like the way it removes lots
of duplication.

I don't understand your argument about the location of the
write-vs-extent assertions.  It seems to me that these are assertions
about what the *public* smgrnblocks() function returns.  In other
words, we assert that the caller is aware of the current relation size
(and has some kind of interlocking scheme for that to be possible),
according to the smgr implementation's public interface.  That's not
an assertion about internal details of the smgr implementation, it's
part of the "contract" for the API.
From 0a57274e29369e61712941e379c24f7db1dec068 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sat, 9 Mar 2024 16:04:21 +1300
Subject: [PATCH v4 1/3] Merge smgrzeroextend() and smgrextend() with
 smgrwritev().

Since mdwrite() and mdextend() were basically the same and both need
vectored variants, merge them into a single interface.  We still want
to be able to assert that callers know the difference between
overwriting and extending and activate slightly difference behavior
during recovery, so use flags to control that.

Likewise for the zero-extending variant, which is has much in common at
the interface level, except it doesn't deal in buffers.

The traditional single-block smgrwrite() and smgrextend() functions with
skipFsync boolean argument are translated to smgrwritev() by inlinable
wrapper functions, for low-overhead backwards-compatibility.

Reviewed-by: Heikki Linnakangas 
Discussion: https://postgr.es/m/CA%2BhUKGLx5bLwezZKAYB2O_qHj%3Dov10RpgRVY7e8TSJVE74oVjg%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   |   7 +-
 src/backend/storage/buffer/localbuf.c |   3 +-
 src/backend/storage/smgr/md.c | 119 +---
 src/backend/storage/smgr/smgr.c   | 127 +-
 src/include/storage/md.h  |   7 +-
 src/include/storage/smgr.h|  22 +++--
 6 files changed, 91 insertions(+), 194 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f0f8d4259c..52bbdff336 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2064,7 +2064,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
-	 * Note: if smgrzeroextend fails, we will end up with buffers that are
+	 * Note: if smgrwritev fails, we will end up with buffers that are
 	 * allocated but not marked BM_VALID.  The next relation extension will
 	 * still select the same block number (because the relation didn't get any
 	 * longer on disk) and so future attempts to extend the relation will find
@@ -2073,7 +2073,8 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 	 *
 	 * We don't need to set checksum for all-zero pages.
 	 */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by,
+			   SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO);
 
 	/*
 	 * Release the file-extension lock; it's now OK for someone else to extend
@@ -3720,7 +3721,7 @@ DropRelationBuffers(SMgrRelation smgr_reln, ForkNumber *forkNum,
 	 *
 	 * In recovery, we cache the value returned by the first lseek(SEEK_END)
 	 * and the future writes keeps the cached value up-to-date. See
-	 * smgrextend. It is possible that the value of the first lseek is smaller
+	 * smgrwritev(). It is possible that the value of the first lseek is smaller
 	 * than the actual number of existing blocks in the file due to buggy
 	 * Linux kernels that might not have accounted for the recent write. But
 	 * that should be fine because there must not be any buffers after that
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index fcfac335a5..5b2b0fe9f4 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -416,7 +416,8 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* actually extend relation */
-	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
+	smgrwritev(bmr.smgr, fork, first_block, NULL, extend_by,
+			   SMGR_WRITE_EXTEND | SMGR_WRITE_ZERO);
 
 	pgstat_count_io_op_time(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
 			io_start, extend_by);
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index bf0f3ca76d..0d560393e5 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -447,83 +447,14 @@ mdunlinkfork(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo)
 	pfree(path);
 }
 
-/*
- * mdextend() -- Add a block to the specified relation.
- *
- * The semantics are nearly the same as mdwrite(): write at the
- * specified position.  However, this is to be used for the case of
- * extending a relation (i.e., blocknum is at or beyond the

Re: Add basic tests for the low-level backup method.

2024-03-13 Thread David Steele


On 3/13/24 19:15, Michael Paquier wrote:

On Wed, Mar 13, 2024 at 01:12:28PM +1300, David Steele wrote:


Not sure what to look for here. There are no distinct messages for crash
recovery. Perhaps there should be?


The closest thing I can think of here would be "database system was
not properly shut down; automatic recovery in progress" as we don't
have InArchiveRecovery, after checking that the canary is missing.  If
you don't like this suggestion, feel free to say so, of course :)


That works for me. I think I got it confused with "database system was 
interrupted..." when I was looking at the success vs. fail logs.



Sure, I added a check for the new log message when recovering with a
backup_label.


+ok($node_replica->log_contains('completed backup recovery with redo LSN'),
+   'verify backup recovery performed with backup_label');

Okay for this choice.  I was thinking first about "starting backup
recovery with redo LSN", closer to the area where the backup_label is
read.


I think you are right that the start message is better since it can only 
appear once when the backup_label is found. The completed message could 
in theory appear after a restart, though the backup_label must have been 
found at some point.


Regards,
-DavidFrom 91f8e8a390713760116990be05d96f8a7e0d1bde Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Wed, 13 Mar 2024 20:02:40 +
Subject: Add basic tests for the low-level backup method.

There are currently no tests for the low-level backup method. pg_backup_start()
and pg_backup_stop() are called but not exercised in any real fashion.

There is a lot more that can be done, but this at least provides some basic
tests and provides a place for future improvement.
---
 src/test/recovery/t/042_low_level_backup.pl | 114 
 1 file changed, 114 insertions(+)
 create mode 100644 src/test/recovery/t/042_low_level_backup.pl

diff --git a/src/test/recovery/t/042_low_level_backup.pl 
b/src/test/recovery/t/042_low_level_backup.pl
new file mode 100644
index 00..76bac7e324
--- /dev/null
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -0,0 +1,114 @@
+
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+# Test low-level backup method by using pg_backup_start() and pg_backup_stop()
+# to create backups.
+
+use strict;
+use warnings;
+
+use File::Copy qw(copy);
+use File::Path qw(remove_tree);
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Start primary node with archiving
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(has_archiving => 1, allows_streaming => 1);
+$node_primary->start;
+
+# Start backup
+my $backup_name = 'backup1';
+my $psql = $node_primary->background_psql('postgres');
+
+$psql->query_safe("SET client_min_messages TO WARNING");
+$psql->set_query_timer_restart();
+$psql->query_safe("select pg_backup_start('test label')");
+
+# Copy files
+my $backup_dir = $node_primary->backup_dir() . '/' . $backup_name;
+
+PostgreSQL::Test::RecursiveCopy::copypath(
+   $node_primary->data_dir(), $backup_dir);
+
+# Cleanup some files/paths that should not be in the backup. There is no
+# attempt to all the exclusions done by pg_basebackup here, in part because
+# they are not required, but also to keep the test simple.
+#
+# Also remove pg_control because it needs to be copied later and it will be
+# obvious if the copy fails.
+unlink("$backup_dir/postmaster.pid")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid");
+unlink("$backup_dir/postmaster.opts")
+   or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts");
+unlink("$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to unlink $backup_dir/global/pg_control");
+
+remove_tree("$backup_dir/pg_wal", {keep_root => 1})
+   or BAIL_OUT("unable to unlink contents of $backup_dir/pg_wal");
+
+# Create a table that will be used to verify that recovery started at the
+# correct location, rather than the location recorded in pg_control
+$psql->query_safe("create table canary (id int)");
+
+# Advance the checkpoint location in pg_control past the location where the
+# backup started. Switch the WAL to make it really obvious that the location
+# is different and to put the checkpoint in a new WAL segment.
+$psql->query_safe("select pg_switch_wal()");
+$psql->query_safe("checkpoint");
+
+# Copy pg_control last so it contains the new checkpoint
+copy($node_primary->data_dir() . '/global/pg_control',
+   "$backup_dir/global/pg_control")
+   or BAIL_OUT("unable to copy global/pg_control");
+
+# Stop backup and get backup_label
+my $backup_label = $psql->query_safe("select labelfile from pg_backup_stop()");
+
+# Rather than writing out backup_label, try to recover the backup without
+# backup_label to demonstrate that recovery will not work correctly without it,
+# i.e. the canary table will be missing and the cluster will be corrupt. 
Provide
+# only the WAL segment that recovery will

Re: un-revert the MAINTAIN privilege and the pg_maintain predefined role

2024-03-13 Thread Nathan Bossart

On Wed, Mar 13, 2024 at 09:49:26AM -0700, Jeff Davis wrote:
> Looks good to me. Thank you for expanding on the comment, as well.

Thanks for reviewing!  Committed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: RFC: Logging plan of the running query

2024-03-13 Thread Robert Haas

On Wed, Mar 13, 2024 at 1:28 AM torikoshia  wrote:
> - I saw no way to find the next node to be executed from the planstate
> tree, so the patch wraps all the ExecProcNode of the planstate tree at
> CHECK_FOR_INTERRUPTS().

I don't think it does this correctly, because some node types have
children other than the left and right node. See /* special child
plans */ in ExplainNode().

But also ... having to wrap the entire plan tree like this seems
pretty awful. I don't really like the idea of a large-scan plan
modification like this in the middle of the query. I also wonder
whether it interacts properly with JIT. But at the same time, I wonder
how you're supposed to avoid it.

Andres, did you have some clever idea for this feature that would
avoid the need to do this?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add system identifier to backup manifest

2024-03-13 Thread Robert Haas

On Fri, Mar 8, 2024 at 12:14 AM Amul Sul  wrote:
> Thank you for the improvement.
>
> The caller of verify_control_file() has the full path of the control file that
> can pass it and avoid recomputing. With this change, it doesn't really need
> verifier_context argument -- only the manifest's system identifier is enough
> along with the control file path.  Did the same in the attached delta patch
> for v11-0002 patch, please have a look, thanks.

Those seem like sensible changes. I incorporated them and committed. I also:

* ran pgindent, which changed a bit of your formatting
* changed some BAIL_OUT calls to die; I think we should hardly ever be
using BAIL_OUT, as that terminates the entire TAP test run, not just
the current file

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-13 Thread Jacob Champion

On Wed, Mar 13, 2024 at 12:01 PM Alvaro Herrera  wrote:
> On 2024-Mar-13, Jelte Fennema-Nio wrote:
> > Sadly I'm having a hard time reliably reproducing this race condition
> > locally. So it's hard to be sure what is happening here. Attached is a
> > patch with a wild guess as to what the issue might be (i.e. seeing an
> > outdated "active" state and thus passing the check even though the
> > query is not running yet)
>
> I tried leaving the original running in my laptop to see if I could
> reproduce it, but got no hits ... and we didn't get any other failures
> apart from the three ones already reported ... so it's not terribly high
> probability.  Anyway I pushed your patch now since the theory seems
> plausible; let's see if we still get the issue to reproduce.  If it
> does, we could make the script more verbose to hunt for further clues.

I hit this on my machine. With the attached diff I can reproduce
constantly (including with the most recent test patch); I think the
cancel must be arriving between the bind/execute steps?

Thanks,
--Jacob
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 6b7903314a..22ce7c07d9 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2073,6 +2073,9 @@ exec_bind_message(StringInfo input_message)
valgrind_report_error_query(debug_query_string);
 
debug_query_string = NULL;
+
+   if (strstr(psrc->query_string, "pg_sleep"))
+   sleep(1);
 }
 
 /*

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Laurenz Albe

On Tue, 2024-03-12 at 11:56 +0100, Daniel Gustafsson wrote:
> > I liked the statement from Laurenz a while ago on his blog
> > (paraphrased): "Upgrading to the latest patch release does not require
> > application testing or recertification". I am not sure we want to put
> > that into the official page (or maybe tone down/qualify it a bit), but I
> > think a lot of users stay on older minor versions because they dread
> > their internal testing policies.
> 
> I think we need a more conservative language since a minor release might fix a
> planner bug that someone's app relied on and their plans will be worse after
> upgrading.  While rare, it can for sure happen so the official wording should
> probably avoid such bold claims.

I think we are pretty conservative with backpatching changes to the
optimizer that could destabilize existing plans.

I feel quite strongly that we should not use overly conservative language
there.  If people feel that they have to test their applications for new
minor releases, the only effect will be that they won't install minor releases.
Installing a minor release should be as routine as the operating system
patches that many companies apply automatically during weekend maintenance
windows.  They can also introduce bugs, and everybody knows and accepts that.

Yours,
Laurenz Albe

Re: clarify equalTupleDescs()

2024-03-13 Thread Tomas Vondra




On 2/27/24 12:13, jian he wrote:
> On Mon, Feb 12, 2024 at 7:47 PM Peter Eisentraut  wrote:
>>
>>
>> In principle, hashRowType() could process all the fields that
>> equalRowTypes() does.  But since it's only a hash function, it doesn't
>> have to be perfect.  (This is also the case for the current
>> hashTupleDesc().)  I'm not sure where the best tradeoff is.
> 
> That's where my confusion comes from.
> hashRowType is used in record_type_typmod_hash.
> record_type_typmod_hash is within assign_record_type_typmod.
>> in assign_record_type_typmod:
> 
> if (RecordCacheHash == NULL)
> {
> /* First time through: initialize the hash table */
> HASHCTL ctl;
> ctl.keysize = sizeof(TupleDesc); /* just the pointer */
> ctl.entrysize = sizeof(RecordCacheEntry);
> ctl.hash = record_type_typmod_hash;
> ctl.match = record_type_typmod_compare;
> RecordCacheHash = hash_create("Record information cache", 64,
>   ,
>   HASH_ELEM | HASH_FUNCTION | HASH_COMPARE);
> /* Also make sure CacheMemoryContext exists */
> if (!CacheMemoryContext)
> CreateCacheMemoryContext();
> }
> /*
> * Find a hashtable entry for this tuple descriptor. We don't use
> * HASH_ENTER yet, because if it's missing, we need to make sure that all
> * the allocations succeed before we create the new entry.
> */
> recentry = (RecordCacheEntry *) hash_search(RecordCacheHash,
> ,
> HASH_FIND, );
> 
> based on the comments in hash_create. The above hash_search function
> would first use
> record_type_typmod_hash to find out candidate entries in a hash table
> then use record_type_typmod_compare to compare the given tupDesc with
> candidate entries.
> 
> Is this how the hash_search in assign_record_type_typmod works?
> 

Yes.

> equalRowTypes processed more fields than hashRowType,
> hashRowType comments mentioned equalRowTypes,
> maybe we should have some comments in hashRowType explaining why only
> hashing natts, tdtypeid, atttypid will be fine.
> 

Not sure I understand what the confusion is - omitting fields with
little entropy is not uncommon, and collisions are inherent to hash
tables, and need to be handled anyway.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-13 Thread Alvaro Herrera

On 2024-Mar-13, Jelte Fennema-Nio wrote:

> I agree it's probably a timing issue. The cancel being received after
> the query is done seems very unlikely, since the query takes 180
> seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these
> animals). I think it's more likely that the cancel request arrives too
> early, and thus being ignored because no query is running yet. The
> test already had logic to wait until the query backend was in the
> "active" state, before sending a cancel to solve that issue. But my
> guess is that that somehow isn't enough.
> 
> Sadly I'm having a hard time reliably reproducing this race condition
> locally. So it's hard to be sure what is happening here. Attached is a
> patch with a wild guess as to what the issue might be (i.e. seeing an
> outdated "active" state and thus passing the check even though the
> query is not running yet)

I tried leaving the original running in my laptop to see if I could
reproduce it, but got no hits ... and we didn't get any other failures
apart from the three ones already reported ... so it's not terribly high
probability.  Anyway I pushed your patch now since the theory seems
plausible; let's see if we still get the issue to reproduce.  If it
does, we could make the script more verbose to hunt for further clues.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Here's a general engineering tip: if the non-fun part is too complex for you
to figure out, that might indicate the fun part is too ambitious." (John Naylor)
https://postgr.es/m/CAFBsxsG4OWHBbSDM%3DsSeXrQGOtkPiOEOuME4yD7Ce41NtaAD9g%40mail.gmail.com

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Nathan Bossart

On Wed, Mar 13, 2024 at 02:21:20PM -0400, Tom Lane wrote:
> I'm +1 on rewriting these documentation pages though.  Seems like
> they could do with a whole fresh start rather than just tweaks
> around the edges --- what we've got now is an accumulation of such
> tweaks.

If no one else volunteers, I could probably give this a try once v17 is
frozen.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Jeremy Schneider



> On Mar 13, 2024, at 11:39 AM, Tom Lane  wrote:
> 
> Jeremy Schneider  writes:
>>> On 3/13/24 11:21 AM, Tom Lane wrote:
>>> Agreed, we would probably add confusion not reduce it if we were to
>>> change our longstanding nomenclature for this.
> 
>> Before v10, the quarterly maintenance updates were unambiguously and
>> always called patch releases
> 
> I think that's highly revisionist history.  I've always called them
> minor releases, and I don't recall other people using different
> terminology.  I believe the leadoff text on
> 
> https://www.postgresql.org/support/versioning/
> 
> is much older than when we switched from two-part major version
> numbers to one-part major version numbers.

Huh, that wasn’t what I expected. I only started (in depth) working with PG 
around 9.6 and I definitely thought of “6” as the minor version. This is an 
interesting mailing list thread.

-Jeremy


Sent from my TI-83

Re: clarify equalTupleDescs()

2024-03-13 Thread Tomas Vondra

Hi,

I looked at this patch today. I went through all the calls switched to
equalRowTypes, and AFAIK all of them are correct - all the places
switched to equalRowTypes() only need the weaker checks.

There's only two places still calling equalTupleDescs() - relcache
certainly needs that, and so does the assert in execReplication().


As for attndims, I agree equalRowTypes() should not check that. We're
not really checking that anywhere, it'd be quite weird to start with it
here. Especially if the plan is to remove it entirely.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Tom Lane

Jeremy Schneider  writes:
> On 3/13/24 11:21 AM, Tom Lane wrote:
>> Agreed, we would probably add confusion not reduce it if we were to
>> change our longstanding nomenclature for this.

> Before v10, the quarterly maintenance updates were unambiguously and
> always called patch releases

I think that's highly revisionist history.  I've always called them
minor releases, and I don't recall other people using different
terminology.  I believe the leadoff text on

https://www.postgresql.org/support/versioning/

is much older than when we switched from two-part major version
numbers to one-part major version numbers.

regards, tom lane

Re: Popcount optimization using AVX512

2024-03-13 Thread Nathan Bossart

On Wed, Mar 13, 2024 at 05:52:14PM +, Amonson, Paul D wrote:
>> I think we want these to be architecture-specific, i.e., only built for
>> x86_64 if the compiler knows how to use the relevant instructions.  There is 
>> a
>> good chance that we'll want to add similar support for other systems.
>> The CRC32C files are probably a good reference point for how to do this.
> 
> I will look at this for the 'pg_popcnt_x86_64_accel.c' file but the
> 'pg_popcnt_choose.c' file is intended to be for any platform that may
> need accelerators including a possible future ARM accelerator.

I worry that using the same file for *_choose.c for all architectures would
become rather #ifdef heavy.  Since we are already separating out this code
into new files, IMO we might as well try to avoid too many #ifdefs, too.
But this is admittedly less important right now because there's almost no
chance of any new architecture support here for v17.

>> +#ifdef TRY_POPCNT_FAST
>> 
>> IIUC this macro can be set if either 1) the popcntq test in the 
>> autoconf/meson
>> scripts passes or 2) we're building with MSVC on x86_64.  I wonder if it 
>> would
>> be better to move the MSVC/x86_64 check to the autoconf/meson scripts so
>> that we could avoid surrounding large portions of the popcount code with this
>> macro.  This might even be a necessary step towards building these files in 
>> an
>> architecture-specific fashion.
> 
> I see the point here; however, this will take some time to get right
> especially since I don't have a Windows box to do compiles on. Should I
> attempt to do this in this patch?

This might also be less important given the absence of any imminent new
architecture support in this area.  I'm okay with it, given we are just
maintaining the status quo.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Jeremy Schneider

On 3/13/24 11:21 AM, Tom Lane wrote:
> Robert Treat  writes:
>> On Wed, Mar 13, 2024 at 1:12 PM Bruce Momjian  wrote:
>>> On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote:
 In my view, the best thing would be to move toward consistently using
 the word "patch" and moving away from the word "minor" for the
 PostgreSQL quarterly maintenance updates.
> 
>>> I think "minor" is a better term since it contrasts with "major".  We
>>> don't actually supply patches to upgrade minor versions.
> 
>> I tend to agree with Bruce, and major/minor seems to be the more
>> common usage within the industry; iirc, debian, ubuntu, gnome, suse,
>> and mariadb all use that nomenclature; and ISTR some distro's who
>> release packaged versions of postgres with custom patches applied (ie
>> 12.4-2 for postgres 12.4 patchlevel 2).
> 
> Agreed, we would probably add confusion not reduce it if we were to
> change our longstanding nomenclature for this.

"Longstanding nomenclature"??

Before v10, the quarterly maintenance updates were unambiguously and
always called patch releases

I don't understand the line of thinking here

Bruce started this whole thread because of "an increasing number of
bug/problem reports on obsolete Postgres versions"

Across the industry the word "minor" often implies a release that will
be maintained, and I'm trying to point out that the change in v10 to
change terminology from "patch" to "minor" actually might be part of
what's responsible for the increasing number of bug reports on old patch
releases, because people don't understand that patch releases are the
way those bugfixes were already delivered.

Just taking MySQL as an example, it's clear that a "minor" like 5.7 is a
full blown release that gets separate patches from 5.6 - so I don't
understand how we're making an argument it's the opposite?

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Tom Lane

Robert Treat  writes:
> On Wed, Mar 13, 2024 at 1:12 PM Bruce Momjian  wrote:
>> On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote:
>>> In my view, the best thing would be to move toward consistently using
>>> the word "patch" and moving away from the word "minor" for the
>>> PostgreSQL quarterly maintenance updates.

>> I think "minor" is a better term since it contrasts with "major".  We
>> don't actually supply patches to upgrade minor versions.

> I tend to agree with Bruce, and major/minor seems to be the more
> common usage within the industry; iirc, debian, ubuntu, gnome, suse,
> and mariadb all use that nomenclature; and ISTR some distro's who
> release packaged versions of postgres with custom patches applied (ie
> 12.4-2 for postgres 12.4 patchlevel 2).

Agreed, we would probably add confusion not reduce it if we were to
change our longstanding nomenclature for this.

I'm +1 on rewriting these documentation pages though.  Seems like
they could do with a whole fresh start rather than just tweaks
around the edges --- what we've got now is an accumulation of such
tweaks.

regards, tom lane

Re: Support json_errdetail in FRONTEND builds

2024-03-13 Thread Jacob Champion

On Tue, Mar 12, 2024 at 11:38 PM Michael Paquier  wrote:
> On Tue, Mar 12, 2024 at 08:38:43PM -0400, Andrew Dunstan wrote:
> > yeah, although maybe worth a different patch.
>
> I've wanted that a few times, FWIW.  I would do a split, mainly for
> clarity.

Sounds good, split into v2-0002. (That gives me room to switch other
call sites to the new API, too.) Thanks both!

> This does not stress me much, either.  I can see that OAuth introduces
> at least two calls of json_errdetail() in libpq, and that would matter
> there.

Yep.

>  case JSON_EXPECTED_STRING:
> -return psprintf(_("Expected string, but found \"%s\"."),
> -extract_token(lex));
> +appendStringInfo(lex->errormsg,
> + _("Expected string, but found \"%.*s\"."),
> + toklen, lex->token_start);
>
> Maybe this could be wrapped in a macro to avoid copying around
> token_start and toklen for all the error cases.

I gave it a shot in 0001; see what you think.

Thanks,
--Jacob


v2-0002-Add-a-helper-function-for-cleaning-up-StringInfos.patch
Description: Binary data


v2-0001-common-jsonapi-support-json_errdetail-in-FRONTEND.patch
Description: Binary data

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-03-13 Thread Jelte Fennema-Nio

Fixed some conflicts again, as well as adding a connection option to
choose the requested protocol version (as discussed in[1]). This new
connection option is not useful when connecting to any of the
supported postgres versions. But it can be useful when connecting to
PG versions before 9.3. Or when connecting to connection poolers or
other databases that implement the postgres protocol but do not
support the NegotiateProtocolVersion message.

[1]: 
https://www.postgresql.org/message-id/flat/CAGECzQRrHn52yEX%2BFc6A9uvVbwRCxjU82KNuBirwFU5HRrNxqA%40mail.gmail.com#835914cbd55c56b36e8e7691cb743a18

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Tomas Vondra

On 3/13/24 14:34, Heikki Linnakangas wrote:
> ...
> 
> Lots of discussion happening on the performance results but it seems
> that there is no performance impact with the preliminary patches up to
> v5-0013-Streaming-Read-API.patch. I'm focusing purely on those
> preliminary patches now, because I think they're worthwhile cleanups
> independent of the streaming read API.
> 

Not quite true - the comparison I shared on 29/2 [1] shows a serious
regression caused by the 0010 patch. We've been investigating this with
Melanie off list, but we don't have any clear findings yet (except that
it's clearly due to moving BitmapAdjustPrefetchIterator() a bit down.

But if we revert this (and move the BitmapAdjustPrefetchIterator back),
the regression should disappear, and we can merge these preparatory
patches. We'll have to deal with the regression (or something very
similar) when merging the remaining patches.

regards

[1]
https://www.postgresql.org/message-id/91090d58-7d3f-4447-9425-f24ba66e292a%40enterprisedb.com
-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Robert Treat

On Wed, Mar 13, 2024 at 1:12 PM Bruce Momjian  wrote:
>
> On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote:
> > It's not just roadmaps and release pages where we mix up these terms
> > either, it's even in user-facing SQL and libpq routines: both
> > PQserverVersion and current_setting('server_version_num') return the
> > patch release version in the numeric patch field, rather than the
> > numeric minor field (which is always 0).
> >
> > In my view, the best thing would be to move toward consistently using
> > the word "patch" and moving away from the word "minor" for the
> > PostgreSQL quarterly maintenance updates.
> >
>
> I think "minor" is a better term since it contrasts with "major".  We
> don't actually supply patches to upgrade minor versions.
>

I tend to agree with Bruce, and major/minor seems to be the more
common usage within the industry; iirc, debian, ubuntu, gnome, suse,
and mariadb all use that nomenclature; and ISTR some distro's who
release packaged versions of postgres with custom patches applied (ie
12.4-2 for postgres 12.4 patchlevel 2).

BTW, as a reminder, we do have this statement, in bold, in the
"upgrading" section of the versioning page:
"We always recommend that all users run the latest available minor
release for whatever major version is in use."  There is actually
several other phrases and wording on that page that could probably be
propagated as replacement language in some of these other areas.

Robert Treat
https://xzilla.net

RE: Popcount optimization using AVX512

2024-03-13 Thread Amonson, Paul D

> -Original Message-
> From: Nathan Bossart 
> Sent: Wednesday, March 13, 2024 9:39 AM
> To: Amonson, Paul D 

> +extern int  pg_popcount32_slow(uint32 word); extern int
> +pg_popcount64_slow(uint64 word);
> 
> +/* In pg_popcnt_*_accel source file. */ extern int
> +pg_popcount32_fast(uint32 word); extern int  pg_popcount64_fast(uint64
> +word);
> 
> Can these prototypes be moved to a header file (maybe pg_bitutils.h)?  It
> looks like these are defined twice in the patch, and while I'm not positive 
> that
> it's against project policy to declare extern function prototypes in .c 
> files, it
> appears to be pretty rare.

Originally, I intentionally did not put these in the header file as I want them 
to be private, but they are not defined in this .c file hence extern. Now I 
realize the "extern" part is not needed to accomplish my goal. Will fix by 
removing the "extern" keyword.

> +  'pg_popcnt_choose.c',
> +  'pg_popcnt_x86_64_accel.c',
> 
> I think we want these to be architecture-specific, i.e., only built for
> x86_64 if the compiler knows how to use the relevant instructions.  There is a
> good chance that we'll want to add similar support for other systems.
> The CRC32C files are probably a good reference point for how to do this.

I will look at this for the 'pg_popcnt_x86_64_accel.c' file but the 
'pg_popcnt_choose.c' file is intended to be for any platform that may need 
accelerators including a possible future ARM accelerator.
 
> +#ifdef TRY_POPCNT_FAST
> 
> IIUC this macro can be set if either 1) the popcntq test in the autoconf/meson
> scripts passes or 2) we're building with MSVC on x86_64.  I wonder if it would
> be better to move the MSVC/x86_64 check to the autoconf/meson scripts so
> that we could avoid surrounding large portions of the popcount code with this
> macro.  This might even be a necessary step towards building these files in an
> architecture-specific fashion.

I see the point here; however, this will take some time to get right especially 
since I don't have a Windows box to do compiles on. Should I attempt to do this 
in this patch?

Thanks,
Paul

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Bruce Momjian

On Wed, Mar 13, 2024 at 09:21:27AM -0700, Jeremy Schneider wrote:
> It's not just roadmaps and release pages where we mix up these terms
> either, it's even in user-facing SQL and libpq routines: both
> PQserverVersion and current_setting('server_version_num') return the
> patch release version in the numeric patch field, rather than the
> numeric minor field (which is always 0).
> 
> In my view, the best thing would be to move toward consistently using
> the word "patch" and moving away from the word "minor" for the
> PostgreSQL quarterly maintenance updates.
> 

I think "minor" is a better term since it contrasts with "major".  We
don't actually supply patches to upgrade minor versions.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.

Re: pg16: XX000: could not find pathkey item to sort

2024-03-13 Thread Alexander Lakhin


Hello David,

09.10.2023 07:13, David Rowley wrote:

On Mon, 9 Oct 2023 at 12:42, David Rowley  wrote:

Maybe it's worth checking the total planning time spent in a run of
the regression tests with and without the patch to see how much
overhead it adds to the "average case".

I've now pushed the patch that trims off the Pathkeys for the ORDER BY
/ DISTINCT aggregates.



I've stumbled upon the same error, but this time it apparently has another
cause. It can be produced (on REL_16_STABLE and master) as follows:
CREATE TABLE t (a int, b int) PARTITION BY RANGE (a);
CREATE TABLE td PARTITION OF t DEFAULT;
CREATE TABLE tp1 PARTITION OF t FOR VALUES FROM (1) TO (2);
SET enable_partitionwise_aggregate = on;
SET parallel_setup_cost = 0;
SELECT a, sum(b order by b) FROM t GROUP BY a ORDER BY a;

ERROR:  could not find pathkey item to sort

`git bisect` for this anomaly blames the same commit 1349d2790.

Best regards,
Alexander

Re: un-revert the MAINTAIN privilege and the pg_maintain predefined role

2024-03-13 Thread Jeff Davis

On Tue, 2024-03-12 at 16:05 -0500, Nathan Bossart wrote:
> It's easy enough to resolve this inconsistency by sending
> down the parent OID when recursing to a TOAST table and using that
> for the
> privilege checks.  AFAICT this avoids any kind of cache lookup
> hazards
> because we hold a session lock on the main relation in this case. 
> I've
> done this in the attached v2.

Looks good to me. Thank you for expanding on the comment, as well.

Regards,
Jeff Davis

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread Bharath Rupireddy

On Wed, Mar 13, 2024 at 11:13 AM shveta malik  wrote:
>
> > Thanks. v8-0001 is how it looks. Please see the v8 patch set with this 
> > change.
>
> JFYI, the patch does not apply to the head. There is a conflict in
> multiple files.

Thanks for looking into this. I noticed that the v8 patches needed
rebase. Before I go do anything with the patches, I'm trying to gain
consensus on the design. Following is the summary of design choices
we've discussed so far:
1) conflict_reason vs invalidation_reason.
2) When to compute the XID age?
3) Where to do the invalidations? Is it in the checkpointer or
autovacuum or some other process?
4) Interaction of these new invalidations with sync slots on the standby.

I hope to get on to these one after the other.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-03-13 Thread Nathan Bossart

A couple of thoughts on v7-0001:

+extern int  pg_popcount32_slow(uint32 word);
+extern int  pg_popcount64_slow(uint64 word);

+/* In pg_popcnt_*_accel source file. */
+extern int  pg_popcount32_fast(uint32 word);
+extern int  pg_popcount64_fast(uint64 word);

Can these prototypes be moved to a header file (maybe pg_bitutils.h)?  It
looks like these are defined twice in the patch, and while I'm not positive
that it's against project policy to declare extern function prototypes in
.c files, it appears to be pretty rare.

+  'pg_popcnt_choose.c',
+  'pg_popcnt_x86_64_accel.c',

I think we want these to be architecture-specific, i.e., only built for
x86_64 if the compiler knows how to use the relevant instructions.  There
is a good chance that we'll want to add similar support for other systems.
The CRC32C files are probably a good reference point for how to do this.

+#ifdef TRY_POPCNT_FAST

IIUC this macro can be set if either 1) the popcntq test in the
autoconf/meson scripts passes or 2) we're building with MSVC on x86_64.  I
wonder if it would be better to move the MSVC/x86_64 check to the
autoconf/meson scripts so that we could avoid surrounding large portions of
the popcount code with this macro.  This might even be a necessary step
towards building these files in an architecture-specific fashion.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread Bharath Rupireddy

On Wed, Mar 13, 2024 at 12:51 PM Bertrand Drouvot
 wrote:
>
> See the error messages on a standby:
>
> == wal removal
>
> postgres=#  SELECT * FROM pg_logical_slot_get_changes('lsub4_slot', NULL, 
> NULL, 'include-xids', '0');
> ERROR:  can no longer get changes from replication slot "lsub4_slot"
> DETAIL:  This slot has been invalidated because it exceeded the maximum 
> reserved size.
>
> == wal level
>
> postgres=# select conflict_reason from pg_replication_slots where slot_name = 
> 'lsub5_slot';;
> conflict_reason
> 
>  wal_level_insufficient
> (1 row)
>
> postgres=#  SELECT * FROM pg_logical_slot_get_changes('lsub5_slot', NULL, 
> NULL, 'include-xids', '0');
> ERROR:  can no longer get changes from replication slot "lsub5_slot"
> DETAIL:  This slot has been invalidated because it was conflicting with 
> recovery.
>
> == rows removal
>
> postgres=# select conflict_reason from pg_replication_slots where slot_name = 
> 'lsub6_slot';;
>  conflict_reason
> -
>  rows_removed
> (1 row)
>
> postgres=#  SELECT * FROM pg_logical_slot_get_changes('lsub6_slot', NULL, 
> NULL, 'include-xids', '0');
> ERROR:  can no longer get changes from replication slot "lsub6_slot"
> DETAIL:  This slot has been invalidated because it was conflicting with 
> recovery.
>
> As you can see, only wal level and rows removal are mentioning conflict with
> recovery.
>
> So, are we already "wrong" mentioning "wal_removed" in conflict_reason?

It looks like yes. So, how about we fix it the way proposed here -
https://www.postgresql.org/message-id/CALj2ACVd_dizYQiZwwUfsb%2BhG-fhGYo_kEDq0wn_vNwQvOrZHg%40mail.gmail.com?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Reports on obsolete Postgres versions

2024-03-13 Thread Jeremy Schneider

On 3/12/24 3:56 AM, Daniel Gustafsson wrote:
>>> but that is far down the page.  Do we need to improve this?
> 
>> I liked the statement from Laurenz a while ago on his blog
>> (paraphrased): "Upgrading to the latest patch release does not require
>> application testing or recertification". I am not sure we want to put
>> that into the official page (or maybe tone down/qualify it a bit), but I
>> think a lot of users stay on older minor versions because they dread
>> their internal testing policies.
> 
> I think we need a more conservative language since a minor release might fix a
> planner bug that someone's app relied on and their plans will be worse after
> upgrading.  While rare, it can for sure happen so the official wording should
> probably avoid such bold claims.
> 
>> The other thing that could maybe be made a bit better is the fantastic
>> patch release schedule, which however is buried in the "developer
>> roadmap". I can see how this was useful years ago, but I think this page
>> should be moved to the end-user part of the website, and maybe (also)
>> integrated into the support/versioning page?
> 
> Fair point.

Both of the above points show inconsistency in how PG uses the terms
"minor" and "patch" today.

It's not just roadmaps and release pages where we mix up these terms
either, it's even in user-facing SQL and libpq routines: both
PQserverVersion and current_setting('server_version_num') return the
patch release version in the numeric patch field, rather than the
numeric minor field (which is always 0).

In my view, the best thing would be to move toward consistently using
the word "patch" and moving away from the word "minor" for the
PostgreSQL quarterly maintenance updates.

-Jeremy

-- 
http://about.me/jeremy_schneider

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-13 Thread vignesh C

On Wed, 13 Mar 2024 at 16:58, Amit Kapila  wrote:
>
> On Tue, Mar 12, 2024 at 5:13 PM vignesh C  wrote:
> >
> > On Mon, 11 Mar 2024 at 17:16, Amit Kapila  wrote:
> > >
> > > On Tue, Feb 27, 2024 at 2:07 PM Hayato Kuroda (Fujitsu)
> > >  wrote:
> > > >
> > > > > PSA the patch for implementing it. It is basically same as Ian's one.
> > > > > However, this patch still cannot satisfy the condition 3).
> > > > >
> > > > > `pg_basebackup -D data_N2 -d "user=postgres" -R`
> > > > > -> dbname would not be appeared in primary_conninfo.
> > > > >
> > > > > This is because `if (connection_string)` case in GetConnection() 
> > > > > explicy override
> > > > > a dbname to "replication". I've tried to add a dummy entry {"dbname", 
> > > > > NULL} pair
> > > > > before the overriding, but it is no-op. Because The replacement of 
> > > > > the dbname in
> > > > > pqConnectOptions2() would be done only for the valid (=lastly 
> > > > > specified)
> > > > > connection options.
> > > >
> > > > Oh, this patch missed the straightforward case:
> > > >
> > > > pg_basebackup -D data_N2 -d "user=postgres dbname=replication" -R
> > > > -> dbname would not be appeared in primary_conninfo.
> > > >
> > > > So I think it cannot be applied as-is. Sorry for sharing the bad item.
> > > >
> > >
> > > Can you please share the patch that can be considered for review?
> >
> > Here is a patch with few changes: a) Removed the version check based
> > on discussion from [1] b) updated the documentation.
> > I have tested various scenarios with the attached patch, the dbname
> > that is written in postgresql.auto.conf for each of the cases with
> > pg_basebackup is given below:
> > 1) ./pg_basebackup -U vignesh -R
> > -> primary_conninfo = "dbname=vignesh" (In this case primary_conninfo
> > will have dbname as username specified)
> > 2) ./pg_basebackup -D data -R
> > -> primary_conninfo = "dbname=vignesh" (In this case primary_conninfo
> > will have the dbname as username (which is the same as the OS user
> > while setting defaults))
> > 3) ./pg_basebackup -d "user=vignesh"  -D data -R
> > -> primary_conninfo = "dbname=replication"  (In this case
> > primary_conninfo will have dbname as replication which is the default
> > value from GetConnection as connection string is specified)
> > 4) ./pg_basebackup -d "user=vignesh dbname=postgres"  -D data -R
> > -> primary_conninfo = "dbname=postgres" (In this case primary_conninfo
> > will have the dbname as the dbname specified)
> > 5) ./pg_basebackup -d ""  -D data -R
> > -> primary_conninfo = "dbname=replication" (In this case
> > primary_conninfo will have dbname as replication which is the default
> > value from GetConnection as connection string is specified)
> >
> > I have mentioned the reasons as to why dbname is being set to
> > replication or username in each of the cases.
> >
>
> IIUC, the dbname is already allowed in connstring for pg_basebackup by
> commit cca97ce6a6 and with this patch we are just writing it in
> postgresql.auto.conf if -R option is used to write recovery info. This
> will help slot syncworker to automatically connect with database
> without user manually specifying the dbname and replication
> connection, the same will still be ignored. I don't see any problem
> with this. Does anyone else see any problem?
>
> The postgresql.auto.conf file will record the connection
>  settings and, if specified, the replication slot
>  that pg_basebackup is using, so that
> -streaming replication will use the same settings later on.
> +a) synchronization of logical replication slots on the primary to the
> +hot_standby from 
> +pg_sync_replication_slots or slot
> sync worker
> +and b) streaming replication will use the same settings later on.
>
> We can simply modify the last line as: ".. streaming replication or
> logical replication slots synchronization will use the same settings
> later on." Additionally, you can give a link reference to [1].

Thanks for the comments, the attached v4 version patch has the changes
for the same.

Regards,
Vignesh
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 88c689e725..563346dd71 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -243,7 +243,9 @@ PostgreSQL documentation
 The postgresql.auto.conf file will record the connection
 settings and, if specified, the replication slot
 that pg_basebackup is using, so that
-streaming replication will use the same settings later on.
+streaming replication or 
+logical replication slots synchronization will use the same
+settings later on.

 
   
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 2585f11939..cb37703ab9 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -49,7 +49,6 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
 	{

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Robert Haas

On Wed, Mar 13, 2024 at 11:39 AM Dilip Kumar  wrote:
> > Andres already commented on the snapshot stuff on an earlier patch
> > version, and that's much nicer with this version. However, I don't
> > understand why a parallel bitmap heap scan needs to do anything at all
> > with the snapshot, even before these patches. The parallel worker
> > infrastructure already passes the active snapshot from the leader to the
> > parallel worker. Why does bitmap heap scan code need to do that too?
>
> Yeah thinking on this now it seems you are right that the parallel
> infrastructure is already passing the active snapshot so why do we
> need it again.  Then I checked other low scan nodes like indexscan and
> seqscan and it seems we are doing the same things there as well.
> Check for SerializeSnapshot() in table_parallelscan_initialize() and
> index_parallelscan_initialize() which are being called from
> ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
> respectively.

I remember thinking about this when I was writing very early parallel
query code. It seemed to me that there must be some reason why the
EState has a snapshot, as opposed to just using the active snapshot,
and so I took care to propagate that snapshot, which is used for the
leader's scans, to the worker scans also. Now, if the EState doesn't
need to contain a snapshot, then all of that mechanism is unnecessary,
but I don't see how it can be right for the leader to do
table_beginscan() using estate->es_snapshot and the worker to use the
active snapshot.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread Bharath Rupireddy

On Wed, Mar 13, 2024 at 9:21 AM Amit Kapila  wrote:
>
> > So, how about we turn conflict_reason to only report the reasons that
> > actually cause conflict with recovery for logical slots, something
> > like below, and then have invalidation_cause as a generic column for
> > all sorts of invalidation reasons for both logical and physical slots?
>
> If our above understanding is correct then coflict_reason will be a
> subset of invalidation_reason. If so, whatever way we arrange this
> information, there will be some sort of duplicity unless we just have
> one column 'invalidation_reason' and update the docs to interpret it
> correctly for conflicts.

Yes, there will be some sort of duplicity if we emit conflict_reason
as a text field. However, I still think the better way is to turn
conflict_reason text to conflict boolean and set it to true only on
rows_removed and wal_level_insufficient invalidations. When conflict
boolean is true, one (including all the tests that we've added
recently) can look for invalidation_reason text field for the reason.
This sounds reasonable to me as opposed to we just mentioning in the
docs that "if invalidation_reason is rows_removed or
wal_level_insufficient it's the reason for conflict with recovery".

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Constant Splitting/Refactoring

2024-03-13 Thread David Christensen

Here is a version 2 of this patch, rebased atop 97d85be365.

As before, this is a cleanup/prerequisite patch series for the page
features/reserved page size patches[1].  (Said patch series is going
to be updated shortly.)

This splits each of the 4 constants that care about page size into
Cluster-specific and -Limit variants, the first intended to become a
variable in time, and the second being the maximum value such a
variable may take (largely used for static
allocations).

Since these patches define these symbols to have the same values they
previously had, there are no changes in functionality.  These were
largely mechanical changes, and as such we should perhaps consider
making the same changes to back-branches to make it so context lines
and the like
would be the same, simplifying maintainer's efforts when applying code
in back branches that touch similar areas.

The script I have to make these changes is simple, and could be run
against the back branches with only the comments surrounding Calc()
pieces needing
to be adjusted once.

Thanks,

David

[1] https://commitfest.postgresql.org/47/3986/


v2-0001-Split-MaxHeapTuplesPerPage-into-runtime-and-max-v.patch
Description: Binary data


v2-0002-Split-MaxHeapTupleSize-into-runtime-and-max-value.patch
Description: Binary data


v2-0004-Split-MaxTIDsPerBTreePage-into-runtime-and-max-va.patch
Description: Binary data


v2-0003-Split-MaxIndexTuplesPerPage-into-runtime-and-max-.patch
Description: Binary data

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Dilip Kumar

On Wed, Mar 13, 2024 at 7:04 PM Heikki Linnakangas  wrote:
>
> (Adding Dilip, the original author of the parallel bitmap heap scan
> patch all those years ago, in case you remember anything about the
> snapshot stuff below.)
>
> On 27/02/2024 16:22, Melanie Plageman wrote:

> Andres already commented on the snapshot stuff on an earlier patch
> version, and that's much nicer with this version. However, I don't
> understand why a parallel bitmap heap scan needs to do anything at all
> with the snapshot, even before these patches. The parallel worker
> infrastructure already passes the active snapshot from the leader to the
> parallel worker. Why does bitmap heap scan code need to do that too?

Yeah thinking on this now it seems you are right that the parallel
infrastructure is already passing the active snapshot so why do we
need it again.  Then I checked other low scan nodes like indexscan and
seqscan and it seems we are doing the same things there as well.
Check for SerializeSnapshot() in table_parallelscan_initialize() and
index_parallelscan_initialize() which are being called from
ExecSeqScanInitializeDSM() and ExecIndexScanInitializeDSM()
respectively.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Support a wildcard in backtrace_functions

2024-03-13 Thread Jelte Fennema-Nio

On Wed, 13 Mar 2024 at 15:20, Peter Eisentraut  wrote:
> Hence the idea
>
>  backtrace_on_error = {all|internal|none}
>
> which could default to 'internal'.

I think one use-case that I'd personally at least would like to see
covered is being able to get backtraces on all warnings. How would
that be done with this setting?

backtrace_on_error = all
backtrace_min_level = warning

In that case backtrace_on_error seems like a weird name, since it can
include backtraces for warnings if you change backtrace_min_level. How
about the following aproach. It still uses 3 GUCs, but they now all
work together. There's one entry point and two additional filters
(level and function name)

# Says what log entries to log backtraces for
log_backtrace = {all|internal|none} (default: internal)

# Excludes log entries from log_include_backtrace by level
backtrace_min_level = {debug4|...|fatal} (default: error)

# Excludes log entries from log_include_backtrace if function name
# does not match list, but empty string disables this filter (thus
# logging for all functions)
backtrace_functions = {...} (default: '')

PS. Other naming option for log_backtrace could be log_include_backtrace

Re: Support a wildcard in backtrace_functions

2024-03-13 Thread Bharath Rupireddy

On Wed, Mar 13, 2024 at 7:50 PM Peter Eisentraut  wrote:
>
> > I think it all depends on how close we consider
> > backtrace_on_internal_error and backtrace_functions. While they
> > obviously have similar functionality, I feel like
> > backtrace_on_internal_error is probably a function that we'd want to
> > turn on by default in the future. While backtrace_functions seems like
> > it's mostly useful for developers. (i.e. the current grouping of
> > backtrace_on_internal_error under DEVELOPER_OPTIONS seems wrong to me)
>
> Hence the idea
>
>  backtrace_on_error = {all|internal|none}
>
> which could default to 'internal'.

So, are you suggesting to just have backtrace_on_error =
{all|internal|none} leaving backtrace_functions_min_level aside?

In that case, I'd like to understand how backtrace_on_error and
backtrace_functions interact with each other? Does one need to set
backtrace_on_error = all to get backtrace of functions specified in
backtrace_functions?

What must be the behaviour of backtrace_on_error = all?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: meson vs tarballs

2024-03-13 Thread Tom Lane

Peter Eisentraut  writes:
> On 13.03.24 07:42, Andrew Dunstan wrote:
>> On 2024-03-13 We 02:31, Andrew Dunstan wrote:
 Alternatively, PostgreSQL can be built using Meson. This is currently 
 experimental and only works when building from a Git checkout (not 
 from a distribution tarball).

>> Ah!. Darn, I missed that. Thanks.
>> Of course, when release 17 comes out this had better not be the case, 
>> since we have removed the custom Windows build system.

> Yes, this has been changed in 17.

My understanding is that pretty soon there will be no difference,
ie distribution tarballs will have the same contents as a git pull
(less the .git infrastructure).  If we're planning on making that
happen for 17, perhaps we'd better get on with it.

regards, tom lane

Re: MERGE ... WHEN NOT MATCHED BY SOURCE

2024-03-13 Thread Dean Rasheed

Rebased version attached.

Regards,
Dean
diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml
new file mode 100644
index f8f83d4..380d0c9
--- a/doc/src/sgml/mvcc.sgml
+++ b/doc/src/sgml/mvcc.sgml
@@ -394,10 +394,14 @@
 conditions for each action are re-evaluated on the updated version of
 the row, starting from the first action, even if the action that had
 originally matched appears later in the list of actions.
-On the other hand, if the row is concurrently updated or deleted so
-that the join condition fails, then MERGE will
-evaluate the condition's NOT MATCHED actions next,
-and execute the first one that succeeds.
+On the other hand, if the row is concurrently updated so that the join
+condition fails, then MERGE will evaluate the
+command's NOT MATCHED BY SOURCE and
+NOT MATCHED [BY TARGET] actions next, and execute
+the first one of each kind that succeeds.
+If the row is concurrently deleted, then MERGE
+will evaluate the command's NOT MATCHED [BY TARGET]
+actions, and execute the first one that succeeds.
 If MERGE attempts an INSERT
 and a unique index is present and a duplicate row is concurrently
 inserted, then a uniqueness violation error is raised;
diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml
new file mode 100644
index e745fbd..6845f14
--- a/doc/src/sgml/ref/merge.sgml
+++ b/doc/src/sgml/ref/merge.sgml
@@ -33,7 +33,8 @@ USING dat
 and when_clause is:
 
 { WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
-  WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } }
+  WHEN NOT MATCHED BY SOURCE [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
+  WHEN NOT MATCHED [BY TARGET] [ AND condition ] THEN { merge_insert | DO NOTHING } }
 
 and merge_insert is:
 
@@ -72,7 +73,9 @@ DELETE
from data_source to
the target table
producing zero or more candidate change rows.  For each candidate change
-   row, the status of MATCHED or NOT MATCHED
+   row, the status of MATCHED,
+   NOT MATCHED BY SOURCE,
+   or NOT MATCHED [BY TARGET]
is set just once, after which WHEN clauses are evaluated
in the order specified.  For each candidate change row, the first clause to
evaluate as true is executed.  No more than one WHEN
@@ -254,18 +257,40 @@ DELETE
   At least one WHEN clause is required.
  
  
+  The WHEN clause may specify WHEN MATCHED,
+  WHEN NOT MATCHED BY SOURCE, or
+  WHEN NOT MATCHED [BY TARGET].
+  Note that the SQL standard only defines
+  WHEN MATCHED and WHEN NOT MATCHED
+  (which is defined to mean no matching target row).
+  WHEN NOT MATCHED BY SOURCE is an extension to the
+  SQL standard, as is the option to append
+  BY TARGET to WHEN NOT MATCHED, to
+  make its meaning more explicit.
+ 
+ 
   If the WHEN clause specifies WHEN MATCHED
   and the candidate change row matches a row in the
-  target table,
-  the WHEN clause is executed if the
+  data_source to a row in the
+  target table, the WHEN clause is executed if the
   condition is
   absent or it evaluates to true.
  
  
-  Conversely, if the WHEN clause specifies
-  WHEN NOT MATCHED
-  and the candidate change row does not match a row in the
-  target table,
+  If the WHEN clause specifies
+  WHEN NOT MATCHED BY SOURCE and the candidate change
+  row represents a row in the target table that does not match a row in the
+  data_source, the
+  WHEN clause is executed if the
+  condition is
+  absent or it evaluates to true.
+ 
+ 
+  If the WHEN clause specifies
+  WHEN NOT MATCHED [BY TARGET] and the candidate change
+  row represents a row in the
+  data_source that does not
+  match a row in the target table,
   the WHEN clause is executed if the
   condition is
   absent or it evaluates to true.
@@ -285,7 +310,10 @@ DELETE
  
   A condition on a WHEN MATCHED clause can refer to columns
   in both the source and the target relations. A condition on a
-  WHEN NOT MATCHED clause can only refer to columns from
+  WHEN NOT MATCHED BY SOURCE clause can only refer to
+  columns from the target relation, since by definition there is no matching
+  source row. A condition on a WHEN NOT MATCHED [BY TARGET]
+  clause can only refer to columns from
   the source relation, since by definition there is no matching target row.
   Only the system attributes from the target table are accessible.
  
@@ -409,8 +437,10 @@ DELETE
   WHEN MATCHED clause, the expression can use values
   from the original row in the target table, and values from the
   data_source row.
-  If used in a WHEN NOT MATCHED clause, the
-  expression can use values from the
+  If used in a WHEN NOT MATCHED BY SOURCE clause, the
+

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  wrote:
>
> > As I mentioned above, if we implement the test cases in C, we can use
> > the debug-build array in the test code. And we won't use it in AND/OR
> > operations tests in the future.
>
> That's a really interesting idea, so I went ahead and tried that for
> v71. This seems like a good basis for testing larger, randomized
> inputs, once we decide how best to hide that from the expected output.
> The tests use SQL functions do_set_block_offsets() and
> check_set_block_offsets(). The latter does two checks against a tid
> array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

> Funnily enough, the debug array
> itself gave false failures when using a similar array in the test
> harness, because it didn't know all the places where the array should
> have been sorted -- it only worked by chance before because of what
> order things were done.

Good catch, thanks.

> I squashed everything from v70 and also took the liberty of switching
> on shared memory for tid store tests. The only reason we didn't do
> this with the radix tree tests is that the static attach/detach
> functions would raise warnings since they are not used.

Agreed to test the tidstore on shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Support a wildcard in backtrace_functions

2024-03-13 Thread Peter Eisentraut


On 08.03.24 16:55, Jelte Fennema-Nio wrote:

On Fri, 8 Mar 2024 at 15:51, Peter Eisentraut  wrote:

What is the relationship of these changes with the recently added
backtrace_on_internal_error?


I think that's a reasonable question. And the follow up ones too.

I think it all depends on how close we consider
backtrace_on_internal_error and backtrace_functions. While they
obviously have similar functionality, I feel like
backtrace_on_internal_error is probably a function that we'd want to
turn on by default in the future. While backtrace_functions seems like
it's mostly useful for developers. (i.e. the current grouping of
backtrace_on_internal_error under DEVELOPER_OPTIONS seems wrong to me)


Hence the idea

backtrace_on_error = {all|internal|none}

which could default to 'internal'.

Re: MERGE ... RETURNING

2024-03-13 Thread Dean Rasheed

On Wed, 13 Mar 2024 at 08:58, Dean Rasheed  wrote:
>
> I think I'll go make those doc changes, and back-patch them
> separately, since they're not related to this patch.
>

OK, I've done that. Here is a rebased patch on top of that, with the
other changes you suggested.

Regards,
Dean
diff --git a/doc/src/sgml/dml.sgml b/doc/src/sgml/dml.sgml
new file mode 100644
index cbbc5e2..3d95bdb
--- a/doc/src/sgml/dml.sgml
+++ b/doc/src/sgml/dml.sgml
@@ -283,10 +283,15 @@ DELETE FROM products;
RETURNING
   
 
+  
+   MERGE
+   RETURNING
+  
+
   
Sometimes it is useful to obtain data from modified rows while they are
being manipulated.  The INSERT, UPDATE,
-   and DELETE commands all have an
+   DELETE, and MERGE commands all have an
optional RETURNING clause that supports this.  Use
of RETURNING avoids performing an extra database query to
collect the data, and is especially valuable when it would otherwise be
@@ -339,6 +344,21 @@ DELETE FROM products
 
   
 
+  
+   In a MERGE, the data available to RETURNING is
+   the content of the source row plus the content of the inserted, updated, or
+   deleted target row.  Since it is quite common for the source and target to
+   have many of the same columns, specifying RETURNING *
+   can lead to a lot of duplicated columns, so it is often more useful to
+   qualify it so as to return just the source or target row.  For example:
+
+MERGE INTO products p USING new_products n ON p.product_no = n.product_no
+  WHEN NOT MATCHED THEN INSERT VALUES (n.product_no, n.name, n.price)
+  WHEN MATCHED THEN UPDATE SET name = n.name, price = n.price
+  RETURNING p.*;
+
+  
+
   
If there are triggers () on the target table,
the data available to RETURNING is the row as modified by
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index 0bb7aeb..0e2b888
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -22421,6 +22421,85 @@ SELECT count(*) FROM sometable;
 
  
 
+ 
+  Merge Support Functions
+
+  
+   MERGE
+   RETURNING
+  
+
+  
+   PostgreSQL includes one merge support function
+   that may be used in the RETURNING list of a
+command to identify the action taken for each
+   row.
+  
+
+  
+   Merge Support Functions
+
+   
+
+ 
+  
+   Function
+  
+  
+   Description
+  
+ 
+
+
+
+ 
+  
+   
+merge_action
+   
+   merge_action ( )
+   text
+  
+  
+   Returns the merge action command executed for the current row.  This
+   will be 'INSERT', 'UPDATE', or
+   'DELETE'.
+  
+ 
+
+   
+  
+
+  
+   Example:
+
+  
+
+  
+   Note that this function can only be used in the RETURNING
+   list of a MERGE command.  It is an error to use it in any
+   other part of a query.
+  
+
+ 
+
  
   Subquery Expressions
 
diff --git a/doc/src/sgml/glossary.sgml b/doc/src/sgml/glossary.sgml
new file mode 100644
index 8c2f114..a81c17a
--- a/doc/src/sgml/glossary.sgml
+++ b/doc/src/sgml/glossary.sgml
@@ -1442,9 +1442,9 @@
  to a client upon the
  completion of an SQL command, usually a
  SELECT but it can be an
- INSERT, UPDATE, or
- DELETE command if the RETURNING
- clause is specified.
+ INSERT, UPDATE,
+ DELETE, or MERGE command if the
+ RETURNING clause is specified.
 
 
  The fact that a result set is a relation means that a query can be used
diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
new file mode 100644
index c2b9c6a..406834e
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -1043,8 +1043,8 @@ INSERT INTO mytable VALUES (1,'one'), (2
 
 
 
- If the command does return rows (for example SELECT,
- or INSERT/UPDATE/DELETE
+ If the command does return rows (for example SELECT, or
+ INSERT/UPDATE/DELETE/MERGE
  with RETURNING), there are two ways to proceed.
  When the command will return at most one row, or you only care about
  the first row of output, write the command as usual but add
@@ -1172,6 +1172,7 @@ SELECT select_expressionsexpressions INTO STRICT target;
 UPDATE ... RETURNING expressions INTO STRICT target;
 DELETE ... RETURNING expressions INTO STRICT target;
+MERGE ... RETURNING expressions INTO STRICT target;
 
 
  where target can be a record variable, a row
@@ -1182,8 +1183,8 @@ DELETE ... RETURNING expres
  INTO clause) just as described above,
  and the plan is cached in the same way.
  This works for SELECT,
- INSERT/UPDATE/DELETE with
- RETURNING, and certain utility commands
+ INSERT/UPDATE/DELETE/MERGE
+ with RETURNING, and certain utility commands
  that return row sets, such as EXPLAIN.
  Except for the INTO clause, the SQL command is the same
  as it would be written outside PL/pgSQL.
@@ -1259,7 +1260,7 @@ END;
 
 
 
- For INSERT/UPDATE/DELETE with
+ For INSERT/UPDATE/DELETE/MERGE with
  RETURNING,

Re: Regarding the order of the header file includes

2024-03-13 Thread Peter Eisentraut


On 12.03.24 12:47, Richard Guo wrote:


On Wed, Mar 6, 2024 at 5:32 PM Richard Guo > wrote:


Please note that this patch only addresses the order of header file
includes in backend modules (and might not be thorough).  It is possible
that other modules may have a similar issue, but I have not evaluated
them yet.


Attached is v2, which also includes the 0002 patch that addresses the
order of header file includes in non-backend modules.


committed (as one patch)

Re: Add new error_action COPY ON_ERROR "log"

2024-03-13 Thread Bharath Rupireddy

On Wed, Mar 13, 2024 at 11:09 AM Michael Paquier  wrote:
>
> Hmm.  This NOTICE is really bugging me.  It is true that the clients
> would get more information, but the information is duplicated on the
> server side because the error context provides the same information as
> the NOTICE itself:
> NOTICE:  data type incompatibility at line 1 for column "a"
> CONTEXT:  COPY aa, line 1, column a: "a"
> STATEMENT:  copy aa from stdin with (on_error ignore, log_verbosity verbose);

Yes, if wanted, clients can also get the CONTEXT - for instance, using
'\set SHOW_CONTEXT always' in psql.

I think we can enhance the NOTICE message to include the column value
(just like CONTEXT message is showing) and leverage relname_only to
emit only the relation name in the CONTEXT message.

/*
 * We suppress error context information other than the relation name,
 * if one of the operations below fails.
 */
Assert(!cstate->relname_only);
cstate->relname_only = true;

I'm attaching the v8 patch set implementing the above idea. With this,
[1] is sent to the client, [2] is sent to the server log. This
approach not only reduces the duplicate info in the NOTICE and CONTEXT
messages, but also makes it easy for users to get all the necessary
info in the NOTICE message without having to set extra parameters to
get CONTEXT message.

Another idea is to move even the table name to NOTICE message and hide
the context with errhidecontext when we emit the new NOTICE messages.

Thoughts?

[1]
NOTICE:  data type incompatibility at line 2 for column n: "a"
NOTICE:  data type incompatibility at line 3 for column k: "33"
NOTICE:  data type incompatibility at line 4 for column m: "{a, 4}"
NOTICE:  data type incompatibility at line 5 for column n: ""
NOTICE:  data type incompatibility at line 7 for column m: "a"
NOTICE:  data type incompatibility at line 8 for column k: "a"
NOTICE:  6 rows were skipped due to data type incompatibility
COPY 3

[2]
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 2 for column n: "a"
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 3 for column k: "33"
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 4 for column m: "{a, 4}"
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 5 for column n: ""
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 7 for column m: "a"
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  data type
incompatibility at line 8 for column k: "a"
2024-03-13 13:49:14.138 UTC [1330270] CONTEXT:  COPY check_ign_err
2024-03-13 13:49:14.138 UTC [1330270] NOTICE:  6 rows were skipped due
to data type incompatibility

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v8-0001-Add-LOG_VERBOSITY-option-to-COPY-command.patch
Description: Binary data


v8-0002-Add-detailed-info-when-COPY-skips-soft-errors.patch
Description: Binary data

Re: type cache cleanup improvements

2024-03-13 Thread Teodor Sigaev




I think that this patch should be split for clarity, as there are a
few things that are independently useful.  I guess something like
that:

Done, all patches should be applied consequentially.

> - The typcache changes.
01-map_rel_to_type.v5.patch adds map relation to its type


- Introduction of hash_initial_lookup(), that simplifies 3 places of
dynahash.c where the same code is used.  The routine should be
inlined.
- The split in hash_seq_search to force a different type of search is
weird, complicating the dynahash interface by hiding what seems like a
search mode.  Rather than hasHashvalue that's hidden in the middle of
HASH_SEQ_STATUS, could it be better to have an entirely different API
for the search?  That should be a patch on its own, as well.


02-hash_seq_init_with_hash_value.v5.patch - introduces a 
hash_seq_init_with_hash_value() method. hash_initial_lookup() is marked as 
inline, but I suppose, modern compilers are smart enough to inline it automatically.


Using separate interface for scanning hash with hash value will make scan code 
more ugly in case when we need to use special value of hash value as it is done 
in cache's scans. Look, instead of this simplified code:

   if (hashvalue == 0)
   hash_seq_init(, TypeCacheHash);
   else
   hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
   while ((typentry = hash_seq_search()) != NULL) {
  ...
   }
we will need to code something like that:
   if (hashvalue == 0)
   {
   hash_seq_init(, TypeCacheHash);

while ((typentry = hash_seq_search()) != NULL) {
...
}
   }
   else
   {
   hash_seq_init_with_hash_value(, TypeCacheHash, hashvalue);
while ((typentry = hash_seq_search_with_hash_value()) != NULL) {
...
}
   }
Or I didn't understand you.

I thought about integrate check inside existing loop in  hash_seq_search() :
+ rerun:
if ((curElem = status->curEntry) != NULL)
{
/* Continuing scan of curBucket... */
status->curEntry = curElem->link;
if (status->curEntry == NULL)   /* end of this bucket */
+   {
+   if (status->hasHashvalue)
+   hash_seq_term(status);  

++status->curBucket;
+   }
+   else if (status->hasHashvalue && status->hashvalue !=
+curElem->hashvalue)
+   goto rerun;
return (void *) ELEMENTKEY(curElem);
}

But for me it looks weird and adds some checks which will takes some CPU time.


03-att_with_hash_value.v5.patch - adds usage of previous patch.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/diff --git a/src/backend/utils/cache/attoptcache.c b/src/backend/utils/cache/attoptcache.c
index af978ccd4b1..3a18b2e9a77 100644
--- a/src/backend/utils/cache/attoptcache.c
+++ b/src/backend/utils/cache/attoptcache.c
@@ -44,12 +44,10 @@ typedef struct
 
 /*
  * InvalidateAttoptCacheCallback
- *		Flush all cache entries when pg_attribute is updated.
+ *		Flush cache entry (or entries) when pg_attribute is updated.
  *
  * When pg_attribute is updated, we must flush the cache entry at least
- * for that attribute.  Currently, we just flush them all.  Since attribute
- * options are not currently used in performance-critical paths (such as
- * query execution), this seems OK.
+ * for that attribute.
  */
 static void
 InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
@@ -57,7 +55,16 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	HASH_SEQ_STATUS status;
 	AttoptCacheEntry *attopt;
 
-	hash_seq_init(, AttoptCacheHash);
+	/*
+	 * By convection, zero hash value is passed to the callback as a sign
+	 * that it's time to invalidate the cache. See sinval.c, inval.c and
+	 * InvalidateSystemCachesExtended().
+	 */
+	if (hashvalue == 0)
+		hash_seq_init(, AttoptCacheHash);
+	else
+		hash_seq_init_with_hash_value(, AttoptCacheHash, hashvalue);
+
 	while ((attopt = (AttoptCacheEntry *) hash_seq_search()) != NULL)
 	{
 		if (attopt->opts)
@@ -70,6 +77,18 @@ InvalidateAttoptCacheCallback(Datum arg, int cacheid, uint32 hashvalue)
 	}
 }
 
+/*
+ * Hash function compatible with two-arg system cache hash function.
+ */
+static uint32
+relatt_cache_syshash(const void *key, Size keysize)
+{
+	const AttoptCacheKey* ckey = key;
+
+	Assert(keysize == sizeof(*ckey));
+	return GetSysCacheHashValue2(ATTNUM, ckey->attrelid, ckey->attnum);
+}
+
 /*
  * InitializeAttoptCache
  *		Initialize the attribute options cache.
@@ -82,9 +101,17 @@ InitializeAttoptCache(void)
 	/* Initialize the hash table. */
 	ctl.keysize = sizeof(AttoptCacheKey);
 	ctl.entrysize = sizeof(AttoptCacheEntry);
+
+	/*
+	 * AttoptCacheEntry takes hash value from the system cache. For
+	 * AttoptCacheHash we use the same hash in order to speedup search by hash
+	 * value. This is used by hash_seq_init_with_hash_value().

Re: BitmapHeapScan streaming read user and prelim refactoring

2024-03-13 Thread Heikki Linnakangas

(Adding Dilip, the original author of the parallel bitmap heap scan 
patch all those years ago, in case you remember anything about the 
snapshot stuff below.)


On 27/02/2024 16:22, Melanie Plageman wrote:

On Mon, Feb 26, 2024 at 08:50:28PM -0500, Melanie Plageman wrote:

On Fri, Feb 16, 2024 at 12:35:59PM -0500, Melanie Plageman wrote:

In the attached v3, I've reordered the commits, updated some errant
comments, and improved the commit messages.

I've also made some updates to the TIDBitmap API that seem like a
clarity improvement to the API in general. These also reduce the diff
for GIN when separating the TBMIterateResult from the
TBM[Shared]Iterator. And these TIDBitmap API changes are now all in
their own commits (previously those were in the same commit as adding
the BitmapHeapScan streaming read user).

The three outstanding issues I see in the patch set are:
1) the lossy and exact page counters issue described in my previous
email


I've resolved this. I added a new patch to the set which starts counting
even pages with no visible tuples toward lossy and exact pages. After an
off-list conversation with Andres, it seems that this omission in master
may not have been intentional.

Once we have only two types of pages to differentiate between (lossy and
exact [no longer have to care about "has no visible tuples"]), it is
easy enough to pass a "lossy" boolean paramater to
table_scan_bitmap_next_block(). I've done this in the attached v4.


Thomas posted a new version of the Streaming Read API [1], so here is a
rebased v5. This should make it easier to review as it can be applied on
top of master.


Lots of discussion happening on the performance results but it seems 
that there is no performance impact with the preliminary patches up to 
v5-0013-Streaming-Read-API.patch. I'm focusing purely on those 
preliminary patches now, because I think they're worthwhile cleanups 
independent of the streaming read API.


Andres already commented on the snapshot stuff on an earlier patch 
version, and that's much nicer with this version. However, I don't 
understand why a parallel bitmap heap scan needs to do anything at all 
with the snapshot, even before these patches. The parallel worker 
infrastructure already passes the active snapshot from the leader to the 
parallel worker. Why does bitmap heap scan code need to do that too?


I disabled that with:


--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -874,7 +874,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
pstate = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, 
false);
node->pstate = pstate;
 
+#if 0

node->worker_snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
Assert(IsMVCCSnapshot(node->worker_snapshot));
RegisterSnapshot(node->worker_snapshot);
+#endif
 }


and ran "make check-world". All the tests passed. To be even more sure, 
I added some code there to assert that the serialized version of 
node->ss.ps.state->es_snapshot is equal to pstate->phs_snapshot_data, 
and all the tests passed with that too.


I propose that we just remove the code in BitmapHeapScan to serialize 
the snapshot, per attached patch.


--
Heikki Linnakangas
Neon (https://neon.tech)
From 265d77efa2b56d5cfc3caf7843058b7336524860 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 13 Mar 2024 15:25:07 +0200
Subject: [PATCH 1/1] Remove redundant snapshot copying from parallel leader to
 workers

The parallel query infrastructure copies the leader backend's active
snapshot to the worker processes. But BitmapHeapScan node also had
bespoken code to pass the snapshot from leader to the worker. That was
redundant, so remove it.

Discussion: XX
---
 src/backend/access/table/tableam.c| 10 --
 src/backend/executor/nodeBitmapHeapscan.c | 17 ++---
 src/include/access/tableam.h  |  5 -
 src/include/nodes/execnodes.h |  4 
 4 files changed, 2 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 6ed8cca05a1..e57a0b7ea31 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -120,16 +120,6 @@ table_beginscan_catalog(Relation relation, int nkeys, struct ScanKeyData *key)
 			NULL, flags);
 }
 
-void
-table_scan_update_snapshot(TableScanDesc scan, Snapshot snapshot)
-{
-	Assert(IsMVCCSnapshot(snapshot));
-
-	RegisterSnapshot(snapshot);
-	scan->rs_snapshot = snapshot;
-	scan->rs_flags |= SO_TEMP_SNAPSHOT;
-}
-
 
 /* 
  * Parallel table scan related functions.
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 345b67649ea..ca548e44eb4 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -721,7 +721,6 @@

Re: Using the %m printf format more

2024-03-13 Thread Peter Eisentraut


On 12.03.24 02:22, Michael Paquier wrote:

On Mon, Mar 11, 2024 at 11:19:16AM +, Dagfinn Ilmari Mannsåker wrote:

On closer look, fprintf() is actually the only errno-clobbering function
it calls, I was just hedging my bets in that statement.


This makes the whole simpler, so I'd be OK with that.  I am wondering
if others have opinions to offer about that.


The 0002 patch looks sensible.  It would be good to fix that, otherwise 
it could have some confusing outcomes in the future.

Re: Disabling Heap-Only Tuples

2024-03-13 Thread Laurenz Albe

On Thu, 2023-09-21 at 16:18 -0700, Andres Freund wrote:
> I think a minimal working approach could be to have the configuration be based
> on the relation size vs space known to the FSM. If the target block of an
> update is higher than ((relation_size - fsm_free_space) *
> new_reloption_or_guc), try finding the target block via the FSM, even if
> there's space on the page.

That sounds like a good way forward.

The patch is in state "needs review", but it got review.  I'll change it to
"waiting for author".

Yours,
Laurenz Albe

Re: Improve readability by using designated initializers when possible

2024-03-13 Thread Peter Eisentraut


On 08.03.24 06:50, Michael Paquier wrote:

On Mon, Mar 04, 2024 at 09:29:03AM +0800, jian he wrote:

On Fri, Mar 1, 2024 at 5:26 PM Peter Eisentraut  wrote:

Oops, there was a second commit in my branch that I neglected to send
in.  Here is my complete patch set.


Thanks for the new patch set.  The gains are neat, giving nice
numbers:
  7 files changed, 200 insertions(+), 644 deletions(-)

+   default:
+   DropObjectById(object);
+   break;

Hmm.  I am not sure that this is a good idea.  Wouldn't it be safer to
use as default path something that generates an ERROR so as this code
path would complain immediately when adding a new catalog?


fixed in new patch


In getObjectDescription() and getObjectTypeDescription() this was a
safeguard, but we don't have that anymore.  So it seems to me that
this should be replaced with a default with elog(ERROR)?


fixed


there is a `OCLASS` at the end of getObjectIdentityParts.


Nice catch.  A comment is not updated.


There is a `ObjectClass` in typedefs.list


This is usually taken care of by committers or updated automatically.


both fixed
From a40a97de94d69d44b3a04b7987e99c8a6675bccf Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 13 Mar 2024 14:16:09 +0100
Subject: [PATCH v3] Remove ObjectClass

ObjectClass is an enum whose values correspond to catalog OIDs.  But
the extra layer of redirection, which is used only in small parts of
the code, and the similarity to ObjectType, are confusing and
cumbersome.

One advantage has been that some switches processing the OCLASS enum
don't have "default:" cases.  This is so that the compiler tells us
when we fail to add support for some new object class.  But you can
also handle that with some assertions and proper test coverage.  It's
not even clear how strong this benefit is.  For example, in
AlterObjectNamespace_oid(), you could still put a new OCLASS into the
"ignore object types that don't have schema-qualified names" case, and
it might or might not be wrong.  Also, there are already various
OCLASS switches that do have a default case, so it's not even clear
what the preferred coding style should be.

Discussion: 
https://www.postgresql.org/message-id/flat/CAGECzQT3caUbcCcszNewCCmMbCuyP7XNAm60J3ybd6PN5kH2Dw%40mail.gmail.com
---
 src/backend/catalog/dependency.c | 239 
 src/backend/catalog/objectaddress.c  | 316 ---
 src/backend/commands/alter.c |  73 ++-
 src/backend/commands/event_trigger.c | 122 +--
 src/backend/commands/tablecmds.c |  62 ++
 src/include/catalog/dependency.h |  51 -
 src/include/commands/event_trigger.h |   2 +-
 src/tools/pgindent/typedefs.list |   1 -
 8 files changed, 222 insertions(+), 644 deletions(-)

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index eadcf6af0df..6e8f6a57051 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -206,7 +206,7 @@ deleteObjectsInList(ObjectAddresses *targetObjects, 
Relation *depRel,
if (extra->flags & DEPFLAG_REVERSE)
normal = true;
 
-   if 
(EventTriggerSupportsObjectClass(getObjectClass(thisobj)))
+   if (EventTriggerSupportsObject(thisobj))
{
EventTriggerSQLDropAddObject(thisobj, original, 
normal);
}
@@ -1349,9 +1349,9 @@ deleteOneObject(const ObjectAddress *object, Relation 
*depRel, int flags)
 static void
 doDeletion(const ObjectAddress *object, int flags)
 {
-   switch (getObjectClass(object))
+   switch (object->classId)
{
-   case OCLASS_CLASS:
+   case RelationRelationId:
{
charrelKind = 
get_rel_relkind(object->objectId);
 
@@ -1382,104 +1382,102 @@ doDeletion(const ObjectAddress *object, int flags)
break;
}
 
-   case OCLASS_PROC:
+   case ProcedureRelationId:
RemoveFunctionById(object->objectId);
break;
 
-   case OCLASS_TYPE:
+   case TypeRelationId:
RemoveTypeById(object->objectId);
break;
 
-   case OCLASS_CONSTRAINT:
+   case ConstraintRelationId:
RemoveConstraintById(object->objectId);
break;
 
-   case OCLASS_DEFAULT:
+   case AttrDefaultRelationId:
RemoveAttrDefaultById(object->objectId);
break;
 
-   case OCLASS_LARGEOBJECT:
+   case LargeObjectRelationId:
LargeObjectDrop(object->objectId);
break;
 
-   case OCLASS_OPERATOR:
+   case

Re: On disable_cost

2024-03-13 Thread Robert Haas

On Tue, Mar 12, 2024 at 4:55 PM David Rowley  wrote:
> The primary place I see issues with disabled_cost is caused by
> STD_FUZZ_FACTOR.  When you add 1.0e10 to a couple of modestly costly
> paths, it makes them appear fuzzily the same in cases where one could
> be significantly cheaper than the other. If we were to bump up the
> disable_cost it would make this problem worse.

Hmm, good point.

> So maybe the fix could be to set disable_cost to something like
> 1.0e110 and adjust compare_path_costs_fuzzily to not apply the
> fuzz_factor for paths >= disable_cost.   However, I wonder if that
> risks the costs going infinite after a couple of cartesian joins.

Yeah, I think the disabled flag is a better answer if we can make it
work. No matter what value we pick for disable_cost, it's bound to
skew the planning sometimes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Using the %m printf format more

2024-03-13 Thread Dagfinn Ilmari Mannsåker

Michael Paquier  writes:

> On Mon, Mar 11, 2024 at 11:19:16AM +, Dagfinn Ilmari Mannsåker wrote:
>> On closer look, fprintf() is actually the only errno-clobbering function
>> it calls, I was just hedging my bets in that statement.
>
> This makes the whole simpler, so I'd be OK with that.  I am wondering
> if others have opinions to offer about that.

If no one chimes in in the next couple of days I'll add it to the
commitfest so it doesn't get lost.

> I've applied v2-0001 for now, as it is worth on its own and it shaves
> a bit of code.

Thanks!

- ilmari

Re: POC, WIP: OR-clause support for indexes

2024-03-13 Thread Andrei Lepikhov


On 13/3/2024 18:05, Alexander Korotkov wrote:

On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
Given all of the above, I think moving transformation to the
canonicalize_qual() would be the right way to go.

Ok, I will try to move the code.
I have no idea about the timings so far. I recall the last time I got 
bogged down in tons of duplicated code. I hope with an almost-ready 
sketch, it will be easier.


--
regards,
Andrei Lepikhov
Postgres Professional

Re: Statistics Import and Export

2024-03-13 Thread Stephen Frost

Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > No, we should be keeping the lock until the end of the transaction
> > (which in this case would be just the one statement, but it would be the
> > whole statement and all of the calls in it).  See analyze.c:268 or
> > so, where we call relation_close(onerel, NoLock); meaning we're closing
> > the relation but we're *not* releasing the lock on it- it'll get
> > released at the end of the transaction.
>
> If that's the case, then changing the two table_close() statements to
> NoLock should resolve any remaining concern.

Note that there's two different things we're talking about here- the
lock on the relation that we're analyzing and then the lock on the
pg_statistic (or pg_class) catalog itself.  Currently, at least, it
looks like in the three places in the backend that we open
StatisticRelationId, we release the lock when we close it rather than
waiting for transaction end.  I'd be inclined to keep it that way in
these functions also.  I doubt that one lock will end up causing much in
the way of issues to acquire/release it multiple times and it would keep
the code consistent with the way ANALYZE works.

If it can be shown to be an issue then we could certainly revisit this.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Put genbki.pl output into src/include/catalog/ directly

2024-03-13 Thread Andreas Karlsson


On 2/8/24 8:58 AM, Peter Eisentraut wrote:
I think keeping the two build systems aligned this way will be useful 
for longer-term maintenance.


Agreed, so started reviewing the patch. Attached is a rebased version of 
the patch to solve a conflict.


AndreasFrom 2069c6d6e2ef2bc37c5af0df12c558ead8a99b15 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 8 Feb 2024 08:31:21 +0100
Subject: [PATCH] Put genbki.pl output into src/include/catalog/ directly

With the makefile rules, the output of genbki.pl was written to
src/backend/catalog/, and then the header files were linked to
src/include/catalog/.

This changes it so that the output files are written directly to
src/include/catalog/.  This makes the logic simpler, and it also makes
the behavior consistent with the meson build system.  Also, the list
of catalog files is now kept in parallel in
src/include/catalog/{meson.build,Makefile}, while before the makefiles
had it in src/backend/catalog/Makefile.
---
 src/backend/Makefile |   2 +-
 src/backend/catalog/.gitignore   |   8 --
 src/backend/catalog/Makefile | 140 +---
 src/include/Makefile |   7 +-
 src/include/catalog/.gitignore   |   4 +-
 src/include/catalog/Makefile | 152 +++
 src/include/catalog/meson.build  |   3 +-
 src/tools/pginclude/headerscheck |   2 -
 8 files changed, 163 insertions(+), 155 deletions(-)
 delete mode 100644 src/backend/catalog/.gitignore

diff --git a/src/backend/Makefile b/src/backend/Makefile
index d66e2a4b9f..3d7be09529 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -118,7 +118,7 @@ utils/activity/wait_event_types.h: utils/activity/generate-wait_event_types.pl u
 
 # run this unconditionally to avoid needing to know its dependencies here:
 submake-catalog-headers:
-	$(MAKE) -C catalog generated-header-symlinks
+	$(MAKE) -C ../include/catalog generated-headers
 
 # run this unconditionally to avoid needing to know its dependencies here:
 submake-nodes-headers:
diff --git a/src/backend/catalog/.gitignore b/src/backend/catalog/.gitignore
deleted file mode 100644
index b580f734c7..00
--- a/src/backend/catalog/.gitignore
+++ /dev/null
@@ -1,8 +0,0 @@
-/postgres.bki
-/schemapg.h
-/syscache_ids.h
-/syscache_info.h
-/system_fk_info.h
-/system_constraints.sql
-/pg_*_d.h
-/bki-stamp
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 196ecafc90..1589a75fd5 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -50,141 +50,8 @@ OBJS = \
 
 include $(top_srcdir)/src/backend/common.mk
 
-# Note: the order of this list determines the order in which the catalog
-# header files are assembled into postgres.bki.  BKI_BOOTSTRAP catalogs
-# must appear first, and pg_statistic before pg_statistic_ext_data, and
-# there are reputedly other, undocumented ordering dependencies.
-CATALOG_HEADERS := \
-	pg_proc.h \
-	pg_type.h \
-	pg_attribute.h \
-	pg_class.h \
-	pg_attrdef.h \
-	pg_constraint.h \
-	pg_inherits.h \
-	pg_index.h \
-	pg_operator.h \
-	pg_opfamily.h \
-	pg_opclass.h \
-	pg_am.h \
-	pg_amop.h \
-	pg_amproc.h \
-	pg_language.h \
-	pg_largeobject_metadata.h \
-	pg_largeobject.h \
-	pg_aggregate.h \
-	pg_statistic.h \
-	pg_statistic_ext.h \
-	pg_statistic_ext_data.h \
-	pg_rewrite.h \
-	pg_trigger.h \
-	pg_event_trigger.h \
-	pg_description.h \
-	pg_cast.h \
-	pg_enum.h \
-	pg_namespace.h \
-	pg_conversion.h \
-	pg_depend.h \
-	pg_database.h \
-	pg_db_role_setting.h \
-	pg_tablespace.h \
-	pg_authid.h \
-	pg_auth_members.h \
-	pg_shdepend.h \
-	pg_shdescription.h \
-	pg_ts_config.h \
-	pg_ts_config_map.h \
-	pg_ts_dict.h \
-	pg_ts_parser.h \
-	pg_ts_template.h \
-	pg_extension.h \
-	pg_foreign_data_wrapper.h \
-	pg_foreign_server.h \
-	pg_user_mapping.h \
-	pg_foreign_table.h \
-	pg_policy.h \
-	pg_replication_origin.h \
-	pg_default_acl.h \
-	pg_init_privs.h \
-	pg_seclabel.h \
-	pg_shseclabel.h \
-	pg_collation.h \
-	pg_parameter_acl.h \
-	pg_partitioned_table.h \
-	pg_range.h \
-	pg_transform.h \
-	pg_sequence.h \
-	pg_publication.h \
-	pg_publication_namespace.h \
-	pg_publication_rel.h \
-	pg_subscription.h \
-	pg_subscription_rel.h
-
-GENERATED_HEADERS := $(CATALOG_HEADERS:%.h=%_d.h) schemapg.h syscache_ids.h syscache_info.h system_fk_info.h
-
-POSTGRES_BKI_SRCS := $(addprefix $(top_srcdir)/src/include/catalog/, $(CATALOG_HEADERS))
-
-# The .dat files we need can just be listed alphabetically.
-POSTGRES_BKI_DATA = $(addprefix $(top_srcdir)/src/include/catalog/,\
-	pg_aggregate.dat \
-	pg_am.dat \
-	pg_amop.dat \
-	pg_amproc.dat \
-	pg_authid.dat \
-	pg_cast.dat \
-	pg_class.dat \
-	pg_collation.dat \
-	pg_conversion.dat \
-	pg_database.dat \
-	pg_language.dat \
-	pg_namespace.dat \
-	pg_opclass.dat \
-	pg_operator.dat \
-	pg_opfamily.dat \
-	pg_proc.dat \
-	pg_range.dat \
-	pg_tablespace.dat \
-	pg_ts_config.dat \
-	pg_ts_config_map.dat \
-	pg_ts_dict.dat \
-	pg_ts_parser.dat \
-	pg_ts_template.dat \
-	pg_type.dat

Re: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-13 Thread Amit Kapila

On Tue, Mar 12, 2024 at 5:13 PM vignesh C  wrote:
>
> On Mon, 11 Mar 2024 at 17:16, Amit Kapila  wrote:
> >
> > On Tue, Feb 27, 2024 at 2:07 PM Hayato Kuroda (Fujitsu)
> >  wrote:
> > >
> > > > PSA the patch for implementing it. It is basically same as Ian's one.
> > > > However, this patch still cannot satisfy the condition 3).
> > > >
> > > > `pg_basebackup -D data_N2 -d "user=postgres" -R`
> > > > -> dbname would not be appeared in primary_conninfo.
> > > >
> > > > This is because `if (connection_string)` case in GetConnection() 
> > > > explicy override
> > > > a dbname to "replication". I've tried to add a dummy entry {"dbname", 
> > > > NULL} pair
> > > > before the overriding, but it is no-op. Because The replacement of the 
> > > > dbname in
> > > > pqConnectOptions2() would be done only for the valid (=lastly specified)
> > > > connection options.
> > >
> > > Oh, this patch missed the straightforward case:
> > >
> > > pg_basebackup -D data_N2 -d "user=postgres dbname=replication" -R
> > > -> dbname would not be appeared in primary_conninfo.
> > >
> > > So I think it cannot be applied as-is. Sorry for sharing the bad item.
> > >
> >
> > Can you please share the patch that can be considered for review?
>
> Here is a patch with few changes: a) Removed the version check based
> on discussion from [1] b) updated the documentation.
> I have tested various scenarios with the attached patch, the dbname
> that is written in postgresql.auto.conf for each of the cases with
> pg_basebackup is given below:
> 1) ./pg_basebackup -U vignesh -R
> -> primary_conninfo = "dbname=vignesh" (In this case primary_conninfo
> will have dbname as username specified)
> 2) ./pg_basebackup -D data -R
> -> primary_conninfo = "dbname=vignesh" (In this case primary_conninfo
> will have the dbname as username (which is the same as the OS user
> while setting defaults))
> 3) ./pg_basebackup -d "user=vignesh"  -D data -R
> -> primary_conninfo = "dbname=replication"  (In this case
> primary_conninfo will have dbname as replication which is the default
> value from GetConnection as connection string is specified)
> 4) ./pg_basebackup -d "user=vignesh dbname=postgres"  -D data -R
> -> primary_conninfo = "dbname=postgres" (In this case primary_conninfo
> will have the dbname as the dbname specified)
> 5) ./pg_basebackup -d ""  -D data -R
> -> primary_conninfo = "dbname=replication" (In this case
> primary_conninfo will have dbname as replication which is the default
> value from GetConnection as connection string is specified)
>
> I have mentioned the reasons as to why dbname is being set to
> replication or username in each of the cases.
>

IIUC, the dbname is already allowed in connstring for pg_basebackup by
commit cca97ce6a6 and with this patch we are just writing it in
postgresql.auto.conf if -R option is used to write recovery info. This
will help slot syncworker to automatically connect with database
without user manually specifying the dbname and replication
connection, the same will still be ignored. I don't see any problem
with this. Does anyone else see any problem?

The postgresql.auto.conf file will record the connection
 settings and, if specified, the replication slot
 that pg_basebackup is using, so that
-streaming replication will use the same settings later on.
+a) synchronization of logical replication slots on the primary to the
+hot_standby from 
+pg_sync_replication_slots or slot
sync worker
+and b) streaming replication will use the same settings later on.

We can simply modify the last line as: ".. streaming replication or
logical replication slots synchronization will use the same settings
later on." Additionally, you can give a link reference to [1].

[1] - 
https://www.postgresql.org/docs/devel/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS-SYNCHRONIZATION
-- 
With Regards,
Amit Kapila.

Re: POC, WIP: OR-clause support for indexes

2024-03-13 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 7:52 AM Andrei Lepikhov
 wrote:
> On 12/3/2024 22:20, Alexander Korotkov wrote:
> > On Mon, Mar 11, 2024 at 2:43 PM Andrei Lepikhov
> >> I think you are right. It is probably a better place than any other to
> >> remove duplicates in an array. I just think we should sort and remove
> >> duplicates from entry->consts in one pass. Thus, this optimisation
> >> should be applied to sortable constants.
> >
> > Ok.
> New version of the patch set implemented all we have agreed on for now.
> We can return MAX_SAOP_ARRAY_SIZE constraint and Alena's approach to
> duplicates deletion for non-sortable cases at the end.
> >
> >> Hmm, we already tried to do it at that point. I vaguely recall some
> >> issues caused by this approach. Anyway, it should be done as quickly as
> >> possible to increase the effect of the optimization.
> >
> > I think there were provided quite strong reasons why this shouldn't be
> > implemented at the parse analysis stage [1], [2], [3].  The
> > canonicalize_qual() looks quite appropriate place for that since it
> > does similar transformations.
> Ok. Let's discuss these reasons. In Robert's opinion [1,3], we should do
> the transformation based on the cost model. But in the canonicalize_qual
> routine, we still make the transformation blindly. Moreover, the second
> patch reduces the weight of this reason, doesn't it? Maybe we shouldn't
> think about that as about optimisation but some 'general form of
> expression'?
> Peter [2] worries about the possible transformation outcomes at this
> stage. But remember, we already transform clauses like ROW() IN (...) to
> a series of ORs here, so it is not an issue. Am I wrong?
> Why did we discard the attempt with canonicalize_qual on the previous
> iteration? - The stage of parsing is much more native for building SAOP
> quals. We can reuse make_scalar_array_op and other stuff, for example.
> During the optimisation stage, the only list partitioning machinery
> creates SAOP based on a list of constants. So, in theory, it is possible
> to implement. But do we really need to make the code more complex?

As we currently do OR-to-ANY transformation at the parse stage, the
system catalog (including views, inheritance clauses, partial and
expression indexes, and others) would have a form depending on
enable_or_transformation at the moment of DDL execution.  I think this
is rather wrong.  The enable_or_transformation should be run-time
optimization which affects the resulting query plan, its result
shouldn't be persistent.

Regarding the ROW() IN (...) precedent.

1. AFAICS, this is not exactly an optimization.  This transformation
allows us to perform type matching individually for every value.
Therefore it allows the execute some queries which otherwise would end
up with error.
2. I don't think this is a sample of good design.  This is rather
hack, which is historically here, but we don't want to replicate this
experience.

Given all of the above, I think moving transformation to the
canonicalize_qual() would be the right way to go.

--
Regards,
Alexander Korotkov

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  wrote:

> As I mentioned above, if we implement the test cases in C, we can use
> the debug-build array in the test code. And we won't use it in AND/OR
> operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids(). Funnily enough, the debug array
itself gave false failures when using a similar array in the test
harness, because it didn't know all the places where the array should
have been sorted -- it only worked by chance before because of what
order things were done.

I squashed everything from v70 and also took the liberty of switching
on shared memory for tid store tests. The only reason we didn't do
this with the radix tree tests is that the static attach/detach
functions would raise warnings since they are not used.
From 9b6b517b9f9977999034d16b4dc33e91094ae7ba Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 12 Dec 2023 22:36:24 +0900
Subject: [PATCH v71 2/6] DEV: Debug TidStore.

---
 src/backend/access/common/tidstore.c | 203 ++-
 1 file changed, 201 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index b725b62d4c..33753d8ed2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -29,6 +29,11 @@
 #include "utils/dsa.h"
 #include "utils/memutils.h"
 
+/* Enable TidStore debugging if USE_ASSERT_CHECKING */
+#ifdef USE_ASSERT_CHECKING
+#define TIDSTORE_DEBUG
+#include "catalog/index.h"
+#endif
 
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
 #define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
@@ -88,6 +93,13 @@ struct TidStore
 
 	/* DSA area for TidStore if using shared memory */
 	dsa_area   *area;
+
+#ifdef TIDSTORE_DEBUG
+	ItemPointerData	*tids;
+	int64		max_tids;
+	int64		num_tids;
+	bool		tids_unordered;
+#endif
 };
 #define TidStoreIsShared(ts) ((ts)->area != NULL)
 
@@ -105,11 +117,25 @@ struct TidStoreIter
 
 	/* output for the caller */
 	TidStoreIterResult output;
+
+#ifdef TIDSTORE_DEBUG
+	/* iterator index for the ts->tids array */
+	int64		tids_idx;
+#endif
 };
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
 	   BlocktableEntry *page);
 
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno,
+	   OffsetNumber *offsets, int num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
@@ -154,6 +180,17 @@ TidStoreCreate(size_t max_bytes, dsa_area *area)
 	else
 		ts->tree.local = local_rt_create(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	{
+		int64		max_tids = max_bytes / sizeof(ItemPointerData);
+
+		ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+		ts->max_tids = max_tids;
+		ts->num_tids = 0;
+		ts->tids_unordered = false;
+	}
+#endif
+
 	return ts;
 }
 
@@ -191,6 +228,7 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_rt_detach(ts->tree.shared);
+
 	pfree(ts);
 }
 
@@ -241,6 +279,11 @@ TidStoreDestroy(TidStore *ts)
 
 	MemoryContextDelete(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		pfree(ts->tids);
+#endif
+
 	pfree(ts);
 }
 
@@ -297,6 +340,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		found = local_rt_set(ts->tree.local, blkno, page);
 
 	Assert(!found);
+
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+	{
+		/* Insert TIDs into the TID array too */
+		ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+	}
+#endif
 }
 
 /* Return true if the given TID is present in the TidStore */
@@ -310,6 +361,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	OffsetNumber off = ItemPointerGetOffsetNumber(tid);
 	bool		ret;
 
+#ifdef TIDSTORE_DEBUG
+	bool ret_debug = false;
+
+	if (!TidStoreIsShared(ts))
+		ret_debug = ts_debug_is_member(ts, tid);
+#endif
+
 	if (TidStoreIsShared(ts))
 		page = shared_rt_find(ts->tree.shared, blk);
 	else
@@ -317,17 +375,29 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	/* no entry for the blk */
 	if (page == NULL)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
 	/* no bitmap for the off */
 	if (wordnum >= page->nwords)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	ret = (page->words[wordnum] &

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Heikki Linnakangas


On 13/03/2024 12:18, Thomas Munro wrote:

On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas  wrote:

Here also is a first attempt at improving the memory allocation and
memory layout.
...
+typedef union BufferSlot
+{
+ PGIOAlignedBlock buffer;
+ dlist_node  freelist_node;
+}BufferSlot;
+


If you allocated the buffers in one large contiguous chunk, you could
often do one large write() instead of a gathered writev() of multiple
blocks. That should be even better, although I don't know much of a
difference it makes. The above layout wastes a fair amount memory too,
because 'buffer' is I/O aligned.


The patch I posted has an array of buffers with the properties you
describe, so you get a pwrite() (no 'v') sometimes, and a pwritev()
with a small iovcnt when it wraps around:


Oh I missed that it was "union BufferSlot". I thought it was a struct. 
Never mind then.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Thomas Munro

On Wed, Mar 13, 2024 at 9:57 PM Heikki Linnakangas  wrote:
> Let's bite the bullet and merge the smgrwrite and smgrextend functions
> at the smgr level too. I propose the following signature:
>
> #define SWF_SKIP_FSYNC  0x01
> #define SWF_EXTEND  0x02
> #define SWF_ZERO0x04
>
> void smgrwritev(SMgrRelation reln, ForkNumber forknum,
> BlockNumber blocknum,
> const void **buffer, BlockNumber nblocks,
> int flags);
>
> This would replace smgwrite, smgrextend, and smgrzeroextend. The

That sounds pretty good to me.

> > Here also is a first attempt at improving the memory allocation and
> > memory layout.
> > ...
> > +typedef union BufferSlot
> > +{
> > + PGIOAlignedBlock buffer;
> > + dlist_node  freelist_node;
> > +}BufferSlot;
> > +
>
> If you allocated the buffers in one large contiguous chunk, you could
> often do one large write() instead of a gathered writev() of multiple
> blocks. That should be even better, although I don't know much of a
> difference it makes. The above layout wastes a fair amount memory too,
> because 'buffer' is I/O aligned.

The patch I posted has an array of buffers with the properties you
describe, so you get a pwrite() (no 'v') sometimes, and a pwritev()
with a small iovcnt when it wraps around:

pwrite(...) = 131072 (0x2)
pwritev(...,3,...) = 131072 (0x2)
pwrite(...) = 131072 (0x2)
pwritev(...,3,...) = 131072 (0x2)
pwrite(...) = 131072 (0x2)

Hmm, I expected pwrite() alternating with pwritev(iovcnt=2), the
latter for when it wraps around the buffer array, so I'm not sure why it's
3.  I guess the btree code isn't writing them strictly monotonically or
something...

I don't believe it wastes any memory on padding (except a few bytes
wasted by palloc_aligned() before BulkWriteState):

(gdb) p >buffer_slots[0]
$4 = (BufferSlot *) 0x15c731cb4000
(gdb) p >buffer_slots[1]
$5 = (BufferSlot *) 0x15c731cb6000
(gdb) p sizeof(bulkstate->buffer_slots[0])
$6 = 8192

Re: [EXTERNAL] Re: Add non-blocking version of PQcancel

2024-03-13 Thread Jelte Fennema-Nio

On Wed, 13 Mar 2024 at 04:53, Tom Lane  wrote:
> I suspect it's basically just a
> timing dependency.  Have you thought about the fact that a cancel
> request is a no-op if it arrives after the query's done?

I agree it's probably a timing issue. The cancel being received after
the query is done seems very unlikely, since the query takes 180
seconds (assuming PG_TEST_TIMEOUT_DEFAULT is not lowered for these
animals). I think it's more likely that the cancel request arrives too
early, and thus being ignored because no query is running yet. The
test already had logic to wait until the query backend was in the
"active" state, before sending a cancel to solve that issue. But my
guess is that that somehow isn't enough.

Sadly I'm having a hard time reliably reproducing this race condition
locally. So it's hard to be sure what is happening here. Attached is a
patch with a wild guess as to what the issue might be (i.e. seeing an
outdated "active" state and thus passing the check even though the
query is not running yet)

v37-0001-Hopefully-make-cancel-test-more-reliable.patch
Description: Binary data

v37-0002-Start-using-new-libpq-cancel-APIs.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-13 Thread shveta malik

On Fri, Mar 8, 2024 at 10:42 PM Bharath Rupireddy
 wrote:
>
> On Wed, Mar 6, 2024 at 4:49 PM Amit Kapila  wrote:
> >
> > You might want to consider its interaction with sync slots on standby.
> > Say, there is no activity on slots in terms of processing the changes
> > for slots. Now, we won't perform sync of such slots on standby showing
> > them inactive as per your new criteria where as same slots could still
> > be valid on primary as the walsender is still active. This may be more
> > of a theoretical point as in running system there will probably be
> > some activity but I think this needs some thougths.
>
> I believe the xmin and catalog_xmin of the sync slots on the standby
> keep advancing depending on the slots on the primary, no? If yes, the
> XID age based invalidation shouldn't be a problem.

If the user has not enabled slot-sync worker and is relying on the SQL
function pg_sync_replication_slots(), then the xmin and catalog_xmin
of synced slots may not keep on advancing. These will be advanced only
on next run of function. But meanwhile the synced slots may be
invalidated due to 'xid_aged'.  Then the next time, when user runs
pg_sync_replication_slots() again, the invalidated slots will be
dropped and will be recreated by this SQL function (provided they are
valid on primary and are invalidated on standby alone).  I am not
stating that it is a problem, but we need to think if this is what we
want. Secondly, the behaviour is not same with 'inactive_timeout'
invalidation. Synced slots are immune to 'inactive_timeout'
invalidation as this invalidation happens only in walsender, while
these are not immune to 'xid_aged' invalidation. So again, needs some
thoughts here.

> I believe there are no walsenders started for the sync slots on the
> standbys, right? If yes, the inactive timeout based invalidation also
> shouldn't be a problem. Because, the inactive timeouts for a slot are
> tracked only for walsenders because they are the ones that typically
> hold replication slots for longer durations and for real replication
> use. We did a similar thing in a recent commit [1].
>
> Is my understanding right? Do you still see any problems with it?

I have explained the situation above for us to think over it better.

thanks
Shveta

Re: MERGE ... RETURNING

2024-03-13 Thread Dean Rasheed

On Wed, 13 Mar 2024 at 06:44, jian he  wrote:
>
> 
> [ WITH with_query [, ...] ]
> MERGE INTO [ ONLY ] 
> here the "WITH" part should have "[ RECURSIVE ]"

Actually, no. MERGE doesn't support WITH RECURSIVE.

It's not entirely clear to me why though. I did a quick test, removing
that restriction in the parse analysis code, and it seemed to work
fine. Alvaro, do you remember why that restriction is there?

It's probably worth noting it in the docs, since it's different from
INSERT, UPDATE and DELETE. I think this would suffice:

with_query

  The WITH clause allows you to specify one or more
  subqueries that can be referenced by name in the MERGE
  query. See  and 
  for details.  Note that WITH RECURSIVE is not supported
  by MERGE.

And then maybe we can remove that restriction in HEAD, if there really
isn't any need for it anymore.

I also noticed that the "UPDATE SET ..." syntax in the synopsis is
missing a couple of options that are supported -- the optional "ROW"
keyword in the multi-column assignment syntax, and the syntax to
assign from a subquery that returns multiple columns. So this should
be updated to match update.sgml:

UPDATE SET { column_name
= { expression | DEFAULT
} |
 ( column_name [, ...] ) = [ ROW ] ( {
expression | DEFAULT } [,
...] ) |
 ( column_name [, ...] ) = ( sub-SELECT )
   } [, ...]

and then in the parameter section:

sub-SELECT

  A SELECT sub-query that produces as many output columns
  as are listed in the parenthesized column list preceding it.  The
  sub-query must yield no more than one row when executed.  If it
  yields one row, its column values are assigned to the target columns;
  if it yields no rows, NULL values are assigned to the target columns.
  The sub-query can refer to values from the original row in the
target table,
  and values from the data_source.

(basically copied verbatim from update.sgml)

I think I'll go make those doc changes, and back-patch them
separately, since they're not related to this patch.

Regards,
Dean

Re: Vectored I/O in bulk_write.c

2024-03-13 Thread Heikki Linnakangas


(Replying to all your messages in this thread together)


This made me wonder why smgrwritev() and smgrextendv() shouldn't be
backed by the same implementation, since they are essentially the same
operation.


+1 to the general idea of merging the write and extend functions.


The differences are some assertions which might as well be
moved up to the smgr.c level as they must surely apply to any future
smgr implementation too, right?, and the segment file creation policy
which can be controlled with a new argument.  I tried that here.


Currently, smgrwrite/smgrextend just calls through to mdwrite/mdextend. 
I'd like to keep it that way. Otherwise we need to guess what a 
hypothetical smgr implementation might look like.


For example this assertion:


/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
#endif


I think that should continue to be md.c's internal affair. For example, 
imagine that you had a syscall like write() but which always writes to 
the end of the file and also returns the offset that the data was 
written to. You would want to assert the returned offset instead of the 
above.


So -1 on moving up the assertions to smgr.c.

Let's bite the bullet and merge the smgrwrite and smgrextend functions 
at the smgr level too. I propose the following signature:


#define SWF_SKIP_FSYNC  0x01
#define SWF_EXTEND  0x02
#define SWF_ZERO0x04

void smgrwritev(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
const void **buffer, BlockNumber nblocks,
int flags);

This would replace smgwrite, smgrextend, and smgrzeroextend. The 
mdwritev() function would have the same signature. A single 'flags' arg 
looks better in the callers than booleans, because you don't need to 
remember what 'true' or 'false' means. The callers would look like this:


/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, false) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, 0);

/* like old smgrwrite(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1, SWF_SKIP_FSYNC);

/* like old smgrextend(reln, MAIN_FORKNUM, 123, buf, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, buf, 1,
   SWF_EXTEND | SWF_SKIP_FSYNC);

/* like old smgrzeroextend(reln, MAIN_FORKNUM, 123, 10, true) */
smgrwritev(reln, MAIN_FORKNUM, 123, NULL, 10,
   SWF_EXTEND | SWF_ZERO | SWF_SKIP_FSYNC);


While thinking about that I realised that an existing write-or-extend
assertion in master is wrong because it doesn't add nblocks.

Hmm, it's a bit weird that we have nblocks as int or BlockNumber in
various places, which I think should probably be fixed.


+1


Here also is a first attempt at improving the memory allocation and
memory layout.
...
+typedef union BufferSlot
+{
+   PGIOAlignedBlock buffer;
+   dlist_node  freelist_node;
+}  BufferSlot;
+


If you allocated the buffers in one large contiguous chunk, you could 
often do one large write() instead of a gathered writev() of multiple 
blocks. That should be even better, although I don't know much of a 
difference it makes. The above layout wastes a fair amount memory too, 
because 'buffer' is I/O aligned.



I wonder if bulk logging should trigger larger WAL writes in the "Have
to write it ourselves" case in AdvanceXLInsertBuffer(), since writing
8kB of WAL at a time seems like an unnecessarily degraded level of
performance, especially with wal_sync_method=open_datasync.  Of course
the real answer is "make sure wal_buffers is high enough for your
workload" (usually indirectly by automatically scaling from
shared_buffers), but this problem jumps out when tracing bulk_writes.c
with default settings.  We write out the index 128kB at a time, but
the WAL 8kB at a time.


Makes sense.

On 12/03/2024 23:38, Thomas Munro wrote:

One more observation while I'm thinking about bulk_write.c... hmm, it
writes the data out and asks the checkpointer to fsync it, but doesn't
call smgrwriteback().  I assume that means that on Linux the physical
writeback sometimes won't happen until the checkpointer eventually
calls fsync() sequentially, one segment file at a time.  I see that
it's difficult to decide how to do that though; unlike checkpoints,
which have rate control/spreading, bulk writes could more easily flood
the I/O subsystem in a burst.  I expect most non-Linux systems'
write-behind heuristics to fire up for bulk sequential writes, but on
Linux where most users live, there is no write-behind heuristic AFAIK
(on the most common file systems anyway), so you have to crank that
handle if you want it to wake up and smell the coffee before it hits
internal limits, but then you have to decide how fast to crank it.

This problem will come into closer focus when we start talking about
streaming writes.  For the current non-streaming bulk_write.c coding,
I don't have

Re: session username in default psql prompt?

2024-03-13 Thread Andrew Dunstan



On 2024-02-27 Tu 19:19, Kori Lane wrote:

Here’s a patch for this.




Reposting as the archive mail processor doesn't seem to like the Apple 
mail attachment.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index cc7d797159..7e82355425 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -4729,7 +4729,7 @@ testdb= \set PROMPT1 '%[%033[1;33;40m%]%n@%/%R%[%033[0m%]%# '
 
 To insert a percent sign into your prompt, write
 %%. The default prompts are
-'%/%R%x%# ' for prompts 1 and 2, and
+'%n@%/%R%x%# ' for prompts 1 and 2, and
 ' ' for prompt 3.
 
 
diff --git a/src/bin/psql/settings.h b/src/bin/psql/settings.h
index 505f99d8e4..8477f9 100644
--- a/src/bin/psql/settings.h
+++ b/src/bin/psql/settings.h
@@ -23,8 +23,8 @@
 #define DEFAULT_EDITOR_LINENUMBER_ARG "+"
 #endif
 
-#define DEFAULT_PROMPT1 "%/%R%x%# "
-#define DEFAULT_PROMPT2 "%/%R%x%# "
+#define DEFAULT_PROMPT1 "%n@%/%R%x%# "
+#define DEFAULT_PROMPT2 "%n@%/%R%x%# "
 #define DEFAULT_PROMPT3 ">> "
 
 /*

Re: Transaction timeout

2024-03-13 Thread Alexander Korotkov

On Wed, Mar 13, 2024 at 7:56 AM Andrey M. Borodin  wrote:
> > On 13 Mar 2024, at 05:23, Alexander Korotkov  wrote:
> >
> > On Tue, Mar 12, 2024 at 10:28 AM Andrey M. Borodin  
> > wrote:
> >>> On 11 Mar 2024, at 16:18, Alexander Korotkov  wrote:
> >>>
> >>> I think if checking psql stderr is problematic, checking just logs is
> >>> fine.  Could we wait for the relevant log messages one by one with
> >>> $node->wait_for_log() just like 040_standby_failover_slots_sync.pl do?
> >>
> >> PFA version with $node->wait_for_log()
> >
> > I've slightly revised the patch.  I'm going to push it if no objections.
>
> One small note: log_offset was updated, but was not used.

Thank you. This is the updated version.

--
Regards,
Alexander Korotkov


v7-0001-Add-TAP-tests-for-timeouts.patch
Description: Binary data

1 2 >

1 - 100 of 121 matches

Mail list logo