Re: postgres_fdw test timeouts

2024-03-08 Thread Alexander Lakhin

Hello Nathan,

08.03.2024 01:00, Nathan Bossart wrote:

On Sun, Dec 10, 2023 at 12:00:01PM +0300, Alexander Lakhin wrote:

So I would not say that it's a dominant failure for now, and given that
04a09ee lives in master only, maybe we can save two commits (Revert +
Revert of revert) while moving to a more persistent solution.

I just checked in on this one to see whether we needed to create an "open
item" for v17.  While I'm not seeing the failures anymore, the patch that
Alexander claimed should fix it [0] doesn't appear to have been committed,
either.  Perhaps this was fixed by something else...

[0] 
https://postgr.es/m/CA%2BhUKGL0bikWSC2XW-zUgFWNVEpD_gEWXndi2PE5tWqmApkpZQ%40mail.gmail.com


I have re-run the tests and found out that the issue was fixed by
d3c5f37dd. It changed the inner of the loop "while (PQisBusy(conn))",
formerly contained in pgfdw_get_result() as follows:
            /* Data available in socket? */
            if (wc & WL_SOCKET_READABLE)
            {
                if (!PQconsumeInput(conn))
                    pgfdw_report_error(ERROR, NULL, conn, false, query);
            }
->
    /* Consume whatever data is available from the socket */
    if (PQconsumeInput(conn) == 0)
    {
        /* trouble; expect PQgetResult() to return NULL */
        break;
    }

That is, the unconditional "if PQconsumeInput() ..." eliminates the test
timeout.

Best regards,
Alexander




Re: Add comment to specify timeout unit in ConditionVariableTimedSleep()

2024-03-08 Thread Michael Paquier
On Tue, Mar 05, 2024 at 03:20:48PM +0900, Michael Paquier wrote:
> That sounds like a good idea to me, so I'm OK with your suggestion.

Applied this one as f160bf06f72a.  Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Exposing the lock manager's WaitForLockers() to SQL

2024-03-08 Thread Will Mortensen
Rebased and fixed conflicts.

FWIW re: Andrey's comment in his excellent CF summary email[0]: we're
currently using vanilla Postgres (via Gentoo) on single nodes, and not
anything fancy like Citus. The Citus relationship is just that we were
inspired by Marco's blog post there. We have a variety of clients
written in different languages that generally don't coordinate their
table modifications, and Marco's scheme merely requires them to use
sequences idiomatically, which we can just about manage. :-)

This feature is then a performance optimization to support this scheme
while avoiding the case where one writer holding a RowExclusiveLock
blocks the reader from taking a ShareLock which in turn prevents other
writers from taking a RowExclusiveLock for a long time. Instead, the
reader can wait for the first writer without taking any locks or
blocking later writers. I've illustrated this difference in the
isolation tests.

Still hoping we can get this into 17. :-)

[0] 
https://www.postgresql.org/message-id/C8D65462-0888-4484-A72C-C99A94381ECD%40yandex-team.ru


v10-0003-Add-pg_wait_for_lockers-function.patch
Description: Binary data


v10-0001-Refactor-GetLockConflicts-into-more-general-GetL.patch
Description: Binary data


v10-0002-Allow-specifying-single-lockmode-in-WaitForLocke.patch
Description: Binary data


Re: Remove unnecessary code from psql's watch command

2024-03-08 Thread Yugo NAGATA
On Fri, 08 Mar 2024 12:09:12 -0500
Tom Lane  wrote:

> Yugo NAGATA  writes:
> > On Wed, 06 Mar 2024 13:03:39 -0500
> > Tom Lane  wrote:
> >> I don't have Windows here to test on, but does the WIN32 code
> >> path work at all?
> 
> > The outer loop is eventually exited even because PSQLexecWatch returns 0
> > when cancel_pressed = 0. However, it happens after executing an extra
> > query in this function not just after exit of the inner loop. Therefore,
> > it would be better to adding set and check of "done" in WIN32, too.
> 
> Ah, I see now.  Agreed, that could stand improvement.
> 
> > I've attached the updated patch 
> > (v2_remove_unnecessary_code_in_psql_watch.patch).
> 
> Pushed with minor tidying.

Thanks!

-- 
Yugo NAGATA 




Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2024-03-08 Thread Yugo NAGATA
On Fri, 8 Mar 2024 16:17:58 -0600
Nathan Bossart  wrote:

> On Fri, Mar 08, 2024 at 03:31:55PM +0900, Yugo NAGATA wrote:
> > On Thu, 7 Mar 2024 16:56:17 -0600
> > Nathan Bossart  wrote:
> >> to me.  Do you think it's worth adding something like a
> >> pg_column_toast_num_chunks() function that returns the number of chunks for
> >> the TOASTed value, too?
> > 
> > If we want to know the number of chunks of a specified chunk_id,
> > we can get this by the following query.
> > 
> > postgres=# SELECT id, (SELECT count(*) FROM pg_toast.pg_toast_16384 WHERE 
> > chunk_id = id) 
> >   FROM (SELECT pg_column_toast_chunk_id(v) AS id FROM t);
> 
> Good point.  Overall, I think this patch is in decent shape, so I'll aim to
> commit it sometime next week.

Thank you.

> 
> > +{ oid => '8393', descr => 'chunk ID of on-disk TOASTed value',
> > +  proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => 
> > 'oid',
> > +  proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' },
> 
> Note to self: this change requires a catversion bump.
> 
> > +INSERT INTO test_chunk_id(v1,v2)
> > +  VALUES (repeat('x', 1), repeat('x', 2048));
> 
> Is this guaranteed to be TOASTed for all possible page sizes?

Should we use block_size?

 SHOW block_size \gset
 INSERT INTO test_chunk_id(v1,v2)
  VALUES (repeat('x', 1), repeat('x', (:block_size / 4)));

I think this will work in various page sizes. 
I've attached a patch in which the test is updated.

Regards,
Yugo Nagata

> 
> -- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com


-- 
Yugo NAGATA 
>From c3b99418d1ae3a8235db9bedd95e7d6ac1556f90 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Wed, 29 Mar 2023 09:59:25 +0900
Subject: [PATCH v8] Add pg_column_toast_chunk_id function

This function returns the chunk_id of an on-disk TOASTed value, or
NULL if the value is un-TOASTed or not on disk. This enables users to
know which columns are actually TOASTed. This function is also useful
to identify a problematic row when an error like
  "ERROR:  unexpected chunk number ... (expected ...) for toast value"
occurs.
---
 doc/src/sgml/func.sgml   | 17 
 src/backend/utils/adt/varlena.c  | 41 
 src/include/catalog/pg_proc.dat  |  3 ++
 src/test/regress/expected/misc_functions.out | 27 +
 src/test/regress/sql/misc_functions.sql  | 24 
 5 files changed, 112 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0bb7aeb40e..3169e2aeb7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28552,6 +28552,23 @@ postgres=# SELECT '0/0'::pg_lsn + pd.segment_number * ps.setting::int + :offset

   
 
+  
+   
+
+ pg_column_toast_chunk_id
+
+pg_column_toast_chunk_id ( "any" )
+oid
+   
+   
+Shows the chunk_id of an on-disk
+TOASTed value. Returns NULL
+if the value is un-TOASTed or not on-disk.
+See  for details about
+TOAST.
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 543afb66e5..84d36781a4 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -5105,6 +5105,47 @@ pg_column_compression(PG_FUNCTION_ARGS)
 	PG_RETURN_TEXT_P(cstring_to_text(result));
 }
 
+/*
+ * Return the chunk_id of the on-disk TOASTed value.
+ * Return NULL if the value is unTOASTed or not on disk.
+ */
+Datum
+pg_column_toast_chunk_id(PG_FUNCTION_ARGS)
+{
+	int			typlen;
+	struct varlena *attr;
+	struct varatt_external toast_pointer;
+
+	/* On first call, get the input type's typlen, and save at *fn_extra */
+	if (fcinfo->flinfo->fn_extra == NULL)
+	{
+		/* Lookup the datatype of the supplied argument */
+		Oid			argtypeid = get_fn_expr_argtype(fcinfo->flinfo, 0);
+
+		typlen = get_typlen(argtypeid);
+		if (typlen == 0)		/* should not happen */
+			elog(ERROR, "cache lookup failed for type %u", argtypeid);
+
+		fcinfo->flinfo->fn_extra = MemoryContextAlloc(fcinfo->flinfo->fn_mcxt,
+	  sizeof(int));
+		*((int *) fcinfo->flinfo->fn_extra) = typlen;
+	}
+	else
+		typlen = *((int *) fcinfo->flinfo->fn_extra);
+
+	if (typlen != -1)
+		PG_RETURN_NULL();
+
+	attr = (struct varlena *) DatumGetPointer(PG_GETARG_DATUM(0));
+
+	if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+		PG_RETURN_NULL();
+
+	VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
+
+	PG_RETURN_OID(toast_pointer.va_valueid);
+}
+
 /*
  * string_agg - Concatenates values and returns string.
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 291ed876fc..443ca854a6 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -7447,6 +7447,9 @@
 { oid => '2121', descr => 'compression method for the compressed datum',
   proname => 'pg_column_compression', provolatile => 's', prorettype => 'text',
   proargtypes => 'any', 

Re: Patch: Add parse_type Function

2024-03-08 Thread David E. Wheeler
On Mar 7, 2024, at 23:39, Erik Wienhold  wrote:

> I think you need to swap the examples.  The text mentions the error case
> first and the NULL case second.

臘‍♂️

Thanks, fixed in the attached patch.

David



v11-0001-Add-to_regtypemod-SQL-function.patch
Description: Binary data


Re: Emitting JSON to file using COPY TO

2024-03-08 Thread jian he
On Sat, Mar 9, 2024 at 2:03 AM Joe Conway  wrote:
>
> On 3/8/24 12:28, Andrey M. Borodin wrote:
> > Hello everyone!
> >
> > Thanks for working on this, really nice feature!
> >
> >> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> >>
> >> Thanks -- will have a look
> >
> > Joe, recently folks proposed a lot of patches in this thread that seem like 
> > diverted from original way of implementation.
> > As an author of CF entry [0] can you please comment on which patch version 
> > needs review?
>
>
> I don't know if I agree with the proposed changes, but I have also been
> waiting to see how the parallel discussion regarding COPY extensibility
> shakes out.
>
> And there were a couple of issues found that need to be tracked down.
>
> Additionally I have had time/availability challenges recently.
>
> Overall, chances seem slim that this will make it into 17, but I have
> not quite given up hope yet either.

Hi.
summary changes I've made in v9 patches at [0]

meta: rebased. Now you need to use `git apply` or `git am`, previously
copyto_json.007.diff, you need to use GNU patch.


at [1], Dean Rasheed found some corner cases when the returned slot's
tts_tupleDescriptor
from
`
ExecutorRun(cstate->queryDesc, ForwardScanDirection, 0, true);
processed = ((DR_copy *) cstate->queryDesc->dest)->processed;
`
cannot be used for composite_to_json.
generally DestReceiver->rStartup is to send the TupleDesc to the DestReceiver,
The COPY TO DestReceiver's rStartup function is copy_dest_startup,
however copy_dest_startup is a no-op.
That means to make the final TupleDesc of COPY TO (FORMAT JSON)
operation bullet proof,
we need to copy the tupDesc from CopyToState's queryDesc.
This only applies to when the COPY TO source is a query (example:
copy (select 1) to stdout), not a table.
The above is my interpretation.


at [2], Masahiko Sawada made several points.
Mainly split the patch to two, one for format json, second is for
options force_array.
Splitting into two is easier to review, I think.
My changes also addressed all the points Masahiko Sawada had mentioned.



[0] 
https://postgr.es/m/cacjufxhd6zrmjjbsdogpovavaekms-u6aorcw0ja-wyi-0k...@mail.gmail.com
[1] 
https://postgr.es/m/CAEZATCWh29787xf=4ngkoixeqrhrqi0qd33z6_-f8t2dz0y...@mail.gmail.com
[2] 
https://postgr.es/m/cad21aocb02zhzm3vxb8hsw8fwosl+irdefb--kmunv8pjpa...@mail.gmail.com




Re: sslinfo extension - add notbefore and notafter timestamps

2024-03-08 Thread Cary Huang
Hello

Thank you for the review and your patch. I have tested with minimum OpenSSL 
version 1.0.2 support and incorporated your changes into the v9 patch as 
attached. 

 > In my -08 timezone, the date doesn't match what's recorded either
 > (it's my "tomorrow"). I think those probably just need to be converted
 > to UTC explicitly? I've attached a sample diff on top of v8 that
 > passes tests on my machine.

Yes, I noticed this in the SSL test too. I am also in GTM-8, so for me the 
tests would fail too due to the time zone differences from GMT. It shall be 
okay to specifically set the outputs of pg_stat_ssl, ssl_client_get_notbefore, 
and ssl_client_get_notafte to be in GMT time zone. The not before and not after 
time stamps in a client certificate are generally expressed in GMT.


Thank you!

Cary Huang
-
HighGo Software Inc. (Canada)
cary.hu...@highgo.ca
www.highgo.ca


v9-0001-Add-notBefore-and-notAfter-to-SSL-cert-info-displ.patch
Description: Binary data


Re: POC, WIP: OR-clause support for indexes

2024-03-08 Thread jian he
+ if (!IsA(lfirst(lc), Invalid))
+ {
+ or_list = lappend(or_list, lfirst(lc));
+ continue;
+ }
 Currently `IsA(lfirst(lc)` works.
but is this generally OK? I didn't find any other examples.
do you need do cast, like `(Node *) lfirst(lc);`

If I understand the logic correctly:
In `foreach(lc, args) ` if everything goes well, it will reach
`hashkey.type = T_Invalid;`
which will make `IsA(lfirst(lc), Invalid)` be true.
adding some comments to the above code would be great.




Re: pipe_read_line for reading arbitrary strings

2024-03-08 Thread Daniel Gustafsson
> On 8 Mar 2024, at 19:38, Daniel Gustafsson  wrote:
> 
>> On 8 Mar 2024, at 18:13, Peter Eisentraut  wrote:
> 
>>> Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA.
>> 
>> Also it shouldn't print %m, was my point.
> 
> Absolutely, I removed that in the patch upthread, it was clearly wrong.

Pushed the fix for the incorrect logline and errcode.

--
Daniel Gustafsson





Re: Call perror on popen failure

2024-03-08 Thread Daniel Gustafsson
> On 8 Mar 2024, at 18:08, Peter Eisentraut  wrote:
> 
> On 08.03.24 11:05, Daniel Gustafsson wrote:
>> If popen fails in pipe_read_line we invoke perror for the error message, and
>> pipe_read_line is in turn called by find_other_exec which is used in both
>> frontend and backend code.  This is an old codepath, and it seems like it 
>> ought
>> to be replaced with the more common logging tools we now have like in the
>> attached diff (which also makes the error translated as opposed to now).  Any
>> objections to removing this last perror() call?
> 
> This change makes it consistent with other popen() calls, so it makes sense 
> to me.

Thanks for review, committed that way.

--
Daniel Gustafsson





Re: Make query cancellation keys longer

2024-03-08 Thread Heikki Linnakangas

On 01/03/2024 07:19, Jelte Fennema-Nio wrote:

I think this behaviour makes more sense as a minor protocol version
bump instead of a parameter.
Ok, the consensus is to bump the minor protocol version for this. Works 
for me.



+ if (strcmp(conn->workBuffer.data, "_pq_.extended_query_cancel") == 0)
+ {
+ /* that's ok */
+ continue;
+ }

Please see this thread[1], which in the first few patches makes
pqGetNegotiateProtocolVersion3 actually usable for extending the
protocol. I started that, because very proposed protocol change that's
proposed on the list has similar changes to
pqGetNegotiateProtocolVersion3 and I think we shouldn't make these
changes ad-hoc hacked into the current code, but actually do them once
in a way that makes sense for all protocol changes.


Thanks, I will take a look and respond on that thread.


Documentation is still missing


I think at least protocol message type documentation would be very
helpful in reviewing, because that is really a core part of this
change.


Added some documentation. There's more work to be done there, but at 
least the message type descriptions are now up-to-date.



Based on the current code I think it should have a few changes:

+ int cancel_key_len = 5 + msgLength - (conn->inCursor - conn->inStart);
+
+ conn->be_cancel_key = malloc(cancel_key_len);
+ if (conn->be_cancel_key == NULL)

This is using the message length to determine the length of the cancel
key in BackendKey. That is not something we generally do in the
protocol. It's even documented: "Notice that although each message
includes a byte count at the beginning, the message format is defined
so that the message end can be found without reference to the byte
count." So I think the patch should be changed to include the length
of the cancel key explicitly in the message.


The nice thing about relying on the message length was that we could 
just redefine the CancelRequest message to have a variable length 
secret, and old CancelRequest with 4-byte secret was compatible with the 
new definition too. But it doesn't matter much, so added an explicit 
length field.


FWIW I don't think that restriction makes sense. Any code that parses 
the messages needs to have the message length available, and I don't 
think it helps with sanity checking that much. I think the documentation 
is a little anachronistic. The real reason that all the message types 
include enough information to find the message end is that in protocol 
version 2, there was no message length field. The only exception that 
doesn't have that property is CopyData, and it's no coincidence that it 
was added in protocol version 3.


On 03/03/2024 16:27, Jelte Fennema-Nio wrote:

On Fri, 1 Mar 2024 at 06:19, Jelte Fennema-Nio  wrote:

This is a preliminary review. I'll look at this more closely soon.


Some more thoughts after looking some more at the proposed changes

+#define EXTENDED_CANCEL_REQUEST_CODE PG_PROTOCOL(1234,5677)

nit: I think the code should be 1234,5679 because it's the newer
message type, so to me it makes more sense if the code is a larger
number.


Unfortunately 1234,5679 already in use by NEGOTIATE_SSL_CODE. That's why 
I went the other direction. If we want to add this to "the end", it 
needs to be 1234,5681. But I wanted to keep the cancel requests together.



+ * FIXME: we used to use signal_child. I believe kill() is
+ * maybe even more correct, but verify that.

signal_child seems to be the correct one, not kill. signal_child has
this relevant comment explaining why it's better than plain kill:

  * On systems that have setsid(), each child process sets itself up as a
  * process group leader.  For signals that are generally interpreted in the
  * appropriate fashion, we signal the entire process group not just the
  * direct child process.  This allows us to, for example, SIGQUIT a blocked
  * archive_recovery script, or SIGINT a script being run by a backend via
  * system().


I changed it to signal the process group if HAVE_SETSID, like 
pg_signal_backend() does. We don't need to signal both the process group 
and the process itself like signal_child() does, because we don't have 
the race condition with recently-forked children that signal_child() 
talks about.



+SendCancelRequest(int backendPID, int32 cancelAuthCode)

I think this name of the function is quite confusing, it's not sending
a cancel request, it is processing one. It sends a SIGINT.


Heh, well, it's sending the cancel request signal to the right backend, 
but I see your point. Renamed to ProcessCancelRequest.



While we're at it, switch to using atomics in pmsignal.c for the state
field. That feels easier to reason about than volatile
pointers.


I feel like this refactor would benefit from being a separate commit.
That would make it easier to follow which change to pmsignal.c is
necessary for what.


Point taken. I didn't do that yet, but it makes sense.


+ MyCancelKeyLength = (MyProcPort != NULL &&
MyProcPort->extended_query_cancel) 

Re: pg_column_toast_chunk_id: a function to get a chunk ID of a TOASTed value

2024-03-08 Thread Nathan Bossart
On Fri, Mar 08, 2024 at 03:31:55PM +0900, Yugo NAGATA wrote:
> On Thu, 7 Mar 2024 16:56:17 -0600
> Nathan Bossart  wrote:
>> to me.  Do you think it's worth adding something like a
>> pg_column_toast_num_chunks() function that returns the number of chunks for
>> the TOASTed value, too?
> 
> If we want to know the number of chunks of a specified chunk_id,
> we can get this by the following query.
> 
> postgres=# SELECT id, (SELECT count(*) FROM pg_toast.pg_toast_16384 WHERE 
> chunk_id = id) 
>   FROM (SELECT pg_column_toast_chunk_id(v) AS id FROM t);

Good point.  Overall, I think this patch is in decent shape, so I'll aim to
commit it sometime next week.

> +{ oid => '8393', descr => 'chunk ID of on-disk TOASTed value',
> +  proname => 'pg_column_toast_chunk_id', provolatile => 's', prorettype => 
> 'oid',
> +  proargtypes => 'any', prosrc => 'pg_column_toast_chunk_id' },

Note to self: this change requires a catversion bump.

> +INSERT INTO test_chunk_id(v1,v2)
> +  VALUES (repeat('x', 1), repeat('x', 2048));

Is this guaranteed to be TOASTed for all possible page sizes?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-08 Thread Nathan Bossart
On Fri, Mar 08, 2024 at 09:33:19AM +, Dean Rasheed wrote:
> If I'm nitpicking, "[--verbose | -v]" in the clusterdb synopsis should
> be replaced with "[option...]", like the other commands, because there
> are other general-purpose options like --quiet and --echo.

Good catch.  I fixed that in v4.  We could probably back-patch this
particular change, but since it's been this way for a while, I don't think
it's terribly important to do so.

> I think this is good to go.

Thanks.  In v4, I've added a first draft of the commit messages, and I am
planning to commit this early next week.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 460da2161265b042079727c1178eff92b3d537b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Mar 2024 14:35:07 -0600
Subject: [PATCH v4 1/3] vacuumdb: Allow specifying objects to process in all
 databases.

Presently, vacuumdb's --table, --schema, and --exclude-schema
options cannot be used together with --all, i.e., you cannot
specify tables or schemas to process in all databases.  This commit
removes this unnecessary restriction, thus enabling potentially
useful command like "vacuumdb --all --schema pg_catalog".

Reviewed-by: Kyotaro Horiguchi, Dean Rasheed
Discussion: https://postgr.es/m/20230628232402.GA1954626%40nathanxps13
---
 doc/src/sgml/ref/vacuumdb.sgml| 60 ++-
 src/bin/scripts/t/100_vacuumdb.pl | 24 ++---
 src/bin/scripts/vacuumdb.c| 19 +++---
 3 files changed, 52 insertions(+), 51 deletions(-)

diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 09356ea4fa..66fccb30a2 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -36,7 +36,13 @@ PostgreSQL documentation
 

 
-   dbname
+   
+
+ dbname
+ -a
+ --all
+
+   
   
 
   
@@ -47,40 +53,44 @@ PostgreSQL documentation

 
  
-   
-
- 
-  -n
-  --schema
- 
- schema
-
-   
-
-   
-
- 
-  -N
-  --exclude-schema
- 
- schema
-
-   
+  -n
+  --schema
  
+ schema
 

 
-   dbname
+   
+
+ dbname
+ -a
+ --all
+
+   
   
 
   
vacuumdb
connection-option
option
-   
--a
---all
-   
+
+   
+
+ 
+  -N
+  --exclude-schema
+ 
+ schema
+
+   
+
+   
+
+ dbname
+ -a
+ --all
+
+   
   
  
 
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index 0601fde205..1d8558c780 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -184,18 +184,18 @@ $node->command_fails_like(
 	[ 'vacuumdb', '-n', 'pg_catalog', '-N', '"Foo"', 'postgres' ],
 	qr/cannot vacuum all tables in schema\(s\) and exclude schema\(s\) at the same time/,
 	'cannot use options -n and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-N', '"Foo"' ],
-	qr/cannot exclude specific schema\(s\) in all databases/,
-	'cannot use options -a and -N at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-n', '"Foo"' ],
-	qr/cannot vacuum specific schema\(s\) in all databases/,
-	'cannot use options -a and -n at the same time');
-$node->command_fails_like(
-	[ 'vacuumdb', '-a', '-t', '"Foo".bar' ],
-	qr/cannot vacuum specific table\(s\) in all databases/,
-	'cannot use options -a and -t at the same time');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-N', 'pg_catalog' ],
+	qr/(?:(?!VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class).)*/,
+	'vacuumdb -a -N');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-n', 'pg_catalog' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -n');
+$node->issues_sql_like(
+	[ 'vacuumdb', '-a', '-t', 'pg_class' ],
+	qr/VACUUM \(SKIP_DATABASE_STATS\) pg_catalog.pg_class/,
+	'vacuumdb -a -t');
 $node->command_fails_like(
 	[ 'vacuumdb', '-a', '-d', 'postgres' ],
 	qr/cannot vacuum all databases and a specific one at the same time/,
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 291766793e..7138c6e97e 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacuumdb.c
@@ -72,6 +72,7 @@ static void vacuum_one_database(ConnParams *cparams,
 static void vacuum_all_databases(ConnParams *cparams,
  vacuumingOptions *vacopts,
  bool analyze_in_stages,
+ SimpleStringList *objects,
  int concurrentCons,
  const char *progname, bool echo, bool quiet);
 
@@ -378,6 +379,7 @@ main(int argc, char *argv[])
 
 		vacuum_all_databases(, ,
 			 analyze_in_stages,
+			 ,
 			 concurrentCons,
 			 progname, echo, quiet);
 	}
@@ -429,18 +431,6 @@ check_objfilter(void)
 		(objfilter & OBJFILTER_DATABASE))
 		pg_fatal("cannot vacuum all databases and a specific one at the same time");
 
-	if ((objfilter & OBJFILTER_ALL_DBS) &&
-		(objfilter & 

Re: Streaming read-ready sequential scan code

2024-03-08 Thread Melanie Plageman
On Sat, Mar 02, 2024 at 06:07:48PM -0500, Melanie Plageman wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Melanie Plageman
>  wrote:
> >
> > On Mon, Feb 26, 2024 at 03:56:57PM -0500, Melanie Plageman wrote:
> > > On Mon, Feb 19, 2024 at 6:05 PM Melanie Plageman
> > >  wrote:
> > > >
> > > > On Mon, Jan 29, 2024 at 4:17 PM Melanie Plageman
> > > >  wrote:
> > > > >
> > > > > There is an outstanding question about where to allocate the
> > > > > PgStreamingRead object for sequential scans
> > > >
> > > > I've written three alternative implementations of the actual streaming
> > > > read user for sequential scan which handle the question of where to
> > > > allocate the streaming read object and how to handle changing scan
> > > > direction in different ways.
> > > >
> > > > Option A) 
> > > > https://github.com/melanieplageman/postgres/tree/seqscan_pgsr_initscan_allocation
> > > > - Allocates the streaming read object in initscan(). Since we do not
> > > > know the scan direction at this time, if the scan ends up not being a
> > > > forwards scan, the streaming read object must later be freed -- so
> > > > this will sometimes allocate a streaming read object it never uses.
> > > > - Only supports ForwardScanDirection and once the scan direction
> > > > changes, streaming is never supported again -- even if we return to
> > > > ForwardScanDirection
> > > > - Must maintain a "fallback" codepath that does not use the streaming 
> > > > read API
> > >
> > > Attached is a version of this patch which implements a "reset"
> > > function for the streaming read API which should be cheaper than the
> > > full pg_streaming_read_free() on rescan. This can easily be ported to
> > > work on any of my proposed implementations (A/B/C). I implemented it
> > > on A as an example.
> >
> > Attached is the latest version of this patchset -- rebased in light of
> > Thomas' updatees to the streaming read API [1]. I have chosen the
> > approach I think we should go with. It is a hybrid of my previously
> > proposed approaches.
> 
> While investigating some performance concerns, Andres pointed out that
> the members I add to HeapScanDescData in this patch push rs_cindex and
> rs_ntuples to another cacheline and introduce a 4-byte hole. Attached
> v4's HeapScanDescData is as well-packed as on master and its members
> are reordered so that rs_cindex and rs_ntuples are back on the second
> cacheline of the struct's data.

I did some additional profiling and realized that dropping the
unlikely() from the places we check rs_inited frequently was negatively
impacting performance. v5 adds those back and also makes a few other
very minor cleanups.

Note that this patch set has a not yet released version of Thomas
Munro's Streaming Read API with a new ramp-up logic which seems to fix a
performance issue I saw with my test case when all of the sequential
scan's blocks are in shared buffers. Once he sends the official new
version, I will rebase this and point to his explanation in that thread.

- Melanie
>From 29827c23a11061846a7c145f430aa4712c1c30f3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Sat, 27 Jan 2024 18:39:37 -0500
Subject: [PATCH v5 1/5] Split heapgetpage into two parts

heapgetpage(), a per-block utility function used in heap scans, read a
passed-in block into a buffer and then, if page-at-a-time processing was
enabled, pruned the page and built an array of its visible tuples. This
was used for sequential and sample scans.

Future commits will add support for streaming reads. The streaming read
API will read in the blocks specified by a callback, but any significant
per-page processing should be done synchronously on the buffer yielded
by the streaming read API. To support this, separate the logic in
heapgetpage() to read a block into a buffer from that which prunes the
page and builds an array of the visible tuples. The former is now
heapfetchbuf() and the latter is now heapbuildvis().

Future commits will push the logic for selecting the next block into
heapfetchbuf() in cases when streaming reads are not supported (such as
backwards sequential scans). Because this logic differs for sample scans
and sequential scans, inline the code to read the block into a buffer
for sample scans.

This has the added benefit of allowing for a bit of refactoring in
heapam_scan_sample_next_block(), including unpinning the previous buffer
before invoking the callback to select the next block.
---
 src/backend/access/heap/heapam.c | 74 ++--
 src/backend/access/heap/heapam_handler.c | 40 +
 src/include/access/heapam.h  |  2 +-
 3 files changed, 72 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625f..aef90d28473 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -363,17 +363,18 @@ heap_setscanlimits(TableScanDesc sscan, BlockNumber startBlk, BlockNumber numBlk
 }
 
 /*
- * 

Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Jeff Davis
On Wed, 2024-03-06 at 12:01 +, Li, Yong wrote:
> Rebase the patch against the latest HEAD.

The upgrade logic could use more comments explaining what's going on
and why. As I understand it, it's a one-time conversion that needs to
happen between 16 and 17. Is that right?

Was the way CLOG is upgraded already decided in some earlier
discussion?

Given that the CLOG is append-only and gets truncated occasionally, I
wonder whether we can just have some marker that xids before some
number are the old CLOG, and xids beyond that number are in the new
CLOG. I'm not necessarily suggesting that; just an idea.

Regards,
Jeff Davis





Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Jeff Davis
On Fri, 2024-03-08 at 12:39 -0800, Jeff Davis wrote:
> Making a better secondary structure seems doable to me. Just to
> brainstorm:

We can also keep the group_lsns, and then just update both the CLOG
page LSN and the group_lsn when setting the transaction status. The
former would be used for all of the normal WAL-related stuff, and the
latter would be used by TransactionIdGetStatus() to return the more
precise LSN for that group.

Regards,
Jeff Davis





Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Tomas Vondra



On 3/8/24 21:29, Thomas Munro wrote:
> On Sat, Mar 9, 2024 at 2:36 AM Tomas Vondra
>  wrote:
>> On 3/8/24 13:21, Tomas Vondra wrote:
>>> My guess would be 8af25652489, as it's the only storage-related commit.
>>>
>>> I'm currently running tests to verify this.
>>>
>>
>> Yup, the breakage starts with this commit. I haven't looked into the
>> root cause, or whether the commit maybe just made some pre-existing
>> issue easier to hit. Also, I haven't followed the discussion on the
>> pgsql-bugs thread [1], maybe there are some interesting findings.
> 
> Adding Heikki.

I spent a bit of time investigating this today, but haven't made much
progress due to (a) my unfamiliarity with the smgr code in general and
the patch in particular, and (b) CLOBBER_CACHE_ALWAYS making it quite
time consuming to iterate and experiment.

However, the smallest schedule that still reproduces the issue is:

---
test: test_setup

test: create_aggregate create_function_sql create_cast constraints
triggers select inherit typed_table vacuum drop_if_exists
updatable_views roleattributes create_am hash_func errors infinite_recurse
---

I tried to reduce the second step to just "constraints", but that does
not fail. Clearly there's some concurrency at play, and having just a
single backend makes that go away.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Jeff Davis
On Fri, 2024-03-08 at 13:58 +0100, Alvaro Herrera wrote:
> In the new code we effectively store only one LSN per page, which I
> understand is strictly worse.

To quote from the README:

"Instead, we defer the setting of the hint bit if we have not yet
flushed WAL as far as the LSN associated with the transaction. This
requires tracking the LSN of each unflushed async commit."

In other words, the problem the group_lsns are solving is that we can't
set the hint bit on a tuple until the commit record for that
transaction has actually been flushed. For ordinary sync commit, that's
fine, because the CLOG bit isn't set until after the commit record is
flushed. But for async commit, the CLOG may be updated before the WAL
is flushed, and group_lsns are one way to track the right information
to hold off updating the hint bits.

"It is convenient to associate this data with clog buffers: because we
will flush WAL before writing a clog page, we know that we do not need
to remember a transaction's LSN longer than the clog page holding its
commit status remains in memory."

It's not clear to me that it is so convenient, if it's preventing the
SLRU from fitting in with the rest of the system architecture.

"The worst case is where a sync-commit xact shares a cached LSN with an
async-commit xact that commits a bit later; even though we paid to sync
the first xact to disk, we won't be able to hint its outputs until the
second xact is sync'd, up to three walwriter cycles later."

Perhaps we can revisit alternatives to the group_lsn? If we accept
Yong's proposal, and the SLRU has a normal LSN and was used in the
normal way, we would just need some kind of secondary structure to hold
a mapping from XID->LSN only for async transactions.

The characteristics we are looking for in this secondary structure are:

  1. cheap to test if it's empty, so it doesn't interfere with a purely
sync workload at all
  2. expire old entries (where the LSN has already been flushed)
cheaply enough so the data structure doesn't bloat
  3. look up an LSN given an XID cheaply enough that it doesn't
interfere with setting hint bits

Making a better secondary structure seems doable to me. Just to
brainstorm:

  * Have an open-addressed hash table mapping async XIDs to their
commit LSN. If you have a hash collision, opportunistically see if the
entry is old and can be removed. Try K probes, and if they are all
recent, then you need to XLogFlush. The table could get pretty big,
because it needs to hold enough async transactions for a wal writer
cycle or two, but it seems reasonable to make async workloads pay that
memory cost.

  * Another idea, if the size of the structure is a problem, is to
group K async xids into a bloom filter that points at a single LSN.
When more transactions come along, create another bloom filter for the
next K async xids. This could interfere with setting hint bits for sync
xids if the bloom filters are undersized, but that doesn't seem like a
big problem.

Regards,
Jeff Davis





Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Thomas Munro
On Sat, Mar 9, 2024 at 2:36 AM Tomas Vondra
 wrote:
> On 3/8/24 13:21, Tomas Vondra wrote:
> > My guess would be 8af25652489, as it's the only storage-related commit.
> >
> > I'm currently running tests to verify this.
> >
>
> Yup, the breakage starts with this commit. I haven't looked into the
> root cause, or whether the commit maybe just made some pre-existing
> issue easier to hit. Also, I haven't followed the discussion on the
> pgsql-bugs thread [1], maybe there are some interesting findings.

Adding Heikki.




Re: type cache cleanup improvements

2024-03-08 Thread Teodor Sigaev
Yep, exacly. One time from 2^32 we reset whole cache instead of one (or several) 
entry with hash value = 0.


On 08.03.2024 18:35, Tom Lane wrote:

Aleksander Alekseev  writes:

One thing that I couldn't immediately figure out is why 0 hash value
is treated as a magic invalid value in TypeCacheTypCallback():


I've not read this patch, but IIRC in some places we have a convention
that hash value zero is passed for an sinval reset event (that is,
"flush all cache entries").

regards, tom lane




--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/




Re: btree: downlink right separator/HIKEY optimization

2024-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2024 at 2:14 PM Peter Geoghegan  wrote:
> What benchmarking have you done here?

I think that the memcmp() test is subtly wrong:

> +   if (PointerIsValid(rightsep))
> +   {
> +   /*
> +* Note: we're not in the rightmost page (see branchpoint earlier 
> in
> +* the loop), so we always have a HIKEY on this page.
> +*/
> +   ItemId  hikeyid = PageGetItemId(page, P_HIKEY);
> +   IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
> +
> +   if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
> +   memcmp([1], [1],
> +  IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 
> 0)
> +   {
> +   break;
> +   }
> +   }

Unlike amcheck's bt_pivot_tuple_identical, your memcmp() does not
compare relevant metadata fields from struct IndexTupleData. It
wouldn't make sense for it to compare the block number, of course (if
it did then the optimization would simply never work), but ISTM that
you still need to compare ItemPointerData.ip_posid.

Suppose, for example, that you had two very similar pivot tuples from
a multicolumn index on (a int4, b int2) columns. The first/earlier
tuple is (a,b) = "(42, -inf)", due to the influence of suffix
truncation. The second/later tuple is (a,b) = "(42, 0)", since suffix
truncation couldn't take place when the second pivot tuple was
created. (Actually, suffix truncation would have been possible, but it
would have only managed to truncate-away the tiebreak heap TID
attribute value in the case of our second tuple).

There'll be more alignment padding (more zero padding bytes) in the
second tuple, compared to the first. But the tuples are still the same
size. When you go to you memcmp() this pair of tuples using the
approach in v3, the bytes that are actually compared will be
identical, despite the fact that these are really two distinct tuples,
with distinct values. (As I said, you'd have to actually compare the
ItemPointerData.ip_posid metadata to notice this small difference.)

-- 
Peter Geoghegan




Re: speed up a logical replica setup

2024-03-08 Thread Tomas Vondra



On 3/8/24 10:44, Hayato Kuroda (Fujitsu) wrote:
> Dear Tomas, Euler,
> 
> Thanks for starting to read the thread! Since I'm not an original author,
> I want to reply partially.
> 
>> I decided to take a quick look on this patch today, to see how it works
>> and do some simple tests. I've only started to get familiar with it, so
>> I have only some comments / questions regarding usage, not on the code.
>> It's quite possible I didn't understand some finer points, or maybe it
>> was already discussed earlier in this very long thread, so please feel
>> free to push back or point me to the past discussion.
>>
>> Also, some of this is rather opinionated, but considering I didn't see
>> this patch before, my opinions may easily be wrong ...
> 
> I felt your comments were quit valuable.
> 
>> 1) SGML docs
>>
>> It seems the SGML docs are more about explaining how this works on the
>> inside, rather than how to use the tool. Maybe that's intentional, but
>> as someone who didn't work with pg_createsubscriber before I found it
>> confusing and not very helpful.
>>
>> For example, the first half of the page is prerequisities+warning, and
>> sure those are useful details, but prerequisities are checked by the
>> tool (so I can't really miss this) and warnings go into a lot of details
>> about different places where things may go wrong. Sure, worth knowing
>> and including in the docs, but maybe not right at the beginning, before
>> I learn how to even run the tool?
> 
> Hmm, right. I considered below improvements. Tomas and Euler, how do you 
> think?
> 
> * Adds more descriptions in "Description" section.
> * Moves prerequisities+warning to "Notes" section.
> * Adds "Usage" section which describes from a single node.
> 
>> I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm
>> sure it won't work for a number of use cases. I know large databases
>> it's common to create "work tables" (not necessarily temporary) as part
>> of a batch job, but there's no need to replicate those tables.
> 
> Indeed, the documentation does not describe that all tables in the database
> would be included in the publication.
> 
>> I do understand that FOR ALL TABLES is the simplest approach, and for v1
>> it may be an acceptable limitation, but maybe it'd be good to also
>> support restricting which tables should be replicated (e.g. blacklist or
>> whitelist based on table/schema name?).
> 
> May not directly related, but we considered that accepting options was a 
> next-step [1].
> 
>> Note: I now realize this might fall under the warning about DDL, which
>> says this:
>>
>> Executing DDL commands on the source server while running
>> pg_createsubscriber is not recommended. If the target server has
>> already been converted to logical replica, the DDL commands must
>> not be replicated so an error would occur.
> 
> Yeah, they would not be replicated, but not lead ERROR.
> So should we say like "Creating tables on the source server..."?
> 

Perhaps. Clarifying the docs would help, but it depends on the wording.
For example, I doubt this should talk about "creating tables" because
there are other DDL that (probably) could cause issues (like adding a
column to the table, or something like that).

>> 5) slot / publication / subscription name
>>
>> I find it somewhat annoying it's not possible to specify names for
>> objects created by the tool - replication slots, publication and
>> subscriptions. If this is meant to be a replica running for a while,
>> after a while I'll have no idea what pg_createsubscriber_569853 or
>> pg_createsubscriber_459548_2348239 was meant for.
>>
>> This is particularly annoying because renaming these objects later is
>> either not supported at all (e.g. for replication slots), or may be
>> quite difficult (e.g. publications).
>>
>> I do realize there are challenges with custom names (say, if there are
>> multiple databases to replicate), but can't we support some simple
>> formatting with basic placeholders? So we could specify
>>
>> --slot-name "myslot_%d_%p"
>>
>> or something like that?
> 
> Not sure we can do in the first version, but looks nice. One concern is that I
> cannot find applications which accepts escape strings like log_line_prefix.
> (It may just because we do not have use-case.) Do you know examples?
> 

I can't think of a tool already doing that, but I think that's simply
because it was not needed. Why should we be concerned about this?

>> BTW what will happen if we convert multiple standbys? Can't they all get
>> the same slot name (they all have the same database OID, and I'm not
>> sure how much entropy the PID has)?
> 
> I tested and the second try did not work. The primal reason was the name of 
> publication
> - pg_createsubscriber_%u (oid).
> FYI - previously we can reuse same publications, but based on my comment [2] 
> the
> feature was removed. It might be too optimistic.
> 

OK. I could be convinced the other limitations are reasonable for v1 and

Re: Statistics Import and Export

2024-03-08 Thread Corey Huinker
>
> Perhaps it's just me ... but it doesn't seem like it's all that many
>
parameters.
>

It's more than I can memorize, so that feels like a lot to me. Clearly it's
not insurmountable.



> > I am a bit concerned about the number of locks on pg_statistic and the
> > relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> > attribute rather than once per relation. But I also see that this will
> > mostly get used at a time when no other traffic is on the machine, and
> > whatever it costs, it's still faster than the smallest table sample
> (insert
> > joke about "don't have to be faster than the bear" here).
>
> I continue to not be too concerned about this until and unless it's
> actually shown to be an issue.  Keeping things simple and
> straight-forward has its own value.
>

Ok, I'm sold on that plan.


>
> > +/*
> > + * Set statistics for a given pg_class entry.
> > + *
> > + * pg_set_relation_stats(relation Oid, reltuples double, relpages int)
> > + *
> > + * This does an in-place (i.e. non-transactional) update of pg_class,
> just as
> > + * is done in ANALYZE.
> > + *
> > + */
> > +Datum
> > +pg_set_relation_stats(PG_FUNCTION_ARGS)
> > +{
> > + const char *param_names[] = {
> > + "relation",
> > + "reltuples",
> > + "relpages",
> > + };
> > +
> > + Oid relid;
> > + Relationrel;
> > + HeapTuple   ctup;
> > + Form_pg_class   pgcform;
> > +
> > + for (int i = 0; i <= 2; i++)
> > + if (PG_ARGISNULL(i))
> > + ereport(ERROR,
> > +
>  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > +  errmsg("%s cannot be NULL",
> param_names[i])));
>
> Why not just mark this function as strict..?  Or perhaps we should allow
> NULLs to be passed in and just not update the current value in that
> case?


Strict could definitely apply here, and I'm inclined to make it so.



> Also, in some cases we allow the function to be called with a
> NULL but then make it a no-op rather than throwing an ERROR (eg, if the
> OID ends up being NULL).


Thoughts on it emitting a WARN or NOTICE before returning false?



>   Not sure if that makes sense here or not
> offhand but figured I'd mention it as something to consider.
>
> > + pgcform = (Form_pg_class) GETSTRUCT(ctup);
> > + pgcform->reltuples = PG_GETARG_FLOAT4(1);
> > + pgcform->relpages = PG_GETARG_INT32(2);
>
> Shouldn't we include relallvisible?
>

Yes. No idea why I didn't have that in there from the start.


> Also, perhaps we should use the approach that we have in ANALYZE, and
> only actually do something if the values are different rather than just
> always doing an update.
>

That was how it worked back in v1, more for the possibility that there was
no matching JSON to set values.

Looking again at analyze.c (currently lines 1751-1780), we just check if
there is a row in the way, and if so we replace it. I don't see where we
compare existing values to new values.


>
> > +/*
> > + * Import statistics for a given relation attribute
> > + *
> > + * pg_set_attribute_stats(relation Oid, attname name, stainherit bool,
> > + *stanullfrac float4, stawidth int, stadistinct
> float4,
> > + *stakind1 int2, stakind2 int2, stakind3 int3,
> > + *stakind4 int2, stakind5 int2, stanumbers1
> float4[],
> > + *stanumbers2 float4[], stanumbers3 float4[],
> > + *stanumbers4 float4[], stanumbers5 float4[],
> > + *stanumbers1 float4[], stanumbers2 float4[],
> > + *stanumbers3 float4[], stanumbers4 float4[],
> > + *stanumbers5 float4[], stavalues1 text,
> > + *stavalues2 text, stavalues3 text,
> > + *stavalues4 text, stavalues5 text);
> > + *
> > + *
> > + */
>
> Don't know that it makes sense to just repeat the function declaration
> inside a comment like this- it'll just end up out of date.
>

Historical artifact - previous versions had a long explanation of the JSON
format.



>
> > +Datum
> > +pg_set_attribute_stats(PG_FUNCTION_ARGS)
>
> > + /* names of columns that cannot be null */
> > + const char *required_param_names[] = {
> > + "relation",
> > + "attname",
> > + "stainherit",
> > + "stanullfrac",
> > + "stawidth",
> > + "stadistinct",
> > + "stakind1",
> > + "stakind2",
> > + "stakind3",
> > + "stakind4",
> > + "stakind5",
> > + };
>
> Same comment here as above wrt NULL being passed in.
>

In this case, the last 10 params (stanumbersN and stavaluesN) can be null,
and are NULL more often than not.


>
> > + for (int k = 0; k < 5; k++)
>
> Shouldn't we use STATISTIC_NUM_SLOTS here?
>

Yes, I 

Re: btree: downlink right separator/HIKEY optimization

2024-03-08 Thread Peter Geoghegan
On Thu, Feb 22, 2024 at 10:42 AM Matthias van de Meent
 wrote:
> I forgot to address this in the previous patch, so here's v3 which
> fixes the issue warning.

What benchmarking have you done here?

Have you tried just reordering things in _bt_search() instead? If we
delay the check until after the binary search, then the result of the
binary search is usually proof enough that we cannot possibly need to
move right. That has the advantage of not requiring that we copy
anything to the stack.

Admittedly, it's harder to make the "binary search first" approach
work on the leaf level, under the current code structure. But maybe
that doesn't matter very much. And even if it does matter, maybe we
should just move the call to _bt_binsrch() that currently takes place
in _bt_first into _bt_search() itself -- so that _bt_binsrch() is
strictly under the control of _bt_search() (obviously not doable for
non-_bt_first callers, which need to call _bt_binsrch_insert instead).
This whole approach will have been made easier by the refactoring I
did late last year, in commit c9c0589fda.

-- 
Peter Geoghegan




Re: pipe_read_line for reading arbitrary strings

2024-03-08 Thread Daniel Gustafsson
> On 8 Mar 2024, at 18:13, Peter Eisentraut  wrote:

>> Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA.
> 
> Also it shouldn't print %m, was my point.


Absolutely, I removed that in the patch upthread, it was clearly wrong.

--
Daniel Gustafsson





Re: Add new error_action COPY ON_ERROR "log"

2024-03-08 Thread Bharath Rupireddy
On Fri, Mar 8, 2024 at 4:42 PM Michael Paquier  wrote:
>
> On Fri, Mar 08, 2024 at 03:36:30PM +0530, Bharath Rupireddy wrote:
> > Please see the attached v5-0001 patch implementing LOG_VERBOSITY with
> > options 'default' and 'verbose'. v5-0002 adds the detailed info to
> > ON_ERROR 'ignore' option.
>
> I may be reading this patch set incorrectly, but why doesn't this
> patch generate one NOTICE per attribute, as suggested by Sawada-san,
> incrementing num_errors once per row when the last attribute has been
> processed?  Also, why not have a test that checks that multiple rows
> spawn more than more messages in some distributed fashion?  Did you
> look at this idea?

If NOTICE per attribute and incrementing num_errors per row is
implemented, it ends up erroring out with ERROR:  missing data for
column "m"  for all-column-empty-row. Shall we treat this ERROR softly
too if on_error ignore is specified? Or shall we discuss this idea
separately?

-- tests for options on_error and log_verbosity
COPY check_ign_err FROM STDIN WITH (on_error ignore, log_verbosity 'verbose');
1   {1} 1
a   {2} 2
3   {3} 33
4   {a, 4}  4

5   {5} 5
\.

NOTICE:  detected data type incompatibility at line number 2 for
column n; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 2 for
column m; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 2 for
column k; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 3 for
column k; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 4 for
column m; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 4 for
column k; COPY check_ign_err
NOTICE:  detected data type incompatibility at line number 5 for
column n; COPY check_ign_err
ERROR:  missing data for column "m"
CONTEXT:  COPY check_ign_err, line 5: ""

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 8, 2024 at 12:34 PM Melanie Plageman
 wrote:
>
> On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote:
> > On 08/03/2024 02:46, Melanie Plageman wrote:
> > > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> > > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> > > I will say that now all of the variable names are *very* long. I didn't
> > > want to remove the "state" from LVRelState->next_block_state. (In fact, I
> > > kind of miss the "get". But I had to draw the line somewhere.) I think
> > > without "state" in the name, next_block sounds too much like a function.
> > >
> > > Any ideas for shortening the names of next_block_state and its members
> > > or are you fine with them?
> >
> > Hmm, we can remove the inner struct and add the fields directly into
> > LVRelState. LVRelState already contains many groups of variables, like
> > "Error reporting state", with no inner structs. I did it that way in the
> > attached patch. I also used local variables more.
>
> +1; I like the result of this.

I did some perf testing of 0002 and 0003 using that fully-in-SB vacuum
test I mentioned in an earlier email. 0002 is a vacuum time reduction
from an average of 11.5 ms on master to 9.6 ms with 0002 applied. And
0003 reduces the time vacuum takes from 11.5 ms on master to 7.4 ms
with 0003 applied.

I profiled them and 0002 seems to simply spend less time in
heap_vac_scan_next_block() than master did in lazy_scan_skip().

And 0003 reduces the time vacuum takes because vacuum_delay_point()
shows up pretty high in the profile.

Here are the profiles for my test.

profile of master:

+   29.79%  postgres  postgres   [.] visibilitymap_get_status
+   27.35%  postgres  postgres   [.] vacuum_delay_point
+   17.00%  postgres  postgres   [.] lazy_scan_skip
+6.59%  postgres  postgres   [.] heap_vacuum_rel
+6.43%  postgres  postgres   [.] BufferGetBlockNumber

profile with 0001-0002:

+   40.30%  postgres  postgres   [.] visibilitymap_get_status
+   20.32%  postgres  postgres   [.] vacuum_delay_point
+   20.26%  postgres  postgres   [.] heap_vacuum_rel
+5.17%  postgres  postgres   [.] BufferGetBlockNumber

profile with 0001-0003

+   59.77%  postgres  postgres   [.] visibilitymap_get_status
+   23.86%  postgres  postgres   [.] heap_vacuum_rel
+6.59%  postgres  postgres   [.] StrategyGetBuffer

Test DDL and setup:

psql -c "ALTER SYSTEM SET shared_buffers = '16 GB';"
psql -c "CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f
INT, g INT) with (autovacuum_enabled=false, fillfactor=25);"
psql -c "INSERT INTO foo SELECT i, i, i, i, i, i, i, i FROM
generate_series(1, 4600)i;"
psql -c "VACUUM (FREEZE) foo;"
pg_ctl restart
psql -c "SELECT pg_prewarm('foo');"
# make sure there isn't an ill-timed checkpoint
psql -c "\timing on" -c "vacuum (verbose) foo;"

- Melanie




Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Joe Conway

On 3/8/24 12:28, Andrey M. Borodin wrote:

Hello everyone!

Thanks for working on this, really nice feature!


On 9 Jan 2024, at 01:40, Joe Conway  wrote:

Thanks -- will have a look


Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?



I don't know if I agree with the proposed changes, but I have also been 
waiting to see how the parallel discussion regarding COPY extensibility 
shakes out.


And there were a couple of issues found that need to be tracked down.

Additionally I have had time/availability challenges recently.

Overall, chances seem slim that this will make it into 17, but I have 
not quite given up hope yet either.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com





Re: CF entries for 17 to be reviewed

2024-03-08 Thread Andrey M. Borodin



> On 6 Mar 2024, at 18:49, Andrey M. Borodin  wrote:
> 
> Here are statuses for "Refactoring" section:

I've made a pass through "Replication and Recovery" and "SQL commands" sections.
"SQL commands" section seems to me most interesting stuff on the commitfest 
(but I'm far from inspecting every thing thread yet). But reviewing patches 
from this section is not a work for one CF definitely. Be prepared that this is 
a long run, with many iterations and occasional architectural pivots. Typically 
reviewers work on these items for much more than one month.

Replication and Recovery
* Synchronizing slots from primary to standby
Titanic work. A lot of stuff already pushed, v108 is now in the discussion. 
ISTM that entry barrier of jumping into discussion with something useful is 
really high.
* CREATE SUBSCRIPTION ... SERVER
The discussion stopped in January. Authors posted new version recently.
* speed up a logical replication setup (pg_createsubscriber)
Newest version posted recently, but it fails CFbot's tests. Pinged authors.
* Have pg_basebackup write "dbname" in "primary_conninfo"?
Some feedback and descussin provided. Switched to WoA.

SQL Commands
* Add SPLIT PARTITION/MERGE PARTITIONS commands
Cool new commands, very useful for sharding. CF item was RfC recently, need 
review update after rebase.
* Add CANONICAL option to xmlserialize
Vignesh C and Chapman Flack provided some feedback back in October 2023, 
but the patch still needs review.
* Incremental View Maintenance
This is a super cool feature. IMO at this point is too late for 17, but you 
should consider reviewing this not because it's easy, but because it's hard. 
It's real rocket science. Fine-grained 11-step patchset, which can change a lot 
of workloads if committed. CFbot finds some failures, but this should not stop 
anyone frome reviewing in this case.
* Implement row pattern recognition feature
SQL:2016 feature, carefully split into 8 steps. Needs review, probably 
review in a long run. The feature seems big. 3 reviewers are working on this, 
but no recent input for some months.
* COPY TO json
Active thread with many different authors proposing different patches. I 
could not summarize it, asked CF entry author for help.
* Add new error_action COPY ON_ERROR "log"
There's an active discussion in the thread.
* add COPY option RECECT_LIMIT
While the patch seems much similar to previous, there's no discussion in 
this thread...
 
Thanks!


Best regards, Andrey Borodin.



Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote:
> On 08/03/2024 02:46, Melanie Plageman wrote:
> > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> > I will say that now all of the variable names are *very* long. I didn't
> > want to remove the "state" from LVRelState->next_block_state. (In fact, I
> > kind of miss the "get". But I had to draw the line somewhere.) I think
> > without "state" in the name, next_block sounds too much like a function.
> > 
> > Any ideas for shortening the names of next_block_state and its members
> > or are you fine with them?
> 
> Hmm, we can remove the inner struct and add the fields directly into
> LVRelState. LVRelState already contains many groups of variables, like
> "Error reporting state", with no inner structs. I did it that way in the
> attached patch. I also used local variables more.

+1; I like the result of this.

> > However, by adding a vmbuffer to next_block_state, the callback may be
> > able to avoid extra VM fetches from one invocation to the next.
> 
> That's a good idea, holding separate VM buffer pins for the next-unskippable
> block and the block we're processing. I adopted that approach.

Cool. It can't be avoided with streaming read vacuum, but I wonder if
there would ever be adverse effects to doing it on master? Maybe if we
are doing a lot of skipping and the block of the VM for the heap blocks
we are processing ends up changing each time but we would have had the
right block of the VM if we used the one from
heap_vac_scan_next_block()?

Frankly, I'm in favor of just doing it now because it makes
lazy_scan_heap() less confusing.

> My compiler caught one small bug when I was playing with various
> refactorings of this: heap_vac_scan_next_block() must set *blkno to
> rel_pages, not InvalidBlockNumber, after the last block. The caller uses the
> 'blkno' variable also after the loop, and assumes that it's set to
> rel_pages.

Oops! Thanks for catching that.

> I'm pretty happy with the attached patches now. The first one fixes the
> existing bug I mentioned in the other email (based on the on-going
> discussion that might not how we want to fix it though).

ISTM we should still do the fix you mentioned -- seems like it has more
upsides than downsides?

> From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 8 Mar 2024 16:00:22 +0200
> Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with
>  DISABLE_PAGE_SKIPPING
> 
> It's important for 'all_visible_according_to_vm' to correctly reflect
> whether the VM bit is set or not, even when we are not trusting the VM
> to skip pages, because contrary to what the comment said,
> lazy_scan_prune() relies on it.
> 
> If it's incorrectly set to 'false', when the VM bit is in fact set,
> lazy_scan_prune() will try to set the VM bit again and dirty the page
> unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all
> heap pages were dirtied, even if there were no changes. We would also
> fail to clear any VM bits that were set incorrectly.
> 
> This was broken in commit 980ae17310, so backpatch to v16.

LGTM.

> From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas 
> Date: Fri, 8 Mar 2024 17:32:19 +0200
> Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip()
> ---
>  src/backend/access/heap/vacuumlazy.c | 256 +++
>  1 file changed, 141 insertions(+), 115 deletions(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c 
> b/src/backend/access/heap/vacuumlazy.c
> index ac55ebd2ae5..0aa08762015 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -204,6 +204,12 @@ typedef struct LVRelState
>   int64   live_tuples;/* # live tuples remaining */
>   int64   recently_dead_tuples;   /* # dead, but not yet 
> removable */
>   int64   missed_dead_tuples; /* # removable, but not removed */

Perhaps we should add a comment to the blkno member of LVRelState
indicating that it is used for error reporting and logging?

> + /* State maintained by heap_vac_scan_next_block() */
> + BlockNumber current_block;  /* last block returned */
> + BlockNumber next_unskippable_block; /* next unskippable block */
> + boolnext_unskippable_allvis;/* its visibility 
> status */
> + Buffer  next_unskippable_vmbuffer;  /* buffer containing 
> its VM bit */
>  } LVRelState;

>  /*
> -static BlockNumber
> -lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
> -bool *next_unskippable_allvis, bool 
> *skipping_current_range)
> +static bool
> +heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
> +  bool 
> 

Re: Emitting JSON to file using COPY TO

2024-03-08 Thread Andrey M. Borodin
Hello everyone!

Thanks for working on this, really nice feature!

> On 9 Jan 2024, at 01:40, Joe Conway  wrote:
> 
> Thanks -- will have a look

Joe, recently folks proposed a lot of patches in this thread that seem like 
diverted from original way of implementation.
As an author of CF entry [0] can you please comment on which patch version 
needs review?

Thanks!


Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/47/4716/



Re: pipe_read_line for reading arbitrary strings

2024-03-08 Thread Peter Eisentraut

On 06.03.24 10:54, Daniel Gustafsson wrote:

On 6 Mar 2024, at 10:07, Peter Eisentraut  wrote:

On 22.11.23 13:47, Alvaro Herrera wrote:

On 2023-Mar-07, Daniel Gustafsson wrote:

The attached POC diff replace fgets() with pg_get_line(), which may not be an
Ok way to cross the streams (it's clearly not a great fit), but as a POC it
provided a neater interface for reading one-off lines from a pipe IMO.  Does
anyone else think this is worth fixing before too many callsites use it, or is
this another case of my fear of silent subtle truncation bugs?  =)

I think this is generally a good change.
I think pipe_read_line should have a "%m" in the "no data returned"
error message.  pg_read_line is careful to retain errno (and it was
already zero at start), so this should be okay ... or should we set
errno again to zero after popen(), even if it works?


Is this correct? The code now looks like this:

line = pg_get_line(pipe_cmd, NULL);

if (line == NULL)
{
if (ferror(pipe_cmd))
log_error(errcode_for_file_access(),
  _("could not read from command \"%s\": %m"), cmd);
else
log_error(errcode_for_file_access(),
  _("no data was returned by command \"%s\": %m"), cmd);
}

We already handle the case where an error happened in the first branch, so 
there cannot be an error set in the second branch (unless something nonobvious 
is going on?).

It seems to me that if the command being run just happens to print nothing but is 
otherwise successful, this would print a bogus error code (or "Success")?


Good catch, that's an incorrect copy/paste, it should use ERRCODE_NO_DATA.


Also it shouldn't print %m, was my point.





Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-08 Thread Bharath Rupireddy
On Wed, Mar 6, 2024 at 4:49 PM Amit Kapila  wrote:
>
> You might want to consider its interaction with sync slots on standby.
> Say, there is no activity on slots in terms of processing the changes
> for slots. Now, we won't perform sync of such slots on standby showing
> them inactive as per your new criteria where as same slots could still
> be valid on primary as the walsender is still active. This may be more
> of a theoretical point as in running system there will probably be
> some activity but I think this needs some thougths.

I believe the xmin and catalog_xmin of the sync slots on the standby
keep advancing depending on the slots on the primary, no? If yes, the
XID age based invalidation shouldn't be a problem.

I believe there are no walsenders started for the sync slots on the
standbys, right? If yes, the inactive timeout based invalidation also
shouldn't be a problem. Because, the inactive timeouts for a slot are
tracked only for walsenders because they are the ones that typically
hold replication slots for longer durations and for real replication
use. We did a similar thing in a recent commit [1].

Is my understanding right? Do you still see any problems with it?

[1]
commit 7c3fb505b14e86581b6a052075a294c78c91b123
Author: Amit Kapila 
Date:   Tue Nov 21 07:59:53 2023 +0530

Log messages for replication slot acquisition and release.
.
Note that these messages are emitted only for walsenders but not for
backends. This is because walsenders are the ones that typically hold
replication slots for longer durations, unlike backends which hold them
for executing replication related functions.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Remove unnecessary code from psql's watch command

2024-03-08 Thread Tom Lane
Yugo NAGATA  writes:
> On Wed, 06 Mar 2024 13:03:39 -0500
> Tom Lane  wrote:
>> I don't have Windows here to test on, but does the WIN32 code
>> path work at all?

> The outer loop is eventually exited even because PSQLexecWatch returns 0
> when cancel_pressed = 0. However, it happens after executing an extra
> query in this function not just after exit of the inner loop. Therefore,
> it would be better to adding set and check of "done" in WIN32, too.

Ah, I see now.  Agreed, that could stand improvement.

> I've attached the updated patch 
> (v2_remove_unnecessary_code_in_psql_watch.patch).

Pushed with minor tidying.

regards, tom lane




Re: Call perror on popen failure

2024-03-08 Thread Peter Eisentraut

On 08.03.24 11:05, Daniel Gustafsson wrote:

If popen fails in pipe_read_line we invoke perror for the error message, and
pipe_read_line is in turn called by find_other_exec which is used in both
frontend and backend code.  This is an old codepath, and it seems like it ought
to be replaced with the more common logging tools we now have like in the
attached diff (which also makes the error translated as opposed to now).  Any
objections to removing this last perror() call?


This change makes it consistent with other popen() calls, so it makes 
sense to me.






Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 8, 2024 at 11:31 AM Melanie Plageman
 wrote:
>
> On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan  wrote:
> >
> > On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman
> >  wrote:
> > > Not that it will be fun to maintain another special case in the VM
> > > update code in lazy_scan_prune(), but we could have a special case
> > > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if
> > > all_visible_according_to_vm is true and all_visible is true, we update
> > > the VM but don't dirty the page.
> >
> > It wouldn't necessarily have to be a special case, I think.
> >
> > We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in
> > the block where lazy_scan_prune marks a previously all-visible page
> > all-frozen -- we don't want to dirty the page unnecessarily there.
> > Making it conditional is defensive in that particular block (this was
> > also added by this same commit of mine), and avoids dirtying the page.
>
> Ah, I see. I got confused. Even if the VM is suspect, if the page is
> all visible and the heap block is already set all-visible in the VM,
> there is no need to update it.
>
> This did make me realize that it seems like there is a case we don't
> handle in master with the current code that would be fixed by changing
> that code Heikki mentioned:
>
> Right now, even if the heap block is incorrectly marked all-visible in
> the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum,
> all_visible_according_to_vm will be passed to lazy_scan_prune() as
> false. Then even if lazy_scan_prune() finds that the page is not
> all-visible, we won't call visibilitymap_clear().
>
> If we revert the code setting next_unskippable_allvis to false in
> lazy_scan_skip() when vacrel->skipwithvm is false and allow
> all_visible_according_to_vm to be true when the VM has it incorrectly
> set to true, then once lazy_scan_prune() discovers the page is not
> all-visible and assuming PD_ALL_VISIBLE is not marked so
> PageIsAllVisible() returns false, we will call visibilitymap_clear()
> to clear the incorrectly set VM bit (without dirtying the page).
>
> Here is a table of the variable states at the end of lazy_scan_prune()
> for clarity:
>
> master:
> all_visible_according_to_vm: false
> all_visible: false
> VM says all vis: true
> PageIsAllVisible:false
>
> if fixed:
> all_visible_according_to_vm: true
> all_visible: false
> VM says all vis: true
> PageIsAllVisible:false

Okay, I now see from Heikki's v8-0001 that he was already aware of this.

- Melanie




Re: Identify transactions causing highest wal generation

2024-03-08 Thread Bharath Rupireddy
On Fri, Mar 8, 2024 at 9:10 PM Tomas Vondra
 wrote:
>
> On 3/8/24 15:50, Gayatri Singh wrote:
> > Hello Team,
> >
> > Can you help me with steps to identify transactions which caused wal
> > generation to surge ?
> >
>
> You should probably take a look at pg_waldump, which prints information
> about WAL contents, including which XID generated each record.

Right. pg_walinspect too can help get the same info for the available
WAL if you are on a production database with PG15 without any access
to the host instance.

> I don't know what exactly is your goal,

Yeah, it's good to know the use-case if possible.

> but sometimes it's not entirely
> direct relationship.For example, a transaction may delete a record,
> which generates just a little bit of WAL. But then after a checkpoint a
> VACUUM comes around, vacuums the page to reclaim the space of the entry,
> and ends up writing FPI (which is much larger). You could argue this WAL
> is also attributable to the original transaction, but that's not what
> pg_waldump will allow you to do. FPIs in general may inflate the numbers
> unpredictably, and it's not something the original transaction can
> affect very much.

Nice. If one knows the fact that there can be WAL generated without
associated transaction (no XID), there won't be surprises when the
amount of WAL generated by all transactions is compared against the
total WAL on the database.

Alternatively, one can get the correct amount of WAL generated
including the WAL without XID before and after doing some operations
as shown below:

postgres=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn

 0/52EB488
(1 row)

postgres=# create table foo as select i from generate_series(1, 100) i;
SELECT 100
postgres=# update foo set i = i +1 where i%2 = 0;
UPDATE 50
postgres=# SELECT pg_current_wal_lsn();
 pg_current_wal_lsn

 0/D2B8000
(1 row)

postgres=# SELECT pg_wal_lsn_diff('0/D2B8000', '0/52EB488');
 pg_wal_lsn_diff
-
   134007672
(1 row)

postgres=# SELECT pg_size_pretty(pg_wal_lsn_diff('0/D2B8000', '0/52EB488'));
 pg_size_pretty

 128 MB
(1 row)

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Identify transactions causing highest wal generation

2024-03-08 Thread Euler Taveira
On Fri, Mar 8, 2024, at 12:40 PM, Tomas Vondra wrote:
> On 3/8/24 15:50, Gayatri Singh wrote:
> > Hello Team,
> > 
> > Can you help me with steps to identify transactions which caused wal
> > generation to surge ?
> > 
> 
> You should probably take a look at pg_waldump, which prints information
> about WAL contents, including which XID generated each record.

You can also use pg_stat_statements to obtain this information.

postgres=# select * from pg_stat_statements order by wal_bytes desc;
-[ RECORD 1 
]--+---
---
---
-
userid | 10
dbid   | 16385
toplevel   | t
queryid| -8403979585082616547
query  | UPDATE pgbench_accounts SET abalance = abalance + $1 
WHERE aid = $2
plans  | 0
total_plan_time| 0
min_plan_time  | 0
max_plan_time  | 0
mean_plan_time | 0
stddev_plan_time   | 0
calls  | 238260
total_exec_time| 4642.59929618
min_exec_time  | 0.011094
max_exec_time  | 0.872748
mean_exec_time | 0.01948543312347807
stddev_exec_time   | 0.006370786385582063
rows   | 238260
.
.
.
wal_records| 496659
wal_fpi| 19417
wal_bytes  | 208501147
.
.
.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman
>  wrote:
> > Not that it will be fun to maintain another special case in the VM
> > update code in lazy_scan_prune(), but we could have a special case
> > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if
> > all_visible_according_to_vm is true and all_visible is true, we update
> > the VM but don't dirty the page.
>
> It wouldn't necessarily have to be a special case, I think.
>
> We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in
> the block where lazy_scan_prune marks a previously all-visible page
> all-frozen -- we don't want to dirty the page unnecessarily there.
> Making it conditional is defensive in that particular block (this was
> also added by this same commit of mine), and avoids dirtying the page.

Ah, I see. I got confused. Even if the VM is suspect, if the page is
all visible and the heap block is already set all-visible in the VM,
there is no need to update it.

This did make me realize that it seems like there is a case we don't
handle in master with the current code that would be fixed by changing
that code Heikki mentioned:

Right now, even if the heap block is incorrectly marked all-visible in
the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum,
all_visible_according_to_vm will be passed to lazy_scan_prune() as
false. Then even if lazy_scan_prune() finds that the page is not
all-visible, we won't call visibilitymap_clear().

If we revert the code setting next_unskippable_allvis to false in
lazy_scan_skip() when vacrel->skipwithvm is false and allow
all_visible_according_to_vm to be true when the VM has it incorrectly
set to true, then once lazy_scan_prune() discovers the page is not
all-visible and assuming PD_ALL_VISIBLE is not marked so
PageIsAllVisible() returns false, we will call visibilitymap_clear()
to clear the incorrectly set VM bit (without dirtying the page).

Here is a table of the variable states at the end of lazy_scan_prune()
for clarity:

master:
all_visible_according_to_vm: false
all_visible: false
VM says all vis: true
PageIsAllVisible:false

if fixed:
all_visible_according_to_vm: true
all_visible: false
VM says all vis: true
PageIsAllVisible:false

> Seems like it might be possible to simplify/consolidate the VM-setting
> code that's now located at the end of lazy_scan_prune. Perhaps the two
> distinct blocks that call visibilitymap_set() could be combined into
> one.

I agree. I have some code to do that in an unproposed patch which
combines the VM updates into the prune record. We will definitely want
to reorganize the code when we do that record combining.

- Melanie




Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Bharath Rupireddy
On Fri, Mar 8, 2024 at 9:25 PM Jelte Fennema-Nio  wrote:
>
> On Fri, 8 Mar 2024 at 15:51, Peter Eisentraut  wrote:
> > What is the relationship of these changes with the recently added
> > backtrace_on_internal_error?
>
> I think that's a reasonable question. And the follow up ones too.
>
> I think it all depends on how close we consider
> backtrace_on_internal_error and backtrace_functions. While they
> obviously have similar functionality, I feel like
> backtrace_on_internal_error is probably a function that we'd want to
> turn on by default in the future.

Hm, we may not want backtrace_on_internal_error to be on by default.
AFAICS, all ERRCODE_INTERNAL_ERROR are identifiable with the error
message, so it's sort of easy for one to track down the cause of it
without actually needing to log backtrace by default.

On Fri, Mar 8, 2024 at 8:21 PM Peter Eisentraut  wrote:
>
> What is the relationship of these changes with the recently added
> backtrace_on_internal_error?  We had similar discussions there, I feel
> like we are doing similar things here but slightly differently.  Like,
> shouldn't backtrace_functions_min_level also affect
> backtrace_on_internal_error?  Don't you really just want
> backtrace_on_any_error?  You are sneaking that in through the backdoor
> via backtrace_functions.  Can we somehow combine all these use cases
> more elegantly?  backtrace_on_error = {all|internal|none}?

I see a merit in Peter's point. I like the idea of
backtrace_functions_min_level controlling backtrace_on_internal_error
too. Less GUCs for similar functionality is always a good idea IMHO.
Here's what I think:

1. Rename the new GUC backtrace_functions_min_level to backtrace_min_level.
2. Add 'internal' to backtrace_min_level_options enum
+static const struct config_enum_entry backtrace_functions_level_options[] = {
+   {"internal", INTERNAL, false},
+{"debug5", DEBUG5, false},
+{"debug4", DEBUG4, false},
3. Remove backtrace_on_internal_error GUC which is now effectively
covered by backtrace_min_level = 'internal';

Does anyone see a problem with it?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Heikki Linnakangas

On 08/03/2024 02:46, Melanie Plageman wrote:

On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:

On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:

I will say that now all of the variable names are *very* long. I didn't
want to remove the "state" from LVRelState->next_block_state. (In fact, I
kind of miss the "get". But I had to draw the line somewhere.) I think
without "state" in the name, next_block sounds too much like a function.

Any ideas for shortening the names of next_block_state and its members
or are you fine with them?


Hmm, we can remove the inner struct and add the fields directly into 
LVRelState. LVRelState already contains many groups of variables, like 
"Error reporting state", with no inner structs. I did it that way in the 
attached patch. I also used local variables more.



I was wondering if we should remove the "get" and just go with
heap_vac_scan_next_block(). I didn't do that originally because I didn't
want to imply that the next block was literally the sequentially next
block, but I think maybe I was overthinking it.

Another idea is to call it heap_scan_vac_next_block() and then the order
of the words is more like the table AM functions that get the next block
(e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to
be too similar to those since this isn't a table AM callback.


I've done a version of this.


+1


However, by adding a vmbuffer to next_block_state, the callback may be
able to avoid extra VM fetches from one invocation to the next.


That's a good idea, holding separate VM buffer pins for the 
next-unskippable block and the block we're processing. I adopted that 
approach.


My compiler caught one small bug when I was playing with various 
refactorings of this: heap_vac_scan_next_block() must set *blkno to 
rel_pages, not InvalidBlockNumber, after the last block. The caller uses 
the 'blkno' variable also after the loop, and assumes that it's set to 
rel_pages.


I'm pretty happy with the attached patches now. The first one fixes the 
existing bug I mentioned in the other email (based on the on-going 
discussion that might not how we want to fix it though). Second commit 
is a squash of most of the patches. Third patch is the removal of the 
delay point, that seems worthwhile to keep separate.


--
Heikki Linnakangas
Neon (https://neon.tech)
From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 8 Mar 2024 16:00:22 +0200
Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with
 DISABLE_PAGE_SKIPPING

It's important for 'all_visible_according_to_vm' to correctly reflect
whether the VM bit is set or not, even when we are not trusting the VM
to skip pages, because contrary to what the comment said,
lazy_scan_prune() relies on it.

If it's incorrectly set to 'false', when the VM bit is in fact set,
lazy_scan_prune() will try to set the VM bit again and dirty the page
unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all
heap pages were dirtied, even if there were no changes. We would also
fail to clear any VM bits that were set incorrectly.

This was broken in commit 980ae17310, so backpatch to v16.

Backpatch-through: 16
Reviewed-by: Melanie Plageman
Discussion: https://www.postgresql.org/message-id/3df2b582-dc1c-46b6-99b6-38eddd1b2...@iki.fi
---
 src/backend/access/heap/vacuumlazy.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8b320c3f89a..ac55ebd2ae5 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -1136,11 +1136,7 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
 
 		/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
 		if (!vacrel->skipwithvm)
-		{
-			/* Caller shouldn't rely on all_visible_according_to_vm */
-			*next_unskippable_allvis = false;
 			break;
-		}
 
 		/*
 		 * Aggressive VACUUM caller can't skip pages just because they are
-- 
2.39.2

From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Fri, 8 Mar 2024 17:32:19 +0200
Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip()

Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more
code into the function, so that the caller doesn't need to know about
ranges or skipping anymore. heap_vac_scan_next_block() returns the
next block to process, and the logic for determining that block is all
within the function. This makes the skipping logic easier to
understand, as it's all in the same function, and makes the calling
code easier to understand as it's less cluttered. The state variables
needed to manage the skipping logic are moved to LVRelState.

heap_vac_scan_next_block() now manages its own VM buffer separately
from the caller's vmbuffer variable. The caller's vmbuffer holds the
VM page for the current block its processing, 

Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan  wrote:
> Seems like it might be possible to simplify/consolidate the VM-setting
> code that's now located at the end of lazy_scan_prune. Perhaps the two
> distinct blocks that call visibilitymap_set() could be combined into
> one.

FWIW I think that my error here might have had something to do with
hallucinating that the code already did things that way.

At the time this went in, I was working on a patchset that did things
this way (more or less). It broke the dependency on
all_visible_according_to_vm entirely, which simplified the
set-and-check-VM code that's now at the end of lazy_scan_prune.

Not sure how practical it'd be to do something like that now (not
offhand), but something to consider.

-- 
Peter Geoghegan




Re: meson: Specify -Wformat as a common warning flag for extensions

2024-03-08 Thread Tristan Partin

On Fri Mar 8, 2024 at 12:32 AM CST, Michael Paquier wrote:

On Thu, Mar 07, 2024 at 11:39:39PM -0600, Tristan Partin wrote:
> It sounds like a legitimate issue. I have confirmed the issue exists with a
> pg_config compiled with Meson. I can also confirm that this issue exists in
> the autotools build.

First time I'm hearing about that, but I'll admit that I am cheating
because -Wformat is forced in my local builds for some time now.  I'm
failing to see the issue with meson and ./configure even if I remove
the switch, though, using a recent version of gcc at 13.2.0, but
perhaps Debian does something underground.  Are there version and/or
environment requirements to be aware of?

Forcing -Wformat implies more stuff that can be disabled with
-Wno-format-contains-nul, -Wno-format-extra-args, and
-Wno-format-zero-length, but the thing is that we're usually very
conservative with such additions in the scripts.  See also
8b6f5f25102f, done, I guess, as an answer to this thread:
https://www.postgresql.org/message-id/4D431505.9010002%40dunslane.net

A quick look at the past history of pgsql-hackers does not mention
that as a problem, either, but I may have missed something.


Ok, I figured this out. -Wall implies -Wformat=1. We set warning_level 
to 1 in the Meson project() call, which implies -Wall, and set -Wall in 
CFLAGS for autoconf. That's the reason we don't get issues building 
Postgres. A user making use of the pg_config --cflags option, as Sutou 
is, *will* run into the aforementioned issues, since we don't propogate 
-Wall into pg_config.


$ gcc $(pg_config --cflags) -E - < /dev/null > /dev/null
cc1: warning: ‘-Wformat-security’ ignored without ‘-Wformat’ 
[-Wformat-security]
$ gcc -Wall $(pg_config --cflags) -E - < /dev/null > /dev/null
(nothing printed)

--
Tristan Partin
Neon (https://neon.tech)




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman
 wrote:
> Not that it will be fun to maintain another special case in the VM
> update code in lazy_scan_prune(), but we could have a special case
> that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if
> all_visible_according_to_vm is true and all_visible is true, we update
> the VM but don't dirty the page.

It wouldn't necessarily have to be a special case, I think.

We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in
the block where lazy_scan_prune marks a previously all-visible page
all-frozen -- we don't want to dirty the page unnecessarily there.
Making it conditional is defensive in that particular block (this was
also added by this same commit of mine), and avoids dirtying the page.

Seems like it might be possible to simplify/consolidate the VM-setting
code that's now located at the end of lazy_scan_prune. Perhaps the two
distinct blocks that call visibilitymap_set() could be combined into
one.

-- 
Peter Geoghegan




Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Jelte Fennema-Nio
On Fri, 8 Mar 2024 at 15:51, Peter Eisentraut  wrote:
> What is the relationship of these changes with the recently added
> backtrace_on_internal_error?

I think that's a reasonable question. And the follow up ones too.

I think it all depends on how close we consider
backtrace_on_internal_error and backtrace_functions. While they
obviously have similar functionality, I feel like
backtrace_on_internal_error is probably a function that we'd want to
turn on by default in the future. While backtrace_functions seems like
it's mostly useful for developers. (i.e. the current grouping of
backtrace_on_internal_error under DEVELOPER_OPTIONS seems wrong to me)

> shouldn't backtrace_functions_min_level also affect
> backtrace_on_internal_error?

I guess that depends on the default behaviour that we want. Would we
want warnings with ERRCODE_INTERNAL_ERROR to be backtraced by default
or not. There is at least one example of such a warning in the
codebase:

ereport(WARNING,
errcode(ERRCODE_INTERNAL_ERROR),
errmsg_internal("could not parse XML declaration in stored value"),
errdetail_for_xml_code(res_code));

> Btw., your code/documentation sometimes writes "stack trace".  Let's
> stick to backtrace for consistency.

Fixed that in the latest patset in the email I sent right before this one.




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 8, 2024 at 10:41 AM Peter Geoghegan  wrote:
>
> On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas  wrote:
> > ISTM we should revert the above hunk, and backpatch it to v16. I'm a
> > little wary because I don't understand why that change was made in the
> > first place, though. I think it was just an ill-advised attempt at
> > tidying up the code as part of the larger commit, but I'm not sure.
> > Peter, do you remember?
>
> I think that it makes sense to set the VM when indicated by
> lazy_scan_prune, independent of what either the visibility map or the
> page's PD_ALL_VISIBLE marking say. The whole point of
> DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all.

Not that it will be fun to maintain another special case in the VM
update code in lazy_scan_prune(), but we could have a special case
that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if
all_visible_according_to_vm is true and all_visible is true, we update
the VM but don't dirty the page. The docs on DISABLE_PAGE_SKIPPING say
it is meant to deal with VM corruption -- it doesn't say anything
about dealing with incorrectly set PD_ALL_VISIBLE markings.

- Melanie




Re: [DOC] Add detail regarding resource consumption wrt max_connections

2024-03-08 Thread Michael Banck
Hi,
 
On Fri, Jan 12, 2024 at 10:14:38PM +, Cary Huang wrote:
> I think it is good to warn the user about the increased allocation of
> memory for certain parameters so that they do not abuse it by setting
> it to a huge number without knowing the consequences.

Right, and I think it might be useful to log (i.e. at LOG not DEBUG3
level, with a nicer message) the amount of memory we allocate on
startup, that is just one additional line per instance lifetime but
might be quite useful to admins. Or maybe two lines if we log whether we
could allocate it as huge pages or not as well:

|2024-03-08 16:46:13.117 CET [237899] DEBUG:  invoking 
IpcMemoryCreate(size=145145856)
|2024-03-08 16:46:13.117 CET [237899] DEBUG:  mmap(146800640) with MAP_HUGETLB 
failed, huge pages disabled: Cannot allocate memory
 
> It is true that max_connections can increase the size of proc array
> and other resources, which are allocated in the shared buffer, which
> also means less shared buffer to perform regular data operations.

AFAICT, those resources are allocated on top of shared_buffers, i.e. the
total allocated memory is shared_buffers + (some resources) *
max_connections + (other resources) * other_factors.

> Instead of stating that higher max_connections results in higher
> allocation, It may be better to tell the user that if the value needs
> to be set much higher, consider increasing the "shared_buffers"
> setting as well.

Only if what you say above is true and I am at fault.


Michael




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Melanie Plageman
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas  wrote:
>
> On 08/03/2024 02:46, Melanie Plageman wrote:
> > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> >>> I feel heap_vac_scan_get_next_block() function could use some love. Maybe
> >>> just some rewording of the comments, or maybe some other refactoring; not
> >>> sure. But I'm pretty happy with the function signature and how it's 
> >>> called.
> >
> > I've cleaned up the comments on heap_vac_scan_next_block() in the first
> > couple patches (not so much in the streaming read user). Let me know if
> > it addresses your feelings or if I should look for other things I could
> > change.
>
> Thanks, that is better. I think I now finally understand how the
> function works, and now I can see some more issues and refactoring
> opportunities :-).
>
> Looking at current lazy_scan_skip() code in 'master', one thing now
> caught my eye (and it's the same with your patches):
>
> >   *next_unskippable_allvis = true;
> >   while (next_unskippable_block < rel_pages)
> >   {
> >   uint8   mapbits = 
> > visibilitymap_get_status(vacrel->rel,
> > 
> >  next_unskippable_block,
> > 
> >  vmbuffer);
> >
> >   if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> >   {
> >   Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
> >   *next_unskippable_allvis = false;
> >   break;
> >   }
> >
> >   /*
> >* Caller must scan the last page to determine whether it has 
> > tuples
> >* (caller must have the opportunity to set 
> > vacrel->nonempty_pages).
> >* This rule avoids having lazy_truncate_heap() take 
> > access-exclusive
> >* lock on rel to attempt a truncation that fails anyway, 
> > just because
> >* there are tuples on the last page (it is likely that there 
> > will be
> >* tuples on other nearby pages as well, but those can be 
> > skipped).
> >*
> >* Implement this by always treating the last block as unsafe 
> > to skip.
> >*/
> >   if (next_unskippable_block == rel_pages - 1)
> >   break;
> >
> >   /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
> >   if (!vacrel->skipwithvm)
> >   {
> >   /* Caller shouldn't rely on 
> > all_visible_according_to_vm */
> >   *next_unskippable_allvis = false;
> >   break;
> >   }
> >
> >   /*
> >* Aggressive VACUUM caller can't skip pages just because 
> > they are
> >* all-visible.  They may still skip all-frozen pages, which 
> > can't
> >* contain XIDs < OldestXmin (XIDs that aren't already frozen 
> > by now).
> >*/
> >   if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
> >   {
> >   if (vacrel->aggressive)
> >   break;
> >
> >   /*
> >* All-visible block is safe to skip in 
> > non-aggressive case.  But
> >* remember that the final range contains such a 
> > block for later.
> >*/
> >   skipsallvis = true;
> >   }
> >
> >   /* XXX: is it OK to remove this? */
> >   vacuum_delay_point();
> >   next_unskippable_block++;
> >   nskippable_blocks++;
> >   }
>
> Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop.
> When DISABLE_PAGE_SKIPPING is set, we always return the next block and
> set *next_unskippable_allvis = false regardless of the visibility map,
> so why bother checking the visibility map at all?
>
> Except at the very last block of the relation! If you look carefully,
> at the last block we do return *next_unskippable_allvis = true, if the
> VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong.
> Surely the intention was to pretend that none of the VM bits were set if
> DISABLE_PAGE_SKIPPING is used, also for the last block.

I agree that having next_unskippable_allvis and, as a consequence,
all_visible_according_to_vm set to true for the last block seems
wrong. And It makes sense from a loop efficiency standpoint also to
move it up to the top. However, making that change would have us end
up dirtying all pages in your example.

> This was changed in commit 980ae17310:
>
> > @@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, 
> > BlockNumber next_block,
> >
> > /* 

Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Jelte Fennema-Nio
On Fri, 8 Mar 2024 at 14:42, Daniel Gustafsson  wrote:
> My documentation comments from upthread still
> stands, but other than those this version LGTM.

Ah yeah, I forgot about those. Fixed now.


v6-0001-Add-backtrace_functions_min_level.patch
Description: Binary data


v6-0002-Add-wildcard-support-to-backtrace_functions-GUC.patch
Description: Binary data


Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Peter Geoghegan
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas  wrote:
> ISTM we should revert the above hunk, and backpatch it to v16. I'm a
> little wary because I don't understand why that change was made in the
> first place, though. I think it was just an ill-advised attempt at
> tidying up the code as part of the larger commit, but I'm not sure.
> Peter, do you remember?

I think that it makes sense to set the VM when indicated by
lazy_scan_prune, independent of what either the visibility map or the
page's PD_ALL_VISIBLE marking say. The whole point of
DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all.

In retrospect I didn't handle this particular aspect very well in
commit 980ae17310. The approach I took is a bit crude (and in any case
slightly wrong in that it is inconsistent in how it handles the last
page). But it has the merit of fixing the case where we just have the
VM's all-frozen bit set for a given block (not the all-visible bit
set) -- which is always wrong. There was good reason to be concerned
about that possibility when 980ae17310 went in.

--
Peter Geoghegan




Re: Identify transactions causing highest wal generation

2024-03-08 Thread Tomas Vondra
On 3/8/24 15:50, Gayatri Singh wrote:
> Hello Team,
> 
> Can you help me with steps to identify transactions which caused wal
> generation to surge ?
> 

You should probably take a look at pg_waldump, which prints information
about WAL contents, including which XID generated each record.

I don't know what exactly is your goal, but sometimes it's not entirely
direct relationship. For example, a transaction may delete a record,
which generates just a little bit of WAL. But then after a checkpoint a
VACUUM comes around, vacuums the page to reclaim the space of the entry,
and ends up writing FPI (which is much larger). You could argue this WAL
is also attributable to the original transaction, but that's not what
pg_waldump will allow you to do. FPIs in general may inflate the numbers
unpredictably, and it's not something the original transaction can
affect very much.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: type cache cleanup improvements

2024-03-08 Thread Tom Lane
Aleksander Alekseev  writes:
> One thing that I couldn't immediately figure out is why 0 hash value
> is treated as a magic invalid value in TypeCacheTypCallback():

I've not read this patch, but IIRC in some places we have a convention
that hash value zero is passed for an sinval reset event (that is,
"flush all cache entries").

regards, tom lane




Re: type cache cleanup improvements

2024-03-08 Thread Aleksander Alekseev
Hi,

> > I would like to tweak the patch a little bit - change some comments,
> > add some Asserts, etc. Don't you mind?
> You are welcome!

Thanks. PFA the updated patch with some tweaks by me. I added the
commit message as well.

One thing that I couldn't immediately figure out is why 0 hash value
is treated as a magic invalid value in TypeCacheTypCallback():

```
-   hash_seq_init(, TypeCacheHash);
+   if (hashvalue == 0)
+   hash_seq_init(, TypeCacheHash);
+   else
+   hash_seq_init_with_hash_value(, TypeCacheHash,
hashvalue);
```

Is there anything that prevents the actual hash value from being zero?
I don't think so, but maybe I missed something.

If zero is indeed an invalid hash value I would like to reference the
corresponding code. If zero is a correct hash value we should either
change this by adding something like `if(!hash) hash++` or use an
additional boolean argument here.

-- 
Best regards,
Aleksander Alekseev


v3-0001-Improve-performance-of-type-cache-cleanup.patch
Description: Binary data


Re: Call perror on popen failure

2024-03-08 Thread Tom Lane
Daniel Gustafsson  writes:
> If popen fails in pipe_read_line we invoke perror for the error message, and
> pipe_read_line is in turn called by find_other_exec which is used in both
> frontend and backend code.  This is an old codepath, and it seems like it 
> ought
> to be replaced with the more common logging tools we now have like in the
> attached diff (which also makes the error translated as opposed to now).  Any
> objections to removing this last perror() call?

I didn't check your replacement code in detail, but I think we have
a policy against using perror, mainly because we can't count on it
to localize consistently with ereport et al.  My grep confirms this
is the only use, so +1 for removing it.

regards, tom lane




Re: Spurious pgstat_drop_replslot() call

2024-03-08 Thread Bertrand Drouvot
Hi,

On Fri, Mar 08, 2024 at 07:55:39PM +0900, Michael Paquier wrote:
> On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote:
> > Indeed, it does not seem appropriate to remove stats during slot 
> > invalidation as
> > one could still be interested to look at them.
> 
> Yeah, my take is that this can still be interesting to know, at least
> for debugging.  This would limit the stats to be dropped when the slot
> is dropped, and that looks like a sound idea.

Thanks for looking at it!

> > This spurious call has been introduced in be87200efd. I think that 
> > initially we
> > designed to drop slots when a recovery conflict occurred during logical 
> > decoding
> > on a standby. But we changed our mind to invalidate such a slot instead.
> > 
> > The spurious call is probably due to the initial design but has not been 
> > removed.
> 
> This is not a subject that has really been touched on the original
> thread mentioned on the commit, so it is a bit hard to be sure.  The
> only comment is that a previous version of the patch did the stats
> drop in the slot invalidation path at an incorrect location.

The switch in the patch from "drop" to "invalidation" is in [1], see:

"
Given the precedent of max_slot_wal_keep_size, I think it's wrong to
just drop
the logical slots. Instead we should just mark them as
invalid, 
like InvalidateObsoleteReplicationSlots().

Makes fully sense and done that way in the attached patch.
“

But yeah, hard to be sure why this call is there, at least I don't remember...

> > I don't think it's worth to add a test but can do if one feel the need.
> 
> Well, that would not be complicated while being cheap, no?  We have a
> few paths in the 035 test where we know that a slot has been
> invalidated so it is just a matter of querying once
> pg_stat_replication_slot and see if some data is still there.

We can not be 100% sure that the stats are up to date when the process holding
the active slot is killed. 

So v2 attached adds a test where we ensure first that we have non empty stats
before triggering the invalidation.

[1]: 
https://www.postgresql.org/message-id/26c6f320-98f0-253c-f8b5-df1e7c1f6315%40amazon.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From b97a42edd66ccb7f8f4d5d2fa24df0b02b6f4f68 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 8 Mar 2024 09:29:29 +
Subject: [PATCH v2] Remove spurious pgstat_drop_replslot() call

There is no need to remove stats for an invalidated slot as one could still
be interested to look at them.
---
 src/backend/replication/slot.c |  1 -
 .../recovery/t/035_standby_logical_decoding.pl | 18 ++
 2 files changed, 18 insertions(+), 1 deletion(-)
   3.5% src/backend/replication/
  96.4% src/test/recovery/t/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b8bf98b182..91ca397857 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1726,7 +1726,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
-			pgstat_drop_replslot(s);
 
 			ReportSlotInvalidation(conflict, false, active_pid,
    slotname, restart_lsn,
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index 0710bccc17..855f37016e 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -494,6 +494,8 @@ $node_subscriber->stop;
 ##
 # Recovery conflict: Invalidate conflicting slots, including in-use slots
 # Scenario 1: hot_standby_feedback off and vacuum FULL
+# In passing, ensure that replication slot stats are not removed when the active
+# slot is invalidated.
 ##
 
 # One way to produce recovery conflict is to create/drop a relation and
@@ -502,6 +504,15 @@ $node_subscriber->stop;
 reactive_slots_change_hfs_and_wait_for_xmins('behaves_ok_', 'vacuum_full_',
 	0, 1);
 
+# Ensure replication slot stats are not empty before triggering the conflict
+$node_primary->safe_psql('testdb',
+	qq[INSERT INTO decoding_test(x,y) SELECT 100,'100';]
+);
+
+$node_standby->poll_query_until('testdb',
+	qq[SELECT total_txns > 0 FROM pg_stat_replication_slots WHERE slot_name = 'vacuum_full_activeslot']
+) or die "slot stat total_txns not > 0 for vacuum_full_activeslot";
+
 # This should trigger the conflict
 wait_until_vacuum_can_remove(
 	'full', 'CREATE TABLE conflict_test(x integer, y text);
@@ -515,6 +526,13 @@ check_for_invalidation('vacuum_full_', 1, 'with vacuum FULL on pg_class');
 # Verify conflict_reason is 'rows_removed' in pg_replication_slots
 check_slots_conflict_reason('vacuum_full_', 'rows_removed');
 
+# Ensure replication slot stats 

Re: [DOC] Add detail regarding resource consumption wrt max_connections

2024-03-08 Thread Robert Treat
On Mon, Jan 22, 2024 at 8:58 AM  wrote:
> On Fri, 2024-01-19 at 17:37 -0500, reid.thomp...@crunchydata.com wrote:
> > On Sat, 2024-01-13 at 10:31 -0700, Roberto Mello wrote:
> > >
> > > I can add a suggestion for the user to consider increasing
> > > shared_buffers in accordance with higher max_connections, but it
> > > would be better if there was a "rule of thumb" guideline to go
> > > along. I'm open to suggestions.
> > >
> > > I can revise with a similar warning in max_prepared_xacts as well.
> > >
> > > Sincerely,
> > >
> > > Roberto
> >
> > Can a "close enough" rule of thumb be calculated from:
> > postgresql.conf -> log_min_messages = debug3
> >
> > start postgresql with varying max_connections to get
> > CreateSharedMemoryAndSemaphores() sizes to generate a rough equation
> >
>
> or maybe it would be sufficient to advise to set log_min_messages =
> debug3 on a test DB and start/stop it with varying values of
> max_connections and look at the differing values in
> DEBUG: invoking IpcMemoryCreate(size=...) log messages for themselves.
>
>

I'm of the opinion that advice suggestingDBA's set things to DEBUG 3
is unfriendly at best. If you really want to add more, there is an
existing unfriendly section of the docs at
https://www.postgresql.org/docs/devel/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
that mentions this problem, specifically:

"If PostgreSQL itself is the cause of the system running out of
memory, you can avoid the problem by changing your configuration. In
some cases, it may help to lower memory-related configuration
parameters, particularly shared_buffers, work_mem, and
hash_mem_multiplier. In other cases, the problem may be caused by
allowing too many connections to the database server itself. In many
cases, it may be better to reduce max_connections and instead make use
of external connection-pooling software."

I couldn't really find a spot to add in your additional info, but
maybe you can find a spot that fits? Or maybe a well written
walk-through of this would make for a good wiki page in case people
really want to dig in.

In any case, I think Roberto's original language is an improvement
over what we have now, so I'd probably recommend just going with that,
along with a similar note to max_prepared_xacts, and optionally a
pointer to the shared mem section of the docs.

Robert Treat
https://xzilla.net




Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Peter Eisentraut

On 08.03.24 12:25, Jelte Fennema-Nio wrote:

On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera  wrote:


On 2024-Mar-08, Bharath Rupireddy wrote:


This works only if '* 'is specified as the only one character in
backtrace_functions = '*', right? If yes, what if someone sets
backtrace_functions = 'foo, bar, *, baz'?


It throws an error, as expected.  This is a useless waste of resources:
checking for "foo" and "bar" is pointless, since the * is going to give
a positive match anyway.  And the "baz" is a waste of memory which is
never going to be checked.


Makes sense. Attached is a new patchset that implements it that way.
I've not included Bharath his 0003 patch, since it's a much bigger
change than the others, and thus might need some more discussion.


What is the relationship of these changes with the recently added 
backtrace_on_internal_error?  We had similar discussions there, I feel 
like we are doing similar things here but slightly differently.  Like, 
shouldn't backtrace_functions_min_level also affect 
backtrace_on_internal_error?  Don't you really just want 
backtrace_on_any_error?  You are sneaking that in through the backdoor 
via backtrace_functions.  Can we somehow combine all these use cases 
more elegantly?  backtrace_on_error = {all|internal|none}?


Btw., your code/documentation sometimes writes "stack trace".  Let's 
stick to backtrace for consistency.






Identify transactions causing highest wal generation

2024-03-08 Thread Gayatri Singh
Hello Team,

Can you help me with steps to identify transactions which caused wal
generation to surge ?

Regards,
Gayatri.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-08 Thread Bharath Rupireddy
On Wed, Mar 6, 2024 at 4:28 PM Amit Kapila  wrote:
>
> IIUC, the current conflict_reason is primarily used to determine
> logical slots on standby that got invalidated due to recovery time
> conflict. On the primary, it will also show logical slots that got
> invalidated due to the corresponding WAL got removed. Is that
> understanding correct?

That's right.

> If so, we are already sort of overloading this
> column. However, now adding more invalidation reasons that won't
> happen during recovery conflict handling will change entirely the
> purpose (as per the name we use) of this variable. I think
> invalidation_reason could depict this column correctly but OTOH I
> guess it would lose its original meaning/purpose.

Hm. I get the concern. Are you okay with having inavlidation_reason
separately for both logical and physical slots? In such a case,
logical slots that got invalidated on the standby will have duplicate
info in conflict_reason and invalidation_reason, is this fine?

Another idea is to make 'conflict_reason text' as a 'conflicting
boolean' again (revert 007693f2a3), and have 'invalidation_reason
text' for both logical and physical slots. So, whenever 'conflicting'
is true, one can look at invalidation_reason for the reason for
conflict. How does this sound?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Daniel Gustafsson
> On 8 Mar 2024, at 15:01, Bharath Rupireddy 
>  wrote:

> So, to get backtraces of all functions at
> backtrace_functions_min_level level, one has to specify
> backtrace_functions = '*'; combining it with function names is not
> allowed. This looks cleaner.
> 
> postgres=# ALTER SYSTEM SET backtrace_functions = '*, 
> pg_create_restore_point';
> ERROR:  invalid value for parameter "backtrace_functions": "*,
> pg_create_restore_point"
> DETAIL:  Invalid character

If we want to be extra helpful here we could add something like the below to
give an errhint when a wildcard was found.  Also, the errdetail should read
like a full sentence so it should be slightly expanded anyways.

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index ca621ea3ff..7bc655ecd2 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -2151,7 +2151,9 @@ check_backtrace_functions(char **newval, void **extra, 
GucSource source)
  ", \n\t");
if (validlen != newvallen)
{
-   GUC_check_errdetail("Invalid character");
+   GUC_check_errdetail("Invalid character in function name.");
+   if ((*newval)[validlen] == '*')
+   GUC_check_errhint("For wildcard matching, use a single 
\"*\" without any other function names.");
return false;
}

--
Daniel Gustafsson





Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Bharath Rupireddy
On Fri, Mar 8, 2024 at 7:12 PM Daniel Gustafsson  wrote:
>
> > On 8 Mar 2024, at 12:25, Jelte Fennema-Nio  wrote:
> >
> > On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera  wrote:
> >>
> >> On 2024-Mar-08, Bharath Rupireddy wrote:
> >>
> >>> This works only if '* 'is specified as the only one character in
> >>> backtrace_functions = '*', right? If yes, what if someone sets
> >>> backtrace_functions = 'foo, bar, *, baz'?
> >>
> >> It throws an error, as expected.  This is a useless waste of resources:
> >> checking for "foo" and "bar" is pointless, since the * is going to give
> >> a positive match anyway.  And the "baz" is a waste of memory which is
> >> never going to be checked.
> >
> > Makes sense. Attached is a new patchset that implements it that way.
>
> This version address the concerns raised by Alvaro, and even simplifies the
> code over earlier revisions.  My documentation comments from upthread still
> stands, but other than those this version LGTM.

So, to get backtraces of all functions at
backtrace_functions_min_level level, one has to specify
backtrace_functions = '*'; combining it with function names is not
allowed. This looks cleaner.

postgres=# ALTER SYSTEM SET backtrace_functions = '*, pg_create_restore_point';
ERROR:  invalid value for parameter "backtrace_functions": "*,
pg_create_restore_point"
DETAIL:  Invalid character

I have one comment on 0002, otherwise all looks good.

+   
+A single * character can be used instead of a list
+of C functions. This * is interpreted as a wildcard
+and will cause all errors in the log to contain backtraces.
+   

It's not always the ERRORs for which backtraces get logged, it really
depends on the new GUC backtrace_functions_min_level. If my
understanding is right, can we specify that in the above note?

> > I've not included Bharath his 0003 patch, since it's a much bigger
> > change than the others, and thus might need some more discussion.

+1. I'll see if I can start a new thread for this.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Confine vacuum skip logic to lazy_scan_skip

2024-03-08 Thread Heikki Linnakangas

On 08/03/2024 02:46, Melanie Plageman wrote:

On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:

I feel heap_vac_scan_get_next_block() function could use some love. Maybe
just some rewording of the comments, or maybe some other refactoring; not
sure. But I'm pretty happy with the function signature and how it's called.


I've cleaned up the comments on heap_vac_scan_next_block() in the first
couple patches (not so much in the streaming read user). Let me know if
it addresses your feelings or if I should look for other things I could
change.


Thanks, that is better. I think I now finally understand how the 
function works, and now I can see some more issues and refactoring 
opportunities :-).


Looking at current lazy_scan_skip() code in 'master', one thing now 
caught my eye (and it's the same with your patches):



*next_unskippable_allvis = true;
while (next_unskippable_block < rel_pages)
{
uint8   mapbits = visibilitymap_get_status(vacrel->rel,

   next_unskippable_block,

   vmbuffer);

if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
{
Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
*next_unskippable_allvis = false;
break;
}

/*
 * Caller must scan the last page to determine whether it has 
tuples
 * (caller must have the opportunity to set 
vacrel->nonempty_pages).
 * This rule avoids having lazy_truncate_heap() take 
access-exclusive
 * lock on rel to attempt a truncation that fails anyway, just 
because
 * there are tuples on the last page (it is likely that there 
will be
 * tuples on other nearby pages as well, but those can be 
skipped).
 *
 * Implement this by always treating the last block as unsafe 
to skip.
 */
if (next_unskippable_block == rel_pages - 1)
break;

/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
if (!vacrel->skipwithvm)
{
/* Caller shouldn't rely on all_visible_according_to_vm 
*/
*next_unskippable_allvis = false;
break;
}

/*
 * Aggressive VACUUM caller can't skip pages just because they 
are
 * all-visible.  They may still skip all-frozen pages, which 
can't
 * contain XIDs < OldestXmin (XIDs that aren't already frozen 
by now).
 */
if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
{
if (vacrel->aggressive)
break;

/*
 * All-visible block is safe to skip in non-aggressive 
case.  But
 * remember that the final range contains such a block 
for later.
 */
skipsallvis = true;
}

/* XXX: is it OK to remove this? */
vacuum_delay_point();
next_unskippable_block++;
nskippable_blocks++;
}


Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop. 
When DISABLE_PAGE_SKIPPING is set, we always return the next block and 
set *next_unskippable_allvis = false regardless of the visibility map, 
so why bother checking the visibility map at all?


Except at the very last block of the relation! If you look carefully, 
at the last block we do return *next_unskippable_allvis = true, if the 
VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong. 
Surely the intention was to pretend that none of the VM bits were set if 
DISABLE_PAGE_SKIPPING is used, also for the last block.


This was changed in commit 980ae17310:


@@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, 
BlockNumber next_block,
 
/* DISABLE_PAGE_SKIPPING makes all skipping unsafe */

if (!vacrel->skipwithvm)
+   {
+   /* Caller shouldn't rely on all_visible_according_to_vm 
*/
+   *next_unskippable_allvis = false;
break;
+   }


Before that, *next_unskippable_allvis was set correctly according to the 
VM, even when DISABLE_PAGE_SKIPPING was used. It's not clear to me why 
that was changed. And I think setting it to 'true' would be a more 
failsafe value than 'false'. When *next_unskippable_allvis is set to 
true, the caller cannot rely on it because a concurrent 

Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Daniel Gustafsson
> On 8 Mar 2024, at 12:25, Jelte Fennema-Nio  wrote:
> 
> On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera  wrote:
>> 
>> On 2024-Mar-08, Bharath Rupireddy wrote:
>> 
>>> This works only if '* 'is specified as the only one character in
>>> backtrace_functions = '*', right? If yes, what if someone sets
>>> backtrace_functions = 'foo, bar, *, baz'?
>> 
>> It throws an error, as expected.  This is a useless waste of resources:
>> checking for "foo" and "bar" is pointless, since the * is going to give
>> a positive match anyway.  And the "baz" is a waste of memory which is
>> never going to be checked.
> 
> Makes sense. Attached is a new patchset that implements it that way.

This version address the concerns raised by Alvaro, and even simplifies the
code over earlier revisions.  My documentation comments from upthread still
stands, but other than those this version LGTM.

> I've not included Bharath his 0003 patch, since it's a much bigger
> change than the others, and thus might need some more discussion.

Agreed.

--
Daniel Gustafsson





Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Ronan Dunklau
Le vendredi 8 mars 2024, 14:36:48 CET Tomas Vondra a écrit :
> > My guess would be 8af25652489, as it's the only storage-related commit.
> > 
> > I'm currently running tests to verify this.
> 
> Yup, the breakage starts with this commit. I haven't looked into the
> root cause, or whether the commit maybe just made some pre-existing
> issue easier to hit. Also, I haven't followed the discussion on the
> pgsql-bugs thread [1], maybe there are some interesting findings.
> 

If that happens only on HEAD and not on 16, and doesn't involve WAL replay, 
then it's not the same bug. 

--
Ronan Dunklau






Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Tomas Vondra



On 3/8/24 13:21, Tomas Vondra wrote:
> On 3/8/24 09:33, Thomas Munro wrote:
>> Happened again.  I see this is OpenSUSE.  Does that mean the file
>> system is Btrfs?
> 
> 
> It is, but I don't think that matters - I've been able to reproduce this
> locally on my laptop using ext4 filesystem. I'd bet the important piece
> here is -DCLOBBER_CACHE_ALWAYS (and it seems avocet/trilobite are the
> only animals running with this).
> 
> Also, if this really is a filesystem (or environment) issue, it seems
> very strange it'd only affect HEAD and not the other branches. So it
> seems quite likely this is actually triggered by a commit.
> 
> Looking at the commits from the good/bad runs, I see this:
> 
> avocet: good=4c2369a bad=f5a465f
> trilobite: good=d13ff82 bad=def0ce3
> 
> That means the commit would have to be somewhere here:
> 
> f5a465f1a07 Promote assertion about !ReindexIsProcessingIndex to ...
> 57b28c08305 Doc: fix minor typos in two ECPG function descriptions.
> 28e858c0f95 Improve documentation and GUC description for ...
> a661bf7b0f5 Remove flaky isolation tests for timeouts
> 874d817baa1 Multiple revisions to the GROUP BY reordering tests
> 466979ef031 Replace lateral references to removed rels in subqueries
> a6b2a51e16d Avoid dangling-pointer problem with partitionwise ...
> d360e3cc60e Fix compiler warning on typedef redeclaration
> 8af25652489 Introduce a new smgr bulk loading facility.
> e612384fc78 Fix mistake in SQL features list
> d13ff82319c Fix BF failure in commit 93db6cbda0.
> 
> My guess would be 8af25652489, as it's the only storage-related commit.
> 
> I'm currently running tests to verify this.
> 

Yup, the breakage starts with this commit. I haven't looked into the
root cause, or whether the commit maybe just made some pre-existing
issue easier to hit. Also, I haven't followed the discussion on the
pgsql-bugs thread [1], maybe there are some interesting findings.

[1]
https://www.postgresql.org/message-id/flat/1878547.tdWV9SEqCh%40aivenlaptop


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Alvaro Herrera
On 2024-Mar-08, Stephen Frost wrote:

> Thanks for testing!  That strikes me as perfectly reasonable and seems
> unlikely that we'd get much benefit from parallelizing it, so I'd say it
> makes sense to keep this code simple.

Okay, agreed, that amount of time sounds reasonable to me too; but I
don't want to be responsible for this at least for pg17.  If some other
committer wants to take it, be my guest.  However, I think this is
mostly a foundation for building other things on top, so committing
during the last commitfest is perhaps not very useful.


Another aspect of this patch is the removal of the LSN groups.  There's
an explanation of the LSN groups in src/backend/access/transam/README,
and while this patch removes the LSN group feature, it doesn't update
that text.  That's already a problem which needs fixed, but the text
says

: In fact, we store more than one LSN for each clog page.  This relates to
: the way we set transaction status hint bits during visibility tests.
: We must not set a transaction-committed hint bit on a relation page and
: have that record make it to disk prior to the WAL record of the commit.
: Since visibility tests are normally made while holding buffer share locks,
: we do not have the option of changing the page's LSN to guarantee WAL
: synchronization.  Instead, we defer the setting of the hint bit if we have
: not yet flushed WAL as far as the LSN associated with the transaction.
: This requires tracking the LSN of each unflushed async commit.
: It is convenient to associate this data with clog buffers: because we
: will flush WAL before writing a clog page, we know that we do not need
: to remember a transaction's LSN longer than the clog page holding its
: commit status remains in memory.  However, the naive approach of storing
: an LSN for each clog position is unattractive: the LSNs are 32x bigger
: than the two-bit commit status fields, and so we'd need 256K of
: additional shared memory for each 8K clog buffer page.  We choose
: instead to store a smaller number of LSNs per page, where each LSN is
: the highest LSN associated with any transaction commit in a contiguous
: range of transaction IDs on that page.  This saves storage at the price
: of some possibly-unnecessary delay in setting transaction hint bits.

In the new code we effectively store only one LSN per page, which I
understand is strictly worse.  Maybe the idea of doing away with these
LSN groups should be reconsidered ... unless I completely misunderstand
the whole thing.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: a wrong index choose when statistics is out of date

2024-03-08 Thread Andy Fan


After some more thoughts about the diference of the two ideas, then I
find we are resolving two different issues, just that in the wrong index
choose cases, both of them should work generally. 

Your idea actually adding some rule based logic named certainty_factor,
just the implemenation is very grace. for the example in this case, it
take effects *when the both indexes has the same cost*. I believe that
can resolve the index choose here, but how about the rows estimation?
issue due to the fact that the design will not fudge the cost anyway, I
assume you will not fudge the rows or selectivity as well. Then if the
optimizer statistics is missing, what can we do for both index choosing
and rows estimation? I think that's where my idea comes out.  

Due to the fact that optimizer statistics can't be up to date by design,
and assume we have a sistuation where the customer's queries needs that
statistcs often, how about doing the predication with the history
statistics? it can cover for both index choose and rows estimation. Then
the following arguments may be arised. a). we can't decide when the
missed optimizer statistics is wanted *automatically*, b). if we
predicate the esitmiation with the history statistics, the value of MCV
information is missed. The answer for them is a). It is controlled by
human with the "alter table t alter column a set
(force_generic=on)". b). it can't be resolved I think, and it only take
effects when the real Const is so different from the ones in
history. generic plan has the same issue I think.

I just reviewed the bad queries plan for the past half years internally,
I found many queries used the Nested loop which is the direct cause. now
I think I find out a new reason for this, because the missed optimizer
statistics cause the rows in outer relation to be 1, which make the Nest
loop is choosed. I'm not sure your idea could help on this or can help
on this than mine at this aspect.

-- 
Best Regards
Andy Fan





Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Tomas Vondra
On 3/8/24 09:33, Thomas Munro wrote:
> Happened again.  I see this is OpenSUSE.  Does that mean the file
> system is Btrfs?


It is, but I don't think that matters - I've been able to reproduce this
locally on my laptop using ext4 filesystem. I'd bet the important piece
here is -DCLOBBER_CACHE_ALWAYS (and it seems avocet/trilobite are the
only animals running with this).

Also, if this really is a filesystem (or environment) issue, it seems
very strange it'd only affect HEAD and not the other branches. So it
seems quite likely this is actually triggered by a commit.

Looking at the commits from the good/bad runs, I see this:

avocet: good=4c2369a bad=f5a465f
trilobite: good=d13ff82 bad=def0ce3

That means the commit would have to be somewhere here:

f5a465f1a07 Promote assertion about !ReindexIsProcessingIndex to ...
57b28c08305 Doc: fix minor typos in two ECPG function descriptions.
28e858c0f95 Improve documentation and GUC description for ...
a661bf7b0f5 Remove flaky isolation tests for timeouts
874d817baa1 Multiple revisions to the GROUP BY reordering tests
466979ef031 Replace lateral references to removed rels in subqueries
a6b2a51e16d Avoid dangling-pointer problem with partitionwise ...
d360e3cc60e Fix compiler warning on typedef redeclaration
8af25652489 Introduce a new smgr bulk loading facility.
e612384fc78 Fix mistake in SQL features list
d13ff82319c Fix BF failure in commit 93db6cbda0.

My guess would be 8af25652489, as it's the only storage-related commit.

I'm currently running tests to verify this.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Statistics Import and Export

2024-03-08 Thread Stephen Frost
Greetings,

* Corey Huinker (corey.huin...@gmail.com) wrote:
> > Having some discussion around that would be useful.  Is it better to
> > have a situation where there are stats for some columns but no stats for
> > other columns?  There would be a good chance that this would lead to a
> > set of queries that were properly planned out and a set which end up
> > with unexpected and likely poor query plans due to lack of stats.
> > Arguably that's better overall, but either way an ANALYZE needs to be
> > done to address the lack of stats for those columns and then that
> > ANALYZE is going to blow away whatever stats got loaded previously
> > anyway and all we did with a partial stats load was maybe have a subset
> > of queries have better plans in the interim, after having expended the
> > cost to try and individually load the stats and dealing with the case of
> > some of them succeeding and some failing.
> 
> It is my (incomplete and entirely second-hand) understanding is that
> pg_upgrade doesn't STOP autovacuum, but sets a delay to a very long value
> and then resets it on completion, presumably because analyzing a table
> before its data is loaded and indexes are created would just be a waste of
> time.

No, pg_upgrade starts the postmaster with -b (undocumented
binary-upgrade mode) which prevents autovacuum (and logical replication
workers) from starting, so we don't need to worry about autovacuum
coming in and causing issues during binary upgrade.  That doesn't
entirely eliminate the concerns around pg_dump vs. autovacuum because we
may restore a dump into a non-binary-upgrade-in-progress system though,
of course.

> > Overall, I'd suggest we wait to see what Corey comes up with in terms of
> > doing the stats load for all attributes in a single function call,
> > perhaps using the VALUES construct as you suggested up-thread, and then
> > we can contemplate if that's clean enough to work or if it's so grotty
> > that the better plan would be to do per-attribute function calls.  If it
> > ends up being the latter, then we can revisit this discussion and try to
> > answer some of the questions raised above.
> 
> In the patch below, I ended up doing per-attribute function calls, mostly
> because it allowed me to avoid creating a custom data type for the portable
> version of pg_statistic. This comes at the cost of a very high number of
> parameters, but that's the breaks.

Perhaps it's just me ... but it doesn't seem like it's all that many
parameters.

> I am a bit concerned about the number of locks on pg_statistic and the
> relation itself, doing CatalogOpenIndexes/CatalogCloseIndexes once per
> attribute rather than once per relation. But I also see that this will
> mostly get used at a time when no other traffic is on the machine, and
> whatever it costs, it's still faster than the smallest table sample (insert
> joke about "don't have to be faster than the bear" here).

I continue to not be too concerned about this until and unless it's
actually shown to be an issue.  Keeping things simple and
straight-forward has its own value.

> This raises questions about whether a failure in one attribute update
> statement should cause the others in that relation to roll back or not, and
> I can see situations where both would be desirable.
> 
> I'm putting this out there ahead of the pg_dump / fe_utils work, mostly
> because what I do there heavily depends on how this is received.
> 
> Also, I'm still seeking confirmation that I can create a pg_dump TOC entry
> with a chain of commands (e.g. BEGIN; ...  COMMIT; ) or if I have to fan
> them out into multiple entries.

If we do go with this approach, we'd certainly want to make sure to do
it in a manner which would allow pg_restore's single-transaction mode to
still work, which definitely complicates this some.

Given that and the other general feeling that the locking won't be a big
issue, I'd suggest the simple approach on the pg_dump side of just
dumping the stats out without the BEGIN/COMMIT.

> Anyway, here's v7. Eagerly awaiting feedback.

> Subject: [PATCH v7] Create pg_set_relation_stats, pg_set_attribute_stats.

> diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
> index 291ed876fc..d12b6e3ca3 100644
> --- a/src/include/catalog/pg_proc.dat
> +++ b/src/include/catalog/pg_proc.dat
> @@ -8818,7 +8818,6 @@
>  { oid => '3813', descr => 'generate XML text node',
>proname => 'xmltext', proisstrict => 't', prorettype => 'xml',
>proargtypes => 'text', prosrc => 'xmltext' },
> -
>  { oid => '2923', descr => 'map table contents to XML',
>proname => 'table_to_xml', procost => '100', provolatile => 's',
>proparallel => 'r', prorettype => 'xml',
> @@ -12163,8 +12162,24 @@
>  
>  # GiST stratnum implementations
>  { oid => '8047', descr => 'GiST support',
> -  proname => 'gist_stratnum_identity', prorettype => 'int2',
> +  proname => 'gist_stratnum_identity',prorettype => 'int2',
>proargtypes => 'int2',
>prosrc 

Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Jelte Fennema-Nio
On Fri, 8 Mar 2024 at 10:59, Alvaro Herrera  wrote:
>
> On 2024-Mar-08, Bharath Rupireddy wrote:
>
> > This works only if '* 'is specified as the only one character in
> > backtrace_functions = '*', right? If yes, what if someone sets
> > backtrace_functions = 'foo, bar, *, baz'?
>
> It throws an error, as expected.  This is a useless waste of resources:
> checking for "foo" and "bar" is pointless, since the * is going to give
> a positive match anyway.  And the "baz" is a waste of memory which is
> never going to be checked.

Makes sense. Attached is a new patchset that implements it that way.
I've not included Bharath his 0003 patch, since it's a much bigger
change than the others, and thus might need some more discussion.


v5-0002-Add-wildcard-support-to-backtrace_functions-GUC.patch
Description: Binary data


v5-0001-Add-backtrace_functions_min_level.patch
Description: Binary data


Re: Dump-restore loosing 'attnotnull' bit for DEFERRABLE PRIMARY KEY column(s).

2024-03-08 Thread Dean Rasheed
On Thu, 7 Mar 2024 at 17:32, Alvaro Herrera  wrote:
>
> This seems to work okay.
>

Yes, this looks good. I tested it against CREATE TABLE ... LIKE, and
it worked as expected. It might be worth adding a test case for that,
to ensure that it doesn't get broken in the future. Do we also want a
test case that does what pg_dump would do:

 - Add a NOT NULL constraint
 - Add a deferrable PK constraint
 - Drop the NOT NULL constraint
 - Check the column is still not nullable

Looking at rel.h, I think that the new field should probably come
after rd_pkindex, under the comment "data managed by
RelationGetIndexList", and have its own comment.

Also, if I'm nitpicking, the new field and local variables should use
the term "deferrable" rather than "deferred". A DEFERRABLE constraint
can be set to be either DEFERRED or IMMEDIATE within a transaction,
but "deferrable" is the right term to use to describe the persistent
property of an index/constraint that can be deferred. (The same
objection applies to the field name "indimmediate", but it's too late
to change that.)

Also, for neatness/consistency, the new field should probably be reset
in load_relcache_init_file(), alongside rd_pkindex, though I don't
think it can matter in practice.

Regards,
Dean




Re: Add new error_action COPY ON_ERROR "log"

2024-03-08 Thread Michael Paquier
On Fri, Mar 08, 2024 at 03:36:30PM +0530, Bharath Rupireddy wrote:
> Please see the attached v5-0001 patch implementing LOG_VERBOSITY with
> options 'default' and 'verbose'. v5-0002 adds the detailed info to
> ON_ERROR 'ignore' option.

I may be reading this patch set incorrectly, but why doesn't this
patch generate one NOTICE per attribute, as suggested by Sawada-san,
incrementing num_errors once per row when the last attribute has been
processed?  Also, why not have a test that checks that multiple rows
spawn more than more messages in some distributed fashion?  Did you
look at this idea?

> We have a CF entry https://commitfest.postgresql.org/47/4798/ for the
> original idea proposed in this thread, that is, to have the ON_ERROR
> 'log' option. I'll probably start a new thread and add a new CF entry
> in the next commitfest if there's no objection from anyone.

Hmm.  You are referring to the part where you'd want to control where
the errors are sent, right?  At least, what you have here would
address point 1) mentioned in the first message of this thread.
--
Michael


signature.asc
Description: PGP signature


Re: Commitfest Manager for March

2024-03-08 Thread Andrey M. Borodin



> On 4 Mar 2024, at 17:09, Aleksander Alekseev  wrote:
> 
> If you need any help please let me know.

Aleksander, I would greatly appreciate if you join me in managing CF. Together 
we can move more stuff :)
Currently, I'm going through "SQL Commands". And so far I had not come to 
"Performance" and "Server Features" at all... So if you can handle updating 
statuses of that sections - that would be great.


Best regards, Andrey Borodin.



Re: Regarding the order of the header file includes

2024-03-08 Thread Bharath Rupireddy
On Thu, Mar 7, 2024 at 12:39 PM Richard Guo  wrote:
>
> While rebasing one of my patches I noticed that the header file includes
> in relnode.c are not sorted in order.  So I wrote a naive script to see
> if any other C files have the same issue.  The script is:
>
> #!/bin/bash
>
> find . -name "*.c" | while read -r file; do
>   headers=$(grep -o '#include "[^>]*"' "$file" |
> grep -v "postgres.h" | grep -v "postgres_fe.h" |
> sed 's/\.h"//g')
>
>   sorted_headers=$(echo "$headers" | sort)
>
>   results=$(diff <(echo "$headers") <(echo "$sorted_headers"))
>
>   if [[ $? != 0 ]]; then
> echo "Headers in '$file' are out of order"
> echo $results
> echo
>   fi
> done

Cool. Isn't it a better idea to improve this script to auto-order the
header files and land it under src/tools/pginclude/headerssort? It can
then be reusable and be another code beautification weapon one can use
before the code release.

FWIW, I'm getting the syntax error when ran the above shell script:

headerssort.sh: 10: Syntax error: "(" unexpected

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Stack overflow issue

2024-03-08 Thread Alexander Korotkov
On Thu, Mar 7, 2024 at 1:49 AM Tom Lane  wrote:
>
> Alexander Korotkov  writes:
> > Sorry for tediousness, but isn't pre-order a variation of depth-first order
> > [1]?
>
> To me, depth-first implies visiting children before parents.
> Do I have the terminology wrong?

According to Wikipedia, depth-first is a general term describing the
tree traversal algorithm, which goes as deep as possible in one branch
before visiting other branches.  The order of between parents and
children, and between siblings specifies the variation of depth-first
search, and pre-order is one of them.  But "pre-order" is the most
accurate term for MemoryContextTraverseNext() anyway.

--
Regards,
Alexander Korotkov




Re: Spurious pgstat_drop_replslot() call

2024-03-08 Thread Michael Paquier
On Fri, Mar 08, 2024 at 10:19:11AM +, Bertrand Drouvot wrote:
> Indeed, it does not seem appropriate to remove stats during slot invalidation 
> as
> one could still be interested to look at them.

Yeah, my take is that this can still be interesting to know, at least
for debugging.  This would limit the stats to be dropped when the slot
is dropped, and that looks like a sound idea.

> This spurious call has been introduced in be87200efd. I think that initially 
> we
> designed to drop slots when a recovery conflict occurred during logical 
> decoding
> on a standby. But we changed our mind to invalidate such a slot instead.
> 
> The spurious call is probably due to the initial design but has not been 
> removed.

This is not a subject that has really been touched on the original
thread mentioned on the commit, so it is a bit hard to be sure.  The
only comment is that a previous version of the patch did the stats
drop in the slot invalidation path at an incorrect location.

> I don't think it's worth to add a test but can do if one feel the need.

Well, that would not be complicated while being cheap, no?  We have a
few paths in the 035 test where we know that a slot has been
invalidated so it is just a matter of querying once
pg_stat_replication_slot and see if some data is still there.
--
Michael


signature.asc
Description: PGP signature


Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-08 Thread Michael Paquier
On Fri, Mar 08, 2024 at 10:26:21AM +, Bertrand Drouvot wrote:
> Yeah, good point: I'll create a dedicated patch for that.

Sounds good to me.

> Note that currently pgstat_drop_replslot() would not satisfy this new Assert
> when being called from InvalidatePossiblyObsoleteSlot(). I think this call
> should be removed and created a dedicated thread for that [1].
> 
> [1]: 
> https://www.postgresql.org/message-id/ZermH08Eq6YydHpO%40ip-10-97-1-34.eu-west-3.compute.internal

Thanks.
--
Michael


signature.asc
Description: PGP signature


Re: New Table Access Methods for Multi and Single Inserts

2024-03-08 Thread Bharath Rupireddy
On Sat, Mar 2, 2024 at 12:02 PM Bharath Rupireddy
 wrote:
>
> On Mon, Jan 29, 2024 at 5:16 PM Bharath Rupireddy
>  wrote:
> >
> > > Please find the attached v9 patch set.
>
> I've had to rebase the patches due to commit 874d817, please find the
> attached v11 patch set.

Rebase needed. Please see the v12 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From 8a3552e65e62afc40db99fbd7bf4f98990d45390 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Fri, 8 Mar 2024 10:11:17 +
Subject: [PATCH v12 1/4] New TAMs for inserts

---
 src/backend/access/heap/heapam.c | 224 +++
 src/backend/access/heap/heapam_handler.c |   9 +
 src/include/access/heapam.h  |  49 +
 src/include/access/tableam.h | 138 ++
 4 files changed, 420 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 34bc60f625..497940d74a 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -64,6 +64,7 @@
 #include "storage/standby.h"
 #include "utils/datum.h"
 #include "utils/inval.h"
+#include "utils/memutils.h"
 #include "utils/relcache.h"
 #include "utils/snapmgr.h"
 #include "utils/spccache.h"
@@ -2442,6 +2443,229 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	pgstat_count_heap_insert(relation, ntuples);
 }
 
+/*
+ * Initialize state required for an insert a single tuple or multiple tuples
+ * into a heap.
+ */
+TableInsertState *
+heap_insert_begin(Relation rel, CommandId cid, int am_flags, int insert_flags)
+{
+	TableInsertState *tistate;
+
+	tistate = palloc0(sizeof(TableInsertState));
+	tistate->rel = rel;
+	tistate->cid = cid;
+	tistate->am_flags = am_flags;
+	tistate->insert_flags = insert_flags;
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0 ||
+		(am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY))
+		tistate->am_data = palloc0(sizeof(HeapInsertState));
+
+	if ((am_flags & TABLEAM_MULTI_INSERTS) != 0)
+	{
+		HeapMultiInsertState *mistate;
+
+		mistate = palloc0(sizeof(HeapMultiInsertState));
+		mistate->slots = palloc0(sizeof(TupleTableSlot *) * HEAP_MAX_BUFFERED_SLOTS);
+
+		mistate->context = AllocSetContextCreate(CurrentMemoryContext,
+ "heap_multi_insert_v2 memory context",
+ ALLOCSET_DEFAULT_SIZES);
+
+		((HeapInsertState *) tistate->am_data)->mistate = mistate;
+	}
+
+	if ((am_flags & TABLEAM_BULKWRITE_BUFFER_ACCESS_STRATEGY) != 0)
+		((HeapInsertState *) tistate->am_data)->bistate = GetBulkInsertState();
+
+	return tistate;
+}
+
+/*
+ * Insert a single tuple into a heap.
+ */
+void
+heap_insert_v2(TableInsertState * state, TupleTableSlot *slot)
+{
+	bool		shouldFree = true;
+	HeapTuple	tuple = ExecFetchSlotHeapTuple(slot, true, );
+	BulkInsertState bistate = NULL;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate == NULL);
+
+	/* Update tuple with table oid */
+	slot->tts_tableOid = RelationGetRelid(state->rel);
+	tuple->t_tableOid = slot->tts_tableOid;
+
+	if (state->am_data != NULL &&
+		((HeapInsertState *) state->am_data)->bistate != NULL)
+		bistate = ((HeapInsertState *) state->am_data)->bistate;
+
+	/* Perform insertion, and copy the resulting ItemPointer */
+	heap_insert(state->rel, tuple, state->cid, state->insert_flags,
+bistate);
+	ItemPointerCopy(>t_self, >tts_tid);
+
+	if (shouldFree)
+		pfree(tuple);
+}
+
+/*
+ * Create/return next free slot from multi-insert buffered slots array.
+ */
+TupleTableSlot *
+heap_multi_insert_next_free_slot(TableInsertState * state)
+{
+	TupleTableSlot *slot;
+	HeapMultiInsertState *mistate;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate != NULL);
+
+	mistate = ((HeapInsertState *) state->am_data)->mistate;
+	slot = mistate->slots[mistate->cur_slots];
+
+	if (slot == NULL)
+	{
+		slot = table_slot_create(state->rel, NULL);
+		mistate->slots[mistate->cur_slots] = slot;
+	}
+	else
+		ExecClearTuple(slot);
+
+	return slot;
+}
+
+/*
+ * Store passed-in tuple into in-memory buffered slots. When full, insert
+ * multiple tuples from the buffers into heap.
+ */
+void
+heap_multi_insert_v2(TableInsertState * state, TupleTableSlot *slot)
+{
+	TupleTableSlot *dstslot;
+	HeapMultiInsertState *mistate;
+
+	Assert(state->am_data != NULL &&
+		   ((HeapInsertState *) state->am_data)->mistate != NULL);
+
+	mistate = ((HeapInsertState *) state->am_data)->mistate;
+	dstslot = mistate->slots[mistate->cur_slots];
+
+	if (dstslot == NULL)
+	{
+		dstslot = table_slot_create(state->rel, NULL);
+		mistate->slots[mistate->cur_slots] = dstslot;
+	}
+
+	/*
+	 * Caller may have got the slot using heap_multi_insert_next_free_slot,
+	 * filled it and passed. So, skip copying in such a case.
+	 */
+	if ((state->am_flags & TABLEAM_SKIP_MULTI_INSERTS_FLUSH) == 0)
+	{
+		ExecClearTuple(dstslot);
+		ExecCopySlot(dstslot, slot);
+	}
+	else
+		

Re: Add SPLIT PARTITION/MERGE PARTITIONS commands

2024-03-08 Thread Andrey M. Borodin



> On 26 Jan 2024, at 23:36, Dmitry Koval  wrote:
> 
> 

The CF entry was in Ready for Committer state no so long ago.
Stephane, you might want to review recent version after it was rebased on 
current HEAD. CFbot's test passed successfully.

Thanks!


Best regards, Andrey Borodin.



Re: Missing LWLock protection in pgstat_reset_replslot()

2024-03-08 Thread Bertrand Drouvot
Hi,

On Thu, Mar 07, 2024 at 02:17:53PM +0900, Michael Paquier wrote:
> On Wed, Mar 06, 2024 at 09:05:59AM +, Bertrand Drouvot wrote:
> > Yeah, all of the above done in v3 attached.
> 
> In passing..  pgstat_create_replslot() and pgstat_drop_replslot() rely
> on the assumption that the LWLock ReplicationSlotAllocationLock is
> taken while calling these routines.  Perhaps that would be worth some
> extra Assert(LWLockHeldByMeInMode()) in pgstat_replslot.c for these
> two?  Not directly related to this problem.

Yeah, good point: I'll create a dedicated patch for that.

Note that currently pgstat_drop_replslot() would not satisfy this new Assert
when being called from InvalidatePossiblyObsoleteSlot(). I think this call
should be removed and created a dedicated thread for that [1].

[1]: 
https://www.postgresql.org/message-id/ZermH08Eq6YydHpO%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: a wrong index choose when statistics is out of date

2024-03-08 Thread Andrei Lepikhov

On 7/3/2024 17:32, David Rowley wrote:

On Thu, 7 Mar 2024 at 21:17, Andrei Lepikhov  wrote:

I would like to ask David why the var_eq_const estimator doesn't have an
option for estimation with a histogram. Having that would relieve a
problem with skewed data. Detecting the situation with incoming const
that is out of the covered area would allow us to fall back to ndistinct
estimation or something else. At least, histogram usage can be
restricted by the reltuples value and ratio between the total number of
MCV values and the total number of distinct values in the table.


If you can think of a way how to calculate it, you should propose a patch.

IIRC, we try to make the histogram buckets evenly sized based on the
number of occurrences. I've not followed the code in default, I'd
guess that doing that allows us to just subtract off the MCV
frequencies and assume the remainder is evenly split over each
histogram bucket, so unless we had an n_distinct per histogram bucket,
or at the very least n_distinct_for_histogram_values, then how would
the calculation look for what we currently record?
Yeah, It is my mistake; I see nothing special here with such a kind of 
histogram: in the case of a coarse histogram net, the level of 
uncertainty in one bin is too high to make a better estimation. I am 
just pondering detection situations when estimation constant is just out 
of statistics scope to apply to alternative, more expensive logic 
involving the number of index pages out of the boundary, index tuple 
width, and distinct value. The Left and right boundaries of the 
histogram are suitable detectors for such a situation.


--
regards,
Andrei Lepikhov
Postgres Professional





Spurious pgstat_drop_replslot() call

2024-03-08 Thread Bertrand Drouvot
Hi hackers,

Please find attached a tiny patch to remove a $SUBJECT.

Indeed, it does not seem appropriate to remove stats during slot invalidation as
one could still be interested to look at them.

This spurious call has been introduced in be87200efd. I think that initially we
designed to drop slots when a recovery conflict occurred during logical decoding
on a standby. But we changed our mind to invalidate such a slot instead.

The spurious call is probably due to the initial design but has not been 
removed.

I don't think it's worth to add a test but can do if one feel the need.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 56ade7f561c180ee0120cb8c33d1c39e32ab7863 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Fri, 8 Mar 2024 09:29:29 +
Subject: [PATCH v1] Remove spurious pgstat_drop_replslot() call

There is no need to remove stats for an invalidated slot as one could still
be interested to look at them.
---
 src/backend/replication/slot.c | 1 -
 1 file changed, 1 deletion(-)
 100.0% src/backend/replication/

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index b8bf98b182..91ca397857 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1726,7 +1726,6 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
 			ReplicationSlotMarkDirty();
 			ReplicationSlotSave();
 			ReplicationSlotRelease();
-			pgstat_drop_replslot(s);
 
 			ReportSlotInvalidation(conflict, false, active_pid,
    slotname, restart_lsn,
-- 
2.34.1



Re: Proposal to add page headers to SLRU pages

2024-03-08 Thread Stephen Frost
Greetings,

* Li, Yong (y...@ebay.com) wrote:
> > On Mar 7, 2024, at 03:09, Stephen Frost  wrote:
> > * Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> >> I suppose this is important to do if we ever want to move SLRUs into
> >> shared buffers.  However, I wonder about the extra time this adds to
> >> pg_upgrade.  Is this something we should be concerned about?  Is there
> >> any measurement/estimates to tell us how long this would be?  Right now,
> >> if you use a cloning strategy for the data files, the upgrade should be
> >> pretty quick ... but the amount of data in pg_xact and pg_multixact
> >> could be massive, and the rewrite is likely to take considerable time.
> > 
> > While I definitely agree that there should be some consideration of
> > this concern, it feels on-par with the visibility-map rewrite which was
> > done previously.  Larger systems will likely have more to deal with than
> > smaller systems, but it's still a relatively small portion of the data
> > overall.
> > 
> > The benefit of this change, beyond just the possibility of moving them
> > into shared buffers some day in the future, is that this would mean that
> > SLRUs will have checksums (if the cluster has them enabled).  That
> > benefit strikes me as well worth the cost of the rewrite taking some
> > time and the minor loss of space due to the page header.
> > 
> > Would it be useful to consider parallelizing this work?  There's already
> > parts of pg_upgrade which can be parallelized and so this isn't,
> > hopefully, a big lift to add, but I'm not sure if there's enough work
> > being done here CPU-wise, compared to the amount of IO being done, to
> > have it make sense to run it in parallel.  Might be worth looking into
> > though, at least, as disks have gotten to be quite fast.
> 
> Thank Alvaro and Stephen for your thoughtful comments.
> 
> I did a quick benchmark regarding pg_upgrade time, and here are the results.

> For clog, 2048 segments can host about 2 billion transactions, right at the 
> limit for wraparound.
> That’s the maximum we can have.  2048 segments are also big for pg_multixact 
> SLRUs.
> 
> Therefore, on a modern hardware, in the worst case, pg_upgrade will run for 7 
> seconds longer.

Thanks for testing!  That strikes me as perfectly reasonable and seems
unlikely that we'd get much benefit from parallelizing it, so I'd say it
makes sense to keep this code simple.

Thanks again!

Stephen


signature.asc
Description: PGP signature


Re: Invalid query generated by postgres_fdw with UNION ALL and ORDER BY

2024-03-08 Thread Richard Guo
On Fri, Mar 8, 2024 at 10:13 AM David Rowley  wrote:

> The fix could also be to use deparseConst() in appendOrderByClause()
> and have that handle Const EquivalenceMember instead.  I'd rather just
> skip them. To me, that seems less risky than ensuring deparseConst()
> handles all Const types correctly.


I've looked at this patch a bit.  I once wondered why we don't check
pathkey->pk_eclass->ec_has_const with EC_MUST_BE_REDUNDANT to see if the
pathkey is not needed.  Then I realized that a child member would not be
marked as constant even if the child expr is a Const, as explained in
add_child_rel_equivalences().

BTW, I wonder if it is possible that we have a pseudoconstant expression
that is not of type Const.  In such cases we would need to check
'bms_is_empty(pull_varnos(em_expr))' instead of 'IsA(em_expr, Const)'.
However, I'm unable to think of such an expression in this context.

The patch looks good to me otherwise.

Thanks
Richard


Re: Add new error_action COPY ON_ERROR "log"

2024-03-08 Thread Bharath Rupireddy
On Thu, Mar 7, 2024 at 12:54 PM Michael Paquier  wrote:
>
> On Thu, Mar 07, 2024 at 12:48:12PM +0530, Bharath Rupireddy wrote:
> > I'm okay with it. But, help me understand it better. We want the
> > 'log_verbosity' clause to have options 'default' and 'verbose', right?
> > And, later it can also be extended to contain all the LOG levels like
> > 'notice', 'error', 'info' , 'debugX' etc. depending on the need,
> > right?
>
> You could, or names that have some status like row_details, etc.
>
> > One more thing, how does it sound using both verbosity and verbose in
> > log_verbosity verbose something like below? Is this okay?
>
> There's some history with this pattern in psql at least with \set
> VERBOSITY verbose.  For the patch, I would tend to choose these two,
> but that's as far as my opinion goes and I am OK other ideas gather
> more votes.

Please see the attached v5-0001 patch implementing LOG_VERBOSITY with
options 'default' and 'verbose'. v5-0002 adds the detailed info to
ON_ERROR 'ignore' option.

We have a CF entry https://commitfest.postgresql.org/47/4798/ for the
original idea proposed in this thread, that is, to have the ON_ERROR
'log' option. I'll probably start a new thread and add a new CF entry
in the next commitfest if there's no objection from anyone.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v5-0001-Add-LOG_VERBOSITY-option-to-COPY-command.patch
Description: Binary data


v5-0002-Add-detailed-info-when-COPY-skips-soft-errors.patch
Description: Binary data


Call perror on popen failure

2024-03-08 Thread Daniel Gustafsson
If popen fails in pipe_read_line we invoke perror for the error message, and
pipe_read_line is in turn called by find_other_exec which is used in both
frontend and backend code.  This is an old codepath, and it seems like it ought
to be replaced with the more common logging tools we now have like in the
attached diff (which also makes the error translated as opposed to now).  Any
objections to removing this last perror() call?

--
Daniel Gustafsson



no_perror.diff
Description: Binary data


Re: speed up a logical replica setup

2024-03-08 Thread Andrey M. Borodin



> On 8 Mar 2024, at 12:03, Shlok Kyal  wrote:
> 
> 

I haven't digged into the thread, but recent version fails some CFbot's tests.

http://commitfest.cputube.org/euler-taveira.html
https://cirrus-ci.com/task/4833499115421696
==29928==ERROR: AddressSanitizer: heap-use-after-free on address 0x61a01458 
at pc 0x7f3b29fdedce bp 0x7ffe68fcf1c0 sp 0x7ffe68fcf1b8

Thanks!


Best regards, Andrey Borodin.



Re: Support a wildcard in backtrace_functions

2024-03-08 Thread Alvaro Herrera
On 2024-Mar-08, Bharath Rupireddy wrote:

> This works only if '* 'is specified as the only one character in
> backtrace_functions = '*', right? If yes, what if someone sets
> backtrace_functions = 'foo, bar, *, baz'?

It throws an error, as expected.  This is a useless waste of resources:
checking for "foo" and "bar" is pointless, since the * is going to give
a positive match anyway.  And the "baz" is a waste of memory which is
never going to be checked.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"I love the Postgres community. It's all about doing things _properly_. :-)"
(David Garamond)




Re: a wrong index choose when statistics is out of date

2024-03-08 Thread Andy Fan


David Rowley  writes:

> On Thu, 7 Mar 2024 at 23:40, Andy Fan  wrote:
>>
>> David Rowley  writes:
>> > If you don't want the planner to use the statistics for the column why
>> > not just do the following?
>>
>> Acutally I didn't want the planner to ignore the statistics totally, I
>> want the planner to treat the "Const" which probably miss optimizer part
>> average, which is just like what we did for generic plan for the blow
>> query.
>
> I'm with Andrei on this one and agree with his "And it is just luck
> that you've got the right answer".

Any example to support this conclusion? and what's the new problem after
this?

-- 
Best Regards
Andy Fan





RE: speed up a logical replica setup

2024-03-08 Thread Hayato Kuroda (Fujitsu)
Dear Tomas, Euler,

Thanks for starting to read the thread! Since I'm not an original author,
I want to reply partially.

> I decided to take a quick look on this patch today, to see how it works
> and do some simple tests. I've only started to get familiar with it, so
> I have only some comments / questions regarding usage, not on the code.
> It's quite possible I didn't understand some finer points, or maybe it
> was already discussed earlier in this very long thread, so please feel
> free to push back or point me to the past discussion.
> 
> Also, some of this is rather opinionated, but considering I didn't see
> this patch before, my opinions may easily be wrong ...

I felt your comments were quit valuable.

> 1) SGML docs
> 
> It seems the SGML docs are more about explaining how this works on the
> inside, rather than how to use the tool. Maybe that's intentional, but
> as someone who didn't work with pg_createsubscriber before I found it
> confusing and not very helpful.
>
> For example, the first half of the page is prerequisities+warning, and
> sure those are useful details, but prerequisities are checked by the
> tool (so I can't really miss this) and warnings go into a lot of details
> about different places where things may go wrong. Sure, worth knowing
> and including in the docs, but maybe not right at the beginning, before
> I learn how to even run the tool?

Hmm, right. I considered below improvements. Tomas and Euler, how do you think?

* Adds more descriptions in "Description" section.
* Moves prerequisities+warning to "Notes" section.
* Adds "Usage" section which describes from a single node.

> I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm
> sure it won't work for a number of use cases. I know large databases
> it's common to create "work tables" (not necessarily temporary) as part
> of a batch job, but there's no need to replicate those tables.

Indeed, the documentation does not describe that all tables in the database
would be included in the publication.

> I do understand that FOR ALL TABLES is the simplest approach, and for v1
> it may be an acceptable limitation, but maybe it'd be good to also
> support restricting which tables should be replicated (e.g. blacklist or
> whitelist based on table/schema name?).

May not directly related, but we considered that accepting options was a 
next-step [1].

> Note: I now realize this might fall under the warning about DDL, which
> says this:
> 
> Executing DDL commands on the source server while running
> pg_createsubscriber is not recommended. If the target server has
> already been converted to logical replica, the DDL commands must
> not be replicated so an error would occur.

Yeah, they would not be replicated, but not lead ERROR.
So should we say like "Creating tables on the source server..."?

> 5) slot / publication / subscription name
> 
> I find it somewhat annoying it's not possible to specify names for
> objects created by the tool - replication slots, publication and
> subscriptions. If this is meant to be a replica running for a while,
> after a while I'll have no idea what pg_createsubscriber_569853 or
> pg_createsubscriber_459548_2348239 was meant for.
> 
> This is particularly annoying because renaming these objects later is
> either not supported at all (e.g. for replication slots), or may be
> quite difficult (e.g. publications).
> 
> I do realize there are challenges with custom names (say, if there are
> multiple databases to replicate), but can't we support some simple
> formatting with basic placeholders? So we could specify
> 
> --slot-name "myslot_%d_%p"
> 
> or something like that?

Not sure we can do in the first version, but looks nice. One concern is that I
cannot find applications which accepts escape strings like log_line_prefix.
(It may just because we do not have use-case.) Do you know examples?

> BTW what will happen if we convert multiple standbys? Can't they all get
> the same slot name (they all have the same database OID, and I'm not
> sure how much entropy the PID has)?

I tested and the second try did not work. The primal reason was the name of 
publication
- pg_createsubscriber_%u (oid).
FYI - previously we can reuse same publications, but based on my comment [2] the
feature was removed. It might be too optimistic.

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889CCBD4D9DAF8BD2F18541F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



Re: vacuumdb/clusterdb/reindexdb: allow specifying objects to process in all databases

2024-03-08 Thread Dean Rasheed
On Wed, 6 Mar 2024 at 22:22, Nathan Bossart  wrote:
>
> Thanks for taking a look.  I updated the synopsis sections in v3.

OK, that looks good. The vacuumdb synopsis in particular looks a lot
better now that "-N | --exclude-schema" is on its own line, because it
was hard to read previously, and easy to mistakenly think that -n
could be combined with -N.

If I'm nitpicking, "[--verbose | -v]" in the clusterdb synopsis should
be replaced with "[option...]", like the other commands, because there
are other general-purpose options like --quiet and --echo.

> I also spent some more time on the reindexdb patch (0003).  I previously
> had decided to restrict combinations of tables, schemas, and indexes
> because I felt it was "ambiguous and inconsistent with vacuumdb," but
> looking closer, I think that's the wrong move.  reindexdb already supports
> such combinations, which it interprets to mean it should reindex each
> listed object.  So, I removed that change in v3.

Makes sense.

> Even though reindexdb allows combinations of tables, schema, and indexes,
> it doesn't allow combinations of "system catalogs" and other objects, and
> it's not clear why.  In v3, I've removed this restriction, which ended up
> simplifying the 0003 patch a bit.  Like combinations of tables, schemas,
> and indexes, reindexdb will now interpret combinations that include
> --system to mean it should reindex each listed object as well as the system
> catalogs.

OK, that looks useful, especially given that most people will still
probably use this against a single database, and it's making that more
flexible.

I think this is good to go.

Regards,
Dean




Re: meson: Specify -Wformat as a common warning flag for extensions

2024-03-08 Thread Sutou Kouhei
Hi,

In 
  "Re: meson: Specify -Wformat as a common warning flag for extensions" on Fri, 
8 Mar 2024 15:32:22 +0900,
  Michael Paquier  wrote:

> Are there version and/or
> environment requirements to be aware of?

I'm using Debian GNU/Linux sid and I can reproduce with gcc
8-13:

$ for x in {8..13}; do; echo gcc-${x}; gcc-${x} -Wformat-security -E - < 
/dev/null > /dev/null; done
gcc-8
cc1: warning: -Wformat-security ignored without -Wformat [-Wformat-security]
gcc-9
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]
gcc-10
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]
gcc-11
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]
gcc-12
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]
gcc-13
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]
$

I tried this on Ubuntu 22.04 too but this isn't reproduced:

$ gcc-11 -Wformat-security -E - < /dev/null > /dev/null
$

It seems that Ubuntu enables -Wformat by default:

$ gcc-11 -Wno-format -Wformat-security -E - < /dev/null > /dev/null
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]

I tried this on AlmaLinux 9 too and this is reproduced:

$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ gcc -Wformat-security -E - < /dev/null > /dev/null
cc1: warning: '-Wformat-security' ignored without '-Wformat' [-Wformat-security]

> Forcing -Wformat implies more stuff that can be disabled with
> -Wno-format-contains-nul, -Wno-format-extra-args, and
> -Wno-format-zero-length, but the thing is that we're usually very
> conservative with such additions in the scripts.  See also
> 8b6f5f25102f, done, I guess, as an answer to this thread:
> https://www.postgresql.org/message-id/4D431505.9010002%40dunslane.net

I think that this is not a problem. Because the comment
added by 8b6f5f25102f ("This was included in -Wall/-Wformat
in older GCC versions") implies that we want to always use
-Wformat-security. -Wformat-security isn't worked without
-Wformat:

https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wformat-security

> If -Wformat is specified, also warn about uses of format
> functions that represent possible security problems.


Thanks,
-- 
kou




Re: MERGE ... RETURNING

2024-03-08 Thread walther

Jeff Davis:

To summarize, most of the problem has been in retrieving the action
(INSERT/UPDATE/DELETE) taken or the WHEN-clause number applied to a
particular matched row. The reason this is important is because the row
returned is the old row for a DELETE action, and the new row for an
INSERT or UPDATE action. Without a way to distinguish the particular
action, the RETURNING clause returns a mixture of old and new rows,
which would be hard to use sensibly.


It seems to me that all of this is only a problem, because there is only
one RETURNING clause.

Dean Rasheed wrote in the very first post to this thread:

I considered allowing a separate RETURNING list at the end of each
action, but rapidly dismissed that idea. Firstly, it introduces
shift/reduce conflicts to the grammar. These can be resolved by making
the "AS" before column aliases non-optional, but that's pretty ugly,
and there may be a better way. More serious drawbacks are that this
syntax is much more cumbersome for the end user, having to repeat the
RETURNING clause several times, and the implementation is likely to be
pretty complex, so I didn't pursue it.


I can't judge the grammar and complexity issues, but as a potential user
it seems to me to be less complex to have multiple RETURNING clauses, 
where I could inject my own constants about the specific actions, than 
to have to deal with any of the suggested functions / clauses. More 
repetitive, yes - but not more complex.


More importantly, I could add RETURNING to only some of the actions and 
not always all at the same time - which seems pretty useful to me.


Best,

Wolfgang




Re: speed up a logical replica setup

2024-03-08 Thread vignesh C
On Thu, 7 Mar 2024 at 18:31, Shlok Kyal  wrote:
>
> Hi,
>
> > Thanks for the feedback. I'm attaching v26 that addresses most of your 
> > comments
> > and some issues pointed by Vignesh [1].
>
> I have created a top-up patch v27-0004. It contains additional test
> cases for pg_createsubscriber.
>
> Currently, two testcases (in v27-0004 patch) are failing. These
> failures are related to running pg_createsubscriber on nodes in
> cascade physical replication and are already reported in [1] and [2].
> I think these cases should be fixed. Thoughts?

We can fix these issues, if we are not planning to fix any of them, we
can add documentation for the same.

> The idea of this patch is to keep track of testcases, so that any
> future patch does not break any scenario which has already been worked
> on. These testcases can be used to test in our development process,
> but which test should actually be committed, can be discussed later.
> Thought?

Few comments for v27-0004-Add-additional-testcases.patch:
1) We could use command_fails_like to verify the reason of the error:
+# set max_replication_slots
+$node_p->append_conf('postgresql.conf', 'max_replication_slots = 3');
+$node_p->restart;
+command_fails(
+   [
+   'pg_createsubscriber', '--verbose',
+   '--dry-run', '--pgdata',
+   $node_s->data_dir, '--publisher-server',
+   $node_p->connstr('pg1'), '--socket-directory',
+   $node_s->host, '--subscriber-port',
+   $node_s->port, '--database',
+   'pg1', '--database',
+   'pg2',
+   ],
+   'max_replication_slots are less in number in publisher');

2) Add a comment saying what is being verified
+# set max_replication_slots
+$node_p->append_conf('postgresql.conf', 'max_replication_slots = 3');
+$node_p->restart;
+command_fails(
+   [
+   'pg_createsubscriber', '--verbose',
+   '--dry-run', '--pgdata',
+   $node_s->data_dir, '--publisher-server',
+   $node_p->connstr('pg1'), '--socket-directory',
+   $node_s->host, '--subscriber-port',
+   $node_s->port, '--database',
+   'pg1', '--database',
+   'pg2',
+   ],
+   'max_replication_slots are less in number in publisher');

3) We could rename this file something like
pg_create_subscriber_failure_cases or something better:
 src/bin/pg_basebackup/t/041_tests.pl | 285 +++
 1 file changed, 285 insertions(+)
 create mode 100644 src/bin/pg_basebackup/t/041_tests.pl

diff --git a/src/bin/pg_basebackup/t/041_tests.pl
b/src/bin/pg_basebackup/t/041_tests.pl
new file mode 100644
index 00..2889d60d54
--- /dev/null
+++ b/src/bin/pg_basebackup/t/041_tests.pl
@@ -0,0 +1,285 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group


4) This success case is not required as this would have already been
covered in 040_pg_createsubscriber.pl:
+$node_p->append_conf('postgresql.conf', 'max_replication_slots = 4');
+$node_p->restart;
+command_ok(
+   [
+   'pg_createsubscriber', '--verbose',
+   '--dry-run', '--pgdata',
+   $node_s->data_dir, '--publisher-server',
+   $node_p->connstr('pg1'), '--socket-directory',
+   $node_s->host, '--subscriber-port',
+   $node_s->port, '--database',
+   'pg1', '--database',
+   'pg2',
+   ],
+   'max_replication_slots are accurate on publisher');

5)  We could use command_fails_like to verify the reason of the error:
$node_s->append_conf('postgresql.conf', 'max_replication_slots = 1');
$node_s->restart;
command_fails(
[
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_s->data_dir, '--publisher-server',
$node_p->connstr('pg1'), '--socket-directory',
$node_s->host, '--subscriber-port',
$node_s->port, '--database',
'pg1', '--database',
'pg2',
],
'max_replication_slots are less in number in subscriber');

6) Add a comment saying what is being verified
$node_s->append_conf('postgresql.conf', 'max_replication_slots = 1');
$node_s->restart;
command_fails(
[
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_s->data_dir, '--publisher-server',
$node_p->connstr('pg1'), '--socket-directory',
$node_s->host, '--subscriber-port',
$node_s->port, '--database',
'pg1', '--database',
'pg2',
],
'max_replication_slots are less in number in subscriber');

7) This success case is not required as this would have already been
covered in 040_pg_createsubscriber.pl:
$node_s->append_conf('postgresql.conf', 'max_replication_slots = 2');
$node_s->restart;
command_ok(
[
'pg_createsubscriber', '--verbose',
'--dry-run', '--pgdata',
$node_s->data_dir, '--publisher-server',
$node_p->connstr('pg1'), '--socket-directory',
$node_s->host, '--subscriber-port',
$node_s->port, '--database',
'pg1', '--database',
'pg2',
],
'max_replication_slots are less in number in subscriber');

8) We could use 

Re: Failures in constraints regression test, "read only 0 of 8192 bytes"

2024-03-08 Thread Thomas Munro
Happened again.  I see this is OpenSUSE.  Does that mean the file
system is Btrfs?




Re: [17] CREATE SUBSCRIPTION ... SERVER

2024-03-08 Thread Jeff Davis
On Wed, 2024-01-31 at 11:10 +0530, Ashutosh Bapat wrote:
> > I like the idea -- it further decouples the logic from the core
> > server.
> > I suspect it will make postgres_fdw the primary way (though not the
> > only possible way) to use this feature. There would be little need
> > to
> > create a new builtin FDW to make this work.
> 
> That's what I see as well. I am glad that we are on the same page.

Implemented in v11, attached.

Is this what you had in mind? It leaves a lot of the work to
postgres_fdw and it's almost unusable without postgres_fdw.

That's not a bad thing, but it makes the core functionality a bit
harder to test standalone. I can work on the core tests some more. The
postgres_fdw tests passed without modification, though, and offer a
simple example of how to use it.

Regards,
Jeff Davis

From 88fa1333ace4d15d72534d20d2cccb37748277f2 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Tue, 2 Jan 2024 13:42:48 -0800
Subject: [PATCH v11] CREATE SUSBCRIPTION ... SERVER.

Allow specifying a foreign server for CREATE SUBSCRIPTION, rather than
a raw connection string with CONNECTION.

Using a foreign server as a layer of indirection improves management
of multiple subscriptions to the same server. It also provides
integration with user mappings in case different subscriptions have
different owners or a subscription changes owners.

Discussion: https://postgr.es/m/61831790a0a937038f78ce09f8dd4cef7de7456a.ca...@j-davis.com
Reviewed-by: Ashutosh Bapat
---
 contrib/postgres_fdw/Makefile |   4 +-
 contrib/postgres_fdw/connection.c |  74 
 .../postgres_fdw/expected/postgres_fdw.out|   8 +
 contrib/postgres_fdw/meson.build  |   6 +
 .../postgres_fdw/postgres_fdw--1.1--1.2.sql   |  11 ++
 contrib/postgres_fdw/postgres_fdw.control |   2 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |   7 +
 contrib/postgres_fdw/t/010_subscription.pl|  71 
 doc/src/sgml/ref/alter_subscription.sgml  |  18 +-
 doc/src/sgml/ref/create_subscription.sgml |  11 +-
 src/backend/catalog/pg_subscription.c |  38 +++-
 src/backend/commands/foreigncmds.c|  57 +-
 src/backend/commands/subscriptioncmds.c   | 167 --
 src/backend/foreign/foreign.c |  42 +
 src/backend/parser/gram.y |  22 +++
 src/backend/replication/logical/worker.c  |  16 +-
 src/bin/pg_dump/pg_dump.c |  27 ++-
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/psql/tab-complete.c   |   2 +-
 src/include/catalog/pg_foreign_data_wrapper.h |   3 +
 src/include/catalog/pg_subscription.h |   7 +-
 src/include/foreign/foreign.h |   3 +
 src/include/nodes/parsenodes.h|   3 +
 src/test/regress/expected/oidjoins.out|   1 +
 24 files changed, 563 insertions(+), 38 deletions(-)
 create mode 100644 contrib/postgres_fdw/postgres_fdw--1.1--1.2.sql
 create mode 100644 contrib/postgres_fdw/t/010_subscription.pl

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index c1b0cad453..995a30c297 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -14,10 +14,12 @@ PG_CPPFLAGS = -I$(libpq_srcdir)
 SHLIB_LINK_INTERNAL = $(libpq)
 
 EXTENSION = postgres_fdw
-DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql
+DATA = postgres_fdw--1.0.sql postgres_fdw--1.0--1.1.sql postgres_fdw--1.1--1.2.sql
 
 REGRESS = postgres_fdw
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4931ebf591..a011e6df5f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -113,6 +113,7 @@ static uint32 pgfdw_we_get_result = 0;
 PG_FUNCTION_INFO_V1(postgres_fdw_get_connections);
 PG_FUNCTION_INFO_V1(postgres_fdw_disconnect);
 PG_FUNCTION_INFO_V1(postgres_fdw_disconnect_all);
+PG_FUNCTION_INFO_V1(postgres_fdw_connection);
 
 /* prototypes of private functions */
 static void make_new_connection(ConnCacheEntry *entry, UserMapping *user);
@@ -1972,6 +1973,79 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List *cancel_requested,
 	}
 }
 
+/*
+ * Values in connection strings must be enclosed in single quotes. Single
+ * quotes and backslashes must be escaped with backslash. NB: these rules are
+ * different from the rules for escaping a SQL literal.
+ */
+static void
+appendEscapedValue(StringInfo str, const char *val)
+{
+	appendStringInfoChar(str, '\'');
+	for (int i = 0; val[i] != '\0'; i++)
+	{
+		if (val[i] == '\\' || val[i] == '\'')
+			appendStringInfoChar(str, '\\');
+		appendStringInfoChar(str, val[i]);
+	}
+	appendStringInfoChar(str, '\'');
+}
+
+Datum
+postgres_fdw_connection(PG_FUNCTION_ARGS)
+{
+	/* TODO: consider memory usage */
+	Oid userid = PG_GETARG_OID(0);
+	Oid serverid = PG_GETARG_OID(1);
+	ForeignServer